Skip to content

Enrollment

The enrollment module handles continuous enrollment validation, gap analysis, and censoring date calculation for survival analysis.

When to Use

Use the enrollment module when you need to:

  • Validate continuous enrollment for a patient cohort
  • Calculate enrollment gaps and identify drop-outs
  • Determine censoring dates for survival analysis
  • Apply enrollment-based inclusion criteria

Integrated with Cohort Module

When you use EnrollmentCriteria in get_cohort(), enrollment analysis is handled automatically. Use this module directly for custom enrollment analysis or when working outside the cohort workflow.

Quick Example

from alx_heor.database import RedshiftConnection
from alx_heor.enrollment import (
    get_enrollment,
    calculate_enrollment_gaps,
    filter_continuous_enrollment,
    get_censor_dates,
)

conn = RedshiftConnection().connect()

# Get enrollment data for a cohort
df_enroll = get_enrollment(
    conn,
    source="iqvia",
    schema="iqvia_pharmetrics_2024q3",
    patient_ids=df_cohort["pat_id"].tolist(),
    start_year=2015,
    end_year=2024,
)

# Calculate gaps relative to index date
df_gaps = calculate_enrollment_gaps(
    df_enroll,
    df_cohort,
    patient_id_col="pat_id",
    index_date_col="index_date",
    months_pre=6,
    months_post=12,
)

# Filter to continuous enrollment (max 1 month gap)
df_continuous = filter_continuous_enrollment(df_gaps, max_gap=1)

# Get censoring dates for survival analysis
df_censor = get_censor_dates(
    df_gaps,
    study_end="2024-03-31",
    max_gap_for_censor=3,
)

Common Patterns

High-Level Enrollment Analysis

from alx_heor.enrollment import analyze_enrollment

# All-in-one function
result = analyze_enrollment(
    conn,
    df_cohort,
    source="iqvia",
    schema="iqvia_pharmetrics_2024q3",
    months_pre=6,
    months_post=12,
    max_gap=1,
    study_end="2024-03-31",
)

# Access results
df_enrolled = result.df_enrolled  # Patients with continuous enrollment
df_censor = result.df_censor      # Censoring dates

Censoring for Survival Analysis

# Calculate censoring dates
# Patients are censored at the earlier of:
# 1. First enrollment gap > 3 months
# 2. Study end date

df_censor = get_censor_dates(
    df_gaps,
    study_end="2024-03-31",
    max_gap_for_censor=3,
)

# Merge with cohort for survival analysis
df_survival = df_cohort.merge(df_censor, on="pat_id")
df_survival["follow_up_days"] = (
    df_survival["censor_date"] - df_survival["index_date"]
).dt.days
  • cohort - Uses enrollment via EnrollmentCriteria
  • claims - Claims data access
  • config - Enrollment table patterns

enrollment

Enrollment and censoring utilities for longitudinal RWE studies.

This module handles one of the most critical aspects of retrospective database studies: ensuring patients are continuously observable throughout the study period. Without continuous enrollment, patients may appear to have no healthcare utilization when they've actually just left the database - a major source of bias.

Key Concepts:

Enrollment: In claims databases, patients are only observable when they have active insurance coverage captured in the database. Enrollment data tracks which months each patient was "enrolled" (observable).

Continuous enrollment: A requirement that patients have uninterrupted enrollment for a specified period (e.g., 6 months before index, 12 months after). Small gaps (1 month) are often allowed due to data processing delays.

Enrollment gap: A period where a patient has no enrollment record. Large gaps (>1-3 months) suggest the patient left the database and may bias outcome assessment.

Censoring: In survival analysis, patients are "censored" when they become unobservable (enrollment gap or study end). Censoring dates are essential for time-to-event analyses.

Why Enrollment Matters:

False negatives: A patient without claims might be healthy OR might have dropped enrollment. Without enrollment checks, you can't tell the difference.

Baseline assessment: To assess comorbidities/exclusion criteria, patients must be observable before index date.

Follow-up assessment: To measure outcomes, patients must remain observable after index date.

Survival analysis: Censoring must occur at enrollment loss, not just study end.

Core Functions:

  • get_enrollment: Fetch enrollment data from database
  • calculate_enrollment_gaps: Identify enrollment gaps relative to index date
  • filter_continuous_enrollment: Select patients meeting enrollment requirements
  • get_censor_dates: Calculate censoring dates for survival analysis
  • analyze_enrollment: High-level workflow combining all steps

Typical Workflow:

  1. Build initial cohort with cohort.get_cohort() (already includes enrollment if specified)
  2. OR use these functions directly for custom enrollment analysis: a. get_enrollment() - Fetch raw enrollment data b. calculate_enrollment_gaps() - Identify gaps c. filter_continuous_enrollment() - Apply requirements d. get_censor_dates() - Calculate censoring for survival analysis
Example

from alx_heor.enrollment import ( ... get_enrollment, ... calculate_enrollment_gaps, ... filter_continuous_enrollment, ... get_censor_dates, ... ) df_enroll = get_enrollment( ... conn, ... source='iqvia', ... schema='iqvia_pharmetrics_2024q3', ... patient_ids=df_cohort['pat_id'].tolist(), ... start_year=2015, ... end_year=2024, ... ) df_gaps = calculate_enrollment_gaps( ... df_enroll, ... df_cohort, ... patient_id_col='pat_id', ... index_date_col='index_date', ... months_pre=6, ... months_post=12, ... ) df_continuous = filter_continuous_enrollment(df_gaps, max_gap=1) print(f"Patients with continuous enrollment: {df_continuous['pat_id'].nunique():,}") df_censor = get_censor_dates( ... df_gaps, ... study_end='2024-03-31', ... max_gap_for_censor=3, ... patient_id_col='pat_id', ... ) print(f"Median follow-up: {df_censor['days_to_censor'].median():.0f} days")

See Also

cohort.get_cohort : High-level cohort function (includes enrollment if specified) cohort.EnrollmentCriteria : Enrollment requirements for get_cohort()

Notes
  • IQVIA enrollment tables: enroll2_2006 through enroll2_2025
  • Enrollment is stored by month (month_id = YYYYMM format)
  • A gap of 1 month is often allowed (data processing delays)
  • Censoring at enrollment loss is critical for unbiased survival analysis

IQVIA Patient ID Pitfall:

IQVIA has two patient ID columns that can cause confusion:

  • pat_id: The primary identifier used in claims tables (claims_YYYY) and enrollment tables (enroll2_YYYY). Always use this for joins.
  • iq_patient_id: A secondary identifier found only in the enroll table (not enroll2_YYYY). Legacy code sometimes uses this for patient-level grouping after joining claims to enroll to get demographics.

When joining claims to enrollment data, always use pat_id. Using iq_patient_id for claims joins will produce empty results because claims tables don't have this column.

EnrollmentResult dataclass

Result from enrollment analysis.

Attributes:

Name Type Description
df_enrollment DataFrame

Raw enrollment data with patient_id and month_id.

df_gaps DataFrame

Enrollment with gap calculations.

df_continuous DataFrame

Patients meeting continuous enrollment criteria.

df_censor DataFrame

Censoring dates for each patient.

attrition dict[str, int]

Attrition counts at each filtering step.

Source code in alx_heor\enrollment\__init__.py
@dataclass
class EnrollmentResult:
    """Result from enrollment analysis.

    Attributes
    ----------
    df_enrollment : pd.DataFrame
        Raw enrollment data with patient_id and month_id.
    df_gaps : pd.DataFrame
        Enrollment with gap calculations.
    df_continuous : pd.DataFrame
        Patients meeting continuous enrollment criteria.
    df_censor : pd.DataFrame
        Censoring dates for each patient.
    attrition : dict[str, int]
        Attrition counts at each filtering step.
    """

    df_enrollment: pd.DataFrame
    df_gaps: pd.DataFrame = field(default_factory=pd.DataFrame)
    df_continuous: pd.DataFrame = field(default_factory=pd.DataFrame)
    df_censor: pd.DataFrame = field(default_factory=pd.DataFrame)
    attrition: dict[str, int] = field(default_factory=dict)

    def summary(self) -> str:
        """Generate human-readable summary of enrollment analysis."""
        lines = ["Enrollment Analysis Summary", "=" * 50]

        if self.attrition:
            lines.append("\nAttrition:")
            first_count = None
            for step, count in self.attrition.items():
                if first_count is None:
                    first_count = count
                    lines.append(f"  {step}: {count:,}")
                else:
                    pct = (count / first_count * 100) if first_count > 0 else 0
                    lines.append(f"  {step}: {count:,} ({pct:.1f}%)")

        if not self.df_censor.empty:
            lines.append(f"\nCensoring summary:")
            lines.append(f"  Total patients: {len(self.df_censor):,}")
            if "is_censored_by_gap" in self.df_censor.columns:
                gap_censored = self.df_censor["is_censored_by_gap"].sum()
                lines.append(f"  Censored by gap: {gap_censored:,}")
                lines.append(
                    f"  Censored at study end: {len(self.df_censor) - gap_censored:,}"
                )

        return "\n".join(lines)

summary

summary() -> str

Generate human-readable summary of enrollment analysis.

Source code in alx_heor\enrollment\__init__.py
def summary(self) -> str:
    """Generate human-readable summary of enrollment analysis."""
    lines = ["Enrollment Analysis Summary", "=" * 50]

    if self.attrition:
        lines.append("\nAttrition:")
        first_count = None
        for step, count in self.attrition.items():
            if first_count is None:
                first_count = count
                lines.append(f"  {step}: {count:,}")
            else:
                pct = (count / first_count * 100) if first_count > 0 else 0
                lines.append(f"  {step}: {count:,} ({pct:.1f}%)")

    if not self.df_censor.empty:
        lines.append(f"\nCensoring summary:")
        lines.append(f"  Total patients: {len(self.df_censor):,}")
        if "is_censored_by_gap" in self.df_censor.columns:
            gap_censored = self.df_censor["is_censored_by_gap"].sum()
            lines.append(f"  Censored by gap: {gap_censored:,}")
            lines.append(
                f"  Censored at study end: {len(self.df_censor) - gap_censored:,}"
            )

    return "\n".join(lines)

get_enrollment

get_enrollment(conn: RedshiftConnection, source: str, schema: str, patient_ids: list[str], start_year: int, end_year: int, columns: list[str] | None = None) -> pd.DataFrame

Fetch enrollment data from database for specified patients.

Enrollment data tracks which months each patient was "observable" in the database (had active insurance coverage). This is essential for: - Verifying patients were observable during baseline/follow-up periods - Calculating continuous enrollment for inclusion criteria - Determining censoring dates for survival analysis

The function queries yearly enrollment tables (e.g., enroll2_2020, enroll2_2021) and filters to the specified patients.

Workflow
  1. get_enrollment() <-- you are here
  2. calculate_enrollment_gaps() - Identify gaps relative to index date
  3. filter_continuous_enrollment() - Apply enrollment requirements
  4. get_censor_dates() - Calculate censoring for survival analysis

Parameters:

Name Type Description Default
conn RedshiftConnection

Active database connection. Must be connected before calling.

required
source str

Data source name: 'iqvia', 'optum', 'komodo'. Determines enrollment table pattern and column names.

required
schema str

Database schema name (e.g., 'iqvia_pharmetrics_2024q3').

required
patient_ids list[str]

List of patient IDs to fetch enrollment for. Typically from df_cohort['pat_id'].tolist().

required
start_year int

Start year for enrollment tables (e.g., 2015).

required
end_year int

End year for enrollment tables (e.g., 2024).

required
columns list[str]

Columns to select. If None, selects patient_id and month_id. Include 'pay_type' if you need payer classification.

None

Returns:

Type Description
DataFrame

Enrollment data with one row per patient-month, containing: - patient_id column (pat_id for IQVIA, patid for Optum) - month_id: Enrollment month in YYYYMM format (e.g., 202301) - Additional columns if specified

See Also

calculate_enrollment_gaps : Next step - identify enrollment gaps cohort.get_cohort : High-level function that handles enrollment automatically

Notes
  • IQVIA enrollment tables: enroll2_2006 through enroll2_2025
  • month_id format: YYYYMM (e.g., 202301 = January 2023)
  • Query can be slow for large patient lists (consider batching)
  • Memory usage scales with patient count × months

Examples:

Basic enrollment fetch:

>>> df_enroll = get_enrollment(
...     conn,
...     source='iqvia',
...     schema='iqvia_pharmetrics_2024q3',
...     patient_ids=df_cohort['pat_id'].tolist(),
...     start_year=2015,
...     end_year=2024,
... )
>>> print(f"Enrollment records: {len(df_enroll):,}")
>>> print(f"Patients with enrollment: {df_enroll['pat_id'].nunique():,}")

Include payer type for payer classification:

>>> df_enroll = get_enrollment(
...     conn,
...     source='iqvia',
...     schema='iqvia_pharmetrics_2024q3',
...     patient_ids=patient_ids,
...     start_year=2020,
...     end_year=2024,
...     columns=['pat_id', 'month_id', 'pay_type'],
... )
Source code in alx_heor\enrollment\__init__.py
def get_enrollment(
    conn: RedshiftConnection,
    source: str,
    schema: str,
    patient_ids: list[str],
    start_year: int,
    end_year: int,
    columns: list[str] | None = None,
) -> pd.DataFrame:
    """Fetch enrollment data from database for specified patients.

    Enrollment data tracks which months each patient was "observable" in the
    database (had active insurance coverage). This is essential for:
    - Verifying patients were observable during baseline/follow-up periods
    - Calculating continuous enrollment for inclusion criteria
    - Determining censoring dates for survival analysis

    The function queries yearly enrollment tables (e.g., enroll2_2020, enroll2_2021)
    and filters to the specified patients.

    Workflow
    --------
    1. **get_enrollment()** <-- you are here
    2. calculate_enrollment_gaps() - Identify gaps relative to index date
    3. filter_continuous_enrollment() - Apply enrollment requirements
    4. get_censor_dates() - Calculate censoring for survival analysis

    Parameters
    ----------
    conn : RedshiftConnection
        Active database connection. Must be connected before calling.
    source : str
        Data source name: 'iqvia', 'optum', 'komodo'.
        Determines enrollment table pattern and column names.
    schema : str
        Database schema name (e.g., 'iqvia_pharmetrics_2024q3').
    patient_ids : list[str]
        List of patient IDs to fetch enrollment for.
        Typically from `df_cohort['pat_id'].tolist()`.
    start_year : int
        Start year for enrollment tables (e.g., 2015).
    end_year : int
        End year for enrollment tables (e.g., 2024).
    columns : list[str], optional
        Columns to select. If None, selects patient_id and month_id.
        Include 'pay_type' if you need payer classification.

    Returns
    -------
    pd.DataFrame
        Enrollment data with one row per patient-month, containing:
        - patient_id column (pat_id for IQVIA, patid for Optum)
        - month_id: Enrollment month in YYYYMM format (e.g., 202301)
        - Additional columns if specified

    See Also
    --------
    calculate_enrollment_gaps : Next step - identify enrollment gaps
    cohort.get_cohort : High-level function that handles enrollment automatically

    Notes
    -----
    - IQVIA enrollment tables: enroll2_2006 through enroll2_2025
    - month_id format: YYYYMM (e.g., 202301 = January 2023)
    - Query can be slow for large patient lists (consider batching)
    - Memory usage scales with patient count × months

    Examples
    --------
    Basic enrollment fetch:

    >>> df_enroll = get_enrollment(
    ...     conn,
    ...     source='iqvia',
    ...     schema='iqvia_pharmetrics_2024q3',
    ...     patient_ids=df_cohort['pat_id'].tolist(),
    ...     start_year=2015,
    ...     end_year=2024,
    ... )
    >>> print(f"Enrollment records: {len(df_enroll):,}")
    >>> print(f"Patients with enrollment: {df_enroll['pat_id'].nunique():,}")

    Include payer type for payer classification:

    >>> df_enroll = get_enrollment(
    ...     conn,
    ...     source='iqvia',
    ...     schema='iqvia_pharmetrics_2024q3',
    ...     patient_ids=patient_ids,
    ...     start_year=2020,
    ...     end_year=2024,
    ...     columns=['pat_id', 'month_id', 'pay_type'],
    ... )
    """
    config = get_source_config(source)
    patient_id_col = config["columns"]["patient_id"]
    enrollment_pattern = config.get("enrollment_table_pattern", "enroll2_{year}")

    # Default columns: patient_id and month_id
    if columns is None:
        columns = [patient_id_col, "month_id"]

    # Build UNION of enrollment tables
    tables = []
    for year in range(start_year, end_year + 1):
        table_name = enrollment_pattern.format(year=year)
        tables.append(f"{schema}.{table_name}")

    # Build SQL with patient filter
    cols_str = ", ".join(columns)
    patient_list = ", ".join(f"'{pid}'" for pid in patient_ids)

    union_parts = [
        f"SELECT {cols_str} FROM {table} WHERE {patient_id_col} IN ({patient_list})"
        for table in tables
    ]
    sql = " UNION ALL ".join(union_parts)

    return conn.query(sql)

build_enrollment_sql

build_enrollment_sql(source: str, schema: str, start_year: int, end_year: int, columns: list[str] | None = None) -> str

Build SQL for enrollment table union (without patient filter).

Useful for joining enrollment data in larger queries.

Parameters:

Name Type Description Default
source str

Data source name.

required
schema str

Database schema name.

required
start_year int

Start year.

required
end_year int

End year.

required
columns list[str]

Columns to select.

None

Returns:

Type Description
str

SQL UNION ALL query for enrollment tables.

Source code in alx_heor\enrollment\__init__.py
def build_enrollment_sql(
    source: str,
    schema: str,
    start_year: int,
    end_year: int,
    columns: list[str] | None = None,
) -> str:
    """Build SQL for enrollment table union (without patient filter).

    Useful for joining enrollment data in larger queries.

    Parameters
    ----------
    source : str
        Data source name.
    schema : str
        Database schema name.
    start_year : int
        Start year.
    end_year : int
        End year.
    columns : list[str], optional
        Columns to select.

    Returns
    -------
    str
        SQL UNION ALL query for enrollment tables.
    """
    config = get_source_config(source)
    patient_id_col = config["columns"]["patient_id"]
    enrollment_pattern = config.get("enrollment_table_pattern", "enroll2_{year}")

    if columns is None:
        columns = [patient_id_col, "month_id"]

    cols_str = ", ".join(columns)

    union_parts = []
    for year in range(start_year, end_year + 1):
        table_name = enrollment_pattern.format(year=year)
        union_parts.append(f"SELECT {cols_str} FROM {schema}.{table_name}")

    return " UNION ALL ".join(union_parts)

calculate_enrollment_gaps

calculate_enrollment_gaps(df_enrollment: DataFrame, df_index: DataFrame, patient_id_col: str = 'pat_id', index_date_col: str = 'index_date', months_pre: int = 6, months_post: int = 12) -> pd.DataFrame

Calculate enrollment gaps relative to each patient's index date.

This function is the core of continuous enrollment assessment. It identifies gaps in each patient's enrollment record, measured relative to their index date. This allows you to: - Verify patients were observable during baseline period - Verify patients were observable during follow-up period - Identify where gaps occurred for censoring

Gap Calculation Logic

For each patient, the function calculates: 1. Gap from baseline start: Days between required baseline start (index - months_pre) and first enrollment month 2. Gaps between months: Days between consecutive enrollment records 3. Gap to follow-up end: Days between last enrollment and required follow-up end (index + months_post)

A max_gap_months column summarizes the largest gap for each patient, making it easy to filter for continuous enrollment.

Workflow
  1. get_enrollment() - Fetch raw enrollment data
  2. calculate_enrollment_gaps() <-- you are here
  3. filter_continuous_enrollment() - Apply requirements (max_gap <= N)
  4. get_censor_dates() - Calculate censoring for survival analysis

Parameters:

Name Type Description Default
df_enrollment DataFrame

Enrollment data with patient_id and month_id columns. Output from get_enrollment().

required
df_index DataFrame

Patient cohort with patient_id and index_date columns. Output from claims.get_index_dates() or cohort.get_cohort().

required
patient_id_col str

Name of patient ID column. Use 'patid' for Optum.

'pat_id'
index_date_col str

Name of index date column in df_index.

'index_date'
months_pre int

Required months of enrollment BEFORE index date (baseline period). Common values: 6, 12.

6
months_post int

Required months of enrollment AFTER index date (follow-up period). Common values: 6, 12, 24.

12

Returns:

Type Description
DataFrame

Enrollment data with added gap calculation columns: - gap_months: Gap from previous enrollment month (in months) - max_gap_months: Maximum gap for this patient (for filtering) - rank: Enrollment month rank within patient (1 = earliest) - is_padded_last: True for synthetic "gap to follow-up end" rows

See Also

get_enrollment : Previous step - fetch raw enrollment data filter_continuous_enrollment : Next step - apply gap threshold get_censor_dates : Calculate censoring based on gaps

Notes
  • A gap of 1 month is often allowed (data processing delays)
  • The function creates synthetic rows to measure gap to follow-up end
  • Patients missing from enrollment data are dropped (inner join)
  • Large gaps early in baseline vs late in follow-up have different implications

Examples:

Calculate gaps for 6-month baseline + 12-month follow-up:

>>> df_gaps = calculate_enrollment_gaps(
...     df_enrollment,
...     df_cohort,
...     patient_id_col='pat_id',
...     index_date_col='index_date',
...     months_pre=6,
...     months_post=12,
... )
>>>
>>> # Check gap distribution
>>> print(df_gaps.groupby('pat_id')['max_gap_months'].first().describe())
count    45678.000000
mean         1.234567
std          2.345678
min          1.000000
50%          1.000000
max         36.000000

Filter to continuous enrollment (max gap <= 1 month):

>>> df_continuous = df_gaps[df_gaps['max_gap_months'] <= 1]
>>> print(f"Continuous enrollment: {df_continuous['pat_id'].nunique():,}")

Check how many patients fail at each gap threshold:

>>> for threshold in [1, 2, 3]:
...     n = df_gaps[df_gaps['max_gap_months'] <= threshold]['pat_id'].nunique()
...     print(f"Gap <= {threshold}: {n:,} patients")
Source code in alx_heor\enrollment\__init__.py
def calculate_enrollment_gaps(
    df_enrollment: pd.DataFrame,
    df_index: pd.DataFrame,
    patient_id_col: str = "pat_id",
    index_date_col: str = "index_date",
    months_pre: int = 6,
    months_post: int = 12,
) -> pd.DataFrame:
    """Calculate enrollment gaps relative to each patient's index date.

    This function is the core of continuous enrollment assessment. It identifies
    gaps in each patient's enrollment record, measured relative to their index
    date. This allows you to:
    - Verify patients were observable during baseline period
    - Verify patients were observable during follow-up period
    - Identify where gaps occurred for censoring

    Gap Calculation Logic
    ---------------------
    For each patient, the function calculates:
    1. **Gap from baseline start**: Days between required baseline start
       (index - months_pre) and first enrollment month
    2. **Gaps between months**: Days between consecutive enrollment records
    3. **Gap to follow-up end**: Days between last enrollment and required
       follow-up end (index + months_post)

    A max_gap_months column summarizes the largest gap for each patient,
    making it easy to filter for continuous enrollment.

    Workflow
    --------
    1. get_enrollment() - Fetch raw enrollment data
    2. **calculate_enrollment_gaps()** <-- you are here
    3. filter_continuous_enrollment() - Apply requirements (max_gap <= N)
    4. get_censor_dates() - Calculate censoring for survival analysis

    Parameters
    ----------
    df_enrollment : pd.DataFrame
        Enrollment data with patient_id and month_id columns.
        Output from get_enrollment().
    df_index : pd.DataFrame
        Patient cohort with patient_id and index_date columns.
        Output from claims.get_index_dates() or cohort.get_cohort().
    patient_id_col : str, default='pat_id'
        Name of patient ID column. Use 'patid' for Optum.
    index_date_col : str, default='index_date'
        Name of index date column in df_index.
    months_pre : int, default=6
        Required months of enrollment BEFORE index date (baseline period).
        Common values: 6, 12.
    months_post : int, default=12
        Required months of enrollment AFTER index date (follow-up period).
        Common values: 6, 12, 24.

    Returns
    -------
    pd.DataFrame
        Enrollment data with added gap calculation columns:
        - gap_months: Gap from previous enrollment month (in months)
        - max_gap_months: Maximum gap for this patient (for filtering)
        - rank: Enrollment month rank within patient (1 = earliest)
        - is_padded_last: True for synthetic "gap to follow-up end" rows

    See Also
    --------
    get_enrollment : Previous step - fetch raw enrollment data
    filter_continuous_enrollment : Next step - apply gap threshold
    get_censor_dates : Calculate censoring based on gaps

    Notes
    -----
    - A gap of 1 month is often allowed (data processing delays)
    - The function creates synthetic rows to measure gap to follow-up end
    - Patients missing from enrollment data are dropped (inner join)
    - Large gaps early in baseline vs late in follow-up have different implications

    Examples
    --------
    Calculate gaps for 6-month baseline + 12-month follow-up:

    >>> df_gaps = calculate_enrollment_gaps(
    ...     df_enrollment,
    ...     df_cohort,
    ...     patient_id_col='pat_id',
    ...     index_date_col='index_date',
    ...     months_pre=6,
    ...     months_post=12,
    ... )
    >>>
    >>> # Check gap distribution
    >>> print(df_gaps.groupby('pat_id')['max_gap_months'].first().describe())
    count    45678.000000
    mean         1.234567
    std          2.345678
    min          1.000000
    50%          1.000000
    max         36.000000

    Filter to continuous enrollment (max gap <= 1 month):

    >>> df_continuous = df_gaps[df_gaps['max_gap_months'] <= 1]
    >>> print(f"Continuous enrollment: {df_continuous['pat_id'].nunique():,}")

    Check how many patients fail at each gap threshold:

    >>> for threshold in [1, 2, 3]:
    ...     n = df_gaps[df_gaps['max_gap_months'] <= threshold]['pat_id'].nunique()
    ...     print(f"Gap <= {threshold}: {n:,} patients")
    """
    df = df_enrollment.copy()

    # Merge index dates
    df = df.merge(
        df_index[[patient_id_col, index_date_col]],
        on=patient_id_col,
        how="inner",
    )

    # Convert month_id to numeric months (months since year 1900)
    df["year"] = df["month_id"].astype(str).str[:4].astype(int)
    df["month"] = df["month_id"].astype(str).str[-2:].astype(int)
    df["months_since_1900"] = (df["year"] - 1900) * 12 + df["month"]

    # Sort and rank
    df = df.sort_values([patient_id_col, "months_since_1900"])
    df["rank"] = df.groupby(patient_id_col)["months_since_1900"].rank(
        "dense", ascending=True
    )

    # Calculate gap between consecutive months
    df["gap_months"] = df.groupby(patient_id_col)["months_since_1900"].diff()

    # Calculate required enrollment period boundaries
    df[index_date_col] = pd.to_datetime(df[index_date_col])

    # Baseline start (months_pre before index)
    baseline_start = df[index_date_col] - pd.DateOffset(months=months_pre)
    df["baseline_start_months"] = (
        (baseline_start.dt.year - 1900) * 12 + baseline_start.dt.month
    )

    # Follow-up end (months_post after index)
    followup_end = df[index_date_col] + pd.DateOffset(months=months_post)
    df["followup_end_months"] = (
        (followup_end.dt.year - 1900) * 12 + followup_end.dt.month
    )

    # Pad first month gap (distance from required baseline start)
    first_month_mask = df["rank"] == 1
    df.loc[first_month_mask, "gap_months"] = (
        df.loc[first_month_mask, "months_since_1900"]
        - df.loc[first_month_mask, "baseline_start_months"]
    )

    # Create padded last month entries (distance to required follow-up end)
    last_month_idx = (
        df.groupby(patient_id_col)["months_since_1900"].transform("max")
        == df["months_since_1900"]
    )
    df_last = df[last_month_idx].copy()
    df_last["gap_months"] = df_last["followup_end_months"] - df_last["months_since_1900"]
    df_last["rank"] = df_last["rank"] + 0.5  # Marks this as the padded last entry
    df_last["is_padded_last"] = True
    df["is_padded_last"] = False

    # Combine original and padded
    df_out = pd.concat([df, df_last], ignore_index=True)
    df_out = df_out.sort_values([patient_id_col, "rank"])

    # Calculate max gap per patient
    max_gaps = (
        df_out.groupby(patient_id_col)["gap_months"]
        .max()
        .reset_index()
        .rename(columns={"gap_months": "max_gap_months"})
    )
    df_out = df_out.merge(max_gaps, on=patient_id_col, how="left")

    return df_out

get_max_enrollment_gap

get_max_enrollment_gap(df_enrollment: DataFrame, df_index: DataFrame, patient_id_col: str = 'pat_id', index_date_col: str = 'index_date', months_pre: int = 6, months_post: int = 12) -> pd.DataFrame

Get maximum enrollment gap for each patient.

Convenience wrapper around calculate_enrollment_gaps that returns just the max gap per patient.

Parameters:

Name Type Description Default
df_enrollment DataFrame

Enrollment data with patient_id and month_id columns.

required
df_index DataFrame

Patient index dates.

required
patient_id_col str

Name of patient ID column.

'pat_id'
index_date_col str

Name of index date column.

'index_date'
months_pre int

Required months before index.

6
months_post int

Required months after index.

12

Returns:

Type Description
DataFrame

One row per patient with columns: patient_id, max_gap_months.

Example

df_max_gap = get_max_enrollment_gap(df_enrollment, df_cohort) continuous = df_max_gap[df_max_gap['max_gap_months'] <= 1]

Source code in alx_heor\enrollment\__init__.py
def get_max_enrollment_gap(
    df_enrollment: pd.DataFrame,
    df_index: pd.DataFrame,
    patient_id_col: str = "pat_id",
    index_date_col: str = "index_date",
    months_pre: int = 6,
    months_post: int = 12,
) -> pd.DataFrame:
    """Get maximum enrollment gap for each patient.

    Convenience wrapper around calculate_enrollment_gaps that returns
    just the max gap per patient.

    Parameters
    ----------
    df_enrollment : pd.DataFrame
        Enrollment data with `patient_id` and `month_id` columns.
    df_index : pd.DataFrame
        Patient index dates.
    patient_id_col : str
        Name of patient ID column.
    index_date_col : str
        Name of index date column.
    months_pre : int
        Required months before index.
    months_post : int
        Required months after index.

    Returns
    -------
    pd.DataFrame
        One row per patient with columns: patient_id, max_gap_months.

    Example
    -------
    >>> df_max_gap = get_max_enrollment_gap(df_enrollment, df_cohort)
    >>> continuous = df_max_gap[df_max_gap['max_gap_months'] <= 1]
    """
    df_gaps = calculate_enrollment_gaps(
        df_enrollment,
        df_index,
        patient_id_col=patient_id_col,
        index_date_col=index_date_col,
        months_pre=months_pre,
        months_post=months_post,
    )

    return (
        df_gaps[[patient_id_col, "max_gap_months"]]
        .drop_duplicates()
        .reset_index(drop=True)
    )

filter_continuous_enrollment

filter_continuous_enrollment(df_gaps: DataFrame, max_gap: int = 1, patient_id_col: str = 'pat_id') -> pd.DataFrame

Filter to patients with continuous enrollment.

Parameters:

Name Type Description Default
df_gaps DataFrame

Output from calculate_enrollment_gaps with max_gap_months column.

required
max_gap int

Maximum allowed gap in months (default: 1, meaning consecutive months).

1
patient_id_col str

Name of patient ID column.

'pat_id'

Returns:

Type Description
DataFrame

Filtered to patients where max_gap_months <= max_gap.

Example

df_continuous = filter_continuous_enrollment(df_gaps, max_gap=1) print(f"Patients with continuous enrollment: {df_continuous[patient_id_col].nunique()}")

Source code in alx_heor\enrollment\__init__.py
def filter_continuous_enrollment(
    df_gaps: pd.DataFrame,
    max_gap: int = 1,
    patient_id_col: str = "pat_id",
) -> pd.DataFrame:
    """Filter to patients with continuous enrollment.

    Parameters
    ----------
    df_gaps : pd.DataFrame
        Output from calculate_enrollment_gaps with max_gap_months column.
    max_gap : int
        Maximum allowed gap in months (default: 1, meaning consecutive months).
    patient_id_col : str
        Name of patient ID column.

    Returns
    -------
    pd.DataFrame
        Filtered to patients where max_gap_months <= max_gap.

    Example
    -------
    >>> df_continuous = filter_continuous_enrollment(df_gaps, max_gap=1)
    >>> print(f"Patients with continuous enrollment: {df_continuous[patient_id_col].nunique()}")
    """
    return df_gaps[df_gaps["max_gap_months"] <= max_gap].copy()

get_censor_dates

get_censor_dates(df_gaps: DataFrame, study_end: str, max_gap_for_censor: int = 3, patient_id_col: str = 'pat_id', index_date_col: str = 'index_date') -> pd.DataFrame

Calculate censoring dates for each patient for survival analysis.

In survival analysis (time-to-event analysis), patients must be "censored" when they become unobservable. This function calculates the censoring date as the earlier of: 1. First enrollment gap exceeding the threshold (patient left database) 2. Study end date (administrative censoring)

Proper censoring is critical for unbiased survival estimates. Without it, patients who leave the database would appear to never have the event, biasing results toward better outcomes.

Censoring Logic

For each patient: 1. Find first enrollment gap > max_gap_for_censor 2. If found: censor_date = start of gap period + tolerance 3. If not found: censor_date = last enrollment + tolerance OR study_end 4. Final censor_date = min(calculated date, study_end)

The tolerance (max_gap_for_censor) allows for some gap before censoring, as small gaps may be data processing issues rather than true disenrollment.

Workflow
  1. get_enrollment() - Fetch raw enrollment data
  2. calculate_enrollment_gaps() - Identify gaps
  3. filter_continuous_enrollment() - Apply requirements
  4. get_censor_dates() <-- you are here

Parameters:

Name Type Description Default
df_gaps DataFrame

Output from calculate_enrollment_gaps(). Must contain gap_months, month_id, and patient_id columns.

required
study_end str

Study end date in 'YYYY-MM-DD' format (e.g., '2024-03-31'). All patients are censored by this date at latest.

required
max_gap_for_censor int

Maximum gap (in months) allowed before censoring. Gaps exceeding this trigger censoring at the start of the gap. Common values: 3 (lenient), 1 (strict).

3
patient_id_col str

Name of patient ID column.

'pat_id'
index_date_col str

Name of index date column.

'index_date'

Returns:

Type Description
DataFrame

One row per patient with columns: - patient_id_col: Patient identifier - index_date_col: Patient's index date - censor_date: Date patient is censored (datetime) - is_censored_by_gap: True if censored due to enrollment gap, False if censored at study end - days_to_censor: Days from index_date to censor_date (int)

See Also

calculate_enrollment_gaps : Previous step - calculate gaps filter_continuous_enrollment : Filter for continuous enrollment

Notes
  • Censoring is RIGHT censoring (event may occur after censor date)
  • Patients censored by gap were likely lost to follow-up
  • Median days_to_censor is a key metric for study feasibility
  • High proportion censored by gap suggests data quality issues

Examples:

Calculate censoring dates with 3-month tolerance:

>>> df_censor = get_censor_dates(
...     df_gaps,
...     study_end='2024-03-31',
...     max_gap_for_censor=3,
...     patient_id_col='pat_id',
... )
>>>
>>> # Summary statistics
>>> print(f"Patients: {len(df_censor):,}")
>>> print(f"Median follow-up: {df_censor['days_to_censor'].median():.0f} days")
>>> print(f"Mean follow-up: {df_censor['days_to_censor'].mean():.0f} days")
>>>
>>> # Censoring breakdown
>>> n_gap = df_censor['is_censored_by_gap'].sum()
>>> n_end = len(df_censor) - n_gap
>>> print(f"Censored by enrollment gap: {n_gap:,} ({n_gap/len(df_censor)*100:.1f}%)")
>>> print(f"Censored at study end: {n_end:,} ({n_end/len(df_censor)*100:.1f}%)")

Use for survival analysis:

>>> # Prepare data for lifelines or survival analysis
>>> df_survival = df_cohort.merge(df_censor, on='pat_id')
>>> df_survival['duration'] = df_survival['days_to_censor']
>>> df_survival['event'] = df_survival['had_event'] & ~df_survival['is_censored_by_gap']
Source code in alx_heor\enrollment\__init__.py
def get_censor_dates(
    df_gaps: pd.DataFrame,
    study_end: str,
    max_gap_for_censor: int = 3,
    patient_id_col: str = "pat_id",
    index_date_col: str = "index_date",
) -> pd.DataFrame:
    """Calculate censoring dates for each patient for survival analysis.

    In survival analysis (time-to-event analysis), patients must be "censored"
    when they become unobservable. This function calculates the censoring date
    as the earlier of:
    1. First enrollment gap exceeding the threshold (patient left database)
    2. Study end date (administrative censoring)

    Proper censoring is critical for unbiased survival estimates. Without it,
    patients who leave the database would appear to never have the event,
    biasing results toward better outcomes.

    Censoring Logic
    ---------------
    For each patient:
    1. Find first enrollment gap > max_gap_for_censor
    2. If found: censor_date = start of gap period + tolerance
    3. If not found: censor_date = last enrollment + tolerance OR study_end
    4. Final censor_date = min(calculated date, study_end)

    The tolerance (max_gap_for_censor) allows for some gap before censoring,
    as small gaps may be data processing issues rather than true disenrollment.

    Workflow
    --------
    1. get_enrollment() - Fetch raw enrollment data
    2. calculate_enrollment_gaps() - Identify gaps
    3. filter_continuous_enrollment() - Apply requirements
    4. **get_censor_dates()** <-- you are here

    Parameters
    ----------
    df_gaps : pd.DataFrame
        Output from calculate_enrollment_gaps(). Must contain gap_months,
        month_id, and patient_id columns.
    study_end : str
        Study end date in 'YYYY-MM-DD' format (e.g., '2024-03-31').
        All patients are censored by this date at latest.
    max_gap_for_censor : int, default=3
        Maximum gap (in months) allowed before censoring. Gaps exceeding
        this trigger censoring at the start of the gap.
        Common values: 3 (lenient), 1 (strict).
    patient_id_col : str, default='pat_id'
        Name of patient ID column.
    index_date_col : str, default='index_date'
        Name of index date column.

    Returns
    -------
    pd.DataFrame
        One row per patient with columns:
        - patient_id_col: Patient identifier
        - index_date_col: Patient's index date
        - censor_date: Date patient is censored (datetime)
        - is_censored_by_gap: True if censored due to enrollment gap,
          False if censored at study end
        - days_to_censor: Days from index_date to censor_date (int)

    See Also
    --------
    calculate_enrollment_gaps : Previous step - calculate gaps
    filter_continuous_enrollment : Filter for continuous enrollment

    Notes
    -----
    - Censoring is RIGHT censoring (event may occur after censor date)
    - Patients censored by gap were likely lost to follow-up
    - Median days_to_censor is a key metric for study feasibility
    - High proportion censored by gap suggests data quality issues

    Examples
    --------
    Calculate censoring dates with 3-month tolerance:

    >>> df_censor = get_censor_dates(
    ...     df_gaps,
    ...     study_end='2024-03-31',
    ...     max_gap_for_censor=3,
    ...     patient_id_col='pat_id',
    ... )
    >>>
    >>> # Summary statistics
    >>> print(f"Patients: {len(df_censor):,}")
    >>> print(f"Median follow-up: {df_censor['days_to_censor'].median():.0f} days")
    >>> print(f"Mean follow-up: {df_censor['days_to_censor'].mean():.0f} days")
    >>>
    >>> # Censoring breakdown
    >>> n_gap = df_censor['is_censored_by_gap'].sum()
    >>> n_end = len(df_censor) - n_gap
    >>> print(f"Censored by enrollment gap: {n_gap:,} ({n_gap/len(df_censor)*100:.1f}%)")
    >>> print(f"Censored at study end: {n_end:,} ({n_end/len(df_censor)*100:.1f}%)")

    Use for survival analysis:

    >>> # Prepare data for lifelines or survival analysis
    >>> df_survival = df_cohort.merge(df_censor, on='pat_id')
    >>> df_survival['duration'] = df_survival['days_to_censor']
    >>> df_survival['event'] = df_survival['had_event'] & ~df_survival['is_censored_by_gap']
    """
    # Work with original records only (not padded last entries)
    df = df_gaps[~df_gaps.get("is_padded_last", False)].copy()

    # Convert month_id to date
    df["month_date"] = pd.to_datetime(df["month_id"].astype(str), format="%Y%m")

    study_end_dt = pd.Timestamp(study_end)

    results = []

    for pat_id, group in df.groupby(patient_id_col):
        group = group.sort_values("rank")
        index_date = group[index_date_col].iloc[0]

        # Find first gap exceeding threshold
        gaps_exceeding = group[group["gap_months"] > max_gap_for_censor]

        if len(gaps_exceeding) > 0:
            # Censored by gap - censor at start of gap period + tolerance
            first_gap_row = gaps_exceeding.iloc[0]
            gap_month_date = first_gap_row["month_date"]
            gap_size = int(first_gap_row["gap_months"])

            # Censor date = month before gap + tolerance
            censor_date = (
                gap_month_date.to_period("M") - gap_size + max_gap_for_censor
            ).to_timestamp()
            is_censored_by_gap = True
        else:
            # Not censored by gap - use last enrollment + tolerance or study end
            last_month = group["month_date"].max()
            censor_date = (last_month.to_period("M") + max_gap_for_censor).to_timestamp()
            is_censored_by_gap = False

        # Cap at study end
        censor_date = min(censor_date, study_end_dt)

        # Calculate days to censor
        days_to_censor = (censor_date - pd.Timestamp(index_date)).days

        results.append(
            {
                patient_id_col: pat_id,
                index_date_col: index_date,
                "censor_date": censor_date,
                "is_censored_by_gap": is_censored_by_gap,
                "days_to_censor": days_to_censor,
            }
        )

    return pd.DataFrame(results)

analyze_enrollment

analyze_enrollment(conn: RedshiftConnection, source: str, schema: str, df_cohort: DataFrame, start_year: int, end_year: int, patient_id_col: str = 'pat_id', index_date_col: str = 'index_date', months_pre: int = 6, months_post: int = 12, max_gap_continuous: int = 1, max_gap_censor: int = 3, study_end: str | None = None) -> EnrollmentResult

Complete enrollment analysis workflow.

This high-level function performs: 1. Fetch enrollment data for cohort 2. Calculate enrollment gaps 3. Filter for continuous enrollment 4. Calculate censoring dates

Parameters:

Name Type Description Default
conn RedshiftConnection

Active database connection.

required
source str

Data source name: 'iqvia', 'optum', 'komodo'.

required
schema str

Database schema name.

required
df_cohort DataFrame

Cohort with patient_id and index_date columns.

required
start_year int

Start year for enrollment tables.

required
end_year int

End year for enrollment tables.

required
patient_id_col str

Name of patient ID column (default: 'pat_id').

'pat_id'
index_date_col str

Name of index date column (default: 'index_date').

'index_date'
months_pre int

Required baseline months (default: 6).

6
months_post int

Required follow-up months (default: 12).

12
max_gap_continuous int

Max gap for continuous enrollment filter (default: 1).

1
max_gap_censor int

Max gap before censoring (default: 3).

3
study_end str

Study end date for censoring. If None, uses end of end_year.

None

Returns:

Type Description
EnrollmentResult

Complete enrollment analysis results with attrition tracking.

Example

result = analyze_enrollment( ... conn, source='iqvia', schema='iqvia_pharmetrics_2024q3', ... df_cohort=df_cohort, start_year=2015, end_year=2024, ... months_pre=6, months_post=12, ... ) print(result.summary()) df_final = result.df_continuous # Patients with continuous enrollment

Source code in alx_heor\enrollment\__init__.py
def analyze_enrollment(
    conn: RedshiftConnection,
    source: str,
    schema: str,
    df_cohort: pd.DataFrame,
    start_year: int,
    end_year: int,
    patient_id_col: str = "pat_id",
    index_date_col: str = "index_date",
    months_pre: int = 6,
    months_post: int = 12,
    max_gap_continuous: int = 1,
    max_gap_censor: int = 3,
    study_end: str | None = None,
) -> EnrollmentResult:
    """Complete enrollment analysis workflow.

    This high-level function performs:
    1. Fetch enrollment data for cohort
    2. Calculate enrollment gaps
    3. Filter for continuous enrollment
    4. Calculate censoring dates

    Parameters
    ----------
    conn : RedshiftConnection
        Active database connection.
    source : str
        Data source name: 'iqvia', 'optum', 'komodo'.
    schema : str
        Database schema name.
    df_cohort : pd.DataFrame
        Cohort with patient_id and index_date columns.
    start_year : int
        Start year for enrollment tables.
    end_year : int
        End year for enrollment tables.
    patient_id_col : str
        Name of patient ID column (default: 'pat_id').
    index_date_col : str
        Name of index date column (default: 'index_date').
    months_pre : int
        Required baseline months (default: 6).
    months_post : int
        Required follow-up months (default: 12).
    max_gap_continuous : int
        Max gap for continuous enrollment filter (default: 1).
    max_gap_censor : int
        Max gap before censoring (default: 3).
    study_end : str, optional
        Study end date for censoring. If None, uses end of end_year.

    Returns
    -------
    EnrollmentResult
        Complete enrollment analysis results with attrition tracking.

    Example
    -------
    >>> result = analyze_enrollment(
    ...     conn, source='iqvia', schema='iqvia_pharmetrics_2024q3',
    ...     df_cohort=df_cohort, start_year=2015, end_year=2024,
    ...     months_pre=6, months_post=12,
    ... )
    >>> print(result.summary())
    >>> df_final = result.df_continuous  # Patients with continuous enrollment
    """
    attrition = {}

    # Step 1: Track initial cohort
    initial_count = df_cohort[patient_id_col].nunique()
    attrition["Initial cohort"] = initial_count

    # Step 2: Fetch enrollment data
    patient_ids = df_cohort[patient_id_col].unique().tolist()
    df_enrollment = get_enrollment(
        conn,
        source=source,
        schema=schema,
        patient_ids=patient_ids,
        start_year=start_year,
        end_year=end_year,
    )

    enrolled_count = df_enrollment[patient_id_col].nunique()
    attrition["With enrollment data"] = enrolled_count

    # Memory cleanup after fetching enrollment
    gc.collect()

    # Step 3: Calculate gaps
    df_gaps = calculate_enrollment_gaps(
        df_enrollment,
        df_cohort,
        patient_id_col=patient_id_col,
        index_date_col=index_date_col,
        months_pre=months_pre,
        months_post=months_post,
    )

    # Step 4: Filter for continuous enrollment
    df_continuous = filter_continuous_enrollment(
        df_gaps,
        max_gap=max_gap_continuous,
        patient_id_col=patient_id_col,
    )

    continuous_count = df_continuous[patient_id_col].nunique()
    attrition[
        f"Continuous enrollment ({months_pre}m pre, {months_post}m post, gap <= {max_gap_continuous})"
    ] = continuous_count

    # Memory cleanup after filtering
    gc.collect()

    # Step 5: Calculate censoring dates
    if study_end is None:
        study_end = f"{end_year}-12-31"

    df_censor = get_censor_dates(
        df_gaps,
        study_end=study_end,
        max_gap_for_censor=max_gap_censor,
        patient_id_col=patient_id_col,
        index_date_col=index_date_col,
    )

    return EnrollmentResult(
        df_enrollment=df_enrollment,
        df_gaps=df_gaps,
        df_continuous=df_continuous,
        df_censor=df_censor,
        attrition=attrition,
    )