Skip to content

Cohort

The cohort module is the primary entry point for building patient cohorts. It provides a declarative way to specify complex inclusion/exclusion criteria and returns structured results with attrition tracking.

When to Use

Use the cohort module when you need to:

  • Identify patients matching diagnosis, procedure, or medication criteria
  • Apply demographic filters (age, sex)
  • Enforce continuous enrollment requirements
  • Track attrition at each selection step
  • Get a single cohort DataFrame ready for downstream analysis

Quick Example

from alx_heor.database import RedshiftConnection
from alx_heor.cohort import (
    get_cohort, CohortCriteria, DiagnosisCriteria, EnrollmentCriteria
)

conn = RedshiftConnection().connect()

result = get_cohort(
    conn,
    source="iqvia",
    schema="iqvia_pharmetrics_2024q3",
    criteria=CohortCriteria(
        primary_diagnosis=DiagnosisCriteria(
            codes=["G700", "G7000", "G7001"],
            min_count=2,
            days_apart=30,
        ),
        enrollment=EnrollmentCriteria(months_before=6, months_after=12),
        min_age=18,
    ),
    start_year=2015,
    end_year=2024,
)

print(result.summary())
df_cohort = result.df_cohort

Common Patterns

Diagnosis with Time Windows

# Baseline malignancy (to exclude)
DiagnosisCriteria(
    codes=["C00", "C01", "C02"],  # Cancer codes
    window_start=-365,            # 1 year before index
    window_end=0,                 # Up to index
    label="Malignancy in baseline",
)

Treatment-Naive Patients

# Exclude patients with prior biologic use
CohortCriteria(
    primary_diagnosis=...,
    excluded_medications=[
        MedicationCriteria(
            generic_names=["rituximab", "eculizumab"],
            window_end=-1,  # Before index date
            label="Prior biologic",
        ),
    ],
)

Multiple Required Criteria

CohortCriteria(
    primary_diagnosis=DiagnosisCriteria(codes=["G700"]),
    required_diagnoses=[
        DiagnosisCriteria(codes=["G73.1"], label="MG crisis"),
    ],
    required_medications=[
        MedicationCriteria(generic_names=["pyridostigmine"]),
    ],
)

cohort

Cohort identification with comprehensive inclusion/exclusion criteria.

This module provides a unified, high-level interface for identifying patient cohorts in retrospective healthcare database studies. Cohort identification is the foundation of any Real-World Evidence (RWE) study - it defines the patient population being studied based on clinical criteria.

Key Concepts:

Inclusion criteria: Conditions that MUST be met to enter the cohort (e.g., diagnosis of gMG, age ≥18, continuous enrollment).

Exclusion criteria: Conditions that REMOVE patients from the cohort (e.g., malignancy in baseline, pregnancy, prior use of study drug).

Index date: The anchor point for each patient's study timeline. Usually the first (or second) qualifying diagnosis date.

Baseline period: Time before index date (e.g., 6 months) used to assess patient characteristics and exclusion criteria.

Follow-up period: Time after index date for outcome assessment.

Attrition table: Tracking how many patients are lost at each selection step (essential for study transparency and reproducibility).

Supported Criteria Types:

  • Diagnosis-based: ICD-9/ICD-10 codes with count and time requirements
  • Procedure-based: CPT/HCPCS codes for surgical/medical procedures
  • Medication-based: NDC codes, J-codes, or generic drug names
  • Demographic: Age at index, sex (M/F)
  • Provider specialty: Exclude diagnoses from certain specialties
  • Enrollment: Continuous enrollment requirements pre/post index

Why Use This Module?

The get_cohort() function automates the entire cohort identification workflow that would otherwise require multiple manual steps:

  1. Query claims → 2. Apply diagnosis criteria → 3. Calculate index dates →
  2. Add demographics → 5. Apply exclusions → 6. Check enrollment → 7. Track attrition

Each step is tracked in an attrition table, providing full transparency.

Example

Build a gMG cohort with standard RWE criteria:

from alx_heor import RedshiftConnection from alx_heor.cohort import get_cohort, CohortCriteria, DiagnosisCriteria

conn = RedshiftConnection().connect() criteria = CohortCriteria( ... primary_diagnosis=DiagnosisCriteria( ... codes=["G700", "G7000", "G7001"], # gMG ICD-10 codes ... min_count=2, # Require 2+ diagnoses ... days_apart=30, # At least 30 days apart ... label="gMG ≥2 Dx, 30 days apart", ... ), ... min_age=18, ... exclude_specialties=["OPHTHAL", "OPTOMTRY"], # Exclude ocular-only MG ... ) result = get_cohort( ... conn, ... source="iqvia", ... schema="iqvia_pharmetrics_2024q3", ... criteria=criteria, ... start_year=2015, ... end_year=2024, ... ) print(result.summary()) Attrition Table ============================================================ ≥1 diagnosis claim: 89,123 gMG ≥2 Dx, 30 days apart: 45,678 (-43,445, 51.3%) Age ≥18: 42,103 (-3,575, 92.2%) Valid sex (M/F): 41,892 (-211, 99.5%) Non-excluded specialty: 38,456 (-3,436, 91.8%)

See Also

claims.get_claims : Lower-level function for querying claims claims.get_index_dates : Lower-level function for index dates enrollment.analyze_enrollment : Detailed enrollment analysis medications.lookup_medications : Medication lookup utilities

Notes
  • For studies spanning Oct 2015, include BOTH ICD-9 and ICD-10 codes
  • The "2+ Dx 30 days apart" criterion is standard to reduce false positives
  • Provider specialty exclusion addresses ocular MG misclassification
  • Always verify attrition percentages against protocol expectations

DiagnosisCriteria dataclass

Diagnosis-based inclusion or exclusion criteria.

This dataclass defines criteria for selecting patients based on ICD diagnosis codes. It supports sophisticated requirements common in RWE studies, such as requiring multiple diagnoses over time (to reduce false positives from rule-out testing) and time-windowed criteria (e.g., checking for malignancy in baseline period only).

The "2+ diagnoses 30 days apart" pattern is the industry standard for reducing false positives. A single diagnosis may represent rule-out testing, while repeated diagnoses over time indicate a confirmed condition.

Use Cases:

Primary inclusion: Define the condition of interest DiagnosisCriteria(codes=["G700"], min_count=2, days_apart=30)

Baseline exclusion: Exclude patients with certain conditions before index DiagnosisCriteria(codes=["C00-C96"], window_start=-365, window_end=-1)

Follow-up requirement: Require certain events after index DiagnosisCriteria(codes=["F32"], window_start=0, window_end=365)

Parameters:

Name Type Description Default
codes list[str]

ICD-9 or ICD-10 diagnosis codes to match. Include BOTH versions for studies spanning Oct 2015 (US ICD-10 transition). Examples: ['G700', 'G7000', 'G7001'] for gMG, ['G35'] for MS.

required
min_count int

Minimum number of diagnosis occurrences required: - 1: Any single diagnosis (more sensitive, more false positives) - 2: Standard RWE criterion (fewer false positives) - 3+: Very restrictive (use for high-frequency conditions)

1
days_apart int

Minimum days between first and last diagnosis when min_count > 1. Common values: 0 (any 2+ dx), 30 (standard), 60 (restrictive).

0
window_start int

Days relative to index date for window start. Negative = before index. Example: -365 means "up to 1 year before index". None means no lower bound (all-time).

None
window_end int

Days relative to index date for window end. Negative = before index. Example: -1 means "up to 1 day before index" (baseline only). None means no upper bound (all-time).

None
diagnosis_position str

Which diagnosis positions to check: - "any": All diagnosis fields (diag1-12, diag_admit) - "primary": Only primary diagnosis (diag1) - "admit": Only admitting diagnosis (diag_admit)

"any"
require_inpatient bool

If True, only count diagnoses from inpatient encounters. Useful for more specific criteria.

False
require_outpatient bool

If True, only count diagnoses from outpatient encounters.

False
label str

Human-readable label for the attrition table. Example: "gMG ≥2 Dx, 30 days apart" or "Malignancy in baseline".

""
See Also

CohortCriteria : Container that holds DiagnosisCriteria objects ProcedureCriteria : Similar criteria for procedures MedicationCriteria : Similar criteria for medications

Notes
  • Codes are matched EXACTLY - 'G70' will NOT match 'G700'
  • Use window_start/window_end for baseline/follow-up criteria
  • Always provide a descriptive label for clear attrition reporting

Examples:

Primary inclusion - gMG with 2+ diagnoses 30 days apart:

>>> primary = DiagnosisCriteria(
...     codes=["G700", "G7000", "G7001"],
...     min_count=2,
...     days_apart=30,
...     label="gMG ≥2 Dx, 30 days apart",
... )

Baseline exclusion - malignancy in year before index:

>>> exclude_cancer = DiagnosisCriteria(
...     codes=["C00", "C01", "C02"],  # Add all cancer codes
...     window_start=-365,            # 1 year before
...     window_end=-1,                # Up to day before index
...     label="Malignancy in baseline",
... )

Baseline exclusion - pregnancy:

>>> exclude_pregnancy = DiagnosisCriteria(
...     codes=["O00", "O26", "Z33", "Z34"],
...     window_start=-270,  # ~9 months
...     window_end=0,
...     label="Pregnancy",
... )

Post-index requirement - depression diagnosis within 1 year:

>>> require_depression = DiagnosisCriteria(
...     codes=["F32", "F33"],
...     window_start=0,
...     window_end=365,
...     label="Depression post-index",
... )
Source code in alx_heor\cohort\__init__.py
@dataclass
class DiagnosisCriteria:
    """Diagnosis-based inclusion or exclusion criteria.

    This dataclass defines criteria for selecting patients based on ICD
    diagnosis codes. It supports sophisticated requirements common in RWE
    studies, such as requiring multiple diagnoses over time (to reduce
    false positives from rule-out testing) and time-windowed criteria
    (e.g., checking for malignancy in baseline period only).

    The "2+ diagnoses 30 days apart" pattern is the industry standard for
    reducing false positives. A single diagnosis may represent rule-out
    testing, while repeated diagnoses over time indicate a confirmed condition.

    **Use Cases:**

    **Primary inclusion**: Define the condition of interest
        DiagnosisCriteria(codes=["G700"], min_count=2, days_apart=30)

    **Baseline exclusion**: Exclude patients with certain conditions before index
        DiagnosisCriteria(codes=["C00-C96"], window_start=-365, window_end=-1)

    **Follow-up requirement**: Require certain events after index
        DiagnosisCriteria(codes=["F32"], window_start=0, window_end=365)

    Parameters
    ----------
    codes : list[str]
        ICD-9 or ICD-10 diagnosis codes to match. Include BOTH versions for
        studies spanning Oct 2015 (US ICD-10 transition).
        Examples: ['G700', 'G7000', 'G7001'] for gMG, ['G35'] for MS.
    min_count : int, default=1
        Minimum number of diagnosis occurrences required:
        - 1: Any single diagnosis (more sensitive, more false positives)
        - 2: Standard RWE criterion (fewer false positives)
        - 3+: Very restrictive (use for high-frequency conditions)
    days_apart : int, default=0
        Minimum days between first and last diagnosis when min_count > 1.
        Common values: 0 (any 2+ dx), 30 (standard), 60 (restrictive).
    window_start : int, optional
        Days relative to index date for window start. Negative = before index.
        Example: -365 means "up to 1 year before index".
        None means no lower bound (all-time).
    window_end : int, optional
        Days relative to index date for window end. Negative = before index.
        Example: -1 means "up to 1 day before index" (baseline only).
        None means no upper bound (all-time).
    diagnosis_position : str, default="any"
        Which diagnosis positions to check:
        - "any": All diagnosis fields (diag1-12, diag_admit)
        - "primary": Only primary diagnosis (diag1)
        - "admit": Only admitting diagnosis (diag_admit)
    require_inpatient : bool, default=False
        If True, only count diagnoses from inpatient encounters.
        Useful for more specific criteria.
    require_outpatient : bool, default=False
        If True, only count diagnoses from outpatient encounters.
    label : str, default=""
        Human-readable label for the attrition table.
        Example: "gMG ≥2 Dx, 30 days apart" or "Malignancy in baseline".

    See Also
    --------
    CohortCriteria : Container that holds DiagnosisCriteria objects
    ProcedureCriteria : Similar criteria for procedures
    MedicationCriteria : Similar criteria for medications

    Notes
    -----
    - Codes are matched EXACTLY - 'G70' will NOT match 'G700'
    - Use window_start/window_end for baseline/follow-up criteria
    - Always provide a descriptive label for clear attrition reporting

    Examples
    --------
    Primary inclusion - gMG with 2+ diagnoses 30 days apart:

    >>> primary = DiagnosisCriteria(
    ...     codes=["G700", "G7000", "G7001"],
    ...     min_count=2,
    ...     days_apart=30,
    ...     label="gMG ≥2 Dx, 30 days apart",
    ... )

    Baseline exclusion - malignancy in year before index:

    >>> exclude_cancer = DiagnosisCriteria(
    ...     codes=["C00", "C01", "C02"],  # Add all cancer codes
    ...     window_start=-365,            # 1 year before
    ...     window_end=-1,                # Up to day before index
    ...     label="Malignancy in baseline",
    ... )

    Baseline exclusion - pregnancy:

    >>> exclude_pregnancy = DiagnosisCriteria(
    ...     codes=["O00", "O26", "Z33", "Z34"],
    ...     window_start=-270,  # ~9 months
    ...     window_end=0,
    ...     label="Pregnancy",
    ... )

    Post-index requirement - depression diagnosis within 1 year:

    >>> require_depression = DiagnosisCriteria(
    ...     codes=["F32", "F33"],
    ...     window_start=0,
    ...     window_end=365,
    ...     label="Depression post-index",
    ... )
    """

    codes: list[str]
    min_count: int = 1
    days_apart: int = 0
    window_start: int | None = None
    window_end: int | None = None
    diagnosis_position: Literal["any", "primary", "admit"] = "any"
    require_inpatient: bool = False
    require_outpatient: bool = False
    label: str = ""

ProcedureCriteria dataclass

Procedure-based inclusion or exclusion criteria.

Parameters:

Name Type Description Default
codes list[str]

CPT/HCPCS procedure codes to match.

required
min_count int

Minimum number of procedure occurrences required.

1
window_start int

Days relative to index date for window start.

None
window_end int

Days relative to index date for window end.

None
label str

Human-readable label for attrition reporting.

""
Source code in alx_heor\cohort\__init__.py
@dataclass
class ProcedureCriteria:
    """Procedure-based inclusion or exclusion criteria.

    Parameters
    ----------
    codes : list[str]
        CPT/HCPCS procedure codes to match.
    min_count : int, default=1
        Minimum number of procedure occurrences required.
    window_start : int, optional
        Days relative to index date for window start.
    window_end : int, optional
        Days relative to index date for window end.
    label : str, default=""
        Human-readable label for attrition reporting.
    """

    codes: list[str]
    min_count: int = 1
    window_start: int | None = None
    window_end: int | None = None
    label: str = ""

MedicationCriteria dataclass

Medication-based inclusion or exclusion criteria.

Can specify medications by generic name, NDC code, or procedure code (J-codes). At least one of generic_names, ndc_codes, or procedure_codes must be provided.

Parameters:

Name Type Description Default
generic_names list[str]

Generic drug names to match (case-insensitive). E.g., ['eculizumab', 'ravulizumab'].

None
ndc_codes list[str]

NDC codes to match.

None
procedure_codes list[str]

HCPCS/J-codes to match (e.g., ['J1300', 'J1303']).

None
min_count int

Minimum number of medication claims required.

1
window_start int

Days relative to index date for window start.

None
window_end int

Days relative to index date for window end.

None
label str

Human-readable label for attrition reporting.

""

Examples:

Any C5 inhibitor post-index:

>>> MedicationCriteria(
...     generic_names=["eculizumab", "ravulizumab"],
...     window_start=0,  # On or after index
...     label="C5 inhibitor post-index",
... )

Treatment-naive (exclude prior biologics):

>>> MedicationCriteria(
...     generic_names=["eculizumab", "ravulizumab", "rituximab"],
...     window_start=-365,
...     window_end=-1,  # Up to day before index
...     label="Prior biologic use",
... )
Source code in alx_heor\cohort\__init__.py
@dataclass
class MedicationCriteria:
    """Medication-based inclusion or exclusion criteria.

    Can specify medications by generic name, NDC code, or procedure code (J-codes).
    At least one of generic_names, ndc_codes, or procedure_codes must be provided.

    Parameters
    ----------
    generic_names : list[str], optional
        Generic drug names to match (case-insensitive).
        E.g., ['eculizumab', 'ravulizumab'].
    ndc_codes : list[str], optional
        NDC codes to match.
    procedure_codes : list[str], optional
        HCPCS/J-codes to match (e.g., ['J1300', 'J1303']).
    min_count : int, default=1
        Minimum number of medication claims required.
    window_start : int, optional
        Days relative to index date for window start.
    window_end : int, optional
        Days relative to index date for window end.
    label : str, default=""
        Human-readable label for attrition reporting.

    Examples
    --------
    Any C5 inhibitor post-index:
    >>> MedicationCriteria(
    ...     generic_names=["eculizumab", "ravulizumab"],
    ...     window_start=0,  # On or after index
    ...     label="C5 inhibitor post-index",
    ... )

    Treatment-naive (exclude prior biologics):
    >>> MedicationCriteria(
    ...     generic_names=["eculizumab", "ravulizumab", "rituximab"],
    ...     window_start=-365,
    ...     window_end=-1,  # Up to day before index
    ...     label="Prior biologic use",
    ... )
    """

    generic_names: list[str] | None = None
    ndc_codes: list[str] | None = None
    procedure_codes: list[str] | None = None
    min_count: int = 1
    window_start: int | None = None
    window_end: int | None = None
    label: str = ""

    def __post_init__(self):
        if not any([self.generic_names, self.ndc_codes, self.procedure_codes]):
            raise ValueError(
                "At least one of generic_names, ndc_codes, or procedure_codes "
                "must be provided"
            )

EnrollmentCriteria dataclass

Continuous enrollment requirements.

Parameters:

Name Type Description Default
months_before int

Required months of continuous enrollment before index date.

0
months_after int

Required months of continuous enrollment after index date.

0
max_gap_months int

Maximum allowed gap in enrollment (in months).

1
label str

Human-readable label for attrition reporting.

""

Examples:

6 months baseline, 12 months follow-up:

>>> EnrollmentCriteria(months_before=6, months_after=12)
Source code in alx_heor\cohort\__init__.py
@dataclass
class EnrollmentCriteria:
    """Continuous enrollment requirements.

    Parameters
    ----------
    months_before : int, default=0
        Required months of continuous enrollment before index date.
    months_after : int, default=0
        Required months of continuous enrollment after index date.
    max_gap_months : int, default=1
        Maximum allowed gap in enrollment (in months).
    label : str, default=""
        Human-readable label for attrition reporting.

    Examples
    --------
    6 months baseline, 12 months follow-up:
    >>> EnrollmentCriteria(months_before=6, months_after=12)
    """

    months_before: int = 0
    months_after: int = 0
    max_gap_months: int = 1
    label: str = ""

CohortCriteria dataclass

Complete specification of cohort inclusion and exclusion criteria.

This dataclass is the "study protocol in code" - it defines all the criteria that determine which patients enter your cohort. By specifying criteria declaratively, you get reproducible cohort definitions that can be version controlled, shared, and audited.

The criteria are applied in a specific order: 1. Primary diagnosis (identifies initial population) 2. Additional required diagnoses 3. Required procedures 4. Required medications 5. Excluded diagnoses (removes patients) 6. Excluded procedures 7. Excluded medications 8. Age filter 9. Sex filter 10. Provider specialty filter 11. Continuous enrollment

Clinical Considerations

Why exclude specialties? Some conditions like Myasthenia Gravis (MG) have subtypes (ocular MG vs generalized MG). Diagnoses from ophthalmology/optometry may represent ocular-only MG, which has different treatment patterns.

Why require 2+ diagnoses? A single diagnosis may be rule-out testing. The patient presents with symptoms, gets tested, but doesn't have the condition. Requiring 2+ diagnoses separated by time increases diagnostic confidence.

Why check enrollment? Patients must be observable for the study period. A patient who drops enrollment can't be followed for outcomes.

Parameters:

Name Type Description Default
primary_diagnosis DiagnosisCriteria

Primary diagnosis criteria for cohort identification (required). This defines the target condition.

required
required_diagnoses list[DiagnosisCriteria]

Additional diagnosis criteria that must be met. Patients must have ALL of these in addition to the primary diagnosis.

[]
required_procedures list[ProcedureCriteria]

Procedure criteria that must be met (e.g., require thymectomy).

[]
required_medications list[MedicationCriteria]

Medication criteria that must be met (e.g., require C5 inhibitor).

[]
excluded_diagnoses list[DiagnosisCriteria]

Diagnosis criteria for exclusion (e.g., malignancy, pregnancy). Patients meeting ANY of these are removed.

[]
excluded_procedures list[ProcedureCriteria]

Procedure criteria for exclusion.

[]
excluded_medications list[MedicationCriteria]

Medication criteria for exclusion (e.g., prior biologic use).

[]
min_age int

Minimum age at index date. None to skip age filter. 18 is standard for adult-only studies.

18
max_age int

Maximum age at index date. None means no upper limit.

None
valid_sex_only bool

If True, exclude patients with unknown/missing sex ('U'). Set False for sensitivity analyses.

True
exclude_specialties list[str]

Provider specialties to exclude from index diagnosis. Common: ['OPHTHAL', 'OPTOMTRY'] for gMG studies.

None
require_specialty_confirmation bool

If True, require at least one diagnosis from a non-excluded specialty.

False
enrollment EnrollmentCriteria

Continuous enrollment requirements (baseline + follow-up months).

None
index_date_method str

How to determine index date: - "first_dx": First diagnosis date (most common) - "second_dx": Second diagnosis date (useful for 2+ Dx criterion) - "first_rx": First qualifying medication date (for treatment studies)

"first_dx"
See Also

get_cohort : Function that applies these criteria to build a cohort DiagnosisCriteria : Detailed diagnosis criteria specification EnrollmentCriteria : Continuous enrollment requirements

Notes
  • Criteria are applied sequentially, with attrition tracked at each step
  • Use descriptive labels in each criterion for clear attrition tables
  • Test with smaller date ranges first to verify criteria before full run

Examples:

Basic gMG cohort (adults, 2+ Dx, exclude ocular specialists):

>>> criteria = CohortCriteria(
...     primary_diagnosis=DiagnosisCriteria(
...         codes=["G700", "G7000", "G7001"],
...         min_count=2,
...         days_apart=30,
...         label="gMG ≥2 Dx, 30 days apart",
...     ),
...     min_age=18,
...     exclude_specialties=["OPHTHAL", "OPTOMTRY"],
... )

Treatment-naive cohort with enrollment requirements:

>>> criteria = CohortCriteria(
...     primary_diagnosis=DiagnosisCriteria(
...         codes=["G700", "G7000", "G7001"],
...         min_count=2,
...         days_apart=30,
...         label="gMG ≥2 Dx",
...     ),
...     excluded_medications=[
...         MedicationCriteria(
...             generic_names=["eculizumab", "ravulizumab"],
...             window_start=-365,
...             window_end=-1,
...             label="Prior C5 inhibitor",
...         ),
...     ],
...     enrollment=EnrollmentCriteria(
...         months_before=6,
...         months_after=12,
...         label="6m baseline + 12m follow-up",
...     ),
...     min_age=18,
... )

Complex criteria with multiple exclusions:

>>> criteria = CohortCriteria(
...     primary_diagnosis=DiagnosisCriteria(codes=["G700"], min_count=2, days_apart=30),
...     excluded_diagnoses=[
...         DiagnosisCriteria(codes=["C00-C96"], window_start=-365, window_end=-1, label="Malignancy"),
...         DiagnosisCriteria(codes=["O00-O99", "Z33"], window_start=-270, window_end=0, label="Pregnancy"),
...         DiagnosisCriteria(codes=["N18.5", "N18.6"], label="ESRD"),
...     ],
...     min_age=18,
...     max_age=89,  # Cap for data quality
... )
Source code in alx_heor\cohort\__init__.py
@dataclass
class CohortCriteria:
    """Complete specification of cohort inclusion and exclusion criteria.

    This dataclass is the "study protocol in code" - it defines all the criteria
    that determine which patients enter your cohort. By specifying criteria
    declaratively, you get reproducible cohort definitions that can be version
    controlled, shared, and audited.

    The criteria are applied in a specific order:
    1. Primary diagnosis (identifies initial population)
    2. Additional required diagnoses
    3. Required procedures
    4. Required medications
    5. Excluded diagnoses (removes patients)
    6. Excluded procedures
    7. Excluded medications
    8. Age filter
    9. Sex filter
    10. Provider specialty filter
    11. Continuous enrollment

    Clinical Considerations
    -----------------------
    **Why exclude specialties?** Some conditions like Myasthenia Gravis (MG) have
    subtypes (ocular MG vs generalized MG). Diagnoses from ophthalmology/optometry
    may represent ocular-only MG, which has different treatment patterns.

    **Why require 2+ diagnoses?** A single diagnosis may be rule-out testing.
    The patient presents with symptoms, gets tested, but doesn't have the condition.
    Requiring 2+ diagnoses separated by time increases diagnostic confidence.

    **Why check enrollment?** Patients must be observable for the study period.
    A patient who drops enrollment can't be followed for outcomes.

    Parameters
    ----------
    primary_diagnosis : DiagnosisCriteria
        Primary diagnosis criteria for cohort identification (required).
        This defines the target condition.
    required_diagnoses : list[DiagnosisCriteria], default=[]
        Additional diagnosis criteria that must be met. Patients must have
        ALL of these in addition to the primary diagnosis.
    required_procedures : list[ProcedureCriteria], default=[]
        Procedure criteria that must be met (e.g., require thymectomy).
    required_medications : list[MedicationCriteria], default=[]
        Medication criteria that must be met (e.g., require C5 inhibitor).
    excluded_diagnoses : list[DiagnosisCriteria], default=[]
        Diagnosis criteria for exclusion (e.g., malignancy, pregnancy).
        Patients meeting ANY of these are removed.
    excluded_procedures : list[ProcedureCriteria], default=[]
        Procedure criteria for exclusion.
    excluded_medications : list[MedicationCriteria], default=[]
        Medication criteria for exclusion (e.g., prior biologic use).
    min_age : int, optional, default=18
        Minimum age at index date. None to skip age filter.
        18 is standard for adult-only studies.
    max_age : int, optional
        Maximum age at index date. None means no upper limit.
    valid_sex_only : bool, default=True
        If True, exclude patients with unknown/missing sex ('U').
        Set False for sensitivity analyses.
    exclude_specialties : list[str], optional
        Provider specialties to exclude from index diagnosis.
        Common: ['OPHTHAL', 'OPTOMTRY'] for gMG studies.
    require_specialty_confirmation : bool, default=False
        If True, require at least one diagnosis from a non-excluded specialty.
    enrollment : EnrollmentCriteria, optional
        Continuous enrollment requirements (baseline + follow-up months).
    index_date_method : str, default="first_dx"
        How to determine index date:
        - "first_dx": First diagnosis date (most common)
        - "second_dx": Second diagnosis date (useful for 2+ Dx criterion)
        - "first_rx": First qualifying medication date (for treatment studies)

    See Also
    --------
    get_cohort : Function that applies these criteria to build a cohort
    DiagnosisCriteria : Detailed diagnosis criteria specification
    EnrollmentCriteria : Continuous enrollment requirements

    Notes
    -----
    - Criteria are applied sequentially, with attrition tracked at each step
    - Use descriptive labels in each criterion for clear attrition tables
    - Test with smaller date ranges first to verify criteria before full run

    Examples
    --------
    Basic gMG cohort (adults, 2+ Dx, exclude ocular specialists):

    >>> criteria = CohortCriteria(
    ...     primary_diagnosis=DiagnosisCriteria(
    ...         codes=["G700", "G7000", "G7001"],
    ...         min_count=2,
    ...         days_apart=30,
    ...         label="gMG ≥2 Dx, 30 days apart",
    ...     ),
    ...     min_age=18,
    ...     exclude_specialties=["OPHTHAL", "OPTOMTRY"],
    ... )

    Treatment-naive cohort with enrollment requirements:

    >>> criteria = CohortCriteria(
    ...     primary_diagnosis=DiagnosisCriteria(
    ...         codes=["G700", "G7000", "G7001"],
    ...         min_count=2,
    ...         days_apart=30,
    ...         label="gMG ≥2 Dx",
    ...     ),
    ...     excluded_medications=[
    ...         MedicationCriteria(
    ...             generic_names=["eculizumab", "ravulizumab"],
    ...             window_start=-365,
    ...             window_end=-1,
    ...             label="Prior C5 inhibitor",
    ...         ),
    ...     ],
    ...     enrollment=EnrollmentCriteria(
    ...         months_before=6,
    ...         months_after=12,
    ...         label="6m baseline + 12m follow-up",
    ...     ),
    ...     min_age=18,
    ... )

    Complex criteria with multiple exclusions:

    >>> criteria = CohortCriteria(
    ...     primary_diagnosis=DiagnosisCriteria(codes=["G700"], min_count=2, days_apart=30),
    ...     excluded_diagnoses=[
    ...         DiagnosisCriteria(codes=["C00-C96"], window_start=-365, window_end=-1, label="Malignancy"),
    ...         DiagnosisCriteria(codes=["O00-O99", "Z33"], window_start=-270, window_end=0, label="Pregnancy"),
    ...         DiagnosisCriteria(codes=["N18.5", "N18.6"], label="ESRD"),
    ...     ],
    ...     min_age=18,
    ...     max_age=89,  # Cap for data quality
    ... )
    """

    # Inclusion criteria
    primary_diagnosis: DiagnosisCriteria
    required_diagnoses: list[DiagnosisCriteria] = field(default_factory=list)
    required_procedures: list[ProcedureCriteria] = field(default_factory=list)
    required_medications: list[MedicationCriteria] = field(default_factory=list)

    # Exclusion criteria
    excluded_diagnoses: list[DiagnosisCriteria] = field(default_factory=list)
    excluded_procedures: list[ProcedureCriteria] = field(default_factory=list)
    excluded_medications: list[MedicationCriteria] = field(default_factory=list)

    # Demographic criteria
    min_age: int | None = 18
    max_age: int | None = None
    valid_sex_only: bool = True

    # Specialty criteria
    exclude_specialties: list[str] | None = None
    require_specialty_confirmation: bool = False

    # Enrollment criteria
    enrollment: EnrollmentCriteria | None = None

    # Index date options
    index_date_method: Literal["first_dx", "second_dx", "first_rx"] = "first_dx"

CohortResult dataclass

Results from cohort identification with attrition tracking.

Attributes:

Name Type Description
df_cohort DataFrame

Final cohort with patient demographics and index dates.

df_claims DataFrame

All diagnosis claims for the cohort (for downstream analysis).

attrition dict[str, int]

Patient counts at each step of the selection process.

criteria CohortCriteria

The criteria used to generate this cohort.

df_enrollment DataFrame

Enrollment data (if enrollment criteria was applied).

df_censor DataFrame

Censoring dates (if enrollment criteria was applied).

df_payer DataFrame

Payer type classification (if enrollment criteria was applied).

Source code in alx_heor\cohort\__init__.py
@dataclass
class CohortResult:
    """Results from cohort identification with attrition tracking.

    Attributes
    ----------
    df_cohort : pd.DataFrame
        Final cohort with patient demographics and index dates.
    df_claims : pd.DataFrame
        All diagnosis claims for the cohort (for downstream analysis).
    attrition : dict[str, int]
        Patient counts at each step of the selection process.
    criteria : CohortCriteria
        The criteria used to generate this cohort.
    df_enrollment : pd.DataFrame
        Enrollment data (if enrollment criteria was applied).
    df_censor : pd.DataFrame
        Censoring dates (if enrollment criteria was applied).
    df_payer : pd.DataFrame
        Payer type classification (if enrollment criteria was applied).
    """

    df_cohort: pd.DataFrame
    df_claims: pd.DataFrame
    attrition: dict[str, int]
    criteria: CohortCriteria
    df_enrollment: pd.DataFrame = field(default_factory=pd.DataFrame)
    df_censor: pd.DataFrame = field(default_factory=pd.DataFrame)
    df_payer: pd.DataFrame = field(default_factory=pd.DataFrame)

    def summary(self) -> str:
        """Generate attrition table as formatted string.

        Returns
        -------
        str
            Formatted attrition table showing patient counts and
            percentage retained at each step.
        """
        lines = ["", "Attrition Table", "=" * 60]
        prev_count = None

        for step, count in self.attrition.items():
            if prev_count is not None and prev_count > 0:
                diff = count - prev_count
                pct = (count / prev_count * 100)
                lines.append(f"{step}: {count:,} ({diff:+,}, {pct:.1f}%)")
            else:
                lines.append(f"{step}: {count:,}")
            prev_count = count

        lines.append("=" * 60)

        # Add censoring summary if available
        if len(self.df_censor) > 0 and "is_censored_by_gap" in self.df_censor.columns:
            lines.append("")
            lines.append("Censoring Summary")
            lines.append("-" * 40)
            censored_by_gap = self.df_censor["is_censored_by_gap"].sum()
            censored_at_end = (~self.df_censor["is_censored_by_gap"]).sum()
            lines.append(f"  Censored by enrollment gap: {censored_by_gap:,}")
            lines.append(f"  Censored at study end: {censored_at_end:,}")

        # Add payer summary if available
        if len(self.df_payer) > 0 and "payer_type" in self.df_payer.columns:
            lines.append("")
            lines.append("Payer Distribution")
            lines.append("-" * 40)
            for payer, count in self.df_payer["payer_type"].value_counts().items():
                pct = count / len(self.df_payer) * 100
                lines.append(f"  {payer}: {count:,} ({pct:.1f}%)")

        return "\n".join(lines)

    def __repr__(self) -> str:
        return (
            f"CohortResult(n_patients={len(self.df_cohort):,}, "
            f"n_claims={len(self.df_claims):,}, "
            f"steps={len(self.attrition)})"
        )

summary

summary() -> str

Generate attrition table as formatted string.

Returns:

Type Description
str

Formatted attrition table showing patient counts and percentage retained at each step.

Source code in alx_heor\cohort\__init__.py
def summary(self) -> str:
    """Generate attrition table as formatted string.

    Returns
    -------
    str
        Formatted attrition table showing patient counts and
        percentage retained at each step.
    """
    lines = ["", "Attrition Table", "=" * 60]
    prev_count = None

    for step, count in self.attrition.items():
        if prev_count is not None and prev_count > 0:
            diff = count - prev_count
            pct = (count / prev_count * 100)
            lines.append(f"{step}: {count:,} ({diff:+,}, {pct:.1f}%)")
        else:
            lines.append(f"{step}: {count:,}")
        prev_count = count

    lines.append("=" * 60)

    # Add censoring summary if available
    if len(self.df_censor) > 0 and "is_censored_by_gap" in self.df_censor.columns:
        lines.append("")
        lines.append("Censoring Summary")
        lines.append("-" * 40)
        censored_by_gap = self.df_censor["is_censored_by_gap"].sum()
        censored_at_end = (~self.df_censor["is_censored_by_gap"]).sum()
        lines.append(f"  Censored by enrollment gap: {censored_by_gap:,}")
        lines.append(f"  Censored at study end: {censored_at_end:,}")

    # Add payer summary if available
    if len(self.df_payer) > 0 and "payer_type" in self.df_payer.columns:
        lines.append("")
        lines.append("Payer Distribution")
        lines.append("-" * 40)
        for payer, count in self.df_payer["payer_type"].value_counts().items():
            pct = count / len(self.df_payer) * 100
            lines.append(f"  {payer}: {count:,} ({pct:.1f}%)")

    return "\n".join(lines)

get_cohort

get_cohort(conn: RedshiftConnection, source: str, schema: str, criteria: CohortCriteria, start_year: int, end_year: int, study_start: str | None = None, study_end: str | None = None, include_claims: bool = True) -> CohortResult

Identify a patient cohort with comprehensive inclusion/exclusion criteria.

This is the primary high-level function for cohort identification in RWE studies. It automates the entire workflow of querying claims, applying inclusion/exclusion criteria, calculating index dates, filtering by demographics, checking enrollment, and tracking attrition at each step.

The function applies criteria in a deterministic order, allowing you to specify your study protocol once and reproduce results consistently. The returned CohortResult includes an attrition table showing how many patients were excluded at each step - essential for study transparency.

Workflow (Automated by this Function)
  1. Query claims matching primary diagnosis codes
  2. Apply min_count and days_apart criteria
  3. Calculate index dates
  4. Add demographics (age, sex)
  5. Apply required diagnosis/procedure/medication criteria
  6. Apply excluded diagnosis/procedure/medication criteria
  7. Filter by age and sex
  8. Apply provider specialty filter
  9. Check continuous enrollment requirements
  10. Generate attrition table

Parameters:

Name Type Description Default
conn RedshiftConnection

Active database connection. Must be connected before calling.

required
source str

Data source name: 'iqvia', 'optum', 'komodo'. Determines column mappings and table patterns via config.

required
schema str

Database schema (e.g., 'iqvia_pharmetrics_2024q3'). Use conn.get_schemas('iqvia') to find available schemas.

required
criteria CohortCriteria

Complete specification of inclusion and exclusion criteria. See CohortCriteria documentation for all available options.

required
start_year int

First year of claims data to query (e.g., 2015).

required
end_year int

Last year of claims data to query (e.g., 2024).

required
study_start str

Study period start date (e.g., '2015-01-01'). If provided, excludes claims before this date. Useful for aligning with protocol dates.

None
study_end str

Study period end date (e.g., '2024-03-31'). Used for censoring and excluding claims after this date.

None
include_claims bool

If True, include full claims data in result for downstream analysis. Set False to save memory when only cohort demographics are needed.

True

Returns:

Type Description
CohortResult

Object containing: - df_cohort: Final filtered cohort (one row per patient) - df_claims: All diagnosis claims for cohort patients - attrition: Dict tracking patient counts at each step - criteria: The CohortCriteria used (for reproducibility) - df_enrollment: Enrollment data (if enrollment criteria applied) - df_censor: Censoring dates (if enrollment criteria applied) - df_payer: Payer classification (if enrollment criteria applied)

See Also

CohortCriteria : Specification of all inclusion/exclusion criteria DiagnosisCriteria : Diagnosis-based criteria EnrollmentCriteria : Continuous enrollment requirements claims.get_claims : Lower-level function if you need custom queries enrollment.analyze_enrollment : Detailed enrollment analysis

Notes
  • Execution time varies by cohort size (rare diseases: minutes, common: 30+ min)
  • Memory usage can be high for large cohorts (use include_claims=False if needed)
  • Always check attrition percentages against protocol expectations
  • The function uses gc.collect() internally to manage memory
  • For debugging, try with start_year=end_year first to reduce data volume

Examples:

Basic gMG cohort with standard RWE criteria:

>>> from alx_heor.cohort import get_cohort, CohortCriteria, DiagnosisCriteria
>>>
>>> criteria = CohortCriteria(
...     primary_diagnosis=DiagnosisCriteria(
...         codes=["G700", "G7000", "G7001"],
...         min_count=2,
...         days_apart=30,
...         label="gMG ≥2 Dx, 30 days apart",
...     ),
...     min_age=18,
...     exclude_specialties=["OPHTHAL", "OPTOMTRY"],
... )
>>>
>>> result = get_cohort(
...     conn,
...     source="iqvia",
...     schema="iqvia_pharmetrics_2024q3",
...     criteria=criteria,
...     start_year=2015,
...     end_year=2024,
... )
>>>
>>> print(result.summary())
Attrition Table
============================================================
≥1 diagnosis claim: 89,123
gMG ≥2 Dx, 30 days apart: 45,678 (-43,445, 51.3%)
Age ≥18: 42,103 (-3,575, 92.2%)
...
>>>
>>> # Access the final cohort
>>> df_cohort = result.df_cohort
>>> print(f"Final cohort: {len(df_cohort):,} patients")

Cohort with enrollment requirements and exclusions:

>>> criteria = CohortCriteria(
...     primary_diagnosis=DiagnosisCriteria(
...         codes=["G700", "G7000", "G7001"],
...         min_count=2,
...         days_apart=30,
...     ),
...     excluded_diagnoses=[
...         DiagnosisCriteria(
...             codes=["C00", "C01", "C02"],  # Malignancy codes
...             window_start=-365,
...             window_end=0,
...             label="Malignancy in baseline",
...         ),
...     ],
...     excluded_medications=[
...         MedicationCriteria(
...             generic_names=["eculizumab", "ravulizumab"],
...             window_start=-365,
...             window_end=-1,
...             label="Prior C5 inhibitor",
...         ),
...     ],
...     enrollment=EnrollmentCriteria(
...         months_before=6,
...         months_after=12,
...     ),
...     min_age=18,
... )
>>>
>>> result = get_cohort(conn, source="iqvia", ...)

Quick test with limited data (for debugging):

>>> # Test with one year first
>>> result_test = get_cohort(
...     conn, source="iqvia", schema="iqvia_pharmetrics_2024q3",
...     criteria=criteria, start_year=2024, end_year=2024,  # Single year
... )
>>> print(f"Test cohort: {len(result_test.df_cohort)} patients")
Source code in alx_heor\cohort\__init__.py
def get_cohort(
    conn: RedshiftConnection,
    source: str,
    schema: str,
    criteria: CohortCriteria,
    start_year: int,
    end_year: int,
    study_start: str | None = None,
    study_end: str | None = None,
    include_claims: bool = True,
) -> CohortResult:
    """
    Identify a patient cohort with comprehensive inclusion/exclusion criteria.

    This is the primary high-level function for cohort identification in RWE studies.
    It automates the entire workflow of querying claims, applying inclusion/exclusion
    criteria, calculating index dates, filtering by demographics, checking enrollment,
    and tracking attrition at each step.

    The function applies criteria in a deterministic order, allowing you to specify
    your study protocol once and reproduce results consistently. The returned
    `CohortResult` includes an attrition table showing how many patients were
    excluded at each step - essential for study transparency.

    Workflow (Automated by this Function)
    -------------------------------------
    1. Query claims matching primary diagnosis codes
    2. Apply min_count and days_apart criteria
    3. Calculate index dates
    4. Add demographics (age, sex)
    5. Apply required diagnosis/procedure/medication criteria
    6. Apply excluded diagnosis/procedure/medication criteria
    7. Filter by age and sex
    8. Apply provider specialty filter
    9. Check continuous enrollment requirements
    10. Generate attrition table

    Parameters
    ----------
    conn : RedshiftConnection
        Active database connection. Must be connected before calling.
    source : str
        Data source name: 'iqvia', 'optum', 'komodo'.
        Determines column mappings and table patterns via config.
    schema : str
        Database schema (e.g., 'iqvia_pharmetrics_2024q3').
        Use `conn.get_schemas('iqvia')` to find available schemas.
    criteria : CohortCriteria
        Complete specification of inclusion and exclusion criteria.
        See CohortCriteria documentation for all available options.
    start_year : int
        First year of claims data to query (e.g., 2015).
    end_year : int
        Last year of claims data to query (e.g., 2024).
    study_start : str, optional
        Study period start date (e.g., '2015-01-01'). If provided, excludes
        claims before this date. Useful for aligning with protocol dates.
    study_end : str, optional
        Study period end date (e.g., '2024-03-31'). Used for censoring and
        excluding claims after this date.
    include_claims : bool, default=True
        If True, include full claims data in result for downstream analysis.
        Set False to save memory when only cohort demographics are needed.

    Returns
    -------
    CohortResult
        Object containing:
        - df_cohort: Final filtered cohort (one row per patient)
        - df_claims: All diagnosis claims for cohort patients
        - attrition: Dict tracking patient counts at each step
        - criteria: The CohortCriteria used (for reproducibility)
        - df_enrollment: Enrollment data (if enrollment criteria applied)
        - df_censor: Censoring dates (if enrollment criteria applied)
        - df_payer: Payer classification (if enrollment criteria applied)

    See Also
    --------
    CohortCriteria : Specification of all inclusion/exclusion criteria
    DiagnosisCriteria : Diagnosis-based criteria
    EnrollmentCriteria : Continuous enrollment requirements
    claims.get_claims : Lower-level function if you need custom queries
    enrollment.analyze_enrollment : Detailed enrollment analysis

    Notes
    -----
    - Execution time varies by cohort size (rare diseases: minutes, common: 30+ min)
    - Memory usage can be high for large cohorts (use include_claims=False if needed)
    - Always check attrition percentages against protocol expectations
    - The function uses gc.collect() internally to manage memory
    - For debugging, try with start_year=end_year first to reduce data volume

    Examples
    --------
    Basic gMG cohort with standard RWE criteria:

    >>> from alx_heor.cohort import get_cohort, CohortCriteria, DiagnosisCriteria
    >>>
    >>> criteria = CohortCriteria(
    ...     primary_diagnosis=DiagnosisCriteria(
    ...         codes=["G700", "G7000", "G7001"],
    ...         min_count=2,
    ...         days_apart=30,
    ...         label="gMG ≥2 Dx, 30 days apart",
    ...     ),
    ...     min_age=18,
    ...     exclude_specialties=["OPHTHAL", "OPTOMTRY"],
    ... )
    >>>
    >>> result = get_cohort(
    ...     conn,
    ...     source="iqvia",
    ...     schema="iqvia_pharmetrics_2024q3",
    ...     criteria=criteria,
    ...     start_year=2015,
    ...     end_year=2024,
    ... )
    >>>
    >>> print(result.summary())
    Attrition Table
    ============================================================
    ≥1 diagnosis claim: 89,123
    gMG ≥2 Dx, 30 days apart: 45,678 (-43,445, 51.3%)
    Age ≥18: 42,103 (-3,575, 92.2%)
    ...
    >>>
    >>> # Access the final cohort
    >>> df_cohort = result.df_cohort
    >>> print(f"Final cohort: {len(df_cohort):,} patients")

    Cohort with enrollment requirements and exclusions:

    >>> criteria = CohortCriteria(
    ...     primary_diagnosis=DiagnosisCriteria(
    ...         codes=["G700", "G7000", "G7001"],
    ...         min_count=2,
    ...         days_apart=30,
    ...     ),
    ...     excluded_diagnoses=[
    ...         DiagnosisCriteria(
    ...             codes=["C00", "C01", "C02"],  # Malignancy codes
    ...             window_start=-365,
    ...             window_end=0,
    ...             label="Malignancy in baseline",
    ...         ),
    ...     ],
    ...     excluded_medications=[
    ...         MedicationCriteria(
    ...             generic_names=["eculizumab", "ravulizumab"],
    ...             window_start=-365,
    ...             window_end=-1,
    ...             label="Prior C5 inhibitor",
    ...         ),
    ...     ],
    ...     enrollment=EnrollmentCriteria(
    ...         months_before=6,
    ...         months_after=12,
    ...     ),
    ...     min_age=18,
    ... )
    >>>
    >>> result = get_cohort(conn, source="iqvia", ...)

    Quick test with limited data (for debugging):

    >>> # Test with one year first
    >>> result_test = get_cohort(
    ...     conn, source="iqvia", schema="iqvia_pharmetrics_2024q3",
    ...     criteria=criteria, start_year=2024, end_year=2024,  # Single year
    ... )
    >>> print(f"Test cohort: {len(result_test.df_cohort)} patients")
    """
    config = get_source_config(source)
    cols = config["columns"]
    id_col = cols["patient_id"]
    attrition = {}

    # -------------------------------------------------------------------------
    # Step 1: Query and apply PRIMARY DIAGNOSIS inclusion
    # -------------------------------------------------------------------------
    df_claims = _query_diagnosis_claims(
        conn, source, schema,
        criteria.primary_diagnosis.codes,
        start_year, end_year,
        study_start, study_end
    )

    # Get unique patients with at least 1 diagnosis
    n_any_dx = df_claims[id_col].nunique()
    attrition["≥1 diagnosis claim"] = n_any_dx

    # Apply min_count and days_apart criteria
    df_cohort = _apply_diagnosis_criteria(
        df_claims, source, criteria.primary_diagnosis
    )

    label = criteria.primary_diagnosis.label or (
        f"≥{criteria.primary_diagnosis.min_count} Dx, "
        f"{criteria.primary_diagnosis.days_apart} days apart"
    )
    attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 2: Calculate index dates
    # -------------------------------------------------------------------------
    df_cohort = _calculate_index_dates(
        df_cohort, df_claims, source, criteria.index_date_method
    )

    # -------------------------------------------------------------------------
    # Step 3: Add demographics
    # -------------------------------------------------------------------------
    df_cohort = _add_demographics(df_cohort, df_claims, source, conn, schema)

    # Memory cleanup after initial processing
    gc.collect()

    # -------------------------------------------------------------------------
    # Step 4: Apply additional DIAGNOSIS inclusions
    # -------------------------------------------------------------------------
    for dx_criteria in criteria.required_diagnoses:
        df_cohort = _apply_required_diagnosis(
            conn, df_cohort, source, schema, dx_criteria,
            start_year, end_year
        )
        label = dx_criteria.label or "Required diagnosis"
        attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 5: Apply PROCEDURE inclusions
    # -------------------------------------------------------------------------
    for proc_criteria in criteria.required_procedures:
        df_cohort = _apply_required_procedure(
            conn, df_cohort, source, schema, proc_criteria,
            start_year, end_year
        )
        label = proc_criteria.label or "Required procedure"
        attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 6: Apply MEDICATION inclusions
    # -------------------------------------------------------------------------
    for med_criteria in criteria.required_medications:
        df_cohort = _apply_required_medication(
            conn, df_cohort, source, schema, med_criteria,
            start_year, end_year
        )
        label = med_criteria.label or "Required medication"
        attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 7: Apply DIAGNOSIS exclusions
    # -------------------------------------------------------------------------
    for dx_criteria in criteria.excluded_diagnoses:
        df_cohort = _apply_diagnosis_exclusion(
            conn, df_cohort, source, schema, dx_criteria,
            start_year, end_year
        )
        label = f"Exclude: {dx_criteria.label or 'diagnosis'}"
        attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 8: Apply PROCEDURE exclusions
    # -------------------------------------------------------------------------
    for proc_criteria in criteria.excluded_procedures:
        df_cohort = _apply_procedure_exclusion(
            conn, df_cohort, source, schema, proc_criteria,
            start_year, end_year
        )
        label = f"Exclude: {proc_criteria.label or 'procedure'}"
        attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 9: Apply MEDICATION exclusions
    # -------------------------------------------------------------------------
    for med_criteria in criteria.excluded_medications:
        df_cohort = _apply_medication_exclusion(
            conn, df_cohort, source, schema, med_criteria,
            start_year, end_year
        )
        label = f"Exclude: {med_criteria.label or 'medication'}"
        attrition[label] = len(df_cohort)

    # Memory cleanup after exclusions
    gc.collect()

    # -------------------------------------------------------------------------
    # Step 10: Apply AGE filter
    # -------------------------------------------------------------------------
    if criteria.min_age is not None or criteria.max_age is not None:
        df_cohort = _filter_age(df_cohort, criteria.min_age, criteria.max_age)
        if criteria.min_age and criteria.max_age:
            label = f"Age {criteria.min_age}-{criteria.max_age}"
        elif criteria.min_age:
            label = f"Age ≥{criteria.min_age}"
        else:
            label = f"Age ≤{criteria.max_age}"
        attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 11: Apply SEX filter
    # -------------------------------------------------------------------------
    if criteria.valid_sex_only:
        df_cohort = _filter_valid_sex(df_cohort, source)
        attrition["Valid sex (M/F)"] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 12: Apply SPECIALTY filter
    # -------------------------------------------------------------------------
    if criteria.exclude_specialties:
        df_cohort = _filter_specialty(
            df_cohort, df_claims, source,
            criteria.exclude_specialties,
            criteria.require_specialty_confirmation
        )
        attrition["Non-excluded specialty"] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 13: Apply ENROLLMENT requirements
    # -------------------------------------------------------------------------
    df_enrollment = pd.DataFrame()
    df_censor = pd.DataFrame()
    df_payer = pd.DataFrame()

    if criteria.enrollment:
        df_cohort, df_enrollment, df_censor, df_payer = _apply_enrollment_with_outputs(
            conn, df_cohort, source, schema, criteria.enrollment,
            start_year, end_year, study_end
        )
        enroll = criteria.enrollment
        label = enroll.label or (
            f"Enrolled {enroll.months_before}m pre / {enroll.months_after}m post"
        )
        attrition[label] = len(df_cohort)

    # -------------------------------------------------------------------------
    # Step 14: Add first specialty to output
    # -------------------------------------------------------------------------
    df_cohort = _add_first_specialty(df_cohort, df_claims, source)

    # -------------------------------------------------------------------------
    # Finalize
    # -------------------------------------------------------------------------
    # Standardize output column names
    df_cohort = _standardize_output_columns(df_cohort, source)

    # Prepare claims DataFrame for output
    if include_claims:
        output_claims = df_claims
    else:
        output_claims = pd.DataFrame()
        del df_claims
        gc.collect()

    return CohortResult(
        df_cohort=df_cohort,
        df_claims=output_claims,
        attrition=attrition,
        criteria=criteria,
        df_enrollment=df_enrollment,
        df_censor=df_censor,
        df_payer=df_payer,
    )