Cohort Building Tutorial¶
This tutorial covers the complete cohort building workflow using get_cohort() and the criteria dataclasses.
Overview¶
Cohort building in alx-heor uses a declarative approach: you define criteria using dataclasses, and the library handles all the SQL generation, data retrieval, and filtering.
from alx_heor.cohort import (
get_cohort,
CohortCriteria,
DiagnosisCriteria,
ProcedureCriteria,
MedicationCriteria,
EnrollmentCriteria,
)
Step 1: Define Diagnosis Criteria¶
The primary entry point for most cohorts is a diagnosis code requirement.
Basic Diagnosis¶
Multiple Claims Requirement¶
# 2+ diagnoses, at least 30 days apart
primary_dx = DiagnosisCriteria(
codes=["G700", "G7000", "G7001"],
min_count=2,
days_apart=30,
label="gMG ≥2 Dx, 30 days apart",
)
Diagnosis Position¶
# Primary diagnosis only (diag1)
primary_dx = DiagnosisCriteria(
codes=["G700"],
diagnosis_position="primary",
)
# Admit diagnosis only
admit_dx = DiagnosisCriteria(
codes=["G700"],
diagnosis_position="admit",
)
Care Setting¶
# Inpatient only
inpatient_dx = DiagnosisCriteria(
codes=["G700"],
require_inpatient=True,
)
# Outpatient only
outpatient_dx = DiagnosisCriteria(
codes=["G700"],
require_outpatient=True,
)
Step 2: Add Time Windows¶
Time windows are relative to the index date (first qualifying diagnosis).
# Baseline comorbidity (1 year before index)
baseline_malignancy = DiagnosisCriteria(
codes=["C00", "C01", "C02"], # Cancer codes
window_start=-365, # 365 days before index
window_end=0, # Up to and including index
label="Malignancy in baseline",
)
# Follow-up event (within 1 year after index)
followup_crisis = DiagnosisCriteria(
codes=["G73.1"],
window_start=0, # On or after index
window_end=365, # Within 1 year
label="MG crisis in follow-up",
)
Step 3: Medication Criteria¶
Require or exclude patients based on medication use.
By Generic Name¶
# Require C5 inhibitor treatment
c5_inhibitor = MedicationCriteria(
generic_names=["eculizumab", "ravulizumab"],
window_start=0, # Post-index
label="C5 inhibitor",
)
By J-Code¶
# Require specific procedure codes
c5_jcode = MedicationCriteria(
procedure_codes=["J1300", "J1303", "J9332", "J9334"],
window_start=0,
label="C5 inhibitor (J-code)",
)
Treatment-Naive (Exclude Prior Use)¶
# Exclude patients with prior biologic use
prior_biologic = MedicationCriteria(
generic_names=["rituximab", "eculizumab", "ravulizumab"],
window_start=-365, # 1 year before
window_end=-1, # Up to day before index
label="Prior biologic",
)
Step 4: Procedure Criteria¶
# Exclude patients with prior thymectomy
prior_thymectomy = ProcedureCriteria(
codes=["60520", "60521", "60522", "60540"],
window_end=-1,
label="Prior thymectomy",
)
Step 5: Enrollment Requirements¶
# Require continuous enrollment
enrollment = EnrollmentCriteria(
months_before=6, # 6 months baseline
months_after=12, # 12 months follow-up
max_gap_months=1, # Allow 1-month gaps
)
Step 6: Put It All Together¶
criteria = CohortCriteria(
# Required: primary diagnosis
primary_diagnosis=DiagnosisCriteria(
codes=["G700", "G7000", "G7001"],
min_count=2,
days_apart=30,
label="gMG diagnosis",
),
# Additional required diagnoses
required_diagnoses=[
# None in this example
],
# Required medications
required_medications=[
MedicationCriteria(
generic_names=["eculizumab", "ravulizumab"],
window_start=0,
label="C5 inhibitor",
),
],
# Excluded diagnoses
excluded_diagnoses=[
DiagnosisCriteria(
codes=["C00", "C01"], # Cancer
window_start=-365,
window_end=0,
label="Baseline malignancy",
),
],
# Excluded procedures
excluded_procedures=[
ProcedureCriteria(
codes=["60520", "60521"],
window_end=-1,
label="Prior thymectomy",
),
],
# Excluded medications
excluded_medications=[
MedicationCriteria(
generic_names=["rituximab"],
window_start=-365,
window_end=-1,
label="Prior rituximab",
),
],
# Demographics
min_age=18,
max_age=None, # No upper limit
valid_sex_only=True,
# Enrollment
enrollment=EnrollmentCriteria(
months_before=6,
months_after=12,
),
)
Step 7: Build the Cohort¶
from alx_heor.database import RedshiftConnection
conn = RedshiftConnection().connect()
result = get_cohort(
conn,
source="iqvia",
schema="iqvia_pharmetrics_2024q3",
criteria=criteria,
start_year=2015,
end_year=2024,
)
Understanding CohortResult¶
The result contains multiple DataFrames and metadata:
Output:
Attrition Table
============================================================
gMG diagnosis: 45,231
Adults (18+): 42,105 (-3,126, 93.1%)
Valid sex: 41,892 (-213, 99.5%)
Baseline malignancy (excluded): 41,234 (-658, 98.4%)
Prior thymectomy (excluded): 40,891 (-343, 99.2%)
Prior rituximab (excluded): 39,234 (-1,657, 95.9%)
C5 inhibitor: 8,234 (-31,000, 21.0%)
Continuous enrollment: 6,891 (-1,343, 83.7%)
============================================================
Accessing Data¶
# Final cohort
df_cohort = result.df_cohort
# Columns: pat_id, index_date, der_yob, der_sex, age_at_index
# All diagnosis claims for cohort patients
df_claims = result.df_claims
# Censoring dates (for survival analysis)
df_censor = result.df_censor
# Columns: pat_id, censor_date, is_censored_by_gap
# Payer type
df_payer = result.df_payer
# Columns: pat_id, pay_type, payer_type
# Attrition as dict
attrition_dict = result.attrition
# {'gMG diagnosis': 45231, 'Adults (18+)': 42105, ...}
Index Date Options¶
By default, the index date is the first qualifying diagnosis. You can change this:
criteria = CohortCriteria(
primary_diagnosis=...,
index_date_method="second_dx", # Use second diagnosis date
)
# Options:
# - "first_dx": First diagnosis date (default)
# - "second_dx": Second diagnosis date (when min_count >= 2)
# - "first_rx": First qualifying medication date
Next Steps¶
- Enrollment Analysis Tutorial - Deep dive into enrollment and censoring
- Medication Analysis Tutorial - Treatment patterns and adherence
- API Reference: Cohort - Complete function documentation