Technical Documentation for ReadmitRisk Hospital Readmission Risk Prediction Pipeline
| Attribute | Value |
|---|---|
| Source | UCI Machine Learning Repository |
| Time Period | 1999-2008 (10 years) |
| Hospitals | 130 US hospitals |
| Raw Records | 101,766 hospital encounters |
| Variables | 50 features including demographics, diagnoses, medications, lab results |
| Target Variable | Readmission status (NO, >30 days, <30 days) |
State-level and hospital-level readmission rates and penalty data based on CMS HRRP patterns. Geographic data covers all 50 states plus District of Columbia.
| Column | Missing Rate | Treatment | Rationale |
|---|---|---|---|
weight |
97% | Dropped | Too sparse to impute meaningfully |
payer_code |
40% | Dropped | High missingness, administrative variable |
medical_specialty |
49% | Dropped | High missingness, many categories |
race |
2% | Filled with "Unknown" | Low missingness, categorical |
| Numeric features | <1% | Filled with median | Preserves distribution |
The raw dataset contains multiple encounters per patient. To ensure independence of observations and prevent data leakage:
encounter_idpatient_nbr, keeping first encounter onlyOriginal target had 3 categories. Transformed to binary for logistic regression:
| Original Value | Binary Value | Count | Percentage |
|---|---|---|---|
| "<30" (readmitted within 30 days) | 1 (Positive) | 6,293 | 8.8% |
| "NO" or ">30" | 0 (Negative) | 65,225 | 91.2% |
Original age values are categorical ranges. Converted to numeric midpoints:
age_mapping = {
'[0-10)': 5, '[10-20)': 15, '[20-30)': 25,
'[30-40)': 35, '[40-50)': 45, '[50-60)': 55,
'[60-70)': 65, '[70-80)': 75, '[80-90)': 85,
'[90-100)': 95
}
| Feature | Formula | Clinical Rationale |
|---|---|---|
total_visits |
number_outpatient + number_emergency + number_inpatient |
Overall healthcare utilization indicator |
medication_intensity |
num_medications / (time_in_hospital + 1) |
Treatment complexity normalized by stay length |
num_med_changes |
Count of medications with "Up" or "Down" dosage changes | Medication optimization during stay |
A1Cresult_abnormal |
1 if A1C > 7 or > 8, else 0 | Diabetes control indicator |
time_in_hospitalnum_lab_proceduresnum_proceduresnum_medicationsnumber_outpatientnumber_emergencynumber_inpatientnumber_diagnosesage_numerictotal_visitsmedication_intensitynum_med_changesracegenderadmission_type_iddischarge_disposition_idadmission_source_iddiabetesMedchangeA1Cresult_abnormalOne-hot encoded with drop_first=True
| Set | Size | Positive Class | Negative Class |
|---|---|---|---|
| Training (80%) | 57,214 | 5,034 (8.8%) | 52,180 (91.2%) |
| Test (20%) | 14,304 | 1,259 (8.8%) | 13,045 (91.2%) |
Stratified split used to maintain class proportions in both sets.
Synthetic Minority Over-sampling Technique (SMOTE) applied to training data only:
from imblearn.over_sampling import SMOTE smote = SMOTE(random_state=42) X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
| Class | Before SMOTE | After SMOTE |
|---|---|---|
| Negative (0) | 52,180 | 52,180 |
| Positive (1) | 5,034 | 52,180 |
| Total | 57,214 | 104,360 |
| Parameter | Value | Rationale |
|---|---|---|
| Algorithm | Logistic Regression | Interpretable coefficients for clinical explainability |
| Regularization | L2 (Ridge) | Prevents overfitting with many features |
| C (inverse regularization) | 0.1 | Moderate regularization strength |
| max_iter | 1000 | Ensure convergence |
| Scaling | StandardScaler | Zero mean, unit variance for each feature |
from sklearn.preprocessing import StandardScaler from sklearn.linear_model import LogisticRegression scaler = StandardScaler() X_train_scaled = scaler.fit_transform(X_train_balanced) X_test_scaled = scaler.transform(X_test) model = LogisticRegression(max_iter=1000, random_state=42, C=0.1) model.fit(X_train_scaled, y_train_balanced)
Risk Score = P(readmission | features) x 100
Where P(readmission | features) is the predicted probability from logistic regression's predict_proba() method.
Estimated Cost = (Risk Score / 100) x $15,000
The $15,000 figure represents the average cost of a hospital readmission based on industry benchmarks.
| Tier | Risk Score Range | Patient Count | Percentage | Action Level |
|---|---|---|---|---|
| Critical | 80-100% | 1,931 | 2.7% | Immediate intervention |
| Very High | 70-80% | 2,232 | 3.1% | Priority outreach |
| High | 60-70% | 2,920 | 4.1% | Proactive monitoring |
| Moderate | 40-60% | 9,224 | 12.9% | Standard care |
| Low | 0-40% | 55,211 | 77.2% | Routine follow-up |
| Rank | Feature | Coefficient | Interpretation |
|---|---|---|---|
| 1 | Total Prior Visits | +4.3579 | High utilization = complex patient |
| 2 | Number of Medications | +0.2819 | Polypharmacy indicates severity |
| 3 | Lab Procedures | +0.1021 | More testing = diagnostic complexity |
| 4 | Number of Diagnoses | +0.0078 | Comorbidity burden |
| Rank | Feature | Coefficient | Interpretation |
|---|---|---|---|
| 1 | Outpatient Visits | -3.1170 | Continuity of care is protective |
| 2 | Prior Inpatient Visits | -1.9585 | Established care relationships |
| 3 | Emergency Visits | -1.5405 | Access to acute care when needed |
State-level readmission rates and penalties are generated based on actual CMS HRRP patterns:
for state_code, info in STATE_DATA.items():
# Add realistic variation
rate_variation = np.random.uniform(-0.8, 0.8)
avg_rate = info['base_rate'] + rate_variation
# Calculate penalty based on rate
if avg_rate > 15.5:
penalty_pct = np.random.uniform(0.5, 2.0)
elif avg_rate > 14.5:
penalty_pct = np.random.uniform(0.2, 0.8)
else:
penalty_pct = np.random.uniform(0, 0.3)
# Estimate total penalty
total_penalty = hospitals * 5_000_000 * (penalty_pct / 100)
CMS penalizes hospitals with Excess Readmission Ratios (ERR) greater than 1.0. Penalties are calculated as a percentage of Medicare payments:
| Validation | Expected | Actual | Status |
|---|---|---|---|
| Total patients | 71,518 | 71,518 | PASS |
| High-risk patients (60%+) | 7,083 | 7,083 | PASS |
| Sum of risk tiers | 7,083 | 1,931 + 2,232 + 2,920 = 7,083 | PASS |
| Risk distribution sum | 71,518 | 37,663 + 17,548 + 9,224 + 5,152 + 1,931 = 71,518 | PASS |
| Total cost exposure | $78,501,391.71 | $25,096,147 + $25,025,321 + $28,379,923 = $78,501,391 | PASS |
| States analyzed | 51 | 51 | PASS |
| Patient export count | 7,083 | 7,083 | PASS |
| File | Description | Records/Keys | Size |
|---|---|---|---|
patient_risks.json |
High-risk patient details | 7,083 patients | ~1.2 MB |
risk_summary.json |
Dashboard statistics | 15 keys | ~3 KB |
state_summary.json |
State-level metrics | 51 states | ~8 KB |
hospital_metrics.json |
Hospital-level data | 746 hospitals | ~45 KB |
# 1. Install dependencies pip install pandas numpy scikit-learn imbalanced-learn # 2. Download UCI Diabetes dataset to data/raw/ # 3. Run analysis pipeline python run_analysis_v2.py # 4. Copy outputs to dashboard cp data/processed/*.json dashboard/lib/ # 5. Start dashboard cd dashboard && npm run dev