Data Analysis Methodology Report

Technical Documentation for ReadmitRisk Hospital Readmission Risk Prediction Pipeline

Important Notice: This project uses historical data (1999-2008) from the UCI Machine Learning Repository for demonstration and portfolio purposes. The risk model and patterns identified reflect healthcare practices from that era and may not accurately represent current clinical realities. Cost exposure calculations use 2024 industry benchmarks applied to historical risk scores. For production deployment, the model should be retrained on recent EHR data (e.g., MIMIC-IV 2008-2019).

1. Data Sources

1.1 Primary Dataset: UCI Diabetes 130-US Hospitals

Attribute Value
Source UCI Machine Learning Repository
Time Period 1999-2008 (10 years)
Hospitals 130 US hospitals
Raw Records 101,766 hospital encounters
Variables 50 features including demographics, diagnoses, medications, lab results
Target Variable Readmission status (NO, >30 days, <30 days)

1.2 Secondary Dataset: CMS Hospital Readmissions Reduction Program

State-level and hospital-level readmission rates and penalty data based on CMS HRRP patterns. Geographic data covers all 50 states plus District of Columbia.

2. Data Cleaning and Preprocessing

2.1 Missing Value Analysis

Column Missing Rate Treatment Rationale
weight 97% Dropped Too sparse to impute meaningfully
payer_code 40% Dropped High missingness, administrative variable
medical_specialty 49% Dropped High missingness, many categories
race 2% Filled with "Unknown" Low missingness, categorical
Numeric features <1% Filled with median Preserves distribution

2.2 Patient Deduplication

The raw dataset contains multiple encounters per patient. To ensure independence of observations and prevent data leakage:

2.3 Target Variable Transformation

Original target had 3 categories. Transformed to binary for logistic regression:

Original Value Binary Value Count Percentage
"<30" (readmitted within 30 days) 1 (Positive) 6,293 8.8%
"NO" or ">30" 0 (Negative) 65,225 91.2%
Class Imbalance Note: The 8.8% positive class rate represents significant class imbalance, addressed via SMOTE oversampling.

3. Feature Engineering

3.1 Age Transformation

Original age values are categorical ranges. Converted to numeric midpoints:

age_mapping = {
    '[0-10)': 5,   '[10-20)': 15,  '[20-30)': 25,
    '[30-40)': 35, '[40-50)': 45,  '[50-60)': 55,
    '[60-70)': 65, '[70-80)': 75,  '[80-90)': 85,
    '[90-100)': 95
}

3.2 Derived Features

Feature Formula Clinical Rationale
total_visits number_outpatient + number_emergency + number_inpatient Overall healthcare utilization indicator
medication_intensity num_medications / (time_in_hospital + 1) Treatment complexity normalized by stay length
num_med_changes Count of medications with "Up" or "Down" dosage changes Medication optimization during stay
A1Cresult_abnormal 1 if A1C > 7 or > 8, else 0 Diabetes control indicator

3.3 Final Feature Set

Numeric Features (12)

  • time_in_hospital
  • num_lab_procedures
  • num_procedures
  • num_medications
  • number_outpatient
  • number_emergency
  • number_inpatient
  • number_diagnoses
  • age_numeric
  • total_visits
  • medication_intensity
  • num_med_changes

Categorical Features (8)

  • race
  • gender
  • admission_type_id
  • discharge_disposition_id
  • admission_source_id
  • diabetesMed
  • change
  • A1Cresult_abnormal

One-hot encoded with drop_first=True

4. Machine Learning Pipeline

4.1 Train-Test Split

Set Size Positive Class Negative Class
Training (80%) 57,214 5,034 (8.8%) 52,180 (91.2%)
Test (20%) 14,304 1,259 (8.8%) 13,045 (91.2%)

Stratified split used to maintain class proportions in both sets.

4.2 SMOTE Oversampling

Synthetic Minority Over-sampling Technique (SMOTE) applied to training data only:

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)
Class Before SMOTE After SMOTE
Negative (0) 52,180 52,180
Positive (1) 5,034 52,180
Total 57,214 104,360
Why SMOTE? With only 8.8% positive class, a naive model predicting "no readmission" for all patients would achieve 91.2% accuracy but be clinically useless. SMOTE creates synthetic positive examples to help the model learn minority class patterns.

4.3 Model Selection and Training

Parameter Value Rationale
Algorithm Logistic Regression Interpretable coefficients for clinical explainability
Regularization L2 (Ridge) Prevents overfitting with many features
C (inverse regularization) 0.1 Moderate regularization strength
max_iter 1000 Ensure convergence
Scaling StandardScaler Zero mean, unit variance for each feature
from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000, random_state=42, C=0.1)
model.fit(X_train_scaled, y_train_balanced)

4.4 Model Evaluation

0.564
ROC-AUC Score
0.110
Average Precision
Performance Context: The modest AUC of 0.564 is consistent with published literature on hospital readmission prediction. Readmissions are influenced by many factors outside the EHR (social determinants, outpatient care quality, patient compliance) that are not captured in this dataset.

5. Risk Score Calculation

5.1 Probability to Score Conversion

Risk Score = P(readmission | features) x 100

Where P(readmission | features) is the predicted probability from logistic regression's predict_proba() method.

5.2 Cost Exposure Estimation

Estimated Cost = (Risk Score / 100) x $15,000

The $15,000 figure represents the average cost of a hospital readmission based on industry benchmarks.

5.3 Risk Stratification Tiers

Tier Risk Score Range Patient Count Percentage Action Level
Critical 80-100% 1,931 2.7% Immediate intervention
Very High 70-80% 2,232 3.1% Priority outreach
High 60-70% 2,920 4.1% Proactive monitoring
Moderate 40-60% 9,224 12.9% Standard care
Low 0-40% 55,211 77.2% Routine follow-up

6. Feature Importance Analysis

6.1 Top Risk-Increasing Factors

Rank Feature Coefficient Interpretation
1 Total Prior Visits +4.3579 High utilization = complex patient
2 Number of Medications +0.2819 Polypharmacy indicates severity
3 Lab Procedures +0.1021 More testing = diagnostic complexity
4 Number of Diagnoses +0.0078 Comorbidity burden

6.2 Protective Factors

Rank Feature Coefficient Interpretation
1 Outpatient Visits -3.1170 Continuity of care is protective
2 Prior Inpatient Visits -1.9585 Established care relationships
3 Emergency Visits -1.5405 Access to acute care when needed
Clinical Insight: The strong protective effect of outpatient visits (-3.117) suggests that patients with regular outpatient follow-up are less likely to be readmitted, even if they have complex conditions. This supports investment in transitional care and follow-up appointment scheduling.

7. Geographic Data Generation

7.1 State-Level Data

State-level readmission rates and penalties are generated based on actual CMS HRRP patterns:

for state_code, info in STATE_DATA.items():
    # Add realistic variation
    rate_variation = np.random.uniform(-0.8, 0.8)
    avg_rate = info['base_rate'] + rate_variation

    # Calculate penalty based on rate
    if avg_rate > 15.5:
        penalty_pct = np.random.uniform(0.5, 2.0)
    elif avg_rate > 14.5:
        penalty_pct = np.random.uniform(0.2, 0.8)
    else:
        penalty_pct = np.random.uniform(0, 0.3)

    # Estimate total penalty
    total_penalty = hospitals * 5_000_000 * (penalty_pct / 100)

7.2 Penalty Calculation Logic

CMS penalizes hospitals with Excess Readmission Ratios (ERR) greater than 1.0. Penalties are calculated as a percentage of Medicare payments:

8. Output Data Verification

8.1 Data Integrity Checks

Validation Expected Actual Status
Total patients 71,518 71,518 PASS
High-risk patients (60%+) 7,083 7,083 PASS
Sum of risk tiers 7,083 1,931 + 2,232 + 2,920 = 7,083 PASS
Risk distribution sum 71,518 37,663 + 17,548 + 9,224 + 5,152 + 1,931 = 71,518 PASS
Total cost exposure $78,501,391.71 $25,096,147 + $25,025,321 + $28,379,923 = $78,501,391 PASS
States analyzed 51 51 PASS
Patient export count 7,083 7,083 PASS

8.2 Output Files

File Description Records/Keys Size
patient_risks.json High-risk patient details 7,083 patients ~1.2 MB
risk_summary.json Dashboard statistics 15 keys ~3 KB
state_summary.json State-level metrics 51 states ~8 KB
hospital_metrics.json Hospital-level data 746 hospitals ~45 KB

9. Limitations and Considerations

  1. Data Age: UCI dataset is from 1999-2008; healthcare patterns may have evolved significantly
  2. Model Performance: AUC of 0.564 indicates modest predictive power, typical for readmission models
  3. Geographic Data: State-level metrics are simulated based on CMS patterns, not actual current data
  4. Cost Estimates: Based on $15,000 average; actual costs vary widely by condition and facility
  5. External Factors: Model cannot capture social determinants, patient compliance, or outpatient care quality
  6. Single Disease Focus: Dataset is diabetes-specific; generalization to other populations requires validation

10. Reproduction Steps

# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn

# 2. Download UCI Diabetes dataset to data/raw/

# 3. Run analysis pipeline
python run_analysis_v2.py

# 4. Copy outputs to dashboard
cp data/processed/*.json dashboard/lib/

# 5. Start dashboard
cd dashboard && npm run dev