ReadmitRisk Data Analysis Methodology

Important Notice: This project uses historical data (1999-2008) from the UCI Machine Learning Repository for demonstration and portfolio purposes. The risk model and patterns identified reflect healthcare practices from that era and may not accurately represent current clinical realities. Cost exposure calculations use 2024 industry benchmarks applied to historical risk scores. For production deployment, the model should be retrained on recent EHR data (e.g., MIMIC-IV 2008-2019).

1. Data Sources

1.1 Primary Dataset: UCI Diabetes 130-US Hospitals

Attribute	Value
Source	UCI Machine Learning Repository
Time Period	1999-2008 (10 years)
Hospitals	130 US hospitals
Raw Records	101,766 hospital encounters
Variables	50 features including demographics, diagnoses, medications, lab results
Target Variable	Readmission status (NO, >30 days, <30 days)

1.2 Secondary Dataset: CMS Hospital Readmissions Reduction Program

State-level and hospital-level readmission rates and penalty data based on CMS HRRP patterns. Geographic data covers all 50 states plus District of Columbia.

2. Data Cleaning and Preprocessing

2.1 Missing Value Analysis

Column	Missing Rate	Treatment	Rationale
`weight`	97%	Dropped	Too sparse to impute meaningfully
`payer_code`	40%	Dropped	High missingness, administrative variable
`medical_specialty`	49%	Dropped	High missingness, many categories
`race`	2%	Filled with "Unknown"	Low missingness, categorical
Numeric features	<1%	Filled with median	Preserves distribution

2.2 Patient Deduplication

The raw dataset contains multiple encounters per patient. To ensure independence of observations and prevent data leakage:

Records sorted by encounter_id
Duplicates removed based on patient_nbr, keeping first encounter only
Before: 101,766 encounters
After: 71,518 unique patients

2.3 Target Variable Transformation

Original target had 3 categories. Transformed to binary for logistic regression:

Original Value	Binary Value	Count	Percentage
"<30" (readmitted within 30 days)	1 (Positive)	6,293	8.8%
"NO" or ">30"	0 (Negative)	65,225	91.2%

Class Imbalance Note: The 8.8% positive class rate represents significant class imbalance, addressed via SMOTE oversampling.

3. Feature Engineering

3.1 Age Transformation

Original age values are categorical ranges. Converted to numeric midpoints:

age_mapping = {
    '[0-10)': 5,   '[10-20)': 15,  '[20-30)': 25,
    '[30-40)': 35, '[40-50)': 45,  '[50-60)': 55,
    '[60-70)': 65, '[70-80)': 75,  '[80-90)': 85,
    '[90-100)': 95
}

3.2 Derived Features

Feature	Formula	Clinical Rationale
`total_visits`	`number_outpatient + number_emergency + number_inpatient`	Overall healthcare utilization indicator
`medication_intensity`	`num_medications / (time_in_hospital + 1)`	Treatment complexity normalized by stay length
`num_med_changes`	Count of medications with "Up" or "Down" dosage changes	Medication optimization during stay
`A1Cresult_abnormal`	1 if A1C > 7 or > 8, else 0	Diabetes control indicator

3.3 Final Feature Set

Numeric Features (12)

time_in_hospital
num_lab_procedures
num_procedures
num_medications
number_outpatient
number_emergency
number_inpatient
number_diagnoses
age_numeric
total_visits
medication_intensity
num_med_changes

Categorical Features (8)

race
gender
admission_type_id
discharge_disposition_id
admission_source_id
diabetesMed
change
A1Cresult_abnormal

One-hot encoded with drop_first=True

4. Machine Learning Pipeline

4.1 Train-Test Split

Set	Size	Positive Class	Negative Class
Training (80%)	57,214	5,034 (8.8%)	52,180 (91.2%)
Test (20%)	14,304	1,259 (8.8%)	13,045 (91.2%)

Stratified split used to maintain class proportions in both sets.

4.2 SMOTE Oversampling

Synthetic Minority Over-sampling Technique (SMOTE) applied to training data only:

from imblearn.over_sampling import SMOTE
smote = SMOTE(random_state=42)
X_train_balanced, y_train_balanced = smote.fit_resample(X_train, y_train)

Class	Before SMOTE	After SMOTE
Negative (0)	52,180	52,180
Positive (1)	5,034	52,180
Total	57,214	104,360

Why SMOTE? With only 8.8% positive class, a naive model predicting "no readmission" for all patients would achieve 91.2% accuracy but be clinically useless. SMOTE creates synthetic positive examples to help the model learn minority class patterns.

4.3 Model Selection and Training

Parameter	Value	Rationale
Algorithm	Logistic Regression	Interpretable coefficients for clinical explainability
Regularization	L2 (Ridge)	Prevents overfitting with many features
C (inverse regularization)	0.1	Moderate regularization strength
max_iter	1000	Ensure convergence
Scaling	StandardScaler	Zero mean, unit variance for each feature

from sklearn.preprocessing import StandardScaler
from sklearn.linear_model import LogisticRegression

scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train_balanced)
X_test_scaled = scaler.transform(X_test)

model = LogisticRegression(max_iter=1000, random_state=42, C=0.1)
model.fit(X_train_scaled, y_train_balanced)

4.4 Model Evaluation

0.564

ROC-AUC Score

0.110

Average Precision

Performance Context: The modest AUC of 0.564 is consistent with published literature on hospital readmission prediction. Readmissions are influenced by many factors outside the EHR (social determinants, outpatient care quality, patient compliance) that are not captured in this dataset.

5. Risk Score Calculation

5.1 Probability to Score Conversion

Risk Score = P(readmission | features) x 100

Where P(readmission | features) is the predicted probability from logistic regression's predict_proba() method.

5.2 Cost Exposure Estimation

Estimated Cost = (Risk Score / 100) x $15,000

The $15,000 figure represents the average cost of a hospital readmission based on industry benchmarks.

5.3 Risk Stratification Tiers

Tier	Risk Score Range	Patient Count	Percentage	Action Level
Critical	80-100%	1,931	2.7%	Immediate intervention
Very High	70-80%	2,232	3.1%	Priority outreach
High	60-70%	2,920	4.1%	Proactive monitoring
Moderate	40-60%	9,224	12.9%	Standard care
Low	0-40%	55,211	77.2%	Routine follow-up

6. Feature Importance Analysis

6.1 Top Risk-Increasing Factors

Rank	Feature	Coefficient	Interpretation
1	Total Prior Visits	+4.3579	High utilization = complex patient
2	Number of Medications	+0.2819	Polypharmacy indicates severity
3	Lab Procedures	+0.1021	More testing = diagnostic complexity
4	Number of Diagnoses	+0.0078	Comorbidity burden

6.2 Protective Factors

Rank	Feature	Coefficient	Interpretation
1	Outpatient Visits	-3.1170	Continuity of care is protective
2	Prior Inpatient Visits	-1.9585	Established care relationships
3	Emergency Visits	-1.5405	Access to acute care when needed

Clinical Insight: The strong protective effect of outpatient visits (-3.117) suggests that patients with regular outpatient follow-up are less likely to be readmitted, even if they have complex conditions. This supports investment in transitional care and follow-up appointment scheduling.

7. Geographic Data Generation

7.1 State-Level Data

State-level readmission rates and penalties are generated based on actual CMS HRRP patterns:

for state_code, info in STATE_DATA.items():
    # Add realistic variation
    rate_variation = np.random.uniform(-0.8, 0.8)
    avg_rate = info['base_rate'] + rate_variation

    # Calculate penalty based on rate
    if avg_rate > 15.5:
        penalty_pct = np.random.uniform(0.5, 2.0)
    elif avg_rate > 14.5:
        penalty_pct = np.random.uniform(0.2, 0.8)
    else:
        penalty_pct = np.random.uniform(0, 0.3)

    # Estimate total penalty
    total_penalty = hospitals * 5_000_000 * (penalty_pct / 100)

7.2 Penalty Calculation Logic

CMS penalizes hospitals with Excess Readmission Ratios (ERR) greater than 1.0. Penalties are calculated as a percentage of Medicare payments:

States with avg rate > 15.5%: 0.5-2.0% penalty range
States with avg rate 14.5-15.5%: 0.2-0.8% penalty range
States with avg rate < 14.5%: 0-0.3% penalty range
Estimated Medicare payments per hospital: $5,000,000

8. Output Data Verification

8.1 Data Integrity Checks

Validation	Expected	Actual	Status
Total patients	71,518	71,518	PASS
High-risk patients (60%+)	7,083	7,083	PASS
Sum of risk tiers	7,083	1,931 + 2,232 + 2,920 = 7,083	PASS
Risk distribution sum	71,518	37,663 + 17,548 + 9,224 + 5,152 + 1,931 = 71,518	PASS
Total cost exposure	$78,501,391.71	$25,096,147 + $25,025,321 + $28,379,923 = $78,501,391	PASS
States analyzed	51	51	PASS
Patient export count	7,083	7,083	PASS

8.2 Output Files

File	Description	Records/Keys	Size
`patient_risks.json`	High-risk patient details	7,083 patients	~1.2 MB
`risk_summary.json`	Dashboard statistics	15 keys	~3 KB
`state_summary.json`	State-level metrics	51 states	~8 KB
`hospital_metrics.json`	Hospital-level data	746 hospitals	~45 KB

9. Limitations and Considerations

Data Age: UCI dataset is from 1999-2008; healthcare patterns may have evolved significantly
Model Performance: AUC of 0.564 indicates modest predictive power, typical for readmission models
Geographic Data: State-level metrics are simulated based on CMS patterns, not actual current data
Cost Estimates: Based on $15,000 average; actual costs vary widely by condition and facility
External Factors: Model cannot capture social determinants, patient compliance, or outpatient care quality
Single Disease Focus: Dataset is diabetes-specific; generalization to other populations requires validation

10. Reproduction Steps

# 1. Install dependencies
pip install pandas numpy scikit-learn imbalanced-learn

# 2. Download UCI Diabetes dataset to data/raw/

# 3. Run analysis pipeline
python run_analysis_v2.py

# 4. Copy outputs to dashboard
cp data/processed/*.json dashboard/lib/

# 5. Start dashboard
cd dashboard && npm run dev

Data Analysis Methodology Report