Chat, Assignment 2 - Data science
Chat, Assignment 2 - Data science
To prepare the dataset for machine learning modeling, we handled all missing values using a
structured imputation strategy. This ensures the final dataset is free of NaN values, which most
scikit-learn models cannot process. Our strategy was tailored to the type and clinical meaning of
each variable:
1. Categorical Imputation
• For the two ADL variables (From Patient and From Relative), which represent ordered
levels of patient independence, we imputed missing values using the mode (most
frequent value). This is appropriate because these variables are ordinal and the mode
helps maintain interpretability while reflecting the most common functional level in the
data.
A number of clinical test results had a substantial number of missing values (e.g., Glucose, Urine
Output, P/F Ratio). For each of these variables, we:
✅ Summary
The resulting cleaned dataset (cleaned_patient_data.pkl) is now ready for modeling in Question
2.
Features are scaled for LR and KNN, since these models are sensitive for different scales (scale
affects distance). Since trees are based on thresholds, there’s no need to scale.
In KNN, K choose according to rule of thumb, K = sqrt(n) n = 7923 → K = 89, also test for +-20
Random forests and LR display similar accuracy, with RF slightly outperforming. This suggests
linearity in our data. Considering that LR is sensitive to nan values, and our classification df
consists of nan values, random forests is the superior model in this case. Also, the heightened
accuracy of the LR can be due to the imputations strategy since median imputations for numeric
features creates “smoother”, biasing the LR model, but not the RF since (..)
Because of this, the random forest model is preferred.
To predict whether a patient will die during their hospital stay, we tested and compared four
classification models: Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree,
and Random Forest. The goal was to identify the model with the highest accuracy while
maintaining interpretability and robustness.
Data Preparation
✅ Imputation Strategy
1. Categorical Variables:
o income_bracket: Missing entries were filled with "Unknown" to retain rows
without injecting bias.
o years_of_education: Filled with the median, which is robust to outliers.
2. Ordinal Variables (ADL):
o Activities of Daily Living (From Patient) and (...From
Relative) were imputed using the mode, preserving the ordinal nature.
3. Numeric Clinical Features:
o For features such as Urine Output, Blood Urea Nitrogen, Serum Albumin,
etc., missing values were imputed using the median after converting values to
numeric using pd.to_numeric(errors='coerce').
This allowed us to retain almost all patient rows while minimizing potential bias, resulting in a
complete, clean dataset (cleaned_patient_data.pkl).
We excluded patients admitted on or after Jan 1, 2023 (as these make up the classification set)
and removed admission_date and patient_id for training.
Feature scaling was applied only for KNN and Logistic Regression, as these models are
sensitive to feature magnitude. Scaling was unnecessary for decision trees and random forests,
which split data based on thresholds.
We trained and evaluated all models using stratified 80/20 train-test splits,
measuring accuracy as the performance metric.
1. Logistic Regression
3. Decision Tree
• With n_estimators=100, accuracy reached 0.8031, the highest among all models.
• A test of multiple values (n = 10, 50, 100, 200, 300) showed that 100–200
estimators consistently provided strong results.
The decision tree visualizations (e.g., Tree #1 from the forest) provided useful interpretability
while maintaining the ensemble model's robustness.
Conclusion
We chose Random Forest as the final model for predicting in-hospital mortality. It achieved
the highest accuracy(0.8031), balanced performance across different configurations, and is
more resilient to the effects of imputation and feature scaling. Hyperparameter tuning
of n_estimators confirmed the model’s stability and optimal performance near 100–200 trees.
This model is best suited for the task of mortality prediction in this heterogeneous, high-
dimensional medical dataset.
2.4 Feature Importance
To understand which variables had the greatest influence on our model's predictions, we
extracted the feature importances from the trained Random Forest classifier.
These features are clinically meaningful indicators of a patient’s overall condition and mortality
risk:
• Simplified Acute Physiology Score III (SAPS III) is a well-established scoring system
for predicting ICU mortality and integrates multiple clinical variables (vitals, labs, etc.),
making it a strong overall predictor.
• SPS Score likely reflects severity of illness, organ dysfunction, or systemic stress — all
critical mortality determinants.
• Mean Arterial Blood Pressure helps assess circulatory stability. Persistent hypotension
is often a sign of shock or critical illness.
• White Blood Cell Count is a key marker of infection, inflammation, and immune
response, commonly linked to conditions like sepsis.
• Age is a fundamental risk factor, as older patients generally face worse outcomes due to
comorbidities and frailty.
Together, these features capture physiological severity, organ function, immune status, and
patient vulnerability — which helps explain why they are so predictive in estimating the
likelihood of in-hospital death
Justification of Chosen Cut-Off Threshold (its 0.17?, check which data is used
for what)
In this task, the goal is to maximize KindCorp’s profit by deciding which patient insurance
claims to offload to EvilCorp. Each patient who dies in hospital results in a cost of €500,000 to
KindCorp, unless their policy has been offloaded for a €150,000 fee. This setup makes False
Negatives (missed deaths) extremely costly, whereas False Positives (unnecessarily offloaded
survivors) are relatively inexpensive.
To account for this imbalance in cost, we optimized the model’s classification threshold not for
accuracy, but for profit. The analysis shows that the optimal cut-off lies around 0.19,
significantly lower than the default threshold of 0.5. This means that patients with even a
moderate predicted probability of dying are flagged for offloading.
This low threshold is intentional: it increases sensitivity (recall) and reduces the number of false
negatives. While this leads to more false positives, the economic trade-off is favorable, since
avoiding a €500,000 loss is worth incurring a few extra €150,000 payments. In short, the model
is biased towards caution — preferring to offload “risky” patients — because the financial
impact of missing a death is far greater than offloading a survivor.
By setting the threshold to 0.19, we ensure that the classification decisions are aligned not just
with model performance, but with business value.
Q3.3 is done on the train/test data since we already know the outcomes
Imputation?
To address missing values in the dataset — particularly in key clinical variables — we
implemented a K-Nearest Neighbors (KNN) imputation strategy focused exclusively
on numeric features. This was followed by proper encoding of categorical variables and
preparation for machine learning modeling.
Step-by-Step Explanation:
This approach allowed us to preserve all patients and all relevant numeric features,
including White Blood Cell Count, while improving model robustness. Compared to simpler
strategies like median imputation, KNN better captures the multivariate structure of the data —
and in this case, resulted in the best model accuracy and ROC AUC performance.