0% found this document useful (0 votes)
0 views

Chat, Assignment 2 - Data science

The document outlines a structured imputation strategy for handling missing values in a dataset used for predicting in-hospital mortality, ensuring no NaN values remain. It details the methods for categorical, ordinal, and numeric imputation, emphasizing the importance of retaining data integrity and clinical interpretability. Ultimately, a Random Forest model was chosen for its superior accuracy and robustness against imputation artifacts compared to other models tested.

Uploaded by

josef.ghneim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
0 views

Chat, Assignment 2 - Data science

The document outlines a structured imputation strategy for handling missing values in a dataset used for predicting in-hospital mortality, ensuring no NaN values remain. It details the methods for categorical, ordinal, and numeric imputation, emphasizing the importance of retaining data integrity and clinical interpretability. Ultimately, a Random Forest model was chosen for its superior accuracy and robustness against imputation artifacts compared to other models tested.

Uploaded by

josef.ghneim
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Imputation Strategy Explanation

To prepare the dataset for machine learning modeling, we handled all missing values using a
structured imputation strategy. This ensures the final dataset is free of NaN values, which most
scikit-learn models cannot process. Our strategy was tailored to the type and clinical meaning of
each variable:

1. Categorical Imputation

• income_bracket: This variable is categorical, with missing values likely due to


unrecorded or sensitive financial information. We filled missing entries with the
placeholder "Unknown" to retain these rows while avoiding introducing bias from the
more common categories.
• years_of_education: As a numeric socio-demographic feature, we imputed missing
values using the median, which is robust to outliers and preserves the overall
distribution.

2. Ordinal Imputation – Activities of Daily Living (ADL)

• For the two ADL variables (From Patient and From Relative), which represent ordered
levels of patient independence, we imputed missing values using the mode (most
frequent value). This is appropriate because these variables are ordinal and the mode
helps maintain interpretability while reflecting the most common functional level in the
data.

3. Numeric Imputation for Clinical Measurements

A number of clinical test results had a substantial number of missing values (e.g., Glucose, Urine
Output, P/F Ratio). For each of these variables, we:

• First ensured they were correctly converted to numeric format


using pd.to_numeric() with errors='coerce'. This step was necessary because many of
these values were read as strings (e.g., due to inconsistent data entry).
• Then, we used median imputation to fill in missing values. The median is robust to
outliers and helps maintain the central tendency of the feature without skewing the
distribution.

Variables handled in this step include:


• Urine Output, Glucose, Blood Urea Nitrogen, Serum Albumin, Bilirubin Level, P/F
Ratio, Arterial Blood PH, and Serum Creatinine Level.

The rest are dopped later??*

✅ Summary

This imputation strategy allows us to:

• Retain as much data as possible (especially for valuable patient rows)


• Avoid introducing bias from overly aggressive row-dropping
• Maintain clinical interpretability and consistency across features

The resulting cleaned dataset (cleaned_patient_data.pkl) is now ready for modeling in Question
2.

Features are scaled for LR and KNN, since these models are sensitive for different scales (scale
affects distance). Since trees are based on thresholds, there’s no need to scale.

In LR: C = 1 by default, get more info as to why

In KNN, K choose according to rule of thumb, K = sqrt(n) n = 7923 → K = 89, also test for +-20

Conclusion: (before hyperparameter tuning)

Random forests and LR display similar accuracy, with RF slightly outperforming. This suggests
linearity in our data. Considering that LR is sensitive to nan values, and our classification df
consists of nan values, random forests is the superior model in this case. Also, the heightened
accuracy of the LR can be due to the imputations strategy since median imputations for numeric
features creates “smoother”, biasing the LR model, but not the RF since (..)
Because of this, the random forest model is preferred.

(Maybe add something more??)

2.1 Predicting In-Hospital Mortality – Model Choice, Variables, and Tuning

To predict whether a patient will die during their hospital stay, we tested and compared four
classification models: Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree,
and Random Forest. The goal was to identify the model with the highest accuracy while
maintaining interpretability and robustness.

Data Preparation

We began by constructing a comprehensive patient dataset by joining the Patient, Study,


and PatientExaminationtables. The resulting dataset was wide-format, with one row per
patient and multiple clinical, demographic, and physiological variables as columns.

✅ Imputation Strategy

Missing values were handled through a structured strategy:

1. Categorical Variables:
o income_bracket: Missing entries were filled with "Unknown" to retain rows
without injecting bias.
o years_of_education: Filled with the median, which is robust to outliers.
2. Ordinal Variables (ADL):
o Activities of Daily Living (From Patient) and (...From
Relative) were imputed using the mode, preserving the ordinal nature.
3. Numeric Clinical Features:
o For features such as Urine Output, Blood Urea Nitrogen, Serum Albumin,
etc., missing values were imputed using the median after converting values to
numeric using pd.to_numeric(errors='coerce').

This allowed us to retain almost all patient rows while minimizing potential bias, resulting in a
complete, clean dataset (cleaned_patient_data.pkl).

Feature Engineering and Scaling

Categorical variables were numerically encoded:

• gender: Male = 0, Female = 1


• Has Cancer: No = 0, Yes = 1, Metastatic = 2
• income_bracket: Very Low = 0 → High = 3, Unknown = 4
• Zodiac Sign: Encoded alphabetically (e.g., Capricorn = 3, Pisces = 7)

We excluded patients admitted on or after Jan 1, 2023 (as these make up the classification set)
and removed admission_date and patient_id for training.

Feature scaling was applied only for KNN and Logistic Regression, as these models are
sensitive to feature magnitude. Scaling was unnecessary for decision trees and random forests,
which split data based on thresholds.

Model Comparison and Tuning

We trained and evaluated all models using stratified 80/20 train-test splits,
measuring accuracy as the performance metric.

1. Logistic Regression

• Default L2-penalized model (C=1.0) achieved ~0.8013 accuracy.


• We also tested L1 regularization (LASSO) for feature selection. Accuracy decreased
slightly (~0.799–0.788), but fewer features were retained at lower C values.
• Observation: LASSO did not improve performance but confirmed that many features
contributed meaningfully to predictions.

2. K-Nearest Neighbors (KNN)


• Using the rule of thumb k = √n, we selected k=89 (n ≈ 8000).
• Accuracy: ~0.7748
• Grid test from k=84 to 94 showed stable but slightly lower accuracy compared to LR and
RF.

3. Decision Tree

• A single unpruned tree yielded accuracy of ~0.7174.


• While interpretable, it overfitted the training data and underperformed relative to
ensemble methods.

4. Random Forest ✅ (Chosen Model)

• With n_estimators=100, accuracy reached 0.8031, the highest among all models.
• A test of multiple values (n = 10, 50, 100, 200, 300) showed that 100–200
estimators consistently provided strong results.

Why Random Forest Was Chosen

Although Logistic Regression performed similarly in accuracy, Random Forests were


ultimately chosen because:

• They are more robust to imputation artifacts: LR performance may be inflated


by median imputation smoothing(biasing linear models), while tree-based models are
less sensitive to this.
• RF models handle missing structure and nonlinear interactions better than LR.
• Unlike LR, RF does not assume linearity or feature independence, which are unlikely in
a real-world ICU dataset.
• LR is more sensitive to scaling, NaNs, and assumptions about normality.

The decision tree visualizations (e.g., Tree #1 from the forest) provided useful interpretability
while maintaining the ensemble model's robustness.

Conclusion

We chose Random Forest as the final model for predicting in-hospital mortality. It achieved
the highest accuracy(0.8031), balanced performance across different configurations, and is
more resilient to the effects of imputation and feature scaling. Hyperparameter tuning
of n_estimators confirmed the model’s stability and optimal performance near 100–200 trees.

This model is best suited for the task of mortality prediction in this heterogeneous, high-
dimensional medical dataset.
2.4 Feature Importance

To understand which variables had the greatest influence on our model's predictions, we
extracted the feature importances from the trained Random Forest classifier.

The five most important features were:

1. Simplified Acute Physiology Score III


2. SPS Score
3. Mean Arterial Blood Pressure
4. White Blood Cell Count
5. Age

These features are clinically meaningful indicators of a patient’s overall condition and mortality
risk:

• Simplified Acute Physiology Score III (SAPS III) is a well-established scoring system
for predicting ICU mortality and integrates multiple clinical variables (vitals, labs, etc.),
making it a strong overall predictor.
• SPS Score likely reflects severity of illness, organ dysfunction, or systemic stress — all
critical mortality determinants.
• Mean Arterial Blood Pressure helps assess circulatory stability. Persistent hypotension
is often a sign of shock or critical illness.
• White Blood Cell Count is a key marker of infection, inflammation, and immune
response, commonly linked to conditions like sepsis.
• Age is a fundamental risk factor, as older patients generally face worse outcomes due to
comorbidities and frailty.
Together, these features capture physiological severity, organ function, immune status, and
patient vulnerability — which helps explain why they are so predictive in estimating the
likelihood of in-hospital death

Justification of Chosen Cut-Off Threshold (its 0.17?, check which data is used
for what)

In this task, the goal is to maximize KindCorp’s profit by deciding which patient insurance
claims to offload to EvilCorp. Each patient who dies in hospital results in a cost of €500,000 to
KindCorp, unless their policy has been offloaded for a €150,000 fee. This setup makes False
Negatives (missed deaths) extremely costly, whereas False Positives (unnecessarily offloaded
survivors) are relatively inexpensive.

To account for this imbalance in cost, we optimized the model’s classification threshold not for
accuracy, but for profit. The analysis shows that the optimal cut-off lies around 0.19,
significantly lower than the default threshold of 0.5. This means that patients with even a
moderate predicted probability of dying are flagged for offloading.

This low threshold is intentional: it increases sensitivity (recall) and reduces the number of false
negatives. While this leads to more false positives, the economic trade-off is favorable, since
avoiding a €500,000 loss is worth incurring a few extra €150,000 payments. In short, the model
is biased towards caution — preferring to offload “risky” patients — because the financial
impact of missing a death is far greater than offloading a survivor.

By setting the threshold to 0.19, we ensure that the classification decisions are aligned not just
with model performance, but with business value.

Q3.3 is done on the train/test data since we already know the outcomes

Imputation?
To address missing values in the dataset — particularly in key clinical variables — we
implemented a K-Nearest Neighbors (KNN) imputation strategy focused exclusively
on numeric features. This was followed by proper encoding of categorical variables and
preparation for machine learning modeling.

Step-by-Step Explanation:

1. Ensuring WBC is Numeric:


o Before imputation, we converted the "White Blood Cell Count" column
explicitly to numeric using pd.to_numeric(), ensuring that it was correctly
included in the list of imputed features.
2. Selection of Numeric Features:
o We extracted all columns of numerical type
using df.select_dtypes(include=['number']). This ensured that only
continuous or ordinal variables — appropriate for distance-based imputation —
were considered.
3. Standardization (Z-score Normalization):
o Prior to KNN imputation, all numeric variables were standardized
using StandardScaler. This is essential because KNN imputation relies on
Euclidean distance — without scaling, variables with larger ranges (e.g., urine
output vs. pH) would dominate the distance calculations.
4. KNN Imputation:
o We applied KNNImputer with n_neighbors=40, meaning each missing value
was estimated using the average of its 40 closest neighbors in the feature space.
This allows the model to leverage local structure in the data for more context-
aware imputations.
5. Inverse Transformation:
o After imputation, we reversed the scaling with scaler.inverse_transform() to
return the data to its original units — ensuring interpretability and compatibility
with downstream models.
6. Reintegration of Imputed Values:
o The imputed and rescaled values were inserted back into a copy of the original
cleaned DataFrame (df_clean), creating a new dataset: df_knn_imputed.
7. Categorical Encoding (for modeling):
o Before training the Random Forest model, all relevant categorical variables were
numerically encoded:
▪ "gender": male = 0, female = 1
▪ "Has Cancer": no = 0, yes = 1, metastatic = 2
▪ "income_bracket": ordinal encoding (Very Low to Unknown)
▪ "Zodiac Sign": alphabetical integer encoding

This approach allowed us to preserve all patients and all relevant numeric features,
including White Blood Cell Count, while improving model robustness. Compared to simpler
strategies like median imputation, KNN better captures the multivariate structure of the data —
and in this case, resulted in the best model accuracy and ROC AUC performance.

You might also like