0% found this document useful (0 votes)

13 views10 pages

Chat Assignment: Imputation Strategy Overview

The document outlines a structured imputation strategy for handling missing values in a dataset used for predicting in-hospital mortality, ensuring no NaN values remain. It details the methods for categorical, ordinal, and numeric imputation, emphasizing the importance of retaining data integrity and clinical interpretability. Ultimately, a Random Forest model was chosen for its superior accuracy and robustness against imputation artifacts compared to other models tested.

Uploaded by

josef.ghneim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views10 pages

Chat Assignment: Imputation Strategy Overview

Uploaded by

josef.ghneim

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

Imputation Strategy Explanation

To prepare the dataset for machine learning modeling, we handled all missing values using a
structured imputation strategy. This ensures the final dataset is free of NaN values, which most
scikit-learn models cannot process. Our strategy was tailored to the type and clinical meaning of
each variable:

1. Categorical Imputation

• income_bracket: This variable is categorical, with missing values likely due to

unrecorded or sensitive financial information. We filled missing entries with the
placeholder "Unknown" to retain these rows while avoiding introducing bias from the
more common categories.
• years_of_education: As a numeric socio-demographic feature, we imputed missing
values using the median, which is robust to outliers and preserves the overall
distribution.

2. Ordinal Imputation – Activities of Daily Living (ADL)

• For the two ADL variables (From Patient and From Relative), which represent ordered
levels of patient independence, we imputed missing values using the mode (most
frequent value). This is appropriate because these variables are ordinal and the mode
helps maintain interpretability while reflecting the most common functional level in the
data.

3. Numeric Imputation for Clinical Measurements

A number of clinical test results had a substantial number of missing values (e.g., Glucose, Urine
Output, P/F Ratio). For each of these variables, we:

• First ensured they were correctly converted to numeric format

using pd.to_numeric() with errors='coerce'. This step was necessary because many of
these values were read as strings (e.g., due to inconsistent data entry).
• Then, we used median imputation to fill in missing values. The median is robust to
outliers and helps maintain the central tendency of the feature without skewing the
distribution.

Variables handled in this step include:

• Urine Output, Glucose, Blood Urea Nitrogen, Serum Albumin, Bilirubin Level, P/F
Ratio, Arterial Blood PH, and Serum Creatinine Level.

The rest are dopped later??*

✅ Summary

This imputation strategy allows us to:

• Retain as much data as possible (especially for valuable patient rows)

• Avoid introducing bias from overly aggressive row-dropping
• Maintain clinical interpretability and consistency across features

The resulting cleaned dataset (cleaned_patient_data.pkl) is now ready for modeling in Question
2.

Features are scaled for LR and KNN, since these models are sensitive for different scales (scale
affects distance). Since trees are based on thresholds, there’s no need to scale.

In LR: C = 1 by default, get more info as to why

In KNN, K choose according to rule of thumb, K = sqrt(n) n = 7923 → K = 89, also test for +-20

Conclusion: (before hyperparameter tuning)

Random forests and LR display similar accuracy, with RF slightly outperforming. This suggests
linearity in our data. Considering that LR is sensitive to nan values, and our classification df
consists of nan values, random forests is the superior model in this case. Also, the heightened
accuracy of the LR can be due to the imputations strategy since median imputations for numeric
features creates “smoother”, biasing the LR model, but not the RF since (..)
Because of this, the random forest model is preferred.

(Maybe add something more??)

2.1 Predicting In-Hospital Mortality – Model Choice, Variables, and Tuning

To predict whether a patient will die during their hospital stay, we tested and compared four
classification models: Logistic Regression, K-Nearest Neighbors (KNN), Decision Tree,
and Random Forest. The goal was to identify the model with the highest accuracy while
maintaining interpretability and robustness.

Data Preparation

We began by constructing a comprehensive patient dataset by joining the Patient, Study,

and PatientExaminationtables. The resulting dataset was wide-format, with one row per
patient and multiple clinical, demographic, and physiological variables as columns.

✅ Imputation Strategy

Missing values were handled through a structured strategy:

1. Categorical Variables:
o income_bracket: Missing entries were filled with "Unknown" to retain rows
without injecting bias.
o years_of_education: Filled with the median, which is robust to outliers.
2. Ordinal Variables (ADL):
o Activities of Daily Living (From Patient) and (...From
Relative) were imputed using the mode, preserving the ordinal nature.
3. Numeric Clinical Features:
o For features such as Urine Output, Blood Urea Nitrogen, Serum Albumin,
etc., missing values were imputed using the median after converting values to
numeric using pd.to_numeric(errors='coerce').

This allowed us to retain almost all patient rows while minimizing potential bias, resulting in a
complete, clean dataset (cleaned_patient_data.pkl).

Feature Engineering and Scaling

Categorical variables were numerically encoded:

• gender: Male = 0, Female = 1

• Has Cancer: No = 0, Yes = 1, Metastatic = 2
• income_bracket: Very Low = 0 → High = 3, Unknown = 4
• Zodiac Sign: Encoded alphabetically (e.g., Capricorn = 3, Pisces = 7)

We excluded patients admitted on or after Jan 1, 2023 (as these make up the classification set)
and removed admission_date and patient_id for training.

Feature scaling was applied only for KNN and Logistic Regression, as these models are
sensitive to feature magnitude. Scaling was unnecessary for decision trees and random forests,
which split data based on thresholds.

Model Comparison and Tuning

We trained and evaluated all models using stratified 80/20 train-test splits,
measuring accuracy as the performance metric.

1. Logistic Regression

• Default L2-penalized model (C=1.0) achieved ~0.8013 accuracy.

• We also tested L1 regularization (LASSO) for feature selection. Accuracy decreased
slightly (~0.799–0.788), but fewer features were retained at lower C values.
• Observation: LASSO did not improve performance but confirmed that many features
contributed meaningfully to predictions.

2. K-Nearest Neighbors (KNN)

• Using the rule of thumb k = √n, we selected k=89 (n ≈ 8000).
• Accuracy: ~0.7748
• Grid test from k=84 to 94 showed stable but slightly lower accuracy compared to LR and
RF.

3. Decision Tree

• A single unpruned tree yielded accuracy of ~0.7174.

• While interpretable, it overfitted the training data and underperformed relative to
ensemble methods.

4. Random Forest ✅ (Chosen Model)

• With n_estimators=100, accuracy reached 0.8031, the highest among all models.
• A test of multiple values (n = 10, 50, 100, 200, 300) showed that 100–200
estimators consistently provided strong results.

Why Random Forest Was Chosen

Although Logistic Regression performed similarly in accuracy, Random Forests were

ultimately chosen because:

• They are more robust to imputation artifacts: LR performance may be inflated

by median imputation smoothing(biasing linear models), while tree-based models are
less sensitive to this.
• RF models handle missing structure and nonlinear interactions better than LR.
• Unlike LR, RF does not assume linearity or feature independence, which are unlikely in
a real-world ICU dataset.
• LR is more sensitive to scaling, NaNs, and assumptions about normality.

The decision tree visualizations (e.g., Tree #1 from the forest) provided useful interpretability
while maintaining the ensemble model's robustness.

Conclusion

We chose Random Forest as the final model for predicting in-hospital mortality. It achieved
the highest accuracy(0.8031), balanced performance across different configurations, and is
more resilient to the effects of imputation and feature scaling. Hyperparameter tuning
of n_estimators confirmed the model’s stability and optimal performance near 100–200 trees.

This model is best suited for the task of mortality prediction in this heterogeneous, high-
dimensional medical dataset.
2.4 Feature Importance

To understand which variables had the greatest influence on our model's predictions, we
extracted the feature importances from the trained Random Forest classifier.

The five most important features were:

1. Simplified Acute Physiology Score III

2. SPS Score
3. Mean Arterial Blood Pressure
4. White Blood Cell Count
5. Age

These features are clinically meaningful indicators of a patient’s overall condition and mortality
risk:

• Simplified Acute Physiology Score III (SAPS III) is a well-established scoring system
for predicting ICU mortality and integrates multiple clinical variables (vitals, labs, etc.),
making it a strong overall predictor.
• SPS Score likely reflects severity of illness, organ dysfunction, or systemic stress — all
critical mortality determinants.
• Mean Arterial Blood Pressure helps assess circulatory stability. Persistent hypotension
is often a sign of shock or critical illness.
• White Blood Cell Count is a key marker of infection, inflammation, and immune
response, commonly linked to conditions like sepsis.
• Age is a fundamental risk factor, as older patients generally face worse outcomes due to
comorbidities and frailty.
Together, these features capture physiological severity, organ function, immune status, and
patient vulnerability — which helps explain why they are so predictive in estimating the
likelihood of in-hospital death

Justification of Chosen Cut-Off Threshold (its 0.17?, check which data is used
for what)

In this task, the goal is to maximize KindCorp’s profit by deciding which patient insurance
claims to offload to EvilCorp. Each patient who dies in hospital results in a cost of €500,000 to
KindCorp, unless their policy has been offloaded for a €150,000 fee. This setup makes False
Negatives (missed deaths) extremely costly, whereas False Positives (unnecessarily offloaded
survivors) are relatively inexpensive.

To account for this imbalance in cost, we optimized the model’s classification threshold not for
accuracy, but for profit. The analysis shows that the optimal cut-off lies around 0.19,
significantly lower than the default threshold of 0.5. This means that patients with even a
moderate predicted probability of dying are flagged for offloading.

This low threshold is intentional: it increases sensitivity (recall) and reduces the number of false
negatives. While this leads to more false positives, the economic trade-off is favorable, since
avoiding a €500,000 loss is worth incurring a few extra €150,000 payments. In short, the model
is biased towards caution — preferring to offload “risky” patients — because the financial
impact of missing a death is far greater than offloading a survivor.

By setting the threshold to 0.19, we ensure that the classification decisions are aligned not just
with model performance, but with business value.

Q3.3 is done on the train/test data since we already know the outcomes

Imputation?
To address missing values in the dataset — particularly in key clinical variables — we
implemented a K-Nearest Neighbors (KNN) imputation strategy focused exclusively
on numeric features. This was followed by proper encoding of categorical variables and
preparation for machine learning modeling.

Step-by-Step Explanation:

1. Ensuring WBC is Numeric:

o Before imputation, we converted the "White Blood Cell Count" column
explicitly to numeric using pd.to_numeric(), ensuring that it was correctly
included in the list of imputed features.
2. Selection of Numeric Features:
o We extracted all columns of numerical type
using df.select_dtypes(include=['number']). This ensured that only
continuous or ordinal variables — appropriate for distance-based imputation —
were considered.
3. Standardization (Z-score Normalization):
o Prior to KNN imputation, all numeric variables were standardized
using StandardScaler. This is essential because KNN imputation relies on
Euclidean distance — without scaling, variables with larger ranges (e.g., urine
output vs. pH) would dominate the distance calculations.
4. KNN Imputation:
o We applied KNNImputer with n_neighbors=40, meaning each missing value
was estimated using the average of its 40 closest neighbors in the feature space.
This allows the model to leverage local structure in the data for more context-
aware imputations.
5. Inverse Transformation:
o After imputation, we reversed the scaling with scaler.inverse_transform() to
return the data to its original units — ensuring interpretability and compatibility
with downstream models.
6. Reintegration of Imputed Values:
o The imputed and rescaled values were inserted back into a copy of the original
cleaned DataFrame (df_clean), creating a new dataset: df_knn_imputed.
7. Categorical Encoding (for modeling):
o Before training the Random Forest model, all relevant categorical variables were
numerically encoded:
▪ "gender": male = 0, female = 1
▪ "Has Cancer": no = 0, yes = 1, metastatic = 2
▪ "income_bracket": ordinal encoding (Very Low to Unknown)
▪ "Zodiac Sign": alphabetical integer encoding

This approach allowed us to preserve all patients and all relevant numeric features,
including White Blood Cell Count, while improving model robustness. Compared to simpler
strategies like median imputation, KNN better captures the multivariate structure of the data —
and in this case, resulted in the best model accuracy and ROC AUC performance.

Draft Reasearch Paper
No ratings yet
Draft Reasearch Paper
3 pages
Sarayu
No ratings yet
Sarayu
27 pages
EMR-Based Disease and Mortality Prediction
No ratings yet
EMR-Based Disease and Mortality Prediction
7 pages
Heart Failure Prediction Final Report
No ratings yet
Heart Failure Prediction Final Report
3 pages
Report
No ratings yet
Report
11 pages
Cse437 4
No ratings yet
Cse437 4
14 pages
Research Paper
No ratings yet
Research Paper
7 pages
Final ICU Assignment With Code
No ratings yet
Final ICU Assignment With Code
11 pages
Heart Disease Predictor - ML - Report
No ratings yet
Heart Disease Predictor - ML - Report
15 pages
Experiment 2
No ratings yet
Experiment 2
17 pages
BIBA Enhancing Heart Disease Prediction With A Hybrid Model Combining Decision Tree, Logistic Regres
No ratings yet
BIBA Enhancing Heart Disease Prediction With A Hybrid Model Combining Decision Tree, Logistic Regres
12 pages
Heart Disease Prediction Using ML
No ratings yet
Heart Disease Prediction Using ML
16 pages
Machine Learning for Heart Disease Prediction
No ratings yet
Machine Learning for Heart Disease Prediction
8 pages
Processes 11 01210
No ratings yet
Processes 11 01210
31 pages
Second Progres Report
No ratings yet
Second Progres Report
10 pages
Meds Can
No ratings yet
Meds Can
34 pages
AI Based: Disease Prediction System: A Practical, Responsible, and Deployable Approach
No ratings yet
AI Based: Disease Prediction System: A Practical, Responsible, and Deployable Approach
7 pages
Bibliography
No ratings yet
Bibliography
6 pages
Unit 5 Healthcare Analytics GPT O4 Reasoning
No ratings yet
Unit 5 Healthcare Analytics GPT O4 Reasoning
29 pages
Cardiovascular Admission Prediction Analysis
No ratings yet
Cardiovascular Admission Prediction Analysis
24 pages
An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission To A Cardiac Unit
No ratings yet
An Optimized Machine Learning Model Accurately Predicts In-Hospital Outcomes at Admission To A Cardiac Unit
14 pages
Prediction of Heart Disease Using Machine Learning
No ratings yet
Prediction of Heart Disease Using Machine Learning
5 pages
Heart Disease Prediction Model
No ratings yet
Heart Disease Prediction Model
25 pages
Heart Disease Prediction
No ratings yet
Heart Disease Prediction
6 pages
Python and Machine Learning
No ratings yet
Python and Machine Learning
14 pages
Heart Disease Prediction: Leveraging Machine Learning: Presented By: Haritima Sinha 2022UGCM004
No ratings yet
Heart Disease Prediction: Leveraging Machine Learning: Presented By: Haritima Sinha 2022UGCM004
21 pages
Heart Disease Prediction Using Data Mining
No ratings yet
Heart Disease Prediction Using Data Mining
8 pages
Conference PPT Anas2
No ratings yet
Conference PPT Anas2
14 pages
Heart Failure Documentation
No ratings yet
Heart Failure Documentation
4 pages
Heart Disease Prediction with ML Techniques
No ratings yet
Heart Disease Prediction with ML Techniques
5 pages
AI Heart Disease Prediction Tool
No ratings yet
AI Heart Disease Prediction Tool
16 pages
Disease Prediction Based On Symptoms
No ratings yet
Disease Prediction Based On Symptoms
16 pages
Heart Failure Prediction
No ratings yet
Heart Failure Prediction
9 pages
DTM 003
No ratings yet
DTM 003
6 pages
Final
No ratings yet
Final
13 pages
Saturday PR
No ratings yet
Saturday PR
10 pages
Heart Disease Prediction in Philippines
No ratings yet
Heart Disease Prediction in Philippines
49 pages
Heart Desease Presentation
No ratings yet
Heart Desease Presentation
23 pages
Predicting Disease With Machine Learning
No ratings yet
Predicting Disease With Machine Learning
20 pages
Lab Report Content - 15marks
No ratings yet
Lab Report Content - 15marks
10 pages
Final Year Project
No ratings yet
Final Year Project
57 pages
Synopsis
No ratings yet
Synopsis
4 pages
Review
No ratings yet
Review
5 pages
Web Application
No ratings yet
Web Application
13 pages
Heart Disease Prediction Using Machine Learning Publication - Ijsart
No ratings yet
Heart Disease Prediction Using Machine Learning Publication - Ijsart
5 pages
ML Model Report
No ratings yet
ML Model Report
8 pages
Health Care Analytics
No ratings yet
Health Care Analytics
30 pages
Project Milestone II
No ratings yet
Project Milestone II
16 pages
Prediction of Deth Event Due To Heart Failure Using Machine Learning Algorithm
No ratings yet
Prediction of Deth Event Due To Heart Failure Using Machine Learning Algorithm
11 pages
INFX 499 Milestone 1
No ratings yet
INFX 499 Milestone 1
8 pages
DATA 51000 ClassificationAssignment
No ratings yet
DATA 51000 ClassificationAssignment
10 pages
DSC540 Final Project Report Write-Up
No ratings yet
DSC540 Final Project Report Write-Up
6 pages
Boo PH 3
No ratings yet
Boo PH 3
11 pages
Final Research Paper
No ratings yet
Final Research Paper
5 pages
Heart Disease Prediction with ML
No ratings yet
Heart Disease Prediction with ML
10 pages
Models To Predict Cardiovascular Risk - Comparison of CART, Multilayer Perceptron and Logistic Regression
No ratings yet
Models To Predict Cardiovascular Risk - Comparison of CART, Multilayer Perceptron and Logistic Regression
5 pages
SV2 16F 5 Jan00 EdE
No ratings yet
SV2 16F 5 Jan00 EdE
10 pages
Dc-Da2000 - SM 1-2
No ratings yet
Dc-Da2000 - SM 1-2
12 pages
The Improbability of God Michael Martin (Editor) PDF Download
100% (9)
The Improbability of God Michael Martin (Editor) PDF Download
97 pages
Magnetic Properties and Paramagnetism
50% (2)
Magnetic Properties and Paramagnetism
20 pages
The Encyclopedia of Phobias, Fears, and Anxieties
100% (1)
The Encyclopedia of Phobias, Fears, and Anxieties
643 pages
10th Math Full Portion Set A
No ratings yet
10th Math Full Portion Set A
4 pages
Raman Scattering - Wikipedia
No ratings yet
Raman Scattering - Wikipedia
42 pages
Application For Replacement Examination Results and Certificates
No ratings yet
Application For Replacement Examination Results and Certificates
2 pages
The Godfather - Part Two - Script
No ratings yet
The Godfather - Part Two - Script
159 pages
LISP Programming Guide and Examples
No ratings yet
LISP Programming Guide and Examples
68 pages
Agniveer Best 1000 One Liner Questions - 49585372 - 2025 - 05 - 24 - 19 - 08
No ratings yet
Agniveer Best 1000 One Liner Questions - 49585372 - 2025 - 05 - 24 - 19 - 08
46 pages
NISM 8 Chapter 2
No ratings yet
NISM 8 Chapter 2
9 pages
Aws Based Blood Bank Management System: Master of Technology
No ratings yet
Aws Based Blood Bank Management System: Master of Technology
42 pages
Toc As1
No ratings yet
Toc As1
2 pages
USP-NF Sodium Fluoride Gel
No ratings yet
USP-NF Sodium Fluoride Gel
2 pages
04 Calibration Manual For Instruments of BalClor BWMS-REV1.0
No ratings yet
04 Calibration Manual For Instruments of BalClor BWMS-REV1.0
22 pages
Vendor Statement Reconciliation - Feasibility - PQD
No ratings yet
Vendor Statement Reconciliation - Feasibility - PQD
7 pages
Coa Vco-1
No ratings yet
Coa Vco-1
4 pages
Grammar Worksheets for ESL Students
No ratings yet
Grammar Worksheets for ESL Students
8 pages
Networking in English Language Teaching
No ratings yet
Networking in English Language Teaching
2 pages
Project Report of Lok Nath Bhusal
100% (1)
Project Report of Lok Nath Bhusal
69 pages
Live To Be One Hundred
No ratings yet
Live To Be One Hundred
1 page
Item No. 06
No ratings yet
Item No. 06
10 pages
Thermodynamic Processes Overview
No ratings yet
Thermodynamic Processes Overview
7 pages
Raju Et Al 2019
No ratings yet
Raju Et Al 2019
6 pages
Basic Probability Assignment Guide
No ratings yet
Basic Probability Assignment Guide
2 pages
National Canine Training and Accreditation Scheme - Private Industry NCTAS-P
No ratings yet
National Canine Training and Accreditation Scheme - Private Industry NCTAS-P
64 pages
Rajiv Gandhi University of Health Sciences, Karnataka: Community Medicine - Paper I (Rs3) Q.P. CODE: 1088
No ratings yet
Rajiv Gandhi University of Health Sciences, Karnataka: Community Medicine - Paper I (Rs3) Q.P. CODE: 1088
1 page
Evacuation Efforts for Indians in Ukraine
No ratings yet
Evacuation Efforts for Indians in Ukraine
14 pages
Understanding How Radio Works
No ratings yet
Understanding How Radio Works
11 pages

Chat Assignment: Imputation Strategy Overview

Uploaded by

Chat Assignment: Imputation Strategy Overview

Uploaded by

Imputation Strategy Explanation

• income_bracket: This variable is categorical, with missing values likely due to

2. Ordinal Imputation – Activities of Daily Living (ADL)

3. Numeric Imputation for Clinical Measurements

• First ensured they were correctly converted to numeric format

Variables handled in this step include:

The rest are dopped later??*

This imputation strategy allows us to:

• Retain as much data as possible (especially for valuable patient rows)

In LR: C = 1 by default, get more info as to why

Conclusion: (before hyperparameter tuning)

(Maybe add something more??)

2.1 Predicting In-Hospital Mortality – Model Choice, Variables, and Tuning

We began by constructing a comprehensive patient dataset by joining the Patient, Study,

Missing values were handled through a structured strategy:

Feature Engineering and Scaling

Categorical variables were numerically encoded:

• gender: Male = 0, Female = 1

Model Comparison and Tuning

• Default L2-penalized model (C=1.0) achieved ~0.8013 accuracy.

2. K-Nearest Neighbors (KNN)

• A single unpruned tree yielded accuracy of ~0.7174.

4. Random Forest ✅ (Chosen Model)

Why Random Forest Was Chosen

Although Logistic Regression performed similarly in accuracy, Random Forests were

• They are more robust to imputation artifacts: LR performance may be inflated

The five most important features were:

1. Simplified Acute Physiology Score III

1. Ensuring WBC is Numeric:

You might also like