Preprocessing1.ipynb - Colab
Preprocessing1.ipynb - Colab
ipynb - Colab
import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import OneHotEncoder, StandardScaler, MinMaxScaler
from sklearn.impute import SimpleImputer
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import fbeta_score, precision_recall_curve, roc_auc_score, classification_report, confusion_matri
import os
train_path = "/content/train.csv"
test_path = "/content/test.csv"
import os
print(os.path.exists(train_path))
print(os.path.exists(test_path))
True
True
if not os.path.isfile(train_path):
raise FileNotFoundError(f"Train file not found at {train_path}")
if not os.path.isfile(test_path):
raise FileNotFoundError(f"Test file not found at {test_path}")
train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)
smoking_status stroke
count 2555 2554
unique 4 4
top never smoked 0
freq 945 2429
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 2/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
Residence_type 0
avg_glucose_level 0
bmi 0
smoking_status 0
stroke 1
dtype: int64
since there are no negative values in the data so writing a code to handle negative values would be redundant
Handling Outliers
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 3/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 4/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
For age, I applied the Interquartile Range (IQR) method to remove extreme values. The rationale is that extremely high ages
might be biologically unrealistic and could distort the model’s learning process. The IQR method effectively identifies and
removes such extreme values while preserving most of the data distribution.
For BMI, I used Winsorization (capping at the 1st and 99th percentiles) instead of removing outliers. Since BMI naturally varies
among individuals, especially in medical datasets, completely removing high or low values could lead to loss of important
information. Instead, capping prevents extreme values from dominating the model while retaining valuable patterns.
For average glucose level, I applied a log transformation to address the right-skewed distribution observed in the data. This
transformation helps stabilize variance and ensures that large glucose values do not disproportionately affect the model. Unlike
outright removal, log transformation allows the model to learn from high glucose levels while mitigating their impact.
These methods ensure that we retain critical medical data while improving model robustness and preventing outliers from
biasing the predictions.
# Capping extreme BMI values using Winsorization (1st and 99th percentile)
bmi_lower_cap = train_df['bmi'].quantile(0.01)
bmi_upper_cap = train_df['bmi'].quantile(0.99)
train_df['bmi'] = np.clip(train_df['bmi'], bmi_lower_cap, bmi_upper_cap)
test_df['bmi'] = np.clip(test_df['bmi'], bmi_lower_cap, bmi_upper_cap)
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 5/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
I chose to keep the low-age children in the dataset because they represent a valid demographic group that could still be at risk
of stroke, especially due to congenital conditions. Instead of removing them, I verified their presence and ensured that extremely
small, likely erroneous values were corrected, preserving the integrity of medically relevant data.
Handling Abnormal Categorical Values: I standardized the gender column by converting all values to lowercase to prevent
duplicate categories (e.g., "Male" and "male" being treated separately). Additionally, we replaced "other" with the most frequent
gender in the dataset. This ensures consistency and prevents issues during encoding while maintaining the integrity of the data
# Checking Unique Values and Their Frequencies for Each Categorical Column
for feature in cat_features:
print(f"{feature} unique values count: {train_df[feature].nunique()}")
print(f"{feature} unique values: {list(train_df[feature].unique())}")
print(f"{feature} value counts:\n{train_df[feature].value_counts()}\n")
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 6/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
gender value counts:
gender
female 1491
male 1060
Name: count, dtype: int64
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 7/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
I chose to retain the "Unknown" category in smoking_status because it represents a significant proportion of the dataset (759
instances) and removing or replacing it could introduce bias. By keeping it as a separate category, the model can learn patterns
from individuals with missing smoking data rather than making incorrect assumptions about their smoking habits.
The stroke rates for males (5.00%) and females (4.76%) are very close, indicating that there is no significant gender bias in
stroke occurrence based on this dataset.
since there are no duplicate rows writing a code to remoce duplicates would be redundant
plt.figure(figsize=(10,6))
sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Matrix")
plt.show()
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 8/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
Correlation Matrix:
# Evaluate model
y_pred = model.predict(X_test)
print(classification_report(y_test, y_pred))
Feature Importance
1 hypertension 0.654611
3 avg_glucose_level 0.379540
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 9/13
2/23/25, 1:03 PM Preprocessing1.ipynb - Colab
2 heart_disease 0.220947
0 age 0.095052
4 bmi 0.004861
precision recall f1-score support
Scaling some the two numerical features Age and BMI because they have different magnitudes, and logistic regression
performs better when features are on a similar scale. Standardizing these variables ensures that no single feature dominates
the model, improving optimization and stability during training.
scaler = StandardScaler()
scaled_train = pd.DataFrame(scaler.fit_transform(train_df[num_features]), columns=num_features)
scaled_test = pd.DataFrame(scaler.transform(test_df[num_features]), columns=num_features)
smoking_status_smokes
0 1.0
1 0.0
2 0.0
3 0.0
4 0.0
y_train_final = train_df["stroke"].astype(int)
# Train-Test Split
X_train, X_val, y_train, y_val = train_test_split(X_train_final, y_train_final, test_size=0.2, random_state=42, strati
print("Preprocessing complete.")
Preprocessing complete.
smoking_status stroke
count 2551 2551.000000
unique 4 NaN
top never smoked NaN
freq 944 NaN
mean NaN 0.048608
std NaN 0.215090
min NaN 0.000000
25% NaN 0.000000
50% NaN 0.000000
75% NaN 0.000000
max NaN 1.000000
# Train-Test Split
X_train, X_val, y_train, y_val = train_test_split(X_train_final, y_train_final, test_size=0.2, random_state=42, strati
print("Preprocessing complete.")
# Predictions
y_pred = model.predict(X_val)
y_probs = model.predict_proba(X_val)[:, 1]
# Evaluation Metrics
auc_score = roc_auc_score(y_val, y_probs)
f_beta = fbeta_score(y_val, y_pred, beta=10)
class_report = classification_report(y_val, y_pred)
conf_matrix = confusion_matrix(y_val, y_pred)
# Display Metrics
print(f"AUC Score: {auc_score}")
print(f"F-beta Score (β=10): {f_beta}")
print("Classification Report:")
print(class_report)
print("Confusion Matrix:")
print(conf_matrix)
Preprocessing complete.
AUC Score: 0.8042798353909465
F-beta Score (β=10): 0.688115064345193
Classification Report:
precision recall f1-score support
Confusion Matrix:
[[362 124]
[ 7 18]]
https://colab.research.google.com/drive/1BHbOmi8ERUUKCVu93HA6ePUdqED9B74s#scrollTo=uq_fzXWzZC9m&printMode=true 13/13