0% found this document useful (0 votes)

10 views11 pages

Model2.ipynb - Colab

The document outlines a data preprocessing workflow for a machine learning project using Python and libraries such as pandas, numpy, and scikit-learn. It includes steps for loading data, handling missing values, standardizing categorical variables, and visualizing distributions and outliers. The dataset focuses on health-related features and aims to prepare the data for further analysis and modeling.

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

10 views11 pages

Model2.ipynb - Colab

Uploaded by

gacia der

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 11

2/23/25, 2:52 PM Model2.

ipynb - Colab

import pandas as pd
import numpy as np
import matplotlib.pyplot as plt
import seaborn as sns
from imblearn.over_sampling import SMOTE
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.preprocessing import OneHotEncoder, StandardScaler
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
from xgboost import XGBClassifier
from sklearn.metrics import fbeta_score, roc_auc_score, classification_report, confusion_matrix

from google.colab import files

uploaded = files.upload()

Choose Files 2 files

test.csv(text/csv) - 152301 bytes, last modified: 2/21/2025 - 100% done
train.csv(text/csv) - 146010 bytes, last modified: 2/21/2025 - 100% done
Saving test.csv to test (1).csv
Saving train csv to train (1) csv

train_path = "/content/train.csv"
test_path = "/content/test.csv"

import os
print(os.path.exists(train_path))
print(os.path.exists(test_path))

True
True

# Load Data (for manually uploaded files in Colab or Jupyter)

train_path = "/content/train.csv"
test_path = "/content/test.csv"

if not os.path.isfile(train_path):
raise FileNotFoundError(f"Train file not found at {train_path}")
if not os.path.isfile(test_path):
raise FileNotFoundError(f"Test file not found at {test_path}")

train_df = pd.read_csv(train_path)
test_df = pd.read_csv(test_path)

# Display dataset description

print(train_df.describe(include='all'))

gender age hypertension heart_disease ever_married \

count 2555 2555.000000 2555.000000 2555.000000 2555
unique 5 NaN NaN NaN 2
top Female NaN NaN NaN Yes
freq 1309 NaN NaN NaN 1699
mean NaN 46.373777 0.099804 0.053620 NaN
std NaN 149.971251 0.299798 0.225311 NaN
min NaN 0.000100 0.000000 0.000000 NaN
25% NaN 26.000000 0.000000 0.000000 NaN
50% NaN 44.000000 0.000000 0.000000 NaN
75% NaN 60.000000 0.000000 0.000000 NaN
max NaN 7500.000000 1.000000 1.000000 NaN

work_type Residence_type avg_glucose_level bmi \

count 2555 2555 2555.000000 2454.000000
unique 5 2 NaN NaN
top Private Urban NaN NaN
https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 1/11
2/23/25, 2:52 PM Model2.ipynb - Colab
freq 1461 1288 NaN NaN
mean NaN NaN 105.534755 28.898248
std NaN NaN 44.689250 7.958036
min NaN NaN 55.220000 10.300000
25% NaN NaN 77.000000 23.500000
50% NaN NaN 91.450000 28.000000
75% NaN NaN 113.160000 33.200000
max NaN NaN 271.740000 92.000000

smoking_status stroke
count 2555 2554
unique 4 4
top never smoked 0
freq 945 2429
mean NaN NaN
std NaN NaN
min NaN NaN
25% NaN NaN
50% NaN NaN
75% NaN NaN
max NaN NaN

# Handling Missing Values

print("Missing values before handling:")
print(train_df.isnull().sum())

Missing values before handling:

gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 101
smoking_status 0
stroke 1
dtype: int64

# Impute missing values for numerical features using median

num_features = ["age", "avg_glucose_level", "bmi"]
num_imputer = SimpleImputer(strategy="median")
train_df[num_features] = num_imputer.fit_transform(train_df[num_features])
test_df[num_features] = num_imputer.transform(test_df[num_features])

# Handling Categorical Missing Values by Filling with Most Frequent Value

cat_features = ["gender", "ever_married", "work_type", "Residence_type", "smoking_status"]
cat_imputer = SimpleImputer(strategy="most_frequent")
train_df[cat_features] = cat_imputer.fit_transform(train_df[cat_features])
test_df[cat_features] = cat_imputer.transform(test_df[cat_features])

# Double-check missing values after handling

print("Missing values after handling:")
print(train_df.isnull().sum())

Missing values after handling:

gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 2/11
2/23/25, 2:52 PM Model2.ipynb - Colab
bmi 0
smoking_status 0
stroke 1
dtype: int64

# Drop row with missing stroke value

train_df = train_df.dropna(subset=["stroke"])

# Double-check missing values after handling

print("Missing values after handling:")
print(train_df.isnull().sum())

Missing values after handling:

gender 0
age 0
hypertension 0
heart_disease 0
ever_married 0
work_type 0
Residence_type 0
avg_glucose_level 0
bmi 0
smoking_status 0
stroke 0
dtype: int64

# Standardizing target variable 'stroke'

train_df["stroke"] = train_df["stroke"].replace({"Yes": 1, "yes": 1, "No": 0, "no": 0}).astype(int)

# Checking Unique Values and Their Frequencies for Target Variable

print("Stroke Unique Values Count:", train_df['stroke'].nunique())
print("Stroke Unique Values:", list(train_df['stroke'].unique()))
print("Stroke Value Counts:\n", train_df['stroke'].value_counts(), "\n")

Stroke Unique Values Count: 2

Stroke Unique Values: [0, 1]
Stroke Value Counts:
stroke
0 2429
1 125
Name: count, dtype: int64

# Checking for Negative Values

print("Checking for Negative Values:")
for feature in num_features:
negative_count = (train_df[feature] < 0).sum()
print(f"{feature}: {negative_count} negative values")

Checking for Negative Values:

age: 0 negative values
avg_glucose_level: 0 negative values
bmi: 0 negative values

# Visualizing Outliers using Boxplots

plt.figure(figsize=(12,6))
for i, feature in enumerate(num_features):
plt.subplot(1, len(num_features), i+1)
sns.boxplot(y=train_df[feature])
plt.title(f"Boxplot of {feature}")
plt.tight_layout()
plt.show()

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 3/11
2/23/25, 2:52 PM Model2.ipynb - Colab

# Visualizing Distribution of avg_glucose_level

plt.figure(figsize=(6,4))
sns.histplot(train_df['avg_glucose_level'], kde=True, bins=30)
plt.title("Distribution of avg_glucose_level")
plt.show()

# Handling Outliers using IQR method for Age Only

Q1 = train_df['age'].quantile(0.25)
Q3 = train_df['age'].quantile(0.75)
IQR = Q3 - Q1
lower_bound = Q1 - 1.5 * IQR
upper_bound = Q3 + 1.5 * IQR
train_df = train_df[(train_df['age'] >= lower_bound) & (train_df['age'] <= upper_bound)]

# Capping extreme BMI values using Winsorization (1st and 99th percentile)
bmi_lower_cap = train_df['bmi'].quantile(0.01)
https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 4/11
2/23/25, 2:52 PM Model2.ipynb - Colab
bmi_upper_cap = train_df['bmi'].quantile(0.99)
train_df['bmi'] = np.clip(train_df['bmi'], bmi_lower_cap, bmi_upper_cap)
test_df['bmi'] = np.clip(test_df['bmi'], bmi_lower_cap, bmi_upper_cap)

# Applying Log Transformation to avg_glucose_level

train_df['avg_glucose_level'] = np.log1p(train_df['avg_glucose_level']) # log1p to avoid log(0)
test_df['avg_glucose_level'] = np.log1p(test_df['avg_glucose_level'])

# Rechecking Outlier Handling with Boxplots

plt.figure(figsize=(12,6))
for i, feature in enumerate(['age', 'bmi', 'avg_glucose_level']):
plt.subplot(1, 3, i+1)
sns.boxplot(y=train_df[feature])
plt.title(f"Boxplot of {feature} After Outlier Handling")
plt.tight_layout()
plt.show()

# Checking for small or incorrect age values

print("Smallest age values:")
print(train_df[['age']].sort_values(by='age').head(10))

Smallest age values:

age
621 0.0001
1184 0.0800
2272 0.2400
113 0.2400
2006 0.3200
1467 0.3200
817 0.3200
701 0.3200
1036 0.4000
2042 0.4800

# Checking Unique Values and Their Frequencies for Each Categorical Column
for feature in cat_features:
print(f"{feature} unique values count: {train_df[feature].nunique()}")
print(f"{feature} unique values: {list(train_df[feature].unique())}")
print(f"{feature} value counts:\n{train_df[feature].value_counts()}\n")

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 5/11
2/23/25, 2:52 PM Model2.ipynb - Colab

gender unique values count: 5

gender unique values: ['Male', 'Female', 'female', 'male', 'Other']
gender value counts:
gender
Female 1308
Male 943
female 182
male 117
Other 1
Name: count, dtype: int64

ever_married unique values count: 2

ever_married unique values: ['Yes', 'No']
ever_married value counts:
ever_married
Yes 1696
No 855
Name: count, dtype: int64

work_type unique values count: 5

work_type unique values: ['Self-employed', 'Private', 'Govt_job', 'children', 'Never_worked']
work_type value counts:
work_type
Private 1458
Self-employed 412
children 341
Govt_job 330
Never_worked 10
Name: count, dtype: int64

Residence_type unique values count: 2

Residence_type unique values: ['Urban', 'Rural']
Residence_type value counts:
Residence_type
Urban 1286
Rural 1265
Name: count, dtype: int64

smoking_status unique values count: 4

smoking_status unique values: ['smokes', 'never smoked', 'Unknown', 'formerly smoked']
smoking_status value counts:
smoking_status
never smoked 944
Unknown 759
formerly smoked 452
smokes 396
Name: count, dtype: int64

# Checking Frequency of Gender Categories Before Standardization

print("Gender Distribution Before Standardization:")
print(train_df['gender'].value_counts())

Gender Distribution Before Standardization:

gender
Female 1308
Male 943
female 182
male 117
Other 1
Name: count, dtype: int64

# Standardizing Gender Values

train_df["gender"] = train_df["gender"].str.lower()
test_df["gender"] = test_df["gender"].str.lower()

# Replacing 'other' gender with the most frequent gender

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 6/11
2/23/25, 2:52 PM Model2.ipynb - Colab
most_frequent_gender = train_df["gender"].mode()[0]
train_df["gender"] = train_df["gender"].replace("other", most_frequent_gender)
test_df["gender"] = test_df["gender"].replace("other", most_frequent_gender)

# Checking Frequency of Gender Categories After Standardization

print("Gender Distribution After Standardization:")
print(train_df['gender'].value_counts())

Gender Distribution After Standardization:

gender
female 1491
male 1060
Name: count, dtype: int64

# Checking Stroke Distribution by Gender

print("Stroke Distribution by Gender:")
print(train_df.groupby('gender')['stroke'].value_counts())

Stroke Distribution by Gender:

gender stroke
female 0 1420
1 71
male 0 1007
1 53
Name: count, dtype: int64

# Checking for duplicate rows before removal

print("Duplicate rows before removal:", train_df.duplicated().sum())

Duplicate rows before removal: 0

# Checking for redundant features

print("Correlation Matrix:")

# Selecting only numeric columns for correlation

numeric_cols = train_df.select_dtypes(include=['number'])

plt.figure(figsize=(10,6))
sns.heatmap(numeric_cols.corr(), annot=True, cmap="coolwarm", fmt=".2f", linewidths=0.5)
plt.title("Feature Correlation Matrix")
plt.show()

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 7/11
2/23/25, 2:52 PM Model2.ipynb - Colab

Correlation Matrix:

scaler = StandardScaler()
scaled_train = pd.DataFrame(scaler.fit_transform(train_df[num_features]), columns=num_features)
scaled_test = pd.DataFrame(scaler.transform(test_df[num_features]), columns=num_features)

# Checking if Age and BMI are properly scaled

print("Mean and Standard Deviation after Scaling:")
print(pd.DataFrame(scaled_train, columns=['age', 'bmi']).describe().loc[['mean', 'std']])

Mean and Standard Deviation after Scaling:

age bmi
mean 2.228280e-17 -3.676662e-16
std 1.000196e+00 1.000196e+00

# Encoding Categorical Features

encoder = OneHotEncoder(drop='first', sparse_output=False)
encoded_train = pd.DataFrame(encoder.fit_transform(train_df[cat_features]), columns=encoder.get_feature_names_out(cat_
encoded_test = pd.DataFrame(encoder.transform(test_df[cat_features]), columns=encoder.get_feature_names_out(cat_featur

# Reset index to align with other features

encoded_train.reset_index(drop=True, inplace=True)
encoded_test.reset_index(drop=True, inplace=True)

# Verifying One-Hot Encoding

print("Encoded train data shape:", encoded_train.shape)
print("Encoded test data shape:", encoded_test.shape)

# Display first few rows to check encoding

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 8/11
2/23/25, 2:52 PM Model2.ipynb - Colab
print("First few rows of encoded train data:")
print(encoded_train.head())

Encoded train data shape: (2551, 10)

Encoded test data shape: (2555, 10)
First few rows of encoded train data:
gender_male ever_married_Yes work_type_Never_worked work_type_Private \
0 1.0 1.0 0.0 0.0
1 1.0 1.0 0.0 1.0
2 1.0 0.0 0.0 0.0
3 1.0 1.0 0.0 1.0
4 1.0 0.0 0.0 0.0

work_type_Self-employed work_type_children Residence_type_Urban \

0 1.0 0.0 1.0
1 0.0 0.0 0.0
2 0.0 0.0 0.0
3 0.0 0.0 1.0
4 0.0 1.0 0.0

smoking_status_formerly smoked smoking_status_never smoked \

0 0.0 0.0
1 0.0 1.0
2 0.0 1.0
3 0.0 0.0
4 1.0 0.0

smoking_status_smokes
0 1.0
1 0.0
2 0.0
3 0.0
4 0.0

# Combining Processed Features

X_train_final = pd.concat([scaled_train, encoded_train, train_df[["hypertension", "heart_disease"]].reset_index(drop=T
X_test_final = pd.concat([scaled_test, encoded_test, test_df[["hypertension", "heart_disease"]].reset_index(drop=True)

y_train_final = train_df["stroke"].astype(int)

# Train-Test Split
X_train, X_val, y_train, y_val = train_test_split(X_train_final, y_train_final, test_size=0.2, random_state=42, strati

print("Preprocessing complete.")

Preprocessing complete.

# Applying SMOTE for oversampling

smote = SMOTE(sampling_strategy=0.5, random_state=42)
X_train, y_train = smote.fit_resample(X_train, y_train)

# Train XGBoost Classifier with Hyperparameter Tuning

xgb_params = {
'n_estimators': [100, 200],
'max_depth': [3, 5, 7],
'learning_rate': [0.01, 0.05, 0.1],
'scale_pos_weight': [10, 15]
}

xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

grid_search_xgb = GridSearchCV(xgb, xgb_params, scoring='roc_auc', cv=3, verbose=1)
grid_search_xgb.fit(X_train, y_train)

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 9/11
2/23/25, 2:52 PM Model2.ipynb - Colab

Fitting 3 folds for each of 36 candidates, totalling 108 fits

/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:32] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:33] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:34] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:35] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 10/11
2/23/25, 2:52 PM Model2.ipynb - Colab

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:35] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:36] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:37] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:38] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:40] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:41] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

warnings.warn(smsg, UserWarning)
/usr/local/lib/python3.11/dist-packages/xgboost/core.py:158: UserWarning: [12:48:42] WARNING: /workspace/src/learn
Parameters: { "use_label_encoder" } are not used.

https://colab.research.google.com/drive/1UbZkmbbE6dt3FGmI0-hzfE9Rn7LMAa7M#scrollTo=UcGITmAB3D1r&printMode=true 11/11

The-Lean-Startup-by-Eric-Ries-Book-Arabic Translated
100% (1)
The-Lean-Startup-by-Eric-Ries-Book-Arabic Translated
19 pages
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
No ratings yet
Step-By-Step-Diabetes-Classification-Knn-Detailed-Copy1 - Jupyter Notebook
12 pages
Preprocessing1.ipynb - Colab
No ratings yet
Preprocessing1.ipynb - Colab
13 pages
baseline.ipynb - Colab
No ratings yet
baseline.ipynb - Colab
5 pages
Stroke Prediction
No ratings yet
Stroke Prediction
10 pages
Cardio Screen RF
100% (1)
Cardio Screen RF
27 pages
B58_ Handling Missing Values,Feature_Selection (1)
No ratings yet
B58_ Handling Missing Values,Feature_Selection (1)
4 pages
1728086737277
No ratings yet
1728086737277
26 pages
Logistic Regression
No ratings yet
Logistic Regression
12 pages
DSBDA2
No ratings yet
DSBDA2
6 pages
KNN For Classification
No ratings yet
KNN For Classification
5 pages
Major project - Colab
No ratings yet
Major project - Colab
15 pages
ML Proj Diabetes.pptx
No ratings yet
ML Proj Diabetes.pptx
51 pages
Student Notebook HR Analysis
No ratings yet
Student Notebook HR Analysis
11 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
KNN - Jupyter Notebook (1)
No ratings yet
KNN - Jupyter Notebook (1)
7 pages
Linear and Multilinear Regression
No ratings yet
Linear and Multilinear Regression
5 pages
Bio-Signal Analysis For Smoking
No ratings yet
Bio-Signal Analysis For Smoking
1 page
DSBDA 5
No ratings yet
DSBDA 5
12 pages
Diabetes - Prediction - Project - Ipynb - Colab
No ratings yet
Diabetes - Prediction - Project - Ipynb - Colab
11 pages
Diabetis Project
No ratings yet
Diabetis Project
7 pages
Python Solution
No ratings yet
Python Solution
30 pages
LAB8_LogisticReg_HeartDisease[1]
No ratings yet
LAB8_LogisticReg_HeartDisease[1]
31 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
QUIZ Week 2 CART Practice PDF
No ratings yet
QUIZ Week 2 CART Practice PDF
10 pages
ML Lab Records
No ratings yet
ML Lab Records
101 pages
Openlab1
No ratings yet
Openlab1
17 pages
Ass 1 Dsbda
No ratings yet
Ass 1 Dsbda
8 pages
Data Science Practicals - Ipynb
No ratings yet
Data Science Practicals - Ipynb
54 pages
eda-ml-decision-tree.ipynb - Colab
No ratings yet
eda-ml-decision-tree.ipynb - Colab
20 pages
LP Practical ! Jupyter Notebook
No ratings yet
LP Practical ! Jupyter Notebook
6 pages
Dovdush_KN-305_lab3
No ratings yet
Dovdush_KN-305_lab3
2 pages
ExNo 08ml
No ratings yet
ExNo 08ml
4 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
Heart Disease Prediction! ❤️?
No ratings yet
Heart Disease Prediction! ❤️?
52 pages
Heart Failure Prediction
100% (1)
Heart Failure Prediction
41 pages
PythonForMachineLearning
No ratings yet
PythonForMachineLearning
66 pages
Logistic Regression 205
No ratings yet
Logistic Regression 205
8 pages
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
No ratings yet
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
8 pages
Data Science Practical 9
No ratings yet
Data Science Practical 9
6 pages
prg7a - Jupyter Notebook
No ratings yet
prg7a - Jupyter Notebook
12 pages
Batch-2 Ieee DMT
No ratings yet
Batch-2 Ieee DMT
4 pages
ML LAB manual-1
No ratings yet
ML LAB manual-1
33 pages
Linear Regression: Data Exploration
No ratings yet
Linear Regression: Data Exploration
12 pages
Project 3 - Diabetes Prediction.ipynb - Colab
No ratings yet
Project 3 - Diabetes Prediction.ipynb - Colab
4 pages
AttiqAhmadAfsarMidExam
No ratings yet
AttiqAhmadAfsarMidExam
8 pages
Experiment 4
No ratings yet
Experiment 4
5 pages
OpenLab2
No ratings yet
OpenLab2
15 pages
Lab Manual - MachineLearningLaboratory-DR.vaishnavi (1)
No ratings yet
Lab Manual - MachineLearningLaboratory-DR.vaishnavi (1)
71 pages
Artificial Neural Network (Ann)
No ratings yet
Artificial Neural Network (Ann)
1 page
Documentation Code
No ratings yet
Documentation Code
20 pages
vertopal.com_Project_16_Calories_Burnt_Prediction
No ratings yet
vertopal.com_Project_16_Calories_Burnt_Prediction
10 pages
ML 7
No ratings yet
ML 7
6 pages
Assignment 1 - LP1
No ratings yet
Assignment 1 - LP1
14 pages
COMP5318
No ratings yet
COMP5318
42 pages
vertopal.com_Heart_Disease_Classification_Full-1
No ratings yet
vertopal.com_Heart_Disease_Classification_Full-1
3 pages
ML Practical 3D
No ratings yet
ML Practical 3D
4 pages
Exp 5
No ratings yet
Exp 5
7 pages
Machine Learning Lab Manual (1)
No ratings yet
Machine Learning Lab Manual (1)
42 pages
Amazing Java: Learn Java Quickly
From Everand
Amazing Java: Learn Java Quickly
Andrei Besedin
No ratings yet
Lect_06_Feature_Engineering_and_Selection
No ratings yet
Lect_06_Feature_Engineering_and_Selection
41 pages
ML_Science
No ratings yet
ML_Science
6 pages
Lect_05_Preprocessing_text
No ratings yet
Lect_05_Preprocessing_text
25 pages
MSBA315_Syllabus_2025
No ratings yet
MSBA315_Syllabus_2025
6 pages
MSBA315-Project-Description
No ratings yet
MSBA315-Project-Description
1 page
Xi Mid Term Exam 2022 Science: Government Hazi Muhammad Mohsin College, Chattogram
No ratings yet
Xi Mid Term Exam 2022 Science: Government Hazi Muhammad Mohsin College, Chattogram
1 page
Thesis On Affordable Housing in Ghana
100% (2)
Thesis On Affordable Housing in Ghana
8 pages
SAMPLE DLP ON PBL UISNG 7es
No ratings yet
SAMPLE DLP ON PBL UISNG 7es
3 pages
Logarithmorum Chilias Prima
No ratings yet
Logarithmorum Chilias Prima
32 pages
Anubhav verma CV
No ratings yet
Anubhav verma CV
1 page
AMF 2.0 Technical Specifications - EN
No ratings yet
AMF 2.0 Technical Specifications - EN
2 pages
Salt Compare
No ratings yet
Salt Compare
1 page
RRL - Effects of Family Relationship On The Educational Behavior of The Students
100% (1)
RRL - Effects of Family Relationship On The Educational Behavior of The Students
2 pages
Proactive
No ratings yet
Proactive
10 pages
Example Dissertation Aims and Objectives
100% (2)
Example Dissertation Aims and Objectives
7 pages
rehan resume
No ratings yet
rehan resume
2 pages
Ing 102 Pidief
No ratings yet
Ing 102 Pidief
23 pages
Course: Mathematics in The Modern World Module 2: Lesson 1-Set Mathematics: Sets, Elements, Subsets
No ratings yet
Course: Mathematics in The Modern World Module 2: Lesson 1-Set Mathematics: Sets, Elements, Subsets
12 pages
AWC - Terminal Aerodrome Forecasts (TAFs)
No ratings yet
AWC - Terminal Aerodrome Forecasts (TAFs)
1 page
M2 Quiz 2 - Radicals
No ratings yet
M2 Quiz 2 - Radicals
2 pages
Discovering Evolutionary Ecology
No ratings yet
Discovering Evolutionary Ecology
220 pages
Download
No ratings yet
Download
1 page
Biology Subject For High School - Darwin's Theory of Natural Selection by Slidesgo
No ratings yet
Biology Subject For High School - Darwin's Theory of Natural Selection by Slidesgo
56 pages
Radiographers Journal January 2023
No ratings yet
Radiographers Journal January 2023
32 pages
Automation and Robotics I MID QP and BITP Feb 2022
No ratings yet
Automation and Robotics I MID QP and BITP Feb 2022
4 pages
Question 2: Consider The Directed Graph G With The Adjacency Matrix (In The Order of Vertices A, B, C
No ratings yet
Question 2: Consider The Directed Graph G With The Adjacency Matrix (In The Order of Vertices A, B, C
7 pages
A New Modern Philosophy The Inclusive Anthology Of... - (2 Francis Bacon)
No ratings yet
A New Modern Philosophy The Inclusive Anthology Of... - (2 Francis Bacon)
6 pages
4j Notes
No ratings yet
4j Notes
130 pages
Sense of Smell mcqs-1
No ratings yet
Sense of Smell mcqs-1
9 pages
4.Asun Single Axis Tracker
No ratings yet
4.Asun Single Axis Tracker
2 pages
Institute of Dentistry, CMH Lahore Medical College Curriculum & Study Guide First Year BDS Deaniod@cmhlahore - Edu.pk
No ratings yet
Institute of Dentistry, CMH Lahore Medical College Curriculum & Study Guide First Year BDS Deaniod@cmhlahore - Edu.pk
160 pages
Class Xii Cumulative Class Test Math Ch-3
No ratings yet
Class Xii Cumulative Class Test Math Ch-3
5 pages
Projected Shadows Psychoanalytic Reflections on the Representation of Loss in European Cinema The New Library of Psychoanalysis 1st Edition Sabbadini - The ebook is available for instant download, read anywhere
100% (1)
Projected Shadows Psychoanalytic Reflections on the Representation of Loss in European Cinema The New Library of Psychoanalysis 1st Edition Sabbadini - The ebook is available for instant download, read anywhere
52 pages
50 Excellent Adobe After Effects Tutorials
100% (4)
50 Excellent Adobe After Effects Tutorials
27 pages

Model2.ipynb - Colab

Uploaded by

Model2.ipynb - Colab

Uploaded by

2/23/25, 2:52 PM Model2.

from google.colab import files

Choose Files 2 files

# Load Data (for manually uploaded files in Colab or Jupyter)

# Display dataset description

gender age hypertension heart_disease ever_married \

work_type Residence_type avg_glucose_level bmi \

# Handling Missing Values

Missing values before handling:

# Impute missing values for numerical features using median

# Handling Categorical Missing Values by Filling with Most Frequent Value

# Double-check missing values after handling

Missing values after handling:

# Drop row with missing stroke value

# Double-check missing values after handling

Missing values after handling:

# Standardizing target variable 'stroke'

# Checking Unique Values and Their Frequencies for Target Variable

Stroke Unique Values Count: 2

# Checking for Negative Values

Checking for Negative Values:

# Visualizing Outliers using Boxplots

# Visualizing Distribution of avg_glucose_level

# Handling Outliers using IQR method for Age Only

# Applying Log Transformation to avg_glucose_level

# Rechecking Outlier Handling with Boxplots

# Checking for small or incorrect age values

Smallest age values:

gender unique values count: 5

ever_married unique values count: 2

work_type unique values count: 5

Residence_type unique values count: 2

smoking_status unique values count: 4

# Checking Frequency of Gender Categories Before Standardization

Gender Distribution Before Standardization:

# Standardizing Gender Values

# Replacing 'other' gender with the most frequent gender

# Checking Frequency of Gender Categories After Standardization

Gender Distribution After Standardization:

# Checking Stroke Distribution by Gender

Stroke Distribution by Gender:

# Checking for duplicate rows before removal

Duplicate rows before removal: 0

# Checking for redundant features

# Selecting only numeric columns for correlation

# Checking if Age and BMI are properly scaled

Mean and Standard Deviation after Scaling:

# Encoding Categorical Features

# Reset index to align with other features

# Verifying One-Hot Encoding

# Display first few rows to check encoding

Encoded train data shape: (2551, 10)

work_type_Self-employed work_type_children Residence_type_Urban \

smoking_status_formerly smoked smoking_status_never smoked \

# Combining Processed Features

# Applying SMOTE for oversampling

# Train XGBoost Classifier with Hyperparameter Tuning

xgb = XGBClassifier(use_label_encoder=False, eval_metric="logloss")

Fitting 3 folds for each of 36 candidates, totalling 108 fits

You might also like