0% found this document useful (0 votes)

27 views15 pages

Jupyter Notebook On Obesity Prediction

The document outlines a notebook designed for exploratory data analysis (EDA) and machine learning model development to predict multi-class obesity risk using a dataset of 20,758 entries with 18 features. It includes data preprocessing, feature engineering (such as calculating BMI), and the implementation of an XGBoost classifier with hyperparameter optimization using Optuna. The analysis reveals insights into the relationships between variables and obesity types, emphasizing the importance of factors like age, weight, and transportation methods.

Uploaded by

Hemant Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

27 views15 pages

Jupyter Notebook On Obesity Prediction

Uploaded by

Hemant Singh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 15

Obesity Prediction Notebook

March 2, 2024

1 Introduction
This notebook is created to facilitate EDA and to create a machine learning model to predict
Multi-Class obesity risk

[1]: import os
new_directory = 'C:/Users/sm94c/Documents/Data Science Projects/Multi Class␣
↪Obesity Prediction'

os.chdir(new_directory)

[2]: import pandas as pd

df = pd.read_csv("train.csv")

[3]: df.head()

[3]: id Gender Age Height Weight family_history_with_overweight \

0 0 Male 24.443011 1.699998 81.669950 yes
1 1 Female 18.000000 1.560000 57.000000 yes
2 2 Female 18.000000 1.711460 50.165754 yes
3 3 Female 20.952737 1.710730 131.274851 yes
4 4 Male 31.641081 1.914186 93.798055 yes

FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF \

0 yes 2.000000 2.983297 Sometimes no 2.763573 no 0.000000
1 yes 2.000000 3.000000 Frequently no 2.000000 no 1.000000
2 yes 1.880534 1.411685 Sometimes no 1.910378 no 0.866045
3 yes 3.000000 3.000000 Sometimes no 1.674061 no 1.467863
4 yes 2.679664 1.971472 Sometimes no 1.979848 no 1.967973

TUE CALC MTRANS NObeyesdad

0 0.976473 Sometimes Public_Transportation Overweight_Level_II
1 1.000000 no Automobile Normal_Weight
2 1.673584 no Public_Transportation Insufficient_Weight
3 0.780199 Sometimes Public_Transportation Obesity_Type_III
4 0.931721 Sometimes Public_Transportation Overweight_Level_II

1
[4]: df.info()
#no missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20758 non-null int64
1 Gender 20758 non-null object
2 Age 20758 non-null float64
3 Height 20758 non-null float64
4 Weight 20758 non-null float64
5 family_history_with_overweight 20758 non-null object
6 FAVC 20758 non-null object
7 FCVC 20758 non-null float64
8 NCP 20758 non-null float64
9 CAEC 20758 non-null object
10 SMOKE 20758 non-null object
11 CH2O 20758 non-null float64
12 SCC 20758 non-null object
13 FAF 20758 non-null float64
14 TUE 20758 non-null float64
15 CALC 20758 non-null object
16 MTRANS 20758 non-null object
17 NObeyesdad 20758 non-null object
dtypes: float64(8), int64(1), object(9)
memory usage: 2.9+ MB

[5]: def summary(df):

print(f'data shape: {df.shape}')
summ = pd.DataFrame(df.dtypes, columns=['data type'])
summ['#missing'] = df.isnull().sum().values
summ['%missing'] = df.isnull().sum().values / len(df)* 100
summ['#unique'] = df.nunique().values
desc = pd.DataFrame(df.describe(include='all').transpose())
summ['min'] = desc['min'].values
summ['max'] = desc['max'].values
return summ

summary(df)

data shape: (20758, 18)

[5]: data type #missing %missing #unique min \

id int64 0 0.0 20758 0.0
Gender object 0 0.0 2 NaN
Age float64 0 0.0 1703 14.0

2
Height float64 0 0.0 1833 1.45
Weight float64 0 0.0 1979 39.0
family_history_with_overweight object 0 0.0 2 NaN
FAVC object 0 0.0 2 NaN
FCVC float64 0 0.0 934 1.0
NCP float64 0 0.0 689 1.0
CAEC object 0 0.0 4 NaN
SMOKE object 0 0.0 2 NaN
CH2O float64 0 0.0 1506 1.0
SCC object 0 0.0 2 NaN
FAF float64 0 0.0 1360 0.0
TUE float64 0 0.0 1297 0.0
CALC object 0 0.0 3 NaN
MTRANS object 0 0.0 5 NaN
NObeyesdad object 0 0.0 7 NaN

max
id 20757.0
Gender NaN
Age 61.0
Height 1.975663
Weight 165.057269
family_history_with_overweight NaN
FAVC NaN
FCVC 3.0
NCP 4.0
CAEC NaN
SMOKE NaN
CH2O 3.0
SCC NaN
FAF 3.0
TUE 2.0
CALC NaN
MTRANS NaN
NObeyesdad NaN

[6]: df.size

[6]: 373644

[7]: df.shape
#18 columns, 20K entries.

[7]: (20758, 18)

3
2 EDA
Let’s explore the relationship between different variables within the dataset.

[8]: import matplotlib.pyplot as plt

import seaborn as sns

[9]: plt.style.use('ggplot')

[10]: sns.scatterplot(df, x='Age',y='Weight',hue='NObeyesdad',alpha=0.05)

[10]: <Axes: xlabel='Age', ylabel='Weight'>

• most people in the dataset have obesity type 3

• Observed a higher class of people from ages 25-35 have type 2 obesity
• Most people of class Insuﬀicient weight are between ages 15 and 25

[11]: sns.countplot(df, x='NObeyesdad', hue='SMOKE')

plt.xlabel("Weight Class")
plt.xticks(rotation=90)

4
[11]: (array([0, 1, 2, 3, 4, 5, 6]),
[Text(0, 0, 'Overweight_Level_II'),
Text(1, 0, 'Normal_Weight'),
Text(2, 0, 'Insufficient_Weight'),
Text(3, 0, 'Obesity_Type_III'),
Text(4, 0, 'Obesity_Type_II'),
Text(5, 0, 'Overweight_Level_I'),
Text(6, 0, 'Obesity_Type_I')])

• People with Obesity type 2 have the highest distribution of smokers

[12]: sns.countplot(df,x='NObeyesdad',hue='MTRANS')
plt.xlabel("Weight Class")
plt.xticks(rotation=90)

5
[12]: (array([0, 1, 2, 3, 4, 5, 6]),
[Text(0, 0, 'Overweight_Level_II'),
Text(1, 0, 'Normal_Weight'),
Text(2, 0, 'Insufficient_Weight'),
Text(3, 0, 'Obesity_Type_III'),
Text(4, 0, 'Obesity_Type_II'),
Text(5, 0, 'Overweight_Level_I'),
Text(6, 0, 'Obesity_Type_I')])

• Entries of obesity type 2 generally use more automobiles than other classes
• Normal Weight entries walk more than other classes

[13]: sns.scatterplot(df,x='Height',y='Weight',alpha=0.05,␣
↪hue='family_history_with_overweight')

6
[13]: <Axes: xlabel='Height', ylabel='Weight'>

2.1 Feature Engineering

Let’s create some new features to act as predictors for obesity. We noticed before that there are
some groups that acted as good predictors for certain classes of obesity. Let’s engineer them.

[14]: df_eng = df.copy()

df_eng['BMI'] = df_eng['Weight'] / (df_eng['Height'] / 100) ** 2

[15]: df_eng.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20758 non-null int64
1 Gender 20758 non-null object
2 Age 20758 non-null float64
3 Height 20758 non-null float64

7
4 Weight 20758 non-null float64
5 family_history_with_overweight 20758 non-null object
6 FAVC 20758 non-null object
7 FCVC 20758 non-null float64
8 NCP 20758 non-null float64
9 CAEC 20758 non-null object
10 SMOKE 20758 non-null object
11 CH2O 20758 non-null float64
12 SCC 20758 non-null object
13 FAF 20758 non-null float64
14 TUE 20758 non-null float64
15 CALC 20758 non-null object
16 MTRANS 20758 non-null object
17 NObeyesdad 20758 non-null object
18 BMI 20758 non-null float64
dtypes: float64(9), int64(1), object(9)
memory usage: 3.0+ MB

3 Machine Learning Model

[16]: import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

X = df_eng.drop('NObeyesdad',axis=1)
y = df['NObeyesdad']

[17]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣

↪random_state=42)

[18]: from sklearn.compose import ColumnTransformer

from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder,␣
↪OneHotEncoder

[38]: # Label encode the target variable y

label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

[20]: import optuna

[37]: from sklearn.pipeline import Pipeline

from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

8
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Separate categorical and numerical columns

categorical_cols = X_train.select_dtypes(include=['object', 'bool']).columns
numerical_cols = X_train.select_dtypes(exclude=['object', 'bool']).columns

# Preprocessing for numerical data

numerical_transformer = StandardScaler()

# Preprocessing for categorical data

categorical_transformer = Pipeline(steps=[
('encoder', OrdinalEncoder(handle_unknown='use_encoded_value',␣
↪unknown_value=-1)),

])

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
],
remainder='passthrough'
)

def objective_xgb(trial):
params = {
"eval_metric": "mlogloss",
"objective": "multi:softmax",
"booster": trial.suggest_categorical("booster", ["gbtree"]),
"grow_policy": trial.suggest_categorical("grow_policy", ["depthwise",␣
↪"lossguide"]),

"n_estimators": trial.suggest_int("n_estimators", 100, 1000),

"learning_rate": trial.suggest_float("learning_rate", 0.01, 1.0,␣
↪log=True),

"gamma": trial.suggest_float("gamma", 1e-9, 0.5),

"subsample": trial.suggest_float("subsample", 0.3, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.3, 1.0),
"max_depth": trial.suggest_int("max_depth", 5, 24),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 7),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-9, 100.0, log=True),
"reg_alpha": trial.suggest_float("reg_alpha", 1e-9, 100.0, log=True)
}

# Create the XGBoost classifier with the current set of hyperparameters

model_xgb = XGBClassifier(**params)

# Construct the full pipeline with the classifier

9
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', model_xgb)
])

# Fit the pipeline on the training data

pipeline.fit(X_train, y_train_encoded)

# Predict and calculate accuracy on the validation set

y_pred_c = pipeline.predict(X_test)

return accuracy_score(y_test_encoded, y_pred_c)

study_xgboost = optuna.create_study(
study_name="Study_XGB_Obesity", direction="maximize"
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
study_xgboost.optimize(objective_xgb, n_trials=1000, show_progress_bar=True)

# Get the best hyperparameters

best_params_xgb = study_xgboost.best_params
print("Best Hyperparameters:", best_params_xgb)

0%| | 0/1000 [00:00<?, ?it/s]

Best Hyperparameters: {'booster': 'gbtree', 'grow_policy': 'lossguide',
'n_estimators': 575, 'learning_rate': 0.03289752563023217, 'gamma':
0.10216792503966325, 'subsample': 0.9781875703877861, 'colsample_bytree':
0.3000971735782186, 'max_depth': 20, 'min_child_weight': 1, 'reg_lambda':
7.404755224243125e-08, 'reg_alpha': 2.2740450488861006}

[39]: # Separate categorical and numerical columns

categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
numerical_cols = X.select_dtypes(exclude=['object', 'bool']).columns

# Preprocessing for numerical data

numerical_transformer = StandardScaler()

# Preprocessing for categorical data

categorical_transformer = Pipeline(steps=[
('encoder', OrdinalEncoder(handle_unknown='use_encoded_value',␣
↪unknown_value=-1)),

])

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),

10
('cat', categorical_transformer, categorical_cols)
],
remainder='passthrough'
)

new_params = {
'booster': 'gbtree',
'grow_policy': 'lossguide',
'n_estimators': 575,
'learning_rate': 0.03289752563023217,
'gamma': 0.10216792503966325,
'subsample': 0.9781875703877861,
'colsample_bytree': 0.3000971735782186,
'max_depth': 20,
'min_child_weight': 1,
'reg_lambda': 7.404755224243125e-08,
'reg_alpha': 2.2740450488861006
}

xgb = XGBClassifier(random_state=42, **new_params)

# Update the pipeline

pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', xgb)
])

# Fit the transformers on the training data

pipeline.fit(X_train, y_train_encoded)

[39]: Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('num', StandardScaler(),
Index(['id', 'Age', 'Height',
'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE',
'BMI'],
dtype='object')),
('cat',
Pipeline(steps=[('encoder',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1))]),
Index(['Gender',
'family_history_with_over…
grow_policy='lossguide', importance_type=None,
interaction_constraints=None,
learning_rate=0.03289752563023217, max_bin=None,

11
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=20,
max_leaves=None, min_child_weight=1, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=575, n_jobs=None,
num_parallel_tree=None,
objective='multi:softprob', …))])

[40]: from sklearn.model_selection import cross_val_score

cross_val_scores = cross_val_score(pipeline, X_train, y_train_encoded, cv=5,␣
↪scoring='accuracy')

# Print the cross-validation scores

print("Cross-validation scores:", cross_val_scores)

# Print the mean accuracy and standard deviation

print("Mean Accuracy: {:.2f}%".format(cross_val_scores.mean() * 100))
print("Standard Deviation: {:.2f}".format(cross_val_scores.std()))

Cross-validation scores: [0.90246839 0.90966576 0.90725685 0.90755796

0.91779584]
Mean Accuracy: 90.89%
Standard Deviation: 0.01

[41]: import matplotlib.pyplot as plt

from xgboost import plot_importance

# Assuming 'pipeline' is your scikit-learn pipeline with XGBClassifier

xgb_classifier = pipeline.named_steps['classifier']

# Plot feature importances

plot_importance(xgb_classifier)
plt.show()

12
4 Time to Make Predictions
[42]: df_test = pd.read_csv("test.csv")

[43]: df_test_eng = df_test.copy()

df_test_eng['BMI'] = df_test_eng['Weight'] / (df_test_eng['Height'] / 100) ** 2

[44]: df_test_eng.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13840 entries, 0 to 13839
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 13840 non-null int64
1 Gender 13840 non-null object
2 Age 13840 non-null float64
3 Height 13840 non-null float64
4 Weight 13840 non-null float64

13
5 family_history_with_overweight 13840 non-null object
6 FAVC 13840 non-null object
7 FCVC 13840 non-null float64
8 NCP 13840 non-null float64
9 CAEC 13840 non-null object
10 SMOKE 13840 non-null object
11 CH2O 13840 non-null float64
12 SCC 13840 non-null object
13 FAF 13840 non-null float64
14 TUE 13840 non-null float64
15 CALC 13840 non-null object
16 MTRANS 13840 non-null object
17 BMI 13840 non-null float64
dtypes: float64(9), int64(1), object(8)
memory usage: 1.9+ MB

[45]: predictions = pipeline.predict(df_test_eng)

[46]: # Assuming 'label_encoder' is your LabelEncoder and 'predictions' is the␣

↪encoded predictions

predicted_original_values = label_encoder.inverse_transform(predictions)
predicted_original_values

[46]: array(['Obesity_Type_II', 'Overweight_Level_I', 'Obesity_Type_III', …,

'Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_II'],
dtype=object)

[47]: # Create a DataFrame with 'ID' and 'Prediction'

result_df = pd.DataFrame({
'id': df_test_eng['id'], # Replace 'ID' with the actual column name
'Prediction': predicted_original_values})

[48]: result_df

[48]: id Prediction
0 20758 Obesity_Type_II
1 20759 Overweight_Level_I
2 20760 Obesity_Type_III
3 20761 Obesity_Type_I
4 20762 Obesity_Type_III
… … …
13835 34593 Overweight_Level_II
13836 34594 Normal_Weight
13837 34595 Insufficient_Weight
13838 34596 Normal_Weight
13839 34597 Obesity_Type_II

[13840 rows x 2 columns]

14
[49]: result_df.to_csv("Submission_7_Optuna.csv",index=False)

[ ]:

Empowerment Technologies Quarter 2 Module 1
No ratings yet
Empowerment Technologies Quarter 2 Module 1
44 pages
Fuji Inverter Manual
No ratings yet
Fuji Inverter Manual
103 pages
Management Science Chapter 11
No ratings yet
Management Science Chapter 11
42 pages
Useful Debate Vocabulary
90% (20)
Useful Debate Vocabulary
1 page
Communication Skills Class 9
90% (10)
Communication Skills Class 9
14 pages
Saes N 120
100% (1)
Saes N 120
13 pages
3-3 Particle in A Box
No ratings yet
3-3 Particle in A Box
9 pages
Geometry of Straight Lines
No ratings yet
Geometry of Straight Lines
17 pages
Prediction of Obesity Level Based On Lifestyle and Eating Habits Data
No ratings yet
Prediction of Obesity Level Based On Lifestyle and Eating Habits Data
4 pages
Prediction of Obesity Level Based On Lifestyle and Eating Habits Data
No ratings yet
Prediction of Obesity Level Based On Lifestyle and Eating Habits Data
4 pages
MlProject Cse 30 37
No ratings yet
MlProject Cse 30 37
27 pages
Professional Writing Skills
100% (1)
Professional Writing Skills
77 pages
Rapport
No ratings yet
Rapport
21 pages
Obesity Disease Risk Prediction Using Machine Learning
No ratings yet
Obesity Disease Risk Prediction Using Machine Learning
10 pages
ML Assignment 3
No ratings yet
ML Assignment 3
7 pages
Chapter4 PDF
No ratings yet
Chapter4 PDF
34 pages
BA - Presentation GRP 2
No ratings yet
BA - Presentation GRP 2
7 pages
Term 3 Syllabi Portion-Grade 8
No ratings yet
Term 3 Syllabi Portion-Grade 8
3 pages
End To End Project Multiple Disease Detection Using ML - Nomidl
No ratings yet
End To End Project Multiple Disease Detection Using ML - Nomidl
24 pages
1 s2.0 S1386505625000218 Main
No ratings yet
1 s2.0 S1386505625000218 Main
13 pages
Linear Merged Pagenumber
No ratings yet
Linear Merged Pagenumber
48 pages
Diabetes
No ratings yet
Diabetes
7 pages
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
No ratings yet
lab - 8 - - (6) عفان عبدالله احمد - التكليف -
18 pages
Binod ML Project-052
No ratings yet
Binod ML Project-052
14 pages
4164 ML-Assignment
No ratings yet
4164 ML-Assignment
4 pages
Estimation of Obesity Levels Based On Computational Intelligence
No ratings yet
Estimation of Obesity Levels Based On Computational Intelligence
5 pages
Diabetes - Test Report
No ratings yet
Diabetes - Test Report
62 pages
Logistic Regression vs. SVMs - Solution
No ratings yet
Logistic Regression vs. SVMs - Solution
7 pages
Camera Ready Paper Jason
No ratings yet
Camera Ready Paper Jason
6 pages
AML Sessional 1 Students
No ratings yet
AML Sessional 1 Students
16 pages
23UCC554
No ratings yet
23UCC554
9 pages
Project
No ratings yet
Project
8 pages
Unit5 - Logistic Regression
No ratings yet
Unit5 - Logistic Regression
4 pages
ML Data Preprocessing in Python
No ratings yet
ML Data Preprocessing in Python
9 pages
DS Food
No ratings yet
DS Food
18 pages
Exp 5
No ratings yet
Exp 5
7 pages
EDAusingpython SAlaruri
No ratings yet
EDAusingpython SAlaruri
50 pages
Diabetes Prediction System
No ratings yet
Diabetes Prediction System
4 pages
Predicting Diabetes Onset Using Machine Learning
No ratings yet
Predicting Diabetes Onset Using Machine Learning
4 pages
Cardiovascular Disease Prediction
No ratings yet
Cardiovascular Disease Prediction
2 pages
MLPPT 11 45
No ratings yet
MLPPT 11 45
31 pages
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
No ratings yet
Import As From Import From Import From Import From Import From Import From Import From Import From Import From Import From Import Import As
8 pages
Data Pre-Processing
No ratings yet
Data Pre-Processing
22 pages
Estimating Diabetic Risk Accurately
No ratings yet
Estimating Diabetic Risk Accurately
26 pages
Logistic - Ipynb - Colaboratory
No ratings yet
Logistic - Ipynb - Colaboratory
6 pages
Chatbot For Prediction of Weight and BMI
No ratings yet
Chatbot For Prediction of Weight and BMI
3 pages
Project 16 Calories Burnt Prediction
No ratings yet
Project 16 Calories Burnt Prediction
10 pages
ML Report098
No ratings yet
ML Report098
23 pages
Baitap 5
No ratings yet
Baitap 5
1 page
Documentation Code
No ratings yet
Documentation Code
20 pages
KNN - Jupyter Notebook
No ratings yet
KNN - Jupyter Notebook
5 pages
مختار النعيري - The Course Work Submission
No ratings yet
مختار النعيري - The Course Work Submission
31 pages
Ai Datascience Project Grade 10
No ratings yet
Ai Datascience Project Grade 10
14 pages
TP Multi-Class Classification
No ratings yet
TP Multi-Class Classification
12 pages
20MIS7095 (LAB 7) .Ipynb Colaboratory
No ratings yet
20MIS7095 (LAB 7) .Ipynb Colaboratory
4 pages
20MIS7043 (LAB 7) .Ipynb Colaboratory
No ratings yet
20MIS7043 (LAB 7) .Ipynb Colaboratory
4 pages
Base Paper
No ratings yet
Base Paper
5 pages
Diabetic Prediction Using LogicalRegression
No ratings yet
Diabetic Prediction Using LogicalRegression
9 pages
Diabetes Prediction - ML
No ratings yet
Diabetes Prediction - ML
29 pages
To Extract Features From Given Data Set and Establish Training Data
No ratings yet
To Extract Features From Given Data Set and Establish Training Data
2 pages
ExNo 08ml
No ratings yet
ExNo 08ml
4 pages
20BCE7620 AP2021228000397 Experiment-6 Removed
No ratings yet
20BCE7620 AP2021228000397 Experiment-6 Removed
19 pages
Step 1
No ratings yet
Step 1
10 pages
9020H SV Ver0
100% (2)
9020H SV Ver0
59 pages
Machine Learning and Deep Learning Techniques
No ratings yet
Machine Learning and Deep Learning Techniques
13 pages
Body Fat Prediction
No ratings yet
Body Fat Prediction
11 pages
Report - SVM
No ratings yet
Report - SVM
13 pages
QPM Edition Version 2018.1
100% (1)
QPM Edition Version 2018.1
40 pages
Culture
No ratings yet
Culture
15 pages
Lesson Plan: Diffusion and Osmosis: Membrane Transport
No ratings yet
Lesson Plan: Diffusion and Osmosis: Membrane Transport
2 pages
Invoice PDF
No ratings yet
Invoice PDF
1 page
MVE 200/15E-30A0 (EE40020030A0JA0000) : 3 PH - 4 Poles - 1500 RPM - 220-240/380-415 V - 50 HZ
No ratings yet
MVE 200/15E-30A0 (EE40020030A0JA0000) : 3 PH - 4 Poles - 1500 RPM - 220-240/380-415 V - 50 HZ
1 page
GIS and Its Implementations
No ratings yet
GIS and Its Implementations
250 pages
The Sabre Story: June 2000
No ratings yet
The Sabre Story: June 2000
5 pages
Formaldehyde Emission
No ratings yet
Formaldehyde Emission
2 pages
PC 17L1
No ratings yet
PC 17L1
4 pages
Đề Aptis L+R
No ratings yet
Đề Aptis L+R
15 pages
Digital Calibration Gage: 0.05% Full Scale Accuracy, 316 SS Wetted Parts
No ratings yet
Digital Calibration Gage: 0.05% Full Scale Accuracy, 316 SS Wetted Parts
1 page
Why Is It Important To Learn English?
No ratings yet
Why Is It Important To Learn English?
35 pages
Audit in S CIS Environment 3
No ratings yet
Audit in S CIS Environment 3
4 pages
Experimental Optimization of Mild Steel
No ratings yet
Experimental Optimization of Mild Steel
4 pages
CV DBecerra
No ratings yet
CV DBecerra
3 pages
Instant Ebooks Textbook Quantitative EEG Analysis Methods and Clinical Applications 1 Har/Cdr Edition Shanbao Tong Download All Chapters
No ratings yet
Instant Ebooks Textbook Quantitative EEG Analysis Methods and Clinical Applications 1 Har/Cdr Edition Shanbao Tong Download All Chapters
45 pages
Module 6 Inputs and Outputs
No ratings yet
Module 6 Inputs and Outputs
35 pages
COCOMO and UNIT 4
No ratings yet
COCOMO and UNIT 4
10 pages
Benefits Manager Role Interview Questions
No ratings yet
Benefits Manager Role Interview Questions
4 pages
Lua游戏AI开发指南: Chinese Edition
From Everand
Lua游戏AI开发指南: Chinese Edition
Posts & Telecom Press
No ratings yet

Jupyter Notebook On Obesity Prediction

Uploaded by

Jupyter Notebook On Obesity Prediction

Uploaded by

Obesity Prediction Notebook

[2]: import pandas as pd

[3]: id Gender Age Height Weight family_history_with_overweight \

FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF \

TUE CALC MTRANS NObeyesdad

[5]: def summary(df):

data shape: (20758, 18)

[5]: data type #missing %missing #unique min \

[7]: (20758, 18)

[8]: import matplotlib.pyplot as plt

[10]: sns.scatterplot(df, x='Age',y='Weight',hue='NObeyesdad',alpha=0.05)

[10]: <Axes: xlabel='Age', ylabel='Weight'>

• most people in the dataset have obesity type 3

[11]: sns.countplot(df, x='NObeyesdad', hue='SMOKE')

• People with Obesity type 2 have the highest distribution of smokers

2.1 Feature Engineering

[14]: df_eng = df.copy()

df_eng['BMI'] = df_eng['Weight'] / (df_eng['Height'] / 100) ** 2

3 Machine Learning Model

[17]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣

[18]: from sklearn.compose import ColumnTransformer

[38]: # Label encode the target variable y

[20]: import optuna

[37]: from sklearn.pipeline import Pipeline

# Separate categorical and numerical columns

# Preprocessing for numerical data

# Preprocessing for categorical data

"n_estimators": trial.suggest_int("n_estimators", 100, 1000),

"gamma": trial.suggest_float("gamma", 1e-9, 0.5),

# Create the XGBoost classifier with the current set of hyperparameters

# Construct the full pipeline with the classifier

# Fit the pipeline on the training data

# Predict and calculate accuracy on the validation set

return accuracy_score(y_test_encoded, y_pred_c)

# Get the best hyperparameters

0%| | 0/1000 [00:00<?, ?it/s]

[39]: # Separate categorical and numerical columns

# Preprocessing for numerical data

# Preprocessing for categorical data

xgb = XGBClassifier(random_state=42, **new_params)

# Update the pipeline

# Fit the transformers on the training data

[40]: from sklearn.model_selection import cross_val_score

# Print the cross-validation scores

# Print the mean accuracy and standard deviation

Cross-validation scores: [0.90246839 0.90966576 0.90725685 0.90755796

[41]: import matplotlib.pyplot as plt

# Assuming 'pipeline' is your scikit-learn pipeline with XGBClassifier

# Plot feature importances

[43]: df_test_eng = df_test.copy()

df_test_eng['BMI'] = df_test_eng['Weight'] / (df_test_eng['Height'] / 100) ** 2

[45]: predictions = pipeline.predict(df_test_eng)

[46]: # Assuming 'label_encoder' is your LabelEncoder and 'predictions' is the␣

[46]: array(['Obesity_Type_II', 'Overweight_Level_I', 'Obesity_Type_III', …,

[47]: # Create a DataFrame with 'ID' and 'Prediction'

[13840 rows x 2 columns]

You might also like