0% found this document useful (0 votes)
27 views15 pages

Jupyter Notebook On Obesity Prediction

The document outlines a notebook designed for exploratory data analysis (EDA) and machine learning model development to predict multi-class obesity risk using a dataset of 20,758 entries with 18 features. It includes data preprocessing, feature engineering (such as calculating BMI), and the implementation of an XGBoost classifier with hyperparameter optimization using Optuna. The analysis reveals insights into the relationships between variables and obesity types, emphasizing the importance of factors like age, weight, and transportation methods.

Uploaded by

Hemant Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views15 pages

Jupyter Notebook On Obesity Prediction

The document outlines a notebook designed for exploratory data analysis (EDA) and machine learning model development to predict multi-class obesity risk using a dataset of 20,758 entries with 18 features. It includes data preprocessing, feature engineering (such as calculating BMI), and the implementation of an XGBoost classifier with hyperparameter optimization using Optuna. The analysis reveals insights into the relationships between variables and obesity types, emphasizing the importance of factors like age, weight, and transportation methods.

Uploaded by

Hemant Singh
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

Obesity Prediction Notebook

March 2, 2024

1 Introduction
This notebook is created to facilitate EDA and to create a machine learning model to predict
Multi-Class obesity risk

[1]: import os
new_directory = 'C:/Users/sm94c/Documents/Data Science Projects/Multi Class␣
↪Obesity Prediction'

os.chdir(new_directory)

[2]: import pandas as pd

df = pd.read_csv("train.csv")

[3]: df.head()

[3]: id Gender Age Height Weight family_history_with_overweight \


0 0 Male 24.443011 1.699998 81.669950 yes
1 1 Female 18.000000 1.560000 57.000000 yes
2 2 Female 18.000000 1.711460 50.165754 yes
3 3 Female 20.952737 1.710730 131.274851 yes
4 4 Male 31.641081 1.914186 93.798055 yes

FAVC FCVC NCP CAEC SMOKE CH2O SCC FAF \


0 yes 2.000000 2.983297 Sometimes no 2.763573 no 0.000000
1 yes 2.000000 3.000000 Frequently no 2.000000 no 1.000000
2 yes 1.880534 1.411685 Sometimes no 1.910378 no 0.866045
3 yes 3.000000 3.000000 Sometimes no 1.674061 no 1.467863
4 yes 2.679664 1.971472 Sometimes no 1.979848 no 1.967973

TUE CALC MTRANS NObeyesdad


0 0.976473 Sometimes Public_Transportation Overweight_Level_II
1 1.000000 no Automobile Normal_Weight
2 1.673584 no Public_Transportation Insufficient_Weight
3 0.780199 Sometimes Public_Transportation Obesity_Type_III
4 0.931721 Sometimes Public_Transportation Overweight_Level_II

1
[4]: df.info()
#no missing values

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20758 non-null int64
1 Gender 20758 non-null object
2 Age 20758 non-null float64
3 Height 20758 non-null float64
4 Weight 20758 non-null float64
5 family_history_with_overweight 20758 non-null object
6 FAVC 20758 non-null object
7 FCVC 20758 non-null float64
8 NCP 20758 non-null float64
9 CAEC 20758 non-null object
10 SMOKE 20758 non-null object
11 CH2O 20758 non-null float64
12 SCC 20758 non-null object
13 FAF 20758 non-null float64
14 TUE 20758 non-null float64
15 CALC 20758 non-null object
16 MTRANS 20758 non-null object
17 NObeyesdad 20758 non-null object
dtypes: float64(8), int64(1), object(9)
memory usage: 2.9+ MB

[5]: def summary(df):


print(f'data shape: {df.shape}')
summ = pd.DataFrame(df.dtypes, columns=['data type'])
summ['#missing'] = df.isnull().sum().values
summ['%missing'] = df.isnull().sum().values / len(df)* 100
summ['#unique'] = df.nunique().values
desc = pd.DataFrame(df.describe(include='all').transpose())
summ['min'] = desc['min'].values
summ['max'] = desc['max'].values
return summ

summary(df)

data shape: (20758, 18)

[5]: data type #missing %missing #unique min \


id int64 0 0.0 20758 0.0
Gender object 0 0.0 2 NaN
Age float64 0 0.0 1703 14.0

2
Height float64 0 0.0 1833 1.45
Weight float64 0 0.0 1979 39.0
family_history_with_overweight object 0 0.0 2 NaN
FAVC object 0 0.0 2 NaN
FCVC float64 0 0.0 934 1.0
NCP float64 0 0.0 689 1.0
CAEC object 0 0.0 4 NaN
SMOKE object 0 0.0 2 NaN
CH2O float64 0 0.0 1506 1.0
SCC object 0 0.0 2 NaN
FAF float64 0 0.0 1360 0.0
TUE float64 0 0.0 1297 0.0
CALC object 0 0.0 3 NaN
MTRANS object 0 0.0 5 NaN
NObeyesdad object 0 0.0 7 NaN

max
id 20757.0
Gender NaN
Age 61.0
Height 1.975663
Weight 165.057269
family_history_with_overweight NaN
FAVC NaN
FCVC 3.0
NCP 4.0
CAEC NaN
SMOKE NaN
CH2O 3.0
SCC NaN
FAF 3.0
TUE 2.0
CALC NaN
MTRANS NaN
NObeyesdad NaN

[6]: df.size

[6]: 373644

[7]: df.shape
#18 columns, 20K entries.

[7]: (20758, 18)

3
2 EDA
Let’s explore the relationship between different variables within the dataset.

[8]: import matplotlib.pyplot as plt


import seaborn as sns

[9]: plt.style.use('ggplot')

[10]: sns.scatterplot(df, x='Age',y='Weight',hue='NObeyesdad',alpha=0.05)

[10]: <Axes: xlabel='Age', ylabel='Weight'>

• most people in the dataset have obesity type 3


• Observed a higher class of people from ages 25-35 have type 2 obesity
• Most people of class Insufficient weight are between ages 15 and 25

[11]: sns.countplot(df, x='NObeyesdad', hue='SMOKE')


plt.xlabel("Weight Class")
plt.xticks(rotation=90)

4
[11]: (array([0, 1, 2, 3, 4, 5, 6]),
[Text(0, 0, 'Overweight_Level_II'),
Text(1, 0, 'Normal_Weight'),
Text(2, 0, 'Insufficient_Weight'),
Text(3, 0, 'Obesity_Type_III'),
Text(4, 0, 'Obesity_Type_II'),
Text(5, 0, 'Overweight_Level_I'),
Text(6, 0, 'Obesity_Type_I')])

• People with Obesity type 2 have the highest distribution of smokers

[12]: sns.countplot(df,x='NObeyesdad',hue='MTRANS')
plt.xlabel("Weight Class")
plt.xticks(rotation=90)

5
[12]: (array([0, 1, 2, 3, 4, 5, 6]),
[Text(0, 0, 'Overweight_Level_II'),
Text(1, 0, 'Normal_Weight'),
Text(2, 0, 'Insufficient_Weight'),
Text(3, 0, 'Obesity_Type_III'),
Text(4, 0, 'Obesity_Type_II'),
Text(5, 0, 'Overweight_Level_I'),
Text(6, 0, 'Obesity_Type_I')])

• Entries of obesity type 2 generally use more automobiles than other classes
• Normal Weight entries walk more than other classes

[13]: sns.scatterplot(df,x='Height',y='Weight',alpha=0.05,␣
↪hue='family_history_with_overweight')

6
[13]: <Axes: xlabel='Height', ylabel='Weight'>

2.1 Feature Engineering


Let’s create some new features to act as predictors for obesity. We noticed before that there are
some groups that acted as good predictors for certain classes of obesity. Let’s engineer them.

[14]: df_eng = df.copy()

df_eng['BMI'] = df_eng['Weight'] / (df_eng['Height'] / 100) ** 2

[15]: df_eng.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 20758 entries, 0 to 20757
Data columns (total 19 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 20758 non-null int64
1 Gender 20758 non-null object
2 Age 20758 non-null float64
3 Height 20758 non-null float64

7
4 Weight 20758 non-null float64
5 family_history_with_overweight 20758 non-null object
6 FAVC 20758 non-null object
7 FCVC 20758 non-null float64
8 NCP 20758 non-null float64
9 CAEC 20758 non-null object
10 SMOKE 20758 non-null object
11 CH2O 20758 non-null float64
12 SCC 20758 non-null object
13 FAF 20758 non-null float64
14 TUE 20758 non-null float64
15 CALC 20758 non-null object
16 MTRANS 20758 non-null object
17 NObeyesdad 20758 non-null object
18 BMI 20758 non-null float64
dtypes: float64(9), int64(1), object(9)
memory usage: 3.0+ MB

3 Machine Learning Model


[16]: import xgboost as xgb
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
from xgboost import XGBClassifier

X = df_eng.drop('NObeyesdad',axis=1)
y = df['NObeyesdad']

[17]: X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,␣


↪random_state=42)

[18]: from sklearn.compose import ColumnTransformer


from sklearn.pipeline import Pipeline
from sklearn.preprocessing import StandardScaler, OrdinalEncoder, LabelEncoder,␣
↪OneHotEncoder

[38]: # Label encode the target variable y


label_encoder = LabelEncoder()
y_train_encoded = label_encoder.fit_transform(y_train)
y_test_encoded = label_encoder.transform(y_test)

[20]: import optuna

[37]: from sklearn.pipeline import Pipeline


from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OrdinalEncoder

8
from xgboost import XGBClassifier
from sklearn.metrics import accuracy_score

# Separate categorical and numerical columns


categorical_cols = X_train.select_dtypes(include=['object', 'bool']).columns
numerical_cols = X_train.select_dtypes(exclude=['object', 'bool']).columns

# Preprocessing for numerical data


numerical_transformer = StandardScaler()

# Preprocessing for categorical data


categorical_transformer = Pipeline(steps=[
('encoder', OrdinalEncoder(handle_unknown='use_encoded_value',␣
↪unknown_value=-1)),

])

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),
('cat', categorical_transformer, categorical_cols)
],
remainder='passthrough'
)

def objective_xgb(trial):
params = {
"eval_metric": "mlogloss",
"objective": "multi:softmax",
"booster": trial.suggest_categorical("booster", ["gbtree"]),
"grow_policy": trial.suggest_categorical("grow_policy", ["depthwise",␣
↪"lossguide"]),

"n_estimators": trial.suggest_int("n_estimators", 100, 1000),


"learning_rate": trial.suggest_float("learning_rate", 0.01, 1.0,␣
↪log=True),

"gamma": trial.suggest_float("gamma", 1e-9, 0.5),


"subsample": trial.suggest_float("subsample", 0.3, 1.0),
"colsample_bytree": trial.suggest_float("colsample_bytree", 0.3, 1.0),
"max_depth": trial.suggest_int("max_depth", 5, 24),
"min_child_weight": trial.suggest_int("min_child_weight", 1, 7),
"reg_lambda": trial.suggest_float("reg_lambda", 1e-9, 100.0, log=True),
"reg_alpha": trial.suggest_float("reg_alpha", 1e-9, 100.0, log=True)
}

# Create the XGBoost classifier with the current set of hyperparameters


model_xgb = XGBClassifier(**params)

# Construct the full pipeline with the classifier

9
pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', model_xgb)
])

# Fit the pipeline on the training data


pipeline.fit(X_train, y_train_encoded)

# Predict and calculate accuracy on the validation set


y_pred_c = pipeline.predict(X_test)

return accuracy_score(y_test_encoded, y_pred_c)

study_xgboost = optuna.create_study(
study_name="Study_XGB_Obesity", direction="maximize"
)
optuna.logging.set_verbosity(optuna.logging.WARNING)
study_xgboost.optimize(objective_xgb, n_trials=1000, show_progress_bar=True)

# Get the best hyperparameters


best_params_xgb = study_xgboost.best_params
print("Best Hyperparameters:", best_params_xgb)

0%| | 0/1000 [00:00<?, ?it/s]


Best Hyperparameters: {'booster': 'gbtree', 'grow_policy': 'lossguide',
'n_estimators': 575, 'learning_rate': 0.03289752563023217, 'gamma':
0.10216792503966325, 'subsample': 0.9781875703877861, 'colsample_bytree':
0.3000971735782186, 'max_depth': 20, 'min_child_weight': 1, 'reg_lambda':
7.404755224243125e-08, 'reg_alpha': 2.2740450488861006}

[39]: # Separate categorical and numerical columns


categorical_cols = X.select_dtypes(include=['object', 'bool']).columns
numerical_cols = X.select_dtypes(exclude=['object', 'bool']).columns

# Preprocessing for numerical data


numerical_transformer = StandardScaler()

# Preprocessing for categorical data


categorical_transformer = Pipeline(steps=[
('encoder', OrdinalEncoder(handle_unknown='use_encoded_value',␣
↪unknown_value=-1)),

])

preprocessor = ColumnTransformer(
transformers=[
('num', numerical_transformer, numerical_cols),

10
('cat', categorical_transformer, categorical_cols)
],
remainder='passthrough'
)

new_params = {
'booster': 'gbtree',
'grow_policy': 'lossguide',
'n_estimators': 575,
'learning_rate': 0.03289752563023217,
'gamma': 0.10216792503966325,
'subsample': 0.9781875703877861,
'colsample_bytree': 0.3000971735782186,
'max_depth': 20,
'min_child_weight': 1,
'reg_lambda': 7.404755224243125e-08,
'reg_alpha': 2.2740450488861006
}

xgb = XGBClassifier(random_state=42, **new_params)

# Update the pipeline


pipeline = Pipeline([
('preprocessor', preprocessor),
('classifier', xgb)
])

# Fit the transformers on the training data


pipeline.fit(X_train, y_train_encoded)

[39]: Pipeline(steps=[('preprocessor',
ColumnTransformer(remainder='passthrough',
transformers=[('num', StandardScaler(),
Index(['id', 'Age', 'Height',
'Weight', 'FCVC', 'NCP', 'CH2O', 'FAF', 'TUE',
'BMI'],
dtype='object')),
('cat',
Pipeline(steps=[('encoder',
OrdinalEncoder(handle_unknown='use_encoded_value',
unknown_value=-1))]),
Index(['Gender',
'family_history_with_over…
grow_policy='lossguide', importance_type=None,
interaction_constraints=None,
learning_rate=0.03289752563023217, max_bin=None,

11
max_cat_threshold=None, max_cat_to_onehot=None,
max_delta_step=None, max_depth=20,
max_leaves=None, min_child_weight=1, missing=nan,
monotone_constraints=None, multi_strategy=None,
n_estimators=575, n_jobs=None,
num_parallel_tree=None,
objective='multi:softprob', …))])

[40]: from sklearn.model_selection import cross_val_score


cross_val_scores = cross_val_score(pipeline, X_train, y_train_encoded, cv=5,␣
↪scoring='accuracy')

# Print the cross-validation scores


print("Cross-validation scores:", cross_val_scores)

# Print the mean accuracy and standard deviation


print("Mean Accuracy: {:.2f}%".format(cross_val_scores.mean() * 100))
print("Standard Deviation: {:.2f}".format(cross_val_scores.std()))

Cross-validation scores: [0.90246839 0.90966576 0.90725685 0.90755796


0.91779584]
Mean Accuracy: 90.89%
Standard Deviation: 0.01

[41]: import matplotlib.pyplot as plt


from xgboost import plot_importance

# Assuming 'pipeline' is your scikit-learn pipeline with XGBClassifier


xgb_classifier = pipeline.named_steps['classifier']

# Plot feature importances


plot_importance(xgb_classifier)
plt.show()

12
4 Time to Make Predictions
[42]: df_test = pd.read_csv("test.csv")

[43]: df_test_eng = df_test.copy()

df_test_eng['BMI'] = df_test_eng['Weight'] / (df_test_eng['Height'] / 100) ** 2

[44]: df_test_eng.info()

<class 'pandas.core.frame.DataFrame'>
RangeIndex: 13840 entries, 0 to 13839
Data columns (total 18 columns):
# Column Non-Null Count Dtype
--- ------ -------------- -----
0 id 13840 non-null int64
1 Gender 13840 non-null object
2 Age 13840 non-null float64
3 Height 13840 non-null float64
4 Weight 13840 non-null float64

13
5 family_history_with_overweight 13840 non-null object
6 FAVC 13840 non-null object
7 FCVC 13840 non-null float64
8 NCP 13840 non-null float64
9 CAEC 13840 non-null object
10 SMOKE 13840 non-null object
11 CH2O 13840 non-null float64
12 SCC 13840 non-null object
13 FAF 13840 non-null float64
14 TUE 13840 non-null float64
15 CALC 13840 non-null object
16 MTRANS 13840 non-null object
17 BMI 13840 non-null float64
dtypes: float64(9), int64(1), object(8)
memory usage: 1.9+ MB

[45]: predictions = pipeline.predict(df_test_eng)

[46]: # Assuming 'label_encoder' is your LabelEncoder and 'predictions' is the␣


↪encoded predictions

predicted_original_values = label_encoder.inverse_transform(predictions)
predicted_original_values

[46]: array(['Obesity_Type_II', 'Overweight_Level_I', 'Obesity_Type_III', …,


'Insufficient_Weight', 'Normal_Weight', 'Obesity_Type_II'],
dtype=object)

[47]: # Create a DataFrame with 'ID' and 'Prediction'


result_df = pd.DataFrame({
'id': df_test_eng['id'], # Replace 'ID' with the actual column name
'Prediction': predicted_original_values})

[48]: result_df

[48]: id Prediction
0 20758 Obesity_Type_II
1 20759 Overweight_Level_I
2 20760 Obesity_Type_III
3 20761 Obesity_Type_I
4 20762 Obesity_Type_III
… … …
13835 34593 Overweight_Level_II
13836 34594 Normal_Weight
13837 34595 Insufficient_Weight
13838 34596 Normal_Weight
13839 34597 Obesity_Type_II

[13840 rows x 2 columns]

14
[49]: result_df.to_csv("Submission_7_Optuna.csv",index=False)

[ ]:

15

You might also like