0% found this document useful (0 votes)
22 views17 pages

Experiment 2

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views17 pages

Experiment 2

Uploaded by

mohammed.ansari
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Experiment - 2

Name: Ansari Mohammed Shanouf Valijan


Class: B.E. Computer Engineering, Semester - VII
UID: 2021300004
Batch: M

Aim:
To implement decision trees for classification and regression on healthcare datasets.

Objective:
To write programs in python in order to demonstrate the working of decision trees for
classification and regression task by using appropriate medical datasets for building the trees
and using it to perform prediction on new data.

Outcomes:
▪ To be able to configure a decision tree in based on the dataset under consideration.
▪ To be able to train and test the decision tree model using various performance
parameters.
▪ To be able to use the model to predict the solution to the problem under consideration
and interpret it well.

Theory:
Decision trees are a popular and versatile machine learning algorithm used for both
classification and regression tasks. They work by recursively splitting the data into subsets
based on certain criteria, forming a tree-like structure of decisions. Each internal node of the
tree represents a test or decision on an attribute (feature), each branch corresponds to the
outcome of that test, and each leaf node represents a class label (in classification) or a
continuous value (in regression). The goal of a decision tree is to create a model that predicts
the value of a target variable by learning simple decision rules inferred from the data features.
Decision trees are favoured for their interpretability, as the decision-making process is
transparent and can be easily visualized, making them an excellent choice when transparency
in model decision-making is essential.
The construction of a decision tree involves selecting the best attribute to split the data at
each node, a process typically guided by measures like Gini impurity, entropy, or information
gain in classification tasks, and variance reduction in regression. The algorithm evaluates all
possible splits across all features and chooses the one that best separates the data according
to the chosen criterion. This process is repeated recursively for each subset of data, forming
a tree until a stopping condition is met, such as reaching a maximum depth or having too few
samples to split further. One of the key advantages of decision trees is their ability to handle
both numerical and categorical data and their robustness to irrelevant features. However,
decision trees are prone to overfitting, especially with noisy data, as they can grow very deep
and complex, capturing random fluctuations in the data rather than the underlying pattern.

Decision trees are widely used in various fields, including finance, healthcare, marketing, and
more, due to their simplicity and interpretability. They can be used in tasks such as credit
scoring, medical diagnosis, and customer segmentation. Despite their strengths, decision
trees have some limitations. They tend to be unstable, meaning small changes in the data can
lead to significantly different trees. This sensitivity to data variations can reduce the model's
generalization ability. Moreover, decision trees can be biased towards features with more
levels (categories) and are not always the best performers in terms of predictive accuracy,
particularly when compared to more complex models like ensemble methods (e.g., Random
Forests or Gradient Boosting). To mitigate these issues, techniques such as pruning, which
involves cutting back the tree to prevent overfitting, and ensemble methods, which combine
multiple trees to improve stability and accuracy, are often employed. Despite these
challenges, decision trees remain a fundamental tool in the machine learning toolbox, valued
for their clarity and ease of use.
Dataset Description:
[1] For classification task-
For the task of constructing the decision tree, the Stroke Prediction Dataset was taken into
consideration.
(https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data)

The task at hand was to be able to predict whether a particular person would suffer from a
stroke based on 10 clinical features that characterize the person. These features include
gender, age, whether the person suffers from hypertension, whether the person has history
of heart diseases, person’s work type (private, government, self-employed), marriage history,
residence type (rural, urban), average glucose level, bmi and the person’s smoking status
(never smoked, formerly smoked).

The dataset has about 5000 records, thereby providing initial confidence for coming up with
a good decision tree model.

[2] For regression task-


Here, in order to construct the decision tree, the Body Mass Index Detection dataset was
utilized.
(https://www.kaggle.com/datasets/sayanroy058/body-mass-index-detection)

The idea was to predict the BMI of a person given his/her age, weight, bio-impudence, gender
and height. The dataset has about 741 records.

Code:
Following is a step-by-step implementation of the task at hand-

[1] Classification task:


Link to Notebook -> DecisionTreeClassification

Importing the required libraries


import pandas as pd
from sklearn.model_selection import train_test_split
from sklearn.preprocessing import LabelEncoder
from sklearn.tree import DecisionTreeClassifier
from sklearn.metrics import accuracy_score, classification_report
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import export_text
import matplotlib.pyplot as plt
from sklearn.tree import plot_tree
import matplotlib.pyplot as plt
import seaborn as sns

Importing the dataset


dataset = pd.read_csv('/content/healthcare-dataset-stroke-data.csv')

Viewing the dataset at a glance


dataset.describe(include='all')
Dropping irrelevant column ‘id’
dataset = dataset.drop(columns=['id'])

Plotting the distribution of features in the dataset


numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

for col in numeric_columns:


plt.figure(figsize=(8, 4))
sns.histplot(df[col], kde=True, bins=30)
plt.title(f'Distribution of {col}')
plt.show()

categorical_columns = df.select_dtypes(include=['object']).columns

for col in categorical_columns:


plt.figure(figsize=(8, 4))
sns.countplot(data=df, x=col)
plt.title(f'Count of {col}')
plt.show()
The above plots helped me in understanding the dataset under consideration a bit better. I
was able to use the plots to better preprocess the data, getting rid of certain rows based on
the distribution as seen above.
Plotting the correlation matrix
corr_matrix = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
The feature ‘age’ seems to be highly correlated to the possibility of a person suffering from a
stroke. The information about residence seems to have the least effect on knowing whether
a particular person will or will not suffer a stroke. Accordingly, that feature may be dropped.

Plotting pair-wise features in an attempt to find further patterns in the dataset


sns.pairplot(df, hue='stroke')
plt.show()

In the above plots, orange data points represent people who suffered from stroke, while blue
points represent otherwise. One can clearly make out that usually, people with old age are
likely to suffer from stokes in all the scenarios (as represented by high density of orange points
in the row labelled as age).

Encoding the categorical attributes to their numerical counterparts


columns_to_encode = ['gender', 'ever_married', 'work_type', 'Residence_type',
'smoking_status']

for column in columns_to_encode:


unique_values = df[column].unique()
value_to_int = {value: idx + 1 for idx, value in enumerate(unique_values)}

df[column] = df[column].map(value_to_int)

Dropping the rows where BMI value is missing


df['bmi'].isna().sum()
(201/5110)*100
3.9334637964774952
df_cleaned = df.dropna(subset=['bmi'])

Viewing if the dataset is unbalanced


stroke_counts = df_cleaned['stroke'].value_counts()

plt.figure(figsize=(8, 6))
plt.pie(stroke_counts, labels=stroke_counts.index, autopct='%1.1f%%',
colors=['#ff9999','#66b3ff'], startangle=140)
plt.title('Distribution of Stroke Cases')
plt.show()

Since the dataset is unbalanced, random sampling was performed to balance it.
Balancing the dataset
stroke_one_df = df_cleaned[df_cleaned['stroke'] == 1]
stroke_zero_df = df_cleaned[df_cleaned['stroke'] == 0].sample(n=211,
random_state=1)
new_df = pd.concat([stroke_one_df, stroke_zero_df])
new_df.reset_index(drop=True, inplace=True)
stroke_counts = new_df['stroke'].value_counts()

plt.figure(figsize=(8, 6))
plt.pie(stroke_counts, labels=stroke_counts.index, autopct='%1.1f%%',
colors=['#ff9999','#66b3ff'], startangle=140)
plt.title('Distribution of Stroke Cases')
plt.show()

Training the decision tree model


df = new_df
X = df.drop('stroke', axis=1)
y = df['stroke']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3,
random_state=42)
param_grid = {
'criterion': ['gini', 'entropy'],
'max_depth': [10, 20, 30, 40, 50],
'min_samples_split': [2, 5, 10],
'min_samples_leaf': [1, 2, 4]
}
clf = DecisionTreeClassifier()
grid_search = GridSearchCV(estimator=clf, param_grid=param_grid,
cv=5, n_jobs=-1, verbose=2)
grid_search.fit(X_train, y_train)

print("Best parameters found: ", grid_search.best_params_)


best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
y_pred_ontrained = best_model.predict(X_train)
Evaluating the model
y_pred = best_model.predict(X_test)
from sklearn.metrics import accuracy_score
print("Accuracy Score: ", accuracy_score(y_test, y_pred))
tree_rules = export_text(best_model, feature_names=list(X.columns))
print("\nDecision Tree Rules:\n", tree_rules)
accuracy_score(y_train, y_pred_ontrained)

[2] Regression task:


Link to Notebook -> DecisionTreeRegression

Importing the necessary libraries


import numpy as np
import pandas as pd
import matplotlib.pyplot as plt
from sklearn.model_selection import train_test_split, GridSearchCV
from sklearn.tree import DecisionTreeRegressor, plot_tree
from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
from sklearn.preprocessing import LabelEncoder
import seaborn as sns

Importing the dataset


df = pd.read_csv('/content/Body Mass Index.csv')

Dropping irrelevant columns and encoding the categorical columns


df = df.drop(columns=['BmiClass'])
label_encoder = LabelEncoder()

df['Gender_encoded'] = label_encoder.fit_transform(df['Gender'])
df = df.drop(columns=['Gender'])

Visualizing the various features of the dataset to better understand it


numeric_columns = df.select_dtypes(include=['float64', 'int64']).columns

for col in numeric_columns:


plt.figure(figsize=(8, 4))
sns.histplot(df[col], kde=True, bins=30)
plt.title(f'Distribution of {col}')
plt.show()

categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
plt.figure(figsize=(8, 4))
sns.countplot(data=df, x=col)
plt.title(f'Count of {col}')
plt.show()

Viewing the correlation among different features present in the dataset


corr_matrix = df.corr()

plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()

The above plot clearly depicts a high dependence of BMI on weight, which is quite logical.
Further, height shows a correlation almost half as strong as weight, still an important factor
to take into consideration. Age seems to have the least positive correlation with the BMI.

Viewing pair-wise plots


sns.pairplot(df, hue='Bmi')
plt.show()
In the above plots, darker hues (purple in colour) depict higher BMI values and as can be
observed, almost all features with values towards higher end are pointing towards a high BMI
value. An exception to this is the Bio Impudence v/s Height plot where high BMI values seem
to be scattered.

Splitting the processed and analysed dataset into train and test sets
X = df.drop(columns='Bmi')
y = df['Bmi']

X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2,


random_state=42)
Defining the decision tree regressor model and training it (parameters were chosen after
experimenting with different configurations and choosing the ones that avoided overfitting)
regressor = DecisionTreeRegressor(
max_depth=25,
min_samples_split=40,
min_samples_leaf=15,
max_features='sqrt',
random_state=10
)

regressor.fit(X_train, y_train)

Evaluating the model


y_pred = regressor.predict(X_test)

mae = mean_absolute_error(y_test, y_pred)


mse = mean_squared_error(y_test, y_pred)
rmse = np.sqrt(mse)
r2 = r2_score(y_test, y_pred)

print(f"Mean Absolute Error (MAE): {mae}")


print(f"Mean Squared Error (MSE): {mse}")
print(f"Root Mean Squared Error (RMSE): {rmse}")
print(f"R-squared (R^2): {r2}")

Printing the decision tree as hypothesized

plt.figure(figsize=(20, 10))
plot_tree(regressor,
feature_names=X.columns,
filled=True,
rounded=True,)
plt.title('Decision Tree Visualization')
plt.show()

Output:
[1] Classification task:
Upon evaluating the model, following was the accuracy obtained-
Accuracy Score: 0.8263492063492064
Training accuracy was as follows-
Accuracy Score: 0.9149659863945578

Following is the decision tree structure that was obtained after training-
Decision Tree Rules:
|--- age <= 44.50
| |--- avg_glucose_level <= 58.25
| | |--- class: 1
| |--- avg_glucose_level > 58.25
| | |--- age <= 31.50
| | | |--- class: 0
| | |--- age > 31.50
| | | |--- work_type <= 1.50
| | | | |--- age <= 33.00
| | | | | |--- class: 0
| | | | |--- age > 33.00
| | | | | |--- class: 0
| | | |--- work_type > 1.50
| | | | |--- class: 0
|--- age > 44.50
| |--- age <= 75.50
| | |--- bmi <= 25.55
| | | |--- avg_glucose_level <= 79.36
| | | | |--- class: 1
| | | |--- avg_glucose_level > 79.36
| | | | |--- avg_glucose_level <= 94.08
| | | | | |--- class: 0
| | | | |--- avg_glucose_level > 94.08
| | | | | |--- bmi <= 23.85
| | | | | | |--- class: 0
| | | | | |--- bmi > 23.85
| | | | | | |--- class: 0
| | |--- bmi > 25.55
| | | |--- bmi <= 32.15
| | | | |--- avg_glucose_level <= 70.97
| | | | | |--- class: 0
| | | | |--- avg_glucose_level > 70.97
| | | | | |--- age <= 67.50
| | | | | | |--- smoking_status <= 2.50
| | | | | | | |--- avg_glucose_level <= 80.30
| | | | | | | | |--- class: 1
| | | | | | | |--- avg_glucose_level > 80.30
| | | | | | | | |--- Residence_type <= 1.50
| | | | | | | | | |--- avg_glucose_level <= 140.18
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- avg_glucose_level > 140.18
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- Residence_type > 1.50
| | | | | | | | | |--- class: 1
| | | | | | |--- smoking_status > 2.50
| | | | | | | |--- age <= 51.50
| | | | | | | | |--- class: 0
| | | | | | | |--- age > 51.50
| | | | | | | | |--- class: 1
| | | | | |--- age > 67.50
| | | | | | |--- class: 1
| | | |--- bmi > 32.15
| | | | |--- bmi <= 33.80
| | | | | |--- class: 0
| | | | |--- bmi > 33.80
| | | | | |--- age <= 55.50
| | | | | | |--- bmi <= 39.55
| | | | | | | |--- class: 0
| | | | | | |--- bmi > 39.55
| | | | | | | |--- bmi <= 42.70
| | | | | | | | |--- class: 1
| | | | | | | |--- bmi > 42.70
| | | | | | | | |--- class: 0
| | | | | |--- age > 55.50
| | | | | | |--- avg_glucose_level <= 191.15
| | | | | | | |--- heart_disease <= 0.50
| | | | | | | | |--- Residence_type <= 1.50
| | | | | | | | | |--- gender <= 1.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- gender > 1.50
| | | | | | | | | | |--- class: 0
| | | | | | | | |--- Residence_type > 1.50
| | | | | | | | | |--- class: 1
| | | | | | | |--- heart_disease > 0.50
| | | | | | | | |--- class: 1
| | | | | | |--- avg_glucose_level > 191.15
| | | | | | | |--- class: 1
| |--- age > 75.50
| | |--- bmi <= 26.95
| | | |--- class: 1
| | |--- bmi > 26.95
| | | |--- bmi <= 31.30
| | | | |--- avg_glucose_level <= 166.65
| | | | | |--- class: 1
| | | | |--- avg_glucose_level > 166.65
| | | | | |--- class: 1
| | | |--- bmi > 31.30
| | | | |--- class: 1

Class 0 representing no stroke, class 1 representing a possibility of stroke.

[1] Regression task:


Following performance parameters were obtained on training dataset-
Mean Absolute Error (MAE): 1.85
Mean Squared Error (MSE): 10.16
Root Mean Squared Error (RMSE): 3.19
R-squared (R^2): 0.89

Following performance parameters were obtained on test dataset-


Mean Absolute Error (MAE): 2.1160518106723467
Mean Squared Error (MSE): 10.597756621559329
Root Mean Squared Error (RMSE): 3.255419576883958
R-squared (R^2): 0.8517373327150053

Decision tree that was hypothesized for the regression task is as follows-
Conclusion:
By performing this experiment, I was able to understand the basic concepts associated with
building a decision tree. I was able to build, train and test the tree in python and was able to
come up with the following inferences-
▪ In case of classification task, the analysis steps revealed a huge dependence on age as
a factor in determining whether a person would or would not suffer from a stroke.
▪ The trained decision tree model showed an accuracy of 82.63 percent on validation set
while the accuracy of 91.49 percent was obtained on training set.
▪ Printing the decision tree hypothesized further supported the first inference owing to
a large number of decision nodes being based on age.
▪ In case of regression task, the analysis, logically, entailed a heavy dependence on
weight and height as features for the prediction of body mass index of an individual.
▪ The model trained initially had a test r-square value of 0.98 which was identified as
overfitting. The rectified model, then, had the test r-square value of around 0.8517
percent while the r-square value on training data was approximately 0.89.

You might also like