Experiment 2
Experiment 2
Aim:
To implement decision trees for classification and regression on healthcare datasets.
Objective:
To write programs in python in order to demonstrate the working of decision trees for
classification and regression task by using appropriate medical datasets for building the trees
and using it to perform prediction on new data.
Outcomes:
▪ To be able to configure a decision tree in based on the dataset under consideration.
▪ To be able to train and test the decision tree model using various performance
parameters.
▪ To be able to use the model to predict the solution to the problem under consideration
and interpret it well.
Theory:
Decision trees are a popular and versatile machine learning algorithm used for both
classification and regression tasks. They work by recursively splitting the data into subsets
based on certain criteria, forming a tree-like structure of decisions. Each internal node of the
tree represents a test or decision on an attribute (feature), each branch corresponds to the
outcome of that test, and each leaf node represents a class label (in classification) or a
continuous value (in regression). The goal of a decision tree is to create a model that predicts
the value of a target variable by learning simple decision rules inferred from the data features.
Decision trees are favoured for their interpretability, as the decision-making process is
transparent and can be easily visualized, making them an excellent choice when transparency
in model decision-making is essential.
The construction of a decision tree involves selecting the best attribute to split the data at
each node, a process typically guided by measures like Gini impurity, entropy, or information
gain in classification tasks, and variance reduction in regression. The algorithm evaluates all
possible splits across all features and chooses the one that best separates the data according
to the chosen criterion. This process is repeated recursively for each subset of data, forming
a tree until a stopping condition is met, such as reaching a maximum depth or having too few
samples to split further. One of the key advantages of decision trees is their ability to handle
both numerical and categorical data and their robustness to irrelevant features. However,
decision trees are prone to overfitting, especially with noisy data, as they can grow very deep
and complex, capturing random fluctuations in the data rather than the underlying pattern.
Decision trees are widely used in various fields, including finance, healthcare, marketing, and
more, due to their simplicity and interpretability. They can be used in tasks such as credit
scoring, medical diagnosis, and customer segmentation. Despite their strengths, decision
trees have some limitations. They tend to be unstable, meaning small changes in the data can
lead to significantly different trees. This sensitivity to data variations can reduce the model's
generalization ability. Moreover, decision trees can be biased towards features with more
levels (categories) and are not always the best performers in terms of predictive accuracy,
particularly when compared to more complex models like ensemble methods (e.g., Random
Forests or Gradient Boosting). To mitigate these issues, techniques such as pruning, which
involves cutting back the tree to prevent overfitting, and ensemble methods, which combine
multiple trees to improve stability and accuracy, are often employed. Despite these
challenges, decision trees remain a fundamental tool in the machine learning toolbox, valued
for their clarity and ease of use.
Dataset Description:
[1] For classification task-
For the task of constructing the decision tree, the Stroke Prediction Dataset was taken into
consideration.
(https://www.kaggle.com/datasets/fedesoriano/stroke-prediction-dataset/data)
The task at hand was to be able to predict whether a particular person would suffer from a
stroke based on 10 clinical features that characterize the person. These features include
gender, age, whether the person suffers from hypertension, whether the person has history
of heart diseases, person’s work type (private, government, self-employed), marriage history,
residence type (rural, urban), average glucose level, bmi and the person’s smoking status
(never smoked, formerly smoked).
The dataset has about 5000 records, thereby providing initial confidence for coming up with
a good decision tree model.
The idea was to predict the BMI of a person given his/her age, weight, bio-impudence, gender
and height. The dataset has about 741 records.
Code:
Following is a step-by-step implementation of the task at hand-
categorical_columns = df.select_dtypes(include=['object']).columns
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
The feature ‘age’ seems to be highly correlated to the possibility of a person suffering from a
stroke. The information about residence seems to have the least effect on knowing whether
a particular person will or will not suffer a stroke. Accordingly, that feature may be dropped.
In the above plots, orange data points represent people who suffered from stroke, while blue
points represent otherwise. One can clearly make out that usually, people with old age are
likely to suffer from stokes in all the scenarios (as represented by high density of orange points
in the row labelled as age).
df[column] = df[column].map(value_to_int)
plt.figure(figsize=(8, 6))
plt.pie(stroke_counts, labels=stroke_counts.index, autopct='%1.1f%%',
colors=['#ff9999','#66b3ff'], startangle=140)
plt.title('Distribution of Stroke Cases')
plt.show()
Since the dataset is unbalanced, random sampling was performed to balance it.
Balancing the dataset
stroke_one_df = df_cleaned[df_cleaned['stroke'] == 1]
stroke_zero_df = df_cleaned[df_cleaned['stroke'] == 0].sample(n=211,
random_state=1)
new_df = pd.concat([stroke_one_df, stroke_zero_df])
new_df.reset_index(drop=True, inplace=True)
stroke_counts = new_df['stroke'].value_counts()
plt.figure(figsize=(8, 6))
plt.pie(stroke_counts, labels=stroke_counts.index, autopct='%1.1f%%',
colors=['#ff9999','#66b3ff'], startangle=140)
plt.title('Distribution of Stroke Cases')
plt.show()
df['Gender_encoded'] = label_encoder.fit_transform(df['Gender'])
df = df.drop(columns=['Gender'])
categorical_columns = df.select_dtypes(include=['object']).columns
for col in categorical_columns:
plt.figure(figsize=(8, 4))
sns.countplot(data=df, x=col)
plt.title(f'Count of {col}')
plt.show()
plt.figure(figsize=(10, 8))
sns.heatmap(corr_matrix, annot=True, cmap='coolwarm', fmt='.2f')
plt.title('Correlation Matrix')
plt.show()
The above plot clearly depicts a high dependence of BMI on weight, which is quite logical.
Further, height shows a correlation almost half as strong as weight, still an important factor
to take into consideration. Age seems to have the least positive correlation with the BMI.
Splitting the processed and analysed dataset into train and test sets
X = df.drop(columns='Bmi')
y = df['Bmi']
regressor.fit(X_train, y_train)
plt.figure(figsize=(20, 10))
plot_tree(regressor,
feature_names=X.columns,
filled=True,
rounded=True,)
plt.title('Decision Tree Visualization')
plt.show()
Output:
[1] Classification task:
Upon evaluating the model, following was the accuracy obtained-
Accuracy Score: 0.8263492063492064
Training accuracy was as follows-
Accuracy Score: 0.9149659863945578
Following is the decision tree structure that was obtained after training-
Decision Tree Rules:
|--- age <= 44.50
| |--- avg_glucose_level <= 58.25
| | |--- class: 1
| |--- avg_glucose_level > 58.25
| | |--- age <= 31.50
| | | |--- class: 0
| | |--- age > 31.50
| | | |--- work_type <= 1.50
| | | | |--- age <= 33.00
| | | | | |--- class: 0
| | | | |--- age > 33.00
| | | | | |--- class: 0
| | | |--- work_type > 1.50
| | | | |--- class: 0
|--- age > 44.50
| |--- age <= 75.50
| | |--- bmi <= 25.55
| | | |--- avg_glucose_level <= 79.36
| | | | |--- class: 1
| | | |--- avg_glucose_level > 79.36
| | | | |--- avg_glucose_level <= 94.08
| | | | | |--- class: 0
| | | | |--- avg_glucose_level > 94.08
| | | | | |--- bmi <= 23.85
| | | | | | |--- class: 0
| | | | | |--- bmi > 23.85
| | | | | | |--- class: 0
| | |--- bmi > 25.55
| | | |--- bmi <= 32.15
| | | | |--- avg_glucose_level <= 70.97
| | | | | |--- class: 0
| | | | |--- avg_glucose_level > 70.97
| | | | | |--- age <= 67.50
| | | | | | |--- smoking_status <= 2.50
| | | | | | | |--- avg_glucose_level <= 80.30
| | | | | | | | |--- class: 1
| | | | | | | |--- avg_glucose_level > 80.30
| | | | | | | | |--- Residence_type <= 1.50
| | | | | | | | | |--- avg_glucose_level <= 140.18
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- avg_glucose_level > 140.18
| | | | | | | | | | |--- class: 1
| | | | | | | | |--- Residence_type > 1.50
| | | | | | | | | |--- class: 1
| | | | | | |--- smoking_status > 2.50
| | | | | | | |--- age <= 51.50
| | | | | | | | |--- class: 0
| | | | | | | |--- age > 51.50
| | | | | | | | |--- class: 1
| | | | | |--- age > 67.50
| | | | | | |--- class: 1
| | | |--- bmi > 32.15
| | | | |--- bmi <= 33.80
| | | | | |--- class: 0
| | | | |--- bmi > 33.80
| | | | | |--- age <= 55.50
| | | | | | |--- bmi <= 39.55
| | | | | | | |--- class: 0
| | | | | | |--- bmi > 39.55
| | | | | | | |--- bmi <= 42.70
| | | | | | | | |--- class: 1
| | | | | | | |--- bmi > 42.70
| | | | | | | | |--- class: 0
| | | | | |--- age > 55.50
| | | | | | |--- avg_glucose_level <= 191.15
| | | | | | | |--- heart_disease <= 0.50
| | | | | | | | |--- Residence_type <= 1.50
| | | | | | | | | |--- gender <= 1.50
| | | | | | | | | | |--- class: 0
| | | | | | | | | |--- gender > 1.50
| | | | | | | | | | |--- class: 0
| | | | | | | | |--- Residence_type > 1.50
| | | | | | | | | |--- class: 1
| | | | | | | |--- heart_disease > 0.50
| | | | | | | | |--- class: 1
| | | | | | |--- avg_glucose_level > 191.15
| | | | | | | |--- class: 1
| |--- age > 75.50
| | |--- bmi <= 26.95
| | | |--- class: 1
| | |--- bmi > 26.95
| | | |--- bmi <= 31.30
| | | | |--- avg_glucose_level <= 166.65
| | | | | |--- class: 1
| | | | |--- avg_glucose_level > 166.65
| | | | | |--- class: 1
| | | |--- bmi > 31.30
| | | | |--- class: 1
Decision tree that was hypothesized for the regression task is as follows-
Conclusion:
By performing this experiment, I was able to understand the basic concepts associated with
building a decision tree. I was able to build, train and test the tree in python and was able to
come up with the following inferences-
▪ In case of classification task, the analysis steps revealed a huge dependence on age as
a factor in determining whether a person would or would not suffer from a stroke.
▪ The trained decision tree model showed an accuracy of 82.63 percent on validation set
while the accuracy of 91.49 percent was obtained on training set.
▪ Printing the decision tree hypothesized further supported the first inference owing to
a large number of decision nodes being based on age.
▪ In case of regression task, the analysis, logically, entailed a heavy dependence on
weight and height as features for the prediction of body mass index of an individual.
▪ The model trained initially had a test r-square value of 0.98 which was identified as
overfitting. The rectified model, then, had the test r-square value of around 0.8517
percent while the r-square value on training data was approximately 0.89.