0% found this document useful (0 votes)
10 views

Decision Tree Copy

Uploaded by

lpminh22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views

Decision Tree Copy

Uploaded by

lpminh22
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 44

UNIVERSITY OF SCIENCE

VIETNAM NATIONAL UNIVERSITY, HO CHI MINH CITY

Project 2: Decision Tree

Members:
22125002 - Pham Minh Anh
22125016 - Nguyen Manh Dinh
22125056 - Le Phat Minh
22125062 - Pham Ha Nam
Artificial Intelligence 22TT2

Contents
1 Self-assessment of Completion & Task-assignment of each member 2

2 The UCI Breast Cancer Wisconsin (Diagnostic) dataset 2


2.1 Dataset Overview & Trivial preparation . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Dataset Description . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Trivial preparation . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 3
2.2 Main flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.1 Preparing the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.2.2 Building the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . 6
2.2.3 Evaluating the decision tree classifiers . . . . . . . . . . . . . . . . . . . . 10
2.2.4 Relationship between the depth and accuracy of a decision tree . . . . . 15

3 The UCI Wine Quality dataset 18


3.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 18
3.2 Main flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.1 Preparing the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 19
3.2.2 Building the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . 25
3.2.3 Evaluating the decision tree classifiers . . . . . . . . . . . . . . . . . . . . 26
3.2.4 The depth and accuracy of a decision tree . . . . . . . . . . . . . . . . . 30

4 The Mobile Price Classification Dataset 32


4.1 Dataset Overview . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 32
4.2 Main flow . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.1 Preparing the datasets . . . . . . . . . . . . . . . . . . . . . . . . . . . . 33
4.2.2 Building the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . 37
4.2.3 Evaluating the decision tree classifiers . . . . . . . . . . . . . . . . . . . 38

1
Artificial Intelligence 22TT2

1 Self-assessment of Completion & Task-assignment of


each member
No. Task Assigned To Completion leve
1 Data preparation for the Wine Quality Dataset Pham Minh Anh 100%
2 Decision tree classifier implementation for the Wine Pham Minh Anh 100%
Quality Dataset
3 Performance evaluation of decision tree for the Wine 100%
Quality Dataset
4 Depth and accuracy for the Wine Quality Dataset 100%
5 Data preparation for the Breast Cancer Dataset Pham Ha Nam 100%
6 Decision tree classifier implementation for the Breast Pham Ha Nam 100%
Cancer Dataset
7 Performance evaluation of decision tree for the Breast 100%
Cancer Dataset
8 Depth and accuracy for the Breast Cancer Dataset 100%
9 Data preparation for the Mobile Price Classification Le Phat Minh 100%
Dataset (additional dataset)
10 Decision tree classifier implementation for the Mobile Le Phat Minh 100%
Price Classification Dataset (additional dataset)
11 Performance evaluation of decision tree for the Mobile 100%
Price Classification Dataset (additional dataset)
12 Depth and accuracy for the Mobile Price Classification 100%
Dataset (additional dataset)
13 Documentation for the Wine Quality Dataset 100 %
14 Documentation for the Breast Cancer Dataset 100 %
15 Documentation for the Mobile Price Classification Nguyen Manh Dinh 100 %
Dataset
16 Comparative analysis of three datasets 100 %
17 Review Document 100 %

2 The UCI Breast Cancer Wisconsin (Diagnostic) dataset

2.1 Dataset Overview & Trivial preparation

2.1.1 Dataset Description

Source: Created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at
the University of Wisconsin.
Description: The dataset contains features computed from digitized images of fine needle
aspirate (FNA) samples of breast masses. It aims to classify breast tumors as either malignant
(M) or benign (B) based on 30 real-valued features derived from characteristics of cell nuclei.
Number of Instances: 569
Number of Attributes: 32 (ID, diagnosis, 30 real-valued features)
Attributes:

2
Artificial Intelligence 22TT2

• ID number

• Diagnosis: M = Malignant, B = Benign

• 30 computed features for each cell nucleus:

– Mean, standard error, and worst (largest) values of:


∗ Radius (mean distance from center to perimeter points)
∗ Texture (gray-scale value standard deviation)
∗ Perimeter
∗ Area
∗ Smoothness (local radius length variation)
∗ Compactness (perimeter2 / area - 1.0)
∗ Concavity (severity of concave contour portions)
∗ Concave points (number of concave contour portions)
∗ Symmetry
∗ Fractal dimension (coastline approximation - 1)

Class Distribution: 357 benign, 212 malignant


Key Notes: No missing values. The dataset has been used extensively in medical and
machine learning research for breast cancer diagnosis.
Naming Convention: In the implementation or visualization, each attribute is represented
as follows:

• <attribute name>1: Represents the mean value.

• <attribute name>2: Represents the standard error value.

• <attribute name>3: Represents the worst (largest) value.

2.1.2 Trivial preparation

First, we mannually set the feature names.


1 # List the features name
2 feature_names = [
3 " radius1 " , " texture1 " , " perimeter1 " , " area1 " , " smoothness1 " ,
" compactness1 " ,
4 " concavity1 " , " concave_points1 " , " symmetry1 " , "
fractal_dimension1 " ,
5 " radius2 " , " texture2 " , " perimeter2 " , " area2 " , " smoothness2 " ,
" compactness2 " ,
6 " concavity2 " , " concave_points2 " , " symmetry2 " , "
fractal_dimension2 " ,
7 " radius3 " , " texture3 " , " perimeter3 " , " area3 " , " smoothness3 " ,
" compactness3 " ,
8 " concavity3 " , " concave_points3 " , " symmetry3 " , "
fractal_dimension3 "
9 ]

3
Artificial Intelligence 22TT2

Then, We simply load the dataset by the function read csv and name the columns according
to the convention above.
1 # Load the dataset
2 data_file_path = " breast + cancer + wisconsin + diagnostic / wdbc . data "
3 column_names = [ ’ ID ’ , ’ Diagnosis ’] + [ f ’{ feature_names [i -1]} ’ for
i in range (1 , 31) ]
4 data = pd . read_csv ( data_file_path , header = None , names =
column_names )
5 # Print out the first 10 lines of data
6 print ( data [:10])

Finally, we drop the ID column since it does not contribute to the decision tree. Also, the
Diagnosis column is transformed to numerical values to facilitate computation.
1 # Drop the ID column and encode the Diagnosis column
2 data = data . drop ( columns =[ ’ ID ’ ])
3 data [ ’ Diagnosis ’] = data [ ’ Diagnosis ’ ]. map ({ ’M ’: 1 , ’B ’: 0})
4

5 # Define features and labels


6 X = data . drop ( columns =[ ’ Diagnosis ’ ])
7 y = data [ ’ Diagnosis ’]

2.2 Main flow

2.2.1 Preparing the datasets

According to the requirements, we use the function train test split from
sklearn.model selection to split the datasets into different portions, specifically 40/60,
60/40, 80/20, and 90/10 (train/test).
There are subsets like X train (feature train), Y train (label train), X test (feature test),
Y test (label test),
1 # Split the dataset and store the results
2 for train_size in splits :
3 X_train , X_test , y_train , y_test = train_test_split (
4 X , y , train_size = train_size , stratify =y , random_state =42
5 )
6

7 # Store the split results in the dictionary


8 train_test_splits [ train_size ] = {
9 ’ X_train ’: X_train ,
10 ’ X_test ’: X_test ,
11 ’ y_train ’: y_train ,
12 ’ y_test ’: y_test
13 }

The result is that we have a shuffled and random split.


1 Train size : 40.0%
2 Training set size : 227 , Test set size : 342

4
Artificial Intelligence 22TT2

3 Train size : 60.0%


4 Training set size : 341 , Test set size : 228
5 Train size : 80.0%
6 Training set size : 455 , Test set size : 114
7 Train size : 90.0%
8 Training set size : 512 , Test set size : 57

According to the visualizations of class distribution of original and split datasets below, we
draw the fact that the ratio of Benign and Malignant is unchanged. Hence, the dataset is split
in a stratified fashion.

Figure 1: Distribution of Diagnosis in Original Dataset

Figure 2: Distribution of Diagnosis in 40/ 60 (Train/ Test) Split

Figure 3: Distribution of Diagnosis in 60/ 40 (Train/ Test) Split

5
Artificial Intelligence 22TT2

Figure 4: Distribution of Diagnosis in 80/ 20 (Train/ Test) Split

Figure 5: Distribution of Diagnosis in 90/ 10 (Train/ Test) Split

2.2.2 Building the decision tree classifiers

According to the requirements, we have to fit 4 decision tree classifiers on 4 training sets
we have. In this report, we will format the document: construct one decision tree classifier and
give comments to one visualization of the classifier as an example.
Firstly, we use the DecisionTreeClassifier from the library sklearn.tree to construct
the decision tree.
In the code below, we use the criterion Average Entropy to maximize the Information
Gain - the main criteria to choose the parent of the tree and subtrees. The random state
parameter ensures reproducibility of the results by controlling the randomness involved in the
algorithm. Specifically, in the DecisionTreeClassifier, it influences the random selection of
features during tree splitting when handling ties. Setting random state=42 fixes the seed of
the random number generator, allowing consistent and repeatable results across multiple runs
of the code.
1 # Train a DecisionTreeClassifier
2 clf = DecisionTreeClassifier ( criterion = ’ entropy ’ ,
random_state =42)
3 clf . fit ( X_train , y_train )

Next, we visualize the Decision Tree classifier by using export graphviz from the library
sklearn.tree.
1 # Visualize the decision tree
2 dot_data = export_graphviz (
3 clf ,
4 out_file = None ,
5 feature_names = X . columns ,

6
Artificial Intelligence 22TT2

6 class_names =[ " Benign " , " Malignant " ] ,


7 filled = True ,
8 rounded = True ,
9 special_characters = True
10 )
11 graph = graphviz . Source ( dot_data )
12 display ( graph )

Figure 6: Breast Cancer Diagnostic Decision Tree Classifier on 40/60 (Train/ Test) Split

The decision tree visualizes the classification of breast cancer samples into two classes:
Benign (orange nodes) and Malignant (blue nodes).

• Each node in the tree represents a decision point based on a threshold value for a specific
feature (e.g., radius3, concave points3).

• The values displayed in each node include:

– Entropy: The impurity of the samples in that node.


– Samples: The number of data points reaching that node.
– Value: The distribution of the two classes ([Benign, Malignant]).
– Class: The predicted class for that node.

• At each split, the decision criterion tests whether the feature value is less than or equal
to the threshold. If true, the left child node is traversed; otherwise, the right child node
is chosen.

• The tree terminates at leaf nodes, where entropy equals 0, indicating pure classification
with no further splitting required.

7
Artificial Intelligence 22TT2

This decision tree shows how features like radius3, concave points3, and texture3 con-
tribute to separating the data into benign and malignant classes, with some paths perfectly
classifying the samples. The decision tree classifier can be different for different splits. Below
are some alternative classifiers.

Figure 7: Breast Cancer Diagnostic Decision Tree Classifier on 60/40 (Train/ Test) Split

Comment: For a larger training set as well as more diverse data and cases, the decision
tree classifier may require longer path to split the diagnosis until there is only one class left.

8
Artificial Intelligence 22TT2

Figure 8: Breast Cancer Diagnostic Decision Tree Classifier on 80/20 (Train/ Test) Split

9
Artificial Intelligence 22TT2

Figure 9: Breast Cancer Diagnostic Decision Tree Classifier on 90/10 (Train/ Test) Split

About the accuracy of the classifiers built on different train/ test splits, we go to the
following section.

2.2.3 Evaluating the decision tree classifiers

Evaluation of Decision Tree classifier on 40/60 split:


Firstly, we construct the classification report and confusion matrix using sklearn.metrics
library based on the predictions made by the Decision Tree classifier.
1 # Make predictions and evaluate the model
2 y_pred = clf . predict ( X_test )
3

4 # Get classification report and confusion matrix


5 report = classification_report ( y_test , y_pred , target_names =[ "
Benign " , " Malignant " ] , output_dict = True )
6 conf_matrix = confusion_matrix ( y_test , y_pred )
7 accuracy = accuracy_score ( y_test , y_pred )

10
Artificial Intelligence 22TT2

Figure 10: Confusion Matrix for Train/ Test Split: 40/60

The confusion matrix shows the performance of the model in classifying the two classes:
Benign and Malignant. The matrix contains the following:

• True Positives (TP): Malignant cases correctly predicted as Malignant (106).

• True Negatives (TN): Benign cases correctly predicted as Benign (206).

• False Positives (FP): Benign cases incorrectly predicted as Malignant (9).

• False Negatives (FN): Malignant cases incorrectly predicted as Benign (21).

The confusion matrix highlights that the model performs well on predicting with a very small
negative and performs better on Benign cases (higher TN and fewer FP) than on Malignant
cases (higher FN).

Figure 11: Classification Report for Train/ Test Split: 40/60

The classification report provides the following metrics for each class:

• Precision: Measures the proportion of positive identifications that were actually correct.

11
Artificial Intelligence 22TT2

– Benign: 0.907 (206 / (206 + 9))


– Malignant: 0.921 (106 / (106 + 21))

• Recall: Measures the proportion of actual positives that were correctly identified.

– Benign: 0.958 (206 / (206 + 9))


– Malignant: 0.835 (106 / (106 + 21))

• F1-score: Harmonic mean of precision and recall.

– Benign: 0.932
– Malignant: 0.876

• Accuracy: Overall correctness of the model (0.912).

• Macro Average: Average of metrics across both classes without considering class im-
balance.

• Weighted Average: Average of metrics across both classes, weighted by the number of
instances in each class.

Above is the detailed explanation of how the metrics convey the performance of the classifier.
Now we go through the results/ visualization of other Decision Tree classifiers that are trained
on other training splits.
Evaluation of Decision Tree classifier on 60/40 split:
The overall accuracy of this split is about 0.939, which is higher than the previous decision
tree classifier. We can draw the fact that the Breast Cancer decision tree classifier prefers the
larger amount of data to train.

Figure 12: Confusion Matrix for Train/ Test Split: 60/40

12
Artificial Intelligence 22TT2

Figure 13: Classification Report for Train/ Test Split: 60/40

Evaluation of Decision Tree classifier on 80/20 split:


The accuracy of this split is even higher than the two previous splits, which indicates that
more training data makes the classifier even better.

Figure 14: Confusion Matrix for Train/ Test Split: 80/20

13
Artificial Intelligence 22TT2

Figure 15: Classification Report for Train/ Test Split: 80/20

Evaluation of Decision Tree classifier on 90/10 split:


The accuracy of this split is not higher than the previous one. It may be due to more test
data is preferred to measure the performance of the decision tree classifier. Hence, we can
conclude that 80/20 is the most suitable split for this dataset.

Figure 16: Confusion Matrix for Train/ Test Split: 90/10

14
Artificial Intelligence 22TT2

Figure 17: Classification Report for Train/ Test Split: 90/10

All the metrics above converge to a conclusion, that is, the Decision Tree classifier con-
structed based on the dataset is generally effective (accuracy higher than 0.9) in predicting
Breast Cancer.

2.2.4 Relationship between the depth and accuracy of a decision tree

Firstly, we construct decision tree classifiers limiting the depth of the tree as None, 2, 3, 4,
5, 6 7.
1 clf = DecisionTreeClassifier ( criterion = ’ entropy ’ , max_depth =
max_depth , random_state =42)
2 clf . fit ( X_train , y_train )

Then, plot accuracy against the max depth.


1 # Plotting the accuracy scores for different max_depth values
2 depths = list ( accuracy_scores . keys () )
3 accuracies = list ( accuracy_scores . values () )
4

5 plt . figure ( figsize =(8 , 6) )


6 plt . plot ( depths , accuracies , marker = ’o ’ , linestyle = ’ - ’ , color = ’b ’
)
7 plt . title ( " Effect of max_depth on Accuracy (80/20 Split ) " ,
fontsize =14)
8 plt . xlabel ( " Max Depth " , fontsize =12)
9 plt . ylabel ( " Accuracy " , fontsize =12)
10 plt . xticks ( depths )
11 plt . grid ( True )
12 plt . show ()

Generally, the accuracy increases (the classifier performs better) as the max depth increases.

max depth None 2 3 4 5 6 7


Accuracy 0.956 0.886 0.939 0.930 0.956 0.930 0.956

Table 1: Max Depth vs Accuracy (80/20 Split)

15
Artificial Intelligence 22TT2

Figure 18: Effect of max depth on accuracy of Breast Cancer Decision Tree classifier on 80/20
split

Beyond a certain point (e.g., max depth = 5), the accuracy reaches a plateau. Further
increases in max depth do not result in significant improvements in accuracy. This occurs
because the tree has already captured most of the meaningful patterns in the data, and deeper
splits primarily capture noise, which does not enhance test performance.
Below are the visualizations of the Decision Tree classifier of each max depth:
Decision tree with max depth = None: means that there is no limit on the max depth
of the decision tree.

16
Artificial Intelligence 22TT2

Figure 19: Breast Cancer Diagnostic Decision Tree Classifier with max depth = None

We can see that all diagnosis are purely split along the path of the decision tree.
Decision tree with max depth = 2:

17
Artificial Intelligence 22TT2

Figure 20: Breast Cancer Diagnostic Decision Tree Classifier with max depth = 2

Some of the leaves of the decision tree contain both types of the diagnosis, which means
that the max depth of 2 is not suitable. However, the overall accurac

3 The UCI Wine Quality dataset

3.1 Dataset Overview

Source: Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida,
Telmo Matos and Jose Reis (CVRVV) @ 2009.
Description: The two datasets are related to red and white variants of the Portuguese
”Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference
[Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and
sensory (the output) variables are available (e.g. there is no data about grape types, wine
brand, wine selling price, etc.).
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort
of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity

18
Artificial Intelligence 22TT2

2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Missing Attribute Values: None

3.2 Main flow

3.2.1 Preparing the datasets

Install Required Libraries


1 pip install pandas scikit - learn matplotlib seaborn

Download and Prepare the Dataset


1 import pandas as pd
2 import numpy as np
3 from sklearn . model_selection import train_test_split
4 import matplotlib . pyplot as plt
5 import seaborn as sns
6

7 # Load the White Wine dataset


8 url = " https :// archive . ics . uci . edu / ml / machine - learning - databases /
wine - quality / winequality - white . csv "
9 wine_data = pd . read_csv ( url , sep = " ; " )

Group data
Since the Wine Quality dataset contains 10 classes of wine quality (classes 0-10), we should
group them into 3 broader categories for analysis: Low quality (classes 0-4), Standard quality
(classes 5-6), and High quality (classes 7-10).
1 # Features and Labels
2 X = wine_data . iloc [: , : -1] # All columns except ’ quality ’
3 # Group quality into broader categories
4 def group_quality ( quality ) :
5 if quality <= 4:

19
Artificial Intelligence 22TT2

6 return " Low "


7 elif 5 <= quality <= 6:
8 return " Standard "
9 else :
10 return " High "
11

12 y = wine_data [ " quality " ]. apply ( group_quality )

According to the requirements, we use the function train test split from sklearn.model selection
to split the datasets into different portions, specifically 40/60, 60/40, 80/20, and 90/10 (train/test).
There are subsets like X train (feature train), Y train (label train), X test (feature test),
Y test (label test).
Prepare datasets
1 # Splitting proportions
2 proportions = [(0.4 , 0.6) , (0.6 , 0.4) , (0.8 , 0.2) , (0.9 , 0.1) ]
3

4 # Prepare datasets
5 datasets = {}
6

7 for train_size , test_size in proportions :


8 X_train , X_test , y_train , y_test = train_test_split (
9 X , y , train_size = train_size , test_size = test_size ,
stratify =y , random_state =42
10 )
11 train_percentage = int ( train_size * 100)
12 test_percentage = int ( test_size * 100)
13

14 datasets [ f " { train_percentage } _train_features " ] = X_train


15 datasets [ f " { train_percentage } _train_labels " ] = y_train
16 datasets [ f " { test_percentage } _test_features " ] = X_test
17 datasets [ f " { test_percentage } _test_labels " ] = y_test

Visualization
1 # Visualization function
2 def v isu alize_class_distribution ( labels , title ) :
3 sns . countplot ( x = labels , palette = " viridis " )
4 plt . title ( title )
5 plt . xlabel ( " Wine Quality Category " )
6 plt . ylabel ( " Count " )
7 plt . show ()

20
Artificial Intelligence 22TT2

Figure 21: Grouped Original Class Distribition

Figure 22: Grouped Trained 80/20 Class Distribition

21
Artificial Intelligence 22TT2

Figure 23: Grouped Test 80/20 Class Distribition

Figure 24: Grouped Trained 90/10 Class Distribition

22
Artificial Intelligence 22TT2

Figure 25: Grouped Test 90/10 Class Distribition

Figure 26: Grouped Trained 40/60 Class Distribition

23
Artificial Intelligence 22TT2

Figure 27: Grouped Test 40/60 Class Distribition

Figure 28: Grouped Trained 60/40 Class Distribition

24
Artificial Intelligence 22TT2

Figure 29: Grouped Test 60/40 Class Distribition

Comment:
The Low category is underrepresented, which might introduce class imbalance issues in
modeling.
Test and trained sets have the same distribution. This means that they are split
in a stratified fashion.

3.2.2 Building the decision tree classifiers

Install Required Libraries


1 pip install graphviz matplotlib scikit - learn

Train and Visualize Decision Tree Classifiers According to the requirements, we have
to fit 4 decision tree classifiers on 4 training sets we have.
Firstly, we use the DecisionTreeClassifier from the lirary sklearn.tree to construct
the decision tree.
In the code below, we use the criterion Average Entropy to maximize the Information
Gain that we have learnt in theory class. The random state parameter ensures reproducibility
of the results by controlling the randomness involved in the algorithm. Specifically, in the
DecisionTreeClassifier, it influences the random selection of features during tree splitting
when handling ties. Setting random state=42 fixes the seed of the random number generator,
allowing consistent and repeatable results across multiple runs of the code.
1 # Train a DecisionTreeClassifier

25
Artificial Intelligence 22TT2

2 clf = DecisionTreeClassifier ( criterion = ’ entropy ’ ,


random_state =42)
3 clf . fit ( X_train , y_train )

Next, we visualize the Decision Tree classifier by using export graphviz from the lirary
sklearn.tree.
1 # Visualize the decision tree
2 dot_data = export_graphviz (
3 clf ,
4 out_file = None ,
5 feature_names = features_train . columns ,
6 class_names =[ str ( i ) for i in sorted ( labels_train . unique () ) ] ,
7 filled = True , rounded = True , special_characters = True
8 )
9 graph = Source ( dot_data )
10 display ( graph )

3.2.3 Evaluating the decision tree classifiers

1 from sklearn . metrics import classification_report ,


confusion_matrix
2 import seaborn as sns
3

4 # Function to evaluate the classifier and generate reports


5 # Function to evaluate the classifier and generate reports
6 def evaluate_classifier ( clf , features_test , labels_test ,
split_name , class_order ) :
7 # Predict the test set
8 predictions = clf . predict ( features_test )
9

10 # Classification report
11 print ( f " Classification Report for { split_name }:\ n " )
12 report = classification_report ( labels_test , predictions ,
target_names = class_order , digits =2)
13 print ( report )
14

15 # Confusion matrix
16 cm = confusion_matrix ( labels_test , predictions , labels =
class_order )
17 print ( f " Confusion Matrix for { split_name }:\ n { cm }\ n " )
18

19 # Visualize the confusion matrix


20 plt . figure ( figsize =(8 , 6) )
21 sns . heatmap ( cm , annot = True , fmt = " d " , cmap = " Blues " ,
xticklabels = class_order , yticklabels = class_order )
22 plt . title ( f " Confusion Matrix ({ split_name }) " )
23 plt . xlabel ( " Predicted Label " )
24 plt . ylabel ( " True Label " )
25 plt . show ()

26
Artificial Intelligence 22TT2

Evaluation of Decision Tree classifier on 90/10 split:


1 Classification Report for Train 90% / Test 10%:
2

3 precision recall f1 - score support


4

5 High 0.60 0.64 0.62 106


6 Low 0.39 0.39 0.39 18
7 Standard 0.87 0.85 0.86 366
8

9 accuracy 0.79 490


10 macro avg 0.62 0.63 0.62 490
11 weighted avg 0.79 0.79 0.79 490
12 Confusion Matrix for Train 90% / Test 10%:
13 [[ 7 9 2]
14 [ 11 312 43]
15 [ 0 38 68]]

The model does a reasonable job identifying High quality wines but misclassifies a notable
proportion as Standard.
Performance is poor for the Low class.
The model performs best for the Standard class, which is expected given its dominance in
the dataset.
While the high performance here contributes to overall accuracy, it might indicate a bias
toward the majority class.
Accuracy (79%): The model achieves decent accuracy but is heavily influenced by the
Standard class. Accuracy alone is not a reliable metric for imbalanced datasets.
Macro Average (F1-Score: 0.62): This highlights the imbalance in performance across
classes. The lower F1-score for the Low and High classes pulls down the macro average.
Weighted Average (F1-Score: 0.79): This metric reflects the dominance of the Stan-
dard class and inflates the perception of performance.

27
Artificial Intelligence 22TT2

Figure 30: Confusion Matrix for Train/ Test Split: 90/10

Many High wines (38) are misclassified as Standard.


Most Low wines are either correctly classified or misclassified as Standard (9 misclassified
vs. 7 correctly classified).
The Standard wines dominate predictions, with relatively few misclassifications into High
or Low.
Evaluation of Decision Tree classifier on 80/20 split:
1 Classification Report for Train 80% / Test 20%:
2

3 precision recall f1 - score support


4

5 Low 0.56 0.58 0.57 212


6 Standard 0.42 0.41 0.41 37
7 High 0.86 0.85 0.85 731
8

9 accuracy 0.78 980


10 macro avg 0.61 0.61 0.61 980
11 weighted avg 0.78 0.78 0.78 980
12

13 Confusion Matrix for Train 80% / Test 20%:


14 [[ 15 19 3]
15 [ 17 621 93]
16 [ 4 84 124]]

28
Artificial Intelligence 22TT2

Low-quality class: The precision (0.56) and recall (0.58) indicate moderate performance.
This suggests the model struggles to distinguish Low-quality wines but performs slightly better
at correctly identifying them.
Standard-quality class: This class shows the weakest performance, with precision (0.42)
and recall (0.41). This is likely due to the small sample size (37 instances in the test set),
leading to class imbalance and difficulty in learning distinctive features for this group.
High-quality class: The model performs very well for this class, achieving high precision
(0.86) and recall (0.85). This is expected since High-quality wines dominate the dataset and
contribute most to the model’s learning.
Accuracy: The overall accuracy of 78% is decent but is driven primarily by the performance
on the High-quality class, not balanced across all categories.
Macro Average: The macro average (0.61) shows the overall weakness in handling under-
represented classes (Low and Standard).
Weighted Average: The weighted average F1-score (0.78) reflects the strong influence of
the dominant High-quality class in the dataset.

Figure 31: Confusion Matrix for Train/ Test Split: 80/20

There is significant misclassification between the Low and Standard classes, with many
Low-quality wines being incorrectly predicted as Standard.
Most errors for the High-quality class are in predicting wines as Standard, which might
indicate overlapping characteristics between Standard and High.

29
Artificial Intelligence 22TT2

3.2.4 The depth and accuracy of a decision tree

1 import pandas as pd
2 from sklearn . tree import DecisionTreeClassifier , export_graphviz
3 from sklearn . metrics import accuracy_score
4 from graphviz import Source
5 import matplotlib . pyplot as plt
6

7 # Extract the 80/20 train and test datasets


8 train_features = datasets [ " 80 _train_features " ]
9 train_labels = datasets [ " 80 _train_labels " ]
10 test_features = datasets [ " 20 _test_features " ]
11 test_labels = datasets [ " 20 _test_labels " ]
12

13 # Depth values to test


14 depth_values = [ None , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 ,
15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25]
15 accuracies = []
16

17 print ( " Results for Train 80% / Test 20% " )


18

19 for depth in depth_values :


20 # Initialize and train the classifier
21 clf = DecisionTreeClassifier ( max_depth = depth , random_state
=42)
22 clf . fit ( train_features , train_labels )
23

24 # Predict and evaluate


25 y_pred = clf . predict ( test_features )
26 accuracy = accuracy_score ( test_labels , y_pred )
27 accuracies . append ( accuracy )
28

29 # Visualize the decision tree using Graphviz


30 dot_data = export_graphviz (
31 clf ,
32 out_file = None ,
33 feature_names = train_features . columns ,
34 class_names =[ str ( cls ) for cls in sorted ( train_labels .
unique () ) ] ,
35 filled = True ,
36 rounded = True ,
37 special_characters = True
38 )
39 graph = Source ( dot_data )
40 # graph . render ( f " decision_tree_max_depth_ { depth }" , format ="
png " , cleanup = True ) # Saves as PNG
41

42 # print ( f " Decision tree for max_depth ={ depth } saved as ’


decision_tree_max_depth_ { depth }. png ’.")
43

30
Artificial Intelligence 22TT2

44 # Report accuracy scores


45 accuracy_df = pd . DataFrame ({ " max_depth " : [ " None " , 2 , 3 , 4 , 5 , 6 ,
7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 , 16 , 17 , 18 , 19 , 20 , 21 , 22 ,
23 , 24 , 25] , " Accuracy " : accuracies })
46 print ( accuracy_df )
47

48 # Plot accuracy vs depth


49 plt . figure ( figsize =(8 , 5) )
50 plt . plot ([ " None " , 2 , 3 , 4 , 5 , 6 , 7 , 8 , 9 , 10 , 11 , 12 , 13 , 14 , 15 ,
16 , 17 , 18 , 19 , 20 , 21 , 22 , 23 , 24 , 25] , accuracies , marker = "
o " , color = " b " , label = " Accuracy " )
51 plt . title ( " Effect of max_depth on Accuracy ( Train 80% / Test 20%)
")
52 plt . xlabel ( " max_depth " )
53 plt . ylabel ( " Accuracy " )
54 plt . grid ( True )
55 plt . legend ()
56 plt . show ()

Result:
1 Results for Train 80% / Test 20%
2 max_depth Accuracy
3 0 None 0.791837
4 1 2 0.767347
5 2 3 0.763265
6 3 4 0.771429
7 4 5 0.772449
8 5 6 0.772449
9 6 7 0.759184
10 7 8 0.767347
11 8 9 0.772449
12 9 10 0.769388
13 10 11 0.782653
14 11 12 0.770408
15 12 13 0.781633
16 13 14 0.773469
17 14 15 0.794898
18 15 16 0.798980
19 16 17 0.787755
20 17 18 0.792857
21 18 19 0.797959
22 19 20 0.790816
23 20 21 0.791837
24 21 22 0.791837
25 22 23 0.791837
26 23 24 0.791837
27 24 25 0.791837

Visualization:
Comments:

31
Artificial Intelligence 22TT2

1. Overall trends:

• The accuracy generally improves as the max depth increases from 2 to 16, reaching
a peak at max depth = 16 with an accuracy of 0.7989 (approximately 79.89%).
• After max depth = 16, the accuracy slightly fluctuates but remains stable, suggesting
diminishing returns for increasing max depth beyond this point.

2. Impact of Small max depth Values:

• At lower max depth values (e.g., 2, 3), the accuracy is relatively low, indicating
underfitting. The model is likely too simplistic to capture the complexities of the
dataset. For example, max depth = 2 results in 0.7673 (76.73%), which is signifi-
cantly lower than the peak accuracy. max depth = 3 reduces accuracy further to
0.7633 (76.33%).

4 The Mobile Price Classification Dataset

4.1 Dataset Overview

Source: Abhishek Sharma


Description: The dataset contains sales data of mobile phones from various companies,
with features like RAM, internal memory, and battery power. The goal is to classify mobile
phones into price ranges (low, medium, high, very high) based on their features, providing
insights into the relationship between features and price range.
Number of Instances: 2000
Number of Atrributes: 21 (price range and 20 other features)
Atrribute Infomation:

1. price range: This is the target variable with value of 0(low cost), 1(medium cost), 2(high
cost) and 3(very high cost).

2. battery power: Total energy a battery can store in one time measured in mAh

3. blue: Has Bluetooth or not

4. clock speed: speed at which microprocessor executes instructions

5. dual sim: Has dual sim support or not

6. fc: Front Camera megapixels

7. four g: Has 4G or not

8. int memory: Internal Memory in Gigabytes

9. m dep: Mobile Depth in cm

10. mobile wt: Weight of mobile phone

32
Artificial Intelligence 22TT2

Note: clock speed and m dep are in float type, while the remaining variables are in int
type. There are 11 other variables not listed here for brevity.
Class Distribution:

Figure 32: Distribution of Price Range

Key Note: Since there is no null or missing value in this data, we do not need to perform
data preprocessing.

4.2 Main flow

4.2.1 Preparing the datasets

The following code blocks will demonstrate the process of splitting data into the train set
(including the feature and label train set) and the test set (including the feature and label test
set)
1 subsets = []
2 x = df . drop ( ’ price_range ’ , axis =1)
3 y = df [ ’ price_range ’]
4

5 # Different train / test split ratios

33
Artificial Intelligence 22TT2

6 split_ratios = [(0.4 , 0.6) , (0.6 , 0.4) , (0.8 , 0.2) , (0.9 , 0.1) ]


7

8 for train_size , test_size in split_ratios :


9 feature_train , feature_test , label_train , label_test =
train_test_split (
10 x , y , train_size = train_size , test_size = test_size ,
stratify =y , shuffle = True , random_state = 82
11 )
12 subsets . append ({
13 ’ feature_train ’: feature_train ,
14 ’ label_train ’: label_train ,
15 ’ feature_test ’: feature_test ,
16 ’ label_test ’: label_test

We define the DecisionTree class as follows:


1 class DecisionTree :
2 def __init__ ( self , criterion , max_depth = None ) :
3 """
4 Initialize a decision tree , with the given criterion and
maximum depth , default depth is None
5 """
6 self . classifier = DecisionTreeClassifier ( criterion = criterion ,
max_depth = max_depth )
7 self . max_depth = max_depth
8 def train ( self , feature_train , label_train ) :
9 """
10 Train the decision tree classifier on a training set
11 """
12 self . classifier . fit ( feature_train , label_train )
13 def predict ( self , feature_test ) :
14 """
15 Predict the output of a feature test set
16 """
17 return self . classifier . predict ( feature_test )
18 def evaluate ( self , label_test , predictions ) :
19 """
20 Print out the classification report and the confusion matrix
of the decision tree output
21 """
22 print ( classification_report ( label_test , predictions ) )
23 print ( confusion_matrix ( label_test , predictions ) )
24 def accuracy ( self , label_test , predictions ) :
25 """
26 Return the accuracy of the decision tree output ( comparing to
the label of the test set )
27 """
28 return accuracy_score ( label_test , predictions )
29 def visualize ( self , feature_train , export_name ) :
30 """
31 Visualize the decision tree after training on a certain

34
Artificial Intelligence 22TT2

training set
32 """
33 dot_data = export_graphviz ( self . classifier , out_file = None ,
34 feature_names = feature_train .
columns , class_names = [ ’0 ’ ,
’1 ’ , ’2 ’ , ’3 ’] ,
35 filled = True , rounded = True ,
36 special_characters = True )
37 graph = graphviz . Source ( dot_data )
38 graph . render ( export_name , format = ’ png ’ , cleanup = True )

Below is the visualization of the datasets, including the original dataset and the training
and testing splits:
1 def plot_class_distribution ( data , title ) :
2 plt . figure ( figsize =(6 , 4) )
3 # Convert the Series to a DataFrame with a ’ price_range ’
column
4 if isinstance ( data , pd . Series ) :
5 data = data . to_frame ( name = ’ price_range ’)
6 sns . countplot ( x = ’ price_range ’ , data = data ) # Assuming ’
price_range ’ is your target variable
7 plt . title ( title )
8 plt . xlabel ( ’ Class ’)
9 plt . ylabel ( ’ Count ’)
10 plt . show ()
11 # Original dataset
12 plot_class_distribution ( df , ’ Original Dataset Class Distribution ’
)
13

14 # Train and test sets for each proportion


15 for i , ( train_size , test_size ) in enumerate ( split_ratios ) :
16 plot_class_distribution ( subsets [ i ][ ’ label_train ’] , f ’ Train
Set Class Distribution ( Train { train_size *100}% / Test {
test_size *100}%) ’)
17 plot_class_distribution ( subsets [ i ][ ’ label_test ’] , f ’ Test Set
Class Distribution ( Train { train_size *100}% / Test {
test_size *100}%) ’)

The following visualizations show the class distributions for the training and testing sets for
different train/test split ratios.

35
Artificial Intelligence 22TT2

(a) Train Set (40% Train / 60% Test) (b) Test Set (40% Train / 60% Test)

(c) Train Set (60% Train / 40% Test) (d) Test Set (60% Train / 40% Test)

(e) Train Set (80% Train / 20% Test) (f) Test Set (80% Train / 20% Test)

(g) Train Set (90% Train / 10% Test) (h) Test Set (90% Train / 10% Test)

Figure 33: Class Distributions for Train and Test Sets for Different Train/Test Split Ratios
36
Artificial Intelligence 22TT2

According to the Class Distributions for Train and Test Sets for Different Train/Test Split
Ratios, we observe that the ratio of all types of price range remains unchanged across all
splits. This indicates that the dataset is split in a stratified fashion, preserving the original
class proportions in both the training and test sets.

4.2.2 Building the decision tree classifiers

In the following section, we will define: decision tree X Y is the decision tree classifier
trained on the dataset with proportion X% for the train set and Y% for the test set. We also
use the criterion Average Entropy to maximize the Information Gain.

1. Decision Tree for the first training set (ratio: 40% train set, 60% test set)
1 decision_tree_40_60 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_40_60 . train ( subsets [0][ ’ feature_train ’] ,
subsets [0][ ’ label_train ’ ])
3 decision_tree_40_60 . visualize ( subsets [0][ ’ feature_train ’] , "
Decision Tree 4 -6 " )

Figure 34: Decision Tree 4-6

2. Decision Tree for the second training set (ratio: 60% train set, 40% test set)
1 decision_tree_60_40 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_60_40 . train ( subsets [1][ ’ feature_train ’] ,
subsets [1][ ’ label_train ’ ])
3 decision_tree_60_40 . visualize ( subsets [1][ ’ feature_train ’] , "
Decision Tree 6 -4 " )

Figure 35: Decision Tree 6-4

3. Decision Tree for the third training set (ratio: 80% train set, 20% test set)
1 decision_tree_80_20 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_80_20 . train ( subsets [2][ ’ feature_train ’] ,
subsets [2][ ’ label_train ’ ])
3 decision_tree_80_20 . visualize ( subsets [2][ ’ feature_train ’] , "
Decision Tree 8 -2 " )

37
Artificial Intelligence 22TT2

Figure 36: Decision Tree 8-2

4. Decision Tree for the fourth training set (ratio: 90% train set, 10% test set)
1 decision_tree_90_10 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_90_10 . train ( subsets [3][ ’ feature_train ’] ,
subsets [3][ ’ label_train ’ ])
3 decision_tree_90_10 . visualize ( subsets [3][ ’ feature_train ’] , "
Decision Tree 9 -1 " )

Figure 37: Decision Tree 9-1

4.2.3 Evaluating the decision tree classifiers

1. Evaluation on the Decision Tree 40-60


1 predictions = decision_tree_40_60 . predict ( subsets [0][ ’
feature_test ’ ])
2 decision_tree_40_60 . evaluate ( subsets [0][ ’ label_test ’] ,
predictions )

Output:

precision recall f1-score support

0 0.92 0.92 0.92 300


1 0.80 0.81 0.80 300
2 0.79 0.77 0.78 300
3 0.90 0.91 0.90 300

accuracy 0.85 1200


macro avg 0.85 0.85 0.85 1200
weighted avg 0.85 0.85 0.85 1200

[[277 23 0 0]
[ 25 242 33 0]
[ 0 38 231 31]
[ 0 0 27 273]]

38
Artificial Intelligence 22TT2

The evaluation of the decision tree classifier trained with a 40-60 train-test split demon-
strates the following key observations:

• The overall accuracy of the model is 85%, indicating that the decision tree performs
well in classifying the price range.
• The precision, recall, and f1-score metrics for each class are relatively balanced,
with values around 0.80 to 0.92, suggesting the model’s capability to handle mul-
tiple classes effectively.
• Low Cost (0) and Class 3 achieve higher precision and recall compared to Classes
1 and 2. This indicates that the model performs slightly better at predicting these
classes.
• The confusion matrix shows that:
– Most samples are correctly classified in their respective classes (e.g., 277 out of
300 for Low Cost (0)).
– Some misclassifications are observed, such as 33 samples of Class 1 being mis-
classified as Class 2, and 38 samples of Class 2 being misclassified as Class
1.
• The macro avg and weighted avg scores are consistent with the overall accuracy,
reinforcing that the model treats each class fairly, regardless of their support sizes.

39
Artificial Intelligence 22TT2

This evaluation indicates that the decision tree classifier is effective and provides valuable
insights into the price range classification task.

2. Evaluation on the Decision Tree 60-40


1 predictions = decision_tree_60_40 . predict ( subsets [1][ ’
feature_test ’ ])
2 decision_tree_60_40 . evaluate ( subsets [1][ ’ label_test ’] ,
predictions )

Output:

precision recall f1-score support

0 0.92 0.94 0.93 200


1 0.84 0.81 0.82 200
2 0.76 0.81 0.78 200
3 0.89 0.85 0.87 200

accuracy 0.85 800


macro avg 0.85 0.85 0.85 800
weighted avg 0.85 0.85 0.85 800

[[187 13 0 0]
[ 17 162 21 0]
[ 0 18 161 21]
[ 0 0 30 170]]

40
Artificial Intelligence 22TT2

3. Evaluation on the Decision Tree 80-20


1 predictions = decision_tree_80_20 . predict ( subsets [2][ ’
feature_test ’ ])
2 decision_tree_80_20 . evaluate ( subsets [2][ ’ label_test ’] ,
predictions )

Output:

precision recall f1-score support

0 0.90 0.90 0.90 100


1 0.82 0.80 0.81 100
2 0.80 0.81 0.81 100
3 0.87 0.89 0.88 100

accuracy 0.85 400


macro avg 0.85 0.85 0.85 400
weighted avg 0.85 0.85 0.85 400

[[90 10 0 0]

41
Artificial Intelligence 22TT2

[10 80 10 0]
[ 0 6 81 13]
[ 0 1 10 89]]

4. Evaluation on the Decision Tree 90-10


1 predictions = decision_tree_90_10 . predict ( subsets [3][ ’
feature_test ’ ])
2 decision_tree_90_10 . evaluate ( subsets [3][ ’ label_test ’] ,
predictions )

0 0.90 0.92 0.91 50


1 0.84 0.82 0.83 50
2 0.81 0.76 0.78 50
3 0.85 0.90 0.87 50

accuracy 0.85 200


macro avg 0.85 0.85 0.85 200
weighted avg 0.85 0.85 0.85 200

42
Artificial Intelligence 22TT2

[[46 4 0 0]
[ 5 41 4 0]
[ 0 4 38 8]
[ 0 0 5 45]]

43

You might also like