Decision Tree Copy
Decision Tree Copy
Members:
22125002 - Pham Minh Anh
22125016 - Nguyen Manh Dinh
22125056 - Le Phat Minh
22125062 - Pham Ha Nam
Artificial Intelligence 22TT2
Contents
1 Self-assessment of Completion & Task-assignment of each member 2
1
Artificial Intelligence 22TT2
Source: Created by Dr. William H. Wolberg, W. Nick Street, and Olvi L. Mangasarian at
the University of Wisconsin.
Description: The dataset contains features computed from digitized images of fine needle
aspirate (FNA) samples of breast masses. It aims to classify breast tumors as either malignant
(M) or benign (B) based on 30 real-valued features derived from characteristics of cell nuclei.
Number of Instances: 569
Number of Attributes: 32 (ID, diagnosis, 30 real-valued features)
Attributes:
2
Artificial Intelligence 22TT2
• ID number
3
Artificial Intelligence 22TT2
Then, We simply load the dataset by the function read csv and name the columns according
to the convention above.
1 # Load the dataset
2 data_file_path = " breast + cancer + wisconsin + diagnostic / wdbc . data "
3 column_names = [ ’ ID ’ , ’ Diagnosis ’] + [ f ’{ feature_names [i -1]} ’ for
i in range (1 , 31) ]
4 data = pd . read_csv ( data_file_path , header = None , names =
column_names )
5 # Print out the first 10 lines of data
6 print ( data [:10])
Finally, we drop the ID column since it does not contribute to the decision tree. Also, the
Diagnosis column is transformed to numerical values to facilitate computation.
1 # Drop the ID column and encode the Diagnosis column
2 data = data . drop ( columns =[ ’ ID ’ ])
3 data [ ’ Diagnosis ’] = data [ ’ Diagnosis ’ ]. map ({ ’M ’: 1 , ’B ’: 0})
4
According to the requirements, we use the function train test split from
sklearn.model selection to split the datasets into different portions, specifically 40/60,
60/40, 80/20, and 90/10 (train/test).
There are subsets like X train (feature train), Y train (label train), X test (feature test),
Y test (label test),
1 # Split the dataset and store the results
2 for train_size in splits :
3 X_train , X_test , y_train , y_test = train_test_split (
4 X , y , train_size = train_size , stratify =y , random_state =42
5 )
6
4
Artificial Intelligence 22TT2
According to the visualizations of class distribution of original and split datasets below, we
draw the fact that the ratio of Benign and Malignant is unchanged. Hence, the dataset is split
in a stratified fashion.
5
Artificial Intelligence 22TT2
According to the requirements, we have to fit 4 decision tree classifiers on 4 training sets
we have. In this report, we will format the document: construct one decision tree classifier and
give comments to one visualization of the classifier as an example.
Firstly, we use the DecisionTreeClassifier from the library sklearn.tree to construct
the decision tree.
In the code below, we use the criterion Average Entropy to maximize the Information
Gain - the main criteria to choose the parent of the tree and subtrees. The random state
parameter ensures reproducibility of the results by controlling the randomness involved in the
algorithm. Specifically, in the DecisionTreeClassifier, it influences the random selection of
features during tree splitting when handling ties. Setting random state=42 fixes the seed of
the random number generator, allowing consistent and repeatable results across multiple runs
of the code.
1 # Train a DecisionTreeClassifier
2 clf = DecisionTreeClassifier ( criterion = ’ entropy ’ ,
random_state =42)
3 clf . fit ( X_train , y_train )
Next, we visualize the Decision Tree classifier by using export graphviz from the library
sklearn.tree.
1 # Visualize the decision tree
2 dot_data = export_graphviz (
3 clf ,
4 out_file = None ,
5 feature_names = X . columns ,
6
Artificial Intelligence 22TT2
Figure 6: Breast Cancer Diagnostic Decision Tree Classifier on 40/60 (Train/ Test) Split
The decision tree visualizes the classification of breast cancer samples into two classes:
Benign (orange nodes) and Malignant (blue nodes).
• Each node in the tree represents a decision point based on a threshold value for a specific
feature (e.g., radius3, concave points3).
• At each split, the decision criterion tests whether the feature value is less than or equal
to the threshold. If true, the left child node is traversed; otherwise, the right child node
is chosen.
• The tree terminates at leaf nodes, where entropy equals 0, indicating pure classification
with no further splitting required.
7
Artificial Intelligence 22TT2
This decision tree shows how features like radius3, concave points3, and texture3 con-
tribute to separating the data into benign and malignant classes, with some paths perfectly
classifying the samples. The decision tree classifier can be different for different splits. Below
are some alternative classifiers.
Figure 7: Breast Cancer Diagnostic Decision Tree Classifier on 60/40 (Train/ Test) Split
Comment: For a larger training set as well as more diverse data and cases, the decision
tree classifier may require longer path to split the diagnosis until there is only one class left.
8
Artificial Intelligence 22TT2
Figure 8: Breast Cancer Diagnostic Decision Tree Classifier on 80/20 (Train/ Test) Split
9
Artificial Intelligence 22TT2
Figure 9: Breast Cancer Diagnostic Decision Tree Classifier on 90/10 (Train/ Test) Split
About the accuracy of the classifiers built on different train/ test splits, we go to the
following section.
10
Artificial Intelligence 22TT2
The confusion matrix shows the performance of the model in classifying the two classes:
Benign and Malignant. The matrix contains the following:
The confusion matrix highlights that the model performs well on predicting with a very small
negative and performs better on Benign cases (higher TN and fewer FP) than on Malignant
cases (higher FN).
The classification report provides the following metrics for each class:
• Precision: Measures the proportion of positive identifications that were actually correct.
11
Artificial Intelligence 22TT2
• Recall: Measures the proportion of actual positives that were correctly identified.
– Benign: 0.932
– Malignant: 0.876
• Macro Average: Average of metrics across both classes without considering class im-
balance.
• Weighted Average: Average of metrics across both classes, weighted by the number of
instances in each class.
Above is the detailed explanation of how the metrics convey the performance of the classifier.
Now we go through the results/ visualization of other Decision Tree classifiers that are trained
on other training splits.
Evaluation of Decision Tree classifier on 60/40 split:
The overall accuracy of this split is about 0.939, which is higher than the previous decision
tree classifier. We can draw the fact that the Breast Cancer decision tree classifier prefers the
larger amount of data to train.
12
Artificial Intelligence 22TT2
13
Artificial Intelligence 22TT2
14
Artificial Intelligence 22TT2
All the metrics above converge to a conclusion, that is, the Decision Tree classifier con-
structed based on the dataset is generally effective (accuracy higher than 0.9) in predicting
Breast Cancer.
Firstly, we construct decision tree classifiers limiting the depth of the tree as None, 2, 3, 4,
5, 6 7.
1 clf = DecisionTreeClassifier ( criterion = ’ entropy ’ , max_depth =
max_depth , random_state =42)
2 clf . fit ( X_train , y_train )
Generally, the accuracy increases (the classifier performs better) as the max depth increases.
15
Artificial Intelligence 22TT2
Figure 18: Effect of max depth on accuracy of Breast Cancer Decision Tree classifier on 80/20
split
Beyond a certain point (e.g., max depth = 5), the accuracy reaches a plateau. Further
increases in max depth do not result in significant improvements in accuracy. This occurs
because the tree has already captured most of the meaningful patterns in the data, and deeper
splits primarily capture noise, which does not enhance test performance.
Below are the visualizations of the Decision Tree classifier of each max depth:
Decision tree with max depth = None: means that there is no limit on the max depth
of the decision tree.
16
Artificial Intelligence 22TT2
Figure 19: Breast Cancer Diagnostic Decision Tree Classifier with max depth = None
We can see that all diagnosis are purely split along the path of the decision tree.
Decision tree with max depth = 2:
17
Artificial Intelligence 22TT2
Figure 20: Breast Cancer Diagnostic Decision Tree Classifier with max depth = 2
Some of the leaves of the decision tree contain both types of the diagnosis, which means
that the max depth of 2 is not suitable. However, the overall accurac
Source: Created by: Paulo Cortez (Univ. Minho), Antonio Cerdeira, Fernando Almeida,
Telmo Matos and Jose Reis (CVRVV) @ 2009.
Description: The two datasets are related to red and white variants of the Portuguese
”Vinho Verde” wine. For more details, consult: http://www.vinhoverde.pt/en/ or the reference
[Cortez et al., 2009]. Due to privacy and logistic issues, only physicochemical (inputs) and
sensory (the output) variables are available (e.g. there is no data about grape types, wine
brand, wine selling price, etc.).
Number of Instances: red wine - 1599; white wine - 4898.
Number of Attributes: 11 + output attribute
Note: several of the attributes may be correlated, thus it makes sense to apply some sort
of feature selection.
Attribute information:
For more information, read [Cortez et al., 2009].
Input variables (based on physicochemical tests):
1 - fixed acidity
18
Artificial Intelligence 22TT2
2 - volatile acidity
3 - citric acid
4 - residual sugar
5 - chlorides
6 - free sulfur dioxide
7 - total sulfur dioxide
8 - density
9 - pH
10 - sulphates
11 - alcohol
Output variable (based on sensory data):
12 - quality (score between 0 and 10)
Missing Attribute Values: None
Group data
Since the Wine Quality dataset contains 10 classes of wine quality (classes 0-10), we should
group them into 3 broader categories for analysis: Low quality (classes 0-4), Standard quality
(classes 5-6), and High quality (classes 7-10).
1 # Features and Labels
2 X = wine_data . iloc [: , : -1] # All columns except ’ quality ’
3 # Group quality into broader categories
4 def group_quality ( quality ) :
5 if quality <= 4:
19
Artificial Intelligence 22TT2
According to the requirements, we use the function train test split from sklearn.model selection
to split the datasets into different portions, specifically 40/60, 60/40, 80/20, and 90/10 (train/test).
There are subsets like X train (feature train), Y train (label train), X test (feature test),
Y test (label test).
Prepare datasets
1 # Splitting proportions
2 proportions = [(0.4 , 0.6) , (0.6 , 0.4) , (0.8 , 0.2) , (0.9 , 0.1) ]
3
4 # Prepare datasets
5 datasets = {}
6
Visualization
1 # Visualization function
2 def v isu alize_class_distribution ( labels , title ) :
3 sns . countplot ( x = labels , palette = " viridis " )
4 plt . title ( title )
5 plt . xlabel ( " Wine Quality Category " )
6 plt . ylabel ( " Count " )
7 plt . show ()
20
Artificial Intelligence 22TT2
21
Artificial Intelligence 22TT2
22
Artificial Intelligence 22TT2
23
Artificial Intelligence 22TT2
24
Artificial Intelligence 22TT2
Comment:
The Low category is underrepresented, which might introduce class imbalance issues in
modeling.
Test and trained sets have the same distribution. This means that they are split
in a stratified fashion.
Train and Visualize Decision Tree Classifiers According to the requirements, we have
to fit 4 decision tree classifiers on 4 training sets we have.
Firstly, we use the DecisionTreeClassifier from the lirary sklearn.tree to construct
the decision tree.
In the code below, we use the criterion Average Entropy to maximize the Information
Gain that we have learnt in theory class. The random state parameter ensures reproducibility
of the results by controlling the randomness involved in the algorithm. Specifically, in the
DecisionTreeClassifier, it influences the random selection of features during tree splitting
when handling ties. Setting random state=42 fixes the seed of the random number generator,
allowing consistent and repeatable results across multiple runs of the code.
1 # Train a DecisionTreeClassifier
25
Artificial Intelligence 22TT2
Next, we visualize the Decision Tree classifier by using export graphviz from the lirary
sklearn.tree.
1 # Visualize the decision tree
2 dot_data = export_graphviz (
3 clf ,
4 out_file = None ,
5 feature_names = features_train . columns ,
6 class_names =[ str ( i ) for i in sorted ( labels_train . unique () ) ] ,
7 filled = True , rounded = True , special_characters = True
8 )
9 graph = Source ( dot_data )
10 display ( graph )
10 # Classification report
11 print ( f " Classification Report for { split_name }:\ n " )
12 report = classification_report ( labels_test , predictions ,
target_names = class_order , digits =2)
13 print ( report )
14
15 # Confusion matrix
16 cm = confusion_matrix ( labels_test , predictions , labels =
class_order )
17 print ( f " Confusion Matrix for { split_name }:\ n { cm }\ n " )
18
26
Artificial Intelligence 22TT2
The model does a reasonable job identifying High quality wines but misclassifies a notable
proportion as Standard.
Performance is poor for the Low class.
The model performs best for the Standard class, which is expected given its dominance in
the dataset.
While the high performance here contributes to overall accuracy, it might indicate a bias
toward the majority class.
Accuracy (79%): The model achieves decent accuracy but is heavily influenced by the
Standard class. Accuracy alone is not a reliable metric for imbalanced datasets.
Macro Average (F1-Score: 0.62): This highlights the imbalance in performance across
classes. The lower F1-score for the Low and High classes pulls down the macro average.
Weighted Average (F1-Score: 0.79): This metric reflects the dominance of the Stan-
dard class and inflates the perception of performance.
27
Artificial Intelligence 22TT2
28
Artificial Intelligence 22TT2
Low-quality class: The precision (0.56) and recall (0.58) indicate moderate performance.
This suggests the model struggles to distinguish Low-quality wines but performs slightly better
at correctly identifying them.
Standard-quality class: This class shows the weakest performance, with precision (0.42)
and recall (0.41). This is likely due to the small sample size (37 instances in the test set),
leading to class imbalance and difficulty in learning distinctive features for this group.
High-quality class: The model performs very well for this class, achieving high precision
(0.86) and recall (0.85). This is expected since High-quality wines dominate the dataset and
contribute most to the model’s learning.
Accuracy: The overall accuracy of 78% is decent but is driven primarily by the performance
on the High-quality class, not balanced across all categories.
Macro Average: The macro average (0.61) shows the overall weakness in handling under-
represented classes (Low and Standard).
Weighted Average: The weighted average F1-score (0.78) reflects the strong influence of
the dominant High-quality class in the dataset.
There is significant misclassification between the Low and Standard classes, with many
Low-quality wines being incorrectly predicted as Standard.
Most errors for the High-quality class are in predicting wines as Standard, which might
indicate overlapping characteristics between Standard and High.
29
Artificial Intelligence 22TT2
1 import pandas as pd
2 from sklearn . tree import DecisionTreeClassifier , export_graphviz
3 from sklearn . metrics import accuracy_score
4 from graphviz import Source
5 import matplotlib . pyplot as plt
6
30
Artificial Intelligence 22TT2
Result:
1 Results for Train 80% / Test 20%
2 max_depth Accuracy
3 0 None 0.791837
4 1 2 0.767347
5 2 3 0.763265
6 3 4 0.771429
7 4 5 0.772449
8 5 6 0.772449
9 6 7 0.759184
10 7 8 0.767347
11 8 9 0.772449
12 9 10 0.769388
13 10 11 0.782653
14 11 12 0.770408
15 12 13 0.781633
16 13 14 0.773469
17 14 15 0.794898
18 15 16 0.798980
19 16 17 0.787755
20 17 18 0.792857
21 18 19 0.797959
22 19 20 0.790816
23 20 21 0.791837
24 21 22 0.791837
25 22 23 0.791837
26 23 24 0.791837
27 24 25 0.791837
Visualization:
Comments:
31
Artificial Intelligence 22TT2
1. Overall trends:
• The accuracy generally improves as the max depth increases from 2 to 16, reaching
a peak at max depth = 16 with an accuracy of 0.7989 (approximately 79.89%).
• After max depth = 16, the accuracy slightly fluctuates but remains stable, suggesting
diminishing returns for increasing max depth beyond this point.
• At lower max depth values (e.g., 2, 3), the accuracy is relatively low, indicating
underfitting. The model is likely too simplistic to capture the complexities of the
dataset. For example, max depth = 2 results in 0.7673 (76.73%), which is signifi-
cantly lower than the peak accuracy. max depth = 3 reduces accuracy further to
0.7633 (76.33%).
1. price range: This is the target variable with value of 0(low cost), 1(medium cost), 2(high
cost) and 3(very high cost).
2. battery power: Total energy a battery can store in one time measured in mAh
32
Artificial Intelligence 22TT2
Note: clock speed and m dep are in float type, while the remaining variables are in int
type. There are 11 other variables not listed here for brevity.
Class Distribution:
Key Note: Since there is no null or missing value in this data, we do not need to perform
data preprocessing.
The following code blocks will demonstrate the process of splitting data into the train set
(including the feature and label train set) and the test set (including the feature and label test
set)
1 subsets = []
2 x = df . drop ( ’ price_range ’ , axis =1)
3 y = df [ ’ price_range ’]
4
33
Artificial Intelligence 22TT2
34
Artificial Intelligence 22TT2
training set
32 """
33 dot_data = export_graphviz ( self . classifier , out_file = None ,
34 feature_names = feature_train .
columns , class_names = [ ’0 ’ ,
’1 ’ , ’2 ’ , ’3 ’] ,
35 filled = True , rounded = True ,
36 special_characters = True )
37 graph = graphviz . Source ( dot_data )
38 graph . render ( export_name , format = ’ png ’ , cleanup = True )
Below is the visualization of the datasets, including the original dataset and the training
and testing splits:
1 def plot_class_distribution ( data , title ) :
2 plt . figure ( figsize =(6 , 4) )
3 # Convert the Series to a DataFrame with a ’ price_range ’
column
4 if isinstance ( data , pd . Series ) :
5 data = data . to_frame ( name = ’ price_range ’)
6 sns . countplot ( x = ’ price_range ’ , data = data ) # Assuming ’
price_range ’ is your target variable
7 plt . title ( title )
8 plt . xlabel ( ’ Class ’)
9 plt . ylabel ( ’ Count ’)
10 plt . show ()
11 # Original dataset
12 plot_class_distribution ( df , ’ Original Dataset Class Distribution ’
)
13
The following visualizations show the class distributions for the training and testing sets for
different train/test split ratios.
35
Artificial Intelligence 22TT2
(a) Train Set (40% Train / 60% Test) (b) Test Set (40% Train / 60% Test)
(c) Train Set (60% Train / 40% Test) (d) Test Set (60% Train / 40% Test)
(e) Train Set (80% Train / 20% Test) (f) Test Set (80% Train / 20% Test)
(g) Train Set (90% Train / 10% Test) (h) Test Set (90% Train / 10% Test)
Figure 33: Class Distributions for Train and Test Sets for Different Train/Test Split Ratios
36
Artificial Intelligence 22TT2
According to the Class Distributions for Train and Test Sets for Different Train/Test Split
Ratios, we observe that the ratio of all types of price range remains unchanged across all
splits. This indicates that the dataset is split in a stratified fashion, preserving the original
class proportions in both the training and test sets.
In the following section, we will define: decision tree X Y is the decision tree classifier
trained on the dataset with proportion X% for the train set and Y% for the test set. We also
use the criterion Average Entropy to maximize the Information Gain.
1. Decision Tree for the first training set (ratio: 40% train set, 60% test set)
1 decision_tree_40_60 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_40_60 . train ( subsets [0][ ’ feature_train ’] ,
subsets [0][ ’ label_train ’ ])
3 decision_tree_40_60 . visualize ( subsets [0][ ’ feature_train ’] , "
Decision Tree 4 -6 " )
2. Decision Tree for the second training set (ratio: 60% train set, 40% test set)
1 decision_tree_60_40 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_60_40 . train ( subsets [1][ ’ feature_train ’] ,
subsets [1][ ’ label_train ’ ])
3 decision_tree_60_40 . visualize ( subsets [1][ ’ feature_train ’] , "
Decision Tree 6 -4 " )
3. Decision Tree for the third training set (ratio: 80% train set, 20% test set)
1 decision_tree_80_20 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_80_20 . train ( subsets [2][ ’ feature_train ’] ,
subsets [2][ ’ label_train ’ ])
3 decision_tree_80_20 . visualize ( subsets [2][ ’ feature_train ’] , "
Decision Tree 8 -2 " )
37
Artificial Intelligence 22TT2
4. Decision Tree for the fourth training set (ratio: 90% train set, 10% test set)
1 decision_tree_90_10 = DecisionTree ( criterion = ’ entropy ’)
2 decision_tree_90_10 . train ( subsets [3][ ’ feature_train ’] ,
subsets [3][ ’ label_train ’ ])
3 decision_tree_90_10 . visualize ( subsets [3][ ’ feature_train ’] , "
Decision Tree 9 -1 " )
Output:
[[277 23 0 0]
[ 25 242 33 0]
[ 0 38 231 31]
[ 0 0 27 273]]
38
Artificial Intelligence 22TT2
The evaluation of the decision tree classifier trained with a 40-60 train-test split demon-
strates the following key observations:
• The overall accuracy of the model is 85%, indicating that the decision tree performs
well in classifying the price range.
• The precision, recall, and f1-score metrics for each class are relatively balanced,
with values around 0.80 to 0.92, suggesting the model’s capability to handle mul-
tiple classes effectively.
• Low Cost (0) and Class 3 achieve higher precision and recall compared to Classes
1 and 2. This indicates that the model performs slightly better at predicting these
classes.
• The confusion matrix shows that:
– Most samples are correctly classified in their respective classes (e.g., 277 out of
300 for Low Cost (0)).
– Some misclassifications are observed, such as 33 samples of Class 1 being mis-
classified as Class 2, and 38 samples of Class 2 being misclassified as Class
1.
• The macro avg and weighted avg scores are consistent with the overall accuracy,
reinforcing that the model treats each class fairly, regardless of their support sizes.
39
Artificial Intelligence 22TT2
This evaluation indicates that the decision tree classifier is effective and provides valuable
insights into the price range classification task.
Output:
[[187 13 0 0]
[ 17 162 21 0]
[ 0 18 161 21]
[ 0 0 30 170]]
40
Artificial Intelligence 22TT2
Output:
[[90 10 0 0]
41
Artificial Intelligence 22TT2
[10 80 10 0]
[ 0 6 81 13]
[ 0 1 10 89]]
42
Artificial Intelligence 22TT2
[[46 4 0 0]
[ 5 41 4 0]
[ 0 4 38 8]
[ 0 0 5 45]]
43