Lab2
Lab2
UNIVERSITY OF SCIENCE
FACULITY OF INFORMATION
TECHNOLOGY
REPORT LAB2
2 Project report 2
2.1 Preparing the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Manually merge the two files . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Prepare 16 subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Visualize the distributions of classes in all the data sets . . . . . . . . . . 4
2.1.4 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Building the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Evaluating the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The depth and accuracy of a decision tree . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 14
References 15
1 General information
1.1 Student information
• Student name: Phạm Khánh Toàn
• Student ID: 21127704
• Student class: 21CLC05
• Phone number: 0866099829
• Email: [email protected]
1
2 Project report
2.1 Preparing the data sets
2.1.1 Manually merge the two files
• Step 1: Import 2 files poker-hand-training-true.data and poker-hand-testing.data into an
Excel file and save it.
• Step 2: Open another Excel file and import the previously saved Excel file.
2
• Step 3: Next, we select "select multiple items" and choose randomly from the displayed
files. Then we select "transform data".
• Step 4: Next, in the Applied steps section, we remove all other parts and only keep the
source. Then, in the kind field, we select only the table option.
• Step 5: Next, we will remove other columns and keep only the data column.
3
• Step 6: Click on Close & Load.
This line of code is using the scikit-learn function train_test_split to split the dataset into
a training set and a test set. It takes the input variables data and label, and uses the train_size
parameter to specify the proportion of data to include in the training set. The stratify parameter
is used to ensure that the proportion of each class in the data is maintained in both the training
and test sets. The shuffle parameter is set to True to randomly shuffle the data before splitting
it, and the random_state parameter is used to set the random seed for reproducibility. The
resulting splits are then stored in a list called subset_with_train_ratio.
This code defines a function called "draw_distribution" that takes in four arguments: origi-
nal_set, training_set, test_set, and train_ratio. The function then uses these inputs to plot a
bar graph showing the class distribution of the data. It does this by first counting the number
of instances of each class in the original, training, and test sets. It then sets the x-axis and bar
width for the graph and plots three bars for each class: one for the original set, one for the
training set, and one for the test set. The x-axis labels are set to the class names and rotated
vertically. Finally, the function sets the graph title based on the train_ratio input and shows
the graph using the plt.show() function.
4
2.1.4 Result of running the program
5
2.2 Building the decision tree classifiers
6
b) Data set 60/40
7
2.3 Evaluating the decision tree classifiers
The classification report and the confusion matrix are two important tools to evaluate the
performance of a classification model.
The confusion matrix is a table that summarizes the number of correct and incorrect pre-
dictions made by the model on a set of test data. It presents four values: true positives (TP),
false positives (FP), true negatives (TN), and false negatives (FN). TP and TN represent the
number of samples that were correctly classified as positive and negative, respectively. FP and
FN represent the number of samples that were incorrectly classified as positive and negative,
respectively. The confusion matrix allows us to calculate performance metrics such as accuracy,
precision, recall, and F1 score.
The classification report provides a summary of these performance metrics for each class
in the dataset. It includes metrics such as precision, recall, F1 score, and support. Precision
measures the proportion of positive predictions that are correct, while recall measures the
proportion of actual positives that are correctly identified. The F1 score is a weighted average
of precision and recall that takes into account both metrics. Support represents the number of
samples in each class.
Together, the classification report and confusion matrix provide a detailed evaluation of the
model’s performance, allowing us to identify areas of improvement and make informed decisions
about how to adjust the model’s parameters or features.
8
2.3.1 Result of running the program
a) Data set 40/60
9
b) Data set 60/40
10
c) Data set 80/20
11
d) Data set 90/10
12
2.4 The depth and accuracy of a decision tree
Max_depth None 2 3 4 5 6 7
Accuracy 0.6399 0.5062 0.5083 0.5247 0.5566 0.5566 0.5569
As per the table, we can see that the accuracy of the decision tree classifier changes as we
change the max_depth parameter. When the max_depth parameter is None, the accuracy is
0.6399, which is the highest among all the values of max_depth. However, as we increase the
value of max_depth from 2 to 7, the accuracy remains almost the same, around 0.50 to 0.56,
without any significant improvement.
This behavior indicates that the decision tree model may not be the best model to predict
the classes for this particular dataset. Additionally, it also suggests that the model may be
overfitting the data when max_depth is not limited, resulting in a high accuracy score on the
training set but a low accuracy score on the test set. Therefore, it is important to find the
optimal value of max_depth that balances the bias-variance tradeoff to avoid overfitting and
underfitting.
13
2.4.1 Result of running the program
a) Depth = None
b) Depth = 2
c) Depth = 3
d) Depth = 4
e) Depth = 5
14
f) Depth = 6
g) Depth = 7
Tài liệu
[1] Stuart J. Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 1995
15