0% found this document useful (0 votes)
2 views

Lab2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views

Lab2

Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

HO CHI MINH NATIONAL UNIVERSITY

UNIVERSITY OF SCIENCE
FACULITY OF INFORMATION
TECHNOLOGY

REPORT LAB2

COURCE: INTRODUCE TO ARTIFICIAL


INTELLIGENT

Student Name: Phạm Khánh Toàn


Student ID: 21127704
Class: 21CLC05

Ho Chi Minh, Ngày 22 tháng 5 năm 2023


Mục lục
1 General information 1
1.1 Student information . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 1
1.2 Self-evaluation of the level of completion . . . . . . . . . . . . . . . . . . . . . . 1

2 Project report 2
2.1 Preparing the data sets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.1 Manually merge the two files . . . . . . . . . . . . . . . . . . . . . . . . . 2
2.1.2 Prepare 16 subsets . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 4
2.1.3 Visualize the distributions of classes in all the data sets . . . . . . . . . . 4
2.1.4 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 5
2.2 Building the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . . . . . 6
2.2.1 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 6
2.3 Evaluating the decision tree classifiers . . . . . . . . . . . . . . . . . . . . . . . . 8
2.3.1 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 9
2.4 The depth and accuracy of a decision tree . . . . . . . . . . . . . . . . . . . . . 13
2.4.1 Result of running the program . . . . . . . . . . . . . . . . . . . . . . . . 14

References 15
1 General information
1.1 Student information
• Student name: Phạm Khánh Toàn
• Student ID: 21127704
• Student class: 21CLC05
• Phone number: 0866099829
• Email: [email protected]

1.2 Self-evaluation of the level of completion

No. Specifications Percent of completeness


1 Preparing the data sets 100%
2 Building the decision tree classifiers 100%
3 Evaluating the decision tree classifiers 100%
4 The depth and accuracy of a decision tree 100%

1
2 Project report
2.1 Preparing the data sets
2.1.1 Manually merge the two files
• Step 1: Import 2 files poker-hand-training-true.data and poker-hand-testing.data into an
Excel file and save it.

• Step 2: Open another Excel file and import the previously saved Excel file.

2
• Step 3: Next, we select "select multiple items" and choose randomly from the displayed
files. Then we select "transform data".

• Step 4: Next, in the Applied steps section, we remove all other parts and only keep the
source. Then, in the kind field, we select only the table option.

• Step 5: Next, we will remove other columns and keep only the data column.

3
• Step 6: Click on Close & Load.

2.1.2 Prepare 16 subsets

This line of code is using the scikit-learn function train_test_split to split the dataset into
a training set and a test set. It takes the input variables data and label, and uses the train_size
parameter to specify the proportion of data to include in the training set. The stratify parameter
is used to ensure that the proportion of each class in the data is maintained in both the training
and test sets. The shuffle parameter is set to True to randomly shuffle the data before splitting
it, and the random_state parameter is used to set the random seed for reproducibility. The
resulting splits are then stored in a list called subset_with_train_ratio.

2.1.3 Visualize the distributions of classes in all the data sets

This code defines a function called "draw_distribution" that takes in four arguments: origi-
nal_set, training_set, test_set, and train_ratio. The function then uses these inputs to plot a
bar graph showing the class distribution of the data. It does this by first counting the number
of instances of each class in the original, training, and test sets. It then sets the x-axis and bar
width for the graph and plots three bars for each class: one for the original set, one for the
training set, and one for the test set. The x-axis labels are set to the class names and rotated
vertically. Finally, the function sets the graph title based on the train_ratio input and shows
the graph using the plt.show() function.

4
2.1.4 Result of running the program

5
2.2 Building the decision tree classifiers

The DecisionTreeClassifier is a class in scikit-learn library that implements a decision tree


algorithm for classification problems. It takes several parameters, including the criterion for
splitting the tree (e.g., entropy or Gini impurity), the maximum depth of the tree, the minimum
number of samples required to split an internal node, and the random state for reproducibility.
The fit method is a built-in method in scikit-learn that trains the decision tree model on a
given dataset. It takes two input parameters: the features (or independent variables) and the
target (or dependent variable) as arguments. Once the model is trained, it can be used to make
predictions on new data.
The tree.export_graphviz is a function in the scikit-learn library that exports a trained
decision tree model to a Graphviz format. Graphviz is an open-source graph visualization
software that can be used to visualize decision trees. The tree.export_graphviz function takes
several parameters, including the decision tree model, the names of the features, the names of
the target classes, and various formatting options for the outputvisualization. The output of
this function is a Graphviz representation of the decision tree model, which can be visualized
using Graphviz tools.

2.2.1 Result of running the program


a) Data set 40/60

6
b) Data set 60/40

c) Data set 80/20

d) Data set 90/10

7
2.3 Evaluating the decision tree classifiers
The classification report and the confusion matrix are two important tools to evaluate the
performance of a classification model.

The confusion matrix is a table that summarizes the number of correct and incorrect pre-
dictions made by the model on a set of test data. It presents four values: true positives (TP),
false positives (FP), true negatives (TN), and false negatives (FN). TP and TN represent the
number of samples that were correctly classified as positive and negative, respectively. FP and
FN represent the number of samples that were incorrectly classified as positive and negative,
respectively. The confusion matrix allows us to calculate performance metrics such as accuracy,
precision, recall, and F1 score.

The classification report provides a summary of these performance metrics for each class
in the dataset. It includes metrics such as precision, recall, F1 score, and support. Precision
measures the proportion of positive predictions that are correct, while recall measures the
proportion of actual positives that are correctly identified. The F1 score is a weighted average
of precision and recall that takes into account both metrics. Support represents the number of
samples in each class.
Together, the classification report and confusion matrix provide a detailed evaluation of the
model’s performance, allowing us to identify areas of improvement and make informed decisions
about how to adjust the model’s parameters or features.

8
2.3.1 Result of running the program
a) Data set 40/60

9
b) Data set 60/40

10
c) Data set 80/20

11
d) Data set 90/10

12
2.4 The depth and accuracy of a decision tree

Max_depth None 2 3 4 5 6 7
Accuracy 0.6399 0.5062 0.5083 0.5247 0.5566 0.5566 0.5569

As per the table, we can see that the accuracy of the decision tree classifier changes as we
change the max_depth parameter. When the max_depth parameter is None, the accuracy is
0.6399, which is the highest among all the values of max_depth. However, as we increase the
value of max_depth from 2 to 7, the accuracy remains almost the same, around 0.50 to 0.56,
without any significant improvement.

This behavior indicates that the decision tree model may not be the best model to predict
the classes for this particular dataset. Additionally, it also suggests that the model may be
overfitting the data when max_depth is not limited, resulting in a high accuracy score on the
training set but a low accuracy score on the test set. Therefore, it is important to find the
optimal value of max_depth that balances the bias-variance tradeoff to avoid overfitting and
underfitting.

13
2.4.1 Result of running the program
a) Depth = None

b) Depth = 2

c) Depth = 3

d) Depth = 4

e) Depth = 5

14
f) Depth = 6

g) Depth = 7

Tài liệu
[1] Stuart J. Russell and Peter Norvig, Artificial Intelligence: A Modern Approach, 1995

[2] File 2023-CSC14003-21CLC-Lab02.pdf

15

You might also like