0% found this document useful (0 votes)

49 views14 pages

Ist 557 Data Mining

The document describes using decision trees, random forests, and XGBoost for classification on a multi-class dataset. For decision trees, 5-fold cross-validation was used to select hyperparameters including a maximum depth of 3, minimum samples split of 10-40. Random forests with 100 trees, maximum depth of 3, and minimum samples split of 10 achieved a testing error of 2.3%. For XGBoost, hyperparameters tuned via cross-validation included number of iterations (15), gamma (1-2), with a final testing error of 2.8%.

Uploaded by

Yizhi Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

49 views14 pages

Ist 557 Data Mining

Uploaded by

Yizhi Liu

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 14

IST 557 DATA MINING

1.1:
In the question 1, I write the decision_tree function to show each split situation of the tree:

The (X, Y) means value of X is the (X+1)th features. Y is the value of the Xth feature. For example: (0,4) means the first
split points is the 1st feature: Cell size and the split value is 4. In here, the reason that I don’t select the midpoint of two
adjacent points is the value of each feature is integer, thus it means (c1+c2)/2 has the same role with the c2.
( 1<(1+2)=1.5 & 2>1.5 is equal to 1<2 & 2>=2 when splitting data). To reduce the computation complexity, I only use c2
as the split point.
The (X, Y, Z) means the final leaf node: the value of X is the major class of this node, Y is the sample number of this node,
Z is the proportion of the major sample. (4, 41, 1) means the major class of this node is 4, the number of samples in this
node is 41 and the proportion of samples which are class 4 is 100%.
The number 0, 1 (not in the parenthesis) represents the tree is binary split tree.
1.2:

Class:

[50,50]

[50,9]
2.1:
First
Splitting Criteria: I use entropy as the impurity function and the midpoint between each pair of adjacent values of each
feature as the split point c. For each node which will be selected to split, I calculate the all possible entropy value when
this node is splitted by all the split points. And choose the minimum value of entropy and its corresponding split point c
and the feature as the split point.
Stopping Criteria: In this question, I choose two criterias as the stopping criteria.
The first one is the maximum depth of the tree: if the maximum depth of tree exceeds 4, this algorithm will stop.
The second one is the minimum number of samples to split the node: if the minimum number of samples in the internal
node is less than 30, this node will not be selected to split.
Tree plot:
Stopping Criteria 1: max depth = 4
Stopping Criteria 2: min samples split = 30

The next questions will allow us to select parameters of these criterions.

2.2:
In this question, I can use the testing error to select the parameters of the stopping criteria. Also, for each selection, I
add the Gini value as the comparison with the Entropy to see which splitting criteria is better.

Training and Testing error VS 1/(Minimum samples split)

Training and Testing error VS Maximum depth of tree

Summary:
From these three plots, we can see the Entropy Splitting criteria is better than Gini criteria, because the testing error
(entropy) is less than testing error (Gini) of all these three criterions. And I select the entropy as the splitting criteria in
the rest of questions.
From these three plots and the testing error (Entropy), the parameters of these three stopping criterions are:
Max depth = 3; minimum samples leaf = 5, minimum samples split = 50; the testing error is about 3.5%

Next, I will use 5-fold cross validation to fix these parameters.

2.3: 5-fold cross validation:

When using 5-fold cross validation, the best “max depth” is 3 or 4.

When using 5-fold cross validation, the best “min samples splits” is 5 or 10 or 15.
2.4:
For outer 5-fold cv:
For inner 5-fold cv:
For max depth in […….]:
For min samples splits in […………..]:
Testing error
In this question, because I have two stopping criteria to select and the two nested layers: for each inner loop of 5-fold
cv, I select the best parameters and show the plots of how to select the parameter are: the error is the average error of
inner 5-fold cv.
Training error of two stop criteria

Testing error of two stop criteria

From the above plot, we get the best parameters of each inner 5-fold cross validation (we have 5 groups of best
parameters, for each group, we also have 6 points (max depth, min samples split)). And for each outer 5-fold cv, I used
the best parameters selected form above to show the test error of outer cv: error is the average error of 5-fold (outer)
Fold 1: the best parameters we got from inner 5-fold when the outer folder is fold 1:
(2,2); (3,2); (4,2); (5,5); (6,10); (7,2)
And from the error of the outer cv, the best parameter is (3,2)=(max depth, min samples split)
The test error of (3,2) is 14.3%

Fold 2: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,10); (3,5); (4,2); (5,5); (6,5); (7,2)
And from the error of the outer cv, the best parameter is (3,5), (5,5),(2,10),(6,5)=(max depth, min samples split)
The test error of (3,5) is 2.8%
Fold 3: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,2); (3,40); (4,10); (5,2); (6,2); (7,2)
And from the error of the outer cv, the best parameter is (3,40)=(max depth, min samples split)
The test error of (3,40) is 2.8%

Fold 4: the best parameters we got from inner 5-fold when the outer folder is fold 4:
(2,2); (3,5); (4,20); (5,2); (6,2); (7,10)
And from the error of the outer cv, the best parameter is (4,20), (6,2)=(max depth, min samples split)
The test error of (4,20) is 5.7%
Fold 2: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,2); (3,10); (4,5); (5,2); (6,2); (7,5)
And from the error of the outer cv, the best parameter is (6,2)=(max depth, min samples split)
The test error of (6,2) is 0%

Summary: the testing error of the outer 5-fold cv. (the corresponding parameter I listed above)

Testing error of different outer 5-fold:

The average error of the outer 5-fold cv is 5.1%
From all these results, the best parameters of max depth should be 3. (used for next question)
3: Random Forest
Parameter Setting:
Because we use the same dataset, from the parameters of Decision tree, I select the max depth of each tree is 3, and
also use 5-fold cross validation to test the value of the second parameter: minimum samples split in each tree. The
“Entropy” was selected as the splitting criteria and the number of trees in forest is 100.

The average error of classification VS 1/(minimum samples split): the error is the average error of 5-fold cross validation.

The training error vs 1/(minimum samples split): The testing error:

The best testing error is 2.3% when the max depth is 3, and the minimum samples split is 10 or 15 for each tree.
The training error of classification VS 1/(minimum samples split) for each fold: the error is the training error of 5-fold
cross validation respectively.
The testing error of classification VS 1/(minimum samples split) for each fold: the error is the testing error of 5-fold cross
validation respectively.

Here, the final parameters are:

Max depth of each tree is 3, minimum samples split in each tree is 10. The “Entropy” was selected as the splitting criteria
and the number of trees in forest is 100.
After setting the parameters, use 5-fold cross-validation again to test, the result of training error and testing error of
each fold is:

The testing error is 2.3%

4: Xgboost:
Similar to the former questions, first I use 5-fold cross validation to confirm the max depth of tree is 3:

Parameter Setting:
In this question, I used the xgboost package. At first, the learning rate is 0.3, the max depth of a tree is 3, the minimum
loss reduction required to split a leaf node of the tree is 2 and the number of iterations is 20.

The objective is softprob which is used to classify the multi-classes. In this question, we have 3 classes, thus this predict
result of each sample will return to 3-elements-list like: [0.2, 0.3, 0.5] and the maximum probability of each sample is the
class this sample belongs to. In this example the final predicted class of this sample will return to class 3.
Also, I select to tune the number of iterations, the gamma value.
The average error of classification VS number of iterations: the error is the average error of 5-fold cross validation
The iteration is 15.
The average error of classification VS 1/gamma: the error is the average error of 5-fold cross validation

Gamma is 1 or 2.
Finally, the learning rate is 0.3, the max depth of a tree is 3, the minimum loss reduction required to split a leaf node of
the tree is 1 and the number of iterations is 15. The objective is “softprob”
After setting the parameters, use 5-fold cross-validation again to test, the result of training error and testing error of
each fold is:

The final testing error is 2.8%

Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
16 pages
Decision Tree Version 3
No ratings yet
Decision Tree Version 3
16 pages
Decision Tree Classifier Project
100% (1)
Decision Tree Classifier Project
20 pages
IDS 575 Assignment - 3: Name: Swapnil Shashank Parkhe UIN: 660014865
No ratings yet
IDS 575 Assignment - 3: Name: Swapnil Shashank Parkhe UIN: 660014865
7 pages
Solving Tarkeeb PDF
No ratings yet
Solving Tarkeeb PDF
146 pages
Multivariate Decision Trees: © 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands
No ratings yet
Multivariate Decision Trees: © 1995 Kluwer Academic Publishers, Boston. Manufactured in The Netherlands
33 pages
Decision Trees Cheat Sheet PDF
No ratings yet
Decision Trees Cheat Sheet PDF
2 pages
Comparative Analysis of XGBoost
No ratings yet
Comparative Analysis of XGBoost
20 pages
Chapter 7 - Trees
No ratings yet
Chapter 7 - Trees
80 pages
MI - Unit 4
No ratings yet
MI - Unit 4
79 pages
PC System Power Supply Diagrams, Schematics and Service Manuals PDF - Google Search PDF
50% (2)
PC System Power Supply Diagrams, Schematics and Service Manuals PDF - Google Search PDF
1 page
Chapter 3
No ratings yet
Chapter 3
88 pages
Introduction To RPART
No ratings yet
Introduction To RPART
67 pages
Homework #2 - Hida Efri Nurfina
No ratings yet
Homework #2 - Hida Efri Nurfina
19 pages
Ds 6
No ratings yet
Ds 6
24 pages
BSC ML Ch3
No ratings yet
BSC ML Ch3
106 pages
Chap9 Cart 574 1
No ratings yet
Chap9 Cart 574 1
42 pages
Decision Trees
No ratings yet
Decision Trees
8 pages
Lec 15
No ratings yet
Lec 15
66 pages
TD1 ELTP 2023 Correction
No ratings yet
TD1 ELTP 2023 Correction
6 pages
Note 6
No ratings yet
Note 6
33 pages
Decision Tree
No ratings yet
Decision Tree
23 pages
Team 5
No ratings yet
Team 5
12 pages
Es2012 62
No ratings yet
Es2012 62
6 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
Treepred
No ratings yet
Treepred
5 pages
5 1 Decision Trees
No ratings yet
5 1 Decision Trees
34 pages
Decision Tree Questions
No ratings yet
Decision Tree Questions
8 pages
Module 9 - CART
No ratings yet
Module 9 - CART
33 pages
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
No ratings yet
ESGB - 2025 - Classification and Regression Tress (Enregistré Automatiquement)
43 pages
GBM For Breast Cancer Dedection
No ratings yet
GBM For Breast Cancer Dedection
16 pages
Decision Trees: Classifier
No ratings yet
Decision Trees: Classifier
23 pages
Classification
No ratings yet
Classification
8 pages
Machine Learning: Version 2 CSE IIT, Kharagpur
No ratings yet
Machine Learning: Version 2 CSE IIT, Kharagpur
6 pages
AI&Ml-module 4 (Complete)
No ratings yet
AI&Ml-module 4 (Complete)
124 pages
Folds: Nomenclature, Classification & Recognition
0% (1)
Folds: Nomenclature, Classification & Recognition
22 pages
Hyperparametric Tuning of XG and RFC
No ratings yet
Hyperparametric Tuning of XG and RFC
2 pages
Unit IV Decision Trees
No ratings yet
Unit IV Decision Trees
37 pages
Decision Tree & Regression
No ratings yet
Decision Tree & Regression
33 pages
Hyper Parameter Optimization
No ratings yet
Hyper Parameter Optimization
13 pages
Grade 2 Tos Sum1
No ratings yet
Grade 2 Tos Sum1
5 pages
Heterogeneous Forests of Decision Trees: Lecture Notes in Computer Science August 2002
No ratings yet
Heterogeneous Forests of Decision Trees: Lecture Notes in Computer Science August 2002
7 pages
AI&Ml-module 4 (Part 1)
No ratings yet
AI&Ml-module 4 (Part 1)
85 pages
Bouckaert Calibrated Tests
No ratings yet
Bouckaert Calibrated Tests
8 pages
A Comment About Greediness in Decision Trees
No ratings yet
A Comment About Greediness in Decision Trees
3 pages
ML Module Iii
No ratings yet
ML Module Iii
12 pages
Lesson 36 - Rule Induction and Decision Tree II
No ratings yet
Lesson 36 - Rule Induction and Decision Tree II
6 pages
Decision Trees
No ratings yet
Decision Trees
37 pages
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
No ratings yet
LAB (1) Decision Tree: Islamic University of Gaza Computer Engineering Department Artificial Intelligence ECOM 5038
18 pages
Decision Trees
No ratings yet
Decision Trees
11 pages
Decision Tree Regression
No ratings yet
Decision Tree Regression
5 pages
ML Unit-3
No ratings yet
ML Unit-3
23 pages
Breiman, L. Friedman, J. H. Olshen, R. A. Stone, C. J. - Classification and Regression Trees - 1984
No ratings yet
Breiman, L. Friedman, J. H. Olshen, R. A. Stone, C. J. - Classification and Regression Trees - 1984
33 pages
CS467-M4-Machine Learning-Ktustudents - in
No ratings yet
CS467-M4-Machine Learning-Ktustudents - in
9 pages
Classification Problems
No ratings yet
Classification Problems
53 pages
Lecture 5a
No ratings yet
Lecture 5a
24 pages
Decision Tree
No ratings yet
Decision Tree
3 pages
Decision Tree Notes
No ratings yet
Decision Tree Notes
6 pages
Classification and Regression Trees
No ratings yet
Classification and Regression Trees
48 pages
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
No ratings yet
Assignment 6 (Sol.) : Introduction To Machine Learning Prof. B. Ravindran
10 pages
Lecture+Notes+-+Random Forests
No ratings yet
Lecture+Notes+-+Random Forests
10 pages
09 Decision Trees Nearest Neighbor
No ratings yet
09 Decision Trees Nearest Neighbor
8 pages
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
No ratings yet
A REPORT ON MIMO IN WIRELESS APPLICATIONS - Final
11 pages
PSU Manual
100% (1)
PSU Manual
23 pages
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
100% (1)
Essentials of Machine Learning Algorithms (With Python and R Codes) PDF
20 pages
2.RGP Corneal Lens
No ratings yet
2.RGP Corneal Lens
13 pages
EMD Module 1
No ratings yet
EMD Module 1
69 pages
2 Failure Theory
No ratings yet
2 Failure Theory
53 pages
Weld Consumable Calculator, Butt and Fillet Welds
No ratings yet
Weld Consumable Calculator, Butt and Fillet Welds
7 pages
Hcca Subwoofer Manual
No ratings yet
Hcca Subwoofer Manual
32 pages
Production of Ceramic Foam Filters For Molten Meta
No ratings yet
Production of Ceramic Foam Filters For Molten Meta
5 pages
Gas Absorption
No ratings yet
Gas Absorption
11 pages
Revised Notes Chapter 1
No ratings yet
Revised Notes Chapter 1
16 pages
Solid State (IITian Notes - Kota)
No ratings yet
Solid State (IITian Notes - Kota)
43 pages
CWS19产品资料英文
No ratings yet
CWS19产品资料英文
7 pages
Comparison of Shielding Methods
No ratings yet
Comparison of Shielding Methods
2 pages
DNA Extraction From Organic Phase of Trizol Reagent After RNA Isolation
No ratings yet
DNA Extraction From Organic Phase of Trizol Reagent After RNA Isolation
2 pages
3BUS094398 H C en System 800xa 5.0 Harmony Overview Hires
No ratings yet
3BUS094398 H C en System 800xa 5.0 Harmony Overview Hires
34 pages
AI Unit 1 Short Answer
No ratings yet
AI Unit 1 Short Answer
14 pages
WF4 Pre Production HoW
No ratings yet
WF4 Pre Production HoW
142 pages
Image Registration Methods A Survey
No ratings yet
Image Registration Methods A Survey
25 pages
Tutorial 20. Modeling Solidification
No ratings yet
Tutorial 20. Modeling Solidification
32 pages
Asynch Exercise 2 WACC APV
No ratings yet
Asynch Exercise 2 WACC APV
2 pages
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
No ratings yet
Automated Face Mask Detection: A Project by Nishant Goel Under The Guidance of Dr. Anil Kumar
21 pages
Analytic Scoring Rubric For MI
No ratings yet
Analytic Scoring Rubric For MI
4 pages
Ammonia STD 10
No ratings yet
Ammonia STD 10
2 pages
3.cutting Tool Materials
No ratings yet
3.cutting Tool Materials
14 pages
1tne968902r1101 Ai561s500 Analog Input Mod 4ai U I
No ratings yet
1tne968902r1101 Ai561s500 Analog Input Mod 4ai U I
2 pages
Practical Design of Experiments: DoE Made Easy
From Everand
Practical Design of Experiments: DoE Made Easy
Colin Hardwick
4.5/5 (7)
Fundamental Math
From Everand
Fundamental Math
Russell Pead
No ratings yet

Ist 557 Data Mining

Uploaded by

Ist 557 Data Mining

Uploaded by

IST 557 DATA MINING

The next questions will allow us to select parameters of these criterions.

Training and Testing error VS 1/(Minimum samples split)

Training and Testing error VS Maximum depth of tree

Next, I will use 5-fold cross validation to fix these parameters.

When using 5-fold cross validation, the best “max depth” is 3 or 4.

Testing error of two stop criteria

Testing error of different outer 5-fold:

The training error vs 1/(minimum samples split): The testing error:

Here, the final parameters are:

The testing error is 2.3%

The final testing error is 2.8%

You might also like