0% found this document useful (0 votes)
49 views14 pages

Ist 557 Data Mining

The document describes using decision trees, random forests, and XGBoost for classification on a multi-class dataset. For decision trees, 5-fold cross-validation was used to select hyperparameters including a maximum depth of 3, minimum samples split of 10-40. Random forests with 100 trees, maximum depth of 3, and minimum samples split of 10 achieved a testing error of 2.3%. For XGBoost, hyperparameters tuned via cross-validation included number of iterations (15), gamma (1-2), with a final testing error of 2.8%.

Uploaded by

Yizhi Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
49 views14 pages

Ist 557 Data Mining

The document describes using decision trees, random forests, and XGBoost for classification on a multi-class dataset. For decision trees, 5-fold cross-validation was used to select hyperparameters including a maximum depth of 3, minimum samples split of 10-40. Random forests with 100 trees, maximum depth of 3, and minimum samples split of 10 achieved a testing error of 2.3%. For XGBoost, hyperparameters tuned via cross-validation included number of iterations (15), gamma (1-2), with a final testing error of 2.8%.

Uploaded by

Yizhi Liu
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

IST 557 DATA MINING

1.1:
In the question 1, I write the decision_tree function to show each split situation of the tree:

The (X, Y) means value of X is the (X+1)th features. Y is the value of the Xth feature. For example: (0,4) means the first
split points is the 1st feature: Cell size and the split value is 4. In here, the reason that I don’t select the midpoint of two
adjacent points is the value of each feature is integer, thus it means (c1+c2)/2 has the same role with the c2.
( 1<(1+2)=1.5 & 2>1.5 is equal to 1<2 & 2>=2 when splitting data). To reduce the computation complexity, I only use c2
as the split point.
The (X, Y, Z) means the final leaf node: the value of X is the major class of this node, Y is the sample number of this node,
Z is the proportion of the major sample. (4, 41, 1) means the major class of this node is 4, the number of samples in this
node is 41 and the proportion of samples which are class 4 is 100%.
The number 0, 1 (not in the parenthesis) represents the tree is binary split tree.
1.2:

Class:

[50,50]

[50,9]
2.1:
First
Splitting Criteria: I use entropy as the impurity function and the midpoint between each pair of adjacent values of each
feature as the split point c. For each node which will be selected to split, I calculate the all possible entropy value when
this node is splitted by all the split points. And choose the minimum value of entropy and its corresponding split point c
and the feature as the split point.
Stopping Criteria: In this question, I choose two criterias as the stopping criteria.
The first one is the maximum depth of the tree: if the maximum depth of tree exceeds 4, this algorithm will stop.
The second one is the minimum number of samples to split the node: if the minimum number of samples in the internal
node is less than 30, this node will not be selected to split.
Tree plot:
Stopping Criteria 1: max depth = 4
Stopping Criteria 2: min samples split = 30

The next questions will allow us to select parameters of these criterions.


2.2:
In this question, I can use the testing error to select the parameters of the stopping criteria. Also, for each selection, I
add the Gini value as the comparison with the Entropy to see which splitting criteria is better.

Training and Testing error VS 1/(Minimum samples split)

Training and Testing error VS Maximum depth of tree

Summary:
From these three plots, we can see the Entropy Splitting criteria is better than Gini criteria, because the testing error
(entropy) is less than testing error (Gini) of all these three criterions. And I select the entropy as the splitting criteria in
the rest of questions.
From these three plots and the testing error (Entropy), the parameters of these three stopping criterions are:
Max depth = 3; minimum samples leaf = 5, minimum samples split = 50; the testing error is about 3.5%

Next, I will use 5-fold cross validation to fix these parameters.


2.3: 5-fold cross validation:

When using 5-fold cross validation, the best “max depth” is 3 or 4.

When using 5-fold cross validation, the best “min samples splits” is 5 or 10 or 15.
2.4:
For outer 5-fold cv:
For inner 5-fold cv:
For max depth in […….]:
For min samples splits in […………..]:
Testing error
In this question, because I have two stopping criteria to select and the two nested layers: for each inner loop of 5-fold
cv, I select the best parameters and show the plots of how to select the parameter are: the error is the average error of
inner 5-fold cv.
Training error of two stop criteria

Testing error of two stop criteria


From the above plot, we get the best parameters of each inner 5-fold cross validation (we have 5 groups of best
parameters, for each group, we also have 6 points (max depth, min samples split)). And for each outer 5-fold cv, I used
the best parameters selected form above to show the test error of outer cv: error is the average error of 5-fold (outer)
Fold 1: the best parameters we got from inner 5-fold when the outer folder is fold 1:
(2,2); (3,2); (4,2); (5,5); (6,10); (7,2)
And from the error of the outer cv, the best parameter is (3,2)=(max depth, min samples split)
The test error of (3,2) is 14.3%

Fold 2: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,10); (3,5); (4,2); (5,5); (6,5); (7,2)
And from the error of the outer cv, the best parameter is (3,5), (5,5),(2,10),(6,5)=(max depth, min samples split)
The test error of (3,5) is 2.8%
Fold 3: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,2); (3,40); (4,10); (5,2); (6,2); (7,2)
And from the error of the outer cv, the best parameter is (3,40)=(max depth, min samples split)
The test error of (3,40) is 2.8%

Fold 4: the best parameters we got from inner 5-fold when the outer folder is fold 4:
(2,2); (3,5); (4,20); (5,2); (6,2); (7,10)
And from the error of the outer cv, the best parameter is (4,20), (6,2)=(max depth, min samples split)
The test error of (4,20) is 5.7%
Fold 2: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,2); (3,10); (4,5); (5,2); (6,2); (7,5)
And from the error of the outer cv, the best parameter is (6,2)=(max depth, min samples split)
The test error of (6,2) is 0%

Summary: the testing error of the outer 5-fold cv. (the corresponding parameter I listed above)

Testing error of different outer 5-fold:


The average error of the outer 5-fold cv is 5.1%
From all these results, the best parameters of max depth should be 3. (used for next question)
3: Random Forest
Parameter Setting:
Because we use the same dataset, from the parameters of Decision tree, I select the max depth of each tree is 3, and
also use 5-fold cross validation to test the value of the second parameter: minimum samples split in each tree. The
“Entropy” was selected as the splitting criteria and the number of trees in forest is 100.

The average error of classification VS 1/(minimum samples split): the error is the average error of 5-fold cross validation.

The training error vs 1/(minimum samples split): The testing error:

The best testing error is 2.3% when the max depth is 3, and the minimum samples split is 10 or 15 for each tree.
The training error of classification VS 1/(minimum samples split) for each fold: the error is the training error of 5-fold
cross validation respectively.
The testing error of classification VS 1/(minimum samples split) for each fold: the error is the testing error of 5-fold cross
validation respectively.

Here, the final parameters are:


Max depth of each tree is 3, minimum samples split in each tree is 10. The “Entropy” was selected as the splitting criteria
and the number of trees in forest is 100.
After setting the parameters, use 5-fold cross-validation again to test, the result of training error and testing error of
each fold is:

The testing error is 2.3%


4: Xgboost:
Similar to the former questions, first I use 5-fold cross validation to confirm the max depth of tree is 3:

Parameter Setting:
In this question, I used the xgboost package. At first, the learning rate is 0.3, the max depth of a tree is 3, the minimum
loss reduction required to split a leaf node of the tree is 2 and the number of iterations is 20.

The objective is softprob which is used to classify the multi-classes. In this question, we have 3 classes, thus this predict
result of each sample will return to 3-elements-list like: [0.2, 0.3, 0.5] and the maximum probability of each sample is the
class this sample belongs to. In this example the final predicted class of this sample will return to class 3.
Also, I select to tune the number of iterations, the gamma value.
The average error of classification VS number of iterations: the error is the average error of 5-fold cross validation
The iteration is 15.
The average error of classification VS 1/gamma: the error is the average error of 5-fold cross validation

Gamma is 1 or 2.
Finally, the learning rate is 0.3, the max depth of a tree is 3, the minimum loss reduction required to split a leaf node of
the tree is 1 and the number of iterations is 15. The objective is “softprob”
After setting the parameters, use 5-fold cross-validation again to test, the result of training error and testing error of
each fold is:

The final testing error is 2.8%

You might also like