Ist 557 Data Mining
Ist 557 Data Mining
1.1:
In the question 1, I write the decision_tree function to show each split situation of the tree:
The (X, Y) means value of X is the (X+1)th features. Y is the value of the Xth feature. For example: (0,4) means the first
split points is the 1st feature: Cell size and the split value is 4. In here, the reason that I don’t select the midpoint of two
adjacent points is the value of each feature is integer, thus it means (c1+c2)/2 has the same role with the c2.
( 1<(1+2)=1.5 & 2>1.5 is equal to 1<2 & 2>=2 when splitting data). To reduce the computation complexity, I only use c2
as the split point.
The (X, Y, Z) means the final leaf node: the value of X is the major class of this node, Y is the sample number of this node,
Z is the proportion of the major sample. (4, 41, 1) means the major class of this node is 4, the number of samples in this
node is 41 and the proportion of samples which are class 4 is 100%.
The number 0, 1 (not in the parenthesis) represents the tree is binary split tree.
1.2:
Class:
[50,50]
[50,9]
2.1:
First
Splitting Criteria: I use entropy as the impurity function and the midpoint between each pair of adjacent values of each
feature as the split point c. For each node which will be selected to split, I calculate the all possible entropy value when
this node is splitted by all the split points. And choose the minimum value of entropy and its corresponding split point c
and the feature as the split point.
Stopping Criteria: In this question, I choose two criterias as the stopping criteria.
The first one is the maximum depth of the tree: if the maximum depth of tree exceeds 4, this algorithm will stop.
The second one is the minimum number of samples to split the node: if the minimum number of samples in the internal
node is less than 30, this node will not be selected to split.
Tree plot:
Stopping Criteria 1: max depth = 4
Stopping Criteria 2: min samples split = 30
Summary:
From these three plots, we can see the Entropy Splitting criteria is better than Gini criteria, because the testing error
(entropy) is less than testing error (Gini) of all these three criterions. And I select the entropy as the splitting criteria in
the rest of questions.
From these three plots and the testing error (Entropy), the parameters of these three stopping criterions are:
Max depth = 3; minimum samples leaf = 5, minimum samples split = 50; the testing error is about 3.5%
When using 5-fold cross validation, the best “min samples splits” is 5 or 10 or 15.
2.4:
For outer 5-fold cv:
For inner 5-fold cv:
For max depth in […….]:
For min samples splits in […………..]:
Testing error
In this question, because I have two stopping criteria to select and the two nested layers: for each inner loop of 5-fold
cv, I select the best parameters and show the plots of how to select the parameter are: the error is the average error of
inner 5-fold cv.
Training error of two stop criteria
Fold 2: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,10); (3,5); (4,2); (5,5); (6,5); (7,2)
And from the error of the outer cv, the best parameter is (3,5), (5,5),(2,10),(6,5)=(max depth, min samples split)
The test error of (3,5) is 2.8%
Fold 3: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,2); (3,40); (4,10); (5,2); (6,2); (7,2)
And from the error of the outer cv, the best parameter is (3,40)=(max depth, min samples split)
The test error of (3,40) is 2.8%
Fold 4: the best parameters we got from inner 5-fold when the outer folder is fold 4:
(2,2); (3,5); (4,20); (5,2); (6,2); (7,10)
And from the error of the outer cv, the best parameter is (4,20), (6,2)=(max depth, min samples split)
The test error of (4,20) is 5.7%
Fold 2: the best parameters we got from inner 5-fold when the outer folder is fold 2:
(2,2); (3,10); (4,5); (5,2); (6,2); (7,5)
And from the error of the outer cv, the best parameter is (6,2)=(max depth, min samples split)
The test error of (6,2) is 0%
Summary: the testing error of the outer 5-fold cv. (the corresponding parameter I listed above)
The average error of classification VS 1/(minimum samples split): the error is the average error of 5-fold cross validation.
The best testing error is 2.3% when the max depth is 3, and the minimum samples split is 10 or 15 for each tree.
The training error of classification VS 1/(minimum samples split) for each fold: the error is the training error of 5-fold
cross validation respectively.
The testing error of classification VS 1/(minimum samples split) for each fold: the error is the testing error of 5-fold cross
validation respectively.
Parameter Setting:
In this question, I used the xgboost package. At first, the learning rate is 0.3, the max depth of a tree is 3, the minimum
loss reduction required to split a leaf node of the tree is 2 and the number of iterations is 20.
The objective is softprob which is used to classify the multi-classes. In this question, we have 3 classes, thus this predict
result of each sample will return to 3-elements-list like: [0.2, 0.3, 0.5] and the maximum probability of each sample is the
class this sample belongs to. In this example the final predicted class of this sample will return to class 3.
Also, I select to tune the number of iterations, the gamma value.
The average error of classification VS number of iterations: the error is the average error of 5-fold cross validation
The iteration is 15.
The average error of classification VS 1/gamma: the error is the average error of 5-fold cross validation
Gamma is 1 or 2.
Finally, the learning rate is 0.3, the max depth of a tree is 3, the minimum loss reduction required to split a leaf node of
the tree is 1 and the number of iterations is 15. The objective is “softprob”
After setting the parameters, use 5-fold cross-validation again to test, the result of training error and testing error of
each fold is: