Decision Tree - All Cost Functions - Stanford
Decision Tree - All Cost Functions - Stanford
Data Mining
Jonathan
c
Taylor
Jonathan
c Taylor
1/1
Classification
Statistics 202:
Data Mining
Jonathan
c Problem description
Taylor
We are given a data matrix X with either continuous or
discrete variables such that each row Xi ∈ F and a set of
labels Y ∈ L.
For a k-class problem, #L = k and we can think of
L = {1, . . . , k}.
Our goal is to find a classifier
f :F →L
2/1
Classification
Statistics 202:
Data Mining
Jonathan
c
Taylor
A supervised problem
Classification is a supervised problem.
Usually, we use a subset of the data, the training set to
learn or estimate the classifier yielding fˆ = fˆtraining .
The performance of fˆ is measured by applying it to each
case in the test set and computing
X
L(fˆtraining (X
X j ), Y j )
j∈test
3/1
Classification
Statistics 202:
Data Mining
Jonathan
c
Taylor
4/1
Classification
Statistics 202:
Data Mining
Jonathan
c
Taylor
5/1
Classification
Statistics 202:
Data Mining
Jonathan
c
Taylor
Common techniques
Decision Tree based Methods
Rule-based Methods
Discriminant Analysis
Memory based reasoning
Neural Networks
Naı̈ve Bayes
Support Vector Machines
6/1
Classification trees
Statistics 202:
Data Mining
Jonathan
c
Taylor
7/1
Classification trees
Statistics 202:
Data Mining
Jonathan
c
Taylor
8/1
Applying a decision tree rule
Statistics 202:
Data Mining
Jonathan
c
Taylor
9/1
Applying a decision tree rule
Statistics 202:
Data Mining
Jonathan
c
Taylor
10 / 1
Applying a decision tree rule
Statistics 202:
Data Mining
Jonathan
c
Taylor
11 / 1
Applying a decision tree rule
Statistics 202:
Data Mining
Jonathan
c
Taylor
12 / 1
Applying a decision tree rule
Statistics 202:
Data Mining
Jonathan
c
Taylor
13 / 1
Applying a decision tree rule
Statistics 202:
Data Mining
Jonathan
c
Taylor
14 / 1
Applying a decision tree rule
Statistics 202:
Data Mining
Jonathan
c
Taylor
15 / 1
Decision boundary for tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
16 / 1
Decision tree for iris data using all features
Statistics 202:
Data Mining
Jonathan
c
Taylor
17 / 1
Decision tree for iris data using petal.length,
petal.width
Statistics 202:
Data Mining
Jonathan
c
Taylor
18 / 1
Regions in petal.length, petal.width plane
Statistics 202:
Data Mining
Jonathan
c 3.0
Taylor
2.5
2.0
1.5
Petal width
1.0
0.5
0.0
0.50 1 2 3 4 5 6 7 8
Petal length
19 / 1
Decision boundary for tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
Statistics 202:
Data Mining
Jonathan
c
Taylor
21 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c Hunt’s algorithm (generic structure)
Taylor
Let Dt be the set of training records that reach a node t
If Dt contains records that belong the same class yt , then
t is a leaf node labeled as yt .
If Dt = ∅, then t is a leaf node labeled by the default
class, yd .
If Dt contains records that belong to more than one class,
use an attribute test to split the data into smaller subsets.
Recursively apply the procedure to each subset.
This splitting procedure is what can vary for different tree
learning algorithms . . .
22 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
23 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor Issues
Greedy strategy: Split the records based on an attribute test
that optimizes certain criterion.
What is the best split: What criterion do we use? Previous
example chose first to split on Refund . . .
How to split the records: Binary or multi-way? Previous
example split Taxable Income at ≥ 80K . . .
When do we stop? Should we continue until each node if possible? Previous
example stopped with all nodes being completely homogeneous
...
24 / 1
Different splits: ordinal / nominal
Statistics 202:
Data Mining
Jonathan
c
Taylor
25 / 1
Different splits: continuous
Statistics 202:
Data Mining
Jonathan
c
Taylor
26 / 1
Choosing a variable to split on
Statistics 202:
Data Mining
Jonathan
c
Taylor
27 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
28 / 1
Choosing a variable to split on
Statistics 202:
Data Mining
Jonathan
c
Taylor
29 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
GINI Index
Suppose we have k classes and node t has frequencies
pt = (p1,t , . . . , pk,t ).
Criterion
X l
X
2
GINI (t) = pj,t pj 0 ,t = 1 − pj,t .
(j,j 0 )∈{1,...,k}:j6=j 0 j=1
30 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c Gain in GINI Index for a potential split
Taylor
Suppose t is to be split into j new child nodes (tl )1≤l≤j .
Each child node has a count nl and a vector of frequencies
(p1,tl , . . . , pk,tl ). Hence they have their own GINI index,
GINI (tl ).
The gain in GINI Index for this split is
Pj
l=1 nl GINI (tl )
Gain(GINI , t → (tl )1≤l≤j ) = GINI (t) − Pj .
l=1 n l
31 / 1
Decision tree for iris data using all features with
GINI
Statistics 202:
Data Mining
Jonathan
c
Taylor
32 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor Entropy / Deviance / Information
Suppose we have k classes and node t has frequencies
pt = (p1,t , . . . , pk,t ).
Criterion
k
X
H(t) = − pj,t log pj,t
j=1
33 / 1
Decision tree for iris data using all features with
Entropy
Statistics 202:
Data Mining
Jonathan
c
Taylor
34 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor Gain in entropy for a potential split
Suppose t is to be split into j new child nodes (tl )1≤l≤j .
Each child node has a count nl and a vector of frequencies
(p1,tl , . . . , pk,tl ). Hence they have their own entropy H(tl ).
The gain in entropy for this split is
Pj
l=1 nl H(tl )
Gain(H, t → (tl )1≤l≤j ) = H(t) − Pj .
l=1 n l
35 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
Misclassification Error
Suppose we have k classes and node t has frequencies
pt = (p1,t , . . . , pk,t ).
The mode is
k̂(t) = argmax pk,t .
k
Criterion
36 / 1
Different criteria: GINI , H, MC
Statistics 202:
Data Mining
Jonathan
c
Taylor
37 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
Misclassification Error
Example: suppose parent has 10 cases: {7D, 3R}
A candidate split produces two nodes: {3D, 0R},
{4D, 3R}.
The gain in MC is 0, but gain in GINI is 0.42 − 0.342 > 0.
Similarly, entropy will also show an improvement . . .
38 / 1
Choosing the split for a continuous variable
Statistics 202:
Data Mining
Jonathan
c
Taylor
39 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
Stopping training
As trees get deeper, or if splits are multi-way the number
of data points per leaf node drops very quickly.
Trees that are too deep tend to overfit the data.
A common strategy is to “prune” the tree by removing
some internal nodes.
40 / 1
Learning the tree
Statistics 202:
Data Mining
Jonathan
c
Taylor
Statistics 202:
Data Mining Cost-complexity pruning (tree library)
Jonathan
c
Taylor
Given a criterion Q like H or GINI , we define the
cost-complexity of a tree with terminal nodes (tj )1≤j≤m
m
X
Cα (T ) = nj Q(tj ) + αm
j=1
T̂α = argmin Cα (T ).
T ≤TL
Statistics 202:
Data Mining
Jonathan
c Pre-pruning (rpart library)
Taylor
These methods stop the algorithm before it becomes a
fully-grown tree.
Examples
Stop if all instances belong to the same class (kind of
obvious).
Stop if number of instances is less than some user-specified
threshold. Both tree, rpart have rules like this.
Stop if class distribution of instances are independent of
the available features (e.g., using χ2 test)
Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain). This
relates to cp in rpart.
43 / 1
Training and test error as a function of cp
Statistics 202:
Data Mining
Jonathan
c
Taylor
44 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Jonathan
c Metrics for Performance Evaluation…
Taylor
PREDICTED CLASS
Class=Yes Class=No
ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)
l Most widely-used metric:
45 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Jonathan
c
Taylor
Measures of performance
Simplest is accuracy
TP + TN
Accuracy =
TP + TN + FP + FN
= SMC(Actual, Predicted)
= 1 − Misclassification Rate
46 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Jonathan
c
Taylor
47 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Cost Matrix
Jonathan
c
Taylor
PREDICTED CLASS
Statistics 202:
Data Mining
Jonathan
c
Taylor
Measures of performance
Classification rule changes to
X
Label(p, C ) = argmini C (i|j)pj
j
49 / 1
Evaluating a classifier
Statistics 202:
Computing Cost of Classification
Data Mining
Jonathan
c Cost PREDICTED CLASS
Taylor
Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0
50 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Jonathan
c Measures of performance
Taylor
Other common ones
TP
Precision =
TP + FP
TN
Specificity = = TNR
TN + FP
TP
Sensitivity = Recall = = TPR
TP + FN
2 · Recall · Precision
F =
Recall + Precision
2 · TP
=
2 · TP + FN + FP
51 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Jonathan
c
Taylor
Measures of performance
Precision emphasizes P(p = Y , a = Y ) &
P(p = Y , a = N).
Recall emphasizes P(p = Y , a = Y ) & P(p = N, a = Y ).
FPR = 1 − TNR
FNR = 1 − TPR.
52 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Jonathan
c
Taylor
Measure of performance
We have done some simple training / test splits to see
how well our classifier is doing.
More accurately, this procedure measures how well our
algorithm for learning the classifier is doing.
How well this works may depend on
Model: Are we using the right type of classifier
model?
Cost: Is our algorithm sensitive to the cost of
misclassification?
Data size: Do we have enough data to learn a model?
53 / 1
Evaluating a classifier
Learning Curve
Statistics 202:
Data Mining l Learning curve sh
how accuracy cha
Jonathan
c
Taylor
with varying samp
l Requires a sampli
schedule for creat
learning curve:
l Arithmetic sam
(Langley, et a
l Geometric sa
(Provost et al
54 / 1
Evaluating a classifier
Statistics 202:
Data Mining
Jonathan
c
Estimating performance
Taylor
Holdout: Split into test and training (e.g. 1/3 test, 2/3
training).
Random subsampling: Repeated replicates of holdout,
averaging results.
Cross validation: Partition data into K disjoint subsets. For
each subset Si , train on all but Si , then test on
Si .
Stratified sampling: May be helpful to sample so Y/N class is
roughly equal in training data.
0.632 Bootstrap: Combine training error and bootstrap error
...
55 / 1
Statistics 202:
Data Mining
Jonathan
c
Taylor
56 / 1