0% found this document useful (0 votes)
31 views

Decision Tree - All Cost Functions - Stanford

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Decision Tree - All Cost Functions - Stanford

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 56

Statistics 202:

Data Mining

Jonathan
c
Taylor

Statistics 202: Data Mining


Classification & Decision Trees
Based in part on slides from textbook, slides of Susan Holmes

Jonathan
c Taylor

October 19, 2012

1/1
Classification

Statistics 202:
Data Mining

Jonathan
c Problem description
Taylor
We are given a data matrix X with either continuous or
discrete variables such that each row Xi ∈ F and a set of
labels Y ∈ L.
For a k-class problem, #L = k and we can think of
L = {1, . . . , k}.
Our goal is to find a classifier

f :F →L

that allows us to predict the label of a new observation


given a new set of features.

2/1
Classification

Statistics 202:
Data Mining

Jonathan
c
Taylor
A supervised problem
Classification is a supervised problem.
Usually, we use a subset of the data, the training set to
learn or estimate the classifier yielding fˆ = fˆtraining .
The performance of fˆ is measured by applying it to each
case in the test set and computing
X
L(fˆtraining (X
X j ), Y j )
j∈test

3/1
Classification

Statistics 202:
Data Mining

Jonathan
c
Taylor

4/1
Classification

Statistics 202:
Data Mining

Jonathan
c
Taylor

Examples of classification tasks


Predicting whether a tumor is benign or malignant.
Classifying credit card transactions as fraudulent or
legitimate.
Predicting the type of a given tumor among several types.
Cateogrizing a document or news story as one of {finance,
weather, sports, etc.}

5/1
Classification

Statistics 202:
Data Mining

Jonathan
c
Taylor
Common techniques
Decision Tree based Methods
Rule-based Methods
Discriminant Analysis
Memory based reasoning
Neural Networks
Naı̈ve Bayes
Support Vector Machines

6/1
Classification trees

Statistics 202:
Data Mining

Jonathan
c
Taylor

7/1
Classification trees

Statistics 202:
Data Mining

Jonathan
c
Taylor

8/1
Applying a decision tree rule

Statistics 202:
Data Mining

Jonathan
c
Taylor

9/1
Applying a decision tree rule

Statistics 202:
Data Mining

Jonathan
c
Taylor

10 / 1
Applying a decision tree rule

Statistics 202:
Data Mining

Jonathan
c
Taylor

11 / 1
Applying a decision tree rule

Statistics 202:
Data Mining

Jonathan
c
Taylor

12 / 1
Applying a decision tree rule

Statistics 202:
Data Mining

Jonathan
c
Taylor

13 / 1
Applying a decision tree rule

Statistics 202:
Data Mining

Jonathan
c
Taylor

14 / 1
Applying a decision tree rule

Statistics 202:
Data Mining

Jonathan
c
Taylor

15 / 1
Decision boundary for tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

16 / 1
Decision tree for iris data using all features

Statistics 202:
Data Mining

Jonathan
c
Taylor

17 / 1
Decision tree for iris data using petal.length,
petal.width
Statistics 202:
Data Mining

Jonathan
c
Taylor

18 / 1
Regions in petal.length, petal.width plane

Statistics 202:
Data Mining

Jonathan
c 3.0
Taylor

2.5

2.0

1.5
Petal width

1.0

0.5

0.0

0.50 1 2 3 4 5 6 7 8
Petal length

19 / 1
Decision boundary for tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

Figure : Trees have trouble capturing structure not parallel to axes


20 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

21 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c Hunt’s algorithm (generic structure)
Taylor
Let Dt be the set of training records that reach a node t
If Dt contains records that belong the same class yt , then
t is a leaf node labeled as yt .
If Dt = ∅, then t is a leaf node labeled by the default
class, yd .
If Dt contains records that belong to more than one class,
use an attribute test to split the data into smaller subsets.
Recursively apply the procedure to each subset.
This splitting procedure is what can vary for different tree
learning algorithms . . .

22 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

23 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor Issues
Greedy strategy: Split the records based on an attribute test
that optimizes certain criterion.
What is the best split: What criterion do we use? Previous
example chose first to split on Refund . . .
How to split the records: Binary or multi-way? Previous
example split Taxable Income at ≥ 80K . . .
When do we stop? Should we continue until each node if possible? Previous
example stopped with all nodes being completely homogeneous
...

24 / 1
Different splits: ordinal / nominal

Statistics 202:
Data Mining

Jonathan
c
Taylor

Figure : Binary or multi-way?

25 / 1
Different splits: continuous

Statistics 202:
Data Mining

Jonathan
c
Taylor

Figure : Binary or multi-way?

26 / 1
Choosing a variable to split on

Statistics 202:
Data Mining

Jonathan
c
Taylor

Figure : Which should we start the splitting on?

27 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

Choosing the best split


Need some numerical criterion to choose among possible
splits.
Criterion should favor homogeneous or pure nodes.
Common cost functions:
Gini Index
Entropy / Deviance / Information
Misclassification Error

28 / 1
Choosing a variable to split on

Statistics 202:
Data Mining

Jonathan
c
Taylor

29 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor
GINI Index
Suppose we have k classes and node t has frequencies
pt = (p1,t , . . . , pk,t ).
Criterion

X l
X
2
GINI (t) = pj,t pj 0 ,t = 1 − pj,t .
(j,j 0 )∈{1,...,k}:j6=j 0 j=1

Maximized when pj,t = 1/k with value 1 − 1/k


Minimized when all records belong to a single class.
Minimizing GINI will favour pure nodes . . .

30 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c Gain in GINI Index for a potential split
Taylor
Suppose t is to be split into j new child nodes (tl )1≤l≤j .
Each child node has a count nl and a vector of frequencies
(p1,tl , . . . , pk,tl ). Hence they have their own GINI index,
GINI (tl ).
The gain in GINI Index for this split is
Pj
l=1 nl GINI (tl )
Gain(GINI , t → (tl )1≤l≤j ) = GINI (t) − Pj .
l=1 n l

Greedy algorithm chooses the biggest gain in GINI index


among a list of possible splits.

31 / 1
Decision tree for iris data using all features with
GINI
Statistics 202:
Data Mining

Jonathan
c
Taylor

32 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor Entropy / Deviance / Information
Suppose we have k classes and node t has frequencies
pt = (p1,t , . . . , pk,t ).
Criterion
k
X
H(t) = − pj,t log pj,t
j=1

Maximized when pi,t = 1/k with value log k


Minimized when one class has no records in it.
Minimizing entropy will favour pure nodes . . .

33 / 1
Decision tree for iris data using all features with
Entropy
Statistics 202:
Data Mining

Jonathan
c
Taylor

34 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor Gain in entropy for a potential split
Suppose t is to be split into j new child nodes (tl )1≤l≤j .
Each child node has a count nl and a vector of frequencies
(p1,tl , . . . , pk,tl ). Hence they have their own entropy H(tl ).
The gain in entropy for this split is
Pj
l=1 nl H(tl )
Gain(H, t → (tl )1≤l≤j ) = H(t) − Pj .
l=1 n l

Greedy algorithm chooses the biggest gain in H among a


list of possible splits.

35 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor
Misclassification Error
Suppose we have k classes and node t has frequencies
pt = (p1,t , . . . , pk,t ).
The mode is
k̂(t) = argmax pk,t .
k

Criterion

Misclassification Error(t) = 1 − pk̂(t),t

Not smooth in pt as GINI , H, can be more difficult to


optimize numerically.

36 / 1
Different criteria: GINI , H, MC

Statistics 202:
Data Mining

Jonathan
c
Taylor

37 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

Misclassification Error
Example: suppose parent has 10 cases: {7D, 3R}
A candidate split produces two nodes: {3D, 0R},
{4D, 3R}.
The gain in MC is 0, but gain in GINI is 0.42 − 0.342 > 0.
Similarly, entropy will also show an improvement . . .

38 / 1
Choosing the split for a continuous variable

Statistics 202:
Data Mining

Jonathan
c
Taylor

39 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

Stopping training
As trees get deeper, or if splits are multi-way the number
of data points per leaf node drops very quickly.
Trees that are too deep tend to overfit the data.
A common strategy is to “prune” the tree by removing
some internal nodes.

40 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

Figure : Underfitting corresponds to the left-hand side, overfit to the


right 41 / 1
Learning the tree

Statistics 202:
Data Mining Cost-complexity pruning (tree library)
Jonathan
c
Taylor
Given a criterion Q like H or GINI , we define the
cost-complexity of a tree with terminal nodes (tj )1≤j≤m
m
X
Cα (T ) = nj Q(tj ) + αm
j=1

Given a large tree TL we might compute Cα (T ) for any


subtree T of TL .
The optimal tree is defined as

T̂α = argmin Cα (T ).
T ≤TL

Can be found by “weakest-link” pruning. See Elements of


Statistical Learning for more . . . 42 / 1
Learning the tree

Statistics 202:
Data Mining

Jonathan
c Pre-pruning (rpart library)
Taylor
These methods stop the algorithm before it becomes a
fully-grown tree.
Examples
Stop if all instances belong to the same class (kind of
obvious).
Stop if number of instances is less than some user-specified
threshold. Both tree, rpart have rules like this.
Stop if class distribution of instances are independent of
the available features (e.g., using χ2 test)
Stop if expanding the current node does not improve
impurity measures (e.g., Gini or information gain). This
relates to cp in rpart.

43 / 1
Training and test error as a function of cp

Statistics 202:
Data Mining

Jonathan
c
Taylor

44 / 1
Evaluating a classifier

Statistics 202:
Data Mining

Jonathan
c Metrics for Performance Evaluation…
Taylor

PREDICTED CLASS
Class=Yes Class=No

ACTUAL Class=Yes a b
(TP) (FN)
CLASS
Class=No c d
(FP) (TN)
l Most widely-used metric:

45 / 1
Evaluating a classifier

Statistics 202:
Data Mining

Jonathan
c
Taylor

Measures of performance
Simplest is accuracy
TP + TN
Accuracy =
TP + TN + FP + FN
= SMC(Actual, Predicted)
= 1 − Misclassification Rate

46 / 1
Evaluating a classifier

Statistics 202:
Data Mining

Jonathan
c
Taylor

Accuracy isn’t everything


Consider an unbalanced 2-class problem with # 1’s=10, #
0’s=9990.
Simply labelling everything 0 yields 99.9% accuracy.
But, this classifier misses all class 1.

47 / 1
Evaluating a classifier

Statistics 202:
Data Mining
Cost Matrix
Jonathan
c
Taylor

PREDICTED CLASS

C(i|j) Class=Yes Class=No

ACTUAL Class=Yes C(Yes|Yes) C(No|Yes)


CLASS
Class=No C(Yes|No) C(No|No)

C(i|j): Cost of misclassifying class j example as class i

© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004 48 / 1


Learning the tree

Statistics 202:
Data Mining

Jonathan
c
Taylor

Measures of performance
Classification rule changes to
X
Label(p, C ) = argmini C (i|j)pj
j

Accuracy is the same as cost if C (Y |Y ) = C (N|N) = c1 ,


C (Y |N) = C (N|Y ) = c2 .

49 / 1
Evaluating a classifier

Statistics 202:
Computing Cost of Classification
Data Mining

Jonathan
c Cost PREDICTED CLASS
Taylor
Matrix
C(i|j) + -
ACTUAL
+ -1 100
CLASS
- 1 0

Model PREDICTED CLASS Model PREDICTED CLASS


M1 M2
+ - + -
ACTUAL ACTUAL
+ 150 40 + 250 45
CLASS CLASS
- 60 250 - 5 200

Accuracy = 80% Accuracy = 90%


Cost = 3910 Cost = 4255
© Tan,Steinbach, Kumar Introduction to Data Mining 4/18/2004

50 / 1
Evaluating a classifier

Statistics 202:
Data Mining

Jonathan
c Measures of performance
Taylor
Other common ones
TP
Precision =
TP + FP
TN
Specificity = = TNR
TN + FP
TP
Sensitivity = Recall = = TPR
TP + FN
2 · Recall · Precision
F =
Recall + Precision
2 · TP
=
2 · TP + FN + FP

51 / 1
Evaluating a classifier

Statistics 202:
Data Mining

Jonathan
c
Taylor

Measures of performance
Precision emphasizes P(p = Y , a = Y ) &
P(p = Y , a = N).
Recall emphasizes P(p = Y , a = Y ) & P(p = N, a = Y ).
FPR = 1 − TNR
FNR = 1 − TPR.

52 / 1
Evaluating a classifier

Statistics 202:
Data Mining

Jonathan
c
Taylor
Measure of performance
We have done some simple training / test splits to see
how well our classifier is doing.
More accurately, this procedure measures how well our
algorithm for learning the classifier is doing.
How well this works may depend on
Model: Are we using the right type of classifier
model?
Cost: Is our algorithm sensitive to the cost of
misclassification?
Data size: Do we have enough data to learn a model?

53 / 1
Evaluating a classifier
Learning Curve
Statistics 202:
Data Mining l Learning curve sh
how accuracy cha
Jonathan
c
Taylor
with varying samp
l Requires a sampli
schedule for creat
learning curve:
l Arithmetic sam
(Langley, et a
l Geometric sa
(Provost et al

Effect of small samp


- Bias in the es
- Variance of e

Figure : As data© Tan,Steinbach,


increases, our estimate
Kumar
of accuracy improves, as 4/18/2004
Introduction to Data Mining
does the variability of our estimate . . .

54 / 1
Evaluating a classifier

Statistics 202:
Data Mining

Jonathan
c
Estimating performance
Taylor
Holdout: Split into test and training (e.g. 1/3 test, 2/3
training).
Random subsampling: Repeated replicates of holdout,
averaging results.
Cross validation: Partition data into K disjoint subsets. For
each subset Si , train on all but Si , then test on
Si .
Stratified sampling: May be helpful to sample so Y/N class is
roughly equal in training data.
0.632 Bootstrap: Combine training error and bootstrap error
...

55 / 1
Statistics 202:
Data Mining

Jonathan
c
Taylor

56 / 1

You might also like