CH 6
CH 6
Outline
• Classification Definition
• Classification Techniques
• Decision Trees
• Practical Issues of Classification
3
A Programming Task
4
Classification: Definition
• Given a collection of records (training set )
• Each record contains a set of attributes, one of the
attributes is the class.
• Find a model for class attribute as a function
of the values of other attributes.
• Goal: previously unseen records should be
assigned a class as accurately as possible.
• A test set is used to determine the accuracy of the
model. Usually, the given data set is divided into
training and test sets, with training set used to build
the model and test set used to validate it.
5
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
6
KNN
10
Classification Techniques
• Decision Tree based Methods
• Rule-based Methods
• Memory based Reasoning
• Neural Networks
• Naïve Bayes and Bayesian Belief Networks
• Support Vector Machines
11
Training Set
Apply
Model Decision
Attrib1 Attrib2 Attrib3 Class
Tid
Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
14
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
15
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
16
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
17
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
18
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
19
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
20
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
12 Yes Medium 80K ?
13 Yes Large 110K ?
Deduction
14 No Small 95K ?
15 No Large 67K ?
10
Test Set
21
class, yd Dt
• If Dt contains records that belong
to more than one class, use an ?
attribute test to split the data into
smaller subsets. Recursively apply
the procedure to each subset.
23
Hunt’s Algorithm
Tid Refund Marital Taxable
Status Income Cheat
Don’t Cheat
Cheat
24
Tree Induction
• Greedy strategy
• Split the records based on an attribute test that
optimizes certain criterion
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
25
Tree Induction
• Greedy strategy
• Split the records based on an attribute test that
optimizes certain criterion
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
26
CarType
Family Luxury
Sports
CarType OR CarType
{Sports, {Family,
Luxury} {Family} Luxury} {Sports}
28
Size
Small Large
Medium
• Binary split: Divides values into two subsets.
Need to find optimal partitioning.
Size Size
{Small,
{Large}
OR {Medium,
{Small}
Medium} Large}
• What about this split?
Size
{Small,
Large} {Medium}
29
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Tree Induction
• Greedy strategy
• Split the records based on an attribute test that
optimizes certain criterion
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
32
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
34
• Entropy
• Misclassification error
35
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
36
GINI (t ) 1 [ p ( j | t )]2
j
C1 0 C1 1 C1 2 C1 3
C2 6 C2 5 C2 4 C2 3
Gini=0.000 Gini=0.278 Gini=0.444 Gini=0.500
37
38
Binary Attributes: Computing GINI Index
Splits into two partitions
Effect of Weighing partitions:
Larger and Purer Partitions are sought for.
Parent
B? C1 6
Yes No C2 6
Gini = 0.500
Node N1 Node N2
Gini(N1)
= 1 – (5/6)2 – (2/6)2 N1 N2 Gini(Children)
= 0.194
C1 5 1 = 7/12 * 0.194 +
Gini(N2) C2 2 4 5/12 * 0.528
= 1 – (1/6)2 – (4/6)2 Gini=0.333 = 0.333
= 0.528 39
40
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
43
Entropy (t ) p ( j | t ) log p ( j | t )
j
n
split i 1
45
Splitting Based on INFO...
• Gain Ratio:
GAIN n n
SplitINFO log
k
GainRATIO Split i i
SplitINFO
split
n ni 1
Error (t ) 1 max P (i | t )
i
Gini(N1) N1 N2
= 1 – (3/3)2 – (0/3)2 Gini(Children)
C1 3 4 = 3/10 * 0
=0
C2 0 3 + 7/10 * 0.489
Gini(N2) = 0.342
= 1 – (4/7)2 – (3/7)2
= 0.489
51
Tree Induction
• Greedy strategy.
• Split the records based on an attribute test that
optimizes certain criterion.
• Issues
• Determine how to split the records
• How to specify the attribute test condition?
• How to determine the best split?
• Determine when to stop splitting
52
Example: C4.5
• Simple depth-first construction.
• Uses Information Gain
• Sorts Continuous Attributes at each node
• Needs entire data to fit in memory
• Unsuitable for Large Datasets
• Needs out-of-core sorting
55
• Missing Values
• Costs of Classification
56
Circular points:
0.5 sqrt(x12+x22) 1
Triangular points:
sqrt(x12+x22) > 0.5 or
sqrt(x12+x22) < 1
57
Underfitting: when model is too simple, both training and test errors are large
58
Lack of data points in the lower half of the diagram makes it difficult
to predict correctly the class labels of that region
- Insufficient number of training records in the region causes the
decision tree to predict the test examples using other training
records that are irrelevant to the classification task
60
Notes on Overfitting
• Overfitting results in decision trees that are more
complex than necessary
Occam’s Razor
• Given two models of similar generalization errors,
one should prefer the simpler model over the
more complex model
Example of Post-Pruning
Training Error (Before splitting) = 10/30
A1 A4
A2 A3
Examples of Post-Pruning
Case 1:
• Optimistic error?
Don’t prune for both cases C0: 11 C0: 2
C1: 3 C1: 4
• Pessimistic error?
Entropy(Children)
Missing = 0.3 (0) + 0.6 (0.9183) = 0.551
value Gain = 0.9 (0.8813 – 0.551) = 0.3303
70
Distribute Instances
Tid Refund Marital Taxable
Status Income Class
Tid Refund Marital Taxable
1 Yes Single 125K No Status Income Class
2 No Married 100K No
10 ? Single 90K Yes
3 No Single 70K No 10
Classify Instances
New record: Married Single Divorce Total
d
Tid Refund Marital Taxable
Status Income Class Class=No 3 1 0 4