Classification: Lecture Notes For Chapters 4 & 5
Classification: Lecture Notes For Chapters 4 & 5
1 Yes Attrib1Single
Tid Attrib2 Attrib3
125K Class
No Learning
1No Yes Large 125K No
2 Married 100K No algorithm
2 No Medium 100K No
3 No Single 70K No
3 No Small 70K No
4 Yes Married 120K No
4 Yes Medium 120K No
5 No Divorced 95K Yes Induction
5 No Large 95K Yes
6 No Married 60K No
6 No Medium 60K No
7 7Yes Yes Divorced
Large 220K
220K No
No Learn
8 8No No Single
Small 85K
85K Yes
Yes Model
9 9No No Married
Medium 75K
75K No
No
10 10
No No Small
Single 90K
90K Yes
Yes
Model
10
10
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification
Predicting tumor cells as benign or malignant
10
10 No Single 90K Yes Model: Decision Tree
Training Data
Another Example
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class
Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Model Decision
Tid Attrib1 Attrib2 Attrib3 Class
Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
Hunt’s Algorithm (one of the earliest)
CART
ID3, C4.5
SLIQ, SPRINT
General Structure of Hunt’s
Algorithm
Tid Refund Marital Taxable
Income Cheat
Let Dt be the set of training records
Status
1 Yes Single 125K No
that reach a node t 2 No Married 100K No
3 No Single 70K No
4 Yes Married 120K No
General Procedure: 5 No Divorced 95K Yes
Status
Single, Single,
Married Married
Divorced Divorced
Don’t Cheat
Cheat
Tree Induction
Greedy strategy
Split the records based on an attribute test that optimizes
certain criterion
Greedy algorithms work in phases
At each phase, a decision is made that improves the current state
and appears to be the best without regard for future consequences
Local maxima!
E.g. finding the shortest path
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split (E.g. which attribute)?
Determine when to stop splitting
How to Specify Test
Condition?
Depends on attribute types
Categorical
Continuous
OR
Size
What about this split? {Small,
Large} {Medium}
Splitting on Continuous
Attributes
Different ways of handling
Binary Decision: (A < v) or (A v)
consider all possible splits and find the best cut
can be more computationally intensive
Taxable
Income
> 80K?
Yes No
Issues
Determine how to split the records
How to specify the attribute test condition?
How to determine the best split?
Determine when to stop splitting
How to determine the Best
Split
Before Splitting: 10 records of class 0,
10 records of class 1
Entropy
Misclassification error
How to Find the Best Split
Before Splitting: C0 X
M0
C1 Y
A? B?
Yes No Yes No
C0 XA C0 X!A C0 XB C0 X!B
C1 YA C1 Y!A C1 YB C1 Y!B
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs. M0 – M34
Measure of Impurity: GINI
Gini Index for a given node t :
GINI (t ) 1 [ p( j | t )]2
j
k
ni
GINI split GINI (i )
i 1 n
one value 2
3
No
No
Married
Single
100K
70K
No
No
value
5 No Divorced 95K Yes
6 No Married 60K No
No 0 7 1 6 2 5 3 4 3 4 3 4 3 4 4 3 5 2 6 1 7 0
Gini 0.420 0.400 0.375 0.343 0.417 0.400 0.300 0.343 0.375 0.400 0.420
Another Example
# Attribute Class
Outlook Temperature Humidity Windy Play
1 sunny 100 high no N
2 sunny 110 high yes N
3 overcast 110 high no Y
4 rainy 75 high no Y
5 rainy 40 normal no Y
6 rainy 40 normal yes N
7 overcast 45 normal yes Y
8 sunny 70 high no N
9 sunny 40 normal no Y
10 rainy 70 normal no Y
11 sunny 70 normal yes Y
12 overcast 70 high yes Y
13 overcast 95 normal no Y
14 rainy 65 high yes N
3
7
age?
<=30 overcast
31..40 >40
no yes yes
4
0
Algorithm for Decision Tree Induction
j 1 | D |
Information gained by branching on attribute A