Decision Tree
Decision Tree
[email protected] https://sites.google.com/view/nsaini1
Supervised Learning
▪ When an algorithm learns from example data and associated
supervision of a teacher.
▪ (On the other hand, in unsupervised learning, the system attempts to find the patterns
input variables (x) and output variables(Y) and can use an algorithm to
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
Splitting Attributes
Tid Refund Marital Taxable
Status Income Cheat
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Apply Model to Test Data
Test Data
Start from the root of tree Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Decision Tree Induction
Many Algorithms:
– Hunt’s Algorithm (one of the earliest)
– CART (Classification and Regression Tree)
– ID3, C4.5
– SLIQ (Fast scalable algorithm for large
application)
◆Can handle both numeric and categorical attributes
– SPRINT (scalable parallel classifier for
datamining)
General Structure of Hunt’s Algorithm
Tid Refund Marital Taxable
Let Dt be the set of training records Status Income Cheat
that reach a node t 1 Yes Single 125K No
Single, Single,
Married Married
Divorced Divorced
Don’t Cheat
Cheat
Tree Induction
Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Size
{Small,
What about this split? Large} {Medium}
Splitting Based on Continuous Attributes
Taxable Taxable
Income Income?
> 80K?
< 10K > 80K
Yes No
Greedy strategy.
– Split the records based on an attribute test
that optimizes certain criterion.
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Greedy approach:
– Nodes with homogeneous class distribution
are preferred
Need a measure of node impurity:
C0: 5 C0: 9
C1: 5 C1: 1
Non-homogeneous, Homogeneous,
High degree of impurity Low degree of impurity
Measures of Node Impurity
Gini Index
Entropy
Misclassification error
How to Find the Best Split
Before Splitting: C0 N00
M0
C1 N01
A? B?
Yes No Yes No
M1 M2 M3 M4
M12 M34
Gain = M0 – M12 vs M0 – M34
Measure of Impurity: GINI
GINI (t ) = 1 − [ p( j | t )]2
j
GINI (t ) = 1 − [ p( j | t )]2
j
Greedy strategy
– Split the records based on an attribute test
that optimizes certain criterion
Issues
– Determine how to split the records
◆How to specify the attribute test condition?
◆How to determine the best split?
Advantages:
– Inexpensive to construct
– Extremely fast at classifying unknown records
– Easy to interpret for small-sized trees
– Accuracy is comparable to other classification
techniques for many simple data sets
Metrics for Performance Evaluation
PREDICTED CLASS
Class=Yes Class=No
a: TP (true positive)
b: FN (false negative)
Class=Yes a b
ACTUAL c: FP (false positive)
d: TN (true negative)
CLASS Class=No c d
Metrics for Performance Evaluation…
PREDICTED CLASS
Class=Yes Class=No
Class=Yes a b
ACTUAL (TP) (FN)
CLASS
Class=No c d
(FP) (TN)
a+d TP + TN
Accuracy = =
a + b + c + d TP + TN + FP + FN
Thank you!!
Any Queries??
[email protected]