Datamining-Lect5 Decision Tree
Datamining-Lect5 Decision Tree
Decision Trees
and Decision Rules
Data Classification and Prediction
Data classification
classification
prediction
Methods of classification
decision tree induction
Forward and Back-propagation (Neural Network)
Bayesian classification
Association rule mining
Classification: Definition
Given a collection of records (training set )
Each record contains a set of attributes, one of the attributes is
the class.
Find a model for class attribute as a function of
the values of other attributes.
Goal: previously unseen records should be
assigned a class as accurately as possible.
A test set is used to determine the accuracy of the model.
Usually, the given data set is divided into training and test sets,
with training set used to build the model and test set used to
validate it.
Illustrating Classification Task
Tid Attrib1 Attrib2 Attrib3 Class Learning
1 Yes Large 125K No
algorithm
2 No Medium 100K No
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply
Tid Attrib1 Attrib2 Attrib3 Class Model
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Examples of Classification Task
Predicting tumor cells as benign or malignant
MarSt Single,
Married Divorced
Tid Refund Marital Taxable
Status Income Cheat
NO Refund
1 Yes Single 125K No
Yes No
2 No Married 100K No
3 No Single 70K No NO TaxInc
4 Yes Married 120K No < 80K > 80K
5 No Divorced 95K Yes
NO YES
6 No Married 60K No
7 Yes Divorced 220K No
8 No Single 85K Yes
9 No Married 75K No There could be more than one tree that
10 No Single 90K Yes fits the same data!
10
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision
Model Tree
Tid Attrib1 Attrib2 Attrib3 Class
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
How They Work
Decision rules - partition sample of data
Terminal node (leaf) indicates the class assignment
Tree partitions samples into mutually exclusive groups
One group for each terminal node
All paths
start at the root node
end at a leaf
Each path represents a decision rule
joining (AND) of all the tests along that path
separate paths that result in the same class are disjunctions (ORs)
All paths - mutually exclusive
for any one case - only one path will be followed
false decisions on the left branch
true decisions on the right branch
Apply Model to Test Data
Test Data
Start from the root of tree. Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married
TaxInc NO
< 80K > 80K
NO YES
Apply Model to Test Data
Test Data
Refund Marital Taxable
Status Income Cheat
No Married 80K ?
Refund 10
Yes No
NO MarSt
Single, Divorced Married Assign Cheat to “No”
TaxInc NO
< 80K > 80K
NO YES
Decision Tree Classification Task
Tid Attrib1 Attrib2 Attrib3 Class
Tree
1 Yes Large 125K No Induction
2 No Medium 100K No algorithm
3 No Small 70K No
6 No Medium 60K No
Training Set
Apply Decision
Tid Attrib1 Attrib2 Attrib3 Class
Model Tree
11 No Small 55K ?
15 No Large 67K ?
10
Test Set
Entropy
S is a sample of training examples
p is the proportion of positive examples
n is the proportion of negative examples
Calculate
I(pi,ni)=((-p/p+n)log2(p/p+n))–((n/p+n)log2(n/p+n))
Calculate
Entropy(S) = pi+ni/P I(pi,ni)
20
Splitting Based on INFO...
Information Gain:
Entropy ( p) Entropy (i )
n
k
GAIN i
n
split i 1
Entropy (Refund) No
No
Single
Married
85K
75K
Yes
No
No Single 90K Yes
pi ni I(pi,ni)
Yes 0 3 0
No 3 4 0.98
I(3,4)=-3/7log23/7-4/7log24/7 I(0,3)=-0/3log20/3-3/3log23/3
=0.98 =0
E(Refund)=7/10x0 + 7/10 x 0.98=0.686
Gain(Refund)=E(Cheat)-E(Refund)
=0.88-0.686=0.19
Refund Marital Status Taxable Inc. Cheat
I(2,2)=-2/4log22/4-2/4log22/4 I(2,2)=-1/2log21/2-1/2log21/2
=1 =1
I(2,2)=-0/4log20/4-4/4log24/4
=0
E(Marital Status)=4/10x1 + 4/10 x 0 + 4/10 x 1 = 0.6
Gain(Marital Status)=E(Cheat)-E(Refund)
=0.88-0.6=0.28
Refund Marital Status Taxable Inc. Cheat
I(3,4)=-3/7log23/7-4/7log24/7
=1
I(0,3)=-0/3log20/3-3/3log23/3
=0
E(Marital Status)=7/10x1 + 3/10 x 0 = 0.686
Gain(Marital Status)=E(Cheat)-E(Refund)
=0.88-0.686=0.19
Refund Marital Status Taxable Inc. Cheat
Gain(Refund) =0.19
Gain(Marital Status) =0.28 (Greater than other)
Gain(Taxable Inc.) =0.19
Refund Marital Status Taxable Inc. Cheat
Step – 3
Calculate the Entropy with respect to Marital
Status=“Single” Refund Taxable Inc. Cheat
Yes 125K No
I(0,1)=-0/1log20/1-1/1log21/1 I(2,1)=-2/3log22/3-1/3log21/3
=0 =0.91
E(MS.single=Refund)=1/10x0 + 3/10 x 0.91=0.27
Gain(MS.single=Refund)=E(Cheat)-E(Refund)
=0.88-0.27=0.60
Marital Status Refund Taxable Inc. Cheat
Step – 3
Calculate the Entropy with respect to Marital
Status=“Single” Taxable Inc. Cheat
125K No
I(0,1)=-0/1log20/1-1/1log21/1 I(2,1)=-2/3log22/3-1/3log21/3
=0 =0.91
E(MS.single=Refund)=1/10x0 + 3/10 x 0.91=0.27
Gain(MS.single=Refund)=E(Cheat)-E(Refund)
=0.88-0.27=0.60
Building Decision Tree
Marital Status Refund Taxable Inc. Cheat
Decision Rules:
Marital St.= Married then Cheat= No
Marital St.= (Single Or Divorced) And Refund = Yes then Cheat = No
Marital St.= (Single Or Divorced) And Refund = No And TaxInc <= 80K then Cheat = No
Marital St.= (Single Or Divorced) And Refund = No And TaxInc >= 80K then Cheat = Yes
Building Decision Tree
Step – 1 Age
Adult
Income
low
Cartype
Family
Class
High
Calculate “Class” attribute Entropy. Adult high Sports High
Younger low Sports Low
Old low Family Low
Younger high Truck High
p=4, n=2, p + n=6 Old high Truck High
Entropy(Cheet) = (-p/p+n)log2(p/p+n)–(n/p+n)log2(n/p+n)
Step – 2
Entropy of Age=0.667 Gain of Age=0.256
Entropy of Income=0.459 Gain of Income=0.46
Entropy of Cartype=0.667 Gain of Cartype=0.256
Which attribute’s Gain is larger…?
Split the table according to the larger Gain.
Calculate Entropy and Gain Income Age Cartype Class
low Adult Family High
Build Tree low Younger Sports Low
Income low Old Family Low
High Low high Adult Sports High
high Younger Truck High
High high Old Truck High
Building Decision Tree
Income Age Cartype Class
Income
High Low low Adult Family High
low Younger Sports Low
low Old Family Low
High Age
Adult Younger, Old
High Low
Decision Rules:
Income = High then Class = High
Income = Low And Age = Adult then Class = High
Income = Low And (Age = Younger Or Old) then Class = Low
Building Decision Tree
Try Yourself:
Building Decision Tree
Try Yourself:
https://kindsonthegenius.com/blog/2018/04/how-to-build-a-decision-tree-for-
classification-step-by-step-procedure-using-entropy-and-gain.html
Under-fitting and Over-fitting
Producing a model that doesn’t perform well even
on the training data, is called under-fitting.
Although typically when this happens we decide
our model isn’t good enough and keep looking for
a better one.