Dtree
Dtree
Sunita Sarawagi
IIT Bombay
http://www.it.iitb.ac.in/~sunita
Decision tree classifiers
• Widely used learning method
• Easy to interpret: can be re-represented as if-then-else
rules
• Approximates function by piece wise constant regions
• Does not require any prior knowledge of data distribution,
works well on noisy data.
• Has been applied to:
• classify medical patients based on the disease,
• equipment malfunction by cause,
• loan applicant by likelihood of payment.
• lots and lots of other applications..
Setting
• Given old data about customers and payments,
predict new applicant’s loan eligibility.
Salary < 1 M
age?
<=30 overcast
30..40 >40
no yes no yes
Weather Data: Play or not Play?
Outlook Temperature Humidity Windy Play?
sunny hot high false No
sunny hot high true No Note:
overcast hot high false Yes Outlook is the
rain mild high false Yes Forecast,
rain cool normal false Yes
no relation to
rain cool normal true No
overcast cool normal true Yes
Microsoft
sunny mild high false No email program
sunny cool normal false Yes
rain mild normal false Yes
sunny mild normal true Yes
overcast mild high true Yes
overcast hot normal false Yes
rain mild high true No
Example Tree for “Play?”
Outlook
sunny
overcast rain
Humidity Yes
Windy
No Yes No Yes
Topics to be covered
• Tree construction:
• Basic tree learning algorithm
• Measures of predictive ability
• High performance decision tree construction: Sprint
• Tree pruning:
• Why prune
• Methods of pruning
• Other issues:
• Handling missing data
• Continuous class labels
• Effect of training size
Tree learning algorithms
• ID3 (Quinlan 1986)
• Successor C4.5 (Quinlan 1993)
• CART
• SLIQ (Mehta et al)
• SPRINT (Shafer et al)
Basic algorithm for tree building
• Greedy top-down construction.
Yes
make node a leaf? Stop
Selection
Find best attribute and best split on attribute
criteria
Partition data on split condition
• Gini
k
2
Gini ( S ) 1 pi
i 1
Information gain
1 0.5
Entropy
Gini
0 1 0 1
p1
• Information gain on partitioning S into r subsets
• Impurity (S) - sum of weighted impurity of each subset
r Sj
Gain(S , S1..Sr ) Entropy(S ) Entropy( S j )
j 1 S
*Properties of the entropy
• The multistage property:
q r
entropy(p,q,r) entropy(p,q r ) (q r ) entropy( , )
qr qr
• Simplification of computation:
witten&eibe
Information gain: example
K= 2, |S| = 100, p1= 0.6, p2= 0.4
E(S) = -0.6 log(0.6) - 0.4 log (0.4)=0.29
S1 S2
witten&eibe
Example: attribute “Outlook”
• “Outlook” = “Sunny”:
info([2,3]) entropy(2/5,3/5) 2 / 5 log(2 / 5) 3 / 5 log(3 / 5) 0.971 bits
• “Outlook” = “Overcast”:
Note: log(0) is
info([4,0]) entropy(1,0) 1log(1) 0 log(0) 0 bits not defined, but
• “Outlook” = “Rainy”: we evaluate
0*log(0) as zero
info([3,2]) entropy(3/5,2/5) 3 / 5 log(3 / 5) 2 / 5 log(2 / 5) 0.971 bits
• Expected information for attribute:
info([3,2],[4,0],[3,2]) (5 / 14) 0.971 (4 / 14) 0 (5 / 14) 0.971
0.693 bits
witten&eibe
Computing the information gain
• Information gain:
(information before split) – (information after split)
witten&eibe
The final decision tree
witten&eibe
Highly-branching attributes
• Problematic: attributes with a large number of
values (extreme case: ID code)
• Subsets are more likely to be pure if there is a
large number of values
Information gain is biased towards choosing attributes
with a large number of values
This may result in overfitting (selection of an attribute
that is non-optimal for prediction)
witten&eibe
Weather Data with ID code
ID Outlook Temperature Humidity Windy Play?
A sunny hot high false No
B sunny hot high true No
C overcast hot high false Yes
D rain mild high false Yes
E rain cool normal false Yes
F rain cool normal true No
G overcast cool normal true Yes
H sunny mild high false No
I sunny cool normal false Yes
J rain mild normal false Yes
K sunny mild normal true Yes
L overcast mild high true Yes
M overcast hot normal false Yes
N rain mild high true No
Split for ID Code Attribute
witten&eibe
Gain ratio
• Gain ratio: a modification of the information gain
that reduces its bias on high-branch attributes
• Gain ratio should be
• Large when data is evenly spread
• Small when all data belong to one branch
• Gain ratio takes number and size of branches
into account when choosing an attribute
• It corrects the information gain by taking the intrinsic
information of a split into account (i.e. how much info
do we need to tell which branch an instance belongs
to)
witten&eibe
Gain Ratio and Intrinsic Info.
• Intrinsic information: entropy of distribution of
instances into branches
|S | |S |
IntrinsicI nfo(S , A) i log i .
|S| 2 | S |
• Gain ratio (Quinlan’86) normalizes info gain by:
GainRatio(S , A) Gain(S , A) .
IntrinsicInfo(S , A)
Computing the gain ratio
• Example: intrinsic information for ID code
info([1,1,,1) 14 (1 / 14 log1 / 14) 3.807 bits
• Importance of attribute decreases as
intrinsic information gets larger
• Example of gain ratio:
gain("Attribute")
gain_ratio("Attribute")
intrinsic_info("Attribute")
• Example:
0.940 bits
gain_ratio("ID_code") 0.246
3.807 bits
witten&eibe
Gain ratios for weather data
Outlook Temperature
Info: 0.693 Info: 0.911
Gain: 0.940-0.693 0.247 Gain: 0.940-0.911 0.029
Split info: info([5,4,5]) 1.577 Split info: info([4,6,4]) 1.362
Gain ratio: 0.247/1.577 0.156 Gain ratio: 0.029/1.362 0.021
Humidity Windy
Info: 0.788 Info: 0.892
Gain: 0.940-0.788 0.152 Gain: 0.940-0.892 0.048
Split info: info([7,7]) 1.000 Split info: info([8,6]) 0.985
Gain ratio: 0.152/1 0.152 Gain ratio: 0.048/0.985 0.049
witten&eibe
More on the gain ratio
• “Outlook” still comes out top
• However: “ID code” has greater gain ratio
• Standard fix: ad hoc test to prevent splitting on that type of
attribute
• Problem with gain ratio: it may overcompensate
• May choose an attribute just because its intrinsic information is
very low
• Standard fix:
• First, only consider attributes with greater than average information
gain
• Then, compare them on gain ratio
witten&eibe
SPRINT (Serial PaRallelizable INduction of decision
Trees)
• Design goals:
• Able to handle large disk-resident training sets
• No restrictions on training-set size
• Easily parallelizable
Example
• Example Data
Partition(Data D)
if (all points in D belong to the same class) then
return;
for each attribute A do
evaluate splits on attribute A;
use best split found to partition D into D1 and D2;
Partition(D1);
Partition(D2);
Evaluating Split Points
• Gini Index
• if data D contains examples from c classes
Gini(D) = 1 - pj2
where pj is the relative frequency of class j in D
Initial Attribute Lists for Age Risk RID Car Type Risk RID
the root node: 17 High 1 family High 0
20 High 5 family High 5
23 High 0 family Low 3
32 Low 4 sports High 2
43 High 2 sports High 1
68 Low 3 truck Low 4
Split Points: Continuous Attrib.
Position of State of Class Histograms:
Attribute List cursor in scan Left Child Right Child GINI Index:
Age Risk RID High Low High Low
0: Age < 17 GINI = undef
17 High 1 0 0 4 2
20 High 5 1: Age < 20
23 High 0 High Low High Low
3: Age < 32 GINI = 0.4
32 Low 4 1 0 3 2
43 High 2
68 Low 3 High Low High Low
6 GINI = 0.222
3 0 1 2
High Low
4 2
High Low High Low
GINI = undef
4 2 0 0
Split Points: Categorical Attrib.
• Consider splits of the form: value(A) {x1, x2, ...,
xn}
• Example: CarType {family, sports}
• Evaluate this split-form for subsets of domain(A)
• To evaluate splits on attribute A for a given tree
initialize class/value matrix of node to zeroes;
node:
for each record in the attribute list do
increment appropriate count in matrix;
evaluate splitting index for various subsets using the constructed matrix;
Finding Split Points: Categorical
Attrib.
Attribute List class/value matrix
Car Type Risk RID High Low
family High 0 family 2 1
family High 5 sports 2 0
family Low 3
truck 0 1
sports High 2
sports High 1
truck Low 4
Left Child Right Child GINI Index:
• Three criteria:
• Cross validation with separate test data
• Statistical bounds: use all data for training but
apply statistical test to decide right size. (cross-
validation dataset may be used to threshold)
• Use some criteria function to choose best size
• Example: Minimum description length (MDL) criteria
Cross validation
• Partition the dataset into two disjoint parts:
• 1. Training set used for building the tree.
• 2. Validation set used for pruning the tree:
• Rule of thumb: 2/3rds training, 1/3rd validation
• Evaluate the tree on the validation set and at each
leaf and internal node keep count of correctly
labeled data.
• Starting bottom-up, prune nodes with error less than its children.
• What if training data set size is limited?
• n-fold cross validation: partition training data into n parts D1,
D2…Dn.
• Train n classifiers with D-Di as training and Di as test instance.
• Pick average. (how?)
That was a simplistic view..
• A tree with minimum error on a single test set may
not be stable.
• In what order do you prune?
Minimum Cost complexity pruning in
CART
• For each cross-validation run
• Construct the full tree Tmax
• Use some error estimates to prune Tmax
• Delete subtrees in decreasing order of strength..
• All subtrees of the same strength go together.
• This gives several trees of various sizes
• Use validation partition to note error against various tree size.
• Choose tree size with smallest error over all CV
partitions
• Run a complicated search involving growing and
shrinking phases to find the best tree of the chosen size
using all data.
Pruning: Which nodes come off
next?
Order of Pruning: Weakest Link
Goes First
• Prune away "weakest link" — the nodes that add least to overall
accuracy of the tree
• contribution to overall tree a function of both increase in accuracy and
size of node
• accuracy gain is weighted by share of sample
• small nodes tend to get removed before large ones
• If several nodes have same contribution they all prune away
simultaneously
• Hence more than two terminal nodes could be cut off in one pruning
• Sequence determined all the way back to root node
• need to allow for possibility that entire tree is bad
• if target variable is unpredictable we will want to prune back to root . . .
the no model solution
Pruning Sequence Example
• Pros • Cons
+ Reasonable training time – Not effective for very high
+ Fast application dimensional data where
+ Easy to interpret information about the class is
+ Easy to implement spread in small ways over
+ Intuitive many correlated features
–Example: words in text
classification
–Not robust to dropping of
important features even when
correlated substitutes exist in
data