Dinesh Kumar Indra Panwar Arjan Singh
Dinesh Kumar Indra Panwar Arjan Singh
Classification And
Regression Tree (CART) &
C4.5
Dinesh Kumar
Indra Panwar
Arjan Singh
Decision tree Label
y y
x x
Decision tree
Input Outcome/Output
variables/Predictors variables
Decision tree
Is a tree like training model generated by learning decision rules inferred from training
data
It allows to add multiple linear equations one after another
The key idea:
To select splits that decrease the impurity of class distribution in the resulting subtree.
Structure
Consists of
Node (attributes) – represents features
Link ( branch) - represents decision
Leaf (terminal node) – represents outcome
Decision tree
Considerations when growing tree
Features to choose and condition for splitting
Values of some attributes gives more information than others(Information attribute_1 < Information attribute_2)
Quantification of information provided by attribute
Information content = − log 𝑃 𝑋 = 𝑥
Lesser the probability more information it provides
Amount of uncertainty/unpredictability in information
Entropy = 𝛴𝑃 𝑋 = 𝑥 − log 𝑃 𝑋 = 𝑥
All data of same class = 0 , Data evenly distributed among classes = 1 (highest)
𝐼𝑛𝑓𝑜𝑟𝑚𝑎𝑡𝑖𝑜𝑛 𝑔𝑎𝑖𝑛
C4.5
Gain ratio =
𝑆𝑝𝑙𝑖𝑡 𝑖𝑛𝑓𝑜
Information gain
= Entropy(target) – (Weighted average) Entropy(children)
To arrive at the Split info
𝑆𝑖 𝑆𝑖
best attribute =-𝛴 ( ) (log( ) )
𝑆 𝑆
Gini gain
CART = Gini index (parent) – (weighted average)
Gini index(children)
𝑆𝑖 2
Gini index = 1 - 𝛴 ( )
𝑆
CART Example:
Gini gain = Gini index (parent) – (weighted average) Gini index(children)
𝑆𝑖
Gini index = 1 - 𝛴 ( 𝑆 )2 a1 (9)
Target
Index a1 a2 a3 Class(t)
Target (9)
1 T T 1.00 P T (4) F(5)
4 5 3 1 1 4
2 T T 6.00 P P N P N P N
3 T F 5.00 N
Gini index of target (t)
4 F F 4.00 P
Gini(t) = 1 – [(4/9)2 + (5/9)2] = 40/81
9 F T 5.00 N
CART Example contd…
a3 (9)
3 4.00 3.50 P
9 8.00 7.50 N
CART Example contd… Graph generated using SciKit
CART C4.5
Key idea Recursive binary partitioning (greedy approach)
Avoid overfitting
Setting constraints on tree size
By setting constraints of tree defining parameters
Minimum samples for a node split
Too high values may result in under-fitting. Use cross validation to tune
Minimum samples for a terminal node
Maximum depth of tree
Maximum number of terminal nodes
Maximum features to consider for split
Randomly selected. As a thumb rule, square root of the total number of
features
Pruning
CART Pruning
Cost-complexity pruning(post-pruning algorithm)
Function returning set
Training/Learning Misclassification
of leaves of tree T Sum of
error rate misclassification
Regularization parameter errors
(set by cross validation)
Algorithm Choosing α
Divide S into k subsets S0 , …, Sk
In fold –
Train a tree
For each αk , prune the tree to that level and
measure the error rate
Compute the average error rate over k folds
Choose the αk that minimizes error rate. Call it α*
Prune the original tree according to α*
CART Pruning example
t1 t2 t3
0
R(Tt) The entire tree, all leaves are 0 0
pure)
R(Tt) 4/16