Decision Trees
Decision Trees
Machine Learning
• Decision Trees
Decision Trees
Problem Setting:
Set of possible instances 𝑋
Each instance x in X is a feature vector
𝑥 =< 𝑥 , 𝑥 , … , 𝑥 >
Unknown target function 𝑓: 𝑋 → 𝑌
Y is discrete valued
Set of function hypotheses 𝐻 = {ℎ|ℎ: 𝑋 → 𝑌}
Each hypothesis h is a decision tree
Trees sort x to leaf, which assign y
Decision Tree Learning
Problem Setting:
Set of possible instances 𝑋
Each instance x in X is a feature vector
𝑥 =< 𝑥 , 𝑥 , … , 𝑥 >
Unknown target function 𝑓: 𝑋 → 𝑌
Y is discrete valued
Set of function hypotheses 𝐻 = {ℎ|ℎ: 𝑋 → 𝑌}
Each hypothesis h is a decision tree
Trees sort x to leaf, which assign y
Input:
Training examples {< 𝑥 , 𝑦 ( ) >} of unknown target function 𝑓
Output:
Hypothesis ℎ ∈ 𝐻 that best approximates target function 𝑓
What we need
Best-split
Measure of homogeneity (Impurity)
What we need
Best-split
Measure of homogeneity (Impurity)
Impurity Measures:
Gini Impurity
Entropy
Best Split
Iris Dataset
Example
Making Predictions
Suppose you found a new Iris with petal length of 5 and petal width 1.5
Then our new feature is [5,1.5]
Which class does it belong (what is y)?
Making Predictions
Suppose you found a new Iris with petal length of 5 and petal width 1.5
Then our new feature is [5,1.5]
Which class does it belong (what is y)?
Meaning of the values
𝐺 =1−∑ 𝑝,
𝑝 , is the ratio of class k instances among the training instances in the ith node
Ex:
The gini score for the depth-1 left node is:
1− − − =0
1− − − ≈ 0.168
1− − − ≈ 0.0425
Iris dataset depth 3
Iris dataset depth 3
Estimating Class Probabilities
Decision Trees also can estimate the probabilities that an instance belongs
to a particular class
Algorithm to build a DT
node =Root
main loop:
1. A the best decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified then STOP. Else, iterate over new
leaf nodes
Stop recursing when reach the maximum depth, or when no new split to reduce
impurity is found.
Example
Example [9+, 5-]
𝐽 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.34
Outlook
Rainy Sunny
Overcast
2 3
𝐺𝑖𝑛𝑖 𝑅𝑎𝑖𝑛𝑦 = 1 − − = 0.48
5 5
4 0
𝐺𝑖𝑛𝑖 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = 1 − − =0
4 4
3 2
𝐺𝑖𝑛𝑖 𝑆𝑢𝑛𝑛𝑦 = 1 − − = 0.48
5 5
5 4 5
𝐽 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = ∗ 0.48 + ∗0+ ∗ 0.48 = 0.34
14 14 14
Example [9+, 5-]
𝐽 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.34
Outlook
Rainy Sunny
Overcast
[9+, 5-]
Temperature
𝐽 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.47
Hot Cool
Mild
Gini=0.5 Gini=0.44 Gini=0.5
Samples=4 Samples=6 Samples=4
Value=[2+,2-] Value=[4+,2-] Value=[2+,2-]
Example [9+, 5-]
𝐽 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.37
Humidity
High Normal
Gini=0.49 Gini=0.25
Samples=7 Samples=7
Value=[3+,4-] Value=[6+,1-]
[9+, 5-]
𝐽 𝑊𝑖𝑛𝑑 = 0.375
Wind
True False
Gini=0.5 Gini=0.25
Samples=6 Samples=8
Value=[3+,3-] Value=[6+,2-]
Example
[9+, 5-]
Outlook
Rainy Sunny
Overcast
Gini=0.48 Gini=0.48
Samples=5 Yes Samples=5
Value=[2+,3-] Value=[3+,2-]
Example: Continuous Features
[2+, 2-]
𝐽 𝐶ℎ𝑒𝑠𝑡 𝑝𝑎𝑖𝑛 = 0.333
Chest Pain
Chest Pain Good Blood Blocked Weight Heart
Circulation Arteries Disease
Yes No
No No No 125 No
[2+, 2-]
[2+, 2-]
Blocked
Arteries
Good Blood
Yes No
Yes No
Gini=0 Gini=0
Gini=0.5 Gini=0.5 Samples=2 Samples=2
Samples=2 Samples=2 Value=[2+,0-] Value=[0+,2-]
Value=[1+,1-] Value=[1+,1-]
𝐽 𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝐴𝑟𝑡𝑒𝑟𝑖𝑒𝑠 = 0
𝐽 𝐺𝑜𝑜𝑑 𝐵𝑙𝑜𝑜𝑑 = 0.5
Example: Continuous Features
[2+, 2-]
𝐽 𝑊𝑒𝑖𝑔ℎ𝑡 ≤ 146 = 0.333
Weight<=146
Weight Heart Weight
Disease
125 No
125 No Yes
146
167
180 Yes Gini=0.444
Gini=0
173.5 Samples=3
180 Samples=1
210 No Value=[0+,1-] Value=[2+,1-]
210 195
167 Yes
[2+, 2-]
[2+, 2-]
Weight<=195
Weight<=173.5
Yes No
Yes No
Gini=0.444 Gini=0
Gini=0.5 Gini=0.5 Samples=3 Samples=1
Samples=2 Samples=2 Value=[2+,1-] Value=[0+,1-]
Value=[1+,1-] Value=[1+,1-]
Instead of trying to split the training set in a way that minimizes impurity, it
splits the training set in a way that minimizes the Mean Squared Error (MSE)
𝑀𝑆𝐸 = (𝑦 − 𝑦 ( ))
𝑚 𝑚 ∈
𝐽 𝑖, 𝑡 = 𝑀𝑆𝐸 + 𝑀𝑆𝐸 𝑤ℎ𝑒𝑟𝑒
𝑚 𝑚 1
𝑦 = 𝑦( )
𝑚
∈
Regression
Tree Pruning
t1 t1
1 0 1 0
t2 t2
t5 t5
1 0 1 1 0
0 0
1
+ t3
t6 - + - + -
1 0
1 0
t4 - + -
1
+ -
Tree Pruning: Algorithm
Tree Pruning: Error Estimate
t1
𝑚 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
1 0
𝐷 =𝐸 −𝐸
t2
t5 𝑚 𝑚
𝐸 = 𝐸 + 𝐸
1 0 1 0
𝑚 𝑚