0% found this document useful (0 votes)
7 views

Decision Trees

Uploaded by

mert
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views

Decision Trees

Uploaded by

mert
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 38

CMPE 442 Introduction to

Machine Learning
• Decision Trees
Decision Trees

 Decision trees can perform:


 Classification
 Regression
 The function is approximated as tree
 Main advantage is interpretability
Example
A Decision tree

 F:<Outlook, Humidity, Wind, Temp> Play Golf?


A Decision tree

 F:<Outlook, Humidity, Wind, Temp> Play Golf?

 Each internal node: test one attribute 𝑋


 Each branch from a node: selects one value for 𝑋
 Each leaf node: predict 𝑌 (𝑜𝑟 𝑃(𝑌|𝑋 ∈ 𝑙𝑒𝑎𝑓))
Decision Tree Learning

Problem Setting:
 Set of possible instances 𝑋
 Each instance x in X is a feature vector
 𝑥 =< 𝑥 , 𝑥 , … , 𝑥 >
 Unknown target function 𝑓: 𝑋 → 𝑌
 Y is discrete valued
 Set of function hypotheses 𝐻 = {ℎ|ℎ: 𝑋 → 𝑌}
 Each hypothesis h is a decision tree
 Trees sort x to leaf, which assign y
Decision Tree Learning

Problem Setting:
 Set of possible instances 𝑋
 Each instance x in X is a feature vector
 𝑥 =< 𝑥 , 𝑥 , … , 𝑥 >
 Unknown target function 𝑓: 𝑋 → 𝑌
 Y is discrete valued
 Set of function hypotheses 𝐻 = {ℎ|ℎ: 𝑋 → 𝑌}
 Each hypothesis h is a decision tree
 Trees sort x to leaf, which assign y
Input:
 Training examples {< 𝑥 , 𝑦 ( ) >} of unknown target function 𝑓
Output:
 Hypothesis ℎ ∈ 𝐻 that best approximates target function 𝑓
What we need

 Best-split
 Measure of homogeneity (Impurity)
What we need

 Best-split
 Measure of homogeneity (Impurity)

 Impurity Measures:
 Gini Impurity
 Entropy
Best Split
Iris Dataset
Example
Making Predictions

 Suppose you found a new Iris with petal length of 5 and petal width 1.5
 Then our new feature is [5,1.5]
 Which class does it belong (what is y)?
Making Predictions

 Suppose you found a new Iris with petal length of 5 and petal width 1.5
 Then our new feature is [5,1.5]
 Which class does it belong (what is y)?
Meaning of the values

 Remember that Iris dataset has feastuers150 flowers


 3 classes: Virginica, Setosa and Versicolor
 50 samples have petal length<=2.45
 100 samples have petal length>2.45
 54 samples out 100 have petal width<=1.75
 gini attribute: measures node’s impurity
 A node is pure (gini=0) if all training instances
that it applies to belong to the same class
Gini impurity

 𝐺 =1−∑ 𝑝,
 𝑝 , is the ratio of class k instances among the training instances in the ith node
 Ex:
 The gini score for the depth-1 left node is:

1− − − =0

 The gini score for the depth-2 left node is:

1− − − ≈ 0.168

 The gini score for the depth-2 right node is:

1− − − ≈ 0.0425
Iris dataset depth 3
Iris dataset depth 3
Estimating Class Probabilities

 Decision Trees also can estimate the probabilities that an instance belongs
to a particular class
Algorithm to build a DT

node =Root
main loop:
1. A the best decision attribute for next node
2. Assign A as decision attribute for node
3. For each value of A, create new descendant of node
4. Sort training examples to leaf nodes
5. If training examples perfectly classified then STOP. Else, iterate over new
leaf nodes

Note: Decision trees use greedy approach!


The CART training algorithm

 Scikit-Learn uses the Classification and Regression Tree (CART) algorithm to


train Decision Trees  outputs only binary trees
 Idea:
 Split the training set in two subsets using a single feature 𝑖 and a threshold 𝑡
 Searches for the pair (𝑖, 𝑡 ) that produces the purest subsets
 The cost function the algorithm tries to minimize:
 𝐽 𝑖, 𝑡 = 𝐺 + 𝐺

𝐺 / 𝑚𝑒𝑠𝑢𝑟𝑒𝑠 𝑡ℎ𝑒 𝑖𝑚𝑝𝑢𝑟𝑖𝑡𝑦 𝑜𝑓 𝑡ℎ𝑒 𝑙𝑒𝑓𝑡 /𝑟𝑖𝑔ℎ𝑡 𝑠𝑢𝑏𝑠𝑒𝑡


 Where
𝑚 / 𝑖𝑠 𝑡ℎ𝑒 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑖𝑛𝑠𝑡𝑎𝑛𝑐𝑒𝑠 𝑖𝑛 𝑡ℎ𝑒 𝑙𝑒𝑓𝑡 /𝑟𝑖𝑔ℎ𝑡 𝑠𝑢𝑏𝑠𝑒𝑡

 Stop recursing when reach the maximum depth, or when no new split to reduce
impurity is found.
Example
Example [9+, 5-]
𝐽 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.34
Outlook

Rainy Sunny
Overcast

Gini=0.48 Gini=0 Gini=0.48


Samples=5 Samples=4 Samples=5
Value=[2+,3-] Value=[4+,0-] Value=[3+,2-]

2 3
𝐺𝑖𝑛𝑖 𝑅𝑎𝑖𝑛𝑦 = 1 − − = 0.48
5 5

4 0
𝐺𝑖𝑛𝑖 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = 1 − − =0
4 4

3 2
𝐺𝑖𝑛𝑖 𝑆𝑢𝑛𝑛𝑦 = 1 − − = 0.48
5 5
5 4 5
𝐽 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = ∗ 0.48 + ∗0+ ∗ 0.48 = 0.34
14 14 14
Example [9+, 5-]
𝐽 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.34
Outlook

Rainy Sunny
Overcast

Gini=0.48 Gini=0 Gini=0.48


Samples=5 Samples=4 Samples=5
Value=[2+,3-] Value=[4+,0-] Value=[3+,2-]

[9+, 5-]

Temperature
𝐽 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.47

Hot Cool
Mild
Gini=0.5 Gini=0.44 Gini=0.5
Samples=4 Samples=6 Samples=4
Value=[2+,2-] Value=[4+,2-] Value=[2+,2-]
Example [9+, 5-]
𝐽 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.37
Humidity

High Normal

Gini=0.49 Gini=0.25
Samples=7 Samples=7
Value=[3+,4-] Value=[6+,1-]

[9+, 5-]
𝐽 𝑊𝑖𝑛𝑑 = 0.375
Wind

True False

Gini=0.5 Gini=0.25
Samples=6 Samples=8
Value=[3+,3-] Value=[6+,2-]
Example

[9+, 5-]

Outlook

Rainy Sunny
Overcast

Gini=0.48 Gini=0.48
Samples=5 Yes Samples=5
Value=[2+,3-] Value=[3+,2-]
Example: Continuous Features
[2+, 2-]
𝐽 𝐶ℎ𝑒𝑠𝑡 𝑝𝑎𝑖𝑛 = 0.333
Chest Pain
Chest Pain Good Blood Blocked Weight Heart
Circulation Arteries Disease
Yes No
No No No 125 No

Yes Yes Yes 180 Yes Gini=0.444 Gini=0


Samples=3 Samples=1
Yes Yes No 210 No Value=[2+,1-] Value=[0+,1-]

Yes No Yes 167 Yes

[2+, 2-]
[2+, 2-]
Blocked
Arteries
Good Blood
Yes No
Yes No
Gini=0 Gini=0
Gini=0.5 Gini=0.5 Samples=2 Samples=2
Samples=2 Samples=2 Value=[2+,0-] Value=[0+,2-]
Value=[1+,1-] Value=[1+,1-]

𝐽 𝐵𝑙𝑜𝑐𝑘𝑒𝑑 𝐴𝑟𝑡𝑒𝑟𝑖𝑒𝑠 = 0
𝐽 𝐺𝑜𝑜𝑑 𝐵𝑙𝑜𝑜𝑑 = 0.5
Example: Continuous Features
[2+, 2-]
𝐽 𝑊𝑒𝑖𝑔ℎ𝑡 ≤ 146 = 0.333
Weight<=146
Weight Heart Weight
Disease
125 No
125 No Yes
146
167
180 Yes Gini=0.444
Gini=0
173.5 Samples=3
180 Samples=1
210 No Value=[0+,1-] Value=[2+,1-]
210 195
167 Yes

[2+, 2-]
[2+, 2-]
Weight<=195
Weight<=173.5
Yes No
Yes No
Gini=0.444 Gini=0
Gini=0.5 Gini=0.5 Samples=3 Samples=1
Samples=2 Samples=2 Value=[2+,1-] Value=[0+,1-]
Value=[1+,1-] Value=[1+,1-]

𝐽 𝑊𝑒𝑖𝑔ℎ𝑡 ≤ 195 = 0.333


𝐽 𝑊𝑒𝑖𝑔ℎ𝑡 ≤ 173.5 = 0.5
Regression

 Instead of trying to split the training set in a way that minimizes impurity, it
splits the training set in a way that minimizes the Mean Squared Error (MSE)

𝑀𝑆𝐸 = (𝑦 − 𝑦 ( ))
𝑚 𝑚 ∈
𝐽 𝑖, 𝑡 = 𝑀𝑆𝐸 + 𝑀𝑆𝐸 𝑤ℎ𝑒𝑟𝑒
𝑚 𝑚 1
𝑦 = 𝑦( )
𝑚

Regression
Tree Pruning

 We want small decision trees:


 Interpretability
 Removal of irrelevant and redundant attributes
 Reduce danger of overfitting
Tree Pruning

 Replace one or more subtrees with leafs


 Label with the most common class among the samples
 We aim perfection in the future examples not in the training set
 Reasonable pruning often gives better performance
Tree Pruning

t1 t1

1 0 1 0

t2 t2
t5 t5

1 0 1 1 0
0 0
1

+ t3
t6 - + - + -
1 0
1 0

t4 - + -
1

+ -
Tree Pruning: Algorithm
Tree Pruning: Error Estimate

t1
𝑚 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑜𝑣𝑒𝑟𝑎𝑙𝑙 𝑡𝑟𝑎𝑖𝑛𝑖𝑛𝑔 𝑠𝑎𝑚𝑝𝑙𝑒𝑠
1 0
𝐷 =𝐸 −𝐸
t2
t5 𝑚 𝑚
𝐸 = 𝐸 + 𝐸
1 0 1 0
𝑚 𝑚

+ t3 𝑚 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑟𝑒𝑎𝑐ℎ𝑖𝑛𝑔 𝑡


t6 -
1 0 𝑒 − 𝑛𝑢𝑚𝑏𝑒𝑟 𝑜𝑓 𝑚𝑖𝑠𝑐𝑙𝑎𝑠𝑠𝑖𝑓𝑖𝑒𝑑 𝑠𝑎𝑚𝑝𝑙𝑒𝑠 𝑤ℎ𝑒𝑛 𝑡 𝑖𝑠 𝑟𝑒𝑝𝑙𝑎𝑐𝑒𝑑
1 0
𝑏𝑦 𝑎 𝑙𝑒𝑎𝑓 𝑛𝑜𝑑𝑒
t4 - + -
1 𝑒+1
𝐸 =
𝑚 +𝑚
+ -
Tree Pruning
Summary

 The attribute values are tested one at a time


 Many alternative trees can be created- smaller trees are preferred
 Always choose an attribute that conveys maximum purity for the class
labels
 To deal with overfitting introduce pruning
Decision Trees

 Use greedy algorithm


 Heuristic since does not guarantee to give the optimal solution
 Finding an optimal tree is known to be an NP-complete problem
 NP complete -problems are problems whose status is unknown. No
polynomial time algorithm has yet been discovered for any NP complete
problem, nor has anybody yet been able to prove that no polynomial-time
algorithm exist for any of them.

You might also like