Chap5 - Machine Learning Part II - Decision Tree
Chap5 - Machine Learning Part II - Decision Tree
Machine Learning
Part II: Decision Tree and Random Forest
Inspired from “Artificial Intelligence: A Modern Approach” book
Decision Tree
• The purpose of a decision tree is to allow
prediction: to determine the class of a new
example from the values of its attributes.
Decision
Tree
2
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Trees
Decision tree to represent learned target functions
◦ Each internal node tests an attribute
◦ Each branch corresponds to attribute value
◦ Each leaf node assigns a classification
by logical formulas
sunny overcast rain
No Yes No Yes
4
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No Goal attribute
D2 Sunny Hot High Strong No 2 classes: yes
and no
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
Temperature is
D5 Rain Cool Normal Weak Yes
nominal
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
We want to be
D8 Sunny Mild High Weak No able to decide /
D9 Sunny Cool Normal Weak Yes predict if a
D10 Rain Mild Normal Weak Yes tennis match
will take place
D11 Sunny Mild Normal Strong Yes
or not
D12 Overcast Mild High Strong Yes depending on
D13 Overcast Hot Normal Weak Yes the weather
D14 Rain Mild High Strong No
5
Training Examples
Outlook
P(sunny|yes) = 2/9 P(sunny|No) = 3/5
P(overcast|yes) = 4/9 P(overcast|No) = 0
P(rain|yes) = 3/9 P(rain|No) = 2/5
Temp
P(hot|yes) = 2/9 P(hot|No) = 2/5
P(yes) = 9/14
P(mild|yes) = 4/9 P(mild|No) = 2/5
P(no) = 5/14 P(cool|yes) = 3/9 P(cool|No) = 1/5
Humidity
P(high|yes) = 3/9 P(high|No) = 4/5
P(normal|yes) = 6/9 P(normal|No) = 2/5
wind
P(Strong|yes) = 3/9 P(Strong|No) = 3/5
P(weak|yes) = 6/9 P(weak|No) = 2/5 6
Decision Trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:
7
Attribute-based representations
Examples described by attribute values (Boolean, discrete, continuous)
E.g., situations where I will/won't wait for a table:
9
Decision tree learning
There are many possible trees.
How can we actually search this space?
Patrons or type?
To wait or not to wait is still at 50%.
11
Choosing a good attribute
Which attribute is better to split on, X1 or X2
13
The quantity of information:
14
15
If Pp = Pn I = 1
Pp = 1 or Pn = 0 I = 0
I must be minimum.
p = p1 + p2 + … + p v
n = n1 + n2 + … + n v
E(A) is minimum.
𝐻 𝑝1 , 𝑝2 , … , 𝑝𝑠 = − 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖
𝑖=1
𝑠
1
𝐻 𝑝1 , 𝑝2 , … , 𝑝𝑠 = 𝑝𝑖 𝑙𝑜𝑔2
𝑝𝑖
𝑖=1
𝑠
1 1
𝐻 𝑝1 , 𝑝2 , … , 𝑝𝑠 = 𝑝𝑖 𝑙𝑜𝑔
𝑙𝑜𝑔 2 𝑝𝑖
𝑖=1
Only takes into account non-zero probabilities 17
Information/Entropy
We Flip Two Diferent Coins (18 times):
Sequence 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
Sequence 2 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1
Versus
18
Information/Entropy
Quantifying uncertainty
𝐻 𝑋 = − 𝑝(𝑥)𝑙𝑜𝑔2 𝑝(𝑥) 𝑋 = {0,1}
𝑥∈𝑋
8 8 1 1 4 4 5 5
𝐻 𝑋 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 𝐻 𝑋 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
9 9 9 9 9 9 9 9
1
𝐻 𝑋 ≈ 𝐻 𝑋 ≈ 0.99
2
Biased coin Normal coin 19
Information/Entropy
20
Entropy of a Joint Distribution
𝑯 𝑿, 𝒀 = − 𝒑(𝒙, 𝒚)𝒍𝒐𝒈𝟐 𝒑(𝒙, 𝒚)
𝒙∈𝑿 𝒚∈𝒀
Example:
◦ X = {Raining, Not raining}
◦ Y = {Cloudy, Not cloudy}
24 24 1 1 25 25 50 50
𝐻 𝑋, 𝑌 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
100 100 100 100 100 100 100 100
𝐻 𝑋, 𝑌 ≈ 1.56 𝑏𝑖𝑡𝑠
21
Specific Conditional Entropy
𝑯 𝒀|𝑿 = 𝒙 = − 𝒑 𝒚 𝒙 𝒍𝒐𝒈𝟐 𝒑(𝒚|𝒙)
𝒚∈𝒀
𝑝(𝑥, 𝑦)
We consider that: 𝑝 𝑦 | 𝑥 = and 𝑝 𝑥 = 𝑝(𝑥, 𝑦) Sum in a row
𝑝(𝑥)
𝑦
Example:
◦ X = {Raining, Not raining}
◦ Y = {Cloudy, Not cloudy}
Example:
◦ X = {Raining, Not raining}
◦ Y = {Cloudy, Not cloudy}
23
Conditional Entropy
1-
𝑯 𝒀|𝑿 = 𝒑 𝒙 𝑯 𝒀 𝑿 = 𝒙
𝒚∈𝒀
𝐻 𝑌|𝑋 = 𝑝 𝑅 𝐻 𝑌 𝑅 + 𝑝 𝑅ത 𝐻(𝑌|𝑅)
ത
25 75
𝐻 𝑌|𝑋 = ∗ 0.24 + ത
∗ 𝐻 (𝑌|𝑅)
100 100
We have 𝐻 𝑌|𝑅ത = − σ𝑦∈𝑌 𝑝 𝑦 𝑅ത 𝑙𝑜𝑔2 𝑝(𝑦|𝑅)
ത
1 1 2 2
ത
𝐻 𝑌|𝑅 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 ≈ 0.9182
3 3 3 3
𝟐𝟓 𝟕𝟓
𝑯 𝒀|𝑿 = ∗ 𝟎. 𝟐𝟒 + ∗ 𝟎. 𝟗𝟏𝟖𝟐 ≈ 𝟎. 𝟕𝟓 𝒃𝒊𝒕𝒔
𝟏𝟎𝟎 𝟏𝟎𝟎
24
Conditional Entropy
2-
𝑯 𝒀|𝑿 = − 𝒑 𝒙, 𝒚 𝒍𝒐𝒈𝟐 𝒑(𝒚|𝒙)
𝒙∈𝑿 𝒚∈𝒀
24 24 1 1 25 1 50 2
𝐻 𝑌|𝑋 = −( 𝑙𝑜𝑔2 ) − ( 𝑙𝑜𝑔2 ) − ( 𝑙𝑜𝑔2 ) − ( 𝑙𝑜𝑔2 )
100 25 100 25 100 3 100 3
50
𝑯 𝒀|𝑿 ≈ 𝟎. 𝟕𝟓 𝒃𝒊𝒕𝒔 𝑝 𝐶,ҧ 𝑅ത 100 2
ҧ ത
𝑝 𝐶|𝑅 = = =
𝑝 𝑅ത 75 3
100
25
𝑝 𝐶, 𝑅ത 1
𝑝 𝐶|𝑅ത = = 100 =
𝑝 𝑅ത 75 3
100 25
Conditional Entropy
Some useful properties:
◦ H is always non-negative
|𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣 )
𝑆
𝑣 ∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴
27
Decision tree learning
How much information about
cloudiness do we get by
discovering whether it is raining?
29
Decision tree construction
Algorithm
30
Back to our example
31
Patrons or type?
Attribute selection
2 4 4 2
𝐼𝐺 𝑇𝑦𝑝𝑒 = 1 − (12 𝐻 𝑌|𝑓𝑟 + 12 𝐻 𝑌|𝑡ℎ + 12 𝐻 𝑌|𝐵𝑢 + 12 𝐻 𝑌|𝑖𝑡 ) 32
Attribute selection
4 6 2
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 = 1 − ( 𝐻 𝑌|𝑠𝑜𝑚𝑒 + 𝐻 𝑌|𝑓𝑢𝑙𝑙 + 𝐻 𝑌|𝑛𝑜𝑛𝑒 )
12 12 12
𝐻 𝑌|𝑠𝑜𝑚𝑒 = 0
𝐻 𝑌|𝑓𝑢𝑙𝑙 ≈ 0.9183
𝐻 𝑌|𝑛𝑜𝑛𝑒 = 0
35
Attribute selection
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 > 𝐼𝐺 𝑇𝑦𝑝𝑒
36
Example 2
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
37
Example 2
Step 1: Calculate IG for the attributes
𝐼𝐺 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 =?
38
Example 2
𝐻 𝑦|𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = −𝑝 𝑦𝑒𝑠|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 ∗
log 𝑝 𝑦𝑒𝑠|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 − 𝑝 𝑛𝑜|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 ∗ log 𝑝 𝑛𝑜|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡
= −1 log 1 − 0 = 0 𝑏𝑖𝑡𝑠
5 4 5
𝐼𝐺 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.94 − ∗ 0.971 + ∗ 0+ ∗ 0.971
14 14 14
𝐼𝐺 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.2466
39
Example 2
Similarly:
𝐼𝐺 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029
𝐼𝐺 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.152
𝐼𝐺 𝑤𝑖𝑛𝑑𝑦 = 0.048
Outlook is chosen
40
Example 2
Next step: Selection of a second attribute
We can examine:
Temperature, Humidity or Windy for Outlook = “sunny”
Gain (“Temp”) = 0.571 bits
Gain (“Humidity”) = 0.971 bits
Gain (“Wind”) = 0.020 bits
And we continue…
41
Example 2
Choice of second attribute
42
Example 2
Final Decision tree
43
Exercise
• We consider the following data:
• Suggest a decision tree that correctly predicts the class.
45
Solution
𝑆𝑣
𝐼𝐺 = 𝐻 𝑌 − ∗𝐻 𝑌 𝑣
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝐴)
4 3 1
𝐼𝐺 ℎ𝑎𝑖𝑟 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝑏𝑙𝑜𝑛𝑑 + ∗ 𝐻 𝑌 𝑏𝑟𝑜𝑤𝑛 + ∗ 𝐻(𝑌|𝑟𝑒𝑑)
8 8 8
𝑣 ∈ 𝑏𝑙𝑜𝑛𝑑, 𝑏𝑟𝑜𝑤𝑛, 𝑟𝑒𝑑
3 2 3
𝐼𝐺 𝑆𝑖𝑧𝑒 = 0.954 − ∗ 𝐻 𝑌 𝐴𝑉𝐺 + ∗ 𝐻(𝑌|𝑏𝑖𝑔) + ∗ 𝐻(𝑌|𝑠𝑚𝑎𝑙𝑙)
8 8 8
47
Solution
2 3 3
𝐼𝐺 𝑤𝑒𝑖𝑔ℎ𝑡 = 0.954 − ∗ 𝐻 𝑌 𝐿𝑊 + ∗ 𝐻(𝑌|𝐴𝑉𝐺) + ∗ 𝐻(𝑌|𝐻𝑒𝑎𝑣𝑦)
8 8 8
2 3 3 48
Solution
3 5
𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 0.954 − ∗ 𝐻 𝑌 𝑦𝑒𝑠 + ∗ 𝐻(𝑌|𝑛𝑜)
8 8
3 5
𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 0.954 − ∗0 + ∗ 0.97 = 𝟎. 𝟑𝟒 𝒃𝒊𝒕𝒔
8 8
49
Solution
Attribute IG
Hair 0.454
Size 0.2655
Weight 0.015
Solar Cream 0.34
Hair is chosen
50
Solution
• So far, the tree is:
Hair
Tanned Sunburned
51
Solution
• Now we examine the other attributes, taking into
consideration blond hair only
52
Solution
Size
Weight
Solar Cream
LightWeight AVG
Yes No
+- +-
-- ++
53
Solution
𝑆𝑣
𝐼𝐺 = 𝐻 𝑌 − ∗𝐻 𝑌 𝑣
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝐴)
1 1 2
𝐼𝐺 𝑠𝑖𝑧𝑒 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝐴𝑉𝐺 + ∗ 𝐻 𝑌 𝑏𝑖𝑔 + ∗ 𝐻(𝑌|𝑠𝑚𝑎𝑙𝑙)
4 4 4
2 2
𝐼𝐺 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝐿𝑊 + ∗ 𝐻 𝑌 𝐴𝑉𝐺
4 4
2 2
𝐼𝐺 𝑤𝑒𝑖𝑔ℎ𝑡 = 1 − ∗1+ ∗1 =𝟎
4 4
55
Solution
2 2
𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝑦𝑒𝑠 + ∗ 𝐻 𝑌 𝑛𝑜
4 4
2 2
𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 1 − ∗0+ ∗0 =𝟏
4 4
Tanned Sunburned
57
Random Forest
We must look first into the ensemble learning
technique.
Ensemble simply means combining multiple models.
Thus a collection of models is used to make
predictions rather than an individual model
Bagging: It creates a different training subset from
sample training data with replacement and the final
output is based on majority voting. For
example, Random Forest.
Random Forest
Boosting: It combines weak learners into
strong learners by creating sequential
models such that the final model has the
highest accuracy.
Bagging
Bagging, also known as Bootstrap Aggregation, serves as the ensemble
technique in the Random Forest algorithm. Here are the steps involved
in Bagging:
Selection of Subset: Bagging starts by choosing a random sample, or
subset, from the entire dataset.
Bootstrap Sampling: Each model is then created from these samples,
called Bootstrap Samples, which are taken from the original data with
replacement.This process is known as row sampling.
Bagging
Majority Voting: The final output is determined by combining the results of all models
through majority voting. The most commonly predicted outcome among the models is
selected.
Aggregation: This step, which involves combining all the results and generating the final
output based on majority voting, is known as aggregation.
Bagging
Bagging
The bootstrap sample is taken from actual data (Bootstrap
sample 01, Bootstrap sample 02, and Bootstrap sample 03)
with a replacement which means there is a high possibility
that each sample won’t contain unique data.
The model (Model 01, Model 02, and Model 03) obtained
from this bootstrap sample is trained independently. Each
model generates results as shown. Now the Happy emoji
has a majority when compared to the Sad emoji. Thus based
on majority voting final output is obtained as Happy emoji.
Agorithm steps
Step 1: In the Random forest model, a subset of data points
and a subset of features is selected for constructing each
decision tree. Simply put, n random records and m features
are taken from the data set having k number of records.
Step 2: Individual decision trees are constructed for each
sample.
Step 3: Each decision tree will generate an output.
Step 4: Final output is considered based on Majority Voting
or Averaging for Classification and regression, respectively.
EXAMPLE
Immune to the curse of dimensionality: Since each tree does not consider all the
features, the feature space is reduced.
Parallelization: Each tree is created independently out of different data and attributes.
This means we can fully use the CPU to build random forests.
Train-Test split: In a random forest, we don’t have to segregate the data for train and
test as there will always be 30% of the data which is not seen by the decision tree.
Stability: Stability arises because the result is based on majority voting/ averaging.