0% found this document useful (0 votes)
2 views68 pages

Chap5 - Machine Learning Part II - Decision Tree

The document discusses decision trees and random forests in machine learning, focusing on how decision trees can be used for classification based on various attributes. It explains the process of training decision trees using examples, the significance of choosing the right attributes, and the concepts of information gain and entropy. Additionally, it outlines the decision tree learning algorithm and its application in predicting outcomes, such as whether to wait for a table at a restaurant based on specific conditions.

Uploaded by

thomaselbitar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
2 views68 pages

Chap5 - Machine Learning Part II - Decision Tree

The document discusses decision trees and random forests in machine learning, focusing on how decision trees can be used for classification based on various attributes. It explains the process of training decision trees using examples, the significance of choosing the right attributes, and the concepts of information gain and entropy. Additionally, it outlines the decision tree learning algorithm and its application in predicting outcomes, such as whether to wait for a table at a restaurant based on specific conditions.

Uploaded by

thomaselbitar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 68

Artificial Intelligence

Machine Learning
Part II: Decision Tree and Random Forest
Inspired from “Artificial Intelligence: A Modern Approach” book
Decision Tree
• The purpose of a decision tree is to allow
prediction: to determine the class of a new
example from the values of its attributes.

Decision
Tree
2
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
Decision Trees
 Decision tree to represent learned target functions
◦ Each internal node tests an attribute
◦ Each branch corresponds to attribute value
◦ Each leaf node assigns a classification

 Can be represented Outlook

by logical formulas
sunny overcast rain

Humidity Yes Wind

high normal strong weak

No Yes No Yes
4
Training Examples
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No Goal attribute
D2 Sunny Hot High Strong No 2 classes: yes
and no
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
Temperature is
D5 Rain Cool Normal Weak Yes
nominal
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
We want to be
D8 Sunny Mild High Weak No able to decide /
D9 Sunny Cool Normal Weak Yes predict if a
D10 Rain Mild Normal Weak Yes tennis match
will take place
D11 Sunny Mild Normal Strong Yes
or not
D12 Overcast Mild High Strong Yes depending on
D13 Overcast Hot Normal Weak Yes the weather
D14 Rain Mild High Strong No
5
Training Examples
Outlook
P(sunny|yes) = 2/9 P(sunny|No) = 3/5
P(overcast|yes) = 4/9 P(overcast|No) = 0
P(rain|yes) = 3/9 P(rain|No) = 2/5
Temp
P(hot|yes) = 2/9 P(hot|No) = 2/5
P(yes) = 9/14
P(mild|yes) = 4/9 P(mild|No) = 2/5
P(no) = 5/14 P(cool|yes) = 3/9 P(cool|No) = 1/5
Humidity
P(high|yes) = 3/9 P(high|No) = 4/5
P(normal|yes) = 6/9 P(normal|No) = 2/5
wind
P(Strong|yes) = 3/9 P(Strong|No) = 3/5
P(weak|yes) = 6/9 P(weak|No) = 2/5 6
Decision Trees
Problem: decide whether to wait for a table at a restaurant,
based on the following attributes:

1. Alternate: is there an alternative restaurant nearby?


2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60)

7
Attribute-based representations
 Examples described by attribute values (Boolean, discrete, continuous)
 E.g., situations where I will/won't wait for a table:

 Classification of examples is positive (T) or negative (F)


 General form for data: a number N of instances, each with attributes
(x1,x2,x3,...xd) and target value y.
8
Decision trees
 One possible representation for hypotheses
 We imagine someone taking a sequence of decisions.
 E.g., here is the “true” tree for deciding whether to wait:

Note you can use the same


attribute more than once.

9
Decision tree learning
 There are many possible trees.
How can we actually search this space?

 Aim: find a small tree consistent with the training examples

 Idea: (recursively) choose "most significant" attribute as root


of (sub)tree: greedy search.
◦ Start from an empty decision tree
◦ Split on next best attribute
◦ Recurse

 What is best attribute?


 We use information theory to guide us
10
Choosing a good attribute
 Idea: a good attribute splits the examples into
subsets that are (ideally) "all positive" or "all
negative"

Patrons or type?
To wait or not to wait is still at 50%.
11
Choosing a good attribute
 Which attribute is better to split on, X1 or X2

 Idea: Use counts at leaves to define probability


distributions, so we can measure uncertainty 12
Decision tree Learning algorithm
 The input is Δ, a set of examples labeled negative or positive
attributes: the attributes that describe the examples.

 The output is a decision tree which classifies the learning examples.

 DTL (Δ, attributes):returns a decision tree.
1. If all examples of Δ are positives, then return true.
2. If all examples are negatives, then return false.
3. If attributes is empty, then return fail.
4. Tree  attribute which is the best called A.
5. For each value Ai of A, do
 Δi  examples where A = Ai
 Subtree  DTL (Δi, attributes)
 Add a branch labeled Ai and a subtree labeled subtree.
1. Return tree.

13
 The quantity of information:

 We define by Pp = Probability that all examples


in a branch are positive (= ).
 Pn = Probability that all examples in a branch are
negative (= ).

 I(Pp, Pn) = -Pp x log2Pp – Pn x log2 Pn.

14
15
If Pp = Pn  I = 1
Pp = 1 or Pn = 0  I = 0

I must be minimum.

We define the average entropy:


v
Pi  ni pi ni
E= 
i 1 P  n
 I (
n  p
,
n  p
)
i i i i

p = p1 + p2 + … + p v
n = n1 + n2 + … + n v

E(A) is minimum.

We define the info gain:


p n
Gain (A) = I ( , ) – E(A) have to be maximum.
pn pn
16
Information/Entropy
 Entropy H(x) measures the amount of uncertainty
in a probability distribution:

 Given probabilities p1, p2, .., ps whose sum is 1,


Entropy is defined as: 𝑠

𝐻 𝑝1 , 𝑝2 , … , 𝑝𝑠 = − ෍ 𝑝𝑖 𝑙𝑜𝑔2 𝑝𝑖
𝑖=1
𝑠
1
𝐻 𝑝1 , 𝑝2 , … , 𝑝𝑠 = ෍ 𝑝𝑖 𝑙𝑜𝑔2
𝑝𝑖
𝑖=1
𝑠
1 1
𝐻 𝑝1 , 𝑝2 , … , 𝑝𝑠 = ෍ 𝑝𝑖 𝑙𝑜𝑔
𝑙𝑜𝑔 2 𝑝𝑖
𝑖=1
 Only takes into account non-zero probabilities 17
Information/Entropy
 We Flip Two Diferent Coins (18 times):

Sequence 1 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 1 0 0
Sequence 2 0 1 0 1 0 1 1 1 0 1 0 0 1 1 0 1 0 1

Versus

18
Information/Entropy
 Quantifying uncertainty
𝐻 𝑋 = − ෍ 𝑝(𝑥)𝑙𝑜𝑔2 𝑝(𝑥) 𝑋 = {0,1}
𝑥∈𝑋

8 8 1 1 4 4 5 5
𝐻 𝑋 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 𝐻 𝑋 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
9 9 9 9 9 9 9 9
1
𝐻 𝑋 ≈ 𝐻 𝑋 ≈ 0.99
2
Biased coin Normal coin 19
Information/Entropy

20
Entropy of a Joint Distribution
𝑯 𝑿, 𝒀 = − ෍ ෍ 𝒑(𝒙, 𝒚)𝒍𝒐𝒈𝟐 𝒑(𝒙, 𝒚)
𝒙∈𝑿 𝒚∈𝒀

 Example:
◦ X = {Raining, Not raining}
◦ Y = {Cloudy, Not cloudy}

24 24 1 1 25 25 50 50
𝐻 𝑋, 𝑌 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
100 100 100 100 100 100 100 100

𝐻 𝑋, 𝑌 ≈ 1.56 𝑏𝑖𝑡𝑠

21
Specific Conditional Entropy
𝑯 𝒀|𝑿 = 𝒙 = − ෍ 𝒑 𝒚 𝒙 𝒍𝒐𝒈𝟐 𝒑(𝒚|𝒙)
𝒚∈𝒀
𝑝(𝑥, 𝑦)
We consider that: 𝑝 𝑦 | 𝑥 = and 𝑝 𝑥 = ෍ 𝑝(𝑥, 𝑦) Sum in a row
𝑝(𝑥)
𝑦

 Example:
◦ X = {Raining, Not raining}
◦ Y = {Cloudy, Not cloudy}

 What is the entropy of cloudiness Y, given that it is raining?


24
𝑝 𝑅, 𝐶 24
𝑝 𝐶|𝑅 = = 100 =
𝑝 𝑅 25 25
100
1 24 24 1 1
𝑝 𝑅, 𝐶ҧ 1 𝐻 𝑌|𝑋 = 𝑥 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2
ҧ
𝑝 𝐶|𝑅 = = 100 = 25 25 25 25
𝑝 𝑅 25 25 𝐻 𝑌|𝑋 = 𝑥 ≈ 0.24 𝑏𝑖𝑡𝑠
100
22
Conditional Entropy
𝑯 𝒀|𝑿 = ෍ 𝒑 𝒙 𝑯 𝒀 𝑿 = 𝒙 = − ෍ ෍ 𝒑 𝒙, 𝒚 𝒍𝒐𝒈𝟐 𝒑(𝒚|𝒙)
𝒚∈𝒀 𝒙∈𝑿 𝒚∈𝒀
𝑝(𝑥, 𝑦)
We consider that: 𝑝 𝑦 | 𝑥 = and 𝑝 𝑥 = ෍ 𝑝(𝑥, 𝑦) Sum in a row
𝑝(𝑥)
𝑦

 Example:
◦ X = {Raining, Not raining}
◦ Y = {Cloudy, Not cloudy}

 What is the entropy of cloudiness, given the knowledge of


whether or not it is raining?
 Entropy of cloudiness  R & !R

23
Conditional Entropy
1-
𝑯 𝒀|𝑿 = ෍ 𝒑 𝒙 𝑯 𝒀 𝑿 = 𝒙
𝒚∈𝒀
𝐻 𝑌|𝑋 = 𝑝 𝑅 𝐻 𝑌 𝑅 + 𝑝 𝑅ത 𝐻(𝑌|𝑅)

25 75
𝐻 𝑌|𝑋 = ∗ 0.24 + ത
∗ 𝐻 (𝑌|𝑅)
100 100
We have 𝐻 𝑌|𝑅ത = − σ𝑦∈𝑌 𝑝 𝑦 𝑅ത 𝑙𝑜𝑔2 𝑝(𝑦|𝑅)

𝐻 𝑌|𝑅ത = −𝑝 𝐶 𝑅ത 𝑙𝑜𝑔2 𝑝 𝐶 𝑅ത − p(𝐶|ҧ 𝑅)𝑙𝑜𝑔


ത ҧ ത
2 𝑝(𝐶|𝑅)

1 1 2 2

𝐻 𝑌|𝑅 = − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 ≈ 0.9182
3 3 3 3

𝟐𝟓 𝟕𝟓
 𝑯 𝒀|𝑿 = ∗ 𝟎. 𝟐𝟒 + ∗ 𝟎. 𝟗𝟏𝟖𝟐 ≈ 𝟎. 𝟕𝟓 𝒃𝒊𝒕𝒔
𝟏𝟎𝟎 𝟏𝟎𝟎
24
Conditional Entropy
2-
𝑯 𝒀|𝑿 = − ෍ ෍ 𝒑 𝒙, 𝒚 𝒍𝒐𝒈𝟐 𝒑(𝒚|𝒙)
𝒙∈𝑿 𝒚∈𝒀

𝐻 𝑌|𝑋 = −𝑝 𝑅, 𝐶 𝑙𝑜𝑔2 𝑝 𝐶|𝑅 − 𝑝 𝑅, 𝐶ҧ 𝑙𝑜𝑔2 𝑝 𝐶|𝑅ҧ


−𝑝 𝑅, ത 𝐶ҧ 𝑙𝑜𝑔2 𝑝(𝐶|ҧ 𝑅)
ത 𝐶 𝑙𝑜𝑔2 𝑝 𝐶|𝑅ത − 𝑝 𝑅, ത

24 24 1 1 25 1 50 2
𝐻 𝑌|𝑋 = −( 𝑙𝑜𝑔2 ) − ( 𝑙𝑜𝑔2 ) − ( 𝑙𝑜𝑔2 ) − ( 𝑙𝑜𝑔2 )
100 25 100 25 100 3 100 3
50
𝑯 𝒀|𝑿 ≈ 𝟎. 𝟕𝟓 𝒃𝒊𝒕𝒔 𝑝 𝐶,ҧ 𝑅ത 100 2
ҧ ത
𝑝 𝐶|𝑅 = = =
𝑝 𝑅ത 75 3
100

25
𝑝 𝐶, 𝑅ത 1
𝑝 𝐶|𝑅ത = = 100 =
𝑝 𝑅ത 75 3
100 25
Conditional Entropy
 Some useful properties:
◦ H is always non-negative

◦ Chain rule: H (X,Y) = H (X|Y) + H (Y) = H (Y|X ) + H (X)

◦ If X and Y independent, then X doesn't tell us anything about


Y : H (Y|X) = H (Y)

◦ Y tells us everything about Y: H (Y|Y) = 0

◦ By knowing X, we can only decrease uncertainty about Y:


H(Y|X) <= H(Y)
26
Information Gain (IG)
 The gain of information of an attribute A is the reduction of
entropy which one can expect if one makes a partition on
the basis of this attribute.

|𝑆𝑣 |
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 (𝑆𝑣 )
𝑆
𝑣 ∈𝑉𝑎𝑙𝑢𝑒𝑠 𝐴

 Values(A) = set of possible values of attribute A


 Sv= subset of S for which A has value v
 |S| = size of S
 | Sv|= size of Sv

27
Decision tree learning
 How much information about
cloudiness do we get by
discovering whether it is raining?

IG (Y|X) = H(Y) - H(Y|X)


IG (Y|X) = 1 – 0.75
IG (Y|X) ≈ 𝟎. 𝟐𝟓 𝒃𝒊𝒕𝒔

 Also called information gain in Y due to X


 If X is completely uninformative about Y: IG (Y|X ) = 0
 If X is completely informative about Y: IG (Y|X ) = H (Y)
 How can we use this to construct our decision tree?
28
Decision tree construction
 At each level, one must choose:
1. Which variable to split.
2. Possibly where to split it.

 Choose them based on how much information we would


gain from the decision! (choose attribute that gives the
highest gain)

29
Decision tree construction
Algorithm

1. pick an attribute to split at a non-terminal node


2. split examples into groups based on attribute value
3. for each group:
◦ if no examples, then return majority from parent
◦ else if all examples are in same class, then return class
◦ else loop to step 1

30
Back to our example

31
Patrons or type?
Attribute selection

IG (Y|X) = H(Y) - H(Y|X)


 𝐻 𝑌 = −𝑝 𝑇 ∗ log 𝑝 𝑇 − 𝑝 𝐹 ∗ log 𝑝 𝐹
6 6 6 6 1 1 1 1
𝐻 𝑌 = − log − log = − log − log( )
12 12 12 12 2 2 2 2
𝐻 𝑌 =1

2 4 4 2
 𝐼𝐺 𝑇𝑦𝑝𝑒 = 1 − (12 𝐻 𝑌|𝑓𝑟 + 12 𝐻 𝑌|𝑡ℎ + 12 𝐻 𝑌|𝐵𝑢 + 12 𝐻 𝑌|𝑖𝑡 ) 32
Attribute selection

◦ 𝐻 𝑌|𝑓𝑟 = −𝑝 𝑦𝑒𝑠 𝑓𝑟 ∗ log 𝑝 𝑦𝑒𝑠 𝑓𝑟 − 𝑝 𝑛𝑜 𝑓𝑟 ∗ log(𝑝(𝑛𝑜|𝑓𝑟)


1 1 1 1 1
𝐻 𝑌|𝑓𝑟 = − log − log = − log =1
2 2 2 2 2

◦ 𝐻 𝑌|𝑡ℎ = −𝑝 𝑦𝑒𝑠 𝑡ℎ ∗ log 𝑝 𝑦𝑒𝑠 𝑡ℎ − 𝑝 𝑛𝑜 𝑡ℎ ∗ log(𝑝(𝑛𝑜|𝑡ℎ)


2 2 2 2 1
𝐻 𝑌|𝑓𝑟 = − log − log = − log =1 33
4 4 4 4 2
Attribute selection

◦ 𝐻 𝑌|𝐵𝑢 = −𝑝 𝑦𝑒𝑠 𝐵𝑢 ∗ log 𝑝 𝑦𝑒𝑠 𝐵𝑢 − 𝑝 𝑛𝑜 𝐵𝑢 ∗ log(𝑝(𝑛𝑜|𝐵𝑢)


2 2 2 2 1
𝐻 𝑌|𝐵𝑢 = − log − log = − log =1
4 4 4 4 2

◦ 𝐻 𝑌|𝑖𝑡 =−𝑝 𝑦𝑒𝑠 𝑖𝑡 ∗ log 𝑝 𝑦𝑒𝑠 𝑖𝑡 − 𝑝 𝑛𝑜 𝑖𝑡 ∗ log(𝑝(𝑛𝑜|𝑖𝑡)


1 1 1 1 1
𝐻 𝑌|𝑖𝑡 = − log − log = − log =1 34
2 2 2 2 2
Attribute selection
2 4 4 2
𝐼𝐺 𝑇𝑦𝑝𝑒 = 1 − ( 𝐻 𝑌|𝑓𝑟 + 𝐻 𝑌|𝑡ℎ + 𝐻 𝑌|𝐵𝑢 + 𝐻 𝑌|𝑖𝑡 )
12 12 12 12
2 4 4 2
𝐼𝐺 𝑇𝑦𝑝𝑒 = 1 − ( ∗1+ ∗ 1+ ∗1+ ∗ 1) = 0
12 12 12 12

4 6 2
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 = 1 − ( 𝐻 𝑌|𝑠𝑜𝑚𝑒 + 𝐻 𝑌|𝑓𝑢𝑙𝑙 + 𝐻 𝑌|𝑛𝑜𝑛𝑒 )
12 12 12
𝐻 𝑌|𝑠𝑜𝑚𝑒 = 0
𝐻 𝑌|𝑓𝑢𝑙𝑙 ≈ 0.9183
𝐻 𝑌|𝑛𝑜𝑛𝑒 = 0

𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 = 1 − 0.459 ≈ 0.541

35
Attribute selection
𝐼𝐺 𝑃𝑎𝑡𝑟𝑜𝑛𝑠 > 𝐼𝐺 𝑇𝑦𝑝𝑒

We select patrons for the tree

36
Example 2
Day Outlook Temp Humidity Wind Tennis?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No
37
Example 2
 Step 1: Calculate IG for the attributes
 𝐼𝐺 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 =?

 𝐻 𝑌 = −𝑝 𝑦𝑒𝑠 ∗ log 𝑝 𝑦𝑒𝑠 − 𝑝 𝑛𝑜 ∗ log 𝑝 𝑛𝑜


9 9 5 5
= − log − log = 0.94 𝑏𝑖𝑡𝑠
14 14 14 14

 𝐻 𝑦|𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑠𝑢𝑛𝑛𝑦 = −𝑝 𝑦𝑒𝑠|𝑠𝑢𝑛𝑛𝑦 ∗ log 𝑝 𝑦𝑒𝑠|𝑠𝑢𝑛𝑛𝑦 −


𝑝 𝑛𝑜|𝑠𝑢𝑛𝑛𝑦 ∗ log 𝑝 𝑛𝑜|𝑠𝑢𝑛𝑛𝑦
2 2 3 3
= − log − log = 0.971 𝑏𝑖𝑡𝑠
5 5 5 5

38
Example 2
 𝐻 𝑦|𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 = −𝑝 𝑦𝑒𝑠|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 ∗
log 𝑝 𝑦𝑒𝑠|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 − 𝑝 𝑛𝑜|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡 ∗ log 𝑝 𝑛𝑜|𝑜𝑣𝑒𝑟𝑐𝑎𝑠𝑡
= −1 log 1 − 0 = 0 𝑏𝑖𝑡𝑠

 𝐻 𝑦|𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 𝑟𝑎𝑖𝑛 = −𝑝 𝑦𝑒𝑠|𝑟𝑎𝑖𝑛 ∗ log 𝑝 𝑦𝑒𝑠|𝑟𝑎𝑖𝑛 −


𝑝 𝑛𝑜|𝑟𝑎𝑖𝑛 ∗ log 𝑝 𝑛𝑜|𝑟𝑎𝑖𝑛
3 3 2 2
= − log − log = 0.971 bits
5 5 5 5

5 4 5
 𝐼𝐺 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.94 − ∗ 0.971 + ∗ 0+ ∗ 0.971
14 14 14
 𝐼𝐺 𝑜𝑢𝑡𝑙𝑜𝑜𝑘 = 0.2466

39
Example 2
 Similarly:

 𝐼𝐺 𝑡𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029
 𝐼𝐺 ℎ𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.152
 𝐼𝐺 𝑤𝑖𝑛𝑑𝑦 = 0.048

Outlook is chosen

40
Example 2
 Next step: Selection of a second attribute

 We can examine:
 Temperature, Humidity or Windy for Outlook = “sunny”
 Gain (“Temp”) = 0.571 bits
 Gain (“Humidity”) = 0.971 bits
 Gain (“Wind”) = 0.020 bits

 And we continue…

41
Example 2
 Choice of second attribute

42
Example 2
 Final Decision tree

43
Exercise
• We consider the following data:
• Suggest a decision tree that correctly predicts the class.

Hair Size Weight Solar cream Class

blond Average Lightweight No 1 = sunburn


blond Big Average Yes 0 = tanned
brown Small Average Yes 0 = tanned
blond Small Average No 1 = sunburn
red Average Heavy No 1 = sunburn
brown Big Heavy No 0 = tanned
brown Average Heavy No 0 = tanned
blond Small Lightweight yes 0 = tanned 44
Solution
• Step 1: Choose the attribute to split on
• Sunburn = +
• Tanned = -
Hair Size

Blond Brown Red AVG Big Small


++ --- + ++ -- +
-- IG = ? - --
Weight
Solar Cream

LightWeight AVG Heavy


Yes No
+- -- +
---- +++
+ --
--

45
Solution
𝑆𝑣
𝐼𝐺 = 𝐻 𝑌 − ෍ ∗𝐻 𝑌 𝑣
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝐴)

 𝐻 𝑌 = − 𝑝 𝑆𝑢𝑛𝑏𝑢𝑟𝑛 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝑢𝑛𝑏𝑢𝑟𝑛) − 𝑝 𝑇𝑎𝑛𝑛𝑒𝑑 ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛𝑛𝑒𝑑)


3 3 5 5
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.954
8 8 8 8

4 3 1
 𝐼𝐺 ℎ𝑎𝑖𝑟 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝑏𝑙𝑜𝑛𝑑 + ∗ 𝐻 𝑌 𝑏𝑟𝑜𝑤𝑛 + ∗ 𝐻(𝑌|𝑟𝑒𝑑)
8 8 8
𝑣 ∈ 𝑏𝑙𝑜𝑛𝑑, 𝑏𝑟𝑜𝑤𝑛, 𝑟𝑒𝑑

 𝐻 𝑌 𝑏𝑙𝑜𝑛𝑑 = − 𝑝 𝑆𝐵|𝑏𝑙𝑜𝑛𝑑 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑏𝑙𝑜𝑛𝑑 − 𝑝(𝑇𝑎𝑛|𝑏𝑙𝑜𝑛𝑑) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑏𝑙𝑜𝑛𝑑 =


2 2 2 2
− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =0
4 4 4 4

 𝐻 𝑌 𝑏𝑟𝑜𝑤𝑛 = −𝑝 𝑆𝐵|𝑏𝑟𝑜𝑤𝑛 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑏𝑟𝑜𝑤𝑛 − 𝑝 𝑇𝑎𝑛|𝑏𝑟𝑜𝑤𝑛 ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑏𝑟𝑜𝑤𝑛


3 3
= −0 ∗ 𝑙𝑜𝑔2 0 − 𝑙𝑜𝑔2 =0
3 3 46
Solution
 𝐻 𝑌 𝑟𝑒𝑑 = − 𝑝 𝑆𝐵|𝑟𝑒𝑑 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑟𝑒𝑑 − 𝑝(𝑇𝑎𝑛|𝑟𝑒𝑑) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑟𝑒𝑑 =
−1 𝑙𝑜𝑔2 1 − 0 𝑙𝑜𝑔2 0 = 1
4 3 1
 𝐼𝐺 𝐻𝑎𝑖𝑟 = 0.954 − ∗1+ ∗0+ ∗ 0 = 0.954 − 0.5 = 𝟎. 𝟒𝟓𝟒 𝐛𝐢𝐭𝐬
8 8 8

3 2 3
 𝐼𝐺 𝑆𝑖𝑧𝑒 = 0.954 − ∗ 𝐻 𝑌 𝐴𝑉𝐺 + ∗ 𝐻(𝑌|𝑏𝑖𝑔) + ∗ 𝐻(𝑌|𝑠𝑚𝑎𝑙𝑙)
8 8 8

 𝐻 𝑌 𝐴𝑉𝐺 = − 𝑝 𝑆𝐵|𝐴𝑉𝐺 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐴𝑉𝐺 − 𝑝(𝑇𝑎𝑛|𝐴𝑉𝐺) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐴𝑉𝐺


2 2 1 1
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.918
3 3 3 3
 𝐻 𝑌 𝑏𝑖𝑔 = − 𝑝 𝑆𝐵|𝐵𝑖𝑔 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐵𝑖𝑔 − 𝑝(𝑇𝑎𝑛|𝐵𝑖𝑔) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐵𝑖𝑔
= − 0 ∗ 𝑙𝑜𝑔2 0 − 1𝑙𝑜𝑔2 1 = 0
 𝐻 𝑌 𝑠𝑚𝑎𝑙𝑙 = − 𝑝 𝑆𝐵|𝑠𝑚𝑎𝑙𝑙 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑠𝑚𝑎𝑙𝑙 − 𝑝(𝑇𝑎𝑛|𝑠𝑚𝑎𝑙𝑙) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑠𝑚𝑎𝑙𝑙
2 2 1 1
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.918
3 3 3 3
3 2 3
 𝐼𝐺 𝑆𝑖𝑧𝑒 = 0.954 − ∗ 0.918 + ∗0+ ∗ 0.918 = 𝟎. 𝟐𝟔𝟓𝟓 𝒃𝒊𝒕𝒔
8 8 8

47
Solution
2 3 3
 𝐼𝐺 𝑤𝑒𝑖𝑔ℎ𝑡 = 0.954 − ∗ 𝐻 𝑌 𝐿𝑊 + ∗ 𝐻(𝑌|𝐴𝑉𝐺) + ∗ 𝐻(𝑌|𝐻𝑒𝑎𝑣𝑦)
8 8 8

 𝐻 𝑌 𝐿𝑊 = − 𝑝 𝑆𝐵|𝐿𝑊 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐿𝑊 − 𝑝(𝑇𝑎𝑛|𝐿𝑊) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐿𝑊


1 1 1 1
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =1
2 2 2 2

 𝐻 𝑌 𝐴𝑉𝐺 = − 𝑝 𝑆𝐵|𝐴𝑉𝐺 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐴𝑉𝐺 − 𝑝(𝑇𝑎𝑛|𝐴𝑉𝐺) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐴𝑉𝐺


1 1 2 2
=− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.918
3 3 3 3

 𝐻 𝑌 𝐻𝑒𝑎𝑣𝑦 = − 𝑝 𝑆𝐵|𝐻𝑒𝑎𝑣𝑦 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐻𝑒𝑎𝑣𝑦 − 𝑝(𝑇𝑎𝑛|𝐻𝑒𝑎𝑣𝑦) ∗


𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐻𝑒𝑎𝑣𝑦
1 1 2 2
=− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.918
3 3 3 3

2 3 3 48
Solution
3 5
 𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 0.954 − ∗ 𝐻 𝑌 𝑦𝑒𝑠 + ∗ 𝐻(𝑌|𝑛𝑜)
8 8

 𝐻 𝑌 𝑦𝑒𝑠 = − 𝑝 𝑆𝐵|𝑦𝑒𝑠 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐿𝑊𝑦𝑒𝑠 − 𝑝(𝑇𝑎𝑛|𝑦𝑒𝑠) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑦𝑒𝑠


3 3
= −0 ∗ 𝑙𝑜𝑔2 0 − 𝑙𝑜𝑔2 =0
3 3

 𝐻 𝑌 𝑛𝑜 = − 𝑝 𝑆𝐵|𝑛𝑜 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑛𝑜 − 𝑝(𝑇𝑎𝑛|𝑛𝑜) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑛𝑜


3 3 2 2
=− 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 = 0.97
5 5 5 5

3 5
 𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 0.954 − ∗0 + ∗ 0.97 = 𝟎. 𝟑𝟒 𝒃𝒊𝒕𝒔
8 8

49
Solution
Attribute IG
Hair 0.454
Size 0.2655
Weight 0.015
Solar Cream 0.34

Hair has the highest IG,

 Hair is chosen

50
Solution
• So far, the tree is:
Hair

Blond Brown Red

Tanned Sunburned

51
Solution
• Now we examine the other attributes, taking into
consideration blond hair only

Size Weight Solar cream Class

Average Lightweight No 1 = sunburn


Big Average Yes 0 = tanned
Small Average No 1 = sunburn
Small Lightweight yes 0 = tanned

52
Solution

Size

AVG Big Small


+ - +
-

Weight
Solar Cream

LightWeight AVG
Yes No
+- +-
-- ++

53
Solution
𝑆𝑣
𝐼𝐺 = 𝐻 𝑌 − ෍ ∗𝐻 𝑌 𝑣
𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒𝑠 (𝐴)

 𝐻 𝑌 = − 𝑝 𝑆𝑢𝑛𝑏𝑢𝑟𝑛 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝑢𝑛𝑏𝑢𝑟𝑛) − 𝑝 𝑇𝑎𝑛𝑛𝑒𝑑 ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛𝑛𝑒𝑑)


2 2 2 2
= − 𝑙𝑜𝑔2 − 𝑙𝑜𝑔2 =1
4 4 4 4

1 1 2
 𝐼𝐺 𝑠𝑖𝑧𝑒 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝐴𝑉𝐺 + ∗ 𝐻 𝑌 𝑏𝑖𝑔 + ∗ 𝐻(𝑌|𝑠𝑚𝑎𝑙𝑙)
4 4 4

 𝐻 𝑌 𝐴𝑉𝐺 = − 𝑝 𝑆𝐵|𝐴𝑉𝐺 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐴𝑉𝐺 − 𝑝(𝑇𝑎𝑛|𝐴𝑉𝐺) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐴𝑉𝐺 =


−1 𝑙𝑜𝑔2 1 − 0 ∗ 𝑙𝑜𝑔2 0 = 0

 𝐻 𝑌 𝑏𝑖𝑔 = − 𝑝 𝑆𝐵|𝑏𝑖𝑔 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑏𝑖𝑔 − 𝑝(𝑇𝑎𝑛|𝑏𝑖𝑔) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑏𝑖𝑔 =


−0 𝑙𝑜𝑔2 0 − 1 ∗ 𝑙𝑜𝑔2 1 = 0

 𝐻 𝑌 𝑠𝑚𝑎𝑙𝑙 = − 𝑝 𝑆𝐵|𝑠𝑚𝑎𝑙𝑙 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑠𝑚𝑎𝑙𝑙 − 𝑝(𝑇𝑎𝑛|𝑠𝑚𝑎𝑙𝑙) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑠𝑚𝑎𝑙𝑙 =


1 1 1 1 54
− 𝑙𝑜𝑔 − ∗ 𝑙𝑜𝑔 =1
Solution
1 1 2
 𝐼𝐺 𝑠𝑖𝑧𝑒 = 1 − ∗0+ ∗ 0 + ∗ 1 = 𝟎. 𝟓
4 4 4

2 2
 𝐼𝐺 𝑤𝑒𝑖𝑔ℎ𝑡 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝐿𝑊 + ∗ 𝐻 𝑌 𝐴𝑉𝐺
4 4

 𝐻 𝑌 𝐿𝑊 = − 𝑝 𝑆𝐵|𝐿𝑊 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐿𝑊 − 𝑝(𝑇𝑎𝑛|𝐿𝑊) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐿𝑊 =


1 1 1 1
− 𝑙𝑜𝑔2 − ∗ 𝑙𝑜𝑔2 =1
2 2 2 2

 𝐻 𝑌 𝐴𝑉𝐺 = − 𝑝 𝑆𝐵|𝐴𝑉𝐺 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝐴𝑉𝐺 − 𝑝(𝑇𝑎𝑛|𝐴𝑉𝐺) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝐴𝑉𝐺 =


1 1 1 1
− 𝑙𝑜𝑔2 − ∗ 𝑙𝑜𝑔2 =1
2 2 2 2

2 2
 𝐼𝐺 𝑤𝑒𝑖𝑔ℎ𝑡 = 1 − ∗1+ ∗1 =𝟎
4 4

55
Solution
2 2
 𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 𝐻 𝑌 − ∗ 𝐻 𝑌 𝑦𝑒𝑠 + ∗ 𝐻 𝑌 𝑛𝑜
4 4

 𝐻 𝑌 𝑦𝑒𝑠 = − 𝑝 𝑆𝐵|𝑦𝑒𝑠 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑦𝑒𝑠 − 𝑝(𝑇𝑎𝑛|𝑦𝑒𝑠) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑦𝑒𝑠 =


2 2
− 𝑙𝑜𝑔2 − 0 ∗ 𝑙𝑜𝑔2 0 = 0
2 2

 𝐻 𝑌 𝑛𝑜 = − 𝑝 𝑆𝐵|𝑛𝑜 ∗ 𝑙𝑜𝑔2 (𝑝 𝑆𝐵 𝑛𝑜 − 𝑝(𝑇𝑎𝑛|𝑛𝑜) ∗ 𝑙𝑜𝑔2 (𝑝 𝑇𝑎𝑛 𝑛𝑜 =


2 2
−0 ∗ 𝑙𝑜𝑔2 0 − ∗ 𝑙𝑜𝑔2 =0
2 2

2 2
 𝐼𝐺 𝑐𝑟𝑒𝑎𝑚 = 1 − ∗0+ ∗0 =𝟏
4 4

Attribute IG Solar Cream


Size 0.5 has the highest
IG,
Weight 0
Solar Cream 1
 Solar Cream
is chosen 56
Solution
• The final tree is:
Hair

Blond Brown Red

Solar Tanned Sunburned


Cream?
Yes No

Tanned Sunburned

57
Random Forest
 We must look first into the ensemble learning
technique.
 Ensemble simply means combining multiple models.
Thus a collection of models is used to make
predictions rather than an individual model
 Bagging: It creates a different training subset from
sample training data with replacement and the final
output is based on majority voting. For
example, Random Forest.
Random Forest
 Boosting: It combines weak learners into
strong learners by creating sequential
models such that the final model has the
highest accuracy.
Bagging
 Bagging, also known as Bootstrap Aggregation, serves as the ensemble
technique in the Random Forest algorithm. Here are the steps involved
in Bagging:
 Selection of Subset: Bagging starts by choosing a random sample, or
subset, from the entire dataset.
 Bootstrap Sampling: Each model is then created from these samples,
called Bootstrap Samples, which are taken from the original data with
replacement.This process is known as row sampling.
Bagging

 Bootstrapping:The step of row sampling with replacement is referred to as bootstrapping.

 Independent Model Training: Each model is trained independently on its corresponding


Bootstrap Sample.This training process generates results for each model.

 Majority Voting: The final output is determined by combining the results of all models
through majority voting. The most commonly predicted outcome among the models is
selected.

 Aggregation: This step, which involves combining all the results and generating the final
output based on majority voting, is known as aggregation.
Bagging
Bagging
 The bootstrap sample is taken from actual data (Bootstrap
sample 01, Bootstrap sample 02, and Bootstrap sample 03)
with a replacement which means there is a high possibility
that each sample won’t contain unique data.
 The model (Model 01, Model 02, and Model 03) obtained
from this bootstrap sample is trained independently. Each
model generates results as shown. Now the Happy emoji
has a majority when compared to the Sad emoji. Thus based
on majority voting final output is obtained as Happy emoji.
Agorithm steps
 Step 1: In the Random forest model, a subset of data points
and a subset of features is selected for constructing each
decision tree. Simply put, n random records and m features
are taken from the data set having k number of records.
 Step 2: Individual decision trees are constructed for each
sample.
 Step 3: Each decision tree will generate an output.
 Step 4: Final output is considered based on Majority Voting
or Averaging for Classification and regression, respectively.
EXAMPLE

Consider the fruit basket as the data as shown in the figure


below. Now n number of samples are taken from the fruit
basket, and an individual decision tree is constructed for
each sample. Each decision tree will generate an output, as
shown in the figure.
The final output is considered based on majority voting. You
can see that the majority decision tree gives output as an
apple when compared to a banana, so the final output is
taken as an apple.
Features of Random Forest.
 Diversity: Not all attributes/variables/features are considered while making an
individual tree; each tree is different.

 Immune to the curse of dimensionality: Since each tree does not consider all the
features, the feature space is reduced.

 Parallelization: Each tree is created independently out of different data and attributes.
This means we can fully use the CPU to build random forests.

 Train-Test split: In a random forest, we don’t have to segregate the data for train and
test as there will always be 30% of the data which is not seen by the decision tree.

 Stability: Stability arises because the result is based on majority voting/ averaging.

You might also like