Module#8 Decision Tree and Random Forest
Module#8 Decision Tree and Random Forest
▪ The decision tree training algorithm ▪ The decision tree training algorithm
▪ Overfitting in the decision tree model ▪ Overfitting in the decision tree model
▪ Ensemble learning and random forest model ▪ Ensemble learning and random forest model
Decision Tree and Random Forest Model Decision Tree and Random Forest Model
▪ The decision tree training algorithm ▪ The decision tree training algorithm
▪ Overfitting in the decision tree model ▪ Overfitting in the decision tree model
▪ Ensemble learning and random forest model ▪ Ensemble learning and random forest model
Decision Tree and Random Forest Model Decision Tree and Random Forest Model
▪ The decision tree training algorithm ▪ The decision tree training algorithm
▪ Overfitting in the decision tree model ▪ Overfitting in the decision tree model
▪ Ensemble learning and random forest model ▪ Ensemble learning and random forest model
𝑥2 𝑥2
𝑏 𝑏
𝑥1 𝑥1
𝑎 𝑎
𝑥2 𝑥2
𝑏 𝑏
𝑥1 𝑥1
𝑎 𝑎
𝑥2 𝑥2
𝑏 𝑏
𝑥1 𝑥1
𝑎 𝑎
• A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏? • A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏?
decision nodes and leaf nodes. yes no
decision nodes and leaf nodes. yes no
• Each decision node specifies a test of some feature of • Each decision node specifies a test of some feature of
the instance. R 𝑥1 < 𝑎?
the instance. R 𝑥1 < 𝑎?
• Each branch descending from a decision node • Each branch descending from a decision node
yes no yes no
corresponds to one of the possible values for the feature corresponds to one of the possible values for the feature
• Each leaf node represents a classification of the instance • Each leaf node represents a classification of the instance
R B R B
• An instance is classified by starting at the root node of • An instance is classified by starting at the root node of
the tree, testing the feature specified by this node, then the tree, testing the feature specified by this node, then
moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5 moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5
value of the feature in the instance. value of the feature in the instance.
• This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7) • This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7)
the new node until a leaf node is reached using the decision tree. the new node until a leaf node is reached using the decision tree.
• A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏? • A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏?
decision nodes and leaf nodes. yes no
decision nodes and leaf nodes. yes no
• Each decision node specifies a test of some feature of • Each decision node specifies a test of some feature of
the instance. R 𝑥1 < 𝑎?
the instance. R 𝑥1 < 𝑎?
• Each branch descending from a decision node • Each branch descending from a decision node
yes no yes no
corresponds to one of the possible values for the feature corresponds to one of the possible values for the feature
• Each leaf node represents a classification of the instance • Each leaf node represents a classification of the instance
R B R B
• An instance is classified by starting at the root node of • An instance is classified by starting at the root node of
the tree, testing the feature specified by this node, then the tree, testing the feature specified by this node, then
moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5 moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5
value of the feature in the instance. value of the feature in the instance.
• This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7) • This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7)
the new node until a leaf node is reached using the decision tree. the new node until a leaf node is reached using the decision tree.
• A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏? • A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏?
decision nodes and leaf nodes. yes no
decision nodes and leaf nodes. yes no
• Each decision node specifies a test of some feature of • Each decision node specifies a test of some feature of
the instance. R 𝑥1 < 𝑎?
the instance. R 𝑥1 < 𝑎?
• Each branch descending from a decision node • Each branch descending from a decision node
yes no yes no
corresponds to one of the possible values for the feature corresponds to one of the possible values for the feature
• Each leaf node represents a classification of the instance • Each leaf node represents a classification of the instance
R B R B
• An instance is classified by starting at the root node of • An instance is classified by starting at the root node of
the tree, testing the feature specified by this node, then the tree, testing the feature specified by this node, then
moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5 moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5
value of the feature in the instance. value of the feature in the instance.
• This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7) • This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7)
the new node until a leaf node is reached using the decision tree. the new node until a leaf node is reached using the decision tree.
• Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree • Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree
from the data? (Table DT1) (the PlayTennis dataset) from the data? (Table DT1) (the PlayTennis dataset)
Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
• Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree • Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree
from the data? (Table DT1) (the PlayTennis dataset) from the data? (Table DT1) (the PlayTennis dataset)
Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
• Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree • Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree
from the data? (Table DT1) (the PlayTennis dataset) from the data? (Table DT1) (the PlayTennis dataset)
Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
Decision Tree Learning Algorithm Decision Tree Learning Algorithm
▪ Given the training data set, how can we construct the decision tree? ▪ Given the training data set, how can we construct the decision tree?
▪ Given the training data set, how can we construct the decision tree? ▪ Given the training data set, how can we construct the decision tree?
▪ Given the training data set, how can we construct the decision tree? ▪ Given the training data set, how can we construct the decision tree?
Decision Tree Learning Algorithm Decision Tree Learning Algorithm
• Which feature should be selected for the root node test? Why? • Which feature should be selected for the root node test? Why?
𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1
• Which feature should be selected for the root node test? Why? • Which feature should be selected for the root node test? Why?
𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1
• Which feature should be selected for the root node test? Why? • Which feature should be selected for the root node test? Why?
𝑥2 𝑥2 𝑥2 𝑥2
𝑥1 𝑥1 𝑥1 𝑥1
▪ What is a good quantitative measure of the classification capability of a feature? ▪ What is a good quantitative measure of the classification capability of a feature?
How about the reduction of the impurity of the sets? How about the reduction of the impurity of the sets?
▪ What is a good quantitative measure of the impurity of the sets? ▪ What is a good quantitative measure of the impurity of the sets?
▪ What is a good quantitative measure of the classification capability of a feature? ▪ What is a good quantitative measure of the classification capability of a feature?
How about the reduction of the impurity of the sets? How about the reduction of the impurity of the sets?
▪ What is a good quantitative measure of the impurity of the sets? ▪ What is a good quantitative measure of the impurity of the sets?
▪ What is a good quantitative measure of the classification capability of a feature? ▪ What is a good quantitative measure of the classification capability of a feature?
How about the reduction of the impurity of the sets? How about the reduction of the impurity of the sets?
▪ What is a good quantitative measure of the impurity of the sets? ▪ What is a good quantitative measure of the impurity of the sets?
▪ Entropy as a measure of impurity of a collection of samples: (binary sets) ▪ Entropy as a measure of impurity of a collection of samples: (binary sets)
Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target
concept, the entropy of 𝑆 relative to this Boolean classification is: concept, the entropy of 𝑆 relative to this Boolean classification is:
Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of
negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0. negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0.
▪ Let’s verify these two facts: ▪ Let’s verify these two facts:
• The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity) • The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity)
• The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity) • The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity)
▪ Entropy as a measure of impurity of a collection of samples: (binary sets) ▪ Entropy as a measure of impurity of a collection of samples: (binary sets)
Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target
concept, the entropy of 𝑆 relative to this Boolean classification is: concept, the entropy of 𝑆 relative to this Boolean classification is:
Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of
negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0. negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0.
▪ Let’s verify these two facts: ▪ Let’s verify these two facts:
• The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity) • The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity)
• The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity) • The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity)
▪ Entropy as a measure of impurity of a collection of samples: (binary sets) ▪ Entropy as a measure of impurity of a collection of samples: (binary sets)
Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target
concept, the entropy of 𝑆 relative to this Boolean classification is: concept, the entropy of 𝑆 relative to this Boolean classification is:
Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of
negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0. negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0.
▪ Let’s verify these two facts: ▪ Let’s verify these two facts:
• The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity) • The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity)
• The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity) • The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity)
Entropy is a measure of uncertainty or impurity of a set. Entropy is a measure of uncertainty or impurity of a set.
Entropy is a measure of uncertainty or impurity of a set. Entropy is a measure of uncertainty or impurity of a set.
Entropy is a measure of uncertainty or impurity of a set. Entropy is a measure of uncertainty or impurity of a set.
▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets) ▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets)
Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆 Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆
relative to this multi-class classification is: relative to this multi-class classification is:
𝐾 𝐾
Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define
0 log 2 0 to be 0. 0 log 2 0 to be 0.
▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets) ▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets)
Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆 Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆
relative to this multi-class classification is: relative to this multi-class classification is:
𝐾 𝐾
Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define
0 log 2 0 to be 0. 0 log 2 0 to be 0.
▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets) ▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets)
Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆 Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆
relative to this multi-class classification is: relative to this multi-class classification is:
𝐾 𝐾
Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define
0 log 2 0 to be 0. 0 log 2 0 to be 0.
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
−0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1 −0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
−0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1 −0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
−0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1 −0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:
7 1 7 1
𝑝𝑅 = ;𝑝 = 𝑝𝑅 = ;𝑝 =
8 𝐵 8 8 𝐵 8
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
7 7 1 1 7 7 1 1
− × log 2 − × log 2 = 0.5436 − × log 2 − × log 2 = 0.5436
8 8 8 8 8 8 8 8
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:
7 1 7 1
𝑝𝑅 = ;𝑝 = 𝑝𝑅 = ;𝑝 =
8 𝐵 8 8 𝐵 8
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
7 7 1 1 7 7 1 1
− × log 2 − × log 2 = 0.5436 − × log 2 − × log 2 = 0.5436
8 8 8 8 8 8 8 8
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:
7 1 7 1
𝑝𝑅 = ;𝑝 = 𝑝𝑅 = ;𝑝 =
8 𝐵 8 8 𝐵 8
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
7 7 1 1 7 7 1 1
− × log 2 − × log 2 = 0.5436 − × log 2 − × log 2 = 0.5436
8 8 8 8 8 8 8 8
Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
• Calculate the entropy of the given training • Calculate the entropy of the given training
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
data set in table DT1: data set in table DT1:
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
9 5 9 5
𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes 𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes
14 14 14 14
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
9 9 5 5 9 9 5 5
= − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes = − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes
14 14 14 14 14 14 14 14
= 0.940 D8 Sunny Mild High Weak No = 0.940 D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
• Calculate the entropy of the given training • Calculate the entropy of the given training
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
data set in table DT1: data set in table DT1:
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
9 5 9 5
𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes 𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes
14 14 14 14
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
9 9 5 5 9 9 5 5
= − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes = − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes
14 14 14 14 14 14 14 14
= 0.940 D8 Sunny Mild High Weak No = 0.940 D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
• Calculate the entropy of the given training • Calculate the entropy of the given training
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
data set in table DT1: data set in table DT1:
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
9 5 9 5
𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes 𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes
14 14 14 14
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
9 9 5 5 9 9 5 5
= − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes = − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes
14 14 14 14 14 14 14 14
= 0.940 D8 Sunny Mild High Weak No = 0.940 D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
Entropy and Information Gain Entropy and Information Gain
• Information gain of a feature is simply the expected reduction in entropy caused by • Information gain of a feature is simply the expected reduction in entropy caused by
partitioning the examples according to this feature (attribute). partitioning the examples according to this feature (attribute).
• The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of • The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of
examples 𝑆, is defined as: examples 𝑆, is defined as:
𝑆𝑣 𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ) 𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆 𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴) 𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴)
Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of
𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 . 𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 .
• Information gain of a feature is simply the expected reduction in entropy caused by • Information gain of a feature is simply the expected reduction in entropy caused by
partitioning the examples according to this feature (attribute). partitioning the examples according to this feature (attribute).
• The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of • The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of
examples 𝑆, is defined as: examples 𝑆, is defined as:
𝑆𝑣 𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ) 𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆 𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴) 𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴)
Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of
𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 . 𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 .
• Information gain of a feature is simply the expected reduction in entropy caused by • Information gain of a feature is simply the expected reduction in entropy caused by
partitioning the examples according to this feature (attribute). partitioning the examples according to this feature (attribute).
• The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of • The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of
examples 𝑆, is defined as: examples 𝑆, is defined as:
𝑆𝑣 𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ) 𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆 𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴) 𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴)
Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of
𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 . 𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 .
▪ Consider the data set in the figure. The two attributes are ▪ Consider the data set in the figure. The two attributes are
𝑥1 𝑎𝑛𝑑 𝑥2 . 𝑥1 𝑎𝑛𝑑 𝑥2 .
▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2 ▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2
𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏. 𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏.
3 3 1 1 3 3 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811;
4 4 4 4 4 4 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0
𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1
𝑆>𝑎 𝑆<𝑎 𝑆>𝑎 𝑆<𝑎
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎
𝑆 𝑆 𝑆 𝑆
1 1 𝑎 = 4.5; 𝑏 = 6.5 1 1 𝑎 = 4.5; 𝑏 = 6.5
= 0.5436 − × 0.811 − × 0.0 = 0.1381 = 0.5436 − × 0.811 − × 0.0 = 0.1381
2 2 2 2
▪ Consider the data set in the figure. The two attributes are ▪ Consider the data set in the figure. The two attributes are
𝑥1 𝑎𝑛𝑑 𝑥2 . 𝑥1 𝑎𝑛𝑑 𝑥2 .
▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2 ▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2
𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏. 𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏.
3 3 1 1 3 3 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811;
4 4 4 4 4 4 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0
𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1
𝑆>𝑎 𝑆<𝑎 𝑆>𝑎 𝑆<𝑎
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎
𝑆 𝑆 𝑆 𝑆
1 1 𝑎 = 4.5; 𝑏 = 6.5 1 1 𝑎 = 4.5; 𝑏 = 6.5
= 0.5436 − × 0.811 − × 0.0 = 0.1381 = 0.5436 − × 0.811 − × 0.0 = 0.1381
2 2 2 2
▪ Consider the data set in the figure. The two attributes are ▪ Consider the data set in the figure. The two attributes are
𝑥1 𝑎𝑛𝑑 𝑥2 . 𝑥1 𝑎𝑛𝑑 𝑥2 .
▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2 ▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2
𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏. 𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏.
3 3 1 1 3 3 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811;
4 4 4 4 4 4 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0
𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1
𝑆>𝑎 𝑆<𝑎 𝑆>𝑎 𝑆<𝑎
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎
𝑆 𝑆 𝑆 𝑆
1 1 𝑎 = 4.5; 𝑏 = 6.5 1 1 𝑎 = 4.5; 𝑏 = 6.5
= 0.5436 − × 0.811 − × 0.0 = 0.1381 = 0.5436 − × 0.811 − × 0.0 = 0.1381
2 2 2 2
▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥2 : 𝑥2 ▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥2 : 𝑥2
𝑉𝑎𝑙𝑢𝑒𝑠 𝑥2 = " > 𝑏", " < 𝑏"; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑥2 = " > 𝑏", " < 𝑏";
𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵} 𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵}
𝑏 𝑏
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0
𝐺𝑎𝑖𝑛 𝑆, 𝑥2 𝐺𝑎𝑖𝑛 𝑆, 𝑥2
𝑆>𝑏 𝑆<𝑏 𝑆>𝑏 𝑆<𝑏
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 ) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 )
𝑆 𝑆 𝑆 𝑆
1 3 1 3
= 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1 = 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1
4 4 4 4
𝑎 𝑎
▪ Which feature do you choose for the root node? ▪ Which feature do you choose for the root node?
𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵} 𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵}
𝑏 𝑏
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0
𝐺𝑎𝑖𝑛 𝑆, 𝑥2 𝐺𝑎𝑖𝑛 𝑆, 𝑥2
𝑆>𝑏 𝑆<𝑏 𝑆>𝑏 𝑆<𝑏
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 ) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 )
𝑆 𝑆 𝑆 𝑆
1 3 1 3
= 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1 = 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1
4 4 4 4
𝑎 𝑎
▪ Which feature do you choose for the root node? ▪ Which feature do you choose for the root node?
𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵} 𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵}
𝑏 𝑏
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0
𝐺𝑎𝑖𝑛 𝑆, 𝑥2 𝐺𝑎𝑖𝑛 𝑆, 𝑥2
𝑆>𝑏 𝑆<𝑏 𝑆>𝑏 𝑆<𝑏
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 ) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 )
𝑆 𝑆 𝑆 𝑆
1 3 1 3
= 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1 = 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1
4 4 4 4
𝑎 𝑎
▪ Which feature do you choose for the root node? ▪ Which feature do you choose for the root node?
• Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis • Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis
the attribute is “Wind” which has the values Weak re ? the attribute is “Wind” which has the values Weak re ?
or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
= {3+, 3−} = {3+, 3−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
6 6 2 2 D6 Rain Cool Normal Strong No 6 6 2 2 D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2
8 8 8 8 8 8 8 8
= 0.811; D7 Overcast Cool Normal Strong Yes = 0.811; D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes
𝑆𝑤𝑒𝑎𝑘 𝑆𝑤𝑒𝑎𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes
𝑆 𝑆
𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 ) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 )
𝑆 D13 Overcast Hot Normal Weak Yes 𝑆 D13 Overcast Hot Normal Weak Yes
8 6 8 6
= 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No = 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No
14 14 14 14
• Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis • Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis
the attribute is “Wind” which has the values Weak re ? the attribute is “Wind” which has the values Weak re ?
or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
= {3+, 3−} = {3+, 3−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
6 6 2 2 D6 Rain Cool Normal Strong No 6 6 2 2 D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2
8 8 8 8 8 8 8 8
= 0.811; D7 Overcast Cool Normal Strong Yes = 0.811; D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes
𝑆𝑤𝑒𝑎𝑘 𝑆𝑤𝑒𝑎𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes
𝑆 𝑆
𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 ) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 )
𝑆 D13 Overcast Hot Normal Weak Yes 𝑆 D13 Overcast Hot Normal Weak Yes
8 6 8 6
= 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No = 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No
14 14 14 14
• Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis • Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis
the attribute is “Wind” which has the values Weak re ? the attribute is “Wind” which has the values Weak re ?
or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
= {3+, 3−} = {3+, 3−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
6 6 2 2 D6 Rain Cool Normal Strong No 6 6 2 2 D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2
8 8 8 8 8 8 8 8
= 0.811; D7 Overcast Cool Normal Strong Yes = 0.811; D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes
𝑆𝑤𝑒𝑎𝑘 𝑆𝑤𝑒𝑎𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes
𝑆 𝑆
𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 ) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 )
𝑆 D13 Overcast Hot Normal Weak Yes 𝑆 D13 Overcast Hot Normal Weak Yes
8 6 8 6
= 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No = 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No
14 14 14 14
• Consider the PlayTennis training data set 𝒟. Let’s • Consider the PlayTennis training data set 𝒟. Let’s
calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
Outlook which has the values Sunny, Overcast and Outlook which has the values Sunny, Overcast and
Rain. D1 Sunny Hot High Weak No Rain. D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − , 𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − ,
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−} 𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971;
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0;
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =
𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes 𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝑆 𝑆
𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛 𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes
𝑆 𝑆 𝑆 𝑆
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
5 4 5 5 4 5
= 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes = 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes
14 14 14 14 14 14
= 0.246 D14 Rain Mild High Strong No = 0.246 D14 Rain Mild High Strong No
• Consider the PlayTennis training data set 𝒟. Let’s • Consider the PlayTennis training data set 𝒟. Let’s
calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
Outlook which has the values Sunny, Overcast and Outlook which has the values Sunny, Overcast and
Rain. D1 Sunny Hot High Weak No Rain. D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − , 𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − ,
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−} 𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971;
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0;
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =
𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes 𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝑆 𝑆
𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛 𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes
𝑆 𝑆 𝑆 𝑆
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
5 4 5 5 4 5
= 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes = 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes
14 14 14 14 14 14
= 0.246 D14 Rain Mild High Strong No = 0.246 D14 Rain Mild High Strong No
• Consider the PlayTennis training data set 𝒟. Let’s • Consider the PlayTennis training data set 𝒟. Let’s
calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
Outlook which has the values Sunny, Overcast and Outlook which has the values Sunny, Overcast and
Rain. D1 Sunny Hot High Weak No Rain. D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − , 𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − ,
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−} 𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971;
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0;
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =
𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes 𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝑆 𝑆
𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛 𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes
𝑆 𝑆 𝑆 𝑆
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
5 4 5 5 4 5
= 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes = 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes
14 14 14 14 14 14
= 0.246 D14 Rain Mild High Strong No = 0.246 D14 Rain Mild High Strong No
▪ Similarly, the information gain of all four attributes can be calculated as follows: ▪ Similarly, the information gain of all four attributes can be calculated as follows:
▪ Since the attribute Outlook had the highest information gain, it is chosen as the ▪ Since the attribute Outlook had the highest information gain, it is chosen as the
decision attribute for the root node. decision attribute for the root node.
▪ Similarly, the information gain of all four attributes can be calculated as follows: ▪ Similarly, the information gain of all four attributes can be calculated as follows:
▪ Since the attribute Outlook had the highest information gain, it is chosen as the ▪ Since the attribute Outlook had the highest information gain, it is chosen as the
decision attribute for the root node. decision attribute for the root node.
▪ Similarly, the information gain of all four attributes can be calculated as follows: ▪ Similarly, the information gain of all four attributes can be calculated as follows:
▪ Since the attribute Outlook had the highest information gain, it is chosen as the ▪ Since the attribute Outlook had the highest information gain, it is chosen as the
decision attribute for the root node. decision attribute for the root node.
• Since the attribute Outlook had the highest • Since the attribute Outlook had the highest
information gain, it is chosen as the decision information gain, it is chosen as the decision
attribute for the root node. attribute for the root node.
• Branches are created below the root node for each • Branches are created below the root node for each
of its possible values (i.e., Sunny, Overcast and Rain). of its possible values (i.e., Sunny, Overcast and Rain).
• Training examples are sorted to each new • Training examples are sorted to each new
descendant node. descendant node.
• Note that every example for which Outlook = • Note that every example for which Outlook =
Overcast is a positive example of PlayTennis. Overcast is a positive example of PlayTennis.
Therefore, this node becomes a leaf node with the Therefore, this node becomes a leaf node with the
classification PlayTennis = Yes. classification PlayTennis = Yes.
• In contrast, the descendants corresponding to • In contrast, the descendants corresponding to
Outlook = Sunny and Outlook = Rain still have Outlook = Sunny and Outlook = Rain still have
nonzero entropy, and the decision tree will be nonzero entropy, and the decision tree will be
further elaborated below these nodes. further elaborated below these nodes.
• Since the attribute Outlook had the highest • Since the attribute Outlook had the highest
information gain, it is chosen as the decision information gain, it is chosen as the decision
attribute for the root node. attribute for the root node.
• Branches are created below the root node for each • Branches are created below the root node for each
of its possible values (i.e., Sunny, Overcast and Rain). of its possible values (i.e., Sunny, Overcast and Rain).
• Training examples are sorted to each new • Training examples are sorted to each new
descendant node. descendant node.
• Note that every example for which Outlook = • Note that every example for which Outlook =
Overcast is a positive example of PlayTennis. Overcast is a positive example of PlayTennis.
Therefore, this node becomes a leaf node with the Therefore, this node becomes a leaf node with the
classification PlayTennis = Yes. classification PlayTennis = Yes.
• In contrast, the descendants corresponding to • In contrast, the descendants corresponding to
Outlook = Sunny and Outlook = Rain still have Outlook = Sunny and Outlook = Rain still have
nonzero entropy, and the decision tree will be nonzero entropy, and the decision tree will be
further elaborated below these nodes. further elaborated below these nodes.
• Since the attribute Outlook had the highest • Since the attribute Outlook had the highest
information gain, it is chosen as the decision information gain, it is chosen as the decision
attribute for the root node. attribute for the root node.
• Branches are created below the root node for each • Branches are created below the root node for each
of its possible values (i.e., Sunny, Overcast and Rain). of its possible values (i.e., Sunny, Overcast and Rain).
• Training examples are sorted to each new • Training examples are sorted to each new
descendant node. descendant node.
• Note that every example for which Outlook = • Note that every example for which Outlook =
Overcast is a positive example of PlayTennis. Overcast is a positive example of PlayTennis.
Therefore, this node becomes a leaf node with the Therefore, this node becomes a leaf node with the
classification PlayTennis = Yes. classification PlayTennis = Yes.
• In contrast, the descendants corresponding to • In contrast, the descendants corresponding to
Outlook = Sunny and Outlook = Rain still have Outlook = Sunny and Outlook = Rain still have
nonzero entropy, and the decision tree will be nonzero entropy, and the decision tree will be
further elaborated below these nodes. further elaborated below these nodes.
ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm) ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm)
ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠) ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)
Create a Root node for the tree Create a Root node for the tree
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = + If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = +
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = - If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = -
Otherwise Begin Otherwise Begin
𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠
The decision attribute for Root ← 𝐴 The decision attribute for Root ← 𝐴
For each possible value 𝑣𝑖 of 𝐴, For each possible value 𝑣𝑖 of 𝐴,
Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖 Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖
Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴 Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴
if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 ) if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 )
else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴}) else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴})
End End
End End
Return Root Return Root
ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm) ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm)
ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠) ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)
Create a Root node for the tree Create a Root node for the tree
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = + If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = +
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = - If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = -
Otherwise Begin Otherwise Begin
𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠
The decision attribute for Root ← 𝐴 The decision attribute for Root ← 𝐴
For each possible value 𝑣𝑖 of 𝐴, For each possible value 𝑣𝑖 of 𝐴,
Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖 Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖
Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴 Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴
if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 ) if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 )
else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴}) else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴})
End End
End End
Return Root Return Root
ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm) ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm)
ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠) ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)
Create a Root node for the tree Create a Root node for the tree
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = + If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = +
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = - If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = -
Otherwise Begin Otherwise Begin
𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠
The decision attribute for Root ← 𝐴 The decision attribute for Root ← 𝐴
For each possible value 𝑣𝑖 of 𝐴, For each possible value 𝑣𝑖 of 𝐴,
Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖 Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖
Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴 Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴
if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 ) if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 )
else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴}) else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴})
End End
End End
Return Root Return Root
This is the learnt decision tree based on the given training data set: This is the learnt decision tree based on the given training data set:
Days Outlook Temperatu Humidity Wind PlayTennis Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
This is the learnt decision tree based on the given training data set: This is the learnt decision tree based on the given training data set:
Days Outlook Temperatu Humidity Wind PlayTennis Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
This is the learnt decision tree based on the given training data set: This is the learnt decision tree based on the given training data set:
Days Outlook Temperatu Humidity Wind PlayTennis Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No
Continuous-valued Attributes Continuous-valued Attributes
• The initial version of ID3 is restricted to attributes that take on a discrete set of values. • The initial version of ID3 is restricted to attributes that take on a discrete set of values.
• This restrict can be removed so that continuous-valued attributes can be incorporated. • This restrict can be removed so that continuous-valued attributes can be incorporated.
• This can be accomplished by dynamically defining new discrete-valued attributes that • This can be accomplished by dynamically defining new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals. partition the continuous attribute value into a discrete set of intervals.
• For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐 • For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐
• How do we choose the best value of 𝑐? • How do we choose the best value of 𝑐?
• The initial version of ID3 is restricted to attributes that take on a discrete set of values. • The initial version of ID3 is restricted to attributes that take on a discrete set of values.
• This restrict can be removed so that continuous-valued attributes can be incorporated. • This restrict can be removed so that continuous-valued attributes can be incorporated.
• This can be accomplished by dynamically defining new discrete-valued attributes that • This can be accomplished by dynamically defining new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals. partition the continuous attribute value into a discrete set of intervals.
• For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐 • For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐
• How do we choose the best value of 𝑐? • How do we choose the best value of 𝑐?
• The initial version of ID3 is restricted to attributes that take on a discrete set of values. • The initial version of ID3 is restricted to attributes that take on a discrete set of values.
• This restrict can be removed so that continuous-valued attributes can be incorporated. • This restrict can be removed so that continuous-valued attributes can be incorporated.
• This can be accomplished by dynamically defining new discrete-valued attributes that • This can be accomplished by dynamically defining new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals. partition the continuous attribute value into a discrete set of intervals.
• For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐 • For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐
• How do we choose the best value of 𝑐? • How do we choose the best value of 𝑐?
Continuous-valued Attributes Continuous-valued Attributes
• For example, suppose we wish to include the continuous-valued attribute • For example, suppose we wish to include the continuous-valued attribute
Temperature in the PlayTennis training data set. Suppose that the training examples Temperature in the PlayTennis training data set. Suppose that the training examples
associated with a particular node in the decision tree have the following values for associated with a particular node in the decision tree have the following values for
Temperature and target values: Temperature and target values:
Temperature 40 48 60 72 80 90 Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No PlayTennis No No Yes Yes Yes No
• We would like to pick a threshold value 𝑐 that produces the greatest information • We would like to pick a threshold value 𝑐 that produces the greatest information
gain! How? gain! How?
• For example, suppose we wish to include the continuous-valued attribute • For example, suppose we wish to include the continuous-valued attribute
Temperature in the PlayTennis training data set. Suppose that the training examples Temperature in the PlayTennis training data set. Suppose that the training examples
associated with a particular node in the decision tree have the following values for associated with a particular node in the decision tree have the following values for
Temperature and target values: Temperature and target values:
Temperature 40 48 60 72 80 90 Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No PlayTennis No No Yes Yes Yes No
• We would like to pick a threshold value 𝑐 that produces the greatest information • We would like to pick a threshold value 𝑐 that produces the greatest information
gain! How? gain! How?
• For example, suppose we wish to include the continuous-valued attribute • For example, suppose we wish to include the continuous-valued attribute
Temperature in the PlayTennis training data set. Suppose that the training examples Temperature in the PlayTennis training data set. Suppose that the training examples
associated with a particular node in the decision tree have the following values for associated with a particular node in the decision tree have the following values for
Temperature and target values: Temperature and target values:
Temperature 40 48 60 72 80 90 Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No PlayTennis No No Yes Yes Yes No
• We would like to pick a threshold value 𝑐 that produces the greatest information • We would like to pick a threshold value 𝑐 that produces the greatest information
gain! How? gain! How?
Temperature 40 48 60 72 80 90 Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No PlayTennis No No Yes Yes Yes No
• First, we sort the examples according to the attribute Temperature. • First, we sort the examples according to the attribute Temperature.
• Then, we identify adjacent examples that differ in their target values and generate a • Then, we identify adjacent examples that differ in their target values and generate a
set of candidate thresholds midway between the corresponding value of Temperature. set of candidate thresholds midway between the corresponding value of Temperature.
i.e., i.e.,
48 + 60 80 + 90 48 + 60 80 + 90
𝑐1 = = 54; 𝑐2 = = 85 𝑐1 = = 54; 𝑐2 = = 85
2 2 2 2
• It can be shown that the threshold value 𝑐 that maximize the information gain must lie • It can be shown that the threshold value 𝑐 that maximize the information gain must lie
in this set. in this set.
• Compute the information gain for each of the candidate values and pick the one with • Compute the information gain for each of the candidate values and pick the one with
the highest information gain as the threshold value 𝑐 the highest information gain as the threshold value 𝑐
Temperature 40 48 60 72 80 90 Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No PlayTennis No No Yes Yes Yes No
• First, we sort the examples according to the attribute Temperature. • First, we sort the examples according to the attribute Temperature.
• Then, we identify adjacent examples that differ in their target values and generate a • Then, we identify adjacent examples that differ in their target values and generate a
set of candidate thresholds midway between the corresponding value of Temperature. set of candidate thresholds midway between the corresponding value of Temperature.
i.e., i.e.,
48 + 60 80 + 90 48 + 60 80 + 90
𝑐1 = = 54; 𝑐2 = = 85 𝑐1 = = 54; 𝑐2 = = 85
2 2 2 2
• It can be shown that the threshold value 𝑐 that maximize the information gain must lie • It can be shown that the threshold value 𝑐 that maximize the information gain must lie
in this set. in this set.
• Compute the information gain for each of the candidate values and pick the one with • Compute the information gain for each of the candidate values and pick the one with
the highest information gain as the threshold value 𝑐 the highest information gain as the threshold value 𝑐
Temperature 40 48 60 72 80 90 Temperature 40 48 60 72 80 90
PlayTennis No No Yes Yes Yes No PlayTennis No No Yes Yes Yes No
• First, we sort the examples according to the attribute Temperature. • First, we sort the examples according to the attribute Temperature.
• Then, we identify adjacent examples that differ in their target values and generate a • Then, we identify adjacent examples that differ in their target values and generate a
set of candidate thresholds midway between the corresponding value of Temperature. set of candidate thresholds midway between the corresponding value of Temperature.
i.e., i.e.,
48 + 60 80 + 90 48 + 60 80 + 90
𝑐1 = = 54; 𝑐2 = = 85 𝑐1 = = 54; 𝑐2 = = 85
2 2 2 2
• It can be shown that the threshold value 𝑐 that maximize the information gain must lie • It can be shown that the threshold value 𝑐 that maximize the information gain must lie
in this set. in this set.
• Compute the information gain for each of the candidate values and pick the one with • Compute the information gain for each of the candidate values and pick the one with
the highest information gain as the threshold value 𝑐 the highest information gain as the threshold value 𝑐
Decision Tree Algorithms Decision Tree Algorithms
• ID3 ( Iterative Dichotomizer 3): for categorical features • ID3 ( Iterative Dichotomizer 3): for categorical features
• C4.5: extended ID3 for continuous valued features • C4.5: extended ID3 for continuous valued features
• C5.0: more efficient version of C4.5 • C5.0: more efficient version of C4.5
• CART (Classification and Regression Tree): extended C4.5 to support • CART (Classification and Regression Tree): extended C4.5 to support
numerical target values for regression numerical target values for regression
• ID3 ( Iterative Dichotomizer 3): for categorical features • ID3 ( Iterative Dichotomizer 3): for categorical features
• C4.5: extended ID3 for continuous valued features • C4.5: extended ID3 for continuous valued features
• C5.0: more efficient version of C4.5 • C5.0: more efficient version of C4.5
• CART (Classification and Regression Tree): extended C4.5 to support • CART (Classification and Regression Tree): extended C4.5 to support
numerical target values for regression numerical target values for regression
• ID3 ( Iterative Dichotomizer 3): for categorical features • ID3 ( Iterative Dichotomizer 3): for categorical features
• C4.5: extended ID3 for continuous valued features • C4.5: extended ID3 for continuous valued features
• C5.0: more efficient version of C4.5 • C5.0: more efficient version of C4.5
• CART (Classification and Regression Tree): extended C4.5 to support • CART (Classification and Regression Tree): extended C4.5 to support
numerical target values for regression numerical target values for regression
▪ If a decision tree is fully grown, it may lose some ▪ If a decision tree is fully grown, it may lose some
generalization capability Overfitting in Decision Tree generalization capability Overfitting in Decision Tree
▪ This is especially severe when you have a data set ▪ This is especially severe when you have a data set
in high-dimensions feature space and small in high-dimensions feature space and small
number of samples number of samples
▪ Overfitting caused by target value noise ▪ Overfitting caused by target value noise
▪ Overfitting caused by lack of samples ▪ Overfitting caused by lack of samples
▪ Approaches to avoid overfitting in decision tree ▪ Approaches to avoid overfitting in decision tree
learning: learning:
• Limit the depth and number of nodes in the • Limit the depth and number of nodes in the
decision tree decision tree
• Post-prune the tree • Post-prune the tree
• Use multiple trees to form a forest!!! • Use multiple trees to form a forest!!!
▪ If a decision tree is fully grown, it may lose some ▪ If a decision tree is fully grown, it may lose some
generalization capability Overfitting in Decision Tree generalization capability Overfitting in Decision Tree
▪ This is especially severe when you have a data set ▪ This is especially severe when you have a data set
in high-dimensions feature space and small in high-dimensions feature space and small
number of samples number of samples
▪ Overfitting caused by target value noise ▪ Overfitting caused by target value noise
▪ Overfitting caused by lack of samples ▪ Overfitting caused by lack of samples
▪ Approaches to avoid overfitting in decision tree ▪ Approaches to avoid overfitting in decision tree
learning: learning:
• Limit the depth and number of nodes in the • Limit the depth and number of nodes in the
decision tree decision tree
• Post-prune the tree • Post-prune the tree
• Use multiple trees to form a forest!!! • Use multiple trees to form a forest!!!
▪ If a decision tree is fully grown, it may lose some ▪ If a decision tree is fully grown, it may lose some
generalization capability Overfitting in Decision Tree generalization capability Overfitting in Decision Tree
▪ This is especially severe when you have a data set ▪ This is especially severe when you have a data set
in high-dimensions feature space and small in high-dimensions feature space and small
number of samples number of samples
▪ Overfitting caused by target value noise ▪ Overfitting caused by target value noise
▪ Overfitting caused by lack of samples ▪ Overfitting caused by lack of samples
▪ Approaches to avoid overfitting in decision tree ▪ Approaches to avoid overfitting in decision tree
learning: learning:
• Limit the depth and number of nodes in the • Limit the depth and number of nodes in the
decision tree decision tree
• Post-prune the tree • Post-prune the tree
• Use multiple trees to form a forest!!! • Use multiple trees to form a forest!!!
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
▪ Overfitting happens when the model is complex (with many parameters such as a ▪ Overfitting happens when the model is complex (with many parameters such as a
polynomial model) but the training data set is small. The model can be over-trained to fit polynomial model) but the training data set is small. The model can be over-trained to fit
all instances including noise. all instances including noise.
▪ Overfitting models tend to have small bias but large variance. The model error changes ▪ Overfitting models tend to have small bias but large variance. The model error changes
largely when the training data set changes. Lower training error but large test error. largely when the training data set changes. Lower training error but large test error.
▪ Ensemble learning is a general technique to combine the predictions of many varied ▪ Ensemble learning is a general technique to combine the predictions of many varied
models into a single prediction (average). models into a single prediction (average).
▪ Ensemble Learning is a popular way to combat overfitting. ▪ Ensemble Learning is a popular way to combat overfitting.
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
▪ Overfitting happens when the model is complex (with many parameters such as a ▪ Overfitting happens when the model is complex (with many parameters such as a
polynomial model) but the training data set is small. The model can be over-trained to fit polynomial model) but the training data set is small. The model can be over-trained to fit
all instances including noise. all instances including noise.
▪ Overfitting models tend to have small bias but large variance. The model error changes ▪ Overfitting models tend to have small bias but large variance. The model error changes
largely when the training data set changes. Lower training error but large test error. largely when the training data set changes. Lower training error but large test error.
▪ Ensemble learning is a general technique to combine the predictions of many varied ▪ Ensemble learning is a general technique to combine the predictions of many varied
models into a single prediction (average). models into a single prediction (average).
▪ Ensemble Learning is a popular way to combat overfitting. ▪ Ensemble Learning is a popular way to combat overfitting.
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
▪ Overfitting happens when the model is complex (with many parameters such as a ▪ Overfitting happens when the model is complex (with many parameters such as a
polynomial model) but the training data set is small. The model can be over-trained to fit polynomial model) but the training data set is small. The model can be over-trained to fit
all instances including noise. all instances including noise.
▪ Overfitting models tend to have small bias but large variance. The model error changes ▪ Overfitting models tend to have small bias but large variance. The model error changes
largely when the training data set changes. Lower training error but large test error. largely when the training data set changes. Lower training error but large test error.
▪ Ensemble learning is a general technique to combine the predictions of many varied ▪ Ensemble learning is a general technique to combine the predictions of many varied
models into a single prediction (average). models into a single prediction (average).
▪ Ensemble Learning is a popular way to combat overfitting. ▪ Ensemble Learning is a popular way to combat overfitting.
The Ensemble Learning Method The Ensemble Learning Method
Average Average
Average Average
Average Average
▪ Motivation for averaging (property of sample mean): ▪ Motivation for averaging (property of sample mean):
• consider a random sample of size 𝑁, 𝑌𝑖 , 𝑖 = 1, … , 𝑁, from a random variable 𝑌 • consider a random sample of size 𝑁, 𝑌𝑖 , 𝑖 = 1, … , 𝑁, from a random variable 𝑌
(population) with mean 𝜇 and variances 𝜎 2 . (population) with mean 𝜇 and variances 𝜎 2 .
• Then, the sample variables 𝑌𝑖 , 𝑖 = 1, … , 𝑁, are independent and have identical • Then, the sample variables 𝑌𝑖 , 𝑖 = 1, … , 𝑁, are independent and have identical
distribution as the population 𝑌. distribution as the population 𝑌.
𝑁 𝑁
1 1
𝑌ത = 𝑌𝑖 𝑌ത = 𝑌𝑖
𝑁 𝑁
𝑖=1 𝑖=1
▪ Motivation for averaging (property of sample mean): ▪ Motivation for averaging (property of sample mean):
• consider a random sample of size 𝑁, 𝑌𝑖 , 𝑖 = 1, … , 𝑁, from a random variable 𝑌 • consider a random sample of size 𝑁, 𝑌𝑖 , 𝑖 = 1, … , 𝑁, from a random variable 𝑌
(population) with mean 𝜇 and variances 𝜎 2 . (population) with mean 𝜇 and variances 𝜎 2 .
• Then, the sample variables 𝑌𝑖 , 𝑖 = 1, … , 𝑁, are independent and have identical • Then, the sample variables 𝑌𝑖 , 𝑖 = 1, … , 𝑁, are independent and have identical
distribution as the population 𝑌. distribution as the population 𝑌.
𝑁 𝑁
1 1
𝑌ത = 𝑌𝑖 𝑌ത = 𝑌𝑖
𝑁 𝑁
𝑖=1 𝑖=1
▪ Motivation for averaging (property of sample mean): ▪ Motivation for averaging (property of sample mean):
• consider a random sample of size 𝑁, 𝑌𝑖 , 𝑖 = 1, … , 𝑁, from a random variable 𝑌 • consider a random sample of size 𝑁, 𝑌𝑖 , 𝑖 = 1, … , 𝑁, from a random variable 𝑌
(population) with mean 𝜇 and variances 𝜎 2 . (population) with mean 𝜇 and variances 𝜎 2 .
• Then, the sample variables 𝑌𝑖 , 𝑖 = 1, … , 𝑁, are independent and have identical • Then, the sample variables 𝑌𝑖 , 𝑖 = 1, … , 𝑁, are independent and have identical
distribution as the population 𝑌. distribution as the population 𝑌.
𝑁 𝑁
1 1
𝑌ത = 𝑌𝑖 𝑌ത = 𝑌𝑖
𝑁 𝑁
𝑖=1 𝑖=1
• We have the following results about the sample mean : • We have the following results about the sample mean :
𝑁 𝑁 𝑁 𝑁
1 1 1 1 1 1
𝐸 𝑌ത = 𝐸 𝑌𝑖 = 𝐸[𝑌𝑖 ] = 𝑁𝜇 = 𝜇 𝐸 𝑌ത = 𝐸 𝑌𝑖 = 𝐸[𝑌𝑖 ] = 𝑁𝜇 = 𝜇
𝑁 𝑁 𝑁 𝑁 𝑁 𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑁 𝑁 𝑁 𝑁
1 1 1 𝜎2 1 1 1 𝜎2
𝑉𝑎𝑟 𝑌ത = 𝑉𝑎𝑟 𝑌𝑖 = 2
𝑉𝑎𝑟(𝑌𝑖 ) = 2 𝑁𝜎 2 = 𝑉𝑎𝑟 𝑌ത = 𝑉𝑎𝑟 𝑌𝑖 = 2
𝑉𝑎𝑟(𝑌𝑖 ) = 2 𝑁𝜎 2 =
𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁 𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
That is, the average (sample mean) has the same expectation, but reduced variance That is, the average (sample mean) has the same expectation, but reduced variance
𝜎2 𝜎2
compared to each of the individual sample variable 𝑌𝑖 . is related to the sample size 𝑁. compared to each of the individual sample variable 𝑌𝑖 . is related to the sample size 𝑁.
𝑁 𝑁
• We have the following results about the sample mean : • We have the following results about the sample mean :
𝑁 𝑁 𝑁 𝑁
1 1 1 1 1 1
𝐸 𝑌ത = 𝐸 𝑌𝑖 = 𝐸[𝑌𝑖 ] = 𝑁𝜇 = 𝜇 𝐸 𝑌ത = 𝐸 𝑌𝑖 = 𝐸[𝑌𝑖 ] = 𝑁𝜇 = 𝜇
𝑁 𝑁 𝑁 𝑁 𝑁 𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑁 𝑁 𝑁 𝑁
1 1 1 𝜎2 1 1 1 𝜎2
𝑉𝑎𝑟 𝑌ത = 𝑉𝑎𝑟 𝑌𝑖 = 𝑉𝑎𝑟(𝑌𝑖 ) = 2 𝑁𝜎 2 = 𝑉𝑎𝑟 𝑌ത = 𝑉𝑎𝑟 𝑌𝑖 = 𝑉𝑎𝑟(𝑌𝑖 ) = 2 𝑁𝜎 2 =
𝑁 𝑁2 𝑁 𝑁 𝑁 𝑁2 𝑁 𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
That is, the average (sample mean) has the same expectation, but reduced variance That is, the average (sample mean) has the same expectation, but reduced variance
𝜎2 𝜎2
compared to each of the individual sample variable 𝑌𝑖 . is related to the sample size 𝑁. compared to each of the individual sample variable 𝑌𝑖 . is related to the sample size 𝑁.
𝑁 𝑁
• We have the following results about the sample mean : • We have the following results about the sample mean :
𝑁 𝑁 𝑁 𝑁
1 1 1 1 1 1
𝐸 𝑌ത = 𝐸 𝑌𝑖 = 𝐸[𝑌𝑖 ] = 𝑁𝜇 = 𝜇 𝐸 𝑌ത = 𝐸 𝑌𝑖 = 𝐸[𝑌𝑖 ] = 𝑁𝜇 = 𝜇
𝑁 𝑁 𝑁 𝑁 𝑁 𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
𝑁 𝑁 𝑁 𝑁
1 1 1 𝜎2 1 1 1 𝜎2
𝑉𝑎𝑟 𝑌ത = 𝑉𝑎𝑟 𝑌𝑖 = 𝑉𝑎𝑟(𝑌𝑖 ) = 2 𝑁𝜎 2 = 𝑉𝑎𝑟 𝑌ത = 𝑉𝑎𝑟 𝑌𝑖 = 𝑉𝑎𝑟(𝑌𝑖 ) = 2 𝑁𝜎 2 =
𝑁 𝑁2 𝑁 𝑁 𝑁 𝑁2 𝑁 𝑁
𝑖=1 𝑖=1 𝑖=1 𝑖=1
That is, the average (sample mean) has the same expectation, but reduced variance That is, the average (sample mean) has the same expectation, but reduced variance
𝜎2 𝜎2
compared to each of the individual sample variable 𝑌𝑖 . is related to the sample size 𝑁. compared to each of the individual sample variable 𝑌𝑖 . is related to the sample size 𝑁.
𝑁 𝑁
▪ In ensemble methods, these 𝑌𝑖 are analogous to the prediction made by individual ▪ In ensemble methods, these 𝑌𝑖 are analogous to the prediction made by individual
classifier 𝑖. classifier 𝑖.
▪ In real-world applications, the predictions will not be completely independent, but ▪ In real-world applications, the predictions will not be completely independent, but
reducing correlation (increasing randomness) will generally reduce the final variance. reducing correlation (increasing randomness) will generally reduce the final variance.
▪ In ensemble methods, these 𝑌𝑖 are analogous to the prediction made by individual ▪ In ensemble methods, these 𝑌𝑖 are analogous to the prediction made by individual
classifier 𝑖. classifier 𝑖.
▪ In real-world applications, the predictions will not be completely independent, but ▪ In real-world applications, the predictions will not be completely independent, but
reducing correlation (increasing randomness) will generally reduce the final variance. reducing correlation (increasing randomness) will generally reduce the final variance.
▪ In ensemble methods, these 𝑌𝑖 are analogous to the prediction made by individual ▪ In ensemble methods, these 𝑌𝑖 are analogous to the prediction made by individual
classifier 𝑖. classifier 𝑖.
▪ In real-world applications, the predictions will not be completely independent, but ▪ In real-world applications, the predictions will not be completely independent, but
reducing correlation (increasing randomness) will generally reduce the final variance. reducing correlation (increasing randomness) will generally reduce the final variance.
The Ensemble Learning Method The Ensemble Learning Method
• Bagging (Bootstrap aggregation): Training Data • Bagging (Bootstrap aggregation): Training Data
• Each Bootstrap generates a new Bootstrap • Each Bootstrap generates a new Bootstrap
Bootstrap Bootstrap Bootstrap Bootstrap Bootstrap Bootstrap
training data set: sampling from the training data set: sampling from the
original data set uniformly and with original data set uniformly and with
replacement Learner 1 Learner 2 Learner N replacement Learner 1 Learner 2 Learner N
⋯ ⋯
• Each sampled data set is used to train a • Each sampled data set is used to train a
𝑌1 𝑌2 𝑌𝑁 𝑌1 𝑌2 𝑌𝑁
learner (e.g., a decision tree) learner (e.g., a decision tree)
• The prediction of each learner will be • The prediction of each learner will be
averaged to form the prediction of the averaged to form the prediction of the
ensemble model Average 𝑌ത ensemble model Average 𝑌ത
• Any classification model can be used as • Any classification model can be used as
learners in this mode learners in this mode
What is the price we are paying here? What is the price we are paying here?
• Bagging (Bootstrap aggregation): Training Data • Bagging (Bootstrap aggregation): Training Data
• Each Bootstrap generates a new Bootstrap • Each Bootstrap generates a new Bootstrap
Bootstrap Bootstrap Bootstrap Bootstrap Bootstrap Bootstrap
training data set: sampling from the training data set: sampling from the
original data set uniformly and with original data set uniformly and with
replacement Learner 1 Learner 2 Learner N replacement Learner 1 Learner 2 Learner N
⋯ ⋯
• Each sampled data set is used to train a • Each sampled data set is used to train a
𝑌1 𝑌2 𝑌𝑁 𝑌1 𝑌2 𝑌𝑁
learner (e.g., a decision tree) learner (e.g., a decision tree)
• The prediction of each learner will be • The prediction of each learner will be
averaged to form the prediction of the averaged to form the prediction of the
ensemble model Average 𝑌ത ensemble model Average 𝑌ത
• Any classification model can be used as • Any classification model can be used as
learners in this mode learners in this mode
What is the price we are paying here? What is the price we are paying here?
• Bagging (Bootstrap aggregation): Training Data • Bagging (Bootstrap aggregation): Training Data
• Each Bootstrap generates a new Bootstrap • Each Bootstrap generates a new Bootstrap
Bootstrap Bootstrap Bootstrap Bootstrap Bootstrap Bootstrap
training data set: sampling from the training data set: sampling from the
original data set uniformly and with original data set uniformly and with
replacement Learner 1 Learner 2 Learner N replacement Learner 1 Learner 2 Learner N
⋯ ⋯
• Each sampled data set is used to train a • Each sampled data set is used to train a
𝑌1 𝑌2 𝑌𝑁 𝑌1 𝑌2 𝑌𝑁
learner (e.g., a decision tree) learner (e.g., a decision tree)
• The prediction of each learner will be • The prediction of each learner will be
averaged to form the prediction of the averaged to form the prediction of the
ensemble model Average 𝑌ത ensemble model Average 𝑌ത
• Any classification model can be used as • Any classification model can be used as
learners in this mode learners in this mode
What is the price we are paying here? What is the price we are paying here?
The Random Forest Model The Random Forest Model
Random Forest is an ensemble model consists of Random Forest is an ensemble model consists of
many decision trees. It uses two key concepts to many decision trees. It uses two key concepts to
enhance the randomness of the member trees: enhance the randomness of the member trees:
• Bootstrap sampling of training data when • Bootstrap sampling of training data when
building decision trees (bagging) building decision trees (bagging)
• Random subset of features (attributes) • Random subset of features (attributes)
considered when splitting nodes considered when splitting nodes
Random Forest is an ensemble model consists of Random Forest is an ensemble model consists of
many decision trees. It uses two key concepts to many decision trees. It uses two key concepts to
enhance the randomness of the member trees: enhance the randomness of the member trees:
• Bootstrap sampling of training data when • Bootstrap sampling of training data when
building decision trees (bagging) building decision trees (bagging)
• Random subset of features (attributes) • Random subset of features (attributes)
considered when splitting nodes considered when splitting nodes
Random Forest is an ensemble model consists of Random Forest is an ensemble model consists of
many decision trees. It uses two key concepts to many decision trees. It uses two key concepts to
enhance the randomness of the member trees: enhance the randomness of the member trees:
• Bootstrap sampling of training data when • Bootstrap sampling of training data when
building decision trees (bagging) building decision trees (bagging)
• Random subset of features (attributes) • Random subset of features (attributes)
considered when splitting nodes considered when splitting nodes
• The random forest combines hundreds or • The random forest combines hundreds or
thousands of decision trees, trains each one on thousands of decision trees, trains each one on
a slightly different set of the observations a slightly different set of the observations
• The model splits nodes in each tree • The model splits nodes in each tree
considering a limited number of the features. considering a limited number of the features.
• The final predictions of the random forest are • The final predictions of the random forest are
made by averaging the predictions of each made by averaging the predictions of each
individual tree. individual tree.
• The random forest combines hundreds or • The random forest combines hundreds or
thousands of decision trees, trains each one on thousands of decision trees, trains each one on
a slightly different set of the observations a slightly different set of the observations
• The model splits nodes in each tree • The model splits nodes in each tree
considering a limited number of the features. considering a limited number of the features.
• The final predictions of the random forest are • The final predictions of the random forest are
made by averaging the predictions of each made by averaging the predictions of each
individual tree. individual tree.
• The random forest combines hundreds or • The random forest combines hundreds or
thousands of decision trees, trains each one on thousands of decision trees, trains each one on
a slightly different set of the observations a slightly different set of the observations
• The model splits nodes in each tree • The model splits nodes in each tree
considering a limited number of the features. considering a limited number of the features.
• The final predictions of the random forest are • The final predictions of the random forest are
made by averaging the predictions of each made by averaging the predictions of each
individual tree. individual tree.
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
▪ The Random Forest model error depends on two things: ▪ The Random Forest model error depends on two things:
• The correlation between any two trees in the forest. Increasing the correlation • The correlation between any two trees in the forest. Increasing the correlation
increases the random forest model error rate increases the random forest model error rate
• The strength of each individual tree in the forest. A tree with a low error rate is a • The strength of each individual tree in the forest. A tree with a low error rate is a
strong classifier. Increasing the strength of the individual trees decreases the forest strong classifier. Increasing the strength of the individual trees decreases the forest
error rate. error rate.
• Reducing 𝑚 reduces both the correlation and strength. Increasing 𝑚 increases both. • Reducing 𝑚 reduces both the correlation and strength. Increasing 𝑚 increases both.
Need a trade-off! Need a trade-off!
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
▪ The Random Forest model error depends on two things: ▪ The Random Forest model error depends on two things:
• The correlation between any two trees in the forest. Increasing the correlation • The correlation between any two trees in the forest. Increasing the correlation
increases the random forest model error rate increases the random forest model error rate
• The strength of each individual tree in the forest. A tree with a low error rate is a • The strength of each individual tree in the forest. A tree with a low error rate is a
strong classifier. Increasing the strength of the individual trees decreases the forest strong classifier. Increasing the strength of the individual trees decreases the forest
error rate. error rate.
• Reducing 𝑚 reduces both the correlation and strength. Increasing 𝑚 increases both. • Reducing 𝑚 reduces both the correlation and strength. Increasing 𝑚 increases both.
Need a trade-off! Need a trade-off!
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
▪ The Random Forest model error depends on two things: ▪ The Random Forest model error depends on two things:
• The correlation between any two trees in the forest. Increasing the correlation • The correlation between any two trees in the forest. Increasing the correlation
increases the random forest model error rate increases the random forest model error rate
• The strength of each individual tree in the forest. A tree with a low error rate is a • The strength of each individual tree in the forest. A tree with a low error rate is a
strong classifier. Increasing the strength of the individual trees decreases the forest strong classifier. Increasing the strength of the individual trees decreases the forest
error rate. error rate.
• Reducing 𝑚 reduces both the correlation and strength. Increasing 𝑚 increases both. • Reducing 𝑚 reduces both the correlation and strength. Increasing 𝑚 increases both.
Need a trade-off! Need a trade-off!
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
• Stacking • Stacking
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
• Stacking • Stacking
Random Forest and Ensemble Learning Methods Random Forest and Ensemble Learning Methods
• Stacking • Stacking