0% found this document useful (0 votes)

6 views37 pages

Module#8 Decision Tree and Random Forest

The document discusses the decision tree model and the random forest model, including concepts such as entropy, information gain, and overfitting. It outlines the structure of a decision tree, detailing the roles of root nodes, decision nodes, and leaf nodes in classification. Additionally, it covers the decision tree learning algorithm and the process of selecting features for the root node test.

Uploaded by

11kc1-23-Lý Mẫn Nhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

6 views37 pages

Module#8 Decision Tree and Random Forest

Uploaded by

11kc1-23-Lý Mẫn Nhi

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Decision Tree and Random Forest Model Decision Tree and Random Forest Model

▪ The decision tree model ▪ The decision tree model

▪ Entropy and Information gain ▪ Entropy and Information gain

▪ The decision tree training algorithm ▪ The decision tree training algorithm

▪ Overfitting in the decision tree model ▪ Overfitting in the decision tree model

▪ Ensemble learning and random forest model ▪ Ensemble learning and random forest model

Decision Tree and Random Forest Model Decision Tree and Random Forest Model

▪ The decision tree model ▪ The decision tree model

▪ Entropy and Information gain ▪ Entropy and Information gain

▪ The decision tree training algorithm ▪ The decision tree training algorithm

▪ Overfitting in the decision tree model ▪ Overfitting in the decision tree model

▪ Ensemble learning and random forest model ▪ Ensemble learning and random forest model

Decision Tree and Random Forest Model Decision Tree and Random Forest Model

▪ The decision tree model ▪ The decision tree model

▪ Entropy and Information gain ▪ Entropy and Information gain

▪ The decision tree training algorithm ▪ The decision tree training algorithm

▪ Overfitting in the decision tree model ▪ Overfitting in the decision tree model

▪ Ensemble learning and random forest model ▪ Ensemble learning and random forest model
𝑥2 𝑥2

𝑏 𝑏

𝑥1 𝑥1

𝑎 𝑎

𝑥2 𝑥2

𝑏 𝑏

𝑥1 𝑥1

𝑎 𝑎

𝑥2 𝑥2

𝑏 𝑏

𝑥1 𝑥1

𝑎 𝑎
• A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏? • A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏?
decision nodes and leaf nodes. yes no
decision nodes and leaf nodes. yes no
• Each decision node specifies a test of some feature of • Each decision node specifies a test of some feature of
the instance. R 𝑥1 < 𝑎?
the instance. R 𝑥1 < 𝑎?
• Each branch descending from a decision node • Each branch descending from a decision node
yes no yes no
corresponds to one of the possible values for the feature corresponds to one of the possible values for the feature
• Each leaf node represents a classification of the instance • Each leaf node represents a classification of the instance
R B R B
• An instance is classified by starting at the root node of • An instance is classified by starting at the root node of
the tree, testing the feature specified by this node, then the tree, testing the feature specified by this node, then
moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5 moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5
value of the feature in the instance. value of the feature in the instance.
• This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7) • This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7)
the new node until a leaf node is reached using the decision tree. the new node until a leaf node is reached using the decision tree.

• A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏? • A learnt decision tree consists of a root node, multiple 𝑥2 < 𝑏?
decision nodes and leaf nodes. yes no
decision nodes and leaf nodes. yes no
• Each decision node specifies a test of some feature of • Each decision node specifies a test of some feature of
the instance. R 𝑥1 < 𝑎?
the instance. R 𝑥1 < 𝑎?
• Each branch descending from a decision node • Each branch descending from a decision node
yes no yes no
corresponds to one of the possible values for the feature corresponds to one of the possible values for the feature
• Each leaf node represents a classification of the instance • Each leaf node represents a classification of the instance
R B R B
• An instance is classified by starting at the root node of • An instance is classified by starting at the root node of
the tree, testing the feature specified by this node, then the tree, testing the feature specified by this node, then
moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5 moving down the tree branch corresponding to the 𝑎 = 4.5, 𝑏 = 6.5
value of the feature in the instance. value of the feature in the instance.
• This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7) • This process is then repeated for the subtree rooted at Let’s classify a test instance 𝒙 = (4,7)
the new node until a leaf node is reached using the decision tree. the new node until a leaf node is reached using the decision tree.

Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No

• Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree • Given a training data set 𝒟 = 𝒙𝑖 , 𝑦𝑖 , 𝑖 = 1,2, … , 𝑁, how can we learn a decision tree
from the data? (Table DT1) (the PlayTennis dataset) from the data? (Table DT1) (the PlayTennis dataset)

▪ Given the training data set, how can we construct the decision tree? ▪ Given the training data set, how can we construct the decision tree?

Decision Tree Learning Algorithm Decision Tree Learning Algorithm

▪ Given the training data set, how can we construct the decision tree? ▪ Given the training data set, how can we construct the decision tree?

Decision Tree Learning Algorithm Decision Tree Learning Algorithm

▪ Given the training data set, how can we construct the decision tree? ▪ Given the training data set, how can we construct the decision tree?
Decision Tree Learning Algorithm Decision Tree Learning Algorithm

• Which feature should be selected for the root node test? Why? • Which feature should be selected for the root node test? Why?

𝑥2 𝑥2 𝑥2 𝑥2

𝑥1 𝑥1 𝑥1 𝑥1

Decision Tree Learning Algorithm Decision Tree Learning Algorithm

• Which feature should be selected for the root node test? Why? • Which feature should be selected for the root node test? Why?

𝑥2 𝑥2 𝑥2 𝑥2

𝑥1 𝑥1 𝑥1 𝑥1

Decision Tree Learning Algorithm Decision Tree Learning Algorithm

• Which feature should be selected for the root node test? Why? • Which feature should be selected for the root node test? Why?

𝑥2 𝑥2 𝑥2 𝑥2

𝑥1 𝑥1 𝑥1 𝑥1
▪ What is a good quantitative measure of the classification capability of a feature? ▪ What is a good quantitative measure of the classification capability of a feature?

How about the reduction of the impurity of the sets? How about the reduction of the impurity of the sets?

▪ What is a good quantitative measure of the impurity of the sets? ▪ What is a good quantitative measure of the impurity of the sets?

▪ What is a good quantitative measure of the classification capability of a feature? ▪ What is a good quantitative measure of the classification capability of a feature?

How about the reduction of the impurity of the sets? How about the reduction of the impurity of the sets?

▪ What is a good quantitative measure of the impurity of the sets? ▪ What is a good quantitative measure of the impurity of the sets?

▪ What is a good quantitative measure of the classification capability of a feature? ▪ What is a good quantitative measure of the classification capability of a feature?

How about the reduction of the impurity of the sets? How about the reduction of the impurity of the sets?

▪ What is a good quantitative measure of the impurity of the sets? ▪ What is a good quantitative measure of the impurity of the sets?
▪ Entropy as a measure of impurity of a collection of samples: (binary sets) ▪ Entropy as a measure of impurity of a collection of samples: (binary sets)
Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target
concept, the entropy of 𝑆 relative to this Boolean classification is: concept, the entropy of 𝑆 relative to this Boolean classification is:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ log 2 𝑝⊕ − 𝑝⊖ log 2 𝑝⊖ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ log 2 𝑝⊕ − 𝑝⊖ log 2 𝑝⊖

Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of Where, 𝑝⊕ is the proportion of positive examples in 𝑆 and 𝑝 ⊖ is the proportion of
negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0. negative examples in 𝑆. In entropy calculation, we define 0 𝑙𝑜𝑔2 0 to be 0.

▪ Let’s verify these two facts: ▪ Let’s verify these two facts:
• The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity) • The entropy of the set is zero when 𝑝⊕ = 0 𝑜𝑟 1.0 (no impurity)
• The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity) • The entropy of the set is 1 when 𝑝⊕ = 0.5 (most impurity)

▪ Entropy as a measure of impurity of a collection of samples: (binary sets) ▪ Entropy as a measure of impurity of a collection of samples: (binary sets)
Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target Given a collection 𝑆, containing positive (⨁) and negative (⊝) examples of some target
concept, the entropy of 𝑆 relative to this Boolean classification is: concept, the entropy of 𝑆 relative to this Boolean classification is:

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ log 2 𝑝⊕ − 𝑝⊖ log 2 𝑝⊖ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = −𝑝⊕ log 2 𝑝⊕ − 𝑝⊖ log 2 𝑝⊖

Entropy is a measure of uncertainty or impurity of a set. Entropy is a measure of uncertainty or impurity of a set.

Entropy is a measure of uncertainty or impurity of a set. Entropy is a measure of uncertainty or impurity of a set.
▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets) ▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets)

Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆 Given a collection 𝑆, containing examples of multiple classes, the entropy of 𝑆
relative to this multi-class classification is: relative to this multi-class classification is:

𝐾 𝐾

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − ෍ 𝑝𝑖 log 2 𝑝𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − ෍ 𝑝𝑖 log 2 𝑝𝑖

𝑖=1 𝑖=1

Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define
0 log 2 0 to be 0. 0 log 2 0 to be 0.

▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets) ▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets)

𝐾 𝐾

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − ෍ 𝑝𝑖 log 2 𝑝𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − ෍ 𝑝𝑖 log 2 𝑝𝑖

𝑖=1 𝑖=1

▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets) ▪ Entropy as a measure of impurity of a collection of samples: (multi-class sets)

𝐾 𝐾

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − ෍ 𝑝𝑖 log 2 𝑝𝑖 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − ෍ 𝑝𝑖 log 2 𝑝𝑖

𝑖=1 𝑖=1

Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define Where, 𝑝𝑖 is the proportion of examples in 𝑆 belonging to class 𝑖. we define
0 log 2 0 to be 0. 0 log 2 0 to be 0.
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:

𝑝𝑅 = 0.5; 𝑝𝐵 = 0.5 𝑝𝑅 = 0.5; 𝑝𝐵 = 0.5

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
−0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1 −0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1

▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:

𝑝𝑅 = 0.5; 𝑝𝐵 = 0.5 𝑝𝑅 = 0.5; 𝑝𝐵 = 0.5

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
−0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1 −0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1

▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:

𝑝𝑅 = 0.5; 𝑝𝐵 = 0.5 𝑝𝑅 = 0.5; 𝑝𝐵 = 0.5

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
−0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1 −0.5 × log 2 0.5 − 0.5 × log 2 0.5 = 1
▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:

7 1 7 1
𝑝𝑅 = ;𝑝 = 𝑝𝑅 = ;𝑝 =
8 𝐵 8 8 𝐵 8

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
7 7 1 1 7 7 1 1
− × log 2 − × log 2 = 0.5436 − × log 2 − × log 2 = 0.5436
8 8 8 8 8 8 8 8

▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:

7 1 7 1
𝑝𝑅 = ;𝑝 = 𝑝𝑅 = ;𝑝 =
8 𝐵 8 8 𝐵 8

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
7 7 1 1 7 7 1 1
− × log 2 − × log 2 = 0.5436 − × log 2 − × log 2 = 0.5436
8 8 8 8 8 8 8 8

▪ Let’s evaluate the entropy of the set in the figure: ▪ Let’s evaluate the entropy of the set in the figure:

7 1 7 1
𝑝𝑅 = ;𝑝 = 𝑝𝑅 = ;𝑝 =
8 𝐵 8 8 𝐵 8

𝐸𝑛𝑡𝑟𝑜𝑝𝑦 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 =
7 7 1 1 7 7 1 1
− × log 2 − × log 2 = 0.5436 − × log 2 − × log 2 = 0.5436
8 8 8 8 8 8 8 8
Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
• Calculate the entropy of the given training • Calculate the entropy of the given training
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
data set in table DT1: data set in table DT1:
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
9 5 9 5
𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes 𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes
14 14 14 14
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
9 9 5 5 9 9 5 5
= − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes = − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes
14 14 14 14 14 14 14 14
= 0.940 D8 Sunny Mild High Weak No = 0.940 D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No

Days Outlook Temperature Humidity Wind PlayTennis? Days Outlook Temperature Humidity Wind PlayTennis?
• Calculate the entropy of the given training • Calculate the entropy of the given training
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
data set in table DT1: data set in table DT1:
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
9 5 9 5
𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes 𝑝𝑦𝑒𝑠 = ; 𝑝𝑛𝑜 = D4 Rain Mild High Weak Yes
14 14 14 14
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝒟
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
9 9 5 5 9 9 5 5
= − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes = − × log 2 − × log 2 D7 Overcast Cool Normal Strong Yes
14 14 14 14 14 14 14 14
= 0.940 D8 Sunny Mild High Weak No = 0.940 D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No

• Information gain of a feature is simply the expected reduction in entropy caused by • Information gain of a feature is simply the expected reduction in entropy caused by
partitioning the examples according to this feature (attribute). partitioning the examples according to this feature (attribute).
• The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of • The information gain, 𝐺𝑎𝑖𝑛(𝑆, 𝐴), of an attribute (feature) 𝐴, relative to a collection of
examples 𝑆, is defined as: examples 𝑆, is defined as:
𝑆𝑣 𝑆𝑣
𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 ) 𝐺𝑎𝑖𝑛 𝑆, 𝐴 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − ෍ 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑣 )
𝑆 𝑆
𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴) 𝑣∈𝑉𝑎𝑙𝑢𝑒(𝐴)

Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of Where, 𝑉𝑎𝑙𝑢𝑒(𝐴) is the set of all possible values for attribute 𝐴, and 𝑆𝑣 is the subset of
𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 . 𝑆 for which attribute 𝐴 has value 𝑣, i.e., 𝑆𝑣 = 𝑠 ∈ 𝑆|𝐴 𝑠 = 𝑣 .

Entropy and Information Gain Entropy and Information Gain

▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2 ▪ Let’s assume both attributes take categorical values, i.e., 𝑥1 > 𝑥2
𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏. 𝑎 𝑜𝑟 𝑥1 < 𝑎, 𝑎𝑛𝑑 𝑥2 > 𝑏 𝑜𝑟 𝑥2 < 𝑏.

▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 : ▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 :

𝑉𝑎𝑙𝑢𝑒𝑠 𝑥1 = " > 𝑎", " < 𝑎"; 𝑏 𝑉𝑎𝑙𝑢𝑒𝑠 𝑥1 = " > 𝑎", " < 𝑎"; 𝑏
𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑎 = 24𝑅, 8𝐵 , 𝑆<𝑎 = {32𝑅, 0𝐵} 𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑎 = 24𝑅, 8𝐵 , 𝑆<𝑎 = {32𝑅, 0𝐵}

3 3 1 1 3 3 1 1
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑎 = − × log 2 − × log 2 = 0.811;
4 4 4 4 4 4 4 4
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑎 = 0.0
𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 𝑥1
𝑆>𝑎 𝑆<𝑎 𝑆>𝑎 𝑆<𝑎
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑎 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑎 ) 𝑎
𝑆 𝑆 𝑆 𝑆
1 1 𝑎 = 4.5; 𝑏 = 6.5 1 1 𝑎 = 4.5; 𝑏 = 6.5
= 0.5436 − × 0.811 − × 0.0 = 0.1381 = 0.5436 − × 0.811 − × 0.0 = 0.1381
2 2 2 2

▪ Consider the data set in the figure. The two attributes are ▪ Consider the data set in the figure. The two attributes are
𝑥1 𝑎𝑛𝑑 𝑥2 . 𝑥1 𝑎𝑛𝑑 𝑥2 .

▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 : ▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 :

▪ Consider the data set in the figure. The two attributes are ▪ Consider the data set in the figure. The two attributes are
𝑥1 𝑎𝑛𝑑 𝑥2 . 𝑥1 𝑎𝑛𝑑 𝑥2 .

▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 : ▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥1 :

𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵} 𝑆 = 56𝑅, 8𝐵 , 𝑆>𝑏 = 8𝑅, 8𝐵 , 𝑆<𝑏 = {48𝑅, 0𝐵}
𝑏 𝑏
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆>𝑏 = 1.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆<𝑏 = 0.0

▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥2 : 𝑥2 ▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥2 : 𝑥2

𝑉𝑎𝑙𝑢𝑒𝑠 𝑥2 = " > 𝑏", " < 𝑏"; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑥2 = " > 𝑏", " < 𝑏";

▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥2 : 𝑥2 ▪ Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑥2 : 𝑥2

𝑉𝑎𝑙𝑢𝑒𝑠 𝑥2 = " > 𝑏", " < 𝑏"; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑥2 = " > 𝑏", " < 𝑏";

𝐺𝑎𝑖𝑛 𝑆, 𝑥2 𝐺𝑎𝑖𝑛 𝑆, 𝑥2
𝑆>𝑏 𝑆<𝑏 𝑆>𝑏 𝑆<𝑏
= 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 ) = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆>𝑏 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆<𝑏 )
𝑆 𝑆 𝑆 𝑆
1 3 1 3
= 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1 = 0.5436 − × 1.0 − × 0.0 = 0.2936 𝑥1
4 4 4 4
𝑎 𝑎
▪ Which feature do you choose for the root node? ▪ Which feature do you choose for the root node?
• Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis • Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis
the attribute is “Wind” which has the values Weak re ? the attribute is “Wind” which has the values Weak re ?

or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No or Strong. Let’s calculate 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 : D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑊𝑖𝑛𝑑 = 𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 𝑆 = 9+, 5 − , 𝑆𝑤𝑒𝑎𝑘 = 6+, 2 − , 𝑆𝑠𝑡𝑟𝑜𝑛𝑔
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
= {3+, 3−} = {3+, 3−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
6 6 2 2 D6 Rain Cool Normal Strong No 6 6 2 2 D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑤𝑒𝑎𝑘 = − × log 2 − × log 2
8 8 8 8 8 8 8 8
= 0.811; D7 Overcast Cool Normal Strong Yes = 0.811; D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 = 1.0
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes

• Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis • Consider the PlayTennis training data set 𝒟. One of Days Outlook Temperatu Humidity Wind PlayTennis
the attribute is “Wind” which has the values Weak re ? the attribute is “Wind” which has the values Weak re ?

𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes 𝐺𝑎𝑖𝑛 𝑆, 𝑾𝒊𝒏𝒅 = 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − D10 Rain Mild Normal Weak Yes
𝑆𝑤𝑒𝑎𝑘 𝑆𝑤𝑒𝑎𝑘
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑤𝑒𝑎𝑘 ) − D11 Sunny Mild Normal Strong Yes
𝑆 𝑆
𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes 𝑆𝑠𝑡𝑟𝑜𝑛𝑔 D12 Overcast Mild High Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 ) 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑡𝑟𝑜𝑛𝑔 )
𝑆 D13 Overcast Hot Normal Weak Yes 𝑆 D13 Overcast Hot Normal Weak Yes
8 6 8 6
= 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No = 0.940 − × 0.811 − × 1.0 = 0.048 D14 Rain Mild High Strong No
14 14 14 14
• Consider the PlayTennis training data set 𝒟. Let’s • Consider the PlayTennis training data set 𝒟. Let’s
calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
Outlook which has the values Sunny, Overcast and Outlook which has the values Sunny, Overcast and
Rain. D1 Sunny Hot High Weak No Rain. D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − , 𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − ,
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−} 𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971;
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0;
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =
𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes 𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝑆 𝑆
𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛 𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes
𝑆 𝑆 𝑆 𝑆
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
5 4 5 5 4 5
= 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes = 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes
14 14 14 14 14 14
= 0.246 D14 Rain Mild High Strong No = 0.246 D14 Rain Mild High Strong No

• Consider the PlayTennis training data set 𝒟. Let’s • Consider the PlayTennis training data set 𝒟. Let’s
calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis calculate the information gain of the attribute Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
Outlook which has the values Sunny, Overcast and Outlook which has the values Sunny, Overcast and
Rain. D1 Sunny Hot High Weak No Rain. D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛; 𝑉𝑎𝑙𝑢𝑒𝑠 𝑶𝒖𝒕𝒍𝒐𝒐𝒌 = 𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛;
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − , 𝑆 = 9+, 5 − , 𝑆𝑠𝑢𝑛𝑛𝑦 = 2+, 3 − ,
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−} 𝑆𝑜𝑐 = 4+, 0 − , 𝑆𝑟𝑎𝑖𝑛 = {3+, 2−}
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑠𝑢𝑛𝑛𝑦 = 0.971;
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0; 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑜𝑐 = 0.0;
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆𝑟𝑎𝑖𝑛 = 0.971
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 =
𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes 𝑆𝑠𝑢𝑛𝑛𝑦
D9 Sunny Cool Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑠𝑢𝑛𝑛𝑦 ) − D10 Rain Mild Normal Weak Yes
𝑆 𝑆
𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛 𝑆𝑜𝑐 𝑆𝑟𝑎𝑖𝑛
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑜𝑐 ) − 𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆𝑟𝑎𝑖𝑛 ) D11 Sunny Mild Normal Strong Yes
𝑆 𝑆 𝑆 𝑆
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
5 4 5 5 4 5
= 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes = 0.940 − × 0.971 − × 0.0 − × 0.971 D13 Overcast Hot Normal Weak Yes
14 14 14 14 14 14
= 0.246 D14 Rain Mild High Strong No = 0.246 D14 Rain Mild High Strong No

𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.246 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.246

𝐺𝑎𝑖𝑛 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151 𝐺𝑎𝑖𝑛 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151

𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 0.048 𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 0.048

𝐺𝑎𝑖𝑛 𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029 𝐺𝑎𝑖𝑛 𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029

▪ Similarly, the information gain of all four attributes can be calculated as follows: ▪ Similarly, the information gain of all four attributes can be calculated as follows:

𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.246 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.246

𝐺𝑎𝑖𝑛 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151 𝐺𝑎𝑖𝑛 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151

𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 0.048 𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 0.048

𝐺𝑎𝑖𝑛 𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029 𝐺𝑎𝑖𝑛 𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029

▪ Similarly, the information gain of all four attributes can be calculated as follows: ▪ Similarly, the information gain of all four attributes can be calculated as follows:

𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.246 𝐺𝑎𝑖𝑛 𝑆, 𝑂𝑢𝑡𝑙𝑜𝑜𝑘 = 0.246

𝐺𝑎𝑖𝑛 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151 𝐺𝑎𝑖𝑛 𝑆, 𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦 = 0.151

𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 0.048 𝐺𝑎𝑖𝑛 𝑆, 𝑊𝑖𝑛𝑑 = 0.048

𝐺𝑎𝑖𝑛 𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029 𝐺𝑎𝑖𝑛 𝑆, 𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒 = 0.029

▪ Since the attribute Outlook had the highest information gain, it is chosen as the ▪ Since the attribute Outlook had the highest information gain, it is chosen as the
decision attribute for the root node. decision attribute for the root node.
• Since the attribute Outlook had the highest • Since the attribute Outlook had the highest
information gain, it is chosen as the decision information gain, it is chosen as the decision
attribute for the root node. attribute for the root node.
• Branches are created below the root node for each • Branches are created below the root node for each
of its possible values (i.e., Sunny, Overcast and Rain). of its possible values (i.e., Sunny, Overcast and Rain).
• Training examples are sorted to each new • Training examples are sorted to each new
descendant node. descendant node.
• Note that every example for which Outlook = • Note that every example for which Outlook =
Overcast is a positive example of PlayTennis. Overcast is a positive example of PlayTennis.
Therefore, this node becomes a leaf node with the Therefore, this node becomes a leaf node with the
classification PlayTennis = Yes. classification PlayTennis = Yes.
• In contrast, the descendants corresponding to • In contrast, the descendants corresponding to
Outlook = Sunny and Outlook = Rain still have Outlook = Sunny and Outlook = Rain still have
nonzero entropy, and the decision tree will be nonzero entropy, and the decision tree will be
further elaborated below these nodes. further elaborated below these nodes.

• Since the attribute Outlook had the highest • Since the attribute Outlook had the highest
information gain, it is chosen as the decision information gain, it is chosen as the decision
attribute for the root node. attribute for the root node.
• Branches are created below the root node for each • Branches are created below the root node for each
of its possible values (i.e., Sunny, Overcast and Rain). of its possible values (i.e., Sunny, Overcast and Rain).
• Training examples are sorted to each new • Training examples are sorted to each new
descendant node. descendant node.
• Note that every example for which Outlook = • Note that every example for which Outlook =
Overcast is a positive example of PlayTennis. Overcast is a positive example of PlayTennis.
Therefore, this node becomes a leaf node with the Therefore, this node becomes a leaf node with the
classification PlayTennis = Yes. classification PlayTennis = Yes.
• In contrast, the descendants corresponding to • In contrast, the descendants corresponding to
Outlook = Sunny and Outlook = Rain still have Outlook = Sunny and Outlook = Rain still have
nonzero entropy, and the decision tree will be nonzero entropy, and the decision tree will be
further elaborated below these nodes. further elaborated below these nodes.
ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm) ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm)
ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠) ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)
Create a Root node for the tree Create a Root node for the tree
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = + If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = +
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = - If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = -
Otherwise Begin Otherwise Begin
𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠
The decision attribute for Root ← 𝐴 The decision attribute for Root ← 𝐴
For each possible value 𝑣𝑖 of 𝐴, For each possible value 𝑣𝑖 of 𝐴,
Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖 Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖
Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴 Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴
if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 ) if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 )
else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴}) else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴})
End End
End End
Return Root Return Root

ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm) ID3 (Iterative Dichotomizer 3) Learning Algorithm (A Recursive Algorithm)
ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠) ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠)
Create a Root node for the tree Create a Root node for the tree
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = + If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are positive, Return the single-node tree, Root, with label = +
If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = - If all 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 are negative, Return the single-node tree, Root, with label = -
Otherwise Begin Otherwise Begin
𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 𝐴 ← the attribute from 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 that best classifies 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠
The decision attribute for Root ← 𝐴 The decision attribute for Root ← 𝐴
For each possible value 𝑣𝑖 of 𝐴, For each possible value 𝑣𝑖 of 𝐴,
Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖 Add a new tree branch bellow Root, corresponding to the test 𝐴 = 𝑣𝑖
Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴 Let 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 be the subset of 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠 that has value 𝑣𝑖 for 𝐴
if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 ) if 𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 is pure, then below this new branch add a leaf node with label = 𝑚𝑜𝑑𝑒(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 )
else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴}) else below this new branch add the subtree ID3(𝐸𝑥𝑎𝑚𝑝𝑙𝑒𝑠𝑣𝑖 , 𝐴𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − {𝐴})
End End
End End
Return Root Return Root

Days Outlook Temperatu Humidity Wind PlayTennis Days Outlook Temperatu Humidity Wind PlayTennis
re ? re ?
D1 Sunny Hot High Weak No D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No D2 Sunny Hot High Strong No
D3 Overcast Hot High Weak Yes D3 Overcast Hot High Weak Yes
D4 Rain Mild High Weak Yes D4 Rain Mild High Weak Yes
D5 Rain Cool Normal Weak Yes D5 Rain Cool Normal Weak Yes
D6 Rain Cool Normal Strong No D6 Rain Cool Normal Strong No
D7 Overcast Cool Normal Strong Yes D7 Overcast Cool Normal Strong Yes
D8 Sunny Mild High Weak No D8 Sunny Mild High Weak No
D9 Sunny Cool Normal Weak Yes D9 Sunny Cool Normal Weak Yes
D10 Rain Mild Normal Weak Yes D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes D11 Sunny Mild Normal Strong Yes
D12 Overcast Mild High Strong Yes D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes D13 Overcast Hot Normal Weak Yes
D14 Rain Mild High Strong No D14 Rain Mild High Strong No

This is the learnt decision tree based on the given training data set: This is the learnt decision tree based on the given training data set:

• The initial version of ID3 is restricted to attributes that take on a discrete set of values. • The initial version of ID3 is restricted to attributes that take on a discrete set of values.
• This restrict can be removed so that continuous-valued attributes can be incorporated. • This restrict can be removed so that continuous-valued attributes can be incorporated.

• This can be accomplished by dynamically defining new discrete-valued attributes that • This can be accomplished by dynamically defining new discrete-valued attributes that
partition the continuous attribute value into a discrete set of intervals. partition the continuous attribute value into a discrete set of intervals.

• For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐 • For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐

𝑡𝑟𝑢𝑒, 𝑖𝑓 𝐴 < 𝑐 𝑡𝑟𝑢𝑒, 𝑖𝑓 𝐴 < 𝑐

𝐴𝑐 = ቊ 𝐴𝑐 = ቊ
𝑓𝑎𝑙𝑠𝑒, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑓𝑎𝑙𝑠𝑒, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• How do we choose the best value of 𝑐? • How do we choose the best value of 𝑐?

Continuous-valued Attributes Continuous-valued Attributes

• For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐 • For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐

𝑡𝑟𝑢𝑒, 𝑖𝑓 𝐴 < 𝑐 𝑡𝑟𝑢𝑒, 𝑖𝑓 𝐴 < 𝑐

𝐴𝑐 = ቊ 𝐴𝑐 = ቊ
𝑓𝑎𝑙𝑠𝑒, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑓𝑎𝑙𝑠𝑒, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• How do we choose the best value of 𝑐? • How do we choose the best value of 𝑐?

Continuous-valued Attributes Continuous-valued Attributes

• The initial version of ID3 is restricted to attributes that take on a discrete set of values. • The initial version of ID3 is restricted to attributes that take on a discrete set of values.

• This restrict can be removed so that continuous-valued attributes can be incorporated. • This restrict can be removed so that continuous-valued attributes can be incorporated.

• For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐 • For an attribute 𝐴 that is continuous-valued, we can create a new Boolean attribute 𝐴𝑐

𝑡𝑟𝑢𝑒, 𝑖𝑓 𝐴 < 𝑐 𝑡𝑟𝑢𝑒, 𝑖𝑓 𝐴 < 𝑐

𝐴𝑐 = ቊ 𝐴𝑐 = ቊ
𝑓𝑎𝑙𝑠𝑒, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒 𝑓𝑎𝑙𝑠𝑒, 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

• How do we choose the best value of 𝑐? • How do we choose the best value of 𝑐?
Continuous-valued Attributes Continuous-valued Attributes

• For example, suppose we wish to include the continuous-valued attribute • For example, suppose we wish to include the continuous-valued attribute
Temperature in the PlayTennis training data set. Suppose that the training examples Temperature in the PlayTennis training data set. Suppose that the training examples
associated with a particular node in the decision tree have the following values for associated with a particular node in the decision tree have the following values for
Temperature and target values: Temperature and target values: