0% found this document useful (0 votes)
65 views51 pages

L5 - Decision Tree - B

This document provides an overview of decision trees and how they are used for classification problems. It discusses key concepts like entropy, information gain, and how decision trees work by recursively splitting the data into purer subsets based on attribute values. An example decision tree is built to classify whether to play tennis based on weather attributes. Entropy and information gain are used to select the best attributes to split on at each node, resulting in a tree that maximizes purity in its leaf nodes to classify new data. Pruning is also mentioned to avoid overfitting.

Uploaded by

Bùi Tấn Phát
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
65 views51 pages

L5 - Decision Tree - B

This document provides an overview of decision trees and how they are used for classification problems. It discusses key concepts like entropy, information gain, and how decision trees work by recursively splitting the data into purer subsets based on attribute values. An example decision tree is built to classify whether to play tennis based on weather attributes. Entropy and information gain are used to select the best attributes to split on at each node, resulting in a tree that maximizes purity in its leaf nodes to classify new data. Pruning is also mentioned to avoid overfitting.

Uploaded by

Bùi Tấn Phát
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 51

Topic: Decision Tree

DAT706: Data Science for Business


Date: 2 March 2023

Vinh Vo
Ho Chi Minh city University of Banking
Outline
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
Outline
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
Introductory Problems
ü John is working as a salesman at a
computer store. He collected the data
related to his previous customers as
shown in the table on the right.
ü John wants to use these data to
predict whether a new customer buy a
computer or not. He applied rules that
base on information like: age, income,
student or not, credit rating.
ü This lecture introduces an algorithm
for the question, ID3 Decision Tree.
Review: The Classification Problem
General Pattern
Previous Lecture:
Hypothesis Logistic Model
Input Output
h! x Now: Decision Tree
x (#) y (#) ∈ {0,1}
(classifier)
In the following we build a
• 0: “Negative Class” (e.g., spam email)
Decision Tree on another
• 1: “Positive Class” (e.g., not spam)
dataset, called “play-tennis”.
• These problems are binary classification: We leave the customer dataset
– The output is a discrete value, and on the previous slide as an
– It takes only one out of two possible values exercise at the end
Data Set for “Play-Tennis” Example
ID Outlook Temp. Humidity Wind Play Tennis • This is a typical dataset for
D1 Sunny Hot High Weak No Decision Tree illustration
D2 Sunny Hot High Strong No • 14 objects in two classes {𝒀, 𝑵}.
D3 Overcast Hot High Weak Yes Each row has 4 properties
D4 Rain Mild High Weak Yes • 𝐷𝑜𝑚{𝑂𝑢𝑡𝑙𝑜𝑜𝑘} =
D5 Rain Cool Normal Weak Yes {𝑆𝑢𝑛𝑛𝑦, 𝑂𝑣𝑒𝑟𝑐𝑎𝑠𝑡, 𝑅𝑎𝑖𝑛}
D6 Rain Cool Normal Strong No • 𝐷𝑜𝑚{𝑇𝑒𝑚𝑝𝑒𝑟𝑎𝑡𝑢𝑟𝑒} =
D7 Overcast Cool Normal Strong Yes {𝐻𝑜𝑡, 𝑀𝑖𝑙𝑑, 𝐶𝑜𝑙𝑑}
D8 Sunny Mild High Weak No
• 𝐷𝑜𝑚{𝐻𝑢𝑚𝑖𝑑𝑖𝑡𝑦} =
D9 Sunny Cold Normal Weak Yes {𝐻𝑖𝑔ℎ, 𝑁𝑜𝑟𝑚𝑎𝑙}
D10 Rain Mild Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
• 𝐷𝑜𝑚{𝑊𝑖𝑛𝑑} =
{𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔}
D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes • We will step-by-step build an ID3
Decision Tree for “Play-Tennis”
D14 Rain Mild High Strong No
Two Possible Decision Trees for “Play-Tennis”

Occam’s Principle: “If two theories This tree is much simpler as


explain the facts equally well, then the “outlook” is selected at the root.
simpler theory is preferred”
Question: How to select a good
=> Preferred the smallest tree that
correctly classifies all training examples. attribute to split a decision node?
Which attribute is better?
• The “play-tennis” set 𝑆 contains 9 positive objects (+) and 5
negative objects (-), denoted by [9+, 5-]
• If attributes “humidity” and “wind” split 𝑆 into sub-nodes with
proportions of positive and negative objects as below, which
attribute is better?
Idea: select the attribute that the
data within a sub-node is purer.
Question: How can we measure
the purity of a class? Solution:
Entropy and Information Gain
Outline
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
Entropy: Definition
• Entropy characterizes the impurity (purity) of an
arbitrary collection of objects.
Ø𝑆: collection of positive & negative objects
Øp! : proportion of positive objects in S
Øp" : proportion of negative objects in S
ØOur dataset: 𝑆 = 14, p! = 9/14, p" = 5/14
• Entropy is defined by:
Entropy: Definition
Entropy for Binary Class
ü The entropy function relative to a binary
classification, as the proportion 𝑝 of positive
objects varies between 0 and 1.

Entropy
ü If the collection has 𝑘 distinct classes of
objects then the entropy is defined by:

(
Entropy S = 5 −p% log ) p%
%&'
=−p' log ) p' − p) log ) p) − ⋯ − p( log ) p(
Entropy: Example
ü From dataset of Play-Tennis:
9 9 5 5
Entropy 9! , 5" = − log # − log = 0.940
14 14 14 # 14
ü If all members of 𝑆 belong to the same class (the purest set) then
𝐸𝑛𝑡𝑟𝑜𝑝𝑦(𝑆) = 0. For example, if all members are positive (p! = 1),
then p" = 0, and 𝐸𝑛𝑡𝑟𝑜𝑝𝑦 𝑆 = − 1 ∗ log # 1 − 0 ∗ log # 0 = 0.
ü If the collection contains an equal number of positive and negative
examples (p! = p" = 0.5), then the 𝑒𝑛𝑡𝑟𝑜𝑝𝑦(𝑠) = 1 (most impurity).
ü If the numbers of positive and negative examples are unequal then the
entropy is between 0 and 1.
Information Gain: Definition
• Information Gain is a measure of the effectiveness of an
attribute in classifying data.
• It is the expected reduction in entropy caused by partitioning
the objects according to this attribute.

𝑉𝑎𝑙𝑢𝑒(𝐴): the set of all possible values for the attribute 𝐴.


S* : the subset of S for which 𝐴 has value 𝑣.
Information Gain: Example
ID Wind Play Tennis
• 𝑆 = 9+, 5, D1 Weak No
D8 Weak No
• 𝑉𝑎𝑙𝑢𝑒(𝑊𝑖𝑛𝑑) = {𝑊𝑒𝑎𝑘, 𝑆𝑡𝑟𝑜𝑛𝑔} D3 Weak Yes

• S-./( = 6+, 2, ; S012345 = 3+, 3, D4


D5
Weak
Weak
Yes
Yes
D9 Weak Yes
D10 Weak Yes
D13 Weak Yes
D2 Strong No
D6 Strong No
D14 Strong No
D7 Strong Yes
D11 Strong Yes
D12 Strong Yes
Which attribute is a better classifier?

Information gain of each attribute:


Ø Gain(S, Outlook) = 0.246
Ø Gain S, Humidity = 0.151
Ø Gain S, Wind = 0.048
Ø Gain(S, Temperature) = 0.029
==> Root level: Outlook

Next level: consider other attributes after classifying by Outlook


Next step in growing the decision tree
ID Outl. Temp. Humi. Wind Play?
D1 Sunny Hot High Weak No
D2 Sunny Hot High Strong No
D8 Sunny Mild High Weak No
D9 Sunny Cold Normal Weak Yes
D11 Sunny Mild Normal Strong Yes
D3 Overcast Hot High Weak Yes
Ø S!"##$ = D% , D& , D' , D(, D%% D7 Overcast Cool Normal Strong Yes
ü Gain S!"##$ , Humidity = .970 D12 Overcast Mild High Strong Yes
D13 Overcast Hot Normal Weak Yes
ü Gain S!"##$ , Temperature = 0.570
D6 Rain Cool Normal Strong No
ü Gain(S!"##$ , Wind) = 0.019 D14 Rain Mild High Strong No
ü We select the attribute Humidity D4 Rain Mild High Weak Yes
Ø Similarly, on S)*+# we select the D5 Rain Cool Normal Weak Yes
attribute Wind D10 Rain Mild Normal Weak Yes
The Resulting Decision Tree & Its Rules
Decision Tree: Stopping Condition
Root node
The growing process can be stop if:
• The entropy in a node is 0 (purest), or
• The number of data points in a node is
less than a 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝛼, or
• The path from a sub-node to root node Leaf nodes
reaches a 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝛽, or
• The reduction in entropy is less than 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝛿, or
• The total number of leaf nodes reaches a 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 𝛾.
Outline
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
Decision Tree: Extension Versions
ü For non-categorical attributes, we need ID Outlook Temp. Humi. Wind Play?
to converse the continuous values into D1 Sunny 85=>T 85=>T Weak No
discrete values by setting a 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑. D2 Sunny 80=>F 90=>T Strong No
D3 Overcast 83=>T 78=>F Weak Yes
For example:
D4 Rain 70=>F 96=>T Weak Yes
Ø If threshold!"#$. = 83, then D5 Rain 68=>F 80=>T Weak Yes
Temp. ≥ 83 = {True, False} D6 Rain 65=>F 70=>F Strong No
Ø If threshold&'#(. = 80, then D7 Overcast 64=>F 65=>F Strong Yes
(Humi. ≥ 80) = {True, False} D8 Sunny 72=>F 95=>T Weak No
D9 Sunny 69=>F 70=>F Weak Yes
ü Threshold should be a value which
D10 Rain 75=>F 80=>T Weak Yes
maximizes the gain for that attribute.
D11 Sunny 75=>F 70=>F Strong Yes
ü C4.5 Decision Tree proposes to perform D12 Overcast 72=>F 90=>T Strong Yes
binary split based on a threshold value. D13 Overcast 81=>F 75=>F Weak Yes
D14 Rain 71=>F 80=>T Strong No
Extension Version: An Example - C4.5
ID Outlook Temp. Humi. Wind Play? In ID3 algorithm, we’ve calculated
D1 Sunny 85=>T 85=>T Weak No
D2 Sunny 80=>F 90=>T Strong No
gains for each attribute.
D3 Overcast 83=>T 78=>F Weak Yes Here, we need to calculate gain
D4 Rain 70=>F 96=>T Weak Yes
ratios instead of gains (C4.5)
D5 Rain 68=>F 80=>T Weak Yes
DEFG(I)
D6 Rain 65=>F 70=>F Strong No
GainRatio A =
D7 Overcast 64=>F 65=>F Strong Yes KLMFNOGPQ(I)
D8 Sunny 72=>F 95=>T Weak No K! K!
D9 Sunny 69=>F 70=>F Weak Yes SplitInfo A = − ∑URST log V
K K
D10 Rain 75=>F 80=>T Weak Yes
D11 Sunny 75=>F 70=>F Strong Yes Details of C4.5 can be found here
D12 Overcast 72=>F 90=>T Strong Yes
D13 Overcast 81=>F 75=>F Weak Yes
D14 Rain 71=>F 80=>T Strong No
Decision Tree: Extension Versions
• Besides Information Gain, there are other measures that can
be used in attribute selection as below
Outline
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
Decision Tree: Overfitting
• Classification task: fit a “model” to a set of training data, so as to be able
to make reliable predictions on general untrained data.
• Overfitting: A statistical model describes random error or noise instead of
the underlying relationship.
• Overfitting occurs when a model fits the training set very well but poorly
perform on general data.
Decision Tree: Overfitting
ü The generated tree may overfit the training data
Ø Too many branches, some may reflect anomalies due to noises or outliers
Ø Result is in poor performance on unseen objects.
ü Two approaches to avoid overfitting
Ø Prepruning: Halt tree construction early—do not split a node if this would result
in the goodness measure falling below a threshold.
o Difficult to choose an appropriate threshold.
Ø Postpruning: Remove branches from a “fully grown” tree—get a sequence of
progressively pruned trees
o Use a set of data different from the training data to decide which is the “best
pruned tree”.
Decision Tree: Pros & Cons
Pros Cons
• Understandable prediction rules are • Only one attribute is considered at a time.
created from the training data. • Computationally expensive for continuous
• Builds a short tree quickly. data (C4.5).
• Only need to test enough attributes • If new data is incorrectly clasified at near
until all data is classified. root level then the result varies so much.
• Finding leaf nodes enables test data to • If an attribute can take many categorical
be pruned, reducing number of tests. values then the decision tree may consist
• The whole dataset is considered to of many branchs and nodes regarding to
create tree. that attrubute. This is useless in prediction.
Training set TRAINING
1 2 3
x (() , y (() PHASE
Raw Data Feature Extracted 4
extraction feature
(feature
engineering)
Learning Algorithm:
maximize Information Gain
Data exploration
Data visualization 5 TESTING PHASE
Correlation Matrix Testing set
… x (+) , y (+) Decision Predicted
Tree y% ∈ 0,1
ID3 Decision Tree:
An Overview Evaluation metrics: Accuracy, Extension:
F-Score, TPR, FNR, etc. C4.5, CART
Outline
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
Case Study: Banking Dataset
• The dataset comes from the UCI Machine Learning
repository, and it is related to direct marketing campaigns
(phone calls) of a Portuguese banking institution.
• Goal: predict whether the client will buy a product or not
(binary classification problem)
• Dataset:
Ø 41,188 instances.
Ø Each instance consists of 21 attributes.
Case Study: Banking Dataset
# Attributes Description
1 age numeric value
2 job job type (categorical: admin, retired, student, unknown, etc.)
3 marital status categorical: divorced, married, single, unknown
5 default has credit in default? (categorical: no, yes, unknown)
6 housing has housing loan? (categorical: no, yes, unknown)
7 loan loan: has personal loan? (categorical: no, yes, unknown)
15 poutcome outcome of the previous marketing campaign (categorical: failure,
nonexistent, success)
… … …
21 𝑦 target variable: has the client buy the product? (1: Yes, 0: No)
Banking Dataset: Snapshot

Source: https://www.kaggle.com
Case Study: Explore The Data
Customer Job Distribution Marital status distribution
Case Study: Explore The Data
Barplot for credit in default Barplot for housing loan
Case Study: Explore The Data
Barplot for previous
Barplot for personal loan
marketing campaign outcome
Case Study: Explore The Data
Barplot for the y variable Correlation matrix
36548

4640

Observation: the attributes are


nearly independent to each other
Case Study: Feature Selection
• The features in this lecture are: Split data into train & test set
Ø job type Total instances in Training Testing
Ø contact the data set 41188 Set Set
Ø previous #positive instances 25638 10910
Ø euribor3m #negative instances 3193 1447
Ø the outcome of the previous Total instances 28831 12357
marketing campaigns.
• Decision Tree version: CART
• Drop the variables that we do not need
Case Study: Experiment Results
#data in Predicted 10749 + 303
test set Label Accuracy = ≅ 89.44%
12357
12357 No Yes
No 10749 161 161 + 1144
Label
True

Incorrect Prediction = ≅ 10.56%


Yes 1144 303 12357

Precision Recall F1-Score F1-Score: the higher, the better. We


Class 0 0.90 0.99 0.94 will see later during the course
Class 1 0.65 0.21 0.32 We may try different features to
Avg/total 0.87 0.89 0.87 increase the classifier’s performance
Bank Dataset: Tree Visualization (1/2)
Bank Dataset: Tree Visualization (2/2)
Outline
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
Worked Exercise: Person Hair Weight Age Class
Length
The Simpsons Homer 0” 250 36 M
Converse attributes into Marge 10” 150 34 F
categorical values:
Bart 2” 90 10 M
ü Threshold Hair = 5
ü Threshold Weight = 160 Lisa 6” 78 8 F
ü Threshold Age = 40 Maggie 4” 20 1 F
Goal: determine the IG for Abe 1” 170 70 M
each attribute: Selma 8” 160 41 F
ü Gain Hair ?
Otto 10” 180 38 M
ü Gain Weight ?
ü Gain Age ? Krusty 6” 200 45 M
Note: the 𝑡ℎ𝑟𝑒𝑠ℎ𝑜𝑙𝑑 for each Comic 8” 290 38 ?
level may be different This exercise is adapted from the lecture on ID3 by Prof. Allan Neymark
p æ p ö n æ n ö
Entropy ( S ) = - log 2 çç ÷÷ - log 2 çç ÷÷
p+n è p+nø p+n è p+nø

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
Hair Length <= 5?

Let us try splitting on


Hair length

Entrop Entro
y(1F,3 py(3F
M) = - ,2M)
(1/4)lo = -(3/
g2(1/4) 5)log
(
= 0.81 - (3/4)l
o = 0.9 2 3/5) - (2/5)
13 g2(3/4) 710 log2 (2
/5)

Gain( A) = E (Current set ) - å E (all child sets)

Gain(Hair Length <= 5) = 0.9911 – (4/9 * 0.8113 + 5/9 * 0.9710 ) = 0.0911


p æ p ö n æ n ö
Entropy ( S ) = - log 2 çç ÷÷ - log 2 çç ÷÷
p+n è p+nø p+n è p+nø

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
Weight <= 160?

Let us try splitting on


Weight

Entrop Entro
y(4F,1 py(0F
M) = - ,4M)
(4/5)lo = -(0/
g2(4/5) 4)log
2 (0/4)
= 0.72 - (1/5)l = 0 - (4/4
19 og2(1/5) )log2 (4
/4)

Gain( A) = E (Current set ) - å E (all child sets)

Gain(Weight <= 160) = 0.9911 – (5/9 * 0.7219 + 4/9 * 0 ) = 0.5900


p æ p ö n æ n ö
Entropy ( S ) = - log 2 çç ÷÷ - log 2 çç ÷÷
p+n è p+nø p+n è p+nø

Entropy(4F,5M) = -(4/9)log2(4/9) - (5/9)log2(5/9)


= 0.9911
yes no
age <= 40?

Let us try splitting on


Age

Entrop Entro
y(3F,3 py(1F
M) = - ,2M)
(3/6)lo = -(1/
g2(3/6) 3)log
(
= 1 - (3/6)l
og2(3/6 = 0.9 2 1/3) - (2/3)
) 183 log2 (2
/3)

Gain( A) = E (Current set ) - å E (all child sets)

Gain(Age <= 40) = 0.9911 – (6/9 * 1 + 3/9 * 0.9183 ) = 0.0183


At Root level:
ü Gain Hair = 0.0911
ü Gain Weight = 0.5900
ü Gain Age = 0.0183 yes no
Weight <= 160?

At the next level, we found


that Hair Length is the best
one. The final decision tree no
yes
is given in behind Hair Length <= 2?
Convert Decision Weight <= 160?
Trees into rules
yes no

How would these Hair Length <= 2?


Male
people be classified?
yes no

Male Female
Rules to Classify Males/Females

If Weight greater than 160, classify as Male


Elseif Hair Length less than or equal to 2, classify as
Male
Else classify as Female
Exercise 1
The dataset that John さ
ん collected about his
previous customers as
shown on the right.
Question: Apply ID3
algorithm to build a
decision tree for the
concept “buy-computer”
Solution
Age
≤ 30 > 40
31 − 40
Credit
Student Yes
Rating
No Yes Fair Excellent

No Yes Yes No
Exercise 2
ü The entropy of a binary classification, as shown in the Entropy for Binary Class
figure on the right.
ü Explain why entropy is maximum when 𝑝 = 0.5?

Entropy
)
Entropy S = 5 −p% log ) p%
%&'
=−p' log ) p' − p) log ) p)
Note that: p) = 1 − p'
What we have learned so far?
• Introduction
• Decision Tree: theorical review
– Entropy and Information Gain
– Extension versions
– Overfitting and Tree Pruning
• Case study: Banking dataset
• Worked exercises
THE END

You might also like