0% found this document useful (0 votes)
12 views41 pages

2025 Lecture07 P1 ID3

The document presents an overview of the ID3 decision tree algorithm, which is a supervised learning method for predicting outcomes based on decision rules derived from data features. It discusses the process of training and testing hypotheses, the structure of decision trees, and how to determine the best attributes for splitting data based on information gain and entropy. Additionally, it provides a practical example of predicting restaurant waiting times using various attributes.

Uploaded by

nmkhoi232
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views41 pages

2025 Lecture07 P1 ID3

The document presents an overview of the ID3 decision tree algorithm, which is a supervised learning method for predicting outcomes based on decision rules derived from data features. It discusses the process of training and testing hypotheses, the structure of decision trees, and how to determine the best attributes for splitting data based on information gain and entropy. Additionally, it provides a practical example of predicting restaurant waiting times using various attributes.

Uploaded by

nmkhoi232
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 41

ID3 DECISION TREE

Nguyễn Ngọc Thảo – Nguyễn Hải Minh


{nnthao, nhminh}@fit.hcmus.edu.vn
Outline
• Supervised learning: A brief revision
• ID3 decision tree algorithm

2
Supervised learning: Training
• Consider a labeled training set of 𝑁 examples.
(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑁 , 𝑦𝑁 )
• where each 𝑦𝑗 was generated by an unknown function 𝑦 = 𝑓(𝑥).
• The output 𝒚𝒋 is called ground truth, i.e., the true answer
that the model must predict.

• The training process finds a hypothesis ℎ such that 𝒉 ≈ 𝒇.

3
Supervised learning: Hypothesis space
• ℎ is drawn from a hypothesis space 𝐻 of possible functions.
• E.g., 𝐻 might be the set of polynomials of degree 3; or the set of 3-
SAT Boolean logic formulas.
• Choose 𝐻 by some prior knowledge about the process that
generated the data or exploratory data analysis (EDA).
• EDA examines the data with statistical tests and visualizations to get
some insight into what hypothesis space might be appropriate.
• Or just try multiple hypothesis spaces and evaluate which
one works best.

4
Supervised learning: Hypothesis
• The hypothesis ℎ is consistent if it agrees with the true
function 𝑓 on all training observations, i.e., ∀𝑥𝑖 ℎ 𝑥𝑖 = 𝑦𝑖 .
• For continuous data, we instead look for a best-fit function for which
each ℎ 𝑥𝑖 is close to 𝑦𝑖 .
• Ockham’s razor: Select the simplest consistent hypothesis.

5
Supervised learning: Hypothesis

Finding hypotheses to fit data. Top row: four plots of best-fit functions from
four different hypothesis spaces trained on data set 1. Bottom row: the same
four functions, but trained on a slightly different data set (sampled from the
same 𝑓(𝑥) function).
6
Supervised learning: Testing
• The quality of the hypothesis ℎ depends on how accurately it
predicts the observations in the test set → generalization.
• The test set must use the same distribution over example space as
training set.

A learning curve for the decision


tree learning algorithm on 100
randomly generated examples in
the restaurant domain. Each data
point is the average of 20 trials.

7
ID3
Decision Tree
What is a decision tree?
• A decision tree is a SL algorithm that predicts the output by
learning decision rules inferred from the features in the data.

Learning
algorithm
Data
Decision tree

• It is offen the building blocks for more complex algorithms,


such as random forests and gradient boosting machines.

9
Example problem: Restaurant waiting

Predicting whether a certain person will wait to


have a seat in a restaurant.

1. Alternate: is there an alternative restaurant nearby?


2. Bar: is there a comfortable bar area to wait in?
3. Fri/Sat: is today Friday or Saturday?
4. Hungry: are we hungry?
5. Patrons: number of people in the restaurant (None, Some, Full)
6. Price: price range ($, $$, $$$)
7. Raining: is it raining outside?
8. Reservation: have we made a reservation?
9. Type: kind of restaurant (French, Italian, Thai, Burger)
10. WaitEstimate: estimated waiting time (0-10, 10-30, 30-60, >60) 10
Example problem: Restaurant waiting

A (true) decision tree for deciding


whether to wait for a table.

11
Example problem: Restaurant waiting

12
Learning decision trees
• Divide and conquer: Split data into x1 >  ?
smaller and smaller subsets no
yes
• Splits are usually on a single variable
x2 >  ? x2 >  ?

yes no yes no

• After splitting up, each outcome is a new decision tree


learning problem with fewer examples and one less attribute.

13
Learning decision trees

Splitting the examples by testing on attributes. At each node we show the positive
(light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b)
Splitting on Patrons does a good job of separating positive and negative examples.
After splitting on Patrons, Hungry is a fairly good second test. 14
ID3 Decision tree: Pseudo-code

function LEARN-ECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)


returns a tree
No examples left 3
if 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 is empty
then return PLURALITY-VALUE(𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
else if all 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 have the same classification
then return the classification Remaining examples
2
are all pos/neg
else if 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 is empty
then return PLURALITY-VALUE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
else No attributes left but
4
… examples are still pos & neg

The decision tree learning algorithm. The function PLURALITY-VALUE selects the
most common output value among a set of examples, breaking ties randomly.
15
ID3 Decision tree: Pseudo-code
function LEARN-DECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
returns a tree
There are still attributes
… to split the examples
1
else
𝐴 ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑎∈𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 IMPORTANCE(𝑎, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
𝑡𝑟𝑒𝑒 ← a new decision tree with root test A
for each value 𝑣 of A do
𝑒𝑥𝑠 ← 𝑒 ∶ 𝑒 ∈ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 and 𝑒. 𝐴 = 𝑣
𝑠𝑢𝑏𝑡𝑟𝑒𝑒 ← LEARN-DECISION-TREE(𝑒𝑥𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − 𝐴, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
add a branch to 𝑡𝑟𝑒𝑒 with label (𝐴 = 𝑣) and subtree 𝑠𝑢𝑏𝑡𝑟𝑒𝑒
return 𝑡𝑟𝑒𝑒

The decision tree learning algorithm. The function IMPORTANCE evaluates the
profitability of attributes.
16
ID3 Decision tree algorithm
There are some positive and some negative examples → choose the
1
best attribute to split them
The remaining examples are all positive (or all negative), → DONE, it
2
is possible to answer Yes or No.
No examples left at a branch → return a default value.
• No example has been observed for a combination of attribute values
3
• The default value is calculated from the plurality classification of all the
examples that were used in constructing the node’s parent.

No attributes left but both positive and negative examples → return


the plurality classification of remaining ones.
4 • Examples of the same description, but different classifications
• It is due to an error or noise in the data, nondeterministic domain, or no
observation of an attribute that would distinguish the examples.
17
Example problem: Restaurant waiting

The decision tree induced from the 12-example training set.


18
Example problem: Restaurant waiting
• The induced decision tree can classify all the examples
without tests for Raining and Reservation.
• It can detect interesting and previously unsuspected pattern.
• E.g., the customers will wait for Thai food on weekends.
• It is also bound to make some mistakes for cases where it
has seen no examples.
• E.g., how about a situation in which the wait is 0–10 minutes, the
restaurant is full, yet the customer is not hungry?

19
Decision tree: Inductive learning
• Simplest: Construct a decision tree
with one leaf for every example
→ memory based learning
→ worse generalization

• Advanced: Split on each variable so that the purity of each


split increases (i.e. either only yes or only no).

20
A purity measure with entropy
• The Entropy measures the uncertainty of a random variable
𝑉 with values 𝑣𝑘 having probability 𝑃 𝑣𝑘 is defined as
𝟏
𝑯 𝑽 = ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 = − ෍ 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 𝑷 𝒗𝒌
𝑷 𝒗𝒌
𝒌 𝒌
• It is fundamental quantity in information theory.

• The information gain (IG) for an attribute 𝐴 is the expected


reduction in entropy from before to after splitting data on 𝐴.

21
A purity measure with entropy
• Entropy is maximal when all possibilities are equally likely.

• Entropy is zero in a pure


”yes” (or pure ”no”) node.

• Decision tree aims to decrease the entropy while increasing


the information gain in each node.
22
Example problem: Restaurant waiting
Alternate?

Yes No

3 Y, 3 N 3 Y, 3 N

• Calculate the Entropy


H(S) = − 6ൗ12 log 2 6ൗ12 − 6ൗ12 log 2 6ൗ12 = 1
of the whole data set
• Calculate Average entropy of attribute Alternate?
𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 𝑃 𝐴𝑙𝑡 = 𝑌 × 𝐻 𝐴𝑙𝑡 = 𝑌 + 𝑃 𝐴𝑙𝑡 = 𝑁 × 𝐻 𝐴𝑙𝑡 = 𝑁 = 1
6 3 3 3 3 6 3 3 3 3
= − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 6 6 6 6
• Calculate Information gain of attribute Alternate?
𝐼𝐺 𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 𝐻 𝑆 − 𝐴𝐸𝐴𝑙𝑡𝑒𝑟𝑛𝑎𝑡𝑒? = 1 − 1 = 0
23
Example problem: Restaurant waiting
Bar?

Yes No

3 Y, 3 N 3 Y, 3 N

• Calculate Average entropy of attribute Bar?


6 3 3 3 3 6 3 3 3 3
𝐴𝐸𝐵𝑎𝑟? = − log 2 − log 2 + − log 2 − log 2 =1
12 6 6 6 6 12 6 6 6 6

• Calculate Information gain of attribute Bar?


𝐼𝐺 𝐵𝑎𝑟? = 𝐻 𝑆 − 𝐴𝐸𝐵𝑎𝑟? = 1 − 1 = 0

24
Example problem: Restaurant waiting
Sat/Fri?

Yes No

2 Y, 3 N 4 Y, 3 N

• Calculate Average entropy of attribute Sat/Fri?


5 2 2 3 3 7 4 4 3 3
𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7

• Calculate Information gain of attribute Sat/Fri?


𝐼𝐺 𝑆𝑎𝑡Τ𝐹𝑟𝑖? = 𝐻 𝑆 − 𝐴𝐸𝑆𝑎𝑡Τ𝐹𝑟𝑖? = 1 − 0.979 = 0.021

25
Example problem: Restaurant waiting
Hungry?

Yes No

5 Y, 2 N 1 Y, 4 N

• Calculate Average entropy of attribute Hungry?


7 5 5 2 2 5 1 1 4 4
𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦? = − log 2 − log 2 + − log 2 − log 2 = 0.804
12 7 7 7 7 12 5 5 5 5

• Calculate Information gain of attribute Hungry?

𝐼𝐺 𝐻𝑢𝑛𝑔𝑟𝑦? = 𝐻 𝑆 − 𝐴𝐸𝐻𝑢𝑛𝑔𝑟𝑦? = 1 − 0.804 = 0.196

26
Example problem: Restaurant waiting
Raining?

Yes No

3 Y, 2 N 3 Y, 4 N

• Calculate Average entropy of attribute Raining?


5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7
• Calculate Information gain of attribute Raining?

𝐼𝐺 𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑎𝑖𝑛𝑖𝑛𝑔? = 1 − 0.979 = 0.021

27
Example problem: Restaurant waiting
Reservation?

Yes No

3 Y, 2 N 3 Y, 4 N

• Calculate Average entropy of attribute Reservation?


5 3 3 2 2 7 3 3 4 4
𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = − log 2 − log 2 + − log 2 − log 2 = 0.979
12 5 5 5 5 12 7 7 7 7

• Calculate Information gain of attribute Reservation?

𝐼𝐺 𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 𝐻 𝑆 − 𝐴𝐸𝑅𝑒𝑠𝑒𝑟𝑣𝑎𝑡𝑖𝑜𝑛? = 1 − 0.979 = 0.021

28
Example problem: Restaurant waiting
Type?
French Burger

1 Y, 1 N Italian 2 Y, 2 N
Thai

1 Y, 1 N 2 Y, 2 N

• Calculate Average entropy of attribute Type?


2 1 1 1 1 2 1 1 1 1
𝐴𝐸𝑇𝑦𝑝𝑒? = − log 2 − log 2 + − log 2 − log 2
12 2 2 2 2 12 2 2 2 2
4 2 2 2 2 4 2 2 2 2
+ − log 2 − log 2 + − log 2 − log 2 =1
12 4 4 4 4 12 4 4 4 4

• Calculate Information gain of attribute Type?


𝐼𝐺 𝑇𝑦𝑝𝑒? = 𝐻 𝑆 − 𝐴𝐸𝑇𝑦𝑝𝑒? = 1 − 1 = 0 31
Example problem: Restaurant waiting
Est. waiting
time?
0-10 > 60
10-30
4 Y, 2 N 2N
30-60

1 Y, 1 N 1 Y, 1 N

• Calculate Average entropy of attribute Est. waiting time?


6 4 4 2 2 2 1 1 1 1
𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? = − log 2 − log 2 + − log 2 − log 2
12 6 6 6 6 12 2 2 2 2
2 1 1 1 1 2 0 0 2 2
+ − log 2 − log 2 + − log 2 − log 2 = 0.792
12 2 2 2 2 12 2 2 2 2

• Calculate Information gain of attribute Est. waiting time?


𝐼𝐺 𝐸𝑠𝑡. 𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? , 𝑆 = 𝐻 𝑆 − 𝐴𝐸𝐸𝑠𝑡.𝑤𝑎𝑖𝑡𝑖𝑛𝑔 𝑡𝑖𝑚𝑒? = 1 − 0.792 = 0.208
32
Example problem: Restaurant waiting
• Largest Information Gain (0.459) / Smallest Entropy (0.541)
achieved by splitting on Patrons.
Patrons?
None Full
Some
2N 2 T,X?4 F
4Y

• Continue making new splits, always purifying nodes

33
Another
numerical example
Example data set: Weather data
outlook temperature humidity windy play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
35
Numerical example: Choose the root
outlook Hsunny = - 2/5  log22/5 - 3/5  log23/5 = 0.971
Hovercast = - 4/4  log24/4 - 0/4  log20/4 = 0
sunny overcast rainy
Hrainy = - 3/5  log23/5 - 2/5  log22/5 = 0.971
AE = 5/14  0.971 + 4/14  0 + 5/14  0.971 = 0.694
2+/3- 4+/0- 3+/2-

temperature Hhot = - 2/4  log22/4 - 2/4  log22/4 = 1


Hmild = - 4/6  log24/6 - 2/6  log22/6 = 0.918
hot mild cool
Hcool = - 2/4  log22/4 - 3/4  log23/4 = 0.811
AE = 4/14  1 + 6/14  0.918 + 4/14  0.811 = 0.911
2+/2- 4+/2- 3+/1-

36
Numerical example: Choose the root

humidity Hhigh = - 3/7  log23/7 - 4/7  log24/7 = 0.985


Hnormal = - 6/7  log26/7 - 1/7  log21/7 = 0.592
high normal
AE = 7/14  0.985 + 7/14  0.592 = 0.789

3+/4- 6+/1-

windy
Htrue = - 3/6  log23/6 - 3/6  log23/6 = 1
true false Hfalse = - 6/8  log26/8 - 2/8  log22/8 = 0.811
AE = 6/14  1 + 8/14  0.811 = 0.892

3+/3- 6+/2-

37
Numerical example: The partial tree

outlook

sunny rainy
overcast

2+/3- 3+/2-
yes

• Which attributes are chosen for the next splits?


• Continue splitting…

38
Numerical example: The second level
• Choose an attribute for the branch outlook = sunny.
outlook temperature humidity windy play
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes

temperature humidity windy

hot mild cool high normal true false

0+/2- 1+/1- 1+/0- 0+/3- 2+/0- 1+/1- 1+/2-


Hhot = 0, Hmild = 1, Hcool = 0 Hhigh = 0, Hnormal = 0 HTRUE = 1, HFALSE = 0.918
AE = 0.4 AE = 0 AE = 3/3  0.918 = 0.951
39
Numerical example: The second level
• Choose an attribute for the branch outlook = rainy

outlook temperature humidity windy play


rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
rainy mild normal false yes
rainy mild high true no

temperature humidity windy

mild cool high normal true false

2+/1- 1+/1- 1+/1- 2+/1- 0+/2- 3+/0-


Hmild = 0.918, Hcool = 1 Hhigh= 1, Hnormal = 0.918 HTRUE = 0, HFALSE = 0
AE = 0.951 AE = 0.951 AE = 0 40
Numberical example: The final tree

outlook

sunny rainy
overcast

humidity windy
yes

high normal true false

no yes no yes

41
Quiz 01: ID3 decision tree
• The data represent files on a computer system. Possible
values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive decision tree for virus identification.

No. Writable Updated Size Class


1 Yes No Small Infected
2 Yes Yes Large Infected
3 No Yes Med Infected
4 No No Med Clean
5 Yes No Large Clean
6 No No Large Clean

42

You might also like