2025 Lecture07 P1 ID3
2025 Lecture07 P1 ID3
2
Supervised learning: Training
• Consider a labeled training set of 𝑁 examples.
(𝑥1, 𝑦1), (𝑥2, 𝑦2), … , (𝑥𝑁 , 𝑦𝑁 )
• where each 𝑦𝑗 was generated by an unknown function 𝑦 = 𝑓(𝑥).
• The output 𝒚𝒋 is called ground truth, i.e., the true answer
that the model must predict.
3
Supervised learning: Hypothesis space
• ℎ is drawn from a hypothesis space 𝐻 of possible functions.
• E.g., 𝐻 might be the set of polynomials of degree 3; or the set of 3-
SAT Boolean logic formulas.
• Choose 𝐻 by some prior knowledge about the process that
generated the data or exploratory data analysis (EDA).
• EDA examines the data with statistical tests and visualizations to get
some insight into what hypothesis space might be appropriate.
• Or just try multiple hypothesis spaces and evaluate which
one works best.
4
Supervised learning: Hypothesis
• The hypothesis ℎ is consistent if it agrees with the true
function 𝑓 on all training observations, i.e., ∀𝑥𝑖 ℎ 𝑥𝑖 = 𝑦𝑖 .
• For continuous data, we instead look for a best-fit function for which
each ℎ 𝑥𝑖 is close to 𝑦𝑖 .
• Ockham’s razor: Select the simplest consistent hypothesis.
5
Supervised learning: Hypothesis
Finding hypotheses to fit data. Top row: four plots of best-fit functions from
four different hypothesis spaces trained on data set 1. Bottom row: the same
four functions, but trained on a slightly different data set (sampled from the
same 𝑓(𝑥) function).
6
Supervised learning: Testing
• The quality of the hypothesis ℎ depends on how accurately it
predicts the observations in the test set → generalization.
• The test set must use the same distribution over example space as
training set.
7
ID3
Decision Tree
What is a decision tree?
• A decision tree is a SL algorithm that predicts the output by
learning decision rules inferred from the features in the data.
Learning
algorithm
Data
Decision tree
9
Example problem: Restaurant waiting
11
Example problem: Restaurant waiting
12
Learning decision trees
• Divide and conquer: Split data into x1 > ?
smaller and smaller subsets no
yes
• Splits are usually on a single variable
x2 > ? x2 > ?
yes no yes no
13
Learning decision trees
Splitting the examples by testing on attributes. At each node we show the positive
(light boxes) and negative (dark boxes) examples remaining. (a) Splitting on Type
brings us no nearer to distinguishing between positive and negative examples. (b)
Splitting on Patrons does a good job of separating positive and negative examples.
After splitting on Patrons, Hungry is a fairly good second test. 14
ID3 Decision tree: Pseudo-code
The decision tree learning algorithm. The function PLURALITY-VALUE selects the
most common output value among a set of examples, breaking ties randomly.
15
ID3 Decision tree: Pseudo-code
function LEARN-DECISION-TREE(𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠, 𝑝𝑎𝑟𝑒𝑛𝑡_𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
returns a tree
There are still attributes
… to split the examples
1
else
𝐴 ← 𝑎𝑟𝑔𝑚𝑎𝑥𝑎∈𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 IMPORTANCE(𝑎, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
𝑡𝑟𝑒𝑒 ← a new decision tree with root test A
for each value 𝑣 of A do
𝑒𝑥𝑠 ← 𝑒 ∶ 𝑒 ∈ 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠 and 𝑒. 𝐴 = 𝑣
𝑠𝑢𝑏𝑡𝑟𝑒𝑒 ← LEARN-DECISION-TREE(𝑒𝑥𝑠, 𝑎𝑡𝑡𝑟𝑖𝑏𝑢𝑡𝑒𝑠 − 𝐴, 𝑒𝑥𝑎𝑚𝑝𝑙𝑒𝑠)
add a branch to 𝑡𝑟𝑒𝑒 with label (𝐴 = 𝑣) and subtree 𝑠𝑢𝑏𝑡𝑟𝑒𝑒
return 𝑡𝑟𝑒𝑒
The decision tree learning algorithm. The function IMPORTANCE evaluates the
profitability of attributes.
16
ID3 Decision tree algorithm
There are some positive and some negative examples → choose the
1
best attribute to split them
The remaining examples are all positive (or all negative), → DONE, it
2
is possible to answer Yes or No.
No examples left at a branch → return a default value.
• No example has been observed for a combination of attribute values
3
• The default value is calculated from the plurality classification of all the
examples that were used in constructing the node’s parent.
19
Decision tree: Inductive learning
• Simplest: Construct a decision tree
with one leaf for every example
→ memory based learning
→ worse generalization
20
A purity measure with entropy
• The Entropy measures the uncertainty of a random variable
𝑉 with values 𝑣𝑘 having probability 𝑃 𝑣𝑘 is defined as
𝟏
𝑯 𝑽 = 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 = − 𝑷 𝒗𝒌 𝒍𝒐𝒈𝟐 𝑷 𝒗𝒌
𝑷 𝒗𝒌
𝒌 𝒌
• It is fundamental quantity in information theory.
21
A purity measure with entropy
• Entropy is maximal when all possibilities are equally likely.
Yes No
3 Y, 3 N 3 Y, 3 N
Yes No
3 Y, 3 N 3 Y, 3 N
24
Example problem: Restaurant waiting
Sat/Fri?
Yes No
2 Y, 3 N 4 Y, 3 N
25
Example problem: Restaurant waiting
Hungry?
Yes No
5 Y, 2 N 1 Y, 4 N
26
Example problem: Restaurant waiting
Raining?
Yes No
3 Y, 2 N 3 Y, 4 N
27
Example problem: Restaurant waiting
Reservation?
Yes No
3 Y, 2 N 3 Y, 4 N
28
Example problem: Restaurant waiting
Type?
French Burger
1 Y, 1 N Italian 2 Y, 2 N
Thai
1 Y, 1 N 2 Y, 2 N
1 Y, 1 N 1 Y, 1 N
33
Another
numerical example
Example data set: Weather data
outlook temperature humidity windy play
sunny hot high false no
sunny hot high true no
overcast hot high false yes
rainy mild high false yes
rainy cool normal false yes
rainy cool normal true no
overcast cool normal true yes
sunny mild high false no
sunny cool normal false yes
rainy mild normal false yes
sunny mild normal true yes
overcast mild high true yes
overcast hot normal false yes
rainy mild high true no
35
Numerical example: Choose the root
outlook Hsunny = - 2/5 log22/5 - 3/5 log23/5 = 0.971
Hovercast = - 4/4 log24/4 - 0/4 log20/4 = 0
sunny overcast rainy
Hrainy = - 3/5 log23/5 - 2/5 log22/5 = 0.971
AE = 5/14 0.971 + 4/14 0 + 5/14 0.971 = 0.694
2+/3- 4+/0- 3+/2-
36
Numerical example: Choose the root
3+/4- 6+/1-
windy
Htrue = - 3/6 log23/6 - 3/6 log23/6 = 1
true false Hfalse = - 6/8 log26/8 - 2/8 log22/8 = 0.811
AE = 6/14 1 + 8/14 0.811 = 0.892
3+/3- 6+/2-
37
Numerical example: The partial tree
outlook
sunny rainy
overcast
2+/3- 3+/2-
yes
38
Numerical example: The second level
• Choose an attribute for the branch outlook = sunny.
outlook temperature humidity windy play
sunny hot high false no
sunny hot high true no
sunny mild high false no
sunny cool normal false yes
sunny mild normal true yes
outlook
sunny rainy
overcast
humidity windy
yes
no yes no yes
41
Quiz 01: ID3 decision tree
• The data represent files on a computer system. Possible
values of the class variable are “infected”, which implies the
file has a virus infection, or “clean” if it doesn't.
• Derive decision tree for virus identification.
42