Lec 02
Lec 02
Lecture 2
Decision Trees
Yes No
Yes No Yes No
Yes No
Yes No Yes No
Exercise: Draw the decision boundary for the decision tree on the left.
Place the width measurement on the x-axis and the height measurement on
the y-axis.
10
Yes No 6
Yes No Yes No 4
4 6 8 10
Yes No
Yes No Yes No
Each path from root to a leaf defines a region Rm of input space, and points
that fall into Rm will have prediction y (m) . But what should this y (m) be?
Let {(x(m1 ) , t (m1 ) ), . . . , (x(mk ) , t (mk ) )} be the training examples that fall
into Rm
Classification Tree (Discrete Output)
Leaf value y (m) typically set to the most common value of t amongst
training data points in Rm
Regression Tree (Continuous Output)
Leaf value y (m) typically set to the mean value of t amongst training data
points in Rm
In labs 2, 5, and 7, we will use data from The National Health and
Nutrition Examination Survey in the United States.
Survey assess people’s health and nutritional status
Combines data from interviews and physical examinations
I Race/Ethnicity
I Ever had chest pain
I Age
I BMI
I Blood pressure
I ...
Our Target: Presence of heart disease (self-reported)
In lab 2, we will build a classification tree to assess the presence of heart
disease.
A:
B:
Q: A and B have the same misclassification rate, so which is the best split?
Information Theory
You may have encountered the term entropy quantifying the state of chaos
in chemical and physical systems,
In statistics, the entropy of a discrete random variable is a number that
quantifies the uncertainty inherent in its possible outcomes:
X
H(X ) = −EX ∼p [log2 p(X )] = − p(x ) log2 p(x )
x ∈X
Intuitively, you can think of entropy as a number that represents how much
certainty is gained, on average, by observing a random draw from a
probability distribution.
Let’s look at some examples. . .
0 1 0 1
8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2
8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2
10
Second coin with p = 18 = 59 :
4 4 5 5
− log2 − log2 ≈ 0.99
9 9 9 9
8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2
10
Second coin with p = 18 = 59 :
4 4 5 5
− log2 − log2 ≈ 0.99
9 9 9 9
The coin whose outcomes are more certain has a lower entropy!
In the extreme case p = 0 or p = 1, we were certain of the outcome
before observing. So, we gained no certainty by observing it, i.e.,
entropy is 0.
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60
Entropy of a Coin
1
The entropy of a coin is highest if the probability of obtaining heads is p = 2
entropy
1.0
0.8
0.6
0.4
0.2
probability p of heads
0.2 0.4 0.6 0.8 1.0
High Entropy
Variable has a uniform like distribution over many outcomes
Flat histogram
Values sampled from it are less predictable
Low Entropy
Distribution is concentrated on only a few outcomes
Histogram is concentrated in a few areas
Values sampled from it are more predictable
Cloudy' Not'Cloudy'
Cloudy' Not'Cloudy'
X
H(Y |X = raining) = − p(y |x = raining) log2 p(y |x = raining)
y ∈Y
24 24 1 1
= − log2 − log2
25 25 25 25
≈ 0.24 bits
p(x ,y )
We used: p(y |x ) =
P
p(x ) , and p(x ) = y p(x , y ) (sum in a row)
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 32 / 60
Conditional Entropy and Expected Conditional Entropy
Cloudy' Not'Cloudy'
X
H(Y |X ) = p(x )H(Y |X = x )
x ∈X
1 3
= H(Y |X = is raining) + H(Y |X = not raining)
4 4
≈ 0.75 bits
H is always non-negative
Chain rule: H(X , Y ) = H(X |Y ) + H(Y ) = H(Y |X ) + H(X )
If X and Y independent, then X does not affect our uncertainty about
Y : H(Y |X ) = H(Y )
By knowing Y makes our knowledge of Y certain: H(Y |Y ) = 0
By knowing X , we can only decrease uncertainty about Y :
H(Y |X ) ≤ H(Y )
How much more certain am I about whether it’s cloudy if I’m told whether
it is raining? My uncertainty in Y minus my expected uncertainty that
would remain in Y after seeing X .
This is called information gain IG(Y |X ) in Y due to X , or the mutual
information of Y and X
Yes No
Yes No Yes No
Problems:
You have exponentially less data at lower levels
Too big of a tree can overfit the data
Greedy algorithms don’t necessarily yield the global optimum
There are other criteria used to measure the quality of a split, e.g., Gini
index
Trees can be pruned in order to make them less complex
Decision trees can also be used for regression on real-valued outputs.
Choose splits to minimize squared error, rather than maximize
information gain.
In labs 2, 5, and 7, we will use data from The National Health and
Nutrition Examination Survey in the United States.
Survey assess people’s health and nutritional status
Combines data from interviews and physical examinations
I Race/Ethnicity
I Ever had chest pain
I Age
I BMI
I Blood pressure
I ...
Our Target: Presence of heart disease (self-reported)
Q: Why is this learning setup potentially problematic?
Input: Represented using the vector x containing features for a single data
point
Target Output: Represented using the scalar t ∈ {0, 1}
Goal: We wish to learn a decision tree t ≈ y = f (x), f : Rn 7→ R.
Data: (x(1) , t (1) ), (x(2) , t (2) ), . . . (x(N) , t (N) )
The x(i) are the inputs
The t (i) are the targets (or the ground truth)
62
29.10
x(1) = , t (1) = 0
104
56
Likewise, we can fold the entire set of target values in a single vector, and
an entire set of predictions in a single vector.
t (1) y (1)
(2) (2)
t y
.. ,
t= y=
..
. .
t (N) y (N)