0% found this document useful (0 votes)
105 views73 pages

Lec 02

Lecture 2 of CSC311H5 at the University of Toronto Mississauga covers decision trees, a powerful supervised learning model used for classification tasks. The lecture discusses how decision trees make predictions through a series of feature-based splits and introduces concepts from information theory, such as entropy, to quantify uncertainty in predictions. Additionally, it outlines the process of constructing decision trees using a greedy heuristic to minimize misclassification rates.

Uploaded by

Andrew Jamsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
105 views73 pages

Lec 02

Lecture 2 of CSC311H5 at the University of Toronto Mississauga covers decision trees, a powerful supervised learning model used for classification tasks. The lecture discusses how decision trees make predictions through a series of feature-based splits and introduces concepts from information theory, such as entropy, to quantify uncertainty in predictions. Additionally, it outlines the process of constructing decision trees using a greedy heuristic to minimize misclassification rates.

Uploaded by

Andrew Jamsa
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 73

CSC311H5 Introduction to Machine Learning

Lecture 2

University of Toronto Mississauga

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 1 / 60


Review

Last class, we discussed


Supervised learning: Learn a predictor given a training set consisting
of inputs (x ∈ RD ) and the corresponding target labels t
Having a separate validation set (for tuning hyperparameters) and test
set (to assess model generalization)
Our first supervised learning model: the k-Nearest Neighbour model
Today: Decision Trees!

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 2 / 60


A bit about machine learning discourse

When learning machine learning models, we always talk about how to


make predictions before talking about how to train our model.
This may seem backwards: we train models before making predictions.
The order that we learn things will not be the order that we do things.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 3 / 60


Section 1

Decision Trees

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 4 / 60


What is a decision tree model

Like a series of nested if-else statements


Simple but powerful learning algorithm
Variations are widely used in ML competitions (e.g., Kaggle)
Lets us motivate concepts from information theory (entropy, mutual
information, etc.)
As is typical when discussing ML models, we will discuss how to make
predictions first (i.e., inference), and then how to build models (i.e.,
learning).

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 5 / 60


Lemons or Oranges

Remember the citrus classification problem from lecture 1?

Scenario: You run a sorting facility for citrus fruits


Binary classification: lemons or oranges
Features measured by sensor on conveyor belt: height and width

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 6 / 60


Decision Trees
Make predictions by splitting on features according to a tree structure.

Yes No

Yes No Yes No

Internal nodes test a feature


Branching is determined by the feature value
Leaf nodes are output predictions

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 7 / 60


Decision Trees
Make predictions by splitting on features according to a tree structure.

Yes No

Yes No Yes No

Internal nodes test a feature


Branching is determined by the feature value
Leaf nodes are output predictions
Q: What would a citrus with width 6.4cm and height 10cm be classified?
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 7 / 60
Visualizing Decision Tree Decision Boundary

Exercise: Draw the decision boundary for the decision tree on the left.
Place the width measurement on the x-axis and the height measurement on
the y-axis.

10

Yes No 6

Yes No Yes No 4
4 6 8 10

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 8 / 60


Decision Tree Decision Boundary

The decision boundary of a decision tree is made up of axis-aligned planes:

Yes No

Yes No Yes No

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 9 / 60


Decision Tree Output

Each path from root to a leaf defines a region Rm of input space, and points
that fall into Rm will have prediction y (m) . But what should this y (m) be?
Let {(x(m1 ) , t (m1 ) ), . . . , (x(mk ) , t (mk ) )} be the training examples that fall
into Rm
Classification Tree (Discrete Output)
Leaf value y (m) typically set to the most common value of t amongst
training data points in Rm
Regression Tree (Continuous Output)
Leaf value y (m) typically set to the mean value of t amongst training data
points in Rm

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 10 / 60


Example: HANES for Heart Disease Prediction

In labs 2, 5, and 7, we will use data from The National Health and
Nutrition Examination Survey in the United States.
Survey assess people’s health and nutritional status
Combines data from interviews and physical examinations
I Race/Ethnicity
I Ever had chest pain
I Age
I BMI
I Blood pressure
I ...
Our Target: Presence of heart disease (self-reported)
In lab 2, we will build a classification tree to assess the presence of heart
disease.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 11 / 60


Decision Tree Features

Continuous Features can be partitioned into half-planes.


Example: AGE are partitioned.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 12 / 60


Decision Tree Features

Discrete Features can be partitioned into its possible values.


Example: CHEST_PAIN is partitioned into its two possible values 0 and 1.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 13 / 60


Exercise: Making a prediction (inference)

Suppose we have the following tree:

What prediction would we make for a person with:


Age: 64
BMI 20.9
No chest pain

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 14 / 60


Section 2

Constructing Decision Trees

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 15 / 60


Which decision tree is better?

A:

B:

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 16 / 60


Why constructing trees is hard

Decision trees are universal function approximators:


For any training set we can construct a decision tree that has exactly
the one leaf for every training point, but it probably won’t generalize.
I Example - If all D features were binary, and we had N = 2D unique
training examples, a Full Binary Tree would have one leaf per example.
Finding the smallest decision tree that correctly classifies a training set
is NP complete.
I If you are interested, check: Hyafil & Rivest’76.

So, how do we construct a useful decision tree?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 17 / 60


Learning decision trees

We will use a greedy heuristic:


Start with the whole training set and an empty decision tree.
Pick a feature and candidate split that would most reduce a loss.
Split on that feature and recurse on subpartitions.
Q: What would be a good loss to use? Ideas?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 18 / 60


Learning decision trees

We will use a greedy heuristic:


Start with the whole training set and an empty decision tree.
Pick a feature and candidate split that would most reduce a loss.
Split on that feature and recurse on subpartitions.
Q: What would be a good loss to use? Ideas?
How about classification error or misclassification rate?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 18 / 60


Example

Consider the following data. Let’s split on width.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 19 / 60


Split by misclassification rate?

Let’s try using misclassification rate as a loss.

Q: A and B have the same misclassification rate, so which is the best split?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 20 / 60


The problem with misclassification rate
Consider this split, which does not improve accuracy at all (compared to
not having the split). Is this split good?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 21 / 60


The problem with misclassification rate
Consider this split, which does not improve accuracy at all (compared to
not having the split). Is this split good?

This split reduces our uncertainty about whether a fruit is a lemon!

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 21 / 60


Choosing a good split

How can we quantify uncertainty in prediction for a given leaf node?


If all examples in leaf have same class: good, low uncertainty
If each class has same amount of examples in leaf: bad, high
uncertainty
Idea: Use counts at leaves to define probability distributions; use a
probabilistic notion of uncertainty to decide splits.
In order to quantify uncertainty, we a brief detour through information
theory. . .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 22 / 60


Section 3

Information Theory

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 23 / 60


Entropy - Quantifying Uncertainty

You may have encountered the term entropy quantifying the state of chaos
in chemical and physical systems,
In statistics, the entropy of a discrete random variable is a number that
quantifies the uncertainty inherent in its possible outcomes:

X
H(X ) = −EX ∼p [log2 p(X )] = − p(x ) log2 p(x )
x ∈X

Intuitively, you can think of entropy as a number that represents how much
certainty is gained, on average, by observing a random draw from a
probability distribution.
Let’s look at some examples. . .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 24 / 60


Flipping Two Different Coins

Q: Which coin is more uncertain?


First: 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, . . .
Second: 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, . . .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 25 / 60


Flipping Two Different Coins

Q: Which coin is more uncertain?


First: 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, . . .
Second: 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, . . .
16
10
8
versus
2

0 1 0 1

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 25 / 60


Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60


Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?
2
First coin with p = 18 = 19 :

8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60


Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?
2
First coin with p = 18 = 19 :

8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2
10
Second coin with p = 18 = 59 :

4 4 5 5
− log2 − log2 ≈ 0.99
9 9 9 9

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60


Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?
2
First coin with p = 18 = 19 :

8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2
10
Second coin with p = 18 = 59 :

4 4 5 5
− log2 − log2 ≈ 0.99
9 9 9 9

The coin whose outcomes are more certain has a lower entropy!
In the extreme case p = 0 or p = 1, we were certain of the outcome
before observing. So, we gained no certainty by observing it, i.e.,
entropy is 0.
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60
Entropy of a Coin

1
The entropy of a coin is highest if the probability of obtaining heads is p = 2
entropy
1.0

0.8

0.6

0.4

0.2

probability p of heads
0.2 0.4 0.6 0.8 1.0

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 27 / 60


High vs Low Entropy Random Variables

High Entropy
Variable has a uniform like distribution over many outcomes
Flat histogram
Values sampled from it are less predictable
Low Entropy
Distribution is concentrated on only a few outcomes
Histogram is concentrated in a few areas
Values sampled from it are more predictable

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 28 / 60


Interpreting Entropy

We can also think of entropy as the expected information content of a


random draw from a probability distribution.
Claude Shannon showed: you cannot store the outcome of a random draw
using fewer expected bits than the entropy without losing information.
So units of entropy are bits; a fair coin flip has 1 bit of entropy.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 29 / 60


Multiple Random Variables

Suppose we observe partial information X about a random variable Y


Example: we want to know whether it is cloudy outside (Y ) but can’t
see the sky, but we can see whether it is raining (X ).
We want to work towards a definition of the expected amount of
information that will be conveyed about Y by observing X .
Or equivalently, the expected reduction in our uncertainty about Y after
observing X .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 30 / 60


Entropy of a Joint Distribution
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}
What is the entropy of system (X, Y)?

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

H(X , Y ) = −EX ,Y ∼p(x ,y ) [log2 p(X , Y )]


X X
=− p(x , y ) log2 p(x , y )
x ∈X y ∈Y
24 24 1 1 25 25 50 50
=− log2 − log2 − log2 − log2
100 100 100 100 100 100 100 100
≈ 1.56 bits

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 31 / 60


Conditional Entropy
What is the entropy of cloudiness Y , given that it is raining?

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

X
H(Y |X = raining) = − p(y |x = raining) log2 p(y |x = raining)
y ∈Y
24 24 1 1
= − log2 − log2
25 25 25 25
≈ 0.24 bits

p(x ,y )
We used: p(y |x ) =
P
p(x ) , and p(x ) = y p(x , y ) (sum in a row)
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 32 / 60
Conditional Entropy and Expected Conditional Entropy

The conditional entropy is the entropy of the conditional distribution.

H(Y |X = x ) = −EY ∼p(y |X =x ) [log2 p(y |X = x )]

This is what we computed in the previous slide.


The expected conditional entropy is defined as follows:

H(Y |X ) = Ex ∼p(x ) [H[Y |X = x ]]


X
= p(x )H(Y |X = x )
x ∈X
X X
=− p(x , y ) log2 p(y |x )
x ∈X y ∈Y

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 33 / 60


Expected Conditional Entropy
What is the expected entropy of cloudiness, given the knowledge of whether
or not it is raining?

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

X
H(Y |X ) = p(x )H(Y |X = x )
x ∈X
1 3
= H(Y |X = is raining) + H(Y |X = not raining)
4 4
≈ 0.75 bits

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 34 / 60


Properties of Entropy and Conditional Entropy

H is always non-negative
Chain rule: H(X , Y ) = H(X |Y ) + H(Y ) = H(Y |X ) + H(X )
If X and Y independent, then X does not affect our uncertainty about
Y : H(Y |X ) = H(Y )
By knowing Y makes our knowledge of Y certain: H(Y |Y ) = 0
By knowing X , we can only decrease uncertainty about Y :
H(Y |X ) ≤ H(Y )

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 35 / 60


Information Gain

How much more certain am I about whether it’s cloudy if I’m told whether
it is raining? My uncertainty in Y minus my expected uncertainty that
would remain in Y after seeing X .
This is called information gain IG(Y |X ) in Y due to X , or the mutual
information of Y and X

IG(Y |X ) = H(Y ) − H(Y |X )

If X is completely uninformative about Y : IG(Y |X ) = 0


If X is completely informative about Y : IG(Y |X ) = H(Y )

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 36 / 60


How does this help with decision trees?
In the case of decision trees, Y is the output class (e.g., lemon or orange),
X is which side of the split a data point is in (e.g., left or right).
Q: Which is the better split?

Q: Is this a good split?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 37 / 60


An alternative to misclassification rates

We will use information gain to quantify how “good” a split is.


Entropy H(Y ) [bits]: characterizes the uncertainty in a draw of a
random variable
Conditional Entropy H(Y |X ) [bits] : characterizes the uncertainty in
a draw of Y after observing X
Information gain measures the informativeness of a variable, which is
exactly what we desire in a decision tree split!
The information gain of a split: how much information (over the
training set) about the class label Y is gained by knowing which side
of a split you’re on.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 38 / 60


Exercise: What is the information gain of split B?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 39 / 60


Exercise: What is the information gain of split B?

Root entropy of class outcome:


H(Y ) = − 27 log2 ( 72 ) − 57 log2 ( 57 ) ≈ 0.86
Leaf conditional entropy of class outcome: H(Y |left) ≈ 0.81,
H(Y |right) ≈ 0.92
IG(split) ≈ 0.86 − ( 47 · 0.81 + 73 · 0.92) ≈ 0.006

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 39 / 60


Exercise: What is the information gain of split A?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 40 / 60


Exercise: What is the information gain of split A?

Root entropy of class outcome:


H(Y ) = − 27 log2 ( 72 ) − 57 log2 ( 57 ) ≈ 0.86
Leaf conditional entropy of class outcome: H(Y |left) = 0,
H(Y |right) ≈ 0.97
IG(split) ≈ 0.86 − ( 27 · 0 + 57 · 0.97) ≈ 0.17!!

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 40 / 60


Section 4

Learning Decision Trees (Continued)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 41 / 60


Constructing Decision Trees

At each level, we choose:


Which feature to split
Possibly where to split it (e.g., for a continuous feature)
Choose feature and split that provides the highest information gain.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 42 / 60


Visualizations to keep in mind

Yes No

Yes No Yes No

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 43 / 60


Decision Tree Construction Algorithm

Recall the greedy heuristic we proposed earlier:


Start with the whole training set and an empty decision tree.
Pick a feature and candidate split that would most reduce a loss.
I e.g., information gain!
I potentially square loss for a regression tree.
Split on that feature and recurse on subpartitions.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 44 / 60


Discussion

Q: When might we wish to stop splitting nodes?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 45 / 60


Discussion

Q: When might we wish to stop splitting nodes?


Possible answer: Terminates when all leaves contain only examples in the
same class or are empty.
Possible answer: When a depth limit is reached, or when the loss stops
improving much

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 45 / 60


Handling Continuous Attributes

Split based on a threshold, chosen to maximize information gain


I Since the training data is finite, there are only a finite number of
thresholds to consider!
Decision trees can also be used for regression on real-valued outputs.
I Choose splits to minimize squared error, rather than maximize
information gain.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 46 / 60


What Makes a Good Tree?

Not too small:


need to handle important but possibly subtle distinctions in data
Not too big:
Computational efficiency (avoid redundant, spurious attributes)
Avoid over-fitting training examples
Human interpretability
Occam’s Razor find the simplest hypothesis that fits the observations
Useful principle, but hard to formalize (how to define simplicity?)
We desire small trees with informative nodes near the root

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 47 / 60


Decision Tree Miscellany

Problems:
You have exponentially less data at lower levels
Too big of a tree can overfit the data
Greedy algorithms don’t necessarily yield the global optimum
There are other criteria used to measure the quality of a split, e.g., Gini
index
Trees can be pruned in order to make them less complex
Decision trees can also be used for regression on real-valued outputs.
Choose splits to minimize squared error, rather than maximize
information gain.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 48 / 60


Section 5

Notations and Indicator Variables

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 49 / 60


Example: HANES for Heart Disease Prediction

In labs 2, 5, and 7, we will use data from The National Health and
Nutrition Examination Survey in the United States.
Survey assess people’s health and nutritional status
Combines data from interviews and physical examinations
I Race/Ethnicity
I Ever had chest pain
I Age
I BMI
I Blood pressure
I ...
Our Target: Presence of heart disease (self-reported)
Q: Why is this learning setup potentially problematic?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 50 / 60


Machine Learning Pipeline

Collecting and understanding the data


Clean and prepare the data
I Determining the features to use
I Splitting into training/validation/test sets
Model training & hyperparameter tuning
Evaluating generalization accuracy
Additional testing (e.g., subgroup analysis, adversarial testing)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 51 / 60


Supervised Learning Notation (NHANES)

Input: Represented using the vector x containing features for a single data
point
Target Output: Represented using the scalar t ∈ {0, 1}
Goal: We wish to learn a decision tree t ≈ y = f (x), f : Rn 7→ R.
Data: (x(1) , t (1) ), (x(2) , t (2) ), . . . (x(N) , t (N) )
The x(i) are the inputs
The t (i) are the targets (or the ground truth)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 52 / 60


NHANES Heart Disease Prediction
Suppose that for the NHANES heart disease prediction problem, we use
d = 4 features (simplified! in the lab we will use more!):
The numerical age feature
The numerical BMI feature
The numerical blood_pressure_sys feature
The numerical diastolic_bp feature
Then one data point from the training set (x(1) , t (1) ) will look like this:

 
62
29.10
x(1) = , t (1) = 0
 
 104 
56

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 53 / 60


Data Matrix

We can express computations more succinctly by using linear algebra


notation.
Start by putting the entire data set in a single, data matrix
 
(1) (1) (1)
x x2 ... xD
 1(2) (2) (2) 
x x2 ... xD 
X =  1.
 
 .. .. .. .. 
 . . .  
(N) (N) (N)
x1 x2 ... xD

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 54 / 60


Data Matrix

We can express computations more succinctly by using linear algebra


notation.
Start by putting the entire data set in a single, data matrix
 
(1) (1) (1)
x x2 ... xD
 1(2) (2) (2) 
x x2 ... xD 
X =  1.
 
 .. .. .. .. 
 . . .  
(N) (N) (N)
x1 x2 ... xD

Q: What is the shape of the data matrix?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 54 / 60


Data Matrix

We can express computations more succinctly by using linear algebra


notation.
Start by putting the entire data set in a single, data matrix
 
(1) (1) (1)
x x2 ... xD
 1(2) (2) (2) 
x x2 ... xD 
X =  1.
 
 .. .. .. .. 
 . . .  
(N) (N) (N)
x1 x2 ... xD

Q: What is the shape of the data matrix?


N × D, where N is the number of data points in the training data, and D is
the number of features.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 54 / 60


Vectorizing Targets and Predictions

Likewise, we can fold the entire set of target values in a single vector, and
an entire set of predictions in a single vector.

t (1) y (1)
   
 (2)   (2) 
t  y 
 ..  ,
t=  y=
 .. 

 .   . 
t (N) y (N)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 55 / 60


Representing features

Each numerical features is represented as a column in the data matrix.


But what about categorical features?
Gender (recorded as binary in NHANES): 1=female, 0=male
Race Ethnicity: 1=hispanic, 2=white, 3=black, 4=asian, 5=other
...

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 56 / 60


Representing Binary Features

For binary features, we can use indicator features.


(i)
For example, for the gender feature, xgender = 1 if and only if the ith data
point represents a person recorded as being “female”.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 57 / 60


Representing Categorical Features

What about for categorical features with many categories?


Race Ethnicity: 1=hispanic, 2=white, 3=black, 4=asian, 5=other
Q: Can we treat categorical features as numerical?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 58 / 60


Representing Categorical Features

What about for categorical features with many categories?


Race Ethnicity: 1=hispanic, 2=white, 3=black, 4=asian, 5=other
Q: Can we treat categorical features as numerical?
This approach can be problematic! What would decision tree splits like
ethnicity >= 2.5 mean?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 58 / 60


Indicator Features (Indicator Variables, One-Hot Encoding)

Each category in the categorical variable becomes its own feature.


Instead of a single feature “race_ethnicity”, we will have features:
ethnicity_hispanic ∈ {0, 1}
ethnicity_white ∈ {0, 1}
ethnicity_asian ∈ {0, 1}
ethnicity_black ∈ {0, 1}
This approach makes it easier for decision trees to create split based on
specific ethnicities.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 59 / 60


What you will do in Lab 2

Explore the NHANES data set


Create the data matrix; split the train/validation/test sets
Train and visualize several decision tree classifiers
Explore the generalization performance

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 60 / 60

You might also like