0% found this document useful (0 votes)

105 views73 pages

Lec 02

Lecture 2 of CSC311H5 at the University of Toronto Mississauga covers decision trees, a powerful supervised learning model used for classification tasks. The lecture discusses how decision trees make predictions through a series of feature-based splits and introduces concepts from information theory, such as entropy, to quantify uncertainty in predictions. Additionally, it outlines the process of constructing decision trees using a greedy heuristic to minimize misclassification rates.

Uploaded by

Andrew Jamsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

105 views73 pages

Lec 02

Uploaded by

Andrew Jamsa

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 73

CSC311H5 Introduction to Machine Learning

Lecture 2

University of Toronto Mississauga

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 1 / 60

Review

Last class, we discussed

Supervised learning: Learn a predictor given a training set consisting
of inputs (x ∈ RD ) and the corresponding target labels t
Having a separate validation set (for tuning hyperparameters) and test
set (to assess model generalization)
Our first supervised learning model: the k-Nearest Neighbour model
Today: Decision Trees!

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 2 / 60

A bit about machine learning discourse

When learning machine learning models, we always talk about how to

make predictions before talking about how to train our model.
This may seem backwards: we train models before making predictions.
The order that we learn things will not be the order that we do things.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 3 / 60

Section 1

Decision Trees

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 4 / 60

What is a decision tree model

Like a series of nested if-else statements

Simple but powerful learning algorithm
Variations are widely used in ML competitions (e.g., Kaggle)
Lets us motivate concepts from information theory (entropy, mutual
information, etc.)
As is typical when discussing ML models, we will discuss how to make
predictions first (i.e., inference), and then how to build models (i.e.,
learning).

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 5 / 60

Lemons or Oranges

Remember the citrus classification problem from lecture 1?

Scenario: You run a sorting facility for citrus fruits

Binary classification: lemons or oranges
Features measured by sensor on conveyor belt: height and width

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 6 / 60

Decision Trees
Make predictions by splitting on features according to a tree structure.

Yes No

Yes No Yes No

Internal nodes test a feature

Branching is determined by the feature value
Leaf nodes are output predictions

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 7 / 60

Decision Trees
Make predictions by splitting on features according to a tree structure.

Yes No

Yes No Yes No

Internal nodes test a feature

Branching is determined by the feature value
Leaf nodes are output predictions
Q: What would a citrus with width 6.4cm and height 10cm be classified?
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 7 / 60
Visualizing Decision Tree Decision Boundary

Exercise: Draw the decision boundary for the decision tree on the left.
Place the width measurement on the x-axis and the height measurement on
the y-axis.

Yes No 6

Yes No Yes No 4
4 6 8 10

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 8 / 60

Decision Tree Decision Boundary

The decision boundary of a decision tree is made up of axis-aligned planes:

Yes No

Yes No Yes No

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 9 / 60

Decision Tree Output

Each path from root to a leaf defines a region Rm of input space, and points
that fall into Rm will have prediction y (m) . But what should this y (m) be?
Let {(x(m1 ) , t (m1 ) ), . . . , (x(mk ) , t (mk ) )} be the training examples that fall
into Rm
Classification Tree (Discrete Output)
Leaf value y (m) typically set to the most common value of t amongst
training data points in Rm
Regression Tree (Continuous Output)
Leaf value y (m) typically set to the mean value of t amongst training data
points in Rm

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 10 / 60

Example: HANES for Heart Disease Prediction

In labs 2, 5, and 7, we will use data from The National Health and
Nutrition Examination Survey in the United States.
Survey assess people’s health and nutritional status
Combines data from interviews and physical examinations
I Race/Ethnicity
I Ever had chest pain
I Age
I BMI
I Blood pressure
I ...
Our Target: Presence of heart disease (self-reported)
In lab 2, we will build a classification tree to assess the presence of heart
disease.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 11 / 60

Decision Tree Features

Continuous Features can be partitioned into half-planes.

Example: AGE are partitioned.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 12 / 60

Decision Tree Features

Discrete Features can be partitioned into its possible values.

Example: CHEST_PAIN is partitioned into its two possible values 0 and 1.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 13 / 60

Exercise: Making a prediction (inference)

Suppose we have the following tree:

What prediction would we make for a person with:

Age: 64
BMI 20.9
No chest pain

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 14 / 60

Section 2

Constructing Decision Trees

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 15 / 60

Which decision tree is better?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 16 / 60

Why constructing trees is hard

Decision trees are universal function approximators:

For any training set we can construct a decision tree that has exactly
the one leaf for every training point, but it probably won’t generalize.
I Example - If all D features were binary, and we had N = 2D unique
training examples, a Full Binary Tree would have one leaf per example.
Finding the smallest decision tree that correctly classifies a training set
is NP complete.
I If you are interested, check: Hyafil & Rivest’76.

So, how do we construct a useful decision tree?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 17 / 60

Learning decision trees

We will use a greedy heuristic:

Start with the whole training set and an empty decision tree.
Pick a feature and candidate split that would most reduce a loss.
Split on that feature and recurse on subpartitions.
Q: What would be a good loss to use? Ideas?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 18 / 60

Learning decision trees

We will use a greedy heuristic:

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 18 / 60

Example

Consider the following data. Let’s split on width.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 19 / 60

Split by misclassification rate?

Let’s try using misclassification rate as a loss.

Q: A and B have the same misclassification rate, so which is the best split?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 20 / 60

The problem with misclassification rate
Consider this split, which does not improve accuracy at all (compared to
not having the split). Is this split good?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 21 / 60

The problem with misclassification rate
Consider this split, which does not improve accuracy at all (compared to
not having the split). Is this split good?

This split reduces our uncertainty about whether a fruit is a lemon!

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 21 / 60

Choosing a good split

How can we quantify uncertainty in prediction for a given leaf node?

If all examples in leaf have same class: good, low uncertainty
If each class has same amount of examples in leaf: bad, high
uncertainty
Idea: Use counts at leaves to define probability distributions; use a
probabilistic notion of uncertainty to decide splits.
In order to quantify uncertainty, we a brief detour through information
theory. . .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 22 / 60

Section 3

Information Theory

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 23 / 60

Entropy - Quantifying Uncertainty

You may have encountered the term entropy quantifying the state of chaos
in chemical and physical systems,
In statistics, the entropy of a discrete random variable is a number that
quantifies the uncertainty inherent in its possible outcomes:

X
H(X ) = −EX ∼p [log2 p(X )] = − p(x ) log2 p(x )
x ∈X

Intuitively, you can think of entropy as a number that represents how much
certainty is gained, on average, by observing a random draw from a
probability distribution.
Let’s look at some examples. . .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 24 / 60

Flipping Two Different Coins

Q: Which coin is more uncertain?

First: 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, . . .
Second: 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, . . .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 25 / 60

Flipping Two Different Coins

Q: Which coin is more uncertain?

First: 0, 0, 0, 1, 0, 0, 0, 0, 0, 0, 0, 0, 1, 0, 0, 0, 0, 0, . . .
Second: 0, 1, 0, 1, 0, 0, 1, 1, 1, 0, 0, 0, 1, 0, 0, 1, 0, 1, . . .
16
10
8
versus
2

0 1 0 1

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 25 / 60

Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60

Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?
2
First coin with p = 18 = 19 :

8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60

Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?
2
First coin with p = 18 = 19 :

8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2
10
Second coin with p = 18 = 59 :

4 4 5 5
− log2 − log2 ≈ 0.99
9 9 9 9

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60

Computing Entropy
2
Exercise What is the entropy of a loaded coin with probability p = 18 ?
10
What about p = 18 ?
2
First coin with p = 18 = 19 :

8 8 1 1 1
− log2 − log2 ≈
9 9 9 9 2
10
Second coin with p = 18 = 59 :

4 4 5 5
− log2 − log2 ≈ 0.99
9 9 9 9

The coin whose outcomes are more certain has a lower entropy!
In the extreme case p = 0 or p = 1, we were certain of the outcome
before observing. So, we gained no certainty by observing it, i.e.,
entropy is 0.
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 26 / 60
Entropy of a Coin

1
The entropy of a coin is highest if the probability of obtaining heads is p = 2
entropy
1.0

0.8

0.6

0.4

0.2

probability p of heads
0.2 0.4 0.6 0.8 1.0

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 27 / 60

High vs Low Entropy Random Variables

High Entropy
Variable has a uniform like distribution over many outcomes
Flat histogram
Values sampled from it are less predictable
Low Entropy
Distribution is concentrated on only a few outcomes
Histogram is concentrated in a few areas
Values sampled from it are more predictable

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 28 / 60

Interpreting Entropy

We can also think of entropy as the expected information content of a

random draw from a probability distribution.
Claude Shannon showed: you cannot store the outcome of a random draw
using fewer expected bits than the entropy without losing information.
So units of entropy are bits; a fair coin flip has 1 bit of entropy.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 29 / 60

Multiple Random Variables

Suppose we observe partial information X about a random variable Y

Example: we want to know whether it is cloudy outside (Y ) but can’t
see the sky, but we can see whether it is raining (X ).
We want to work towards a definition of the expected amount of
information that will be conveyed about Y by observing X .
Or equivalently, the expected reduction in our uncertainty about Y after
observing X .

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 30 / 60

Entropy of a Joint Distribution
Example: X = {Raining, Not raining}, Y = {Cloudy, Not cloudy}
What is the entropy of system (X, Y)?

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

H(X , Y ) = −EX ,Y ∼p(x ,y ) [log2 p(X , Y )]

X X
=− p(x , y ) log2 p(x , y )
x ∈X y ∈Y
24 24 1 1 25 25 50 50
=− log2 − log2 − log2 − log2
100 100 100 100 100 100 100 100
≈ 1.56 bits

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 31 / 60

Conditional Entropy
What is the entropy of cloudiness Y , given that it is raining?

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

X
H(Y |X = raining) = − p(y |x = raining) log2 p(y |x = raining)
y ∈Y
24 24 1 1
= − log2 − log2
25 25 25 25
≈ 0.24 bits

p(x ,y )
We used: p(y |x ) =
P
p(x ) , and p(x ) = y p(x , y ) (sum in a row)
Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 32 / 60
Conditional Entropy and Expected Conditional Entropy

The conditional entropy is the entropy of the conditional distribution.

H(Y |X = x ) = −EY ∼p(y |X =x ) [log2 p(y |X = x )]

This is what we computed in the previous slide.

The expected conditional entropy is defined as follows:

H(Y |X ) = Ex ∼p(x ) [H[Y |X = x ]]

X
= p(x )H(Y |X = x )
x ∈X
X X
=− p(x , y ) log2 p(y |x )
x ∈X y ∈Y

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 33 / 60

Expected Conditional Entropy
What is the expected entropy of cloudiness, given the knowledge of whether
or not it is raining?

Cloudy' Not'Cloudy'

Raining' 24/100' 1/100'

Not'Raining' 25/100' 50/100'

X
H(Y |X ) = p(x )H(Y |X = x )
x ∈X
1 3
= H(Y |X = is raining) + H(Y |X = not raining)
4 4
≈ 0.75 bits

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 34 / 60

Properties of Entropy and Conditional Entropy

H is always non-negative
Chain rule: H(X , Y ) = H(X |Y ) + H(Y ) = H(Y |X ) + H(X )
If X and Y independent, then X does not affect our uncertainty about
Y : H(Y |X ) = H(Y )
By knowing Y makes our knowledge of Y certain: H(Y |Y ) = 0
By knowing X , we can only decrease uncertainty about Y :
H(Y |X ) ≤ H(Y )

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 35 / 60

Information Gain

How much more certain am I about whether it’s cloudy if I’m told whether
it is raining? My uncertainty in Y minus my expected uncertainty that
would remain in Y after seeing X .
This is called information gain IG(Y |X ) in Y due to X , or the mutual
information of Y and X

IG(Y |X ) = H(Y ) − H(Y |X )

If X is completely uninformative about Y : IG(Y |X ) = 0

If X is completely informative about Y : IG(Y |X ) = H(Y )

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 36 / 60

How does this help with decision trees?
In the case of decision trees, Y is the output class (e.g., lemon or orange),
X is which side of the split a data point is in (e.g., left or right).
Q: Which is the better split?

Q: Is this a good split?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 37 / 60

An alternative to misclassification rates

We will use information gain to quantify how “good” a split is.

Entropy H(Y ) [bits]: characterizes the uncertainty in a draw of a
random variable
Conditional Entropy H(Y |X ) [bits] : characterizes the uncertainty in
a draw of Y after observing X
Information gain measures the informativeness of a variable, which is
exactly what we desire in a decision tree split!
The information gain of a split: how much information (over the
training set) about the class label Y is gained by knowing which side
of a split you’re on.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 38 / 60

Exercise: What is the information gain of split B?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 39 / 60

Exercise: What is the information gain of split B?

Root entropy of class outcome:

H(Y ) = − 27 log2 ( 72 ) − 57 log2 ( 57 ) ≈ 0.86
Leaf conditional entropy of class outcome: H(Y |left) ≈ 0.81,
H(Y |right) ≈ 0.92
IG(split) ≈ 0.86 − ( 47 · 0.81 + 73 · 0.92) ≈ 0.006

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 39 / 60

Exercise: What is the information gain of split A?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 40 / 60

Exercise: What is the information gain of split A?

Root entropy of class outcome:

H(Y ) = − 27 log2 ( 72 ) − 57 log2 ( 57 ) ≈ 0.86
Leaf conditional entropy of class outcome: H(Y |left) = 0,
H(Y |right) ≈ 0.97
IG(split) ≈ 0.86 − ( 27 · 0 + 57 · 0.97) ≈ 0.17!!

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 40 / 60

Section 4

Learning Decision Trees (Continued)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 41 / 60

Constructing Decision Trees

At each level, we choose:

Which feature to split
Possibly where to split it (e.g., for a continuous feature)
Choose feature and split that provides the highest information gain.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 42 / 60

Visualizations to keep in mind

Yes No

Yes No Yes No

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 43 / 60

Decision Tree Construction Algorithm

Recall the greedy heuristic we proposed earlier:

Start with the whole training set and an empty decision tree.
Pick a feature and candidate split that would most reduce a loss.
I e.g., information gain!
I potentially square loss for a regression tree.
Split on that feature and recurse on subpartitions.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 44 / 60

Discussion

Q: When might we wish to stop splitting nodes?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 45 / 60

Discussion

Q: When might we wish to stop splitting nodes?

Possible answer: Terminates when all leaves contain only examples in the
same class or are empty.
Possible answer: When a depth limit is reached, or when the loss stops
improving much

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 45 / 60

Handling Continuous Attributes

Split based on a threshold, chosen to maximize information gain

I Since the training data is finite, there are only a finite number of
thresholds to consider!
Decision trees can also be used for regression on real-valued outputs.
I Choose splits to minimize squared error, rather than maximize
information gain.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 46 / 60

What Makes a Good Tree?

Not too small:

need to handle important but possibly subtle distinctions in data
Not too big:
Computational efficiency (avoid redundant, spurious attributes)
Avoid over-fitting training examples
Human interpretability
Occam’s Razor find the simplest hypothesis that fits the observations
Useful principle, but hard to formalize (how to define simplicity?)
We desire small trees with informative nodes near the root

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 47 / 60

Decision Tree Miscellany

Problems:
You have exponentially less data at lower levels
Too big of a tree can overfit the data
Greedy algorithms don’t necessarily yield the global optimum
There are other criteria used to measure the quality of a split, e.g., Gini
index
Trees can be pruned in order to make them less complex
Decision trees can also be used for regression on real-valued outputs.
Choose splits to minimize squared error, rather than maximize
information gain.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 48 / 60

Section 5

Notations and Indicator Variables

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 49 / 60

Example: HANES for Heart Disease Prediction

In labs 2, 5, and 7, we will use data from The National Health and
Nutrition Examination Survey in the United States.
Survey assess people’s health and nutritional status
Combines data from interviews and physical examinations
I Race/Ethnicity
I Ever had chest pain
I Age
I BMI
I Blood pressure
I ...
Our Target: Presence of heart disease (self-reported)
Q: Why is this learning setup potentially problematic?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 50 / 60

Machine Learning Pipeline

Collecting and understanding the data

Clean and prepare the data
I Determining the features to use
I Splitting into training/validation/test sets
Model training & hyperparameter tuning
Evaluating generalization accuracy
Additional testing (e.g., subgroup analysis, adversarial testing)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 51 / 60

Supervised Learning Notation (NHANES)

Input: Represented using the vector x containing features for a single data
point
Target Output: Represented using the scalar t ∈ {0, 1}
Goal: We wish to learn a decision tree t ≈ y = f (x), f : Rn 7→ R.
Data: (x(1) , t (1) ), (x(2) , t (2) ), . . . (x(N) , t (N) )
The x(i) are the inputs
The t (i) are the targets (or the ground truth)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 52 / 60

NHANES Heart Disease Prediction
Suppose that for the NHANES heart disease prediction problem, we use
d = 4 features (simplified! in the lab we will use more!):
The numerical age feature
The numerical BMI feature
The numerical blood_pressure_sys feature
The numerical diastolic_bp feature
Then one data point from the training set (x(1) , t (1) ) will look like this:

 
62
29.10
x(1) = , t (1) = 0
 
 104 
56

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 53 / 60

Data Matrix

We can express computations more succinctly by using linear algebra

notation.
Start by putting the entire data set in a single, data matrix
 
(1) (1) (1)
x x2 ... xD
 1(2) (2) (2) 
x x2 ... xD 
X =  1.
 
 .. .. .. .. 
 . . .  
(N) (N) (N)
x1 x2 ... xD

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 54 / 60

Data Matrix

We can express computations more succinctly by using linear algebra

Q: What is the shape of the data matrix?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 54 / 60

Data Matrix

We can express computations more succinctly by using linear algebra

Q: What is the shape of the data matrix?

N × D, where N is the number of data points in the training data, and D is
the number of features.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 54 / 60

Vectorizing Targets and Predictions

Likewise, we can fold the entire set of target values in a single vector, and
an entire set of predictions in a single vector.

t (1) y (1)
   
 (2)   (2) 
t  y 
 ..  ,
t=  y=
 .. 

 .   . 
t (N) y (N)

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 55 / 60

Representing features

Each numerical features is represented as a column in the data matrix.

But what about categorical features?
Gender (recorded as binary in NHANES): 1=female, 0=male
Race Ethnicity: 1=hispanic, 2=white, 3=black, 4=asian, 5=other
...

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 56 / 60

Representing Binary Features

For binary features, we can use indicator features.

(i)
For example, for the gender feature, xgender = 1 if and only if the ith data
point represents a person recorded as being “female”.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 57 / 60

Representing Categorical Features

What about for categorical features with many categories?

Race Ethnicity: 1=hispanic, 2=white, 3=black, 4=asian, 5=other
Q: Can we treat categorical features as numerical?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 58 / 60

Representing Categorical Features

What about for categorical features with many categories?

Race Ethnicity: 1=hispanic, 2=white, 3=black, 4=asian, 5=other
Q: Can we treat categorical features as numerical?
This approach can be problematic! What would decision tree splits like
ethnicity >= 2.5 mean?

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 58 / 60

Indicator Features (Indicator Variables, One-Hot Encoding)

Each category in the categorical variable becomes its own feature.

Instead of a single feature “race_ethnicity”, we will have features:
ethnicity_hispanic ∈ {0, 1}
ethnicity_white ∈ {0, 1}
ethnicity_asian ∈ {0, 1}
ethnicity_black ∈ {0, 1}
This approach makes it easier for decision trees to create split based on
specific ethnicities.

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 59 / 60

What you will do in Lab 2

Explore the NHANES data set

Create the data matrix; split the train/validation/test sets
Train and visualize several decision tree classifiers
Explore the generalization performance

Lecture 2 CSC311H5 Introduction to Machine LearningUniversity of Toronto Mississauga 60 / 60

STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
No ratings yet
STAT 451: Machine Learning Lecture Notes: Sebastian Raschka Department of Statistics University of Wisconsin-Madison
18 pages
Cse 445 Lecture 8 Mma
No ratings yet
Cse 445 Lecture 8 Mma
107 pages
Decision Trees
No ratings yet
Decision Trees
14 pages
IML Trees
No ratings yet
IML Trees
66 pages
4.decision Tree
No ratings yet
4.decision Tree
39 pages
Random Forest Regression
No ratings yet
Random Forest Regression
57 pages
Lecture2 DT
No ratings yet
Lecture2 DT
103 pages
02 LecDT
No ratings yet
02 LecDT
85 pages
2 ML Ch3 Decision Trees Final
No ratings yet
2 ML Ch3 Decision Trees Final
70 pages
06 Trees Handout
No ratings yet
06 Trees Handout
39 pages
22.InfoTheory DecisionTrees Short
No ratings yet
22.InfoTheory DecisionTrees Short
25 pages
Lecture 12 - Decision and Regression Trees
No ratings yet
Lecture 12 - Decision and Regression Trees
35 pages
CHTKT - DataScience - Chapter03 - Machine Learning With Python - 02
No ratings yet
CHTKT - DataScience - Chapter03 - Machine Learning With Python - 02
34 pages
ML 10 Decision Trees
No ratings yet
ML 10 Decision Trees
101 pages
Family and Friends Special Edition 4 - Workbook - Answer Key & Script
50% (2)
Family and Friends Special Edition 4 - Workbook - Answer Key & Script
22 pages
P4-DTRF 1
No ratings yet
P4-DTRF 1
63 pages
Stochastic and Overfitting
No ratings yet
Stochastic and Overfitting
19 pages
Lecture 4
No ratings yet
Lecture 4
15 pages
ML-3-Decision Tree
No ratings yet
ML-3-Decision Tree
17 pages
CS194 Fall 2011 Lecture 08
No ratings yet
CS194 Fall 2011 Lecture 08
32 pages
Lesson 3.1 - Supervised Learning Decision Trees
No ratings yet
Lesson 3.1 - Supervised Learning Decision Trees
51 pages
Decision Tree
No ratings yet
Decision Tree
52 pages
M2 Decision Trees
No ratings yet
M2 Decision Trees
37 pages
Random Forest
No ratings yet
Random Forest
225 pages
Ds 6
No ratings yet
Ds 6
24 pages
Decision Tree 2
No ratings yet
Decision Tree 2
19 pages
2024 Decision Trees
No ratings yet
2024 Decision Trees
28 pages
21 Decision Trees
No ratings yet
21 Decision Trees
62 pages
MLSP Lab Exp4
No ratings yet
MLSP Lab Exp4
9 pages
Decision Trees For Classification and Regression: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Decision Trees For Classification and Regression: Piyush Rai Introduction To Machine Learning (CS771A)
26 pages
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
No ratings yet
Machine Learning: Decision Trees: CS540 Jerry Zhu University of Wisconsin-Madison
49 pages
19 - Decision Tree - ID3
No ratings yet
19 - Decision Tree - ID3
87 pages
WK 07
No ratings yet
WK 07
8 pages
T54B VCF PDF
92% (12)
T54B VCF PDF
528 pages
Decision Trees
No ratings yet
Decision Trees
38 pages
Decision Trees
No ratings yet
Decision Trees
42 pages
Springer - Linguistic Decision Trees For Classification-2014
No ratings yet
Springer - Linguistic Decision Trees For Classification-2014
43 pages
7 DecisionTree
No ratings yet
7 DecisionTree
58 pages
Decision Tree & Random Forest
No ratings yet
Decision Tree & Random Forest
41 pages
Week 11 - Decision Tree Learning
No ratings yet
Week 11 - Decision Tree Learning
43 pages
ML Unit 3 Notes
No ratings yet
ML Unit 3 Notes
117 pages
Decision Trees-Lecture 9&10
No ratings yet
Decision Trees-Lecture 9&10
60 pages
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
No ratings yet
16-Decision Tree Classification Algorithm Advantages With Examples (Iterative Dichotomiser 3-ID3) - 22-03-2024
83 pages
ML Unit 3 Notes-1
No ratings yet
ML Unit 3 Notes-1
118 pages
1.10. Decision Trees - Scikit-Learn 0.24.1 Documentation
No ratings yet
1.10. Decision Trees - Scikit-Learn 0.24.1 Documentation
10 pages
07 - ML - Decision Tree
No ratings yet
07 - ML - Decision Tree
37 pages
Luminor Brand Guidelines
100% (2)
Luminor Brand Guidelines
22 pages
Wk. 5.2. Decision Trees (27.10.2020)
No ratings yet
Wk. 5.2. Decision Trees (27.10.2020)
57 pages
Decision Trees - A Complete Introduction With Examples - by Shubham Koli - Medium
No ratings yet
Decision Trees - A Complete Introduction With Examples - by Shubham Koli - Medium
22 pages
DMDW Co3 Session 14
No ratings yet
DMDW Co3 Session 14
55 pages
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
No ratings yet
CS446: Machine Learning: Lecture 21 (ML Models - Decision Trees - ID3)
54 pages
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
No ratings yet
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
91 pages
Lecture2 DT
No ratings yet
Lecture2 DT
89 pages
Acuson Aspen - Service Manual
80% (10)
Acuson Aspen - Service Manual
230 pages
Machine Learning: MVJ21CS62
No ratings yet
Machine Learning: MVJ21CS62
12 pages
ML - Module-3-Chapter-6 RNSIT
No ratings yet
ML - Module-3-Chapter-6 RNSIT
10 pages
Marketing Strategies of TVS Motors
No ratings yet
Marketing Strategies of TVS Motors
64 pages
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
No ratings yet
Learning by Asking Questions: Decision Trees: Piyush Rai Machine Learning (CS771A)
22 pages
Lecture 04 Decession Trees 04112022 015118pm
No ratings yet
Lecture 04 Decession Trees 04112022 015118pm
43 pages
Csi 2018 Mechanical Division 15
100% (1)
Csi 2018 Mechanical Division 15
303 pages
Module - 2 Decision Tree Learning
No ratings yet
Module - 2 Decision Tree Learning
79 pages
Decision Trees: Principal Data Miner, ATO Adjunct Associate Professor, ANU
No ratings yet
Decision Trees: Principal Data Miner, ATO Adjunct Associate Professor, ANU
3 pages
Decision Trees
No ratings yet
Decision Trees
15 pages
Decision Trees
No ratings yet
Decision Trees
53 pages
Turbine System
100% (1)
Turbine System
52 pages
Heat of Reaction
83% (6)
Heat of Reaction
8 pages
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
No ratings yet
Decision Tree Learning: - A Learned Decision Tree Can Also Be Re-Represented As A Set of If-Then Rules
49 pages
The
No ratings yet
The
17 pages
Estimating The Cost of Risky Debt by Ian Cooper
No ratings yet
Estimating The Cost of Risky Debt by Ian Cooper
8 pages
Series: Always at The Primacy of Digital Imaging The Pride and Legacy of Fujifilm
0% (1)
Series: Always at The Primacy of Digital Imaging The Pride and Legacy of Fujifilm
4 pages
F Ma Friction
100% (1)
F Ma Friction
5 pages
U.K. Chatterjee, S.K. Bose, S.K. Roy - Environmental Degradation of Metals - Corrosion Technology Series - 14-CRC Press (2001)
No ratings yet
U.K. Chatterjee, S.K. Bose, S.K. Roy - Environmental Degradation of Metals - Corrosion Technology Series - 14-CRC Press (2001)
509 pages
Okuma 5 Axis Guide
100% (1)
Okuma 5 Axis Guide
13 pages
M.E. Production Engineering - Manufacturing &amp Automation
No ratings yet
M.E. Production Engineering - Manufacturing &amp Automation
41 pages
Asme MFC-21.1-2015
No ratings yet
Asme MFC-21.1-2015
38 pages
NRF 24 e 1
No ratings yet
NRF 24 e 1
119 pages
In Bluebeard's Castle
No ratings yet
In Bluebeard's Castle
65 pages
Magnifico 160000334 V1 1121 LR 01
No ratings yet
Magnifico 160000334 V1 1121 LR 01
12 pages
Clue Dinner Theater Script
No ratings yet
Clue Dinner Theater Script
45 pages
تاثير العولمة على السياسات العامة في الدول النامية
No ratings yet
تاثير العولمة على السياسات العامة في الدول النامية
20 pages
Unit 202: Principles of Engineering Technology: Handout 14: Work, Power and Energy
No ratings yet
Unit 202: Principles of Engineering Technology: Handout 14: Work, Power and Energy
3 pages
PMQB Ans
No ratings yet
PMQB Ans
31 pages
Week 3
No ratings yet
Week 3
4 pages
Hastasya Bhushanam Danam - Eng
No ratings yet
Hastasya Bhushanam Danam - Eng
7 pages
03 SR Iit - Co-Sc GTM-13 (N) Main (Model-A - B - C) - 11-01-2024 - 2197346
No ratings yet
03 SR Iit - Co-Sc GTM-13 (N) Main (Model-A - B - C) - 11-01-2024 - 2197346
12 pages
Pump Cycle Calculator: Input Data
No ratings yet
Pump Cycle Calculator: Input Data
2 pages
Week 5 Tutorial Notes Answers
No ratings yet
Week 5 Tutorial Notes Answers
6 pages
c630 Nickel Aluminum Bronze PDF
No ratings yet
c630 Nickel Aluminum Bronze PDF
2 pages
Decision Tree Pruning: Fundamentals and Applications
From Everand
Decision Tree Pruning: Fundamentals and Applications
Fouad Sabry
No ratings yet