Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441
Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441
Stat441
Matthias Schonlau, Ph.D.
Statistical Learning Classification
Stat441
Overview
What is statistical learning?
Overfitting/ Train-test split
Example Dutch income vs age
Bias-variance tradeoff
Some concepts
Prediction vs. Interpretation
Regression vs. Classification
Supervised vs. unsupervised
Reading: Chapter 2, James et al.
Linear regression
Y=f(x) + e
Where f(x) is a linear function.
The parameters beta are unknown, but the
functional form (linearity) is known.
Functional form is known except for possible
variable selection, quadratic terms etc.
20
40
Age of the household member
60
80
When y spans
several orders of
magnitude, it often
makes sense to
take log(y)
Caution: log(0)
i.e.
20
40
Age of the household member
60
80
6
logincome
4
2
0
Log income
(+1 euro) vs
age
A random
subset of
about n=60
observations
20
40
Age of the household member
60
80
5
0
Linear
regression
with 95%
confidence
interval
Because the
relationship is
not linear, the
fit is poor
10
20
40
60
Age of the household member
95% CI
logincome
80
Fitted values
100
6
4
2
0
Predicts
every obs
perfectly.
Learning is
very
flexible.
20
40
60
Age of the household member
logincome
80
100
Prediction
6
4
2
0
20
40
60
Age of the household member
logincome
80
100
Prediction
Train=50%
By splitting the
data into training
and test set,
overfitting can be
avoided.
Train=100%
20
40
60
Age of the household member
logincome
Prediction
80
100
Train=100%
20
40
60
Age of the household member
logincome
80
100
Prediction
4
2
0
Train=50%
20
40
60
Age of the household member
logincome
Prediction
80
100
5
0
10
15
20
40
60
Age of the household member
logincome
80
100
pred
Overfitting
Overfitting means the model fits the random
noise of a sample rather than the
generalizable relationship.
Overfitting tends to occur when the model has
too many parameters relative to the number
of observations.
Learning algorithms are designed to be
flexible and tend to have a lot of parameters.
Overfitting
How does one defend against overfitting?
Separate the data into training and test data.
Fit the model on the training data.
Evaluate the model fit on the test data.
Evaluation of fit
For continuous outcomes, the fit is often
evaluated with the mean squared error:
1 n
(x )
=
MSE
y
f
i
i
n i =1
Evaluation of fit
All machine learning algorithms have at least
one tuning parameter
A tuning parameter governs how flexible the
fit is.
One can plot the MSE as a function of the
flexibility parameter.
Both for the training data and the test data
Train=100%
In boosting, a flexibility
parameter is the number of
iterations
Train=100%:
Fit as many iterations until
best fit is achieved.
0
When there are duplicates in
x with differing ys, perfect fitTrain=50%
(MSE=0) is not possible
20
40
60
Age of the household member
100
Prediction
bestiter= 152
4
2
0
Train=50%:
Fit as many iterations as it takes to
minimize MSE (or an equivalent
criterion).
Here, this is done automatically.
The output contains information
about the best number of iterations:
logincome
80
20
40
60
Age of the household member
logincome
Prediction
80
100
Evaluation of fit
For the Dutch income example, since we had
perfect predictions on the training data
MSEtrain = 0
Bias-Variance tradeoff
There is a reason why the U-shape in the test
MSE occurs.
The expected MSE can be decomposed as:
E ( y0 f ( x0 ) )= Var ( f ( xo )) + Bias ( f ( x0 ) + Var ( )
2
Bias-Variance tradeoff
Variance refers to the variation we would
get by using a different training data set
Bias refers to the error between the learning
model and the true function.
The equation states we need to
simultaneously minimize both bias and
variance.
Bias- Variance
tradeoff
All curves refer to the
test data
Squared bias (blue
curve)
Variance (Orange curve)
Var(epsilon) (dashed
line)
Test error (red line)
Biasvariance tradeoff
When the training data are large, a flexible
learning algorithm may be able to eliminate
much of the bias.
In real life the true function is unknown and it
is not possible to compute this bias variance/
tradeoff explicitly.
But it is useful to keep this tradeoff in mind.
Some concepts
We will now talk about some other concepts:
Prediction vs. Interpretation
Regression vs. Classification
Supervised vs. unsupervised
Statistical learning:
Excels at prediction
Making learning algorithm interpretable is
challenging
Course outline
Models for Supervised learning
(emphasizing classification)
Logistic regression /
Multinomial regression
Discriminant analysis
k nearest neighbours
Nave Bayes
Trees
Random forests
Boosting
Support vector machines
Neural networks
Multi-label learning
Case studies