0% found this document useful (0 votes)
81 views30 pages

Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441

This document provides an overview of statistical learning and classification concepts. It discusses what statistical learning is, how overfitting is addressed through train-test splits, and the bias-variance tradeoff. Key concepts explained include prediction versus interpretation, regression versus classification, and supervised versus unsupervised learning. Examples using Dutch income data demonstrate how a learning algorithm can perfectly fit training data but fail to generalize, and how evaluating on a test set addresses overfitting.

Uploaded by

1plus12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
81 views30 pages

Matthias Schonlau, Ph.D. Statistical Learning - Classification Stat441

This document provides an overview of statistical learning and classification concepts. It discusses what statistical learning is, how overfitting is addressed through train-test splits, and the bias-variance tradeoff. Key concepts explained include prediction versus interpretation, regression versus classification, and supervised versus unsupervised learning. Examples using Dutch income data demonstrate how a learning algorithm can perfectly fit training data but fail to generalize, and how evaluating on a test set addresses overfitting.

Uploaded by

1plus12
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 30

Statistical Learning Classification

Stat441
Matthias Schonlau, Ph.D.
Statistical Learning Classification
Stat441

Overview
What is statistical learning?
Overfitting/ Train-test split
Example Dutch income vs age

Bias-variance tradeoff
Some concepts
Prediction vs. Interpretation
Regression vs. Classification
Supervised vs. unsupervised
Reading: Chapter 2, James et al.

What is statistical learning?

Linear regression
Y=f(x) + e
Where f(x) is a linear function.
The parameters beta are unknown, but the
functional form (linearity) is known.
Functional form is known except for possible
variable selection, quadratic terms etc.

What is statistical learning?


Y=f(x) + e
In statistical learning f(x) is unknown and must
be estimated from the data
It is usually not possible to write down f(x)
with easy equations
It may contains 100s/1000s of parameters
It is not easily interpretable
Sometimes called black box methods

LISS survey panel


in the
Netherlands
August 2015
A random subset
of about n=60
observations
income vs age

Personal gross monthly income in Euros, imputed


0
2000
4000
6000

Dutch Income vs age

20

40
Age of the household member

60

80

When y spans
several orders of
magnitude, it often
makes sense to
take log(y)
Caution: log(0)
i.e.

ymax >> 10 ymin

Personal gross monthly income in Euros, imputed


0
2000
4000
6000

Dutch Income vs age

20

40
Age of the household member

60

80

6
logincome
4
2
0

Log income
(+1 euro) vs
age
A random
subset of
about n=60
observations

Dutch Income vs age

20

40
Age of the household member

60

80

5
0

Linear
regression
with 95%
confidence
interval
Because the
relationship is
not linear, the
fit is poor

10

Dutch income vs. age

20

40
60
Age of the household member
95% CI
logincome

80

Fitted values

100

Dutch Income vs age


Now fit a learning algorithm on the (full)
data
Use boosting with interaction= 10
Think of boosting as a blackbox method for
now
we will learn later what this means

6
4
2
0

Predicts
every obs
perfectly.
Learning is
very
flexible.

Dutch income vs age

20

40
60
Age of the household member
logincome

80

100

Prediction

graph twoway (scatter logincome leeftijd )


(line predb leeftijd)

6
4
2
0

20

40
60
Age of the household member
logincome

80

100

Prediction

Train=50%

The fit with


train=50% can be
further improved,
but this is not the
point here.

By splitting the
data into training
and test set,
overfitting can be
avoided.

Train=100%

20

40
60
Age of the household member
logincome

Prediction

80

100

Train=100%

boost logincome leeftijd, inter(10) shrink(.01)


bag(1) train(1.0) pred(predb)
distribution(normal)

The only difference is


the specification of
the training data

20

40
60
Age of the household member
logincome

80

100

Prediction

4
2
0

boost logincome leeftijd, inter(10) shrink(.01)


bag(1) train(0.5) pred(predb)
distribution(normal)

Train=50%

20

40
60
Age of the household member
logincome

Prediction

80

100

5
0

Scatter plot of the


full data set
(n~=6000)
Regression line
Predicted values
(using SVMs)

10

15

Dutch Income vs Age

20

40
60
Age of the household member
logincome

80

100

pred

graph twoway (scatter logincome leeftijd ,msize(.1)


jitter(2) ) (line pred leeftijd, lcolor(orange))

Overfitting
Overfitting means the model fits the random
noise of a sample rather than the
generalizable relationship.
Overfitting tends to occur when the model has
too many parameters relative to the number
of observations.
Learning algorithms are designed to be
flexible and tend to have a lot of parameters.

Overfitting
How does one defend against overfitting?
Separate the data into training and test data.
Fit the model on the training data.
Evaluate the model fit on the test data.

If you accidentally fit the noise in the training


data, the fit on the test data will deteriorate.

Evaluation of fit
For continuous outcomes, the fit is often
evaluated with the mean squared error:

1 n
(x )
=

MSE
y
f

i
i
n i =1

The MSE can be computed both for the


training and the test data.

Evaluation of fit
All machine learning algorithms have at least
one tuning parameter
A tuning parameter governs how flexible the
fit is.
One can plot the MSE as a function of the
flexibility parameter.
Both for the training data and the test data

Test error vs train error


With increased flexibility,
the error in the training
data will continue to
decrease
the error on the test data
will eventually increase
(U-shape)

The best fit occurs where


the test error is
minimized.
Figure 2.9 James et al.

Dutch income data

Train=100%

In boosting, a flexibility
parameter is the number of
iterations
Train=100%:
Fit as many iterations until
best fit is achieved.
0
When there are duplicates in
x with differing ys, perfect fitTrain=50%
(MSE=0) is not possible

20

40
60
Age of the household member

100

Prediction

bestiter= 152

4
2
0

Train=50%:
Fit as many iterations as it takes to
minimize MSE (or an equivalent
criterion).
Here, this is done automatically.
The output contains information
about the best number of iterations:

logincome

80

20

40
60
Age of the household member
logincome

Prediction

80

100

Evaluation of fit
For the Dutch income example, since we had
perfect predictions on the training data
MSEtrain = 0

Bias-Variance tradeoff
There is a reason why the U-shape in the test
MSE occurs.
The expected MSE can be decomposed as:
E ( y0 f ( x0 ) )= Var ( f ( xo )) + Bias ( f ( x0 ) + Var ( )
2

Where y0 and x0 refer to test data.

All three components are non-negative.


Var ( ) is a lower bound on the MSE

Bias-Variance tradeoff
Variance refers to the variation we would
get by using a different training data set
Bias refers to the error between the learning
model and the true function.
The equation states we need to
simultaneously minimize both bias and
variance.

Bias- Variance
tradeoff
All curves refer to the
test data
Squared bias (blue
curve)
Variance (Orange curve)
Var(epsilon) (dashed
line)
Test error (red line)

Figure 2.12 from James et al.

Biasvariance tradeoff
When the training data are large, a flexible
learning algorithm may be able to eliminate
much of the bias.
In real life the true function is unknown and it
is not possible to compute this bias variance/
tradeoff explicitly.
But it is useful to keep this tradeoff in mind.

Some concepts
We will now talk about some other concepts:
Prediction vs. Interpretation
Regression vs. Classification
Supervised vs. unsupervised

Interpretation vs. Prediction


Linear regression:
Interpretable.
Can specify how any one variable affects y

Good at prediction only when the model is correct

Statistical learning:
Excels at prediction
Making learning algorithm interpretable is
challenging

Regression vs. classification


Regression:
y is continuous
Classification

Y indicates class membership in one of L classes


Important special case is L=2
Has cancer vs no cancer
Default on loan vs. No default on loan

Logistic regression is the most common method when


L=2 .
The designation classification is not consistently used

Supervised vs. Unsupervised learning


Supervised learning :
y and x-vars are known
Unsupervised learning :
There are only x-vars. y is unknown
Often the goal is to cluster observations into
group.

Supervised vs unsupervised analysis


Supervised
analysis if
the colors
(i.e. ys) are
known
Unsupervised
analysis if the
colors are not
known

Course outline
Models for Supervised learning
(emphasizing classification)
Logistic regression /
Multinomial regression
Discriminant analysis
k nearest neighbours
Nave Bayes
Trees
Random forests
Boosting
Support vector machines
Neural networks

Issues in supervised learning


Overfitting
Feature (=variable)
selection
Text mining
Turning text into numerical
variables

Multi-label learning
Case studies

You might also like