0% found this document useful (0 votes)
9 views54 pages

Lec 1

Uploaded by

foreverycc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
9 views54 pages

Lec 1

Uploaded by

foreverycc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

Lecture 1:

Course logistics,
Supervised vs. Unsupervised learning,
Bias-Variance tradeoff

STATS 202: Data mining and analysis

Rajan Patel

1 / 23
Syllabus

I Videos: Every lecture will be recorded by SCPD.


I Email policy: Please use the stats202 google group for most
questions. Homeworks and all SCPD Exams should be
e-mailed to [email protected].
I Class website: www.stats202.com.

2 / 23
Prediction challenges

The MNIST dataset is a library of handwritten digits.

In a prediction challenge, you are given a training set of images of


handwritten digits, which are labeled from 0 to 9.
You are also given a test set of handwritten digits, which are not
identified.
Your job is to assign a digit to each image in the test set.

3 / 23
The Netflix prize
Netflix popularized prediction challenges by organizing an open,
blind contest to improve its recommendation system.
The prize was $1 million.
Users

Rankings (1 to 5 stars)

Movies
4 / 23
The Netflix prize
Netflix popularized prediction challenges by organizing an open,
blind contest to improve its recommendation system.
The prize was $1 million.

Some rankings were


Users

hidden in the training


data

Movies
4 / 23
The Netflix prize
Netflix popularized prediction challenges by organizing an open,
blind contest to improve its recommendation system.
The prize was $1 million.

The challenge was to


Users

predict those rankings

Movies
4 / 23
Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:


Samples or units

Variables or factors

5 / 23
Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:


Samples or units

Variables or factors
Quantitative, eg. weight, height, number of children, ...

5 / 23
Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:


Samples or units

Variables or factors
Qualitative, eg. college major, profession, gender, ...

5 / 23
Supervised vs. unsupervised learning

In unsupervised learning we start with a data matrix:

Our goal is to:

I Find meaningful relationships between the variables or units.


Correlation analysis.
I Find meaningful groupings of the data. Clustering.

Unsupervised learning is also known in Statistics as exploratory


data analysis.

6 / 23
Supervised vs. unsupervised learning

In supervised learning, there are input variables, and output


variables:
Samples or units

Input variables Output variable

Variables or factors

7 / 23
Supervised vs. unsupervised learning

In supervised learning, there are input variables, and output


variables:
Samples or units

Input variables Output variable


If quantitative, we say
this is a regression
problem

Variables or factors

7 / 23
Supervised vs. unsupervised learning

In supervised learning, there are input variables, and output


variables:
Samples or units

Input variables Output variable


If qualitative, we say
this is a classification
problem

Variables or factors

7 / 23
Supervised vs. unsupervised learning

In supervised learning, there are input variables, and output


variables:

If X is the vector of inputs for a particular sample. The output


variable is modeled by:

Y = f (X) + "
|{z}
Random error

8 / 23
Supervised vs. unsupervised learning

In supervised learning, there are input variables, and output


variables:

If X is the vector of inputs for a particular sample. The output


variable is modeled by:

Y = f (X) + "
|{z}
Random error

Our goal is to learn the function f , using a set of training samples.

8 / 23
Supervised vs. unsupervised learning

Y = f (X) + "
|{z}
Random error

Motivations:

I Prediction: Useful when the input variable is readily


available, but the output variable is not.
Example: Predict stock prices next month using data from last
year.

9 / 23
Supervised vs. unsupervised learning

Y = f (X) + "
|{z}
Random error

Motivations:

I Prediction: Useful when the input variable is readily


available, but the output variable is not.
I Inference: A model for f can help us understand the
structure of the data — which variables influence the output,
and which don’t? What is the relationship between each
variable and the output, e.g. linear, non-linear?
Example: What is the influence of genetic variations on the
incidence of heart disease.

9 / 23
Parametric and nonparametric methods:

There are two kinds of supervised learning methods:

I Parametric methods: We assume that f takes a specific


form. For example, a linear form:

f (X) = X1 1 + · · · + Xp p

with parameters 1 , . . . , p. Using the training data, we try to


fit the parameters.
I Non-parametric methods: We don’t make any assumptions
on the form of f , but we restrict how “wiggly” or “rough” the
function can be.

10 / 23
Parametric vs. nonparametric prediction
Incom

Incom
e

e
ity

ity
r

r
Ye Ye

io

io
a a

n
rs rs

Se

Se
of of
Ed Ed
uca uc
tio atio
n n

Figure: *

Figures 2.4 and 2.5

11 / 23
Parametric vs. nonparametric prediction
Incom

Incom
e

e
ity

ity
r

r
Ye Ye

io

io
a a

n
rs rs

Se

Se
of of
Ed Ed
uca uc
tio atio
n n

Figure: *

Figures 2.4 and 2.5

Parametric methods have a limit of fit quality. Non-parametric


methods keep improving as we add more data to fit.

11 / 23
Parametric vs. nonparametric prediction
Incom

Incom
e

e
ity

ity
r

r
Ye Ye

io

io
a a

n
rs rs

Se

Se
of of
Ed Ed
uca uc
tio atio
n n

Figure: *

Figures 2.4 and 2.5

Parametric methods have a limit of fit quality. Non-parametric


methods keep improving as we add more data to fit.

Parametric methods are often simpler to interpret.


11 / 23
Prediction error
Training data: (x1 , y1 ), (x2 , y2 ) . . . (xn , yn )
Predicted function: fˆ.

Our goal in supervised learning is to minimize the prediction error.

12 / 23
Prediction error
Training data: (x1 , y1 ), (x2 , y2 ) . . . (xn , yn )
Predicted function: fˆ.

Our goal in supervised learning is to minimize the prediction error.


For regression models, this is typically the Mean Squared Error:

M SE(fˆ) = E(y0 fˆ(x0 ))2 .

12 / 23
Prediction error
Training data: (x1 , y1 ), (x2 , y2 ) . . . (xn , yn )
Predicted function: fˆ.

Our goal in supervised learning is to minimize the prediction error.


For regression models, this is typically the Mean Squared Error:

M SE(fˆ) = E(y0 fˆ(x0 ))2 .

Unfortunately, this quantity cannot be computed, because we don’t


know the true joint distribution of (X, Y ).

12 / 23
Prediction error
Training data: (x1 , y1 ), (x2 , y2 ) . . . (xn , yn )
Predicted function: fˆ.

Our goal in supervised learning is to minimize the prediction error.


For regression models, this is typically the Mean Squared Error:

M SE(fˆ) = E(y0 fˆ(x0 ))2 .

Unfortunately, this quantity cannot be computed, because we don’t


know the true joint distribution of (X, Y ). We can compute a
sample average using the training data; this is known as the
training MSE:
n
1X
M SEtraining (fˆ) = (yi fˆ(xi ))2 .
n
i=1

12 / 23
Prediction error

The main challenge of statistical learning is that a low training


MSE does not imply a low MSE.

13 / 23
Prediction error

The main challenge of statistical learning is that a low training


MSE does not imply a low MSE.

If we have test data {(x0i , yi0 ); i = 1, . . . , m} which were not used to


fit the model, a better measure of quality for fˆ is the test MSE:
m
1 X 0
M SEtest (fˆ) = (yi fˆ(x0i ))2 .
m
i=1

13 / 23
Figure: *

Figure 2.9.

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

The circles are simulated data from the black curve.

14 / 23
Figure: *

Figure 2.9.

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

The circles are simulated data from the black curve. In


this artificial example, we know what f is.

14 / 23
Figure: *

Figure 2.9.

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Three estimates fˆ are shown:


1. Linear regression.
2. Splines (very smooth).
3. Splines (quite rough).
14 / 23
Figure: *

Figure 2.9.

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility

Red line: Test MSE.


Gray line: Training MSE.

14 / 23
Figure: *

2.5
12

2.0
10

Mean Squared Error

1.5
8
Y

1.0
6

0.5
4
2

0.0
0 20 40 60 80 100 2 5 10 20

X Flexibility
Figure 2.10

The function f is now almost linear.

15 / 23
Figure: *

20
20

15
Mean Squared Error
10

10
Y

5
−10

0
0 20 40 60 80 100 2 5 10 20

X Flexibility
Figure 2.11

When the noise " has small variance, the third method does well.

16 / 23
The bias variance decomposition

Let x0 be a fixed test point, y0 = f (x0 ) + "0 , and fˆ be estimated


from n training samples (x1 , y1 ) . . . (xn , yn ).
Let E denote the expectation over y0 and the training outputs
(y1 , . . . , yn ). Then, the Mean Squared Error at x0 can be
decomposed:

M SE(x0 ) = E(y0 fˆ(x0 ))2 = Var(fˆ(x0 ))+[Bias(fˆ(x0 ))]2 +Var("0 ).

17 / 23
The bias variance decomposition

Let x0 be a fixed test point, y0 = f (x0 ) + "0 , and fˆ be estimated


from n training samples (x1 , y1 ) . . . (xn , yn ).
Let E denote the expectation over y0 and the training outputs
(y1 , . . . , yn ). Then, the Mean Squared Error at x0 can be
decomposed:

M SE(x0 ) = E(y0 fˆ(x0 ))2 = Var(fˆ(x0 ))+[Bias(fˆ(x0 ))]2 +Var("0 ).

Irreducible error

17 / 23
The bias variance decomposition

Let x0 be a fixed test point, y0 = f (x0 ) + "0 , and fˆ be estimated


from n training samples (x1 , y1 ) . . . (xn , yn ).
Let E denote the expectation over y0 and the training outputs
(y1 , . . . , yn ). Then, the Mean Squared Error at x0 can be
decomposed:

M SE(x0 ) = E(y0 fˆ(x0 ))2 = Var(fˆ(x0 ))+[Bias(fˆ(x0 ))]2 +Var("0 ).

The variance of the estimate of Y : E[fˆ(x0 )E(fˆ(x0 ))]2


This measures how much the estimate of fˆ at x0
changes when we sample new training data.

17 / 23
The bias variance decomposition

Let x0 be a fixed test point, y0 = f (x0 ) + "0 , and fˆ be estimated


from n training samples (x1 , y1 ) . . . (xn , yn ).
Let E denote the expectation over y0 and the training outputs
(y1 , . . . , yn ). Then, the Mean Squared Error at x0 can be
decomposed:

M SE(x0 ) = E(y0 fˆ(x0 ))2 = Var(fˆ(x0 ))+[Bias(fˆ(x0 ))]2 +Var("0 ).

The squared bias of the estimate of Y : [E(fˆ(x0 )) f (x0 )]2


This measures the deviation of the average
prediction fˆ(x0 ) from the truth f (x0 ).

17 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
Implications of bias variance decomposition

M SE(x0 ) = E(y0 fˆ(x0 ))2 = Var(fˆ(x0 ))+[Bias(fˆ(x0 ))]2 +Var(").

I The MSE is always positive.


I Each element on the right hand side is always positive.
I Therefore, typically when we decrease the bias beyond some
point, we increase the variance, and vice-versa.

More flexibility () Higher variance () Lower bias.

19 / 23
Squiggly f , high noise Linear f , high noise Squiggly f , low noise

2.5

2.5

20
MSE
Bias
Var
2.0

2.0

15
1.5

1.5

10
1.0

1.0

5
0.5

0.5
0.0

0.0

0
2 5 10 20 2 5 10 20 2 5 10 20

Flexibility Flexibility Flexibility

Figure: *

Figure 2.12

20 / 23
Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on a


number of variables, the function f takes values in the set
{Ford, Toyota, Mercedes-Benz, . . . }.

21 / 23
Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on a


number of variables, the function f takes values in the set
{Ford, Toyota, Mercedes-Benz, . . . }.

The model:
Y = f (X) + "
becomes insufficient, as X is not necessarily real-valued.

21 / 23
Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on a


number of variables, the function f takes values in the set
{Ford, Toyota, Mercedes-Benz, . . . }.

The model:
(( (
Y(
( f(
=( (X) +"
becomes insufficient, as X is not necessarily real-valued.

21 / 23
Classification problems

In a classification setting, the output takes values in a discrete set.

For example, if we are predicting the brand of a car based on a


number of variables, the function f takes values in the set
{Ford, Toyota, Mercedes-Benz, . . . }.

We will use slightly different notation:

P (X, Y ) : joint distribution of (X, Y ),


P (Y | X) : conditional distribution of X given Y,
ŷi : prediction for xi .

21 / 23
Loss function for classification

There are many ways to measure the error of a classification


prediction. One of the most common is the 0-1 loss:

E(1(y0 6= ŷ0 ))

Like the MSE, this quantity can be estimated from training and
test data by taking a sample average:
n
1X
1(yi 6= ŷi )
n
i=1

22 / 23

You might also like