Lec 1
Lec 1
Course logistics,
Supervised vs. Unsupervised learning,
Bias-Variance tradeoff
Rajan Patel
1 / 23
Syllabus
2 / 23
Prediction challenges
3 / 23
The Netflix prize
Netflix popularized prediction challenges by organizing an open,
blind contest to improve its recommendation system.
The prize was $1 million.
Users
Rankings (1 to 5 stars)
Movies
4 / 23
The Netflix prize
Netflix popularized prediction challenges by organizing an open,
blind contest to improve its recommendation system.
The prize was $1 million.
Movies
4 / 23
The Netflix prize
Netflix popularized prediction challenges by organizing an open,
blind contest to improve its recommendation system.
The prize was $1 million.
Movies
4 / 23
Supervised vs. unsupervised learning
Variables or factors
5 / 23
Supervised vs. unsupervised learning
Variables or factors
Quantitative, eg. weight, height, number of children, ...
5 / 23
Supervised vs. unsupervised learning
Variables or factors
Qualitative, eg. college major, profession, gender, ...
5 / 23
Supervised vs. unsupervised learning
6 / 23
Supervised vs. unsupervised learning
Variables or factors
7 / 23
Supervised vs. unsupervised learning
Variables or factors
7 / 23
Supervised vs. unsupervised learning
Variables or factors
7 / 23
Supervised vs. unsupervised learning
Y = f (X) + "
|{z}
Random error
8 / 23
Supervised vs. unsupervised learning
Y = f (X) + "
|{z}
Random error
8 / 23
Supervised vs. unsupervised learning
Y = f (X) + "
|{z}
Random error
Motivations:
9 / 23
Supervised vs. unsupervised learning
Y = f (X) + "
|{z}
Random error
Motivations:
9 / 23
Parametric and nonparametric methods:
f (X) = X1 1 + · · · + Xp p
10 / 23
Parametric vs. nonparametric prediction
Incom
Incom
e
e
ity
ity
r
r
Ye Ye
io
io
a a
n
rs rs
Se
Se
of of
Ed Ed
uca uc
tio atio
n n
Figure: *
11 / 23
Parametric vs. nonparametric prediction
Incom
Incom
e
e
ity
ity
r
r
Ye Ye
io
io
a a
n
rs rs
Se
Se
of of
Ed Ed
uca uc
tio atio
n n
Figure: *
11 / 23
Parametric vs. nonparametric prediction
Incom
Incom
e
e
ity
ity
r
r
Ye Ye
io
io
a a
n
rs rs
Se
Se
of of
Ed Ed
uca uc
tio atio
n n
Figure: *
12 / 23
Prediction error
Training data: (x1 , y1 ), (x2 , y2 ) . . . (xn , yn )
Predicted function: fˆ.
12 / 23
Prediction error
Training data: (x1 , y1 ), (x2 , y2 ) . . . (xn , yn )
Predicted function: fˆ.
12 / 23
Prediction error
Training data: (x1 , y1 ), (x2 , y2 ) . . . (xn , yn )
Predicted function: fˆ.
12 / 23
Prediction error
13 / 23
Prediction error
13 / 23
Figure: *
Figure 2.9.
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
14 / 23
Figure: *
Figure 2.9.
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
14 / 23
Figure: *
Figure 2.9.
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Figure 2.9.
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
14 / 23
Figure: *
2.5
12
2.0
10
1.5
8
Y
1.0
6
0.5
4
2
0.0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Figure 2.10
15 / 23
Figure: *
20
20
15
Mean Squared Error
10
10
Y
5
−10
0
0 20 40 60 80 100 2 5 10 20
X Flexibility
Figure 2.11
When the noise " has small variance, the third method does well.
16 / 23
The bias variance decomposition
17 / 23
The bias variance decomposition
Irreducible error
17 / 23
The bias variance decomposition
17 / 23
The bias variance decomposition
17 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
18 / 23
Implications of bias variance decomposition
19 / 23
Squiggly f , high noise Linear f , high noise Squiggly f , low noise
2.5
2.5
20
MSE
Bias
Var
2.0
2.0
15
1.5
1.5
10
1.0
1.0
5
0.5
0.5
0.0
0.0
0
2 5 10 20 2 5 10 20 2 5 10 20
Figure: *
Figure 2.12
20 / 23
Classification problems
21 / 23
Classification problems
The model:
Y = f (X) + "
becomes insufficient, as X is not necessarily real-valued.
21 / 23
Classification problems
The model:
(( (
Y(
( f(
=( (X) +"
becomes insufficient, as X is not necessarily real-valued.
21 / 23
Classification problems
21 / 23
Loss function for classification
E(1(y0 6= ŷ0 ))
Like the MSE, this quantity can be estimated from training and
test data by taking a sample average:
n
1X
1(yi 6= ŷi )
n
i=1
22 / 23