Statistical Learning
Statistical Learning
2
5
2
5
2
5
2
0
2
0
2
0
Sales
Sales
Sales
1
5
1
5
1
5
1
0
1
0
1
0
5
5
0 50 200 300 0 10 20 30 40 0 20 40 60 80
100 50 100
TV Radio
Newspaper
S a l e s ≈ f (TV, Radio,
Newspaper)
1 / 30
Notation
Here S a l e s is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it
X 1 . Likewise name Radio as X 2 , and so on.
We can refer to the input vector collectively as
X
X = 1 X
2
X3
Y = f (X ) + ϵ
2 / 30
What is f (X ) good for?
3 / 30
●●
● ●
6
●
●
● ●
●
●●● ●
●
4
●● ●●●●
●● ●●●
●
●●●●●●●●●● ●
● ●● ●●●
●●●
●●
●
●●● ●●●●●●●●●
●●
● ●
● ●●●●●●●●
y
2 ●● ● ●●● ●●●
●
●●
●●
●●● ●●●●●
●
●●●●● ●●
●●●●
● ●●●●●●●● ●
●
●●●
●● ●● ●
●●●●● ●●●●● ●●●●● ●
●●● ●●
●● ● ●●
● ●●●●● ● ●●
● ●
● ●●●●● ●●●●●●●●●●●●●
●● ●●●●●
●
●●● ● ● ●●●●●●
●
● ●● ● ●●● ●●
●●●● ● ●●●●●
●●●● ●● ● ●●
●
● ● ● ●●●
●●●
●●
●● ●● ●●●●
●●●● ●●●●●●●
● ●●●● ●
● ● ●●●
● ●● ● ● ● ●●
●●●●● ●●● ●●● ● ● ●●●
● ●● ● ●●● ●●●●●●● ●
●● ●●●●●●
●●●●● ●●●
●●● ●●●●
●● ●●●●●●●
●●●
●●
● ●● ● ●●●
●●●●
●● ●●●●●●●●●●●
●●● ●●●●●● ●●●● ●●
●
●
●
● ●●●●●
●●●●●
●● ●●
●●● ●●●●●
●● ●● ●
●●
● ● ●●●●●
●
● ●●
●●●●
● ●●
●● ●●
●●●●●●
●●●●●●●● ●●
●● ●●● ●●●●● ●●● ●●●●●
●● ●●●●●●●●●● ●
●●●●●●●● ●● ● ● ●●
● ● ●●●●
●● ● ●●●●●●●
● ● ●●●
●●●● ●●●●●
−2
● ●●●
●●
●●● ●●●
●● ●●●●● ● ●●● ●
● ●●●● ● ●● ● ●● ●●● ●●●●●●●●● ●● ●
●●●●●●
● ●●
●● ●●
●● ● ● ●●●
0
●●●●● ● ●●●
●●●●● ●
1 2 3 4 5 6 7
f (4) = E (Y |X = 4)
E ( Y | X = 4) means expected value (average) of Y given X = 4.
This ideal f (x) = E ( Y | X = x ) is called the regression function.
4 / 30
The regression function f (x )
fˆ(x ) = A ve(Y |X ∈ N
(x )) where N (x ) is some neighborhood of
x.
● ●
2
●
●
●●● ●
● ●
● ●
1
● ● ●
● ●●
●
●●
y
●
● ●● ●
●●●
0
● ●● ●
● ●● ● ● ●●
● ● ● ● ● ● ●
●●
●
● ● ● ●● ● ●
−1
●
● ● ●●
● ● ●
●●
−2
3
1 2 3 4 5 6
x
6 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p ≤ 4 and large-ish N .
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.
7 / 30
The curse of dimensionality
10% Neighborhood
p=
1.
0
●● ● ●●●
● ● ● ●
●
●●
10
● ● ●
● ● ● ●
1.
●
5
●
● ● ●
● ● ● ● p=
0.
●
5
● ●
● ● ● 5
●
●●●
● ● p=
1.
● ● 3
0
Radiu
● ●
● ●● ● ● ●
0.
●
2
x
●● ●
●
● ●
●●
p=
s
●
●
●
●
●●
● 2
p=
● ●
−0.5
●
●● ● ●●
1
0.
●
5
● ●
● ● ●
● ●●
● ●
● ●
● ● ● ●
● ● ● ●
−1.0
●
●
●
0.
0
●
−1.0 −0.5 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.7
x1 Fraction of
Volume
8 / 30
Parametric and structured models
9 / 30
A linear model f ˆ L (X) = βˆ0 + βˆ1X gives a reasonable fit here
● ●
2
●
●
● ●● ●
● ● ●● ●
1
●
● ●●
●
●●
y
● ● ●
●●
●●
● ● ● ●● ● ●
0
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ●● ● ●
● ● ●
−1
● ●
● ●● ●● ●
−2
3
1 2 3 4 5 6
●
●
●● ●
● ●●
●
1
● ●●●
● ●
●●
y
●●
● ● ●
●● ●●
● ● ●
0
● ● ● ● ● ●
● ● ● ● ● ●
●
● ●
● ● ● ●● ● ● ●
●
●
−1
● ●
● ●● ●● ●
−2
3
1 2 3 4 5 6
10 / 30
In co
e
m
rit
Ye
io
E d ars
y en
uc of
S
at
io
n
11 / 30
In co
e
m
rit
Ye
io
E d ars
y en
uc of
S
at
io
n
rit
Ye
io
E d ars
y en
uc of
S
at
io
n
13 / 30
In co
e
m
rit
Ye
io
E d ars
y en
uc of
S
at
io
n
14 / 30
Some trade-offs
15 / 30
Subset
Hig Selection
h Lasso
Least
Interpretabil
S
q
u
a
ity
r
e Bagging,
s Boosting
Models Trees
Low Hig
h
Flexibilit
y
16 / 30
Assessing Model Accuracy
17 / 30
2.
5
1
2
2.
0
1
0
Mean Squared
1.
5
8
Y
1.
6
0
Error
4
0.
5
2
0.
0
0 20 40 60 80 2 5 10
100 20
X
Flexibility
2.
0
1
0
Mean Squared
1.
5
8
Y
1.
6
0
Error
4
0.
5
2
0.
0
0 20 40 60 80 2 5 10
100 20
X
Flexibility
Here the truth is smoother, so the smoother fit and linear model do
really well.
19 / 30
2
0
2
0
1
5
Mean Squared
1
0
Y
1
0
Error
0
5
−10
0
0 20 40 60 80 2 5 10
100 20
X
Flexibility
Here the truth is wiggly and the noise is low, so the more flexible fits
do the best.
20 / 30
Bias-Variance Trade-off
Suppose we have fit a model f ˆ (x) to some training data Tr, and
let (x 0 , y 0 ) be a test observation drawn from the population. If
the true model is Y = f ( X ) + ϵ (with f (x) = E ( Y | X = x)),
then
2
0
ˆ 0 ˆ 0 ˆ 2
E y0 − f ( x ) = Var(f ( x )) + [Bias(f ( x ))] +
The Var(ϵ).
expectation averages over the variability of y as well as
0
MSE
2.
2.
2
0
5
5
Bia
s
Var
2.
2.
0
1
5
1.
1.
5
1
0
1.
1.
0
5
0.
0.
5
0
0.
0.
0
2 5 10 2 5 10 2 5 10
20 20 20
Flexibilit Flexibilit Flexibilit
y y y
22 / 30
Classification Problems
23 / 30
| | | | | || | ||| || |||||||||||||||| |||| | ||| || | | | || || | | | || | || | ||||||
0.8
|||||| |||| ||| | | ||
0.6
y
0.4
0.2
0.0
1.0
| || | || | |
0.6
y
0.4
0.2
0.0
1.0
| | | || | | || | | | |||| || |||| | || | | || | ||
| | |
2 3 4 5 6
25 / 30
Classification: some details
K Nearest Neighbor
26 / 30
Classification: some details
o o o o o o oooo
oo o o
o o oo
oo o o o ooo o o o o o o
o oo oo o
oo o o
o oo
oo o ooo
o o
o o o ooo o o
o o
ooo oo o
o o o oooooo o
o
o oo o
o
o oo o
X1
28 / 30
K N N:
K=10
oo oo
oo o o
o oo o
oo
o
o oo o ooo oooo
o
oooo o
oo ooo ooo o oo
o o oo o o oo
o o o oo
oo o o
o o o o
o o o o oo o o o o
o
o oo o o oo o o o
o o oo o o o o o oo o
oo o oo
X2
o o o o oo o o
o o
oo o o
o ooo
oo o oo o o o o
o o oo ooo o o o
o o o o
o
o o
o
oo o o
o oooo
o o o oo ooo
oooo o o o o
o o oooooo o
o
o oo o
o o
o o o
X1
29 / 30
K N N: K N N:
K=1 K=100
o o o o
o o oo o o o o oo o o
o oo o o oo o
oo oo
o o
o oo oo ooo o o oo oo ooo o
o oo oo oo o o oo oo oo o
o oo oo o o oo
oo oo o oo oo o o oo
oo oo
o oo o oo
o o o oo o o o o oo o
o o
oo
o ooo o o o oo
o ooo o o o
o o o o o o o o o o o o o o
o o o o
o oo o oo o o o o oo o oo o o o
oo oo
o o o o oo o oo
ooo o o o o o oo o oo
ooo o
oo oo
o o o oo oo o o o oo oo
o o o oo o o o oo
oo oo oo oo
oo o o o oo o o o
o o ooo o o ooo
o oo o o oo o
o o o o o o o o
o o o o
o o oo o o o oo o
o ooo oo ooo o o ooo oo ooo o
o o
ooo o ooo o
o o oo oo o o oo oo
oo o oo oo o oo
o o o ooo oo o o o ooo oo
o
o o
o
o o o o
o o o o o o
o o
o o
o o o o
30 / 30
0.2
0
0.1
5
0.1
Error
0
Rate
0.0
5
Training
Errors Test
0.0
0
Errors
0.0 0.0 0.0 0.1 0.2 0.5 1.0
1 2 5 0 0 0 0
1/K
31 / 30