0% found this document useful (0 votes)
24 views31 pages

Statistical Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views31 pages

Statistical Learning

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 31

What is Statistical Learning?

2
5

2
5

2
5
2
0

2
0

2
0
Sales

Sales

Sales
1
5

1
5

1
5
1
0

1
0

1
0
5

5
0 50 200 300 0 10 20 30 40 0 20 40 60 80
100 50 100
TV Radio
Newspaper

Shown are S a l e s vs TV, Radio and Newspaper, with a blue


linear-regression line fit separately to each.
Can we predict S a l e s using these three?
Perhaps we can do better using a model

S a l e s ≈ f (TV, Radio,
Newspaper)
1 / 30
Notation
Here S a l e s is a response or target that we wish to predict. We
generically refer to the response as Y .
TV is a feature, or input, or predictor; we name it
X 1 . Likewise name Radio as X 2 , and so on.
We can refer to the input vector collectively as
 
X
X = 1 X 
 2
 X3

Now we write our model as

Y = f (X ) + ϵ

where ϵ captures measurement errors and other discrepancies.

2 / 30
What is f (X ) good for?

• With a good f we can make predictions of Y at new points


X = x.
• We can understand which components of
X = (X 1 , X 2 , . . . , X p ) are important in explaining Y ,
and which are irrelevant. e.g. S e n i o r i t y and Years o f
Educati on have a big impact on Income, but M arital
Status typically does not.
• Depending on the complexity of f , we may be able
to understand how each component X j of X affects
Y.

3 / 30
●●
● ●

6


● ●

●●● ●

4
●● ●●●●
●● ●●●

●●●●●●●●●● ●
● ●● ●●●
●●●
●●

●●● ●●●●●●●●●
●●
● ●
● ●●●●●●●●
y

2 ●● ● ●●● ●●●

●●
●●
●●● ●●●●●

●●●●● ●●
●●●●
● ●●●●●●●● ●

●●●
●● ●● ●
●●●●● ●●●●● ●●●●● ●
●●● ●●
●● ● ●●
● ●●●●● ● ●●
● ●
● ●●●●● ●●●●●●●●●●●●●
●● ●●●●●

●●● ● ● ●●●●●●

● ●● ● ●●● ●●
●●●● ● ●●●●●
●●●● ●● ● ●●

● ● ● ●●●
●●●
●●
●● ●● ●●●●
●●●● ●●●●●●●
● ●●●● ●
● ● ●●●
● ●● ● ● ● ●●
●●●●● ●●● ●●● ● ● ●●●
● ●● ● ●●● ●●●●●●● ●
●● ●●●●●●
●●●●● ●●●
●●● ●●●●
●● ●●●●●●●
●●●
●●
● ●● ● ●●●
●●●●
●● ●●●●●●●●●●●
●●● ●●●●●● ●●●● ●●



● ●●●●●
●●●●●
●● ●●
●●● ●●●●●
●● ●● ●
●●
● ● ●●●●●

● ●●
●●●●
● ●●
●● ●●
●●●●●●
●●●●●●●● ●●
●● ●●● ●●●●● ●●● ●●●●●
●● ●●●●●●●●●● ●
●●●●●●●● ●● ● ● ●●
● ● ●●●●
●● ● ●●●●●●●
● ● ●●●
●●●● ●●●●●
−2

● ●●●
●●
●●● ●●●
●● ●●●●● ● ●●● ●
● ●●●● ● ●● ● ●● ●●● ●●●●●●●●● ●● ●
●●●●●●
● ●●
●● ●●
●● ● ● ●●●
0

●●●●● ● ●●●
●●●●● ●

1 2 3 4 5 6 7

Is there an ideal f ( X ) ? In particular, what is a good value for


f ( X ) at any selected value of X , say X = 4? There can be
many Y values at X = 4. A good value is

f (4) = E (Y |X = 4)
E ( Y | X = 4) means expected value (average) of Y given X = 4.
This ideal f (x) = E ( Y | X = x ) is called the regression function.
4 / 30
The regression function f (x )

• Is also defined for vector X ; e.g.


f (x ) = f (x 1 , x 2 , x 3 ) = E (Y |X 1 = x 1, X 2 = x 2, X 3 = x 3)
• Is the ideal or optimal predictor of Y with regard to
mean-squared prediction error: f (x ) = E (Y |X = x ) is
the function that minimizes E [ ( Y − g ( X ) ) 2 | X = x] over
all functions g at all points X = x.
• ϵ = Y − f (x) is the irreducible error — i.e. even if we knew
f (x ), we would still make errors in prediction, since at
each X = x there is typically a distribution of possible Y
values.
• F or any estimate fˆ(x ) of f (x ), we have
E [ ( Y − f ˆ (X)) 2 |X = x] = `[f (x) − fˆ (x)]2 + Var(ϵ)
x ˛ ¸
Red u cible
` ˛ ¸ x
Ir red u ci ble 5 / 30
How to estimate f
• Typically we have few if any data points with X = 4
exactly.
• So we cannot compute E ( Y | X = x)!
• Relax the definition and let

fˆ(x ) = A ve(Y |X ∈ N
(x )) where N (x ) is some neighborhood of
x.

● ●
2



●●● ●
● ●
● ●
1

● ● ●
● ●●

●●
y


● ●● ●
●●●
0

● ●● ●
● ●● ● ● ●●
● ● ● ● ● ● ●
●●

● ● ● ●● ● ●
−1


● ● ●●
● ● ●
●●
−2
3

1 2 3 4 5 6

x
6 / 30
• Nearest neighbor averaging can be pretty good for small p
— i.e. p ≤ 4 and large-ish N .
• We will discuss smoother versions, such as kernel and
spline smoothing later in the course.
• Nearest neighbor methods can be lousy when p is large.
Reason: the curse of dimensionality. Nearest neighbors
tend to be far away in high dimensions.

• We need to get a reasonable fraction of the N values of


yi
to average to bring the variance down—e.g. 10%.
• A 10% neighborhood in high dimensions need no longer be
local, so we lose the spirit of estimating E ( Y | X = x ) by
local averaging.

7 / 30
The curse of dimensionality

10% Neighborhood

p=
1.
0

●● ● ●●●
● ● ● ●

●●
10
● ● ●
● ● ● ●

1.

5

● ● ●
● ● ● ● p=
0.


5

● ●
● ● ● 5

●●●
● ● p=

1.
● ● 3

0
Radiu
● ●
● ●● ● ● ●
0.


2
x

●● ●

● ●
●●
p=

s




●●
● 2
p=
● ●
−0.5


●● ● ●●
1

0.

5
● ●
● ● ●
● ●●
● ●
● ●
● ● ● ●
● ● ● ●
−1.0



0.
0

−1.0 −0.5 0.5 1.0 0.0 0.1 0.2 0.3 0.4 0.5 0.6
0.0 0.7
x1 Fraction of
Volume

8 / 30
Parametric and structured models

The linear model is an important example of a parametric


model:
f L (X ) = β 0 + β 1 X 1 + β 2 X 2 + . . . β p X p .

• A linear model is specified in terms of p + 1


parameters
β0 , β1 , . . . , βp .
• We estimate the parameters by fitti ng the model to
training data.
• Although it is almost never correct, a linear model often
serves as a good and interpretable approximation to the
unknown true function f ( X ) .

9 / 30
A linear model f ˆ L (X) = βˆ0 + βˆ1X gives a reasonable fit here
● ●

2


● ●● ●
● ● ●● ●

1

● ●●

●●

y
● ● ●
●●
●●
● ● ● ●● ● ●

0
● ● ● ●
● ●
● ● ● ● ● ●
● ●
● ● ●● ● ●
● ● ●

−1
● ●
● ●● ●● ●

−2
3

1 2 3 4 5 6

A quadratic model f ˆ Q (X) = βˆ0 + βˆ1X + βˆ2X2 fits


slightly better.
● ●
2



●● ●
● ●●

1

● ●●●
● ●
●●
y

●●
● ● ●
●● ●●
● ● ●
0

● ● ● ● ● ●
● ● ● ● ● ●

● ●
● ● ● ●● ● ● ●


−1

● ●
● ●● ●● ●
−2
3

1 2 3 4 5 6

10 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

Simulated example. Red points are simulated values for income


from the model

income = f (educati on, s e n i o r i t y ) +

ϵ f is the blue surface.

11 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

Linear regression model fit to the simulated data.

fˆL (educati on, s e n i o r i t y ) =


βˆ 0 + βˆ 1 ×educati on+ βˆ 2 ×s e n i o r i t y
12 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

More flexible regression model fˆS (educati on, s e n i o r i t y ) fit to


the simulated data. Here we use a technique called a thin-plate
spline to fit a flexible surface. We control the roughness of the
fit.

13 / 30
In co
e
m

rit
Ye

io
E d ars

y en
uc of

S
at
io
n

Even more flexible spline regression model


fˆS (educati on, s e n i o r i t y ) fit to the simulated data. Here the
fitted model makes no errors on the training data! Also known
as overfitting.

14 / 30
Some trade-offs

• Prediction accuracy versus interpretability.


— Linear models are easy to interpret; thin-plate splines
are not.
• Good fit versus over-fit or under-fit.
— How do we know when the fit is just right?
• Parsimony versus black-box.
— We often prefer a simpler model involving fewer
variables over a black-box predictor involving them all.

15 / 30
Subset
Hig Selection
h Lasso

Least
Interpretabil

S
q
u
a
ity

r
e Bagging,
s Boosting

Support Vector Machines


Generalized Additive
Lo
w

Models Trees

Low Hig
h
Flexibilit
y

16 / 30
Assessing Model Accuracy

Suppose we fit a model f ˆ (x) to some training data


Tr = { xi , yi N1} , and we wish to see how well it
performs.
• We could compute the average squared prediction
error over Tr:
M S E Tr = A vei ∈Tr [y i − fˆ(x i )]2
This may be biased toward more overfit models.
• Instead we should,M if possible, compute it using fresh
data Te = { xi , iy 1}
test
:
M S E Te = A vei ∈Te[y i −
fˆ(x i )]2

17 / 30
2.
5
1
2

2.
0
1
0

Mean Squared

1.
5
8
Y

1.
6

0
Error
4

0.
5
2

0.
0
0 20 40 60 80 2 5 10
100 20
X

Flexibility

Black curve is truth. Red curve on right is MSE Te , grey curve is


MSE Tr . Orange, blue and green curves/squares correspond to fits of
different flexibility.
18 / 30
2.
5
1
2

2.
0
1
0

Mean Squared

1.
5
8
Y

1.
6

0
Error
4

0.
5
2

0.
0
0 20 40 60 80 2 5 10
100 20
X

Flexibility

Here the truth is smoother, so the smoother fit and linear model do
really well.

19 / 30
2
0
2
0

1
5
Mean Squared
1
0
Y

1
0
Error
0

5
−10

0
0 20 40 60 80 2 5 10
100 20
X

Flexibility

Here the truth is wiggly and the noise is low, so the more flexible fits
do the best.

20 / 30
Bias-Variance Trade-off

Suppose we have fit a model f ˆ (x) to some training data Tr, and
let (x 0 , y 0 ) be a test observation drawn from the population. If
the true model is Y = f ( X ) + ϵ (with f (x) = E ( Y | X = x)),
then
2
0
ˆ 0 ˆ 0 ˆ 2
E y0 − f ( x ) = Var(f ( x )) + [Bias(f ( x ))] +
The Var(ϵ).
expectation averages over the variability of y as well as
0

the variability in Tr. Note that Bias(f ˆ (x 0 ))] = E[f ˆ (x 0 )] − f


(x 0 ).

Typically as the flexibility of fˆ increases, its variance increases,


and its bias decreases. So choosing the flexibility based on
average test error amounts to a bias-variance trade-off.
21 / 30
Bias-variance trade-off for the three examples

MSE
2.

2.

2
0
5

5
Bia
s
Var
2.

2.
0

1
5
1.

1.
5

1
0
1.

1.
0

5
0.

0.
5

0
0.

0.
0

2 5 10 2 5 10 2 5 10
20 20 20
Flexibilit Flexibilit Flexibilit
y y y

22 / 30
Classification Problems

Here the response variable Y is qualitative — e.g. email is one


of C = (spam, ham) (ham=good email), digit class is one of
C = { 0, 1, . . . , 9} . Our goals are to:
• Build a classifier C ( X ) that assigns a class label from C to
a future unlabeled observation X .
• Assess the uncertainty in each classification
• Understand the roles of the different predictors among
X = (X 1 , X 2 , . . . , X p ).

23 / 30
| | | | | || | ||| || |||||||||||||||| |||| | ||| || | | | || || | | | || | || | ||||||

0.8
|||||| |||| ||| | | ||

0.6
y

0.4
0.2
0.0
1.0

|| | | || ||| || || ||| | |||| | | | || |||||| ||


| | ||||||| | |
1 4 5 6 7
2
x
3
Is there an ideal C ( X ) ? Suppose the K elements in C are
numbered 1, 2, . . . , K . Let
pk (x ) = P r(Y = k|X = x ), k = 1, 2, . . . , K .
These are the conditional class probabilities at x; e.g. see little
barplot at x = 5. Then the Bayes optimal classifier at x is
C ( x ) = j if p j (x) = max{p 1 (x), p 2 (x), . . . , p K
(x)} 24 / 30
| | || || || || || || || |||| ||| | | ||||||| || || | ||||
0.8

| || | || | |
0.6
y

0.4
0.2
0.0
1.0

| | | || | | || | | | |||| || |||| | || | | || | ||
| | |
2 3 4 5 6

Nearest-neighbor averaging can be used as before.


Also breaks down as dimension grows. However, the impact on
Cˆ (x ) is less than on pˆk (x ), k = 1, . . . , K .

25 / 30
Classification: some details

Bayes Classifier in a 2 class classification assigns Class 1 if


probability > 0.5, else Class 2

Bayes Decision Boundary

Bayes Error Rate

K Nearest Neighbor

26 / 30
Classification: some details

• Typically we measure the performance of C ˆ (x) using the


misclassification error rate:

Err Te = Ave i ∈ Te I[y i / = C ˆ (x i )]

• The Bayes classifier (using the true p k (x)) has smallest


error (in the population).

• Support-vector machines build structured models for


C(x).
• We will also build structured models for representing
the
p k (x). e.g. Logistic regression, generalized additive
models. 27 / 30
Example: K-nearest neighbors in two dimensions
oo
oo
oo o o
o
o o oo o o
o o o
o o o
o
o o ooo o ooooo o oo
o o ooo oo o oo o
o o oo
oo o oo
o o
oo o
o o o
o o o o o o
o
ooo o o o
o oo o o o
oo o
o o o oo oo oo oo
oo o
X2

o o o o o o oooo
oo o o
o o oo
oo o o o ooo o o o o o o
o oo oo o
oo o o
o oo
oo o ooo
o o
o o o ooo o o
o o
ooo oo o
o o o oooooo o
o
o oo o
o
o oo o

X1
28 / 30
K N N:
K=10

oo oo
oo o o
o oo o
oo
o
o oo o ooo oooo
o
oooo o
oo ooo ooo o oo
o o oo o o oo
o o o oo
oo o o
o o o o
o o o o oo o o o o
o
o oo o o oo o o o
o o oo o o o o o oo o
oo o oo
X2

o o o o oo o o
o o
oo o o
o ooo
oo o oo o o o o
o o oo ooo o o o
o o o o
o
o o
o
oo o o
o oooo
o o o oo ooo
oooo o o o o
o o oooooo o
o
o oo o
o o
o o o

X1

29 / 30
K N N: K N N:
K=1 K=100

o o o o
o o oo o o o o oo o o
o oo o o oo o
oo oo
o o
o oo oo ooo o o oo oo ooo o
o oo oo oo o o oo oo oo o
o oo oo o o oo
oo oo o oo oo o o oo
oo oo
o oo o oo
o o o oo o o o o oo o
o o
oo
o ooo o o o oo
o ooo o o o
o o o o o o o o o o o o o o
o o o o
o oo o oo o o o o oo o oo o o o
oo oo
o o o o oo o oo
ooo o o o o o oo o oo
ooo o
oo oo
o o o oo oo o o o oo oo
o o o oo o o o oo
oo oo oo oo
oo o o o oo o o o
o o ooo o o ooo
o oo o o oo o
o o o o o o o o
o o o o
o o oo o o o oo o
o ooo oo ooo o o ooo oo ooo o
o o
ooo o ooo o
o o oo oo o o oo oo
oo o oo oo o oo
o o o ooo oo o o o ooo oo
o
o o
o
o o o o
o o o o o o
o o
o o
o o o o

30 / 30
0.2
0
0.1
5
0.1
Error

0
Rate
0.0
5

Training
Errors Test
0.0
0

Errors
0.0 0.0 0.0 0.1 0.2 0.5 1.0
1 2 5 0 0 0 0
1/K

31 / 30

You might also like