0% found this document useful (0 votes)
56 views41 pages

Overview of Supervised Learning

This document provides an overview of supervised learning methods. It discusses linear regression, k-nearest neighbors classification, and statistical decision theory. It introduces notation for inputs (X), outputs (Y), and qualitative responses (G). It then compares linear regression and k-nearest neighbors on a classification example and discusses when each method may perform better. Finally, it discusses the Bayes classifier and how it relates to minimizing error rates.

Uploaded by

Peter Parker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views41 pages

Overview of Supervised Learning

This document provides an overview of supervised learning methods. It discusses linear regression, k-nearest neighbors classification, and statistical decision theory. It introduces notation for inputs (X), outputs (Y), and qualitative responses (G). It then compares linear regression and k-nearest neighbors on a classification example and discusses when each method may perform better. Finally, it discusses the Bayes classifier and how it relates to minimizing error rates.

Uploaded by

Peter Parker
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 41

Overview of Supervised

Learning
Outline
• Linear Regression and Nearest Neighbors method
• Statistical Decision Theory
• Local Methods in High Dimensions
• Statistical Models, Supervised Learning and
Function Approximation
• Structured Regression Models
• Classes of Restricted Estimators
• Model Selection and Bias

18/10/25 Overview of Supervised Learning 2


Notation
• X: inputs, feature vector, predictors, independent variables.
Generally X will be a vector of p values. Qualitative features
are coded in X.
– Sample values of X generally in lower case; xi is i-th of N sample
values.

• Y: output, response, dependent variable.


– Typically a scalar, can be a vector, of real values. Again yi is a
realized value.

• G: a qualitative response, taking values in a discrete set G;


e.g. G={ survived, died }. We often code G via a binary
indicator response vector Y.

18/10/25 Overview of Supervised Learning 3


Problem
• 200 points generated in IR2
from a unknown
distribution; 100 in each of
two classes G={ GREEN,
RED }.

• Can we build a rule to


predict the color of the
future points?

18/10/25 Overview of Supervised Learning 4


Linear regression
• Code Y=1 if G=RED, else Y=0.
• We model Y as a linear function of X:
  p  
Y  0   X j  j  X T 
j 1

• Obtain  by least squares, by minimizing the


quadratic criterion: N
RSS (  )   ( yi  xiT  ) 2
i 1
• Given an N  p model matrix X and a response

vector y,  ( X X ) X Y
T 1 T

18/10/25 Overview of Supervised Learning 5


Linear regression
• Prediction at a future point x0 is Yˆ( x0 )  x0T ˆ. Also

�RED if Yˆ( x0 ) > 0. 5,
Gˆ( x0 )  �
GREEN if Yˆ( x0 ) �0. 5.

• The decision boundary is { X | X T ˆ  0. 5} is linear
(Figure 2.1) (and seems to make many errors on the
training data).

18/10/25 Overview of Supervised Learning 6


Linear regression
• Figure 2.1: A Classification
example in two dimensions.
The classes are coded as a
binary variable (GREEN=0,
RED=1) and then fit by linear
regression. The line is the
decision boundary
 defined by
X T   0.5

• The red shaded region


denotes that part of input
space classified as RED
,while the green region is
classified as GREEN.
18/10/25 Overview of Supervised Learning 7
Possible scenarios
Scenario 1: The data in each class are generated from a
Gaussian distribution with uncorrelated components,
same variances, and different means.

Scenario 2: The data in each class are generated from a


mixture of 10 gaussians in each class.

For Scenario 1, the linear regression rule is almost optimal


(Chapter 4).

For Scenario 2, it is far too rigid.

18/10/25 Overview of Supervised Learning 8


K-Nearest Neighbors
A natural way to classify a new point is to have a look at its neighbors,
and take a vote:
1
ˆ
Yk( x)  �
k xi �Nk ( x )
yi ,

where N k( x) is a neighborhood of x that contains exactly k neighbors (k-


nearest neighborhood).
If there is a clear dominance of one of the classes in the neighborhood of
an observation x, then it is likely that the observation itself would belong
to that class, too. Thus the classification rule is the majority voting among
the members of N k( x) . As before,

�RED if Yˆk( x0 ) > 0. 5,
Gˆk( x0 )  �
GREEN if Yˆk( x0 ) �0. 5.

18/10/25 Overview of Supervised Learning 9
K-Nearest Neighbors
• Figure 2.2: The same
classification example in two
dimensions as in Figure 2.1.
The classes are coded as a
binary variable (GREEN=0,
RED=1) and the fit by 15-
nearest-neighbor.
• The predicted class is hence
chosen by majority vote
amongst the 15-nearest
neighbors.

18/10/25 Overview of Supervised Learning 10


K-Nearest Neighbors
• Figure 2.3: The same
classification example
are coded as a binary
variable ( GREEN=0,
RED=1), and then
predicted by
1-nearest-neighbor
classification.

18/10/25 Overview of Supervised Learning 11


Linear regression vs. k-NN
First we expose the oracle. The density for each class was
an equal mixture of 10 Gaussians. For the GREEN class,
its 10 means were generated from a N((1,0)T,I) distribution
(and considered fixed). For the RED class, the 10 means
were generated from a N((1,0)T,I). The within cluster
variances were 1/5.

18/10/25 Overview of Supervised Learning 12


Linear regression vs. k-NN
• Figure 2.4: Misclassification curves
for the simulation example above.
a test sample of size 10,000 was
used. The red curves are test and
the green are training error for k-
NN classification. The results for
linear regression are the bigger
green and red dots at three
degrees of freedom. The purple
line is the optimal Bayes Error
Rate.

18/10/25 Overview of Supervised Learning 13


Other Methods
Many modern procedures are variants of linear
regression and K-nearest neighbors:
•Kernel smoothers
•Local linear regression
•Linear basis expansions
•Projection pursuit and neural networks

18/10/25 Overview of Supervised Learning 14


Statistical decision theory
Case 1 : Quantitative output Y
p
•Let X � Rdenote a real valued random input vector
•We have a Loss function L( Y , f ( X ) )for penalizing errors in
prediction.
•Most common and convenient is squared error loss:
L( Y , f ( X ) )  ( Y  f ( X ) )2.
•This leads us to a criterion for choosing f.
EPE( f )  E( Y  f ( X ) )2
the Expected (squared) Prediction Error,
•Minimizing EPE( fleads
) to a solution f ( x)  E( Y X  x )the
,
conditional expectation, also known as the regression function.

18/10/25 Overview of Supervised Learning 15


回归函数
EPE ( f )  E[Y  f ( X )]
  (y-f(x))2 pr(dx,dy )
  (y-f(x))2 pr(dy | dx) pr (dx )
 E X EY | X ([Y  f ( X )] | X ) 2

对EPE逐点极小化得
f ( x )  arg min c EY | X ([Y  c ]2 | X  x )
极小解为:
f ( x)  E (Y | X  x)
18/10/25 Overview of Supervised Learning 16
Case 2: Qualitative output G:
•Suppose our prediction rule is Gˆ( X, )and G and Gˆ( Xtake ) values in
G card
, with (G) . K
•We have a different loss function for penalizing prediction errors.
L( k , l ) is the price paid for classifying an observation belonging
to class Gk as Gl .
•Most often we use the 0-1 loss function where all misclassifications
are charged a single unit.
•The expected prediction error is
EPE  E � L( G , ˆ( X ) )�
G
� �
•Solution is
K
Gˆ( x)  argmin g �G �L(Gk , g )P(Gk X  x)
k 1

18/10/25 Overview of Supervised Learning 17


• K-nn tries to implement conditional expectations directly, by
- Approximating expectations by sample averages
- Relaxing the notion of conditioning at a point, to conditioning in
a region about a point.
• As N, k → ∞, such that k/N → 0, the K-nearest neighbor estimate
fˆ( x) � E( Y X  x) — it is consistent.
• Linear regression assumes a (linear) structural form for f ( x )  x T
,
and minimizes sample version of EPE directly.
• As sample size grows, our estimate of linear coefficients ˆ
converges to the optimal  opt  E( XX T )1 E( XY ).
• Model is limited by the linearity assumption

Question: Why not always use k-nearest neighbors?

18/10/25 Overview of Supervised Learning 18


Bayes Classifier
With the 0-1 loss function this simplifies to

Gˆ( x)  Gk if P(Gk �
x)  max P( g �
X)
g �G

This is known as the Bayes classifier. It just says that we should


pick the class having maximum probability at the input x.
Question: how do we construct the Bayes classifier for our
simulation example?

18/10/25 Overview of Supervised Learning 19


Bayes Classifier
• Figure 2.5: The optimal
Bayes decision
boundary for the
simulation example
above.
• Since the generating
density is known for
each class, this
boundary can be
calculated exactly.

18/10/25 Overview of Supervised Learning 20


Curse of dimensionality
K-nearest neighbors can fail in high dimensions, because it becomes
difficult to gather K observations close to a target point x0:
•near neighborhoods tend to be spatially large, and estimates are
biased.
•reducing the spatial size of the neighborhood means reducing K,
and the variance of the estimate increases.

•Most points are at the boundary


•Sampling density is proportional to N1/p; if 100 points are sufficient
to estimate a function in IR1, 10010 are needed to achieve similar
accuracy in IR10

18/10/25 Overview of Supervised Learning 21


Example 1
•1000 training examples xi generated uniformly on [-1,1]p.
2
8 X
• Y  f ( X )  e (no measurement error).
•use the 1-nearest-neighbor rule to predict y0 at the test-point x0= 0.

EPE( x0 )  ET [ f ( x0)  yˆ0]2



EPE ( x0 ) ETE[ Tf[( yxˆˆˆ
0
0
) 
 y
E 0T]2
( y0
) ]2
 [ E T
( y0
)  f ( x0
) ]2

� � �
ETVar )0) E[ yBias
2
[ f (Tx(0yˆˆ 0 ]  (
E y
[ y
00
) .
]  y 0 ]2

� � �
 ET [ y 0  ET ( y 0 )]  ET [ E ( y 0 )  f ( x0 )]2
2

� � �
 2 E{[ y 0  ET ( y 0 )][ E ( y 0 )  f ( x0 )]}
� �
 VarT ( y 0 )  Bias ( y 0 )
2

18/10/25 Overview of Supervised Learning 22


18/10/25 Overview of Supervised Learning 23
Figure 2.8: A simulation example with the same setup as
in Figure 2.7. Here the function is constant in all but one
1
dimension: F( X )  ( X 1  1)3. The variance dominates.
2

18/10/25 Overview of Supervised Learning 24


Linear Model
• Linear Model Y  XT e

1
• Linear Regression βX (X T
X) y T

� N
• Test error y  x0T ˆ  x0T   �li ( x0 )e i
i 1

li ( x0 )   the i-th component of X ( X T X ) 1 x0

18/10/25 Overview of Supervised Learning 25


Curse of dimensionality
Example 2

If the linear model is correct, or almost correct, K-nearest neighbors


will do much worse than linear regression.

In cases like this (and of course, assuming we know this is the case),
simple linear regression methods are not affected by the dimension.

Figure 2.9 illustrates two simple cases.

18/10/25 Overview of Supervised Learning 26


Here f ( X )  1( X  1)3. Here we have
2 1
• Efˆ( x )  Ef ( X ) � f ( EX )  f ( x for
) all
0 (1) (1) 0
dimensions
• Varfˆ( x )  Varf ( X ) �with dimension
0 (1)
Figure 2.9: The curves show the expected prediction
error (at x0 = 0) for 1-nearest neighbor relative to least
squares for the model Y = f ( X ) + ε. For the 1 red curve,
f (x) = x1, while for the green curve f ( x )  ( x1
 1)3
.
18/10/25 Overview of Supervised Learning 2 28
Statistical Models
Y=f(X)+ε
with E(ε) = 0 and X and ε independent.
• E( Y�X )  f( X )
• Pr( Y� )
X depends on X only through f ( X ).
•Useful approximation to the truth — all unmeasured variables
captured by ε
•N realizations yi = f ( xi ) + εi, i = 1, …, N
•Assume εi and εj are independent.
X )  s 2( X ) .
More generally can have, for example, Var( Y�
For qualitative outcomes { Pr( G  Gk�
X )}  p( X )which we
K

1
model directly.
18/10/25 Overview of Supervised Learning 29
Supervised Learning
• Given: Training examples
{(x1,f(x1)),(x2,f(x2)),…,(xP,f(xP))}
of some unknown function (system) y = f(x)

• Find fˆ(x) (i.e. an approximation)


— Predict y �= fˆ(x� ), where x�is not in the
training set

18/10/25 Overview of Supervised Learning 30


Two Types of Supervised Learning
{
• Classification y � c1, c2, K , cN }
─ Model output is a prediction that the input
belongs to some class
─ If the input is an image, the output might be
chair, face, dog, boat,… etc.
• Regression y � R
─ The output has infinitely many values
─ If the input is stock features, the output could
be a prediction of tomorrow’s stock price

18/10/25 Overview of Supervised Learning 31


Learning Classification Models

18/10/25 Overview of Supervised Learning 32


Learning Regression Models

18/10/25 Overview of Supervised Learning 33


Function Approximation
N
RSS(q )  �( y
i 1
i
 fq( xi ) )2

Assumes
• xi, yi are points in, say IRp+1.
• A (parametric) form for f (X)
• A loss function for measuring the quality of approximation.
Figure 2.10 illustrates the situation.

18/10/25 Overview of Supervised Learning 34


Function Approximation
• Figure 2.10: Least
squares fitting of a
function of two inputs.
The parameters of
fθ(x) are chosen so as
to minimize the sum-
of-squared vertical
errors.

18/10/25 Overview of Supervised Learning 35


Function Approximation
• More generally, Maximum Likelihood Estimation
provides a natural basis for estimation.
• E.g. multinomial

Pr  G  k X   pk ,q ( X )
N
(q )   log Prg i ,q ( xi )
i 1

18/10/25 Overview of Supervised Learning 36


Structured Regression Models
N
RSS( f )  � i
( y
i 1
 f ( xi
) )2

• Any function passing through (xi , yi) has RSS = 0


• Need to restrict the class
• Usually restrictions impose local behavior — see equivalent
kernels in chapters 5 and 6
• Any method that attempts to approximate locally varying
functions is “cursed”
• Alternatively, any method that “overcomes” the curse,
assumes an implicit metric that does not allow
neighborhoods to be simultaneously small in all directions.

18/10/25 Overview of Supervised Learning 37


Classes of Restricted Estimators
Some of the classes of restricted methods that we cover are
•Roughness Penalty and Bayesian Methods
PRSS( f , l )  RSS( f )  l J( f )
•Kernel Methods and Local Regression
N
RSS( fq , x0 )  �l 0 i i q i
K (
i 1
x , x ) ( y  f ( x ) )2

•Basis functions and dictionary methods


M
fq( x)  �q
m 1
h ( x)
m m

18/10/25 Overview of Supervised Learning 38


Model Selection & the Bias-Variance Tradeoff

Many of the flexible methods have a complexity parameter:


•the multiplier of the penalty term
•the width of the kernel
•the number of basis functions
Cannot use RSS to determine this parameter — why?
Can use Prediction error on unseen test cases to guide us
E.g. Y = f ( X ) + ε, K-nn (and assume the sample xi are fixed):
E�
( Y  ˆˆˆ
f ( x ) )2
�X  x � s 2
 Bias 2
( f k( x0) )  VarT( f k( x0) )
� k 0 0�

1 k s2
 s  [ f ( x0 )  �f ( x( l ) )] 
2 2

k l 1 k
Selecting k is a bias-variance tradeoff — see Figure 2.11.
18/10/25 Overview of Supervised Learning 39
Model Selection & the Bias-Variance Tradeoff

• Test and training error as a function of model


complexity.

18/10/25 Overview of Supervised Learning 40


• Page 27
• Ex 2.1; 2.2 ; 2.6

18/10/25 Overview of Supervised Learning 41

You might also like