Lecture 1 - Overview of Supervised Learning
Lecture 1 - Overview of Supervised Learning
DD3364
March 9, 2012
Problem 1: Regression
y
10
Predict its y value?
5
x
0
Problem 2: Classification
Is it a bike or a
face ?
Some Terminology
In Machine Learning have outputs which are predicted from
measured inputs.
Some Terminology
In Machine Learning have outputs which are predicted from
measured inputs.
Variable types
Outputs can be
discrete (categorical, qualitative),
continuous (quantitative) or
ordered categorical (order is important)
More Notation
The prediction of the output for a given value of input vector
X is denoted by Y .
regression problems
T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for
classification problems
T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}
More Notation
The prediction of the output for a given value of input vector
X is denoted by Y .
regression problems
T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for
classification problems
T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}
More Notation
The prediction of the output for a given value of input vector
X is denoted by Y .
regression problems
T = {(x1 , y1 ), . . . , (xn , yn )}
with each xi Rp and yi R
It is presumed that we have labelled training data for
classification problems
T = {(x1 , g1 ), . . . , (xn , gn )}
with each xi Rp and gi {1, . . . , G}
Linear Model
Have an input vector X = (X1 , . . . , Xp )t
A linear model predicts the output Y as
Y = 0 +
p
X
Xj j
j=1
Y = X t
Linear Model
Have an input vector X = (X1 , . . . , Xp )t
A linear model predicts the output Y as
Y = 0 +
p
X
Xj j
j=1
Y = X t
unique.
RSS() = (y X)t (y X)
where X Rnp is a matrix with each row being an input
vector and y = (y1 , . . . , yn )t
unique.
RSS() = (y X)t (y X)
where X Rnp is a matrix with each row being an input
vector and y = (y1 , . . . , yn )t
is given by
= (Xt X)1 Xt y
if Xt X is non-singular
This is easy to show by differentiation of RSS()
This model has p + 1 parameters.
(
0
G(x)
=
1
if xt .5
if xt > .5
(
0
G(x)
=
1
if xt .5
if xt > .5
too rigid
k=1
too rigid
k=1
too rigid
k=1
1
Y (x) =
k
yi
xi Nk (x)
is
the k-nearest neighbour estimate for G
G(x)
=
(
0
1
if
P
1
k
g
xi Nk (x) i .5
otherwise
k = 15
k=1
k=1
k=1
k=1
0.3
0.2
0.1
0
log( nk )
0
Z Z
Z Z
The solution is
f (x) = E[Y |X = x]
This is known as the regression function.
The solution is
f (x) = E[Y |X = x]
This is known as the regression function.
Only one problem with this: one rarely knows the pdf
p(Y |X).
averaging.
x2
x2
x2
x
x1
x1
x1
= accuracy of y increases.
1
k
xi Nk (x)
yi E[y | x]
1
k
xi Nk (x)
yi E[y | x]
sample X Rp
sample X Rp
sample X Rp
Question:
Let k = r n where r [0, 1] and x = 0.
Question:
Let k = r n where r [0, 1] and x = 0.
ep (r)
p = 10
Let p = 10 then
p=3
0.8
ep (.01) = .63,
ep (.1) = .80
p=2
0.6
p=1
0.4
0.2
0
r
0
0.2
0.4
0.6
0.8
Question:
Let k = 1 and x = 0.
What is the median distance of the nearest neighbour to x?
Question:
Let k = 1 and x = 0.
What is the median distance of the nearest neighbour to x?
Solution:
This median distance is given by the expression
1
d(p, n) = (1 .5 n ) p
distance
0.5
0.4
0.3
0.2
0.1
p
2
10
input problem
inputs.
input problem
inputs.
Simulated Example
The Set-up
e8x
0.5
x
1
Y = f (X) = e8kXk
e8x
y0
0.5
x0
1
x(1) 0
p = 1, n = 20
x(1)
1
0.5
0.5
Average estimate of y0
frequency
300
200
100
y0
0
0.5
p=2
1
x2
x1
1
Y = f (X) = e8kXk
p=2
x2
x(1)
x0
x1
1
1-nn estimate of y0
60
frequency
40
20
y0
0
0.5
1-nn estimate of y0
frequency
80
60
40
20
0
y0
0
0.5
As p increases
average distance to nn
1
average value of y0
0.8
0.6
0.5
0.4
0.2
p
2
10
0
2
10
Bias-Variance Decomposition
For the simulation experiment have a completely deterministic
relationship:
Y = f (X) = e8kXk
= ET [(
y0 ET [
y0 ])2 ] + (ET [
y0 ] f (x0 ))2
= VarT (
y0 ) + Bias2 (
y0 )
MSE
Bias2
Variance
MSE
0.5
p
2
10
The Set-up
1
2 (x
+ 1)3
3
2
1
0
x
1
1
2 (x
+ 1)3
3
2
1
y0
0
x0
1
0 x(1)
Bias2
Variance
MSE
0.2
0.1
p
2
10
Case 1
.5(x1 + 1)3 +
4
2
0
2
x1
1
Y = .5(X1 + 1)3 + ,
N (0, 1)
Case 2
x1 +
2
0
2
x1
1
Y = X1 + ,
N (0, 1)
1.5
p
2
10
1.5
p
2
10
Words of Caution
regression function as
Statistical models,
Supervised learning and
Function approximation
Goal
Know there is a function f (x) relating inputs to outputs:
Y f (X)
Want to find an estimate f(x) of f (x) from labelled training
data.
f (x)
x
0
3
random variable indept of input X
Y = f (X) +
output
deterministic relationship
Y = f (X) +
where
the random variable has E[] = 0
is independent of X
f (x) = E[Y |X = x]
any departures from the deterministic relationship are mopped
up by
0.5
x
0
T = {(x1 , y1 ), . . . , (xn , yn )}
where each xi Rp and yi R.
y
10
x
0
function approximation.
Common approach
2.6 Statistical Models, Supervised Learning and Function Approximation
31
FIGURE 2.10. Least squares fitting of a function of two inputs. The parameters
f (x) =
L() =
N
!
i=1
hm (x) m
log
Pr (yi ).
m=1
(2.33)
RSS() =
(y f (x ))
P (Y |X, ) = N (f (X), 2 )
Log-likelihood of the training data is
L() =
=
n
X
i=1
n
X
i=1
log P (Y = yi |X = xi , )
log N (yi ; f (xi ), 2 )
RSS(f ) =
n
X
(yi f (xi ))2
i=1
and
RSS(f) = 0
a solution.
choice of constraint.
choice of constraint.
Quadratic: f (x) = xt x + 1t x + 0
Note:
It is assumed we have training examples {(xi , yi )}n
i=1 and
We present the energy functions or functionals which are
PRSS(f, ) =
Pn
i=1 (yi
J(f ) =
[f 00 (x)]2 dx
For wiggly f s this functional will have a large value while for
linear f s it is zero.
Regularization methods express our belief that the f were
neighbourhood.
Need to specify
the nature of local neighbourhood
the class of functions used in local fit
n
X
i=1
where
Kernel function: K (x0 , xi ) assign weights to xi depending
on its closeness to x0 .
kx0 xk2
1
K (x0 , x) = exp
f (x) =
M
X
m hm (x)
m=1
f (x) =
M
X
K (m , x) m
m=1
where
Km (m , x) is a symmetric kernel centred at location m .
the Gaussian kernel is a popular kernel to use
problem.
f (x) =
M
X
t
m (m
x + bm )
m=1
where
= (1 , . . . , M , 1 , . . . , M , b1 , . . . , bm )t
(z) = 1/(1 + ez ) is the activation function.
The directions m and bias terms bm have to be determined
Dictionary methods
Adaptively chosen basis function methods aka dictionary
methods
mechanism
f(x)
f (x)
1.5
x
0
and = .1
E[f(x)]
1.5
x
0
magnitude.
1.5
f(x)
f (x)
x
0
and = .1
E[f(x)]
1.5
x
0
k = 1.
E[f(x)]
E[f(x)]
1.5
1.5
x
0
High complexity: k = 1
x
0
Lower complexity: k = 15
Why??
Training error decreases when model complexity
increases
Overfitting
38
Prediction Error
High Bias
Low Variance
Test Sample
Training Sample
Low
High
Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.
Have
a high
variance predictor
anything
can happen.
The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
This More
scenario
is astermed
generally,
the modeloverfitting
complexity of our procedure is increased, the
variance tends to increase and the squared bias tends to decrease. The op-
posite behavior
occurs as the loses
model complexity
is decreased.
For k-nearest
In such
cases predictor
the ability
to generalize
neighbors, the model complexity is controlled by k.
Underfitting
38
Prediction Error
High Bias
Low Variance
Test Sample
Training Sample
Low
High
Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.
The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
More
the modelwill
complexity
of ourhow
procedure
is increased, the
Latter
ongenerally,
in theascourse
discuss
to overcome
these
variance tends to increase and the squared bias tends to decrease. The opproblems.
posite behavior occurs as the model complexity is decreased. For k-nearest
neighbors, the model complexity is controlled by k.
Underfitting
38
Prediction Error
High Bias
Low Variance
Test Sample
Training Sample
Low
High
Model Complexity
FIGURE 2.11. Test and training error as a function of model complexity.
The variance term is simply the variance of an average here, and decreases as the inverse of k. So as k varies, there is a biasvariance tradeoff.
More
the modelwill
complexity
of ourhow
procedure
is increased, the
Latter
ongenerally,
in theascourse
discuss
to overcome
these
variance tends to increase and the squared bias tends to decrease. The opproblems.
posite behavior occurs as the model complexity is decreased. For k-nearest
neighbors, the model complexity is controlled by k.