Lecture3 2015
Lecture3 2015
414/2104:
Machine
Learning
Russ
Salakhutdinov
Department of Computer Science!
Department of Statistics!
[email protected]!
h0p://www.cs.toronto.edu/~rsalakhu/
Lecture 3
Parametric
Distribu>ons
•
We
want
model
the
probability
distribu>on
of
a
random
variable
x
given
a
finite
set
of
observa>ons:
• We will also assume that the data points are i.i.d
Basis
func>ons
are
global:
small
Basis
func>ons
are
local:
small
changes
in
x
changes
in
x
affect
all
basis
func>ons.
only
affect
nearby
basis
func>ons.
µj
and
s
control
loca>on
and
scale
(width).
Linear
Basis
Func>on
Models
Sigmoidal
basis
func>ons
•
Decision
boundaries
will
be
linear
in
the
feature
space
but
would
correspond
to
nonlinear
boundaries
in
the
original
input
space
x.
•
Classes
that
are
linearly
separable
in
the
feature
space
need
not
be
linearly
separable
in
the
original
input
space.
Linear
Basis
Func>on
Models
Original input space Corresponding feature space using
two Gaussian basis functions
• We define two Gaussian basis functions with centers shown by the green crosses,
and with contours shown by the green circles.
where
Maximum
Likelihood
Taking
the
logarithm,
we
obtain:
•
Define:
•
The
sum-‐of-‐squares
error
is
equal
to
the
squared
Euclidean
distance
between
y
and
t
(up
to
a
factor
of
1/2).
The
solu>on
corresponds
to
the
orthogonal
projec>on
of
t
onto
the
subspace
S.
Sequen>al
Learning
•
The
training
data
examples
are
presented
one
at
a
>me,
and
the
model
parameter
are
updated
a_er
each
such
presenta>on
(online
learning):
•
Stochas>c
gradient
descent:
if
the
training
examples
are
picked
at
random
(dominant
technique
when
learning
with
very
large
datasets).
•
Care
must
be
taken
when
choosing
learning
rate
to
ensure
convergence.
Regularized
Least
Squares
•
Let
us
consider
the
following
error
func>on:
¸
is
called
the
regulariza>on
coefficient.
Data
term
+
Regulariza>on
term
Ridge
which
is
minimized
by
seFng:
regression
The
solu>on
adds
a
posi>ve
constant
to
the
diagonal
of
This
makes
the
problem
nonsingular,
even
if
is
not
of
full
rank
(e.g.
when
the
number
of
training
examples
is
less
than
the
number
of
basis
func>ons).
Effect
of
Regulariza>on
•
The
overall
error
func>on
is
the
sum
of
two
parabolic
bowls.
Lasso
Quadra>c
The
Lasso
•
Penalize
the
absolute
value
of
the
weights:
•
For
sufficiently
large
¸,
some
of
the
coefficients
will
be
driven
to
exactly
zero,
leading
to
a
sparse
model.
•
The
above
formula>on
is
equivalent
to:
•
Our
goal
is
predict
target
t
given
a
new
value
for
x:
- for
regression:
t
is
a
real-‐valued
con>nuous
target.
- for
classifica>on:
t
a
categorical
variable
represen>ng
class
labels.
•
We
could
compute
condi>onal
probabili>es
of
the
two
classes,
given
the
input
image:
Bayes’ Rule
•
If
our
goal
to
minimize
the
probability
of
assigning
x
to
the
wrong
class,
then
we
should
choose
the
class
having
the
highest
posterior
probability.
Minimizing
Misclassifica>on
Rate
Goal:
Make
as
few
misclassifica>ons
as
possible.
We
need
a
rule
that
assigns
each
value
of
x
to
one
of
the
available
classes.
Decision
Truth
Expected Loss:
Goal
is
to
choose
regions
as
to
minimize
expected
loss.
Reject
Op>on
Regression
Let
x
2
Rd
denote
a
real-‐valued
input
vector,
and
t
2
R
denote
a
real-‐
valued
random
target
(output)
variable
with
joint
the
distribu>on
•
The
decision
step
consists
of
finding
an
es>mate
y(x)
of
t
for
each
input
x.
•
Similar
to
classifica>on
case,
to
quan>fy
what
it
means
to
do
well
or
poorly
on
a
task,
we
need
to
define
a
loss
(error)
func>on:
•
Our
goal
is
to
choose
y(x)
so
as
to
minimize
the
expected
squared
loss.
•
The
op>mal
solu>on
(if
we
assume
a
completely
flexible
func>on)
is
the
condi>onal
average:
•
Taking
the
expecta>on
over
the
last
term
vanishes,
so
we
get:
Bias-‐Variance
Trade-‐off
Average
predic>ons
over
all
Solu>ons
for
individual
datasets
Intrinsic
variability
datasets
differ
from
the
vary
around
their
averages
-‐-‐
how
of
the
target
op>mal
regression
func>on.
sensi>ve
is
the
func>on
to
the
values.
par>cular
choice
of
the
dataset.
•
Trade-‐off
between
bias
and
variance:
With
very
flexible
models
(high
complexity)
we
have
low
bias
and
high
variance;
With
rela>vely
rigid
models
(low
complexity)
we
have
high
bias
and
low
variance.
•
The
model
with
the
op>mal
predic>ve
capabili>es
has
to
balance
between
bias
and
variance.
Bias-‐Variance
Trade-‐off
•
Consider
the
sinusoidal
dataset.
We
generate
100
datasets,
each
containing
N=25
points,
drawn
independently
from
High
variance
Low
variance
Low
bias
Bias-‐Variance
Trade-‐off
•
Consider
the
sinusoidal
dataset.
We
generate
100
datasets,
each
containing
N=25
points,
drawn
independently
from
• And the integrated squared bias and variance are given by:
where
the
integral
over
x
weighted
by
the
distribu>on
p(x)
is
approximated
by
the
finite
sum
over
data
points
drawn
from
that
distribu>on.
Bias-‐Variance
Trade-‐off
From
these
plots
note
that
over-‐regularized
model
(large
¸)
has
high
bias,
and
under-‐regularized
model
(low
¸)
has
high
variance.
Bea>ng
the
Bias-‐Variance
Trade-‐off
•
We
can
reduce
the
variance
by
averaging
over
many
models
trained
on
different
datasets:
-
In
prac>ce,
we
only
have
a
single
observed
dataset.
If
we
had
many
independent
training
sets,
we
would
be
be0er
off
combining
them
into
one
large
training
dataset.
With
more
data,
we
have
less
variance.
•
Given
a
standard
training
set
D
of
size
N,
we
could
generate
new
training
sets,
N,
by
sampling
examples
from
D
uniformly
and
with
replacement.
-
This
is
called
bagging
and
it
works
quite
well
in
prac>ce.
•
Given
enough
computa>on,
we
would
be
be0er
off
resor>ng
to
the
Bayesian
framework
(which
we
will
discuss
next):
-
Combine
the
predic>ons
of
many
models
using
the
posterior
probability
of
each
parameter
vector
as
the
combina>on
weight.