Lecture1 2015
Lecture1 2015
414/2104:
Machine
Learning
Russ
Salakhutdinov
Department of Computer Science!
Department of Statistics!
[email protected]!
h0p://www.cs.toronto.edu/~rsalakhu/
Lecture 1
Evalua;on
• 3
Assignments
worth
40%.
• Midterm
worth
20%.
• Undergrads:
Final
worth
40%
• Graduate:
10%
oral
presenta;on,
30%
final
Text
Books
•
Christopher
M.
Bishop
(2006)
Pa@ern
RecogniBon
and
Machine
Learning,
Springer.
Addi;onal
Books
•
Kevin
Murphy,
Machine
Learning:
A
Probabilis;c
Perspec;ve.
•
Most
of
the
figures
and
material
will
come
from
these
books.
Sta;s;cal
Machine
Learning
Sta;s;cal
machine
learning
is
a
very
dynamic
field
that
lies
at
the
intersec;on
of
Sta;s;cs
and
computa;onal
sciences.
Gene
Expression
Develop
sta;s;cal
models
that
can
discover
Rela;onal
underlying
Data/
structure,
cause,
Product
Climate
Change
or
sta;s;cal
Recommenda;on
correla;ons
from
data.
Social
Network
Geological
Data
Example:
Boltzmann
Machine
Latent
(hidden)
Model
parameters
variables
European Community
Interbank Markets Monetary/Economic
Energy Markets
Disasters and
Accidents
Leading Legal/Judicial
Economic
Indicators
Learned
``genre’’
Neflix
dataset:
Fahrenheit
9/11
Independence
Day
Bowling
for
Columbine
The
Day
Aher
Tomorrow
480,189
users
The
People
vs.
Larry
Flynt
Con
Air
17,770
movies
Canadian
Bacon
Men
in
Black
II
La
Dolce
Vita
Men
in
Black
Over
100
million
ra;ngs.
Friday
the
13th
The
Texas
Chainsaw
Massacre
Children
of
the
Corn
Child's
Play
The
Return
of
Michael
Myers
•
Part
of
the
wining
solu;on
in
the
Neflix
contest
(1
million
dollar
prize).
Tenta;ve
List
of
Topics
• Linear
methods
for
regression,
Bayesian
linear
regression
• Linear
models
for
classifica;on
• Probabilis;c
Genera;ve
and
Discrimina;ve
models
• Regulariza;on
methods
• Model
Comparison
and
BIC
• Neural
Networks
• Radial
basis
func;on
networks
• Kernel
Methods,
Gaussian
processes,
Support
Vector
Machines
• Mixture
models
and
EM
algorithm
• Graphical
Models
and
Bayesian
Networks
Types
of
Learning
Consider
observing
a
series
of
input
vectors:
• Unsupervised
Learning:
The
goal
is
to
build
a
sta;s;cal
model
of
x,
which
can
be
used
for
making
predic;ons,
decisions.
•
The
term
w0
is
the
intercept,
or
ohen
called
bias
term.
It
will
be
convenient
to
include
the
constant
variable
1
in
x
and
write:
Source:
Wikipedia
Linear
Least
Squares
If
is
nonsingular,
then
the
unique
solu;on
is
given
by:
Source: Wikipedia
Goal: Fit the data using a polynomial func;on of the form:
Note:
the
polynomial
func;on
is
a
nonlinear
func;on
of
x,
but
it
is
a
linear
func;on
of
the
coefficients
w
!
Linear
Models.
Example:
Polynomial
Curve
Firng
•
As
for
the
least
squares
example:
we
can
minimize
the
sum
of
the
squares
of
the
errors
between
the
predic;ons
for
each
data
point
xn
and
the
corresponding
target
values
tn.
•
For
M=9,
the
training
error
is
zero
!
The
polynomial
contains
10
degrees
of
freedom
corresponding
to
10
parameters
w,
and
so
can
be
fi0ed
exactly
to
the
10
data
points.
•
However,
the
test
error
has
become
very
large.
Why?
Overfirng
•
However,
the
number
of
parameters
is
not
necessarily
the
most
appropriate
measure
of
the
model
complexity.
Generaliza;on
•
The
goal
is
achieve
good
generalizaBon
by
making
accurate
predic;ons
for
new
test
data
that
is
not
known
during
learning.
•
Choosing
the
values
of
parameters
that
minimize
the
loss
func;on
on
the
training
data
may
not
be
the
best
op;on.
•
We
would
like
to
model
the
true
regulari;es
in
the
data
and
ignore
the
noise
in
the
data:
-
It
is
hard
to
know
which
regulari;es
are
real
and
which
are
accidental
due
to
the
par;cular
training
examples
we
happen
to
pick.
•
Intui;on:
We
expect
the
model
to
generalize
if
it
explains
the
data
well
given
the
complexity
of
the
model.
•
If
the
model
has
as
many
degrees
of
freedom
as
the
data,
it
can
fit
the
data
perfectly.
But
this
is
not
very
informa;ve.
•
Some
theory
on
how
to
control
model
complexity
to
op;mize
generaliza;on.
A
Simple
Way
to
Penalize
Complexity
One
technique
for
controlling
over-‐firng
phenomenon
is
regularizaBon,
which
amounts
to
adding
a
penalty
term
to
the
error
func;on.
penalized
error
target
value
regulariza;on
func;on
parameter
Graph
of
the
root-‐mean-‐squared
training
and
test
errors
vs.
ln¸
for
the
M=9
polynomial.
How
to
choose
¸?
Cross
Valida;on
If
the
data
is
plen;ful,
we
can
divide
the
dataset
into
three
subsets:
• Training
Data:
used
to
firng/learning
the
parameters
of
the
model.
• Valida;on
Data:
not
used
for
learning
but
for
selec;ng
the
model,
or
choosing
the
amount
of
regulariza;on
that
works
best.
• Test
Data:
used
to
get
performance
of
the
final
model.
For
many
applica;ons,
the
supply
of
data
for
training
and
tes;ng
is
limited.
To
build
good
models,
we
may
want
to
use
as
much
training
data
as
possible.
If
the
valida;on
set
is
small,
we
get
noisy
es;mate
of
the
predic;ve
performance.
• Marginal Probability:
where
Basics
of
Probability
Theory
•
Consider
two
random
variables
X
and
Y:
- X
takes
any
values
xi,
where
i=1,..,M.
- Y
takes
any
values
yj,
j=1,…,L.
•
Consider
a
total
of
N
trials
and
let
the
number
of
trials
in
which
X
=
xi
and
Y
=
yj
is
nij.
•
Marginal
probability
can
be
wri0en
as:
• Condi;onal Probability:
Sum Rule
Product
Rule
Bayes’
Rule
•
From
the
product
rule,
together
with
symmetric
property:
•
If
we
are
given
a
finite
number
N
of
points
drawn
from
the
probability
distribu;on
(density),
then
the
expecta;on
can
be
approximated
as:
which
measures
how
much
variability
there
is
in
f(x)
around
its
mean
value
E[f(x)].
which
measures
the
extent
to
which
x
and
y
vary
together.
If
x
and
y
are
independent,
then
their
covariance
vanishes.
•
For
two
vectors
of
random
variables
x
and
y,
the
covariance
is
a
matrix:
The
Gaussian
Distribu;on
•
For
the
case
of
single
real-‐valued
variable
x,
the
Gaussian
distribu;on
is
defined
as:
•
Next
class,
we
will
look
at
various
distribu;ons
as
well
as
at
mul;variate
extension
of
the
Gaussian
distribu;on.
The
Gaussian
Distribu;on
•
For
the
case
of
single
real-‐valued
variable
x,
the
Gaussian
distribu;on
is
defined
as:
Because
the
parameter
µ
represents
the
average
value
of
x
under
the
distribu;on,
it
is
referred
to
as
the
mean.
•
It
then
follows
that
the
variance
of
x
is
given
by:
Sampling
Assump;ons
•
Suppose
we
have
a
dataset
of
observa;ons
x
=
(x1,…,xN)T,
represen;ng
N
1-‐dimensional
observa;ons.
•
Assume
that
the
training
examples
are
drawn
independently
from
the
set
of
all
possible
examples,
or
from
the
same
underlying
distribu;on
•
Assume
that
the
test
samples
are
drawn
in
exactly
the
same
way
-‐-‐
i.i.d
from
the
same
distribu;on
as
the
training
data.
Sample
mean
•
Maximizing
w.r.t
µ
gives:
It is ohen convenient to maximize the log of the likelihood func;on:
•
Our
goal
is
predict
target
t
given
a
new
value
for
x:
- for
regression:
t
is
a
real-‐valued
con;nuous
target.
- for
classifica;on:
t
a
categorical
variable
represen;ng
class
labels.