Lecture 10
Lecture 10
David
Sontag
New
York
University
T P W P
warm 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0
• Must have:
Joint Distributions
• A joint distribution over a set of random variables:
specifies a real number for each assignment:
T W P
– How many assignments if n variables with domain sizes d? hot sun 0.4
hot rain 0.1
– Must obey: cold sun 0.2
cold rain 0.3
• For all but the smallest distributions, impractical to write out or estimate
– Instead, we make additional assumptions about the distribution
Marginal Distributions
• Marginal distributions are sub-tables which eliminate variables
• Marginalization (summing out): Combine collapsed rows by adding
T P
hot 0.5
T W P X
P (t) = P (t, w) cold 0.5
hot sun 0.4
w
hot rain 0.1
cold sun 0.2 X W P
cold rain 0.3 P (w) = P (t, w) sun 0.6
t
rain 0.4
Conditional Probabilities
• A simple relation between joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability
T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Conditional Distributions
• Conditional distributions are probability distributions over
some variables given fixed values of others
Conditional Distributions Joint Distribution
W P T W P
sun 0.8 hot sun 0.4
rain 0.2 hot rain 0.1
cold sun 0.2
cold rain 0.3
W P
sun 0.4
rain 0.6
The Product Rule
• Sometimes have conditional distributions but want the joint
• Example:
D W P D W P
wet sun 0.1 wet sun 0.08
W P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
Bayes’ Rule
• Two ways to factor a joint distribution over two variables:
• Dividing, we get:
…
• Flips
are
i.i.d.:
D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )
– Independent
events
– Iden;cally
distributed
according
to
Bernoulli
distribu;on
• Sequence
D
of
αH
Heads
and
αT
Tails
⇥lnH⇥ dd d d
⇥ ln (D || ⇥)) =
lnPP(D [ln
ln ↵ H ↵
⇥ H (1 ⇥)) T] =
T
[ H ln ⇥ + T
d ⇥ Hdd⇥
ln d d⇥
dH d d⇥
ln P (D | d⇥) = [ln ⇥ (1 ⇥) T ] = [ H ln ⇥ + T ln(1 ⇥
d⇥⇥) T ] =d [ Hd⇥ln ⇥ + T ln(1 ⇥)] d⇥
) T ] = dd⇥ [ H ln ⇥ + T ln(1 ⇥)]
⇥) T
]=
d⇥d⇥ [ dH ln ⇥ + T ln(1
d ⇥)] H T
= H ln ⇥ + T ln(1 ⇥) = =0
d⇥ d⇥ ⇥ 1 ⇥
HH TT
1 ⇥) = ⇥
ln(1 ⇥) = ==0 0
⇥ 11 ⇥⇥
Data
L(✓; D) = ln P (D|✓)
✓
What
if
I
have
prior
beliefs?
• Billionaire
says:
Wait,
I
know
that
the
thumbtack
is
“close”
to
50-‐50.
What
can
you
do
for
me
now?
• You
say:
I
can
learn
it
the
Bayesian
way…
• Rather
than
es;ma;ng
a
single
θ,
we
obtain
a
distribu;on
over
possible
values
of
θ
In the beginning After observations
Pr(✓) Observe flips Pr(✓ | D)
e.g.: {tails, tails}
✓
✓
2N ⇥2
⇥ ⇤ 2e ⇤ P (mistake)
Bayesian
ln ⇥L⇤earning
ln 2 2N ⇤ 2
Prior
• Use
Bayes’
rule!
2 Data Likelihood
ln ⇥ ⇤ ln 2 2N ⇤ ln(2/⇥)
N⇤
2⇤2
Posteriorln(2/⇥)
N⇤ 2 ln(2/0.05) 3.8
2⇤ N ⇤ 2
⌅ Normalization
= 190
2 ⇥ 0.1 0.02
• Or equivalently:
ln(2/0.05) 3.8
⇤ • For uniform
2
⌅ priors, this
= 190 P (⌅) ⇧
reduces to 1
2 ⇥ 0.1 0.02
maximum likelihood estimation!
P (⌅) ⇧ 1 P (⌅ | D) ⇧ P (D | ⌅)
Bayesian
Learning
for
Thumbtacks
Likelihood:
3
12
• Learn
parameters
…
…
– µ (“mean”)
99
89
– σ (“variance”)
MLE
for
Gaussian:
• Prob.
of
i.i.d.
samples
D={x1,…,xN}:
µM LE , M LE = arg max P (D | µ, )
µ,
• Log-likelihood of data:
Your
second
learning
algorithm:
MLE
for
mean
of
a
Gaussian
• What’s
MLE
for
mean?
µM LE
µM, LEM, LE ==
M LE argarg
max (D || µ,
maxPP(D µ, ))
µ,
µM LE , M LE = arg µ,
max P (D | µ, )
µ,
N
X (xi µ)
N
=XX N(x
i i 2µ)
(x µ) = 0
= = i=1 22 ==00
i=1i=1
N
X
= - Nµ = 0
xi +
i=1
N
X
N (xi µ)2
= + 3
=0
i=1
MLE
for
variance
• Again,
set
deriva;ve
to
zero:
µM LE , M LE = arg max P (D | µ, )
µ,
µ) max P (D | µ, )
N
µM LE , X =i
M LE(x arg
= µ,0
=
2
i=1
X
NN
N X (x(xi µ)2µ)
i
= =+ 3 2
= 0= 0
i=1
i=1
Learning
Gaussian
parameters
• MLE:
Classifier 5
Likelihood Prior
Normalization Constant
• Why
did
this
help?
Well,
we
think
that
we
might
be
able
to
specify
how
features
are
“generated”
by
the
class
label
The
Bayes
Classifier
• Let’s
expand
this
for
our
digit
recogni;on
task:
– More generally:
• Decision rule:
P(Y=1)=1
P(Y=1)=0
mator (MVUE) is sometimes used instead. It is
1
⇤
j
⇥ˆ 2ik = j
(X i µ̂ ik )2
(Y j
= yk ) (15)
(⇤ j (Y 1= yk )) 1 j j
⇥ˆ 2ik = ⇤ (Xi µ̂ik ) 2
(Y j
= yk ) (15)
Logis;c
Regression
j
(⇤ j (Y = yk )) 1 j
ogistic Regression
ogistic Regression Logistic function
Regression is an approach to learning functions of the form f :X ⇤ Y , or (Sigmoid):
n the case
• where
Regression isLearn Y isP(Y|X)
an approachdiscrete-valued,
to learning and X =of⌅Xthe
functions
directly! . Xn ⇧ fis: any
1 . .form X ⇤vector
Y , or
n the
ng case or
discrete where Y is discrete-valued,
continuous variables. In this X = ⌅Xwe
andsection . Xn ⇧primarily
1 . .will is any vectorcon-
ng discrete
case whereorY isAssume
• continuous
a boolean avariable,
particular
variables. Ininthis section
order we willnotation.
to simplify primarilyIncon- the 1
case where Y isfunctional
a boolean form
variable, in order to simplify
section we extend our treatment to the case where Y takes on any finite notation. In the 1 + e z
section
of wevalues.
discrete extend our treatment to the case where Y takes on any finite
• Sigmoid applied to a linear
of
sticdiscrete values.
Regression assumes a parametric form for the distribution P(Y |X),
stic Regression function
assumes a of the data:
parametric form for thedata.
distribution P(Y |X),
ectly estimates its parameters from the training The parametric
ectly
sumed estimates
by Logisticits parameters
Regression infrom
the the
casetraining
where Ydata. The parametric
is boolean is: z
sumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = 1 n (16)
P(Y = 1|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (16)
1 + exp(w0 + ⇤i=1 wi Xi ) Features can be
exp(w0 + ⇤nni=1 wi Xi ) discrete or
P(Y = 0|X) = exp(w0 + ⇤i=1n wi Xi ) continuous!(17)
P(Y = 0|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (17)
1 + exp(w0 + ⇤i=1 wi Xi )
hat equation (17) follows directly from equation (16), because the sum of
hat equation (17)
o probabilities follows
must equaldirectly
1. from equation (16), because the sum of
ohighly
probabilities mustproperty
convenient equal 1. of this form for P(Y |X) is that it leads to a
Logis;c
Func;on
in
n
Dimensions
=0
P(Y
highest0|X) P(Y|X) n (17)
assign
Form the value
of the yk that
logistic +maximizes
1 function.
exp(w0One +In
⇤ P(Y w
Logistic
highly
i=1 ik)|X).
=i Xyconvenient Put another
Regression, propertyP(Y way,
of we
is form
|X)this as- for P(Y |X) is that
0
– 0For binary P(Y
Y, output = 0|X)
Y=0 holds:
if
w.X+w
euation
ollow Y=
label (17)
this if the
form.
follows following
directly1 from
< condition
simple linear expression
equation (16), because forthe
classification.
sum of To classify any given X w
P(Y = 1|X)
babilities must equal 1. P(Y =to0|X)
want assign the value yk that maximizes P(Y = yk |X). Put anoth
yituting
convenient property (16)
from equations 1<
of this
andassign
form the thislabel
for P(Y Y =is0that
|X) if theit following
leads to a condition holds:
P(Y(17),
= 1|X) becomes
gexpression
the natural for classification. To classify any given X we generally
log of both sides we have n a linear classificationP(Y rule= that
0|X)
ningthe value
from yk that (16)
equations maximizes
and (17),P(Y =becomes
this yk |X). Put another way, we
bel Y = 0 if X satisfies1 < exp(w0 + ⇤ wi Xi ) 1<
P(Y = 1|X)
el Y = 0 if the following condition holds: i=1
n
0 + ⇤ wi Xfrom
n
1< exp(w
P(Y =substituting
0|X) i) equations (16) and (17), this becomes
1< 0 < w 0 +
P(Y = 1|X)i=1
⇥i=1wi Xi (18)
n
1 < exp(w0 + ⇤ wi Xi )
om A
equations Linear
(16)
s Y = 1 otherwise. and Classifier!
(17), this becomes i=1
tingly, the parametric form nof P(Y |X) used by Logistic Regression is
he form implied1 <by
exp(w 0 + ⇤ wi Xi ) of a Gaussian Naive Bayes classi-
the assumptions
i=1
fore, we can view Logistic Regression as a closely related alternative to
Likelihood
vs.
Condi;onal
Likelihood
Genera;ve
(Naïve
Bayes)
maximizes
Data
likelihood
Focuses
only
on
learning
P(Y|X)
-‐
all
that
maFers
for
classifica;on
Maximizing
Condi;onal
Log
Likelihood
0 or 1!
Gradient:
⌅l(w) 1 ⇥
= xji y j
P
1+e (Y
ax
j j
= 1|x , w)
⌅wi
j
⇧ ln p(w) ⇥ ⇤wi2 ⌅⌃
⌅l(w) ⌅ ⌅2
wi xji ) wi xji )
j i
= y (w0 + ln 1 + exp(w0 +
Gradient
Ascent
for
LR
Gradient ascent algorithm: (learning rate η > 0)
do:
Genera)ve
Discrimina)ve
• Assume
func;onal
form
for
• Assume
func;onal
form
for
– P(X|Y)
assume
cond
indep
– P(Y|X)
no
assump)ons
– P(Y)
– Est.
params
from
train
data
– Est
params
from
training
data
• Gaussian
NB
for
cont.
features
• Handles
discrete
&
cont
features
• Bayes
rule
to
calc.
P(Y|X=
x):
– P(Y
|
X)
∝
P(X
|
Y)
P(Y)
• Indirect
computa;on
• Directly
calculate
P(Y|X=x)
– Can
generate
a
sample
of
the
data
– Can’t
generate
data
sample
– Can
easily
handle
missing
data
Naïve
Bayes
vs.
Logis;c
Regression
[Ng & Jordan, 2002]
Some
experiments
from
UCI
data
sets
©Carlos Guestrin 2005-2009 67