0% found this document useful (0 votes)
16 views

Lecture 10

Uploaded by

aidynn.enoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
16 views

Lecture 10

Uploaded by

aidynn.enoc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Introduc)on

 to  Bayesian  methods  


Lecture  10  

David  Sontag  
New  York  University  

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein,


and Vibhav Gogate
Bayesian  learning  
• Bayesian  learning  uses  probability  to  model  
data  and  quan+fy  uncertainty  of  predic;ons  
– Facilitates  incorpora;on  of  prior  knowledge  
– Gives  op;mal  predic;ons  
• Allows  for  decision-­‐theore;c  reasoning  
Your  first  consul;ng  job  
• A  billionaire  from  the  suburbs  of  ManhaFan  asks  
you  a  ques;on:  
– He  says:  I  have  thumbtack,  if  I  flip  it,  what’s  the  
probability  it  will  fall  with  the  nail  up?  
– You  say:  Please  flip  it  a  few  ;mes:  

– You  say:  The  probability  is:  


• P(heads)  =  3/5  
– He  says:  Why???  
– You  say:  Because…  
Outline  of  lecture  
• Review  of  probability  
• Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  


• Naïve  Bayes  
• Logis;c  regression  
Random Variables
• A random variable is some aspect of the world about
which we (may) have uncertainty
– R = Is it raining?
– D = How long will it take to drive to work?
– L = Where am I?

• We denote random variables with capital letters

• Random variables have domains


– R in {true, false} (sometimes write as {+r, ¬r})
– D in [0, ∞)
– L in possible locations, maybe {(0,0), (0,1), …}
Probability Distributions
• Discrete random variables have distributions

T P W P
warm 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0

• A discrete distribution is a TABLE of probabilities of values


• The probability of a state (lower case) is a single number

• Must have:
Joint Distributions
• A joint distribution over a set of random variables:
specifies a real number for each assignment:

T W P
– How many assignments if n variables with domain sizes d? hot sun 0.4
hot rain 0.1
– Must obey: cold sun 0.2
cold rain 0.3

• For all but the smallest distributions, impractical to write out or estimate
– Instead, we make additional assumptions about the distribution
Marginal Distributions
• Marginal distributions are sub-tables which eliminate variables
• Marginalization (summing out): Combine collapsed rows by adding

T P
hot 0.5
T W P X
P (t) = P (t, w) cold 0.5
hot sun 0.4
w
hot rain 0.1
cold sun 0.2 X W P
cold rain 0.3 P (w) = P (t, w) sun 0.6
t
rain 0.4
Conditional Probabilities
• A simple relation between joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability

T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Conditional Distributions
• Conditional distributions are probability distributions over
some variables given fixed values of others
Conditional Distributions Joint Distribution

W P T W P
sun 0.8 hot sun 0.4
rain 0.2 hot rain 0.1
cold sun 0.2
cold rain 0.3
W P
sun 0.4
rain 0.6
The Product Rule
• Sometimes have conditional distributions but want the joint

• Example:

D W P D W P
wet sun 0.1 wet sun 0.08
W P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
Bayes’ Rule
• Two ways to factor a joint distribution over two variables:

• Dividing, we get:

• Why is this at all helpful?


– Let’s us build one conditional from its reverse
– Often one conditional is tricky but the other one is simple
– Foundation of many practical systems (e.g. ASR, MT)

• In the running for most important ML equation!


Returning  to  thumbtack  example…  
• P(Heads)  =  θ,    P(Tails)  =  1-­‐θ


• Flips  are  i.i.d.:     D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )
– Independent  events  
– Iden;cally  distributed  according  to  Bernoulli  
distribu;on  
• Sequence  D  of  αH  Heads  and  αT  Tails      

Called the “likelihood” of the data under the model


Maximum  Likelihood  Es;ma;on  
• Data:  Observed  set  D  of  αH  Heads  and  αT  Tails      
• Hypothesis:  Bernoulli  distribu;on    
• Learning:  finding  θ is  an  op;miza;on  problem  
– What’s  the  objec;ve  func;on?  

• MLE:  Choose  θ  to  maximize  probability  of  D  


11,11,
ary 2012
2012 January 11, 2012
he Author
Your  first  parameter  learning  algorithm  
ary 11, 2012
ˆ = arg max ⇥ˆln=
P (D | max
arg ) ln P (D | ⇥)
✓ ⇥
⇥ˆ = arg max ln P (D | ⇥)
axln Pln(D | ⇥)| ⇥)
P (D ln ↵H ⇥
ln ⇥ H
Set  
ax lnHP• (D | ⇥) deriva;ve  to  zero,  
ln a ⇥ nd  solve!  
H

⇥lnH⇥ dd d d
⇥ ln (D || ⇥)) =
lnPP(D [ln
ln ↵ H ↵
⇥ H (1 ⇥)) T] =
T
[ H ln ⇥ + T
d ⇥ Hdd⇥
ln d d⇥
dH d d⇥
ln P (D | d⇥) = [ln ⇥ (1 ⇥) T ] = [ H ln ⇥ + T ln(1 ⇥
d⇥⇥) T ] =d [ Hd⇥ln ⇥ + T ln(1 ⇥)] d⇥
) T ] = dd⇥ [ H ln ⇥ + T ln(1 ⇥)]
⇥) T
]=
d⇥d⇥ [ dH ln ⇥ + T ln(1
d ⇥)] H T
= H ln ⇥ + T ln(1 ⇥) = =0
d⇥ d⇥ ⇥ 1 ⇥
HH TT
1 ⇥) = ⇥
ln(1 ⇥) = ==0 0
⇥ 11 ⇥⇥
Data

L(✓; D) = ln P (D|✓)


What  if  I  have  prior  beliefs?    
• Billionaire  says:  Wait,  I  know  that  the  thumbtack  
is  “close”  to  50-­‐50.  What  can  you  do  for  me  now?  
• You  say:  I  can  learn  it  the  Bayesian  way…  
• Rather  than  es;ma;ng  a  single  θ,  we  obtain  a  
distribu;on  over  possible  values  of  θ

In the beginning After observations
Pr(✓) Observe flips Pr(✓ | D)
e.g.: {tails, tails}



2N ⇥2
⇥ ⇤ 2e ⇤ P (mistake)
Bayesian  
ln ⇥L⇤earning  
ln 2 2N ⇤ 2

Prior
• Use  Bayes’   rule!   2 Data Likelihood
ln ⇥ ⇤ ln 2 2N ⇤ ln(2/⇥)
N⇤
2⇤2
Posteriorln(2/⇥)
N⇤ 2 ln(2/0.05) 3.8
2⇤ N ⇤ 2
⌅ Normalization
= 190
2 ⇥ 0.1 0.02
• Or equivalently:
ln(2/0.05) 3.8
⇤ • For uniform
2
⌅ priors, this
= 190 P (⌅) ⇧
reduces to 1
2 ⇥ 0.1 0.02
maximum likelihood estimation!
P (⌅) ⇧ 1 P (⌅ | D) ⇧ P (D | ⌅)
Bayesian  Learning  for  Thumbtacks  

Likelihood:  

• What  should  the  prior  be?  


– Represent  expert  knowledge  
– Simple  posterior  form  

• For  binary  variables,  commonly  used  prior  is  the  


Beta  distribu;on:  
n ⇤ ⇤ ln 2 2N ⌅2 2 ln ⇥ ⇤ ln 2 2N ⇤2
ln ⇥ ⇤ ln 2 2N ⇤
Beta  
ln(2/⇤)
p rior   d istribu;on  
ln(2/⇥) –   P(θ)  
⇤ ⇤ ln(2/⇥)
NN 2 2
N⇤
2⇤2
2⌅2⇤
ln(2/0.05) 3.8
(2/0.05)
ln(2/0.05) 3.8N⇤
3.8 2
⌅ = 190
⌅⌅ 2 ⇥ 0.1
==190
190 0.02
⇥ 2 2
0.10.1 0.02
2⇥ 0.02
P (⌅) ⇧ 1
P (⇧) ⇧⇧1 1
P (⌅)
• Since  the  Beta  distribu;on  is  conjugate  to  the  Bernoulli  distribu;on,  the  
| aD)
P (⌅has  
posterior  distribu;on   ⇧ P (Dsimple  
 par;cularly   | ⌅) form:  
P (⌅
(⇧ | D)
| D) ⇧ ⇧P P | ⌅)
(D(D| ⇧)
1 1
P (⌅ | D) ⇧ ⌅ H
(1 ⌅) T
⌅ ⇥H (1 ⌅)⇥T
1 1 H +⇥H 1 T +⇥t +1
⌅ ⇥H (1 ⌅)⇥T =⌅ (1 ⌅)
1 H +⇥H 1 T +⇥t +1
T
=⇧ (1 ⇧) = Beta( H +⇥H , T +⇥T )
Using  Bayesian  inference  for  predic;on  
• We  now  have  a  distribu)on  over  parameters  
• For  any  specific  f,  a  func;on  of  interest,  compute  the  
expected  value  of  f:  

• Integral  is  oien  hard  to  compute  


• As  more  data  is  observed,  prior  is  more  concentrated  
• MAP  (Maximum  a  posteriori  approxima;on):  use  most  
likely  parameter  to  approximate  the  expecta;on  
What  about  con;nuous  variables?  
• Billionaire  says:  If  I  am  
measuring  a  
con;nuous  variable,  
what  can  you  do  for  
me?  
• You  say:  Let  me  tell  
you  about  Gaussians…  
Some  proper;es  of  Gaussians  
• Affine  transforma;on  (mul;plying  by  
scalar  and  adding  a  constant)  are  
Gaussian  
– X  ~  N(µ,σ2)  
– Y  =  aX  +  b    Y  ~  N(aµ+b,a2σ2)  

• Sum  of  Gaussians  is  Gaussian  


– X  ~  N(µX,σ2X)  
– Y  ~  N(µY,σ2Y)  
– Z  =  X+Y     Z  ~  N(µX+µY,  σ2X+σ2Y)  

• Easy  to  differen;ate,  as  we  will  see  soon!  


Learning  a  Gaussian   xi Exam  
i = Score  
0   85  
• Collect  a  bunch  of  data  
1   95  
– Hopefully,  i.i.d.  samples  
– e.g.,  exam  scores   2   100  

3   12  
• Learn  parameters  
…   …  
– µ (“mean”)
99   89  
– σ (“variance”)
MLE  for  Gaussian:  
• Prob.  of  i.i.d.  samples  D={x1,…,xN}:  
µM LE , M LE = arg max P (D | µ, )
µ,

• Log-likelihood of data:
Your  second  learning  algorithm:  
MLE  for  mean  of  a  Gaussian  
• What’s  MLE  for  mean?  

µM LE
µM, LEM, LE ==
M LE argarg
max (D || µ,
maxPP(D µ, ))
µ,
µM LE , M LE = arg µ,
max P (D | µ, )
µ,
N
X (xi µ)
N
=XX N(x
i i 2µ)
(x µ) = 0
= = i=1 22 ==00
i=1i=1
N
X
= - Nµ = 0
xi +
i=1

N
X
N (xi µ)2
= + 3
=0
i=1
MLE  for  variance  
• Again,  set  deriva;ve  to  zero:  
µM LE , M LE = arg max P (D | µ, )
µ,

µ) max P (D | µ, )
N
µM LE , X =i
M LE(x arg
= µ,0
=
2
i=1

X
NN
N X (x(xi µ)2µ)
i
= =+ 3 2
= 0= 0
i=1
i=1
Learning  Gaussian  parameters  
• MLE:  

• MLE  for  the  variance  of  a  Gaussian  is  biased  


– Expected  result  of  es;ma;on  is  not  true  parameter!    
– Unbiased  variance  es;mator:  
Bayesian  learning  of  Gaussian  
parameters  
• Conjugate  priors  
– Mean:  Gaussian  prior  
– Variance:  Wishart  Distribu;on  

• Prior  for  mean:  


Outline  of  lecture  
• Review  of  probability  
• Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  


• Naïve  Bayes  
• Logis;c  regression  
Bayesian  Classifica;on  
• Problem  statement:  
– Given  features  X1,X2,…,Xn  
– Predict  a  label  Y  

[Next several slides adapted from:


Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos
Guestrin, and Dan Weld]
Example  Applica;on  
• Digit  Recogni)on  

Classifier 5

• X1,…,Xn  ∈  {0,1}  (Black  vs.  White  pixels)  


• Y  ∈  {0,1,2,3,4,5,6,7,8,9}  
The  Bayes  Classifier  
• If  we  had  the  joint  distribu;on  on  X1,…,Xn  and  Y,  could  predict  
using:  

– (for  example:  what  is  the  probability  that  the  image  


represents  a  5  given  its  pixels?)  

• So  …  How  do  we  compute  that?  


The  Bayes  Classifier  
• Use  Bayes  Rule!  

Likelihood Prior

Normalization Constant

• Why  did  this  help?    Well,  we  think  that  we  might  be  able  to  
specify  how  features  are  “generated”  by  the  class  label  
The  Bayes  Classifier  
• Let’s  expand  this  for  our  digit  recogni;on  task:  

• To  classify,  we’ll  simply  compute  these  probabili;es,  one  per  


class,  and  predict  based  on  which  one  is  largest  
Model  Parameters  
• How  many  parameters  are  required  to  specify  the  likelihood,  
P(X1,…,Xn|Y)  ?  
– (Supposing  that  each  image  is  30x30  pixels)  

• The  problem  with  explicitly  modeling  P(X1,…,Xn|Y)  is  that  


there  are  usually  way  too  many  parameters:  
– We’ll  run  out  of  space  
– We’ll  run  out  of  ;me  
– And  we’ll  need  tons  of  training  data  (which  is  usually  not  
available)  
Naïve  Bayes  
• Naïve  Bayes  assump;on:  
– Features  are  independent  given  class:  

– More  generally:  

• How  many  parameters  now?  


• Suppose  X  is  composed  of  n  binary  features  
The  Naïve  Bayes  Classifier  
• Given:  
– Prior  P(Y)   Y
– n  condi;onally  independent  
features  X  given  the  class  Y  
– For  each  Xi,  we  have  
likelihood  P(Xi|Y)   X1 X2 Xn

• Decision  rule:  

If certain assumption holds, NB is optimal classifier!


(they typically don’t)
A Digit Recognizer

• Input: pixel grids

• Output: a digit 0-9


Are  the  naïve  Bayes  assump;ons  realis;c  here?  
What has to be learned?

1 0.1 1 0.01 1 0.05


2 0.1 2 0.05 2 0.01
3 0.1 3 0.05 3 0.90
4 0.1 4 0.30 4 0.80
5 0.1 5 0.80 5 0.90
6 0.1 6 0.90 6 0.90
7 0.1 7 0.05 7 0.25
8 0.1 8 0.60 8 0.85
9 0.1 9 0.50 9 0.60
0 0.1 0 0.80 0 0.80
maxlnln 1⇥⇥
argmax
arg ++ [tjj ii wii hi2i (xjj )]
arg
arg max
maxwln
ln
w ⇥

⇥⇥ 22 +
+ 2⇥
2⇥
2
2
w
w ⇥⇥ 22 j=1
j=1 2⇥
2⇥ 2
j=1
j=1
P P
MLE  ==for   t he   p arameters   hio if  N j B  
X
NN P 2
NX
N
X
X [t
[tj j P iwwi (xh
i (x(x )]
j22)]
2
argmax
arg max [t
[t jj w
wi h
h (x
ii ii i2i jj )]
)]
=
= arg arg max
max ww 2⇥
2⇥
2
2
• Given  dataset  ww j=1j=1 j=1 2⇥
2⇥ 2
j=1
– Count(A=a,B=b)  Ã  number  
N of  examples   where  A=a  and  
X
N
X X XX
B=b   X
X N
N X 22
==arg
argmin
min [t[tj j w i (x
wi hi hi (x j j )]
22)]
• MLE  for  = arg
arg min
=discrete  
min
w
w N
wB,   [t
[t
w simply:  
j=1
j=1
jj wiihhii(x
w
ii
(xjj)] )]
– Prior:   j=1
j=1 ii
Count(Y ==y)y)
Count(Y
P (Y ==y)y)=
P(Y =PCount(Y
P
Count(Y == y)
y)
P (Y
P (Y == y) =P
y) = P 0y0Count(Y
Count(Y ==yy))
Count(Y =
yy00 Count(Y
y = yy ))
– Observa;on  distribu;on:    
Count(Xii =
Count(X x, YY =
= x, = y)
y)
P (X
P (Xii = x|Y =
= x|Y =P
y) =
= y) P
Count(Xii =
xx00 Count(X = xx ,, YY =
= y)
y)
MLE  for  the  parameters  of  NB  
• Training  amounts  to,  for  each  of  the  classes,  averaging  all  of  
the  examples  together:  
arg maxlnln ⇥⇥ i=1 ++ j
argmax i i i j
arg max lnww ⇥ ⇥⇥ 22+ 2⇥
2⇥ 22
w ✓
⇥ 2 ◆N X N j=1 P 2⇥ 2
j=1 2
1 j=1 [tj i w i h i (x j )]
arg max ln ⇥ + P
P
MAP   e s;ma;on  
[tj j w fhor  iwiN hi)]iB  
N 2⇥ 2
w ⇥ 2 N X X
N P
[t w h (x(x
2i j )]
2
)] 2
=arg
argmax X [t
max j
j=1
i (x j
=
= arg max X i i i 2j
P 2 2⇥
ww
N 2⇥ 2
2
w
• Given  dataset   [t
j=1
j=1 j 2⇥
i w i h i (x j )]
= arg maxj=1 2
w
Count(A=a,B=b)   Ã  number   2⇥
N of  examples  
– j=1
N XX X X
N X where  A=a  and  
B=b   X 22
=
=minarg
arg X
min min
N [t [t
[t j j wwi h
i i (x
h (x
2i j )]
j )]
= arg
• MAP  e=
s;ma;on   for   X w h
wwdjiscrete  NB,   (x )]
i simply:  
i j2
w
arg min j=1 w hi(x
[tj=1
j i )]
i i j
– Prior:   w j=1 i
j=1 i
Count(Y ==y)y)
Count(Y
P (Y ==y)
P(Y y)=Count(Y
= PP =
= y)
Count(YCount(Y
y)
(Y(Y==y)
PP = P ==yy))
y 0y Count(Y
y) = P 0 Count(Y 0
=y)
yy0 Count(Y
– Observa;on  distribu;on:     =y)
Count(X
Count(X i =
i = x,x,
Y Y = y)= y)
+a
P (X i =
P (X x|Y==y)
x|Y
i = =P
y) = P
Count(Xi =
x00 Count(X
x x ,xY , =
i = = +y)|X_i|*a
Y y)

• Called  “smoothing”.  Corresponds  to  Dirichlet  prior!  


What  about  if  there  is  missing  data?  
• One  of  the  key  sMissing data approaches  is  that  
trengths  of  Bayesian  
they  can  naturally  handle  missing  data  
 Suppose don’t have value for some attribute Xi •
 applicant’s credit history unknown
 some medical test not performed on patient
 how to compute P(X1=x1 … Xj=? … Xd=xd | y)
 Easy with Naïve Bayes •
 ignore attribute in instance
where its value is missing
 compute likelihood based on observed attributes
 no need to “fill in” or explicitly model missing values
 based on conditional independence between attributes

[Slide from Victor Lavrenko and Nigel Goddard]


Copyright © Victor Lavrenko, 2013 Copyright ©
Naive Bayes = Linear Classifier
Theorem: assume that xi {0, 1} for all i [1, N ] .
Then, the Naive Bayes classifier is defined by
x⇥ sgn(w · x + b),

where wi = log Pr[x


Pr[xi =1|+1]
i =1| 1]
Pr[xi =0|+1]
log Pr[x i =0| 1]

and b = log Pr[+1]


Pr[ 1] +
N
i=1
Pr[xi =0|+1]
log Pr[x i =0| 1]
.
Proof: observe that for any i [1, N ] ,
Pr[xi | +1] Pr[xi = 1 | +1] Pr[xi = 0 | +1] Pr[xi = 0 | +1]
log = log log xi +log .
Pr[xi | 1] Pr[xi = 1 | 1] Pr[xi = 0 | 1] Pr[xi = 0 | 1]

[Slide from Mehyrar Mohri]


Mehryar Mohri - Introduction to Machine Learning page 20
Outline  of  lecture  
• Review  of  probability  
• Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  


• Naïve  Bayes  
• Logis)c  regression  

[Next several slides adapted from:


Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld]
Logis;c  Regression  
 Learn P(Y|X) directly!
 Assume a particular functional form
✬ Linear classifier? On one side we say P(Y=1|X)=1, and on
the other P(Y=1|X)=0
✬ But, this is not differentiable (hard to learn)… doesn’t
allow for label noise...

P(Y=1)=1

P(Y=1)=0
mator (MVUE) is sometimes used instead. It is
1

j
⇥ˆ 2ik = j
(X i µ̂ ik )2
(Y j
= yk ) (15)
(⇤ j (Y 1= yk )) 1 j j
⇥ˆ 2ik = ⇤ (Xi µ̂ik ) 2
(Y j
= yk ) (15)
Logis;c  Regression  
j
(⇤ j (Y = yk )) 1 j
ogistic Regression
ogistic Regression Logistic function
Regression is an approach to learning functions of the form f :X ⇤ Y , or (Sigmoid):
n the case
• where
Regression isLearn Y isP(Y|X)
an approachdiscrete-valued,
to learning and X =of⌅Xthe
functions
directly! . Xn ⇧ fis: any
1 . .form X ⇤vector
Y , or
n the
ng case or
discrete where Y is discrete-valued,
continuous variables. In this X = ⌅Xwe
andsection . Xn ⇧primarily
1 . .will is any vectorcon-
ng discrete
case whereorY isAssume
• continuous
a boolean avariable,
particular
variables. Ininthis section
order we willnotation.
to simplify primarilyIncon- the 1
case where Y isfunctional
a boolean form
variable, in order to simplify
section we extend our treatment to the case where Y takes on any finite notation. In the 1 + e z

section
of wevalues.
discrete extend our treatment to the case where Y takes on any finite
• Sigmoid applied to a linear
of
sticdiscrete values.
Regression assumes a parametric form for the distribution P(Y |X),
stic Regression function
assumes a of the data:
parametric form for thedata.
distribution P(Y |X),
ectly estimates its parameters from the training The parametric
ectly
sumed estimates
by Logisticits parameters
Regression infrom
the the
casetraining
where Ydata. The parametric
is boolean is: z
sumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = 1 n (16)
P(Y = 1|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (16)
1 + exp(w0 + ⇤i=1 wi Xi ) Features can be
exp(w0 + ⇤nni=1 wi Xi ) discrete or
P(Y = 0|X) = exp(w0 + ⇤i=1n wi Xi ) continuous!(17)
P(Y = 0|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (17)
1 + exp(w0 + ⇤i=1 wi Xi )
hat equation (17) follows directly from equation (16), because the sum of
hat equation (17)
o probabilities follows
must equaldirectly
1. from equation (16), because the sum of
ohighly
probabilities mustproperty
convenient equal 1. of this form for P(Y |X) is that it leads to a
Logis;c  Func;on  in  n  Dimensions  

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!


1x1x2
0.8
0.6
0.4
0.2
0
10
8
6
4
2
0
6-4-2
4
2
0
-2
on we extend our treatment to the case1wherei=1 Y takes on any finite
crete values.P(Y = 1|X) = model assumed n
by Logistic Regression (16) in the case where Y is boolean
1 + exp(w0 + ⇤i=1 n wi Xi )
0.2

Regression assumes a0 parametricexp(w + ⇤the


form0 for i=1 w i Xi )
distribution P(Y |X),
P(Y =−5 0|X) = X0 (17) 1
n P(Yi ) = parametric
1|X) =
5
estimates its parameters from1 + theexp(w
training
0 + ⇤data.i=1 wi XThe n
ed by Logistic
ce that equation Logis;c  Regression:  decision  boundary    
Regression
P(Y(17)= 0|X) in exp(w
the case0+ where
n
⇤i=1Ywiis
= directly from equation
follows n
)
Xiboolean is:
(16), because the sum
1 + exp(w
(17)of
0 + ⇤ i=1 wi Xi )

two probabilities must equal and 1 + exp(w +


1. 1 0 ⇤i=1 i i w X )
n
hat highly =
P(Yconvenient
ne equation 1|X)
(17) = property
follows directlyof from
this equation (16), because (16)
the exp(w
sum of 0 + ⇤ i=1 wi Xi )
nform for P(Y |X) is that it leads to a
1 + exp(w0 + ⇤i=1 wi Xi ) P(Y = 0|X) = 1 + exp(w + n w X )
oleprobabilities must for
linear expression equal 1.
classification. To classify any given X we generally 0 ⇤i=1 i i
highly
to assignconvenient
the valueproperty of this form
yk that maximizes
Notice thatP(Y for=P(Y
equation yk |X).
|X)
(17) isPut
that it leads
another
follows to from
way,
directly awe equation (16), because
nnear • Prediction:
the expression for Output
exp(w
classification.+ the
⇤ n Y with
To i ) any given
wi Xholds:
i=1classify
label Y = = 0 if the
= following 0 condition
these two probabilities must Xequalwe generally
1.

=0
P(Y
highest0|X) P(Y|X) n (17)
assign
Form the value
of the yk that
logistic +maximizes
1 function.
exp(w0One +In
⇤ P(Y w
Logistic
highly
i=1 ik)|X).
=i Xyconvenient Put another
Regression, propertyP(Y way,
of we
is form
|X)this as- for P(Y |X) is that

0
– 0For binary P(Y
Y, output = 0|X)
Y=0 holds:
if

w.X+w
euation
ollow Y=
label (17)
this if the
form.
follows following
directly1 from
< condition
simple linear expression
equation (16), because forthe
classification.
sum of To classify any given X w
P(Y = 1|X)
babilities must equal 1. P(Y =to0|X)
want assign the value yk that maximizes P(Y = yk |X). Put anoth
yituting
convenient property (16)
from equations 1<
of this
andassign
form the thislabel
for P(Y Y =is0that
|X) if theit following
leads to a condition holds:
P(Y(17),
= 1|X) becomes
gexpression
the natural for classification. To classify any given X we generally
log of both sides we have n a linear classificationP(Y rule= that
0|X)
ningthe value
from yk that (16)
equations maximizes
and (17),P(Y =becomes
this yk |X). Put another way, we
bel Y = 0 if X satisfies1 < exp(w0 + ⇤ wi Xi ) 1<
P(Y = 1|X)
el Y = 0 if the following condition holds: i=1
n
0 + ⇤ wi Xfrom
n
1< exp(w
P(Y =substituting
0|X) i) equations (16) and (17), this becomes
1< 0 < w 0 +
P(Y = 1|X)i=1
⇥i=1wi Xi (18)
n
1 < exp(w0 + ⇤ wi Xi )
om A
equations Linear
(16)
s Y = 1 otherwise. and Classifier!
(17), this becomes i=1
tingly, the parametric form nof P(Y |X) used by Logistic Regression is
he form implied1 <by
exp(w 0 + ⇤ wi Xi ) of a Gaussian Naive Bayes classi-
the assumptions
i=1
fore, we can view Logistic Regression as a closely related alternative to
Likelihood  vs.  Condi;onal  Likelihood  
Genera;ve  (Naïve  Bayes)  maximizes  Data  likelihood  

Discrimina;ve  (Logis;c  Regr.)  maximizes  Condi)onal  Data  Likelihood  

Focuses  only  on  learning  P(Y|X)  -­‐  all  that  maFers  for  classifica;on      
Maximizing  Condi;onal  Log  Likelihood  

0 or 1!

Bad news: no closed-form solution to maximize l(w)


Good news: l(w) is concave function of w!
No local maxima
Concave functions easy to optimize
Op;mizing  concave  func;on  –  
Gradient  ascent    
• Condi;onal  likelihood  for  Logis;c  Regression  is  concave  !    

Gradient:

Learning rate, η>0


Update rule:
P (Y
P = r|X)
(Y = r|X)
P = 1=
=
(Y 1 r|X)PP
P (Y = r|X) = = (Y
(Y
1 ⇤
1 =j|X)
= j|X)
PP(Y(Y==j|X)
j|X) ⌅2
j=1 j=1 j=1
2 j=1
error = eww0 +
P (t
+P i iw wiX Xi i Pt̂ ) =
P t i 1
w h (x )
k k i
jj ⌥
e 0 i i w + w X j 1 P
P ee 1 1P
i

Maximize   Condi;onal  Log  Likelihood:  Gradient  ascent  


0
w0 + +i w i
i i Xi ji
== yy lnln= i j P + (1
(1 i
y y ) ) ln
ln kj P
=1 + eywy00ln
wj ln
+ w i X i P
P w X + (11 +ye + (1 y
jw )+ln
w0)+ln i w
0 w i Xi
w+0P
jj 1 + e + 11 ii wi Xw
++ eei w0 +
0 + i w
i i X
ii i 1 + e 1
i i Xi
1+ +ewe0
+ wiiw
i XiiXi
jj
P (Y = 1|X)
l(w)
l(w)
⇥ exp(w10 + ⇥⇥w1i Xi )
l(w) j j ⌥
j j
1|xj ,j ,w) ⇥
= ⌅l(w) xxjii yy= (Y
PP(Y
j j j j= =j 1|x w)
ji j j j ⇥
w = xxi i yy PP(Y(Y ==1|x 1|x, w), w)
jj⌅ww i jj
⇧⇧ ⇤⇤ + ⌅⌃ ⌅⌃
l(w)
l(w)
P (Y
⇧⇧ = 2|X) ⇥ exp(w 20 ⇤ ⇤w X
2i i ) ⌅⌃ ⌅⌃
= l(w) y j⌥
= ⌅l(w) (w0 +⌅ j jw wiixxjii)) ⌥ lnln
j
j j 11+ + ⌅exp(w
exp(w i 0 0++ ww j j
i xiix)i )

j j
ww =
w= yy (w (w00++ ww wi xi x
ii) ) ln ln 1 1++ exp(w
exp(w 0 +
0 + wiwxix
i )i )
j w
j⌅w i ⌅wwi ii ⌅wwi i i
jj ii r 1 i i
⇧ ⌥⌥ ⌃ ⌃
⇧xjj exp(w
x exp(w ++ ⌥ w w x x
jj
) ) ⌥ ⌃
P (Y j= j r|X)
j ii = 1 0 j 0
xi exp(w iiP
yj x
i i(Yi
0 i+
j i = j|X) j
= y xii j j = ⌥⌥ j j i wi xi )
= 11 +y exp(w
+ x
exp(w
i 00+ +j=1 i iww ixix i) )⌥ j
jj 1 + exp(w + i
i i i)
w x
j 0
⇧ j ⌃
⇧ P ⌥⌥ ⌃
ej
w 0 + w
i exp(w⇧
X
iexp(wi 0+ +
jj
wwi xi xi ))⌥ ⌃1
j j j 0 i ji j
= =
y= ln xxii =yywj +P 1x j
+ w y j
exp(w
X
+ +
exp(w
(1 ⌥ ⌥
i
w
0y+)
x j jlni wi xi )
) ⌥ w +
P
j 1 + e 0 1 ii+ exp(w
j
i i 00 + i w
i
i ix i i ) 1 + ej 0 i wi Xi
j j 1 + exp(w0 + i wi xi )

⌅l(w) 1 ⇥
= xji y j
P
1+e (Y
ax
j j
= 1|x , w)
⌅wi
j
⇧ ln p(w) ⇥ ⇤wi2 ⌅⌃
⌅l(w) ⌅ ⌅2
wi xji ) wi xji )
j i
= y (w0 + ln 1 + exp(w0 +
Gradient  Ascent  for  LR  
Gradient ascent algorithm: (learning rate η > 0)
do:  

 For  i=1  to  n:  (iterate  over  features)  

un;l  “change”  <  ε  


Loop over training examples!
Naïve  Bayes        vs.      Logis)c  Regression  
Learning:  h:X   Y            X  –  features  
                 Y  –  target  classes  

Genera)ve     Discrimina)ve  
• Assume  func;onal  form  for     • Assume  func;onal  form  for    
– P(X|Y)    assume  cond  indep     – P(Y|X)      no  assump)ons  
– P(Y)  
– Est.  params  from  train  data   – Est  params  from  training  data  
• Gaussian  NB  for  cont.  features   • Handles  discrete  &  cont  features  
• Bayes  rule  to  calc.  P(Y|X=  x):  
– P(Y  |  X)  ∝  P(X  |  Y)  P(Y)  
• Indirect  computa;on     • Directly  calculate  P(Y|X=x)  
– Can  generate  a  sample  of  the  data   – Can’t  generate  data  sample  
– Can  easily  handle  missing  data  
 Naïve  Bayes  vs.  Logis;c  Regression  
[Ng & Jordan, 2002]

• Genera;ve  vs.  Discrimina;ve  classifiers  


•  Asympto;c  comparison    
(#  training  examples    infinity)  
–  when  model  correct  
• NB,  Linear  Discriminant  Analysis  (with  class  independent  
variances),  and  Logis;c  Regression  produce  iden;cal  
classifiers  

–  when  model  incorrect  


•  LR  is  less  biased  –  does  not  assume  condi;onal  
independence  
– therefore  LR  expected  to  outperform  NB  
Naïve  Bayes  vs.  Logis;c  Regression  
[Ng & Jordan, 2002]

• Genera;ve  vs.  Discrimina;ve  classifiers  


• Non-­‐asympto;c  analysis  
–  convergence  rate  of  parameter  es;mates,    
     (n  =  #  of  a_ributes  in  X)  
• Size  of  training  data  to  get  close  to  infinite  data  solu;on  
• Naïve  Bayes  needs  O(log  n)  samples  
• Logis;c  Regression  needs  O(n)  samples  

– Naïve  Bayes  converges  more  quickly  to  its  (perhaps  


less  helpful)  asympto;c  es;mates  
Naïve bayes
Logistic Regression

Some  
experiments  
from  UCI  data  
sets  
©Carlos Guestrin 2005-2009 67

You might also like