0% found this document useful (0 votes)
13 views

Lecture 10

Uploaded by

aidynn.enoc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Lecture 10

Uploaded by

aidynn.enoc
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 59

Introduc)on

 to  Bayesian  methods  


Lecture  10  

David  Sontag  
New  York  University  

Slides adapted from Luke Zettlemoyer, Carlos Guestrin, Dan Klein,


and Vibhav Gogate
Bayesian  learning  
• Bayesian  learning  uses  probability  to  model  
data  and  quan+fy  uncertainty  of  predic;ons  
– Facilitates  incorpora;on  of  prior  knowledge  
– Gives  op;mal  predic;ons  
• Allows  for  decision-­‐theore;c  reasoning  
Your  first  consul;ng  job  
• A  billionaire  from  the  suburbs  of  ManhaFan  asks  
you  a  ques;on:  
– He  says:  I  have  thumbtack,  if  I  flip  it,  what’s  the  
probability  it  will  fall  with  the  nail  up?  
– You  say:  Please  flip  it  a  few  ;mes:  

– You  say:  The  probability  is:  


• P(heads)  =  3/5  
– He  says:  Why???  
– You  say:  Because…  
Outline  of  lecture  
• Review  of  probability  
• Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  


• Naïve  Bayes  
• Logis;c  regression  
Random Variables
• A random variable is some aspect of the world about
which we (may) have uncertainty
– R = Is it raining?
– D = How long will it take to drive to work?
– L = Where am I?

• We denote random variables with capital letters

• Random variables have domains


– R in {true, false} (sometimes write as {+r, ¬r})
– D in [0, ∞)
– L in possible locations, maybe {(0,0), (0,1), …}
Probability Distributions
• Discrete random variables have distributions

T P W P
warm 0.5 sun 0.6
cold 0.5 rain 0.1
fog 0.3
meteor 0.0

• A discrete distribution is a TABLE of probabilities of values


• The probability of a state (lower case) is a single number

• Must have:
Joint Distributions
• A joint distribution over a set of random variables:
specifies a real number for each assignment:

T W P
– How many assignments if n variables with domain sizes d? hot sun 0.4
hot rain 0.1
– Must obey: cold sun 0.2
cold rain 0.3

• For all but the smallest distributions, impractical to write out or estimate
– Instead, we make additional assumptions about the distribution
Marginal Distributions
• Marginal distributions are sub-tables which eliminate variables
• Marginalization (summing out): Combine collapsed rows by adding

T P
hot 0.5
T W P X
P (t) = P (t, w) cold 0.5
hot sun 0.4
w
hot rain 0.1
cold sun 0.2 X W P
cold rain 0.3 P (w) = P (t, w) sun 0.6
t
rain 0.4
Conditional Probabilities
• A simple relation between joint and conditional probabilities
– In fact, this is taken as the definition of a conditional probability

T W P
hot sun 0.4
hot rain 0.1
cold sun 0.2
cold rain 0.3
Conditional Distributions
• Conditional distributions are probability distributions over
some variables given fixed values of others
Conditional Distributions Joint Distribution

W P T W P
sun 0.8 hot sun 0.4
rain 0.2 hot rain 0.1
cold sun 0.2
cold rain 0.3
W P
sun 0.4
rain 0.6
The Product Rule
• Sometimes have conditional distributions but want the joint

• Example:

D W P D W P
wet sun 0.1 wet sun 0.08
W P
dry sun 0.9 dry sun 0.72
sun 0.8
wet rain 0.7 wet rain 0.14
rain 0.2
dry rain 0.3 dry rain 0.06
Bayes’ Rule
• Two ways to factor a joint distribution over two variables:

• Dividing, we get:

• Why is this at all helpful?


– Let’s us build one conditional from its reverse
– Often one conditional is tricky but the other one is simple
– Foundation of many practical systems (e.g. ASR, MT)

• In the running for most important ML equation!


Returning  to  thumbtack  example…  
• P(Heads)  =  θ,    P(Tails)  =  1-­‐θ


• Flips  are  i.i.d.:     D={xi | i=1…n}, P(D | θ ) = ΠiP(xi | θ )
– Independent  events  
– Iden;cally  distributed  according  to  Bernoulli  
distribu;on  
• Sequence  D  of  αH  Heads  and  αT  Tails      

Called the “likelihood” of the data under the model


Maximum  Likelihood  Es;ma;on  
• Data:  Observed  set  D  of  αH  Heads  and  αT  Tails      
• Hypothesis:  Bernoulli  distribu;on    
• Learning:  finding  θ is  an  op;miza;on  problem  
– What’s  the  objec;ve  func;on?  

• MLE:  Choose  θ  to  maximize  probability  of  D  


11,11,
ary 2012
2012 January 11, 2012
he Author
Your  first  parameter  learning  algorithm  
ary 11, 2012
ˆ = arg max ⇥ˆln=
P (D | max
arg ) ln P (D | ⇥)
✓ ⇥
⇥ˆ = arg max ln P (D | ⇥)
axln Pln(D | ⇥)| ⇥)
P (D ln ↵H ⇥
ln ⇥ H
Set  
ax lnHP• (D | ⇥) deriva;ve  to  zero,  
ln a ⇥ nd  solve!  
H

⇥lnH⇥ dd d d
⇥ ln (D || ⇥)) =
lnPP(D [ln
ln ↵ H ↵
⇥ H (1 ⇥)) T] =
T
[ H ln ⇥ + T
d ⇥ Hdd⇥
ln d d⇥
dH d d⇥
ln P (D | d⇥) = [ln ⇥ (1 ⇥) T ] = [ H ln ⇥ + T ln(1 ⇥
d⇥⇥) T ] =d [ Hd⇥ln ⇥ + T ln(1 ⇥)] d⇥
) T ] = dd⇥ [ H ln ⇥ + T ln(1 ⇥)]
⇥) T
]=
d⇥d⇥ [ dH ln ⇥ + T ln(1
d ⇥)] H T
= H ln ⇥ + T ln(1 ⇥) = =0
d⇥ d⇥ ⇥ 1 ⇥
HH TT
1 ⇥) = ⇥
ln(1 ⇥) = ==0 0
⇥ 11 ⇥⇥
Data

L(✓; D) = ln P (D|✓)


What  if  I  have  prior  beliefs?    
• Billionaire  says:  Wait,  I  know  that  the  thumbtack  
is  “close”  to  50-­‐50.  What  can  you  do  for  me  now?  
• You  say:  I  can  learn  it  the  Bayesian  way…  
• Rather  than  es;ma;ng  a  single  θ,  we  obtain  a  
distribu;on  over  possible  values  of  θ

In the beginning After observations
Pr(✓) Observe flips Pr(✓ | D)
e.g.: {tails, tails}



2N ⇥2
⇥ ⇤ 2e ⇤ P (mistake)
Bayesian  
ln ⇥L⇤earning  
ln 2 2N ⇤ 2

Prior
• Use  Bayes’   rule!   2 Data Likelihood
ln ⇥ ⇤ ln 2 2N ⇤ ln(2/⇥)
N⇤
2⇤2
Posteriorln(2/⇥)
N⇤ 2 ln(2/0.05) 3.8
2⇤ N ⇤ 2
⌅ Normalization
= 190
2 ⇥ 0.1 0.02
• Or equivalently:
ln(2/0.05) 3.8
⇤ • For uniform
2
⌅ priors, this
= 190 P (⌅) ⇧
reduces to 1
2 ⇥ 0.1 0.02
maximum likelihood estimation!
P (⌅) ⇧ 1 P (⌅ | D) ⇧ P (D | ⌅)
Bayesian  Learning  for  Thumbtacks  

Likelihood:  

• What  should  the  prior  be?  


– Represent  expert  knowledge  
– Simple  posterior  form  

• For  binary  variables,  commonly  used  prior  is  the  


Beta  distribu;on:  
n ⇤ ⇤ ln 2 2N ⌅2 2 ln ⇥ ⇤ ln 2 2N ⇤2
ln ⇥ ⇤ ln 2 2N ⇤
Beta  
ln(2/⇤)
p rior   d istribu;on  
ln(2/⇥) –   P(θ)  
⇤ ⇤ ln(2/⇥)
NN 2 2
N⇤
2⇤2
2⌅2⇤
ln(2/0.05) 3.8
(2/0.05)
ln(2/0.05) 3.8N⇤
3.8 2
⌅ = 190
⌅⌅ 2 ⇥ 0.1
==190
190 0.02
⇥ 2 2
0.10.1 0.02
2⇥ 0.02
P (⌅) ⇧ 1
P (⇧) ⇧⇧1 1
P (⌅)
• Since  the  Beta  distribu;on  is  conjugate  to  the  Bernoulli  distribu;on,  the  
| aD)
P (⌅has  
posterior  distribu;on   ⇧ P (Dsimple  
 par;cularly   | ⌅) form:  
P (⌅
(⇧ | D)
| D) ⇧ ⇧P P | ⌅)
(D(D| ⇧)
1 1
P (⌅ | D) ⇧ ⌅ H
(1 ⌅) T
⌅ ⇥H (1 ⌅)⇥T
1 1 H +⇥H 1 T +⇥t +1
⌅ ⇥H (1 ⌅)⇥T =⌅ (1 ⌅)
1 H +⇥H 1 T +⇥t +1
T
=⇧ (1 ⇧) = Beta( H +⇥H , T +⇥T )
Using  Bayesian  inference  for  predic;on  
• We  now  have  a  distribu)on  over  parameters  
• For  any  specific  f,  a  func;on  of  interest,  compute  the  
expected  value  of  f:  

• Integral  is  oien  hard  to  compute  


• As  more  data  is  observed,  prior  is  more  concentrated  
• MAP  (Maximum  a  posteriori  approxima;on):  use  most  
likely  parameter  to  approximate  the  expecta;on  
What  about  con;nuous  variables?  
• Billionaire  says:  If  I  am  
measuring  a  
con;nuous  variable,  
what  can  you  do  for  
me?  
• You  say:  Let  me  tell  
you  about  Gaussians…  
Some  proper;es  of  Gaussians  
• Affine  transforma;on  (mul;plying  by  
scalar  and  adding  a  constant)  are  
Gaussian  
– X  ~  N(µ,σ2)  
– Y  =  aX  +  b    Y  ~  N(aµ+b,a2σ2)  

• Sum  of  Gaussians  is  Gaussian  


– X  ~  N(µX,σ2X)  
– Y  ~  N(µY,σ2Y)  
– Z  =  X+Y     Z  ~  N(µX+µY,  σ2X+σ2Y)  

• Easy  to  differen;ate,  as  we  will  see  soon!  


Learning  a  Gaussian   xi Exam  
i = Score  
0   85  
• Collect  a  bunch  of  data  
1   95  
– Hopefully,  i.i.d.  samples  
– e.g.,  exam  scores   2   100  

3   12  
• Learn  parameters  
…   …  
– µ (“mean”)
99   89  
– σ (“variance”)
MLE  for  Gaussian:  
• Prob.  of  i.i.d.  samples  D={x1,…,xN}:  
µM LE , M LE = arg max P (D | µ, )
µ,

• Log-likelihood of data:
Your  second  learning  algorithm:  
MLE  for  mean  of  a  Gaussian  
• What’s  MLE  for  mean?  

µM LE
µM, LEM, LE ==
M LE argarg
max (D || µ,
maxPP(D µ, ))
µ,
µM LE , M LE = arg µ,
max P (D | µ, )
µ,
N
X (xi µ)
N
=XX N(x
i i 2µ)
(x µ) = 0
= = i=1 22 ==00
i=1i=1
N
X
= - Nµ = 0
xi +
i=1

N
X
N (xi µ)2
= + 3
=0
i=1
MLE  for  variance  
• Again,  set  deriva;ve  to  zero:  
µM LE , M LE = arg max P (D | µ, )
µ,

µ) max P (D | µ, )
N
µM LE , X =i
M LE(x arg
= µ,0
=
2
i=1

X
NN
N X (x(xi µ)2µ)
i
= =+ 3 2
= 0= 0
i=1
i=1
Learning  Gaussian  parameters  
• MLE:  

• MLE  for  the  variance  of  a  Gaussian  is  biased  


– Expected  result  of  es;ma;on  is  not  true  parameter!    
– Unbiased  variance  es;mator:  
Bayesian  learning  of  Gaussian  
parameters  
• Conjugate  priors  
– Mean:  Gaussian  prior  
– Variance:  Wishart  Distribu;on  

• Prior  for  mean:  


Outline  of  lecture  
• Review  of  probability  
• Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  


• Naïve  Bayes  
• Logis;c  regression  
Bayesian  Classifica;on  
• Problem  statement:  
– Given  features  X1,X2,…,Xn  
– Predict  a  label  Y  

[Next several slides adapted from:


Vibhav Gogate, Jonathan Huang, Luke Zettlemoyer, Carlos
Guestrin, and Dan Weld]
Example  Applica;on  
• Digit  Recogni)on  

Classifier 5

• X1,…,Xn  ∈  {0,1}  (Black  vs.  White  pixels)  


• Y  ∈  {0,1,2,3,4,5,6,7,8,9}  
The  Bayes  Classifier  
• If  we  had  the  joint  distribu;on  on  X1,…,Xn  and  Y,  could  predict  
using:  

– (for  example:  what  is  the  probability  that  the  image  


represents  a  5  given  its  pixels?)  

• So  …  How  do  we  compute  that?  


The  Bayes  Classifier  
• Use  Bayes  Rule!  

Likelihood Prior

Normalization Constant

• Why  did  this  help?    Well,  we  think  that  we  might  be  able  to  
specify  how  features  are  “generated”  by  the  class  label  
The  Bayes  Classifier  
• Let’s  expand  this  for  our  digit  recogni;on  task:  

• To  classify,  we’ll  simply  compute  these  probabili;es,  one  per  


class,  and  predict  based  on  which  one  is  largest  
Model  Parameters  
• How  many  parameters  are  required  to  specify  the  likelihood,  
P(X1,…,Xn|Y)  ?  
– (Supposing  that  each  image  is  30x30  pixels)  

• The  problem  with  explicitly  modeling  P(X1,…,Xn|Y)  is  that  


there  are  usually  way  too  many  parameters:  
– We’ll  run  out  of  space  
– We’ll  run  out  of  ;me  
– And  we’ll  need  tons  of  training  data  (which  is  usually  not  
available)  
Naïve  Bayes  
• Naïve  Bayes  assump;on:  
– Features  are  independent  given  class:  

– More  generally:  

• How  many  parameters  now?  


• Suppose  X  is  composed  of  n  binary  features  
The  Naïve  Bayes  Classifier  
• Given:  
– Prior  P(Y)   Y
– n  condi;onally  independent  
features  X  given  the  class  Y  
– For  each  Xi,  we  have  
likelihood  P(Xi|Y)   X1 X2 Xn

• Decision  rule:  

If certain assumption holds, NB is optimal classifier!


(they typically don’t)
A Digit Recognizer

• Input: pixel grids

• Output: a digit 0-9


Are  the  naïve  Bayes  assump;ons  realis;c  here?  
What has to be learned?

1 0.1 1 0.01 1 0.05


2 0.1 2 0.05 2 0.01
3 0.1 3 0.05 3 0.90
4 0.1 4 0.30 4 0.80
5 0.1 5 0.80 5 0.90
6 0.1 6 0.90 6 0.90
7 0.1 7 0.05 7 0.25
8 0.1 8 0.60 8 0.85
9 0.1 9 0.50 9 0.60
0 0.1 0 0.80 0 0.80
maxlnln 1⇥⇥
argmax
arg ++ [tjj ii wii hi2i (xjj )]
arg
arg max
maxwln
ln
w ⇥

⇥⇥ 22 +
+ 2⇥
2⇥
2
2
w
w ⇥⇥ 22 j=1
j=1 2⇥
2⇥ 2
j=1
j=1
P P
MLE  ==for   t he   p arameters   hio if  N j B  
X
NN P 2
NX
N
X
X [t
[tj j P iwwi (xh
i (x(x )]
j22)]
2
argmax
arg max [t
[t jj w
wi h
h (x
ii ii i2i jj )]
)]
=
= arg arg max
max ww 2⇥
2⇥
2
2
• Given  dataset  ww j=1j=1 j=1 2⇥
2⇥ 2
j=1
– Count(A=a,B=b)  Ã  number  
N of  examples   where  A=a  and  
X
N
X X XX
B=b   X
X N
N X 22
==arg
argmin
min [t[tj j w i (x
wi hi hi (x j j )]
22)]
• MLE  for  = arg
arg min
=discrete  
min
w
w N
wB,   [t
[t
w simply:  
j=1
j=1
jj wiihhii(x
w
ii
(xjj)] )]
– Prior:   j=1
j=1 ii
Count(Y ==y)y)
Count(Y
P (Y ==y)y)=
P(Y =PCount(Y
P
Count(Y == y)
y)
P (Y
P (Y == y) =P
y) = P 0y0Count(Y
Count(Y ==yy))
Count(Y =
yy00 Count(Y
y = yy ))
– Observa;on  distribu;on:    
Count(Xii =
Count(X x, YY =
= x, = y)
y)
P (X
P (Xii = x|Y =
= x|Y =P
y) =
= y) P
Count(Xii =
xx00 Count(X = xx ,, YY =
= y)
y)
MLE  for  the  parameters  of  NB  
• Training  amounts  to,  for  each  of  the  classes,  averaging  all  of  
the  examples  together:  
arg maxlnln ⇥⇥ i=1 ++ j
argmax i i i j
arg max lnww ⇥ ⇥⇥ 22+ 2⇥
2⇥ 22
w ✓
⇥ 2 ◆N X N j=1 P 2⇥ 2
j=1 2
1 j=1 [tj i w i h i (x j )]
arg max ln ⇥ + P
P
MAP   e s;ma;on  
[tj j w fhor  iwiN hi)]iB  
N 2⇥ 2
w ⇥ 2 N X X
N P
[t w h (x(x
2i j )]
2
)] 2
=arg
argmax X [t
max j
j=1
i (x j
=
= arg max X i i i 2j
P 2 2⇥
ww
N 2⇥ 2
2
w
• Given  dataset   [t
j=1
j=1 j 2⇥
i w i h i (x j )]
= arg maxj=1 2
w
Count(A=a,B=b)   Ã  number   2⇥
N of  examples  
– j=1
N XX X X
N X where  A=a  and  
B=b   X 22
=
=minarg
arg X
min min
N [t [t
[t j j wwi h
i i (x
h (x
2i j )]
j )]
= arg
• MAP  e=
s;ma;on   for   X w h
wwdjiscrete  NB,   (x )]
i simply:  
i j2
w
arg min j=1 w hi(x
[tj=1
j i )]
i i j
– Prior:   w j=1 i
j=1 i
Count(Y ==y)y)
Count(Y
P (Y ==y)
P(Y y)=Count(Y
= PP =
= y)
Count(YCount(Y
y)
(Y(Y==y)
PP = P ==yy))
y 0y Count(Y
y) = P 0 Count(Y 0
=y)
yy0 Count(Y
– Observa;on  distribu;on:     =y)
Count(X
Count(X i =
i = x,x,
Y Y = y)= y)
+a
P (X i =
P (X x|Y==y)
x|Y
i = =P
y) = P
Count(Xi =
x00 Count(X
x x ,xY , =
i = = +y)|X_i|*a
Y y)

• Called  “smoothing”.  Corresponds  to  Dirichlet  prior!  


What  about  if  there  is  missing  data?  
• One  of  the  key  sMissing data approaches  is  that  
trengths  of  Bayesian  
they  can  naturally  handle  missing  data  
 Suppose don’t have value for some attribute Xi •
 applicant’s credit history unknown
 some medical test not performed on patient
 how to compute P(X1=x1 … Xj=? … Xd=xd | y)
 Easy with Naïve Bayes •
 ignore attribute in instance
where its value is missing
 compute likelihood based on observed attributes
 no need to “fill in” or explicitly model missing values
 based on conditional independence between attributes

[Slide from Victor Lavrenko and Nigel Goddard]


Copyright © Victor Lavrenko, 2013 Copyright ©
Naive Bayes = Linear Classifier
Theorem: assume that xi {0, 1} for all i [1, N ] .
Then, the Naive Bayes classifier is defined by
x⇥ sgn(w · x + b),

where wi = log Pr[x


Pr[xi =1|+1]
i =1| 1]
Pr[xi =0|+1]
log Pr[x i =0| 1]

and b = log Pr[+1]


Pr[ 1] +
N
i=1
Pr[xi =0|+1]
log Pr[x i =0| 1]
.
Proof: observe that for any i [1, N ] ,
Pr[xi | +1] Pr[xi = 1 | +1] Pr[xi = 0 | +1] Pr[xi = 0 | +1]
log = log log xi +log .
Pr[xi | 1] Pr[xi = 1 | 1] Pr[xi = 0 | 1] Pr[xi = 0 | 1]

[Slide from Mehyrar Mohri]


Mehryar Mohri - Introduction to Machine Learning page 20
Outline  of  lecture  
• Review  of  probability  
• Maximum  likelihood  es;ma;on  

2  examples  of  Bayesian  classifiers:  


• Naïve  Bayes  
• Logis)c  regression  

[Next several slides adapted from:


Vibhav Gogate, Luke Zettlemoyer, Carlos Guestrin, and Dan Weld]
Logis;c  Regression  
 Learn P(Y|X) directly!
 Assume a particular functional form
✬ Linear classifier? On one side we say P(Y=1|X)=1, and on
the other P(Y=1|X)=0
✬ But, this is not differentiable (hard to learn)… doesn’t
allow for label noise...

P(Y=1)=1

P(Y=1)=0
mator (MVUE) is sometimes used instead. It is
1

j
⇥ˆ 2ik = j
(X i µ̂ ik )2
(Y j
= yk ) (15)
(⇤ j (Y 1= yk )) 1 j j
⇥ˆ 2ik = ⇤ (Xi µ̂ik ) 2
(Y j
= yk ) (15)
Logis;c  Regression  
j
(⇤ j (Y = yk )) 1 j
ogistic Regression
ogistic Regression Logistic function
Regression is an approach to learning functions of the form f :X ⇤ Y , or (Sigmoid):
n the case
• where
Regression isLearn Y isP(Y|X)
an approachdiscrete-valued,
to learning and X =of⌅Xthe
functions
directly! . Xn ⇧ fis: any
1 . .form X ⇤vector
Y , or
n the
ng case or
discrete where Y is discrete-valued,
continuous variables. In this X = ⌅Xwe
andsection . Xn ⇧primarily
1 . .will is any vectorcon-
ng discrete
case whereorY isAssume
• continuous
a boolean avariable,
particular
variables. Ininthis section
order we willnotation.
to simplify primarilyIncon- the 1
case where Y isfunctional
a boolean form
variable, in order to simplify
section we extend our treatment to the case where Y takes on any finite notation. In the 1 + e z

section
of wevalues.
discrete extend our treatment to the case where Y takes on any finite
• Sigmoid applied to a linear
of
sticdiscrete values.
Regression assumes a parametric form for the distribution P(Y |X),
stic Regression function
assumes a of the data:
parametric form for thedata.
distribution P(Y |X),
ectly estimates its parameters from the training The parametric
ectly
sumed estimates
by Logisticits parameters
Regression infrom
the the
casetraining
where Ydata. The parametric
is boolean is: z
sumed by Logistic Regression in the case where Y is boolean is:
1
P(Y = 1|X) = 1 n (16)
P(Y = 1|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (16)
1 + exp(w0 + ⇤i=1 wi Xi ) Features can be
exp(w0 + ⇤nni=1 wi Xi ) discrete or
P(Y = 0|X) = exp(w0 + ⇤i=1n wi Xi ) continuous!(17)
P(Y = 0|X) = 1 + exp(w0 + ⇤ni=1 wi Xi ) (17)
1 + exp(w0 + ⇤i=1 wi Xi )
hat equation (17) follows directly from equation (16), because the sum of
hat equation (17)
o probabilities follows
must equaldirectly
1. from equation (16), because the sum of
ohighly
probabilities mustproperty
convenient equal 1. of this form for P(Y |X) is that it leads to a
Logis;c  Func;on  in  n  Dimensions  

Sigmoid applied to a linear function of the data:

Features can be discrete or continuous!


1x1x2
0.8
0.6
0.4
0.2
0
10
8
6
4
2
0
6-4-2
4
2
0
-2
on we extend our treatment to the case1wherei=1 Y takes on any finite
crete values.P(Y = 1|X) = model assumed n
by Logistic Regression (16) in the case where Y is boolean
1 + exp(w0 + ⇤i=1 n wi Xi )
0.2

Regression assumes a0 parametricexp(w + ⇤the


form0 for i=1 w i Xi )
distribution P(Y |X),
P(Y =−5 0|X) = X0 (17) 1
n P(Yi ) = parametric
1|X) =
5
estimates its parameters from1 + theexp(w
training
0 + ⇤data.i=1 wi XThe n
ed by Logistic
ce that equation Logis;c  Regression:  decision  boundary    
Regression
P(Y(17)= 0|X) in exp(w
the case0+ where
n
⇤i=1Ywiis
= directly from equation
follows n
)
Xiboolean is:
(16), because the sum
1 + exp(w
(17)of
0 + ⇤ i=1 wi Xi )

two probabilities must equal and 1 + exp(w +


1. 1 0 ⇤i=1 i i w X )
n
hat highly =
P(Yconvenient
ne equation 1|X)
(17) = property
follows directlyof from
this equation (16), because (16)
the exp(w
sum of 0 + ⇤ i=1 wi Xi )
nform for P(Y |X) is that it leads to a
1 + exp(w0 + ⇤i=1 wi Xi ) P(Y = 0|X) = 1 + exp(w + n w X )
oleprobabilities must for
linear expression equal 1.
classification. To classify any given X we generally 0 ⇤i=1 i i
highly
to assignconvenient
the valueproperty of this form
yk that maximizes
Notice thatP(Y for=P(Y
equation yk |X).
|X)
(17) isPut
that it leads
another
follows to from
way,
directly awe equation (16), because
nnear • Prediction:
the expression for Output
exp(w
classification.+ the
⇤ n Y with
To i ) any given
wi Xholds:
i=1classify
label Y = = 0 if the
= following 0 condition
these two probabilities must Xequalwe generally
1.

=0
P(Y
highest0|X) P(Y|X) n (17)
assign
Form the value
of the yk that
logistic +maximizes
1 function.
exp(w0One +In
⇤ P(Y w
Logistic
highly
i=1 ik)|X).
=i Xyconvenient Put another
Regression, propertyP(Y way,
of we
is form
|X)this as- for P(Y |X) is that

0
– 0For binary P(Y
Y, output = 0|X)
Y=0 holds:
if

w.X+w
euation
ollow Y=
label (17)
this if the
form.
follows following
directly1 from
< condition
simple linear expression
equation (16), because forthe
classification.
sum of To classify any given X w
P(Y = 1|X)
babilities must equal 1. P(Y =to0|X)
want assign the value yk that maximizes P(Y = yk |X). Put anoth
yituting
convenient property (16)
from equations 1<
of this
andassign
form the thislabel
for P(Y Y =is0that
|X) if theit following
leads to a condition holds:
P(Y(17),
= 1|X) becomes
gexpression
the natural for classification. To classify any given X we generally
log of both sides we have n a linear classificationP(Y rule= that
0|X)
ningthe value
from yk that (16)
equations maximizes
and (17),P(Y =becomes
this yk |X). Put another way, we
bel Y = 0 if X satisfies1 < exp(w0 + ⇤ wi Xi ) 1<
P(Y = 1|X)
el Y = 0 if the following condition holds: i=1
n
0 + ⇤ wi Xfrom
n
1< exp(w
P(Y =substituting
0|X) i) equations (16) and (17), this becomes
1< 0 < w 0 +
P(Y = 1|X)i=1
⇥i=1wi Xi (18)
n
1 < exp(w0 + ⇤ wi Xi )
om A
equations Linear
(16)
s Y = 1 otherwise. and Classifier!
(17), this becomes i=1
tingly, the parametric form nof P(Y |X) used by Logistic Regression is
he form implied1 <by
exp(w 0 + ⇤ wi Xi ) of a Gaussian Naive Bayes classi-
the assumptions
i=1
fore, we can view Logistic Regression as a closely related alternative to
Likelihood  vs.  Condi;onal  Likelihood  
Genera;ve  (Naïve  Bayes)  maximizes  Data  likelihood  

Discrimina;ve  (Logis;c  Regr.)  maximizes  Condi)onal  Data  Likelihood  

Focuses  only  on  learning  P(Y|X)  -­‐  all  that  maFers  for  classifica;on      
Maximizing  Condi;onal  Log  Likelihood  

0 or 1!

Bad news: no closed-form solution to maximize l(w)


Good news: l(w) is concave function of w!
No local maxima
Concave functions easy to optimize
Op;mizing  concave  func;on  –  
Gradient  ascent    
• Condi;onal  likelihood  for  Logis;c  Regression  is  concave  !    

Gradient:

Learning rate, η>0


Update rule:
P (Y
P = r|X)
(Y = r|X)
P = 1=
=
(Y 1 r|X)PP
P (Y = r|X) = = (Y
(Y
1 ⇤
1 =j|X)
= j|X)
PP(Y(Y==j|X)
j|X) ⌅2
j=1 j=1 j=1
2 j=1
error = eww0 +
P (t
+P i iw wiX Xi i Pt̂ ) =
P t i 1
w h (x )
k k i
jj ⌥
e 0 i i w + w X j 1 P
P ee 1 1P
i

Maximize   Condi;onal  Log  Likelihood:  Gradient  ascent  


0
w0 + +i w i
i i Xi ji
== yy lnln= i j P + (1
(1 i
y y ) ) ln
ln kj P
=1 + eywy00ln
wj ln
+ w i X i P
P w X + (11 +ye + (1 y
jw )+ln
w0)+ln i w
0 w i Xi
w+0P
jj 1 + e + 11 ii wi Xw
++ eei w0 +
0 + i w
i i X
ii i 1 + e 1
i i Xi
1+ +ewe0
+ wiiw
i XiiXi
jj
P (Y = 1|X)
l(w)
l(w)
⇥ exp(w10 + ⇥⇥w1i Xi )
l(w) j j ⌥
j j
1|xj ,j ,w) ⇥
= ⌅l(w) xxjii yy= (Y
PP(Y
j j j j= =j 1|x w)
ji j j j ⇥
w = xxi i yy PP(Y(Y ==1|x 1|x, w), w)
jj⌅ww i jj
⇧⇧ ⇤⇤ + ⌅⌃ ⌅⌃
l(w)
l(w)
P (Y
⇧⇧ = 2|X) ⇥ exp(w 20 ⇤ ⇤w X
2i i ) ⌅⌃ ⌅⌃
= l(w) y j⌥
= ⌅l(w) (w0 +⌅ j jw wiixxjii)) ⌥ lnln
j
j j 11+ + ⌅exp(w
exp(w i 0 0++ ww j j
i xiix)i )

j j
ww =
w= yy (w (w00++ ww wi xi x
ii) ) ln ln 1 1++ exp(w
exp(w 0 +
0 + wiwxix
i )i )
j w
j⌅w i ⌅wwi ii ⌅wwi i i
jj ii r 1 i i
⇧ ⌥⌥ ⌃ ⌃
⇧xjj exp(w
x exp(w ++ ⌥ w w x x
jj
) ) ⌥ ⌃
P (Y j= j r|X)
j ii = 1 0 j 0
xi exp(w iiP
yj x
i i(Yi
0 i+
j i = j|X) j
= y xii j j = ⌥⌥ j j i wi xi )
= 11 +y exp(w
+ x
exp(w
i 00+ +j=1 i iww ixix i) )⌥ j
jj 1 + exp(w + i
i i i)
w x
j 0
⇧ j ⌃
⇧ P ⌥⌥ ⌃
ej
w 0 + w
i exp(w⇧
X
iexp(wi 0+ +
jj
wwi xi xi ))⌥ ⌃1
j j j 0 i ji j
= =
y= ln xxii =yywj +P 1x j
+ w y j
exp(w
X
+ +
exp(w
(1 ⌥ ⌥
i
w
0y+)
x j jlni wi xi )
) ⌥ w +
P
j 1 + e 0 1 ii+ exp(w
j
i i 00 + i w
i
i ix i i ) 1 + ej 0 i wi Xi
j j 1 + exp(w0 + i wi xi )

⌅l(w) 1 ⇥
= xji y j
P
1+e (Y
ax
j j
= 1|x , w)
⌅wi
j
⇧ ln p(w) ⇥ ⇤wi2 ⌅⌃
⌅l(w) ⌅ ⌅2
wi xji ) wi xji )
j i
= y (w0 + ln 1 + exp(w0 +
Gradient  Ascent  for  LR  
Gradient ascent algorithm: (learning rate η > 0)
do:  

 For  i=1  to  n:  (iterate  over  features)  

un;l  “change”  <  ε  


Loop over training examples!
Naïve  Bayes        vs.      Logis)c  Regression  
Learning:  h:X   Y            X  –  features  
                 Y  –  target  classes  

Genera)ve     Discrimina)ve  
• Assume  func;onal  form  for     • Assume  func;onal  form  for    
– P(X|Y)    assume  cond  indep     – P(Y|X)      no  assump)ons  
– P(Y)  
– Est.  params  from  train  data   – Est  params  from  training  data  
• Gaussian  NB  for  cont.  features   • Handles  discrete  &  cont  features  
• Bayes  rule  to  calc.  P(Y|X=  x):  
– P(Y  |  X)  ∝  P(X  |  Y)  P(Y)  
• Indirect  computa;on     • Directly  calculate  P(Y|X=x)  
– Can  generate  a  sample  of  the  data   – Can’t  generate  data  sample  
– Can  easily  handle  missing  data  
 Naïve  Bayes  vs.  Logis;c  Regression  
[Ng & Jordan, 2002]

• Genera;ve  vs.  Discrimina;ve  classifiers  


•  Asympto;c  comparison    
(#  training  examples    infinity)  
–  when  model  correct  
• NB,  Linear  Discriminant  Analysis  (with  class  independent  
variances),  and  Logis;c  Regression  produce  iden;cal  
classifiers  

–  when  model  incorrect  


•  LR  is  less  biased  –  does  not  assume  condi;onal  
independence  
– therefore  LR  expected  to  outperform  NB  
Naïve  Bayes  vs.  Logis;c  Regression  
[Ng & Jordan, 2002]

• Genera;ve  vs.  Discrimina;ve  classifiers  


• Non-­‐asympto;c  analysis  
–  convergence  rate  of  parameter  es;mates,    
     (n  =  #  of  a_ributes  in  X)  
• Size  of  training  data  to  get  close  to  infinite  data  solu;on  
• Naïve  Bayes  needs  O(log  n)  samples  
• Logis;c  Regression  needs  O(n)  samples  

– Naïve  Bayes  converges  more  quickly  to  its  (perhaps  


less  helpful)  asympto;c  es;mates  
Naïve bayes
Logistic Regression

Some  
experiments  
from  UCI  data  
sets  
©Carlos Guestrin 2005-2009 67

You might also like