0% found this document useful (0 votes)
34 views11 pages

Unit 3-Bayesian Logistic

Bayesian logistic regression models the probability of class membership p(C|φ) as a sigmoid function of a linear predictor. Exact Bayesian inference is intractable, so Laplace approximation is used to fit a Gaussian distribution q(w) approximating the posterior p(w|t). The predictive distribution is obtained by convolving this Gaussian with the sigmoid, which is approximated using the probit function for computational simplicity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
34 views11 pages

Unit 3-Bayesian Logistic

Bayesian logistic regression models the probability of class membership p(C|φ) as a sigmoid function of a linear predictor. Exact Bayesian inference is intractable, so Laplace approximation is used to fit a Gaussian distribution q(w) approximating the posterior p(w|t). The predictive distribution is obtained by convolving this Gaussian with the sigmoid, which is approximated using the probit function for computational simplicity.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPTX, PDF, TXT or read online on Scribd
You are on page 1/ 11

Bayesian Logistic Regression

Bayesian Logistic Regression


• Logistic regression is a discriminative
probabilistic linear classifier: p(C |) =  (w
1
T

)
• Exact Bayesian inference for Logistic
Regression p(C)p(
1|)   (w
T
w)dw is intractable, because:

1.Evaluation of posterior distribution p(w|t)


– Needs normalization of prior p(w)=N(w|m0,S0) times
likelihood (a product of sigmoids) p(t | w)  y 1 
N
tn 1 n
t
n n
n y
• Solution: use Laplace approximation to 
get
1
 q(w)
Gaussian
2. Evaluation of predictive distribution p(C1 |) ! 
(w T )q(w)dw
– Convolution of sigmoid and Gaussian
• Solution: Approximate Sigmoid by Probit
Laplace Approximation (summary)
• Need mode w0 of posterior distribution p(w|t)
– Done by a numerical optimization algorithm
• Fit a Gaussian centered at the mode
1 1/2 
q(w) = f (w) = A exp -1 (w -w0 )T A (w - 
Ww) (2π)M/2
 0 
-1
= 2N (w |w0, A )
– Needs second derivatives of log A   ln f (w) |w=w
0

posterior
• Equivalent to finding Hessian matrix
SN   ln p(w | t) 0 S  yn (1-
1
n  nnT
y ) n i1
Evaluation of Posterior Distribution
• Gaussian prior
p(w)=N(w|m0,S0)
– Where m0 and S0 are hyper-parameters
• Posterior distribution
p(w|t)  p(w)p(t|
w) where t =(t1,..,tN)T
N
– Substituting = y {1
1− n
p(t | w) ∏
tn
t
n n
−y n=
1 1
}
n
ln p(w|t)   (w  m0 )1S0 (w  m0 )
2
T
 (t n ln yn  (1  t n )ln(1  yn ) 
 i1
const
• yn )  (w
where T
n
Gaussian Approximation of Posterior
• Maximize posterior p(w|t) to give
– MAP solution wmap
• Done by numerical optimization
– Defines mean of the Gaussian
• Covariance given by
– Inverse of matrix of 2nd derivatives of negative n
1
SN   ln p(w|t)  S  y n (1  yn )n nT
log-likelihood 0

i1 

• Gaussian approximation to posterior


q(w)  N (w | w map , SN )

• Need to marginalize wrt this distribution to


make predictions
Predictive Distribution
• Predictive distribution for class C1, given
new feature vector  (x
)
– Obtained by marginalizing wrt posterior p(w|t)
Sum rule
p(C1 | , t)   p(C1 ,w | ,
t)dw Product rule
=  p(C1 | , t,w) p(w|
t)dw
=  p(C1 |  ,w) p(w|t)dw Given  and w, C1 is indep of
t Approximate p(w|t) by Gaussian q(w)
! 
correspondingT probability for class C2
(w  )q(w)dw
p(C2 | ,t )  1  p(C1 | ,t )
Predictive distrib. is a Convolution
p(C1 | , t) ! 
(w T  )q(w)dw
– Function σ(wTϕ) depends on w only through its
projection onto ϕ
– Denoting a = wTϕ we have
 (w  ) !   (a  w T T
 )
• where δ is the Dirac delta function
(a)da
– Thus
  (w T
 )q(w)dw    (a) p(a)dawhere p(a)    (a 
w T  )q(w)dw
• Can evaluate p(a) because
– the delta function imposes a linear constraint on w
– Since q(w) is Gaussian, its marginal is also Gaussian
a  [a]   p(a)da   q(w)wT  dw  map T

• Evaluate its mean and covariance w
 var[a]   p(a)a 2  [a]2
a
2

=  q(w) (w T2 )2  (mNT ) dw  T SN




da 
Variational Approximation to Predictive Distribution

• Predictive distribution is

p(C1 | t)    (a)

p(a)da =   (a)N a | a ,  2
a da


• Convolution of Sigmoid-Gaussian is intractable
• Use probit instead of logistic sigmoid
1

0.5

0
!5 0 5
Approximation using Probit

p(C1 | t)=  (a)N a | a , a2 da


• Use probit which is similar to Logistic sigmoid 1

– Defined as
a

 (a)   N ( | 0.5

• Approximate σ(a)
0,1)d
by Φ(λa) 0

• Find λ such that two functions have same slope at origin


!5 0 5

Approximate  (a) by  (a)


Find suitable value of  by requiring that two have same slope at origin, which yields  2 = /8

• Convolution of probit with Gaussian is a probit


2   
  (a)N (a | , )da   
  2   2
1/ 2 

– Thus p(C1 | , t)=  (a)N a | a ,2a da 


!  ( 2
a )a)
where (
 ( 2 )  (1   2 /
8)1/2
Probit Classification
Applying it to

p(C1 | t)=  (a)N a | a ,2a
da
We 
have p(C1 | , t)     a2 ) a
(
where 
a  w T
  a2  T S N
map 
Decision boundary corresponding to p(C1|ϕ,t) =0.5 is given by
a  0
This is the same solution as
w Tmap  0
Thus marginalization has no effect!
When minimizing misclassification rate with equal prior probabilities

For more complex decision criteria it plays important role


Summary
• Logistic regression is a linear probabilistic
discriminative model p(C | x )   (w
1
T

)
• Bayesian Logistic Regression is intractable
• Using Laplacian the posterior parameter
distribution p(w|t) can be approximated as a
Gaussian
• Predictive distribution is convolution of
sigmoids and Gaussian p(C | ) !  
1
(w )q(w)dw
T

– Probit yields convolution as probit

You might also like