Machine Learning and Pattern Recognition - Laplace - Approximation
Machine Learning and Pattern Recognition - Laplace - Approximation
The conditional probability on the left is what we intuitively want to optimize. The maxi-
mization on the right gives the same answer, but contains the term we will actually compute.
Reminder: why do we take the log?1
[The website version of this note has a question here.]
We usually find the mode of the distribution by minimizing an ‘energy’, which is the negative
log-probability of the distribution up to a constant. For a posterior distribution, we can
define the energy as:
E(w) = − log p(w, D), w∗ = arg min E(w). (2)
w
The notation means that we evaluate the second derivative at the optimum, w = w∗ . If H is
large, the slope (the first derivative) changes rapidly from a steep descent to a steep ascent.
We should approximate the distribution with a narrow Gaussian. Generalizing to multiple
variables w, we know ∇w E is zero at the optimum and we evaluate the Hessian, a matrix
with elements:
∂2 E ( w )
Hij = . (4)
∂wi ∂w j ∗ w=w
1. Because log is a monotonic transformation, maximizing the log of a function is equivalent to maximizing the
original function. Often the log of a distribution is more convenient to work with, less prone to numerical problems,
and closer to an ideal quadratic function that optimizers like.
( w − µ )2
EN (w) = . (5)
2σ2
The minimum is w∗ = µ, and the second derivative H = 1/σ2 , implying the variance is
σ2 = 1/H. Generalizing to higher dimensions, for a Gaussian N (µ, Σ), the energy is:
1
EN (w) = ( w − µ ) > Σ −1 ( w − µ ), (6)
2
with w∗ = µ and H = Σ−1 , implying the covariance is Σ = H −1 .
Therefore matching the minimum and curvature of the ‘energy’ (negative log-probability) to
those of a Gaussian energy gives the Laplace approximation to the posterior distribution:
p(w, D) | H |1/2
1
p(w | D) = ≈ N (w; w∗ , H −1 ) = exp − ( w − w ∗ >
) H ( w − w ∗
) . (8)
P(D) (2π ) D/2 2
An equivalent expression is
However, this expectation can be simplified. Only the inner product a = w> x matters, so
we can take the average over this scalar quantity instead. The linear combination a is a
linear combination of Gaussian beliefs, so our beliefs about it are also Gaussian. By now you
should be able to show that
Therefore, the predictions given the approximate posterior, are given by a one-dimensional
integral:
1
P(y = 1 | x, D) ≈ σ (κ w∗ > x), κ= q . (18)
π > −1
1+ 8x H x
Under this approximation, the predictions use the most probable or MAP weights. However,
the activation is scaled down (with κ) when the activation is uncertain, so that predictions
will be less confident far from the data (as they should be).
∂E 1 ∂2 E
E(w∗ + δ) ≈ E(w∗ ) + δ + δ2 (19)
∂w w∗ 2 ∂w2 w∗
1
≈ E(w∗ ) + Hδ2 , (20)
2
∂E
where the second term disappears because ∂w is zero at the optimum. In multiple dimensions
this Taylor approximation generalizes to:
In models with many parameters, the posterior will often be flat in some direction, where
parameters trade off each other to give similar predictions. When there is zero curvature in
some direction, the Hessian isn’t positive definite and we can’t get a meaningful approxima-
tion.
6 Further Reading
Bishop covers the Laplace approximation and application to Bayesian logistic regression in
Sections 4.4 and 4.5.
Or read Murphy Sections 8.4 to 8.4.4 inclusive. You can skip 8.4.2 on BIC.
Similar material is covered by MacKay, Ch. 41, pp492–503, and Ch. 27, pp341–342.
The Laplace approximation was used in some of the earliest Bayesian neural networks
although — as presented here — it’s now rarely used. However, the idea does occur in recent
work, such as on continual learning (Kirkpatrick et al., Google Deepmind, 2017) and a more
sophisticated variant is used by the popular statistical package, R-INLA.
2. The final two figures in this note come from previous MLPR course notes, by one of Amos Storkey, Chris
Williams, or Charles Sutton.