Machine Learning and Pattern Recognition Variational KL
Machine Learning and Pattern Recognition Variational KL
So we fit the mean and covariance of the approximation to find the best match to the
posterior according to our variational cost function. Although we won’t consider other cases,
the variational family doesn’t have to be Gaussian. The variational distribution can be a
discrete distribution if we have a posterior distribution over discrete variables.
1 Kullback–Leibler Divergence
The Kullback–Leibler divergence, usually just called the KL-divergence, is a common measure
of the discrepancy between two distributions:
p(z)
Z
DKL ( p || q) = p(z) log dz. (2)
q(z)
The KL-divergence is non-negative, DKL ( p || q) ≥ 0, and is only zero when the two distribu-
tions are identical.
The divergence doesn’t satisfy the formal criteria to be a distance, for example, it isn’t
symmetric: DKL ( p || q) 6= DKL (q || p).
2 Minimizing DKL ( p || q)
To minimize DKL ( p || q) we set the variational parameters m and V to match the mean and
covariance of the target distribution p. The illustration below shows an example from the
notes on Bayesian logistic regression. The Laplace approximation is poor on this example:
the mode of the posterior is very close to the mode of the prior, and the curvature there is
almost the same as well. The Laplace approximation sets the approximate posterior so close
to the prior distribution (blue solid line below) that I haven’t plotted it. A different Gaussian
fit (magenta dotted line), with the same mean m and variance V as the posterior distribution,
is a better summary of where the plausible parameters are than the Laplace approximation:
1. The textbooks often avoid parameterizing q in their presentations of variational methods. Instead they describe
the optimization problem as being on the distribution q itself, using the calculus of variations. We don’t need such
a general treatment in this course.
-4 -2 0 2 4
w
Optimizing DKL ( p(w | D) || q(w; α)) tends to be difficult. The cost function is an expectation
under the complicated posterior distribution that we are trying to approximate, and we
usually can’t evaluate it.
Even if we could find the mean and covariance of our distribution (approximating them
would be possible) the answer may not be sensible. Matching the mean of a bimodal
distribution will find an approximation centred on implausible parameters:
Our predictions are not likely to be sensible if we mainly use parameters that are not
plausible given the data.
3 Minimizing DKL (q || p)
Most variational methods in Machine Learning minimize DKL (q(w; α) || p(w | D)), partly
because we are better at optimizing this cost function. (There are also other sensible varia-
tional principles, but we won’t cover them in this course.) Minimizing the KL-divergence
this way around encourages the fit to concentrate on plausible parameters:
q(w; α)
Z
DKL (q(w; α) || p(w | D)) = q(w; α) log dw (3)
p(w | D)
Z Z
=− q(w; α) log p(w | D) dw + q(w; α) log q(w; α) dw (4)
| {z }
negative entropy, − H (q)
To make the first term small, we avoid putting probability mass on implausible parameters.
As an extreme example, there is an infinite penalty for putting probability mass of q on a
region of parameters that are impossible given the data. The second term is the negative
entropy of the distribution q.2 To make the second term small we want a high entropy
distribution, one that is as spread out as possible.
Minimizing this KL-divergence usually results in a Gaussian approximation that finds one
mode of the distribution, and spreads out to cover as much mass in that mode as possible.
2. H is the standard symbol for entropy, and has nothing to do with a Hessian (also H; sorry!).
p(D | w) p(w)
p(w | D) = , (5)
p(D)
DKL (q || p) = Eq [log q(w)] − Eq [log p(D | w)] − Eq [log p(w)] + log p(D). (6)
| {z }
J (q)
The first three terms, equal to J (q) in Murphy’s book, depend on the variational distribution
(or its parameters), so we minimize these terms. The final term, log p(D) is the log-marginal
likelihood (also known as the “model evidence”). Knowing that the KL-divergence is non-
negative gives us a bound on the marginal likelihood:
Thus, fitting the variational objective is optimizing a lower bound on the log marginal
likelihood. Recently “the ELBO” or “Evidence Lower Bound” has become a popular name
for − J (q).3
[The website version of this note has a question here.]
which is a sum of (possibly simple) integrals. In the last few years variational inference has
become dominated by stochastic gradient descent, which updates the variational parameters
using unbiased approximations of the variational cost function and its gradients.
5 Reading
Reading for variational inference in Murphy’s book: Sections 21.1–21.2.
• Straightforward to apply
• 2nd derivatives ⇒ certainty of parameters
• Incremental improvement on MAP estimate
• Approximation of marginal/model likelihood
3. Variational inference was originally inspired by work in statistical physics, and with that analogy, + J (q) is also
called the variational free energy.
where Z = exp(θ > φ(w)) dw, and φ(w) is a vector of statistics, defining the approximating
R
family. If we choose φ to contain each variable wd , and the product of each pair of variables,
wc wd , then we are fitting a Gaussian distribution. Substituting our approximating family
into the variational objective:
p(w)
Z
DKL ( p || q) = p(w) log dw (10)
q(w)
That is, the minimum of the objective function (ok, we’d need to do slightly more work
to prove this turning point is a minimum) is when the expected statistics defined by φ all
match in the two distributions. For a Gaussian, that means the mean and covariance match.