0% found this document useful (0 votes)
29 views5 pages

Machine Learning and Pattern Recognition Variational KL

This document discusses variational objectives and KL divergence. It introduces variational methods which fit a target distribution by defining an optimization problem involving a variational family of distributions and a variational cost function. The KL divergence is commonly used as a measure of discrepancy between distributions, and minimizing the KL divergence from the variational distribution to the target distribution encourages the fit to concentrate on plausible parameters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views5 pages

Machine Learning and Pattern Recognition Variational KL

This document discusses variational objectives and KL divergence. It introduces variational methods which fit a target distribution by defining an optimization problem involving a variational family of distributions and a variational cost function. The KL divergence is commonly used as a measure of discrepancy between distributions, and minimizing the KL divergence from the variational distribution to the target distribution encourages the fit to concentrate on plausible parameters.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

Variational objectives and KL Divergence

The Laplace approximation fitted a Gaussian distribution to a parameter posterior by


matching a mode and the curvature of the log posterior at that mode. We saw that there
are failure modes when the shape of the distribution at the mode is misleading about the
over-all distribution.
Variational methods fit a target distribution, such as a parameter posterior, by defining an
optimization problem. The ingredients are:

• A family of possible distributions q(w; α).


• A variational cost function, which describes the discrepancy between q(w; α) and the
target distribution (for us, the parameter posterior).

The computational task is to optimize the variational parameters (here α).1


For this course, the variational family will always be Gaussian:

q(w; α = {m, V }) = N (w; m, V ). (1)

So we fit the mean and covariance of the approximation to find the best match to the
posterior according to our variational cost function. Although we won’t consider other cases,
the variational family doesn’t have to be Gaussian. The variational distribution can be a
discrete distribution if we have a posterior distribution over discrete variables.

1 Kullback–Leibler Divergence
The Kullback–Leibler divergence, usually just called the KL-divergence, is a common measure
of the discrepancy between two distributions:

p(z)
Z
DKL ( p || q) = p(z) log dz. (2)
q(z)

The KL-divergence is non-negative, DKL ( p || q) ≥ 0, and is only zero when the two distribu-
tions are identical.
The divergence doesn’t satisfy the formal criteria to be a distance, for example, it isn’t
symmetric: DKL ( p || q) 6= DKL (q || p).

2 Minimizing DKL ( p || q)
To minimize DKL ( p || q) we set the variational parameters m and V to match the mean and
covariance of the target distribution p. The illustration below shows an example from the
notes on Bayesian logistic regression. The Laplace approximation is poor on this example:
the mode of the posterior is very close to the mode of the prior, and the curvature there is
almost the same as well. The Laplace approximation sets the approximate posterior so close
to the prior distribution (blue solid line below) that I haven’t plotted it. A different Gaussian
fit (magenta dotted line), with the same mean m and variance V as the posterior distribution,
is a better summary of where the plausible parameters are than the Laplace approximation:

1. The textbooks often avoid parameterizing q in their presentations of variational methods. Instead they describe
the optimization problem as being on the distribution q itself, using the calculus of variations. We don’t need such
a general treatment in this course.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1


p(w)
p(w | D)
N (w; m, V )

-4 -2 0 2 4
w

Optimizing DKL ( p(w | D) || q(w; α)) tends to be difficult. The cost function is an expectation
under the complicated posterior distribution that we are trying to approximate, and we
usually can’t evaluate it.
Even if we could find the mean and covariance of our distribution (approximating them
would be possible) the answer may not be sensible. Matching the mean of a bimodal
distribution will find an approximation centred on implausible parameters:

Our predictions are not likely to be sensible if we mainly use parameters that are not
plausible given the data.

3 Minimizing DKL (q || p)
Most variational methods in Machine Learning minimize DKL (q(w; α) || p(w | D)), partly
because we are better at optimizing this cost function. (There are also other sensible varia-
tional principles, but we won’t cover them in this course.) Minimizing the KL-divergence
this way around encourages the fit to concentrate on plausible parameters:
q(w; α)
Z
DKL (q(w; α) || p(w | D)) = q(w; α) log dw (3)
p(w | D)
Z Z
=− q(w; α) log p(w | D) dw + q(w; α) log q(w; α) dw (4)
| {z }
negative entropy, − H (q)

To make the first term small, we avoid putting probability mass on implausible parameters.
As an extreme example, there is an infinite penalty for putting probability mass of q on a
region of parameters that are impossible given the data. The second term is the negative
entropy of the distribution q.2 To make the second term small we want a high entropy
distribution, one that is as spread out as possible.
Minimizing this KL-divergence usually results in a Gaussian approximation that finds one
mode of the distribution, and spreads out to cover as much mass in that mode as possible.

2. H is the standard symbol for entropy, and has nothing to do with a Hessian (also H; sorry!).

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2


However, the distribution can’t spread out to cover low probability regions, or the first term
would grow large. See Murphy Figure 21.1 for an illustration.
If we substitute the expression for the posterior from Bayes’ rule,

p(D | w) p(w)
p(w | D) = , (5)
p(D)

into the KL-divergence, we get a spray of terms:

DKL (q || p) = Eq [log q(w)] − Eq [log p(D | w)] − Eq [log p(w)] + log p(D). (6)
| {z }
J (q)

The first three terms, equal to J (q) in Murphy’s book, depend on the variational distribution
(or its parameters), so we minimize these terms. The final term, log p(D) is the log-marginal
likelihood (also known as the “model evidence”). Knowing that the KL-divergence is non-
negative gives us a bound on the marginal likelihood:

DKL (q || p) ≥ 0 ⇒ log p(D) ≥ − J (q). (7)

Thus, fitting the variational objective is optimizing a lower bound on the log marginal
likelihood. Recently “the ELBO” or “Evidence Lower Bound” has become a popular name
for − J (q).3
[The website version of this note has a question here.]

4 Optimization methods for DKL (q || p)


The literature is full of clever iterative ways to optimize DKL (q || p) for different models.
Could we use standard optimizers? The hardest term to evaluate is:
N
Eq [log P(D | w)] = ∑ Eq [log P(y(n) | x(n) , w)], (8)
n =1

which is a sum of (possibly simple) integrals. In the last few years variational inference has
become dominated by stochastic gradient descent, which updates the variational parameters
using unbiased approximations of the variational cost function and its gradients.

5 Reading
Reading for variational inference in Murphy’s book: Sections 21.1–21.2.

6 Overview of Gaussian approximations


Laplace approximation:

• Straightforward to apply
• 2nd derivatives ⇒ certainty of parameters
• Incremental improvement on MAP estimate
• Approximation of marginal/model likelihood

3. Variational inference was originally inspired by work in statistical physics, and with that analogy, + J (q) is also
called the variational free energy.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3


Variational methods:

• Optimization: fit variational parameters of q (not w!)


• Usually DKL (q || p), not DKL ( p || q)
• Bound of marginal/model likelihood
• Optimization: traditionally harder to apply. Now becoming automatic as well.

7 The KL divergence appears in other contexts


A student in a previous year asked: “In MLP, we use the KL divergence cost function to train
neural nets. More specifically, we use the divergence between the 1-0 labels and the model output. . .
Does it count as using a variational method?”
The answer is no. One principle for fitting a single setting of the weights is to estimate the KL
between the true label distribution and the model’s distribution. A Monte Carlo estimate of
this KL uses samples from the true distribution found in the training data. Optimizing this
objective is the same as maximum likelihood training. Variational methods (in the context of
Bayesian methods) fit a posterior distribution over the weights with a distribution.

7.1 For keen students: Information theory


The KL-divergence gives the average storage wasted by a compression system that encodes
a file based on model q instead of the optimal distribution p. MacKay’s book is the place to
read about the links between machine learning and compression.

8 For keen students: DKL ( p||q) and moment matching


One question that you may have is: “Why does minimizing DKL ( p||q) with Gaussian q
lead to matching the mean and covariance of p?” If you substitute in a Gaussian q into the
formula, differentiate and set to zero, this result will drop out (eventually). However, the
working for a multivariate Gaussian will be messy. It’s easier to show a more general result.
We give the details here.
We’re going to match two distributions over w. My variational parameters, defining q will
be θ. We choose q to be the member of an exponential family:
1
q(w) = exp(θ > φ(w)), (9)
Z (θ )

where Z = exp(θ > φ(w)) dw, and φ(w) is a vector of statistics, defining the approximating
R
family. If we choose φ to contain each variable wd , and the product of each pair of variables,
wc wd , then we are fitting a Gaussian distribution. Substituting our approximating family
into the variational objective:
p(w)
Z
DKL ( p || q) = p(w) log dw (10)
q(w)

which up to a constant with respect to θ is4 :


Z
− p(w)θ > φ(w) dw + log Z (θ ). (11)

We differentiate the KL wrt θ, to get


1
Z Z
− p(w)φ(w) dw + φ(w) exp(θ > φ(w)) dw. (12)
Z (θ )

4. A common mistake is to omit Z (θ ), which is not a constant wrt θ.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4


The gradient is a difference of expectations, one under p and the other under q. The gradient
is zero when the expectations are equal:

E p [φ(w)] = Eq [φ(w)]. (13)

That is, the minimum of the objective function (ok, we’d need to do slightly more work
to prove this turning point is a minimum) is when the expected statistics defined by φ all
match in the two distributions. For a Gaussian, that means the mean and covariance match.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 5

You might also like