0% found this document useful (0 votes)

29 views5 pages

Machine Learning and Pattern Recognition Variational KL

This document discusses variational objectives and KL divergence. It introduces variational methods which fit a target distribution by defining an optimization problem involving a variational family of distributions and a variational cost function. The KL divergence is commonly used as a measure of discrepancy between distributions, and minimizing the KL divergence from the variational distribution to the target distribution encourages the fit to concentrate on plausible parameters.

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

29 views5 pages

Machine Learning and Pattern Recognition Variational KL

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

Variational objectives and KL Divergence

The Laplace approximation fitted a Gaussian distribution to a parameter posterior by

matching a mode and the curvature of the log posterior at that mode. We saw that there
are failure modes when the shape of the distribution at the mode is misleading about the
over-all distribution.
Variational methods fit a target distribution, such as a parameter posterior, by defining an
optimization problem. The ingredients are:

• A family of possible distributions q(w; α).

• A variational cost function, which describes the discrepancy between q(w; α) and the
target distribution (for us, the parameter posterior).

The computational task is to optimize the variational parameters (here α).1

For this course, the variational family will always be Gaussian:

q(w; α = {m, V }) = N (w; m, V ). (1)

So we fit the mean and covariance of the approximation to find the best match to the
posterior according to our variational cost function. Although we won’t consider other cases,
the variational family doesn’t have to be Gaussian. The variational distribution can be a
discrete distribution if we have a posterior distribution over discrete variables.

1 Kullback–Leibler Divergence
The Kullback–Leibler divergence, usually just called the KL-divergence, is a common measure
of the discrepancy between two distributions:

p(z)
Z
DKL ( p || q) = p(z) log dz. (2)
q(z)

The KL-divergence is non-negative, DKL ( p || q) ≥ 0, and is only zero when the two distribu-
tions are identical.
The divergence doesn’t satisfy the formal criteria to be a distance, for example, it isn’t
symmetric: DKL ( p || q) 6= DKL (q || p).

2 Minimizing DKL ( p || q)
To minimize DKL ( p || q) we set the variational parameters m and V to match the mean and
covariance of the target distribution p. The illustration below shows an example from the
notes on Bayesian logistic regression. The Laplace approximation is poor on this example:
the mode of the posterior is very close to the mode of the prior, and the curvature there is
almost the same as well. The Laplace approximation sets the approximate posterior so close
to the prior distribution (blue solid line below) that I haven’t plotted it. A different Gaussian
fit (magenta dotted line), with the same mean m and variance V as the posterior distribution,
is a better summary of where the plausible parameters are than the Laplace approximation:

1. The textbooks often avoid parameterizing q in their presentations of variational methods. Instead they describe
the optimization problem as being on the distribution q itself, using the calculus of variations. We don’t need such
a general treatment in this course.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

p(w)
p(w | D)
N (w; m, V )

-4 -2 0 2 4
w

Optimizing DKL ( p(w | D) || q(w; α)) tends to be difficult. The cost function is an expectation
under the complicated posterior distribution that we are trying to approximate, and we
usually can’t evaluate it.
Even if we could find the mean and covariance of our distribution (approximating them
would be possible) the answer may not be sensible. Matching the mean of a bimodal
distribution will find an approximation centred on implausible parameters:

Our predictions are not likely to be sensible if we mainly use parameters that are not
plausible given the data.

3 Minimizing DKL (q || p)
Most variational methods in Machine Learning minimize DKL (q(w; α) || p(w | D)), partly
because we are better at optimizing this cost function. (There are also other sensible varia-
tional principles, but we won’t cover them in this course.) Minimizing the KL-divergence
this way around encourages the fit to concentrate on plausible parameters:
q(w; α)
Z
DKL (q(w; α) || p(w | D)) = q(w; α) log dw (3)
p(w | D)
Z Z
=− q(w; α) log p(w | D) dw + q(w; α) log q(w; α) dw (4)
| {z }
negative entropy, − H (q)

To make the first term small, we avoid putting probability mass on implausible parameters.
As an extreme example, there is an infinite penalty for putting probability mass of q on a
region of parameters that are impossible given the data. The second term is the negative
entropy of the distribution q.2 To make the second term small we want a high entropy
distribution, one that is as spread out as possible.
Minimizing this KL-divergence usually results in a Gaussian approximation that finds one
mode of the distribution, and spreads out to cover as much mass in that mode as possible.

2. H is the standard symbol for entropy, and has nothing to do with a Hessian (also H; sorry!).

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

However, the distribution can’t spread out to cover low probability regions, or the first term
would grow large. See Murphy Figure 21.1 for an illustration.
If we substitute the expression for the posterior from Bayes’ rule,

p(D | w) p(w)
p(w | D) = , (5)
p(D)

into the KL-divergence, we get a spray of terms:

DKL (q || p) = Eq [log q(w)] − Eq [log p(D | w)] − Eq [log p(w)] + log p(D). (6)
| {z }
J (q)

The first three terms, equal to J (q) in Murphy’s book, depend on the variational distribution
(or its parameters), so we minimize these terms. The final term, log p(D) is the log-marginal
likelihood (also known as the “model evidence”). Knowing that the KL-divergence is non-
negative gives us a bound on the marginal likelihood:

DKL (q || p) ≥ 0 ⇒ log p(D) ≥ − J (q). (7)

Thus, fitting the variational objective is optimizing a lower bound on the log marginal
likelihood. Recently “the ELBO” or “Evidence Lower Bound” has become a popular name
for − J (q).3
[The website version of this note has a question here.]

4 Optimization methods for DKL (q || p)

The literature is full of clever iterative ways to optimize DKL (q || p) for different models.
Could we use standard optimizers? The hardest term to evaluate is:
N
Eq [log P(D | w)] = ∑ Eq [log P(y(n) | x(n) , w)], (8)
n =1

which is a sum of (possibly simple) integrals. In the last few years variational inference has
become dominated by stochastic gradient descent, which updates the variational parameters
using unbiased approximations of the variational cost function and its gradients.

5 Reading
Reading for variational inference in Murphy’s book: Sections 21.1–21.2.

6 Overview of Gaussian approximations

Laplace approximation:

• Straightforward to apply
• 2nd derivatives ⇒ certainty of parameters
• Incremental improvement on MAP estimate
• Approximation of marginal/model likelihood

3. Variational inference was originally inspired by work in statistical physics, and with that analogy, + J (q) is also
called the variational free energy.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

Variational methods:

• Optimization: fit variational parameters of q (not w!)

• Usually DKL (q || p), not DKL ( p || q)
• Bound of marginal/model likelihood
• Optimization: traditionally harder to apply. Now becoming automatic as well.

7 The KL divergence appears in other contexts

A student in a previous year asked: “In MLP, we use the KL divergence cost function to train
neural nets. More specifically, we use the divergence between the 1-0 labels and the model output. . .
Does it count as using a variational method?”
The answer is no. One principle for fitting a single setting of the weights is to estimate the KL
between the true label distribution and the model’s distribution. A Monte Carlo estimate of
this KL uses samples from the true distribution found in the training data. Optimizing this
objective is the same as maximum likelihood training. Variational methods (in the context of
Bayesian methods) fit a posterior distribution over the weights with a distribution.

7.1 For keen students: Information theory

The KL-divergence gives the average storage wasted by a compression system that encodes
a file based on model q instead of the optimal distribution p. MacKay’s book is the place to
read about the links between machine learning and compression.

8 For keen students: DKL ( p||q) and moment matching

One question that you may have is: “Why does minimizing DKL ( p||q) with Gaussian q
lead to matching the mean and covariance of p?” If you substitute in a Gaussian q into the
formula, differentiate and set to zero, this result will drop out (eventually). However, the
working for a multivariate Gaussian will be messy. It’s easier to show a more general result.
We give the details here.
We’re going to match two distributions over w. My variational parameters, defining q will
be θ. We choose q to be the member of an exponential family:
1
q(w) = exp(θ > φ(w)), (9)
Z (θ )

where Z = exp(θ > φ(w)) dw, and φ(w) is a vector of statistics, defining the approximating
R
family. If we choose φ to contain each variable wd , and the product of each pair of variables,
wc wd , then we are fitting a Gaussian distribution. Substituting our approximating family
into the variational objective:
p(w)
Z
DKL ( p || q) = p(w) log dw (10)
q(w)

which up to a constant with respect to θ is4 :

Z
− p(w)θ > φ(w) dw + log Z (θ ). (11)

We differentiate the KL wrt θ, to get

1
Z Z
− p(w)φ(w) dw + φ(w) exp(θ > φ(w)) dw. (12)
Z (θ )

4. A common mistake is to omit Z (θ ), which is not a constant wrt θ.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

The gradient is a difference of expectations, one under p and the other under q. The gradient
is zero when the expectations are equal:

E p [φ(w)] = Eq [φ(w)]. (13)

That is, the minimum of the objective function (ok, we’d need to do slightly more work
to prove this turning point is a minimum) is when the expected statistics defined by φ all
match in the two distributions. For a Gaussian, that means the mean and covariance match.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 5

Untitled
100% (2)
Untitled
728 pages
StockSceneryPart 7-8
100% (1)
StockSceneryPart 7-8
59 pages
Manual Desconectador Yaskawa
60% (5)
Manual Desconectador Yaskawa
32 pages
Maternal Pelvis
100% (2)
Maternal Pelvis
32 pages
E46 M3 MSS54 DME Error Code Comparison Table - ChiTown - M Forum - Chicago's BMW - M Club Site
No ratings yet
E46 M3 MSS54 DME Error Code Comparison Table - ChiTown - M Forum - Chicago's BMW - M Club Site
15 pages
Ned - 2025 Jce Geography
No ratings yet
Ned - 2025 Jce Geography
12 pages
Mubofwe Dam Ss SCF Calculation
No ratings yet
Mubofwe Dam Ss SCF Calculation
113 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
Infiniti 7000 Manual PDF
No ratings yet
Infiniti 7000 Manual PDF
84 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
PROPRTIONAL PRESSURE REDUCING 3DREP and 3DREPE RE29184 PDF
No ratings yet
PROPRTIONAL PRESSURE REDUCING 3DREP and 3DREPE RE29184 PDF
12 pages
G3167 Online LC Solution UseMa en D0006652
No ratings yet
G3167 Online LC Solution UseMa en D0006652
388 pages
Technical Report On Electrostatic Precipitator: Geeco Enercon PVT LTD Tiruchirapalli - 620015 India
No ratings yet
Technical Report On Electrostatic Precipitator: Geeco Enercon PVT LTD Tiruchirapalli - 620015 India
21 pages
Empirical Process (Sara Van de Geer)
No ratings yet
Empirical Process (Sara Van de Geer)
91 pages
Asymptotic Statistics (By Changliang ZOU)
No ratings yet
Asymptotic Statistics (By Changliang ZOU)
115 pages
Lecture 6 Value Function Approximation
No ratings yet
Lecture 6 Value Function Approximation
56 pages
Doing Business in Hungary
No ratings yet
Doing Business in Hungary
22 pages
Optimization For Machine Learning: Massachusetts Institute of Technology
No ratings yet
Optimization For Machine Learning: Massachusetts Institute of Technology
169 pages
Latent Variable Models: Stefano Ermon
No ratings yet
Latent Variable Models: Stefano Ermon
26 pages
Notes
No ratings yet
Notes
9 pages
Machine Learning PDF
No ratings yet
Machine Learning PDF
77 pages
L20 GenerativeModels
No ratings yet
L20 GenerativeModels
53 pages
Statistics - Lecture 7
No ratings yet
Statistics - Lecture 7
47 pages
S M S T C Lecture 2425 3
No ratings yet
S M S T C Lecture 2425 3
61 pages
Lecture-04 GMM EMalg
No ratings yet
Lecture-04 GMM EMalg
34 pages
Intro Arduino Matlab v5
No ratings yet
Intro Arduino Matlab v5
17 pages
Bayesian NN
No ratings yet
Bayesian NN
82 pages
RCC DESIGN Intro
No ratings yet
RCC DESIGN Intro
55 pages
Vapnik - Complete Statistical Theory of Learning Learning U
No ratings yet
Vapnik - Complete Statistical Theory of Learning Learning U
59 pages
Top 50 SQL Interview Questions & Answers
No ratings yet
Top 50 SQL Interview Questions & Answers
20 pages
TS Part2
No ratings yet
TS Part2
62 pages
A Beginner's Guide To Variational Inference
No ratings yet
A Beginner's Guide To Variational Inference
48 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
No ratings yet
2009 Paninsky Nonparametric Estimation of Entropy and Distributions
34 pages
JM Aerofoil Catalogue (60Hz)
No ratings yet
JM Aerofoil Catalogue (60Hz)
144 pages
Approximate Inference Via Variational Sampling
No ratings yet
Approximate Inference Via Variational Sampling
13 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
Empirical Finance 6
No ratings yet
Empirical Finance 6
38 pages
Lecture2 2015
No ratings yet
Lecture2 2015
58 pages
08 VariationalInference
No ratings yet
08 VariationalInference
31 pages
Futami 18 A
No ratings yet
Futami 18 A
10 pages
Msqe Metrics 1 ps2
No ratings yet
Msqe Metrics 1 ps2
11 pages
Lecture6 Handout
No ratings yet
Lecture6 Handout
41 pages
Part 3
No ratings yet
Part 3
29 pages
Part 4
No ratings yet
Part 4
24 pages
Ch3 PDF
No ratings yet
Ch3 PDF
55 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
Part 5
No ratings yet
Part 5
31 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
12 MLEFilled
No ratings yet
12 MLEFilled
8 pages
MDA3S
No ratings yet
MDA3S
22 pages
Lecture 7 8 Other Descent Methods
No ratings yet
Lecture 7 8 Other Descent Methods
7 pages
09 Aos687
No ratings yet
09 Aos687
31 pages
OS Long
No ratings yet
OS Long
20 pages
25 Customizing Models A Algorithms
No ratings yet
25 Customizing Models A Algorithms
38 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
Award in Education and Training Sample
No ratings yet
Award in Education and Training Sample
9 pages
Variation Al
No ratings yet
Variation Al
25 pages
W2e Multivariate Gaussian
No ratings yet
W2e Multivariate Gaussian
6 pages
Destination: Proposal
No ratings yet
Destination: Proposal
5 pages
Sheet 2 Solution
No ratings yet
Sheet 2 Solution
5 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
A Short Note On An Inequality Between KL and TV: Clément L. Canonne August 3, 2023
No ratings yet
A Short Note On An Inequality Between KL and TV: Clément L. Canonne August 3, 2023
10 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
Lec8 MLE
No ratings yet
Lec8 MLE
35 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
24 Variational Inference
No ratings yet
24 Variational Inference
24 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
MLE Lecture Note For Econometrician
No ratings yet
MLE Lecture Note For Econometrician
13 pages
Foundations Computational Mathematics: Online Learning Algorithms
No ratings yet
Foundations Computational Mathematics: Online Learning Algorithms
26 pages
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
No ratings yet
Dalalyan - 2017 - Theoretical Guarantees For Approximate Sampling From Smooth and Log-Concave Densities
26 pages
Variational Inference Ref Paper
No ratings yet
Variational Inference Ref Paper
13 pages
T04 Soln
No ratings yet
T04 Soln
4 pages
Var Bayes Linreg
No ratings yet
Var Bayes Linreg
14 pages
18.650 Statistics For Applications
No ratings yet
18.650 Statistics For Applications
25 pages
A First Course in Stochastic Processes, 2nd Review
No ratings yet
A First Course in Stochastic Processes, 2nd Review
2 pages
DCPT
No ratings yet
DCPT
36 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Lec 4
No ratings yet
Lec 4
8 pages
A Brief Primer On Variational Inference - Fabian Dablander
No ratings yet
A Brief Primer On Variational Inference - Fabian Dablander
14 pages
Notes
No ratings yet
Notes
10 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
Assembler p3
100% (1)
Assembler p3
7 pages
Approximate Inference: Sargur Srihari Srihari@cedar - Buffalo.edu
No ratings yet
Approximate Inference: Sargur Srihari Srihari@cedar - Buffalo.edu
18 pages
Renewable Energy Assignment
No ratings yet
Renewable Energy Assignment
5 pages
Magnetic Fields 2
No ratings yet
Magnetic Fields 2
4 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
No ratings yet
STAT 538 Maximum Entropy Models C Marina Meil A Mmp@stat - Washington.edu
20 pages
Silver Nanoparticles Data
No ratings yet
Silver Nanoparticles Data
6 pages
Hydraulic Engineering Lab Manual
No ratings yet
Hydraulic Engineering Lab Manual
27 pages
Ammonia QP
No ratings yet
Ammonia QP
4 pages
Rehman2019 PDF
No ratings yet
Rehman2019 PDF
14 pages
6.3.1.10 Packet Tracer - Exploring Internetworking Devices Instructions
100% (1)
6.3.1.10 Packet Tracer - Exploring Internetworking Devices Instructions
4 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Machine Learning and Pattern Recognition - Variational - Details
No ratings yet
Machine Learning and Pattern Recognition - Variational - Details
3 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
Sol Information Theory 1
No ratings yet
Sol Information Theory 1
4 pages
Cheat Sheet 1
No ratings yet
Cheat Sheet 1
2 pages
w2c Central Limit
No ratings yet
w2c Central Limit
1 page
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Tut7 Questions
No ratings yet
Tut7 Questions
2 pages
Eca LNG: Coordinate System: Refrigerationmaster
No ratings yet
Eca LNG: Coordinate System: Refrigerationmaster
1 page
Bachman Pre Cup 2015
No ratings yet
Bachman Pre Cup 2015
10 pages
Department of Education: Lesson Study Learning Plan On Addition of Polynomials
No ratings yet
Department of Education: Lesson Study Learning Plan On Addition of Polynomials
6 pages
03 Asym - Ipynb Econ Prob
No ratings yet
03 Asym - Ipynb Econ Prob
3 pages
Large Deviations: S. R. S. Varadhan
No ratings yet
Large Deviations: S. R. S. Varadhan
12 pages
Modeling, Inference and Prediction: 2.1 Probabilistic Models
No ratings yet
Modeling, Inference and Prediction: 2.1 Probabilistic Models
16 pages
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
No ratings yet
Maximum Likelihood Estimation: Guy Lebanon February 19, 2011
6 pages
At Salak Is 2009
No ratings yet
At Salak Is 2009
2 pages
An Introduction To Variational Calculus in Machine Learning
No ratings yet
An Introduction To Variational Calculus in Machine Learning
7 pages
A Heuristic Search Algorithm For Vehicle Routing Problems and GIS-based Vehicle Routing System Onboard
No ratings yet
A Heuristic Search Algorithm For Vehicle Routing Problems and GIS-based Vehicle Routing System Onboard
6 pages
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
No ratings yet
Approximating The Kullback Leibler Divergence Between Gaussian Mixture Models
4 pages
.300 Win. Magnum Ballistics Calcs (QuickTarget Unlimited Lapua Edition)
No ratings yet
.300 Win. Magnum Ballistics Calcs (QuickTarget Unlimited Lapua Edition)
4 pages
Notes On Kullback-Leibler Divergence and Likelihood Theory
No ratings yet
Notes On Kullback-Leibler Divergence and Likelihood Theory
4 pages
Topology and Geometry for Physicists
From Everand
Topology and Geometry for Physicists
Charles Nash
3.5/5 (1)

Machine Learning and Pattern Recognition Variational KL

Uploaded by

Machine Learning and Pattern Recognition Variational KL

Uploaded by

Variational objectives and KL Divergence

The Laplace approximation fitted a Gaussian distribution to a parameter posterior by

• A family of possible distributions q(w; α).

The computational task is to optimize the variational parameters (here α).1

q(w; α = {m, V }) = N (w; m, V ). (1)

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

into the KL-divergence, we get a spray of terms:

DKL (q || p) ≥ 0 ⇒ log p(D) ≥ − J (q). (7)

4 Optimization methods for DKL (q || p)

6 Overview of Gaussian approximations

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

• Optimization: fit variational parameters of q (not w!)

7 The KL divergence appears in other contexts

7.1 For keen students: Information theory

8 For keen students: DKL ( p||q) and moment matching

which up to a constant with respect to θ is4 :

We differentiate the KL wrt θ, to get

4. A common mistake is to omit Z (θ ), which is not a constant wrt θ.

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

E p [φ(w)] = Eq [φ(w)]. (13)

MLPR:w11b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 5

You might also like