0% found this document useful (0 votes)

35 views

Machine Learning and Pattern Recognition - Laplace - Approximation

The document discusses using the Laplace approximation to fit a Gaussian distribution to approximate Bayesian logistic regression. It explains that the Laplace approximation matches the mode and curvature of the log probability density at the mode of the posterior distribution. It then shows how to use this approximation to make predictions for Bayesian logistic regression by approximating the posterior as a Gaussian and taking its expectation over predictions.

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Machine Learning and Pattern Recognition - Laplace - Approximation

Uploaded by

zeliawillscumberg

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

The Laplace approximation applied to

Bayesian logistic regression

There are multiple ways that we could try to fit a distribution with a Gaussian form. For
example, we could try to match the mean and variance of the distribution. The Laplace
approximation is another possible way to approximate a distribution with a Gaussian. It can
be seen as an incremental improvement of the MAP approximation to Bayesian inference,
and only requires some additional derivative computations.
In Bayesian logistic regression, we can only evaluate the posterior distribution up to a
constant: we can evaluate the joint probability p(w, D), but not the normalizer P(D). We
match the shape of the posterior using p(w, D), and then the approximation can be used to
approximate P(D).
The Laplace approximation sets the mode of the Gaussian approximation to the mode of
the posterior distribution, and matches the curvature of the log probability density at that
location. We need to be able to evaluate first and second derivatives of log P(w, D).
The rest of the note just fills in the details. We’re not adding much to MacKay’s textbook
pp341–342, or Murphy’s book p255. Although we try to go slightly more slowly and show
some pictures of what can go wrong.

1 Matching the distributions

First of all we find the most probable setting of the parameters:
w∗ = arg max p(w | D) = arg max log p(w, D). (1)
w w

The conditional probability on the left is what we intuitively want to optimize. The maxi-
mization on the right gives the same answer, but contains the term we will actually compute.
Reminder: why do we take the log?1
[The website version of this note has a question here.]
We usually find the mode of the distribution by minimizing an ‘energy’, which is the negative
log-probability of the distribution up to a constant. For a posterior distribution, we can
define the energy as:
E(w) = − log p(w, D), w∗ = arg min E(w). (2)
w

We minimize it as usual, using a gradient-based numerical optimizer.

∂E
The minimum of the energy is a turning point. For a scalar variable w the first derivative ∂w
is zero and the second derivative gives the curvature of this turning point:
∂2 E ( w )
H= . (3)
∂w2 w=w∗

The notation means that we evaluate the second derivative at the optimum, w = w∗ . If H is
large, the slope (the first derivative) changes rapidly from a steep descent to a steep ascent.
We should approximate the distribution with a narrow Gaussian. Generalizing to multiple
variables w, we know ∇w E is zero at the optimum and we evaluate the Hessian, a matrix
with elements:
∂2 E ( w )
Hij = . (4)
∂wi ∂w j ∗ w=w

1. Because log is a monotonic transformation, maximizing the log of a function is equivalent to maximizing the
original function. Often the log of a distribution is more convenient to work with, less prone to numerical problems,
and closer to an ideal quadratic function that optimizers like.

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

This matrix tells us how sharply the distribution is peaked in different directions.
For comparison, we can find the optimum and curvature that we would get if our distribution
were Gaussian. For a one-dimensional distribution, N (µ, σ2 ), the energy (the negative log-
probability up to a constant) is:

( w − µ )2
EN (w) = . (5)
2σ2
The minimum is w∗ = µ, and the second derivative H = 1/σ2 , implying the variance is
σ2 = 1/H. Generalizing to higher dimensions, for a Gaussian N (µ, Σ), the energy is:

1
EN (w) = ( w − µ ) > Σ −1 ( w − µ ), (6)
2
with w∗ = µ and H = Σ−1 , implying the covariance is Σ = H −1 .
Therefore matching the minimum and curvature of the ‘energy’ (negative log-probability) to
those of a Gaussian energy gives the Laplace approximation to the posterior distribution:

p(w | D) ≈ N (w; w∗ , H −1 ) (7)

[The website version of this note has a question here.]

2 Approximating the normalizer Z

Evaluating our approximation for a D-dimensional distribution gives:

p(w, D) | H |1/2

1
p(w | D) = ≈ N (w; w∗ , H −1 ) = exp − ( w − w ∗ >
) H ( w − w ∗
) . (8)
P(D) (2π ) D/2 2

At the mode w∗ = w, the exponential term disappears and we get:

p(w∗ , D) | H |1/2 p(w∗ , D)(2π ) D/2

≈ , P(D) ≈ . (9)
P(D) (2π ) D/2 | H |1/2

An equivalent expression is

P(D) ≈ p(w∗ , D) |2πH −1 |1/2 , (10)

where | · | means take the determinant of the matrix.

When some people say “the Laplace approximation”, they are referring to this approximation
of the normalization P(D), rather than the intermediate Gaussian approximation to the
distribution.

3 Computing logistic regression predictions

Now we return to the question of how to make Bayesian predictions (all implicitly condi-
tioned on a set of model choices M):
Z
P(y | x, D) = p(y, w | x, D) dw (11)
Z
= P(y | x, w) p(w | D) dw. (12)

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

We can approximate the posterior with a Gaussian, p(w | D) ≈ N (w; w∗ , H −1 ), using the
Laplace approximation (or variational methods, next week). Using this approximation, we
still have an integral with no closed form solution:
Z
P(y = 1 | x, D) ≈ σ (w> x) N (w; w∗ , H −1 ) dw (13)
h i
= EN (w; w∗ ,H −1 ) σ(w> x) . (14)

However, this expectation can be simplified. Only the inner product a = w> x matters, so
we can take the average over this scalar quantity instead. The linear combination a is a
linear combination of Gaussian beliefs, so our beliefs about it are also Gaussian. By now you
should be able to show that

p( a) = N ( a; w∗ > x, x> H −1 x). (15)

Therefore, the predictions given the approximate posterior, are given by a one-dimensional
integral:

P(y = 1 | x, D) ≈ EN (a; w∗ > x, x> H −1 x) [σ( a)] (16)

Z
= σ ( a) N ( a; w∗ > x, x> H −1 x) da. (17)

One-dimensional integrals can be computed numerically to high precision.

Bishop p. 220 and Murphy Section 8.4.4.2 review a further approximation, which is quicker
to evaluate and provides an interpretable closed-form expression:

1
P(y = 1 | x, D) ≈ σ (κ w∗ > x), κ= q . (18)
π > −1
1+ 8x H x

Under this approximation, the predictions use the most probable or MAP weights. However,
the activation is scaled down (with κ) when the activation is uncertain, so that predictions
will be less confident far from the data (as they should be).

4 Is the Laplace approximation reasonable?

If we think that the Energy is well-behaved and sharply peaked around the mode of the
distribution, we might think that we can approximate it with a Taylor series. In one dimension
we write

∂E 1 ∂2 E
E(w∗ + δ) ≈ E(w∗ ) + δ + δ2 (19)
∂w w∗ 2 ∂w2 w∗
1
≈ E(w∗ ) + Hδ2 , (20)
2
∂E
where the second term disappears because ∂w is zero at the optimum. In multiple dimensions
this Taylor approximation generalizes to:

E(w∗ + δ) ≈ E(w∗ ) + 12 δ>Hδ. (21)

A quadratic energy (negative log-probability) implies a Gaussian distribution. The distribu-

tion is close to the Gaussian fit when the Taylor series is accurate.
For models with a fixed number of identifiable parameters, the posterior becomes tightly
peaked in the limit of large datasets. Then the Taylor expansion of the log-posterior doesn’t
need to be extrapolated far and will be accurate. Search term for more information: “Bayesian
central limit theorem”.

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

5 The Laplace approximation doesn’t always work well!
Despite the theory above, it is easy for the Laplace approximation to go wrong.
In high dimensions, there are many directions in parameter space where there might only
be a small number of informative datapoints. Then the posterior could look like the first
asymmetrical example in this note.
If the mode and curvature are matched, but the distribution is otherwise non-Gaussian, then
the value of the densities won’t match2 .

As a result, the approximation of P(D) will be poor.

[The website version of this note has a question here.]
One way for a distribution to be non-Gaussian is to be multi-modal. The posterior of logistic
regression only has one mode, but the posterior for neural networks will be multimodal.
Even if capturing one mode is reasonable, an optimizer could get stuck in bad local optima.

In models with many parameters, the posterior will often be flat in some direction, where
parameters trade off each other to give similar predictions. When there is zero curvature in
some direction, the Hessian isn’t positive definite and we can’t get a meaningful approxima-
tion.

6 Further Reading
Bishop covers the Laplace approximation and application to Bayesian logistic regression in
Sections 4.4 and 4.5.
Or read Murphy Sections 8.4 to 8.4.4 inclusive. You can skip 8.4.2 on BIC.
Similar material is covered by MacKay, Ch. 41, pp492–503, and Ch. 27, pp341–342.
The Laplace approximation was used in some of the earliest Bayesian neural networks
although — as presented here — it’s now rarely used. However, the idea does occur in recent
work, such as on continual learning (Kirkpatrick et al., Google Deepmind, 2017) and a more
sophisticated variant is used by the popular statistical package, R-INLA.

2. The final two figures in this note come from previous MLPR course notes, by one of Amos Storkey, Chris
Williams, or Charles Sutton.

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

University of Calgary MATH265-F2022-LEC1-outline
No ratings yet
University of Calgary MATH265-F2022-LEC1-outline
6 pages
FEM For 2D With Matlab
No ratings yet
FEM For 2D With Matlab
33 pages
Differential Calculus
67% (3)
Differential Calculus
32 pages
1 s2.0 S0377042716302989 Main
No ratings yet
1 s2.0 S0377042716302989 Main
10 pages
Machine Learning and Pattern Recognition - Variational - Details
No ratings yet
Machine Learning and Pattern Recognition - Variational - Details
3 pages
Generalizing Vector Calculus by Kevin Gibson Kevin - Gibson@asu - Edu
No ratings yet
Generalizing Vector Calculus by Kevin Gibson Kevin - Gibson@asu - Edu
3 pages
SGD
No ratings yet
SGD
19 pages
CH 12
No ratings yet
CH 12
12 pages
Applied Comp Methods Notes1
No ratings yet
Applied Comp Methods Notes1
13 pages
Convergence Rates of Best N-Term Galerkin Approxim
No ratings yet
Convergence Rates of Best N-Term Galerkin Approxim
29 pages
An Approximate Solution of Dirac-Hulthén Problem With Pseudospin and Spin Symmetry For Any K State
No ratings yet
An Approximate Solution of Dirac-Hulthén Problem With Pseudospin and Spin Symmetry For Any K State
8 pages
Mixed VEM
No ratings yet
Mixed VEM
14 pages
shalev-shwartz13a
No ratings yet
shalev-shwartz13a
33 pages
MGLarson Axel Malqvist A Post Err Mixed Fem Elliptic
No ratings yet
MGLarson Axel Malqvist A Post Err Mixed Fem Elliptic
14 pages
FEM
No ratings yet
FEM
47 pages
Function Spaces
No ratings yet
Function Spaces
20 pages
Exponential Convergence Rates For Batch Normalization - 4
No ratings yet
Exponential Convergence Rates For Batch Normalization - 4
1 page
Luby's Alg. For Maximal Independent Sets Using Pairwise Independence
No ratings yet
Luby's Alg. For Maximal Independent Sets Using Pairwise Independence
6 pages
w2e_multivariate_gaussian
No ratings yet
w2e_multivariate_gaussian
6 pages
affinelies-edits
No ratings yet
affinelies-edits
18 pages
Doubly Negative Metamaterials: V X X X X
No ratings yet
Doubly Negative Metamaterials: V X X X X
10 pages
Existence and Concentration of Positive Solutions For A Class of Gradient Systems - Claudianor O. ALVES
No ratings yet
Existence and Concentration of Positive Solutions For A Class of Gradient Systems - Claudianor O. ALVES
21 pages
Jackson Electrodynamics Chapter 1 Solution
No ratings yet
Jackson Electrodynamics Chapter 1 Solution
13 pages
Radial Viscous Flow Between Two Parallel Annular Plates
100% (1)
Radial Viscous Flow Between Two Parallel Annular Plates
4 pages
Norm of the Bergman projection
No ratings yet
Norm of the Bergman projection
18 pages
CSE Example Capacitor Numerical PDE
No ratings yet
CSE Example Capacitor Numerical PDE
13 pages
Hamiltonian
No ratings yet
Hamiltonian
8 pages
Notes On Finite Elements: 1 Basic Definitions
No ratings yet
Notes On Finite Elements: 1 Basic Definitions
7 pages
DRDO PPT m1
No ratings yet
DRDO PPT m1
16 pages
All Pair Shortest Path
No ratings yet
All Pair Shortest Path
2 pages
A New Fourier Transform
No ratings yet
A New Fourier Transform
22 pages
s12220-017-9873-5
No ratings yet
s12220-017-9873-5
16 pages
Cover Time On Regular Graph: Li Jiang March 8, 2013
No ratings yet
Cover Time On Regular Graph: Li Jiang March 8, 2013
7 pages
Poisson's Equation by The FEM, Continued: General Boundary Conditions and Error Analysis
No ratings yet
Poisson's Equation by The FEM, Continued: General Boundary Conditions and Error Analysis
10 pages
The Finite Element Method For 2D Problems: Theorem 9.1
No ratings yet
The Finite Element Method For 2D Problems: Theorem 9.1
47 pages
Birth of A Theorem Honey Comb
No ratings yet
Birth of A Theorem Honey Comb
13 pages
Machine Learning and Pattern Recognition Variational KL
No ratings yet
Machine Learning and Pattern Recognition Variational KL
5 pages
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
No ratings yet
Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression
4 pages
Lapois PDF
No ratings yet
Lapois PDF
110 pages
lecture03c_maximum_likelihood
No ratings yet
lecture03c_maximum_likelihood
8 pages
2022_apost_hDG_hypercircle_Dina_et_al_AJM
No ratings yet
2022_apost_hDG_hypercircle_Dina_et_al_AJM
20 pages
n13
No ratings yet
n13
6 pages
Noetherian Existence
No ratings yet
Noetherian Existence
15 pages
ass5-25
No ratings yet
ass5-25
6 pages
Divergence and Laplacian
No ratings yet
Divergence and Laplacian
10 pages
Darboux, Multiplicative, Completely Hyper-Elliptic Points and Higher Commutative PDE
No ratings yet
Darboux, Multiplicative, Completely Hyper-Elliptic Points and Higher Commutative PDE
14 pages
Metodi Numerici Per Le Equazioni Alle Derivate Parziali
No ratings yet
Metodi Numerici Per Le Equazioni Alle Derivate Parziali
79 pages
NeurIPS 2022 Poisson Flow Generative Models Supplemental Conference
No ratings yet
NeurIPS 2022 Poisson Flow Generative Models Supplemental Conference
33 pages
ECE 562: Advanced Digital Communications Lecture 21: Time, Frequency and Antenna Diversity
No ratings yet
ECE 562: Advanced Digital Communications Lecture 21: Time, Frequency and Antenna Diversity
11 pages
Fractional Boundary Value Problem in Complex Domain
No ratings yet
Fractional Boundary Value Problem in Complex Domain
11 pages
MA111 Lec7 D3D4
No ratings yet
MA111 Lec7 D3D4
20 pages
PAMQ2022
No ratings yet
PAMQ2022
41 pages
Duality Based Algorithms For Total Variation Regularized Image Restoration
No ratings yet
Duality Based Algorithms For Total Variation Regularized Image Restoration
24 pages
2019-05-30
No ratings yet
2019-05-30
7 pages
Fact Sheet Functional Analysis
No ratings yet
Fact Sheet Functional Analysis
9 pages
Garodia2021 Article OnConstrainedMinimizationVaria
No ratings yet
Garodia2021 Article OnConstrainedMinimizationVaria
20 pages
The WKB Approximation: Michael Fowler 1/21/08
No ratings yet
The WKB Approximation: Michael Fowler 1/21/08
6 pages
Math222 Notes 6 General Features of Rectilinear Motion-1
No ratings yet
Math222 Notes 6 General Features of Rectilinear Motion-1
5 pages
Error Analysis of The Nonconforming P Finite Element Method To The Sequential Regularization Formulation For Unsteady Navier-Stokes Equations
No ratings yet
Error Analysis of The Nonconforming P Finite Element Method To The Sequential Regularization Formulation For Unsteady Navier-Stokes Equations
28 pages
A User's Guide To The POT Package (Version 1.4) : Mathieu Ribatet
No ratings yet
A User's Guide To The POT Package (Version 1.4) : Mathieu Ribatet
31 pages
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
From Everand
Student's Solutions Manual and Supplementary Materials for Econometric Analysis of Cross Section and Panel Data, second edition
Jeffrey M. Wooldridge
No ratings yet
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
w2c_central_limit
No ratings yet
w2c_central_limit
1 page
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
Part 3
No ratings yet
Part 3
29 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
Part 5
No ratings yet
Part 5
31 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
TS Part2
No ratings yet
TS Part2
62 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
MDA3S
No ratings yet
MDA3S
22 pages
Part 4
No ratings yet
Part 4
24 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Math 151 Homework 04 PDF
No ratings yet
Math 151 Homework 04 PDF
6 pages
FRQ 2 Sample With Key
No ratings yet
FRQ 2 Sample With Key
5 pages
64e94518aaabf751 Fractional Derivative
No ratings yet
64e94518aaabf751 Fractional Derivative
20 pages
Calculus - Concepts and Applications - Foerster
82% (57)
Calculus - Concepts and Applications - Foerster
731 pages
Full Download Microeconometrics Using Stata Volume II Nonlinear Models and Causal Inference Methods Second Edition A. Colin Csmron PDF
100% (16)
Full Download Microeconometrics Using Stata Volume II Nonlinear Models and Causal Inference Methods Second Edition A. Colin Csmron PDF
60 pages
GATE 2012 Exam Syllabus, GATE 2012 Syllabus, Syllabus For GATE 2012, Syllabus Mtech GATE 2012, GATE Syllabus 2012, GATE 2011 Syllabus
No ratings yet
GATE 2012 Exam Syllabus, GATE 2012 Syllabus, Syllabus For GATE 2012, Syllabus Mtech GATE 2012, GATE Syllabus 2012, GATE 2011 Syllabus
2 pages
CEE 101 - Engineering Calculus 1 (Course Syllabus) PDF
No ratings yet
CEE 101 - Engineering Calculus 1 (Course Syllabus) PDF
9 pages
Xii Maths Exam Last Minute Tips
No ratings yet
Xii Maths Exam Last Minute Tips
2 pages
Maths Booklet
No ratings yet
Maths Booklet
56 pages
The Mutual Impact of Organizational Culture and Structure: Nebojša Janićijević
No ratings yet
The Mutual Impact of Organizational Culture and Structure: Nebojša Janićijević
26 pages
Anna University Chennai
No ratings yet
Anna University Chennai
121 pages
LECTURE 5 & 6 - successive differentiation
No ratings yet
LECTURE 5 & 6 - successive differentiation
6 pages
Anti Derivative of A X
No ratings yet
Anti Derivative of A X
3 pages
R2023-CSE-Curriculum & Syllabus
No ratings yet
R2023-CSE-Curriculum & Syllabus
53 pages
Calculus: Differentiation Rules
No ratings yet
Calculus: Differentiation Rules
17 pages
Motion in A Straight Line - DPP 01 (Lecture No - 02)
100% (1)
Motion in A Straight Line - DPP 01 (Lecture No - 02)
2 pages
Differentiation of Logarithmic Function
No ratings yet
Differentiation of Logarithmic Function
11 pages
Application of Derivatives: Only One Option Correct Type Questions
No ratings yet
Application of Derivatives: Only One Option Correct Type Questions
11 pages
Mathematical Methods for Engineering and Science, 2nd 2nd Edition Merle C. Potter 2024 Scribd Download
100% (1)
Mathematical Methods for Engineering and Science, 2nd 2nd Edition Merle C. Potter 2024 Scribd Download
40 pages
Differential Forms PDF
100% (2)
Differential Forms PDF
110 pages
Analysis of An Extended Well Test To Identify Connectivity Between Adjacent Compartment in A North Sea Reservoir
No ratings yet
Analysis of An Extended Well Test To Identify Connectivity Between Adjacent Compartment in A North Sea Reservoir
10 pages
Homework #1 Taylor Series and Derivatives
No ratings yet
Homework #1 Taylor Series and Derivatives
2 pages
Evaluating derivatives principles and techniques of algorithmic differentiation Second Edition Andreas Griewank - Download the ebook and explore the most detailed content
100% (1)
Evaluating derivatives principles and techniques of algorithmic differentiation Second Edition Andreas Griewank - Download the ebook and explore the most detailed content
47 pages
MSE Physics
No ratings yet
MSE Physics
37 pages
Application of Differentiation
No ratings yet
Application of Differentiation
4 pages
Grade 12 Rates of Change Further Practice
No ratings yet
Grade 12 Rates of Change Further Practice
14 pages
Nature and Scope of Managerial Economics
No ratings yet
Nature and Scope of Managerial Economics
70 pages
Chapter 6 Unit Plan
No ratings yet
Chapter 6 Unit Plan
13 pages

Machine Learning and Pattern Recognition - Laplace - Approximation

Uploaded by

Machine Learning and Pattern Recognition - Laplace - Approximation

Uploaded by

The Laplace approximation applied to

Bayesian logistic regression

1 Matching the distributions

We minimize it as usual, using a gradient-based numerical optimizer.

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

p(w | D) ≈ N (w; w∗ , H −1 ) (7)

[The website version of this note has a question here.]

2 Approximating the normalizer Z

At the mode w∗ = w, the exponential term disappears and we get:

p(w∗ , D) | H |1/2 p(w∗ , D)(2π ) D/2

P(D) ≈ p(w∗ , D) |2πH −1 |1/2 , (10)

where | · | means take the determinant of the matrix.

3 Computing logistic regression predictions

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

p( a) = N ( a; w∗ > x, x> H −1 x). (15)

P(y = 1 | x, D) ≈ EN (a; w∗ > x, x> H −1 x) [σ( a)] (16)

One-dimensional integrals can be computed numerically to high precision.

4 Is the Laplace approximation reasonable?

E(w∗ + δ) ≈ E(w∗ ) + 12 δ>Hδ. (21)

A quadratic energy (negative log-probability) implies a Gaussian distribution. The distribu-

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

As a result, the approximation of P(D) will be poor.

MLPR:w10b Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

You might also like