0% found this document useful (0 votes)

28 views4 pages

Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression

This document discusses Bayesian logistic regression. It provides an overview of logistic regression and maximum likelihood estimation. It then introduces Bayesian logistic regression, which averages predictions over different parameter values weighted by their posterior probability. The document discusses how Bayesian predictions differ from maximum likelihood estimates and are more uncertain far from training data. It also notes that exactly computing Bayesian predictions is challenging due to intractable integrals.

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

28 views4 pages

Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression

Uploaded by

zeliawillscumberg

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Bayesian logistic regression

So far we have only performed probabilistic inference in two particularly tractable situations:
1) small discrete models: inferring the class in a Bayes classifier, the card game, the robust
logistic regression model. 2) “linear-Gaussian models”, where the observations are linear
combinations of variables with Gaussian beliefs, to which we add Gaussian noise.
For most models, we cannot compute the equations for making Bayesian predictions exactly.
Logistic regression will be our working example. We’ll look at how Bayesian predictions differ
from regularized maximum likelihood. Then we’ll look at different ways to approximately
compute the integrals.

1 Logistic regression
As a quick review, the logistic regression model gives the probability of a binary label given
a feature vector:
>
P(y = 1 | x, w) = σ (w> x) = 1/(1 + e−w x ). (1)
We usually add a bias parameter b to the model, making the probability σ (w> x+b). Although
the bias is often dropped from the presentation, to reduce clutter. We can always work out
how to add a bias back in, by including a constant element in the input features x.
You’ll see various notations used for the training data D . The model gives the probability of
a vector of outcomes y associated with a matrix of inputs X (where the nth row is x(n)> ).
Maximum likelihood fitting maximizes the probability:

P(y | X, w) = ∏ σ ( z ( n ) w > x ( n ) ), where z(n) = 2y(n) − 1, if y(n) ∈ {0, 1}. (2)

For compactness, we’ll write this likelihood as P(D | w), even though really only the outputs
y in the data are modelled. The inputs X are assumed fixed and known.
Logistic regression is most frequently fitted by a regularized form of maximum likelihood.
For example L2 regularization fits an estimate
h i
w∗ = arg max log P(y | X, w) − λw> w . (3)
w

We find a setting of the weights that make the training data appear probable, but discourage
fitting extreme settings of the weights, that don’t seem reasonable. Usually the bias weight
will be omitted from the regularization term.
Just as with simple linear regression, we can instead follow a Bayesian approach. The weights
are unknown, so predictions are made considering all possible settings, weighted by how
plausible they are given the training data.

2 Bayesian logistic regression

The posterior distribution over the weights is given by Bayes’ rule:

P(D | w) p(w)
p(w | D) = ∝ P(D | w) p(w). (4)
P(D)

The normalizing constant is the integral required to make the posterior distribution integrate
to one: Z
P(D) = P(D | w) p(w) dw. (5)

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

The figures below1 are for five different plausible sets of parameters, sampled from the pos-
terior p(w | D).2 Each figure shows the decision boundary σ (w> x) = 0.5 for one parameter
vector as a solid line, and two other contours given by w> x = ±1.

The axes in the figures above are the two input features x1 and x2 . The model included a bias
parameter, and the model parameters were sampled from the posterior distribution given
data from the two classes as illustrated. The arrow, perpendicular to the decision boundary,
illustrates the direction and magnitude of the weight vector.
Assuming that the data are well-modelled by logistic regression, it’s clear that we don’t
know what the correct parameters are. That is, we don’t know what parameters we would
fit after seeing substantially more data. The predictions given the different plausible weight
vectors differ substantially.
The Bayesian way to proceed is to use probability theory to derive an expression for the
prediction we want to make:
Z
P(y | x, D) = p(y, w | x, D) dw (6)
Z
= P(y | x, w) p(w | D) dw. (7)

That is, we should average the predictive distributions P(y | x, w) for different parameters,
weighted by how plausible those parameters are, p(w | D). Contours of this predictive
distribution, P(y = 1 | x, D) ∈ {0.5, 0.27, 0.73, 0.12, 0.88}, are illustrated in the left panel below.
Predictions at some constant distance away from the decision boundary are less certain
when further away from the training inputs. That’s because the different predictors above
disagreed in regions far from the data.

10 10

5 5
(a) (b)

0 0
0 5 10 0 5 10

Again, the axes are the input features x1 and x2 . The right hand figure shows P(y = 1 | x, w∗ )
for some fitted weights w∗ . No matter how these fitted weights are chosen, the contours
have to be linear. The parallel contours mean that the uncertainty of predictions falls at the
same rate when moving away from the decision boundary, no matter how far we are from
the training inputs.

1. The two figures in this section are extracts from Figure 41.7 of MacKay’s textbook (p499). Murphy’s Figures 8.5
and 8.6 contain a similar illustration.
2. It’s not obvious how to generate samples from p(w | D), and in fact it’s hard to do exactly. These samples were
drawn approximately with a “Markov chain Monte Carlo” method.

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

It’s common to describe L2 regularized logistic regression as MAP (Maximum a posteriori)
estimation with a Gaussian N (0, σw2 I) prior on the weights. The “most probable”3 weights,
coincide with an L2 regularized estimate:

∗ 1 >
w = arg max [log p(w | D)] = arg max log P(D | w) − 2 w w . (8)
w w 2σw
MAP estimation is not a “Bayesian” procedure. The rules of probability theory don’t tell us to
fix an unknown parameter vector to an estimate. We could view MAP as an approximation
to the Bayesian procedure, but the figure above illustrates that it is a crude one: the Bayesian
predictions (left) are qualitatively different to the MAP ones (right).
Unfortunately, we can’t evaluate the integral for predictions P(y | x, D) in closed form.
Making model choices for Bayesian logistic regression is also computationally challenging.
The marginal probability of the data, P(D), is the marginal likelihood of the model, which
we might write as P(D | M) when we are evaluating some model choices M (such as basis
functions and hyperparameters). We also can’t evaluate the integral for P(D) in closed form.

3 The logistic regression posterior is sometimes approximately

Gaussian
We’re able to do some integrals involving Gaussian distributions. The posterior distribution
over the weights p(w | D) is not Gaussian, but we can make progress if we can approximate
it with a Gaussian.
Below is an example to illustrate how the posterior over the weights can look non-Gaussian.
We have a Gaussian prior with one sigmoidal likelihood term. Here we assume we know the
bias4 is 10, and we have one datapoint with y = 1 at x = −20:

p(w) ∝ N (w; 0, 1) (9)

p(w | D) ∝ N (w; 0, 1) σ (10 − 20w). (10)

We are now fairly sure that the weight isn’t a large positive value, because otherwise we’d
have probably seen y = 0. We (softly) slice off the positive region5 and renormalize to get the
posterior distribution illustrated below:

p(w)
p(w | D)

−4 −2 0 2 4
w

The distribution is asymmetric and so clearly not Gaussian. Every time we multiply the
posterior by a sigmoidal likelihood, we softly carve away half of the weight space in some
direction. While the posterior distribution has no neat analytical form, the distribution over
plausible weights often does look Gaussian after many observations.
[The website version of this note has a question here.]

3. “Most probable” is problematic for real-valued parameters. Really we are picking the weights with the highest
probability density. But those weights aren’t well-defined, because if we consider a non-linear reparameterization of
the weights, the maximum of the pdf will be in a different place. That’s why we prefer to describe estimating the
weights as “regularized maximum likelihood” or “penalized maximum likelihood” rather than MAP.
4. Perhaps we have many datapoints and have fitted the bias precisely, but we have one datapoint that has a novel
feature turned on, and the example is showing the posterior over the weight that interacts with that one feature.
5. If it’s not obvious what’s going on, plot σ(10 − 20w) against w. We are multiplying our prior by this soft step
function, which multiplies the prior by nearly one on the left, and nearly zero on the right.

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

As another example, let’s consider N = 500 labels, {z(n) }, generated from a logistic regression
model with no bias and with w = 1 at x (n) ∼ N (0, 102 ). Then,

p(w) ∝ N (w; 0, 1) (11)

500
p(w | D) ∝ N (w; 0, 1) ∏ σ(wx(n) z(n) ), z(n) ∈ {±1}. (12)
n =1

The posterior now appears to be a beautiful bell-shape:

p(w)
p(w | D)

−4 −2 0 2 4
w

Fitting a Gaussian distribution (using the Laplace approximation, next note) shows that the
distribution isn’t quite Gaussian. . . but it’s close:

p(w)
p(w | D)
N (w; w ∗ , 1/H)

−4 −2 0 2 4
w

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

R.A. Thisted - Elements of Statistical Computing - Numerical Computation-Routledge (1988)
100% (4)
R.A. Thisted - Elements of Statistical Computing - Numerical Computation-Routledge (1988)
456 pages
Machine Learning and Pattern Recognition - Laplace - Approximation
No ratings yet
Machine Learning and Pattern Recognition - Laplace - Approximation
4 pages
BML Lecture Notes
No ratings yet
BML Lecture Notes
126 pages
04 - Linear-Classification-2024
No ratings yet
04 - Linear-Classification-2024
65 pages
CLASS 2025 Bayesian Framework
No ratings yet
CLASS 2025 Bayesian Framework
46 pages
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
No ratings yet
Logistic Regression: Some Slides Adapted From Dan Jurfasky and Brendan O'Connor
53 pages
5 LR Apr 7 2021
No ratings yet
5 LR Apr 7 2021
94 pages
ML 3
No ratings yet
ML 3
66 pages
Machine Learning - Unit 2
No ratings yet
Machine Learning - Unit 2
104 pages
Scribe Notes BML
No ratings yet
Scribe Notes BML
25 pages
Chopin Ridgway
No ratings yet
Chopin Ridgway
24 pages
Lecture 5
No ratings yet
Lecture 5
23 pages
7 Logistic-Regression
No ratings yet
7 Logistic-Regression
63 pages
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
No ratings yet
Probabilistic Models For Supervised Learning: Piyush Rai Introduction To Machine Learning (CS771A)
32 pages
Lec22 Introduction2BayesianRegression
No ratings yet
Lec22 Introduction2BayesianRegression
42 pages
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
No ratings yet
Machine Learning: Probabilistic View of Linear Regression Logistic Regression Hyperplane Based Classifiers and Perceptron
67 pages
Ds 7
No ratings yet
Ds 7
20 pages
Logistic Regression
No ratings yet
Logistic Regression
26 pages
Normal Distribution (Convert Normal Random Variable To Standard Normal Variable)
67% (3)
Normal Distribution (Convert Normal Random Variable To Standard Normal Variable)
13 pages
Generalized Linear Model
No ratings yet
Generalized Linear Model
67 pages
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
No ratings yet
Andrew Rosenberg - Lecture 5: Linear Regression With Regularization CSC 84020 - Machine Learning
38 pages
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
No ratings yet
Logistic Regression: Adapted From: Tom Mitchell's Machine Learning Book Evan Wei Xiang and Qiang Yang
15 pages
Bayesian Modeling - Student
No ratings yet
Bayesian Modeling - Student
26 pages
Lec24 BayesianLinearRegression
No ratings yet
Lec24 BayesianLinearRegression
29 pages
CS772 Lec5
No ratings yet
CS772 Lec5
22 pages
Unit 3-Bayesian Logistic
No ratings yet
Unit 3-Bayesian Logistic
11 pages
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
No ratings yet
Introduction To Bayesian Methods: Jessi Cisewski Department of Statistics Yale University
53 pages
جلسه پنجم-1
No ratings yet
جلسه پنجم-1
15 pages
Chapter 1: Random Variable and Probability Distribution
No ratings yet
Chapter 1: Random Variable and Probability Distribution
15 pages
07 - Bayesian Learning
No ratings yet
07 - Bayesian Learning
55 pages
Logistic Regression (Probability Concepts) and Perceptron
No ratings yet
Logistic Regression (Probability Concepts) and Perceptron
20 pages
Regression Analysis
No ratings yet
Regression Analysis
18 pages
Lecture2 2013
No ratings yet
Lecture2 2013
60 pages
BaYesian Models Machine Learning 2016
No ratings yet
BaYesian Models Machine Learning 2016
126 pages
Key Concepts in Probabilistic Learning and SVMs
No ratings yet
Key Concepts in Probabilistic Learning and SVMs
15 pages
Lec 20
No ratings yet
Lec 20
16 pages
Week 4 Logistic
No ratings yet
Week 4 Logistic
21 pages
ML Assignment 1
No ratings yet
ML Assignment 1
7 pages
Bayesian Uncertainty Quantification
No ratings yet
Bayesian Uncertainty Quantification
23 pages
Machine Learning - Logistic Regression
No ratings yet
Machine Learning - Logistic Regression
16 pages
Revised Lecture Notes 2
No ratings yet
Revised Lecture Notes 2
16 pages
Bishop2008 Chapter ANewFrameworkForMachineLearnin
No ratings yet
Bishop2008 Chapter ANewFrameworkForMachineLearnin
24 pages
Lecture 6
No ratings yet
Lecture 6
13 pages
1.2.6 Advanced
No ratings yet
1.2.6 Advanced
5 pages
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
No ratings yet
Note 4: EECS 189 Introduction To Machine Learning Fall 2020 1 MLE and MAP For Regression (Part I)
6 pages
Bishop CH 3 Notes
No ratings yet
Bishop CH 3 Notes
6 pages
W8 - Logistic Regression
No ratings yet
W8 - Logistic Regression
18 pages
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
No ratings yet
Unit 5 - Machine Learning - WWW - Rgpvnotes.in
17 pages
Example 7 Maths IA
No ratings yet
Example 7 Maths IA
19 pages
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
No ratings yet
1 Lecture 5b: Probabilistic Perspectives On ML Algorithms
6 pages
Introduction To Machine Learning: 2 Linear Classifiers
No ratings yet
Introduction To Machine Learning: 2 Linear Classifiers
4 pages
Lecture Notes For Probability and Statistics
No ratings yet
Lecture Notes For Probability and Statistics
7 pages
RVM Tutorial
No ratings yet
RVM Tutorial
25 pages
DHW - Lecture 2 - Data Handling, Errors and Statistics in Analytical Chemistry - 210916
No ratings yet
DHW - Lecture 2 - Data Handling, Errors and Statistics in Analytical Chemistry - 210916
72 pages
Machine Learning and Pattern Recognition Bayesian Complexity Control
No ratings yet
Machine Learning and Pattern Recognition Bayesian Complexity Control
4 pages
Cheatsheet Supervised Learning
100% (1)
Cheatsheet Supervised Learning
4 pages
Machine Learning and Pattern Recognition Sampling Based Approximations
No ratings yet
Machine Learning and Pattern Recognition Sampling Based Approximations
3 pages
Bayesian Modelling For Data Analysis and Learning From Data
No ratings yet
Bayesian Modelling For Data Analysis and Learning From Data
19 pages
Bayesian Basics: Ryan P. Adams
No ratings yet
Bayesian Basics: Ryan P. Adams
7 pages
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
No ratings yet
Introduction To Bayesian Inference: M. Botje NIKHEF, PO Box 41882, 1009DB Amsterdam, The Netherlands June 21, 2006
68 pages
Lecture 6
No ratings yet
Lecture 6
19 pages
The Expendables Value Consultants - BSAP Case Study
100% (2)
The Expendables Value Consultants - BSAP Case Study
67 pages
Modern Bayesian Econometrics
No ratings yet
Modern Bayesian Econometrics
100 pages
Cheatsheet Supervised Learning
No ratings yet
Cheatsheet Supervised Learning
4 pages
Bayes ML Tutorial
No ratings yet
Bayes ML Tutorial
69 pages
Tut6 Questions
No ratings yet
Tut6 Questions
2 pages
Estimation and Detection Theory by Don H. Johnson
No ratings yet
Estimation and Detection Theory by Don H. Johnson
214 pages
An Introductory Guide To Shazam
No ratings yet
An Introductory Guide To Shazam
138 pages
Central Limit Theorem
No ratings yet
Central Limit Theorem
36 pages
Topic:: Normal Probability Curve
No ratings yet
Topic:: Normal Probability Curve
20 pages
ODTK Objects and GUI Overview PDF
No ratings yet
ODTK Objects and GUI Overview PDF
45 pages
Doing Business in Hungary
No ratings yet
Doing Business in Hungary
22 pages
Bio Statslectures
No ratings yet
Bio Statslectures
60 pages
Mohamed 2020
No ratings yet
Mohamed 2020
22 pages
Gaussian Sequence Model
No ratings yet
Gaussian Sequence Model
470 pages
TS Part2
No ratings yet
TS Part2
62 pages
W6a Gaussian Process Kernels
No ratings yet
W6a Gaussian Process Kernels
6 pages
PMRslides 03 B
No ratings yet
PMRslides 03 B
45 pages
Stat Module 3
No ratings yet
Stat Module 3
15 pages
Modern Business Statistics With Microsoft Excel 5th Edition Anderson Solutions Manual PDF Download
100% (2)
Modern Business Statistics With Microsoft Excel 5th Edition Anderson Solutions Manual PDF Download
54 pages
Project and Operation Management: Project Scheduling With Uncertain Activity Times
No ratings yet
Project and Operation Management: Project Scheduling With Uncertain Activity Times
41 pages
Part 3
No ratings yet
Part 3
29 pages
Probability Sam
No ratings yet
Probability Sam
98 pages
Part 4
No ratings yet
Part 4
24 pages
Biological Data Science Lecture4
No ratings yet
Biological Data Science Lecture4
21 pages
MDA3S
No ratings yet
MDA3S
22 pages
Part 5
No ratings yet
Part 5
31 pages
Biological Data Science Lecture6
No ratings yet
Biological Data Science Lecture6
29 pages
Slides 03 A
No ratings yet
Slides 03 A
21 pages
MATH11183 Week 1-Part 2
No ratings yet
MATH11183 Week 1-Part 2
18 pages
Week 8 Pca
No ratings yet
Week 8 Pca
26 pages
Award in Education and Training Sample
No ratings yet
Award in Education and Training Sample
9 pages
Standard Score Villanda Pamela 1
No ratings yet
Standard Score Villanda Pamela 1
39 pages
Impact of Permeability Heterogeneity On Geothermal Battery Energy Storage
No ratings yet
Impact of Permeability Heterogeneity On Geothermal Battery Energy Storage
13 pages
W2e Multivariate Gaussian
No ratings yet
W2e Multivariate Gaussian
6 pages
Bayesian Week4 LectureNotes
No ratings yet
Bayesian Week4 LectureNotes
15 pages
MI2026 Problems
No ratings yet
MI2026 Problems
35 pages
PMRslides 02
No ratings yet
PMRslides 02
13 pages
Heat Advection
No ratings yet
Heat Advection
12 pages
Chaper 3: Probability and Probability Models: (Introductory Biostatistics - Chap T. Le)
No ratings yet
Chaper 3: Probability and Probability Models: (Introductory Biostatistics - Chap T. Le)
7 pages
Week 2 Naive Bayes
No ratings yet
Week 2 Naive Bayes
15 pages
2017 AMAM Exam Paper
No ratings yet
2017 AMAM Exam Paper
6 pages
BDS 2016-17
No ratings yet
BDS 2016-17
4 pages
BDS 2018-19
No ratings yet
BDS 2018-19
6 pages
An Analysis of The Accuracy of Bluetooth Low Energy For Indoor Positioning Applications
No ratings yet
An Analysis of The Accuracy of Bluetooth Low Energy For Indoor Positioning Applications
10 pages
A Generalized Model For Evaluating Supply-Chain Delivery Performance
No ratings yet
A Generalized Model For Evaluating Supply-Chain Delivery Performance
12 pages
4735 Random Utility Theory For Social Choice PDF
No ratings yet
4735 Random Utility Theory For Social Choice PDF
9 pages
M4 L01 NormalDistribution
No ratings yet
M4 L01 NormalDistribution
6 pages
2019 AMAM Exam Paper
No ratings yet
2019 AMAM Exam Paper
3 pages
MLPR w0f - Machine Learning and Pattern Recognition
No ratings yet
MLPR w0f - Machine Learning and Pattern Recognition
3 pages
Bayesian Workshop1 Solution
No ratings yet
Bayesian Workshop1 Solution
3 pages
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
No ratings yet
Machine Learning and Pattern Recognition Minimal Stochastic Variational Inference Demo
3 pages
Task Card 4 - The Normal Distribution
No ratings yet
Task Card 4 - The Normal Distribution
3 pages
w9b Netflix Prize
No ratings yet
w9b Netflix Prize
3 pages
DAY 2 - How Much Do You Get Paid?
No ratings yet
DAY 2 - How Much Do You Get Paid?
2 pages
w2c Central Limit
No ratings yet
w2c Central Limit
1 page
BUS 341 Exercise#4
No ratings yet
BUS 341 Exercise#4
2 pages
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
From Everand
Gauss Nodes Revolution: Numerical Integration Theory Radically Simplified And Generalised
Rob Porter
No ratings yet

Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression

Uploaded by

Machine Learning and Pattern Recognition Week 10 - Bayes - Logistic - Regression

Uploaded by

Bayesian logistic regression

P(y | X, w) = ∏ σ ( z ( n ) w > x ( n ) ), where z(n) = 2y(n) − 1, if y(n) ∈ {0, 1}. (2)

2 Bayesian logistic regression

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 1

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 2

3 The logistic regression posterior is sometimes approximately

p(w) ∝ N (w; 0, 1) (9)

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 3

p(w) ∝ N (w; 0, 1) (11)

The posterior now appears to be a beautiful bell-shape:

MLPR:w10a Iain Murray and Arno Onken, http://www.inf.ed.ac.uk/teaching/courses/mlpr/2020/ 4

You might also like