0% found this document useful (0 votes)

35 views

Machine Learning Lecture 4

This document provides an overview of a Bayesian view of regression. It begins by establishing a model of the data generating process, assuming each data point results from some deterministic function plus sampling error. It then derives the likelihood function and shows that maximum likelihood inference is equivalent to least squares regression. Finally, it introduces incorporating prior information using Bayes' rule, showing how a Gaussian prior on the model weights favors simpler models that avoid overfitting.

Uploaded by

chelsea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

35 views

Machine Learning Lecture 4

Uploaded by

chelsea

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 4

Lecture 4: A Bayesian View of Regression

Iain Styles
22 October 2019

Bayesian View of Regression

So far, we have adopted quite an informal approach to regression:

we wrote down an error function (least-squares) that made some
sense from an intuitive viewpoint that seemed to make logical
sense, but we have no formal basis for claiming that the “least
squares fitting” method was a correct and valid way to approach
the regression problem. Studying the problem from a Bayesian
perspective will give us the formal rigour that we need in order to
justify the choices we have made.
Our starting point will be to construct a model of the underlying
data-generating process. We assume that each data point is the
result of some process that has a deterministic component, and
some associated sampling uncertainty.

y = f( x, w) + e (1)
where e ∼ N (0, σ2 ) is a normal distribution of variance σ2 such
that σ is a measure of the uncertainty in the sampling. That is,
when the value of the dependent variable y is sampled for some
value of the independent variable x, it will be drawn from a normal
distribution with mean f ( x, w) and variance σ2 . Under this model,
we can write the distribution of y as

p(y| x, w, σ2 ) = N (y| f ( x, w), σ2 ) (2)

that is, it is normally distributed with mean f ( x, w) and variance

σ2 .
Now consider that we have a dataset D = {( xi , yi )}iN=1 which
we will write as (x, y). We assume that the dependent variables yi
are sampled independently from normal distributions with the same
variance σ2 . The independence of the sampling means that the joint
probability distributions over all data points can be written as the
product of the distributions for each point:

N
p(y|x, w, σ2 ) = ∏ N ( y i | f ( x i , w ), σ 2 ) (3)
i =1

This is known as the likelihood of y: it is the probability density

function of the dependent variables y conditioned on the set of pa-
rameters that describe the data generating function (ie. given some
set of parameters, what is the probability of the measurements?).
With the likelihood, we can now approach regression in a dif-
ferent way. Since the likelihood is a proper probability density
function, we can ask “what parameters w maximise it“? In other
words, what is the most likely set of measurements, and what are
lecture 4: a bayesian view of regression 2

the parameters that gives rise to the most likely measurements?

This is known as maximum likelihood inference.
First, we substitute in the full form of the normal distribution
1
N ( x |µ, σ2 ) = (2πσ2 )− 2 exp(−( x − µ)2 /(2σ2 ))
N
N
p(y|x, w, σ2 ) = (2πσ2 )− 2 ∏ exp(−(yi − f (xi , w))2 /(2σ2 )) (4)
i =1
We now take the logarithm of this to get rid of the exponential
terms. Since the logarithm is a monotonic function (it has no max-
ima or minima of its own), the maximum of the log-likelihood will
be at the same value of w as the maximum of the likelihood.

!
N
2 −N
2
ln p(y|x, w, σ ) = ln(2πσ ) 2 ) + ln ∏ exp(−(yi − f (xi , w)) 2 2
/(2σ ))
i =1
(5)
where we have used ln ab = ln a + ln b. We now use the general-
isation of this, ln ∏i ai = ∑i ln ai , and the identity ln ab = b ln a to
obtain the following expression for the log-likelihood:

N
N 1
ln p(y|x, w, σ2 ) = −
2
ln 2πσ2 − 2
2σ ∑ (yi − f (xi , w))2 (6)
i =1
This has two terms. The first term (which is negative) is max-
imised by minimising the number of data points or the variance
in the measurement. This is intuitively obvious: more data and/or
more noise means less certainty. The second term is exactly the
familiar least-squares error term (negated). Maximising the log-
likelihood is therefore equivalent to minimising the least-squares the most likely set of data
is the one with the lowest error
error.
We have written down an expression for the likelihood assum-
ing Gaussian noise on the data. We can now use this to perform
some rather more sophisticated types of regression. In particular, it
allows us to incorporate prior information about the problem using
Bayes Rule:

p( a|b) = p(b| a) p( a)/p(b) (7)

where p( a|b) is the posterior distribution of a given b, p(b| a) is the
likelihood of b given a and p( a) is the prior distribution of a.
Given the likelihood p(y|x, w, σ2 ), we can use Bayes’ rule to
compute the probability density function of the model weights:
p(y|x, w, σ2 ) × p(w)
p(w|x, y, σ2 ) = (8)
P(y)
That is, the probability density function of the model weights
depends on the likelihood of the measurements conditioned on the
weight, multiplied by the prior distribution of the weights, and the
normalised by the distribution of the measurements. We will ignore
the normalising factor p(y) for simplicity and consider

p(w|x, y, σ2 ) ∝ p(y|x, w, σ2 ) × p(w) (9)

lecture 4: a bayesian view of regression 3

The simplest case to consider is p(w) = c, a constant. In this case

we have that uniform distribution of weights

p(w|x, y, σ2 ) ∝ p(y|x, w, σ2 ) × c (10)

M=9
6
∝ p(y|x, w, σ2 ) (11)
4
and the maximum likelihood solution of this is the same as be-
2

y
fore – it is the least-squares solution. This solution assumes that all
model weights - large or small - are equally likely. 0
Is this desireable? Sometime, but not necessarily so. One char-
acteristic of overfitting is that the model weights of the high-order 1 0 1
terms can be very large. We have seen this previously in our earlier x
examples, reproduced in Figure 1 and Table 1. Our previous studies
Figure 1: Fitting y = sin(2πx ) with a
have focussed on removing these high order terms from the basis polynomial fit of degree M = 9 to data
set, but could we control their contribution to the model fitting in a with added noise..
different way?

M w0 w1 w2 w3 w4 w5 w6 w7 w8 w9
9 -0.66 10.98 25.62 -117.80 -143.29 405.10 246.74 -561.32 -127.91 263.129
Table 1: Coefficients of a high-order
polynomial fit to noisy data show
Let us consider another form of prior distribution for the model characteristic large values of high-
weights. We assume that they are drawn from a normal distribu- order coefficients.
tion with zero mean, and, for convenience, variance σ2 = 1/2λ.
We ignore normalisation constants for simplicity as they will all
be absorbed into a single constant of proportionality. The distri-
bution is condition on λ and assuming each of the components is
independent, the joint distribution can be written

M
p(w|λ) ∝ ∏ exp(−λwi2 ) (12)
i =1
∝ exp(−λ ∑ wi2 ) (13)
i
∝ exp(−λwT w) (14)

Using Bayes Theorem we have

p(w|x, y, σ2 , λ) ∝ p(y|x, w, σ2 ) × p(w|λ) (15)

and noting that ln ab = ln a + ln b, we follow the same process as

before and find that this is maximised by the minimum of

N
L= ∑ (yi − f (xi , w))2 + λwT w. (16)
i =1

That is, a Gaussian prior with zero mean and variance σ2 = 1/2λ
is equivalent to adding a “penalty” term to the least-squares error
function. This penalty is proportional to the square of the length of
the weight vector and so when we try to minimise L it will prefer-
entially prefer solutions with small values for its components. This
is consistent with the Bayesian prior, which is normally distributed
around zero. The most likely values of the weights are those near
lecture 4: a bayesian view of regression 4

to zero, and the least likely are those that are large. The parame-
ter λ controls the width of the Gaussian prior: large λ means low
variance and therefore a narrow distribution, and so the larger λ
is, the less likely the weights are to take large values. Because this
prior distribution results in the model coefficient being kept small,
it is known as a shrinkage method, and since the penalty term is
the L2 norm (ie the square length of the weight vector, this is often
referred to as L2 regularisation, or sometimes as Tikhonov regulari-
sation (although this is a more general class of methods).
L2 regularisation is very widely used in regression tasks. In the
next section of the module, we will study how to use it effectively

Reading
Sections 1.2.5 of Bishop, Pattern Recognition and Machine Learn-
ing.

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6435)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (641)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (997)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1853)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4102)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (628)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1018)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5143)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2126)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (463)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (279)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4360)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1090)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2788)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2033)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2010)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4088)
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
33 Personal Statement
No ratings yet
33 Personal Statement
4 pages
(NASA NHB-5300.4-1B) Quality Program Provisions For Aeronautical and Space System Contractors (NPC 200-2) (1969)
No ratings yet
(NASA NHB-5300.4-1B) Quality Program Provisions For Aeronautical and Space System Contractors (NPC 200-2) (1969)
49 pages
Board Games Using Graph Theory
No ratings yet
Board Games Using Graph Theory
3 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
Education in Pakistan: Problems and Their Solutions: DR - Khalid Rashid
No ratings yet
Education in Pakistan: Problems and Their Solutions: DR - Khalid Rashid
12 pages
Class 1 Maths L5 Summary Statisitcs 1
No ratings yet
Class 1 Maths L5 Summary Statisitcs 1
35 pages
Prfelb
No ratings yet
Prfelb
11 pages
0417 w16 QP 2
No ratings yet
0417 w16 QP 2
12 pages
GeomCaliper Catalog
No ratings yet
GeomCaliper Catalog
4 pages
W.1 Academic Writing - Purpose and Audience
No ratings yet
W.1 Academic Writing - Purpose and Audience
2 pages
CH 13 CBQ
No ratings yet
CH 13 CBQ
15 pages
Kentucky Fried Chicken in Japan
No ratings yet
Kentucky Fried Chicken in Japan
3 pages
Physics Photon: Department of and Science
No ratings yet
Physics Photon: Department of and Science
4 pages
6.1 Entrepreneurship Development and Management
No ratings yet
6.1 Entrepreneurship Development and Management
16 pages
Australia's Defence White Papers
No ratings yet
Australia's Defence White Papers
5 pages
An Integrated Comparison of Captive-Bred and Wild Atlantic Salmon (Salmo Salar) : Implications For Supportive Breeding Programs
No ratings yet
An Integrated Comparison of Captive-Bred and Wild Atlantic Salmon (Salmo Salar) : Implications For Supportive Breeding Programs
11 pages
T - Worksheet-2 - Mensuration - Volume of Simple Shapes and Word Problems
No ratings yet
T - Worksheet-2 - Mensuration - Volume of Simple Shapes and Word Problems
3 pages
Crafting Research Questions
No ratings yet
Crafting Research Questions
4 pages
H Houses Ruled by Planets
No ratings yet
H Houses Ruled by Planets
10 pages
TET 4.1 Manual
No ratings yet
TET 4.1 Manual
202 pages
CWRU kira面经
No ratings yet
CWRU kira面经
3 pages
Doodle Town Scope and Sequence
No ratings yet
Doodle Town Scope and Sequence
8 pages
Stress Journal Revised1
No ratings yet
Stress Journal Revised1
7 pages
Atvtuner
No ratings yet
Atvtuner
21 pages
DMO Contract Manager - Soldier Systems
No ratings yet
DMO Contract Manager - Soldier Systems
4 pages
Security System Engineering Nist - sp.800-160
No ratings yet
Security System Engineering Nist - sp.800-160
261 pages
Concert Hall Acoustic Computer Modeling
No ratings yet
Concert Hall Acoustic Computer Modeling
14 pages
Estimation of Pure Component Properties. Part 4 - Estimation of The Saturated Liquid Viscosity of Non-Electrolyte Organic Compounds Via Group Contributions and Group Interactions
No ratings yet
Estimation of Pure Component Properties. Part 4 - Estimation of The Saturated Liquid Viscosity of Non-Electrolyte Organic Compounds Via Group Contributions and Group Interactions
23 pages
Universal Grammar
No ratings yet
Universal Grammar
10 pages
The Dog-And-Rabbit Chase Problem As An Exercise in Introductory Kinematics
No ratings yet
The Dog-And-Rabbit Chase Problem As An Exercise in Introductory Kinematics
5 pages
Smith PsychologyDayDreams 1904
No ratings yet
Smith PsychologyDayDreams 1904
25 pages

Machine Learning Lecture 4

Uploaded by

Machine Learning Lecture 4

Uploaded by

Lecture 4: A Bayesian View of Regression

Bayesian View of Regression

So far, we have adopted quite an informal approach to regression:

p(y| x, w, σ2 ) = N (y| f ( x, w), σ2 ) (2)

that is, it is normally distributed with mean f ( x, w) and variance

This is known as the likelihood of y: it is the probability density

the parameters that gives rise to the most likely measurements?

p( a|b) = p(b| a) p( a)/p(b) (7)

p(w|x, y, σ2 ) ∝ p(y|x, w, σ2 ) × p(w) (9)

The simplest case to consider is p(w) = c, a constant. In this case

p(w|x, y, σ2 ) ∝ p(y|x, w, σ2 ) × c (10)

Using Bayes Theorem we have

p(w|x, y, σ2 , λ) ∝ p(y|x, w, σ2 ) × p(w|λ) (15)

and noting that ln ab = ln a + ln b, we follow the same process as

You might also like