0% found this document useful (0 votes)
5 views23 pages

Lecture 5

The document discusses Bayesian linear regression, detailing the model, prior, and posterior distributions of the unknown parameter w. It explains how to compute the maximum a posteriori (MAP) estimates, the predictive distribution for new data, and the concept of active learning to optimize data collection. Additionally, it addresses model selection and the role of the hyperparameter λ in maximizing evidence for improved performance.

Uploaded by

sejal.mittal99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views23 pages

Lecture 5

The document discusses Bayesian linear regression, detailing the model, prior, and posterior distributions of the unknown parameter w. It explains how to compute the maximum a posteriori (MAP) estimates, the predictive distribution for new data, and the concept of active learning to optimize data collection. Additionally, it addresses model selection and the role of the hyperparameter λ in maximizing evidence for improved performance.

Uploaded by

sejal.mittal99
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

COMS 4721: Machine Learning for Data Science

Lecture 5

Prof. John Paisley

Department of Electrical Engineering


Columbia University
BAYESIAN LINEAR REGRESSION

Model
Have vector y ∈ Rn and covariates matrix X ∈ Rn×d . The ith row of y and X
correspond to the ith observation (yi , xi ).

In a Bayesian setting, we model this data as:

Likelihood : y ∼ N(Xw, σ 2 I)
Prior : w ∼ N(0, λ−1 I)

The unknown model variable is w ∈ Rd .


▶ The “likelihood model” says how well the observed data agrees with w.
▶ The “model prior” is our prior belief (or constraints) on w.

This is called Bayesian linear regression because we have defined a prior on


the unknown parameter and will try to learn its posterior.
R EVIEW: M AXIMUM A POSTERIORI INFERENCE

MAP solution
MAP inference returns the maximum of the log joint likelihood.

Joint Likelihood : p(y, w|X) = p(y|w, X)p(w)

Using Bayes rule, we see that this point also maximizes the posterior of w.

wMAP = arg max ln p(w|y, X)


w
= arg max ln p(y|w, X) + ln p(w) − ln p(y|X)
w
1 λ
= arg max − (y − Xw)T (y − Xw) − wT w + const.
w 2σ 2 2

We saw that this solution for wMAP is the same as for ridge regression:

wMAP = (λσ 2 I + X T X)−1 X T y ⇔ wRR


P OINT ESTIMATES VS BAYESIAN INFERENCE

Point estimates
wMAP and wML are referred to as point estimates of the model parameters.

They find a specific value (point) of the vector w that maximizes an objective
function — the posterior (MAP) or likelihood (ML).

▶ ML: Only considers the data model: p(y|w, X).


▶ MAP: Takes into account model prior: p(y, w|X) = p(y|w, X)p(w).

Bayesian inference
Bayesian inference goes one step further by characterizing uncertainty about
the values in w using Bayes rule.
BAYES RULE AND LINEAR REGRESSION

Posterior calculation
Since w is a continuous-valued random variable in Rd , Bayes rule says that
the posterior distribution of w given y and X is

p(y|w, X)p(w)
p(w|y, X) = R
Rd
p(y|w, X)p(w) dw

That is, we get an updated distribution on w through the transition

prior → likelihood → posterior

Quote: “The posterior of is proportional to the likelihood times the prior.”


F ULLY BAYESIAN INFERENCE

Bayesian linear regression


In this case, we can update the posterior distribution p(w|y, X) analytically.

We work with the proportionality first:

p(w|y, X) ∝ p(y|w, X)p(w)


h 1 T
ih λ T i
∝ e− 2σ2 (y−Xw) (y−Xw) e− 2 w w
1 T
(λI+σ −2 X T X)w−2σ −2 wT X T y}
∝ e− 2 {w

The ∝ sign lets us multiply and divide this by anything as long as it doesn’t
contain w. We’ve done this twice above. Therefore the 2nd line ̸= 3rd line.
BAYESIAN INFERENCE FOR LINEAR REGRESSION

We need to normalize:
1 T
(λI+σ −2 X T X)w−2σ −2 wT X T y}
p(w|y, X) ∝ e− 2 {w

There are two key terms in the exponent:

wT (λI + σ −2 X T X)w − 2wT X T y/σ 2


| {z } | {z }
quadratic in w linear in w

We can conclude that p(w|y, X) is Gaussian. Why?


1. We can multiply and divide by anything not involving w.
2. A Gaussian has (w − µ)T Σ−1 (w − µ) in the exponent.
3. We can “complete the square” by adding terms not involving w.
BAYESIAN INFERENCE FOR LINEAR REGRESSION

Compare: In other words, a Gaussian looks like this:


1 1 T
Σ−1 w−2wT Σ−1 µ+µT Σ−1 µ)
p(w|µ, Σ) = d 1 e− 2 (w
(2π) |Σ|
2 2

and we’ve shown that, for some setting of Z,


1 − 1 (wT (λI+σ−2 X T X)w−2wT X T y/σ2 )
p(w|y, X) = e 2
Z
Conclude: What happens if in the above Gaussian we define:

Σ−1 = (λI + σ −2 X T X), µ = (λσ 2 I + X T X)−1 X T y ?

Using these specific values of µ and Σ we only need to set


d 1 1 T
Σ−1 µ
Z = (2π) 2 |Σ| 2 e 2 µ
BAYESIAN INFERENCE FOR LINEAR REGRESSION

The posterior distribution


Therefore, the posterior distribution of w is:

p(w|y, X) = N(w|µ, Σ),

Σ = (λI + σ −2 X T X)−1 ,
µ = (λσ 2 I + X T X)−1 X T y

Things to notice:
▶ µ = wMAP
▶ Σ captures uncertainty about w, like Var[wLS ] and Var[wRR ] did before.
▶ However, now we have a full probability distribution on w.
U SES OF THE POSTERIOR DISTRIBUTION

Understanding w
We saw how we could calculate the variance of wLS and wRR . Now we have
an entire distribution. Some questions we can ask are:

Q: Is wi > 0 or wi < 0? Can we confidently say wi ̸= 0?


A: Use the marginal posterior distribution: wi ∼ N(µi , Σii ).

Q: How do wi and wj relate?


A: Use their joint marginal posterior distribution:
     
wi µi Σii Σij
∼N ,
wj µj Σji Σjj

Predicting new data


The posterior p(w|y, X) is perhaps most useful for predicting new data.
P REDICTING NEW DATA
P REDICTING NEW DATA

Recall: For a new pair (x0 , y0 ) with x0 measured and y0 unknown, we can
predict y0 using x0 and the LS or RR (i.e., ML or MAP) solutions:

y0 ≈ x0T wLS or y0 ≈ x0T wRR

With Bayes rule, we can make a probabilistic statement about y0 :


Z
p(y0 |x0 , y, X) = p(y0 , w|x0 , y, X) dw
Rd
Z
= p(y0 |w, x0 , y, X) p(w|x0 , y, X) dw
Rd

Notice that conditional independence lets us write

p(y0 |w, x0 , y, X) = p(y0 |w, x0 ) and p(w|x0 , y, X) = p(w|y, X)


| {z } | {z }
likelihood posterior
P REDICTING NEW DATA

Predictive distribution (intuition)


This is called the predictive distribution:
Z
p(y0 |x0 , y, X) = p(y0 |x0 , w) p(w|y, X) dw
Rd | {z } | {z }
likelihood posterior

Intuitively:
1. Evaluate the likelihood of a value y0 given x0 for a particular w.
2. Weight that likelihood by our current belief about w given data (y, X).
3. Then sum (integrate) over all possible values of w.
P REDICTING NEW DATA

We know from the model and Bayes rule that

Model: p(y0 |x0 , w) = N(y0 |x0T w, σ 2 ),


Bayes rule: p(w|y, X) = N(w|µ, Σ).

With µ and Σ calculated on a previous slide.

The predictive distribution can be calculated exactly with these distributions.


Again we get a Gaussian distribution:

p(y0 |x0 , y, X) = N(y0 |µ0 , σ02 ),


µ0 = x0T µ,
σ02 = σ 2 + x0T Σx0 .

Notice that the expected value is the MAP prediction since µ0 = x0T wMAP , but
we now quantify our confidence in this prediction with the variance σ02 .
ACTIVE LEARNING
P RIOR → POSTERIOR → PRIOR

Bayesian learning is naturally thought of as a sequential process. That is, the


posterior after seeing some data becomes the prior for the next data.

Let y and X be “old data” and y0 and x0 be some “new data”. By Bayes rule

p(w|y0 , x0 , y, X) ∝ p(y0 |w, x0 )p(w|y, X).

The posterior after (y, X) has become the prior for (y0 , x0 ).

Simple modifications can be made sequentially in this case:

p(w|y0 , x0 , y, X) = N(w|µ, Σ),


Pn
Σ (λI + σ −2 (x0 x0T + i=1 xi xiT ))−1 ,
=
Pn Pn
µ = (λσ 2 I + (x0 x0T + i=1 xi xiT ))−1 (x0 y0 + i=1 xi yi ).
I NTELLIGENT LEARNING

Notice we could also have written

p(w|y0 , x0 , y, X) ∝ p(y0 , y|w, X, x0 )p(w)

but often we want to use the sequential aspect of inference to help us learn.

Learning w and making predictions for new y0 is a two-step procedure:


▶ Form the predictive distribution p(y0 |x0 , y, X).
▶ Update the posterior distribution p(w|y, X, y0 , x0 ).

Question: Can we learn p(w|y, X) intelligently?

That is, if we’re in the situation where we can pick which yi to measure with
knowledge of D = {x1 , . . . , xn }, can we come up with a good strategy?
ACTIVE LEARNING

An “active learning” strategy


Imagine we already have data (y, X) for X ⊂ D, and the posterior p(w|y, X).
We can construct the predictive distribution for every remaining x0 ∈ D.

p(y0 |x0 , y, X) = N(y0 |µ0 , σ02 ),


µ0 = x0T µ,
σ02 = σ 2 + x0T Σx0 .

For each x0 , σ02 tells how confident we are. This suggests the following:
1. Form predictive distribution p(y0 |x0 , y, X) for all unmeasured x0 ∈ D
2. Pick the x0 for which σ02 is largest and measure y0
3. Update the posterior p(w|y, X) where y ← (y, y0 ) and X ← (X, x0 )
4. Return to #1 using the updated posterior
ACTIVE LEARNING

Entropy (i.e., uncertainty) minimization


When devising a procedure such as this one, it’s useful to know what
objective function is being optimized in the process.

We introduce the concept of the entropy of a distribution. Let p(z) be a


continuous distribution, then its (differential) entropy is:
R
H(p) = − p(z) ln p(z)dz.

This is a measure of the spread of the distribution. More positive values


correspond to a more “uncertain” distribution (larger variance).

The entropy of a multivariate Gaussian is


1  
H(N(w|µ, Σ)) = ln (2πe)d |Σ| .
2
ACTIVE LEARNING

The entropy of a Gaussian changes with its covariance matrix. With


sequential Bayesian learning, the covariance transitions from

Prior : (λI + σ −2 X T X)−1 ≡ Σ



Posterior : (λI + σ −2 (x0 x0T + X T X))−1 ≡ (Σ−1 + σ −2 x0 x0T )−1

Using the “rank-one update” property of the determinant, we can show that
the entropy of the prior Hprior relates to the entropy of the posterior Hpost as:

d
Hpost = Hprior − ln(1 + σ −2 x0T Σx0 )
2
Therefore, the x0 that minimizes Hpost also maximizes σ 2 + x0T Σx0 . We are
minimizing H myopically, so this is called a “greedy algorithm”.
M ODEL SELECTION
S ELECTING λ

We’ve discussed λ as a “nuisance” parameter that can impact performance.

Bayes rule gives a principled way to do this via evidence maximization:

p(w|y, X, λ) = p(y|w, X) p(w|λ) / p(y|X, λ) .


| {z } | {z } | {z }
likelihood prior evidence

The “evidence” gives the likelihood of the data with w integrated out. It’s a
measure of how good our model and parameter assumptions are.
S ELECTING λ

If we want to set λ, we can also do it by maximizing the evidence.1

λ̂ = arg max ln p(y|X, λ).


λ

We notice that this looks exactly like maximum likelihood, and it is:
Type-I ML: Maximize the likelihood over the “main parameter” (w).
Type-II ML: Integrate out “main parameter” (w) and maximize over
the “hyperparameter” (λ). Also called empirical Bayes.
The difference is only in their perspective.

This approach requires us to solve this integral, but we often can’t for more
complex models. Cross-validation is an alternative that’s always available.

1 We can show that the distribution of y is p(y|X, λ) = N(y|0, σ 2 I + λ−1 XX T ). This would

require an algorithm to maximize over λ. The key point here is the general technique.

You might also like