Lecture 5
Lecture 5
Lecture 5
Model
Have vector y ∈ Rn and covariates matrix X ∈ Rn×d . The ith row of y and X
correspond to the ith observation (yi , xi ).
Likelihood : y ∼ N(Xw, σ 2 I)
Prior : w ∼ N(0, λ−1 I)
MAP solution
MAP inference returns the maximum of the log joint likelihood.
Using Bayes rule, we see that this point also maximizes the posterior of w.
We saw that this solution for wMAP is the same as for ridge regression:
Point estimates
wMAP and wML are referred to as point estimates of the model parameters.
They find a specific value (point) of the vector w that maximizes an objective
function — the posterior (MAP) or likelihood (ML).
Bayesian inference
Bayesian inference goes one step further by characterizing uncertainty about
the values in w using Bayes rule.
BAYES RULE AND LINEAR REGRESSION
Posterior calculation
Since w is a continuous-valued random variable in Rd , Bayes rule says that
the posterior distribution of w given y and X is
p(y|w, X)p(w)
p(w|y, X) = R
Rd
p(y|w, X)p(w) dw
The ∝ sign lets us multiply and divide this by anything as long as it doesn’t
contain w. We’ve done this twice above. Therefore the 2nd line ̸= 3rd line.
BAYESIAN INFERENCE FOR LINEAR REGRESSION
We need to normalize:
1 T
(λI+σ −2 X T X)w−2σ −2 wT X T y}
p(w|y, X) ∝ e− 2 {w
Σ = (λI + σ −2 X T X)−1 ,
µ = (λσ 2 I + X T X)−1 X T y
Things to notice:
▶ µ = wMAP
▶ Σ captures uncertainty about w, like Var[wLS ] and Var[wRR ] did before.
▶ However, now we have a full probability distribution on w.
U SES OF THE POSTERIOR DISTRIBUTION
Understanding w
We saw how we could calculate the variance of wLS and wRR . Now we have
an entire distribution. Some questions we can ask are:
Recall: For a new pair (x0 , y0 ) with x0 measured and y0 unknown, we can
predict y0 using x0 and the LS or RR (i.e., ML or MAP) solutions:
Intuitively:
1. Evaluate the likelihood of a value y0 given x0 for a particular w.
2. Weight that likelihood by our current belief about w given data (y, X).
3. Then sum (integrate) over all possible values of w.
P REDICTING NEW DATA
Notice that the expected value is the MAP prediction since µ0 = x0T wMAP , but
we now quantify our confidence in this prediction with the variance σ02 .
ACTIVE LEARNING
P RIOR → POSTERIOR → PRIOR
Let y and X be “old data” and y0 and x0 be some “new data”. By Bayes rule
The posterior after (y, X) has become the prior for (y0 , x0 ).
but often we want to use the sequential aspect of inference to help us learn.
That is, if we’re in the situation where we can pick which yi to measure with
knowledge of D = {x1 , . . . , xn }, can we come up with a good strategy?
ACTIVE LEARNING
For each x0 , σ02 tells how confident we are. This suggests the following:
1. Form predictive distribution p(y0 |x0 , y, X) for all unmeasured x0 ∈ D
2. Pick the x0 for which σ02 is largest and measure y0
3. Update the posterior p(w|y, X) where y ← (y, y0 ) and X ← (X, x0 )
4. Return to #1 using the updated posterior
ACTIVE LEARNING
Using the “rank-one update” property of the determinant, we can show that
the entropy of the prior Hprior relates to the entropy of the posterior Hpost as:
d
Hpost = Hprior − ln(1 + σ −2 x0T Σx0 )
2
Therefore, the x0 that minimizes Hpost also maximizes σ 2 + x0T Σx0 . We are
minimizing H myopically, so this is called a “greedy algorithm”.
M ODEL SELECTION
S ELECTING λ
The “evidence” gives the likelihood of the data with w integrated out. It’s a
measure of how good our model and parameter assumptions are.
S ELECTING λ
We notice that this looks exactly like maximum likelihood, and it is:
Type-I ML: Maximize the likelihood over the “main parameter” (w).
Type-II ML: Integrate out “main parameter” (w) and maximize over
the “hyperparameter” (λ). Also called empirical Bayes.
The difference is only in their perspective.
This approach requires us to solve this integral, but we often can’t for more
complex models. Cross-validation is an alternative that’s always available.
1 We can show that the distribution of y is p(y|X, λ) = N(y|0, σ 2 I + λ−1 XX T ). This would
require an algorithm to maximize over λ. The key point here is the general technique.