0% found this document useful (0 votes)
7 views7 pages

Bayes Regression

This document provides a detailed derivation of the posterior distribution for Bayesian linear regression using conjugate priors, including the model setup, likelihood, and prior distributions. It discusses efficient posterior sampling methods, particularly through the Cholesky decomposition, and presents the necessary steps for deriving the marginal posterior distributions. The document concludes with practical examples and R code for implementing the sampling process.

Uploaded by

mohammedokoea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
7 views7 pages

Bayes Regression

This document provides a detailed derivation of the posterior distribution for Bayesian linear regression using conjugate priors, including the model setup, likelihood, and prior distributions. It discusses efficient posterior sampling methods, particularly through the Cholesky decomposition, and presents the necessary steps for deriving the marginal posterior distributions. The document concludes with practical examples and R code for implementing the sampling process.

Uploaded by

mohammedokoea
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 7

Notes on Bayesian Linear Regression

Adam N. Smith
July 31, 2021

This note derives the posterior distribution of a Bayesian linear


regression model with conjugate priors and may be used as a companion
to chapter 2.8 in Rossi et al. (2005). We first define the model and
derive the posterior. We conclude with a discussion of efficient
posterior sampling based on the Cholesky decomposition.

1 Model
The standard multiple linear regression model relates a response variable yi to a
k-dimensional vector of predictor variables xi = (xi1 , . . . , xik ) for i = 1, . . . , n.

yi = x0i β + εi , εi ∼ N (0, σ 2 ) (1)

The parametric assumption on the error terms induces a distribution on y given xi .


Collecting the predictor variables into a matrix X allows us to rewrite the model in
matrix notation:
y = Xβ + ε, ε ∼ N (0, σ 2 In ) (2)

where y is an n-dimensional vector of response variables, X is an n×k design matrix,


β is a k-dimensional vector of regression coefficients, and ε is an n-dimensional vector
of errors assumed to have a N (0, σ 2 In ) distribution.
The two key ingredients for a Bayesian analysis of the regression model above
are the likelihood (i.e., distribution of the data) and prior (i.e., distribution of pa-
rameters).

Likelihood Assuming the error vector is distributed N (0, σ 2 In ) induces a multi-


variate normal likelihood:
 
2 2 −n/2 1 0
p(y|X, β, σ ) = (2πσ ) exp − 2 (y − Xβ) (y − Xβ) . (3)

1
Prior We choose conjugate priors for (β, σ 2 ) to ensure an analytic expression for
the posterior distribution. With both β and σ 2 unknown, the conjugate prior is
specified as p(β, σ 2 ) = p(β|σ 2 )p(σ 2 ) where

β|σ 2 ∼ N (β̄, σ 2 A−1 ) (4)


ν0 s20
σ2 ∼ . (5)
χ2ν0

Hence, β|σ 2 has a normal prior and σ 2 has scaled inverse chi-squared prior.1

Posterior The joint posterior then takes the form:

p(β, σ 2 |y, X) = p(y|X, β, σ 2 ) p(β|σ 2 ) p(σ 2 ) (6)


 
n 1
∝ (σ 2 )− 2 exp − 2 (y − Xβ)0 (y − Xβ)

 
2 − k2 1 0
× (σ ) exp − 2 (β − β̄) A(β − β̄)

ν0 s20
 
ν0
2 −( 2 +1)
× (σ ) exp − 2 .

2 Deriving the Posterior


Now, the goal is to derive expressions for the marginal posterior distributions of
β|σ 2 and σ 2 .

Step 1. Quadratic Forms We can simplify (6) by noticing that p(y|X, β, σ 2 )


and p(β|σ 2 ) both contain quadratic forms in β. The first step is to then expand out
the sum of the two quadratic forms.

(y−Xβ)0 (y − Xβ) + (β − β̄)0 A(β − β̄) (7)


= (y 0 − β 0 X 0 )(y − Xβ) + (β 0 − β̄ 0 )A(β − β̄)
= y 0 y − y 0 Xβ − β 0 X 0 y + β 0 X 0 Xβ + β 0 Aβ − β 0 Aβ̄ − β̄ 0 Aβ + β̄ 0 Aβ̄
= β 0 X 0 Xβ + β 0 Aβ − 2β 0 X 0 y − 2β 0 Aβ̄ + y 0 y + β̄ 0 Aβ̄
1
Note the equivalence between the scaled inverse chi-squared distribution and the inverse gamma
distribution: if σ 2 ∼ (ν0 s20 )/χ2ν0 then σ 2 ∼ IG(ν0 /2, ν0 s20 /2).

2
The last line uses the fact that y 0 Xβ and β 0 Aβ̄ are both scalars, so y 0 Xβ = (y 0 Xβ)0
and β 0 Aβ̄ = (β 0 Aβ̄)0 . Now write (7) as
h i
β 0 (X 0 X + A)β − β 0 (2X 0 y + 2Aβ̄) + y 0 y + β̄ 0 Aβ̄. (8)

We can further simplify the terms in [·] by completing the square in β.

Step 2. Completing the Square The matrix version of completing the square
is given by:
X 0 M X + X 0 n + p = (X − h)0 M (X − h) + k (9)

where h = − 21 M −1 n and k = p − 14 n0 M −1 n. Next, we plug in the matrices from (8)


into the general form given above.

M = X 0X + A
n = −2(X 0 y + Aβ̄)
h = (X 0 X + A)−1 (X 0 y + Aβ̄)
k = −(X 0 y + Aβ̄)0 (X 0 X + A)−1 (X 0 y + Aβ̄)
p=0

If we let β̃ = h = (X 0 X + A)−1 (X 0 y + Aβ̄), then the bracketed terms in (8) become

β 0 (X 0 X + A)β − β 0 (2X 0 y + 2Aβ̄) (10)


= (β − β̃)0 (X 0 X + A)(β − β̃) − (X 0 y + Aβ̄)0 (X 0 X + A)−1 (X 0 y + Aβ̄)
= (β − β̃)0 (X 0 X + A)(β − β̃) − (X 0 y + Aβ̄)0 β̃
h i
Now since (X 0 X + A)−1 is symmetric and I = (X 0 X + A)−1 (X 0 X + A) , we can
write the rightmost term above as
h i
(X 0 y + Aβ̄)0 β̃ = (X 0 y + Aβ̄)0 (X 0 X + A)−1 (X 0 X + A) β̃ (11)
0
= (X 0 y + Aβ̄)0 (X 0 X + A)−1 (X 0 X + A)β̃
h i0
= (X 0 X + A)−1 (X 0 y + Aβ̄) (X 0 X + A)β̃

= β̃ 0 (X 0 X + A)β̃.

3
Therefore, using the results of equations (8), (10), and (11), (7) simplifies to

(y−Xβ)0 (y − Xβ) + (β − β̄)0 A(β − β̄) (12)


0 0 0 0 0 0
= (β − β̃) (X X + A)(β − β̃) + y y + β̄ Aβ̄ − β̃ (X X + A)β̃.

Step 3. Main Result The joint posterior distribution is then


 
2 2 −n 1 0
p(β, σ |y, X) ∝ (σ ) exp − 2 (y − Xβ) (y − Xβ)
2

 
k
2 −2 1 0
× (σ ) exp − 2 (β − β̄) A(β − β̄)

ν0 s20
 
ν
2 −( 20 +1)
× (σ ) exp − 2

 
2 − k2 1 0 0
= (σ ) exp − 2 (β − β̃) (X X + A)(β − β̃)

( )
n+ν0 (ν s 2 + y 0 y + β̄ 0 Aβ̄ − β̃ 0 (X 0 X + A)β̃)
0
× (σ 2 )−( 2 +1) exp − 0
.
2σ 2
(13)
But now we see that the joint posterior distribution factors into two parts: the
conditional posterior of β|σ 2 and the marginal posterior of σ 2 . Formally, we have
 
β|σ 2 , y, X ∼ N β̃, σ 2 (X 0 X + A)−1 (14)
νn s2n
σ 2 |y, X ∼ (15)
χ2νn

where

β̃ = (X 0 X + A)−1 (X 0 y + Aβ̄) (16)


νn = ν0 + n (17)
ν0 s20 + y0y + β̄ 0 Aβ̄
− β̃ 0 (X 0 X + A)β̃
s2n = . (18)
ν0 + n

3 Sampling from the Posterior


Sampling from the posterior above is exact by virtue of conjugacy. That is, we
can generate iid draws from the posteriors of β|σ 2 and σ 2 using standard software.
However, drawing from β|σ 2 requires computing (X 0 X + A)−1 (i.e., the inverse of

4
the posterior precision) which is the inverse of a k × k matrix. This type of matrix
inverse regularly appear in the computation of posterior moments, especially in
Bayesian regression models. When k is large, this matrix inverse becomes more
computationally demanding and can be a bottleneck in a posterior sampling routine.
Rossi et al. (2005) describe the Bayesian regression model with an eye towards ef-
ficient computation. The goal of this section is to provide the necessary background
information to understand their approach. We start by defining the Cholesky de-
composition which is a common method for matrix factorization.

Definition 1. The Cholesky decomposition of a symmetric positive-definite matrix


Σ is defined as Σ = U 0 U where U is the upper triangular “Cholesky root” matrix.

Consider the following simple example (based in R).


> Sigma
[ ,1] [ ,2]
[1 ,] 2 1
[2 ,] 1 3
> U = chol ( Sigma )
> U
[ ,1] [ ,2]
[1 ,] 1.414214 0.7071068
[2 ,] 0.000000 1.5811388
> t ( U )%*% U
[ ,1] [ ,2]
[1 ,] 2 1
[2 ,] 1 3
Now consider the problem of inverting Σ. The most straightforward approach is to
use the solve() function in R.
> solve ( Sigma )
[ ,1] [ ,2]
[1 ,] 0.6 -0.2
[2 ,] -0.2 0.4
However, a more efficient approach is to use the Cholesky decomposition of Σ.

Definition 2. If Σ is a symmetric positive-definite matrix with Cholesky decompo-


sition Σ = U 0 U , then Σ−1 = (U −1 )(U −1 )0 .

5
This result shows that the inverse of Σ can be computed only using the inverse
of the Cholesky root U . That is, we have replaced the problem of inverting Σ with
the problem of inverting U . The fact that U is upper triangular leads to faster and
more numerically stable inversion methods relative to a dense matrix like Σ. The
following R code uses the previous result to compute Σ−1 .

> U = chol ( Sigma )


> U
[ ,1] [ ,2]
[1 ,] 1.414214 0.7071068
[2 ,] 0.000000 1.5811388
> IR = backsolve (U , diag ( ncol ( U )))
> IR
[ ,1] [ ,2]
[1 ,] 0.7071068 -0.3162278
[2 ,] 0.0000000 0.6324555
> IR %*% t ( IR )
[ ,1] [ ,2]
[1 ,] 0.6 -0.2
[2 ,] -0.2 0.4

Here IR refers to the “inverse (Cholesky) root” of Σ. Also notice that backsolve()
is used in place of solve() for computing IR. While solve(U) is equivalent to
backsolve(U,diag(nrow(U))), backsolve() is preferred because it recognizes the
special structure of U and solves the triangular systems of equations.
We can now return to the problem of sampling from the posterior defined in
(14). Using the results of the previous section, we first write

Σ = (X 0 X + A) = U 0 U (19)

where U is the upper triangular Cholesky root of (X 0 X + A). It follows that

Σ−1 = (X 0 X + A)−1 (20)


= (U −1 )(U −1 )0
= (IR)(IR)0

6
and so
β̃ = (X 0 X + A)−1 (X 0 y + Aβ̄) (21)
= (IR)(IR)0 (X 0 y + Aβ̄).

The following R code generates one draw from the posterior of β|σ 2 .

k = length ( betabar )
U = chol ( crossprod ( X )+ A )
IR = backsolve (U , diag ( k ))
btilde = crossprod ( t ( IR ))%*%( crossprod (X , y )+ A %*% betabar )
beta = btilde + sqrt ( sigmasq )* IR %*% rnorm ( k )

Rossi et al. (2005) take this a step further. Let A = U 0 U and define
! !
y X
z= W = (22)
U β̄ U

so that W 0 W = (X 0 X + A) and W 0 z = (X 0 y + Aβ̄). The breg() function in bayesm


(Rossi, 2019) uses this modified model structure.

k = length ( betabar )
RA = chol ( A )
W = rbind (X , RA )
z = c (y , RA %*% betabar )
IR = backsolve ( chol ( crossprod ( W )) , diag ( k ))
btilde = crossprod ( t ( IR ))%*% crossprod (W , z )
beta = btilde + sqrt ( sigmasq )* IR %*% rnorm ( k )

References
Rossi, P. E. (2019). bayesm: Bayesian Inference for Marketing/Micro-Econometrics,
R package version 3.1-4 edition.

Rossi, P. E., Allenby, G. M., and McCulloch, R. (2005). Bayesian Statistics and
Marketing. John Wiley & Sons.

You might also like