Normal Linear Regression Model Via Gibbs Sampling Gibbs Sampler Diagnostics
Normal Linear Regression Model Via Gibbs Sampling Gibbs Sampler Diagnostics
As mentioned previously conjugate priors may be overly restrictive in many Bayesian applications. Here
we use a popular combination of independent priors for the regression model – normal for β and inverse-
gamma for σ 2 .
The enhanced flexibility in prior modeling comes at the price of abandoning analytical results for the
posterior distribution. Instead, we will use posterior simulation via Gibbs Sampling to obtain draws from
the joint and marginal posteriors.
The structural model is the same as for the CLRM with conjugate priors.
.
p ( y | θ, X ) = ( 2π )
−n / 2
σ2( )
−n / 2
(
exp − 2σ1 2 ( y − Xβ )′ ( y − Xβ )
) (1)
( )
p β, σ 2 = p ( β ) p σ 2 ( ) where
β ~ n ( µ 0 , V0 ) , σ 2 ~ ig ( v0 ,τ 0 )
p ( β ) = ( 2π )
−k / 2
V0
−1/ 2
(
exp − 12 ( β − µ 0 )′ V0-1 ( β − µ 0 ) ) (2)
τ 0v0 −( v0 +1) τ τ0 τ 02
( )
p σ2 =
Γ ( v0 )
(σ 2 ) exp − 02 , with ( )
E σ2 = ( )
, V σ2 =
σ v0 − 1 ( v0 − 1) ( v0 − 2 )
2
Note that σ 2 does not enter the prior density of β . AS mentioned before, amongst the many possible
parameterizations of the inverse-gamma (ig) density, we choose the form given in Gelman et al. (2004),
where v0 is the shape parameter and τ 0 is the scale parameter. For the density to have a defined mean,
we need v0 > 1 , and for a well-defined variance we need v0 > 2 .
Combining the priors with the likelihood, and dropping all terms that are multiplicatively unrelated to our
parameters of interest yields the posterior kernel
2
(
p β, σ 2 | y , X ∝ )
( )
(3)
( )
− n −2 v0 − 2
(σ ) 2 2
exp − 2σ1 2 ( 2τ 0 ) exp − 12
1
σ 2 ( y − Xβ )′ ( y − Xβ ) + ( β − µ0 )′ V0-1 ( β − µ0 ) .
We first aim to find the posterior density for β , conditional on σ 2 (i.e. treating σ 2 as a constant). Thus,
we will fist focus on the components of the posterior kernel that cannot be multiplicatively separated from
β . This leaves
(
)
p β | σ 2 , y , X ∝ exp − 12 ( σ2
1
( y − Xβ )′ ( y − Xβ ) + ( β − µ0 )′ V0-1 ( β − µ0 ) ) . (4)
Note the conditionality on σ 2 on both sides of (4). Using the same algebraic manipulations and
reasoning as for the previous model, we obtain:
( ) ( )
−1
β | σ 2 , y, X ~ n ( µ1 , V1 ) with V1 = V0-1 + σ12 X′X and µ1 = V1 V0-1µ 0 + σ12 X′y (5)
To derive the conditional posterior density for σ 2 , we return to our original form for the joint posterior
given in (3). Ignoring terms that are not related to σ 2 , we have
( ) .
− n −2 v0 −2
(
p σ 2 | β, y , X ∝ σ 2 ) ( ) 2
exp − 2σ1 2 2τ 0 + ( y − Xβ )′ ( y − Xβ )
(6)
Comparing this expression to the kernel of the ig prior in (2), we recognize this as the kernel of another ig
density. Specifically:
σ 2 | β, y , X ~ ig ( v1 ,τ1 ) with
2v0 + n (7)
( y − Xβ )′ ( y − Xβ )
1
v1 = and τ1 = τ 0 +
2 2
Gibbs Sampler
The Gibbs Sampler (GS) has become the “workhorse” of Bayesian posterior simulation in recent years.
The general idea is simple: break the joint posterior into conditional posteriors for which the analytical
form of its density is known. Then sample sequentially and repeatedly from these conditionals. After a
number of draws the joint sequence of conditional draws will converge to the desired joint posterior
densities for all parameter. In addition, each individual sequence can be interpreted as the marginal
posterior for a given parameter. The GS is an example of a “Markov Chain Monte Carlo”, or MC 2
procedure.
applications, we may want to split θ into 3, 4, or even more components. The key notion is that we know
the analytical form of the resulting full conditional posterior distributions, i.e.
p ( θ1 | y, θ2 ) and p ( θ2 | y, θ1 ) . (3.8)
All we need to get the GS started is an initial value for θ 2 , call it θ02 . This can be chosen arbitrarily, or
one can use OLS results or results from previous analyses. We assume this starting value comes directly
from the marginal posterior p ( θ2 | y ) . Next, we draw θ1 conditional on θ02 from p ( θ1 | y , θ02 ) . Call this
draw θ11 . Next, we draw another value of θ 2 conditional on θ11 from p ( θ 2 | y , θ11 ) . Call this draw θ12 .
We repeat this process R times. In essence, we use the basic rule of conditional probabilities , i.e.
p ( θ | y ) = p ( θ1 | y, θ2 ) p ( θ2 | y ) = p ( θ2 | y, θ1 ) p ( θ1 | y ) (3.9)
over and over again. As a caveat we should note that, naturally, there is no guarantee that our starting
value θ02 really came from the marginal posterior p ( θ2 | y ) . However, under relatively weak conditions
(see Koop Ch. 4 for details) the starting value(s) will not matter and the GS will indeed converge to draws
from p ( θ | y ) . To assure that the effect of the starting value has truly “faded away”, we usually discard
the first r1 draws of the sequence, and keep only the remaining r2 = R − r1 draws. The discarded draws
are often referred to as “burn-ins”.
The derivation of moments from this simulated posterior is accomplished via Monte Carlo Integration.
For example, the analytical expressions for the mean (or expectation) and variance of a given element of
β , say β j , are given by
( ) ∫ ( ) ( ) ∫(β ( )) p ( β )
2
E β j = β j p β j | y, X d β j V βj = j − E βj j | y, X d β j
( ) ( ) ( ( )) ( ) ∫ ( )
2
V β j = E β 2j − E β j where E β j2 = β 2j p β j | y , X d β j
Also, the expectation and variance of any other function of β j , say g β j , take the form of ( )
( ( )) = ∫ g ( β ) p ( β ) ( ( )) = ∫ ( g ( β ) − E ( g ( β ))) p ( β )
2
E g βj j j | y, X d β j V g βj j j j | y, X d β j
For any of these expressions, MCI approximates the integral with averaging over draws. Thus
4
( )
E βj ≈
1
R
∑β j ,r
( )
E β j2 ≈
1
R
∑β 2
j ,r
( ( )) ≈ R1 ∑ g ( β )
E g βj j ,r
Convergence Plots
Convergence plots are a simple visual tool to examine if the simulator has converged to the posterior
distribution for a given parameter. The plot simply shows all draws of a given parameter in chronological
order, as generated by the sampler. "Convergence" usually implies that the draws are tightly clustered
around a flat line (the posterior mean), and do not wander widely. This is similar to assessing stationarity
for time series data.
Convergence plots can be used to assess the sensitivity of the algorithm to starting draws (the chain
should converge to the same distribution under different starting draws), the speed of convergence (and
thus the efficiency of the sampler), and the sufficiency of the chosen number of discarded draws (burn-
ins).
Script mod2_convergence plots provides a few examples using the simulated data from mod2s1a.
There are three parts: Full data, data truncated to 100 observations, and data truncated to 10 observations.
For each case, we first use two sets of starting draws for β and σ 2 . The first set comes from the OLS
output and is thus "right on target", i.e. close to the posterior mean (and the true parameter values
underlying our simulated data). The second set is deliberately located quite far from the true values.
You can see that with the full data set of 10,000 observations, the choice of starting draws virtually
doesn't matter – the chain essentially converges after one or two iterations. Furthermore, parameter draws
fluctuate very tightly around the posterior mean. A reduction in sample size implicitly assigns more
weight to our diffuse priors. As a result, the chain exhibits more "noise", i.e. wider fluctuations around
the mean. This will translate into a larger posterior standard deviation. However, even with just a
handful of observations, convergence is virtually immediate even with off-target starting draws. This is
an indication that the sampler itself is efficient, i.e. "mixes rapidly".
Posterior noise (i.e. the posterior standard deviation) is also driven by the information content of the data,
regardless of sample size. Highly collinear data implies poor information content and will generate wider
5
fluctuations around the posterior mean. This is illustrated in script mod2_convergence plots2. If
you compare the plots from this script to the ones from before for any parameter and sample size, you will
notice the increase in variability around the posterior mean for the collinear data.
As for the conjugate prior case, it is illustrative to plot and compare the prior and posterior distributions
for our parameters. This is accomplished in script mod2_plots. As before, the posterior densities are
much tighter than the priors in all cases.
Matlab script mod2_application implements the normal linear regression model with independent
priors using Mroz's (1987) labor data and provides posterior plots for selected parameters.
The nse captures simulation noise for a value of interest generated via posterior simulation. Usually this
value of interest is a measure of central tendency, such as the posterior mean. Consider the mean
M
θ = 1
M ∑θ
m =1
M of a sequence of m draws of (some generic) parameter θ . Assume these draws were
generated by a Gibbs Sampler (GS) or some other Markov-Chain Monte Carlo ( MC2) algorithm. If these
draws were perfectly independently and identically distributed (i.i.d.) with sample variance s 2 , we could
quickly derive the nse for θ using the basic formula for the standard error of a sample mean, i.e.
nse (θ ) = V (θ ) =
1 s
Ms 2 = (1)
M2 M
However, as with all MC2 procedures, we have to ex ante assume that these sequential draws will have a
considerable degree of correlation. This means we have to consider all covariance terms between all
draws of θ . After some straightforward analytical simplifications (shown in detail in KPT, p. 145), we
end up with a more general expression for the nse that allows for correlation across all draws:
s2 M −1
nse (θ ) = V (θ ) =
M
( )
1 + 2 ∑ 1 − M ρ j ,
j
(2)
j =1
6
sj
where ρ j = is the lag-correlation between some draw θ i and a draw that was obtained j iterations
s2
prior to θ i , i.e. θi − j , with s j denoting the associated sample covariance. It can be easily seen from (2)
that under perfect independence ( ρ j = 0, ∀j ) we arrive again at the basic formula given in (1). In most
MC2 applications, lag-correlations will be positive but declining in magnitude. Thus, the second term
under the square root in (2) will exceed 1, and the nse under correlation will be larger that the nse under
independence. Note that in either case the nse can be made arbitrarily small by increasing M, the number
of draws. However, this can be very costly in terms of computation time. Thus, we are always interested
in devising a posterior sampler that is “efficient” in the sense of generating draws with low lag-
correlation. There are many “tricks” for accomplishing this – we’ll touch upon a few in this course.
In summary, the first purpose of the nse is to provide a measure of “simulation error” or “simulation
noise” surrounding a posterior construct of interest, usually the posterior mean of a given parameter.
Thus, the nse has a similar function to the standard error (s.e.) in Classical Analysis. However, its
intuition is very different – it simply captures simulation noise, i.e. the penalty for having to approximate
the joint posterior via simulations (since its analytical form is unknown). The Classical s.e. conveys the
notion of sampling error – i.e. the variability of the statistical construct of interest under (hypothetical) re-
sampling. You can also think of this as the penalty from working with a small sample relative to a large
population.
The second purpose of the nse is to provide a summary measure of the “efficiency” of the
posterior simulator. An efficient simulator will have low correlation across draws, and thus will be able
to “tell the same story with fewer draws” – saving valuable computing time. As discussed in Chib (2001,
section 3.2) the ratio of the squared nse under correlation over the squared nse under independence can be
interpreted as “inefficiency factor” (IEF), also known as “autocorrelation time” for a given parameter, i.e.
nse 2 (θ ) M −1
IEF (θ ) =
nse2 (θ ; ρ j = 0, ∀j )
( )
= 1 + 2 ∑ 1 − Mj ρ j
j =1
(3)
A well-designed posterior simulator will generate sequences of parameter draws with low IEFs, the ideal
being an IEF close to 1. Sometimes the inverse of the IEF is used to measure posterior efficiency. This
quantity is called “numerical efficiency” (See Geweke, 1992). A third quantity to assess efficiency is the
“i.i.d - equivalent number of iterations”, labeled M* in the following, i.e. the number of i.i.d. draws that
contain the same amount of information about θ as the observed number of draws under correlation. It is
easily derived via
s2 M −1
s2
*
= 1 + 2 ∑ ( )
1 − Mj ρ j → M * =
−
M
=
M
IEF
(4)
( )
M M j =1
M 1
1 + 2 ∑ 1 − M ρ j
j
j =1
It follows that under perfect efficiency, we have M * = M , but usually we observe M * < M .
It should be noted that a very high IEF (very low M*) can be indicative of identification problems in your
model. How high is “very high? From personal experience, I would say that IEFs in the 1-5 range
7
indicate good efficiency, in the 6-20 range they’re still “tolerable”, but anything above 20 deserves closer
inspection. Certainly, IEFs of 100 and higher are almost a sure-bet indication of an identification or
specification problem in the underlying structural model.
Another important sampler diagnostic is Geweke’s (1992) CD score. It is based on the simple intuition
that if the entire sequence of retained θ 's (focusing on a single parameter for simplicity) can truly be
interpreted as random draws from the same posterior density p (θ | y ) , and we divide the sequence of R
draws into three segments, the mean of the first segment of r = 1⋯ R1 draws should be “not too different”
from the mean of the last segment of r = R2 + 1⋯ R draws. A stated in Koop Ch. 4, setting R1 = 0.1R and
R2 = 0.6 R produces adequate results for most applications. Define the two means as θ1 and θ 2 ,
respectively. Then, asymptotically, the difference between these means, weighted by their respective
numerical standard errors, converges to a standard normal (“z”) variate, i.e.
θ1 − θ 2 a
CD = ~ n ( 0,1) . (4.5)
nse + nse
2
1
2
2
Thus, a CD value that clearly exceeds 1.96 for a specific parameter θ would raise a flag – it indicates that
the sequence of posterior draws may not have converged to p (θ | y ) . In practice, if a few CD values in
the 2-2.5 range for a model with many parameters would hardly raise concerns. However, if your
posterior simulator generates CD values of 3 or higher, an increase in the number of burn-in draws may
be warranted. Similarly to IEFs, grossly inflated CD values may also indicate identification and / or mis-
specification problems in the underlying model.
The Matlab function “klausdiagnostics” automatically generates the following posterior statistics for
all model parameters: posterior mean, posterior standard deviation, nse, IEF, M*, and CD. This is
illuatrated in our application of the normal regression model with independent priors. The Matlab
function “klausdiagnostics_greater0” produces, in addition, the posterior probability for a
parameter to exceed zero. This conveys a quick picture as to where the bulk of the posterior is located vis-
a-vis zero. The classically trained reader can relate to this quite well as it provides "similarly flavored"
information to the standard t-statistic or p-value in a typical regression output.
There are two additional noteworthy diagnostics tools: autocorrelation plots (“AC plots”) and re-running
the posterior sampler with different starting values, as implemented in Session 6 of this course. An AC
plot provides a simple visual inspection of the lag-correlation terms ( ρ j in (2) ) for a given parameter.
An example for AC plots is provided in the Matlab script mod2_ac_plots. A "well-behaved" AC plot
will exhibit small correlation effects that randomly fluctuate around zero. A plot with high (usually
positive) correlations that taper off only very slowly with increasing lag would be indicative of
inefficiencies in the posterior simulator.
8
Blocking
As discussed in previous Sessions of this course, a Gibbs Sampler operates by splitting the full set of
parameters into different groups or "blocks", which are then drawn sequentially and repeatedly,
conditional on all other blocks.
The main consideration in designing these blocks for a standard Gibbs Sampler is that the conditional
posterior density for each block is known, else we wouldn't be able to take any draws from it. However,
this requirement can be relaxed by employing other posterior simulation techniques, such as the
Metropolis Hastings (MH) algorithm, which does not require full knowledge of the conditional posterior
density for a given parameter or block of parameters. Thus, the "optimal blocking" of the full set of
parameters becomes a more general question.
As discussed inter alia in Chib (2001, section 7.1) it is generally recommended that parameters be drawn
in as few blocks as possible, and that parameters that tend to be highly correlated be collected in the same
block.
Identifying what constitutes a full-fledged block in a given posterior algorithm can be tricky. This is
because blocks can be combined by the method of composition, i.e. by exploiting the fact that partially
conditional posterior distributions may be known (or can be approximated at low computational cost) for
some parameters. For example, consider an initial blocking of the full parameter vector into three groups,
θ1 , θ 2 , and θ 3 . A "naïve" posterior sampler will then operate as follows:
Now suppose the partially conditional density p ( θ1 | θ3 , y ) is known or can be approximated at low
computational cost. A (likely) more efficient posterior sampler would then collect θ1 and θ 2 in a single
block and operate as follows:
Thus, even though step 1 involves 2 sub-steps, it is considered a single block. The key notion in
identifying the number of blocks is that for a set of parameters to constitute a self-standing block, it needs
to be drawn conditional on all other blocks, and vice versa, i.e. all other blocks also need to be
conditioned on the first block.
For the normal regression model we have considered so far the parameter blocking was quite obvious
and, as it turns out, efficient: We grouped the constant term and all slope parameters into a single block (
9
β ), which left the regression variance σ 2 as the only other (single-parameter) block. Our GS proceeded
as follows:
(
1. Draw β from p β | σ 2 , y, X )
2. Draw σ 2 from p (σ 2
| β, y , X )
Matlab script mod2_blocking proposes a different approach based on 4 blocks. Specifically, we split β
into three parts of equal length, labeled β1 , β 2 , and β 3 . We then proceed as follows, using function
gs_normal_blocked:
(
1. Draw β1 from p β1 | β 2 , β 3 ,σ 2 , y , X )
2. Draw β 2 from p ( β 2 | β1 , β 3 , σ 2 , y, X )
3. Draw β 3 from p ( β 3 | β1 , β 2 , σ 2 , y, X )
4. Draw σ 2 from p (σ 2
| β, y , X )
In practice, draws of β j , j = 1⋯3 can be obtained as follows:
We know that for the basic linear regression model
y = Xβ + ε (6)
( ) ( )
−1
β | σ 2 , y, X ~ n ( µ1 , V1 ) with V1 = V0-1 + σ12 X′X and µ1 = V1 V0-1µ 0 + σ12 X′y (7)
Now partition X and β into three parts corresponding to our new blocking, i.e.
y = X1 β1 + X 2 β 2 + X 3 β 3 + ε (8)
yɶ = X1β1 + ε (9)
( )
−1
β1 | β 2 , β 3 ,σ 2 , y , X ~ n ( µ1 , V1 ) with V1 = V01
-1
+ σ12 X1′ X1 and
(10)
µ1 = V1 V01
-1
(µ 01 + σ12 X1′ yɶ = V1 V01
-1
) (
µ 01 + σ12 X′ ( y − X 2β 2 − X 3β 3 ) )
where µ 01 and V01 are the prior mean and variance for β1 . Draws of β 2 and β 3 can be obtained in
analogous fashion.
10
The posterior output clearly shows efficiency losses compared to the original version, as judged by IEF
and M* scores. This is depicted graphically in script mod2_ac_plots, which compares autocorrelation
plots for the original and the "excessively-blocked" version for selected parameters.
Also note the loss in speed – the inefficient sampler takes about twice as long to take the same number of
draws.
We can also compare the relative performance of the two samplers based on convergence plots, as
implemented in script mod2_convergence_plots3. The plotted chain of draws for "age" is especially
illustrative: the chain wanders widely, with a clear auto-correlation pattern. The main drawback of such a
highly correlated chain is that it takes much longer to "visit" the entire posterior distribution with
appropriate frequencies . This may result in misleading posterior inference, based on overly tight or
otherwise "incomplete" distributions. (In fact, the posterior standard deviations flowing from the
inefficient sampler are actually slightly smaller than those generated by the efficient sampler) .
As you may expect, the choice of priors becomes especially important under small sample sizes, where
the data will be much less dominant in shaping the posterior distribution.
As mentioned previously, this potentially pronounced effect of prior distributions and parameters on
"final results" is both a curse and a blessing. It requires great care in prior selection, but it also allows for
the introduction of pertinent information that is exogenous to the (small) data at hand.
With our diagnostic tools in hand, we are now well-equipped to take a closer look at the role of priors in
small-sample application.
Matlab script mod2_wetlands applies the normal linear regression model with independent priors to a
small data set of 12 observations as used in Moeltner and Woodward (2009).
This is an example of a meta-dataset, compiled from results and summary statistics reported in existing
studies. This secondary data set is then processed via meta-regression to yield insights into potential
outcomes associated with a new policy site or setting. This strategy is commonly referred to as "Benefit
Transfer", and constitutes a low-cost alternative to primary data collection.
For our purposes, think of this application simply as a regression model with a very small sample. The
outcome variable of interest is the average willingness to pay (per year) across a group of residents (“sub-
population”) to preserve a specific wetland area. Explanatory variables are the percentage of active
wetland users in the sub-population, average annual household income (in log form), and wetland size (in
1000 acres).
Script mod2_wetlands estimates the model using our efficient 2-block Gibbs Sampler and vague priors
for all parameters. The posterior output suggests convergence (based on CD scores) and good efficiency
(based on IEF scores). As expected, the posterior standard deviations are relatively large (say, compared
to the posterior mean), a direct effect of the vague priors.
Script mod2_wetlands2 implements the same model with informed priors based on estimated
parameters reported in the literature, as described in Moeltner and Woodward (2009). Again, the
11
diagnostics point at convergence and good efficiency. The posterior standard deviations are smaller than
in the original model, which is unambiguously desirable.
However, the posterior means have changed as well, which leads to changed inference on expected
marginal effects. For example, the original model produces a posterior mean for income elasticity of
0.288. This increases to 0.388 in the refined model.
Script mod2_wetlands_plots compares the posterior and prior densities across the two models for a
selected set of parameters. Note that there are efficiency spillovers even for parameters that themselves
did not receive informed priors (such as the marginal effect of "users").
References
Chib, Siddartha. 2001. "Markov Chain Monte Carlo Methods: Computation and Inference," in J. J.
Heckman and E. Leamer (eds), Handbook of Econometrics: Elsevier.
Moeltner, K. and R. Woodward. 2009. "Meta-Functional Benefit Transfer for Wetland Valuation: Making
the Most of Small Samples." Environmental and Resource Economics 42, 89-109.