Bayesian Model Averaging For Linear Regression Models
Bayesian Model Averaging For Linear Regression Models
To cite this article: Adrian E. Raftery , David Madigan & Jennifer A. Hoeting (1997) Bayesian
Model Averaging for Linear Regression Models, Journal of the American Statistical Association,
92:437, 179-191, DOI: 10.1080/01621459.1997.10473615
We consider the problem of accounting for model uncertainty in linear regression models. Conditioning on a single selected model
ignores model uncertainty, and thus leads to the underestimation of uncertainty when making inferences about quantities of interest.
A Bayesian solution to this problem involves averaging over all possible models (i.e., combinations of predictors) when making
inferences about quantities of interest. This approach is often not practical. In this article we offer two alternative approaches.
First, we describe an ad hoc procedure, “Occam’s window,” which indicates a small set of models over which a model average
can be computed. Second, we describe a Markov chain Monte Carlo approach that directly approximates the exact solution. In the
presence of model uncertainty, both of these model averaging procedures provide better predictive performance than any single
model that might reasonably have been selected. In the extreme case where there are many candidate predictors but no relationship
between any of them and the response, standard variable selection procedures often choose some subset of variables that yields a
high R2 and a highly significant overall F value. In this situation, Occam’s window usually indicates the null model (or a small
number of models including the null model) as the only one (or ones) to be considered thus largely resolving the problem of
selecting significant models when there is no signal in the data. Software to implement our methods is available from StatLib.
KEY WORDS: Bayes factor; Markov chain Monte Carlo model composition; Model uncertainty; Occam’s window; Posterior
model probability.
179
180 Journal of the American Statistical Association, March 1997
(Freedman, Navidi, and Peters 1986; Learner 1978; Madi- where A is the observable to be predicted and the expecta-
gan and Raftery 1994; Stewart 1987; Stewart and Davis tion is with respect to Cf=’=,
Pr(AlMk,D) Pr(MklD). This
1986). follows from the nonnegativity of the Kullback-Leibler in-
In the next section we outline the philosophy underlying formation divergence.
our approach. In Section 3 we describe how we selected Implementation of Bayesian model averaging is difficult
prior distributions and outline the two model averaging ap- for two reasons. First, the integrals in (3) can be hard to
proaches in Section 4.In Section 5 we provide an example compute. Second, the number of terms in (1) can be enor-
and describe our assessment of predictive performance. In mous. In this article we present solutions to both of these
Section 6 we compare the performance of Occam’s window problems.
to that of standard variable selection methods when there
is no relationship between the predictors and the response. 3. BAYESIAN FRAMEWORK
Finally, in Section 7 we discuss related work and suggest 3.1 Modeling Framework
future directions.
Each model that we consider is of the form
2. ACCOUNTING FOR MODEL UNCERTAINTY P
USING BMA Y=PO+E
PjXj+E=Xp+E, (4)
As described previously, basing inferences on a single j=1
“best” model as if the single selected model were true ig- where the observed data on p predictors are contained in the
nores model uncertainty, which can result in underestimat- n x (p + 1) matrix X. The observed data on the dependent
ing uncertainty about quantities of interest. There is a stan-variable are contained in the n vector Y. We assign to e
dard Bayesian solution to this problem, proposed by Learner a normal distribution with mean zero and variance o2 and
(1978). If M = { M I ,. . . ,M K } denotes the set of all models
assume that the E’S in distinct cases are independent. We
being considered and if A is the quantity of interest such consider the (p + 1) individual parameters P and o2 to be
as a future observation or the utility of a course of action, unknown.
then the posterior distribution of A given the data D is Where possible, informative prior distributions for P and
K o2 should be elicited and incorporated into the analysis (see
Pr(A1D) = EPr(A1Mk. D)Pr(Mk1D). (1) Garthwaite and Dickey 1992 and Kadane, Dickey, Winkler,
k=l Smith, and Peters 1980). In the absence of expert opinion,
we seek to choose prior distributions that reflect uncertainty
This is an average of the posterior distributions under each about the parameters and also embody reasonable a pri-
model weighted by the correspondingposterior model prob- ori constraints. We use prior distributions that are proper
abilities. We call this Bayesian model averaging (BMA). In but reasonably flat over the range of parameter values that
Equation (1) the posterior probability of model Mk is given could plausibly arise. These represent the common situation
by where there is some prior information, but rather little of
it, and put us in the “stable estimation” case where results
are relatively insensitive to changes in the prior distribution
(Edwards,Lindman, and Savage 1963).We use the standard
normal gamma conjugate class of priors,
I1
K
Pr(AlMk, D)Pr(MklD) where Xi is the design matrix and Vi is the covariance ma-
trix for p corresponding to model Mi (Raiffa and Schlaifer
5 -E[log{Pr(AlMj, D)}] ( j = 1,.. . , K ) , 1961). The Bayes factor for MO versus M I , the ratio of
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 181
Equation ( 5 ) for i = 0 and i = 1, is then given by ple variance of Y ,:s denotes the sample variance of Xi
for i = 1,.. . , p , and 4 is a hyperparameter to be chosen.
The prior variance of PO is chosen conservatively and rep-
resents an upper bound on the reasonable variance for this
where ai = Xv + (Y- Xipi)"' + XiViXf)-'(Y- Xipi), parameter. The variances of the remaining /3 parameters are
chosen to reflect increasing precision about each pi as the
i = 0,l.
variance of the corresponding X iincreases and to be in-
3.2 Selection of Prior Distributions variant to scale changes in both the predictor variables and
The Bayesian framework described earlier gives the the response variable.
BMA user the flexibility to modify the prior setup as de- For a categorical predictor variable Xi with (c + 1) pos-
sired. In this section we describe the prior distribution setup sible outcomes (c 2 2), the Bayes factor should be invari-
that we adopt in our examples below. ant to the selection of the corresponding dummy variables
For noncategorical predictor variables, we assume the (Xil,.. . ,Xi,). To this end, we set the prior variance of
individual p's to be independent a priori. We center the (pil,. . . ,pi,) equal to u2c$2[(l/n)XiTXi]-',where X i is
distribution of ,8 on zero (apart from PO) and choose the n x c design matrix for the dummy variables, where
p = (&,O,O ,...,0), where 60 is the ordinary least each dummy variable has been centered by subtracting its
squares estimate of PO. The covariance matrix V is sample mean. This is related to the g prior of Zellner (1986).
equal to u2 multiplied by a diagonal matrix with entries The complete prior covariance matrix for ,8 is now given
(s$, 42s;2, 42s;2,.. . ,q52s;2), where s; denotes the sam- by
V(P) = u 2
I I I.
\ I I
-, 1
I
’
I
I
Strong Evidence for M,
;”” Otherwise, the state stays in state M. Madigan and York
(1995) described MC3 for discrete graphical models. Soft-
Figure 2. Occam’s Window: lnterpreting the Posterior Odds for
ware for implementing the MC3 algorithm is described in
Nested Models. the Appendix.
Raftery, Madigan. and Hoeting: Averaging for Linear Regression 183
5. MODEL UNCERTAINTY AND PREDICTION Ehrlich’s analysis concentrated on the relationship be-
tween crime rate and predictors 14 and 15 (probability of
5.1 Example: Crime and Punishment imprisonment and average time served in state prisons). In
5.1.1 Crime and Punishment: Overview. Up to the his original analysis, Ehrlich (1 973) focused on two regres-
1960s, criminal behavior was traditionally viewed as de- sion models, consisting of the predictors (9, 12, 13, 14, 15)
viant and linked to the offender’s presumed exceptional psy- and (1, 6, 9, 10, 12, 13, 14, 15), which were chosen in ad-
chological, social, or family circumstances (Taft and Eng- vance based on theoretical grounds.
land 1964). Becker (1968) and Stigler (1970) argued that on To compare Ehrlich’s results with models that might be
the contrary, the decision to engage in criminal activity is a selected using standard techniques, we chose three popular
rational choice determined by its costs and benefits relative variable selection techniques: Efroymson’s stepwise method
to other (legitimate) opportunities. (Miller 1990), minimum Mallow’s C,,and maximum ad-
In an influential article, Ehrlich (1973) developed this ar- justed R2 (Weisberg 1985). Efroymson’s stepwise method
gument theoretically, specified it mathematically, and tested is like forward selection except that when a new variable
it empirically using aggregate data from 47 U.S. states in is added to the subset, partial correlations are considered
1960. Errors in Ehrlich’s empirical analysis were corrected to see whether any of the variables currently in the sub-
by Vandaele (1978), who gave the corrected data, which we set should be dropped. Similar hybrid methods are found in
use here (see also Cox and Snell 1982). (Ehrlich’s study has most standard statistical computer packages. Problems with
been much criticized (see, e.g., Brier and Fienberg 19801, stepwise regression, Mallow’s C,,and adjusted R2are well
and we cite it here for purely illustrative purposes. For econ- known (see, e.g., Weisberg 1985).
omy of expression, we use causal language and speak of Table 1 displays the results from the full model with all
“effects,” even though the validity of this language for these 15 predictors, three models selected using standard variable
data is dubious. Because people, not states, commit crimes, selection techniques, and the two models chosen by Ehrlich
these data may reflect aggregation bias.) on theoretical grounds. The three models chosen using vari-
Ehrlich’s theory goes as follows. The costs of crime are able selection techniques (models 2, 3, 4) share many of
related to the probability of imprisonment and the aver- the same variables and have high values of R2.Ehrlich’s
age time served in prison, which in turn are influenced by theoretically chosen models fit the data less well. There are
police expenditures, which may themselves have an inde- striking differences-indeed, conflicts-between the results
pendent deterrent effect. The benefits of crime are related from the different models. Even the models chosen using
to both the aggregate wealth and income inequality in the statistical techniques lead to conflicting conclusions about
surrounding community. The expected net payoff from al- the main questions of interest, despite the models’ superfi-
ternative legitimate activities is related to educational level cial similarity.
and the availability of employment, the latter being mea- Consider first the predictor for probability of imprison-
sured by the unemployment and labor force participation ment, XIS.This is a significant predictor in all six models,
rates. The payoff from legitimate activities was expected to so interest focuses on estimating the size of its effect. To
be lower (in 1960) for nonwhites and for young males than aid interpretation, recall that all variables have been trans-
for others, so that states with high proportions of these were formed logarithmically, so that when all other predictors are
expected also to have higher crime rates. Vandaele (1978) held fixed, p14 = -.30 means roughly that a 10%increase
also included an indicator variable for southern states, the in the probability of imprisonment produces a 3% reduc-
sex ratio, and the state population as control variables, but tion in the crime rate. The estimates of fluctuate wildly
the theoretical rationale for inclusion of these predictors is between models. The stepwise regression model gives an es-
unclear. timate about one-third lower in absolute value than the full
We thus have 15 candidate predictors of crime rate (Ta- model, enough to be of policy importance; this difference
ble 4),and so potentially 215 = 32,768 different models. is equal to about 1.7 standard errors. The Ehrlich models
As in the original analyses, all data were transformed loga- give estimates that are about one-half higher than the full
rithmically. Standard diagnostic checking (see, e.g., Draper model, and more than twice as big as those from stepwise
and Smith 1981) did not reveal any gross violations of the regression (in absolute value). There is clearly considerable
assumptions underlying normal linear regression. model uncertainty about this parameter.
Table 1. Models Selected for Crime Data
R2 Numberof
# Method Variables (%) variables 314 315 P15
NOTE: 4 5 is Ihe p value from a two-sided I tea for iesllng = 0. For lhe slepwise procedure. F = 3.84 was used for (he F-to-enter and F-lodelete value. This corresponds approximately
to the 5% level.
184 Journal of the American Statistical Association, March 1997
Table 2. Crime Data: Occam’s Window Posterior Model Probabilities effect on the relationship between the models as measured
Posterior model
by the Bayes factor.
Model probability (%) Table 4 shows the posterior probability that the coeffi-
1 3 4 9 11 13 14 12.6
cient for each predictor does not equal 0-that is, Pr(& #
1 3 4 11 13 14 9.0 OID)-obtained by summing the posterior model probabili-
1 3 4 9 13 14 8.4 ties across models for each predictor. The results from Oc-
1 3 5 9 11 13 14 8.0 cam’s window and MC3 are fairly close for most of the
3 4 8 9 13 14 7.6
1 3 4 13 14 6.3 predictors. Predictors with high Pr(& # 010) include pro-
1 3 4 11 13 5.8 portion of young males, mean years of schooling, police
1 3 5 11 13 14 5.7 expenditure, income inequality, and probability of impris-
1 3 4 13 4.9
1 3 5 9 13 14 4.8
onment.
3 5 8 9 13 14 4.4 Comparing the two models analyzed by Ehrlich (1973),
3 4 9 13 14 4.1 consisting of the predictors (9, 12, 13, 14, 15) and (1, 6,
3 5 9 13 14 3.6
1 3 5 13 14 3.5
9, 10, 12, 13, 14, 15), with the results in Table 4, we see
2 3 4 13 14 2.0 that several predictors included in Ehrlich’s analysis receive
1 3 5 11 13 1.9 little support from the data. The estimated Pr(,8i # 010) is
3 4 13 14 1.6 quite small for predictors 6, 10, 12, and 15. Two predictors
3 5 13 14 1.6
3 4 13 1.4 (3 and 4) have empirical support but were not included by
1 3 5 13 1.4 Ehrlich. Indeed, Ehrlich’s two selected models have very
3 5 13 .7 low posterior probabilities.
1 4 12 13 .7
Ehrlich’s work attracted attention primarily because of
his conclusion that both the probability of imprisonment
Now consider ,815, the effect of the average time served in (predictor 14) and the average prison term (predictor 15)
state prisons. Whether this is significant at all is not clear, influenced the crime rate. The posterior distributions for the
and t tests based on different models lead to conflicting coefficients of these predictors, based on the model averag-
conclusions. In the full model, Pl5 has a nonsignificant p ing results of MC3, are shown in Figures 3 and 4. The MC3
value of .133, while stepwise regression leads to a model posterior distribution for 9 1 4 is indeed centered away from
that does not include this variable. On the other hand, Mal- 0, with a small spike at 0 corresponding to P ( p 1 4 = 010).
lows’ C, leads to a model in which the p value for ,815 is The posterior distribution for p i 4 based on Occam’s win-
significant at the .05 level, whereas with adjusted R2 it is dow is quite similar. The spike at 0 is an artifact of our
again not significant. In contrast, in Ehrlich’s models it is approach, in which it is possible to consider models with a
highly significant. predictor fully removed from the model. This is in contrast
Together these results paint a confused picture about ,814 to the practice of setting the predictor close to 0 with high
and p 1 5 . Later we argue that the confusion can be resolved probability (as in George and McCulloch 1993).In contrast
by taking explicit account of model uncertainty. to Figure 3, the MC3 posterior distribution for the coeffi-
cient corresponding to average prison term is centered close
5.1.2 Crime and Punishment: Model Averaging. For to 0 and has a large spike at 0 (Fig. 4). Occam’s window in-
the model averaging strategies, we assumed that all possi- dicates a spike at 0 only, or no support for inclusion of this
ble combinations of predictors were equally likely a pri- predictor. By averaging over all models, our results indicate
ori. To implement Occam’s window, we started from the support for a relationship between crime rate and predic-
null model and used the “up” algorithm only (see Madigan tor 14, but not predictor 15. Our model averaging results
and Raftery 1994). The selected models and their poste- are consistent with those of Ehrlich for the probability of
rior model probabilities are shown in Table 2. The models imprisonment, but not for the average prison term.
with posterior model probabilities of 1.2% or larger as in-
dicated by MC3 are shown in Table 3. In total, 1,772 differ-
Table 3. Crime Data: M e , Models With Posterior Model
ent models were visited during 30,000 iterations of MC3. Probabilities of 1.2%or Larger
Occam’s window chose 22 models in this example, clearly
indicating model uncertainty. Choosing any one model and Posterior model
Model probability (%)
making inferences as if it were the “true” model ignores ~ ~
Among the variables that measure the expected benefits The model averaging results for the predictors for po-
from crime, Ehrlich concluded that both wealth and income lice expenditures lead to an interesting interpretation. Po-
inequality had an effect; we found this to be true for income lice expenditure was measured in two successive years, and
inequality but not for wealth. For the predictors that repre- the measures are highly correlated ( r = .993). The data
sent the payoff from legitimate activities, Ehrlich found the show clearly that the 1960 crime rate is associated with
effects of variables 1,6, 10, and 1 1 to be unclear; he did not police expenditures, and that only one of the two mea-
include mean schooling in his model. We found strong evi- sures (X,and XS)is needed, but they do not say for sure
dence for the effect of some of these variables, notably the which measure should be used. Each model in Occam’s
percent of young males and mean schooling, but the effects window contains one predictor or the other, but not both.
of unemployment and labor force participation are either For both Occam’s window and MC3 Pr[(& # 0) U (& #
unproven or unlikely. Finally, the “control” variables that O)lD]= 1, so the data provide very strong evidence for an
have no theoretical basis (2, 7, 8) turned out, satisfyingly, association with police expenditures.
to have no empirical support either. In summary, we found strong support for some of
Ehrlich’s conclusions but not for others. In particular, by
averaging over all models, our results indicate support for
a relationship between crime rate and probability of impris-
onment, but not for average time served in state prisons.
R -
- 2 - 5.1.3 Crime and Punishment: Assessment of Predictive
PelLformance. We use the predictive ability of the selected
models for future observations to measure the effectiveness
n “ -
n
-. c
- iiZ
of a model selection strategy. Our specific objective is to
6
W compare the quality of the predictions based on model av-
0,
eraging with the quality of predictions based on any single
9 -
c
- 2 model that an analyst might reasonably have selected.
To measure performance, we randomly split the com-
plete dataset into two subsets. Other percentage splits can
2 - - 2 be adopted. A 50-50 split was chosen here, so that each
portion would contain enough data to be a representative
0 sample. We ran Occam’s window and MC3 using half of
0
I I I I 1-
- 8 the data. This set is called the training set, DT.We evalu-
Figure 3. Posterior Distribution for p14, the Coefficient for the Pre- and graphical Of performance*
dictor “Probability of Imprisonment,”Based on the M e Model Average. Predictive coverage was measured Using the proportion
The spike corresponds-to P (Pi4 = 010). The vertical axis on the left of observations in the Derformance set that fall in the cor-
corresponds to the posterior distributionf0rp74, and the vertical axis on
the right corresponds to the posterior distribution for 014 equal to zero.
responding 90%predic;ion interval. For both Occam9swin-
The densitv is scaled so that the maximum of the densitv is eaual to dow and MC3, 80%of the observations in the performance
P(p14 # i D ) on the right axis. set fell in the 90% prediction intervals over the averaged
186 Journal of the American Statistical Association, March 1997
Y 1.oo Occam’s window and MC3 are not highly sensitive to the
N
choice of prior. The results for Occam’s window and MC3
using three different sets of priors were quite similar.
s ,0.75
In an attempt to provide a graphical measure of predic-
tive performance, we used a “calibration plot” to determine
n
-
n -
c
0 whether the predictions were well calibrated. A model is
-
r Y 0
Q - well calibrated if, for example, 70% of the observations in
v
0 ,0.50 Q
‘5 the test dataset are less than or equal to the 70th percentile
9
Q of the posterior predictive distribution. The calibration plot
c
shows the degree of calibration for different models, with
,0.25 the posterior predictive probability on the z-axis and the
Y
0 percentage of observed data less than or equal to the pos-
terior predictive probability on the y-axis. In a calibration
plot, perfect calibration is the 45-degree line; the closer a
8 .0.0 model’s calibration line to the 45-degree line, the better
-1.0 -0.5 0.0 0.5 1.o calibrated the model. The calibration plot is similar to re-
liability diagrams used to assess probability forecasts (see,
PI$ e.g., Murphy and Winkler 1977). The calibration plot for
Figure 4. Posterior Distributionfor PIS,the Coefficient for the Predic- the model chosen by stepwise selection and for model av-
tor “AverageTime Served in State Prisons,”Based on the Model Average eraging using Occam’s window is shown in Figure 5. The
Over a Large Set of Models From M e . See Figure 3. shaded area in Figure 5 shows where the model averaging
strategy produces predictions that are better calibrated than
models (Table 5). David Draper (personal communication) predictions from the model chosen by the stepwise model
suggested that BMA falls somewhat short of nominal cov- selection procedure. The calibration plot for MC3 is similar.
erage here because aspects of model uncertainty other than These performance measures support our claim that con-
model selection have not been assessed. In Hoeting, Raftery, ditioning on a single selected model ignores model uncer-
and Madigan (1995, 1996), we extended BMA to account tainty, which in turn leads to the underestimation of uncer-
for uncertainty in the selection of transformations and in tainty when making inferences about quantities of interest.
the identification of outliers. Model averaging leads to better-calibrated predictive distri-
For comparison with other standard variable selection butions.
techniques, we used the three popular variable selection
procedures discussed earlier to select two or three “best”
models. The models that we chose using these methods 5.2 Simulated Examples: Predictive Performance
are given in Table 5. All of the individual models chosen In the foregoing example, the true answer is unknown. To
using standard techniques performed considerably worse further demonstrate the usefulness of BMA, we use several
than the model averaging approaches, with prediction cov- simulated examples. In our examples, we follow the format
erage ranging from 58% to 67%. Thus the model averag- of George and McCulloch (1993).
ing strategies improved predictive coverage substantially as Example 5.2.1. In this example we investigate the im-
compared to any single model that might reasonably have pact of model averaging on predictive performance when
been chosen. there is little model uncertainty. For the training set, we
A sensitivity analysis for priors chosen within the frame- simulated p = 15 predictors and n = 50 observations as
work described in Section 3.2 indicates that the results for independent standard normal vectors. We generated the re-
Predictive
Method Model coverage (%)
MC3 Model averaging 80
Occam’s window Model averaging 80
Stepwise (5%) 3 4 9 13 67
Adjusted R2 (2) 1 2 3 4 5 8 11 12 13 15 67
Adjusted R2 (3) 1 2 3 4 5 6 8 11 12 13 15 67
Stepwise (15%) 3 4 8 9 13 15 63
CP (2) 1 2 3 4 11 13 63
Adjusted R2 (1) 1 2 3 4 5 11 12 13 15 58
CP (1) 1 2 3 4 11 13 15 58
CP (3) 1 2 3 4 11 12 13 15 58
NOTE: Predictive coverage is the percentage 01 observations in the performance set that fall in the 90% prediction interval. Method
numbers correspond to tha ith model chosen using the given model Selection method. For example. C, (1) is the first model chosen
using the C, method. The percentage values shown for the stepwise procedurescorrespond to the significance levels lor the F-to-enter
and F-lodelete values. For example, F = 3.84 corresponds approximately to the 5% level.
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 1137
m
or
c
/I where E - N50(O,c2) with c = 2.5. Least squares es-
timates for these data are given in Table 8. The corre-
I
5r lation structure resulted in moderate pairwise correlation
between predictors 1-5 and 11-15 (corr(X1,X11) = .39?
.L corr(X2, x12) = .41, COrr(x3,x i s ) = .56, cOrr(x4, x14) =
e
.71, corr(X5,X15) = .69) and small pairwise correlations
elsewhere (median correlation equal to - .02). We generated
H 50 additional observations in the same manner to create the
I1 0
V O
prediction set.
Table 9 shows that in this example, model averaging has
better predictive performance than any single model that
might have been selected. In this example, the poor per-
s
s9 0
o y , , , ,I formance of the true model and the other single models
selected using standard techniques demonstrate that model
uncertainty can strongly influence predictive performance.
0.0 0.2 0.4 0.6 0.8 1 .o
posterior prediiive probabilii
Figure 5. Crime Data: Calibration Plot. The solid line denotes model
averaging (Occam’s window); the dashed line, predictors 3, 4, 8, 9, 13, 6. SUCCESSFUL IDENTIFICATION
15 (stepwise). OF THE NULL MODEL
Linear regression models are frequently used even when
sponse using the model little is known about the relationship between the predictors
and the response. When there is a weak relationship be-
tween the predictors and the response, the overall F statis-
where E: -
N s o ( 0 . 0 ~ )with B = 2.5. Least squares esti-
mates for these data are given in Table 6. There is little
tic will be small and thus the null hypothesis that the null
model is true fails to be rejected. However, many data an-
model uncertainty in this example; only the p values for alysts perform model selection regardless of the F statis-
,04 and ,& were smaller than .l. We generated 50 addi-
tic value for the overall model. Problems can then occur,
tional observations in the same manner to create the predic- as subsequent model selection techniques often choose a
tion set. model that includes a subset of the predictors. Freedman
In this example the true model, the model averaging tech- (1983) has shown that in the extreme case where there is no
niques, and models selected using standard techniques all relationship between the predictors and the response vari-
have poor predictive coverage (Table 7). It is slightly en- able, omitting the predictors with the smallest t values (e.g.,
p > .25) can result in a model with a highly significant F
couraging that BMA performs better than the true model,
but the improvement is too small to be significant. This statistic and high R2. In contrast, if the response and predic-
and other similar examples that we simulated show that tors are independent, Occam’s window typically indicates
the null model only, or the null model as one of a small
when there is very little model uncertainty, predictive per-
formance is not significantly improved by model averaging. number of “best” models.
Exumple 5.2.2. This example demonstrates the perfor- Following Freedman (1983), we generated 5,100 inde-
mance of BMA when a subset of the predictors is corre- pendent observations from a standard normal distribution
lated. For the training set, we simulated p = 15 predic- to create a matrix with 100 rows and 51 columns. The first
tors and n = 50 observations. We obtained predictors 1-10 column was taken to be the dependent variable in a regres-
sion equation, and the other 50 columns were taken to be
-
as independent standard normal vectors, X I , . . . ,Xl0 iid
N ( 0 , l),and generated predictors 11-15 using the frame-
work
the predictors. Thus the predictors are independent of the
response by construction. For the entire dataset, the multi-
ple regression results were as follows:
[Xii,. . . ,xi51 R2 = .55 and p = .29.
= [X, )... :X~]([.3,.5,.7,.9,1.1]T[11111])
+€, 18 coefficients out of 50 were significant at the .25
where e
model
- N(O.1). We generated the response using the
level.
4 coefficients out of 50 were significant at the .05 level.
We used three different variable selection procedures on
y = x1+x2 + x3 + x4 + x s + € (13) the simulated data. The first of these was the method used
Table 6. Least Squares Estimates for Example 5.2.1 (5 = 2.9)
P O 0 0 0 1.00 1.00 0 0 0 0 0 0 0 0 0 0
.42 .21 .40 .07 .95 1.72 20 .34 -.32 24 -.15 .6 -.45 -.08 20 .18
.46 .55 .56 .36 .52 .47 .39 .5a .49 .45 .44 .55 .4a .52 .45 .47
188 Journal of the American Statistical Association, March 1997
Predictive
Method Model coverage (%)
BMA (estimated coverage) Model averaging 72
Occam’s window Model averaging 70
Adjusted R2 (3) 2 4 5 8 11 70
CP (3) 4 5 11 70
True model and stepwise (5%) 4 5 68
Stepwise (15%) and Cp (2) 2 4 5 68
c, (1) 4 5 68
Adjusted R2 (2) 2 4 5 10 11 68
Adjusted R2 (1) 2 4 5 11 66
NOTE Predictivecoverage lor BMA (all models) is estimated using the 371 models with posterior model probabilitiesgreater than .ooOl;
see TaMe 5.
by Freedman (1983), in which all predictors with p values man (1983) and the stepwise method chose models with
of .25 or lower were included in a second pass over the many predictors and highly significant R2 values.
data. The results from this method were as follows: At best, Occam’s window correctly indicates that the null
model is the only model that should be chosen when there
R2 = .40 and p = .0003.
is no signal in the data. At worst, Occam’s window chooses
17 coefficients out of 18 were significant at the .25 the null model along with several other models. The pres-
level. ence of the null model among those chosen by Occam’s
10 coefficients out of 18 were significant at the .05 window should indicate to a researcher the possibility of
level. evidence for a lack of signal in the data that he or she is
analyzing.
These results are highly misleading, as they indicate a def- To examine the possibility that our Bayesian approach
inite relationship between the response and the predictors, favors parsimony to the extent that Occam’s window finds
whereas in fact the data are all noise.
no signal even when one exists, we did an additional simu-
The second model selection method used on the full data-
lation study. We generated 3,000 observations from a stan-
set was Efroymson’s stepwise method. This indicated a
model with 15 predictors with the following results: dard normal distribution to create a dataset with 100 ob-
servations and 30 candidate predictors. We allowed the re-
+ E with
R2= .40,and p = .0001.
All 15 predictors were significant at the .25 level. E -
sponse Y to depend only on XI, where Y = .5X1
N ( 0 , .75). Thus Y still has unit variance, and the “true”
R2 for the model equals .20.
10 coefficients out of 15 were significant at the .05
For this simulated data, Occam’s window contained one
level. model only-the correct model with XI.In contrast, the
Again a model is chosen that misleadingly appears to have screening method used by Freedman produced a model with
a great deal of explanatory power. six predictors, including XI,with four of these significant
The third variable selection method that we used was at the .1 level. Stepwise regression indicated a model with
Occam’s window. The only model chosen by this method two predictors, including XI, both of them significant at the
was the null model. .025 level. So the two standard variable selection methods
We repeated the foregoing procedure 10 times with simi- indicated evidence for variables that in fact were not at all
lar results. In five simulations, Occam’s window chose only associated with the dependent variable, whereas Occam’s
the null model. For the remaining simulations, three mod- window chose the correct model.
els or fewer were chosen along with the null model. All the These examples provide evidence that Occam’s window
nonnull models chosen had R2 values less than .15. For all overcomes the problem of selection of the null model when
of the simulations, the selection procedure used by Freed- there is no signal in the data.
Predictive
Method Model coverage (%)
NOTE: Predictive coverage lor BMA (all models) is estimated using the 1,014 models with posterior model probabilitiesgreater than
.00005; see Table 5.
and Estimation With Application to a Price Index for Radio Services:’ Gibbs Sampler and Related Markov Chain Monte Carlo Methods,” Jour-
Journal of Econometrics, 49, 169-193. M I of rhe Royal Statistical Society, Ser. B, 55, 3-24.
Murphy, A. H., and Winkler, R. L. (1977). “Reliability of Subjective Prob- Stewart, L. (1987). “Hierarchical Bayesian Analysis Using Monte Carlo
ability Forecasts of Precipitation and Temperature:’ Applied Statistics, Integration: Computing Posterior Distributions When There are Many
26, 4147. Possible Models:’ The Statistician, 36, 21 1-219.
Neter, J., Wasserman, W., and Kutner, M. (1990).Applied Linear Statistical Stewart, L., and Davis, W.W. (1986). “Bayesian Posterior Distributions
Models, Homewood, IL: Irwin. Over Sets of Possible Models With Inferences Computed by Monte
Raftery, A. E. (1988), “Approximate Bayes Factors for Generalized Lin- Carlo Integration:’ The Statistician, 35, 175-182.
ear Models,” Technical Report 121, University of Washington, Dept. of Stigler, G.J. (1970), “The Optimum Enforcement of Laws,” Journal of
Statistics. Political Economy, 78, 526-536.
(1996), “Approximate Bayes Factors and Accounting for Model Taft, D. R., and England, R. W. (1964). Criminology (4th ed.), New York
Uncertainty in Generalized Linear Models,” Biomerrika, 83, 251-266. Macmillan.
Raiffa, H., and Schlaifer, R. (1961). Applied Statistical Decision Theory, Vandaele, W. (1978), “Participation in Illegitimate Activities; Ehrlich Re-
Cambridge, MA: MIT Press. visited,” in Dererrence and Incapacitation (eds. A. Blumstein, J. Cohen,
Regal, R., and Hook, E. B. (1991), “The Effects of Model Selection on and D. Nagin), Washington, D.C.: National Academy of Sciences Press,
Confidence Intervals for the Size of a Closed Population,” Statistics in pp. 270-335.
Medicine, 10, 7 17-72 I . Weisberg, S. (1985). Applied Linear Regression (2nd ed.), New York Wi-
Schwarz, G. (1978). “Estimating the Dimension of a Model,” The Annals ley.
of Statistics, 6, 461464. Zellner, A. (1986), “On Assessing Prior Distributions and Bayesian Re-
Shibata, R. (1981). “An Optimal Selection of Regression Variables:’ gression Analysis With g Prior Distributions,” in Bayesian Inference
Biometriku, 68, 45-54. and Decision Techniques-Essays in Honor of Bruno de Finetri, eds.
Smith, A. F. M., and Roberts, G. 0. (19931, “Bayesian Computation via P. K.Goel and A. Zellner, Amsterdam: North-Holland, pp. 233-243.