0% found this document useful (0 votes)
52 views14 pages

Bayesian Model Averaging For Linear Regression Models

Uploaded by

Shangbo Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views14 pages

Bayesian Model Averaging For Linear Regression Models

Uploaded by

Shangbo Wang
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Journal of the American Statistical Association

ISSN: 0162-1459 (Print) 1537-274X (Online) Journal homepage: https://www.tandfonline.com/loi/uasa20

Bayesian Model Averaging for Linear Regression


Models

Adrian E. Raftery , David Madigan & Jennifer A. Hoeting

To cite this article: Adrian E. Raftery , David Madigan & Jennifer A. Hoeting (1997) Bayesian
Model Averaging for Linear Regression Models, Journal of the American Statistical Association,
92:437, 179-191, DOI: 10.1080/01621459.1997.10473615

To link to this article: https://doi.org/10.1080/01621459.1997.10473615

Published online: 17 Feb 2012.

Submit your article to this journal

Article views: 2448

View related articles

Citing articles: 713 View citing articles

Full Terms & Conditions of access and use can be found at


https://www.tandfonline.com/action/journalInformation?journalCode=uasa20
Bayesian Model Averaging for Linear
Regression Models
Adrian E. RAFTERY,David MADIGAN,
and Jennifer A. HOETING

We consider the problem of accounting for model uncertainty in linear regression models. Conditioning on a single selected model
ignores model uncertainty, and thus leads to the underestimation of uncertainty when making inferences about quantities of interest.
A Bayesian solution to this problem involves averaging over all possible models (i.e., combinations of predictors) when making
inferences about quantities of interest. This approach is often not practical. In this article we offer two alternative approaches.
First, we describe an ad hoc procedure, “Occam’s window,” which indicates a small set of models over which a model average
can be computed. Second, we describe a Markov chain Monte Carlo approach that directly approximates the exact solution. In the
presence of model uncertainty, both of these model averaging procedures provide better predictive performance than any single
model that might reasonably have been selected. In the extreme case where there are many candidate predictors but no relationship
between any of them and the response, standard variable selection procedures often choose some subset of variables that yields a
high R2 and a highly significant overall F value. In this situation, Occam’s window usually indicates the null model (or a small
number of models including the null model) as the only one (or ones) to be considered thus largely resolving the problem of
selecting significant models when there is no signal in the data. Software to implement our methods is available from StatLib.
KEY WORDS: Bayes factor; Markov chain Monte Carlo model composition; Model uncertainty; Occam’s window; Posterior
model probability.

1. INTRODUCTION applications this averaging will not be a practical proposi-


Selecting subsets of predictor variables is a basic part of tion. Here we present two alternative approaches.
building a linear regression model. The objective of vari- First, we extend the Bayesian graphical model selection
able selection is typically stated as follows: Given a de- algorithm of Madigan and Raftery (1994) to linear regres-
pendent variable Y and a set of a candidate predictors sion models. We refer to this algorithm as “Occam’s win-
XI, XZ, .. .,Xk,find the “best” model of the form dow.” This approach involves averaging over a reduced set
of models. Second, we directly approximate the complete
V
solution by applying the Markov chain Monte Carlo model
composition (MC3) approach of Madigan and York (1995)
3=1
to linear regression models. In this approach the posterior
where XI, X2,. . . ,X, is a subset of XI,X2,.. .,xk.Here distribution of a quantity of interest is approximated by a
“best” may have any of several meanings; for example, Markov chain Monte Carlo method that generates a process
the model providing the most accurate predictions for new that moves through model space. We show in an example
cases exchangeable with those used to fit the model. that both of these model averaging approaches provide bet-
A typical approach to data analysis is to carry out a ter predictive performance than any single model that might
model selection exercise leading to a single “best” model reasonably have been selected.
and then to make inferences as if the selected model were Freedman (1983) pointed out that when there are many
the true model. However, this ignores a major component predictors and there is no relationship between the predic-
of uncertainty-namely, uncertainty about the model itself tors and the response, variable selection techniques can lead
(Draper 1995; Hodges 1987; Leamer 1978; Moulton 1991; to a model with a high R2 and a highly significant over-
Raftery 1988, 1996). As a consequence, uncertainty about all F value. By contrast, when a dataset is generated with
quantities of interest can be underestimated. (For striking no relationship between the predictors and the response,
examples of this see Draper 1995, Kass and Raftery 1995, Occam’s window typically indicates the null model as the
Madigan and York 1995, Miller 1984, Raftery 1996, and “best” model or as one of a small set of “best” models,
Regal and Hook 1991.) A complete Bayesian solution to thus largely resolving the problem of selecting a significant
this problem involves averaging over all possible combina- model for a null relationship.
tions of predictors when making inferences about quantities The background literature for our approach includes sev-
of interest. Indeed, this approach provides optimal predic- eral areas of research: the selection of subsets of predictor
tive ability (Madigan and Raftery 1994). However, in many variables in linear regression models (Breiman 1992, 1995;
Breiman and Spector 1992; Draper and Smith 1981; Hock-
ing 1976; Linhart and Zucchini 1986; Miller 1990; Shibata
Adrian E. Raftery is Professor of Statistics and Sociology, and David 1981), Bayesian approaches to the selection of subsets of
Madigan is Assistant Professor of Statistics, Department of Statistics, Uni-
versity of Washington, Seattle, WA 98195. Jennifer Hoeting is Assistant predictor variables in linear regression models (George and
Professor of Statistics, Department of Statistics, Colorado State Univer- McCulloch 1993; Laud and Ibrahim 1995; Mitchell and
sity, Fort Collins, CO 80523. The research of Raftery and Hoeting was par- Beauchamp 1988; Schwarz 1978), and model uncertainty
tially supported by Office of Naval Research contract N-00014-91-J-1074.
Madigan’s research was partially supported by National Science Foun-
dation grant DMS 921 11627. The authors are grateful to Danika Lew for
research assistance and the editor, the associate editor, two anonymous ref- @ 1997 American Statistical Association
erees, and David Draper for very helpful comments that greatly improved Journal of the American Statistical Association
the article. March 1997, Vol. 92, No. 437, Theory and Methods

179
180 Journal of the American Statistical Association, March 1997

(Freedman, Navidi, and Peters 1986; Learner 1978; Madi- where A is the observable to be predicted and the expecta-
gan and Raftery 1994; Stewart 1987; Stewart and Davis tion is with respect to Cf=’=,
Pr(AlMk,D) Pr(MklD). This
1986). follows from the nonnegativity of the Kullback-Leibler in-
In the next section we outline the philosophy underlying formation divergence.
our approach. In Section 3 we describe how we selected Implementation of Bayesian model averaging is difficult
prior distributions and outline the two model averaging ap- for two reasons. First, the integrals in (3) can be hard to
proaches in Section 4.In Section 5 we provide an example compute. Second, the number of terms in (1) can be enor-
and describe our assessment of predictive performance. In mous. In this article we present solutions to both of these
Section 6 we compare the performance of Occam’s window problems.
to that of standard variable selection methods when there
is no relationship between the predictors and the response. 3. BAYESIAN FRAMEWORK
Finally, in Section 7 we discuss related work and suggest 3.1 Modeling Framework
future directions.
Each model that we consider is of the form
2. ACCOUNTING FOR MODEL UNCERTAINTY P
USING BMA Y=PO+E
PjXj+E=Xp+E, (4)
As described previously, basing inferences on a single j=1

“best” model as if the single selected model were true ig- where the observed data on p predictors are contained in the
nores model uncertainty, which can result in underestimat- n x (p + 1) matrix X. The observed data on the dependent
ing uncertainty about quantities of interest. There is a stan-variable are contained in the n vector Y. We assign to e
dard Bayesian solution to this problem, proposed by Learner a normal distribution with mean zero and variance o2 and
(1978). If M = { M I ,. . . ,M K } denotes the set of all models
assume that the E’S in distinct cases are independent. We
being considered and if A is the quantity of interest such consider the (p + 1) individual parameters P and o2 to be
as a future observation or the utility of a course of action, unknown.
then the posterior distribution of A given the data D is Where possible, informative prior distributions for P and
K o2 should be elicited and incorporated into the analysis (see
Pr(A1D) = EPr(A1Mk. D)Pr(Mk1D). (1) Garthwaite and Dickey 1992 and Kadane, Dickey, Winkler,
k=l Smith, and Peters 1980). In the absence of expert opinion,
we seek to choose prior distributions that reflect uncertainty
This is an average of the posterior distributions under each about the parameters and also embody reasonable a pri-
model weighted by the correspondingposterior model prob- ori constraints. We use prior distributions that are proper
abilities. We call this Bayesian model averaging (BMA). In but reasonably flat over the range of parameter values that
Equation (1) the posterior probability of model Mk is given could plausibly arise. These represent the common situation
by where there is some prior information, but rather little of
it, and put us in the “stable estimation” case where results
are relatively insensitive to changes in the prior distribution
(Edwards,Lindman, and Savage 1963).We use the standard
normal gamma conjugate class of priors,

is the marginal likelihood of model M k , Ok is the vector


+
of parameters of model M k , Pr(0klMk) is the prior density Here v,A, the (p 1) x (p + 1) matrix V, and the (p + 1)
of & under model Mk,Pr(Dl&, M k ) is the likelihood, and vector p are hyperparameters to be chosen.
Pr(Mk) is the prior probability that Mk is the true model. The marginal likelihood for Y under a model Mi based
All probabilities are implicitly conditional on M, the set on the proper priors described earlier is given by
of all models being considered. In this article we consider
M to be equal to the set of all possible combinations of P(YIPi, vz,XZ, Mi)
predictors.
Averaging over all of the models in this fashion provides
better predictive ability, as measured by a logarithmic scor-
ing rule, than using any single model Mj: x [XY + (Y - XZ/AZ)t
x (I + xZvixy(Y- x. 2P2.)]-(v+n)/2, (5)

I1
K
Pr(AlMk, D)Pr(MklD) where Xi is the design matrix and Vi is the covariance ma-
trix for p corresponding to model Mi (Raiffa and Schlaifer
5 -E[log{Pr(AlMj, D)}] ( j = 1,.. . , K ) , 1961). The Bayes factor for MO versus M I , the ratio of
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 181

Equation ( 5 ) for i = 0 and i = 1, is then given by ple variance of Y ,:s denotes the sample variance of Xi
for i = 1,.. . , p , and 4 is a hyperparameter to be chosen.
The prior variance of PO is chosen conservatively and rep-
resents an upper bound on the reasonable variance for this
where ai = Xv + (Y- Xipi)"' + XiViXf)-'(Y- Xipi), parameter. The variances of the remaining /3 parameters are
chosen to reflect increasing precision about each pi as the
i = 0,l.
variance of the corresponding X iincreases and to be in-
3.2 Selection of Prior Distributions variant to scale changes in both the predictor variables and
The Bayesian framework described earlier gives the the response variable.
BMA user the flexibility to modify the prior setup as de- For a categorical predictor variable Xi with (c + 1) pos-
sired. In this section we describe the prior distribution setup sible outcomes (c 2 2), the Bayes factor should be invari-
that we adopt in our examples below. ant to the selection of the corresponding dummy variables
For noncategorical predictor variables, we assume the (Xil,.. . ,Xi,). To this end, we set the prior variance of
individual p's to be independent a priori. We center the (pil,. . . ,pi,) equal to u2c$2[(l/n)XiTXi]-',where X i is
distribution of ,8 on zero (apart from PO) and choose the n x c design matrix for the dummy variables, where
p = (&,O,O ,...,0), where 60 is the ordinary least each dummy variable has been centered by subtracting its
squares estimate of PO. The covariance matrix V is sample mean. This is related to the g prior of Zellner (1986).
equal to u2 multiplied by a diagonal matrix with entries The complete prior covariance matrix for ,8 is now given
(s$, 42s;2, 42s;2,.. . ,q52s;2), where s; denotes the sam- by

V(P) = u 2
I I I.
\ I I

To choose the remaining- hyperparameters


-- - v,X, and 4, For a = .05, this yields v = 2.58,X = .28, and 4 = 2.85.
we define a number of reasonable desiderata and attempt For this set of hyperparameters, Pr(a2 5 1) = .81. We use
to satisfy them. In what follows we assume that all the these settings of the hyperparameters in the examples that
variables have been standardized to have mean zero and follow.
sample variance 1. We would like the following desiderata To compare our prior for pi,i = 1, . . . ,p for a noncate-
to hold gorical predictor with the actual distribution of coefficients
from real data, we collected 13 datasets from several regres-
1. The prior density Pr(p1, . . . ,p p )is reasonably flat over
sion textbooks (see App. A). Figure 1 shows a histogram
the unit hypercube [-l,l]*.
of the 100 coefficients from the standardized data plotted
2. Pr(a2)is reasonably flat over (a, 1) for some small a. with the prior distribution resulting from the hyperparam-
3. Pr(02 I 1) is large.
eters that we use. As desired, the prior density is relatively
The order of importance of these desiderata is roughly the flat over the range of observed values.
order in which they are listed. More formally, we maximize
Pr(u2 5 1) subject to the following: 4. TWO APPROACHES TO BAYESIAN
MODEL AVERAGING
a. Pr(p1 = 0 , . . . , p p = O)/Pr(P1 = 1,...,pP = 1) 5
K I . (Following Jeffreys (1961), we choose K1 = 4.1 Occam's Window
m.) Our first method for accounting for model uncertainty
b. {max,<,z<l/ Pr(02)}Pr(02 = a ) 5 K2.
starting from Equation (1) involves applying the Occam's
c. {max,<,z<l /Pr(a2)}Pr(a2= 1) I K2.
window algorithm of Madigan and Raftery (1994) to linear
Because desideratum 2 is less important than desideratum regression models. Two basic principles underly this ad hoc
1, we have chosen K2 = 10. approach.
182 Journal of the American Statistical Association, March 1997

to identify the models in A. Two further principles underly


the search strategy. The first principle-occam’s window-
concerns interpretating the ratio of posterior model prob-
abilities Pr(MlJD)/Pr(MOlD).Here MO is a model with
one less predictor than Ml. The essential idea is shown in
Figure 2. If there is evidence for MO then Ml is rejected,
but to reject MO we require strong evidence for the larger
model, M1. If the evidence is inconclusive (falling in Oc-
cam’s window), then neither model is rejected. The second
principle is that if MO is rejected, then so are all of the
models nested within it.
These principles fully define the strategy. Typically, in
our experience, the number of terms in (1) is reduced to
fewer than 25, often to as few as 1 or 2. Madigan and
0-
4 -3 -2 -1 0 1 2 3 4 Raftery (1994) provided a detailed description of the algo-
rithm and showed how averaging over the selected models
Pi provides better predictive performance than basing infer-
Figure 1. Histogram of 100 Coefficients from Standardized Data, ence on a single model in each of the examples that they
from 13 Textbook Datasets. The solid line is the prior density for o,, considered.
i = 1, . . . , p .
4.2 Markov Chain Monte Carlo Model Composition
First, if a model predicts the data far less well than the Our second approach is to approximate (1) using a
model that provides the best predictions, then it has effec- Markov chain Monte Carlo (MCMC) approach (see, e.g.,
tively been discredited and should no longer be considered. Smith and Roberts 1993).For our application, we adopt the
Thus models not belonging to MCMC model composition (MC3) methodology of Madi-
gan and York (1995), which generates a stochastic pro-
cess that moves through model space. We can construct
a Markov chain { M ( t ) , t = 1,2,. . .} with state space M
should be excluded from Equation (l), where C is chosen and equilibrium distribution Pr(Mi1D). If we simulate this
by the data analyst and maq{Pr(M~lD)}denotes the model Markov chain for t = 1,.. . ,N , then under certain regular-
with the highest posterior model probability. In the exam- ity conditions, for any function g ( M i ) defined on M , the
ples that follow we use C = 20. The number of models in average
Occam’s window increases as the value of C decreases.
Second, appealing to Occam’s razor, we exclude models
that receive less support from the data than any of their
simpler submodels. More formally, we also exclude from (11)
1.
t=l
(1) models belonging to

converges almost surely to E ( g ( M ) )as N + 00 (Smith and


Roberts 1993). To compute (1) in this fashion, set g ( M ) =
Equation (1) is then replaced by Pr(AlM, D).
To construct the Markov chain, we define a neighborhood
nbd(M) for each M E M that consists of the model M
itself and the set of models with either one variable more
or one variable fewer than M. Define a transition matrix
where q by setting q(M + M I ) = 0 for all MI # nbd(M) and
q(M 4 M’) constant for all MI E nbd(M). If the chain is
A = A’ \ B E M . (10)
currently in state M, then we proceed by drawing hl‘ from
This greatly reduces the number of models in the sum in q(M + MI). It is then accepted with probability
Equation (11, and now all that is required is a search strategy

-, 1
I

Evidence for ,WO


ki
I
I.
Inconclusive Evidence


I
I
Strong Evidence for M,
;”” Otherwise, the state stays in state M. Madigan and York
(1995) described MC3 for discrete graphical models. Soft-
Figure 2. Occam’s Window: lnterpreting the Posterior Odds for
ware for implementing the MC3 algorithm is described in
Nested Models. the Appendix.
Raftery, Madigan. and Hoeting: Averaging for Linear Regression 183

5. MODEL UNCERTAINTY AND PREDICTION Ehrlich’s analysis concentrated on the relationship be-
tween crime rate and predictors 14 and 15 (probability of
5.1 Example: Crime and Punishment imprisonment and average time served in state prisons). In
5.1.1 Crime and Punishment: Overview. Up to the his original analysis, Ehrlich (1 973) focused on two regres-
1960s, criminal behavior was traditionally viewed as de- sion models, consisting of the predictors (9, 12, 13, 14, 15)
viant and linked to the offender’s presumed exceptional psy- and (1, 6, 9, 10, 12, 13, 14, 15), which were chosen in ad-
chological, social, or family circumstances (Taft and Eng- vance based on theoretical grounds.
land 1964). Becker (1968) and Stigler (1970) argued that on To compare Ehrlich’s results with models that might be
the contrary, the decision to engage in criminal activity is a selected using standard techniques, we chose three popular
rational choice determined by its costs and benefits relative variable selection techniques: Efroymson’s stepwise method
to other (legitimate) opportunities. (Miller 1990), minimum Mallow’s C,,and maximum ad-
In an influential article, Ehrlich (1973) developed this ar- justed R2 (Weisberg 1985). Efroymson’s stepwise method
gument theoretically, specified it mathematically, and tested is like forward selection except that when a new variable
it empirically using aggregate data from 47 U.S. states in is added to the subset, partial correlations are considered
1960. Errors in Ehrlich’s empirical analysis were corrected to see whether any of the variables currently in the sub-
by Vandaele (1978), who gave the corrected data, which we set should be dropped. Similar hybrid methods are found in
use here (see also Cox and Snell 1982). (Ehrlich’s study has most standard statistical computer packages. Problems with
been much criticized (see, e.g., Brier and Fienberg 19801, stepwise regression, Mallow’s C,,and adjusted R2are well
and we cite it here for purely illustrative purposes. For econ- known (see, e.g., Weisberg 1985).
omy of expression, we use causal language and speak of Table 1 displays the results from the full model with all
“effects,” even though the validity of this language for these 15 predictors, three models selected using standard variable
data is dubious. Because people, not states, commit crimes, selection techniques, and the two models chosen by Ehrlich
these data may reflect aggregation bias.) on theoretical grounds. The three models chosen using vari-
Ehrlich’s theory goes as follows. The costs of crime are able selection techniques (models 2, 3, 4) share many of
related to the probability of imprisonment and the aver- the same variables and have high values of R2.Ehrlich’s
age time served in prison, which in turn are influenced by theoretically chosen models fit the data less well. There are
police expenditures, which may themselves have an inde- striking differences-indeed, conflicts-between the results
pendent deterrent effect. The benefits of crime are related from the different models. Even the models chosen using
to both the aggregate wealth and income inequality in the statistical techniques lead to conflicting conclusions about
surrounding community. The expected net payoff from al- the main questions of interest, despite the models’ superfi-
ternative legitimate activities is related to educational level cial similarity.
and the availability of employment, the latter being mea- Consider first the predictor for probability of imprison-
sured by the unemployment and labor force participation ment, XIS.This is a significant predictor in all six models,
rates. The payoff from legitimate activities was expected to so interest focuses on estimating the size of its effect. To
be lower (in 1960) for nonwhites and for young males than aid interpretation, recall that all variables have been trans-
for others, so that states with high proportions of these were formed logarithmically, so that when all other predictors are
expected also to have higher crime rates. Vandaele (1978) held fixed, p14 = -.30 means roughly that a 10%increase
also included an indicator variable for southern states, the in the probability of imprisonment produces a 3% reduc-
sex ratio, and the state population as control variables, but tion in the crime rate. The estimates of fluctuate wildly
the theoretical rationale for inclusion of these predictors is between models. The stepwise regression model gives an es-
unclear. timate about one-third lower in absolute value than the full
We thus have 15 candidate predictors of crime rate (Ta- model, enough to be of policy importance; this difference
ble 4),and so potentially 215 = 32,768 different models. is equal to about 1.7 standard errors. The Ehrlich models
As in the original analyses, all data were transformed loga- give estimates that are about one-half higher than the full
rithmically. Standard diagnostic checking (see, e.g., Draper model, and more than twice as big as those from stepwise
and Smith 1981) did not reveal any gross violations of the regression (in absolute value). There is clearly considerable
assumptions underlying normal linear regression. model uncertainty about this parameter.
Table 1. Models Selected for Crime Data

R2 Numberof
# Method Variables (%) variables 314 315 P15

1 Full model All 87 15 -.30 -.27 .133


2 Stepwise regression 1 3 4 9 11 13 14 83 7 --.I9
3 Mallows’ Cp 1 3 4 9 11 12 13 14 15 85 9 -.30 -.30 .050
4 Adjusted R2 1 3 4 7 8 9 11 12 13 14 15 86 11 -.30 -.25 .i29
5 Ehrlich model 1 9 12 13 14 15 66 5 -.45 -.55 .009
6 Ehrlich model 2 1 6 9 10 12 13 14 15 70 8 -.43 -.53 .011
~~

NOTE: 4 5 is Ihe p value from a two-sided I tea for iesllng = 0. For lhe slepwise procedure. F = 3.84 was used for (he F-to-enter and F-lodelete value. This corresponds approximately
to the 5% level.
184 Journal of the American Statistical Association, March 1997

Table 2. Crime Data: Occam’s Window Posterior Model Probabilities effect on the relationship between the models as measured
Posterior model
by the Bayes factor.
Model probability (%) Table 4 shows the posterior probability that the coeffi-
1 3 4 9 11 13 14 12.6
cient for each predictor does not equal 0-that is, Pr(& #
1 3 4 11 13 14 9.0 OID)-obtained by summing the posterior model probabili-
1 3 4 9 13 14 8.4 ties across models for each predictor. The results from Oc-
1 3 5 9 11 13 14 8.0 cam’s window and MC3 are fairly close for most of the
3 4 8 9 13 14 7.6
1 3 4 13 14 6.3 predictors. Predictors with high Pr(& # 010) include pro-
1 3 4 11 13 5.8 portion of young males, mean years of schooling, police
1 3 5 11 13 14 5.7 expenditure, income inequality, and probability of impris-
1 3 4 13 4.9
1 3 5 9 13 14 4.8
onment.
3 5 8 9 13 14 4.4 Comparing the two models analyzed by Ehrlich (1973),
3 4 9 13 14 4.1 consisting of the predictors (9, 12, 13, 14, 15) and (1, 6,
3 5 9 13 14 3.6
1 3 5 13 14 3.5
9, 10, 12, 13, 14, 15), with the results in Table 4, we see
2 3 4 13 14 2.0 that several predictors included in Ehrlich’s analysis receive
1 3 5 11 13 1.9 little support from the data. The estimated Pr(,8i # 010) is
3 4 13 14 1.6 quite small for predictors 6, 10, 12, and 15. Two predictors
3 5 13 14 1.6
3 4 13 1.4 (3 and 4) have empirical support but were not included by
1 3 5 13 1.4 Ehrlich. Indeed, Ehrlich’s two selected models have very
3 5 13 .7 low posterior probabilities.
1 4 12 13 .7
Ehrlich’s work attracted attention primarily because of
his conclusion that both the probability of imprisonment
Now consider ,815, the effect of the average time served in (predictor 14) and the average prison term (predictor 15)
state prisons. Whether this is significant at all is not clear, influenced the crime rate. The posterior distributions for the
and t tests based on different models lead to conflicting coefficients of these predictors, based on the model averag-
conclusions. In the full model, Pl5 has a nonsignificant p ing results of MC3, are shown in Figures 3 and 4. The MC3
value of .133, while stepwise regression leads to a model posterior distribution for 9 1 4 is indeed centered away from
that does not include this variable. On the other hand, Mal- 0, with a small spike at 0 corresponding to P ( p 1 4 = 010).
lows’ C, leads to a model in which the p value for ,815 is The posterior distribution for p i 4 based on Occam’s win-
significant at the .05 level, whereas with adjusted R2 it is dow is quite similar. The spike at 0 is an artifact of our
again not significant. In contrast, in Ehrlich’s models it is approach, in which it is possible to consider models with a
highly significant. predictor fully removed from the model. This is in contrast
Together these results paint a confused picture about ,814 to the practice of setting the predictor close to 0 with high
and p 1 5 . Later we argue that the confusion can be resolved probability (as in George and McCulloch 1993).In contrast
by taking explicit account of model uncertainty. to Figure 3, the MC3 posterior distribution for the coeffi-
cient corresponding to average prison term is centered close
5.1.2 Crime and Punishment: Model Averaging. For to 0 and has a large spike at 0 (Fig. 4). Occam’s window in-
the model averaging strategies, we assumed that all possi- dicates a spike at 0 only, or no support for inclusion of this
ble combinations of predictors were equally likely a pri- predictor. By averaging over all models, our results indicate
ori. To implement Occam’s window, we started from the support for a relationship between crime rate and predic-
null model and used the “up” algorithm only (see Madigan tor 14, but not predictor 15. Our model averaging results
and Raftery 1994). The selected models and their poste- are consistent with those of Ehrlich for the probability of
rior model probabilities are shown in Table 2. The models imprisonment, but not for the average prison term.
with posterior model probabilities of 1.2% or larger as in-
dicated by MC3 are shown in Table 3. In total, 1,772 differ-
Table 3. Crime Data: M e , Models With Posterior Model
ent models were visited during 30,000 iterations of MC3. Probabilities of 1.2%or Larger
Occam’s window chose 22 models in this example, clearly
indicating model uncertainty. Choosing any one model and Posterior model
Model probability (%)
making inferences as if it were the “true” model ignores ~ ~

model uncertainty. In the next section we further explore 1 3 4 9 11 13 14 2.6


1 3 4 11 13 14 1.8
the consequences of basing inferences on a single model. 1 3 4 9 13 14 1.7
The top models indicated by the two methods (Tables 1 3 4 5 9 13 14 1.6
1 3 4 9 11 13 14 15 1.6
2 and 3) are quite similar. The posterior probabilities are 1 3 4 9 13 14 15 1.6
normalized over all selected models for Occam’s window 3 4 8 9 13 14 1.5
and over all possible combinations of the 15 predictors for 1 3 4 13 14 1.3
MC3. So the posterior probabilities for the same models 1 3 4 11 13 1.2
1 3 5 11 13 14 1.2
differ across the model averaging method, but this has little
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 185

Table 4. Crime Data: Pr(pi # 01D), Expressed as a Percentage


Predictor Occam‘s Ehrlich’s
number Predictor window M e models
1 Percentage of males age 14-24 73 79 *
2 Indicator variable for southern state 2 17
3 Mean years of schooling 99 98
4 Police expenditure in 1960 64 72
5 Police expenditure in 1959 36 50
6 Labor force participation rate 0 6 *
7 Number of males per 1,000females 0 7
8 State population 12 23
9 Number of nonwhites per 1,000people 53 62 *
10 Unemployment rate of urban males age 14-24 0 11 *
11 Unemployment rate of urban males, age 35-39 43 45
12 Wealth 1 30 * *
13 Income inequality 100 100 * *
14 Probability of imprisonment 83 83 * *
15 Average time served in state prisons 0 22 *
NOTE: The last wlurnn indicates the predictors included in the two models considered by Ehrlich.
*
* Corresponds to Ehrlich model 1 and corresponds to Ehrlich model 2.

Among the variables that measure the expected benefits The model averaging results for the predictors for po-
from crime, Ehrlich concluded that both wealth and income lice expenditures lead to an interesting interpretation. Po-
inequality had an effect; we found this to be true for income lice expenditure was measured in two successive years, and
inequality but not for wealth. For the predictors that repre- the measures are highly correlated ( r = .993). The data
sent the payoff from legitimate activities, Ehrlich found the show clearly that the 1960 crime rate is associated with
effects of variables 1,6, 10, and 1 1 to be unclear; he did not police expenditures, and that only one of the two mea-
include mean schooling in his model. We found strong evi- sures (X,and XS)is needed, but they do not say for sure
dence for the effect of some of these variables, notably the which measure should be used. Each model in Occam’s
percent of young males and mean schooling, but the effects window contains one predictor or the other, but not both.
of unemployment and labor force participation are either For both Occam’s window and MC3 Pr[(& # 0) U (& #
unproven or unlikely. Finally, the “control” variables that O)lD]= 1, so the data provide very strong evidence for an
have no theoretical basis (2, 7, 8) turned out, satisfyingly, association with police expenditures.
to have no empirical support either. In summary, we found strong support for some of
Ehrlich’s conclusions but not for others. In particular, by
averaging over all models, our results indicate support for
a relationship between crime rate and probability of impris-
onment, but not for average time served in state prisons.
R -
- 2 - 5.1.3 Crime and Punishment: Assessment of Predictive
PelLformance. We use the predictive ability of the selected
models for future observations to measure the effectiveness
n “ -
n
-. c
- iiZ
of a model selection strategy. Our specific objective is to
6
W compare the quality of the predictions based on model av-
0,
eraging with the quality of predictions based on any single
9 -
c
- 2 model that an analyst might reasonably have selected.
To measure performance, we randomly split the com-
plete dataset into two subsets. Other percentage splits can
2 - - 2 be adopted. A 50-50 split was chosen here, so that each
portion would contain enough data to be a representative
0 sample. We ran Occam’s window and MC3 using half of
0
I I I I 1-
- 8 the data. This set is called the training set, DT.We evalu-

Figure 3. Posterior Distribution for p14, the Coefficient for the Pre- and graphical Of performance*
dictor “Probability of Imprisonment,”Based on the M e Model Average. Predictive coverage was measured Using the proportion
The spike corresponds-to P (Pi4 = 010). The vertical axis on the left of observations in the Derformance set that fall in the cor-
corresponds to the posterior distributionf0rp74, and the vertical axis on
the right corresponds to the posterior distribution for 014 equal to zero.
responding 90%predic;ion interval. For both Occam9swin-
The densitv is scaled so that the maximum of the densitv is eaual to dow and MC3, 80%of the observations in the performance
P(p14 # i D ) on the right axis. set fell in the 90% prediction intervals over the averaged
186 Journal of the American Statistical Association, March 1997

Y 1.oo Occam’s window and MC3 are not highly sensitive to the
N
choice of prior. The results for Occam’s window and MC3
using three different sets of priors were quite similar.
s ,0.75
In an attempt to provide a graphical measure of predic-
tive performance, we used a “calibration plot” to determine
n
-
n -
c
0 whether the predictions were well calibrated. A model is

-
r Y 0
Q - well calibrated if, for example, 70% of the observations in
v
0 ,0.50 Q
‘5 the test dataset are less than or equal to the 70th percentile
9
Q of the posterior predictive distribution. The calibration plot
c
shows the degree of calibration for different models, with
,0.25 the posterior predictive probability on the z-axis and the
Y
0 percentage of observed data less than or equal to the pos-
terior predictive probability on the y-axis. In a calibration
plot, perfect calibration is the 45-degree line; the closer a
8 .0.0 model’s calibration line to the 45-degree line, the better
-1.0 -0.5 0.0 0.5 1.o calibrated the model. The calibration plot is similar to re-
liability diagrams used to assess probability forecasts (see,
PI$ e.g., Murphy and Winkler 1977). The calibration plot for
Figure 4. Posterior Distributionfor PIS,the Coefficient for the Predic- the model chosen by stepwise selection and for model av-
tor “AverageTime Served in State Prisons,”Based on the Model Average eraging using Occam’s window is shown in Figure 5. The
Over a Large Set of Models From M e . See Figure 3. shaded area in Figure 5 shows where the model averaging
strategy produces predictions that are better calibrated than
models (Table 5). David Draper (personal communication) predictions from the model chosen by the stepwise model
suggested that BMA falls somewhat short of nominal cov- selection procedure. The calibration plot for MC3 is similar.
erage here because aspects of model uncertainty other than These performance measures support our claim that con-
model selection have not been assessed. In Hoeting, Raftery, ditioning on a single selected model ignores model uncer-
and Madigan (1995, 1996), we extended BMA to account tainty, which in turn leads to the underestimation of uncer-
for uncertainty in the selection of transformations and in tainty when making inferences about quantities of interest.
the identification of outliers. Model averaging leads to better-calibrated predictive distri-
For comparison with other standard variable selection butions.
techniques, we used the three popular variable selection
procedures discussed earlier to select two or three “best”
models. The models that we chose using these methods 5.2 Simulated Examples: Predictive Performance
are given in Table 5. All of the individual models chosen In the foregoing example, the true answer is unknown. To
using standard techniques performed considerably worse further demonstrate the usefulness of BMA, we use several
than the model averaging approaches, with prediction cov- simulated examples. In our examples, we follow the format
erage ranging from 58% to 67%. Thus the model averag- of George and McCulloch (1993).
ing strategies improved predictive coverage substantially as Example 5.2.1. In this example we investigate the im-
compared to any single model that might reasonably have pact of model averaging on predictive performance when
been chosen. there is little model uncertainty. For the training set, we
A sensitivity analysis for priors chosen within the frame- simulated p = 15 predictors and n = 50 observations as
work described in Section 3.2 indicates that the results for independent standard normal vectors. We generated the re-

Table 5. Crime Data: Performance Comparison

Predictive
Method Model coverage (%)
MC3 Model averaging 80
Occam’s window Model averaging 80
Stepwise (5%) 3 4 9 13 67
Adjusted R2 (2) 1 2 3 4 5 8 11 12 13 15 67
Adjusted R2 (3) 1 2 3 4 5 6 8 11 12 13 15 67
Stepwise (15%) 3 4 8 9 13 15 63
CP (2) 1 2 3 4 11 13 63
Adjusted R2 (1) 1 2 3 4 5 11 12 13 15 58
CP (1) 1 2 3 4 11 13 15 58
CP (3) 1 2 3 4 11 12 13 15 58

NOTE: Predictive coverage is the percentage 01 observations in the performance set that fall in the 90% prediction interval. Method
numbers correspond to tha ith model chosen using the given model Selection method. For example. C, (1) is the first model chosen
using the C, method. The percentage values shown for the stepwise procedurescorrespond to the significance levels lor the F-to-enter
and F-lodelete values. For example, F = 3.84 corresponds approximately to the 5% level.
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 1137

m
or
c
/I where E - N50(O,c2) with c = 2.5. Least squares es-
timates for these data are given in Table 8. The corre-
I
5r lation structure resulted in moderate pairwise correlation
between predictors 1-5 and 11-15 (corr(X1,X11) = .39?
.L corr(X2, x12) = .41, COrr(x3,x i s ) = .56, cOrr(x4, x14) =
e
.71, corr(X5,X15) = .69) and small pairwise correlations
elsewhere (median correlation equal to - .02). We generated
H 50 additional observations in the same manner to create the
I1 0
V O
prediction set.
Table 9 shows that in this example, model averaging has
better predictive performance than any single model that
might have been selected. In this example, the poor per-
s
s9 0
o y , , , ,I formance of the true model and the other single models
selected using standard techniques demonstrate that model
uncertainty can strongly influence predictive performance.
0.0 0.2 0.4 0.6 0.8 1 .o
posterior prediiive probabilii
Figure 5. Crime Data: Calibration Plot. The solid line denotes model
averaging (Occam’s window); the dashed line, predictors 3, 4, 8, 9, 13, 6. SUCCESSFUL IDENTIFICATION
15 (stepwise). OF THE NULL MODEL
Linear regression models are frequently used even when
sponse using the model little is known about the relationship between the predictors
and the response. When there is a weak relationship be-
tween the predictors and the response, the overall F statis-
where E: -
N s o ( 0 . 0 ~ )with B = 2.5. Least squares esti-
mates for these data are given in Table 6. There is little
tic will be small and thus the null hypothesis that the null
model is true fails to be rejected. However, many data an-
model uncertainty in this example; only the p values for alysts perform model selection regardless of the F statis-
,04 and ,& were smaller than .l. We generated 50 addi-
tic value for the overall model. Problems can then occur,
tional observations in the same manner to create the predic- as subsequent model selection techniques often choose a
tion set. model that includes a subset of the predictors. Freedman
In this example the true model, the model averaging tech- (1983) has shown that in the extreme case where there is no
niques, and models selected using standard techniques all relationship between the predictors and the response vari-
have poor predictive coverage (Table 7). It is slightly en- able, omitting the predictors with the smallest t values (e.g.,
p > .25) can result in a model with a highly significant F
couraging that BMA performs better than the true model,
but the improvement is too small to be significant. This statistic and high R2. In contrast, if the response and predic-
and other similar examples that we simulated show that tors are independent, Occam’s window typically indicates
the null model only, or the null model as one of a small
when there is very little model uncertainty, predictive per-
formance is not significantly improved by model averaging. number of “best” models.
Exumple 5.2.2. This example demonstrates the perfor- Following Freedman (1983), we generated 5,100 inde-
mance of BMA when a subset of the predictors is corre- pendent observations from a standard normal distribution
lated. For the training set, we simulated p = 15 predic- to create a matrix with 100 rows and 51 columns. The first
tors and n = 50 observations. We obtained predictors 1-10 column was taken to be the dependent variable in a regres-
sion equation, and the other 50 columns were taken to be
-
as independent standard normal vectors, X I , . . . ,Xl0 iid
N ( 0 , l),and generated predictors 11-15 using the frame-
work
the predictors. Thus the predictors are independent of the
response by construction. For the entire dataset, the multi-
ple regression results were as follows:
[Xii,. . . ,xi51 R2 = .55 and p = .29.
= [X, )... :X~]([.3,.5,.7,.9,1.1]T[11111])
+€, 18 coefficients out of 50 were significant at the .25
where e
model
- N(O.1). We generated the response using the
level.
4 coefficients out of 50 were significant at the .05 level.
We used three different variable selection procedures on
y = x1+x2 + x3 + x4 + x s + € (13) the simulated data. The first of these was the method used
Table 6. Least Squares Estimates for Example 5.2.1 (5 = 2.9)

Po PI P2 P3 04 05 P6 P7 08 P9 010 811 012 P13 014 PI5

P O 0 0 0 1.00 1.00 0 0 0 0 0 0 0 0 0 0
.42 .21 .40 .07 .95 1.72 20 .34 -.32 24 -.15 .6 -.45 -.08 20 .18
.46 .55 .56 .36 .52 .47 .39 .5a .49 .45 .44 .55 .4a .52 .45 .47
188 Journal of the American Statistical Association, March 1997

Table 7. Performance Comparison for Example 5.2.1: Predictive Coverage


for a 90% Prediction Interval

Predictive
Method Model coverage (%)
BMA (estimated coverage) Model averaging 72
Occam’s window Model averaging 70
Adjusted R2 (3) 2 4 5 8 11 70
CP (3) 4 5 11 70
True model and stepwise (5%) 4 5 68
Stepwise (15%) and Cp (2) 2 4 5 68
c, (1) 4 5 68
Adjusted R2 (2) 2 4 5 10 11 68
Adjusted R2 (1) 2 4 5 11 66

NOTE Predictivecoverage lor BMA (all models) is estimated using the 371 models with posterior model probabilitiesgreater than .ooOl;
see TaMe 5.

by Freedman (1983), in which all predictors with p values man (1983) and the stepwise method chose models with
of .25 or lower were included in a second pass over the many predictors and highly significant R2 values.
data. The results from this method were as follows: At best, Occam’s window correctly indicates that the null
model is the only model that should be chosen when there
R2 = .40 and p = .0003.
is no signal in the data. At worst, Occam’s window chooses
17 coefficients out of 18 were significant at the .25 the null model along with several other models. The pres-
level. ence of the null model among those chosen by Occam’s
10 coefficients out of 18 were significant at the .05 window should indicate to a researcher the possibility of
level. evidence for a lack of signal in the data that he or she is
analyzing.
These results are highly misleading, as they indicate a def- To examine the possibility that our Bayesian approach
inite relationship between the response and the predictors, favors parsimony to the extent that Occam’s window finds
whereas in fact the data are all noise.
no signal even when one exists, we did an additional simu-
The second model selection method used on the full data-
lation study. We generated 3,000 observations from a stan-
set was Efroymson’s stepwise method. This indicated a
model with 15 predictors with the following results: dard normal distribution to create a dataset with 100 ob-
servations and 30 candidate predictors. We allowed the re-
+ E with
R2= .40,and p = .0001.
All 15 predictors were significant at the .25 level. E -
sponse Y to depend only on XI, where Y = .5X1
N ( 0 , .75). Thus Y still has unit variance, and the “true”
R2 for the model equals .20.
10 coefficients out of 15 were significant at the .05
For this simulated data, Occam’s window contained one
level. model only-the correct model with XI.In contrast, the
Again a model is chosen that misleadingly appears to have screening method used by Freedman produced a model with
a great deal of explanatory power. six predictors, including XI,with four of these significant
The third variable selection method that we used was at the .1 level. Stepwise regression indicated a model with
Occam’s window. The only model chosen by this method two predictors, including XI, both of them significant at the
was the null model. .025 level. So the two standard variable selection methods
We repeated the foregoing procedure 10 times with simi- indicated evidence for variables that in fact were not at all
lar results. In five simulations, Occam’s window chose only associated with the dependent variable, whereas Occam’s
the null model. For the remaining simulations, three mod- window chose the correct model.
els or fewer were chosen along with the null model. All the These examples provide evidence that Occam’s window
nonnull models chosen had R2 values less than .15. For all overcomes the problem of selection of the null model when
of the simulations, the selection procedure used by Freed- there is no signal in the data.

Table 8. Least Squares Estimates for Example 5.2.2 (6 = 2.21)

p 0 1.00 1.00 1.00 1.00 1.00 0 0 0 0 0 0 0 0 0 0


.12 .80 1.07 1.03 -.18 .55 -.67 .28 -.11 .31 .29 .ll -.09 -.39 .73 -.96
60 .38 .49 .41 .45 .53 .58 .37 .41 .49 .33 .34 .40 .32 .35 .37 .40
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 189

Table 9. Performance Comparison for Example 5.2.2: Predictive Coverage


for a 90% Prediction Interval

Predictive
Method Model coverage (%)

MC3 Model averaging 92


Occam’s window Model averaging 86
Stepwise (5% and 15%) 1 2 3 4 80
True model 1 2 3 4 5 78
CP (2) 1 2 3 6 13 14 15 72
Adjusted R2 (1)
CP (3) 1 2 3 6 10 14 15 72
Adjusted R2 (3) 1 2 3 6 7 13 14 15 72
CP (1) 1 2 3 5 14 15 70
Adiusted R2 (2) 1 2 3 6 10 13 14 15 70

NOTE: Predictive coverage lor BMA (all models) is estimated using the 1,014 models with posterior model probabilitiesgreater than
.00005; see Table 5.

7. DISCUSSION to date suggest that we achieved this objective. The pri-


ors for p lead to a reasonable prior variance and result in
7.1 Related Work conclusions that are not highly sensitive to the choice of hy-
perparameters. Thus the data dependence does not appear
Draper (1995) has also addressed the problem of assess- to be a drawback.
ing model uncertainty. Draper’s approach is based on the In a strict sense, our data-dependent priors do not corre-
idea of model expansion; that is, starting with a single rea- spond to a Bayesian subjective prior. Our priors might be
sonable model chosen by a data-analytic search, expanding considered to be an approximation to a true Bayesian sub-
model space to include those models suggested by context jective prior and might be appropriate when little prior in-
or other considerations, and then averaging over this model formation is available. We have followed other authors, in-
class. Draper did not directly address the problem of model cluding George and McCullough (1993), Laud and Ibrahim
uncertainty in variable selection. However, one could con- (1995), and Zellner (19861, in referring to our approach as
sider Occam’s window to be a practical implementation of Bayesian.
model expansion. The choice of which procedure to use-Occam’s win-
George and McCulloch (1993) have developed the dow or MC3-will depend on the particular application.
stochastic search variable selection (SSVS) method, which Occam’s window will be most useful when one is inter-
is similar in spirit to MC3. They defined a Markov chain ested in making inferences about the relationships between
that moves through model space and parameter space at the the variables. Occam’s window also tends to be much faster
same time. Their method never actually removes a predic- computationally. MC3 is the better procedure to choose if
tor from the full model, but only sets it close to zero with the goal is good predictions or if the posterior distribution
high probability. Our approach avoids this by integrating of some quantity is of more interest than the nature of the
analytically over parameter space. “true” model and if computer time is not a critical consid-
We have focused here on Bayesian solutions to the model eration. However, each approach is flexible enough to be
uncertainty problem. Very little has been written about fre- used successfully for both inference and prediction.
quentist solutions to the problem. Perhaps the most obvious We have described two procedures that can be used to ac-
frequentist solution is to bootstrap the entire data analysis, count for model uncertainty in variable selection for linear
including model selection. However, Freedman et al. (1986) regression models. In addition to variable selection, uncer-
have shown that this does not necessarily give a satisfactory tainty is also involved in the identification of outliers and
solution to the problem. in the choice of transformations in regression. To broaden
the flexibility of our current procedures, and to improve our
7.2 Conclusions ability to account for model uncertainty, we have extended
The prior distribution of the covariance matrix for p de- BMA to include transformation selection and outlier iden-
scribed in Section 3.2 depends on the actual data, including tification in work reported elsewhere (Hoeting et al. 1995,
both the dependent and the independent variables. A simi- 1996).
lar data-dependent approach to the assessment of the priors
was used by Raftery (1996). Although at first this may ap-
APPENDIX A: DATA FOR FIGURE 1
pear to be contrary to the idea of a prior, our objective was
to develop priors that lead to posteriors similar to those of The following data from selected textbooks were used to make
a person with little prior information. Examples analyzed Figure 1:
190 Journal of the American Statistical Association, March 1997

Page Number of Number of


Dataset Source number observations predictors
Attitude survey Chatterjee and Price (1991) 70 30 6
Equal education oppurtunity Chatterjee and Price (1991) 176 70 3
Gasoline mileage Chatterjee and Price (1991) 26 1 30 10
Nuclear power Cox and Snell (1982) 81 32 10
Crime Cox and Snell (1982) 170 47 13
Hald Draper and Smith (1981) 630 13 4
Grades Hamilton (1993) 83 118 3
Swiss fertility Mosteller and Tukey (1977) 550 47 5
Surgical unit Neter, Wasserman and Kutner (1990) 439,468 108 4
Berkeley study Weisberg (1985)
Girls 56 32 10
Boys 57 26 10
Housing Weisberg (1985) 24 1 27 9
Highway Weisberg (1985) 206 39 13

APPENDIX 6: SOFTWARE FOR Springer-Verlag,pp. 1-16.


IMPLEMENTING MC3 Garthwaite, P. H., and Dickey, J. M. (1992), “Elicitation of Prior Distri-
butions for Variable Selection Problems in Regression,” The Annals of
BMA is a set of S-PLUS functions that can be obtained free of Statistics, 20, 1697-1719.
charge via the World Wide Web address http://lib.stat.cmu. Geisser, S. (1980), Discussion of “Sampling and Bayes’ Inference in Sci-
edu/S/bma or by sending an e-mail message containing the text entific Modelling and Robustness” by G. E. P. Box, Journal of the Royal
“send BMA from S” to the Internet address [email protected]. Statistical Society, Ser. A, 143, 4 1 6 4 17.
The program MC3.REG performs MCMC model composition George, E. I., and McCulloch, R. E. (19931, “Variable Selection via Gibbs
for linear regression. The set of programs fully implements the Sampling,” Journal of the American Statistical Association,88,881-890.
MC3 algorithm described in Section 4.2. Good,1. J. (1952). “Rational Decisions,” Journal of the Royal Statistical
Society, Ser. B, 14, 107-1 14.
Hamilton. L. C. (1993). Statistics With Stata 3, Belmont, C A Duxbury
[Received November 1993. Revised June 19W.I Press.
Hocking, R. R. (1976), “The Analysis and Selection of Variables in Linear
REFERENCES Regression:’ Biometrics, 32, 1-5 1.
Hodges, J. S. (1987),“Uncertainty,Policy Analysis, and Statistics,” Statis-
Becker, G. S. (1968), “Crime and Punishment: An Economic Approach,” tical Science, 2, 259-291.
Journal of Political Economy, 76, 169-217.
Hoeting, J. A., Raftery, A. E., and Madigan, D. (1995). “Simultaneous
Brier, S. S., and Fienberg, S. E. (1980). “Recent Econometric Modeling Variable and Transformation Selection in Linear Regression,” Technical
of Crime and Punishment: Support for the Deterrence Hypothesis?,” Report 9506, Colorado State University, Dept. of Statistics.
Evaluation Review, 4, 147-191.
Breiman, L. (1968),Probability, Reading, MA: Addison-Wesley.
-(1996).“A Method for SimultaneousVariable Selection and Outlier
Identification in Linear Regression,” Journal of Computational Statistics
-(1992). ‘The Little Bootstrap and Other Methods for Dimension- and Data Analysis, 22, 25 1-270.
ality Selection in Regression: X-Fixed Prediction Error,” Journal of the
Jeffreys, H. (1961), Theory of Probability (3rd ed.), London: Oxford Uni-
American Statistical Association, 87, 738-754.
versity Press.
-(1995).“Better Subset Regression Using the Nonnegative Garrote,” Kadane, J. B., Dickey, J. M., Winkler, R. L., Smith, W. S.,and Peters, S. C.
Technometrics, 37, 373-384.
(1980). “Interactive Elicitation of Opinion for a Normal Linear Model,”
Breiman, L., and Spector, P. (1992), “Submodel Selection and Evaluation Journal of the American Statistical Association, 75, 845-854.
in Regression,” International Statistical Review, 60, 291-319.
Kass, R. E., and Raftery, A. E. (1995). “Bayes Factors,” Journal of the
Chatterjee, S., and Price, B. (1991).Regression Analysis by Example (2nd
American Staristical Association, 90, 773-795.
ed.), New York: Wiley.
Laud, P. W., and Ibrahim, J. G. (1995), “Predictive Model Selection,”Jour-
Cox, D. R., and Snell, E. J. (1982),Applied Statistics: Principles and Ex-
nal of the Royal Statistical Society, Ser. B, 57, 247-262.
amples, New York Chapman and Hall.
Chung, K. L. (1967),Markov Chains with Stationary Transition Probabil- Learner, E. E. (1978),Specification Searches, New York: Wiley.
ities (2nd ed.),Berlin: Springer-Verlag. Linhart, H., and Zucchini, W. (1986).Model Selection, New York: Wiley.
Draper, D. (199% “Assessment and Propagation of Model Uncertainty” Madigan, D., and Raftery, A. E. (1994),“Model Selection and Accounting
(with discussion). Journal of the Royal Statistical Society, Ser. B, 57, for Model Uncertainty in Graphical Models Using Occam’s Window,”
45-97. Journal of the American Statistical Association, 89, 1535-1546.
Draper, N. R., and Smith, H. (1981). Applied Regression Analysis (2nd Madigan, D., and York, J. (1995). “Bayesian Graphical Models for Discrete
ed.), New York: Wiley. Data,” International Statistical Review, 63, 2 15-232.
Edwards, W., Lindman, H., and Savage, L. J. (1963). “Bayesian Statistical Miller, A. J. (1984). “Selection of Subsets of Regression Variables” (with
Inference for Psychological Research,” Psychological Review, 70, 193- discussion), Journal of the Royal Statistical Society, Ser. A, 147, 389-
242. 425.
Ehrlich, 1. (1973). “Participation in Illegitimate Activities: A Theoretical -(1990), Subset Selection in Regression, New York Chapman and
and Empirical Investigation.” Journal of Political Economy, 81, 521- Hall.
565. Mitchell, T.J., and Beauchamp, J. J. (1988), “Bayesian Variable Selec-
Freedman, D. A. (1983). “A Note on Screening Regression Equations,” tion in Linear Regression” (with discussion), Journal of the American
The American Statistician, 37, 152-155. Statistical Association, 83, 1023-1036.
Freedman, D. A., Navidi, W. C., and Peters, S. C. (1986).“On the Impact Mosteller, F., and Tukey, J. W. (1977). Data Analysis and Regression,
of Variable Selection in Fitting Regression Equations:’ in On Model Reading, MA Addison-Wesley.
Uncertainty and Its Statistical Implications, ed. T. K. Dijkstra. Berlin: Moulton, B. R. (1991). “A Bayesian Approach to Regression Selection
Raftery, Madigan, and Hoeting: Averaging for Linear Regression 191

and Estimation With Application to a Price Index for Radio Services:’ Gibbs Sampler and Related Markov Chain Monte Carlo Methods,” Jour-
Journal of Econometrics, 49, 169-193. M I of rhe Royal Statistical Society, Ser. B, 55, 3-24.
Murphy, A. H., and Winkler, R. L. (1977). “Reliability of Subjective Prob- Stewart, L. (1987). “Hierarchical Bayesian Analysis Using Monte Carlo
ability Forecasts of Precipitation and Temperature:’ Applied Statistics, Integration: Computing Posterior Distributions When There are Many
26, 4147. Possible Models:’ The Statistician, 36, 21 1-219.
Neter, J., Wasserman, W., and Kutner, M. (1990).Applied Linear Statistical Stewart, L., and Davis, W.W. (1986). “Bayesian Posterior Distributions
Models, Homewood, IL: Irwin. Over Sets of Possible Models With Inferences Computed by Monte
Raftery, A. E. (1988), “Approximate Bayes Factors for Generalized Lin- Carlo Integration:’ The Statistician, 35, 175-182.
ear Models,” Technical Report 121, University of Washington, Dept. of Stigler, G.J. (1970), “The Optimum Enforcement of Laws,” Journal of
Statistics. Political Economy, 78, 526-536.
(1996), “Approximate Bayes Factors and Accounting for Model Taft, D. R., and England, R. W. (1964). Criminology (4th ed.), New York
Uncertainty in Generalized Linear Models,” Biomerrika, 83, 251-266. Macmillan.
Raiffa, H., and Schlaifer, R. (1961). Applied Statistical Decision Theory, Vandaele, W. (1978), “Participation in Illegitimate Activities; Ehrlich Re-
Cambridge, MA: MIT Press. visited,” in Dererrence and Incapacitation (eds. A. Blumstein, J. Cohen,
Regal, R., and Hook, E. B. (1991), “The Effects of Model Selection on and D. Nagin), Washington, D.C.: National Academy of Sciences Press,
Confidence Intervals for the Size of a Closed Population,” Statistics in pp. 270-335.
Medicine, 10, 7 17-72 I . Weisberg, S. (1985). Applied Linear Regression (2nd ed.), New York Wi-
Schwarz, G. (1978). “Estimating the Dimension of a Model,” The Annals ley.
of Statistics, 6, 461464. Zellner, A. (1986), “On Assessing Prior Distributions and Bayesian Re-
Shibata, R. (1981). “An Optimal Selection of Regression Variables:’ gression Analysis With g Prior Distributions,” in Bayesian Inference
Biometriku, 68, 45-54. and Decision Techniques-Essays in Honor of Bruno de Finetri, eds.
Smith, A. F. M., and Roberts, G. 0. (19931, “Bayesian Computation via P. K.Goel and A. Zellner, Amsterdam: North-Holland, pp. 233-243.

You might also like