A Novel Bayesian Approach For Variable Selection in Linear Regression Models
A Novel Bayesian Approach For Variable Selection in Linear Regression Models
article info a b s t r a c t
Article history: A novel Bayesian approach to the problem of variable selection in multiple linear
Received 7 March 2019 regression models is proposed. In particular, a hierarchical setting which allows for
Received in revised form 28 October 2019 direct specification of a priori beliefs about the number of nonzero regression coefficients
Accepted 29 October 2019
as well as a specification of beliefs that given coefficients are nonzero is presented.
Available online 5 November 2019
This is done by introducing a new prior for a random set which holds the indices of
Keywords: the predictors with nonzero regression coefficients. To guarantee numerical stability,
Variable selection a g-prior with an additional ridge parameter is adopted for the unknown regression
Hierarchical Bayes coefficients. In order to simulate from the joint posterior distribution an intelligent
g-prior with ridge parameter random walk Metropolis–Hastings algorithm which is able to switch between different
Model uncertainty
models is proposed. For the model transitions a novel proposal, which prefers to add a
Metropolis–Hastings algorithm
priori or empirically important predictors to the model and further tries to remove less
Consistency
important ones, is used. Testing the algorithm on real and simulated data illustrates that
it performs at least on par and often even better than other well-established methods.
Finally, it is proven that under some nominal assumptions, the presented approach is
consistent in terms of model selection.
© 2019 The Author(s). Published by Elsevier B.V. This is an open access article under the CC
BY-NC-ND license (http://creativecommons.org/licenses/by-nc-nd/4.0/).
1. Introduction
Nowadays, where computing becomes less expensive, huge amounts of data are routinely collected by companies and
organizations of all kinds. The scale of the data being collected increases greatly and thus also the number of measured
features (variables/predictors/covariates). Fitting statistical models, which include a large number of features, often results
in over-fitting since the resulting model complexity cannot be supported by the available sample size (Buehlmann et al.,
2016). Moreover, it becomes hard to detect the most predictive features and their importance with respect to the model.
Thus, variable selection (i.e. the extraction of the relevant features for a given task) plays an ever more important role in
many fields including genetics (Lee et al., 2003), astronomy (Zheng and Zhang, 2007), and economics (Foster and Stine,
2004).
In this paper we are interested in variable selection in the multiple linear regression model
y = Xβ + ε, (1.1)
✩ For this work there exists supplementary material. Based on the prostate cancer data provided by Stamey et al. (1989) the performance of the
proposed approach is compared to other recent variable selection methods.
∗ Corresponding author.
E-mail address: [email protected] (K. Posch).
https://doi.org/10.1016/j.csda.2019.106881
0167-9473/© 2019 The Author(s). Published by Elsevier B.V. This is an open access article under the CC BY-NC-ND license (http://creativecommons.
org/licenses/by-nc-nd/4.0/).
2 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
where y is a n-dimensional response vector, X = (x1 , . . . , xp ) ∈ Rn×p denotes the so-called design matrix which holds the
p potential predictors, β = (β1 , . . . , βp )T is the vector of unknown regression coefficients, and ε ∼ N (0, σ 2 In ) denotes
the noise vector with independent and identically distributed components. To avoid the need for an intercept β0 it is
assumed that the response y is zero centered. Moreover, to allow for an easy evaluation of the influence of single predictors
x1 , . . . , xp on the model based on the magnitude of the regression coefficients the predictors are treated as zero centered
∑ n 2
i=1 xji = 0 and standardized with ∥xj ∥ = n − 1 for 1 ≤ j ≤ p.
In the linear model (1.1) one assumes that only some regression coefficients are different from zero. Thus, the
problem of variable selection reduces to the identification of the nonzero regression coefficients (Alhamzawi and Taha
Mohammad Ali, 2018). Especially shrinkage approaches such as the lasso (Tibshirani, 1996), the adaptive lasso (Zou,
2006), the elastic net (Zou and Hastie, 2005) and their Bayesian analogues (Park and Casella, 2008; Alhamzawi et al.,
2012; Leng et al., 2014; Huang et al., 2015) which simultaneously perform variable selection and coefficient estimation
have been shown to be effective and are often the methods of choice in linear regression. These methods estimate β as
minimizer
∑n of the objective function L(β) + P (β, λ), where L denotes the quadratic loss function (negative log-likelihood)
i=1 (yi − xi β) and P denotes a method specific penalty function that encourages a sparse solution. The parameter
T 2
vector λ controls the penalization strength and thus the level of sparsity. Commonly, a good choice for λ is determined via
cross-validation in practical applications. Among above mentioned approaches the most popular one is the least absolute
shrinkage and selection operator (lasso), which ∑was proposed by Tibshirani (1996). In the lasso regression the penalty
p
function is defined as P (β, λ) := λ∥β∥1 = λ i=1 |βi |. Thus, the regression coefficients are continuously shrunken to
zero and for sufficiently large λ some coefficients take exactly the value zero. The lasso can be viewed as a convex and
therefore more efficiently solvable reformulation of the best subset selection approach, in which the penalty function is
given by P (β, λ) := λ∥β∥0 = λ#(i|βi ̸ = 0). Moreover, the lasso estimate for β can be interpreted as a Bayesian posterior
mode estimate when independent Laplace priors all with zero mean and the same scale parameter λ > 0 are assigned to
the regression coefficients:
p
∏ λ −λ |βi |
p(β|σ ) =
2
e σ .
2σ
i=1
In contrast to the frequentist approach the Bayesian one (Park and Casella, 2008) provides credible intervals for the
model parameters which can be used for variable selection. A further generalization of the classical lasso is the adaptive
lasso (Zou, 2006), which allows for different penalization factors of the regression coefficients:
p
∑
P (β, λ) := λi |βi |.
i=1
For instance, the penalization vector λ = (λ1 , . . . , λp )T can be chosen based on the ordinary least squares estimator
γ γ
λ−
j
1
= |β̂j (ols)| = |((XT X)−1 XT y)j | with γ > 0, or in more complicated settings (collinear predictors, or n < p) similarly
based on the minimizer of the ridge-regularized problem given by:
n
∑
min (yi − xTi β)2 + λ∥β∥22 .
β
i=1
The adaptive lasso was proposed to address potential weaknesses of the classical lasso. It has been shown (Zou, 2006) that
there are situations in which the classical lasso either tends to select inactive predictors, or over-shrinks the regression
coefficients of correct predictors. In contrast, the adaptive lasso satisfies the so-called oracle properties introduced by Fan
and Li (2001). A procedure satisfies these properties, if it asymptotically identifies the right subset model and has optimal
estimation rate. For more details please refer to Fan and Li (2001). Recently, a Bayesian version of the adaptive lasso was
proposed by Alhamzawi et al. (2012) and also by Leng et al. (2014). The Bayesian adaptive lasso generalizes the classical
(i.e. non adaptive) Bayesian lasso by allowing different scale parameters in the Laplace priors of the regression coefficients:
p
∏ λi −λi |βi |
p(β|σ 2 ) = e σ .
2σ
i=1
The scale parameters λi are then either chosen via marginal maximal likelihood in an empirical Bayesian setting or by
assigning appropriate hyperpriors in a hierarchical Bayes approach. Another generalization of the classical lasso is the
elastic net proposed by Zou and Hastie (2005). Here the penalty function is given by:
The elastic net encourages a grouping of strongly correlated predictors, such that they tend to be in or out the model
together. Moreover, it works better than the classical lasso in the case where p ≫ n. There are also some Bayesian
versions of the elastic net proposed in literature including the one proposed by Huang et al. (2015). In particular, we used
their R-package EBglmnet to validate our novel method.
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 3
Also Bayesian penalized regression techniques which are not directly related to the lasso have recently received lots
of attention in the statistics literature. Carvalho et al. (2010) introduced the horseshoe estimator for sparse signals. In our
context this approach is given by (see Makalic and Schmidt, 2016)
βj |λj , τ , σ 2 ∼ N (0, λ2j τ 2 σ 2 ),
λj ∼ C+ (0, 1),
τ ∼ C+ (0, 1),
where C+ denotes the half-Cauchy distribution. The hyperparameter τ controls the amount of overall shrinkage of β, while
the parameters λ1 , . . . , λp allow for individual adaptions on the degree of shrinkage. Thus, the horseshoe prior shows a
pole at zero and has polynomial tails, which according to Polson and Scott (2012) are important properties in the big data
domain. The pole at zero enables an efficient recovering of the truly nonzero regression coefficients, whereas polynomial
tails allow large signals to remain unchanged. Recently, Bhadra et al. (2017) published a generalization of the horseshoe
estimator, the so-called horseshoe+ estimator. In their approach an additional level of hyperparameters is introduced via
λj ∼ C+ (0, φj ),
φj ∼ C+ (0, 1).
The horseshoe+ estimator has a lower posterior mean squared error and faster posterior concentration rates in terms
of the Kullback–Leibler divergence than the classical horseshoe estimator. Another recent development in the context of
variable selection and shrinkage priors is the class of Dirichlet–Laplace (DL) shrinkage priors proposed by Bhattacharya
et al. (2015). These priors possess optimal posterior concentration. In the linear regression model DL priors can be used
based on the hierarchical setting (see Zhang and Bondell, 2018)
βj |φj , τ , σ 2 ∼ DE(φj τ σ ),
(φ1 , . . . , φp )T ∼ Dir(a, . . . , a),
( )
1
τ ∼ Ga pa, ,
2
where DE(b) denotes a zero mean Laplace kernel with density f (y) = (2b)−1 exp(−|y|/b) for y ∈ R, Dir(a, . . . , a) is the
Dirichlet distribution with concentration vector (a, . . . , a), and Ga denotes the Gamma distribution. Small values of a
guarantee that only some of the components of φ = (φ1 , . . . , φp )T are nonzero and, further, that the nonzero ones take
small values. The parameter τ controls the global shrinkage towards zero and the φj allow for individual deviations from
this global behavior.
Besides above described shrinkage approaches, Bayesian methods that are based on the introduction of a random
indicator vector γ = (γ1 , . . . , γp )T ∈ {0, 1}p gain increasing popularity (Liang et al., 2008; Guan and Stephens, 2011;
Fisher and Mehta, 2014; Wang et al., 2015). If γi equals zero the regression coefficient βi of the ith predictor is assumed
to equal zero. This is equivalent to the assumption that the ith predictor does not explain the target y. A common choice
is to assign independent Bernoulli priors to the indicator variables:
p
∏
p(γ ) = γi πi + (1 − γi )(1 − πi ). (1.2)
i=1
For instance, Guan and Stephens (2011) used prior (1.2) with the simplification π = π1 = · · · = πp and assigned a
uniform Prior U(a, b) to ln(π ). In the work of Fisher and Mehta (2014) the flat prior γ ∝ 1 was used which is equivalent
to setting πi = 0.5 in (1.2) for i = 1, . . . , p. For given γ the vector of nonzero coefficients denoted by βγ is usually
assigned a conventional g-prior βγ |g , σ 2 , Xγ ∼ N (0, g σ 2 (XTγ Xγ )−1 ) (introduced by Zellner 1986) where Xγ denotes the
design matrix with all columns deleted that correspond to indicator variables being equal to zero. The reason for the
common choice of this prior is that it often leads to a computationally tractable Bayes factor.
It should be mentioned that the Bayesian methods corresponding to above described idea of introducing an indicator
vector γ belong to the so-called spike and slab approaches. These approaches use mixture priors with two components – a
spike concentrated around zero and a comparably flat slab – to perform variable selection. More exactly, above mentioned
methods correspond to spike and slab approaches with spikes defined by a point mass at zero, so called Diracspikes. The
spike and slab idea was initially developed by Mitchell and Beauchamp (1988). More recent spike and slab approaches
for instance can be found in Carbonetto and Stephens (2012) and in Ročková and George (2014). The first approach
uses Diracspikes and approximates the posterior distribution via variational inference, while the second one uses normal
spikes with a small variance and maximizes the posterior using an efficient expectation–maximization algorithm. Very
recently, Chen and Walker (2019) proposed a fast method for variable selection based on marginal solo spike and slab
priors. Their approach is sequential and deals with each covariate in succession while assigning a Gaussian spike and slab
prior with sample size dependent variance to the coefficient under investigation. It should be mentioned, that spike and
slab priors are also applied apart from classical regression approaches. For instance, Polson and Scott (2011) used this
type of priors to regularize support vector machines.
4 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
In this work, we propose a setting which is not based on a random indicator vector γ , but on a random set A ⊆
{1, . . . , p} that holds the indices of the active predictors, i.e. the predictors with coefficients different from zero. We
assign a prior to A which depends on the cardinality of the set |A| as well as on the actual elements of A. This enables
an easy formulation of useful a priori knowledge. In Section 3 we show how the results of other methods can be used
in order to define such a useful prior in terms of an empirical Bayes approach. To simulate from the joint posterior of
our model we propose a novel random walk Metropolis–Hastings algorithm, in contrast to Guan and Stephens (2011),
Fisher and Mehta (2014) and Wang et al. (2015) where Gibbs sampling is applied for this task. This is especially useful
when it is difficult or even impossible to determine the conditional distributions needed for the Gibbs sampling algorithm
in an analytically closed form. In these cases the corresponding integrals have to be approximated and in particular for
practitioners it can become a difficult task to choose a proper approximation technique among the various possible ones.
For instance, in Liang et al. (2008) the Laplace approximation is used for this task. Wang et al. (2015) assigned on purpose
a beta-prime prior with specific parameters to g in order to achieve a closed-form expression of the marginal posterior
of γ . Since we do not need to care about closed-form expressions by using a random walk Metropolis algorithm, our
approach is based on the more popular Zellner–Siow prior for g (Zellner and Siow, 1980), which is an inverse gamma
distribution with shape parameter α = 1/2 and scale parameter β = n/2. However, there are many valid and useful
possibilities how to treat g. Liang et al. (2008) provide an excellent study on many different choices including fixed g
priors, empirical Bayes approaches, and mixture priors obtained from assigning different prior distributions to g.
Though, the g-prior depends on the inverse of the empirical covariance matrix of the selected predictors. This matrix
is singular if the number of selected covariates is greater than the number of observations n and, further, may be almost
rank deficient given that the predictors are highly correlated. To overcome this problem Wang et al. (2015) replaced the
classical inverse with the Moore–Penrose generalized inverse and thus ended up with the so-called gsg-prior (see West,
2003). In contrast to them, we adopt a g-prior with an additional ridge parameter for the unknown regression coefficients
to guarantee nonsingularity of the empirical covariance matrix. This modification of the classical g-prior was first proposed
by Gupta and Ibrahim (2007) and further investigated by Baragatti and Pommeret (2012).
Moreover, in Section 2.2 we state that our approach is consistent in terms of model selection according to the
consistency definition given by Fernández et al. (2001). The proof of this result is deferred to Appendix A. Finally, in
Section 3, we evaluate our approach on the basis of real and simulated data and compare the results with diverse well-
established variable selection methods from literature. We show that our approach performs at least on par and often
better than the comparative methods.
2. Methodology
In this section we describe the hierarchical Bayesian approach we propose for the task of variable selection in multiple
linear regression models. Moreover, we state that the novel model presented is consistent in terms of model selection.
Finally, we propose an intelligent random walk Metropolis–Hastings algorithm to simulate from the joint posterior of the
model parameters.
In order to perform variable selection, a random set A ⊆ {1, . . . , p} which holds the indices of the active predictors,
i.e. the predictors with coefficients different from zero, is used. Moreover, it is assumed that the cardinality k = |A| is
greater than zero. In this paper the focus lies on problems where at least one of the measured features is predictive.
Nevertheless, one can easily generalize the proposed approach in such a way that also the null model (all regression
coefficients equal zero) is valid. Thus, we propose the following prior for A:
1
p(A = {α1 , . . . , αk }) ∝ (pα1 + · · · + pαk ) p̃(k), (2.1)
k
where
∑p
(i) {pα1 , . . . , pαk } ⊆ {p1 , . . . , pp } with i=1 pi = 1 and pi ≥ 0 for i = 1, . . . , p and
(ii) p̃ : {1, . . . , p} → R+ 0.
The mapping p̃ can be arbitrarily chosen and should represent the a priori belief in the model size, i.e. the number of
parameters different from zero. For instance one could define p̃ based on the probability mass function of a zero truncated
binomial distribution with size parameter equal to the number of predictors p and the second parameter chosen in such a
way that the mean of the distribution equals the a priori most probable model size (see Section 2.4). On the other hand the
parameters p1 , . . . , pp are used to represent the a priori belief in the importance of the predictors x1 , . . . , xp to the model.
In terms of empirical Bayes these parameters could be chosen based on the regression coefficients of a ridge regularized
model (compare to the adaptive lasso) or based on the correlations (i.e. the linear dependencies) between the target
variable and the predictors. Please note that in the case of equal a priori importance of the predictors 1p = p1 = · · · = pp ,
the prior (2.1) reduces to
p(A = {α1 , . . . , αk }) ∝ p̃(k).
1
Thus, in this case the prior only reflects the a priori belief regarding k, which shows the necessity of the factor k
in (2.1).
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 5
For the variance σ 2 of the error terms ε1 , . . . , εn an inverse gamma prior is chosen:
( )
b
p(σ 2 ) ∝ (σ 2 )−(a+1) exp − . (2.2)
σ2
For b → 0 and a → 0 the inverse gamma prior converges to the non-informative and improper Jeffreys prior p(σ 2 ) ∝
1
σ2
(Jeffreys, 1961) which is invariant under a change of scale. In the experimental studies in Section 3, a and b are both
set to 0.001 such that the chosen prior is a proper approximation of the Jeffreys prior.
As already stated in the introduction, for given A the vector of nonzero coefficients, denoted by βA , is commonly
assigned a conventional g-prior βA |g , σ 2 , XA ∼ N (0, g σ 2 (XTA XA )−1 ) (introduced by Zellner (1986)), where XA denotes
the submatrix of X consisting of all columns corresponding to predictors with index in A. Since the g-prior depends on
the inverse of the empirical covariance matrix of the selected predictors it is singular if the number of selected covariates
k is greater than the number of observations n and further may be almost rank deficient provided that the predictors
are highly correlated (see West, 2003; Gupta and Ibrahim, 2007; Baragatti and Pommeret, 2012). To overcome these
two problems, based on the ideas of Gupta and Ibrahim (2007) and Baragatti and Pommeret (2012), we consider a ridge
penalized version of the g-prior:
βA |g , σ 2 , XA ∼ N (0, (g −1 σ −2 XTA XA + λIk )−1 ), (2.3)
where
max( 1k , 1
if n ≤ ζ
{
)
λ= 300 (2.4)
0 else,
with ζ denoting an appropriately large constant. It should be explained what appropriately large means. Obviously, for
n > ζ = p − 1 the first possible problem (k > n) cannot appear. The second problem, i.e. an almost rank deficient matrix
XTA XA , also should not appear for n ≫ k, i.e. for sufficiently large ζ . Even if the predictors are highly correlated, a huge
sample size will prevent XTA XA from being nearly singular and thus maybe computationally singular (compare Carsey
and Harden, 2013). Although it has to be assumed that there is no predictor that is an exact linear combination of the
other predictors. However, this assumption can be easily achieved by excluding such non-informative predictors from
the beginning. It should be mentioned that setting λ to zero is not necessary at all in practical applications. This is just a
theoretical consideration which will ease the proof of model selection consistency in Appendix A. The definition of λ for
n ≤ ζ is taken from Baragatti and Pommeret (2012), who suggested to reduce the influence of the ridge parameter when
the number of predictors is large up to the threshold 1/300.
Finally, we assign the popular Zellner–Siow prior (Zellner and Siow, 1980), i.e. an inverse gamma distribution with
shape parameter α = 1/2 and scale parameter β = n/2 to g, resulting in the complete hierarchical representation of the
model
y|X, β, σ 2 ∼ N (Xβ, σ 2 In ), (2.5)
( )
1
p(β|A, g , σ 2 , XA ) ∝ exp − βTA g −1 σ −2 XTA XA + λIk βA h(βAC ),
( )
(2.6)
2
1
p(A = {α1 , . . . , αk }) ∝ (pα1 + · · · + pαk ) p̃(k), (2.7)
k
( )
1 n
g ∼ IG , , (2.8)
2 2
σ 2 ∼ IG (a, b) , (2.9)
where βAC denotes the components of β with indices not in A,
if βAC = 0
{
1
h(βAC ) = (2.10)
0 else,
and λ satisfies Eq. (2.4).
In this section we state that model (2.5)–(2.10) is consistent in terms of model selection. The proof of the result is
given in Appendix A. Therefore, throughout the section it is assumed that the sample y is generated by model MA with
parameters βA and σ 2 , i.e.,
In rough words, model selection consistency means that the true model will be selected provided that enough data is
available. A sound mathematical definition was given by Fernández et al. (2001). According to these authors the posterior
probability is said to be consistent if
p lim p(MA |y, X) = 1 and p lim p(MA′ |y, X) = 0 for all A′ ̸ = A, (2.12)
n→∞ n→∞
where the probability limit is taken with respect to the true sampling distribution Eq. (2.11). Moreover, some preliminary
results which are required in our proof are formulated as Lemma A.1 in their work. In our notation this Lemma reads as:
Under the sampling model (also true model) MA Eq. (2.11),
Theorem 1. Assume that Eq. (2.14) holds. Then the proposed model setup (2.5)–(2.10) is consistent in terms of model selection.
As usual in Bayesian regression models, the joint posterior p(β, A, g , σ 2 |y, X) is analytically intractable. The most
common approach to overcome this problem is to use Markov chain Monte Carlo (MCMC) methods to simulate from
this unknown distribution. Such a method generates realizations of a Markov chain which converges in distribution to
the (conditional) random vector of interest, i.e. in this paper to β, A, g , σ 2 |y, X. Thus, after some time of convergence, the
so-called burn-in phase, the generated random numbers can be considered as a (dependent) sample from the posterior
distribution. To finally obtain an independent sample only simulated numbers with a given distance d are chosen, a
so-called thinning is performed. In this paper a special Metropolis–Hastings (MH) algorithm, and thus a special MCMC
method, is proposed to simulate from p(β, A, g , σ 2 |y, X). MH algorithms are iterative and in each iteration a random
number is drawn from a proposal distribution q which is then accepted or not based on a given criterion. If the proposal
q depends on the respectively last accepted random number one also speaks of a random walk MH algorithm. For further
details on MCMC methods please refer to Fahrmeir et al. (2013). In order to define a MH algorithm for a given problem
basically the only thing one has to specify is the proposal distribution q used in the algorithm. Therefore, we define the
proposal q(βt +1 , At +1 , gt +1 , σt2+1 |At , gt , σt2 ) as follows:
Inspired by the work of Hastings (1970) the proposal distribution for the variance parameter σ 2 as well as the proposal
distribution of the parameter g are defined to be uniform distributions
q(σt2+1 |σt2 ) = Unif σt2+1 ; max 10−8 , σt2 − εσ , σt2 + εσ ,
( { } )
(2.16)
−8
, gt − εg , gt + εg ,
( { } )
q(gt +1 |gt ) = Unif gt +1 ; max 10 (2.17)
where εσ > 0 and εg > 0 denote tuning parameters. Tuning parameters are always chosen in such a way, that the
acceptance rate of the algorithm is neither very low nor very high which should guarantee a fast convergence of the
algorithm. In order to specify the proposal for the random set A at first a Bernoulli distributed random variable ch is
introduced
ch ∼ Bernoulli(ph ), (2.18)
where ph ∈ [0, 1] can be considered as a tuning parameter. The event ch = 1 means that the model size (number of
nonzero regression coefficients) changes, i.e. kt +1 ̸ = kt . Moreover, a random variable α with support on the index set
{1, . . . , p} of the regression coefficients is introduced in order to describe the model transition probabilities in case of
changing model size. A realization of the conditional random variable α|At corresponds to the index of a predictor which
is going to be added to or removed from the model. Transitions between models, which differ by two or more parameters,
are not allowed. The probability mass function of α|At is defined as
if kt > 1, α ∈
/ At
⎧
⎪
⎪ p̃α∑
⎨ ∑i∈At p̃i
⎪
if kt > 1, α ∈ At
⎪
p̃α i∈At 1/p̃i
q(α|At ) = I{1,...,p} (α ) (2.19)
⎪
⎪0 if kt = 1, α ∈ At
else,
⎪
⎪ p̃α
⎩∑
i∈{1,...,p}\At p̃i
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 7
where the parameters p̃1 , . . . , p̃p are greater or equal to zero, sum up to one, and can but must not be identical to the
parameters p1 , . . . , pp already used in the prior p(A). Again, as the parameters p1 , . . . , pp the parameters p̃1 , . . . , p̃p can
be used to represent a priori beliefs about the importance of the predictors x1 , . . . , xp and a good choice of them should
lead to a fast convergence of the MH algorithm. Especially, in problems with a high number of predictors it is essential to
assign p̃1 , . . . , p̃p with informative values, since otherwise it might take a very large number of iterations until a MCMC
method (see Section 2.3) used to simulate from the joint posterior might converge. According to Eq. (2.19), in the case
where kt > 1, a predictor xα is added with probability p̃α , which might correspond to the a priori importance of this
1
predictor. Furthermore, a predictor xα is removed with a probability proportional to p̃− α , which might be the inverse a
priori importance of xα . In the case where kt = 1 a predictor xα is added with a probability proportional to p̃α and it is
impossible that the only predictor in the model is removed, this would violate the prior assumption that k = |A| > 0
with probability 1. Using Eqs. (2.18) and (2.19) the proposal distribution q(At +1 |At ) is finally defined by:
Note that the probability mass function q in Eq. (2.20) is given by Eq. (2.19), since |kt +1 − kt | = 1. Besides the easy
evaluation of q(At +1 |At ) for a given input, one can easily simulate from this distribution. This is done by sampling
from the distribution of ch followed by sampling from the conditional distribution of At +1 |At , ch where the conditioning
takes places with respect to the sample of ch generated in the previous step. The only thing remaining to define the
overall proposal q(βt +1 , At +1 , gt +1 , σt2+1 |At , gt , σt2 , X) is the conditional proposal q(βt +1 |At +1 , gt +1 , σt2+1 , XAt +1 ). It is well-
known (see Fahrmeir et al., 2013) that under the observation model y|β, σ 2 , X ∼ N (Xβ, σ 2 I) with prior distributions
β|σ 2 ∼ N (0, σ 2 M) and σ 2 ∼ IG(a, b) the conditional posterior of β given σ 2 is given by
β|σ 2 , y, X ∼ N (m̃, σ 2 M̃),
with
m̃ = M̃XT y,
)−1
M̃ = XT X + M−1 .
(
q(βt +1 |At +1 , gt +1 , σt2+1 , XAt +1 ) = q(βt +1,At +1 |At +1 , gt +1 , σt2+1 , XAt +1 )q(βt +1,AC |At +1 )
t +1
( ))
1( )T (
∝ exp − βt +1,At +1 − µt +1 Ft +1 βt +1,At +1 − µt +1 h(βt +1,AC ),
2 t +1
where
µt +1 = σt−+21 F−
t +1 XAt +1 y,
1 T
and
Ft +1 = σt−+21 XTAt +1 XAt +1 + σt−+21 gt−+11 XTAt +1 XAt +1 + λIkt +1 .
is known the ICF method can increase the speed of Bayes variable selection model-fitting up to tenfold. However, the
actual implementation of our method does not use the ICF approach. We plan to optimize the code in future work.
Finally, it should be said that the proposed MH algorithm is a version of the MCMC algorithms listed in Yi (2004) with a
more flexible update scheme for the model transitions. This update scheme can be considered as the main novelty of above
described MCMC sampler. Most of the simple schemes described in Yi (2004) do not contribute to the acceptance ratio due
to symmetry. Moreover, they do not allow for proposals without model changes. To get some intuition, a representative
one of these schemes selects one of the covariates at random and either removes or adds it from A depending on if it is
element of A or not. For the selection each variable is assigned the same probability. The other schemes are quite similar
to the just described one. The superior performance of our proposal is illustrated in Section 3.4.
This section provides a guideline how to arrive at a useful choice for the mapping p̃, that represents the a priori beliefs
regarding the model size. As already mentioned in Section 2.1, one could define p̃ as the probability mass function of a
zero truncated binomial distribution with size parameter equal to the number of predictors p and the second parameter
chosen in such a way that the mean of the distribution equals the a priori most probable model size. However, in general it
is somewhat cumbersome to determine the second parameter in such a way that the distribution has a given mean. Note
that the expected value of a zero truncated binomial distribution with size parameter p and second parameter q equals
pq/(1 − (1 − q)p ). Further, note that by the Abel–Ruffini theorem there is no algebraic solution to the general polynomial
equations of degree five or higher with arbitrary coefficients. A classical binomial distribution with size parameter p and
second parameter equal to µ/p has mean µ. Thus, a zero truncated binomial distribution with these parameters will result
in a distribution with expected value not too far away from µ.
Observing that the variance of a zero truncated binomial distribution grows with increasing number of predictors it
can be helpful to define p̃ proportional to a given power l of such a distribution. Especially, for large p choices of l > 1 are
often necessary to allow for fast convergence of the proposed MH algorithm (see Section 2.3). We have investigated this
property by considering the real-world datasets and the simulated datasets described in Section 3. Note that by specifying
the constant of proportionality in such a way, that p̃ defines a probability distribution, the resulting distribution has a
smaller variance than the generating binomial distribution, for l > 1.
While specifying p̃ proportional to a given power l of a zero truncated binomial distribution with size parameter p and
second parameter µ/p is quite intuitive, finding a good specification for µ and l is non-trivial. In the sequel a guideline
how to specify these parameters is given.
In case that the true model size is known everything is easy. One can simply assign the true number to µ and, further,
set l to a large number such as 50. As a result our method will return the a posteriori most probable model with µ
parameters. We have validated this statement via simulation studies.
In case that the true model size is unknown or that there is no true model at all things get more difficult. Clearly, a
good specification of l and µ can be found via cross-validation (CV) on a two-dimensional grid. However, due to the fact
that Bayesian methods based on MCMC sampling are not the fastest ones to train a two-dimensional CV is not optimal
in terms of computational effort. For this reason, we recommend to assign a fixed value to l and then searching for a
good choice of µ either via cross-validation or by empirically investigating the performance of some specifications on a
train/test split. In the experimental studies in Section 3 the empirical approach was used. A possible first choice for µ
can be found by considering the number of nonzero regression coefficients from a fast frequentist approach, such as the
lasso. We recommend the following specification of l. Set l equal to 1 for small values of p (e.g. p < 30) and choose l = 3
for large p. This way of proceeding leads to a less vague prior for k = |A| in case of increasing p and returned the best
results in the real-world studies considered in Section 3.1 and the simulation studies considered in Section 3.2.
We want to mention that the additional time required for searching a useful parameter µ should not be considered
as too critical since for given µ and l our MH algorithm converges comparatively fast, as can be seen in Section 3.3.
Nevertheless, in future research we want to investigate hyperpriors for both µ and l in order to make our approach more
comfortable to use.
Finally, in Table 1 we report for varying µ and l = 3 model size and prediction accuracy for a representative dataset
corresponding to the simulation study presented in Section 3.2.1 with n = 100 training observations. One can see how
increasing values of µ lead to increasing average model sizes and since for the first specification of µ the true model size
is found also to decreasing accuracies.
3. Experimental studies
In this section the (predictive) performance of the proposed hierarchical Bayesian approach is analyzed and compared
with some other Bayesian and non-Bayesian methods. For this task we have implemented the MH algorithm described
in Section 2.3 by using the interface between the programming languages R (R Core Team, 2015) and C. The comparison
includes the lasso (Tibshirani, 1996), the adaptive lasso (alasso) (Zou, 2006), the elastic net (elastic) (Zou and Hastie, 2005),
the Bayesian lasso (blasso) (Park and Casella, 2008), the Bayesian adaptive lasso (balasso) (Alhamzawi et al., 2012; Leng
et al., 2014) the Bayesian elastic net (belastic) (Huang et al., 2015), the horseshoe estimator (hs) (Carvalho et al., 2010),
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 9
Table 1
Model size and prediction accuracy for a representative dataset corre-
sponding to the simulation study presented in Section 3.2.1 with n = 100
training observations, l = 3 and varying µ.
µ Mean model size MAD
1 6.0885 0.3942685
2 6.6475 0.3988725
3 7.5405 0.4136312
4 8.64725 0.4185981
5 9.62 0.4290377
6 10.844 0.4339028
the horseshoe+ estimator (hs+) (Bhadra et al., 2017), the spike and slab approach via variational inference (ssv) proposed
by Carbonetto and Stephens (2012) and the spike and slab approach via expectation–maximization (ssem) proposed
by Ročková and George (2014). For our extensive comparisons two real-world datasets and also simulated datasets from
two different artificial models are used. The performance comparison on the real-world datasets is carried out on the
basis of a 5-fold cross-validation. Performing a 5-fold cross-validation on a given dataset means partitioning the dataset
in 5 equal sized subsets, iteratively selecting each of these subsets exactly once as testing data, whilst using the remaining
subsets for training. Aggregating some measure of accuracy, respectively computed from each of the 5 testing datasets,
results in the overall performance of the models to evaluate. The performance comparison for the artificial models is
accomplished by simulating multiple datasets from these models, then performing a train/test split for each simulated
dataset, and finally again aggregating the single accuracies achieved. The accuracy measures chosen in this paper are the
mean squared error (MSE) and the mean absolute deviation (MAD)
nte
1 ∑
MSE = (yi − ŷi )2 , (3.1)
nte
i=1
nte
1 ∑
MAD = |yi − ŷi |, (3.2)
nte
i=1
where nte denotes the cardinality of the testing dataset and the ŷi denote the predicted target values, for i = 1, . . . , nte .
For the aggregation of single accuracies we are using the median since it is more robust to outliers than the mean of
the values. The median of MSEs is then denoted by MMSE and the median of MADs by MMAD. For our novel approach
it should be stressed that, for obtaining a prediction ŷ∗ corresponding to a test sample x∗ , an estimate of the expected
value of the posterior predictive distribution E(y∗ |x∗ , y, X) is used. Using Monte Carlo integration this expected value can
be estimated as follows:
∫
E(y∗ |x∗ , y, X) = y∗ p(y∗ |x∗ , y, X) dy∗
∫ {∫ ∫ }
= y∗
p(y |x , β, σ )p(β, σ |y, X) dσ dβ
∗ ∗ 2 2 2
dy∗
∫ ∫ {∫ }
= y∗ p(y∗ |x∗ , β, σ 2 ) dy∗ p(β, σ 2 |y, X) dσ 2 dβ
∫ ∫
= x∗T β p(β, σ 2 |y, X) dσ 2 dβ
∫
= x∗T β p(β|y, X) dβ
N
1 ∑
≈ x∗T βi ,
N
i=1
where βi , . . . , βN is a sample of size N from the marginal posterior p(β|y, X). For all the non-Bayesian comparison models
the predicted target values ŷ1 , . . . , ŷte are computed by calculating the inner product of the test samples with the estimate
β̂ obtained by minimizing the corresponding model-specific objective function. For the Bayesian comparison models the
way the predictions are computed depend on the output provided by the R-packages used in this paper. In this work
the R-function blasso included in the R-package monomvn (Gramacy, 2018) is applied in order to perform Bayesian lasso
regression. Since the function blasso returns a sample from the marginal posterior p(β|y, X), as for the model proposed
by us, the predictions are computed by estimating the posterior predictive mean. Moreover, we use the R-Function brq
included in the package Brq (Alhamzawi, 2018) for applying the Bayesian adaptive lasso regression. It should be mentioned
that this R-function is not developed for the classical linear regression setting where the conditional mean of the response
variable is modeled, rather it is developed for the more general setting where any conditional quantile is modeled. For
10 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
purposes of the comparison study in this paper the conditional median, i.e. the conditional 50%-quantile, is selected. The
R-function produces a Bayes estimate of β which is then used for prediction (i.e. inner product computation with the
test samples). In order to use the Bayesian elastic net the function EBglmnet included in the package EBglmnet (Huang
and Liu, 2016) is used in this work. This function returns the posterior mean of the nonzero regression coefficients,
which is then used for prediction (again via inner product computation with the test samples). Further, the R functions
horseshoe, bayesreg, EMVS, and varbvs included in the equally named packages are used to apply the horseshoe estimator,
the horseshoe+ estimator, spike and slab via expectation–maximization, and spike and slab via variational inference.
Again, dependent on the outcome provided by the R-functions either the posterior predictive mean, or the posterior
mode of the regression coefficients is used for prediction.
It is also important to mention how (inside the 5-fold cross-validation for real-world datasets, or for the multiple
simulated datasets) the hyperparameters of the trained models are determined. For the frequentist models, which are all
implemented in the R-package glmnet (Friedman et al., 2010), the R-function cv.glmnet is used, which is also included
in this package. The penalization factor λ of the lasso regression is chosen based on a 10-fold cross-validation on a one
dimensional grid with the MAD as accuracy measure. Analogously, the penalization factors λ1 , λ2 of the elastic net are
selected based on a 10-fold cross-validation on a now two-dimensional grid, again with MAD as accuracy measure. For the
adaptive lasso the chosen penalization factors are the absolute values of the inverse regression coefficients obtained from
the respective ridge-penalized problem, the penalization factor of which is also determined via 10-fold cross-validation.
For the Bayesian version of the lasso and the adaptive lasso, and, further, for the two horseshoe estimators and the two
spike and slab approaches, the default values of the corresponding functions blasso, brq, horseshoe, varbvs, EMVS, and
bayesreg are assigned to the hyperparameters of the a priori distributions. For further details please refer to the help
pages of these packages. However, it should be mentioned that for the Bayesian lasso the option of additional model
selection based on a Reversible Jump MCMC algorithm is set to true for the experimental studies in this paper. Moreover,
for the Bayesian elastic net the hyperparameters in the prior distribution are selected via 5-fold cross-validation. This
is done by using the function cv.EBglmnet, which is also available in the package EBglmnet. The hyperparameters of our
proposed Bayesian model are chosen differently in the following experimental studies to indicate useful possible choices.
These specific choices are described later on in this article.
In order to obtain i.i.d. samples from the model-specific posterior distributions the R-functions used in this paper
merely save a fraction of all samples generated by the MCMC methods implemented. For the Bayesian elastic net, the
decision which samples to store is already taken by the implementation itself and cannot be user-defined. All other
Bayesian models go along with implementations which allow for a user specific determination of this fraction. To do
so the following procedure is chosen: In a first step, respectively, the first 10,000 samples are deleted and, in a second
step, all samples except of every 10-th one are deleted. Note that samples generated in the first step before convergence
of the used MCMC algorithms are deleted, while thinning done in the second step should guarantee that the remaining
samples can be considered as being independent. For each of the observed Bayesian models 50,000 (dependent) samples
are generated, except for the Bayesian adaptive lasso where 70,000 ones are produced. This results in 4000 i.i.d. samples
each, except of 6000 samples for the adaptive lasso. The reason for the special treatment is that in our experiments the
Bayesian adaptive lasso turned out to require more samples to get stable estimates of the target variables.
In this section the performance of the proposed method is compared with the performance of other well-established
methods based on two real-world datasets. The comparison is made along the lines of the above considerations (see
beginning of Section 3).
Table 2
Diabetes data performance comparison: Median of mean squared prediction errors
(MMSE) and median of mean absolute prediction deviations (MMAD) based on a 5-fold
cross-validation.
Method MMSE MMAD
New approach 0.4873534 0.5678801
Lasso 0.492067 0.571596
Adaptive lasso 0.4939229 0.5736721
Elastic net 0.4922686 0.5706994
Bayesian lasso 0.4924316 0.5736084
Bayesian adaptive lasso 0.4997307 0.5786672
Bayesian elastic net 0.4895844 0.5727555
Horseshoe 0.4903684 0.5711527
horseshoe+ 0.4919946 0.5727804
Spike and slab (VI) 0.5179594 0.5894747
Spike and slab (EM) 0.4893634 0.568348
Table 3
Ozone data performance comparison: Median of mean squared prediction errors (MMSE) and median of
mean absolute prediction deviations (MMAD) based on a 5-fold cross-validation.
Method MMSE MMAD
New approach 0.2471677 0.3837716
Lasso 0.2451734 0.3898133
Adaptive lasso 0.2539498 0.3893223
Elastic net 0.2498921 0.3898747
Bayesian lasso 0.256831 0.3926603
Bayesian adaptive lasso 0.2458003 0.3865291
Bayesian elastic net 0.2409481 0.3853373
Horseshoe 0.2505222 0.3895577
horseshoe+ 0.2486865 0.3905475
Spike and slab (VI) 0.2916582 0.4246455
Spike and slab (EM) 0.2526146 0.3923742
• Temperature (degrees F)
• Inversion base height (feet)
• Daggett pressure gradient (mmHg)
• Visibility (miles)
• Vandenburg 500 millibar height (m)
• Humidity (%)
• Inversion base temperature (degrees F)
• Wind speed (mph)
The dataset is available in the R-package gclus (Hurley, 2012). We decide to examine the regression model including all
linear, quadratic and two-way interactions, resulting in 44 possible predictors. Moreover, we zero center and standardize
(variance equal to one) all variables previous to the training of the diverse models. We want to mention that in contrast to
a statement at the beginning of Section 3 in this performance evaluation for all Bayesian models 100,000 MCMC samples
are drawn and not 50,000 or 70,000 as stated there. This is the only experimental study where this statement is violated.
The hyperparameters of the newly proposed Bayesian approach are specified as follows: We assign to each of the
parameters p1 , . . . , pp and p̃1 , . . . , p̃p the value 1/p, i.e. assume equal a priori importance of the predictors. The function
p̃ is defined in such a way, that it maps a value k ∈ {1, . . . , p} to a value proportional to the third power of the probability
P(x = k), where x denotes a random variable which is zero truncated binomial distributed with size parameter p and
second parameter 7/p. Finally, the tuning parameters (see Section 2.3) ph , εσ and εg are set to the values 0.5, 0.1 and 60,
respectively.
Performing a 5-fold cross-validation as described at the beginning of Section 3 results in Table 3. The new approach
achieves the lowest MMAD and a MSE comparable to the other methods.
In this section, on the basis of simulated data corresponding to two different artificial models, the performance of
our new method is compared with the performance of other well-established methods. The artificial models on purpose
12 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
Table 4
Performance comparison for different specifications of n: Median of mean squared prediction errors (MMSE) and median of mean absolute prediction
deviations (MMAD) based on 100 simulated datasets.
Method MMSE MMAD MMSE MMAD MMSE MMAD
n = 50 n = 50 n = 100 n = 100 n = 200 n = 200
New approach 0.3619324 0.4771161 0.2496664 0.4029445 0.2405002 0.3930497
Lasso 0.4643155 0.5437892 0.3114765 0.4518035 0.2696669 0.410945
Adaptive lasso 0.4261128 0.5221177 0.2869056 0.4297021 0.2653867 0.4100492
Elastic net 0.4681363 0.5450483 0.3193174 0.4529994 0.2727069 0.4143414
Bayesian lasso 0.6445547 0.6325042 0.3078518 0.4442361 0.2665262 0.4126882
Bayesian adaptive lasso 0.7616539 0.7053986 0.8052745 0.7225279 0.3911191 0.5018005
Bayesian elastic net 0.777284 0.7043048 0.3367692 0.4702988 0.273061 0.4137397
Horseshoe 0.4305569 0.5250438 0.2710609 0.4169866 0.2512771 0.3999326
horseshoe+ 0.427505 0.5213638 0.2699478 0.4167009 0.2517378 0.398612
Spike and slab (VI) 0.6992991 0.6668771 0.2563254 0.4063442 0.2462369 0.3935947
Spike and slab (EM) 0.4947604 0.5624915 0.4078821 0.5085099 0.3596638 0.4822522
include many correlated predictors, from which only a small subset is predictive, i.e. has regression coefficients different
from zero. This should reflect difficult variable selection problems that more and more organizations have to face in
their daily life nowadays and in the future. From both artificial models multiple datasets are simulated. In particular,
100 datasets are sampled, each according to three different settings. First the sampling takes place with n = 50 training
observations, then with n = 100 training observations and finally with n = 200 training observations. The number of test
observations is always the same and given by the value nte = 200. Moreover, all the observations are drawn independently
from each other. For each sampled dataset the entire target vector (includes target values from both the training set and
the testing set) is standardized via division by the sample standard deviation of the training target values. This simplifies
finding a good choice of the tuning parameters for the MH algorithm proposed. The scaling allows for choosing them
equal to specifications used in Section 3.1. The predictors are not scaled at all since they are sampled from distributions
with mean zero and standard deviation one.
Fig. 1. Boxplots of the MSEs obtained from the 100 simulated datasets with n = 50 training samples.
Fig. 2. Boxplots of the MADs obtained from the 100 simulated datasets with n = 50 training samples.
Fig. 3. Boxplots of the MSEs obtained from the 100 simulated datasets with n = 100 training samples.
data generating model pretty well. We want to report that the acceptance rate of the proposed MH algorithm is given by
the value 0.20672, which is a good acceptance rate indicating that the algorithm converges fast. Finally, we compute the
14 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
Fig. 4. Boxplots of the MADs obtained from the 100 simulated datasets with n = 100 training samples.
Fig. 5. Boxplots of the MSEs obtained from the 100 simulated datasets with n = 200 training samples.
Fig. 6. Boxplots of the MADs obtained from the 100 simulated datasets with n = 200 training samples.
Gelman–Rubin–Brooks plot (Gelman and Rubin, 1992) (see Figs. 9 and 10 ) for the truly nonzero regression coefficient
β21 based on four runs of the proposed MH algorithm. The plot shows how the Gelman and Rubin’s shrink factor and its
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 15
Fig. 7. Markov chain of the model size for a representative dataset simulated from model (3.3)–(3.7) with 100 observations.
Fig. 8. Posterior model inclusion probabilities of the predictors. Truly nonzero regression coefficients are colored blue, while the remaining coefficients
are colored red. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
Fig. 9. Gelman–Rubin–Brooks plot for the truly nonzero regression coefficient β21 based on four Markov chains.
upper confidence limit evolves with increasing number of MCMC iterations. This shrink factor is based on a comparison
of within-chain and between-chain variances. When the upper confidence limit is close to one approximate convergence
is diagnosed. The fast decreasing of this limit again indicates that our algorithm converges quickly.
Fig. 10. Gelman–Rubin–Brooks plot for the truly nonzero regression coefficient β21 based on four Markov chains.
Table 5
Performance comparison for different specifications of n: Median of mean squared prediction errors (MMSE) and median of mean absolute prediction
deviations (MMAD) based on a 100 simulated datasets.
Method MMSE MMAD MMSE MMAD MMSE MMAD
n = 50 n = 50 n = 100 n = 100 n = 200 n = 200
New approach 0.8765757 0.7592543 0.3961816 0.5011716 0.3429514 0.4642461
Lasso 0.8419270 0.7365452 0.4921145 0.5657979 0.3924337 0.5011001
Adaptive lasso 0.8478488 0.7292339 0.5084577 0.5750116 0.3995920 0.4994661
Elastic net 0.8518180 0.7418488 0.5468828 0.5872800 0.4035266 0.5061762
Bayesian lasso 1.0223014 0.7989786 0.5464250 0.5891704 0.4266989 0.5210218
Bayesian elastic net 1.1351309 0.8540231 0.7570665 0.6939538 0.8889157 0.7535658
Horseshoe 1.0008839 0.7921216 0.4590196 0.5340483 0.3426184 0.4626980
horseshoe+ 0.7970941 0.7174288 0.4105573 0.5170266 0.3465024 0.4661583
Spike and slab (VI) 1.0449073 0.8074199 0.9303236 0.7708462 0.3486689 0.4738486
Spike and slab (EM) 0.9268140 0.7725199 0.5598754 0.6004000 0.6006768 0.6232895
The computation time required for training a statistical model is an important factor in order to obtain extensive
acceptance. In Table 6 the model training time required for one train/test split is reported for all the datasets and all
the models considered in Sections 3.1 and 3.2, respectively. The time values are given in seconds. Clearly, the frequentist
and the approximation approaches go along with the by far shortest computation times. However, compared to the other
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 17
Fig. 11. Boxplots of the MSEs obtained from the 100 simulated datasets with n = 50 training samples.
Fig. 12. Boxplots of the MADs obtained from the 100 simulated datasets with n = 50 training samples.
Fig. 13. Boxplots of the MSEs obtained from the 100 simulated datasets with n = 100 training samples.
Bayesian methods which perform MCMC sampling our new approach performs well. In some cases the time is comparable
to those required by the other methods, while in many other cases the time is even significantly shorter. Finally, it should
18 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
Fig. 14. Boxplots of the MADs obtained from the 100 simulated datasets with n = 100 training samples.
Fig. 15. Boxplots of the MSEs obtained from the 100 simulated datasets with n = 200 training samples.
Fig. 16. Boxplots of the MADs obtained from the 100 simulated datasets with n = 200 training samples.
be mentioned that the computation time required by our approach could be further decreased by applying the iterative
complex factorization algorithm proposed by Zhou and Guan (2019), see Section 2.3.
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 19
Table 6
Model training computation time in seconds.
Method Diabetes Ozone Sim1 Sim1 Sim1 Sim2 Sim2 Sim2
n = 50 n = 100 n = 200 n = 50 n = 100 n = 200
New approach 10.64832 20.69821 3.754998 5.781095 7.435956 16.54165 19.60387 27.30835
Lasso 0.1014423 0.1088657 0.084394 0.1712339 0.1270933 0.1185813 0.2687073 0.5652223
Adaptive lasso 0.1454561 0.1865087 0.214384 0.2040057 0.2673378 0.3516195 0.5710163 0.8397572
Elastic net 1.210941 1.67854 1.482584 2.435285 2.082306 1.87458 4.370458 9.581333
Bayesian lasso 1.841743 20.30306 14.6497 15.72513 16.04441 16.58162 35.40158 115,61982
Bayesian adaptive lasso 87.24348 232.25304 125.5302 126.8031 236.2899
Bayesian elastic net 1.884113 2.721288 10.44362 18.17331 14.59613 23.96037 104.63952 593.98728
Horseshoe 9.955535 95.37426 36.46957 90.80268 93.76182 87.02514 143.68974 345,82272
horseshoe+ 9.128706 29.36822 74,15208 48.4012 41.89797 258.68088 397.42212 596.81928
Spike and slab (VI) 0.1403606 0.1115017 0.331382 0.5891297 0.4502439 1.03173 1.33496 1.665471
Spike and slab (EM) 0.0059228 0.0093267 0.010065 0.01341391 0.010576 0.02538824 0.02559233 0.04630232
In this section, we compare our proposal for the random set A (i.e. the proposal for model transitions) with the simple
proposal (Yi, 2004) which selects one of the covariates at random and either removes or adds it from A depending on if it is
element of A or not. In particular, the model proposed by us is trained with these two proposals on a representative dataset
corresponding to our second simulation study, see 3.2.2. The prior is specified as in the experiments in the simulation
study. Further, in the case where our proposal is used it is also specified as in this study. This means that the model
transition probabilities are based on the coefficients belonging to a ridge regularized model. In Fig. 17 some Markov
chains (model size, parameter σ , parameter g, truly nonzero coefficient β21 ) resulting from the two model trainings are
visualized. Inside the figure all the sub figures on the left side correspond to the training with our proposal, while the
other sub figures correspond to the training with the simple proposal. One can see that our proposal leads to a pretty fast
convergence of the Metropolis–Hastings algorithm, while the simple proposal converges much slower. Especially, at the
plot for σ this can be seen very clearly. The reason for this is given by the fact that the less intelligent proposal selects
model transitions at random instead of according to some well-considered strategy. As a result our proposal goes along
with an acceptance rate of 24.15%, whilst the other one merely accepts 0.71%. In principle, one can also work with the
simple proposal. The price to pay is that the MCMC algorithm has to be run for a lot more iterations and, moreover,
that a stronger thinning has to be performed. Then the predictive performance is the same since a valid sample from the
same posterior distribution is generated by both approaches. Finally, it should be mentioned that the simple proposal
works well for less complex data structures, as experiments have shown. However, with increasing dimensionality more
intelligent proposals should be considered.
4. Conclusion
In this article we have presented a novel Bayesian approach to cope with the problem of variable selection in the
multiple linear regression model with dependent predictors. The proposed method is the first one, which uses a random
set to hold the indices of the predictors with nonzero regression coefficients. To make our approach robust to challenges
such as multicollinearity, or the number of observations being smaller than the number of potential predictors, a g-prior
with an additional ridge parameter was assigned to the nonzero regression coefficients. While other authors apply Gibbs
sampling to simulate from the joint posterior distribution we presented an intelligent Metropolis–Hastings algorithm
for this task. In particular, the MH algorithm used a novel proposal for the model transitions which is more likely to
extend models with a priori or empirical relevant variables than less important ones and, further, proceeds reversely in
case of model reduction. We have shown that this well-grounded strategy is essential in complex data structures in
order for the MH algorithm to converge fast. By using a MH algorithm, we do not need to compute the conditional
distributions needed for the Gibbs sampling algorithm, which often are not available in an analytically closed form,
and have to be approximated. Further, we have shown that the proposed Bayesian approach is consistent in terms of
model selection under some nominal assumptions. Experimental studies with two different real-world datasets as well as
simulated datasets from two artificial models demonstrated the good performance of the presented method. In particular,
on the simulated datasets our new approach achieved the best results on average while comparing with many other
well-established methods.
Acknowledgments
The publication is one of the results of the project iDev40 (www.idev40.eu). The iDev40 project has received funding
from the ECSEL Joint Undertaking (JU), Austria under Grant Agreement No. 783163. The JU receives support from the
European Union’s Horizon 2020 research and innovation programme. It is co-funded by the consortium members, grants
from Austria, Germany, Belgium, Italy, Spain and Romania.
20 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
Fig. 17. Some MCMC chains resulting from training data corresponding to the second simulation study with our proposal and a simple one. The
figures in the left column belong to our approach, while the other ones belong to the simple method.
)−1
HA = XA XTA XA XTA ,
(
(A.4)
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 21
and k denotes the cardinality of A. In principle, the required marginal posterior of A can be found by integrating out g in
Eq. (A.2). However, there exists no closed-form solution of the corresponding integral. To overcome this problem, a lower
and an upper bound of the marginal posterior are derived. The numerator of the quotient in (A.1) is then replaced by the
upper bound and the denominator by the lower bound such that the resulting quotient is an upper bound of the original
one. Showing that the upper bound of the original quotient converges to zero will immediately imply that (A.1) holds. To
determine these bounds, at first, a property of the Gaussian hypergeometric function given by
Γ (τ ) 1
∫
2 F1 (α, β, τ , z) = t β−1 (1 − t)τ −β−1 (1 − tz)−α dt , (A.5)
Γ (β ) Γ (τ − β ) 0
and convergent for |z | < 1 with τ > β > 0 and for |z | = 1 only if τ > α + β and β > 0 is derived. It can be easily
validated that for 1 > z ∈ R and τ > β > 0 the following equation holds:
∫ ∞
t β−1 (1 + t)α−τ [1 + t(1 − z)]−α dt
0
Γ (˛) Γ (−α − β + τ )
= 2 F1 (α, β, 1 + α + β − τ , 1 − z)
Γ (−α + τ )
Γ (α + β − τ ) Γ (−β + τ )
2 F1 (−α + τ , −β + τ , 1 − α − β + τ , 1 − z)(1 − z)
−α−β+τ
+
Γ (α )
Γ (β ) Γ (τ − β )
= 2 F1 (α, β, τ , z).
Γ (τ )
Note that the second equality holds by identity 15.3.6 in Abramowitz and Stegun (1964). Above equation directly translates
to the following identity which will later on be used to evaluate an upper and a lower bound of p(A|y, X):
∞
Γ (β + 1) Γ (−α − τ − β − 1)
∫
t β (1 + t)τ [1 + t(1 − z)]α dt = 2 F1 (−α, β + 1, −α − τ , z). (A.6)
0 Γ (−α − τ )
Determination of the upper bound:
At first we observe that:
( )−1+ k−n2a
n
− 2k +a n
− 2k +a n
(g + 1) 2 ≤g 2 exp − ∀g ∈ R + . (A.7)
2g
This inequality holds since it can be reduced to the well-known inequality 1 + x ≤ exp(x) ∀x ∈ R. Moreover, we observe
that:
( ) k−n2a
n
exp − ≤ 1 ∀g ∈ R + . (A.8)
2g
Using the inequalities (A.7) and (A.8) an upper bound of p(A|y, X) can be derived:
to (A.14), which was also used in the model consistency proof of Wang et al. (2015), allows another approximate
simplification of the right-hand side expression of (A.14) for large n through:
( n )− 2k − 12 (
k 1
)
k n 1
Γ + (1 − R2A ) 2 −a− 2 + 2 . (A.15)
2 2 2
Note that the approximation sign in Strirling’s formula above means that the ratio of the two sides converges to one as
x goes to infinity.
Determination of the lower bound:
Observing that (g + 1)r ≥ g r ≥ 0 for g ≥ 0, r ≥ 0 and further observing that exp[g −1 (1 − R2A )−1 ] ≥ 1 + g −1 (1 − R2A )−1 ∀g ∈
R since exp(x) ≥ 1 + x ∀x ∈ R, a lower bound for p(A|y, X) is given by:
p(A |y, X)
′ p(A′ ) 2
Γ 2
+ 1
2
(1 − R2A′ ) 2 −a− 2 + 2
≤ (A.23)
p(A|y, X) n )− 2k − 12
p(A)(1 − R2A )−a− 2 Γ
(k 1
c1 2n + c2
)(
2
+ 2
) 2k + 12 ( ) 2n
c1 2n
(
+ c2 1 − R2A
= c3 ( n ) k′ + 1 (A.24)
2 2 1 − R2A′
2
) 2k + 12 ( 1 [ ] ) 2n
c1 2n + c2
(
2b + yT (In − HA )y
= c3 ( n ) k2′ + 21
n
1
[ ] . (A.25)
n
2b + yT (In − HA′ )y
2
(i) If MA ⊈ MA′ the probability limit of the upper bound (A.25) for n → ∞ evaluates as follows:
) 2k + 12 ( 1 [ ] ) 2n
c1 2n + c2
(
n
2b + yT (In − HA )y
p lim c3 (A.26)
n→∞
( n ) k2′ + 12 1
[
2b + yT (In − HA′ )y
]
n
2
) 2k + 12 ( ) 2n
c1 2n + c2 σ2
(
= p lim c4 (A.27)
n→∞
( n ) k2′ + 12 σ + bA′
2
2
= 0. (A.28)
Note that Eq. (A.27) holds by Eqs. (2.13) and (2.15). Further, note that the term (σ 2 /(σ 2 + bA′ ))n/2 is element of the
interval (0, 1) and thus converges exponentially fast to zero. Therefore, the other factor of the product corresponding
to Eq. (A.27) does not influence the limit at all, since it is polynomial in n.
K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881 23
(ii) If MA ⊆ MA′ according to the work of Fernández et al. (2001) the following holds:
) 2n
yT (In − HA )y
( (s)
d
−→ exp ,
yT (I n − HA′ )y 2
d
where s has a χ 2 distribution with k′ − k degrees of freedom and −→ denotes convergence in distribution. Thus,
for n → ∞ the limit of the upper bound (A.25) evaluates as follows:
) 2k + 12 ( 1 [ ] ) 2n
c1 2n + c2
(
n
2b + yT (In − HA )y
p lim c3 (A.29)
n→∞
( n ) k2′ + 21 1
[
2b + yT (In − HA′ )y
]
n
2
) 2k + 12
c1 2n + c2
( (s)
= p lim c5 k′
( n ) 2 + 21 exp (A.30)
n→∞ 2
2
= 0. (A.31)
Note that k > k since MA ̸ =
′
MA′ .
Finally, in both cases the upper bound of the quotient in (A.1) converges to 0 and thus also the quotient itself since it
is greater or equal to 0 and the proof is completed. □
References
Abramowitz, M., Stegun, I.A., 1964. Handbook of Mathematical Functions with Formulas, Graphs, and Mathematical Tables. Dover, New York.
Alhamzawi, R., 2018. Brq: An R package for Bayesian quantile regression. Working Paper. URL https://cran.r-project.org/web/packages/Brq/Brq.pdf.
Alhamzawi, R., Taha Mohammad Ali, H., 2018. The Bayesian adaptive Lasso regression. Math. Biosci. 303, 75–82.
Alhamzawi, R., Yu, K., F.B.enoit, D., 2012. Bayesian adaptive Lasso quantile regression. Stat. Model. 12 (3), 279–297.
Baragatti, M.C., Pommeret, D., 2012. A study of variable selection using g-prior distribution with ridge parameter. Comput. Statist. Data Anal. 56,
1920–1934.
Bhadra, A., Datta, J., Polson, N.G., Willard, B., 2017. The horseshoe+ estimator of ultra-sparse signals. Bayesian Anal. 12 (4), 1105–1131.
Bhattacharya, A., Pati, D., Pillai, N.S., Dunson, D.B., 2015. Dirichlet–Laplace priors for optimal shrinkage. J. Amer. Statist. Assoc. 110 (512), 1479–1490.
Breiman, L., Friedman, J.H., 1985. Estimating optimal transformations for multiple regression and correlation: Rejoinder. J. Amer. Statist. Assoc. 80
(391), 614–619.
Buehlmann, P., Drineas, P., Kane, M., van der Laan, M., 2016. Handbook of Big Data. Chapman & Hall/CRC.
Carbonetto, P., Stephens, M., 2012. Scalable variational inference for Bayesian variable selection in regression, and its accuracy in genetic association
studies. Bayesian Anal. 7 (1), 73–108.
Carsey, T., Harden, J., 2013. Monte Carlo Simulation and Resampling Methods for Social Science. Sage Publications, Inc.
Carvalho, C.M., Polson, N.G., Scott, J.G., 2010. The horseshoe estimator for sparse signals. Biometrika 97 (2), 465–480.
Chen, S., Walker, S.G., 2019. Fast Bayesian variable selection for high dimensional linear models: Marginal solo spike and slab priors. Electron. J. Stat.
13 (1), 284–309.
Efron, B., Hastie, T., Johnstone, I., Tibshirani, R., 2004. Least angle regression. Ann. Statist. 32 (2), 407–499.
Fahrmeir, L., Kneib, T., Lang, S., Marx, B., 2013. Regression. Springer, Berlin.
Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle properties. J. Amer. Statist. Assoc. 96 (456), 1348–1360.
Fernández, C., Ley, E., Steel, M.F., 2001. Benchmark priors for Bayesian model averaging. J. Econometrics 100 (2), 381–427, URL http://www.
sciencedirect.com/science/article/pii/S0304407600000762.
Fisher, C., Mehta, P., 2014. Fast Bayesian feature selection for high dimensional linear regression in genomics via the ising approximation.
Bioinformatics 31.
Foster, D.P., Stine, R.A., 2004. Variable selection in data mining: Building a predictive model for bankruptcy. J. Amer. Statist. Assoc. 99, 303–313.
Friedman, J., Hastie, T., Tibshirani, R., 2010. Regularization paths for generalized linear models via coordinate descent. J. Stat. Softw. 33 (1), 1–22,
URL http://www.jstatsoft.org/v33/i01/.
Gelman, A., Rubin, D.B., 1992. Inference from iterative simulation using multiple sequences. Statist. Sci. 7 (4), 457–472.
Gramacy, R.B., 2018. Monomvn: Estimation for multivariate normal and student-t data with monotone missingness. R package version 1.9-8. URL
https://CRAN.R-project.org/package=monomvn.
Guan, Y., Stephens, M., 2011. Bayesian variable selection regression for genome-wide association studies and other large-scale problems. Ann. Appl.
Stat. 5, 1780–1815.
Gupta, M., Ibrahim, J.G., 2007. Variable selection in regression mixture modeling for the discovery of gene regulatory networks. J. Amer. Statist. Assoc.
102 (479), 867–880.
Hastings, W.K., 1970. Monte Carlo sampling methods using Markov chains and their applications. Biometrika 57 (1), 97–109.
Huang, A., Liu, D., 2016. EBglmnet: Empirical Bayesian Lasso and elastic net methods for generalized linear models. R package version 4.1. URL
https://CRAN.R-project.org/package=EBglmnet.
Huang, A., Xu, S., hui Cai, X., 2015. Empirical Bayesian elastic net for multiple quantitative trait locus mapping. Heredity 114, 107–115.
Hurley, C., 2012. Gclus: Clustering graphics. R package version 1.3.1. URL https://CRAN.R-project.org/package=gclus.
Jeffreys, H., 1961. Theory of Probability, third ed. Oxford University Press.
Lee, K.E., Sha, N., Dougherty, E.R., Vannucci, M., Mallick, B.K., 2003. Gene selection: a Bayesian variable selection approach. Bioinformatics 19 1,
90–97.
24 K. Posch, M. Arbeiter and J. Pilz / Computational Statistics and Data Analysis 144 (2020) 106881
Leng, C., Tran, M.-N., Nott, D., 2014. Bayesian adaptive Lasso. Ann. Inst. Statist. Math. 66 (2), 221–244.
Liang, F., Paulo, R., Molina, G., Clyde, M.A., Berger, J.O., 2008. Mixtures of g priors for Bayesian variable selection. J. Amer. Statist. Assoc. 103 (481),
410–423.
Makalic, E., Schmidt, D., 2016. High-dimensional Bayesian regularised regression with the bayesreg package. arXiv:1611.06649v3.
Mitchell, T.J., Beauchamp, J.J., 1988. Bayesian variable selection in linear regression. J. Amer. Statist. Assoc. 83 (404), 1023–1032.
Park, T., Casella, G., 2008. The Bayesian lasso. J. Amer. Statist. Assoc. 103 (482), 681–686.
Polson, N.G., Scott, S.L., 2011. Data augmentation for support vector machines. Bayesian Anal. 6 (1), 1–23.
Polson, N.G., Scott, J.G., 2012. Local shrinkage rules, Levy processes and regularized regression. J. R. Stat. Soc. Ser. B Stat. Methodol. 74 (2), 287–311.
R Core Team, 2015. R: A language and environment for statistical computing. R Foundation for Statistical Computing, Vienna, Austria, URL
https://www.R-project.org/.
Ročková, V., George, E.I., 2014. Emvs: The EM approach to Bayesian variable selection. J. Amer. Statist. Assoc. 109 (506), 828–846.
Ročková, V., George, E.I., 2018. The spike-and-slab LASSO. J. Amer. Statist. Assoc. 113 (521), 431–444.
Stamey, T.A., Kabalin, J.N., McNeal, J.E., Johnstone, I.M., Freiha, F.S., Redwine, E.A., Yang, N., 1989. Prostate specific antigen in the diagnosis and
treatment of adenocarcinoma of the prostate. II. Radical prostatectomy treated patients. J. Urol. 141 5, 1076–1083.
Tibshirani, R., 1996. Regression shrinkage and selection via the lasso. J. R. Stat. Soc. Ser. B Stat. Methodol. 58 (1), 267–288.
Wang, M., Sun, X., Lu, T., 2015. Bayesian structured variable selection in linear regression models. Comput. Statist. 30 (1), 205–229.
West, M., 2003. Bayesian factor regression models in the ‘‘large p, small n’’ paradigm. In: Bayesian Statistics. Oxford University Press, pp. 723–732.
Yi, N., 2004. A unified Markov chain Monte Carlo framework for mapping multiple quantitative trait loci. Genetics 167 (2), 967–975.
Zellner, A., 1986. On assessing prior distributions and Bayesian regression analysis with g-prior distributions. In: Goel, P.K., Zellner, A. (Eds.), Basic
Bayesian Inference and Decision Techniques: Essays in Honor of Bruno de Finetti. pp. 233–243.
Zellner, A., Siow, A., 1980. Posterior odds ratios for selected regression hypotheses. Trab. Estad. Investig. Oper. 31 (1), 585–603.
Zhang, Y., Bondell, H.D., 2018. Variable selection via penalized credible regions with Dirichlet–Laplace global-local shrinkage priors. Bayesian Anal.
13 (3), 823–844.
Zheng, H., Zhang, Y., 2007. Feature selection for high dimensional data in astronomy. Adv. Space Res. 41 (12), 1960–1964.
Zhou, Q., Guan, Y., 2019. Fast model-fitting of Bayesian variable selection regression using the iterative complex factorization algorithm. Bayesian
Anal. 14 (2), 573–594.
Zou, H., 2006. The adaptive lasso and its oracle properties. J. Amer. Statist. Assoc. 101 (476), 1418–1429.
Zou, H., Hastie, T., 2005. Regularization and variable selection via the elastic net. J. R. Stat. Soc. Ser. B Stat. Methodol. 67 (2), 301–320.
Zuber, V., Strimmer, K., 2017. Care: High-dimensional regression and CAR score variable selection. R package version 1.1.10. URL https://CRAN.R-
project.org/package=care.