Zhang 2015
Zhang 2015
PII: S0304-4076(15)00030-5
DOI: http://dx.doi.org/10.1016/j.jeconom.2015.02.006
Reference: ECONOM 4027
Please cite this article as: Zhang, Y., Yang, Y., Cross-validation for selecting a model selection
procedure. Journal of Econometrics (2015), http://dx.doi.org/10.1016/j.jeconom.2015.02.006
This is a PDF file of an unedited manuscript that has been accepted for publication. As a
service to our customers we are providing this early version of the manuscript. The manuscript
will undergo copyediting, typesetting, and review of the resulting proof before it is published in
its final form. Please note that during the production process errors may be discovered which
could affect the content, and all legal disclaimers that apply to the journal pertain.
Cross-Validation for Selecting a Model Selection Procedure∗
Abstract
While there are various model selection methods, an unanswered but important question is how to select one
of them for data at hand. The difficulty is due to that the targeted behaviors of the model selection procedures
depend heavily on uncheckable or difficult-to-check assumptions on the data generating process. Fortunately, cross-
validation (CV) provides a general tool to solve this problem. In this work, results are provided on how to apply CV
to consistently choose the best method, yielding new insights and guidance for potentially vast amount of application.
Key words: Cross-validation, cross-validation paradox, data splitting ratio, adaptive procedure selection, information
1 Introduction
Model selection is an indispensable step in the process of developing a functional prediction model
or a model for understanding the data generating mechanism. While thousands of papers have been
published on model selection, an important and largely unanswered question is: How do we select
a modeling procedure that typically involves model selection and parameter estimation? In a real
application, one usually does not know which procedure fits the data the best. Instead of staunchly
following one’s favorite procedure, a better idea is to adaptively choose a modeling procedure. In
1
this article we focus on selecting a modeling procedure in the regression context through cross-
validation when, for example, it is unknown whether the true model is finite or infinite dimensional
in classical setting or if the true regression function is a sparse linear function or a sparse additive
Cross-validation (e.g., Allen, 1974; Stone, 1974; Geisser, 1975) is one of the most commonly
used methods of evaluating predictive performances of a model, which is given a priori or developed
by a modeling procedure. Basically, based on data splitting, part of the data is used for fitting each
competing model and the rest of the data is used to measure the predictive performances of the
models by the validation errors, and the model with the best overall performance is selected. On
this ground, cross-validation (CV) has been extensively used in data mining for the sake of model
A fundamental issue in applying CV to model selection is the choice of data splitting ratio or
the validation size nv , and a number of theoretical results have been obtained. In the parametric
framework, i.e., the true model lies within the candidate model set, delete-1 (or leave-one-out,
LOO) is asymptotically equivalent to AIC (Akaike Information Criterion, Akaike, 1973) and they
are inconsistent in the sense that the probability of selecting the true model does not converge
to 1 as the sample size n goes to ∞, while BIC (Bayesian Information Criterion, Schwarz, 1978)
and delete-nv CV with nv /n → 1 (and n − nv → ∞) are consistent (see, e.g., Stone, 1977; Nishii,
1984; Shao, 1993). In the context of nonparametric regression, delete-1 CV and AIC lead to
asymptotically optimal or rate optimal choice for regression function estimation, while BIC and
delete-nv CV with nv /n → 1 usually lose the asymptotic optimality (Li, 1987; Speed and Yu,
1993; Shao, 1997). Consequently, the optimal choice of the data splitting ratio or the choice of an
information criterion is contingent on whether the data are under a parametric or a nonparametric
framework.
In the absence of prior information on the true model, an indiscriminate use of model selection
criteria may result in poor results (Shao, 1997; Yang, 2007a). Facing the dilemma in choosing
2
the most appropriate modeling or model selection procedure for the data at hand, CV provides a
general solution. A theoretical result is given on consistency of CV for procedure selection in the
dimension of the regression function to reflect the challenge of high dimension and small sample
size, we aim to investigate the relationship between the performance of CV and the data splitting
ratio in terms of modeling procedure selection instead of the usual model selection (which intends
to choose a model among a list of parametric models). Through theoretical and simulating studies,
we provide a guidance about the choice of splitting ratio for various situations. Simply speaking,
in terms of comparing the predictive performances of two modeling procedures, a large enough
evaluation set is preferred to account for the randomness in the prediction assessment, but at the
same time we must make sure that the relative performance of the two model selection procedures
at the reduced sample size resembles that at the full sample size. This typically forces the training
size to be not too small. Therefore, the choice of splitting ratio needs to balance the above two
conflicting directions.
The well-known conflict between AIC and BIC has attracted a lot of attention from both
theoretical and applied perspectives. While some researchers stick to their philosophy to strongly
favor one over the other, presumably most people are open to means to stop the “war”, if possible.
In this paper, we propose to use CV to share the strengths of AIC and BIC adaptively in terms
of asymptotic optimality. We show that an adaptive selection by CV between AIC and BIC on a
sequence of linear models leads to (pointwise) asymptotically optimal function estimation in both
Two questions may immediately arise on the legitimacy of the approach we are taking. The
first is: If you use CV to choose between AIC and BIC that are applied on a list of parametric
models, you will end up with a model in that list. Since there is the GIC (Generalized Information
Criterion, e.g., Rao and Wu, 1989) that includes both AIC and BIC as special cases, why do you
3
take the more complicated approach? The second question is: Again, your approach ends up with
a model in the original list. Then why don’t you select one in the original list by CV directly? It
seems clear that your choosing between the AIC model and the BIC model by CV is much more
complicated. Our answers to these intriguing questions will be given in the conclusion section based
Although CV is perhaps the most widely used tool for model selection, there are major seemingly
wide-spread misconceptions that may lead to improper data analysis. Some of these will be studied
as well.
The paper is organized as follows. In Section 2, we set up the problem and present the cross-
validation method for selecting a modeling procedure. The application of CV to share the strengths
of AIC and BIC is given in Section 3. In Section 4, a general result on consistency of CV in high-
dimensional regression is presented, with a few applications. In Sections 5 and 6, simulation results
and a real data example are given, respectively. In Section 7, we examine/discuss some issues with
misconceptions on CV. Concluding remarks are in Section 8. The proofs of the main results are in
the Appendix.
Y = µ(X) + ε, (1)
regression function and ε is the random error with E(ε|x) = 0 and E(ε2 |x) < ∞ almost surely. Let
4
Consider regression models in the form of
∑
µM (x) = β0 + βj φj (x), (2)
j∈JM
is an index set associated with M . The statistical goal is to develop an estimator of µ(x) in the
Cross validation is realized by splitting the data randomly into two disjoint parts: the training
set Z t = (Xi , Yi )i∈It consisting of nt sample points and the validating set Z v = (Xi , Yi )i∈Iv consist-
1 ∑( )2
CV (M ; Iv ) = bIt ,M (Xi ) ,
Yi − µ (3)
nv
i∈Iv
where µ
bIt ,M (x) is estimated based on the training set only. Let S be a collection of data splittings
at the same splitting ratio with |S| = S and s ∈ S denote a specific splitting, producing It (s) and
Iv (s). Usually the average validation error of multiple versions of data splitting
1∑
CV (M ; S) = CV (M ; Iv (s)) (4)
S
s∈S
is considered to obtain a more stable assessment of the model’s predictive performance. This will
be called delete-nv CV error with S splittings for a given model, M . Note that there are different
ways to do this. One is to average over all possible data splittings, called leave-nv -out (Shao, 1993;
Zhang, 1993), which is often computationally infeasible. Alternatively, delete-nv CV can be carried
( )
out through S (1 ≤ S < nnv ) splittings, and there are two slightly different approaches to average
over a randomly chosen subset of all possible data splittings, i.e., S: with or without replacement,
5
the former being called Monte Carlo CV (e.g., Picard and Cook, 1984) and the latter repeated
learning-testing (e.g., Breiman et al., 1984; Burman, 1989; Zhang, 1993). An even simpler version
is k-fold CV, in which case the data are randomly partitioned into k equal-size subsets. In turn each
of the k subsets is retained as the validation set, while the remaining k −1 folds work as the training
set, and the average prediction error of each candidate model is obtained. Hence, k-fold CV is one
version of delete-nv CV with nv = n/k and S = k. These different types of delete-nv CVs will be
studied theoretically and/or numerically in this paper. Although they may sometimes exhibit quite
different behaviors in practical uses, they basically share the same theoretical properties in terms
of selection consistency, as will be seen. We will call any of them a delete-nv CV for convenience
except when their differences are of interest. We refer to Arlot and Celisse (2010) for an excellent
The new use of the CV, as is the focus in this work, is at the second level, i.e., the use of CV
to select a model selection procedure from a finite set of modeling procedures, Λ. Now there are
many model selection procedures available and they have quite different properties that may or may
not be in play for the data at hand. See, e.g., Fan et al (2011) and Ng (2013) for recent reviews
and discussions of model selection methods in the traditional and high-dimensional settings for
model identification and prediction. Although CV has certainly been applied in practice to select
a regression or classification procedure, to our knowledge, little has been reported on the selection
of a model selection criterion and the theoretical guidance on the choice of the data splitting ratio
For each δ ∈ Λ, model selection and parameter estimation are performed by δ on the training
cI ,δ is the model selected and estimated by the modeling procedure δ making use of only
where M t
6
cI ,δ .
the selected model M t
The comparison of different procedures can be realized by (5), usually based on multiple versions
There are two different ways to utilize the multiple data splittings, one based on averaging and
1∑
CVa (δ; S) = CV (δ; Iv (s)). (6)
S
s∈S
Then CVa selects the procedure that minimizes CVa (δ; S) over δ ∈ Λ. Secondly, let CVv (δ; S)
denote the frequency that δ achieves the minimum, minδ′ ∈Λ CV (δ ′ ; Iv (s)) over s ∈ S, i.e.,
1∑
CVv (δ; S) = I{CV (δ;Iv (s))=minδ′ ∈Λ CV (δ′ ;Iv (s))} . (7)
S
s∈S
Then CVv selects the procedure that maximizes CVv (δ; S). Let δbaS and δbvS denote the procedure
In the literature, there are conflicting recommendations on the data splitting ratio for CV (see
Arlot and Celisse, 2010) and 10-fold CV seems to be a favorite by many researchers, although
LOO is even used for comparing procedures. We aim to shed some light on this issue and provide
some guidance on how to split data for the sake of consistent procedure selection, especially in high
dimensional regression problems. Next we present some results in traditional regression, and then
7
3 Stop the war between AIC and BIC by CV
In the classical regression setting with fixed truth and a relatively small list of models, model
n
∑ ( )2
cλ = argminM ∈M
M bn,M (Xi ) + λn |M |σ 2 ,
Yi − µ (8)
n
i=1
A critical issue is the choice of λn . For instance, the conflict between AIC (λn = 2) and BIC
(λn = log n) in terms of asymptotic optimality and pointwise versus minimax-rate optimality under
parametric or nonparametric assumption is well-known (e.g., Shao, 1997; Yang, 2005, 2007a). In
a finite sample case, signal-to-noise ratio has an important effect on the relative performance of
AIC and BIC. As discussed in Liu and Yang (2011) (and will be seen in Table 1 later), in a true
parametric framework, BIC performs better than AIC when the signal-to-noise ratio is low or high,
but can be worse than AIC when the ratio is in the middle.
Without any prior knowledge, the problem of deciding on which information criterion to use
is very challenging. We consider the issue of seeking optimal behaviors of AIC and BIC in com-
peting scenarios by CV for estimating a univariate regression function based on the classical series
expansion approach. Both AIC and BIC can be applied to choose the order of the expansion. At
issue is the practically important question that which criterion should be used. We apply CV to
choose between AIC and BIC and show that, with a suitably chosen data splitting ratio, when the
true model is among the candidates, CV selects BIC with probability approaching one; and when
the true function is infinite-dimensional, CV selects AIC with probability approaching one. Thus
in terms of the selection probability, the composite criterion asymptotically behaves like the better
one of AIC and BIC for both the AIC and BIC territories.
For illustration, consider estimating a regression function on [0,1] based on series expansion. Let
8
√ √ √
{φ0 (x) = 1, φ1 (x) = 2 cos(2πx), φ2 (x) = 2 sin(2πx), φ3 (x) = 2 cos(4πx), ...} be the orthonor-
mal trigonometric basis on [0, 1] in L2 (PX1 ), where PX1 denotes the distribution of X1 , assumed to
∑m ∑n
The estimator considered here is µ
bn,m (x) = bj φj (x),
j=0 α where α
bj = 1
n i=1 Yi φj (Xi ) (b
α0 =
the best model in terms of the trade-off between the estimation error and the approximation error.
Let ∥∥p (p ≥ 1) denote the Lp -norm with respect to the probability distribution of X1 (or later
Assumption 0: The regression function µ has at least one derivative and satisfies that
∑ ( ∑ ) ∑
∥ αj φj ∥4 = O ∥ αj φj ∥2 and lim sup ∥ αj φj ∥∞ < ∞, (9)
m→∞
j≥m+1 j≥m+1 j≥m+1
i.e., the L4 and L2 approximation errors are of the same order and the L∞ approximation error is
There is a technical nuisance that one needs to take care of. When the true regression function
is one of the candidate models, with probability going to 1, BIC selects the true model, but AIC
selects the true model with a certain probability non-vanishing nor approaching one. Thus, there
is a non-vanishing probability that AIC and BIC actually agree, in which case we have a tie. We
Let m
b n,AIC and m
b n,BIC be the models selected by AIC and BIC respectively at the sample size
∑m b n,BIC
n. We define the regression estimators in a slightly different way: µ
bn,BIC (x) = j=0 bj φj (x),
α
9
but for the estimator based on AIC, when AIC and BIC select the same model, µ bn,AIC (x) =
∑mb n,AIC +1 ∑m
b n,AIC
j=0 bj φj (x) and otherwise µ
α bn,AIC (x) = j=0 bj φj (x). This modification provides a
α
means to break the tie when AIC and BIC happen to agree with each other. Note that the
the sense that there exists a constant c > 1 such that with probability going to 1, we have
bn,BIC ∥2 / inf M ∈M ∥µ − µ
∥µ − µ bn,M ∥2 ≥ c. In the parametric case, BIC is consistent in selection.
In the nonparametric case, asymptotic efficiency of AIC has been established in, e.g., Shibata
(1983), Li (1987), Polyak and Tsybakov (1990) and Shao (1997), while sub-optimality of BIC is
seen in Shao (1997) and Speed and Yu (1993). When the true regression function is contained in
at least one of the candidate models, BIC is consistent and asymptotically efficient but AIC is not
In the following theorem and corollary, obtained on the estimation of the regression function on
the unit interval via trigonometric expansion under homoscedastic errors, delete-nv CV is performed
by CVa (δ; S) with the size of S uniformly bounded or CVv (δ; S) over unrestricted number of data
splittings.
THEOREM 1 Consider the delete-nv CV with nt → ∞ and nt = o(nv ) to choose between AIC
and BIC. Suppose that 0 < E(ε4i |Xi ) ≤ σ 4 holds almost surely for some constant 0 < σ < ∞ for
all i ≥ 1 and that Assumptions 0-1 are satisfied. Then the CV method is consistent for selection
between AIC and BIC in the sense that when the true model is among the candidates, the probability
of BIC being selected goes to 1; and when the true regression function is infinite-dimensional, then
Remarks:
1. We assumed above that µ(x) has at least one derivative. Without this condition, we may
10
need nv /n2t → ∞ and nt → ∞ to guarantee consistent selection of the better model selection
method.
2. Regarding the modification of AIC, from our numerical work, with a large enough number of
data splittings, there are rarely ties between the CV errors of the AIC and BIC procedures.
So we do not think it is necessary for application, and we actually considered the regular
3. The restriction on the size of S to be uniformly bounded on the data splittings for CVa (δ; S)
is due to a technical difficulty in analyzing the sum of dependent CV errors over the data
BIC by the delete-nv CV. Under the same conditions in Theorem 1, for both the parametric and
∥µ − µ
bn,δb∥2
→ 1 in probability.
inf M ∈M ∥µ − µ
bn,M ∥2
From above, with the use of CV, the estimator becomes asymptotically optimal in an adaptive
fashion for both parametric and nonparametric cases. We can take nv /nt arbitrarily slowly increas-
ing to ∞ (e.g., log log n). As will be demonstrated, practically, nv /nt = 1 often works very well for
estimating the regression function for typical sample sizes, although there may be a small chance
of overfitting when the sample size is very large (which is not a major issue for estimation). Note
also that nv /nt = 1 yields the optimal-rate model averaging in general (e.g., Yang, 2001). Thus
we recommend delete-n/2 CV (both CVa and CVv ) for the purpose of estimating the regression
function. We emphasize that no member in the GIC family (including AIC and BIC) can have
the property in the above corollary. This shows the power of the approach of selecting a selection
11
method.
Therefore, for the purpose of estimating the regression function, the competition between AIC
and BIC in terms of who can achieve the (pointwise) asymptotic efficiency in the parametric
and nonparametric scenarios can be resolved by a proper use of CV. It should be emphasized
that this does not indicate that the conflict between AIC and BIC in terms of achieving model
the regression function can be successfully addressed, which, in fact, is impossible by any means
(Yang, 2005).
It should be pointed out that we have focused on homoscedastic errors in this paper. With
heteroscedasticity, it is known that AIC is no longer generally asymptotically optimal in the non-
parametric case but leave-one-out CV is (Andrews, 1991). It remains to be seen if the delete-nv
CV can be used to choose between LOO and BIC to achieve asymptotic optimality adaptively over
Finally, we mention that there have been other results on combining the strengths of AIC and
BIC together by adaptive model selection methods in Barron, Yang and Yu (1994) via an adaptive
use of the minimum description length (MDL) criterion, Hansen and Yu (1999) by a different use
of MDL based on a pre-test, George and Foster (2000) based on an empirical Bayes approach,
Yang (2007a) by examining the history of BIC at different sample sizes, Ing (2007) by choosing
between AIC and BIC through accumulated prediction errors in a time series setting, Liu and Yang
(2011) by choosing between AIC and BIC using a parametricness index, and van Erven, Grüwald
and de Rooij (2012) using a switching distribution to encourage early switch to a larger model
in a Bayesian approach. Shen and Ye (2002) and Zhang (2009) propose adaptive model selection
12
4 Selecting a modeling procedure for high dimensional regression
In this section we investigate the relationship between the splitting ratio and the performance of CV
with respect to consistent procedure selection for high dimensional regression where the true model
and/or model space grow with the sample size. Our main interest is to highlight the requirement
of the data splitting ratio for different situations using relatively simple settings to avoid blurring
the main picture with complicated technical conditions necessary for more general results.
The definition of one procedure being asymptotically better than another in Yang (2007b) is
intended for the traditional regression setting and needs to be generalized for accommodating the
high-dimensional case. Consider two modeling procedures δ1 and δ2 for estimating the function µ.
or simply µ
bn,δ1 ) is asymptotically ξn -better than δ2 (or {b n=1 , or µ
µn,δ2 }∞ bn,δ2 ) under the L2 loss if
for every 0 < ϵ < 1, there exists a constant cϵ > 0 such that when n is large enough,
( )
bn,δ2 ∥22 ≥ (1 + cϵ ξn2 ) ∥µ − µ
P ∥µ − µ bn,δ1 ∥22 ≥ 1 − ϵ. (10)
When ξn → 0, the performances of the two procedures may be very close and then hard to be
Definition 1 above, we recover the definition used by Yang (2007b) for comparing procedures. For
high dimensional regression, however, we may need to choose ξn → 0 in some situations, as will be
seen later. Note also that in the definition, there is no need to consider ξn of a higher order than 1.
DEFINITION 2 A procedure δ (or {b n=1 ) is said to converge exactly at rate {an } in probability
µn,δ }∞
bn,δ ∥2 = Op (an ), and for every 0 < ϵ < 1, there exists c′ϵ > 0 such that
under the loss L2 if ∥µ − µ
( )
when n is large enough, P ∥µ − µ bn,δ ∥2 ≥ c′ϵ an ≥ 1 − ϵ.
13
4.1 A general theorem
Suppose there are a finite number of procedures in Λ. Consider a procedure δ ∈ Λ that produces
• Condition 0. The error variances E(ε2i |x) are upper bounded by a constant σ 2 > 0 almost
• Condition 1. There exists a sequence of positive numbers An such that for each procedure
δ∈Λ
bn,δ ∥∞ = Op (An ).
∥µ − µ
• Condition 2. Under the L2 loss, for some ξn > 0, one of the procedures is asymptotically
• Condition 3. There exists a sequence of positive numbers {Dn } such that for δ ∈ Λ,
∥µ − µ
bn,δ ∥4
= Op (Dn ).
∥µ − µ
bn,δ ∥2
Let an denote the minimum of an,δ over δ ∈ Λ, except that the best procedure is excluded.
Clearly, an describes the closest performance of the competing procedures to the best. Let S be a
i. nv → ∞ and nt → ∞;
ii. nv Dn−4
t
→ ∞;
√ ( )
iii. nv ξnt ant / 1 + Ant → ∞,
14
then the delete-nv CVv is consistent for any set S, i.e., the best procedure is selected with probability
∥µ − µ
bn,δb∥2
→ 1 in probability.
inf δ∈Λ ∥µ − µ
bn,δ ∥2
If the size of S is uniformly bounded, then CVa has the same asymptotic properties as CVv above.
Remarks:
1. Requirement ii in Theorem 2 demands that the evaluation size nv to be large enough to avoid
possible trouble in identifying the best candidate due to excessive variation of the prediction
essence: it basically says that the data splitting ratio should make nv large and (consequently)
nt small enough so that the second best convergence rate at the reduced sample size nt , i.e.,
ant , is “magnified” enough so as to make the performance difference between the best and
2. Consider the case that An and Dn are bounded. For CVv , as long as the data splitting ratio
√
satisfies nv ξnt ant → ∞, it is selection consistent, regardless of how many data splittings
are done. For the usual k-fold CV with k fixed (a special case of CVa ), if the constant data
√
splitting ratio (k − 1) : 1 satisfies the same condition, i.e., nξn an → ∞, then it is consistent
√
in selection. However, when nξn an stays bounded, the k-fold CV is not expected to be
3. Note also that in case of CVv , the theorem generalizes Theorem 2 of Yang (2007b) in terms
4. It is worthwhile to point out that although we have focused on the selection of a model
selection methods by CV in the motivation of this work, Theorem 2 is equally applicable for
15
5. The set of sufficient conditions on data splitting of CV in Theorem 2 for selection consistency
has not been shown to be necessary. We tend to think that when An and Dn are bounded and
ξn (taken as large as possible) properly reflects the relative performance of the best procedure
√
over the rest, the resulting requirement of nv → ∞, nt → ∞ and nv ξnt ant → ∞ may well
6. Conditions 1 and 3 are basically always satisfied. But what is important here is the orders
of magnitude of An and Dn , which affect the sufficient requirement on data splitting ratio to
In the high-dimensional regression case, the number of features pn is typically assumed to increase
with n and the true model size qn may also grow. We need to point out that Yang (2007b) deals with
the setting that the true regression function is fixed when there are more and more observations.
In the new high-dimensional regression setting, the true regression function may change with n.
The theorems in Yang (2007b) and in the present paper help us understand some key differences
1. In the traditional case, the estimator based on the true model is asymptotically better than
that based on a model with extra parameters according to the definition in Yang (2007b). But
the definition does not work for the high-dimensional case, hence the new definition (Definition
1). Indeed, when directly comparing the true model of size qn → ∞ and a larger model with
√
∆qn extra terms, the estimator of the true model is asymptotically ∆qn /qn -better than
the larger model. Clearly, if ∆qn is bounded, then the true model is not asymptotically
better under the definition in Yang (2007b). Based on the new sufficient result in this paper,
( )( )
nv ∆qqn
n qn +∆qn
nt → ∞ is adequate for CV to work. There are different scenarios for the
16
(a) ∆qn is bounded. Then nv /nt → ∞ is sufficient.
2. In the traditional parametric regression case, the true model is fixed. An estimator of µ(x)
based on a sensible model selection method (e.g., AIC or BIC) converges (in a point-wise
fashion) at the rate 1/n (under the squared error loss), which is also the minimax rate of
Indeed, the minimax-rate of convergence is now well understood under both hard (strong)
sparsity (i.e., there are only a few non-zero coefficients) and soft sparsity (i.e., the coefficient
vector has a bounded ℓp -norm for some 0 < p ≤ 1 (see, Wang, et al. (2014) for most recent
results and earlier references). Even when the true model size is fixed, when pn increases,
√
the minimax-rate of convergence is at least log(pn )/n (assuming log pn = O(n)), which is
√
slower than 1/ n. A consequence is that for the high-dimensional case, if we compare a given
linear model with a high-dimensional sparse regression model, it suffices to have nv and nt of
4.3 Applications
We consider several specific examples and provide an understanding on how CV should be applied
in each case.
One procedure, say, δ1 , targets the situation that the true regression function is a sparse linear
∑
function in the features, i.e., µ(x1 , · · · , xpn ) = j∈J0 βj xj , where J0 is a subset of {1, 2, ..., pn } of
size qn . We may take an adaptive estimator based on model selection e.g., in Wang et al. (2014)
17
that automatically achieves the minimax optimal rate qn (1 + log(pn /qn ))/n ∧ 1 without knowing
qn .
The other procedure, say, δ2 , is based on a sparse nonparametric additive model assumption,
∑
i.e., µ(x1 , · · · , xpn ) = j∈J1 βj ψj (xj ), where J1 is a subset of {1, 2, ..., pn } of size dn and ψj (xj ) is
a univariate function in a class with L2 metric entropy of order (ϵ)−1/α for some α > 0. Raskutti
et al. (2012) construct an estimator based on model selection that achieves the rate
( 2α
)
dn (1 + log(pn /dn ))/n + dn n− 2α+1 ∧ 1,
Under the sparse linear model assumption, δ2 is conjectured to typically still converge at the
above displayed rate and is suboptimal. When the linear assumption fails but the additive model
assumption holds, δ1 does not converge at all. Since we do not know which assumption is true, we
From Theorem 2, if pn → ∞, it suffices to have both nt and nv of order n. Thus any fixed data
splitting ratio, e.g., half-half, works fine theoretically. Note also that the story is similar when the
Suppose that an economic theory suggests a parametric regression model on the response that
depends on a few known covariates. With availability of big data and high computing power, many
possibly relevant covariates can be considered for prediction purpose. High-dimensional model
selection methods can be used to search for a sparse linear model as an alternative. The question
In this case, when the parametric model holds, the estimator converges at the parametric rate
√
with L2 loss of order 1/ n, but the high-dimensional estimator converges more slowly typically at
18
√
least by a factor of log pn . In contrast, if the parametric model fails to take advantage of useful
information in other covariates but the sparse linear model holds, the parametric estimator does
not converge to the true regression function while the high-dimensional alternative does.
In this case, from Theorem 2, it suffices to have nv at order larger than n/ log(pn ). In particular,
Consider a path generating method that asymptotically contains the true model of size qn on the
path of sequentially nested models. To select a model on the path obtained based on separate data,
we use CV. From Section 4.2, with a finite solution path, nv /nt → ∞ guarantees against overfitting.
As for under-fitting, assuming that the true features are nearly orthonormal, a missing coefficient β
causes squared bias of order β 2 . To make the true model have a better estimator than that from a
√
sub-model, it suffices to require β to be at least a larger enough multiple of log(pn )/n. Then with
5 Simulations
In the simulations below, we primarily study the selection, via cross-validation, among modeling
procedures that include both model selection and parameter estimation. Since CV with averaging
is much more widely used in practice than CV with voting and they exhibit similar performance
(sometimes slightly better for CVa ) in out experiments, all results presented in Sections 5, 6 and
7 are of CV with averaging. In each replication |S| = S = 400 random splittings are performed to
generated from the multivariate normal distribution with mean 0 and an AR(1) covariance matrix
with marginal variance 1 and autocorrelation coefficient ρ, independently. Two values of ρ, −0.5
19
and 0.5 are examined. The responses are generated from the model
pn
∑
Yi = βj Xi,j + εi (11)
j=1
where ε′i s (i = 1, · · · , n) are iid N (0, 1) and β = (β1 , .., βpn )T is a pn -dimensional vector including
In this subsection the performances of CV at different splitting ratios are investigated in both
parametric and (practically) nonparametric settings. Let n = 1000 and pn = 20. Three information
criteria AIC, BIC and BICc (λ = log n + log log n) are considered. Our goal here is not to be
comprehensive. Instead, we try to capture archetype behaviors of the CV’s (at different splitting
ratios) under parametric and nonparametric settings, which offer insight on this matter. In each
The cross-validation error is calculated in two steps. Firstly, the training set including nt
sample points is generated by random subsampling without replacement and the remaining nv
observations are put into the validation set Iv . We define τ = nv /n as the validating proportion.
Twenty validating proportions equally spaced between (pn + 5)/n and (n − 5)/n are tested. In
the second step, a modeling procedure δ is selected and fitted by the training set from the three
candidates AIC, BIC and BICc , and the validating error is calculated.
The above two steps are repeated 400 times through random subsampling and their average
for each criterion is its final CV error (6). The criterion attaining the minimal final CV error is
selected.
In the two contrasting scenarios, the effects of τ on i) the distribution of the difference of the CV
errors of any two competitors; ii) the probability of selecting the better criterion; iii) the resulting
estimation efficiency: for each pair of criteria, the smaller MSE of the two over that based on the CV
20
selection are presented in Figures 1 and 2, displayed on the first three rows (the individual values
over the 1000 replications, mean and standard deviation), the 4th row and 5th row, respectively.
Here we take (β1 , β2 ) = (2, 2) and βj = 0 (3 ≤ j ≤ 20), and BICc beats the other two criteria in
From the plots of AIC v.s. BIC and AIC v.s. BICc of Figure 1, the performance of CV in terms
of proportion of identifying the better procedure (i.e., the larger λn in this case) and the comparative
efficiency experience a two-phase process: improve and then stay flat when the validating proportion
τ goes up from 0 to 1. As τ is above 50%, the proportion of selecting the better procedure by CV
is close to 1. In the plot BIC v.s. BICc , the proportion of selecting the better procedure and the
comparative efficiency increase slightly from 95% to 1 across different levels of splitting ratios due
to the smaller difference between the two penalty coefficients in contrast to the other two pairs.
Another observation is that the mean of the CV error difference experiences a two-phase process,
a slight increase as the validating proportion τ is less than 90% followed by a sharp increase as τ
goes above 90%. The standard deviation of CV error difference experiences a three-phase process,
sharp decrease, slight decrease and jump-up. The data splitting ratio plays a key role here: the
increase of validating size smoothes out the fluctuations of the CV errors, but when the training
size is below some threshold, the parameter estimation errors become quite wild and cause trouble
Now we take βj = 1/j (j = 1, · · · , 20), where, with pn fixed at 20 and n not very large (e.g., around
1000), AIC tends to outperform the other two criteria. This is a “practically nonparametric”
21
Figure 2 about here.
better procedure (i.e., the smaller λn here) exhibits different patterns than the parametric scenario.
Though the sample standard deviation of CV error difference exhibits similar patterns, the mean of
CV error difference between two procedures increases from a negative value (which is the good sign
to have here) to a positive value, whereas in the parametric scenario the sign does not change. In
nonparametric frameworks, as the validating proportion τ is above 80% the best model at the full
sample size n suffers from low sample size more than the underfitting model due to large parameter
estimation error. As a result, the comparative efficiency and the proportion of selecting the better
In summary of the illustration, the half-half splitting CV with S = 400 splittings selected the
better procedures with almost 100 percent chance between any two competitors considered here in
both data generating scenarios. This is certainly not expected to be true always, but our experience
In this section we look into the performance of delete-n/2 CV with S = 400 splittings to combine
the power of various procedures in traditional and high dimensional regression settings. As a
The final accuracy of each regression procedure is measured in terms of the L2 loss, which is
calculated as follows. Apply a candidate procedure δ to the whole sample and use the selected
cδ to estimate the mean function at 10,000 sets of independently generated features from
model M
the same distribution. Denote the estimates and the corresponding true means by YbiP (M
cδ ) and µ′
i
22
(i = 1, · · · , 10000) respectively. The squared loss then is
10000
∑ (
1 )
cδ ) 2 ,
Loss(δ) = µ′i − YbiP (M (12)
10000
i=1
which simulates the squared L2 loss of the regression estimate by the procedure. The square loss
of any version of CV is the square loss of the final estimator when using CV for choosing among
the model selection methods. The risks of the competing methods are the respective average losses
In this subsection we compare the predictive performances of AIC, BIC and BICc with different
versions of CV’s in terms of the average of squared L2 loss in (12). The data are generated by
15
∑
Yi = βj Xi,j + εi ; where βj = 0.25/j (1 ≤ j ≤ 10); βj = 0 (11 ≤ j ≤ 15), (13)
j=1
where Xi,j and εi are simulated by the same method as before. Three different sample sizes n = 100,
As shown by Table 1, when the sample size is small or extremely large, BICc outperforms
the other two competitors (at n = 100) or performs the best (tied with BIC) (at n = 500, 000)
and delete-n/2 and delete-0.8n CV’s have similar performance to BICc , whereas for moderately
large sample size (n=10,000) AIC dominates others and delete-n/2 CV shares similar performance
with AIC. In summary, over the three model selection criteria, CV equipped with half-half splitting
impressively achieves adaptivity in hiring the unknown “top gun” to various settings of sample size,
signal-to-noise ratio and design matrix structure (some results are not presented here). Other types
of CV perform generally worse than the delete-n/2 CV. Although the true model is parametric (but
23
practically nonparametric at n = 10, 000), based on the results in Liu and Yang (2011), it is not
surprising at all to see the relative performance between AIC and BIC switches direction twice,
In this subsection we compare the predictive performances, in terms of the average squared L2 loss
in (12), of SCAD (Fan and Li, 2001), MCP (Zhang, 2010), LASSO (Tibshirani, 1996), Stepwise
regression plus RIC (Foster and George, 1994), that is, λn = 2 log pn in (8) (STRIC), against the
delete-n/2 CV used to choose one of them with S = 400 splittings (some other versions of CV are
pn
∑
Yi = βj Xi,j + εi ; βj = 6/j (1 ≤ j ≤ qn ), βj = 0 (qn + 1 ≤ j ≤ pn ) (14)
j=1
where Xi,j and εi are simulated by the same method as before. We examine three different qn
As shown by Table 2 when qn = 1, ρ = ±0.5 and qn = 5, ρ = −0.5, SCAD and MCP outperform
the other two competitors, which tend to include some redundant variables, and delete-n/2 and
delete-0.8n CV’s have similar performance to SCAD and MCP while delete-0.2n and 10-fold CV’s
fail to select the best procedure. When qn = 10, ρ = ±0.5, STRIC dominates the others, and
SCAD and MCP tend to exclude some relevant variables. The delete-n/2, delete-0.2n and 10-fold
It is worth noting that when qn = 5 and ρ = 0.5, delete-n/2 CV outperforms all four original
candidate procedures (SCAD, MCP, LASSO and STRIC) significantly by making a “smart selec-
tion”. Examining the output of the 500 replications we found that the data, which are generated
randomly and independently in each replication, exhibit a parametric pattern in some cases such
24
that MCP performs best, and a nonparametric pattern in other cases such that STRIC shows
great advantages. Overall, delete-n/2 CV picks up the best procedure adaptively and on average it
achieves lower loss than the best one of all the four candidates. This “smart selection” phenomenon
tion, which typically includes two steps: model selection and parameter estimation. Moreover, the
simulations confirm the theoretical results in Section 3 and 4 and provides a guidance in splitting
ratio choice, i.e., half-half splitting tends to work very well for the selection of optimal procedures
across various settings. In general high-dimensional cases, the different model selection criteria may
have drastically different performances, as seen in the present example, and a consistent choice of
the best among them, as done by half-half splitting CV is a practically important approach to move
Physical constraints on the production and transmission of electricity make it the most volatile
commodity. For example, in the city of New York, the price at peak hours of a hot and humid
summer day can be hundred times the lowest level. Therefore, financial risk management is often
a high priority for participants in deregulated electricity markets due to the substantial price risks.
The cost of supplying the next megawatt of electricity determines its price in the wholesale
market. Take the regional power market of New York as an example, it has roughly four hundred
locations (i.e., nodes) with different prices due to local supply and demand. When two close nodes
are connected by a high-voltage transmission line, they tend to share similar prices because of the
low transmission cost between them. Power market participants face unique risks from the price
volatility. So modeling prices across nodes is essential to prediction and risk hedging. The data
we have here cover 423 nodes (pn = 423) and 422 price observations per node (n = 422). In the
absence of additional information (such as distance between nodes and their connectivity), the goal
25
here is to estimate one node by the rest via linear modeling (the unit of the response is dollar
per megawatts). This is a high dimensional linear regression problem that makes adaptive model
selection challenging. In fact, we will show next that different selection criteria picked very different
models.
We compare delete-n/2 CV with MCP, SCAD, LASSO, and STRIC with respect to predictive
performances in three steps. Firstly, the 422 observations are randomly divided into an estimation
set and a final evaluation set according to two pre-defined ratios, 75:25 and 25:75. Four models are
chosen by MCP, SCAD, LASSO and STRIC, respectively, from the estimation set and then used
to make predictions on the evaluation set. Secondly, a procedure is selected by delete-n/2 CV from
these four candidate procedures, where the delete-n/2 CV is implemented by evenly splitting the
estimation set into a training part and a validation part in 400 subsampling rounds (i.e., S = 400).
A model is thus developed by the delete-n/2 CV procedure from the estimation set and then used
to make predictions on the final evaluation set. The prediction error is the average of squared L2
loss at each node in the final evaluation set. Finally, repeat the above two steps 100 times for
the two ratios 75:25 and 25:75 and the average of square root of prediction errors based on 500
replications is displayed in the following table for each of the five procedures. The “permutation
standard error” (which is not really a valid standard error of the CV error due to dependence of
When the estimation set is small (25:75), SCAD and MCP exhibit much larger “permutation
standard errors” because the high correlations among the features (nodes) and small estimation
sample size caused a few large prediction errors in the 500 replications for the two methods, while
LASSO was more stable. Overall, the delete-n/2 CV procedure yields the best predictive accuracy.
26
7 Misconceptions on the use of CV
Much effort has been made on proper use of CV (see, e.g., Hastie et al, 2009, Chapter 7.10; Arlot
and Celisse, 2010 for a comprehensive review). Unfortunately, some influential work in the literature
that examines CV methods, while making important points, does not clearly distinguish different
goals and thus draws inappropriate conclusions. For instance, regarding which k-fold CV to use,
Kohavi (1995) focused only on accuracy estimation in all the numerical work, but the observations
there (which will be discussed later) are often directly passed onto model selection. The work
has been very well-known and the recommendation there that the best method to use for model
selection is 10-fold CV has been followed by many in computer science, statistics and other fields.
In another direction, we have seen publications that use LOO CV on real data to rank parametric
Applying CV without factoring in the objective can be a very serious mistake. There are three
main goals in the use of CV. The first is for estimating the prediction performance of a model or a
modeling procedure. The second and third goals are often both under the same name of “CV for
model selection”. However, there are different objectives of model selection, one as an internal step
in the process of producing the final estimator, the other as for identifying the best candidate model
or modeling procedure. With this distinction spelled out, the second use of CV is to choose a tuning
end goal of producing the best estimator (see e.g., van der Vaart et al. (2006) for estimation bounds
in this direction). The third use of CV is to figure out which model/modeling procedure works the
The second and third goals are closely related. Indeed, the third use of CV may be applied for
the second goal, i.e., the declared best model/modeling procedure can be then used for estimation.
Note that this type of application, with a proper data splitting ratio, results in asymptotically
optimal performance in estimation, as shown in Sections 3 and 4 in the present paper. A caveat is
that this asymptotic optimality may not always be satisfactory. For instance, when selecting among
27
parametric models in a “practically non-parametric” situation with the fixed true model being one
of the candidates, a model selection method built for the third goal (such as BIC) may perform
very poorly for the second goal (see, e.g, Shao, 1997; Liu and Yang, 2011).
In the reverse direction, the best CV for the second goal does not necessarily imply the achieve-
ment of the third goal. For instance, in the nonparametric regression case, the LOO CV is asymp-
totically optimal for selecting the order of nested models (e.g., Li, 1987), but it is not true that
the selected model agrees with the best choice with probability going to 1. Indeed, to achieve the
asymptotic optimality in estimation, one does not have to be able to identify the best candidate.
As seen in Section 4, in high-dimensional linear regression with the true model dimension qn → ∞,
an overfitting model with a bounded number of extra terms performs asymptotically as well as
the true model. Furthermore, identifying the best choice with probability going to 1 may lead to
The following misconceptions are frequently seen in the literature, even up to now.
7.1 “Leave-one-out (LOO) CV has smaller bias but larger variance than leave-
more-out CV”
This view is quite popular. For instance, Kohavi (1995, Section 1) states: “For example, leave-one-
out is almost unbiased, but it has high variance, leading to unreliable estimates”.
The statement, however, is not generally true. In fact, in least squares linear regression, Burman
(1989) shows that among the k-fold CVs, in estimating the prediction error, LOO (i.e., n-fold CV)
has the smallest asymptotic bias and variance. For k < n, if all possible removals of n/k observations
are considered (instead of a single k-fold CV), is the error estimation variance then smaller than
that from LOO? The answer is No. As an illustration, consider the simplest regression model:
Yi = θ + εi , where θ is the only mean parameter and εi are iid N (0, σ 2 ) with σ > 0. Then a
theoretical calculation (Lu, 2007) shows that LOO has the smallest bias and variance at the same
time among all delete-nv CVs with all possible nv deletions considered.
28
A simulation is done to gain a numerical understanding. The data are generated by (11) with
matrix is generated the same way as in Section 5.1 but with ρ = 0. The delete-nv (nv = 1, 2 and
5) CV’s are compared through 10,000 independent replications. Three cases are considered: CV
estimate of the prediction error for the true model (with the parameters estimated), for the model
selection of AIC over all subset models, and for the model selection by BIC. In each replication, the
( )
CV error estimate is the average of all nnv splittings as nv = 1 and 2, and of 1000 subsamplings as
nv = 5. The theoretical mean prediction error at the sample size (n = 50), as well as the bias and
variance of the CV estimator of the true mean prediction error in each case are simulated based
on the 10,000 runs. The standard errors (in the parentheses) for each procedure are also reported
(the delta method is used for the standard error of the variance estimate).
As a comparison, we report the average of permutation variance, which refers to the sample
variance of the CV errors over different data splittings in each data generation of 50 observations.
It is worth pointing out that the permutation variance is sometimes mistaken as a proper estimate
of the variance of the CV error estimate. We also present the average pairwise differences of bias
and variance with standard errors between delete-1 and delete-2 given to make it clear that the
As revealed by the above table, as expected, the bias of delete-nv CV errors is increasing in nv
in all cases. The variance exhibits more complex patterns: for the true model, LOO in fact has
the smallest variability; but for the AIC procedure, the variance decreases in nv (for nv ≤ 5); in
the case of BIC, the variance first decreases and then increases as nv goes up from 1 to 5. In the
example, LOO still has smaller MSE for the AIC procedure compared to the delete-5 CV.
Note that the permutation variance, which is not what one should care about regarding the
choice of a CV method, consistently has the pattern of decreasing in nv . This deceptive mono-
29
tonicity may well be a contributor to the afore-stated misconception. See Arlot and Celisse (2010,
Section 5) for papers that link instability of modeling procedures to the CV variability.
7.2 “Better estimation (e.g., in bias and variance) of the prediction error by
This seemingly obviously correct statement is actually false! To put the issue in a slightly different
setup, suppose that a specific data splitting ratio (nt : nv , with nt + nv = n) works very well to tell
apart correctly two competing models. Now suppose we are given another n iid observations from
the same population. If we put all the new n observations into estimation (i.e., with training size
now n + nt ) and use the same amount of observations (i.e., nv ) as before for validation. With the
obviously improved estimation capability and unchanged validation capability, we should do better
in comparing the two models, right? Wrong! This is the cross validation paradox (Yang, 2006).
The reason is that prediction error estimation and comparing models/procedures are drastically
different targets. For the latter, when comparing two models/procedures that are close to each
other, the improved estimation capability by having more observations in the estimation part only
makes the models/procedures more difficult to be distinguished. The phenomenon that nv needs to
be close to n for consistent model selection in linear regression was first discovered by Shao (1993).
In the context of comparing two procedures (e.g., a parametric estimator and a kernel estimator),
this high demand on nv may not be necessary, as shown in Yang (2006, 2007b). The present
work provides a more general result suitable for both traditional and high-dimensional regression
settings.
In the above, we focused on the third goal of using CV. It is also useful to note that better
estimation of the prediction error does not mean better model selection in terms of the second goal
30
7.3 “The best method to use for model selection is 10-fold CV”
As mentioned earlier, Kohavi (1995) endorsed the 10-fold CV as the best for model selection on the
ground that it may often attain the smallest variance and the mean squared error in the estimation
of prediction errors. Based on the previous subsection, the subsequent recommendation of 10-
fold CV for model selection in that paper does not seem to be justified. Indeed, from Tables 1
and 2, it is seen that the 10-fold CV performs worse than the delete-n/2 CV for estimating the
regression function (repeated 10-fold does not help much here). Based on our theoretical results
and our experience, it is expected that for selection consistency purpose, delete-n/2 CV usually
works better than 10-fold CV. The adaptively higher probability of selecting the best candidate by
Now we examine the performance of the 10-fold CV relative to other versions of CV in terms of
prediction error estimation. Some simulations are run in the same setting as the above subsection
except n = 100 and pn = 10 or 1000 (for the LASSO, SCAD and MCP cases). Since the bias aspect
is clear in ranking, we focus on the variance and the MSE. The outputs are as follows.
The simulations demonstrate that LOO possesses the smallest variance for a fixed model, which
can be the true model, an underfitting or overfitting model. However, if model selection is involved,
the performance of LOO worsens in variability as the model selection uncertainty gets higher due
to large model space, small penalty coefficients and/or the use of data-driven penalty coefficients.
It can lose to the 10-fold CV, which indeed can sometimes (e.g., AIC case) achieve the smallest
The highly unstable cases of LASSO, SCAD and MCP are interesting. As k decreases, we
actually see that the variance drops monotonically for each of them (except the k-fold version,
which gives misleading representation). However, the bias increases severely for small k, making
the MSE increasing rapidly from k = 10 down to 2. The MSE is minimal for the LOO in all cases
31
except AIC (and BIC with repeated k-fold), which supports that the statement of the title of this
subsection is a misconception.
Since a single k-fold CV without repetition often exhibits large variability as seen above, we
examine the performance of repeated k-fold CVs as it is repeated 200 times with random data
Figure 5 clearly shows that the variance of repeated k-fold CVs drops sharply for small k as the
number of repetitions increases up to 10-20 and decreases only slightly afterwards. In other words,
repeated k-fold CVs achieve much improvement on prediction error estimation over single k-fold
CVs at the cost of only a limited number of repetitions, especially for small k.
Furthermore, S repetitions of delete-n/k and S/k repetitions of k-fold CV (S = 100 for LASSO,
SCAD and MCP and S = 500 for other methods) are compared and presented in Figure 3. It
shows that the repeated k-fold CVs outperforms delete-n/k CV at roughly the same amount of
computations.
Based on the above, we generally suggest repeated k-fold CVs (obviously except k = n) as a
more efficient/reliable alternative to delete nv (n/nv = k) or single k-fold CVs regardless of the
modeling procedures if the primary goal is prediction error estimation. As for the choice of k in
repeated k-fold, it seems a large k (e.g., LOO) is preferred. LOO is a safe choice: even if it is not
the best, it does not lose by much to other CVs for the prediction error estimation.
In the above simulations, uncorrelated features (ρ = 0) are assumed. We also examined the
outputs by setting ρ = 0.5 and −0.5, and the major findings are pretty much the same.
32
8 Concluding remarks
In the introduction, in the context of dealing with parametric regression models as candidates,
two questions were raised regarding the legitimacy of our use of CV for selecting a model selection
criterion. The first question is that for achieving asymptotic optimality of AIC and BIC adaptively,
why not consider GIC, which contains AIC and BIC as special cases. The fact of matter is that
one does not know which penalty constant λn to use and for any determinist sequence of λn , it is
easy to see that you can only have at most one of the properties of AIC and BIC, but not both.
Therefore for adaptive model selection based on GIC, one must choose λn in a data driven fashion.
Our approach is in fact one way to proceed: it chooses between the AIC penalty sequence λn = 2
and the BIC penalty λn = log n, and we have shown that this leads to an asymptotic optimality
for both the AIC and BIC worlds at the same time.
The other question asks why not use CV to select a model among all those in the list instead of
the two by AIC and BIC. Actually, it is well-known that CV on the original models behaves some-
where between AIC and BIC, depending on the data splitting ratio (e.g., Shao, 1997). Therefore it
in fact cannot offer adaptive asymptotic optimality as we were seeking. More specifically, if one uses
CV to find the best model in the parametric scenario, then one must have nv /n → 1. However, if
the true regression function is actually infinite-dimensional, such a choice of CV results in selecting
a model of size of a smaller order than the optimal, leading to sub-optimal rate of convergence.
Conversely, for the infinite-dimensional case, LOO CV typically performs optimally, but should
the true regression function be among the candidate models, it fails to be optimal. So any use of
CV on the candidate models in fact cannot enjoy optimal performance under both parametric and
nonparametric assumptions. The second level use of CV, i.e., CV on the AIC and BIC, comes to
the rescue, as we have shown. This demonstrates the importance of second level of model selection,
i.e., the selection of a model selection criterion. With the general applicability, CV has a unique
33
advantage to do the second level procedure selection. Clearly, CV is also applicable for comparing
Therefore, we can conclude that the use of CV on modeling procedures can be a powerful tool
for adaptive estimation that suits multiple scenarios simultaneously. In high-dimensional settings,
the advantage of this approach can be even more remarkable. Indeed, with exponentially many or
more models being considered, any practically feasible model selection method constructed is good
typically for one or a few specific scenarios, and CV can be used to choose among a number of
methods in hope that the best one handles the true data generation process well.
Our results reveal that for selection consistency, the choice of splitting ratio for CV needs to
balance two ends, the ability to order the candidates in terms of estimation accuracy based on
the validation part of data (which favors large validation size) and the need to have the same
performance ordering at the reduced sample size as at the full sample size (which can go wrong
when the size of the estimation part of data is too low). Overall, unless one is selecting among
parametric models at least one of which captures the statistical behavior of the data generating
process very well, we recommend half-half splitting or slightly more observations for evaluation
when applying CV for the goal of identifying the best modeling procedure.
In the literature, even including recent publications, there are overly taken recommendations. The
general suggestion of Kohavi (1995) to use 10-fold CV has been widely accepted. For instance,
Krstajic et al (2014, page 11) state: “Kohavi [6] and Hastie et al [4] empirically show that V-fold
take the recommendation of 10-fold CV (with repetition) for all their numerical investigations. In
our view, such a practice may be misleading. First, there should not be any general recommendation
that does not take into account of the goal of the use of CV. In particular, examination of bias and
34
matter from optimal model selection (with either of the two goals of model selection stated earlier).
Second, even limited to the accuracy estimation context, the statement is not generally correct. For
models/modeling procedures with low instability, LOO often has the smallest variability. We have
also demonstrated that for highly unstable procedures (e.g., LASSO with pn much larger than n),
the 10-fold or 5-fold CVs, while reducing variability, can have significantly larger MSE than LOO
Overall, from Figures 3-4, LOO and repeated 50- and 20-fold CVs are the best here, 10-fold
is significantly worse, and k ≤ 5 is clearly poor. For predictive performance estimation, we tend
to believe that LOO is typically the best or among the best for a fixed model or a very stable
modeling procedure (such as BIC in our context) in both bias and variance, or quite close to the
best in MSE for a more unstable procedure (such as AIC or even LASSO with pn ≫ n). While
10-fold CV (with repetitions) certainly can be the best sometimes, but more frequently, it is in an
awkward position: it is riskier than LOO (due to the bias problem) for prediction error estimation
and it is usually worse than delete-n/2 CV for identifying the best candidate.
Not surprisingly, k-fold CV is an efficient way to use the data when compared to randomly
remove n/k observations k times. However, the k-fold CV is known to be often unstable. We agree
with Krstajic et al (2014) that given k, the repeated k-fold CV (even repeating just 10 or 20 times)
In this work, we have considered both the averaging- and voting-based CVs, i.e, CVa and CVv .
Our numerical comparisons of the two tend to suggest that with the number of data splittings
suitably large, they perform very similarly, with CVa slightly better occasionally in risk of estimating
It is clear that the best CV depends on the goal of the usage and even with the same objective,
it may require different data splitting ratios in accordance with the nature of the target regression
function, the noise level, the sample size and the candidate estimators. Thus efforts should be put
on the understanding of the best version of CV for different scenarios, as we have done in this work
35
for the specific problem of consistent selection of a candidate modeling procedure. We have focused
on the squared L2 loss under homoscedastic errors. It remains to be seen how other choices of loss
For consistently identifying the best candidate model and modeling procedure by CV, the
evaluation part has to be sufficiently large, the larger the better as long as the ranking of the
candidates in terms of risk at the reduced sample size of the training part stays the same as that
under the full sample size (which demands the training sample size to be not too small). The benefits
of having a large portion for evaluation are two-fold: 1) more observations for evaluation provide
better capability to distinguish the close competitors; 2) the fewer observations in the training part
make the accuracy difference between the close competitors magnified and the difference becomes
With more and more model selection methods being proposed especially for high-dimensional
data, we advocate the use of cross-validation to choose the best for understanding/interpretation
or efficient prediction.
Acknowledgments
We thank two anonymous referees, the Associate Editor and the Editor, Dr. Yacine Ait-Sahalia,
for providing us with very insightful comments and valuable suggestions to improve the paper. The
research of Yuhong Yang was partially supported by the NSF Grant DMS-1106576.
Appendix: Proofs
Proof of Theorem 1 and Corollary 3.1:
We apply Theorem 2 to prove the results (note that the parts for CVv in Theorem 1 and Corollary
3.1 can be proved by applying Theorem 2 of Yang (2007b) directly with ξn = 1). The main task is
to verify the conditions required for that theorem, which is done below. Note that since the true
36
We first show that Condition 2 holds. Note that by Assumption 1, when the true model is not
Suppose the true model is m∗ and it is among the candidates. Then P (m b n,BIC = m∗ ) → 1
∑m∗
as n → ∞ by Assumption 1. Note that L2 (µ, µ bn,m∗ ) = j=0 (b
αj − αj )2 and L2 (µ, µ
bn,AIC ) =
∑m en
j=0 (b
αj − αj )2 + Em
e n , where m
e n equals m
b n,AIC when AIC and BIC select different models and
equals mb n,AIC + 1 otherwise. From above, with probability going to 1, we have L2 (µ, µ bn,AIC ) =
∑m en ∑m∗ +1 ∑ ∗
j=0 (b
αj − αj )2 ≥ j=0 (b bn,BIC ) = m
αj − αj )2 , and L2 (µ, µ j=0 (b
αj − αj )2 (again with probability
∑
bj = n1 ni=1 Yi φj (Xi ). By the central limit theorem, for a given j ≥ m∗ +1,
going to 1). Recall that α
√
αj asymptotically has a non-degenerate normal distribution with mean zero and thus is bounded
nb
L2 (µ, µ
bn,AIC ) n(bαm∗ +1 )2
≥1+ ,
L2 (µ, µ
bn,BIC ) nL2 (µ, µ
bn,BIC )
it follows that µ
bn,BIC is asymptotically better than µ
bn,AIC according to Definition 1 with ξn = 1
1 ∑(∑ ) 1∑
mn
∑ n mn n (∑
mn )
(b
αj − αj )φj (x) = µ(Xi )φj (Xi )φj (x) − αj φj (x) + εi φj (Xi )φj (x) .
j=0
n i=1 j=0 n i=1 j=0
Consequently,
∫ 1 (∑
mn )4 ∫ 1 (∑
mn )4
E (b
αj − αj )φj (x) dx = E (b
αj − αj )φj (x) dx
0 j=0 0 j=0
(∫ 1 (1 ∑n ∑ mn
)4
∫ 1
(1 ∑n ∑ mn
)4 )
≤ 8 E ( µ(Xi )φj (Xi )φj (x) − αj φj (x)) dx + E ( εi φj (Xi )φj (x)) dx .
0 n i=1 j=0 0 n i=1 j=0
37
Applying Rosenthal’s inequality (Rosenthal, 1970, see also Härdle et al., 1998)), we have that for
a constant c > 0,
(∑
n mn
(∑ )) 4 ( (∑mn
)4
E µ(Xi )φj (Xi )φj (x) − αj φj (x) ≤ c nE µ(X1 )φj (X1 )φj (x) − αj φj (x)
i=1 j=0 j=0
( ∑mn
)2 )
+ nE( µ(X1 )φj (X1 )φj (x) − αj φj (x))2 ,
j=0
(∑
n mn
(∑ ))4 (∑
n mn
(∑ n
)4 ( ∑ ∑mn
)2 )
E εi φj (Xi )φj (x) ≤c E εi φj (Xi )φj (x) + E[ εi φj (Xi )φj (x)]2 .
i=1 j=0 i=1 j=0 i=1 j=0
√
Under the assumption that Eε4j ≤ σ 4 and since |φj (x)| ≤ A = 2, we have
∫ 1 (∑
mn )4
E εi φj (Xi )φi (x) dx
0 j=0
∫ 1 (∑
mn ∑
mn ∑
mn ∑
mn )
= E ε4i φi1 (Xi )φi2 (Xi )φi3 (Xi )φi4 (Xi )φi1 (x)φi2 (x)φi3 (x)φi4 (x) dx
0 i1 =0 i2 =0 i3 =0 i4 =0
m
∑∑∑n m n mn ∑ mn ( )∫ 1
= E ε4i φi1 (Xi )φi2 (Xi )φi3 (Xi )φi4 (Xi ) φi1 (x)φi2 (x)φi3 (x)φi4 (x)dx.
i1 =0 i2 =0 i3 =0 i4 =0 0
∫
1
Note that |φi1 (Xi )φi2 (Xi )φi3 (Xi )φi4 (Xi )| ≤ A4 and 0 φi1 (x)φi2 (x)φi3 (x)φi4 (x)dx ≤ A4 . For the
∫1
trigonometric basis, out of (mn + 1)4 terms, 0 φi1 (x)φi2 (x)φi3 (x)φi4 (x)dx is nonzero for O(m3n )
∫1
many choices. In fact, based on elementary calculations, 0 φi1 (x)φi2 (x)φi3 (x)φi4 (x)dx is nonzero
only when qi1 − qi2 = qi3 − qi4 or qi1 − qi2 = qi4 − qi3 or qi1 + qi2 = qi3 + qi4 , where qi1 , qi2 , qi3 ,
and qi4 are the frequencies of the basis functions φi1 (x), φi2 (x), φi3 (x), φi4 (x) respectively. Hence
∫ 1 ( ∑mn )4
the claim holds. Consequently 0 E j=0 εi φj (Xi )φj (x) dx = O(mn ). By orthogonality of the
3
∫ (∑ )2 ∫ (∑ )2 (∑
1 mn 1 mn mn
( )2 )
E εi φj (Xi )φj (x) dx = E εi φj (Xi )φj (x) dx = E ε2i φj (Xi )
0 j=0 0 j=0 j=0
(
)
≤ A mn + 1 σ 2 .
2
38
Thus
∫ (1 ∑
1 n mn
(∑ ))4 c ( 3
e ) ( m2 )
E εi φj (Xi )φj (x) dx ≤ 4
nmn + n2 m2n = O 2n ,
0 n i=1 j=0
n n
(∑
n mn
(∑ ) )4 ( )
E µ(Xi )φj (Xi )φj (x) − αj φj (x) = O n2 m2n
i=1 j=0
∫ 1 (∑
mn )4 ( m2 )
( )
E bj − αj φj (x) dx = O
α n
.
0 j=0
n2
(mn +1)σ 2 ∑∞
It is easily seen that ∥µ− µ
bn,mn ∥22 is of order n + 2
i=mn +1 αj . Together with the assumption
on the L4 and L2 approximation errors of µ, we have ∥µ− µ bn,mn ∥2 = Op (1). Now when
bn,mn ∥4 /∥µ− µ
mn is the model selected by AIC or BIC, the above analysis also applies. This verifies Condition 3
It remains to show Condition 1 is satisfied. Under the assumption that µ has at least one deriva-
tive, the optimal rate of convergence under the squared L2 loss is no slower than n−2/3 , and the size
∑ n
bn,mn ∥∞ ≤ ∥ m
of the model selected by AIC or BIC is no greater than order n1/2 , we have ∥µn − µ j=0
∑∞
(b
αj −αj )φj ∥∞ +∥ i=mn +1 αj φj ∥∞ . For the trigonometric series, relating the L∞ and L2 distances
∑ n √ ∑ n
(see, e.g., Barron and Sheu (1991), Equation (7.6)), we have ∥ m αj −αj )φj ∥∞ ≤ 2 mn ∥ m
j=0 (b j=0
√ √ ∑ √ √
mn m m
(b
αj − αj )φj ∥2 = 2 mn j=0 (b
αi − αj )2 = Op ( √nn n ). With mn of order no larger than n, we
Proof of Theorem 2:
Suppose δ ∗ is the best procedure. Let π denote a permutation of the order of the observations
and let Wπ be the indicator that δ ∗ is selected by CV with the data splitting associated with the
39
permutation π. Let Π denote a collection of random permutations. Note that δ ∗ is finally selected
Thus to prove the theorem, it suffices to show that for each permutation π, we have EWπ → 1 (as
probability approaching 1, δ ∗ has smaller predictive mean squared error than each of the other
procedures. With the above argument, the core of the proof is put into the same context as that
Although the new definition of one one estimator being better is more general than that given
in Yang (2007b), the argument in the proof of Theorem 1 there can be directly extended for the
Now we prove the result for CVa (δ; S). From above, for each s ∈ S, for any δ ̸= δ ∗ , we have
( ( ) ( ))
P CV δ ∗ ; Iv (s) ≥ CV δ; Iv (s) → 0.
Since
( ) ( )
{CVa (δ ∗ ; S) ≥ CVa (δ; S)} ⊂ ∪s∈S {CV δ ∗ ; Iv (s) ≥ CV δ; Iv (s) },
we have
( ) ∑ ( ( ) ( ))
P CVa (δ ∗ ; S) ≥ CVa (δ; S) ≤ P CV δ ∗ ; Iv (s) ≥ CV δ; Iv (s) .
s∈S
With |S| uniformly bounded, we conclude that δ ∗ is selected with probability going to 1. The proof
is complete.
40
References
[1] Akaike, H., 1973. Information theory and an extension of the maximum likelihood principle.
In Proceedings of the 2nd International Symposium on Information Theory 267-281, eds. B.N.
[2] Allen, D.M., 1974. The relationship between variable selection and data augmentation and a
[4] Arlot, S., Celisse, A., 2010. A survey of cross-validation procedures for model selection. Statis-
[5] Barron, A.R., Sheu, C., 1991. Approximation of density functions by sequences of exponential
[6] Barron, A.R., Yang, Y., Yu, B., 1994. Asymptotically optimal function estimation by mini-
[7] Breiman, L., Friedman, J.H., Olshen, R.A., Stone, C.J., 1984. Classification and Regression
Belmont, CA.
[8] Breiman, L., Spector, P., 1992. Submodel selection and evaluation in regression. The X-random
[9] Burman, P., 1989. A Comparative study of ordinary cross-validation, v-fold cross-validation
41
[10] Fan, J., Li, R., 2001. Variable selection via nonconcave penalized likelihood and its oracle
[11] Fan, J., Lv, J., Qi, L., 2011. Sparse high-dimensional models in economics. Annual Review of
Economics 3, 291-317.
[12] Foster, D.P., George, E.I., 1994. The risk inflation criterion for multiple regression. The Annals
[13] Geisser, S., 1975. The predictive sample reuse method with applications. Journal of the Amer-
[14] George, E.I., Foster, D.P., 2000. Calibration and empirical Bayes variable selection. Biometrika
87, 731-47.
[15] Hansen, M., Yu, B., 1999. Bridging AIC and BIC: an MDL model selection criterion. In
and Imaging, p. 63. Santa Fe, NM: IEEE Info. Theory Soc.
[16] Härdle, W., Kerkyacharian, G., Picard, D., Tsybakov, A., 1998. Wavelets, Approximation,
[17] Hastie, T., Tibshirani, R., Friedman, J., 2009. The Elements of Statistical Learning: Data
[18] Ing, C.-K., 2007. Accumulated prediction errors, information criteria and optimal forecasting
[19] Kohavi, R., 1995. A study of Cross-Validation and Bootstrap for Accuracy Estimation and
Model Selection, In Proceedings of the 14th International Joint Conference on Artificial Intel-
42
[20] Krstajic, D., Buturovic, L.J., Leahy, D.E., Thomas, S., 2014. Cross-validation pitfalls when
http://www.jcheminf.com/content/6/1/10.
[21] Li, K.-C., 1987. Asymptotic optimality for Cp , CL , cross-validation and generalized cross-
[22] Liu, W., Yang, Y., 2011. Parametric or nonparametric? A parametricness index for model
[23] Lu, F., 2007. Prediction error estimation by cross validation. Ph.D. Preliminary Exam Paper,
[24] Ng, S., 2013. Variable Selection in Predictive Regressions. Handbook of Economic Forecasting,
[25] Nishii, R., 1984. Asymptotic properties of criteria for selection of variables in multiple regres-
[26] Picard, R.R., Cook, R.D., 1984. Cross-validation of regression models. Journal of the American
[27] Polyak, B.T., Tsybakov, A.B., 1990. Asymptotic optimality of the Cp -test for the orthogonal
series estimation of regression. Theory of Probability and its Applications 35, 293-306.
[28] Rao, C.R., Wu, Y., 1989. A strongly consistent procedure for model selection in a regression
[29] Rosenthal, H.P., 1970. On the subspaces of Lp (p > 2) spanned by sequences of independent
43
[30] Raskutti, G., Wainwright, M., Yu, B., 2012. Minimax-optimal rates for sparse additive models
over kernel classes via convex programming. Journal of Machine Learning Research 13, 389-
427.
[31] Schwarz, G., 1978. Estimating the dimension of a model. The Annals of Statistics 6, 461-464.
[32] Shen, X., Ye, J., 2002. Adaptive model selection. Journal of the American Statistical Associ-
[33] Shibata, R., 1983. Asymptotic mean efficiency of a selection of regression variables. Annals of
[34] Shao, J., 1993. Linear model selection by cross-validation. Journal of the American Statistical
[35] Shao, J., 1997. An asymptotic theory for linear model selection (with discussion). Statistica
Sinica 7, 221-242.
[36] Speed, T.P., Yu, B., 1993. Model selection and prediction: Normal regression. Annals of the
[37] Stone, M., 1974.Cross-validation choice and assessment of statistical predictions. Journal of
[38] Stone, M., 1977. Asymptotics for and against cross-validation. Biometrika 64, 29-35.
[39] Tibshirani, R., 1996. Regression shrinkage and selection via the LASSO, Journal of the Royal
[40] van der Vaart, A.W., Dudoit, S., van der Laan, M., 2006. Oracle inequalities for multi-fold
44
[41] van Erven, T., Grünwald, P., de Rooij, S., 2012. Catching up faster by switching sooner:
[42] Wang, Z., Paterlini, S., Gao, F., Yang, Y., 2014. Adaptive Minimax Regression Estimation
[43] Yang, Y., 2001. Adaptive regression by mixing. Journal of American Statistical Association
96, 574-588.
[44] Yang, Y., 2005. Can the strengths of AIC and BIC be shared? A conflict between model
[45] Yang, Y., 2006. Comparing learning methods for classification. Statistica Sinica 16, 635-657.
[46] Yang, Y., 2007a. Prediction/estimation with simple linear model: Is it really that simple?
[47] Yang, Y., 2007b. Consistency of cross validation for comparing regression procedures. The
[48] Zhang, C.H., 2010. Nearly unbiased variable selection under minimax concave penalty. The
[49] Zhang, P., 1993. Model selection via multifold cross validation. The Annals of Statististics 21,
299-313.
[50] Zhang, Y., 2009. Model selection: A Lagrange optimization approach. Journal of Statistical
45
Tables and Figures
Table 1: Comparison of AIC, BIC, BICc and CV (with 400 data splittings) in terms of MSE (in the
unit of 1/n) based on 500 replications with σ = 1, pn = 15, βj = 0.25/j (1 ≤ j ≤ 10) and βj = 0
(11 ≤ j ≤ 15). The standard errors (in the unit of 1/n) are shown in the parentheses.
ρ = 0.5
100 14.64 (0.32) 12.63 (0.26) 12.18 (0.23) 12.21 (0.24) 12.18 (0.23) 13.07 (0.27) 13.27 (0.28)
10000 16.33 (0.31) 28.28 (0.43) 32.81 (0.48) 16.33 (0.31) 19.33 (0.41) 17.10 (0.37) 18.01 (0.37)
500000 13.76 (0.25) 10.83 (0.21) 10.83 (0.21) 10.89 (0.21) 10.83 (0.21) 11.34 (0.23) 11.68 (0.23)
Table 2: Comparison of SCAD, MCP, LASSO, STRIC (Stepwise plus RIC) and CV (with 400 data
splittings) in terms of MSE (in the unit of 1/n) based on 500 replications with σ = 1, n = 500,
pn = 500, βj = 6/j for j ≤ qn and βj = 0, otherwise. The standard errors (in the unit of 1/n) are
shown in the parentheses.
ρ = 0.5
1 1.07 (0.07) 1.07 (0.06) 46.53 (0.58) 4.62 (0.29) 1.10 (0.10) 1.07 (0.07) 1.59 (0.16) 2.11 (0.21)
5 18.26 (1.14) 7.57 (0.29) 205.53 (1.39) 9.14 (0.36) 6.40 (0.20) 7.02 (0.25) 6.69 (0.23) 7.39 (0.27)
10 425.97 (4.51) 238.29 (4.71) 369.36 (2.34) 14.20 (0.38) 14.20 (0.38) 14.20 (0.38) 14.20 (0.38) 14.20 (0.38)
46
Table 3: Comparison of LASSO, MCP, SCAD, STRIC (Stepwise plus RIC) and delete-n/2 CV
with 400 splittings in terms of square root of Prediction Error. 500 replications are performed. The
permutation standard error is shown in brackets.
Table 4: Bias, Variance and Permutation Variance (Per-Var) of CV errors based on 10,000 repeti-
tions: CV error estimation for the true model, for AIC and for BIC with n = 50, pn = 10, qn = 4,
β1 = β2 = β3 = β4 = 2 and σ = 4. The standard errors are shown in the parentheses.
True Model
Bias 0.355 (0.037) 0.399 (0.038) 0.542 (0.038) -0.044 (0.0001)
Variance 14.004 (0.214) 14.075 (0.215) 14.407 (0.219) -0.071 (0.001)
Per-Var 640.198 (3.435) 309.253 (1.652) 118.112 (0.627) 331.046 (1.783)
AIC
Bias 0.792 (0.048) 0.880 (0.047) 1.181 (0.047) - 0.088 (0.004)
Variance 22.908 (0.361) 22.350 (0.349) 22.278 (0.343) 0.557 (0.052)
Per-Var 759.043 (4.504) 368.060 (2.139) 143.418 (0.807) 390.983 (2.389)
BIC
Bias 0.559 (0.044) 0.629 (0.044) 0.864 (0.044) -0.070 (0.003)
Variance 19.766 (0.321) 19.418 (0.312) 19.514 (0.309) 0.347 (0.045)
Per-Var 695.520 (4.084) 336.451 (1.938) 130.080 (0.730) 359.070 (2.164)
47
CV(AIC)−CV(BIC) CV(BIC)−CV(BICc) CV(AIC)−CV(BICc)
Cross Validating Error
0.4
0.4
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.4
0.4
0.2
0.2
0.2
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.12
0.12
0.06
0.06
0.06
0.00
0.00
0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.95
0.95
0.85
0.85
0.85
0.75
0.75
0.75
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Optimal MSE/MSE by CV
Optimal MSE/MSE by CV
0.95
0.95
0.95
0.85
0.85
0.85
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.6
0.6
0.2
0.2
0.2
−0.2
−0.2
−0.2
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.3
0.3
0.1
0.1
0.1
−0.1
−0.1
−0.1
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.12
0.12
0.06
0.06
0.06
0.00
0.00
0.00
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
0.8
0.8
0.4
0.4
0.4
0.0
0.0
0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Optimal MSE/MSE by CV
Optimal MSE/MSE by CV
0.9
0.9
0.9
0.7
0.7
0.7
0.5
0.5
0.5
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
b b
c c
SD
2.6
SD
2.5
2.4
100 50 20 10 5 4 2 100 50 20 10 5 4 2
3.4
Method Method
a a
3.4
3.2
b b
c c
3.2
SD
SD
3.0
3.0
2.8
2.8
2.6
100 50 20 10 5 4 2 100 50 20 10 5 4 2
AIC LASSO
2.7 2.8 2.9 3.0 3.1 3.2
5.5
Method Method
a a
b b
5.0
c c
SD
SD
4.5
4.0
100 50 20 10 5 4 2 100 50 20 10 5 4 2
MCP SCAD
5.5
5.5
5.0
5.0
SD
SD
4.5
4.5
Method Method
a a
b b
c c
4.0
4.0
100 50 20 10 5 4 2 100 50 20 10 5 4 2
50
3.0 True Model Wrong Model
c c
Root MSE
Root MSE
2.6
2.4
100 50 20 10 5 4 2 100 50 20 10 5 4 2
3.4
Method Method
a a
3.4
3.2
b b
c c
Root MSE
Root MSE
3.2
3.0
3.0
2.8
2.8
2.6
100 50 20 10 5 4 2 100 50 20 10 5 4 2
AIC LASSO
2.7 2.8 2.9 3.0 3.1 3.2
Method Method
a a
b b
7
c c
Root MSE
Root MSE
6
5
4
100 50 20 10 5 4 2 100 50 20 10 5 4 2
MCP SCAD
8
Method Method
a a
b b
7
c c
Root MSE
Root Mse
6
5
5
4
100 50 20 10 5 4 2 100 50 20 10 5 4 2
51
3.1
True Model Full Model
k fold k fold
4.0
50 50
3.0
10 10
4 4
2 2
2.9
3.5
2.8
SD
SD
2.7
3.0
2.6
2.5
2.5
2.4
k fold k fold
50 50
3.5
10 10
4.0
4 4
2 2
3.4
3.3
3.5
SD
SD
3.2
3.1
3.0
3.0
2.9
AIC LASSO
4.8
k fold k fold
50 50
10 10
4.0
4 4
4.6
2 2
4.4
SD
3.5
SD
4.2
3.0
4.0
52