Super Learner
Super Learner
Super Learner
Recommended Citation:
van der Laan, Mark J.; Polley, Eric C.; and Hubbard, Alan E. (2007) "Super Learner," Statistical
Applications in Genetics and Molecular Biology: Vol. 6: Iss. 1, Article 25.
DOI: 10.2202/1544-6115.1309
Super Learner
Mark J. van der Laan, Eric C. Polley, and Alan E. Hubbard
Abstract
When trying to learn a model for the prediction of an outcome given a set of covariates, a
statistician has many estimation procedures in their toolbox. A few examples of these candidate
learners are: least squares, least angle regression, random forests, and spline regression. Previous
articles (van der Laan and Dudoit (2003); van der Laan et al. (2006); Sinisi et al. (2007))
theoretically validated the use of cross validation to select an optimal learner among many
candidate learners. Motivated by this use of cross validation, we propose a new prediction method
for creating a weighted combination of many candidate learners to build the super learner. This
article proposes a fast algorithm for constructing a super learner in prediction which uses V-fold
cross-validation to select weights to combine an initial set of candidate learners. In addition, this
paper contains a practical demonstration of the adaptivity of this so called super learner to various
true data generating distributions. This approach for construction of a super learner generalizes to
any parameter which can be defined as a minimizer of a loss function.
1 Introduction
Numerous methods exist to learn from data the best predictor of a given
outcome based on a sample of n independent and identically distributed
observations Oi = (Yi , Xi ), Yi the outcome of interest, and Xi a vector of
input variables, i = 1, . . . , n. A few examples include decision trees, neu-
ral networks, support vector regression, least angle regression, logic regres-
sion, poly-class, Multivariate Adaptive Regression Splines (MARS), and the
Deletion/Substitution/Addition (D/S/A) algorithm. Such learners can be
characterized by the mechanism used to search the parameter space of pos-
sible regression functions. For example, the D/S/A algorithm (Sinisi and
van der Laan, 2004) uses polynomial basis functions, while logic regression
(Ruczinski et al., 2003) constructs Boolean expressions of binary covariates.
The performance of a particular learner depends on how effective its searching
strategy is in approximating the optimal predictor defined by the true data
generating distribution. Thus, the relative performance of various learners
will depend on the true data-generating distribution. In practice, it is gener-
ally impossible to know a priori which learner will perform best for a given
prediction problem and data set. To solve the problem, some researchers have
proposed combining learners in various methods and have exhibited better
performance over a single candidate learner (Freund et al., 1997; Hansen,
1998), but there is concern that these methods may over-fit the data and
may not be the optimal way to combine the candidate learners.
The framework for unified loss-based estimation (van der Laan and Du-
doit, 2003) suggests a solution to this problem in the form of a new learner,
termed the “super learner”. In the context of prediction, this learner is itself
a prediction algorithm, which applies a set of candidate learners to the ob-
served data, and chooses the optimal learner for a given prediction problem
based on cross-validated risk. Theoretical results show that such a super
learner will perform asymptotically as well as or better than any of the can-
didate learners (van der Laan and Dudoit, 2003; van der Laan et al., 2006).
To be specific, consider some candidate learners. Least Angle Regression
(LARS) (Efron et al., 2004) is a model selection algorithm related to the
lasso. Logic Regression (Ruczinski et al., 2003) is an adaptive regression
methodology that attempts to construct predictors as Boolean combinations
of binary covariates. The D/S/A algorithm (Sinisi and van der Laan, 2004)
for polynomial regression data-adaptively generates candidate predictors as
polynomial combinations of continuous and/or binary covariates, and is avail-
1
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
DOI: 10.2202/1544-6115.1309 2
van der Laan et al.: Super Learner
set, and its risk is estimated with the corresponding validation set. For each
learner the v risks over the v validation sets are averaged resulting in the
so-called cross-validated risk. The learner with the minimal cross-validated
risk is selected.
It is helpful to consider each learner as an algorithm applied to empirical
distributions. Thus, if we index a particular learner with an index k, then
this learner can be represented as a function Pn → Ψ̂k (Pn ) from empirical
probability distributions Pn to functions of the covariates. Consider a collec-
tion of K(n) learners Ψ̂k , k = 1, . . . , K(n), in parameter space Ψ. The super
learner is a new learner defined as
Ψ̂(Pn ) ≡ Ψ̂K̂(Pn ) (Pn ),
where K̂(Pn ) denotes the cross-validation selector described above which
simply selects the learner which performed best in terms of cross-validated
risk. Specifically,
X
0
K̂(Pn ) ≡ arg min EBn (Yi − Ψ̂k (Pn,Bn
)(Xi ))2 ,
k
i,Bn (i)=1
where Bn ∈ {0, 1}n denotes a random binary vector whose realizations define
a split of the learning sample into a training sample {i : Bn (i) = 0} and
1 0
validation sample {i : Bn (i) = 1}. Here Pn,B n
and Pn,B n
are the empirical
probability distributions of the validation and training sample, respectively.
The aggressive use of cross-validation is inspired by the theorem 3.1 in
van der Laan et al. (2006). The theorem is provided in the appendix.
The “oracle” selector is defined in Theorem 2 in the appendix as the
estimator, among the K(n) learners considered, which minimizes risk under
the true data-generating distribution. In other words, the oracle selector is
the best possible estimator given the set of candidate learners considered;
however, it depends on both the observed data and P0 , and thus is unknown.
This theorem shows us that the super learner performs as well (in terms
of expected risk difference) as the oracle selector, up to a typically second
order term. Thus, as long as the number of candidate learners considered
(K(n)) is polynomial in sample size, the super learner is the optimal learner
in the following sense:
• If, as is typical, none of the candidate learners (nor, as a result, the
oracle selector) converge at a parametric rate, the super learner per-
forms asymptotically as well (in the risk difference sense) as the oracle
selector, which chooses the best of the candidate learners.
3
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
DOI: 10.2202/1544-6115.1309 4
van der Laan et al.: Super Learner
5
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
candidate learners. Figure 1 contains a flow diagram for the steps involved
in the super learner.
.
.
.
1
2
.
. lm D/S/A ... RF 5. Evaluate super learner
.
by combining predictions from
V
each candidate learner (step 0)
with m(z;β) (steps 1-4)
0. Train each
candidate learner on
entire dataset
where one could use, for example, the linear regression model m(z | β) =
βz. If Y ∈ {0, 1}, then one could use the logistic linear regression model
DOI: 10.2202/1544-6115.1309 6
van der Laan et al.: Super Learner
One could also estimate β with a constrained least squares regression estima-
tor such as penalized L1 -regression (Lasso), penalized L2 regression (shrink-
age), where the constraints are selected with cross-validation, or one could
restrict β to the set of positive weights summing up till 1.
Since the candidate learners are all trying to predict the same outcome Y
there is a potential for collinearity or near collinearity in the predicted Z data
set. In practice, the user could simply remove one of the troublesome candi-
date learners, which is often the default in most statistical regression software
when collinearity is present. Near collinearity can make the interpretation
of ψn∗ (z) difficult, especially if evaluating the magnitude of the parameters
βn . We recommend caution in interpretation of the parameter estimates but
note that near collinearity should not effect the prediction accuracy of the
super learner.
Data adaptive minimum cross-validated risk predictor: There is
no need to restrict ψn∗ to parametric regression fits. For example, one could
define ψn∗ in terms of the application of a particular data adaptive (machine
learning) regression algorithm to the data set (Yi , Zi ), i = 1, . . . , n, such as
CART, D/S/A, or MARS, among others. In fact, one could apply a super
learning algorithm itself to estimate E(Y | Z). In this manner one can let
the data speak in order to build a good predictor of Y based on covariate
vector Z based on (Yi , Zi ), i = 1, . . . , n.
Thus, this super learner is indexed, beyond the choice of initial candidate
estimators, by a choice of minimum cross-validated risk predictor. As a con-
sequence, the proposal provides a whole class of tools indexed by an arbitrary
choice of regression algorithm (i.e., ψn∗ ) to map a set of candidate learners
into a new cross-validated estimator (i.e. super learner). In particular, it
provides a new way of using the cross-validated risk function, which goes
beyond minimizing the cross-validated risk over a set of candidate learners.
7
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
For each α ∈ A, define the candidate estimator Ψ̂α (Pn ) ≡ m((Ψ̂j (Pn ) :
j = 1, . . . , J) | α): i.e.
DOI: 10.2202/1544-6115.1309 8
van der Laan et al.: Super Learner
where
For each δ > 0 we have that there exists a C(δ) < ∞ such that
V
1 X
Ed(Ψ̂αn (PnT (v) ), ψ0 ) ≤
V v=1
V
1 X V log n
(1 + δ)E min d(Ψ̂α (PnT (v) ), ψ0 ) + C(δ) .
α∈An V v=1 n
Thus, if
1
PV
E minα∈An V v=1 d(Ψ̂α (PnT (v) ), ψ0 )
log n
→ 0 as n → ∞, (2)
n
then it follows that the estimator Ψ̂αn is asymptotically equivalent with the
oracle estimator Ψ̂α̃n when applied to samples of size (1 − 1/V )n:
1
PV
V v=1 Ed(Ψ̂αn (PnT (v) ), ψ0 )
→ 1 as n → ∞.
minα∈An V1 Vv=1 d(Ψ̂α (PnT (v) ), ψ0 )
P
E
If (2) does not hold, then it follows that Ψ̂αn achieves the (log n)/n rate:
V
1 X log n
Ed(Ψ̂αn (PnT (v) ), ψ0 ) = O .
V v=1 n
9
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
where αn = arg minα∈A ni=1 (Yi − m(Zi | α))2 . The other conclusions of the
P
theorem now also apply.
This theorem implies that the selected prediction algorithm Ψ̂αn will ei-
ther perform asymptotically as well (up till the constant) as the best estima-
tor among the family of estimators {Ψ̂α : α ∈ A} when applied to samples of
size n(1 − 1/V ), or achieve the parametric model rate 1/n up till a log n fac-
tor. By a simple argument as presented in van der Laan and Dudoit (2003),
Dudoit and van der Laan (2005) and van der Vaart et al. (2006), it follows
that by letting the V = Vn in the V-fold cross-validation scheme converge
to infinity at a slow enough rate relative to n, then either ψn = Ψ̂αn (Pn )
performs asymptotically as well (up till the constant) as the best estimator
among the estimators {Ψ̂α : α} applied to the full sample Pn , or it achieves
the parametric rate of convergence up till the log n factor.
The take home message of this theorem is that our super learner will
perform asymptotically as well as the best learner among the family of can-
didate learners Ψ̂α indexed by α. By choosing the regression model m(· | α)
so that there exist a αj so that m(Z | αj ) = Zj for each j = 1, . . . , J (e.g.,
m(Z | α) = αZ), then it follows, in particular, that the resulting prediction
algorithm asymptotically outperforms each of the initial candidate estima-
tors Ψ̂j . More importantly and practically, the set of candidate estimators
Ψ̂α can include interesting combinations of these J estimators which exploit
the strengths of various of these estimators for the particular data generating
distribution P0 instead of focusing on one of them. For example, if one uses
the linear regression model m(Z | α) = αZ, then the candidate estimators
{Ψ̂α : α} include all averages of the J estimators, including convex combina-
tions. As becomes evident in our data analysis and simulation results, the
selected super learner ψn∗ based on a linear (or logistic) regression model is
DOI: 10.2202/1544-6115.1309 10
van der Laan et al.: Super Learner
often indeed (or logistic function of) a weighted average of competing esti-
mators in which various of the candidate learners significantly contribute to
the average.
4 Simulation results
In this section, we conducted 3 simulation studies to evaluate the working
characteristics of the super learner. These simulations all involve a continu-
ous response variable. For the first simulation, the true model is:
11
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
method RMSPE βn
Least Squares 1.00 0.038
LARS 1.15 -0.171
D/S/A 0.22 0.535
Logic 0.32 0.274
Random Forest 0.42 0.398
Super Learner 0.20
DOI: 10.2202/1544-6115.1309 12
van der Laan et al.: Super Learner
Table 3 contains the results for the second simulation. As in the first
simulation, the relative mean squared prediction error is used to evaluate
the candidate learners and the super learner. For this model, simple linear
regression, LARS, and ridge regression all appear to have the same results.
Random forests and adaptive regression splines are better able to pick up
the non-linear relationship, but among the candidate learners, the D/S/A is
the best with a relative MSPE of 0.43. But the super learner improves on
the fit even more with a relative MSPE of 0.22 by combining the candidate
learners. Since the model for ψn∗ (z) can be near collinear, the estimates of β
are often unstable and should not be used to determine the best candidate
by comparing the magnitude of the parameter estimate.
The main advantage of the proposed super learner is the adaptivity to
different data generating distributions across many studies. The third simu-
lation demonstrates this feature by creating 3 additional studies and applying
the super learner and the candidates to all 3 studies then combining the re-
sults with the second simulation and evaluating the mean square error across
all 4 studies. Equation 5 shows the data generating distributions for the 3
new studies. The data generating distribution for the covariates X is the
same as the second simulation example above. To be consistent across the 4
studies, the same candidate learners from the second simulation were applied
to these 3 new studies.
13
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
method RMSPE βn
Least Squares 1.00 -0.73
LARS 0.91 -0.92
D/S/A 0.43 0.86
Ridge 0.98 0.61
Random Forest 0.71 1.06
MARS 0.61 0.05
Super Learner 0.22
−5 + X2 + 6(X10 + 8)+ − 6(X10 )+ − 7(X10 − 5)+
− 6(X15 + 6)+ + 8(X15 )+ + 7(X15 − 6)+ + ε if j = 1
Yij = 10 · I(X1 > −4 and X2 > 0 and X3 > −4) + ε if j = 2 (5)
p
−4 + X2 + |X3 | + sin(X4 ) − .3X6 X11 + 3X7
+ .3X83 − 2X9 − 2X10 − 2X11 + ε if j = 3
where ε ∼ N ormal(0, 16) and I(x) = 1 if x is true, and 0 otherwise. For the 4
studies (the 3 new studies combined with the second simulation), the learning
sample contained 200 observations and the evaluation sample contained 5,000
observations.
Table 4 contains the results from the second simulation. For the first
study (j = 1), the adaptive regression spline function is able to estimate
well the true distribution. The super learner is not able to improve on the
fit, but it does not do worse than the best candidate algorithm. In the
second study (j = 2), the adaptive regression spline function is not the best
candidate learner. The random forests performs best in the second study,
but the super learner is able to improve on the fit. The third study (j = 3)
is similar to the first in that the adaptive regression splines function is able
to approximate the true distribution well, but the super learner does not do
worse. The squared prediction error from these three studies and the second
simulation was combined to give a mean squared prediction error for the four
DOI: 10.2202/1544-6115.1309 14
van der Laan et al.: Super Learner
studies. The last column in table 4 gives the relative mspe for each of the
candidate learners and the super learner. If the researcher had selected just
one of the candidate learners, they might have done well within one or two
of the studies, but overall the super learner will outperform the candidate
learners. For example, the MARS learner performs well on the first and
third study, and does well overall with a relative MSPE of 0.38, but the
super learner outperforms the MARS learner with an overall relative MSPE
of 0.19. The super learner is able to adapt to the different data generating
distributions and will outperform any candidate learner across many studies.
5 Data Analysis
We applied the super learner to the diabetes data set from the LARS package
in R. Details on the data set can be found in Efron et al. (2004). The data set
consists of 442 observations of 10 covariates (9 quantitative and 1 qualitative)
and a continuous outcome. The covariates have been standardized to have
mean zero and unit L2 norm. We selected 6 candidate learners for the super
learner. The first candidate was least squares using all 10 covariates. Next
we considered the least squares model with all possible two-way interactions
and quadratic terms on the quantitative covariates. The third and fourth
candidates were applying LARS to the main effects and all possible two-
way interaction models above. Internal cross-validation was used to select
15
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
the “fraction” point for the prediction. The fifth candidate algorithm was
D/S/A allowing for two-way interactions and a maximum model size of 64.
The final candidate learner was the random forests algorithm. For the super
learner, we then used a linear model and estimated the parameters with least
squares.
We also applied the proposed super learner to the HIV-1 drug resistance
data set in Sinisi et al. (2007) and Rhee et al. (2006). The goal of the data is
to predict drug susceptibility based on mutations in the protease and reverse
transcriptase enzymes. The HIV-1 sequences were obtained from publicly
available isolates in the Stanford HIV Reverse Transcriptase and Protease
Sequence Database. Details on the data and previous analysis can be found
in Sinisi et al. (2007) and Rhee et al. (2006). The outcome of interest is
standardized log fold change in drug susceptibility, defined as the ratio IC50
of an isolate to a standard wildtype control isolate; IC50 (inhibitory concen-
tration) is the concentration of the drug needed to inhibit viral replication
by 50%. We focused our analysis to a single protease inhibitor, nelfinavir,
where we have 740 viral isolates in the learning sample of 61 binary predictor
covariates and one quantitative outcome.
For the HIV data set, we considered six candidate learners. The first
candidate was least squares on all main terms. The second candidate was
the LARS algorithm. Internal cross validation was used to determine the best
fraction parameter. The third candidate was logic regression. Similar to the
simulation example, we used 10-fold cross-validation on the entire learning set
to determine the parameters,#trees ∈ {1, . . . , 5} and #leaves ∈ {1, . . . , 20},
for logic regression. For the HIV data set, we selected #trees = 5 and
#leaves = 10. The fourth candidate was the CART algorithm. We also
applied the D/S/A algorithm searching over only main effects terms and a
maximum model size of 35. The final candidate was random forests. For the
super learner, a linear model was used to estimate the parameters with least
squares. All models were fit in R similar to the simulation example above.
To evaluate the performance of the super learner in comparison to each
of the candidate learners we split the learning data set into 10 validation
data sets and corresponding training data sets. The super learner and each
candidate learner was fit one each fold of the cross-validation, giving us a
honest cross-validated risk estimate to compare the super learner to each of
the candidate learners.
DOI: 10.2202/1544-6115.1309 16
van der Laan et al.: Super Learner
Table 5: Super learner results for the diabetes data set. Least Squares (1)
and LARS (1) refer to the main effects only models. Least Squares (2) and
LARS (2) refer to the all possible two-way interaction models. Relative 10-
fold Honest Cross-Validation risk estimates, compared to main terms least
squares (RCV risk) are reported. βn in the super learner is reported in the
last column (αn = −6.228).
17
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
Table 6: Super learner results for the HIV data set. Relative 10-fold hon-
est cross validated risk estimates (RCV risk) compared to least squares are
reported. βn in the super learner is reported in the last column (αn = 0.027).
D/S/A does not perform well on this data set. This highlights the need
for a super learner since one candidate algorithm will not work on all data
sets. Among the candidate learners, least squares has the smallest cross-
validated risk estimate, but the super learner has a smaller risk estimate
(RCV = 0.87). We also present the estimates for α and β in table 6. Both
least squares and random forests appear to be receiving the most weight in
the super learner with coefficients 0.552 and 0.510 respectively. Again, the
super learner can use the cross validated predictions to data adaptively build
the best predictor.
These are both situations where one of the candidate learners does a good
job of prediction and gives little room for improvement for the super learner.
But these examples also demonstrate that one candidate algorithm may not
be flexible enough to perform best on all data generating distributions and
since a researcher is unlikely to know a priori which candidate learner will
work best, the super learner is a natural choice for prediction.
6 Discussion.
The new super learning approach provides both a fundamental theoretical
as well as practical improvement to the construction of a predictor. The
super learner is a flexible prediction algorithm which can perform well on
many different data generating distributions, and utilizes cross-validation
to protect against over-fitting. We wish to stress that the theory suggests
that to achieve the best performance one should not apply this algorithm
DOI: 10.2202/1544-6115.1309 18
van der Laan et al.: Super Learner
to a restricted set of candidate learners, but one should aim to include any
available sensible learners. In addition, the amount of computations does
not exceed the amount of computations it takes to calculate each of the
candidate learners on the training and full data sets. In our simulations
we used a particular set of available learners only because they were easily
available as R functions. Thus, the potential for improving learners applies
to a very wide array of practical problems.
Our results generalize to parameters which can be defined as minimizers
of a loss function, including (unknown) loss functions indexed by parameters
of the true data generating distribution (van der Laan and Dudoit (2003)).
In particular, the super learner approach applies to maximum likelihood es-
timation in semiparametric or nonparametric models for the data generating
distribution, and to targeted maximum likelihood estimation with respect to
a particular smooth functional of the density of the data, as presented in
van der Laan and Rubin (2007).
7 Appendix
Under the Assumption A1 that the loss function L(O, ψ) = (Y − ψ(X))2
is uniformly bounded, and the Assumption A2 that the variance of the ψ0 -
centered loss function L(O, ψ) − L(O, ψ0 ) can be bounded by its expectation
uniformly in ψ, van der Laan et al. (2006) (Theorem 3.1) establish the fol-
lowing finite sample inequality.
Theorem 2. Let {ψˆk = Ψ̂k (Pn ), k = 1, ..., K(n)} beRa given set of K(n)
estimators of the parameter value ψ0 = arg minψ∈Ψ L(o, ψ)dP0 (o). Let
d0 (ψ, ψ0 ) ≡ EP0 {L(O, ψ) − L(O, ψ0 )} denote the risk difference between a
candidate estimator ψ and the parameter ψ0 . Suppose that Ψ is a param-
eter space so Rthat Ψ̂k (Pn ) ∈ Ψ for all k, with probability 1. Let K̂(Pn ) ≡
0 1
arg mink EBn L(o, Ψ̂k (Pn,B ))dPn,B (o) be the cross-validation selector, and
Rn n
0
let K̃(Pn ) ≡ arg mink EBn L(o, Ψ̂k (Pn,B n
))dP0 (o) be the comparable oracle
selector. Let p be the proportion of observations in the validation sample.
Then, under assumptions A1 and A2, one has the following finite sample
inequality for any λ > 0 (where C(λ) is a constant, defined in van der Laan
19
Statistical Applications in Genetics and Molecular Biology, Vol. 6 [2007], Iss. 1, Art. 25
et al. (2006)):
0
Ed0 (Ψ̂K̂(Pn ) (Pn,Bn
), ψ0 ) ≤
0 1 + log(K(n))
(1 + 2λ)Ed0 (Ψ̂K̃(Pn ) (Pn,Bn
), ψ0 ) + 2C(λ) .
np
References
L. Breiman. Random Forests. Machine Learning, 45:5–32, 2001.
A.E. Hoerl and R.W. Kennard. Ridge regression: Biased estimation for
nonorthogonal problems. Technometrics, 12(3):55–67, 1970.
DOI: 10.2202/1544-6115.1309 20
van der Laan et al.: Super Learner
S. E. Sinisi, E. C. Polley, S.Y. Rhee, and M. J. van der Laan. Super learn-
ing: An application to the prediction of HIV-1 drug resistance. Statistical
Applications in Genetics and Molecular Biology, 6(1), 2007.
M. J. van der Laan, S. Dudoit, and A. W. van der Vaart. The cross-validated
adaptive epsilon-net estimator. Statistics and Decisions, 24(3):373–395,
2006.
A.W. van der Vaart, S. Dudoit, and M.J. van der Laan. Oracle inequalities
for multi-fold cross vaidation. Statistics and Decisions, 24(3), 2006.
21