Deeplerning Ensmble Metyhode
Deeplerning Ensmble Metyhode
Abstract
Artificial neural networks have been successfully applied to a variety of machine learning
tasks, including image recognition, semantic segmentation, and machine translation. However,
few studies fully investigated ensembles of artificial neural networks. In this work, we inves-
tigated multiple widely used ensemble methods, including unweighted averaging, majority
voting, the Bayes Optimal Classifier, and the (discrete) Super Learner, for image recognition
tasks, with deep neural networks as candidate algorithms. We designed several experiments,
with the candidate algorithms being the same network structure with different model check-
points within a single training process, networks with same structure but trained multiple times
stochastically, and networks with different structure. In addition, we further studied the over-
confidence phenomenon of the neural networks, as well as its impact on the ensemble methods.
Across all of our experiments, the Super Learner achieved best performance among all the en-
semble methods in this study.
1 Introduction
Ensemble learning methods train several baseline models, and use some rules to combine them
together to make predictions. The ensemble learning methods have gained popularity because
of their superior prediction performance in practice. Consider a prediction task with some fixed
data generating mechanism. The performance of a particular learner depends on how effective its
searching strategy is in approximating the optimal predictor defined by the true data generating
distribution [van der Laan et al., 2007]. In theory, the relative performance of various learners
will depend on the model assumptions and the true data-generating distribution. In practice, the
performance of the learners will depend on the sample size, dimensionality, and the bias-variance
trade-off of the model. Thus it is generally impossible to know a priori which learner would
perform best given the finite sample data set and prediction problem [van der Laan et al., 2007].
One widely used method is to use cross-validation to give an “objective” and “honest” assessment
of each learners, and then select the single algorithm that achieves best validation-performance.
This is known as the discrete Super Learner selector [Van Der Laan and Dudoit, 2003, van der
Laan et al., 2007, Polley and Van Der Laan, 2010], which asymptotically performs as well as the
best base learner in the library, even as the number of candidates grows polynomial in sample size.
Instead of selecting one algorithm, another approach to guarantee the predictive performance
is to compute the optimal convex combination of the base learners. The idea of ensemble learning,
1
which combines predictors instead of selecting a single predictor, is well studied in the literature:
[Breiman, 1996b] summarized and referred several related studies [Rao and Subrahmaniam, 1971,
Efron and Morris, 1973, Rubin and Weisberg, 1975, Berger and Bock, 1976, Green and Straw-
derman, 1991] about the theoretical properties of ensemble learning. Two widely used ensemble
techniques are bagging [Breiman, 1996a] and boosting [Freund et al., 1996, Freund and Schapire,
1997, Friedman, 2001]. Bagging uses bootstrap aggregation to reduce the variance for the strong
learners, while boosting algorithms “boost” the capacity of the weak learners. [Wolpert, 1992,
Breiman, 1996b] proposed a linear combination strategy called stacking to ensemble the models.
[van der Laan et al., 2007] further extended stacked generalization with a cross-validation based
optimization framework called Super Learner, which finds the optimal combination of a collec-
tion of prediction algorithms by minimizing the cross-validated risk. Recently, the super learner
have showed great success in variety of areas, including precision medicine [Luedtke and van der
Laan, 2016], mortality prediction[Pirracchio et al., 2015, Chambaz et al., 2016], online learning
[Benkeser et al., 2016], and spatial prediction[Davies and van der Laan, 2016].
In recent years, deep artificial neural networks (ANNs) have led to a series of breakthroughs
in a variety of tasks. ANNs have shown great success in almost all machine learning related chal-
lenges across different areas, like computer vision [Krizhevsky et al., 2012, Szegedy et al., 2015,
He et al., 2015a], machine translation [Luong et al., 2015, Cho et al., 2014], and social network
analysis [Perozzi et al., 2014, Grover and Leskovec, 2016]. Due to their high capacity/flexibility,
deep neural networks usually have high variance and low bias. In practice, model averaging with
multiple stochastically trained networks is commonly used to improve the predictive performance.
[Krizhevsky et al., 2012] won the first place in the image classification challenge of ILSVRC 2012,
by averaging 7 CNNs with same structure. [Simonyan and Zisserman, 2014] won the first place in
classification and localization challenge in ILSVRC 2014 with averaging of multiple deep CNNs.
[He et al., 2015a] won the first place using six models of Residual Network with different depth
to form an ensemble in ILSVRC 2015. In addition, [He et al., 2015a] also won the ImageNet
detection task in ILSVRC 2015 with the ensemble of 3 residual network models.
However, the behavior of ensemble learning with deep networks is still not well studied and
understood. First, most of the neural networks literature focuses mainly on the design of the
network structure, and only applies naive averaging ensemble to enhance the performance. To the
best of our knowledge, no detailed work investigates, compares and discusses ensemble methods
for deep neural networks. Naive unweighted averaging, which is largely used, is not data-adaptive
and thus vulnerable to a “bad” library of base learners: it works well for networks with similar
structure and comparable performance, but it is sensitive to the presence of excessively biased base
learners. This issue could be easily addressed by a cross-validation based data-adaptive ensemble
like Bayes Optimal Classifier and Super Learner. In later sections, we investigate and compare the
performance of four commonly used ensemble methods on an image classification task, with deep
convolutional neural networks (CNNs) as base learners.
This study mainly focuses on the comparison of ensemble methods of CNNs for image recog-
nition. For readers who are not familiar with deep learning, each CNN could be just treated as a
black-box estimator, with an image as input, and outputs the probability vector for each possible
class. We refer the interested reader to [LeCun et al., 2015, Goodfellow et al., 2016] for more
details about deep learning.
2
2 Background
In this paper, “algorithm candidate”, “hypothesis”, and “base learner” refer to an individual learner
(here a deep CNN) used in an ensemble. The term ’library’ refers to the set of the base learners for
the ensemble methods.
3
Compared to naive averaging, majority voting is less sensitive to the output from a single net-
work. However, it would still be dominated if the library contains multiple similar and dependent
base learners. Another weakness of majority voting is the loss of information, as it only uses the
predicted label.
[Kuncheva et al., 2003] showed pairwise dependence plays an an important role in majority
voting. For image classification, shallow networks usually give more diverse prediction compared
to deeper networks[Choromanska et al., 2015]. Thus we hypothesize majority voting would yield
a greater improvement over base learners with a library of shallow networks than with a library of
deep networks.
Note that P[Strain |h j ] = ∏(y,x)∈Strain h j (y|x) is the likelihood of the data under the hypothesis h j .
However this quantity might not reflect well the quality of the hypothesis since the likelihood of
the training sample is subject to overfitting. To give an “honest” estimation, we could split the
training data into two sets, one for model training, and the other for computing P[Strain |h]. For
neural networks, a validation set (distinct from the testing set) is usually set aside only to tune a
few hyper-parameters, thus the information in it is not fully exploited. We expect that using such
a validation set would provide a good estimation of the likelihood P[Strain |h]. Finally, we would
assess the model using the untouched testing set.
The second difficulty in BOC is choosing the prior probability for each hypothesis p(hi ). For
simplicity, the prior is usually set to be the uniform distribution [Mitchell, 1997].
[Dietterich, 2000] observed that, when the sample size is large, one hypothesis typically tends
to have a much larger posterior probability than others. We will see in the later section that when
the validation set is large, the posterior weight is usually dominated by only one hypothesis (base
learner). As the weights are proportional to the likelihood on the validation set, if the weight
vector is dominated dominated by a single algorithm, BOC would be the same selector as the
discrete Super Learner selector with negative likelihood loss function [van der Laan et al., 2007].
4
2.4 Stacked Generalization
The idea of stacking was originally proposed in [Wolpert, 1992], which concludes stacking works
by deducing the biases of the generalizer(s) with respect to a provided learning set. [Breiman,
1996b] also studied stacked regression by using cross-validation to construct the ’good’ combina-
tion.
Consider a linear stacking for the prediction task. The basic idea of stacking is to ’stack’ the
predictions f1 , · · · , fm by linear combination with weights ai , i ∈ 1, · · · , m:
m
fstacking (x) = ∑ ai fi (x)
i=1
where the weight vector a is learned by a meta-learner.
where val(v) is the set of indices of the observations in the v-th fold, and p−v
ji is defined as the
prediction for the i-th observation, from the j-th base learner that trained on the whole data except
the v-th fold. Then we have
!
V m
RCV (~a) = ∑ ∑ l yi , ∑ a j p−v
ji
v=1 i∈val(v) j=1
where ~a = [a1 , · · · , am ] is the weight vector. The optimal weight vector given by the Super
Learner is then
5
Thus the cross-validated loss is:
V m m
RCV (~a) = − ∑ ∑ [yi log( ∑ a j p−v −v
ji ) + (1 − yi ) log(1 − ∑ a j p ji )]
v=1 i∈val(v) j=1 j=1
where p−v
ji is the predicted probability for i-th unit from j-th base learner which is trained on the
whole data except v-th fold.
In addition, stacking on the logit scale usually gives much better performance in practice. In
other words, we use the optimal linear combination before softmax transformation:
V m
RCV (~a) = ∑ ∑ l(yi , expit( ∑ a j logit(p−v
ji )))
v=1 i∈val(v) j=1
For K-class classification with softmax output like neural networks, we could also ensemble in
the score level:
exp(∑mj=1 a j · si [ j, z])
pzi (~a) = − log( )
∑K m
k=1 exp(∑ j=1 a j · si [ j, k])
where pzi (~a) is the ensemble prediction for i-th unit and z-th class with weight vector ~a. si is an
m by K matrix, and si [ j, k] stands for the score of j-th model and k-th class.
We can impose restrictions on a, such as constraining it to lie in a probability simplex:
||a||1 = 1, ai ≥ 0, for i = 1, · · · , m.
This would drive the weights of some base learners to zero, which would reduce the variance
of the ensemble and make it more interpretable. This constrain is not a necessary condition to
achieve the oracle property for SL. In theory, the oracle inequality requires bounded loss function,
so the LASSO constraint is highly advisable (e.g. ∑ j |a j | < M, for some fixed M). In practice, we
found imposing large M leads to better practical performance.
For small data sets, it is recommended to use cross-validation to compute the optimal ensemble
weight vector. However this takes a long time when the data set and the library are large. Usually
people just set aside a validation set, instead of cross-validation, to assess and tune the models
for deep learning. Similarly, instead of optimizing the V-fold cross-validated loss, we could op-
timize on the single-split cross-validation loss instead to get the ensemble weights, which is so
called “single split (or sample split) Super Learner”. Figure 1 shows the details of this variation
of Super Learner. [Ju et al., 2016] shows the success of such single split Super Learner in three
large healthcare databases. In this study, we compute the weights of Super Learner by minimizing
the single-split cross-validated loss. This procedure necessitates almost no additional computa-
tion: only one forward pass for all validation images and then solving a low-dimensional convex
optimization.
6
Whole Data Set
Figure 1: Single Split (Sample Split) Super Learner, which computes the weights on the validation
set.
removes certain proportion of the activations (the output from the last layer) during the training
and uses all the activations in the testing. It could be seen as training multiple base learners and
ensemling them during prediction. [Veit et al., 2016] discusses ResNet, a state-of-the-art network
structure, could be understood as an exponential ensembles of shallow networks. However, such
ensembles might be highly biased, as the meta-learner computes the weights based on the predic-
tion of the base learner (e.g. shallow network) on the training set. These weights might be biased
as the base-learners might not make objective prediction on the training set.
In contrast, the Super Learner computes an honest ensemble weight based on the validation set.
A validation set is commonly used to train/tune a neural network. However, it is usually only used
to select a few tuning parameters (e.g. learning rate, weight decay). For most image classification
data sets, the validation set is very large in order to make the validation stable. We thus conjecture
that the potential of the validation information has not been fully exploited.
The Super Learner could be considered as a neural network with 1 by 1 convolution over the
validation set, with the scores of the base learners as input. It learns the 1 × 1 × m kernel either by
back-propagation, or through directly solving the convex optimization problem.
4 Experiment
4.1 Data
The CIFAR-10 data set [Krizhevsky and Hinton, 2009] is a widely used benchmark data set for
image recognition. It contains 10 classes of natural images, with 50, 000 training images and
10, 000 testing images. Each image is an RGB image of size 32 × 32. There are 10 classes in the
data set: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each class has
5000 images in the training data and 1000 images in the testing data.
7
Network 1
Network 2
m by K by 1 1 by 1 K by 1
Score tensor Score vector
…….
convolu6on
Network m-1
Network m
Figure 2: Super Learner from convolution neural network perspective. The base learners are
trained in the training set, and 1 by 1 convolutional layer is trained in the validation set. The
simple structure of SL avoids the overfitting on the validation set.
4.2.2 GoogLeNet
GoogLeNet [Szegedy et al., 2015] is a deep convolutional neural network architecture based on
the inception module, which improved the computational efficiency. In each inception module, a
1 × 1 convolution is applied as dimension reduction before expensive large convolutions. Within
each inception module, the propagation splits into 4 flows, each with different convolution size,
and is then concatenated.
8
Next Layer
3 x 3 max pooling
3 x 3 conv
1 x 1 conv
1 x 1 conv
5 x 5 conv
Previous Layer
Figure 3: An example of MLP layer in the NIN structure. Notice each convolution are followed
by ReLU layer.
4.3 Training
For all the models, we split the training data into training (first 4, 5000 images) and validation set
(last 5, 000 images). There are 10K testing data.
For the Network-in-Network model, we used Adam with learning rate 0.001. We followed the
original paper [Lin et al., 2013], tuning the learning rate and initialization manually. The training
9
Filter
concatena2on
1 x 1 conv
3 x 3 max
1 x 1 conv 1 x 1 conv
pooling
Previous Layer
Figure 4: An example of Inception module for GoogLeNet. Notice each convolution are followed
by ReLU layer.
was regularized by L-2 penalty with predefined weight 0.001 and two dropout layers in the middle
of the network, with rate 0.5.
For VGG net, we slightly modified the training procedure in the original paper [Simonyan and
Zisserman, 2014] for ILSVRC-2013 competitions [Zeiler and Fergus, 2014, Russakovsky et al.,
2015]. We used SGD with momentum 0.9. We started with learning rate 0.01 and decay divide it
by 10 at every 32k iterations. The training is regularized by L-2 penalty with weight 10−3 and two
dropout layers for the fitst two fully connected layer, with rate 0.5.
For GoogLeNet, we set base learning rate to be 0.05, weight decay 10−3 , and momentum 0.9.
We decreased the learning rate by 4% every 8 epochs. We set the rate to 0.4 for the dropout layer
before the last fully connected layer.
For the Residual Network, we follow the training procedures in the original paper [He et al.,
2015a]: we applied SGD with weight decay of 0.0001 and momentum of 0.9. The weight was
initialized following the method in [He et al., 2015b], and we applied batch normalization [Ioffe
and Szegedy, 2015] without dropout. Learning rate started with 0.1, and was divided by 10 at
every 32k iterations. We trained the model with 200 epochs.
All the networks were trained with mini-batch size 128 for 200 epochs.
4.4 Results
In this section, we compare the empirical performance for all the ensemble methods we mentioned
before, including: Unweighted Averaging (before/after softmax layer), Majority Voting, Bayes
10
F(X) + X
Weight Layer
F(X) RELU
Weight Layer
Previous Layer X
Figure 5: An example of Inception module for GoogLeNet. Notice each convolution are followed
by ReLU layer.
Optimal Classifier, Super Learner (with negative log-likelihood loss). We also include discrete
SL, with negative log-likelihood loss and 0-1 error loss.. For comparison, we list the base learner
which achieved best performance on the testing set, as an empirical oracle.
Table 1: Left: Prediction accuracy on the testing set for ResNet 8 trained by 80, 90, 100, 110
epochs. Right: Prediction Accuracy on the testing set for ResNet 110 trained by 70, 85, 100, 115
epochs.
Table 1 shows the prediction accuracy for the ResNet 8 and 110 after different epochs. As
ResNe 8 is much shallower, thus more adaptive during training, we set the smaller interval with
epoch 10. Notice there is a great accuracy improvement around epoch 100, due to the learning rate
decay.
For ResNet 8, the SL is substantively better than naive averaging and majority voting. Earlier
stage learners would have worse performance, which causes the deterioration of the performance
for naive averaging. The performance of majority voting is even worse than the best base learner,
as the majority of the base learners are under-optimized.
For ResNet 110, the performance for all the meta-learners is similar. One possible explanation
is that deeper network is more stable during training.
11
Table 2: Prediction accuracy on the testing set for ResNet 8 and 110
In this experiment, the weights of BOCs are dominated by one model, which gives the best
performance on the validation set. Thus the BOC is equivalent to the discrete Super Learner with
negative likelihood as loss function. In the experiments, BOC performed only as well as the best
base learner. In the subsequent experiments, all the BOCs showed the similar dominated weight
pattern. Given the practical equivalence with the discrete Super Learner, we don’t elaborate further
on BOCs, and we will report only the discrete Super Learner’s performance.
Table 3: Prediction Accuracy on the testing set for ResNet with 8 and 110 layers
We trained 4 networks for ResNet 8 and 110 respectively. Table 3 shows the performance of the
networks. We further studied the performance of all the meta-learners. Shallow networks enjoyed
more improvement (2.54%) compared to deeper networks 1.43% after ensembled by the Super
Learner. Due to the similarity of the models, the SL did not show great improvement compared
12
Table 4: Prediction accuracy on the testing set for ensemble methods. The algorithm candidates
are the ResNets with same structure but trained several times, where the differences come from
randomized initialization and SGD.
to naive averaging. Similarly, majority voting did not work well, which might also be due to
the diversity of the base learners. The discrete SL with negative log-likelihood loss successfully
selected the best single learner in the library, while the discrete SL with error loss selected a
slightly weaker one. This suggests that for finite samples, the Super Learner using the negative
log likelihood loss performs better w.r.t. prediction accuracy, than the Super Learner that uses
prediction accuracy as criterion.
Table 5: Prediction Accuracy on the testing set for networks with different structure
13
Table 6: Cross-entropy on the testing set for Networks with different structure
Model Cross-entropy
NIN 0.5779
VGG 1.5649
ResNet 32 1.5442
ResNet 44 1.5341
ResNet 56 1.5327
ResNet 110 1.5242
gives worse prediction accuracy. This due to its prediction behavior: we look at the predicted
probability of the true labels for the images in the testing set:
Table 7: Cross-entropy on the testing set for networks with different structure
14
Table 8: Prediction accuracy on the testing set for ensemble methods. The algorithm candi-
dates include NIN, VGG, ResNet 32, ResNet 44, and ResNet 56. We compare the performance
with/without the over-confident NIN network.
Table 9: Prediction accuracy on the testing set for ensemble methods. The algorithm candidates
include VGG, ResNet 32, ResNet 44, and ResNet 56. We compared the performance with/without
five under-optimized GoogLeNets.
Ensemble Without GoogLeNet With 3 GoogLeNets With 5 GoogLeNets
Best Base Learner 0.9399 0.9399 0.9399
SuperLearner 0.9475 0.9477 0.9477
Discrete SuperLearner (nll) 0.9399 0.9399 0.9399
Discrete SuperLearner (error) 0.9399 0.9399 0.9399
BOC (before softmax) 0.9399 0.9399 0.9399
BOC (after softmax) 0.9399 0.9399 0.9399
Unweighted Average (before softmax) 0.9456 0.9326 0.9001
Unweighted Average (after softmax) 0.9455 0.9329 0.9007
Majority Vote 0.9433 0.9263 0.8720
In the experiment, adding many weaker candidates deteriorated the performance of the un-
weighted average. The majority voting was slightly influenced when there were only few weak
15
learners, while would be dominated if the number of the weak learner was large. Unweighted av-
eraging also failed in this case. BOCs remained unchanged as the likelihood on the validation set
is still dominated by the same base learner. Super Learner shows exciting success in this setting:
the prediction accuracy remained stable with the extra weak learning.
Table 10: Prediction accuracy on the testing set for all the ensemble methods using all the networks
mentioned in this study as base learners.
Ensemble Accuracy
Best base learner 0.9399
SuperLearner 0.9502
Discrete SuperLearner (nll) 0.9395
Discrete SuperLearner (error) 0.9395
BOC (before softmax) 0.9395
BOC (after softmax) 0.9395
Unweighted Average (before softmax) 0.9444
Unweighted Average (after softmax) 0.9448
Majority Vote 0.9410
Table 10 shows the performance of all the ensemble methods as well as the base learner with
the best performance. Due to the large proportion of weak learners (e.g. under-fitted GoogLeNet,
and the networks trained with less iterations in the first experiment) and the over-confident learners
(NIN), all the other ensemble methods have much worse performance compared to Super Learner.
This is another strength of the Super Learner: by simply putting all the potential base learners into
the library, the Super Learner computes the weights data-adaptively, which does not require any
tedious pre-selecting procedure based on human experience.
4.5 Discussion
We studied the relative performance for several widely used ensemble methods with deep convo-
lutional neural networks as base learners on the CIFAR 10 data set, which is a commonly used
benchmark for image classification. The unweighted averaging proved surprisingly successful
when the performance of the base learners are comparable. It outperformed majority voting in
almost all the experiments. However, the unweighted averaging is proved to be sensitive to over-
confident candidates. The Super Leaner addressed this issue by simply optimizing a weight on the
validation set in a data-adaptive manner. This ensemble structure could be considered as a 1 × 1
convolution layer stacked on the output of the base learners. It could adaptively assign weight on
base learners, which enables weak learner to improve the prediction.
16
Super Learner is proposed as a cross-validation based ensemble method. However, since
CNN are computationally intensive and that validation sets are typically large in image recog-
nition tasks, we used the validation set of the neural networks for computing the weights of Super
Learner(single-split cross-validation), instead of using conventional cross validation (multiple-fold
cross-validation). The structure is simple and could be easily extended. One potential extension of
the linear-weighted Super Learner would be stacking several 1×1 convolutions with non-linear ac-
tivation layers in between. This structure could mimic the cascading/hierarchical ensemble [Wang
et al., 2014, Su et al., 2009]. Due to the small number of parameters, we hope this meta-learner
would not overfit the validation set and thus would help improve the prediction. However this in-
volves non-convex optimization and the results might not be stable. We leave this as future work.
References
D. Benkeser, S. D. Lendle, C. Ju, and M. J. van der Laan. Online cross-validation-based ensemble
learning. U.C. Berkeley Division of Biostatistics Working Paper Series, page Working Paper
355., 2016.
J. O. Berger and M. Bock. Combining independent normal mean estimation problems with un-
known variances. The Annals of Statistics, pages 642–648, 1976.
A. Chambaz, W. Zheng, and M. van der Laan. Data-adaptive inference of the optimal treatment
rule and its mean reward. the masked bandit. U.C. Berkeley Division of Biostatistics Working
Paper Series., 2016.
M. M. Davies and M. J. van der Laan. Optimal spatial prediction using ensemble machine learning.
The international journal of biostatistics, 12(1):179–201, 2016.
B. Efron and C. Morris. Combining possibly related estimation problems. Journal of the Royal
Statistical Society. Series B (Methodological), pages 379–421, 1973.
17
Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In ICML, volume 96,
pages 148–156, 1996.
E. J. Green and W. E. Strawderman. A james-stein type estimator for combining unbiased and
possibly biased estimators. Journal of the American Statistical Association, 86(416):1001–1006,
1991.
A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 855–864. ACM, 2016.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint
arXiv:1512.03385, 2015a.
K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1026–1034, 2015b.
T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, and M. J. van der Laan. Survival ensembles.
Biostatistics, 7(3):355–373, 2006.
S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.
A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical
report, University of Toronto., 2009.
L. I. Kuncheva, C. J. Whitaker, C. A. Shipp, and R. P. Duin. Limits on the majority vote accuracy
in classifier fusion. Pattern Analysis & Applications, 6(1):22–31, 2003.
M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.
A. R. Luedtke and M. J. van der Laan. Super-learning of an optimal dynamic treatment rule. The
international journal of biostatistics, 12(1):305–332, 2016.
18
M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural ma-
chine translation. arXiv preprint arXiv:1508.04025, 2015.
T. M. Mitchell. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37):870–877, 1997.
B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 701–710. ACM, 2014.
R. Pirracchio, M. L. Petersen, M. Carone, M. R. Rigon, S. Chevret, and M. J. van der Laan. Mortal-
ity prediction in intensive care units with the super icu learner algorithm (sicula): a population-
based study. The Lancet Respiratory Medicine, 3(1):42–52, 2015.
E. C. Polley and M. J. Van Der Laan. Super learner in prediction. U.C. Berkeley Division of
Biostatistics Working Paper Series., 2010.
J. Rao and K. Subrahmaniam. Combining independent estimators and estimation in linear regres-
sion with unequal variances. Biometrics, pages 971–990, 1971.
D. B. Rubin and S. Weisberg. The variance of a linear combination of independent estimators
using estimated weights. Biometrika, 62(3):708–709, 1975.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of
Computer Vision, 115(3):211–252, 2015.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556, 2014.
S. E. Sinisi, E. C. Polley, M. L. Petersen, S.-Y. Rhee, and M. J. van der Laan. Super learning: an
application to the prediction of hiv-1 drug resistance. Statistical applications in genetics and
molecular biology, 6(1), 2007.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):
1929–1958, 2014.
Y. Su, S. Shan, X. Chen, and W. Gao. Hierarchical ensemble of global and local classifiers for face
recognition. IEEE Transactions on Image Processing, 18(8):1885–1896, 2009.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–9, 2015.
M. J. Van Der Laan and S. Dudoit. Unified cross-validation methodology for selection among
estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle
inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper Series., 2003.
M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical applications in
genetics and molecular biology, 6(1), 2007.
19
A. Veit, M. Wilber, and S. Belongie. Residual networks are exponential ensembles of relatively
shallow networks. arXiv preprint arXiv:1605.06431, 2016.
20