0% found this document useful (0 votes)
30 views20 pages

Deeplerning Ensmble Metyhode

Uploaded by

Deni Diana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views20 pages

Deeplerning Ensmble Metyhode

Uploaded by

Deni Diana
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 20

The Relative Performance of Ensemble Methods with

Deep Convolutional Neural Networks for Image


Classification
Cheng Ju and Aurélien Bibaut and Mark J. van der Laan
arXiv:1704.01664v1 [stat.ML] 5 Apr 2017

Abstract
Artificial neural networks have been successfully applied to a variety of machine learning
tasks, including image recognition, semantic segmentation, and machine translation. However,
few studies fully investigated ensembles of artificial neural networks. In this work, we inves-
tigated multiple widely used ensemble methods, including unweighted averaging, majority
voting, the Bayes Optimal Classifier, and the (discrete) Super Learner, for image recognition
tasks, with deep neural networks as candidate algorithms. We designed several experiments,
with the candidate algorithms being the same network structure with different model check-
points within a single training process, networks with same structure but trained multiple times
stochastically, and networks with different structure. In addition, we further studied the over-
confidence phenomenon of the neural networks, as well as its impact on the ensemble methods.
Across all of our experiments, the Super Learner achieved best performance among all the en-
semble methods in this study.

1 Introduction
Ensemble learning methods train several baseline models, and use some rules to combine them
together to make predictions. The ensemble learning methods have gained popularity because
of their superior prediction performance in practice. Consider a prediction task with some fixed
data generating mechanism. The performance of a particular learner depends on how effective its
searching strategy is in approximating the optimal predictor defined by the true data generating
distribution [van der Laan et al., 2007]. In theory, the relative performance of various learners
will depend on the model assumptions and the true data-generating distribution. In practice, the
performance of the learners will depend on the sample size, dimensionality, and the bias-variance
trade-off of the model. Thus it is generally impossible to know a priori which learner would
perform best given the finite sample data set and prediction problem [van der Laan et al., 2007].
One widely used method is to use cross-validation to give an “objective” and “honest” assessment
of each learners, and then select the single algorithm that achieves best validation-performance.
This is known as the discrete Super Learner selector [Van Der Laan and Dudoit, 2003, van der
Laan et al., 2007, Polley and Van Der Laan, 2010], which asymptotically performs as well as the
best base learner in the library, even as the number of candidates grows polynomial in sample size.
Instead of selecting one algorithm, another approach to guarantee the predictive performance
is to compute the optimal convex combination of the base learners. The idea of ensemble learning,

1
which combines predictors instead of selecting a single predictor, is well studied in the literature:
[Breiman, 1996b] summarized and referred several related studies [Rao and Subrahmaniam, 1971,
Efron and Morris, 1973, Rubin and Weisberg, 1975, Berger and Bock, 1976, Green and Straw-
derman, 1991] about the theoretical properties of ensemble learning. Two widely used ensemble
techniques are bagging [Breiman, 1996a] and boosting [Freund et al., 1996, Freund and Schapire,
1997, Friedman, 2001]. Bagging uses bootstrap aggregation to reduce the variance for the strong
learners, while boosting algorithms “boost” the capacity of the weak learners. [Wolpert, 1992,
Breiman, 1996b] proposed a linear combination strategy called stacking to ensemble the models.
[van der Laan et al., 2007] further extended stacked generalization with a cross-validation based
optimization framework called Super Learner, which finds the optimal combination of a collec-
tion of prediction algorithms by minimizing the cross-validated risk. Recently, the super learner
have showed great success in variety of areas, including precision medicine [Luedtke and van der
Laan, 2016], mortality prediction[Pirracchio et al., 2015, Chambaz et al., 2016], online learning
[Benkeser et al., 2016], and spatial prediction[Davies and van der Laan, 2016].
In recent years, deep artificial neural networks (ANNs) have led to a series of breakthroughs
in a variety of tasks. ANNs have shown great success in almost all machine learning related chal-
lenges across different areas, like computer vision [Krizhevsky et al., 2012, Szegedy et al., 2015,
He et al., 2015a], machine translation [Luong et al., 2015, Cho et al., 2014], and social network
analysis [Perozzi et al., 2014, Grover and Leskovec, 2016]. Due to their high capacity/flexibility,
deep neural networks usually have high variance and low bias. In practice, model averaging with
multiple stochastically trained networks is commonly used to improve the predictive performance.
[Krizhevsky et al., 2012] won the first place in the image classification challenge of ILSVRC 2012,
by averaging 7 CNNs with same structure. [Simonyan and Zisserman, 2014] won the first place in
classification and localization challenge in ILSVRC 2014 with averaging of multiple deep CNNs.
[He et al., 2015a] won the first place using six models of Residual Network with different depth
to form an ensemble in ILSVRC 2015. In addition, [He et al., 2015a] also won the ImageNet
detection task in ILSVRC 2015 with the ensemble of 3 residual network models.
However, the behavior of ensemble learning with deep networks is still not well studied and
understood. First, most of the neural networks literature focuses mainly on the design of the
network structure, and only applies naive averaging ensemble to enhance the performance. To the
best of our knowledge, no detailed work investigates, compares and discusses ensemble methods
for deep neural networks. Naive unweighted averaging, which is largely used, is not data-adaptive
and thus vulnerable to a “bad” library of base learners: it works well for networks with similar
structure and comparable performance, but it is sensitive to the presence of excessively biased base
learners. This issue could be easily addressed by a cross-validation based data-adaptive ensemble
like Bayes Optimal Classifier and Super Learner. In later sections, we investigate and compare the
performance of four commonly used ensemble methods on an image classification task, with deep
convolutional neural networks (CNNs) as base learners.
This study mainly focuses on the comparison of ensemble methods of CNNs for image recog-
nition. For readers who are not familiar with deep learning, each CNN could be just treated as a
black-box estimator, with an image as input, and outputs the probability vector for each possible
class. We refer the interested reader to [LeCun et al., 2015, Goodfellow et al., 2016] for more
details about deep learning.

2
2 Background
In this paper, “algorithm candidate”, “hypothesis”, and “base learner” refer to an individual learner
(here a deep CNN) used in an ensemble. The term ’library’ refers to the set of the base learners for
the ensemble methods.

2.1 Unweighted Average


Unweighted averaging is the most common ensemble approach for neural networks. It takes un-
weighted average of the output score/probability for all the base learners, and reports it as the
predicted score/probability.
Due to the high capacity of deep neural networks, simple unweighted averaging improves the
performance substantively. Taking the average of multiple networks reduces the variance, as deep
ANNs have high variance and low bias. If the models are uncorrelated enough, the variance of
models could be dramatically reduced by averaging. This idea inspires Random Forest [Breiman,
2001], which builds less correlated trees by bootstrapping observations and sampling features.
We could average either directly the score output, or the predicted probability after softmax
transformation:
~si [ j]
pi j = softmax(~si )[ j] = K ,
∑k=1 exp(si [k])
where score vector ~si is the output from the last layer of the neural network for i-th unit, ~si [k]
is the score corresponding to k-th class/label, and pi j is the predicted probability for unit i in class
j. It is more reasonable to average after the softmax transformation, as the scores might have
varying scales of magnitude across the base learners, as the score output from different network
might be in different magnitude. Indeed, adding a constant to scores for all the classes leaves
predicted probability unchanged. In this study, we compared both naive averaging of the scores
and averaging of their softmax transformed counterparts (i.e. the probabilities)
Unweighted averaging might be a reasonable ensemble for similar base learners of comparable
performance, as the deep learning literature suggests [Simonyan and Zisserman, 2014, Szegedy
et al., 2015, He et al., 2015a]. However, when the library contains heterogeneous networks, the
naive unweighted averaging may not be a smart choice. It is vulnerable to the weaker learners in the
library, and sensitive to the over-confident candidate (We will explain further the over-confidence
phenomenon in later sections.). A good meta-learner should be intelligent enough to combine the
strength of base learners data-adaptively. Heuristically, some networks might have weak overall
prediction strength, but can be good at discriminating certain subclasses (e.g. fine-grained classi-
fier). We hope the meta-learner could combine the strengths of all the base learners, thus yielding
a better strategy.

2.2 Majority Voting


Majority voting is similar to unweighted averaging. But instead of averaging over the output
probability, it counts the votes of all the predicted labels from the base learners, and makes a final
prediction using label with most votes. Or equivalently, it takes an unweighted average using the
label from base learners and chooses the label with the largest value.

3
Compared to naive averaging, majority voting is less sensitive to the output from a single net-
work. However, it would still be dominated if the library contains multiple similar and dependent
base learners. Another weakness of majority voting is the loss of information, as it only uses the
predicted label.
[Kuncheva et al., 2003] showed pairwise dependence plays an an important role in majority
voting. For image classification, shallow networks usually give more diverse prediction compared
to deeper networks[Choromanska et al., 2015]. Thus we hypothesize majority voting would yield
a greater improvement over base learners with a library of shallow networks than with a library of
deep networks.

2.3 Bayes Optimal Classifier


In a classification problem, it can be shown that the function f of the predictors x that minimizes
the misclassification rate E I( f (x) 6= y) is the so-called Bayes classifier. It is given by f (x) =
argmaxy P[y|x]. It fully characterized by the data-generating distribution P.
In the Bayesian voting approach, each base learner h j is viewed as an hypothesis made on
the functional form of the conditional distribution of y given x. More formally, denoting Strain
our training sample, and (x, y) a new data-point, we denote h j (y|x) = P[y|x, h j , Strain ]. It means
the value of the hypothesis h j , which is trained on Strain , evaluated at (y, x). The Bayesian vot-
ing approach requires a prior distribution that, for each j, models the probability P(h j ) that the
hypothesis h j is correct. Using the Bayes rule, one readily obtains that

P(y|x, Strain ) ∝ ∑ P[y|h j , x, Strain ]P[Strain |h j ]P[h j ]. (1)


hj

This motivates the definition of the Bayesian Optimal classifier as

argmaxy ∑ h j (y|x)P[Strain |h j ]P[h j ]. (2)


hj

Note that P[Strain |h j ] = ∏(y,x)∈Strain h j (y|x) is the likelihood of the data under the hypothesis h j .
However this quantity might not reflect well the quality of the hypothesis since the likelihood of
the training sample is subject to overfitting. To give an “honest” estimation, we could split the
training data into two sets, one for model training, and the other for computing P[Strain |h]. For
neural networks, a validation set (distinct from the testing set) is usually set aside only to tune a
few hyper-parameters, thus the information in it is not fully exploited. We expect that using such
a validation set would provide a good estimation of the likelihood P[Strain |h]. Finally, we would
assess the model using the untouched testing set.
The second difficulty in BOC is choosing the prior probability for each hypothesis p(hi ). For
simplicity, the prior is usually set to be the uniform distribution [Mitchell, 1997].
[Dietterich, 2000] observed that, when the sample size is large, one hypothesis typically tends
to have a much larger posterior probability than others. We will see in the later section that when
the validation set is large, the posterior weight is usually dominated by only one hypothesis (base
learner). As the weights are proportional to the likelihood on the validation set, if the weight
vector is dominated dominated by a single algorithm, BOC would be the same selector as the
discrete Super Learner selector with negative likelihood loss function [van der Laan et al., 2007].

4
2.4 Stacked Generalization
The idea of stacking was originally proposed in [Wolpert, 1992], which concludes stacking works
by deducing the biases of the generalizer(s) with respect to a provided learning set. [Breiman,
1996b] also studied stacked regression by using cross-validation to construct the ’good’ combina-
tion.
Consider a linear stacking for the prediction task. The basic idea of stacking is to ’stack’ the
predictions f1 , · · · , fm by linear combination with weights ai , i ∈ 1, · · · , m:
m
fstacking (x) = ∑ ai fi (x)
i=1
where the weight vector a is learned by a meta-learner.

3 Super Learner: a Cross-validation based Stacking


Super Learner [van der Laan et al., 2007] is an extension of stacking. It is a cross-validation based
ensemble framework, which minimizes cross-validated risk for the combination. The original
paper [van der Laan et al., 2007] demonstrated the finite sample and asymptotic properties of the
Super Learner. The literature shows its application to a wide range of topics, e.g. survival analysis
[Hothorn et al., 2006], clinical trial [Sinisi et al., 2007], and mortality prediction [Pirracchio et al.,
2015]. It combines the base learners by cross-validation. Here is an example of SL with V -fold
cross-validation with m base learners for binary prediction. We first define the cross-validated loss
for j-th base learner:
V  
( j) −v
RCV = ∑ ∑ l yi , p ji
v=1 i∈val(v)

where val(v) is the set of indices of the observations in the v-th fold, and p−v
ji is defined as the
prediction for the i-th observation, from the j-th base learner that trained on the whole data except
the v-th fold. Then we have
!
V m
RCV (~a) = ∑ ∑ l yi , ∑ a j p−v
ji
v=1 i∈val(v) j=1

where ~a = [a1 , · · · , am ] is the weight vector. The optimal weight vector given by the Super
Learner is then

~a = arg min RCV (~a)


~a
For simplicity, we consider the binary classification task, which could be easily generalized to
multi-class classification and regression. We first study a simple version of the Super Learner with
m single algorithms, using negative (Bernoulli) log-likelihood as loss function:

l(y, p) = −[y log(p) + (1 − y) log(1 − p)].

5
Thus the cross-validated loss is:
V m m
RCV (~a) = − ∑ ∑ [yi log( ∑ a j p−v −v
ji ) + (1 − yi ) log(1 − ∑ a j p ji )]
v=1 i∈val(v) j=1 j=1

where p−v
ji is the predicted probability for i-th unit from j-th base learner which is trained on the
whole data except v-th fold.
In addition, stacking on the logit scale usually gives much better performance in practice. In
other words, we use the optimal linear combination before softmax transformation:
V m
RCV (~a) = ∑ ∑ l(yi , expit( ∑ a j logit(p−v
ji )))
v=1 i∈val(v) j=1

For K-class classification with softmax output like neural networks, we could also ensemble in
the score level:

exp(∑mj=1 a j · si [ j, z])
pzi (~a) = − log( )
∑K m
k=1 exp(∑ j=1 a j · si [ j, k])

where pzi (~a) is the ensemble prediction for i-th unit and z-th class with weight vector ~a. si is an
m by K matrix, and si [ j, k] stands for the score of j-th model and k-th class.
We can impose restrictions on a, such as constraining it to lie in a probability simplex:

||a||1 = 1, ai ≥ 0, for i = 1, · · · , m.
This would drive the weights of some base learners to zero, which would reduce the variance
of the ensemble and make it more interpretable. This constrain is not a necessary condition to
achieve the oracle property for SL. In theory, the oracle inequality requires bounded loss function,
so the LASSO constraint is highly advisable (e.g. ∑ j |a j | < M, for some fixed M). In practice, we
found imposing large M leads to better practical performance.
For small data sets, it is recommended to use cross-validation to compute the optimal ensemble
weight vector. However this takes a long time when the data set and the library are large. Usually
people just set aside a validation set, instead of cross-validation, to assess and tune the models
for deep learning. Similarly, instead of optimizing the V-fold cross-validated loss, we could op-
timize on the single-split cross-validation loss instead to get the ensemble weights, which is so
called “single split (or sample split) Super Learner”. Figure 1 shows the details of this variation
of Super Learner. [Ju et al., 2016] shows the success of such single split Super Learner in three
large healthcare databases. In this study, we compute the weights of Super Learner by minimizing
the single-split cross-validated loss. This procedure necessitates almost no additional computa-
tion: only one forward pass for all validation images and then solving a low-dimensional convex
optimization.

3.1 Super Learner From a Neural Network Perspective


Lots of neural network structures could be considered as ensemble learning. One of the commonly
used regularization methods for deep neural network, dropout [Srivastava et al., 2014], randomly

6
Whole Data Set

Training set Valida0on set Tes0ng set


For training all candidate For tuning For final
es0mators/algorithms and SL evalua0on

Figure 1: Single Split (Sample Split) Super Learner, which computes the weights on the validation
set.

removes certain proportion of the activations (the output from the last layer) during the training
and uses all the activations in the testing. It could be seen as training multiple base learners and
ensemling them during prediction. [Veit et al., 2016] discusses ResNet, a state-of-the-art network
structure, could be understood as an exponential ensembles of shallow networks. However, such
ensembles might be highly biased, as the meta-learner computes the weights based on the predic-
tion of the base learner (e.g. shallow network) on the training set. These weights might be biased
as the base-learners might not make objective prediction on the training set.
In contrast, the Super Learner computes an honest ensemble weight based on the validation set.
A validation set is commonly used to train/tune a neural network. However, it is usually only used
to select a few tuning parameters (e.g. learning rate, weight decay). For most image classification
data sets, the validation set is very large in order to make the validation stable. We thus conjecture
that the potential of the validation information has not been fully exploited.
The Super Learner could be considered as a neural network with 1 by 1 convolution over the
validation set, with the scores of the base learners as input. It learns the 1 × 1 × m kernel either by
back-propagation, or through directly solving the convex optimization problem.

4 Experiment
4.1 Data
The CIFAR-10 data set [Krizhevsky and Hinton, 2009] is a widely used benchmark data set for
image recognition. It contains 10 classes of natural images, with 50, 000 training images and
10, 000 testing images. Each image is an RGB image of size 32 × 32. There are 10 classes in the
data set: airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck. Each class has
5000 images in the training data and 1000 images in the testing data.

7
Network 1

Network 2
m by K by 1 1 by 1 K by 1
Score tensor Score vector

…….
convolu6on

Network m-1

Network m

Figure 2: Super Learner from convolution neural network perspective. The base learners are
trained in the training set, and 1 by 1 convolutional layer is trained in the validation set. The
simple structure of SL avoids the overfitting on the validation set.

4.2 Network description


4.2.1 Network in Network
The network in network (NIN) structure [Lin et al., 2013] consists of mlpconv (MLP) layers, which
use multilayer perceptrons to convolve the input. Each MLP layer is made by one convolution
layer with larger kernel size followed by two 1 × 1 convolution layer and max pooling layer. In
addition, it uses a global average pooling layer as a replacement for the fully connected layers in
conventional neural networks.

4.2.2 GoogLeNet
GoogLeNet [Szegedy et al., 2015] is a deep convolutional neural network architecture based on
the inception module, which improved the computational efficiency. In each inception module, a
1 × 1 convolution is applied as dimension reduction before expensive large convolutions. Within
each inception module, the propagation splits into 4 flows, each with different convolution size,
and is then concatenated.

4.2.3 VGG Network


VGG net [Simonyan and Zisserman, 2014] is a neural network structure using an architecture with
very small (3 × 3) convolution filters, which won the first and the second places in the localization
and classification tracks for ImageNet Challenge 2014 respectively. Each block is made by several
consecutive 3 × 3 convolutions and followed by a max pooling layer. The number of filters for each
convolution increases as the network goes deeper. Finally there are three fully connected layers
before the softmax transformation.
In this study, we only used VGG net D with 16 layers [Simonyan and Zisserman, 2014]. We
denote it as VGG net for simplicity in the later sections.

8
Next Layer

3 x 3 max pooling

3 x 3 conv

1 x 1 conv

1 x 1 conv

5 x 5 conv

Previous Layer

Figure 3: An example of MLP layer in the NIN structure. Notice each convolution are followed
by ReLU layer.

4.2.4 Residual Network


Residual Network [He et al., 2015a] is a network structure that stacked by multiple “bottleneck”
building blocks. Figure 5 shows an example of so called bottleneck building block, stacked by two
regular layer (e.g. convolution layers). In the original study [He et al., 2015a], each bottleneck
building block is made by three convolutional layers, with kernel size 1, 3, and 1. Similar to NIN
and GoogLeNet, it uses 1×1 convolution as dimension reduction to reduce the computation. There
is a parameter-free identity shortcut from the starting layer to the final output for each bottleneck
block. It solves the degradation problem for deep networks and makes training a very deep neural
network possible.
In later sections, we follow the same structure from the original paper for CIFAR-10 data: we
use a stack of 6n layers with 3 × 3 convolutions. The sizes of the feature maps are {32, 16, 8}
respectively, with 2n layers for each feature map size [He et al., 2015a]. There would be 6n + 2
layers including the softmax layer. For example, ResNet with n = 5 has 32 layers in total.

4.3 Training
For all the models, we split the training data into training (first 4, 5000 images) and validation set
(last 5, 000 images). There are 10K testing data.
For the Network-in-Network model, we used Adam with learning rate 0.001. We followed the
original paper [Lin et al., 2013], tuning the learning rate and initialization manually. The training

9
Filter
concatena2on

3x3 conv 5x5 conv 1 x 1 conv

1 x 1 conv

3 x 3 max
1 x 1 conv 1 x 1 conv
pooling

Previous Layer

Figure 4: An example of Inception module for GoogLeNet. Notice each convolution are followed
by ReLU layer.

was regularized by L-2 penalty with predefined weight 0.001 and two dropout layers in the middle
of the network, with rate 0.5.
For VGG net, we slightly modified the training procedure in the original paper [Simonyan and
Zisserman, 2014] for ILSVRC-2013 competitions [Zeiler and Fergus, 2014, Russakovsky et al.,
2015]. We used SGD with momentum 0.9. We started with learning rate 0.01 and decay divide it
by 10 at every 32k iterations. The training is regularized by L-2 penalty with weight 10−3 and two
dropout layers for the fitst two fully connected layer, with rate 0.5.
For GoogLeNet, we set base learning rate to be 0.05, weight decay 10−3 , and momentum 0.9.
We decreased the learning rate by 4% every 8 epochs. We set the rate to 0.4 for the dropout layer
before the last fully connected layer.
For the Residual Network, we follow the training procedures in the original paper [He et al.,
2015a]: we applied SGD with weight decay of 0.0001 and momentum of 0.9. The weight was
initialized following the method in [He et al., 2015b], and we applied batch normalization [Ioffe
and Szegedy, 2015] without dropout. Learning rate started with 0.1, and was divided by 10 at
every 32k iterations. We trained the model with 200 epochs.
All the networks were trained with mini-batch size 128 for 200 epochs.

4.4 Results
In this section, we compare the empirical performance for all the ensemble methods we mentioned
before, including: Unweighted Averaging (before/after softmax layer), Majority Voting, Bayes

10
F(X) + X

Weight Layer

F(X) RELU

Weight Layer

Previous Layer X

Figure 5: An example of Inception module for GoogLeNet. Notice each convolution are followed
by ReLU layer.

Optimal Classifier, Super Learner (with negative log-likelihood loss). We also include discrete
SL, with negative log-likelihood loss and 0-1 error loss.. For comparison, we list the base learner
which achieved best performance on the testing set, as an empirical oracle.

4.4.1 Ensemble of Same Network with Different Training Checkpoints

Table 1: Left: Prediction accuracy on the testing set for ResNet 8 trained by 80, 90, 100, 110
epochs. Right: Prediction Accuracy on the testing set for ResNet 110 trained by 70, 85, 100, 115
epochs.

Training Epoch Prediction Accuracy Training Epoch Prediction Accuracy


70 0.7790 70 0.8896
80 0.8245 85 0.8999
90 0.8197 100 0.9318
100 0.8659 115 0.9354

Table 1 shows the prediction accuracy for the ResNet 8 and 110 after different epochs. As
ResNe 8 is much shallower, thus more adaptive during training, we set the smaller interval with
epoch 10. Notice there is a great accuracy improvement around epoch 100, due to the learning rate
decay.
For ResNet 8, the SL is substantively better than naive averaging and majority voting. Earlier
stage learners would have worse performance, which causes the deterioration of the performance
for naive averaging. The performance of majority voting is even worse than the best base learner,
as the majority of the base learners are under-optimized.
For ResNet 110, the performance for all the meta-learners is similar. One possible explanation
is that deeper network is more stable during training.

11
Table 2: Prediction accuracy on the testing set for ResNet 8 and 110

Ensemble ResNet 8 ResNet 110


Best Base Learner 0.8659 0.9354
SuperLearner 0.8679 0.9358
Discrete SuperLearner (nll) 0.8659 0.9354
Discrete SuperLearner (error) 0.8659 0.9354
Unweighted Average (before softmax) 0.8611 0.9354
Unweighted Average (after softmax) 0.8614 0.9354
BOC (before softmax) 0.8659 0.9318
BOC (after softmax) 0.8659 0.9318
Majority Voting 0.8485 0.9319

In this experiment, the weights of BOCs are dominated by one model, which gives the best
performance on the validation set. Thus the BOC is equivalent to the discrete Super Learner with
negative likelihood as loss function. In the experiments, BOC performed only as well as the best
base learner. In the subsequent experiments, all the BOCs showed the similar dominated weight
pattern. Given the practical equivalence with the discrete Super Learner, we don’t elaborate further
on BOCs, and we will report only the discrete Super Learner’s performance.

4.4.2 Ensemble of Same Network Trained Multiple Times


Unlike other conventional machine learning algorithms, deep neural networks solve a high-dimensional
non-convex optimization problem. Mini-batch stochastic gradient descent with momentum is com-
monly used for training. Due to non-convexity, networks with same structure but different initial-
ization and training vary a lot. [Choromanska et al., 2015] studied the distribution of loss on the
testing set for a certain network structure trained multiple times with SGD. It shows the distribu-
tion of loss is more concentrated for deeper neural network. This suggest deep neural networks are
less sensitive to randomness in the initialization and training. If so, ensemble learning would be
less helpful for the deeper nets.
To help understand this property, we trained 4 ResNet with 8 layers and 4 ResNet with 110
layers.

Table 3: Prediction Accuracy on the testing set for ResNet with 8 and 110 layers

Model Prediction Accuracy Model Prediction Accuracy


ResNet 8 0 0.8785 ResNet 110 0 0.9399
ResNet 8 1 0.8819 ResNet 110 1 0.9364
ResNet 8 2 0.8758 ResNet 110 2 0.9349
ResNet 8 3 0.8761 ResNet 110 3 0.9395

We trained 4 networks for ResNet 8 and 110 respectively. Table 3 shows the performance of the
networks. We further studied the performance of all the meta-learners. Shallow networks enjoyed
more improvement (2.54%) compared to deeper networks 1.43% after ensembled by the Super
Learner. Due to the similarity of the models, the SL did not show great improvement compared

12
Table 4: Prediction accuracy on the testing set for ensemble methods. The algorithm candidates
are the ResNets with same structure but trained several times, where the differences come from
randomized initialization and SGD.

Ensemble ResNet 8 ResNet 110


Best Base Learner 0.8820 0.9399
SuperLearner 0.9073 0.9542
Discrete SuperLearner (nll) 0.8820 0.9395
Discrete SuperLearner (error) 0.8761 0.9395
BOC (before Sotmax) 0.8820 0.9395
BOC (after Sotmax) 0.8820 0.9395
Unweighted Average (before Sotmax) 0.9068 0.9542
Unweighted Average (afterbefore Sotmax) 0.9068 0.9541
Majority Vote 0.9000 0.9510

to naive averaging. Similarly, majority voting did not work well, which might also be due to
the diversity of the base learners. The discrete SL with negative log-likelihood loss successfully
selected the best single learner in the library, while the discrete SL with error loss selected a
slightly weaker one. This suggests that for finite samples, the Super Learner using the negative
log likelihood loss performs better w.r.t. prediction accuracy, than the Super Learner that uses
prediction accuracy as criterion.

4.4.3 Ensemble of Networks with Different Structure


In this section, we studied ensemble of networks with different structure. We trained NIN, VGG,and
ResNet with 32, 44, 56, 110 layers. Table 5 shows the performance of each net on the testing set.

Table 5: Prediction Accuracy on the testing set for networks with different structure

Model Prediction Accuracy


NIN 0.8677
VGG 0.8914
ResNet 32 0.9181
ResNet 44 0.9243
ResNet 56 0.9272
ResNet 110 0.9399

4.4.4 Over-confident Model


As the 0 − 1 loss for classification is not differentiable, cross-entropy loss is commonly used as
surrogate loss in neural network training. We could see from table 6 that the cross-entropy is
usually negatively correlated with the prediction accuracy. However, we could see that Network-
in-Network model has much lower cross-entropy loss compared to all the other models, while it

13
Table 6: Cross-entropy on the testing set for Networks with different structure

Model Cross-entropy
NIN 0.5779
VGG 1.5649
ResNet 32 1.5442
ResNet 44 1.5341
ResNet 56 1.5327
ResNet 110 1.5242

gives worse prediction accuracy. This due to its prediction behavior: we look at the predicted
probability of the true labels for the images in the testing set:

Table 7: Cross-entropy on the testing set for networks with different structure

Model Image 1 Image 2 Image 3 Image 4 Image 5


NIN 0.9999 0.9999 0.09985 0.5306 1.000
VGG 0.2319 0.2319 0.2319 0.2302 0.2314
ResNet 32 0.2319 0.2318 0.2317 0.2316 0.2317

It is interesting to observe the high-confidence phenomenon for the Network-in-Network model,


where most of the predictions are made with high confidence (predicted probability). Such high-
confident networks usually achieve much smaller surrogate loss (negative log-likelihood loss in our
example) on the testing set, but not necessary smaller 0-1 error loss. Though all the networks suf-
fered from over-fitting, only the NIN net showed the over-confidence. In addition, NIN has higher
training cross-entropy loss (0.13104) compared to VGG (0.02233). Thus it is not reasonable to
blindly attribute the over-confidence to the over-fitting.
When several base learners suffer from the over-confidence issue, the performance of model
averaging would be seriously deteriorated: the unweighted average score/probability would be
dominated by the over-confident models. When all the models are over-confident, the unweighted
average is identical to the majority vote.
In addition, the VGG net and the ResNet with 32 layers had very similar predicted probabil-
ity, even though their structure is totally different (agree on first 3 digits on most observations).
However, this special pattern is beyond the scope of this study.
We empirically study the impact of over-confident network candidates for ensemble methods:
we have five candidates in the ensemble library: NIN, VGG, ResNet 32, ResNet 44, and ResNet
56. We compare the performance with/without adding NIN, which is the only over-confident net.
Table 8 shows the performance of the ensemble algorithms on the testing set. The unweighted
average model was weakened by the NIN net: over-confidence made NIN dominate the others,
and led to 0.23% (before softmax) and 5% (after softmax) decrease in the prediction accuracy. The
naive average before softmax was less influenced as the scale of networks are different. The ma-
jority vote algorithm was not influenced too much by the extra candidate, which is not surprising.
The over-confident network only weakened discrete SL with negative log-likelihood loss, while
did not influence the discrete SL with error loss. The Super Learner successfully harnessed the
over-confident model: adding NIN helped increase the prediction accuracy from 0.9405 to 0.9414.

14
Table 8: Prediction accuracy on the testing set for ensemble methods. The algorithm candi-
dates include NIN, VGG, ResNet 32, ResNet 44, and ResNet 56. We compare the performance
with/without the over-confident NIN network.

Ensemble Without NIN With NIN


Best Base Learner 0.9399 0.9399
SuperLearner 0.9469 0.9475
Discrete SuperLearner (nll) 0.9399 0.8677
Discrete SuperLearner (error) 0.9399 0.9399
BOC (before softmax) 0.9399 0.8677
BOC (after softmax) 0.9399 0.8677
Unweighted Average (before softmax) 0.9456 0.9223
Unweighted Average (after softmax) 0.9455 0.8974
Majority Vote 0.9433 0.9413

4.4.5 Learning from Weak Learner


We hope our ensemble method could learn from all the models, even though there might be base
learners with weaker overall performance compared to the other learners in the library. In this
experiment, we used under-trained GoogLeNets [Szegedy et al., 2015] as the weak candidates. The
original paper [Szegedy et al., 2015] did not describe explicitly how to automatically train/tune the
network in CIFAR 10 data set. We set the initial learning rate to be 0.05, with momentum 0.96, and
decreased the learning rate by 4% every 8 epochs. This did not give satisfactory performance: the
prediction accuracy on the testing set is around 0.83. To avoid the impact of over-confidence, we
removed the NIN net. Thus the weakest base learner in the library is the VGG net, which achieved
0.8914 accuracy on the testing set. We observe that the difference in prediction accuracy for the
VGG net and the GoogLeNet is around 6%, which means our GoogLeNet model is substantially
weaker than other candidates.
We trained the GoogLeNet 5 times and then compare the performance of different ensemble
methods with/without such 5 googLeNets in the library.

Table 9: Prediction accuracy on the testing set for ensemble methods. The algorithm candidates
include VGG, ResNet 32, ResNet 44, and ResNet 56. We compared the performance with/without
five under-optimized GoogLeNets.
Ensemble Without GoogLeNet With 3 GoogLeNets With 5 GoogLeNets
Best Base Learner 0.9399 0.9399 0.9399
SuperLearner 0.9475 0.9477 0.9477
Discrete SuperLearner (nll) 0.9399 0.9399 0.9399
Discrete SuperLearner (error) 0.9399 0.9399 0.9399
BOC (before softmax) 0.9399 0.9399 0.9399
BOC (after softmax) 0.9399 0.9399 0.9399
Unweighted Average (before softmax) 0.9456 0.9326 0.9001
Unweighted Average (after softmax) 0.9455 0.9329 0.9007
Majority Vote 0.9433 0.9263 0.8720

In the experiment, adding many weaker candidates deteriorated the performance of the un-
weighted average. The majority voting was slightly influenced when there were only few weak

15
learners, while would be dominated if the number of the weak learner was large. Unweighted av-
eraging also failed in this case. BOCs remained unchanged as the likelihood on the validation set
is still dominated by the same base learner. Super Learner shows exciting success in this setting:
the prediction accuracy remained stable with the extra weak learning.

4.4.6 Prediction with All Candidates


As the number of base learners is usually much smaller than the sample size and there is usually
no apriori which learner would achieve best performance, it is encouraged to apply as rich library
as possible to improve the performance of Super Learner. In this experiment, we simply put all the
networks mentioned before into the library of all the ensemble methods.

Table 10: Prediction accuracy on the testing set for all the ensemble methods using all the networks
mentioned in this study as base learners.

Ensemble Accuracy
Best base learner 0.9399
SuperLearner 0.9502
Discrete SuperLearner (nll) 0.9395
Discrete SuperLearner (error) 0.9395
BOC (before softmax) 0.9395
BOC (after softmax) 0.9395
Unweighted Average (before softmax) 0.9444
Unweighted Average (after softmax) 0.9448
Majority Vote 0.9410

Table 10 shows the performance of all the ensemble methods as well as the base learner with
the best performance. Due to the large proportion of weak learners (e.g. under-fitted GoogLeNet,
and the networks trained with less iterations in the first experiment) and the over-confident learners
(NIN), all the other ensemble methods have much worse performance compared to Super Learner.
This is another strength of the Super Learner: by simply putting all the potential base learners into
the library, the Super Learner computes the weights data-adaptively, which does not require any
tedious pre-selecting procedure based on human experience.

4.5 Discussion
We studied the relative performance for several widely used ensemble methods with deep convo-
lutional neural networks as base learners on the CIFAR 10 data set, which is a commonly used
benchmark for image classification. The unweighted averaging proved surprisingly successful
when the performance of the base learners are comparable. It outperformed majority voting in
almost all the experiments. However, the unweighted averaging is proved to be sensitive to over-
confident candidates. The Super Leaner addressed this issue by simply optimizing a weight on the
validation set in a data-adaptive manner. This ensemble structure could be considered as a 1 × 1
convolution layer stacked on the output of the base learners. It could adaptively assign weight on
base learners, which enables weak learner to improve the prediction.

16
Super Learner is proposed as a cross-validation based ensemble method. However, since
CNN are computationally intensive and that validation sets are typically large in image recog-
nition tasks, we used the validation set of the neural networks for computing the weights of Super
Learner(single-split cross-validation), instead of using conventional cross validation (multiple-fold
cross-validation). The structure is simple and could be easily extended. One potential extension of
the linear-weighted Super Learner would be stacking several 1×1 convolutions with non-linear ac-
tivation layers in between. This structure could mimic the cascading/hierarchical ensemble [Wang
et al., 2014, Su et al., 2009]. Due to the small number of parameters, we hope this meta-learner
would not overfit the validation set and thus would help improve the prediction. However this in-
volves non-convex optimization and the results might not be stable. We leave this as future work.

References
D. Benkeser, S. D. Lendle, C. Ju, and M. J. van der Laan. Online cross-validation-based ensemble
learning. U.C. Berkeley Division of Biostatistics Working Paper Series, page Working Paper
355., 2016.

J. O. Berger and M. Bock. Combining independent normal mean estimation problems with un-
known variances. The Annals of Statistics, pages 642–648, 1976.

L. Breiman. Bagging predictors. Machine learning, 24(2):123–140, 1996a.

L. Breiman. Stacked regressions. Machine learning, 24(1):49–64, 1996b.

L. Breiman. Random forests. Machine learning, 45(1):5–32, 2001.

A. Chambaz, W. Zheng, and M. van der Laan. Data-adaptive inference of the optimal treatment
rule and its mean reward. the masked bandit. U.C. Berkeley Division of Biostatistics Working
Paper Series., 2016.

K. Cho, B. Van Merriënboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio.


Learning phrase representations using rnn encoder-decoder for statistical machine translation.
arXiv preprint arXiv:1406.1078, 2014.

A. Choromanska, M. Henaff, M. Mathieu, G. B. Arous, and Y. LeCun. The loss surfaces of


multilayer networks. In AISTATS, 2015.

M. M. Davies and M. J. van der Laan. Optimal spatial prediction using ensemble machine learning.
The international journal of biostatistics, 12(1):179–201, 2016.

T. G. Dietterich. Ensemble methods in machine learning. In International workshop on multiple


classifier systems, pages 1–15. Springer, 2000.

B. Efron and C. Morris. Combining possibly related estimation problems. Journal of the Royal
Statistical Society. Series B (Methodological), pages 379–421, 1973.

Y. Freund and R. E. Schapire. A decision-theoretic generalization of on-line learning and an


application to boosting. Journal of computer and system sciences, 55(1):119–139, 1997.

17
Y. Freund, R. E. Schapire, et al. Experiments with a new boosting algorithm. In ICML, volume 96,
pages 148–156, 1996.

J. H. Friedman. Greedy function approximation: a gradient boosting machine. Annals of statistics,


pages 1189–1232, 2001.

I. Goodfellow, Y. Bengio, and A. Courville. Deep learning, 2016.

E. J. Green and W. E. Strawderman. A james-stein type estimator for combining unbiased and
possibly biased estimators. Journal of the American Statistical Association, 86(416):1001–1006,
1991.

A. Grover and J. Leskovec. node2vec: Scalable feature learning for networks. In Proceedings of
the 22nd ACM SIGKDD International Conference on Knowledge Discovery and Data Mining,
pages 855–864. ACM, 2016.

K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. arXiv preprint
arXiv:1512.03385, 2015a.

K. He, X. Zhang, S. Ren, and J. Sun. Delving deep into rectifiers: Surpassing human-level per-
formance on imagenet classification. In Proceedings of the IEEE International Conference on
Computer Vision, pages 1026–1034, 2015b.

T. Hothorn, P. Bühlmann, S. Dudoit, A. Molinaro, and M. J. van der Laan. Survival ensembles.
Biostatistics, 7(3):355–373, 2006.

S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing
internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.

C. Ju, M. Combs, S. D. Lendle, J. M. Franklin, R. Wyss, S. Schneeweiss, and M. J. van der


Laan. Propensity score prediction for electronic healthcare dataset using super learner and high-
dimensional propensity score method. U.C. Berkeley Division of Biostatistics Working Paper
Series, page Working Paper 351., 2016.

A. Krizhevsky and G. Hinton. Learning multiple layers of features from tiny images. Technical
report, University of Toronto., 2009.

A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classification with deep convolutional


neural networks. In Advances in neural information processing systems, pages 1097–1105,
2012.

L. I. Kuncheva, C. J. Whitaker, C. A. Shipp, and R. P. Duin. Limits on the majority vote accuracy
in classifier fusion. Pattern Analysis & Applications, 6(1):22–31, 2003.

Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436–444, 2015.

M. Lin, Q. Chen, and S. Yan. Network in network. arXiv preprint arXiv:1312.4400, 2013.

A. R. Luedtke and M. J. van der Laan. Super-learning of an optimal dynamic treatment rule. The
international journal of biostatistics, 12(1):305–332, 2016.

18
M.-T. Luong, H. Pham, and C. D. Manning. Effective approaches to attention-based neural ma-
chine translation. arXiv preprint arXiv:1508.04025, 2015.
T. M. Mitchell. Machine learning. 1997. Burr Ridge, IL: McGraw Hill, 45(37):870–877, 1997.
B. Perozzi, R. Al-Rfou, and S. Skiena. Deepwalk: Online learning of social representations. In
Proceedings of the 20th ACM SIGKDD international conference on Knowledge discovery and
data mining, pages 701–710. ACM, 2014.
R. Pirracchio, M. L. Petersen, M. Carone, M. R. Rigon, S. Chevret, and M. J. van der Laan. Mortal-
ity prediction in intensive care units with the super icu learner algorithm (sicula): a population-
based study. The Lancet Respiratory Medicine, 3(1):42–52, 2015.
E. C. Polley and M. J. Van Der Laan. Super learner in prediction. U.C. Berkeley Division of
Biostatistics Working Paper Series., 2010.
J. Rao and K. Subrahmaniam. Combining independent estimators and estimation in linear regres-
sion with unequal variances. Biometrics, pages 971–990, 1971.
D. B. Rubin and S. Weisberg. The variance of a linear combination of independent estimators
using estimated weights. Biometrika, 62(3):708–709, 1975.
O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, Z. Huang, A. Karpathy, A. Khosla,
M. Bernstein, et al. Imagenet large scale visual recognition challenge. International Journal of
Computer Vision, 115(3):211–252, 2015.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recogni-
tion. arXiv preprint arXiv:1409.1556, 2014.
S. E. Sinisi, E. C. Polley, M. L. Petersen, S.-Y. Rhee, and M. J. van der Laan. Super learning: an
application to the prediction of hiv-1 drug resistance. Statistical applications in genetics and
molecular biology, 6(1), 2007.
N. Srivastava, G. E. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: a simple
way to prevent neural networks from overfitting. Journal of Machine Learning Research, 15(1):
1929–1958, 2014.
Y. Su, S. Shan, X. Chen, and W. Gao. Hierarchical ensemble of global and local classifiers for face
recognition. IEEE Transactions on Image Processing, 18(8):1885–1896, 2009.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition, pages 1–9, 2015.
M. J. Van Der Laan and S. Dudoit. Unified cross-validation methodology for selection among
estimators and a general cross-validated adaptive epsilon-net estimator: Finite sample oracle
inequalities and examples. U.C. Berkeley Division of Biostatistics Working Paper Series., 2003.
M. J. van der Laan, E. C. Polley, and A. E. Hubbard. Super learner. Statistical applications in
genetics and molecular biology, 6(1), 2007.

19
A. Veit, M. Wilber, and S. Belongie. Residual networks are exponential ensembles of relatively
shallow networks. arXiv preprint arXiv:1605.06431, 2016.

H. Wang, A. Cruz-Roa, A. Basavanhally, H. Gilmore, N. Shih, M. Feldman, J. Tomaszewski,


F. Gonzalez, and A. Madabhushi. Cascaded ensemble of convolutional neural networks and
handcrafted features for mitosis detection. In SPIE Medical Imaging, pages 90410B–90410B.
International Society for Optics and Photonics, 2014.

D. H. Wolpert. Stacked generalization. Neural networks, 5(2):241–259, 1992.

M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In European


Conference on Computer Vision, pages 818–833. Springer, 2014.

20

You might also like