Deep Super Learner: A Deep Ensemble For Classification Problems
Deep Super Learner: A Deep Ensemble For Classification Problems
Deep Super Learner: A Deep Ensemble For Classification Problems
Abstract. Deep learning has become very popular for tasks such as
predictive modeling and pattern recognition in handling big data. Deep
learning is a powerful machine learning method that extracts lower level
features and feeds them forward for the next layer to identify higher
level features that improve performance. However, deep neural networks
have drawbacks, which include many hyper-parameters and infinite ar-
chitectures, opaqueness into results, and relatively slower convergence on
smaller datasets. While traditional machine learning algorithms can ad-
dress these drawbacks, they are not typically capable of the performance
levels achieved by deep neural networks. To improve performance, ensem-
ble methods are used to combine multiple base learners. Super learning
is an ensemble that finds the optimal combination of diverse learning
algorithms. This paper proposes deep super learning as an approach
which achieves log loss and accuracy results competitive to deep neural
networks while employing traditional machine learning algorithms in a
hierarchical structure. The deep super learner is flexible, adaptable, and
easy to train with good performance across different tasks using identi-
cal hyper-parameter values. Using traditional machine learning requires
fewer hyper-parameters, allows transparency into results, and has rela-
tively fast convergence on smaller datasets. Experimental results show
that the deep super learner has superior performance compared to the
individual base learners, single-layer ensembles, and in some cases deep
neural networks. Performance of the deep super learner may further be
improved with task-specific tuning.
1 Introduction
Deep learning is a machine learning method that uses layers of processing units
where the output of a layer cascades to be the input of the next layer and
can be applied to either supervised or unsupervised learning problems [1] [2].
Deep neural networks (DNN) is an architecture of deep learning that typically
1
This paper was written as part of the Certificate in Data Analytics, Big Data, and
Predictive Analytics at Ryerson University
has many connected units arranged in layers of varying sizes with information
being fed forward through the network. DNN have been successfully applied to
fields such as computer vision and natural language processing, having achieved
accuracy rates similar or superior to humans in classification [3]. For example,
Ciresan et al. using DNN achieved an error rate half the rate of humans in
recognizing traffic signs. The multiple layers of a DNN allow for varying levels of
abstraction and the cascade between the layers enables the extraction of features
from lower to higher level layers to improve performance [4]. However, DNN also
have drawbacks, listed below:
– DNN have many hyper-parameters, which are parameters where their val-
ues are set prior to training as opposed to parameter values that are set via
training, that interact with each other in their relation to performance. Nu-
merous hyper-parameters, together with infinite architectures, makes tuning
of hyper-parameter and architecture difficult [5].
– With a large number of processing units, tracing through a DNN to un-
derstand the reasoning for classifications is difficult, leading to DNN being
treated as black boxes [6].
– DNN typically require very large amounts of data to train and do not con-
verge as fast, with respect to sample size, as traditional machine learning
algorithms [7].
Traditional machine learning algorithms, on the other hand, are relatively simple
to tune and their output may provide interpretable results leading to a deeper
understanding of the problem, though they tend to underperform DNN in terms
of accuracy.
The remainder of this paper is organized as follows: section 1 introduces
the motivation and background for this paper, section 2 presents the overall
procedure of the DSL approach, section 3 describes the methodology of the
experiment, section 4 presents the results of a comparison of the performance of
the DSL to the individual base learners and a selection of ensembles and DNN
on various problems, and section 5 concludes and describes future work.
1.1 Motivation
Given the drawbacks of DNN and the poor performance of traditional machine
learning algorithms in some domains and/ or prediction tasks, this paper inves-
tigates whether traditional machine learning algorithms can be used to address
the drawbacks of DNN and achieve levels of performance comparable to DNN.
A new ensemble method, named here as Deep Super Learner (DSL), seeks to
have simplicity in setup, interpretability of results, fast convergence on small
and large datasets with the power of deep learning.
Fig. 1. Overall procedure of DSL with j classes, k folds, l features, m learners, n records
To make predictions on unseen test data, pass the data in its entirety through
a similar process using each of the models trained and weights optimized at each
iteration. If the models are trained on the entire training set, use these m models
for each iteration. If the models are trained on the k folds, use each model trained
on each fold to make predictions on all the unseen data and average across the
k models to get predictions for each of the m learners. Using the optimum
weights for the m learners found during training for the iteration, calculate the
overall weighted average predictions for the iteration. Append the predictions
to the original test data as additional features. Repeat the process for the same
number of iterations used in training.
3 Methodology
The hyper-parameters and architectures for the DSL, base learners, benchmark
ensembles, and benchmark DNN described below are kept constant between
datasets. When necessary, adjustments are made for the different dimensionality
of the datasets.
for iteration in 1 to max iterations do
Split data into k folds each with train and validate sets;
for each fold in k folds do
for each learner in ensemble do
Train learner on train set in fold;
Get class probabilities from learner on validate set in fold;
Build predictions matrix of class probabilities;
end
end
Get weights to minimize loss function with predictions and true labels;
Get average probabilities across learners by multiplying predictions with
weights;
Get loss value of loss function with average probabilities and true labels;
if loss value is less than loss value from previous iteration then
Append average probabilities to data;
else
Save iteration;
Break for;
end
end
Algorithm 1: A Pseudo-code of the Proposed Approach, DSL
The same five base learners used in DSL are also tested individually and in
the benchmark ensembles using identical hyper-parameter values. If a hyper-
parameter of a learner is not listed below, in Table 1, default values of the
implementation of the algorithm are used.
Since random forest, extremely randomized trees, and XGBoost are them-
selves ensembles, three additional ensembles are tested for comparison: a simple
equal weighted average of the base learners, a stacked ensemble where the output
of the base learners is fed into XGBoost, and a single-layer super learner.
3.2 Benchmark Deep Neural Networks
Hyper-parameters Hyper-parameters
Architecture
(Multi-layer perceptron) (Convolutional neural network)
Filters: 32; Kernel size: 5 or (5, 5);
Convolutional layer N/A Activation: RELU;
Weight constraint: 4
Max pooling layer N/A Pool size: 2 or (2, 2)
Filters: 16; Kernel size: 3 or (3, 3);
Convolutional layer N/A Activation: RELU;
Weight constraint: 4
Max pooling layer N/A Pool size: 2 or (2, 2)
Dropout regularization N/A Drop rate: 0.2
Nodes: 128; Activation: RELU; Nodes: 128; Activation: RELU;
Dense layer
Weight constraint: 4 Weight constraint: 4
Nodes: 64; Activation: RELU; Nodes: 64; Activation: RELU;
Dense layer
Weight constraint: 4 Weight constraint: 4
Nodes: number of classes; Nodes: number of classes;
Output layer
Activation: Softmax Activation: Softmax
Learning rate: 0.001; Learning rate: 0.001;
Optimizer: Adam Learning rate decay:
√ Learning rate decay:
√
Learning rate/ M ax epochs Learning rate/ M ax epochs
Batch size 200 200
Max epochs 50 50
Validation split 0.2 0.2
Early stopping patience 3 3
3.3 Datasets
Sentiment Classification
The IMDB Movie reviews sentiment classification dataset contains 25,000 re-
views for training and 25,000 for testing. The reviews have been labelled as
positive or negative [14]. The 2,000 most frequent words in the set are used to
calculate the term frequency-inverse document frequency (TF-IDF) matrix.
Image Categorization
Two metrics are used to evaluate the performance of the learning algorithms. One
is Accuracy, which is the proportion of correctly classified records, and the other
is LogLoss. Both Accuracy and LogLoss formulas are shown in Equations 1
and 2, respectively.
Pn Pj
x=1 y=1 f (x, y)C(x, y) TP + TN
Accuracy = = (1)
n TP + FP + TN + FN
Where n denotes the number of instances, j the number of classes, f (x, y) the
actual probability of instance x to be of class y. C(x, y) is one if and only if y
is the predicted class of x, otherwise C(x, y) is zero. Accuracy is equivalently
defined in terms of the confusion matrix, where T P is true positives, T N is true
negatives, F P is false positives, and F N is false negatives.
Pj Pn
− y=1 x=1 f (x, y)log(p(x, y))
LogLoss = (2)
n
Where f (x, y) is defined as above and p(x, y) is the estimated probability of
instance x is class y. Minimizing LogLoss, also known as cross entropy, is equiv-
alent to maximizing the log likelihood of observing the data from the model. Both
Accuracy and LogLoss are commonly used performance measures in machine
learning [16].
4 Results
Log loss and accuracy results of DSL, base learners, benchmark ensembles, and
benchmark DNN on the IMDB sentiment classification dataset are shown in
Table 3.
The DSL achieved statistically significantly lower loss and higher accuracy
than all other algorithms. Since the TF-IDF matrix does not convey spatial
or sequential relationships, DNN architectures CNN may not be expected to
perform as well on this test. The MLP, like DSL here, is set up to be more
general purpose yet is outperformed by DSL. DSL outperforming a single-layer
super learner indicates adding depth to the algorithm improves performance.
Figure 2 shows the performance of DSL on the IMDB test data by iteration.
Table 3. Comparison of log loss and accuracy on IMDB test data
Fig. 2. Log loss and accuracy by iteration of the DSL on IMDB test data.
Log loss and accuracy results of DSL, base learners, benchmark ensembles, and
benchmark DNN on the MNIST handwritten digits dataset are shown in Table 4.
The DSL achieved statistically significantly lower loss and higher accuracy
than all algorithms except for CNN. The design of CNN make them well suited
to image processing. Again, DSL outperformed MLP and super learner showing
the advantages of diversity in learners and depth. The order of the base learners
by performance differs between the two datasets showing the importance of
including a diverse set of learners when addressing various problems and the
value of optimizing the component weights. Figure 3 shows the performance of
DSL on the MNIST test data by iteration.
Table 4. Comparison of log loss and accuracy on MNIST test data.
Fig. 3. Log loss and accuracy by iteration of the DSL on MNIST test data.
4.3 Runtime
All algorithms are implemented in Python using the scikit-learn library for lo-
gistic regression, k-nearest neighbors (KNN), random forest, and extremely ran-
domized trees, XGBoost library for XGBoost, SciPy for the convex optimizer,
and Keras with a TensorFlow backend for MLP and CNN. Experiments are run
on a desktop with an Intel Core i7-7700 with 16 GB of RAM. DSL on IMDB
converged after three iterations running for a total of 50 minutes, 46 of which
are spent in the prediction phase of KNN. MLP on IMDB converged after two
epochs running for one minute. CNN on IMDB converged after six epochs run-
ning for a total of seven minutes. DSL on MNIST converged after five iterations,
running for a total of 86 minutes, 70 of which are spent in the prediction phase
of KNN. MLP on MNIST converged in 12 epochs running for two minutes. CNN
on MNIST converged in 12 epochs running for a total of 12 minutes. DSL is in-
herently parallel across component learners. With optimized parallel processing
and selection of base learners, the runtime of DSL can be dramatically reduced.
5 Conclusion
Results for the deep super learner are encouraging. Using a weighted average
of the base learners optimized to minimize log loss yields results superior to
any individual base learner. Using a cascade of multiple layers of base learners
where each successive layer uses the output from the previous layer as augmented
features for input to add depth to the learning improved performance further.
While still shy of the performance levels obtained by CNN on image data, the
deep super learner using traditional machine learning algorithms outperformed
MLP on image data and outperformed MLP and CNN on classification from a
TF-IDF matrix while also having fewer hyper-parameters and providing inter-
pretable and transparent results. Though still in the early stages of development
of a deep super learning ensemble, particularly compared to DNN, further devel-
opment of the architecture, for example to better capture spatial or sequential
relationships, should be conducted.
References
1. Bengio, Y., Courville, A., Vincent, P.: Representation Learning: A Review and New
Perspectives. IEEE Transactions on Pattern Analysis and Machine Intelligence
35(8) (8 2013) 1798–1828
2. Längkvist, M., Karlsson, L., Loutfi, A.: A Review of Unsupervised Feature Learn-
ing and Deep Learning for Time-series Modeling. Pattern Recognition Letters 42
(2014) 11–24
3. Schmidhuber, J.: Multi-column Deep Neural Networks for Image Classification.
In: Proceedings of the 2012 IEEE Conference on Computer Vision and Pattern
Recognition (CVPR’12), Washington, DC, USA, IEEE Computer Society (2012)
3642–3649
4. Bengio, Y.: Learning Deep Architectures for AI. Foundations and Trends
R in
Machine Learning 2(1) (1 2009) 1–127
5. Zhou, Z.H., Feng, J.: Deep forest: Towards an Alternative to Deep Neural Net-
works. In: Proceedings of the 26th International Joint Conference on Artificial
Intelligence (IJCAI ’17), Melbourne, Australia (2017) 3553–3559
6. Sussillo, D., Barak, O.: Opening the Black Box: Low-Dimensional Dynamics in
High-Dimensional Recurrent Neural Networks. Neural Computation 25(3) (2013)
626–649
7. Farrelly, C.M.: Deep vs. Diverse Architectures for Classification Problems. (2017)
8. Seni, G., Elder, J.F.: Ensemble Methods in Data Mining: Improving Accuracy
Through Combining Predictions. In Grossman, R., ed.: Synthesis Lectures on
Data Mining and Knowledge Discovery. Morgan & Claypool (2010)
9. Chen, T., Guestrin, C.: XGBoost: A Scalable Tree Boosting System. In: Proceed-
ings of the 22nd ACM SIGKDD International Conference on Knowledge Discovery
and Data Mining, San Francisco, California, USA, ACM (2016) 785–794
10. Xie, J., Rojkova, V., Pal, S., Coggeshall, S.: A Combination of Boosting and
Bagging for KDD Cup 2009 - Fast Scoring on a Large Database. The Journal of
Machine Learning Research (JMLR) 7 (2009) 35–43
11. van der Laan, M.J., Polley, E.C., Hubbard, A.E.: Super Learner. Statistical Ap-
plications in Genetics and Molecular Biology 6(1) (1 2007)
12. Zhou, Z.H.: Ensemble Methods: Foundations and Algorithms. illustrate edn. Chap-
man & Hall/CRC. Machine Learning & Pattern Recognition Series, Boca Raton,
FL (2012)
13. Lessmann, S., Baesens, B., Mues, C., Pietsch, S.: Benchmarking Classification
Models for Software Defect Prediction: A Proposed Framework and Novel Findings.
IEEE Transactions on Software Engineering 34(4) (7 2008) 485–496
14. Maas, A.L., Daly, R.E., Pham, P.T., Huang, D., Ng, A.Y., Potts, C.: Learning
Word Vectors for Sentiment Analysis. In: Proceedings of the 49th Annual Meeting
of the Association for Computational Linguistics: Human Language Technologies.
HLT ’11, Portland, Oregon (2011) 142–150
15. Lecun, Y., Bottou, L., Bengio, Y., Haffner, P.: Gradient-based Learning Applied
to Document Recognition. Proceedings of the IEEE 86(11) (1998) 2278–2324
16. Ferri, C., Hernández-Orallo, J., Modroiu, R.: An Experimental Comparison of
Performance Measures for Classification. Pattern Recognition Letters 30(1) (2009)
27–38