S-NN: Stacked Neural Networks: Milad Mohammadi Stanford University Subhasis Das Stanford University
S-NN: Stacked Neural Networks: Milad Mohammadi Stanford University Subhasis Das Stanford University
S-NN: Stacked Neural Networks: Milad Mohammadi Stanford University Subhasis Das Stanford University
Abstract
Stacked-NN
Accuracy
vision tasks by training a simple classifier on top of features
obtained from pre-trained neural networks.
The goal of this project is to generate better features
for transfer learning from multiple publicly available pre-
Non-Places Dataset
trained neural networks. To this end, we propose a novel on Places Net
architecture called Stacked Neural Networks which lever-
ages the fast training time of transfer learning while simul- Generalization
&
taneously being much more accurate. We show that us- Training Speed
ing a stacked NN architecture can result in up to 8% im-
provements in accuracy over state-of-the-art techniques us- Figure 1: The conceptual state of the deep learning space
ing only one pre-trained network for transfer learning. presented as a trade off between time consuming and task-
A second aim of this project is to make network fine- specific neural network training targeted specifically to spe-
tuning retain the generalizability of the base network to un- cific datasets versus fast neural network training through
seen tasks. To this end, we propose a new technique called transfer learning to obtain reasonable performance from rel-
joint finetuning that is able to give accuracies compa- atively generalizable neural networks. The goal of this pa-
rable to finetuning the same network individually over two per is to evaluate the performance of a network architec-
datasets. We also show that a jointly finetuned network gen- ture named Stacked Neural Networks (S-NN) to leverage the
eralizes better to unseen tasks when compared to a network fast training speed of transfer learning while considerably
finetuned over a single task. increasing the accuracy of transfer learning by generating
better features.
1. Introduction
Transfer learning is a general framework in which one generalization accuracy from transfer learning. Our aim is
trains a pre-trained neural network to a new task on which to find better features for vision datasets that are highly gen-
the network was not trained. Amazingly, Razavian et eralizable and fast to train.
al. [14] have shown that transfer learning on a pre-trained Since Razavian et al.s work, several state-of-the-art pre-
neural network can outperform traditional hand-tuned ap- trained neural networks have been made publicly available
proaches in several tasks including coarse-grained detec- at Caffe Model Zoo [1]. These include networks such
tion, fine-grained detection and attribute detection. In their as VGG [15], GoogLeNet [5], Places [19], and NIN [8].
work, Razavian et al. point out that Its all about the fea- Our initial studies suggested that these networks have non-
tures. overlapping mis-classification behavior. This observation
Figure 1 depicts the trade off between obtaining high leads us to believe that by combining the combination of
classification accuracy for deep networks highly trained for these networks can improve the classification by compen-
a particular dataset for over a long time period and trans- sating for the shortcoming of other networks. What is valu-
fer learning to produce decent classification accuracy over able to understand is whether there is a single combination
a short training time period. In order to reach this target, of neural networks that is generalizable for all datasets or
this work is to present a novel method for leveraging higher different combination of networks provide optimal results
1
for different datasets. al. [16], which uses several Inception modules to create
Before going any further, a short note about terminol- a deeper network with 22 layers while having much fewer
ogy. In this paper, we use the term transfer learning to parameters than other networks such as VGG and AlexNet.
mean training a SVM classifier on top of features extracted We use pool5/5x5 s1 layer of GoogLeNet as the fea-
from pre-trained networks without changing the networks. tures layer.
On the other hand, by fine-tuning we mean changing all the Places: This is a network created by Zhou et al. [19].
layers of a network to better fit the given task. It has the same architecture as AlexNet [7] but trained on
Ensembling has been recognized as a simple way to the Places dataset instead of ImageNet to enable better per-
boost the performance in a vision task by averaging the formance in scene-centric tasks. We use the fc7 layer of
scores of multiple networks together. However, in some Places as the features layer.
tasks such as Image-to-Sentence retrieval [9], a set of fea- Network In Network (NIN): This is a network architec-
tures is desirable instead of a score over some set of pre- ture used by Lin et al. [8] which uses neural networks as the
defined classes. Thus, we try to tackle the problem of gener- layer transfer function instead of a convolution followed by
ating better features rather than simply improving the trans- a non-linearity. We use the pool4 layer as the features.
fer learning accuracy by ensembling.
In this work, we show that a combination of several net- 3. Datasets
work features by a novel technique which we call stacking
offers better accuracy in many vision tasks. We also eval- Below, we describe the attributes of each of the datasets
uate various combinations of networks to find which net- used for our study evaluation.
work combinations offer the best accuracy across the dif- Caltech-UCSD Birds 200-2011 [17]: This is a dataset
ferent datasets and whether there is a single combination of 200 different species of birds. The dataset consists of
of networks that offers substantially higher accuracy across 11,788 images.
the board. Caltech256 [4]: This is a dataset of 256 object cate-
We also evaluate the effect of using an ensemble of gories containing 30,607 images. The dataset is collected
stacked networks rather than a single stacked network in from Google images.
order to boost the performance of transfer learning even fur- Food-101 [2]: This is a dataset of 101 distinct food cat-
ther. We observe that ensembling can provide a substantial egories with 1,000 foods per category.
boost in performance over and above that offered by stack- LISA Traffic Sign Dataset [10]: This dataset contains
ing. 7,855 annotations on 6,610 video frames captured on US
We also examine the effects of fine-tuning on the gener- roads. Each image is labeled with the traffic signs visible
alizability of the features output by a network. We observe a on the images as well as the location of the sign. It covers
significant drop in generalization performance of a network 47 of the US traffic signs.
once it has been fine-tuned to one specific task. However, MIT scene [13]: This is an indoor scene dataset with
we show that joint finetuning of a single network on two dif- 15,620 images with 67 categories each of which containing
ferent tasks can actually create a network that has accuracies at least 100 images.
close to the individually fine-tuned networks. We also show Oxford flowers [12]: This is a collection of 102 groups
that the features of a jointly fine-tuned network have signif- of flowers each with 40 to to 256 flowers.
icantly higher generalizability on unseen tasks than a fine-
tuned network for a single task. However, we observe that 4. Methodology
this jointly fine-tuned network still significantly underper- In this section, we formally define Stacked Neural Net-
forms the baseline of using the pre-trained network features works and discuss the studies we conducted to construct a
only. Section 6 details our experiments in this area. novel deep learning framework for improving the state-of-
the-art prediction accuracy of deep neural networks (NN).
2. Neural Networks Set
4.1. Feature Stacking
The network architectures and features used for this
study are outlined below. Stacked Neural Networks (S-NN) is defined as a com-
VGG 16-layers and 19-layers (VGG16, VGG19): This bination of publicly available neural network architectures
is an architecture proposed by Simonyan et al. [15] which whose features are extracted at an intermediate layer of the
uses a very deep network (16 and 19 layers respectively) network, and then concatenated together to form a larger
with smaller convolution filters of 33 size to obtain state- feature set. Figure 2 illustrates this idea in detail. The con-
of-the-art accuracies on the ImageNet 2014 Challenge. We catenated feature vector is used to train a classifier layer
use the fc7 layer of the VGG network as the features layer. which consists of an optional dropout layer, an affine layer
GoogLeNet: This is an architecture used by Szegedy et and an SVM loss function. The impact of the dropout
2
Scores
Features
NIN S-NN 1
S-NN
Features
Scores
Places S-NN 2
Stacked Features
Dropout
Features
Affine
Scores
INPUT SVM Loss INPUT Ensemble
IMAGE
GoogLeNet Function S-NN 3
Features
IMAGE Scores
VGG16
Features
Scores
VGG19
S-NN N
Figure 2: A stack of five publicly available neural network Figure 3: Scores generated from a group of S-NNs each
architectures. The features generated from each network containing a subset of the networks illustrated in Figure 2
are combined into a unified feature vector. This vector is are combined to generate the mean score of the ensemble.
used to classify the dropout and affine layers. We define the
combination of multiple networks a Stacked Neural Net-
work (S-NN) sembles allows us to compare the performance of each S-
NN combination against other S-NNs as discussed in detail
in Section 5, Figure 5.
layer will be discussed in detail in Section 5. While Fig-
ure 2 shows all five convolutional neural networks (CNN) 5. Results Discussion
as members of the S-NN, any combination of these CNNs
is also considered a S-NN. For instance, {GoogLeNet, In this section we analyze the impact of S-NN and en-
VGG16} and {NIN, Places, VGG19} are examples of a 2- semble of S-NNs. In doing so, we answer two key ques-
network and 3-network S-NNs. We will evaluate the effect tions in this section:
of different network combinations in Section 5 showing that
S-NNs deliver higher classification accuracy than single- 1. What is the impact of multiple networks on perfor-
network structures. mance? In other words, does more networks neces-
sarily mean more performance?
4.2. Ensemble of S-NNs
2. What is the most generalizable S-NN architecture if
It is shown in the literature [3] that an ensemble of we were to pick a combination of these five NNs?
multiple independently trained networks can improve the
prediction accuracy by reducing the classification error Figure 4 is aimed to answer question (1). It shows our
rate. Each combination of S-NNs produces a new fea- best experimental results on all combinations of 1, 2, 3, 5
ture vector and a new set of scores. To further improve S-NNs for all datasets. Each S-NN combination was evalu-
the classification accuracy, we studied the effect the en- ated using 10s different hyperparameter sweeps (i.e. learn-
semble mean of scores on the final network. Figure 3 ing rate, regularization factor, number of epochs). The list
shows a number of S-NNs whose scores are combined into of hyperparamters used in this work is included in Table 1.
an ensemble score. While any arbitrary group of S-NNs We evaluated the validation accuracy for single-network, 2-
may be used to generate an ensemble score, we choose to network, 3-network, and 5-network stack combinations. All
compute this score by stacking a S-NN with all its net- 2-network experiments do not use the dropout layer while
work combination subsets. For example, given a S-NN all 3-network experiments use the dropout layer (see Fig-
containing three networks {NIN, VGG19, GoogLeNet}, ure 2); we discovered dropout becomes an important train-
we take the ensemble score of all its network subsets: ing element once the number of networks is larger than two.
{NIN}, {VGG19}, {GoogLeNet}, {NIN, VGG19}, {NIN, For the case of five networks, however, we experimented the
GoogLeNet}, {VGG19, GoogLeNet}, {NIN, VGG19}, network performance with and without the dropout layer.
{NIN, VGG19, GoogLeNet}. This method of forming en- Figure 4 points out for most networks, the case of 5 S-NN
3
1.0
1-net 2-net 3-net 5-net
0.0
0.8
Accuracy Degradation
0.1
0.6
Accuracy
0.2
0.4
0.3 mitscene
0.2 food-101
caltech256
0.4 birds200
flowers
0.0 mitscene food101 lisa caltech256 birds200 flowers
N+G
G
P
N+P
N+G+P
V2
V1+V2
V2+N
V1+N
V1+V2+N
V1
G+P
V1+G
V2+G
V1+V2+G
V2+N+G
V1+N+G
V1+V2+N+G
V2+P
V1+P
V1+V2+P
V2+N+P
V1+N+P
V1+G+P
V2+G+P
V1+V2+N+P
V1+N+G+P
V2+N+G+P
V1+V2+G+P
V1+V2+N+G+P
Datasets
Figure 4: Best performing S-NNs over six different Figure 5: The horizontal axis shows all possible S-NN ar-
datasets and many several different hyperparameters. S- chitectures. The verticalNetwork
axis Combination
shows the ensemble accuracy
NNs with 1, 2, 3, and 5 network are included. For each of each architecture with all combinations of its subsets.
number of networks, all their combinations are evaluated. For example, the subsets of {NIN, VGG16} used to com-
pute the ensemble accuracy are {NIN, VGG16}, {VGG16},
and {NIN}. The network ensembles that deliver the highest
is favorable. It also indicates that the case of 2 S-NN is generalization accuracy in the 1, 2, 3, and 4-network cases
strongly competitive with 5 S-NN. are highlighted in gray.
4
Single Net Stacking w/ Dropout Stack Ensemble
Stacking w/o Dropout Single Net Ensemble
1.0
0.8
0.6
Accuracy
0.2
0.0mitscene food-101 lisa caltech256birds200 flowers
Datasets
Figure 6: Best performance results generated across all tests
done. The dashed line represent the state-of-the-art without
data augmentation and the solid lines represent the state-of- (d) VGG16+GoogLeNet+Places
the-art with data augmentation. (c) VGG16 CNN
CNNs
5
To answer this question, we took a third task C (Cal-
Affine
Softmax tech256 in our experiments), and evaluated the transfer
Food Loss
101 learning accuracy of task C over VGG16, VGG16A and
(task A)
VGG16AB. The results are summarized in Table 3.
Affine
(task B) Softmax Table 3: Performance of task C on VGG16, VGG16A and
Loss
VGG16AB
6
on the prediction accuracy of each network in classifying these techniques enabled us to boost the classification accu-
a given dataset. For instance, if GoogLetNet CNN shows racy beyond the state-of-the-art results presented in previ-
weaker classification performance relative to the Places ous literature.
CNN, the relative contribution of GoogLeNet features will Furthermore, we evaluated the effect of training multi-
be reduced. To do so, the feature vector of each network is ple datasets on one network. Interestingly enough, we con-
multiplied by a scalar value in [0, 1]. This value is computed cluded that it is possible to jointly finetune a single network
by dividing the classification accuracy of each NN by the over multiple datasets and still obtain accuracies that are al-
accuracy of the network with the best result. For instance, most similar to individual finetuning of the networks. We
assume a S-NN consisting of {GoogLeNet, VGG16}. If also show that these jointly finetuned networks have better
GoogLeNet and VGG16 have individual classification ac- generalization capabilities than individually finetuned vari-
curacy of 0.3 and 0.6 respectively, the GoogLeNet and VGG ants.
features are multiplied by 0.5 and 1 respectively before S-NN proves the presence of fruitful impact in fostering
stacking their features. collaborative neural network classification to improve gen-
eralization accuracy in transfer learning.
7.3. Data Augmentation
Data augmentation has improved classification accuracy References
via diversifying NN features. Literatures [11, 2, 17] show [1] Caffe model zoo.
substantial improvement in the MIT Scene, CUB 200, and [2] L. Bossard, M. Guillaumin, and L. Van Gool. Food-101
Oxford Flowers datasets when their networks are trained us- mining discriminative components with random forests. In
ing data augmentation. While we have been short in time to European Conference on Computer Vision, 2014.
try this technique, we believe it will substantially boost our [3] L. Chen. Learning ensembles of convolutional neural net-
prediction accuracy on most datasets if not all. works.
[4] G. Griffin, A. Holub, and P. Perona. Caltech-256 object cat-
7.4. Parallel Wimpy Networks egory dataset. 2007.
[5] P. V. Group. A GPU Implementation of GoogLeNet.
Inspired by the notion of ensembles presented by Hinton
[6] G. Hinton, O. Vinyals, and J. Dean. Dark knowledge. Lec-
et al. [6], this study proves combining multiple powerful ture, 2014.
networks leads to more substantial performance gains. It [7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet
also proves training a single network on multiple datasets Classification with Deep Convolutional Neural Networks. In
can deliver better generalization accuracy. The next mile- F. Pereira, C. J. C. Burges, L. Bottou, and K. Q. Weinberger,
stone we would like to tackle is to evaluate the possibility editors, Advances in Neural Information Processing Systems
of building numerous small, fast-to-train networks trained 25, pages 10971105. Curran Associates, Inc., 2012.
on multiple datasets and stacked as S-NNs. We call them [8] M. Lin, Q. Chen, and S. Yan. Network In Network.
a stack of wimpy neural networks. Such a technique is in- arXiv:1312.4400 [cs], Dec. 2013. arXiv: 1312.4400.
teresting to us from two fronts. First is to find if multiple [9] J. Mao, W. Xu, Y. Yang, J. Wang, and A. Yuille. Deep
wimpy S-NNs can do as well as (or better than) a powerful Captioning with Multimodal Recurrent Neural Networks (m-
network like VGG19. Second is to find if this architecture RNN). arXiv:1412.6632 [cs], Dec. 2014. arXiv: 1412.6632.
can help reduce computation overhead demand of recent [10] A. Mogelmose, M. M. Trivedi, and T. B. Moeslund. Vision-
deep neural networks, such as VGG19 by enabling a much based traffic sign detection and analysis for intelligent driver
more parallelizable network architecture with the ability to assistance systems: Perspectives and survey. Intelligent
Transportation Systems, IEEE Transactions on, 13(4):1484
be conveniently offloaded onto multiple computation units
1497, 2012.
(i.e. CPUs or GPUs).
[11] M.-E. Nilsback and A. Zisserman. A visual vocabulary for
flower classification. In Proceedings of the IEEE Confer-
8. Conclusion ence on Computer Vision and Pattern Recognition, volume 2,
pages 14471454, 2006.
In this work, we presented Stacked Neural Networks,
[12] M.-E. Nilsback and A. Zisserman. Automated flower classi-
a novel technique in extracting higher generalization accu- fication over a large number of classes. In Computer Vision,
racy from the state-of-the-art neural networks in the public Graphics & Image Processing, 2008. ICVGIP08. Sixth In-
domain. We evaluated various NN stack combinations and dian Conference on, pages 722729. IEEE, 2008.
discovered that while a five-CNN stack delivers the best ac- [13] A. Quattoni and A. Torralba. Recognizing indoor scenes.
curacy, the stack of two CNNs can deliver similar accuracy 2009.
gains while consuming much less computation power. We [14] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson.
also presented the classification accuracy improvements of Cnn features off-the-shelf: an astounding baseline for recog-
generating the ensemble of S-NNs. The combination of nition. In Computer Vision and Pattern Recognition Work-
7
shops (CVPRW), 2014 IEEE Conference on, pages 512519.
IEEE, 2014.
[15] K. Simonyan and A. Zisserman. Very Deep Con-
volutional Networks for Large-Scale Image Recognition.
arXiv:1409.1556 [cs], Sept. 2014. arXiv: 1409.1556.
[16] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed,
D. Anguelov, D. Erhan, V. Vanhoucke, and A. Rabinovich.
Going Deeper with Convolutions. arXiv:1409.4842 [cs],
Sept. 2014. arXiv: 1409.4842.
[17] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
The caltech-ucsd birds-200-2011 dataset. 2011.
[18] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson. How trans-
ferable are features in deep neural networks? In Advances in
Neural Information Processing Systems, pages 33203328,
2014.
[19] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva.
Learning Deep Features for Scene Recognition using Places
Database. NIPS, 2014.