Analyzing The Performance of Multilayer Neural
Analyzing The Performance of Multilayer Neural
1 Introduction
Over the last two years, a sequence of results on benchmark visual recognition
tasks has demonstrated that convolutional neural networks (CNNs) [6,14,18] will
likely replace engineered features, such as SIFT [15] and HOG [2], for a wide
variety of problems. This sequence started with the breakthrough ImageNet [3]
classification results reported by Krizhevsky et al. [11]. Soon after, Donahue et
al. [4] showed that the same network, trained for ImageNet classification, was an
effective blackbox feature extractor. Using CNN features, they reported state-
of-the-art results on several standard image classification datasets. At the same
time, Girshick et al. [7] showed how the network could be applied to object
detection. Their system, called R-CNN, classifies object proposals generated by
a bottom-up grouping mechanism (e.g., selective search [23]). Since detection
training data is limited, they proposed a transfer learning strategy in which the
CNN is first pre-trained, with supervision, for ImageNet classification and then
fine-tuned on the small PASCAL detection dataset [5]. Since this initial set of
results, several other papers have reported similar findings on a wider range of
tasks (see, for example, the outcomes reported by Razavian et al. in [17]).
Feature transforms such as SIFT and HOG afford an intuitive interpretation
as histograms of oriented edge filter responses arranged in spatial blocks. How-
ever, we have little understanding of what visual features the different layers of
a CNN encode. Given that rich feature hierarchies provided by CNNs are likely
to emerge as the prominent feature extractor for computer vision models over
D. Fleet et al. (Eds.): ECCV 2014, Part VII, LNCS 8695, pp. 329–344, 2014.
c Springer International Publishing Switzerland 2014
330 P. Agrawal, R. Girshick, and J. Malik
the next few years, we believe that developing such an understanding is an inter-
esting scientific pursuit and an essential exercise that will help guide the design
of computer vision methods that use CNNs. Therefore, in this paper we study
several aspects of CNNs through an empirical lens.
ImageNet Pre-training Does not Overfit. One concern when using super-
vised pre-training is that achieving a better model fit to ImageNet, for example,
might lead to higher generalization error when applying the learned features
to another dataset and task. If this is the case, then some form of regulariza-
tion during pre-training, such as early stopping, would be beneficial. We show
the surprising result that pre-training for longer yields better results, with di-
minishing returns, but does not increase generalization error. This implies that
fitting the CNN to ImageNet induces a general and portable feature represen-
tation. Moreover, the learning process is well behaved and does not require ad
hoc regularization in the form of early stopping.
2 Experimental Setup
Object Detection. For the task of object detection we use PASCAL VOC
2007. We train using the trainval set and test on the test set. We refer to this
dataset and task by “PASCAL-DET”. PASCAL-DET uses the same set of im-
ages as PASCAL-CLS. We note that it is standard practice to use the 2007
version of PASCAL VOC for reporting results of ablation studies and hyperpa-
rameter sweeps. We report performance on PASCAL-DET using the standard
AP and mAP metrics. In some of our experiments we use only the ground-
truth PASCAL-DET bounding boxes, in which case we refer to the setup by
“PASCAL-DET-GT”.
In order to provide a larger detection training set for certain experiments,
we also make use of the “PASCAL-DET+DATA” dataset, which we define as
including VOC 2007 trainval union with VOC 2012 trainval. The VOC 2007
test set is still used for evaluation. This dataset contains approximately 37k
labeled bounding boxes, which is roughly three times the number contained in
PASCAL-DET.
332 P. Agrawal, R. Girshick, and J. Malik
360
320
conv−1
300
conv−2
conv−3
conv−4
280 conv−5
fc−6
fc−7
260
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
Fraction of Filters
Fig. 1. PASCAL object class selectivity plotted against the fraction of filters, for each
layer, before fine-tuning (dash-dot line) and after fine-tuning (solid line). A lower value
indicates greater class selectivity. Although layers become more discriminative as we go
higher up in the network, fine-tuning on limited data (PASCAL-DET) only significantly
affects the last two layers (fc-6 and fc-7).
Table 2. Comparison in performance when fine-tuning the entire network (ft) versus
only fine-tuning the fully-connected layers (fc-ft)
indicates that a filter is highly class selective, while a large value indicates that
a filter fires regardless of class. The precise definition of this measure is given in
the Appendix.
In order to summarize the class selectivity for a set of filters, we sort them
from the most selective to least selective and plot the average selectivity of the
first k filters while sweeping k down the sorted list. Figure 1 shows the class
selectivity for the sets of filters in layers 1 to 7 before and after fine-tuning (on
VOC 2007 trainval). Selectivity is measured using the ground truth boxes from
PASCAL-DET-GT instead of a whole-image classification task to ensure that
filter responses are a direct result of the presence of object categories of interest
and not correlations with image background.
Figure 1 shows that class selectivity increases from layer 1 to 7 both with
and without fine-tuning. It is interesting to note that entropy changes due to
fine-tuning are only significant for layers 6 and 7. This observation indicates
that fine-tuning only layers 6 and 7 may suffice for achieving good performance
when fine-tuning data is limited. We tested this hypothesis on SUN-CLS and
PASCAL-DET by comparing the performance of a fine-tuned network (ft) with
Analyzing the Performance of Multilayer Neural Networks 335
layer 5k 15k 25k 35k 50k 95k 105k 195k 205k 305k
conv-1 23.0 24.3 24.4 24.5 24.3 24.8 24.7 24.4 24.4 24.4 ± 0.5
conv-2 33.7 40.4 40.9 41.8 42.7 43.2 44.0 45.0 45.1 45.1 ± 0.7
conv-3 34.2 46.8 47.0 48.2 48.6 49.4 51.6 50.7 50.9 50.5 ± 0.6
conv-4 33.5 49.0 48.7 50.2 50.7 51.6 54.1 54.3 54.4 54.2 ± 0.7
conv-5 33.0 53.4 55.0 56.8 57.3 59.2 63.5 64.9 65.5 65.6 ± 0.3
fc-6 34.2 59.7 62.6 62.7 63.5 65.6 69.3 71.3 71.8 72.1 ± 0.3
fc-7 30.9 61.3 64.1 65.1 65.9 67.8 71.8 73.4 74.0 74.3 ± 0.3
a network which was fine-tuned by only updating the weights of fc-6 and fc-7
(fc-ft). These results, in Table 2, show that with small amounts of data, fine-
tuning amounts to “rewiring” the fully connected layers. However, when more
fine-tuning data is available (PASCAL-DET+DATA), there is still substantial
benefit from fine-tuning all network parameters.
Fig. 2. Evolution of conv-1 filters with time. After just 15k iterations, these filters
closely resemble their converged state.
Neuroscientists have conjectured that cells in the human brain which only re-
spond to very specific and complex visual stimuli (such as the face of one’s grand-
mother) are involved in object recognition. These neurons are often referred to as
grandmother cells (GMC) [1,16]. Proponents of artificial neural networks have
shown great interest in reporting the presence of GMC-like filters for specific
object classes in their networks (see, for example, the cat filter reported in [13]).
The notion of GMC like features is also related to standard feature encodings
for image classification. Prior to the work of [11], the dominant approaches for
image and scene classification were based on either representing images as a bag
of local descriptors (BoW), such as SIFT (e.g., [12]), or by first finding a set
of mid-level patches [10,20] and then encoding images in terms of them. The
problem of finding good mid-level patches is often posed as a search for a set
of high-recall discriminative templates. In this sense, mid-level patch discovery
is the search for a set of GMC templates. The low-level BoW representation, in
contrast, is a distributed code in the sense that a single feature by itself is not dis-
criminative, but a group of features taken together is. This makes it interesting
to investigate the nature of mid-level CNN features such as conv-5.
For understanding these feature representations in CNNs, [19,26] recently pre-
sented methods for finding locally optimal visual inputs for individual filters.
However, these methods only find the best, or in some cases top-k, visual inputs
2
A network pre-trained from scratch, which was different from the one used in Section
3.1, was used to obtain these results. The difference in performance is not significant.
Analyzing the Performance of Multilayer Neural Networks 337
0.5
0.5
0.5
0.5
0.5
0
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
Recall
Fig. 3. The precision-recall curves for the top five (based on AP) conv-5 filter responses
on PASCAL-DET-GT. Curves in red and blue indicate AP for fine-tuned and pre-
trained networks, respectively. The dashed black line is the performance of a random
filter. For most classes, precision drops significantly even at modest recall values. There
are GMC filters for classes such as bicycle, person, car, cat.
that activate a filter, but do not characterize the distribution of images that
cause an individual filter to fire above a certain threshold. For example, if it is
found that the top-10 visual inputs for a particular filter are cats, it remains
unclear what is the response of the filter to other images of cats. Thus, it is not
possible to make claims about presence of GMC like filters for cat based on such
analysis. A GMC filter for the cat class, is one that fires strongly on all cats and
nothing else. This criteria can be expressed as a filter that has high precision and
high recall. That is, a GMC filter for class C is a filter that has a high average
precision (AP) when tasked with classifying inputs from class C versus inputs
from all other classes.
First, we address the question of finding GMC filters by computing the AP of
individual filters (Section 4.1). Next, we measure how distributed are the feature
representations (Section 4.2). For both experiments we use features from layer
conv-5, which consists of responses of 256 filters in a 6 × 6 spatial grid. Using
max pooling, we collapse the spatial grid into a 256-D vector, so that for each
filter we have a single response per image (in Section 5.1 we show that this
transformation causes only a small drop in task performance).
For each filter, its AP value is calculated for classifying images using class labels
and filter responses to object bounding boxes from PASCAL-DET-GT. Then,
338 P. Agrawal, R. Girshick, and J. Malik
for each class we sorted filters in decreasing order of their APs. If GMC filters
for this class exist, they should be the top ranked filters in this sorted list. The
precision-recall curves for the top-five conv-5 filters are shown in Figure 3. We
find that GMC-like filters exist for only for a few classes, such as bicycle, person,
cars, and cats.
Table 5. Number of filters required to achieve 50% or 90% of the complete perfor-
mance on PASCAL-DET-GT using a CNN pre-trained on ImageNet and fine-tuned for
PASCAL-DET using conv-5 features
perf. aero bike bird boat bottle bus car cat chair cow table dog horse mbike person plant sheep sofa train tv
pre-train 50% 15 3 15 15 10 10 3 2 5 15 15 2 10 3 1 10 20 25 10 2
fine-tune 50% 10 1 20 15 5 5 2 2 3 10 15 3 15 10 1 5 15 15 5 2
pre-train 90% 40 35 80 80 35 40 30 20 35 100 80 30 45 40 15 45 50 100 45 25
fine-tune 90% 35 30 80 80 30 35 25 20 35 50 80 35 30 40 10 35 40 80 40 20
Fig. 5. The set overlap between the 50 most discriminative conv-5 filters for each class
determined using PASCAL-DET-GT. Entry (i, j) of the matrix is the fraction of top-50
filters class i has in common with class j (Section 4.2). Chance is 0.195. There is little
overlap, but related classes are more likely to share filters.
)LOWHU
)LOWHU
)LOWHU1
D E F G
seen that different classes use different subsets of conv-5 filters and there is little
overlap between classes. This further indicates that intermediate representations
in the CNN are distributed.
Now we remove spatial information from filter responses while retaining informa-
tion about their magnitudes. We consider two methods for ablating spatial infor-
mation from features computed by the convolutional layers (the fully-connected
layers do not contain explicit spatial information).
The first method (“sp-max”) simply collapses the p × p spatial map into a sin-
gle value per feature channel by max pooling. The second method (“sp-shuffle”)
retains the original distribution of feature activation values, but scrambles spa-
tial correlations between columns of feature channels. To perform sp-shuffle, we
permute the spatial locations in the p × p spatial map. This permutation is
performed independently for each network input (i.e., different inputs undergo
different permutations). Columns of filter responses in the same location move
together, which preserves correlations between features within each (shuffled)
spatial location. These transformations are illustrated in Figure 6.
6 Conclusion
To help researchers better understand CNNs, we investigated pre-training and
fine-tuning behavior on three classification and detection datasets. We found that
the large CNN used in this work can be trained from scratch using a surprisingly
modest amount of data. But, importantly, pre-training significantly improves
performance and pre-training for longer is better. We also found that some of
the learnt CNN features are grandmother-cell-like, but for the most part they
form a distributed code. This supports the recent set of empirical results showing
that these features generalize well to other datasets and tasks.
References
1. Barlow, H.: Single units and sensations: A neuron doctrine for perceptual psychol-
ogy? Perception (1972)
2. Dalal, N., Triggs, B.: Histograms of oriented gradients for human detection. In:
CVPR, pp. 886–893 (2005)
3. Deng, J., Dong, W., Socher, R., Li, L.J., Li, K., Fei-Fei, L.: ImageNet: A Large-
Scale Hierarchical Image Database. In: CVPR 2009 (2009)
4. Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang, N., Tzeng, E., Darrell, T.:
Decaf: A deep convolutional activation feature for generic visual recognition. arXiv
preprint arXiv:1310.1531 (2013)
5. Everingham, M., Van Gool, L., Williams, C.K.I., Winn, J., Zisserman, A.: The
pascal visual object classes (voc) challenge. IJCV 88(2) (2010)
6. Fukushima, K.: Neocognitron: A self-organizing neural network model for a mech-
anism of pattern recognition unaffected by shift in position. Biological Cybernet-
ics 36(4), 193–202 (1980)
7. Girshick, R., Donahue, J., Darrell, T., Malik, J.: Rich feature hierarchies for accu-
rate object detection and semantic segmentation. In: CVPR (2014)
Analyzing the Performance of Multilayer Neural Networks 343