Places A 10 Million Image Database For Scene Recognition.2017.2723009
Places A 10 Million Image Database For Scene Recognition.2017.2723009
fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
Abstract—The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-
human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places
Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse
list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs),
we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches.
Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene
classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a
novel resource to guide future progress on scene recognition problems.
Index Terms—Scene classification, visual recognition, deep learning, deep feature, image dataset.
Fig. 2. Sorted distribution of image number per category in the Places Database. Places contains 10,624,928
images from 434 categories. Category names are shown for every 6 intervals.
The successes of Deep Blue in chess, Watson in “Jeop- MIT Indoor67 database [15] with 67 indoor categories and
ardy!”, and AlphaGo in Go against their expert human the SUN (Scene Understanding, with 397 categories and
opponents may thus be seen as not just advances in algo- 130,519 images) database [16] provided a larger coverage
rithms, but the increasing availability of very large datasets: of place categories, but failed short in term of quantity of
700,000, 8.6 million, and 30 million items, respectively [6]– data needed to feed deep learning algorithms. To comple-
[8]. Convolutional Neural Networks [3], [9] have likewise ment large object-centric datasets such as ImageNet [11],
achieved near human-level visual recognition, trained on we build the Places dataset described here.
1.2 million object [10]–[12] and 2.5 million scene images
[1]. Expansive coverage of the space of classes and samples Meanwhile, the Pascal VOC dataset [17] is one of the
allows getting closer to the right ecosystem of data that a earliest image dataset with diverse object annotations in
natural system, like a human, would experience. The history scene context. The Pascal VOC challenge has greatly ad-
of image datasets for scene recognition also sees the rapid vanced the development of models for object detection and
growing in the image samples as follows. segmentation tasks. Nowadays, COCO dataset [18] focuses
on collecting object instances both in polygon and bounding
box annotations for images depicting everyday scenes of
1.2 Scene-centric Datasets common objects. The recent Visual Genome dataset [19]
The first benchmark for scene recognition was the Scene15 aims at collecting dense annotations of objects, attributes,
database [13], extended from the initial 8 scene dataset in and their relationships. ADE20K [20] collects precise dense
[14]. This dataset contains only 15 scene categories with annotation of scenes, objects, parts of objects with a large
a few hundred images per class, and current classifiers are and open vocabulary. Altogether, annotated datasets further
saturated, reaching near human performance with 95%. The enable artificial systems to learn visual knowledge linking
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
parts, objects and scene context. 2.2.1 Step 1: Downloading images using scene cate-
gory and attributes
From online image search engines (Google Images, Bing
2 P LACES DATABASE Images, and Flickr), candidate images were downloaded
using a query word from the list of scene classes provided
2.1 Coverage of the categorical space by the SUN database [16]. In order to increase the diversity
The first asset of a high-quality dataset is an expansive of visual appearances in the Places dataset, each scene class
coverage of the categorical space to be learned. The strategy query was combined with 696 common English adjectives 1
of Places is to provide an exhaustive list of the categories of (e.g., messy, spare, sunny, desolate, etc.). In Fig. 3) we show
environments encountered in the world, bounded by spaces some examples of images in Places grouped by queries.
where a human body would fit (e.g. closet, shower). The About 60 million images (color images of at least 200×200
SUN (Scene UNderstanding) dataset [16] provided that pixels size) with unique URLs were identified. Importantly,
initial list of semantic categories. The SUN dataset was the Places and SUN datasets are complementary: PCA-
built around a quasi-exhaustive list of scene categories with based duplicate removal was conducted within each scene
different functionalities, namely categories with unique category in both databases, so that they do not contain the
identities in discourse. Through the use of WordNet [21], same images.
the SUN database team selected 70,000 words and concrete
terms that described scenes, places and environments that 2.2.2 Step 2: Labeling images with ground truth cat-
can be used to complete the phrase “I am in a place”, or egory
“let’s go to the/a place”. Most of the words referred to Image ground truth label verification was done by crowd-
basic and entry-level names ( [22]), resulting in a corpus sourcing the task to Amazon Mechanical Turk (AMT).
of 900 different scene categories after bundling together Fig.4 illustrates the experimental paradigm used. First,
synonyms, and separating classes described by the same AMT workers were given instructions relating to a par-
word but referring to different environments (e.g. inside ticular category at a time (e.g. cliff), with a definition,
and outside views of churches). Details about the building sample images belonging to the category (true images),
of that initial corpus can be found in [16]. Places Database and sample images not belonging to the category (false
has inherited the same list of scene categories from the SUN images). As an example, Fig.4.a shows the instructions for
dataset, with a few changes that are described in section the category cliff. Workers then performed a verification
2.2.4. task for the corresponding category. Fig.4.b shows the
AMT interface for the verification task. The experimental
interface displayed a central image, flanked by smaller
2.2 Construction of the database version of images the worker had just responded (on the
left), and the images the worker will respond to next (on
The construction of the Places Database is composed of
the right). Information gleaned from the construction of the
four steps, from querying and downloading images, labeling
SUN dataset suggests that in the first iteration of labeling
images with ground truth category, to scaling up the dataset
more than 50% of the the downloaded images are not
using a classifier, and further improving the separation of
true exemplars of the category. For this reason the default
similar classes. The detail of each step is introduced in the
answer in the interface the default answer was set to NO
following sections.
(notice that all the smaller versions of the images in the left
The data collection process of the Place Database is
are marked with a bold red contour, which denotes that the
similar to the image collection in other common datasets,
image do not belong to the category). Thus, if the worker
like ImageNet and COCO. The definition of categories
just presses the space bar to move, images will keep the
for the ImageNet dataset [11] is based on the synset of
default NO label. Whenever a true exemplar appears in the
WordNet [21]. Candidate images are queried from several
center, the worker can press a specific key to mark it as a
Image search engines using the set of WordNet synonyms.
positive exemplar (responding YES). As the response is set
Images are cleaned up through AMT in the format of the
to YES the bold contour of the image turns to green. The
binary task similar to the ours. Quality control is done by
interface also allows moving backwards to revise previous
multiple users annotating the same image. There are about
annotations. Each AMT HIT (Human Intelligence Task, one
500-1200 ground-truth images per synset. On the other
assignment for one worker), consisted of 750 images for
hand, COCO dataset [18] focuses on annotating the object
manual annotation. A control set of 30 positive samples
instances inside the images with more scene context. The
and 30 negative samples with ground-truth category labels
candidate images are mainly collected from Flickr, in order
from the SUN database were intermixed in the HIT as well.
to include less iconic images commonly returned by image
As a quality control measure, only worker HITs with an
search engines. The image annotation process of COCO is
accuracy of 90% or higher on these control images were
split into category labeling, instance spotting, and instance
kept.
segmentation, with all the tasks done by AMT workers.
COCO has 80 object categories with more than 2 million 1. The list of adjectives used in querying can be found in https://github.
object instances. com/CSAILVision/places365/blob/master/adjectives download.csv
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
spare bedroom teenage bedroom romantic bedroom darkest forest path wintering forest path greener forest path
wooded kitchen messy kitchen stylish kitchen rocky coast misty coast sunny coast
Fig. 3. Image samples from four scene categories grouped by queries to illustrate the diversity of the dataset.
For each query we show 9 annotated images.
the largest scene-centric image dataset so far. The next the diversity of A with respect to B can be defined as
subsection presents a comparison of these three datasets
in terms of image diversity.
DivB (A) = 1 − p(d(a1 , a2 ) < d(b1 , b2 )) (1)
100000 Places88
ImageNet88
Number of images
SUN88
10000
1000
100
cemetery
tower
train railway
fountain
lighthouse
valley
harbor
skyscraper
palace
arch
bedroom
restaurant
kitchen
railroad track
mansion
windmill
stadium football
building facade
lobby
abbey
volcano
amusement park
shed
raft
playground
hotel room
office
motel
underwater coral reef
dining room
viaduct
campsite
mausoleum
shower
classroom
parlor
hot spring
closet
dam
ski slope
iceberg
phone booth
swamp
airport terminal
auditorium
wind farm
bookstore
supermarket
water tower
cockpit
veranda
chalet
ruin
attic
courthouse
engine room
market outdoor
schoolhouse
conference room
pavilion
hospital room
rock arch
topiary garden
church outdoor
pulpit
gift shop
nursery
food court
reception
amphitheater
pantry
apartment building outdoor
jail cell
candy store
dorm room
bowling alley
garbage dump
assembly line
locker room
game room
butchers shop
waiting room
beauty salon
rope bridge
cafeteria
shoe shop
sandbar
igloo
Fig. 7. Comparison of the number of images per scene category for the common 88 scene categories in Places,
ImageNet, and SUN datasets.
50
60 60
45
Classification accuracy
Classification accuracy
Classification accuracy
50 50 40
35
40 40
30
30 30 25
10 0 1 2 3 4 10 0 10 0
1 2 3 4 1 2 3 4
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Number of training samples per category Number of training samples per category Number of training samples per category
a) b) c)
35
ImageNet88
30 SUN88 for a fixed number of training examples. As the Places
25 database is very large, it achieves the best performance on
20 two of the test sets when all the training data is used.
15
10
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
5 C ONVOLUTIONAL N EURAL N ETWORKS
c) Diversity FOR S CENE C LASSIFICATION
Fig. 8. Examples of pairs for the diversity experiment Given the impressive performance of the deep Convolu-
for a) playground and b) bedroom. Which pair shows tional Neural Networks (CNNs), particularly on the Im-
the most similar images? The bottom pairs were cho- ageNet benchmark [3], [12], we choose three popular
sen in these examples. c) Histogram of relative diver- CNN architectures, AlexNet [3], GoogLeNet [26], and
sity per each category (88 categories) and dataset. VGG 16 convolutional-layer CNN [27], then train them on
Places88 (in blue line) contains the most diverse set of Places205 and Places365-Standard respectively to create
images, then ImageNet88 (in red line) and the lowest baseline CNN models. The trained CNNs are named as
diversity is in the SUN88 database (in yellow line) as PlacesSubset-CNN, i.e., Places205-AlexNet or Places365-
most images are prototypical of their class. VGG.
All the Places-CNNs presented here were trained using
the Caffe package [28] on Nvidia GPUs Tesla K40 and
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
Titan X3 . Additionally, given the recent breakthrough per- Places365-AlexNet is 0.577. Thus Places365-AlexNet not
formances of the Residual Network (ResNet) on ImageNet only predicts more categories, but also has better accuracy
classification [29], we further fine-tuned ResNet152 on on the previous existing categories.
the Places365-Standard (termed as Places365-ResNet) and Fig.10 shows the responses to examples correctly pre-
compared it with the other trained-from-scratch Places- dicted by the Places365-VGG. Notice that most of the
CNNs for scene classification. Top-5 responses are very relevant to the scene description.
Some failure or ambiguous cases are shown in Fig.11.
Broadly, we can identify two kinds of mis-classification
5.1 Results on Places205 and Places365
given the current label attribution of Places: 1) less-typical
After training the various Places-CNNs, we used the final activities happening in a scene, such as taking group
output layer of each network to classify the test set images photo in a construction site and camping in a junkyard;
of Places205 and SUN205 (see [1]). The classification 2) images composed of multiple scene parts, which make
results for Top-1 accuracy and Top-5 accuracy are listed one ground-truth scene label not sufficient to describe the
in Table 1. The Top-1 accuracy is the percentage of the whole environment. These results suggest the need to have
testing images where the top predicted label exactly match multi-ground truth labels for describing environments.
the ground-truth label. The Top-5 accuracy is that the It is important to emphasize that for many scene cate-
percentage of testing images where the ground-truth label gories the Top-1 accuracy might be an ill-defined measure:
is among the top ranked 5 predicted labels given by an environments are inherently multi-labels in terms of their
algorithm. Since there are some ambiguity between some semantic description. Different observers will use different
scene categories, the Top-5 accuracy is a more suitable terms to refer to the same place, or different parts of the
criteria of measuring scene classification performance. same environment, and all the labels might fit well the
As a baseline comparison, we show the results of a linear description of the scene, as we observe in the examples
SVM trained on ImageNet-CNN features of 5000 images of Fig.11.
per category in Places205 and 50 images per category
in SUN205 respectively. We observe that Places-CNNs
perform much better than the ImageNet feature+SVM 5.2 Web-demo for Scene Recognition
baseline while, as expected, Places205-GoogLeNet and Based on our trained Places-CNN, we created a web-
Places205-VGG outperformed Places205-AlexNet with a demo for scene recognition6 , accessible through a computer
large margin, due to their deeper structures. To date (Oct browser or mobile phone. People can upload photos to
2, 2016) the top ranked results on the test set of Places205 the web-demo to predict the type of environment, with
leaderboard4 is 64.10% on Top-1 accuracy and 90.65% on the 5 most likely semantic categories, and relevant scene
Top-5 accuracy. Note that for the test set of SUN205, we attributes. Two screenshots of the prediction result on the
did not fine-tune the Places-CNNs on the training set of mobile phone are shown in Fig.12. Note that people can
SUN205, as we directly evaluated them on the test set of submit feedback about the result. The top-5 recognition
SUN. accuracy of our recognition web-demo in the wild is about
We further evaluated the baseline Places365-CNNs on 72% (from the 9,925 anonymous feedbacks dated from
the validation set and test set of Places365. The results Oct.19, 2014 to May 5, 2016), which is impressive given
are shown in Table 2. We can see that Places365-VGG and that people uploaded all kinds of photos from real-life and
Places365-ResNet have similar top performances compared not necessarily places-like photos. Places205-AlexNet is the
with the other two CNNs5 . Even though Places365 has 160 back-end prediction model in the demo.
more categories than Places205, the Top-5 accuracy of the
Places205-CNNs (trained on the previous version of Places
[1]) on the test set only drops by 2.5%. 5.3 Places365 Challenge Result
To evaluate how extra categories bring improvements, The subset Places365-Challenge, which contains more
we compute the accuracy for the 182 common cate- than 8 million images from 365 scene categories, was used
gories between Places205 and Places365 (we merge some in the Places Challenge 2016 held as part of the ILSVRC
categories in Places205 when building Places365 thus Challenge in the European Conference on Computer Vision
there are less common categories) for Places205-CNN and (ECCV) 2016.
Places365-CNN. On the validation set of Places365, we The rule of the challenge is that each team can only
select the images of the 182 common categories, then use the provided data in the Places365-Challenge to train
use the aligned 182 outputs of the Places205-AlexNet and their networks. Standard CNN models trained on Imagenet-
Places365-AlexNet to predict the labels respectively. The 1.2million and previous Places are allowed to use. Each
Top1 accuracy for Places205-AlexNet is 0.572, the one for team can submit at most 5 prediction results. Ranks are
based on the top-5 classification error of each submission.
3. All the Places-CNNs are available at https://github.com/ Winners teams are then invited to give talks at the ILSVRC-
CSAILvision/places365
4. http://places.csail.mit.edu/user/leaderboard.php COCO Joint Workshop at ECCV’16.
5. The performance of the ResNet might result from fine-tuning or
under-training, as the ResNet is not trained from scratch. 6. http://places.csail.mit.edu/demo.html
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
TABLE 1
Classification accuracy on the test set of Places205 and the test set of SUN205. We use the class score
averaged over 10-crops of each test image to classify the image. ∗ shows the top 2 ranked results from the
Places205 leaderboard.
TABLE 2
Classification accuracy on the validation set and test set of Places365. We use the class score averaged over
10-crops of each testing image to classify the image.
Fig. 10. The predictions given by the Places365-VGG for the images from the validation set. The ground-
truth label (GT) and the top 5 predictions are shown. The number beside each label indicates the prediction
confidence.
10
11
70
Combined kernel [37.5] of what has been learned inside CNNs and what are the
HoG2x2 [26.3]
60
DenseSIFT [23.5]
differences between the object-centric CNN trained on Im-
Ssim [22.7] ageNet and the scene-centric CNN trained on Places given
50 Geo texton [22.1] that they share the same architecture AlexNet. Following
Classification accuracy
Texton [21.6]
Gist [16.3]
the methodology in [2] we feed 100,000 held-out testing
40
LBP [14.7] images into the two networks and record the max activation
30
ImageNet−AlexNet [42.6] of each unit pooled over the whole spatial feature map for
Places205−AlexNet [54.3]
Places365−VGG [63.24]
each of the images respectively. For each unit, we get the
20 top three ranked images by ranking their max activations,
then we segment the images by bilinear upsampling the
10
binarized spatial feature map mask.
0
The image segmentation results of the units from differ-
1 5 10 20 50
Number of training samples per category ent layers are shown in Fig.15. We can see that from conv1
to conv5, the units detect visual concepts from low-level
edge/texture to high-level object/scene parts. Furthermore,
Fig. 14. Classification accuracy on the SUN397
in the object-centric ImageNet-CNN there are more units
Dataset. We compare the deep features of Places365-
detecting object parts such as dog and people’s heads in the
VGG, Places205-AlexNet (result reported in [1]), and
conv5 layer, while in the scene centric Places-CNN there
ImageNet-AlexNet, to hand-designed features (HOG,
are more units detecting scene parts such as bed, chair, or
gist, etc). The deep features of Places365-VGG outper-
buildings in the conv5 layer.
forms other deep features and hand-designed features
Thus the specialty of the units in the object-centric CNN
by a large margin. Results of other hand-designed
and scene-centric CNN yield very different performances
features/kernels are fetched from [16].
of generic visual features on a variety of recognition bench-
marks (object-centric datasets vs scene-centric datasets) in
ing layer before softmax producing the class predictions. Table 3.
The classifier in all of the experiments is a linear SVM We further synthesized preferred input images for the
with the default parameter for all of the features. Places-CNN by using the image synthesis technique pro-
Table 3 summarizes the classification accuracy on various posed in [41]. This method uses a learned prior deep
datasets for the deep features of Places-CNNs and the deep generator network to generate images which maximize the
features of the ImageNet-CNNs. Fig.14 plots the classifica- final class activation or the intermediate unit activation of
tion accuracy for different visual features on the SUN397 the Places-CNN. The synthetic images for 50 scene cate-
database over different numbers of training samples per gories are shown in Fig.16. These abstract image contents
category. The classifier is a linear SVM with the same de- reveal the knowledge of the specific scene learned and
fault parameters for the two deep feature layers (C=1) [40]. memorized by the Places-CNN: examples include the buses
The Places-CNN features show impressive performance on within a road environment in the bus station, and the tents
scene-related datasets, outperforming the ImageNet-CNN surrounded by forest-types of features for the campsite.
features. On the other hand, the ImageNet-CNN features Here we used Places365-AlexNet (other Places365-CNNs
show better performance on object-related image datasets. generated similar results). We further used the synthesis
Importantly, our comparison shows that Places-CNN and technique to generate the images preferred by the units
ImageNet-CNN have complementary strengths on scene- in the conv5 layer of Places365-AlexNet. As shown in
centric tasks and object-centric tasks, as expected from Fig.17, the synthesized images are very similar to the
the type of the datasets used to train these networks. On segmented image regions by the estimated receptive field
the other hand, the deep features from the Places365- of the units.
VGG achieve the best performance (63.24%) on the most
challenging scene classification dataset SUN397, while the 6 C ONCLUSION
deep features of Places205-VGG performs the best on the
From the Tiny Image dataset [42], to ImageNet [11] and
MIT Indoor67 dataset. As far as we know, they are the
Places [1], the rise of multi-million-item dataset initiatives
state-of-the-art scores achieved by a single feature + linear
and other densely labeled datasets [18], [20], [43], [44] have
SVM on those two datasets. Furthermore, we merge the
enabled data-hungry machine learning algorithms to reach
1000 classes from the ImageNet and the 365 classes from
near-human semantic classification of visual patterns, like
the Places365-Standard to train a VGG (Hybrid1365-VGG).
objects and scenes. With its category coverage and high-
The deep feature from the Hybrid1365-VGG achieves the
diversity of exemplars, Places offers an ecosystem of visual
best score averaged over all the eight image datasets.
context to guide progress on scene understanding problems.
Such problems could include determining the actions hap-
5.5 Visualization of the Internal Units pening in a given environment, spotting inconsistent objects
Through the visualization of the unit responses for various or human behaviors for a particular place, and predicting
levels of network layers, we can have a better understanding future events or the cause of events given a scene.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
12
TABLE 3
Classification accuracy/precision on scene-centric databases (the first four datasets) and object-centric
databases (the last four datasets) for the deep features of various Places-CNNs and ImageNet-CNNs. All the
accuracy/precision is the top-1 accuracy/precision.
Deep Feature SUN397 MIT Indoor67 Scene15 SUN Attribute Caltech101 Caltech256 Action40 Event8 Average
Places365-AlexNet 56.12 70.72 89.25 92.98 66.40 46.45 46.82 90.63 69.92
Places205-AlexNet 54.32 68.24 89.87 92.71 65.34 45.30 43.26 94.17 69.15
ImageNet-AlexNet 42.61 56.79 84.05 91.27 87.73 66.95 55.00 93.71 72.26
Places365-GoogLeNet 58.37 73.30 91.25 92.64 61.85 44.52 47.52 91.00 70.06
Places205-GoogLeNet 57.00 75.14 90.92 92.09 54.41 39.27 45.17 92.75 68.34
ImageNet-GoogLeNet 43.88 59.48 84.95 90.70 89.96 75.20 65.39 96.13 75.71
Places365-VGG 63.24 76.53 91.97 92.99 67.63 49.20 52.90 90.96 73.18
Places205-VGG 61.99 79.76 91.61 92.07 67.58 49.28 53.33 93.33 73.62
ImageNet-VGG 48.29 64.87 86.28 91.78 88.42 74.96 66.63 95.17 77.05
Hybrid1365-VGG 61.77 79.49 92.15 92.93 88.22 76.04 68.11 93.13 81.48
Fig. 15. a) Visualization of the units’ receptive fields at different layers for the ImageNet-CNN and Places-CNN.
Subsets of units at each layer are shown. In each row we show the top 3 most activated images. Images are
segmented based on the binarized spatial feature map of the units at different layers of ImageNet-CNN and
Places-CNN. Here we take ImageNet-AlexNet and Places205-AlexNet as the comparison examples. See the
detailed visualization methodology in [2].
13
Fig. 16. The synthesized images preferred by the final output of Places365-AlexNet for 20 scene categories.
Fig. 17. The synthesized images preferred by the conv5 units of the Places365-AlexNet corresponds to the
segmented images by the receptive fields of those units. The synthetic images are very similar to the segmented
image regions of the units. Each row of the segmented images correspond to one unit.
Spatial pyramid matching for recognizing natural scene categories,” image annotations,” ijcv, 2016.
in Proc. CVPR, 2006. [20] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
[14] A. Oliva and A. Torralba, “Modeling the shape of the scene: A “Scene paring through ade20k dataset,” Proc. CVPR, 2017.
holistic representation of the spatial envelope,” Int’l Journal of [21] G. A. Miller, “Wordnet: a lexical database for english,” Communi-
Computer Vision, 2001. cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
[15] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. [22] P. Jolicoeur, M. A. Gluck, and S. M. Kosslyn, “Pictures and names:
CVPR, 2009. Making the connection,” Cognitive psychology, 1984.
[16] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun [23] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in
database: Large-scale scene recognition from abbey to zoo,” in Proc. Proc. CVPR, 2011.
CVPR, 2010.
[24] C. Heip, P. Herman, and K. Soetaert, “Indices of diversity and
[17] M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Al-
evenness,” Oceanis, 1998.
lan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorkó
et al., “The pascal visual object classes challenge 2007 (voc2007) [25] E. H. Simpson, “Measurement of diversity.” Nature, 1949.
results,” 2007. [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in convolutions,” Proc. CVPR, 2015.
context,” in European Conference on Computer Vision. Springer, [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks
2014, pp. 740–755. for large-scale image recognition,” International Conference on
[19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Learning Representations, 2014.
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual [28] Y. Jia, “Caffe: An open source convolutional architecture for fast
genome: Connecting language and vision using crowdsourced dense feature embedding,” http://caffe.berkeleyvision.org/, 2013.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence
14
[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Agata Lapedriza is an Associate Professor
image recognition,” Proc. CVPR, 2016. at the Universitat Oberta de Catalunya. She
[30] Q. Zhong, C. Li, Y. Zhang, H. Sun, S. Yang, D. Xie, and S. Pu, “To- received her MS deegree in Mathematics at
wards good practices for recognition and detection,” http://image-net. the Universitat de Barcelona in 2003, and
org/challenges/talks/2016/Hikvision at ImageNet 2016.pdf, 2016. her Ph.D. degree in Computer Science at
[31] L. Shen, Z. Lin, G. Sun, and J. Hu, “Places401 and places365 the Computer Vision Center in 2009, at the
models,” https://github.com/lishen-shirley/Places2-CNNs, 2016. Universitat Autonoma Barcelona. She was
[32] J. Shao, X. Zhang, Z. Ding, Y. Zhao, Y. Chen, J. Zhou, working as a visiting researcher in the Com-
W. Wang, L. Mei, and C. Hu, “Good pratices for deep feature fu- puter Science and Artificial Intelligence Lab,
sion,” http://image-net.org/challenges/talks/2016/Trimps-Soushen@ at the Massachusetts Institute of Technology,
ILSVRC2016.pdf, 2016. from 2012 until 2015. Her research interests
[33] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, are related to high-level image understanding, scene and object
and T. Darrell, “Decaf: A deep convolutional activation feature for recognition, and affective computing.
generic visual recognition,” in Proc. ICML, 2014.
[34] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn
features off-the-shelf: an astounding baseline for recognition,” CVPR
workshop, 2014.
[35] G. Patterson and J. Hays, “Sun attribute database: Discovering,
annotating, and recognizing scene attributes,” in Proc. CVPR, 2012. Aditya Khosla received the BS degree in
[36] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual computer science, electrical engineering and
models from few training examples: An incremental bayesian ap- economics from the California Institute of
proach tested on 101 object categories,” Computer Vision and Image Technology, and the MS degree in computer
Understanding, 2007. science from Stanford University, in 2009 and
[37] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category 2011 respectively. He completed his PhD in
dataset,” 2007. computer science from the Massachusetts
[38] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, Institute of Technology in 2016 with a focus
“Human action recognition by learning bases of action attributes and on computer vision and machine learning. In
parts,” in Proc. ICCV, 2011. his thesis, he developed machine learning
[39] L.-J. Li and L. Fei-Fei, “What, where and who? classifying events techniques that predict human behavior and
by scene and object recognition,” in Proc. ICCV, 2007. the impact of visual media on people.
[40] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“LIBLINEAR: A library for large linear classification,” 2008.
[41] A. Nguyen, A. Dosovitskiy, T. Yosinski, Jason band Brox, and
J. Clune, “Synthesizing the preferred inputs for neurons in neural
networks via deep generator networks,” In Advances in Neural
Information Processing Systems, 2016. Aude Oliva is a Principal Research Scien-
[42] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: tist at the MIT Computer Science and Artifi-
A large data set for nonparametric object and scene recognition,” cial Intelligence Laboratory (CSAIL). After a
IEEE Trans. on Pattern Analysis and Machine Intelligence, 2008. French baccalaureate in Physics and Mathe-
[43] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- matics, she received two M.Sc. degrees and
man, “The pascal visual object classes (voc) challenge,” Int’l Journal a Ph.D in Cognitive Science from the Institut
of Computer Vision, 2010. National Polytechnique of Grenoble, France.
[44] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- She joined the MIT faculty in the Department
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset of Brain and Cognitive Sciences in 2004 and
for semantic urban scene understanding,” Proc. CVPR, 2016. CSAIL in 2012. Her research on vision and
memory is cross-disciplinary, spanning hu-
man perception and cognition, computer vision, and human neuro-
science. She received the 2006 National Science Foundation (NSF)
Career award, the 2014 Guggenheim and the 2016 Vannevar Bush
fellowships.