0% found this document useful (0 votes)
56 views14 pages

Places A 10 Million Image Database For Scene Recognition.2017.2723009

The Places Database is a comprehensive repository of 10 million scene photographs, categorized into 434 semantic categories, aimed at enhancing scene recognition capabilities in machine learning. Utilizing state-of-the-art Convolutional Neural Networks (CNNs), the database provides baseline models that significantly improve scene classification performance. This resource is intended to guide future advancements in visual recognition by offering a diverse and extensive dataset for training and evaluation.

Uploaded by

liulyyy008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
56 views14 pages

Places A 10 Million Image Database For Scene Recognition.2017.2723009

The Places Database is a comprehensive repository of 10 million scene photographs, categorized into 434 semantic categories, aimed at enhancing scene recognition capabilities in machine learning. Utilizing state-of-the-art Convolutional Neural Networks (CNNs), the database provides baseline models that significantly improve scene classification performance. This resource is intended to guide future advancements in visual recognition by offering a diverse and extensive dataset for training and evaluation.

Uploaded by

liulyyy008
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

This article has been accepted for publication in a future issue of this journal, but has not been

fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

Places: A 10 million Image Database for


Scene Recognition
Bolei Zhou, Agata Lapedriza, Aditya Khosla, Aude Oliva, and Antonio Torralba

Abstract—The rise of multi-million-item dataset initiatives has enabled data-hungry machine learning algorithms to reach near-
human semantic classification performance at tasks such as visual object and scene recognition. Here we describe the Places
Database, a repository of 10 million scene photographs, labeled with scene semantic categories, comprising a large and diverse
list of the types of environments encountered in the world. Using the state-of-the-art Convolutional Neural Networks (CNNs),
we provide scene classification CNNs (Places-CNNs) as baselines, that significantly outperform the previous approaches.
Visualization of the CNNs trained on Places shows that object detectors emerge as an intermediate representation of scene
classification. With its high-coverage and high-diversity of exemplars, the Places Database along with the Places-CNNs offer a
novel resource to guide future progress on scene recognition problems.

Index Terms—Scene classification, visual recognition, deep learning, deep feature, image dataset.

1 I NTRODUCTION Whereas most datasets have focused on object categories


If a current state-of-the-art visual recognition system would (providing labels, bounding boxes or segmentations), here
send you a text to describe what it sees, the text might read we describe the Places database, a quasi-exhaustive repos-
something like: “There is a sofa facing a TV set. A person itory of 10 million scene photographs, labeled with 434
is sitting on the sofa holding a remote control. The TV is scene semantic categories, comprising about 98 percent of
on and a talk show is playing”. Reading this, you would the type of places a human can encounter in the world.
likely imagine a living-room. However, that scenery can Image samples are shown in Fig. 1 while Fig. 2 shows the
very well happen in a resort by the beach. number of images per category, sorted in decreasing order.
For an agent acting into the world, there is no doubt that Departing from Zhou et al. [1], we describe in depth the
object and event recognition should be a primary goal of its construction of the Places Database, and evaluate the per-
visual system. But knowing the place or context in which formance of several state-of-the-art Convolutional Neural
the objects appear is as equally important for an intelligent Networks (CNNs) for place recognition. We compare how
system to understand what might have happened in the past the features learned in a CNN for scene classification be-
and what may happen in the future. For instance, a table have when used as generic features in other visual recogni-
inside a kitchen can be used to eat or prepare a meal, while tion tasks. Finally, we visualize the internal representations
a table inside a classroom is intended to support a notebook of the CNNs and discuss one major consequence of training
or a laptop to take notes. a deep learning model to perform scene recognition: object
A key aspect of scene recognition is to identify the place detectors emerge as an intermediate representation of the
in which the objects seat (e.g., beach, forest, corridor, office, network [2]. Therefore, while the Places database does not
street, ...). Although one can avoid using the place category contain any object labels or segmentations, it can be used
by providing a more exhaustive list of the objects in the to train new object classifiers.
picture and a description of their spatial relationships, a
1.1 The Rise of Multi-million Datasets
place category provides the appropriate level of abstraction
to avoid such a long and complex description. Note that one What does it take to reach human-level performance with
could avoid using object categories in a description by only a machine-learning algorithm? In the case of supervised
listing parts (i.e. two eyes on top of a mouth for a face). learning, the problem is two-fold. First, the algorithm must
Like objects, places have functions and attributes. They are be suitable for the task, such as Convolutional Neural
composed of parts and some of those parts can be named Networks in the large scale visual recognition [1], [3] and
and correspond to objects, just like objects are composed Recursive Neural Networks for natural language processing
of parts, some of which are nameable as well (e.g., legs, [4], [5]. Second, it must have access to a training dataset
eyes). of appropriate coverage (quasi-exhaustive representation
of classes and variety of exemplars) and density (enough
• B. Zhou, A. Khosla, A. Oliva, A. Torralba are with the Computer
samples to cover the diversity of each class). The optimal
Science and Artificial Intelligence Laboratory, Massachusetts Institute space for these datasets is often task-dependent, but the
of Technology, USA. rise of multi-million-item sets has enabled unprecedented
• A. Lapedriza is with Universitat Oberta de Catalunya, Spain.
performance in many domains of artificial intelligence.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

veterinarians office elevator door fishpond windmill train station platform

bedroom cafeteria watering hole corral amusement park

staircase bar field road arch tower soccer field

conference center shoe shop rainforest swimming pool street

Indoor Nature Urban


Fig. 1. Image samples from various categories of the Places Database (two samples per category). The dataset
contains three macro-classes: Indoor, Nature, and Urban.

Fig. 2. Sorted distribution of image number per category in the Places Database. Places contains 10,624,928
images from 434 categories. Category names are shown for every 6 intervals.

The successes of Deep Blue in chess, Watson in “Jeop- MIT Indoor67 database [15] with 67 indoor categories and
ardy!”, and AlphaGo in Go against their expert human the SUN (Scene Understanding, with 397 categories and
opponents may thus be seen as not just advances in algo- 130,519 images) database [16] provided a larger coverage
rithms, but the increasing availability of very large datasets: of place categories, but failed short in term of quantity of
700,000, 8.6 million, and 30 million items, respectively [6]– data needed to feed deep learning algorithms. To comple-
[8]. Convolutional Neural Networks [3], [9] have likewise ment large object-centric datasets such as ImageNet [11],
achieved near human-level visual recognition, trained on we build the Places dataset described here.
1.2 million object [10]–[12] and 2.5 million scene images
[1]. Expansive coverage of the space of classes and samples Meanwhile, the Pascal VOC dataset [17] is one of the
allows getting closer to the right ecosystem of data that a earliest image dataset with diverse object annotations in
natural system, like a human, would experience. The history scene context. The Pascal VOC challenge has greatly ad-
of image datasets for scene recognition also sees the rapid vanced the development of models for object detection and
growing in the image samples as follows. segmentation tasks. Nowadays, COCO dataset [18] focuses
on collecting object instances both in polygon and bounding
box annotations for images depicting everyday scenes of
1.2 Scene-centric Datasets common objects. The recent Visual Genome dataset [19]
The first benchmark for scene recognition was the Scene15 aims at collecting dense annotations of objects, attributes,
database [13], extended from the initial 8 scene dataset in and their relationships. ADE20K [20] collects precise dense
[14]. This dataset contains only 15 scene categories with annotation of scenes, objects, parts of objects with a large
a few hundred images per class, and current classifiers are and open vocabulary. Altogether, annotated datasets further
saturated, reaching near human performance with 95%. The enable artificial systems to learn visual knowledge linking
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

parts, objects and scene context. 2.2.1 Step 1: Downloading images using scene cate-
gory and attributes
From online image search engines (Google Images, Bing
2 P LACES DATABASE Images, and Flickr), candidate images were downloaded
using a query word from the list of scene classes provided
2.1 Coverage of the categorical space by the SUN database [16]. In order to increase the diversity
The first asset of a high-quality dataset is an expansive of visual appearances in the Places dataset, each scene class
coverage of the categorical space to be learned. The strategy query was combined with 696 common English adjectives 1
of Places is to provide an exhaustive list of the categories of (e.g., messy, spare, sunny, desolate, etc.). In Fig. 3) we show
environments encountered in the world, bounded by spaces some examples of images in Places grouped by queries.
where a human body would fit (e.g. closet, shower). The About 60 million images (color images of at least 200×200
SUN (Scene UNderstanding) dataset [16] provided that pixels size) with unique URLs were identified. Importantly,
initial list of semantic categories. The SUN dataset was the Places and SUN datasets are complementary: PCA-
built around a quasi-exhaustive list of scene categories with based duplicate removal was conducted within each scene
different functionalities, namely categories with unique category in both databases, so that they do not contain the
identities in discourse. Through the use of WordNet [21], same images.
the SUN database team selected 70,000 words and concrete
terms that described scenes, places and environments that 2.2.2 Step 2: Labeling images with ground truth cat-
can be used to complete the phrase “I am in a place”, or egory
“let’s go to the/a place”. Most of the words referred to Image ground truth label verification was done by crowd-
basic and entry-level names ( [22]), resulting in a corpus sourcing the task to Amazon Mechanical Turk (AMT).
of 900 different scene categories after bundling together Fig.4 illustrates the experimental paradigm used. First,
synonyms, and separating classes described by the same AMT workers were given instructions relating to a par-
word but referring to different environments (e.g. inside ticular category at a time (e.g. cliff), with a definition,
and outside views of churches). Details about the building sample images belonging to the category (true images),
of that initial corpus can be found in [16]. Places Database and sample images not belonging to the category (false
has inherited the same list of scene categories from the SUN images). As an example, Fig.4.a shows the instructions for
dataset, with a few changes that are described in section the category cliff. Workers then performed a verification
2.2.4. task for the corresponding category. Fig.4.b shows the
AMT interface for the verification task. The experimental
interface displayed a central image, flanked by smaller
2.2 Construction of the database version of images the worker had just responded (on the
left), and the images the worker will respond to next (on
The construction of the Places Database is composed of
the right). Information gleaned from the construction of the
four steps, from querying and downloading images, labeling
SUN dataset suggests that in the first iteration of labeling
images with ground truth category, to scaling up the dataset
more than 50% of the the downloaded images are not
using a classifier, and further improving the separation of
true exemplars of the category. For this reason the default
similar classes. The detail of each step is introduced in the
answer in the interface the default answer was set to NO
following sections.
(notice that all the smaller versions of the images in the left
The data collection process of the Place Database is
are marked with a bold red contour, which denotes that the
similar to the image collection in other common datasets,
image do not belong to the category). Thus, if the worker
like ImageNet and COCO. The definition of categories
just presses the space bar to move, images will keep the
for the ImageNet dataset [11] is based on the synset of
default NO label. Whenever a true exemplar appears in the
WordNet [21]. Candidate images are queried from several
center, the worker can press a specific key to mark it as a
Image search engines using the set of WordNet synonyms.
positive exemplar (responding YES). As the response is set
Images are cleaned up through AMT in the format of the
to YES the bold contour of the image turns to green. The
binary task similar to the ours. Quality control is done by
interface also allows moving backwards to revise previous
multiple users annotating the same image. There are about
annotations. Each AMT HIT (Human Intelligence Task, one
500-1200 ground-truth images per synset. On the other
assignment for one worker), consisted of 750 images for
hand, COCO dataset [18] focuses on annotating the object
manual annotation. A control set of 30 positive samples
instances inside the images with more scene context. The
and 30 negative samples with ground-truth category labels
candidate images are mainly collected from Flickr, in order
from the SUN database were intermixed in the HIT as well.
to include less iconic images commonly returned by image
As a quality control measure, only worker HITs with an
search engines. The image annotation process of COCO is
accuracy of 90% or higher on these control images were
split into category labeling, instance spotting, and instance
kept.
segmentation, with all the tasks done by AMT workers.
COCO has 80 object categories with more than 2 million 1. The list of adjectives used in querying can be found in https://github.
object instances. com/CSAILVision/places365/blob/master/adjectives download.csv
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

spare bedroom teenage bedroom romantic bedroom darkest forest path wintering forest path greener forest path

wooded kitchen messy kitchen stylish kitchen rocky coast misty coast sunny coast

Fig. 3. Image samples from four scene categories grouped by queries to illustrate the diversity of the dataset.
For each query we show 9 annotated images.

trained to classify the remaining 53 million images: We


first randomly selected 1,000 images per scene category as
training set and 50 images as validation set (for the 413
categories which had more than 1000 samples). AlexNet
achieved 32% scene classification accuracy on the valida-
tion set after training. The trained AlexNet was then used
a) b)
to classify the 53 million images. We used the predicted
class score by the AlexNet to rank the images within one
Fig. 4. Annotation interface in the Amazon Mechanical scene category as follow: for a given category with too
Turk for selecting the correct exemplars of the scene few exemplars, the top ranked images with predicted class
from the downloaded images. a) instruction given to confidence higher than 0.8 were sent to AMT for a third
the workers in which we define positive and negative round of manual annotation using the same interface shown
examples. b) binary selection interface. in Fig.4. The default answer was set to NO.
After completing this third round of AMT annotation, the
distribution of the number of images per category flattened
The positive images resulting from the first cleaning
out: 401 scene categories had more than 5,000 images per
iteration were sent for a second iteration of cleaning. We
category and 240 scene categories had more than 20,000
used the same task interface but with the default answer was
images. In total, about 3 million images were added into
set to YES. In this second iteration, 25.4% of the images
the dataset.
were relabeled as NO. We tested a third cleaning iteration
on a few exemplars but did not pursue it further as the
percentage of images relabeled as NO was not significant. 2.2.4 Step 4: Improving the separation of similar
After the two iterations of annotation, we collected one classes
scene label for 7,076,580 images pertaining to 476 scene Despite the initial effort to bundle synonyms from Word-
categories. As expected, the number of images per scene Net, the scene list from the SUN database still contained a
category vary greatly (i.e. there are many more images of few categories with very close synonyms (e.g. ‘ski lodge’
bedroom than cave on the web). There were 413 scene and ‘ski resort’, or ‘garbage dump’ and ‘landfill’). We
categories that ended up with at least 1000 exemplars, and manually identified 46 synonym pairs like these and merged
98 scene categories with more than 20,000 exemplars. their images into a single category.
Additionally, we observed that some scene categories
2.2.3 Step 3: Scaling up the dataset using a classifier could be easily confused with blurry categorical boundaries,
As a result of the previous round of image annotation, there as illustrated in Fig. 5. This means that, for images in these
were 53 million remaining downloaded images not assigned blurry boundaries, answering the question “Does image I
to any of the 476 scene categories (e.g. a bedroom picture belong to class A?” might be difficult. However, it can
could have been downloaded when querying images for be easier to answer the question “Does image I belong to
living-room category, but marked as negative by the AMT class A or B?”. With this question, the decision boundary
worker). Therefore, a third annotation task was designed becomes clearer for a human observer and it also gets closer
to re-classify then re-annotate those images, using a semi- to the final task that a computer system will be trained to
automatic bootstrapping approach. solve, which is actually separating classes even when the
A deep learning-based scene classifier, AlexNet [3], was boundaries are blurry.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

create Places365-Standard and Places365-Challenge. The


details of each benchmark are the following:
• Places365-Standard has 1,803,460 training images
with the image number per class varying from 3,068
Field Forest to 5,000. The validation set has 50 images per class
and the test set has 900 images per class. Note that the
Fig. 5. Boundaries between place categories can be experiments in this paper are reported on Places365-
blurry, as some images can be made of a mixture of Standard.
different components. The images shown in this figure • Places365-Challenge contains the same categories as
show a soft transition between a field and a forest. Places365-Standard, but the training set is signifi-
Although the extreme images can be easily classified cantly larger with a total of 8 million training images.
as field and forest scenes, the middle images can be The validation set and testing set are the same as the
ambiguous. Places365-Standard. This subset was released for the
Places Challenge 20162 held in conjunction with the
European Conference on Computer Vision (ECCV)
2016, as part of the ILSVRC Challenge.
• Places205. Places205, described in [1], has 2.5 million
images from 205 scene categories. The image number
per class varies from 5,000 to 15,000. The training
set has 2,448,873 total images, with 100 images per
a) b) category for the validation set and 200 images per
Fig. 6. Annotation interface in Amazon Mechanical category for the test set.
• Places88. Places88 contains the 88 common scene
Turk for differentiating images from two similar cate-
gories. a) instruction in which we give several typical categories among the ImageNet [12], SUN [16] and
examples in each category. b) the binary selection Places205 databases. Places88 contains only the im-
interface, in which the worker has to classify the shown ages obtained in round 2 of annotations, from the
image into one of the classes or none of them. first version of Places used in [1]. We call the other
two corresponding subsets ImageNet88 and SUN88
respectively. These subsets are used to compare perfor-
After checking the annotations, we confirmed that in the mances across different scene-centric databases, as the
previous three steps of the AMT annotation, workers were three datasets contain different exemplars per category
confused with some pairs (or groups) of scene categories. (i.e. none of these three datasets contain common
For instance, there was an overlap between ‘canyon’ and images). Note that finding correspondences between
‘mountain’ or ‘butte’ and ‘mountain’. There were also the classes defined in ImageNet and Places brings
images mixed in the following category pairs: ‘jacuzzi’ and some challenges. ImageNet follows the WordNet def-
‘swimming pool indoor’; ‘pond’ and ‘lake’; ‘volcano’ and initions, but some WordNet definitions are not always
‘mountain’; ‘runway’ and ‘highway and road’; ‘operating appropriate for describing places. For instance, the
room’ and ‘hospital room’; among others. In the whole set class ’elevator’ in ImageNet refers to an object. In
of categories, we identified 53 different ambiguous pairs. Places, ’elevator’ takes different meanings depend-
To further differentiate the images from the categories ing on the location of the observer: elevator door,
with shared content, we designed a new interface for elevator interior, or elevator lobby. Many categories
a fourth annotation step. The instructions for the task in ImageNet do not differentiate between indoor and
are shown Fig. 6.a, while Fig. 6.b shows the annotation outdoor (e.g., ice-skating rink) while in Places, indoor
interface. The interface combines exemplar images from and outdoor versions are separated as they do not
the two categories with shared content (such as ‘art school’ necessarily afford the same function.
and ‘art studio’), and AMT workers were asked to classify
images into one of the categories or neither of them. 4 C OMPARING S CENE - CENTRIC DATASETS
After this fourth annotation step, the Places database
Scene-centric datasets correspond to images labeled with a
was finalized with over 10 millions labeled exemplars
scene, or place name, as opposed to object-centric datasets,
(10,624,928 images) from 434 place categories.
where images are labeled with object names. In this section
we use the Places88 benchmark to compare Places dataset
3 P LACES B ENCHMARKS with the tow other biggest scene datasets: ImageNet88 and
SUN88. Fig. 7 illustrates the differences among the number
Here we describe four subsets of Places database as bench- of images found in the different categories for Places88,
marks. Places205 and Places88 are from [1]. Two new ImageNet88 and SUN88. Notice that Places Database is
benchmarks have been added: from the 434 categories, we
selected 365 categories with more than 4000 images each to 2. http://places2.csail.mit.edu/challenge.html
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

the largest scene-centric image dataset so far. The next the diversity of A with respect to B can be defined as
subsection presents a comparison of these three datasets
in terms of image diversity.
DivB (A) = 1 − p(d(a1 , a2 ) < d(b1 , b2 )) (1)

4.1 Dataset Diversity where a1 , a2 ∈ A and b1 , b2 ∈ B are randomly selected.


With this definition of relative diversity we have that A is
Given the types of images found on the internet, some more diverse than B if, and only if, DivB (A) > DivA (B).
categories will be more biased than others in terms of For an arbitrary number of datasets, A1 , ..., AN , the
viewpoints, types of objects, or even image style [23]. diversity of A1 with respect to A2 , ..., AN can be defined
However, bias can be compensated with a high diversity of as
images, with many appearances represented in the dataset.
In this section we describe a measure of dataset diversity
to compare how diverse images from three scene-centric DivA2 ,...,AN (A1 ) = 1−p(d(a11 , a12 ) < min d(ai1 , ai2 ))
datasets (Places88, SUN88 and ImageNet88) are. i=2:N

Comparing datasets is an open problem. Even datasets (2)


covering the same visual classes have notable differences where ai1 , ai2 ∈ Ai are randomly selected, i = 2 : N .
providing different generalization performances when used We measured the relative diversities between SUN88,
to train a classifier [23]. Beyond the number of images ImageNet88 and Places88 using AMT. Workers were pre-
and categories, there are aspects that are important but sented with different pairs of images and they had to select
difficult to quantify, like the variability in camera poses, the pair that contained the most similar images. The pairs
in decoration styles or in the type of objects that appear in were randomly sampled from each database. Each trial
the scene. was composed of 4 pairs from each database, giving a
Although the quality of a database is often task depen- total of 12 pairs to choose from. We used 4 pairs per
dent, it is reasonable to assume that a good database should database to increase the chances of finding a similar pair
be dense (with a high degree of data concentration), and and avoiding users having to skip trials. AMT workers had
diverse (it should include a high variability of appearances to select the most similar pair on each trial. We ran 40
and viewpoints). Imagine, for instance, a dataset composed trials per category and two observers per trial, for the 88
of 100,000 images all taken within the same bedroom. categories in common between ImageNet88, SUN88 and
This dataset would have a very high density but a very Places88 databases. Fig. 8.a-b shows some examples of
low diversity as all the images will look very similar. An pairs from the diversity experiments for the scene categories
ideal dataset, expected to generalize well, should have high playground (a) and bedroom (b). In the figure only one pair
diversity as well. While one can achieve high density by from each database is shown. We observed that different
collecting a large number of images, diversity is not an annotators were consistent in deciding whether a pair of
obvious quantity to estimate in image sets, as it assumes images was more similar than another pair of images.
some notion of similarity between images. One way to esti- Fig. 8.c shows the histograms of relative diversity for all
mate similarity is to ask the question are these two images the 88 scene categories common to the three databases. If
similar? However, similarity in the wild is a subjective and the three datasets were identical in terms of diversity, the
loose concept, as two images can be viewed as similar average diversity should be 2/3 for the three datasets. Note
if they contain similar objects, and/or have similar spatial that this measure of diversity is a relative measure between
configurations, and/or have similar decoration styles and so the three datasets. In the experiment, users selected pairs
on. A way to circumvent this problem is to define relative from the SUN database to be the closest to each other 50%
measures of similarity for comparing datasets. of the time, while the pairs from the Places database were
Several measures of diversity have been proposed, par- judged to be the most similar only on 17% of the trials.
ticularly in biology for characterizing the richness of an ImageNet88 pairs were selected 33% of the time.
ecosystem (see [24] for a review). Here, we propose to The results show that there is a large variation in terms
use a measure inspired by the Simpson index of diver- of diversity among the three datasets, showing Places to
sity [25]. The Simpson index measures the probability that be the most diverse of the three datasets. The average
two random individuals from an ecosystem belong to the relative diversity on each dataset is 0.83 for Places, 0.67
same species. It is a measure of how well distributed the for ImageNet88 and 0.50 for SUN. The categories with the
individuals across different species are in an ecosystem, and largest variation in diversity across the three datasets were
it is related to the entropy of the distribution. Extending playground, veranda and waiting room.
this measure for evaluating the diversity of images within
a category is non-trivial if there are no annotations of sub-
categories. For this reason, we propose to measure the 4.2 Cross Dataset Generalization
relative diversity of image datasets A and B based on As discussed in [23], training and testing across different
the following idea: if set A is more diverse than set B, datasets generally results in a drop of performance due to
then two random images from set B are more likely to be the dataset bias problem. In this case, the bias between
visually similar than two random samples from A. Then, datasets is due, among other factors, to the differences in
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

100000 Places88
ImageNet88
Number of images

SUN88

10000

1000

100
cemetery
tower
train railway
fountain
lighthouse
valley
harbor
skyscraper
palace
arch
bedroom
restaurant
kitchen
railroad track
mansion
windmill
stadium football
building facade
lobby
abbey
volcano
amusement park
shed
raft
playground
hotel room
office
motel
underwater coral reef
dining room
viaduct
campsite
mausoleum
shower
classroom
parlor
hot spring
closet
dam
ski slope
iceberg
phone booth
swamp
airport terminal
auditorium
wind farm
bookstore
supermarket
water tower
cockpit
veranda
chalet
ruin
attic
courthouse
engine room
market outdoor
schoolhouse
conference room
pavilion
hospital room
rock arch
topiary garden
church outdoor
pulpit
gift shop
nursery
food court
reception
amphitheater
pantry
apartment building outdoor
jail cell
candy store
dorm room
bowling alley
garbage dump
assembly line
locker room
game room
butchers shop
waiting room
beauty salon
rope bridge
cafeteria
shoe shop
sandbar
igloo
Fig. 7. Comparison of the number of images per scene category for the common 88 scene categories in Places,
ImageNet, and SUN datasets.

Test on SUN 88 Test on ImageNet Scene 88 Test on Places 88


70 70 55

50
60 60
45
Classification accuracy

Classification accuracy

Classification accuracy
50 50 40

35
40 40
30

30 30 25

Train on Places 88 [69.5] Train on ImageNet 88 [65.6] Train on Places 88 [54.2]


20
Train on SUN 88 [63.3] Train on Places 88 [60.3] Train on ImageNet 88 [44.6]
20 20
Train on ImageNet 88 [62.8] Train on SUN 88 [49.2] Train on SUN 88 [37.0]
15

10 0 1 2 3 4 10 0 10 0
1 2 3 4 1 2 3 4
10 10 10 10 10 10 10 10 10 10 10 10 10 10 10
Number of training samples per category Number of training samples per category Number of training samples per category
a) b) c)

Fig. 9. Cross dataset generalization of training on


the 88 common scenes between Places, SUN and
ImageNet then testing on the 88 common scenes from:
a) SUN, b) ImageNet and c) Places database.

the diversity between the three datasets. Fig. 9 shows the


classification results obtained from the training and testing
on different permutations of the 3 datasets. For these results
a) b) we use the features extracted from a pre-trained ImageNet-
40
CNN and a linear SVM. In all three cases training and
Places88 testing on the same dataset provides the best performance
Number of categories

35
ImageNet88
30 SUN88 for a fixed number of training examples. As the Places
25 database is very large, it achieves the best performance on
20 two of the test sets when all the training data is used.
15

10

0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
5 C ONVOLUTIONAL N EURAL N ETWORKS
c) Diversity FOR S CENE C LASSIFICATION

Fig. 8. Examples of pairs for the diversity experiment Given the impressive performance of the deep Convolu-
for a) playground and b) bedroom. Which pair shows tional Neural Networks (CNNs), particularly on the Im-
the most similar images? The bottom pairs were cho- ageNet benchmark [3], [12], we choose three popular
sen in these examples. c) Histogram of relative diver- CNN architectures, AlexNet [3], GoogLeNet [26], and
sity per each category (88 categories) and dataset. VGG 16 convolutional-layer CNN [27], then train them on
Places88 (in blue line) contains the most diverse set of Places205 and Places365-Standard respectively to create
images, then ImageNet88 (in red line) and the lowest baseline CNN models. The trained CNNs are named as
diversity is in the SUN88 database (in yellow line) as PlacesSubset-CNN, i.e., Places205-AlexNet or Places365-
most images are prototypical of their class. VGG.
All the Places-CNNs presented here were trained using
the Caffe package [28] on Nvidia GPUs Tesla K40 and
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

Titan X3 . Additionally, given the recent breakthrough per- Places365-AlexNet is 0.577. Thus Places365-AlexNet not
formances of the Residual Network (ResNet) on ImageNet only predicts more categories, but also has better accuracy
classification [29], we further fine-tuned ResNet152 on on the previous existing categories.
the Places365-Standard (termed as Places365-ResNet) and Fig.10 shows the responses to examples correctly pre-
compared it with the other trained-from-scratch Places- dicted by the Places365-VGG. Notice that most of the
CNNs for scene classification. Top-5 responses are very relevant to the scene description.
Some failure or ambiguous cases are shown in Fig.11.
Broadly, we can identify two kinds of mis-classification
5.1 Results on Places205 and Places365
given the current label attribution of Places: 1) less-typical
After training the various Places-CNNs, we used the final activities happening in a scene, such as taking group
output layer of each network to classify the test set images photo in a construction site and camping in a junkyard;
of Places205 and SUN205 (see [1]). The classification 2) images composed of multiple scene parts, which make
results for Top-1 accuracy and Top-5 accuracy are listed one ground-truth scene label not sufficient to describe the
in Table 1. The Top-1 accuracy is the percentage of the whole environment. These results suggest the need to have
testing images where the top predicted label exactly match multi-ground truth labels for describing environments.
the ground-truth label. The Top-5 accuracy is that the It is important to emphasize that for many scene cate-
percentage of testing images where the ground-truth label gories the Top-1 accuracy might be an ill-defined measure:
is among the top ranked 5 predicted labels given by an environments are inherently multi-labels in terms of their
algorithm. Since there are some ambiguity between some semantic description. Different observers will use different
scene categories, the Top-5 accuracy is a more suitable terms to refer to the same place, or different parts of the
criteria of measuring scene classification performance. same environment, and all the labels might fit well the
As a baseline comparison, we show the results of a linear description of the scene, as we observe in the examples
SVM trained on ImageNet-CNN features of 5000 images of Fig.11.
per category in Places205 and 50 images per category
in SUN205 respectively. We observe that Places-CNNs
perform much better than the ImageNet feature+SVM 5.2 Web-demo for Scene Recognition
baseline while, as expected, Places205-GoogLeNet and Based on our trained Places-CNN, we created a web-
Places205-VGG outperformed Places205-AlexNet with a demo for scene recognition6 , accessible through a computer
large margin, due to their deeper structures. To date (Oct browser or mobile phone. People can upload photos to
2, 2016) the top ranked results on the test set of Places205 the web-demo to predict the type of environment, with
leaderboard4 is 64.10% on Top-1 accuracy and 90.65% on the 5 most likely semantic categories, and relevant scene
Top-5 accuracy. Note that for the test set of SUN205, we attributes. Two screenshots of the prediction result on the
did not fine-tune the Places-CNNs on the training set of mobile phone are shown in Fig.12. Note that people can
SUN205, as we directly evaluated them on the test set of submit feedback about the result. The top-5 recognition
SUN. accuracy of our recognition web-demo in the wild is about
We further evaluated the baseline Places365-CNNs on 72% (from the 9,925 anonymous feedbacks dated from
the validation set and test set of Places365. The results Oct.19, 2014 to May 5, 2016), which is impressive given
are shown in Table 2. We can see that Places365-VGG and that people uploaded all kinds of photos from real-life and
Places365-ResNet have similar top performances compared not necessarily places-like photos. Places205-AlexNet is the
with the other two CNNs5 . Even though Places365 has 160 back-end prediction model in the demo.
more categories than Places205, the Top-5 accuracy of the
Places205-CNNs (trained on the previous version of Places
[1]) on the test set only drops by 2.5%. 5.3 Places365 Challenge Result
To evaluate how extra categories bring improvements, The subset Places365-Challenge, which contains more
we compute the accuracy for the 182 common cate- than 8 million images from 365 scene categories, was used
gories between Places205 and Places365 (we merge some in the Places Challenge 2016 held as part of the ILSVRC
categories in Places205 when building Places365 thus Challenge in the European Conference on Computer Vision
there are less common categories) for Places205-CNN and (ECCV) 2016.
Places365-CNN. On the validation set of Places365, we The rule of the challenge is that each team can only
select the images of the 182 common categories, then use the provided data in the Places365-Challenge to train
use the aligned 182 outputs of the Places205-AlexNet and their networks. Standard CNN models trained on Imagenet-
Places365-AlexNet to predict the labels respectively. The 1.2million and previous Places are allowed to use. Each
Top1 accuracy for Places205-AlexNet is 0.572, the one for team can submit at most 5 prediction results. Ranks are
based on the top-5 classification error of each submission.
3. All the Places-CNNs are available at https://github.com/ Winners teams are then invited to give talks at the ILSVRC-
CSAILvision/places365
4. http://places.csail.mit.edu/user/leaderboard.php COCO Joint Workshop at ECCV’16.
5. The performance of the ResNet might result from fine-tuning or
under-training, as the ResNet is not trained from scratch. 6. http://places.csail.mit.edu/demo.html
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

TABLE 1
Classification accuracy on the test set of Places205 and the test set of SUN205. We use the class score
averaged over 10-crops of each test image to classify the image. ∗ shows the top 2 ranked results from the
Places205 leaderboard.

Test set of Places205 Test set of SUN205


Top-1 acc. Top-5 acc. Top-1 acc. Top-5 acc.
ImageNet-AlexNet feature+SVM 40.80% 70.20% 49.60% 80.10%
Places205-AlexNet 50.04% 81.10% 67.52% 92.61%
Places205-GoogLeNet 55.50% 85.66% 71.6% 95.01%
Places205-VGG 58.90% 87.70% 74.6% 95.92%
SamExynos∗ 64.10% 90.65% - -
SIAT MMLAB∗ 62.34% 89.66% - -

TABLE 2
Classification accuracy on the validation set and test set of Places365. We use the class score averaged over
10-crops of each testing image to classify the image.

Validation Set of Places365 Test Set of Places365


Top-1 acc. Top-5 acc. Top-1 acc. Top-5 acc.
Places365-AlexNet 53.17% 82.89% 53.31% 82.75%
Places365-GoogLeNet 53.63% 83.88% 53.59% 84.01%
Places365-VGG 55.24% 84.91% 55.19% 85.01%
Places365-ResNet 54.74% 85.08% 54.65% 85.07%

GT: cafeteria GT: classroom GT: drugstore


top-1: cafeteria (0.179) top-1: locker room (0.585) top-1: supermarket (0.286)
top-2: restaurant (0.167) top-2: lecture room (0.135) top-2: hardware store (0.248)
top-3: dining hall (0.091) top-3: conference center (0.061) top-3: drugstore (0.120)
top-4: coffee shop (0.086) top-4: classroom (0.033) top-4: department store (0.087)
top-5: restaurant patio (0.080) top-5: elevator door (0.025) top-5: pharmacy (0.052)

GT: natural canal GT: creek GT: greenhouse indoor


top-1: swamp (0.529) top-1: forest broadleaf (0.307) top-1: greenhouse indoor (0.479)
top-2: marsh (0.232) top-2: forest path (0.208) top-2: greenhouse outdoor (0.055)
top-3: natural canal (0.063) top-3: creek (0.086) top-3: botanical garden (0.044)
top-4: lagoon (0.047) top-4: rainforest (0.076) top-4: assembly line (0.025)
top-5: rainforest (0.029) top-5: cemetery (0.049) top-5: vegetable garden (0.022)

GT: chalet GT: crosswalk GT: market outdoor


top-1: ski resort (0.141) top-1: crosswalk (0.720) top-1: promenade (0.569)
top-2: ice floe (0.129) top-2: plaza (0.060) top-2: bazaar outdoor (0.137)
top-3: igloo (0.114) top-3: street (0.055) top-3: boardwalk (0.118)
top-4: balcony exterior (0.103) top-4: shopping mall indoor (0.039) top-4: market outdoor (0.074)
top-5: courtyard (0.083) top-5: bazaar outdoor (0.021) top-5: flea market indoor (0.029)

Fig. 10. The predictions given by the Places365-VGG for the images from the validation set. The ground-
truth label (GT) and the top 5 predictions are shown. The number beside each label indicates the prediction
confidence.

There were totally 92 valid submissions from 27 teams. images.


Finally team Hikvision [30] won the 1st place with 9.01%
top-5 error, team MW [31] won the 2nd place with 10.19% 5.4 Generic Visual Features from Places-CNNs
top-5 error, and team Trimps-Soushen [32] won the 3rd and ImageNet-CNNs
place with 10.30% top-5 error. The leaderboard and the We further used the activation from the trained Places-
team information are available at the challenge result page7 . CNNs as generic features for visual recognition tasks using
The ranked results of all the submissions are plotted in different image classification benchmarks. Activations from
Fig.13. The entry from the winner team outperforms our the higher-level layers of a CNN, also termed deep features,
best baseline with a large margin (∼ 6% in top-5 accuracy). have proven to be effective generic features with state-of-
Note that our baselines are trained with the Places365- the-art performance on various image datasets [33], [34].
Standard while those challenge entries are trained on the But most of the deep features are from the CNNs trained
Places365-Challenge which has 5.5 million more training on ImageNet, which is mostly an object-centric dataset.
Here we evaluated the classification performances of
7. http://places2.csail.mit.edu/results2016.html the deep features from scene-centric CNNs and object-
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

10

Top-5 errors of all the 92 submission (sorted)


20.00%
18.00%
16.00% Baseline ResNet152: 14.9%
14.00%
12.00%
10.00%
GT: construction site GT: junkyard
top-1: martial arts gym (0.157) top-1: campsite (0.306) 8.00% Hikvision: 9.01%
top-2: stable (0.156) top-2: sandbox (0.276)
top-3: boxing ring (0.091) top-3: beer garden (0.052) 6.00%
top-4: locker room (0.090) top-4: market outdoor (0.035) 0 10 20 30 40 50 60 70 80 90 100
top-5: basketball court (0.056) top-5: flea market indoor (0.033)

Fig. 13. The ranked results of all the 92 valid sub-


missions. The best baseline trained on Places365-
standard is the Resnet152 which has the top5-error as
14.9%, while the winner network from HikVision gets
9.01% top5-error which outperform the baseline with
large margin.

GT: aquarium GT: lagoon


top-1: restaurant (0.213) top-1: balcony interior (0.136)
top-2: ice cream parlor (0.139) top-2: beach house (0.134) centric CNNs in a systematic way. The deep features from
top-3: coffee shop (0.138) top-3: boardwalk (0.123)
top-4: pizzeria (0.085) top-4: roof garden (0.103)
several Places-CNNs and ImageNet-CNNs on the following
top-5: cafeteria (0.078) top-5: restaurant patio (0.068) scene and object benchmarks are tested: SUN397 [16],
MIT Indoor67 [15], Scene15 [13], SUN Attribute [35],
Fig. 11. Examples of predictions rated as incorrect in Caltech101 [36], Caltech256 [37], Stanford Action40 [38],
the validation set by the Places365-VGG. GT states and UIUC Event8 [39].
for ground truth label. Note that some of the top- All of the presented experiments follow the standards in
5 responses are often not wrong per se, predicting the mentioned papers. In the SUN397 experiment [16], the
semantic categories near by the GT category. See the training set size is 50 images per category. Experiments
text for details. were run on 5 splits of the training set and test set given
in the dataset. In the MIT Indoor67 experiment [15], the
training set size is 100 images per category. The experiment
is run on the split of the training set and test set given in
the dataset. In the Scene15 experiment [13], the training
set size is 50 images per category. Experiments are run
on 10 random splits of the training set and test set. In
the SUN Attribute experiment [35], the training set size
is 150 images per attribute. The reported result is the
average precision. The splits of the training set and test
set are given in the paper. In Caltech101 and Caltech256
experiment [36], [37], the training set size is 30 images
per category. The experiments are run on 10 random splits
of the training set and test set. In the Stanford Action40
experiment [38], the training set size is 100 images per
category. Experiments are run on 10 random splits of
the training set and test set. The reported result is the
classification accuracy. In the UIUC Event8 experiment
[39], the training set size is 70 images per category and
the test set size is 60 images per category. The experiments
are run on 10 random splits of the training set and test set.
Places-CNNs and ImageNet-CNNs have the same net-
work architectures for AlexNet, GoogLeNet, and VGG,
but they are trained on scene-centric data (Places) and
Fig. 12. Two screenshots of the scene recognition object-centric data (ImageNet) respectively. For AlexNet
demo based on the Places-CNN. The web-demo pre- and VGG, we used the 4096-dimensional feature vector
dicts the type of environment, the semantic categories, from the activation of the Fully Connected Layer (fc7) of
and associated scene attributes for uploaded photos. the CNN. For GoogLeNet, we used the 1024-dimensional
feature vector from the response of the global average pool-
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

11

70
Combined kernel [37.5] of what has been learned inside CNNs and what are the
HoG2x2 [26.3]
60
DenseSIFT [23.5]
differences between the object-centric CNN trained on Im-
Ssim [22.7] ageNet and the scene-centric CNN trained on Places given
50 Geo texton [22.1] that they share the same architecture AlexNet. Following
Classification accuracy

Texton [21.6]
Gist [16.3]
the methodology in [2] we feed 100,000 held-out testing
40
LBP [14.7] images into the two networks and record the max activation
30
ImageNet−AlexNet [42.6] of each unit pooled over the whole spatial feature map for
Places205−AlexNet [54.3]
Places365−VGG [63.24]
each of the images respectively. For each unit, we get the
20 top three ranked images by ranking their max activations,
then we segment the images by bilinear upsampling the
10
binarized spatial feature map mask.
0
The image segmentation results of the units from differ-
1 5 10 20 50
Number of training samples per category ent layers are shown in Fig.15. We can see that from conv1
to conv5, the units detect visual concepts from low-level
edge/texture to high-level object/scene parts. Furthermore,
Fig. 14. Classification accuracy on the SUN397
in the object-centric ImageNet-CNN there are more units
Dataset. We compare the deep features of Places365-
detecting object parts such as dog and people’s heads in the
VGG, Places205-AlexNet (result reported in [1]), and
conv5 layer, while in the scene centric Places-CNN there
ImageNet-AlexNet, to hand-designed features (HOG,
are more units detecting scene parts such as bed, chair, or
gist, etc). The deep features of Places365-VGG outper-
buildings in the conv5 layer.
forms other deep features and hand-designed features
Thus the specialty of the units in the object-centric CNN
by a large margin. Results of other hand-designed
and scene-centric CNN yield very different performances
features/kernels are fetched from [16].
of generic visual features on a variety of recognition bench-
marks (object-centric datasets vs scene-centric datasets) in
ing layer before softmax producing the class predictions. Table 3.
The classifier in all of the experiments is a linear SVM We further synthesized preferred input images for the
with the default parameter for all of the features. Places-CNN by using the image synthesis technique pro-
Table 3 summarizes the classification accuracy on various posed in [41]. This method uses a learned prior deep
datasets for the deep features of Places-CNNs and the deep generator network to generate images which maximize the
features of the ImageNet-CNNs. Fig.14 plots the classifica- final class activation or the intermediate unit activation of
tion accuracy for different visual features on the SUN397 the Places-CNN. The synthetic images for 50 scene cate-
database over different numbers of training samples per gories are shown in Fig.16. These abstract image contents
category. The classifier is a linear SVM with the same de- reveal the knowledge of the specific scene learned and
fault parameters for the two deep feature layers (C=1) [40]. memorized by the Places-CNN: examples include the buses
The Places-CNN features show impressive performance on within a road environment in the bus station, and the tents
scene-related datasets, outperforming the ImageNet-CNN surrounded by forest-types of features for the campsite.
features. On the other hand, the ImageNet-CNN features Here we used Places365-AlexNet (other Places365-CNNs
show better performance on object-related image datasets. generated similar results). We further used the synthesis
Importantly, our comparison shows that Places-CNN and technique to generate the images preferred by the units
ImageNet-CNN have complementary strengths on scene- in the conv5 layer of Places365-AlexNet. As shown in
centric tasks and object-centric tasks, as expected from Fig.17, the synthesized images are very similar to the
the type of the datasets used to train these networks. On segmented image regions by the estimated receptive field
the other hand, the deep features from the Places365- of the units.
VGG achieve the best performance (63.24%) on the most
challenging scene classification dataset SUN397, while the 6 C ONCLUSION
deep features of Places205-VGG performs the best on the
From the Tiny Image dataset [42], to ImageNet [11] and
MIT Indoor67 dataset. As far as we know, they are the
Places [1], the rise of multi-million-item dataset initiatives
state-of-the-art scores achieved by a single feature + linear
and other densely labeled datasets [18], [20], [43], [44] have
SVM on those two datasets. Furthermore, we merge the
enabled data-hungry machine learning algorithms to reach
1000 classes from the ImageNet and the 365 classes from
near-human semantic classification of visual patterns, like
the Places365-Standard to train a VGG (Hybrid1365-VGG).
objects and scenes. With its category coverage and high-
The deep feature from the Hybrid1365-VGG achieves the
diversity of exemplars, Places offers an ecosystem of visual
best score averaged over all the eight image datasets.
context to guide progress on scene understanding problems.
Such problems could include determining the actions hap-
5.5 Visualization of the Internal Units pening in a given environment, spotting inconsistent objects
Through the visualization of the unit responses for various or human behaviors for a particular place, and predicting
levels of network layers, we can have a better understanding future events or the cause of events given a scene.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

12

TABLE 3
Classification accuracy/precision on scene-centric databases (the first four datasets) and object-centric
databases (the last four datasets) for the deep features of various Places-CNNs and ImageNet-CNNs. All the
accuracy/precision is the top-1 accuracy/precision.

Deep Feature SUN397 MIT Indoor67 Scene15 SUN Attribute Caltech101 Caltech256 Action40 Event8 Average
Places365-AlexNet 56.12 70.72 89.25 92.98 66.40 46.45 46.82 90.63 69.92
Places205-AlexNet 54.32 68.24 89.87 92.71 65.34 45.30 43.26 94.17 69.15
ImageNet-AlexNet 42.61 56.79 84.05 91.27 87.73 66.95 55.00 93.71 72.26
Places365-GoogLeNet 58.37 73.30 91.25 92.64 61.85 44.52 47.52 91.00 70.06
Places205-GoogLeNet 57.00 75.14 90.92 92.09 54.41 39.27 45.17 92.75 68.34
ImageNet-GoogLeNet 43.88 59.48 84.95 90.70 89.96 75.20 65.39 96.13 75.71
Places365-VGG 63.24 76.53 91.97 92.99 67.63 49.20 52.90 90.96 73.18
Places205-VGG 61.99 79.76 91.61 92.07 67.58 49.28 53.33 93.33 73.62
ImageNet-VGG 48.29 64.87 86.28 91.78 88.42 74.96 66.63 95.17 77.05
Hybrid1365-VGG 61.77 79.49 92.15 92.93 88.22 76.04 68.11 93.13 81.48

conv1 conv2 conv3 conv4 conv5


ImageNet-CNN
Places-CNN

Fig. 15. a) Visualization of the units’ receptive fields at different layers for the ImageNet-CNN and Places-CNN.
Subsets of units at each layer are shown. In each row we show the top 3 most activated images. Images are
segmented based on the binarized spatial feature map of the units at different layers of ImageNet-CNN and
Places-CNN. Here we take ImageNet-AlexNet and Places205-AlexNet as the comparison examples. See the
detailed visualization methodology in [2].

ACKNOWLEDGMENTS [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classifi-


cation with deep convolutional neural networks.” in In Advances in
The authors would like to thank Santani Teng, Zoya Bylin- Neural Information Processing Systems, 2012.
skii, Mathew Monfort and Caitlin Mullin for comments on [4] S. Hochreiter and J. Schmidhuber, “Long short-term memory,”
the paper. Over the years, the Places project was supported Neural computation, 1997.
[5] C. D. Manning and H. Schütze, Foundations of statistical natural
by the National Science Foundation under Grants No. language processing. MIT Press, 1999.
1016862 to A.O and No. 1524817 to A.T; the Vannevar [6] M. Campbell, A. J. Hoane, and F.-h. Hsu, “Deep blue,” Artificial
Bush Faculty Fellowship program sponsored by the Basic intelligence, 2002.
[7] D. Ferrucci, A. Levas, S. Bagchi, D. Gondek, and E. T. Mueller,
Research Office of the Assistant Secretary of Defense for “Watson: beyond jeopardy!” Artificial Intelligence, 2013.
Research and Engineering and funded by the Office of [8] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Van
Naval Research through grant N00014-16-1-3116 to A.O.; Den Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam,
M. Lanctot et al., “Mastering the game of go with deep neural
the MIT Big Data Initiative at CSAIL, the Toyota Research networks and tree search,” Nature, 2016.
Institute / MIT CSAIL Joint Research Center, Google, [9] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based
Xerox and Amazon Awards, and a hardware donation from learning applied to document recognition,” Proceedings of the IEEE,
1998.
NVIDIA Corporation, to A.O and A.T. B.Z is supported by [10] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers:
a Facebook Fellowship. Surpassing human-level performance on imagenet classification,” in
Proc. CVPR, 2015.
[11] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei,
R EFERENCES “Imagenet: A large-scale hierarchical image database,” in Proc.
[1] B. Zhou, A. Lapedriza, J. Xiao, A. Torralba, and A. Oliva, “Learning CVPR, 2009.
deep features for scene recognition using places database,” in In [12] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma,
Advances in Neural Information Processing Systems, 2014. Z. Huang, A. Karpathy, A. Khosla, M. Bernstein et al., “Imagenet
[2] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba, “Object large scale visual recognition challenge,” Int’l Journal of Computer
detectors emerge in deep scene cnns,” International Conference on Vision, 2015.
Learning Representations, 2015. [13] S. Lazebnik, C. Schmid, and J. Ponce, “Beyond bags of features:
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

13

Fig. 16. The synthesized images preferred by the final output of Places365-AlexNet for 20 scene categories.

Fig. 17. The synthesized images preferred by the conv5 units of the Places365-AlexNet corresponds to the
segmented images by the receptive fields of those units. The synthetic images are very similar to the segmented
image regions of the units. Each row of the segmented images correspond to one unit.

Spatial pyramid matching for recognizing natural scene categories,” image annotations,” ijcv, 2016.
in Proc. CVPR, 2006. [20] B. Zhou, H. Zhao, X. Puig, S. Fidler, A. Barriuso, and A. Torralba,
[14] A. Oliva and A. Torralba, “Modeling the shape of the scene: A “Scene paring through ade20k dataset,” Proc. CVPR, 2017.
holistic representation of the spatial envelope,” Int’l Journal of [21] G. A. Miller, “Wordnet: a lexical database for english,” Communi-
Computer Vision, 2001. cations of the ACM, vol. 38, no. 11, pp. 39–41, 1995.
[15] A. Quattoni and A. Torralba, “Recognizing indoor scenes,” in Proc. [22] P. Jolicoeur, M. A. Gluck, and S. M. Kosslyn, “Pictures and names:
CVPR, 2009. Making the connection,” Cognitive psychology, 1984.
[16] J. Xiao, J. Hays, K. A. Ehinger, A. Oliva, and A. Torralba, “Sun [23] A. Torralba and A. A. Efros, “Unbiased look at dataset bias,” in
database: Large-scale scene recognition from abbey to zoo,” in Proc. Proc. CVPR, 2011.
CVPR, 2010.
[24] C. Heip, P. Herman, and K. Soetaert, “Indices of diversity and
[17] M. Everingham, A. Zisserman, C. K. Williams, L. Van Gool, M. Al-
evenness,” Oceanis, 1998.
lan, C. M. Bishop, O. Chapelle, N. Dalal, T. Deselaers, G. Dorkó
et al., “The pascal visual object classes challenge 2007 (voc2007) [25] E. H. Simpson, “Measurement of diversity.” Nature, 1949.
results,” 2007. [26] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov,
[18] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, D. Erhan, V. Vanhoucke, and A. Rabinovich, “Going deeper with
P. Dollár, and C. L. Zitnick, “Microsoft coco: Common objects in convolutions,” Proc. CVPR, 2015.
context,” in European Conference on Computer Vision. Springer, [27] K. Simonyan and A. Zisserman, “Very deep convolutional networks
2014, pp. 740–755. for large-scale image recognition,” International Conference on
[19] R. Krishna, Y. Zhu, O. Groth, J. Johnson, K. Hata, J. Kravitz, Learning Representations, 2014.
S. Chen, Y. Kalantidis, L.-J. Li, D. A. Shamma et al., “Visual [28] Y. Jia, “Caffe: An open source convolutional architecture for fast
genome: Connecting language and vision using crowdsourced dense feature embedding,” http://caffe.berkeleyvision.org/, 2013.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.
This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI
10.1109/TPAMI.2017.2723009, IEEE Transactions on Pattern Analysis and Machine Intelligence

14

[29] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for Agata Lapedriza is an Associate Professor
image recognition,” Proc. CVPR, 2016. at the Universitat Oberta de Catalunya. She
[30] Q. Zhong, C. Li, Y. Zhang, H. Sun, S. Yang, D. Xie, and S. Pu, “To- received her MS deegree in Mathematics at
wards good practices for recognition and detection,” http://image-net. the Universitat de Barcelona in 2003, and
org/challenges/talks/2016/Hikvision at ImageNet 2016.pdf, 2016. her Ph.D. degree in Computer Science at
[31] L. Shen, Z. Lin, G. Sun, and J. Hu, “Places401 and places365 the Computer Vision Center in 2009, at the
models,” https://github.com/lishen-shirley/Places2-CNNs, 2016. Universitat Autonoma Barcelona. She was
[32] J. Shao, X. Zhang, Z. Ding, Y. Zhao, Y. Chen, J. Zhou, working as a visiting researcher in the Com-
W. Wang, L. Mei, and C. Hu, “Good pratices for deep feature fu- puter Science and Artificial Intelligence Lab,
sion,” http://image-net.org/challenges/talks/2016/Trimps-Soushen@ at the Massachusetts Institute of Technology,
ILSVRC2016.pdf, 2016. from 2012 until 2015. Her research interests
[33] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, are related to high-level image understanding, scene and object
and T. Darrell, “Decaf: A deep convolutional activation feature for recognition, and affective computing.
generic visual recognition,” in Proc. ICML, 2014.
[34] A. S. Razavian, H. Azizpour, J. Sullivan, and S. Carlsson, “Cnn
features off-the-shelf: an astounding baseline for recognition,” CVPR
workshop, 2014.
[35] G. Patterson and J. Hays, “Sun attribute database: Discovering,
annotating, and recognizing scene attributes,” in Proc. CVPR, 2012. Aditya Khosla received the BS degree in
[36] L. Fei-Fei, R. Fergus, and P. Perona, “Learning generative visual computer science, electrical engineering and
models from few training examples: An incremental bayesian ap- economics from the California Institute of
proach tested on 101 object categories,” Computer Vision and Image Technology, and the MS degree in computer
Understanding, 2007. science from Stanford University, in 2009 and
[37] G. Griffin, A. Holub, and P. Perona, “Caltech-256 object category 2011 respectively. He completed his PhD in
dataset,” 2007. computer science from the Massachusetts
[38] B. Yao, X. Jiang, A. Khosla, A. L. Lin, L. Guibas, and L. Fei-Fei, Institute of Technology in 2016 with a focus
“Human action recognition by learning bases of action attributes and on computer vision and machine learning. In
parts,” in Proc. ICCV, 2011. his thesis, he developed machine learning
[39] L.-J. Li and L. Fei-Fei, “What, where and who? classifying events techniques that predict human behavior and
by scene and object recognition,” in Proc. ICCV, 2007. the impact of visual media on people.
[40] R.-E. Fan, K.-W. Chang, C.-J. Hsieh, X.-R. Wang, and C.-J. Lin,
“LIBLINEAR: A library for large linear classification,” 2008.
[41] A. Nguyen, A. Dosovitskiy, T. Yosinski, Jason band Brox, and
J. Clune, “Synthesizing the preferred inputs for neurons in neural
networks via deep generator networks,” In Advances in Neural
Information Processing Systems, 2016. Aude Oliva is a Principal Research Scien-
[42] A. Torralba, R. Fergus, and W. T. Freeman, “80 million tiny images: tist at the MIT Computer Science and Artifi-
A large data set for nonparametric object and scene recognition,” cial Intelligence Laboratory (CSAIL). After a
IEEE Trans. on Pattern Analysis and Machine Intelligence, 2008. French baccalaureate in Physics and Mathe-
[43] M. Everingham, L. Van Gool, C. K. Williams, J. Winn, and A. Zisser- matics, she received two M.Sc. degrees and
man, “The pascal visual object classes (voc) challenge,” Int’l Journal a Ph.D in Cognitive Science from the Institut
of Computer Vision, 2010. National Polytechnique of Grenoble, France.
[44] M. Cordts, M. Omran, S. Ramos, T. Rehfeld, M. Enzweiler, R. Be- She joined the MIT faculty in the Department
nenson, U. Franke, S. Roth, and B. Schiele, “The cityscapes dataset of Brain and Cognitive Sciences in 2004 and
for semantic urban scene understanding,” Proc. CVPR, 2016. CSAIL in 2012. Her research on vision and
memory is cross-disciplinary, spanning hu-
man perception and cognition, computer vision, and human neuro-
science. She received the 2006 National Science Foundation (NSF)
Career award, the 2014 Guggenheim and the 2016 Vannevar Bush
fellowships.

Antonio Torralba received the degree in


telecommunications engineering from Tele-
com BCN, Spain, in 1994 and the Ph.D. de-
gree in signal, image, and speech processing
from the Institut National Polytechnique de
Grenoble, France, in 2000. From 2000 to
2005, he spent postdoctoral training at the
Brain and Cognitive Science Department and
the Computer Science and Artificial Intelli-
Bolei Zhou is a Ph.D. Candidate in Com- gence Laboratory, MIT. He is now a Profes-
puter Science and Artificial Intelligence Lab sor of Electrical Engineering and Computer
(CSAIL) at the Massachusetts Institute of Science at the Massachusetts Institute of Technology (MIT). Prof.
Technology. He received M.Phil. degree in In- Torralba is an Associate Editor of the International Journal in Com-
formation Engineering from the Chinese Uni- puter Vision, and has served as program chair for the Computer
versity of Hong Kong and B.Eng. degree in Vision and Pattern Recognition conference in 2015. He received
Biomedical Engineering from Shanghai Jiao the 2008 National Science Foundation (NSF) Career award, the
Tong University in 2010. His research inter- best student paper award at the IEEE Conference on Computer
ests are computer vision and machine learn- Vision and Pattern Recognition (CVPR) in 2009, and the 2010 J.
ing. He is an award recipient of the Facebook K. Aggarwal Prize from the International Association for Pattern
Fellowship, the Microsoft Research Asia Fel- Recognition (IAPR).
lowship, and the MIT Greater China Fellowship.
0162-8828 (c) 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications_standards/publications/rights/index.html for more
information.

You might also like