0% found this document useful (0 votes)

44 views

ImageNet Large Scale Visual Recognition Challenge

visual recognition network

Uploaded by

Muhammad Shoib Amin

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

44 views

ImageNet Large Scale Visual Recognition Challenge

visual recognition network

Uploaded by

Muhammad Shoib Amin

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 38

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/265295439

ImageNet Large Scale Visual Recognition Challenge

Article in International Journal of Computer Vision · September 2014

DOI: 10.1007/s11263-015-0816-y · Source: arXiv

CITATIONS READS

6,007 6,270

12 authors, including:

Olga Russakovsky Hao Su

Stanford University Stanford University
34 PUBLICATIONS 11,810 CITATIONS 71 PUBLICATIONS 15,920 CITATIONS

SEE PROFILE SEE PROFILE

Sanjeev Satheesh Fei Fei Li

Stanford University Stanford University
23 PUBLICATIONS 12,447 CITATIONS 381 PUBLICATIONS 51,320 CITATIONS

SEE PROFILE SEE PROFILE

Some of the authors of this publication are also working on these related projects:

hybrid intrusion detction systems View project

Human trajectory forecasting View project

All content following this page was uploaded by Hao Su on 30 January 2015.

The user has requested enhancement of the downloaded file.

Noname manuscript No.
(will be inserted by the editor)

ImageNet Large Scale Visual Recognition Challenge

Olga Russakovsky* · Jia Deng* · Hao Su · Jonathan Krause ·
Sanjeev Satheesh · Sean Ma · Zhiheng Huang · Andrej Karpathy ·
Aditya Khosla · Michael Bernstein · Alexander C. Berg · Li Fei-Fei
arXiv:1409.0575v2 [cs.CV] 1 Dec 2014

Received: date / Accepted: date

Abstract The ImageNet Large Scale Visual Recogni- lenges of collecting large-scale ground truth annotation,
tion Challenge is a benchmark in object category classi- highlight key breakthroughs in categorical object recog-
fication and detection on hundreds of object categories nition, provide a detailed analysis of the current state
and millions of images. The challenge has been run an- of the field of large-scale image classification and ob-
nually from 2010 to present, attracting participation ject detection, and compare the state-of-the-art com-
from more than fifty institutions. puter vision accuracy with human accuracy. We con-
This paper describes the creation of this benchmark clude with lessons learned in the five years of the chal-
dataset and the advances in object recognition that lenge, and propose future directions and improvements.
have been possible as a result. We discuss the chal-
Keywords Dataset · Large-scale · Benchmark ·
Object recognition · Object detection
O. Russakovsky*
Stanford University, Stanford, CA, USA
E-mail: [email protected]
J. Deng* 1 Introduction
University of Michigan, Ann Arbor, MI, USA
(* = authors contributed equally) Overview. The ImageNet Large Scale Visual Recogni-
H. Su tion Challenge (ILSVRC) has been running annually
Stanford University, Stanford, CA, USA for five years (since 2010) and has become the standard
J. Krause benchmark for large-scale object recognition.1 ILSVRC
Stanford University, Stanford, CA, USA follows in the footsteps of the PASCAL VOC chal-
S. Satheesh lenge (Everingham et al., 2012), established in 2005,
Stanford University, Stanford, CA, USA which set the precedent for standardized evaluation of
S. Ma recognition algorithms in the form of yearly competi-
Stanford University, Stanford, CA, USA tions. As in PASCAL VOC, ILSVRC consists of two
Z. Huang components: (1) a publically available dataset, and (2)
Stanford University, Stanford, CA, USA an annual competition and corresponding workshop. The
A. Karpathy dataset allows for the development and comparison of
Stanford University, Stanford, CA, USA categorical object recognition algorithms, and the com-
A. Khosla petition and workshop provide a way to track the progress
Massachusetts Institute of Technology, Cambridge, MA, USA and discuss the lessons learned from the most successful
M. Bernstein and innovative entries each year.
Stanford University, Stanford, CA, USA
1
A. C. Berg In this paper, we will be using the term object recogni-
UNC Chapel Hill, Chapel Hill, NC, USA tion broadly to encompass both image classification (a task
requiring an algorithm to determine what object classes are
L. Fei-Fei present in the image) as well as object detection (a task requir-
Stanford University, Stanford, CA, USA ing an algorithm to localize all objects present in the image).
2 Olga Russakovsky* et al.

The publically released dataset contains a set of successful of these algorithms in this paper, and com-
manually annotated training images. A set of test im- pare their performance with human-level accuracy.
ages is also released, with the manual annotations with- Finally, the large variety of object classes in ILSVRC
held.2 Participants train their algorithms using the train- allows us to perform an analysis of statistical properties
ing images and then automatically annotate the test of objects and their impact on recognition algorithms.
images. These predicted annotations are submitted to This type of analysis allows for a deeper understand-
the evaluation server. Results of the evaluation are re- ing of object recognition, and for designing the next
vealed at the end of the competition period and authors generation of general object recognition algorithms.
are invited to share insights at the workshop held at the
International Conference on Computer Vision (ICCV) Goals. This paper has three key goals:
or European Conference on Computer Vision (ECCV)
1. To discuss the challenges of creating this large-scale
in alternate years.
object recognition benchmark dataset,
ILSVRC annotations fall into one of two categories:
2. To highlight the developments in object classifica-
(1) image-level annotation of a binary label for the pres-
tion and detection that have resulted from this ef-
ence or absence of an object class in the image, e.g.,
fort, and
“there are cars in this image” but “there are no tigers,”
3. To take a closer look at the current state of the field
and (2) object-level annotation of a tight bounding box
of categorical object recognition.
and class label around an object instance in the image,
e.g., “there is a screwdriver centered at position (20,25) The paper may be of interest to researchers working
with width of 50 pixels and height of 30 pixels”. on creating large-scale datasets, as well as to anybody
interested in better understanding the history and the
Large-scale challenges and innovations. In creating the current state of large-scale object recognition.
dataset, several challenges had to be addressed. Scal- The collected dataset and additional information
ing up from 19,737 images in PASCAL VOC 2010 to about ILSVRC can be found at:
1,461,406 in ILSVRC 2010 and from 20 object classes to
http://image-net.org/challenges/LSVRC/
1000 object classes brings with it several challenges. It
is no longer feasible for a small group of annotators to
annotate the data as is done for other datasets (Fei-Fei 1.1 Related work
et al., 2004; Criminisi, 2004; Everingham et al., 2012;
Xiao et al., 2010). Instead we turn to designing novel We briefly discuss some prior work in constructing bench-
crowdsourcing approaches for collecting large-scale an- mark image datasets.
notations (Su et al., 2012; Deng et al., 2009, 2014).
Some of the 1000 object classes may not be as easy Image classification datasets. Caltech 101 (Fei-Fei et al.,
to annotate as the 20 categories of PASCAL VOC: e.g., 2004) was among the first standardized datasets for
bananas which appear in bunches may not be as easy multi-category image classification, with 101 object classes
to delineate as the basic-level categories of aeroplanes and commonly 15-30 training images per class. Caltech
or cars. Having more than a million images makes it in- 256 (Griffin et al., 2007) increased the number of ob-
feasible to annotate the locations of all objects (much ject classes to 256 and added images with greater scale
less with object segmentations, human body parts, and and background variability. Another dataset TinyIm-
other detailed annotations that subsets of PASCAL VOC ages (Torralba et al., 2008) contains 80 million 32x32
contain). New evaluation criteria have to be defined to low resolution images collected from the internet using
take into account the facts that obtaining perfect man- synsets in WordNet (Miller, 1995) as queries. However,
ual annotations in this setting may be infeasible. since this data has not been manually verified, there
Once the challenge dataset was collected, its scale are many errors, making it less suitable for algorithm
allowed for unprecedented opportunities both in evalu- evaluation.
ation of object recognition algorithms and in developing The ImageNet dataset (Deng et al., 2009) is the
new techniques. Novel algorithmic innovations emerge backbone of ILSVRC. ImageNet is an image dataset
with the availability of large-scale training data. The organized according to the WordNet hierarchy (Miller,
broad spectrum of object categories motivated the need 1995). Each concept in WordNet, possibly described by
for algorithms that are even able to distinguish classes multiple words or word phrases, is called a “synonym
which are visually very similar. We highlight the most set” or “synset”. ImageNet populates 21,841 synsets of
2
In 2010, the test annotations were later released publicly; WordNet with an average of 650 manually verified and
since then the test annotation have been kept hidden. full resolution images. As a result, ImageNet contains
ImageNet Large Scale Visual Recognition Challenge 3

14,197,122 annotated images organized by the semantic Large-scale annotation. ILSVRC makes extensive use
hierarchy of WordNet (as of August 2014). ImageNet is of Amazon Mechanical Turk to obtain accurate annota-
larger in scale and diversity than the other image clas- tions (Sorokin and Forsyth, 2008). Works such as (Welin-
sification datasets. ILSVRC uses a subset of ImageNet der et al., 2010; Sheng et al., 2008; Vittayakorn and
images for training the algorithms and some of Ima- Hays, 2011) describe quality control mechanisms for
geNet’s image collection protocols for annotating addi- this marketplace. (Vondrick et al., 2012) provides a de-
tional images for testing the algorithms. tailed overview of crowdsourcing video annotation. A
related line of work is to obtain annotations through
Image parsing datasets. Several datasets aim to pro- well-designed games, e.g. (von Ahn and Dabbish, 2005).
vide richer image annotations beyond image-category Our novel approaches to crowdsourcing accurate image
labels. LabelMe (Russell et al., 2007) contains general annotations are in Sections 3.1.3, 3.2.1 and 3.3.3.
photographs with multiple objects per image. It has
bounding polygon annotations around objects, but for Standardized challenges. There are several datasets with
the most part is not completely labeled and the ob- standardized online evaluation similar to ILSVRC: the
ject names are not standardized: annotators are free to aforementioned PASCAL VOC (Everingham et al., 2012),
choose which objects to label and what to name each Labeled Faces in the Wild (Huang et al., 2007) for
object. This makes it difficult to use LabelMe for train- unconstrained face recognition, Reconstruction meets
ing and evaluating algorithms. The SUN2012 (Xiao et al., Recognition (Urtasun et al., 2014) for 3D reconstruc-
2010) dataset contains 16,873 manually cleaned up and tion and KITTI (Geiger et al., 2013) for computer vi-
fully annotated images suitable for object detection. sion in autonomous driving. These datasets along with
The LotusHill dataset (Yao et al., 2007) contains very ILSVRC help benchmark progress in different areas of
detailed annotations of objects in 636,748 images and computer vision.
video frames, but it is not available for free. Several
datasets provide pixel-level segmentations: for example,
MSRC dataset (Criminisi, 2004) with 591 images and 1.2 Paper layout
23 object classes, Stanford Background Dataset (Gould
et al., 2009) with 715 images and 8 classes, and the We begin with a brief overview of ILSVRC challenge
Berkeley Segmentation dataset (Arbelaez et al., 2011) tasks in Section 2. Dataset collection and annotation
with 500 images annotated with object boundaries. are described at length in Section 3. Section 4 discusses
The closest to ILSVRC is the PASCAL VOC dataset the evaluation criteria of algorithms in the large-scale
(Everingham et al., 2010, 2014), which provides a stan- recognition setting. Section 5 provides an overview of
dardized test bed for object detection, image classifi- the methods developed by ILSVRC participants.
cation, object segmentation, person layout, and action Section 6 contains an in-depth analysis of ILSVRC
classification. Much of the design choices in ILSVRC results: Section 6.1 documents the progress of large-
have been inspired by PASCAL VOC and the simi- scale recognition over the years, Section 6.2 concludes
larities and differences between the datasets are dis- that ILSVRC results are statistically significant, Sec-
cussed at length throughout the paper. ILSVRC scales tion 6.3 thoroughly analyzes the current state of the
up PASCAL VOC’s goal of standardized training and field of object recognition, and Section 6.4 compares
evaluation of recognition algorithms by more than an state-of-the-art computer vision accuracy with human
order of magnitude in number of object classes and im- accuracy. We conclude and discuss lessons learned from
ages: PASCAL VOC 2012 has 20 object classes and ILSVRC in Section 7.
21,738 images compared to ILSVRC2012 with 1000 ob-
ject classes and 1,431,167 annotated images.
The recently released COCO dataset (Lin et al., 2 Challenge tasks
2014b) contains more than 328,000 images with 2.5 mil-
lion object instances manually segmented. It has fewer The goal of ILSVRC is to estimate the content of pho-
object categories than ILSVRC (91 in COCO versus tographs for the purpose of retrieval and automatic
200 in ILSVRC object detection) but more instances annotation. Test images are presented with no initial
per category (27K on average compared to about 1K annotation, and algorithms have to produce labelings
in ILSVRC object detection). Further, it contains ob- specifying what objects are present in the images. New
ject segmentation annotations which are not currently test images are collected and labeled especially for this
available in ILSVRC. COCO is likely to become another competition and are not part of the previously pub-
important large-scale benchmark. lished ImageNet dataset (Deng et al., 2009).
4 Olga Russakovsky* et al.

Image Single-object Object

Task
classification localization detection
Number of object classes
Manual labeling 1 1 1 or more
annotated per image
on training set all instances all instances
Locations of —
annotated classes on some images on all images
Number of object classes
Manual labeling 1 1 all target classes
annotated per image
on validation
Locations of all instances all instances
and test sets —
annotated classes on all images on all images
Table 1 Overview of the provided annotations for each of the tasks in ILSVRC.

ILSVRC over the years has consisted of one or more fixed to provide a standardized benchmark while the
of the following tasks (years in parentheses):3 rest of ImageNet continued to grow.
1. Image classification (2010-2014): Algorithms pro-
duce a list of object categories present in the image.
2.2 Single-object localization task
2. Single-object localization (2011-2014): Algorithms
produce a list of object categories present in the im-
The single-object localization task, introduced in 2011,
age, along with an axis-aligned bounding box indi-
built off of the image classification task to evaluate the
cating the position and scale of one instance of each
ability of algorithms to learn the appearance of the tar-
object category.
get object itself rather than its image context.
3. Object detection (2013-2014): Algorithms produce
Data for the single-object localization task consists
a list of object categories present in the image along
of the same photographs collected for the image classi-
with an axis-aligned bounding box indicating the
fication task, hand labeled with the presence of one of
position and scale of every instance of each object
1000 object categories. Each image contains one ground
category.
truth label. Additionally, every instance of this category
This section provides a brief overview and history of is annotated with an axis-aligned bounding box.
each of the three key tasks. Table 1 shows summary For each image, algorithms produce a list of object
statistics. categories present in the image, along with a bounding
box indicating the position and scale of one instance
of each object category. The quality of a labeling is
2.1 Image classification task
evaluated based on the object category label that best
matches the ground truth label, with the additional re-
Data for the image classification task consists of pho-
quirement that the location of the predicted instance is
tographs collected from Flickr4 and other search en-
also accurate (see Section 4.2).
gines, manually labeled with the presence of one of
1000 object categories. Each image contains one ground
truth label.
2.3 Object detection task
For each image, algorithms produce a list of object
categories present in the image. The quality of a label- The object detection task went a step beyond single-
ing is evaluated based on the label that best matches object localization and tackled the problem of localizing
the ground truth label for the image (see Section 4.1). multiple object categories in the image. This task has
Constructing ImageNet was an effort to scale up been a part of the PASCAL VOC for many years on
an image classification dataset to cover most nouns in the scale of 20 object categories and tens of thousands
English using tens of millions of manually verified pho- of images, but scaling it up by an order of magnitude
tographs (Deng et al., 2009). The image classification in object categories and in images proved to be very
task of ILSVRC came as a direct extension of this ef- challenging from a dataset collection and annotation
fort. A subset of categories and images was chosen and point of view (see Section 3.3).
3
In addition, ILSVRC in 2012 also included a taster fine- Data for the detection tasks consists of new pho-
grained classification task, where algorithms would classify tographs collected from Flickr using scene-level queries.
dog photographs into one of 120 dog breeds (Khosla et al., The images are annotated with axis-aligned bounding
2011). Fine-grained classification has evolved into its own
Fine-Grained classification challenge in 2013 (Berg et al., boxes indicating the position and scale of every instance
2013), which is outside the scope of this paper. of each target object category. The training set is ad-
4
www.flickr.com ditionally supplemented with (a) data from the single-
ImageNet Large Scale Visual Recognition Challenge 5

object localization task, which contains annotations for classification (Section 3.1), single-object localization (Sec-
all instances of just one object category, and (b) nega- tion 3.2), and object detection (Section 3.3), focusing
tive images known not to contain any instance of some on the three key steps for each dataset.
object categories.
For each image, algorithms produce bounding boxes
indicating the position and scale of all instances of all 3.1 Image classification dataset construction
target object categories. The quality of labeling is eval-
The image classification task tests the ability of an algo-
uated by recall, or number of target object instances
rithm to name the objects present in the image, without
detected, and precision, or the number of spurious de-
necessarily localizing them.
tections produced by the algorithm (see Section 4.3).
We describe the choices we made in constructing
the ILSVRC image classification dataset: selecting the
3 Dataset construction at large scale target object categories from ImageNet (Section 3.1.1),
collecting a diverse set of candidate images by using
Our process of constructing large-scale object recogni- multiple search engines and an expanded set of queries
tion image datasets consists of three key steps. in multiple languages (Section 3.1.2), and finally filter-
The first step is defining the set of target object ing the millions of collected images using the carefully
categories. To do this, we select from among the ex- designed crowdsourcing strategy of ImageNet (Deng et al.,
isting ImageNet (Deng et al., 2009) categories. By us- 2009) (Section 3.1.3).
ing WordNet as a backbone (Miller, 1995), ImageNet
already takes care of disambiguating word meanings 3.1.1 Defining object categories for the image
and of combining together synonyms into the same ob- classification dataset
ject category. Since the selection of object categories
needs to be done only once per challenge task, we use a The 1000 categories used for the image classification
combination of automatic heuristics and manual post- task were selected from the ImageNet (Deng et al.,
processing to create the list of target categories appro- 2009) categories. The 1000 synsets are selected such
priate for each task. For example, for image classifica- that there is no overlap between synsets: for any synsets
tion we may include broader scene categories such as i and j, i is not an ancestor of j in the WordNet hierar-
a type of beach, but for single-object localization and chy. These synsets are part of the larger ImageNet hier-
object detection we want to focus only on object cate- archy and may have children in ImageNet; however, for
gories which can be unambiguously localized in images ILSVRC we do not consider their child subcategories.
(Sections 3.1.1 and 3.3.1). The synset hierarchy of ILSVRC can be thought of as a
The second step is collecting a diverse set of can- “trimmed” version of the complete ImageNet hierarchy.
didate images to represent the selected categories. We Figure 1 visualizes the diversity of the ILSVRC2012 ob-
use both automatic and manual strategies on multiple ject categories.
search engines to do the image collection. The process is The exact 1000 synsets used for the image classifica-
modified for the different ILSVRC tasks. For example, tion and single-object localization tasks have changed
for object detection we focus our efforts on collecting over the years. There are 639 synsets which have been
scene-like images using generic queries such as “African used in all five ILSVRC challenges so far. In the first
safari” to find pictures likely to contain multiple ani- year of the challenge synsets were selected randomly
mals in one scene (Section 3.3.2). from the available ImageNet synsets at the time, fol-
The third (and most challenging) step is annotat- lowed by manual filtering to make sure the object cat-
ing the millions of collected images to obtain a clean egories were not too obscure. With the introduction of
dataset. We carefully design crowdsourcing strategies the object localization challenge in 2011 there were 321
targeted to each individual ILSVRC task. For example, synsets that changed: categories such as “New Zealand
the bounding box annotation system used for localiza- beach” which were inherently difficult to localize were
tion and detection tasks consists of three distinct parts removed, and some new categories from ImageNet con-
in order to include automatic crowdsourced quality containing object localization annotations were added. In
trol (Section 3.2.1). Annotating images fully with all ILSVRC2012, 90 synsets were replaced with categories
target object categories (on a reasonable budget) for corresponding to dog breeds to allow for evaluation of
object detection requires an additional hierarchical im- more fine-grained object classification, as shown in Fig-
age labeling system (Section 3.3.3). ure 2. The synsets have remained consistent since year
We describe the data collection and annotation pro- 2012. Appendix A provides the complete list of object
cedure for each of the ILSVRC tasks in order: image categories used in ILSVRC2012-2014.
6 Olga Russakovsky* et al.

Fig. 1 The diversity of data in the ILSVRC image classification and single-object localization tasks. For each of the eight
dimensions, we show example object categories along the range of that property. Object scale, number of instances and image
clutter are computed using the metrics defined in Section 3.2.2. The other properties were computed by asking human subjects
to annotate each of the 1000 object categories (Russakovsky et al., 2013).

PASCAL ILSVRC with the word from parent synsets, if the same word
appears in the gloss of the target synset. For exam-
birds

ple, when querying “whippet”, according to WordNet’s

··· glossary a “small slender dog of greyhound type de-
veloped in England”, we also use “whippet dog” and
cats

“whippet greyhound.” To further enlarge and diversify

··· the candidate pool, we translate the queries into other
languages, including Chinese, Spanish, Dutch and Ital-
dogs

ian. We obtain accurate translations using WordNets in

· · · those languages.
Fig. 2 The ILSVRC dataset contains many more fine-
grained classes compared to the standard PASCAL VOC 3.1.3 Image classification dataset annotation
benchmark; for example, instead of the PASCAL “dog” cat-
egory there are 120 different breeds of dogs in ILSVRC2012-
2014 classification and single-object localization tasks. Annotating images with corresponding object classes
follows the strategy employed by ImageNet (Deng et al.,
2009). We summarize it briefly here.
3.1.2 Collecting candidate images for the image To collect a highly accurate dataset, we rely on hu-
classification dataset mans to verify each candidate image collected in the
previous step for a given synset. This is achieved by us-
Image collection for ILSVRC classification task is the ing Amazon Mechanical Turk (AMT), an online plat-
same as the strategy employed for constructing Ima- form on which one can put up tasks for users for a
geNet (Deng et al., 2009). Training images are taken monetary reward. With a global user base, AMT is par-
directly from ImageNet. Additional images are collected ticularly suitable for large scale labeling. In each of our
for the ILSVRC using this strategy and randomly par- labeling tasks, we present the users with a set of can-
titioned into the validation and test sets. didate images and the definition of the target synset
We briefly summarize the process; (Deng et al., 2009) (including a link to Wikipedia). We then ask the users
contains further details. Candidate images are collected to verify whether each image contains objects of the
from the Internet by querying several image search en- synset. We encourage users to select images regardless
gines. For each synset, the queries are the set of Word- of occlusions, number of objects and clutter in the scene
Net synonyms. Search engines typically limit the num- to ensure diversity.
ber of retrievable images (on the order of a few hundred While users are instructed to make accurate judg-
to a thousand). To obtain as many images as possi- ment, we need to set up a quality control system to
ble, we expand the query set by appending the queries ensure this accuracy. There are two issues to consider.
ImageNet Large Scale Visual Recognition Challenge 7

First, human users make mistakes and not all users fol- ILSVRC 2011, and became an official part of ILSVRC
low the instructions. Second, users do not always agree in 2012.
with each other, especially for more subtle or confus- The key challenge was developing a scalable crowd-
ing synsets, typically at the deeper levels of the tree. sourcing method for object bounding box annotation.
The solution to these issues is to have multiple users Our three-step self-verifying pipeline is described in Sec-
independently label the same image. An image is con- tion 3.2.1. Having the dataset collected, we perform
sidered positive only if it gets a convincing majority of detailed analysis in Section 3.2.2 to ensure that the
the votes. We observe, however, that different categories dataset is sufficiently varied to be suitable for evalu-
require different levels of consensus among users. For ation of object localization algorithms.
example, while five users might be necessary for obtain-
ing a good consensus on Burmese cat images, a much Object classes and candidate images. The object classes
smaller number is needed for cat images. We develop a for single-object localization task are the same as the
simple algorithm to dynamically determine the number object classes for image classification task described
of agreements needed for different categories of images. above in Section 3.1. The training images for localiza-
For each synset, we first randomly sample an initial tion task are a subset of the training images used for
subset of images. At least 10 users are asked to vote image classification task, and the validation and test
on each of these images. We then obtain a confidence images are the same between both tasks.
score table, indicating the probability of an image being Recall that for the image classification task every
a good image given the consensus among user votes. For image was annotated with one object class label, corre-
each of the remaining candidate images in this synset, sponding to one object that is present in an image. For
we proceed with the AMT user labeling until a pre- the single-object localization task, every validation and
determined confidence score threshold is reached. test image and a subset of the training images were an-
notated with axis-aligned bounding boxes around every
Empirical evaluation. Evaluation of the accuracy of the instance of this object.
large-scale crowdsourced image annotation system was
done on the entire ImageNet (Deng et al., 2009). A to- 3.2.1 Bounding box object annotation system
tal of 80 synsets were randomly sampled at every tree
depth of the mammal and vehicle subtrees. An inde- We summarize the crowdsourced bounding box anno-
pendent group of subjects verified the correctness of tation system described in detail in (Su et al., 2012).
each of the images. An average of 99.7% precision is The goal is to build a system that is fully automated,
achieved across the synsets. We expect similar accuracy highly accurate, and cost-effective. Given a collection
on ILSVRC image classification dataset since the im- of images where the object of interest has been veri-
age annotation pipeline has remained the same. To ver- fied to exist, for each image the system collects a tight
ify, we manually checked 1500 ILSVRC2012-2014 image bounding box for every instance of the object.
classification test set images (the test set has remained There are two requirements:
unchanged in these three years). We found 5 annotation
errors, corresponding as expected to 99.7% precision. – Quality Each bounding box needs to be tight, i.e.
the smallest among all bounding boxes that contain
3.1.4 Image classification dataset statistics the object. This would greatly facilitate the learning
algorithms for the object detector by giving better
Using the image collection and annotation procedure alignment of the object instances;
described in previous sections, we collected a large- – Coverage Every object instance needs to have a
scale dataset used for ILSVRC classification task. There bounding box. This is important for training local-
are 1000 object classes and approximately 1.2 million ization algorithms because it tells the learning algo-
training images, 50 thousand validation images and 100 rithms with certainty what is not the object.
thousand test images. Table 2 (top) documents the size
of the dataset over the years of the challenge. The core challenge of building such a system is ef-
fectively controlling the data quality with minimal cost.
Our key observation is that drawing a bounding box is
3.2 Single-object localization dataset construction significantly more difficult and time consuming than
giving answers to multiple choice questions. Thus qual-
The single-object localization task evaluates the ability ity control through additional verification tasks is more
of an algorithm to localize at least one instance of an cost-effective than consensus-based algorithms. This leads
object category. It was introduced as a taster task in to the following workflow with simple basic subtasks:
8 Olga Russakovsky* et al.

Image classification annotations (1000 object classes)

Year Train images (per class) Val images (per class) Test images (per class)
ILSVRC2010 1,261,406 (668-3047) 50,000 (50) 150,000 (150)
ILSVRC2011 1,229,413 (384-1300) 50,000 (50) 100,000 (100)
ILSVRC2012-14 1,281,167 (732-1300) 50,000 (50) 100,000 (100)

Additional annotations for single-object localization (1000 object classes)

Year Train images with Train bboxes Val images with Val bboxes Test images with
bbox annotations annotated bbox annotations annotated bbox annotations
(per class) (per class) (per class) (per class)
ILSVRC2011 315,525 (104-1256) 344,233 (114-1502) 50,000 (50) 55,388 (50-118) 100,000
ILSVRC2012-14 523,966 (91-1268) 593,173 (92-1418) 50,000 (50) 64,058 (50-189) 100,000
Table 2 Scale of ILSVRC image classification task (top) and single-object localization task (bottom). The numbers in paren-
theses correspond to (minimum per class - maximum per class). The 1000 classes change from year to year but are consistent
between image classification and single-object localization tasks in the same year. All images from the image classification task
may be used for single-object localization.

1. Drawing A worker draws one bounding box around Additional evaluation of the overall cost and an anal-
one instance of an object on the given image. ysis of quality control can be found in (Su et al., 2012).
2. Quality verification A second worker checks if the
bounding box is correctly drawn. 3.2.2 Single-object localization dataset statistics
3. Coverage verification A third worker checks if all
object instances have bounding boxes. Using the annotation procedure described above, we
The sub-tasks are designed following two principles. collect a large set of bounding box annotations for the
First, the tasks are made as simple as possible. For ex- ILSVRC single-object classification task. All 50 thou-
ample, instead of asking the worker to draw all bound- sand images in the validation set and 100 thousand im-
ing boxes on the same image, we ask the worker to draw ages in the test set are annotated with bounding boxes
only one. This reduces the complexity of the task. Sec- around all instances of the ground truth object class
ond, each task has a fixed and predictable amount of (one object class per image). In addition, in ILSVRC2011
work. For example, assuming that the input images are 25% of training images are annotated with bounding
clean (object presence is correctly verified) and the cov- boxes the same way, yielding more than 310 thousand
erage verification tasks give correct results, the amount annotated images with more than 340 thousand anno-
of work of the drawing task is always that of providing tated object instances. In ILSVRC2012 40% of training
exactly one bounding box. images are annotated, yielding more than 520 thousand
Quality control on Tasks 2 and 3 is implemented annotated images with more than 590 thousand anno-
by embedding “gold standard” images where the cor- tated object instances. Table 2 (bottom) documents the
rect answer is known. Worker training for each of these size of this dataset.
subtasks is described in detail in (Su et al., 2012). In addition to the size of the dataset, we also ana-
lyze the level of difficulty of object localization in these
Empirical evaluation. The system is evaluated on 10 images compared to the PASCAL VOC benchmark. We
categories with ImageNet (Deng et al., 2009): balloon, compute statistics on the ILSVRC2012 single-object lo-
bear, bed, bench, beach, bird, bookshelf, basketball hoop, calization validation set images compared to PASCAL
bottle, and people. A subset of 200 images are ran- VOC 2012 validation images.
domly sampled from each category. On the image level, Real-world scenes are likely to contain multiple in-
our evaluation shows that 97.9% images are completely stances of some objects, and nearby object instances are
covered with bounding boxes. For the remaining 2.1%, particularly difficult to delineate. The average object
some bounding boxes are missing. However, these are category in ILSVRC has 1.61 target object instances
all difficult cases: the size is too small, the boundary is on average per positive image, with each instance hav-
blurry, or there is strong shadow. ing on average 0.47 neighbors (adjacent instances of
On the bounding box level, 99.2% of all bound- the same object category). This is comparable to 1.69
ing boxes are accurate (the bounding boxes are visi- instances per positive image and 0.52 neighbors per in-
bly tight). The remaining 0.8% are somewhat off. No stance for an average object class in PASCAL.
bounding boxes are found to have less than 50% inter- As described in (Hoiem et al., 2012), smaller ob-
section over union overlap with ground truth. jects tend to be significantly more difficult to local-
ImageNet Large Scale Visual Recognition Challenge 9

ize. In the average object category in PASCAL the ob- PASCAL VOC Closest ILSVRC-DET class
(20 classes) (200 classes total)
ject occupies 24.1% of the image area, and in ILSVRC
aeroplane airplane
35.8%. However, PASCAL has only 20 object categories bicycle bicycle
while ILSVRC has 1000. The 537 object categories of bird bird
ILSVRC with the smallest objects on average occupy boat watercraft
the same fraction of the image as PASCAL objects: bottle wine bottle (or water bottle)
bus bus
24.1%. Thus even though on average the object in-
car car
stances tend to be bigger in ILSVRC images, there are cat domestic cat
more than 25 times more object categories than in PAS- chair chair
CAL VOC with the same average object scale. cow cattle
dining table table
Appendix B and (Russakovsky et al., 2013) have dog dog
additional comparisons. horse horse
motorbike motorcyle
person person
potted plant flower pot
sheep sheep
sofa sofa
3.3 Object detection dataset construction train train
tv/monitor tv or monitor

The ILSVRC task of object detection evaluates the abil- Table 3 Correspondences between the object classes in the
PASCAL VOC and the ILSVRC detection task.
ity of an algorithm to name and localize all instances of
all target objects present in an image. It is much more
challenging than object localization because some ob-
3.3.1 Defining object categories for the object detection
ject instances may be small/occluded/difficult to accu-
dataset
rately localize, and the algorithm is expected to locate
them all, not just the one it finds easiest.
There are 200 object classes hand-selected for the detec-
There are three key challenges in collecting the ob- tion task, corresponding to a synset within ImageNet.
ject detection dataset. The first challenge is selecting These were chosen to be mostly basic-level object cat-
the set of common objects which tend to appear in clut- egories that would be easy for people to identify and
tered photographs and are well-suited for benchmarking label. The rationale is that the object detection system
object detection performance. Our approach relies on developed for this task can later be combined with a
statistics of the object localization dataset and the tra- fine-grained classification model to further classify the
dition of the PASCAL VOC challenge (Section 3.3.1). objects if a finer subdivision is desired.5 As with the
The second challenge is obtaining a much more var- 1000 classification classes, the synsets are selected such
ied set of scene images than those used for the image that there is no overlap between synsets: for any synsets
classification and single-object localization datasets. Sec- i and j, i is not an ancestor of j in the WordNet hier-
tion 3.3.2 describes the procedure for utilizing as much archy.
data from the single-object localization dataset as pos- The selection of the 200 object detection classes in
sible and supplementing it with Flickr images queried 2013 was guided by the ILSVRC 2012 classification and
using hundreds of manually designed high-level queries. localization dataset. Starting with 1000 object classes
The third, and biggest, challenge is completely an- and their bounding box annotations we first eliminated
notating this dataset with all the objects. This is done all object classes which tended to be too “big” in the
in two parts. Section 3.3.3 describes the first part: our image (on average the object area was greater than
hierarchical strategy for obtaining the list of all target 50% of the image area). These were classes such as
objects which occur within every image. This is nec- T-shirt, spiderweb, or manhole cover. We then man-
essary since annotating in a straight-forward way by ually eliminated all classes which we did not feel were
creating a task for every (image, object class) pair is well-suited for detection, such as hay, barbershop, or
no longer feasible at this scale. Appendix D describes poncho. This left 494 object classes which were merged
the second part: annotating the bounding boxes around into basic-level categories: for example, different species
these objects, using the single-object localization bound- 5
Some of the training objects are actually annotated with
ing box annotation pipeline of Section 3.2.1 along with
more detailed classes: for example, one of the 200 object
extra verification to ensure that every instance of the classes is the category “dog,” and some training instances
object is annotated with exactly one bounding box. are annotated with the specific dog breed.
10 Olga Russakovsky* et al.

Fig. 3 Summary of images collected for the detection task.

Images in green (bold) boxes have all instances of all 200 de-
tection object classes fully annotated. Table 4 lists the com-
plete statistics.

of birds were merged into just the “bird” class. The

classes remained the same in ILSVRC2014. Appendix C
contains the complete list of object categories used in
ILSVRC2013-2014 (in the context of the hierarchy de-
scribed in Section 3.3.3). Fig. 4 Random selection of images in ILSVRC detection val-
idation set. The images in the top 4 rows were taken from
Staying mindful of the tradition of the PASCAL
ILSVRC2012 single-object localization validation set, and the
VOC dataset we also tried to ensure that the set of images in the bottom 4 rows were collected from Flickr using
200 classes contains as many of the 20 PASCAL VOC scene-level queries.
classes as possible. Table 3 shows the correspondences.
The changes that were done were to ensure more accu-
rate and consistent crowdsourced annotations. The obtain other objects of interest. The second source (23%)
ject class with the weakest correspondence is “potted is images from Flickr collected specifically for detection
plant” in PASCAL VOC, corresponding to “flower pot” task. We queried Flickr using a large set of manually
in ILSVRC. “Potted plant” was one of the most chal- defined queries, such as “kitchenette” or “Australian
lenging object classes to annotate consistently among zoo” to retrieve images of scenes likely to contain sev-
the PASCAL VOC classes, and in order to obtain accu- eral objects of interest. We also added pairwise queries,
rate annotations using crowdsourcing we had to restrict or queries with two target object names such as “tiger
the definition to a more concrete object. lion,” which also often returned cluttered scenes.
Figure 4 shows a random set of both types of val-
idation images. Images were randomly split, with 33%
3.3.2 Collecting images for the object detection dataset
going into the validation set and 67% into the test set.6
Many images for the detection task were collected dif- The training set for the detection task comes from
ferently than the images in ImageNet and the classifica- three sources of images (percent of images from each
tion and single-object localization tasks. Figure 3 sum- source in parentheses). The first source (63%) is all
marizes the types of images that were collected. Ideally training images from ILSVRC2012 single-object local-
all of these images would be scene images fully anno- ization task corresponding to the 200 detection classes
tated with all target categories. However, given budget (or their children in the ImageNet hierarchy). We did
constraints our goal was to provide as much suitable de- not filter by object size, allowing teams to take advan-
tection data as possible, even if the images were drawn tage of all the positive examples available. The second
from a few different sources and distributions. source (24%) is negative images which were part of the
The validation and test detection set images come original ImageNet collection process but voted as neg-
from two sources (percent of images from each source ative: for example, some of the images were collected
in parentheses). The first source (77%) is images from from Flickr and search engines for the ImageNet synset
ILSVRC2012 single-object localization validation and “animals” but during the manual verification step did
test sets corresponding to the 200 detection classes (or 6
The validation/test split is consistent with ILSVRC2012:
their children in the ImageNet hierarchy). Images where
validation images of ILSVRC2012 remained in the validation
the target object occupied more than 50% of the image set of ILSVRC2013, and ILSVRC2012 test images remained
area were discarded, since they were unlikely to con- in ILSVRC2013 test set.
ImageNet Large Scale Visual Recognition Challenge 11

and monitor frequently co-occur in images. Simi-

larly, some labels tend to all be absent at the same
time. For example, all objects that require electricity
are usually absent in pictures taken outdoors. This
suggests that we could potentially fill in the values
of multiple labels by grouping them into only one
query for humans. Instead of checking if dog, cat,
rabbit etc. are present in the photo, we just check
about the “animal” group If the answer is no, then
this implies a no for all categories in the group.
2. Hierarchy. The above example of grouping dog,
Fig. 5 Multi-label annotation becomes much more efficient cat, rabbit etc. into animal has implicitly assumed
when considering real-world structure of data: correlation be- that labels can be grouped together and humans
tween labels, hierarchical organization of concepts, and spar-
sity of labels.
can efficiently answer queries about the group as a
whole. This brings up our second key observation:
humans organize semantic concepts into hierarchies
not collect enough votes to be considered as containing and are able to efficiently categorize at higher se-
an “animal.” These images were manually re-verified mantic levels (Thorpe et al., 1996), e.g. humans can
for the detection task to ensure that they did not in determine the presence of an animal in an image as
fact contain the target objects. The third source (13%) fast as every type of animal individually. This leads
is images collected from Flickr specifically for the de- to substantial cost savings.
tection task. These images were added for ILSVRC2014 3. Sparsity. The values of labels for each image tend
following the same protocol as the second type of images to be sparse, i.e. an image is unlikely to contain more
in the validation and test set. This was done to bring than a dozen types of objects, a small fraction of the
the training and testing distributions closer together. hundreds of object categories. This enables rapid
elimination of many objects by quickly filling in no.
With a high degree of sparsity, an efficient algorithm
3.3.3 Complete image-object annotation for the object can have a cost which grows logarithmically with the
detection dataset number of objects instead of linearly.
The key challenge in annotating images for the object We propose algorithmic strategies that exploit the
detection task is that all objects in all images need to above intuitions. The key is to select a sequence of
be labeled. Suppose there are N inputs (images) which queries for humans such that we achieve the same label-
need to be annotated with the presence or absence of ing results with only a fraction of the cost of the naı̈ve
K labels (objects). A naı̈ve approach would query hu- approach. The main challenges include how to mea-
mans for each combination of input and label, requiring sure cost and utility of queries, how to construct good
N K queries. However, N and K can be very large and queries, and how to dynamically order them. A detailed
the cost of this exhaustive approach quickly becomes description of the generic algorithm, along with theo-
prohibitive. For example, annotating 60, 000 validation retical analysis and empirical evaluation, is presented
and test images with the presence or absence of 200 ob- in (Deng et al., 2014).
ject classes for the detection task naı̈vely would take 80
times more effort than annotating 150, 000 validation Application of the generic multi-class labeling algorithm
and test images with 1 object each for the classification to our setting. The generic algorithm automatically se-
task – and this is not even counting the additional cost lects the most informative queries to ask based on ob-
of collecting bounding box annotations around each object label statistics learned from the training set. In
ject instance. This quickly becomes infeasible. our case of 200 object classes, since obtaining the train-
In (Deng et al., 2014) we study strategies for scaling set was by itself challenging we chose to design the
able multilabel annotation, or for efficiently acquiring queries by hand. We created a hierarchy of queries of
multiple labels from humans for a collection of items. the type “is there a... in the image?” For example, one
We exploit three key observations for labels in real of the high-level questions was “is there an animal in
world applications (illustrated in Figure 5): the image?” We ask the crowd workers this question
about every image we want to label. The children of
1. Correlation. Subsets of labels are often highly cor- the “animal” question would correspond to specific ex-
related. Objects such as a computer keyboard, mouse amples of animals: for example, “is there a mammal in
12 Olga Russakovsky* et al.

Object Presence
Input: Image i, queries Q, directed graph G over Q
Is there Is there Is there
Output: Labels L : Q → {“yes”, “no”}
Image

an animal? a mammal? a cat?

Initialize labels L(q) = ∅ ∀q ∈ Q;
Initialize candidates C = {q: q ∈ Root(G)};
Fig. 6 Our algorithm dynamically selects the next query to while C not empty do
efficiently determine the presence or absence of every object Obtain answer A to query q∗ ∈ C;
in every image. Green denotes a positive annotation and red L(q∗) = A; C = C\{q∗};
denotes a negative annotation. This toy example illustrates a if A is “yes” then
sample progression of the algorithm for one label (cat) on a Chldr = {q ∈ Children(q∗, G): L(q) = ∅};
set of images. C = C ∪ Chldr;
else
Des = {q ∈ Descendants(q∗, G): L(q) = ∅};
the image?” or “is there an animal with no legs?” To L(q) = “no00 ∀q ∈ Des;
annotate images efficiently, these questions are asked C = C\Des;
only on images determined to contain an animal. The end
end
200 leaf node questions correspond to the 200 target ob-
Algorithm 1: The algorithm for complete multi-class
jects, e.g., “is there a cat in the image?”. A few sample
annotation. This is a special case of the algorithm de-
iterations of the algorithm are shown in Figure 6.
scribed in (Deng et al., 2014). A hierarchy of ques-
Algorithm 1 is the formal algorithm for labeling an
tions G is manually constructed. All root questions
image with the presence or absence of each target object
are asked on every image. If the answer to query q∗
category. With this algorithm in mind, the hierarchy of
on image i is “no” then the answer is assumed to be
questions was constructed following the principle that
“no” for all queries q such that q is a descendant of
false positives only add extra cost whereas false nega-
q∗ in the hierarchy. We continue asking the queries
tives can significantly affect the quality of the labeling.
until all queries are answered. For images taken from
Thus, it is always better to stick with more general but
the single-object localization task we used the known
less ambiguous questions, such as “is there a mammal
object label to initialize L.
in the image?” as opposed to asking overly specific but
potentially ambiguous questions, such as “is there an
animal that can climb trees?” Constructing this hierar- training objects (478,807 vs 13,609), 3.5 times more
chy was a surprisingly time-consuming process, involv- validation images (20,121 vs 5823) and 3.5 times more
ing multiple iterations to ensure high accuracy of label- validation objects (55,501 vs 15,787). ILSVRC has 2.8
ing and avoid question ambiguity. Appendix C shows annotated objects per image on the validation set, com-
the constructed hierarchy. pared to 2.7 in PASCAL VOC. The average object in
ILSVRC takes up 17.0% of the image area and in PAS-
Bounding box annotation. Once all images are labeled CAL VOC takes up 20.7%. This is because ILSVRC has
with the presence or absence of all object categories we a wider variety of classes, including tiny objects such as
use the bounding box system described in Section 3.2.1 sunglasses (1.3% of image area on average), ping-pong
along with some additional modifications of Appendix D balls (1.5% of image area on average) and basketballs
to annotate the location of every instance of every present (2.0% of image area on average).
object category.

3.3.4 Object detection dataset statistics 4 Evaluation at large scale

Using the procedure described above, we collect a large- Once the dataset has been collected, we need to define a
scale dataset for ILSVRC object detection task. There standardized evaluation procedure for algorithms. Some
are 200 object classes and approximately 450K training measures have already been established by datasets such
images, 20K validation images and 40K test images. Ta- as the Caltech 101 (Fei-Fei et al., 2004) for image clas-
ble 4 documents the size of the dataset over the years of sification and PASCAL VOC (Everingham et al., 2012)
the challenge. The major change between ILSVRC2013 for both image classification and object detection. To
and ILSVRC2014 was the addition of 60,658 fully an- adapt these procedures to the large-scale setting we had
notated training images. to address three key challenges. First, for the image
Prior to ILSVRC, the object detection benchmark classification and single-object localization tasks only
was the PASCAL VOC challenge (Everingham et al., one object category could be labeled in each image due
2010). ILSVRC has 10 times more object classes than to the scale of the dataset. This created potential ambi-
PASCAL VOC (200 vs 20), 10.6 times more fully anno- guity during evaluation (addressed in Section 4.1). Sec-
tated training images (60,658 vs 5,717), 35.2 times more ond, evaluating localization of object instances is inher-
ImageNet Large Scale Visual Recognition Challenge 13

Object detection annotations (200 object classes)

Year Train images Train bboxes annotated Val images Val bboxes annotated Test
(per class) (per class) (per class) (per class ) images
ILSVRC2013 395909 345854 21121 55501 40152
(417-561-66911 pos, (438-660-73799) (23-58-5791 pos, (31-111-12824)
185-4130-10073 neg) rest neg)
ILSVRC2014 456567 478807 21121 55501 40152
(461-823-67513 pos, (502-1008-74517) (23-58-5791 pos, (31-111-12824)
42945-64614-70626 neg) rest neg)
Table 4 Scale of ILSVRC object detection task. Numbers in parentheses correspond to (minimum per class - median per
class - maximum per class).

to 5) objects in an image and not be penalized as long

as one of the objects indeed corresponded to the ground
truth label. Figure 7(top row) shows some examples.
Concretely, each image i has a single class label Ci .
An algorithm is allowed to return 5 labels ci1 , . . . ci5 ,
and is considered correct if cij = Ci for some j. Fig-
ure 7(top) shows some examples.
Let the error of a prediction dij = d(cij , Ci ) be 1
if cij 6= Ci and 0 otherwise. The error of an algorithm
is the fraction of test images on which the algorithm
makes a mistake:
N
1 X
error = min dij (1)
N i=1 j
Fig. 7 Tasks in ILSVRC. The first column shows the ground
truth labeling on an example image, and the next three We used two additional measures of error. First, we
show three sample outputs with the corresponding evalua- evaluated top-1 error. In this case algorithms were pe-
tion score. nalized if their highest-confidence output label ci1 did
not match ground truth class Ci . Second, we evaluated
ently difficult in some images which contain a cluster hierarchical error. The intuition is that confusing two
of objects (addressed in Section 4.2). Third, evaluating nearby classes (such as two different breeds of dogs) is
localization of object instances which occupy few pixels not as harmful as confusing a dog for a container ship.
in the image is challenging (addressed in Section 4.3). For the hierarchical criteria, the cost of one misclassifi-
In this section we describe the standardized eval- cation, d(cij , Ci ), is defined as the height of the lowest
uation criteria for each of the three ILSVRC tasks. common ancestor of cij and Ci in the ImageNet hier-
We elaborate further on these and other more minor archy. The height of a node is the length of the longest
challenges with large-scale evaluation. Appendix E de- path to a leaf node (leaf nodes have height zero).
scribes the submission protocol and other details of run- However, in practice we found that all three mea-
ning the competition itself. sures of error (top-5, top-1, and hierarchical) produced
the same ordering of results. Thus, since ILSVRC2012
we have been exclusively using the top-5 metric which
4.1 Image classification is the simplest and most suitable to the dataset.
The scale of ILSVRC classification task (1000 categories
and more than a million of images) makes it very ex- 4.2 Single-object localization
pensive to label every instance of every object in every
image. Therefore, on this dataset only one object cate- The evaluation for single-object localization is similar
gory is labeled in each image. This creates ambiguity in to object classification, again using a top-5 criteria to al-
evaluation. For example, an image might be labeled as low the algorithm to return unannotated object classes
a “strawberry” but contain both a strawberry and an without penalty. However, now the algorithm is con-
apple. Then an algorithm would not know which one sidered correct only if it both correctly identifies the
of the two objects to name. For the image classification target class Ci and accurately localizes one of its in-
task we allowed an algorithm to identify multiple (up stances. Figure 7(middle row) shows some examples.
14 Olga Russakovsky* et al.

Input: Bounding box predictions with confidence

scores {(bj , sj )}M
j=1 and ground truth boxes B
on image I
Output: Binary results {zj }M j=1 of whether or not
Fig. 8 Images marked as “difficult” in the ILSVRC2012 prediction j is a true positive detection
single-object localization validation set. Please refer to Sec- Let U = B be the set of unmatched objects;
tion 4.2 for details. Order {(bj , sj )}M
j=1 in descending order of sj ;
for j=1 . . . M do
Let C = {Bk ∈ U : IOU(Bk , bi ) ≥ thr(Bk )};
Concretely, an image is associated with object class if C =
6 ∅ then
Ci , with all instances of this object class annotated with Let k∗ = arg max{k : Bk ∈C} IOU(Bk , bj );
bounding boxes Bik . An algorithm returns {(cij , bij )}5j=1 Set U = U\Bk∗ ;
Set zj = 1 since true positive detection;
of class labels cij and associated locations bij . The error else
of a prediction j is Set zj = 0 since false positive detection;
end
dij = max(d(cij , Ci ), min d(bij , Bik )) (2) end
k Algorithm 2: The algorithm for greedily match-
ing object detection outputs to ground truth la-
Here d(bij , Bik ) is the error of localization, defined as 0 bels. In (Everingham et al., 2010) this algorithm
if the area of intersection of boxes bij and Bik divided uses thr(Bk ) = 0.5. ILSVRC computes thr(Bk ) us-
by the areas of their union is greater than 0.5, and 1 ing Eq. 5.
otherwise. (Everingham et al., 2010) The error of an
algorithm is computed as in Eq. 1.
Evaluating localization is inherently difficult in some of the total detections returned by the algorithm. Con-
images. Consider a picture of a bunch of bananas or a cretely,
carton of apples. It is easy to classify these images as P
containing bananas or apples, and even possible to lo- ij 1[sij ≥ t]zij
Recall(t) = (3)
calize a few instances of each fruit. However, in order P N
for evaluation to be accurate every instance of banana ij 1[sij ≥ t]zij
P recision(t) = P (4)
or apple needs to be annotated, and that may be impos- ij 1[sij ≥ t]
sible. To handle the images where localizing individual
The final metric for evaluating an algorithm on a
object instances is inherently ambiguous we manually
given object class is average precision over the different
discarded 3.5% of images since ILSVRC2012. Some ex-
levels of recall achieved by varying the threshold t. The
amples of discarded images are shown in Figure 8.
winner of each object class is then the team with the
highest average precision, and then winner of the chal-
lenge is the team that wins on the most object classes.7
4.3 Object detection
Difference with PASCAL VOC. Evaluating localization
The criteria for object detection was adopted from PAS- of object instances which occupy very few pixels in the
CAL VOC (Everingham et al., 2010). It is designed to image is challenging. The PASCAL VOC approach was
penalize the algorithm for missing object instances, for to label such instances as “difficult” and ignore them
duplicate detections of one instance, and for false posi- during evaluation. However, since ILSVRC contains a
tive detections. Figure 7(bottom row) shows examples. more diverse set of object classes including, for exam-
For each object class and each image Ii , an algo- ple, “nail” and “ping pong ball” which have many very
rithm returns predicted detections (bij , sij ) of predicted small instances, it is important to include even very
locations bij with confidence scores sij . These detec- small object instances in evaluation.
tions are greedily matched to the ground truth boxes In Algorithm 2, a predicted bounding box b is con-
{Bik } using Algorithm 2. For every detection j on im- sidered to have properly localized by a ground truth
age i the algorithm returns zij = 1 if the detection is bounding box B if IOU (b, B) ≥ thr(B). The PASCAL
matched to a ground truth box according to the thresh- VOC metric uses the threshold thr(B) = 0.5. However,
old criteria, and 0 otherwise. For a given object class,
7
let N be the total number of ground truth instances In this paper we focus on the mean average precision
across all categories as the measure of a team’s performance.
across all images. Given a threshold t, define recall as
This is done for simplicity and is justified since the ordering
the fraction of the N objects detected by the algorithm, of teams by mean average precision was always the same as
and precision as the fraction of correct detections out the ordering by object categories won.
ImageNet Large Scale Visual Recognition Challenge 15

for small objects even deviations of a few pixels would ILSVRC2010. The first year the challenge consisted
be unacceptable according to this threshold. For exam- of just the classification task. The winning entry from
ple, consider an object B of size 10 × 10 pixels, with a NEC team (Lin et al., 2011) used SIFT (Lowe, 2004)
detection window of 20 × 20 pixels which fully contains and LBP (Ahonen et al., 2006) features with two non-
that object. This would be an error of approximately 5 linear coding representations (Zhou et al., 2010; Wang
pixels on each dimension, which is average human an- et al., 2010) and a stochastic SVM. The honorable men-
notation error. However, the IOU in this case would be tion XRCE team (Perronnin et al., 2010) used an im-
100/400 = 0.25, far below the threshold of 0.5. Thus proved Fisher vector representation (Perronnin and Dance,
for smaller objects we loosen the threshold in ILSVRC 2007) along with PCA dimensionality reduction and
to allow for the annotation to extend up to 5 pixels on data compression followed by a linear SVM. Fisher vector-
average in each direction around the object. Concretely, based methods have evolved over five years of the chal-
if the ground truth box B is of dimensions w × h then lenge and continued performing strongly in every ILSVRC
from 2010 to 2014.
wh
thr(B) = min 0.5, (5)
(w + 10)(h + 10)
In practice, this changes the threshold only on objects ILSVRC2011. The winning classification entry in 2011
which are smaller than approximately 25 × 25 pixels, was the 2010 runner-up team XRCE, applying high-
and affects 5.5% of objects in the detection validation dimensional image signatures (Perronnin et al., 2010)
set. with compression using product quantization (Sanchez
and Perronnin, 2011) and one-vs-all linear SVMs. The
Practical consideration. One additional practical con- single-object localization competition was held for the
sideration for ILSVRC detection evaluation is subtle first time that year, with two brave entries. The win-
and comes directly as a results of the scale of ILSVRC. ner was the UvA team using a selective search ap-
In PASCAL, algorithms would often return many de- proach to generate class-independent object hypothesis
tections per class on the test set, including ones with regions (van de Sande et al., 2011b), followed by dense
low confidence scores. This allowed the algorithms to sampling and vector quantization of several color SIFT
reach the level of high recall at least in the realm of features (van de Sande et al., 2010), pooling with spatial
very low precision. On ILSVRC detection test set if pyramid matching (Lazebnik et al., 2006), and classi-
an algorithm returns 10 bounding boxes per object per fying with a histogram intersection kernel SVM (Maji
image this would result in 10×200×40K = 80M detec- and Malik, 2009) trained on a GPU (van de Sande et al.,
tions. Each detection contains an image index, a class 2011a).
index, 4 bounding box coordinates, and the confidence
score, so it takes on the order of 28 bytes. The full set of ILSVRC2012. This was a turning point for large-scale
detections would then require 2.24Gb to store and sub- object recognition, when large-scale deep neural net-
mit to the evaluation server, which is impractical. This works entered the scene. The undisputed winner of both
means that algorithms are implicitly required to limit the classification and localization tasks in 2012 was the
their predictions to only the most confident locations. SuperVision team. They trained a large, deep convolu-
tional neural network on RGB values, with 60 million
parameters using an efficient GPU implementation and
5 Methods
a novel hidden-unit dropout trick (Krizhevsky et al.,
The ILSVRC dataset and the competition has allowed 2012; Hinton et al., 2012). The second place in image
significant algorithmic advances in large-scale image recog- classification went to the ISI team, which used Fisher
nition and retrieval. vectors (Sanchez and Perronnin, 2011) and a stream-
lined version of Graphical Gaussian Vectors (Harada
and Kuniyoshi, 2012), along with linear classifiers us-
5.1 Challenge entries ing Passive-Aggressive (PA) algorithm (Crammer et al.,
2006). The second place in single-object localization
This section is organized chronologically, highlighting went to the VGG, with an image classification sys-
the particularly innovative and successful methods which tem including dense SIFT features and color statis-
participated in the ILSVRC each year. Tables 5, 6 and 7 tics (Lowe, 2004), a Fisher vector representation (Sanchez
list all the participating teams. We see a turning point and Perronnin, 2011), and a linear SVM classifier, plus
in 2012 with the development of large-scale convolu- additional insights from (Arandjelovic and Zisserman,
tional neural networks. 2012; Sanchez et al., 2012). Both ISI and VGG used
16 Olga Russakovsky* et al.

(Felzenszwalb et al., 2010) for object localization; Su- The winning image classification with provided data
perVision used a regression model trained to predict team was GoogLeNet, which explored an improved con-
bounding box locations. Despite the weaker detection volutional neural network architecture combining the
model, SuperVision handily won the object localization multi-scale idea with intuitions gained from the Heb-
task. A detailed analysis and comparison of the Super- bian principle. Additional dimension reduction layers
Vision and VGG submissions on the single-object local- allowed them to increase both the depth and the width
ization task can be found in (Russakovsky et al., 2013). of the network significantly without incurring signifi-
The influence of the success of the SuperVision model cant computational overhead. In the image classifica-
can be clearly seen in ILSVRC2013 and ILSVRC2014. tion with external data track, CASIAWS won by using
weakly supervised object localization from only clas-
ILSVRC2013. There were 24 teams participating in the sification labels to improve image classification. MCG
ILSVRC2013 competition, compared to 21 in the pre- region proposals (Arbeláez et al., 2014) pretrained on
vious three years combined. Following the success of the PASCAL VOC 2012 data are used to extract region
deep learning-based method in 2012, the vast majority proposals, regions are represented using convolutional
of entries in 2013 used deep convolutional neural net- networks, and a multiple instance learning strategy is
works in their submission. The winner of the classifica- used to learn weakly supervised object detectors to rep-
tion task was Clarifai, with several large deep convolu- resent the image.
tional networks averaged together. The network archi- In the single-object localization with provided data
tectures were chosen using the visualization technique track, the winning team was VGG, which explored the
of (Zeiler and Fergus, 2013), and they were trained effect of convolutional neural network depth on its ac-
on the GPU following (Zeiler et al., 2011) using the curacy by using three different architectures with up to
dropout technique (Krizhevsky et al., 2012). 19 weight layers with rectified linear unit non-linearity,
The winning single-object localization OverFeat sub- building off of the implementation of Caffe (Jia, 2013).
mission was based on an integrated framework for us- For localization they used per-class bounding box re-
ing convolutional networks for classification, localiza- gression similar to OverFeat (Sermanet et al., 2013). In
tion and detection with a multiscale sliding window the single-object localization with external data track,
approach (Sermanet et al., 2013). They were the only Adobe used 2000 additional ImageNet classes to train
team tackling all three tasks. the classifiers in an integrated convolutional neural net-
The winner of object detection task was UvA team, work framework for both classification and localization,
which utilized a new way of efficient encoding (van de with bounding box regression. At test time they used
Sande et al., 2014) of densely sampled color descrip- k-means to find bounding box clusters and rank the
tors (van de Sande et al., 2010) pooled using a multi- clusters according to the classification scores.
level spatial pyramid in a selective search framework In the object detection with provided data track, the
(Uijlings et al., 2013). The detection results were rescored winning team NUS used the RCNN framework (Gir-
using a full-image convolutional network classifier. shick et al., 2013) with the network-in-network method
(Lin et al., 2014a) incorporating improvements of (Howard,
ILSVRC2014. 2014 attracted the most submissions, with 2014). Global context information was incorporated fol-
36 teams submitting 123 entries compared to just 24 lowing (Chen et al., 2014). In the object detection with
teams in 2013 – a 1.5x increase in participation.8 As external data track, the winning team was GoogLeNet
in 2013 almost all teams used convolutional neural net- (which also won image classification with provided data).
works as the basis for their submission. Significant progress It is truly remarkable that the same team was able to
has been made in just one year: image classification er- win at both image classification and object detection,
ror was almost halved since ILSVRC2013 and object indicating that their methods are able to not only clas-
detection mean average precision almost doubled com- sify the image based on scene information but also ac-
pared to ILSVRC2013. Please refer to Section 6.1 for curately localize multiple object instances. Just like al-
details. most all teams participating in this track, GoogLeNet
In 2014 teams were allowed to use outside data for used the image classification dataset as extra training
training their models in the competition, so there were data.
six tracks: provided and outside data tracks in each
of image classification, single-object localization, and
5.2 Large scale paradigm shift
object detection tasks.
8
Table 7 omits 4 teams which submitted results but chose ILSVRC over the past five years has paved the way for
not to officially participate in the challenge. several major paradigm shifts in computer vision.
ILSVRC 2010

ImageNet Large Scale Visual Recognition Challenge

Codename CLS LOC Insitutions Contributors and references

Hminmax 54.4 Massachusetts Institute of Technology Jim Mutch, Sharat Chikkerur, Hristo Paskov, Ruslan Salakhutdinov, Stan Bileschi, Hueihan Jhuang

IBM 70.1 IBM research† , Georgia Tech‡ Lexing Xie† , Hua Ouyang‡ , Apostol Natsev†

ISIL 44.6 Intelligent Systems and Informatics Lab., The University of Tatsuya Harada, Hideki Nakayama, Yoshitaka Ushiku, Yuya Yamashita, Jun Imura, Yasuo Kuniyoshi
Tokyo

ITNLP 78.7 Harbin Institute of Technology Deyuan Zhang, Wenfeng Xuan, Xiaolong Wang, Bingquan Liu, Chengjie Sun

LIG 60.7 Laboratoire d’Informatique de Grenoble Georges Quénot

NEC 28.2 NEC Labs America† , University of Illinois at Urbana- Yuanqing Lin† , Fengjun Lv† , Shenghuo Zhu† , Ming Yang† , Timothee Cour† , Kai Yu† , LiangLiang Cao‡ ,
Champaign‡ , Rutgers∓ Zhen Li‡ , Min-Hsuan Tsai‡ , Xi Zhou‡ , Thomas Huang‡ , Tong Zhang∓
(Lin et al., 2011)
NII 74.2 National Institute of Informatics, Tokyo,Japan† , Hefei Nor- Cai-Zhi Zhu† , Xiao Zhou‡ , Shinı́chi Satoh†
mal Univ. Heifei, China‡

NTU 58.3 CeMNet, SCE, NTU, Singapore Zhengxiang Wang, Liang-Tien Chia

Regularities 75.1 SRI International Omid Madani, Brian Burns

UCI 46.6 University of California Irvine Hamed Pirsiavash, Deva Ramanan, Charless Fowlkes

XRCE 33.6 Xerox Research Centre Europe Jorge Sanchez, Florent Perronnin, Thomas Mensink
(Perronnin et al., 2010)

ILSVRC 2011
Codename CLS LOC Institutions Contributors and references

ISI 36.0 - Intelligent Systems and Informatics lab, University of Tokyo Tatsuya Harada, Asako Kanezaki, Yoshitaka Ushiku, Yuya Yamashita, Sho Inaba, Hiroshi Muraoka, Yasuo
Kuniyoshi

NII 50.5 - National Institute of Informatics, Japan Duy-Dinh Le, Shinı́chi Satoh

UvA 31.0 42.5 University of Amsterdam† , University of Trento‡ Koen E. A. van de Sande† , Jasper R. R. Uijlings‡ , Arnold W. M. Smeulders† , Theo Gevers† , Nicu Sebe‡ ,
Cees Snoek†
(van de Sande et al., 2011b)
XRCE 25.8 56.5 Xerox Research Centre Europe† , CIII‡ Florent Perronnin† , Jorge Sanchez†‡
(Sanchez and Perronnin, 2011)

ILSVRC 2012
Codename CLS LOC Institutions Contributors and references

ISI 26.2 53.6 University of Tokyo† , JST PRESTO‡ Naoyuki Gunji† , Takayuki Higuchi† , Koki Yasumoto† , Hiroshi Muraoka† , Yoshitaka Ushiku† , Tatsuya
Harada† ‡ , Yasuo Kuniyoshi†
(Harada and Kuniyoshi, 2012)
LEAR 34.5 - LEAR INRIA Grenoble† , TVPA Xerox Research Centre Thomas Mensink† ‡ , Jakob Verbeek† , Florent Perronnin‡ , Gabriela Csurka‡
Europe‡ (Mensink et al., 2012)

VGG 27.0 50.0 University of Oxford Karen Simonyan, Yusuf Aytar, Andrea Vedaldi, Andrew Zisserman
(Arandjelovic and Zisserman, 2012; Sanchez et al., 2012)
SuperVision 16.4 34.2 University of Toronto Alex Krizhevsky, Ilya Sutskever, Geoffrey Hinton
(Krizhevsky et al., 2012)
UvA 29.6 - University of Amsterdam Koen E. A. van de Sande, Amir Habibian, Cees G. M. Snoek
(Sanchez and Perronnin, 2011; Scheirer et al., 2012)
XRCE 27.1 - Xerox Research Centre Europe† , LEAR INRIA ‡ Florent Perronnin† , Zeynep Akata†‡ , Zaid Harchaoui‡ , Cordelia Schmid‡
(Perronnin et al., 2012)

Table 5 Teams participating in ILSVRC2010-2012, ordered alphabetically. Each method is identified with a codename used in the text. We report flat top-5 classification
and single-object localization error, in percents (lower is better). For teams which submitted multiple entries we report the best score. In 2012, SuperVision also submitted
entries trained with the extra data from the ImageNet Fall 2011 release, and obtained 15.3% classification error and 33.5% localization error. Key references are provided
where available. More details about the winning entries can be found in Section 5.1.

17
ILSVRC 2013

18
Codename CLS LOC DET Insitutions Contributors and references

Adobe 15.2 - - Adobe† , University of Illinois at Urbana-Champaign‡ Hailin Jin† , Zhe Lin† , Jianchao Yang† , Tom Paine‡
(Krizhevsky et al., 2012)
AHoward 13.6 - - Andrew Howard Consulting Andrew Howard

BUPT 25.2 - - Beijing University of Posts and Telecommunications† , Orange Labs Chong Huang† , Yunlong Bian† , Hongliang Bai‡ , Bo Liu† , Yanchao Feng† , Yuan Dong†
International Center Beijing‡

Clarifai 11.7 - - Clarifai Matthew Zeiler

(Zeiler and Fergus, 2013; Zeiler et al., 2011)
CogVision 16.1 - - Microsoft Research† , Harbin Institute of Technology‡ Kuiyuan Yang† , Yalong Bai† , Yong Rui‡

decaf 19.2 - - University of California Berkeley Yangqing Jia, Jeff Donahue, Trevor Darrell
(Donahue et al., 2013)
Deep Punx 20.9 - - Saint Petersburg State University Evgeny Smirnov, Denis Timoshenko, Alexey Korolev
(Krizhevsky et al., 2012; Wan et al., 2013; Tang, 2013)
Delta - - 6.1 National Tsing Hua University Che-Rung Lee, Hwann-Tzong Chen, Hao-Ping Kang, Tzu-Wei Huang, Ci-Hong Deng, Hao-
Che Kao

IBM 20.7 - - University of Illinois at Urbana-Champaign† , IBM Watson Re- Zhicheng Yan† , Liangliang Cao‡ , John R Smith‡ , Noel Codella‡ ,Michele Merler‡ , Sharath
search Center‡ , IBM Haifa Research Center∓ Pankanti‡ , Sharon Alpert∓ , Yochay Tzur∓ ,

MIL 24.4 - - University of Tokyo Masatoshi Hidaka, Chie Kamada, Yusuke Mukuta, Naoyuki Gunji, Yoshitaka Ushiku, Tat-
suya Harada

Minerva 21.7 Peking University† , Microsoft Research‡ , Shanghai Jiao Tong Tianjun Xiao†‡ , Minjie Wang∓‡ , Jianpeng Li§‡ , Yalong Baiς ‡ , Jiaxing Zhang‡ , Kuiyuan
University∓ , XiDian University§ , Harbin Institute of Technologyς Yang‡ , Chuntao Hong‡ , Zheng Zhang‡
(Wang et al., 2014)
NEC - - 19.6 NEC Labs America† , University of Missouri ‡ Xiaoyu Wang† , Miao Sun‡ , Tianbao Yang† , Yuanqing Lin† , Tony X. Han‡ , Shenghuo Zhu†
(Wang et al., 2013)
NUS 13.0 National University of Singapore Min Lin*, Qiang Chen*, Jian Dong, Junshi Huang, Wei Xia, Shuicheng Yan (* = equal
contribution)
(Krizhevsky et al., 2012)
Orange 25.2 Orange Labs International Center Beijing† , Beijing University of Hongliang BAI† , Lezi Wang‡ , Shusheng Cen‡ , YiNan Liu‡ , Kun Tao† , Wei Liu† , Peng Li† ,
Posts and Telecommunications‡ Yuan Dong†

OverFeat 14.2 30.0 (19.4) New York University Pierre Sermanet, David Eigen, Michael Mathieu, Xiang Zhang, Rob Fergus, Yann LeCun
(Sermanet et al., 2013)
Quantum 82.0 - - Self-employed† , Student in Troy High School, Fullerton, CA‡ Henry Shu† , Jerry Shu‡
(Batra et al., 2013)
SYSU - - 10.5 Sun Yat-Sen University, China. Xiaolong Wang
(Felzenszwalb et al., 2010)
Toronto - - 11.5 University of Toronto Yichuan Tang*, Nitish Srivastava*, Ruslan Salakhutdinov (* = equal contribution)

Trimps 26.2 - - The Third Research Institute of the Ministry of Public Security, Jie Shao, Xiaoteng Zhang, Yanfeng Shang, Wenfei Wang, Lin Mei, Chuanping Hu
P.R. China

UCLA - - 9.8 University of California Los Angeles Yukun Zhu, Jun Zhu, Alan Yuille

UIUC - - 1.0 University of Illinois at Urbana-Champaign Thomas Paine, Kevin Shih, Thomas Huang
(Krizhevsky et al., 2012)
UvA 14.3 - 22.6 University of Amsterdam, Euvision Technologies Koen E. A. van de Sande, Daniel H. F. Fontijne, Cees G. M. Snoek, Harro M. G. Stokman,
Arnold W. M. Smeulders
(van de Sande et al., 2014)
VGG 15.2 46.4 - Visual Geometry Group, University of Oxford Karen Simonyan, Andrea Vedaldi, Andrew Zisserman
(Simonyan et al., 2013)
ZF 13.5 - - New York University Matthew D Zeiler, Rob Fergus
(Zeiler and Fergus, 2013; Zeiler et al., 2011)

Olga Russakovsky* et al.

Table 6 Teams participating in ILSVRC2013, ordered alphabetically. Each method is identified with a codename used in the text. For classificaton and single-object
localization we report flat top-5 error, in percents (lower is better). For detection we report mean average precision, in percents (higher is better). Even though the winner
of the challenge was determined by the number of object categories won, this correlated strongly with mAP. Parentheses indicate the team used outside training data
and was not part of the official competition. Some competing teams also submitted entries trained with outside data: Clarifai with 11.2% classification error, NEC with
20.9% detection mAP. Key references are provided where available. More details about the winning entries can be found in Section 5.1.
ILSVRC 2014

ImageNet Large Scale Visual Recognition Challenge

Codename CLS CLSo LOC LOCo DET DETo Insitutions Contributors and references

Adobe - 11.6 - 30.1 - - Adobe† , UIUC‡ Hailin Jin† , Zhaowen Wang‡ , Jianchao Yang† , Zhe Lin†

AHoward 8.1 - ◦ - - - Howard Vision Technologies Andrew Howard (Howard, 2014)

BDC 11.3 - ◦ - - - Institute for Infocomm Research† , Uni- Olivier Morre† ‡ , Hanlin Goh† , Antoine Veillard‡ , Vijay Chandrasekhar† (Krizhevsky et al., 2012)
versit Pierre et Marie Curie‡

Berkeley - - - - - 34.5 UC Berkeley Ross Girshick, Jeff Donahue, Sergio Guadarrama, Trevor Darrell, Jitendra Malik (Girshick et al., 2013,
2014)
BREIL 16.0 - ◦ - - - KAIST department of EE Jun-Cheol Park, Yunhun Jang, Hyungwon Choi, JaeYoung Jun (Chatfield et al., 2014; Jia, 2013)

Brno 17.6 - 52.0 - - - Brno University of Technology Martin Kolář, Michal Hradiš, Pavel Svoboda (Krizhevsky et al., 2012; Mikolov et al., 2013; Jia, 2013)

CASIA-2 - - - - 28.6 - Chinese Academy of Science† , South- Peihao Huang† , Yongzhen Huang† , Feng Liu‡ , Zifeng Wu† , Fang Zhao† , Liang Wang† , Tieniu
east University‡ Tan† (Girshick et al., 2014)

CASIAWS - 11.4 - ◦ - - CRIPAC, CASIA Weiqiang Ren, Chong Wang, Yanhua Chen, Kaiqi Huang, Tieniu Tan (Arbeláez et al., 2014)

Cldi 13.9 - 46.9 - - - KAIST† , Cldi Inc.‡ Kyunghyun Paeng† , Donggeun Yoo† , Sunggyun Park† , Jungin Lee‡ , Anthony S. Paek‡ , In So Kweon† ,
Seong Dae Kim† (Krizhevsky et al., 2012; Perronnin et al., 2010)
CUHK - - - - - 40.7 The Chinese University of Hong Kong Wanli Ouyang, Ping Luo, Xingyu Zeng, Shi Qiu, Yonglong Tian, Hongsheng Li, Shuo Yang, Zhe Wang,
Yuanjun Xiong, Chen Qian, Zhenyao Zhu, Ruohui Wang, Chen-Change Loy, Xiaogang Wang, Xiaoou
Tang (Ouyang et al., 2014; Ouyang and Wang, 2013)
DeepCNet 17.5 - ◦ - - - University of Warwick Ben Graham (Graham, 2013; Schmidhuber, 2012)

DeepInsight - - - - - 40.5 NLPR† , HKUST‡ Junjie Yan† , Naiyan Wang‡ , Stan Z. Li† , Dit-Yan Yeung‡ (Girshick et al., 2014)

FengjunLv 17.4 - ◦ - - - Fengjun Lv Consulting Fengjun Lv (Krizhevsky et al., 2012; Harel et al., 2007)

GoogLeNet 6.7 - 26.4 - - 43.9 Google Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Drago Anguelov, Dumitru Erhan,
Andrew Rabinovich (Szegedy et al., 2014)
HKUST - - - - 28.9 - Hong Kong U. of Science and Tech.† , Cewu Lu† , Hei Law*† , Hao Chen*‡ , Qifeng Chen*∓ , Yao Xiao*† Chi Keung Tang† (Uijlings et al., 2013;
Chinese U. of H. K.‡ , Stanford U.∓ Girshick et al., 2013; Perronnin et al., 2010; Felzenszwalb et al., 2010)

libccv 16.0 - ◦ - - - libccv.org Liu Liu (Zeiler and Fergus, 2013)

MIL 18.3 - 33.7 - - 30.4 The University of Tokyo† , IIT Senthil Purushwalkam† ‡ , Yuichiro Tsuchiya† , Atsushi Kanehira† , Asako Kanezaki† , Tatsuya
Guwahati‡ Harada† (Kanezaki et al., 2014; Girshick et al., 2013)

MPG UT - - - - - 26.4 The University of Tokyo Riku Togashi, Keita Iwamoto, Tomoaki Iwase, Hideki Nakayama (Girshick et al., 2014)

MSRA 8.1 - 35.5 - 35.1 - Microsoft Research† , Xi’an Jiaotong Kaiming He† , Xiangyu Zhang‡ , Shaoqing Ren∓ , Jian Sun† (He et al., 2014)
U.‡ , U. of Science and Tech. of China∓

NUS - - - - 37.2 - National University of Singapore† , Jian Dong† , Yunchao Wei† , Min Lin† , Qiang Chen‡ , Wei Xia† , Shuicheng Yan† (Lin et al., 2014a; Chen
IBM Research Australia‡ et al., 2014)

NUS-BST 9.8 - ◦ - - - National Univ. of Singapore† , Beijing Min Lin† , Jian Dong† , Hanjiang Lai† , Junjun Xiong‡ , Shuicheng Yan† (Lin et al., 2014a; Howard, 2014;
Samsung Telecom R&D Center† Krizhevsky et al., 2012)

Orange 15.2 14.8 42.8 42.7 - 27.7 Orange Labs Beijing† , BUPT China‡ Hongliang Bai† , Yinan Liu† , Bo Liu‡ , Yanchao Feng‡ , Kun Tao† , Yuan Dong† (Girshick et al., 2014)

PassBy 16.7 - ◦ - - - LENOVO† , HKUST‡ , U. of Macao∓ Lin Sun† ‡ , Zhanghui Kuang† , Cong Zhao† , Kui Jia∓ , Oscar C.Au‡ (Jia, 2013; Krizhevsky et al., 2012)

SCUT 18.8 - ◦ - - - South China Univ. of Technology Guo Lihua, Liao Qijun, Ma Qianli, Lin Junbin

Southeast - - - - 30.5 - Southeast U.† , Chinese A. of Sciences‡ Feng Liu† , Zifeng Wu‡ , Yongzhen Huang‡

SYSU 14.4 - 31.9 - - - Sun Yat-Sen University Liliang Zhang, Tianshui Chen, Shuye Zhang, Wanglan He, Liang Lin, Dengguang Pang, Lingbo Liu

Trimps - 11.5 - 42.2 - 33.7 The Third Research Institute of the Jie Shao, Xiaoteng Zhang, JianYing Zhou, Jian Wang, Jian Chen, Yanfeng Shang, Wenfei Wang, Lin
Ministry of Public Security Mei, Chuanping Hu (Girshick et al., 2014; Manen et al., 2013; Howard, 2014)

TTIC 10.2 - 48.3 - - - Toyota Technological Institute at George Papandreou† , Iasonas Kokkinos‡ (Papandreou, 2014; Papandreou et al., 2014; Jojic et al., 2003;
Chicago† , Ecole Centrale Paris‡ Krizhevsky et al., 2012; Sermanet et al., 2013; Dubout and Fleuret, 2012; Iandola et al., 2014)

UI 99.5 - ◦ - - - University of Isfahan Fatemeh Shafizadegan, Elham Shabaninia (Yang et al., 2009)

UvA 12.1 - ◦ - 32.0 35.4 U. of Amsterdam and Euvision Tech. Koen van de Sande, Daniel Fontijne, Cees Snoek, Harro Stokman, Arnold Smeulders (van de Sande et al.,
2014)
VGG 7.3 - 25.3 - - - University of Oxford Karen Simonyan, Andrew Zisserman (Simonyan and Zisserman, 2014)

XYZ 11.2 - ◦ - - - The University of Queensland Zhongwen Xu and Yi Yang (Krizhevsky et al., 2012; Jia, 2013; Zeiler and Fergus, 2013; Lin et al., 2014a)

Table 7 Teams participating in ILSVRC2014, ordered alphabetically. Each method is identified with a codename used in the text. For classificaton and single-object
localization we report flat top-5 error, in percents (lower is better). For detection we report mean average precision, in percents (higher is better). CLSo,LOCo,DETo
corresponds to entries using outside training data (officially allowed in ILSVRC2014). ◦ means localization error greater than 60% (localization submission was required
with every classification submission). Key references are provided where available. More details about the winning entries can be found in Section 5.1.

19
20 Olga Russakovsky* et al.

The field of categorical object recognition has dra-

matically evolved in the large-scale setting. Section 5.1
documents the progress, starting from coded SIFT fea-
tures and evolving to large-scale convolutional neural
networks dominating at all three tasks of image clas-
sification, single-object localization, and object detec- Fig. 9 Performance of winning entries in the ILSVRC2010-
tion. With the availability of so much training data it 2014 competitions in each of the three tasks (details about
became possible to learn neural networks directly from the entries and numerical results are in Section 5.1). There is
the image data, without needing to create a multi-stage a steady reduction of error every year in object classification
and single-object localization tasks, and a 1.9x improvement
hand-tuned pipelines of extracted features and discrimi- in mean average precision in object detection. There are two
native classifiers. The major breakthrough came in 2012 considerations in making these comparisons. (1) The object
with the win of the SuperVision team on image classifi- categories used in ISLVRC changed between years 2010 and
cation and single-object localization tasks (Krizhevsky 2011, and between 2011 and 2012. However, the large scale
of the data (1000 object categories, 1.2 million training im-
et al., 2012), and by 2014 all of the top contestants were ages) has remained the same, making it possible to compare
relying heavily on convolutional neural networks. results. Image classification and single-object localization en-
Further, the field of computer vision as a whole has tries shown here use only provided training data. (2) The
focused on large-scale recognition over the past few size of the object detection training data has increased signif-
icantly between years 2013 and 2014 (Section 3.3). Section 6.1
years. Best paper awards at top vision conferences in discusses the relative effects of training data increase versus
2013 were awarded to large-scale recognition methods: algorithmic improvements.
at CVPR 2013 to ”Fast, Accurate Detection of 100,000
Object Classes on a Single Machine” (Dean et al., 2013)
and at ICCV 2013 to ”From Large Scale Image Cate- comparable across the years. The dataset has not changed
gorization to Entry-Level Categories” (Ordonez et al., since 2012, and there has been a 2.4x reduction in image
2013). Additionally, several influential lines of research classification error (from 16.4% to 6.7%) and a 1.3x in
have emerged, such as large-scale weakly supervised single-object localization error (from 33.5% to 25.3%)
localization work of (Kuettel et al., 2012) which was in the past three years.
awarded the best paper award in ECCV 2012 and large-
scale zero-shot learning, e.g., (Frome et al., 2013). 6.1.2 Object detection improvement over the years

Object detection accuracy as measured by the mean

6 Results and analysis average precision (mAP) has increased 1.9x since the in-
troduction of this task, from 22.6% mAP in ILSVRC2013
6.1 Improvements over the years to 43.9% mAP in ILSVRC2014. However, these results
are not directly comparable for two reasons. First, the
State-of-the-art accuracy has improved significantly from size of the object detection training data has increased
ILSVRC2010 to ILSVRC2014, showcasing the massive significantly from 2013 to 2014 (Section 3.3). Second,
progress that has been made in large-scale object recog- the 43.9% mAP result was obtained with the addition
nition over the past five years. The performance of the of the image classification and single-object localiza-
winning ILSVRC entries for each task and each year are tion training data. Here we attempt to understand the
shown in Figure 9. The improvement over the years is relative effects of the training set size increase versus
clearly visible. In this section we quantify and analyze algorithmic improvements. All models are evaluated on
this improvement. the same ILSVRC2013-2014 object detection test set.
First, we quantify the effects of increasing detec-
6.1.1 Image classification and single-object localization tion training data between the two challenges by com-
improvement over the years paring the same model trained on ILSVRC2013 de-
tection data versus ILSVRC2014 detection data. The
There has been a 4.2x reduction in image classification UvA team’s framework from 2013 achieved 22.6% with
error (from 28.2% to 6.7%) and a 1.7x reduction in ILSVRC2013 data (Table 6) and 26.3% with ILSVRC2014
single-object localization error (from 42.5% to 25.3%) data and no other modifications.9 The absolute increase
since the beginning of the challenge. For consistency, in mAP was 3.7%. The RCNN model achieved 31.4%
here we consider only teams that use the provided train- mAP with ILSVRC2013 detection plus image classi-
ing data. Even though the exact object categories have fication data (Girshick et al., 2013) and 34.5% mAP
changed (Section 3.1.1), the large scale of the dataset
9
has remained the same (Table 2), making the results Personal communication with members of the UvA team.
ImageNet Large Scale Visual Recognition Challenge 21

with ILSVRC2014 detection plus image classification Image classification

Year Codename Error (percent) 99.9% Conf Int
data (Berkeley team in Table 7). The absolute increase 2014 GoogLeNet 6.66 6.40 - 6.92
in mAP by expanding ILSVRC2013 detection data to 2014 VGG 7.32 7.05 - 7.60
ILSVRC2014 was 3.1%. 2014 MSRA 8.06 7.78 - 8.34
2014 AHoward 8.11 7.83 - 8.39
Second, we quantify the effects of adding in the ex-
2014 DeeperVision 9.51 9.21 - 9.82
ternal data for training object detection models. The 2013 Clarifai† 11.20 10.87 - 11.53
NEC model in 2013 achieved 19.6% mAP trained on 2014 CASIAWS† 11.36 11.03 - 11.69
ILSVRC2013 detection data alone and 20.9% mAP trained 2014 Trimps† 11.46 11.13 - 11.80
on ILSVRC2013 detection plus classification data (Ta- 2014 Adobe† 11.58 11.25 - 11.91
2013 Clarifai 11.74 11.41 - 12.08
ble 6). The absolute increase in mAP was 1.3%. The 2013 NUS 12.95 12.60 - 13.30
UvA team’s best entry in 2014 achieved 32.0% mAP 2013 ZF 13.51 13.14 - 13.87
trained on ILSVRC2014 detection data and 35.4% mAP 2013 AHoward 13.55 13.20 - 13.91
trained on ILSVRC2014 detection plus classification 2013 OverFeat 14.18 13.83 - 14.54
2014 Orange† 14.80 14.43 - 15.17
data. The absolute increase in mAP was 3.4%. 2012 SuperVision† 15.32 14.94 - 15.69
Thus, we conclude based on the evidence so far 2012 SuperVision 16.42 16.04 - 16.80
that expanding the ILSVRC2013 detection set to the 2012 ISI 26.17 25.71 - 26.65
ILSVRC2014 set, as well as adding in additional train- 2012 VGG 26.98 26.53 - 27.43
2012 XRCE 27.06 26.60 - 27.52
ing data from the classification task, all account for 2012 UvA 29.58 29.09 - 30.04
approximately 1 − 4% in absolute mAP improvement Single-object localization
for the models. For comparison, we can also attempt Year Codename Error (percent) 99.9% Conf Int
to quantify the effect of algorithmic innovation. The 2014 VGG 25.32 24.87 - 25.78
2014 GoogLeNet 26.44 25.98 - 26.92
UvA team’s 2013 framework achieved 26.3% mAP on 2013 OverFeat 29.88 29.38 - 30.35
ILSVRC2014 data as mentioned above, and their im- 2014 Adobe† 30.10 29.61 - 30.58
proved method in 2014 obtained 32.0% mAP (Table 7). 2014 SYSU 31.90 31.40 - 32.40
This is 5.8% absolute increase in mAP over just one 2012 SuperVision† 33.55 33.05 - 34.04
2014 MIL 33.74 33.24 - 34.25
year from algorithmic innovation alone. 2012 SuperVision 34.19 33.67 - 34.69
In summary, we conclude that the absolute 21.3% 2014 MSRA 35.48 34.97 - 35.99
increase in mAP between winning entries of ILSVRC2013 2014 Trimps† 42.22 41.69 - 42.75
(22.6% mAP) and of ILSVRC2014 (43.9% mAP) is 2014 Orange† 42.70 42.18 - 43.24
2013 VGG 46.42 45.90 - 46.95
the result of impressive algorithmic innovation and not 2012 VGG 50.03 49.50 - 50.57
just a consequence of increased training data. However, 2012 ISI 53.65 53.10 - 54.17
increasing the ISLVRC2014 object detection training 2014 CASIAWS† 61.96 61.44 - 62.48
dataset further is likely to produce additional improve- Object detection
Year Codename AP (percent) 99.9% Conf Int
ments in detection accuracy for current algorithms. 2014 GoogLeNet† 43.93 42.92 - 45.65
2014 CUHK† 40.67 39.68 - 42.30
2014 DeepInsight† 40.45 39.49 - 42.06
2014 NUS 37.21 36.29 - 38.80
6.2 Statistical significance 2014 UvA† 35.42 34.63 - 36.92
2014 MSRA 35.11 34.36 - 36.70
One important question to ask is whether results of dif- 2014 Berkeley† 34.52 33.67 - 36.12
ferent submissions to ILSVRC are statistically signifi- 2014 UvA 32.03 31.28 - 33.49
2014 Southeast 30.48 29.70 - 31.93
cantly different from each other. Given the large scale , 2014 HKUST 28.87 28.03 - 30.20
it is no surprise that even minor differences in accuracy 2013 UvA 22.58 22.00 - 23.82
are statistically significant; we seek to quantify exactly 2013 NEC† 20.90 20.40 - 22.15
how much of a difference is enough. 2013 NEC 19.62 19.14 - 20.85
2013 OverFeat† 19.40 18.82 - 20.61
Following the strategy employed by PASCAL VOC 2013 Toronto 11.46 10.98 - 12.34
(Everingham et al., 2014), for each method we obtain 2013 SYSU 10.45 10.04 - 11.32
a confidence interval of its score using bootstrap sam- 2013 UCLA 9.83 9.48 - 10.77
pling. During each bootstrap round, we sample N im- Table 8 We use bootstrapping to construct 99.9% confi-
ages with replacement from the available N test im- dence intervals around the result of up to top 5 submissions
ages and evaluate the performance of the algorithm to each ILSVRC task in 2012-2014. † means the entry used
external training data. The winners using the provided data
on those sampled images. This can be done very effi- for each track and each year are bolded. The difference be-
ciently by precomputing the accuracy on each image. tween the winning method and the runner-up each year is
Given the results of all the bootstrapping rounds we significant even at the 99.9% level.
22 Olga Russakovsky* et al.

discard the lower and the upper α fraction. The range

of the remaining results represents the 1 − 2α confi-
dence interval. We run a large number of bootstrap-
ping rounds (from 20,000 until convergence). Table 8
shows the results of the top entries to each task of
ILSVRC2012-2014. The winning methods are statis-
tically significantly different from the other methods, Fig. 10 For each object class, we consider the best perfor-
even at the 99.9% level. mance of any entry submitted to ILSVRC2012-2014, includ-
ing entries using additional training data. The plots show
the distribution of these “optimistic” per-class results. Perfor-
mance is measured as accuracy for image classification (left)
and for single-object localization (middle), and as average
precision for object detection (right). While the results are
6.3 Current state of categorical object recognition very promising in image classification, the ILSVRC datasets
are far from saturated: many object classes continue to be
challenging for current algorithms.
Besides looking at just the average accuracy across hun-
dreds of object categories and tens of thousands of im-
ages, we can also delve deeper to understand where 6.3.2 Qualitative examples of easy and hard classes
mistakes are being made and where researchers’ efforts
should be focused to expedite progress. Figure 11 show the easiest and hardest classes for each
task, i.e., classes with the best and worst results ob-
To do so, in this section we will be analyzing an
tained with the “optimistic” models.
“optimistic” measurement of state-of-the-art recogni-
For image classification, 121 out of 1000 object classes
tion performance instead of focusing on the differences
have 100% image classification accuracy according to
in individual algorithms. For each task and each object
the optimistic estimate. Figure 11 (top) shows a ran-
class, we compute the best performance of any entry
dom set of 10 of them. They contain a variety of classes,
submitted to any ILSVRC2012-2014, including meth-
such as mammals like “red fox” and animals with dis-
ods using additional training data. Since the test sets
tinctive structures like “stingray”. The hardest classes
have remained the same, we can directly compare all
in the image classification task, with accuracy as low as
the entries in the past three years to obtain the most
59.0%, include metallic and see-through man-made ob-
“optimistic” measurement of state-of-the-art accuracy
jects, such as “hook” and “water bottle,” the material
on each category.
“velvet” and the highly varied scene class “restaurant.”
For consistency with the object detection metric For single-object localization, the 10 easiest classes
(higher is better), in this section we will be using image with 99.0 − 100% accuracy are all mammals and birds.
classification and single-object localization accuracy in- The hardest classes include metallic man-made objects
stead of error, where accuracy = 1 − error. such as “letter opener” and “ladle”, plus thin structures
such as “pole” and “spacebar” and highly varied classes
such as “wing”. The most challenging class “spacebar”
has a only 23.0% localization accuracy.
6.3.1 Range of accuracy across object classes For object detection, the easiest classes are living
organisms such as “dog” and “tiger”, plus “basketball”
Figure 10 shows the distribution of accuracy achieved and “volleyball” with distinctive shape and color, and
by the “optimistic” models across the object categories. a somewhat surprising “snowplow.” The easiest class
The image classification model achieves 94.6% accu- “butterfly” is not yet perfectly detected but is very close
racy on average (or 5.4% error), but there remains a with 92.7% AP. The hardest classes are as expected
41.0% absolute difference inaccuracy between the most small thin objects such as “flute” and “nail”, and the
and least accurate object class. The single-object local- highly varied “lamp” and “backpack” classes, with as
ization model achieves 81.5% accuracy on average (or low as 8.0% AP.
18.5% error), with a 77.0% range in accuracy across
the object classes. The object detection model achieves 6.3.3 Per-class accuracy as a function of image
44.7% average precision, with an 84.7% range across the properties
object classes. It is clear that the ILSVRC dataset is far
from saturated: performance on many categories has re- We now take a closer look at the image properties to
mained poor despite the strong overall performance of try to understand why current algorithms perform well
the models. on some object classes but not others. One hypothesis
ImageNet Large Scale Visual Recognition Challenge 23

Image classification
Easiest classes

Fig. 12 Performance of the “optimistic” method as a func-

tion of object scale in the image, on each task. Each dot cor-
responds to one object class. Average scale (x-axis) is com-
Hardest classes puted as the average fraction of the image area occupied by
an instance of that object class on the ILSVRC2014 valida-
tion set. “Optimistic” performance (y-axis) corresponds to
the best performance on the test set of any entry submitted
to ILSVRC2012-2014 (including entries with additional train-
ing data). The test set has remained the same over these three
years. We see that accuracy tends to increase as the objects
get bigger in the image. However, it is clear that far from all
the variation in accuracy on these classes can be accounted
for by scale alone.
Single-object localization
Easiest classes

is that variation in accuracy comes from the fact that

instances of some classes tend to be much smaller in
images than instances of other classes, and smaller ob-
jects may be harder for computers to recognize. In this
section we argue that while accuracy is correlated with
Hardest classes object scale in the image, not all variation in accuracy
can be accounted for by scale alone.
For every object class, we compute its average scale,
or the average fraction of image area occupied by an in-
stance of the object class on the ILSVRC2012-2014 val-
idation set. Since the images and object classes in the
image classification and single-object localization tasks
Object detection are the same, we use the bounding box annotations of
Easiest classes the single-object localization dataset for both tasks. In
that dataset the object classes range from “swimming
trunks” with scale of 1.5% to “spider web” with scale
of 85.6%. In the object detection validation dataset
the object classes range from “sunglasses” with scale
of 1.3% to “sofa” with scale of 44.4%.
Figure 12 shows the performance of the “optimistic”
method as a function of the average scale of the object
Hardest classes
in the image. Each dot corresponds to one object class.
We observe a very weak positive correlation between ob-
ject scale and image classification accuracy: ρ = 0.14.
For single-object localization and object detection the
correlation is stronger, at ρ = 0.40 and ρ = 0.41 re-
spectively. It is clear that not all variation in accuracy
can be accounted for by scale alone. Nevertheless, in
the next section we will normalize for object scale to
Fig. 11 For each object category, we take the best perfor- ensure that this factor is not affecting our conclusions.
mance of any entry submitted to ILSVRC2012-2014 (includ-
ing entries using additional training data). Given these “op-
timistic” results we show the easiest and harder classes for 6.3.4 Per-class accuracy as a function of object
each task, i.e., classes with best and worst results. The num- properties.
bers in parentheses indicate classification accuracy, localiza-
tion accuracy, and detection average precision for each task
Besides considering image-level properties we can also
respectively. For image classification the 10 easiest classes are
randomly selected from among 121 object classes with 100% observe how accuracy changes as a function of intrin-
accuracy.
24 Olga Russakovsky* et al.

sic object properties. We define three properties in- as close as possible). For real-world size property for
spired by human vision: the real-world size of the ob- example, the resulting average object scale in each of
ject, whether it’s deformable within instance, and how the five bins is 31.6%−31.7% in the image classification
textured it is. For each property, the object classes are and single-object localization tasks, and 12.9% − 13.4%
assigned to one of a few bins (listed below). These prop- in the object detection task.10
erties are illustrated in Figure 1. Figure 13 shows the average performance of the “op-
Human subjects annotated each of the 1000 im- timistic” model on the object classes that fall into each
age classification and single-object localization object bin for each property. We analyze the results in detail
classes from ILSVRC2012-2014 with these properties. (Rus- below. Unless otherwise specified, the reported accura-
sakovsky et al., 2013). By construction (see Section 3.3.1), cies below are after the scale normalization step.
each of the 200 object detection classes is either also To evaluate statistical significance, we compute the
one of 1000 object classes or is an ancestor of one or 95% confidence interval for accuracy using bootstrap-
more of the 1000 classes in the ImageNet hierarchy. To ping: we repeatedly sample the object classes within
compute the values of the properties for each object de- the bin with replacement, discard some as needed to
tection class, we simply average the annotated values of normalize by scale, and compute the average accuracy
the descendant classes. of the “optimistic” model on the remaining classes. We
In this section we draw the following conclusions report the 95% confidence intervals (CI) in parentheses.
about state-of-the-art recognition accuracy as a func-
tion of these object properties:
Real-world size. In Figure 13(top, left) we observe that
– Real-world size: XS for extra small (e.g. nail), in the image classification task the “optimistic” model
small (e.g. fox), medium (e.g. bookcase), large (e.g. tends to perform significantly better on objects which
car) or XL for extra large (e.g. church) are larger in the real-world. The classification accuracy
The image classification and single-object localiza- is 93.6% − 93.9% on XS, S and M objects compared to
tion “optimistic” models performs better on large 97.0% on L and 96.4% on XL objects. Since this after
and extra large real-world objects than on smaller normalizing for scale and thus can’t be explained by the
ones. The “optimistic” object detection model sur- objects’ size in the image, we conclude that either (1)
prisingly performs better on extra small objects than larger real-world are easier for the model to recognize,
on small or medium ones. or (2) larger real-world objects usually occur in images
– Deformability within instance: Rigid (e.g., mug) with very distinctive backgrounds.
or deformable (e.g., water snake) To distinguish between the two cases we look Fig-
The “optimistic” model on each of the three tasks ure 13(top, middle). We see that in the single-object
performs statistically significantly better on deformable localization task, the L objects are easy to localize at
objects compared to rigid ones. However, this ef- 82.4% localization accuracy. XL objects, however, tend
fect disappears when analyzing natural objects sep- to be the hardest to localize with only 73.4% localiza-
arately from man-made objects. tion accuracy. We conclude that the appearance of L
– Amount of texture: none (e.g. punching bag), low objects must be easier for the model to learn, while
(e.g. horse), medium (e.g. sheep) or high (e.g. hon- XL objects tend to appear in distinctive backgrounds.
eycomb) The image background make these XL classes easier for
The “optimistic” model on each of the three tasks the image-level classifier, but the individual instances
is significantly better on objects with at least low are difficult to accurately localize. Some examples of L
level of texture compared to untextured objects. objects are “killer whale,” “schooner,” and “lion,” and
some examples of XL objects are “boathouse,” “mosque,”
These and other findings are justified and discussed in “toyshop” and “steel arch bridge.”
detail below.
In Figure 13(top,right) corresponding to the object
detection task, the influence of real-world object size is
Experimental setup. We observed in Section 6.3.3 that not as apparent. One of the key reasons is that many of
objects that occupy a larger area in the image tend to the XL and L object classes of the image classification
be somewhat easier to recognize. To make sure that and single-object localization datasets were removed in
differences in object scale are not influencing results in
10
this section, we normalize each bin by object scale. We For rigid versus deformable objects, the average scale in
each bin is 34.1% − 34.2% for classification and localization,
discard object classes with the largest scales from each
and 13.5%−13.7% for detection. For texture, the average scale
bin as needed until the average object scale of object in each of the four bins is 31.1% − 31.3% for classification and
classes in each bin across one property is the same (or localization, and 12.7% − 12.8% for detection.
ImageNet Large Scale Visual Recognition Challenge 25

Real-world size constructing the detection dataset (Section 3.3.1) since

they were not basic categories well-suited for detection.
There were only 3 XL object classes remaining in the
dataset (“train,” “airplane” and “bus”), and none af-
ter scale normalization.We omit them from the analy-
sis. The average precision of XS, S, M objects (44.5%,
39.0%, and 38.5% mAP respectively) is statistically in-
Deformability within instance significant from average precision on L objects: 95%
confidence interval of L objects is 37.5% − 59.5%. This
may be due to the fact that there are only 6 L object
classes remaining after scale normalization; all other
real-world size bins have at least 18 object classes.
Finally, it is interesting that performance on XS ob-
jects of 44.5% mAP (CI 40.5% − 47.6%) is statistically
significantly better than performance on S or M ob-
jects with 39.0% mAP and 38.5% mAP respectively.
Some examples of XS objects are “strawberry,” “bow
tie” and “rugby ball.”

Deformability within instance. In Figure 13(second row)

it is clear that the “optimistic” model performs statis-
Amount of texture tically significantly worse on rigid objects than on de-
formable objects. Image classification accuracy is 93.2%
on rigid objects (CI 92.6% − 93.8%), much smaller than
95.7% on deformable ones. Single-object localization ac-
curacy is 76.2% on rigid objects (CI 74.9% − 77.4%),
much smaller than 84.7% on deformable ones. Object
detection mAP is 40.1% on rigid objects (CI 37.2% −
42.9%), much smaller than 44.8% on deformable ones.
We can further analyze the effects of deformabil-
ity after separating object classes into “natural” and
“man-made” bins based on the ImageNet hierarchy. De-
formability is highly correlated with whether the object
is natural or man-made: 0.72 correlation for image clas-
sification and single-object localization classes, and 0.61
Fig. 13 Performance of the “optimistic” computer vision
for object detection classes. Figure 13(third row) shows
model as a function of object properties. The x-axis corre- the effect of deformability on performance of the model
sponds to object properties annotated by human labelers for for man-made and natural objects separately.
each object class (Russakovsky et al., 2013) and illustrated Man-made classes are significantly harder than nat-
in Figure 1. The y-axis is the average accuracy of the “opti-
mistic” model. Note that the range of the y-axis is different
ural classes: classification accuracy 92.8% (CI 92.3% −
for each task to make the trends more visible. The black circle 93.3%) for man-made versus 97.0% for natural, localiza-
is the average accuracy of the model on all object classes that tion accuracy 75.5% (CI 74.3% − 76.5%) for man-made
fall into each bin. We control for the effects of object scale versus 88.5% for natural, and detection mAP 38.7% (CI
by normalizing the object scale within each bin (details in
Section 6.3.4). The color bars show the average performance
35.6 − 41.3%) for man-made versus 50.9% for natural.
of the remaining classes, and the error bars show 95% con- However, whether the classes are rigid or deformable
fidence interval obtained with bootstrapping. Some bins are within this subdivision is no longer significant in most
missing color bars because less than 5 object classes remained cases. For example, the image classification accuracy is
in the bin after scale normalization. For example, the bar for
XL real-world object detection classes is missing because that
92.3% (CI 91.4% − 93.1%) on man-made rigid objects
bin has only 3 object classes (airplane, bus, train) and after and 91.8% on man-made deformable objects – not sta-
normalizing by scale no classes remain. tistically significantly different.
There are two cases where the differences in per-
formance are statistically significant. First, for single-
object localization, natural deformable objects are eas-
26 Olga Russakovsky* et al.

ier than natural rigid objects: localization accuracy of when compared to human-level accuracy. In this sec-
87.9% (CI 85.9% − 90.1%) on natural deformable ob- tion we compare the performance of the leading large-
jects is higher than 85.8% on natural rigid objects – scale image classification method with the performance
falling slightly outside the 95% confidence interval. This of humans on this task.
difference in performance is likely because deformable To support this comparison, we developed an inter-
natural animals tend to be easier to localize than rigid face that allowed a human labeler to annotate images
natural fruit. with up to five ILSVRC target classes. We compare hu-
Second, for object detection, man-made rigid ob- man errors to those of the winning ILSRC2014 image
jects are easier than man-made deformable objects: 38.5% classification model, GoogLeNet (Section 5.1). For this
mAP (CI 35.2% − 41.7%) on man-made rigid objects is analysis we use a random sample of 1500 ILSVRC2012-
higher than 33.0% mAP on man-made deformable ob- 2014 image classification test set images.
jects. This is because man-made rigid objects include
classes like “traffic light” or “car” whereas the man- Annotation interface. Our web-based annotation inter-
made deformable objects contain challenging classes like face consists of one test set image and a list of 1000
“plastic bag,” “swimming trunks” or “stethoscope.” ILSVRC categories on the side. Each category is de-
scribed by its title, such as “cowboy boot.” The cate-
gories are sorted in the topological order of the Ima-
Amount of texture. Finally, we analyze the effect that geNet hierarchy, which places semantically similar con-
object texture has on the accuracy of the “optimistic” cepts nearby in the list. For example, all motor vehicle-
model. Figure 13(fourth row) demonstrates that the related classes are arranged contiguously in the list. Ev-
model performs better as the amount of texture on the ery class category is additionally accompanied by a row
object increases. The most significant difference is be- of 13 examples images from the training set to allow for
tween the performance on untextured objects and the faster visual scanning. The user of the interface selects 5
performance on objects with low texture. Image clas- categories from the list by clicking on the desired items.
sification accuracy is 90.5% on untextured objects (CI Since our interface is web-based, it allows for natural
89.3% − 91.6%), lower than 94.6% on low-textured ob- scrolling through the list, and also search by text.
jects. Single-object localization accuracy is 71.4% on
untextured objects (CI 69.1%−73.3%), lower than 80.2% Annotation protocol. We found the task of annotating
on low-textured objects. Object detection mAP is 33.2% images with one of 1000 categories to be an extremely
on untextured objects (CI 29.5% − 35.9%), lower than challenging task for an untrained annotator. The most
42.9% on low-textured objects. common error that an untrained annotator is suscepti-
Texture is correlated with whether the object is nat- ble to is a failure to consider a relevant class as a pos-
ural or man-made, at 0.35 correlation for image classi- sible label because they are unaware of its existence.
fication and single-object localization, and 0.46 corre- Therefore, in evaluating the human accuracy we re-
lation for object detection. To determine if this is a lied primarily on expert annotators who learned to rec-
contributing factor, in Figure 13(bottom row) we break ognize a large portion of the 1000 ILSVRC classes. Dur-
up the object classes into natural and man-made and ing training, the annotators labeled a few hundred val-
show the accuracy on objects with no texture versus idation images for practice and later switched to the
objects with low texture. We observe that the model test set images.
is still statistically significantly better on low-textured
object classes than on untextured ones, both on man- 6.4.1 Quantitative comparison of human and computer
made and natural object classes independently.11 accuracy on large-scale image classification

We report results based on experiments with two ex-

pert annotators. The first annotator (A1) trained on
6.4 Human accuracy on large-scale image classification 500 images and annotated 1500 test images. The sec-
ond annotator (A2) trained on 100 images and then
Recent improvements in state-of-the-art accuracy on annotated 258 test images. The average pace of label-
the ILSVRC dataset are easier to put in perspective ing was approximately 1 image per minute, but the dis-
tribution is strongly bimodal: some images are quickly
11
Natural object detection classes are removed from this recognized, while some images (such as those of fine-
analysis because there are only 3 and 13 natural untextured
grained breeds of dogs, birds, or monkeys) may require
and low-textured classes respectively, and none remain after
scale normalization. All other bins contain at least 9 object multiple minutes of concentrated effort.
classes after scale normalization. The results are reported in Table 9.
ImageNet Large Scale Visual Recognition Challenge 27

Annotator 1. Annotator A1 evaluated a total of 1500 Relative Confusion A1 A2

test set images. The GoogLeNet classification error on Human succeeds, GoogLeNet succeeds 1352 219
Human succeeds, GoogLeNet fails 72 8
this sample was estimated to be 6.8% (recall that the Human fails, GoogLeNet succeeds 46 24
error on full test set of 100,000 images is 6.7%, as shown Human fails, GoogLeNet fails 30 7
in Table 7). The human error was estimated to be 5.1%. Total number of images 1500 258
Estimated GoogLeNet classification error 6.8% 5.8%
Thus, annotator A1 achieves a performance superior to
Estimated human classification error 5.1% 12.0%
GoogLeNet, by approximately 1.7%. We can analyze
Table 9 Human classification results on the ILSVRC2012-
the statistical significance of this result under the null 2014 classification test set, for two expert annotators A1 and
hypothesis that they are from the same distribution. In A2. We report top-5 classification error.
particular, comparing the two proportions with a z-test
yields a one-sided p-value of p = 0.022. Thus, we can
conclude that this result is statistically significant at of representative mistakes can be found in Figure 14.
the 95% confidence level. The analysis and insights below were derived specifi-
cally from GoogLeNet predictions, but we suspect that
Annotator 2. Our second annotator (A2) trained on a many of the same errors may be present in other meth-
smaller sample of only 100 images and then labeled 258 ods.
test set images. As seen in Table 9, the final classifica-
tion error is significantly worse, at approximately 12.0% Types of errors in both computer and human annota-
Top-5 error. The majority of these errors (48.8%) can tions:
be attributed to the annotator failing to spot and con- 1. Multiple objects. Both GoogLeNet and humans
sider the ground truth label as an option. struggle with images that contain multiple ILSVRC
Thus, we conclude that a significant amount of train- classes (usually many more than five), with little in-
ing time is necessary for a human to achieve compet- dication of which object is the focus of the image.
itive performance on ILSVRC. However, with a suffi- This error is only present in the Classification set-
cient amount of training, a human annotator is still ting, since every image is constrained to have ex-
able to outperform the GoogLeNet result (p = 0.022) actly one correct label. In total, we attribute 24
by approximately 1.7%. (24%) of GoogLeNet errors and 12 (16%) of human
errors to this category. It is worth noting that hu-
Annotator comparison. We also compare the prediction mans can have a slight advantage in this error type,
accuracy of the two annotators. Of a total of 204 images since it can sometimes be easy to identify the most
that both A1 and A2 labeled, 174 (85%) were correctly salient object in the image.
labeled by both A1 and A2, 19 (9%) were correctly 2. Incorrect annotations. We found that approxi-
labeled by A1 but not A2, 6 (3%) were correctly labeled mately 5 out of 1500 images (0.3%) were incorrectly
by A2 but not A1, and 5 (2%) were incorrectly labeled annotated in the ground truth. This introduces an
by both. These include 2 images that we consider to be approximately equal number of errors for both hu-
incorrectly labeled in the ground truth. mans and GoogLeNet.
In particular, our results suggest that the human
annotators do not exhibit strong overlap in their pre- Types of errors that the computer is more susceptible to
dictions. We can approximate the performance of an than the human:
“optimistic” human classifier by assuming an image to 1. Object small or thin. GoogLeNet struggles with
be correct if at least one of A1 or A2 correctly labeled recognizing objects that are very small or thin in
the image. On this sample of 204 images, we approxi- the image, even if that object is the only object
mate the error rate of an “optimistic” human annotator present. Examples of this include an image of a
at 2.4%, compared to the GoogLeNet error rate of 4.9%. standing person wearing sunglasses, a person hold-
ing a quill in their hand, or a small ant on a stem of a
6.4.2 Analysis of human and computer errors on flower. We estimate that approximately 22 (21%) of
large-scale image classification GoogLeNet errors fall into this category, while none
of the human errors do. In other words, in our sam-
We manually inspected both human and GoogLeNet ple of images, no image was mislabeled by a human
errors to gain an understanding of common error types because they were unable to identify a very small
and how they compare. For purposes of this section, we or thin object. This discrepancy can be attributed
only discuss results based on the larger sample of 1500 to the fact that a human can very effectively lever-
images that were labeled by annotator A1. Examples age context and affordances to accurately infer the
28 Olga Russakovsky* et al.

Fig. 14 Representative validation images that highlight common sources of error. For each image, we display the ground truth
in blue, and top 5 predictions from GoogLeNet follow (red = wrong, green = right). GoogLeNet predictions on the validation
set images were graciously provided by members of the GoogLeNet team. From left to right: Images that contain multiple
objects, images of extreme closeups and uncharacteristic views, images with filters, images that significantly benefit from the
ability to read text, images that contain very small and thin objects, images with abstract representations, and example of a
fine-grained image that GoogLeNet correctly identifies but a human would have significant difficulty with.

identity of small objects (for example, a few barely 1. Fine-grained recognition. We found that humans
visible feathers near person’s hand as very likely be- are noticeably worse at fine-grained recognition (e.g.
longing to a mostly occluded quill). dogs, monkeys, snakes, birds), even when they are
2. Image filters. Many people enhance their photos in clear view. To understand the difficulty, consider
with filters that distort the contrast and color dis- that there are more than 120 species of dogs in the
tributions of the image. We found that 13 (13%) dataset. We estimate that 28 (37%) of the human
of the images that GoogLeNet incorrectly classified errors fall into this category, while only 7 (7%) of
contained a filter. Thus, we posit that GoogLeNet is GoogLeNet errors do.
not very robust to these distortions. In comparison, 2. Class unawareness. The annotator may sometimes
only one image among the human errors contained be unaware of the ground truth class present as a
a filter, but we do not attribute the source of the label option. When pointed out as an ILSVRC class,
error to the filter. it is usually clear that the label applies to the im-
3. Abstract representations. GoogLeNet struggles age. These errors get progressively less frequent as
with images that depict objects of interest in an ab- the annotator becomes more familiar with ILSVRC
stract form, such as 3D-rendered images, paintings, classes. Approximately 18 (24%) of the human er-
sketches, plush toys, or statues. An example is the rors fall into this category.
abstract shape of a bow drawn with a light source in 3. Insufficient training data. Recall that the anno-
night photography, a 3D-rendered robotic scorpion, tator is only presented with 13 examples of a class
or a shadow on the ground, of a child on a swing. under every category name. However, 13 images are
We attribute approximately 6 (6%) of GoogLeNet not always enough to adequately convey the allowed
errors to this type of error and believe that humans class variations. For example, a brown dog can be
are significantly more robust, with no such errors incorrectly dismissed as a “Kelpie” if all examples of
seen in our sample. a “Kelpie” feature a dog with black coat. However,
4. Miscellaneous sources. Additional sources of er- if more than 13 images were listed it would have
ror that occur relatively infrequently include ex- become clear that a “Kelpie” may have brown coat.
treme closeups of parts of an object, unconventional Approximately 4 (5%) of human errors fall into this
viewpoints such as a rotated image, images that category.
can significantly benefit from the ability to read
text (e.g. a featureless container identifying itself as 6.4.3 Conclusions from human image classification
“face powder”), objects with heavy occlusions, and experiments
images that depict a collage of multiple images. In
general, we found that humans are more robust to We investigated the performance of trained human an-
all of these types of error. notators on a sample of 1500 ILSVRC test set images.
Our results indicate that a trained human annotator is
Types of errors that the human is more susceptible to capable of outperforming the best model (GoogLeNet)
than the computer: by approximately 1.7% (p = 0.022).
ImageNet Large Scale Visual Recognition Challenge 29

We expect that some sources of error may be rela- reveals unexpected challenges. From designing com-
tively easily eliminated (e.g. robustness to filters, rota- plicated multi-step annotation strategies (Section 3.2.1)
tions, collages, effectively reasoning over multiple scales), to having to modify the evaluation procedure (Section 4),
while others may prove more elusive (e.g. identifying we had to continuously adjust to the large-scale setting.
abstract representations of objects). On the other hand, On the plus side, of course, the major breakthroughs in
a large majority of human errors come from fine-grained object recognition accuracy (Section 5) and the analysis
categories and class unawareness. We expect that the of the strength and weaknesses of current algorithms as
former can be significantly reduced with fine-grained a function of object class properties ( Section 6.3) would
expert annotators, while the latter could be reduced never have been possible on a smaller scale.
with more practice and greater familiarity with ILSVRC
classes. Our results also hint that human errors are not
7.2 Criticism
strongly correlated and that human ensembles may fur-
ther reduce human error rate.
In the past five years, we encountered three major crit-
It is clear that humans will soon outperform state-
icisms of the ILSVRC dataset and the corresponding
of-the-art ILSVRC image classification models only by
challenge: (1) the ILSVRC dataset is insufficiently chal-
use of significant effort, expertise, and time. One inter-
lenging, (2) the ILSVRC dataset contains annotation
esting follow-up question for future investigation is how
errors, and (3) the rules of ILSVRC competition are
computer-level accuracy compares with human-level ac-
too restrictive. We discuss these in order.
curacy on more complex image understanding tasks.
The first criticism is that the objects in the dataset
tend to be large and centered in the images, making
the dataset insufficiently challenging. In Sections 3.2.2
7 Conclusions and 3.3.4 we tried to put those concerns to rest by an-
alyzing the statistics of the ILSVRC dataset and con-
In this paper we described the large-scale data collec- cluding that it is comparable with, and in many cases
tion process of ILSVRC, provided a summary of the much more challenging than, the long-standing PAS-
most successful algorithms on this data, and analyzed CAL VOC benchmark (Everingham et al., 2010).
the success and failure modes of these algorithms. In The second is regarding the errors in ground truth
this section we discuss some of the key lessons we learned labeling. We went through several rounds of in-house
over the years of ILSVRC, strive to address the key crit- post-processing of the annotations obtained using crowd-
icisms of the dataset and the challenge we encountered sourcing, and corrected many common sources of errors
over the years, and conclude by looking forward into (e.g., Appendix D). The major remaining source of an-
the future. notation errors stem from fine-grained object classes,
e.g., labelers failing to distinguish different species of
birds. This is a tradeoff that had to be made: in order
7.1 Lessons learned to annotate data at this scale on a reasonable budget,
we had to rely on non-expert crowd labelers. However,
The key lesson of collecting the dataset and running the overall the dataset is encouragingly clean. By our esti-
challenge for five years is this: All human intelligence mates, 99.7% precision is achieved in the image classi-
tasks need to be exceptionally well-designed. We fication dataset (Sections 3.1.3 and 6.4) and 97.9% of
learned this lesson both when annotating the dataset images that went through the bounding box annota-
using Amazon Mechanical Turk workers (Section 3) and tion system have all instances of the target object class
even when trying to evaluate human-level image clas- labeled with bounding boxes (Section 3.2.1).
sification accuracy using expert labelers (Section 6.4). The third criticism we encountered is over the rules
The first iteration of the labeling interface was always of the competition regarding using external training
bad – generally meaning completely unusable. If there data. In ILSVRC2010-2013, algorithms had to only use
was any inherent ambiguity in the questions posed (and the provided training and validation set images and an-
there almost always was), workers found it and accu- notations for training their models. With the growth of
racy suffered. If there is one piece of advice we can the field of large-scale unsupervised feature learning,
offer to future research, it is to very carefully design, however, questions began to arise about what exactly
continuously monitor, and extensively sanity-check all constitutes “outside” data: for example, are image fea-
crowdsourcing tasks. tures trained on a large pool of “outside” images in an
The other lesson, already well-known to large-scale unsupervised fashion allowed in the competition? Af-
researchers, is this: Scaling up the dataset always ter much discussion, In ILSVRC2014 we took the first
30 Olga Russakovsky* et al.

step towards addressing this problem. We followed the We are eagerly awaiting the future development of
PASCAL VOC strategy and created two tracks in the object recognition datasets and algorithms, and are grate-
competition: entries using only “provided” data and en- ful that ILSVRC served as a stepping stone on this
tries using “outside” data, meaning any images or an- path.
notations not provided as part of ILSVRC training or
validation sets. However, in the future this strategy will Acknowledgements We thank Stanford University, UNC
likely need to be further revised as the computer vision Chapel Hill, Google and Facebook for sponsoring the chal-
field evolves. For example, competitions can consider lenges, and NVIDIA for providing computational resources
to participants of ILSVRC2014. We thank our advisors over
allowing the use of any image features which are publi- the years: Lubomir Bourdev, Alexei Efros, Derek Hoiem, Ji-
cally available, even these features were learned on an tendra Malik, Chuck Rosenberg and Andrew Zisserman. We
external source of data. thank the PASCAL VOC organizers for partnering with us
in running ILSVRC2010-2012. We thank all members of the
Stanford vision lab for supporting the challenge and putting
up with us along the way. Finally, and most importantly, we
thank all researchers that have made the ILSVRC effort a suc-
7.3 The future
cess by competing in the challenges and by using the datasets
to advance computer vision.
Given the massive algorithmic breakthroughs over the
past five years, we are very eager to see what will hap-
pen in the next five years. There are many potential
directions of improvement and growth for ILSVRC and Appendix A ILSVRC2012-2014 image
other large-scale image datasets. classification and single-object localization
First, continuing the trend of moving towards richer object categories
image understanding (from image classification to single- abacus, abaya, academic gown, accordion, acorn, acorn squash, acoustic gui-
tar, admiral, affenpinscher, Afghan hound, African chameleon, African crocodile,
object localization to object detection), the next chal- African elephant, African grey, African hunting dog, agama, agaric, aircraft car-
rier, Airedale, airliner, airship, albatross, alligator lizard, alp, altar, ambulance,
lenge would be to tackle pixel-level object segmenta- American alligator, American black bear, American chameleon, American coot,
American egret, American lobster, American Staffordshire terrier, amphibian,
tion. The recently released large-scale COCO dataset (Lin analog clock, anemone fish, Angora, ant, apiary, Appenzeller, apron, Arabian
camel, Arctic fox, armadillo, artichoke, ashcan, assault rifle, Australian terrier,
et al., 2014b) is already taking a step in that direction. axolotl, baboon, backpack, badger, bagel, bakery, balance beam, bald eagle, bal-
loon, ballplayer, ballpoint, banana, Band Aid, banded gecko, banjo, bannister,
Second, as datasets grow even larger in scale, it may barbell, barber chair, barbershop, barn, barn spider, barometer, barracouta, bar-
rel, barrow, baseball, basenji, basketball, basset, bassinet, bassoon, bath towel,
become impossible to fully annotate them manually. bathing cap, bathtub, beach wagon, beacon, beagle, beaker, bearskin, beaver,
Bedlington terrier, bee, bee eater, beer bottle, beer glass, bell cote, bell pepper,
The scale of ILSVRC is already imposing limits on the Bernese mountain dog, bib, bicycle-built-for-two, bighorn, bikini, binder, binoc-
ulars, birdhouse, bison, bittern, black and gold garden spider, black grouse, black
stork, black swan, black widow, black-and-tan coonhound, black-footed ferret,
manual annotations that we feasible to obtain: for ex- Blenheim spaniel, bloodhound, bluetick, boa constrictor, boathouse, bobsled,
bolete, bolo tie, bonnet, book jacket, bookcase, bookshop, Border collie, Border
ample, we had to restrict the number of objects labeled terrier, borzoi, Boston bull, bottlecap, Bouvier des Flandres, bow, bow tie, box
turtle, boxer, Brabancon griffon, brain coral, brambling, brass, brassiere, break-
per image in the image classification and single-object water, breastplate, briard, Brittany spaniel, broccoli, broom, brown bear, bub-
ble, bucket, buckeye, buckle, bulbul, bull mastiff, bullet train, bulletproof vest,
localization datasets. In the future, with billions of im- bullfrog, burrito, bustard, butcher shop, butternut squash, cab, cabbage butter-
fly, cairn, caldron, can opener, candle, cannon, canoe, capuchin, car mirror, car
ages, it will become impossible to obtain even one clean wheel, carbonara, Cardigan, cardigan, cardoon, carousel, carpenter’s kit, car-
ton, cash machine, cassette, cassette player, castle, catamaran, cauliflower, CD
label for every image. Datasets such as Yahoo’s Flickr player, cello, cellular telephone, centipede, chain, chain mail, chain saw, chain-
link fence, chambered nautilus, cheeseburger, cheetah, Chesapeake Bay retriever,
12
Creative Commons 100M, released with weak human chest, chickadee, chiffonier, Chihuahua, chime, chimpanzee, china cabinet, chi-
ton, chocolate sauce, chow, Christmas stocking, church, cicada, cinema, cleaver,
tags but no centralized annotation, will become more cliff, cliff dwelling, cloak, clog, clumber, cock, cocker spaniel, cockroach, cocktail
shaker, coffee mug, coffeepot, coho, coil, collie, colobus, combination lock, comic
common. book, common iguana, common newt, computer keyboard, conch, confectionery,
consomme, container ship, convertible, coral fungus, coral reef, corkscrew, corn,
The growth of unlabeled or only partially labeled cornet, coucal, cougar, cowboy boot, cowboy hat, coyote, cradle, crane, crane,
crash helmet, crate, crayfish, crib, cricket, Crock Pot, croquet ball, crossword
puzzle, crutch, cucumber, cuirass, cup, curly-coated retriever, custard apple,
large-scale datasets implies two things. First, algorithms daisy, dalmatian, dam, damselfly, Dandie Dinmont, desk, desktop computer,
dhole, dial telephone, diamondback, diaper, digital clock, digital watch, dingo,
will have to rely more on weakly supervised training dining table, dishrag, dishwasher, disk brake, Doberman, dock, dogsled, dome,
doormat, dough, dowitcher, dragonfly, drake, drilling platform, drum, drumstick,
data. Second, even evaluation might have to be done dugong, dumbbell, dung beetle, Dungeness crab, Dutch oven, ear, earthstar,
echidna, eel, eft, eggnog, Egyptian cat, electric fan, electric guitar, electric lo-
after the algorithms make predictions, not before. This comotive, electric ray, English foxhound, English setter, English springer, enter-
tainment center, EntleBucher, envelope, Eskimo dog, espresso, espresso maker,
means that rather than evaluating accuracy (how many European fire salamander, European gallinule, face powder, feather boa, fid-
dler crab, fig, file, fire engine, fire screen, fireboat, flagpole, flamingo, flat-
of the test images or objects did the algorithm get right) coated retriever, flatworm, flute, fly, folding chair, football helmet, forklift, foun-
tain, fountain pen, four-poster, fox squirrel, freight car, French bulldog, French
or recall (how many of the desired images or objects did horn, French loaf, frilled lizard, frying pan, fur coat, gar, garbage truck, gar-
den spider, garter snake, gas pump, gasmask, gazelle, German shepherd, Ger-
the algorithm manage to find), both of which require man short-haired pointer, geyser, giant panda, giant schnauzer, gibbon, Gila
monster, go-kart, goblet, golden retriever, goldfinch, goldfish, golf ball, golfcart,
a fully annotated test set, we will be focusing more on gondola, gong, goose, Gordon setter, gorilla, gown, grand piano, Granny Smith,
grasshopper, Great Dane, great grey owl, Great Pyrenees, great white shark,
precision: of the predictions that the algorithm made, Greater Swiss Mountain dog, green lizard, green mamba, green snake, green-
house, grey fox, grey whale, grille, grocery store, groenendael, groom, ground
how many were deemed correct by humans. beetle, guacamole, guenon, guillotine, guinea pig, gyromitra, hair slide, hair
spray, half track, hammer, hammerhead, hamper, hamster, hand blower, hand-
held computer, handkerchief, hard disc, hare, harmonica, harp, hartebeest, har-
vester, harvestman, hatchet, hay, head cabbage, hen, hen-of-the-woods, hermit
12 crab, hip, hippopotamus, hog, hognose snake, holster, home theater, honeycomb,
http://webscope.sandbox.yahoo.com/catalog.php? hook, hoopskirt, horizontal bar, hornbill, horned viper, horse cart, hot pot, hot-
datatype=i&did=67 dog, hourglass, house finch, howler monkey, hummingbird, hyena, ibex, Ibizan
ImageNet Large Scale Visual Recognition Challenge 31

hound, ice bear, ice cream, ice lolly, impala, Indian cobra, Indian elephant, in- Chance performance of localization (CPL). Chance per-
digo bunting, indri, iPod, Irish setter, Irish terrier, Irish water spaniel, Irish
wolfhound, iron, isopod, Italian greyhound, jacamar, jack-o’-lantern, jackfruit, formance on a dataset is a common metric to con-
jaguar, Japanese spaniel, jay, jean, jeep, jellyfish, jersey, jigsaw puzzle, jinrik-
isha, joystick, junco, keeshond, kelpie, Kerry blue terrier, killer whale, kimono,
king crab, king penguin, king snake, kit fox, kite, knee pad, knot, koala, Ko-
sider. We define the CPL measure as the expected ac-
modo dragon, komondor, kuvasz, lab coat, Labrador retriever, lacewing, ladle,
ladybug, Lakeland terrier, lakeside, lampshade, langur, laptop, lawn mower, leaf
curacy of a detector which first randomly samples an
beetle, leafhopper, leatherback turtle, lemon, lens cap, Leonberg, leopard, lesser
panda, letter opener, Lhasa, library, lifeboat, lighter, limousine, limpkin, liner,
object instance of that class and then uses its bounding
lion, lionfish, lipstick, little blue heron, llama, Loafer, loggerhead, long-horned
beetle, lorikeet, lotion, loudspeaker, loupe, lumbermill, lycaenid, lynx, macaque,
box directly as the proposed localization window on all
macaw, Madagascar cat, magnetic compass, magpie, mailbag, mailbox, mail-
lot, maillot, malamute, malinois, Maltese dog, manhole cover, mantis, maraca,
other images (after rescaling the images to the same
marimba, marmoset, marmot, mashed potato, mask, matchstick, maypole, maze,
measuring cup, meat loaf, medicine chest, meerkat, megalith, menu, Mexican size). Concretely, let B1 , B2 , . . . , BN be all the bound-
hairless, microphone, microwave, military uniform, milk can, miniature pinscher,
miniature poodle, miniature schnauzer, minibus, miniskirt, minivan, mink, mis- ing boxes of the object instances within a class, then
sile, mitten, mixing bowl, mobile home, Model T, modem, monarch, monastery,
mongoose, monitor, moped, mortar, mortarboard, mosque, mosquito net, mo-
tor scooter, mountain bike, mountain tent, mouse, mousetrap, moving van, mud P P
turtle, mushroom, muzzle, nail, neck brace, necklace, nematode, Newfoundland,
night snake, nipple, Norfolk terrier, Norwegian elkhound, Norwich terrier, note- i j6=i IOU (Bi , Bj ) ≥ 0.5
book, obelisk, oboe, ocarina, odometer, oil filter, Old English sheepdog, or- CPL = (6)
ange, orangutan, organ, oscilloscope, ostrich, otter, otterhound, overskirt, ox, N (N − 1)
oxcart, oxygen mask, oystercatcher, packet, paddle, paddlewheel, padlock, paint-
brush, pajama, palace, panpipe, paper towel, papillon, parachute, parallel bars,
park bench, parking meter, partridge, passenger car, patas, patio, pay-phone,
peacock, pedestal, Pekinese, pelican, Pembroke, pencil box, pencil sharpener,
Some of the most difficult ILSVRC categories to lo-
perfume, Persian cat, Petri dish, photocopier, pick, pickelhaube, picket fence,
pickup, pier, piggy bank, pill bottle, pillow, pineapple, ping-pong ball, pinwheel,
calize according to this metric are basketball, swim-
pirate, pitcher, pizza, plane, planetarium, plastic bag, plate, plate rack, platy-
pus, plow, plunger, Polaroid camera, pole, polecat, police van, pomegranate,
ming trunks, ping pong ball and rubber eraser, all with
Pomeranian, poncho, pool table, pop bottle, porcupine, pot, potpie, potter’s
wheel, power drill, prairie chicken, prayer rug, pretzel, printer, prison, proboscis
less than 0.2% CPL. This measure correlates strongly
monkey, projectile, projector, promontory, ptarmigan, puck, puffer, pug, punch-
ing bag, purse, quail, quill, quilt, racer, racket, radiator, radio, radio telescope,
(ρ = 0.9) with the average scale of the object (fraction
rain barrel, ram, rapeseed, recreational vehicle, red fox, red wine, red wolf, red-
backed sandpiper, red-breasted merganser, redbone, redshank, reel, reflex cam- of image occupied by object). The average CPL across
era, refrigerator, remote control, restaurant, revolver, rhinoceros beetle, Rhode-
sian ridgeback, rifle, ringlet, ringneck snake, robin, rock beauty, rock crab, rock the 1000 ILSVRC categories is 20.8%. The 20 PASCAL
python, rocking chair, rotisserie, Rottweiler, rubber eraser, ruddy turnstone,
ruffed grouse, rugby ball, rule, running shoe, safe, safety pin, Saint Bernard, categories have an average CPL of 8.7%, which is the
saltshaker, Saluki, Samoyed, sandal, sandbar, sarong, sax, scabbard, scale, schip-
perke, school bus, schooner, scoreboard, scorpion, Scotch terrier, Scottish deer- same as the CPL of the 562 most difficult categories of
hound, screen, screw, screwdriver, scuba diver, sea anemone, sea cucumber, sea
lion, sea slug, sea snake, sea urchin, Sealyham terrier, seashore, seat belt, sewing ILSVRC.
machine, Shetland sheepdog, shield, Shih-Tzu, shoe shop, shoji, shopping bas-
ket, shopping cart, shovel, shower cap, shower curtain, siamang, Siamese cat,
Siberian husky, sidewinder, silky terrier, ski, ski mask, skunk, sleeping bag,
slide rule, sliding door, slot, sloth bear, slug, snail, snorkel, snow leopard, snow-
mobile, snowplow, soap dispenser, soccer ball, sock, soft-coated wheaten ter-
Clutter. Intuitively, even small objects are easy to lo-
rier, solar dish, sombrero, sorrel, soup bowl, space bar, space heater, space
shuttle, spaghetti squash, spatula, speedboat, spider monkey, spider web, spin-
calize on a plain background. To quantify clutter we
dle, spiny lobster, spoonbill, sports car, spotlight, spotted salamander, squirrel
monkey, Staffordshire bullterrier, stage, standard poodle, standard schnauzer,
employ the objectness measure of (Alexe et al., 2012),
starfish, steam locomotive, steel arch bridge, steel drum, stethoscope, stingray,
stinkhorn, stole, stone wall, stopwatch, stove, strainer, strawberry, street sign,
which is a class-generic object detector evaluating how
streetcar, stretcher, studio couch, stupa, sturgeon, submarine, suit, sulphur but-
terfly, sulphur-crested cockatoo, sundial, sunglass, sunglasses, sunscreen, suspen-
likely a window in the image contains a coherent ob-
sion bridge, Sussex spaniel, swab, sweatshirt, swimming trunks, swing, switch,
syringe, tabby, table lamp, tailed frog, tank, tape player, tarantula, teapot, ject (of any class) as opposed to background (sky, wa-
teddy, television, tench, tennis ball, terrapin, thatch, theater curtain, thimble,
three-toed sloth, thresher, throne, thunder snake, Tibetan mastiff, Tibetan ter- ter, grass). For every image m containing target ob-
rier, tick, tiger, tiger beetle, tiger cat, tiger shark, tile roof, timber wolf, titi,
toaster, tobacco shop, toilet seat, toilet tissue, torch, totem pole, toucan, tow ject instances at positions B1m , B2m , . . . , we use the pub-
truck, toy poodle, toy terrier, toyshop, tractor, traffic light, trailer truck, tray,
tree frog, trench coat, triceratops, tricycle, trifle, trilobite, trimaran, tripod, tri- licly available objectness software to sample 1000 win-
umphal arch, trolleybus, trombone, tub, turnstile, tusker, typewriter keyboard,
umbrella, unicycle, upright, vacuum, valley, vase, vault, velvet, vending machine, dows W1m , W2m , . . . W1000
m
, in order of decreasing proba-
vestment, viaduct, vine snake, violin, vizsla, volcano, volleyball, vulture, waffle
iron, Walker hound, walking stick, wall clock, wallaby, wallet, wardrobe, war- bility of the window containing any generic object. Let
plane, warthog, washbasin, washer, water bottle, water buffalo, water jug, water
ouzel, water snake, water tower, weasel, web site, weevil, Weimaraner, Welsh obj(m) be the number of generic object-looking win-
springer spaniel, West Highland white terrier, whippet, whiptail, whiskey jug,
whistle, white stork, white wolf, wig, wild boar, window screen, window shade,
Windsor tie, wine bottle, wing, wire-haired fox terrier, wok, wolf spider, wom-
dows sampled before localizing an instance of the target
bat, wood rabbit, wooden spoon, wool, worm fence, wreck, yawl, yellow lady’s
slipper, Yorkshire terrier, yurt, zebra, zucchini
category, i.e., obj(m) = min{k : maxi iou(Wkm , Bim ) ≥
0.5}. For a category containing M images, we compute
the average number of such windows per image and de-
fine
Appendix B Additional single-object
localization dataset statistics Clutter = log2 ( M 1
P
m obj(m)) (7)

We consider two additional metrics of object localiza- The higher the clutter of a category, the harder the
tion difficulty: chance performance of localization and objects are to localize according to generic cues. If an
the level of clutter. We use these metrics to compare object can’t be localized with the first 1000 windows (as
ILSVRC2012-2014 single-object localization dataset to is the case for 1% of images on average per category in
the PASCAL VOC 2012 object detection benchmark. ILSVRC and 5% in PASCAL), we set obj(m) = 1001.
The measures of localization difficulty are computed on The fact that more than 95% of objects can be local-
the validation set of both datasets. According to both of ized with these windows imply that the objectness cue is
these measures of difficulty there is a subset of ILSVRC already quite strong, so objects that require many win-
which is as challenging as PASCAL but more than an dows on average will be extremely difficult to detect:
order of magnitude greater in size. e.g., ping pong ball (clutter of 9.57, or 758 windows
32 Olga Russakovsky* et al.

on average), basketball (clutter of 9.21), puck (clutter ◦ (29) guacamole

◦ (30) burrito
of 9.17) in ILSVRC. The most difficult object in PAS- ◦ (31) popsicle (ice cream or water ice on a small wooden stick)
◦ fruit
CAL is bottle with clutter score of 8.47. On average, ◦ (32) fig
◦ (33) pineapple, ananas
ILSVRC has clutter score of 3.59. The most difficult ◦ (34) banana
◦ (35) pomegranate
subset of ILSVRC with 250 object categories has an ◦ (36) apple
◦ (37) strawberry
order of magnitude more categories and the same aver- ◦ (38) orange
◦ (39) lemon
age amount of clutter (of 5.90) as the PASCAL dataset. ◦ vegetables
◦ (40) cucumber, cuke
◦ (41) artichoke, globe artichoke
◦ (42) bell pepper
◦ (43) head cabbage
◦ (44) mushroom
Appendix C Hierarchy of questions for full • items that run on electricity (plugged in or using batteries); including clocks,
microphones, traffic lights, computers, etc
image annotation ◦ (45) remote control, remote
◦ electronics that blow air
◦ (46) hair dryer, blow dryer
◦ (47) electric fan: a device for creating a current of air by movement of a
The following is a hierarchy of questions manually con- surface or surfaces (please do not consider hair dryers)
◦ electronics that can play music or amplify sound
structed for crowdsourcing full annotation of images ◦ (48) tape player
◦ (49) iPod
◦ (50) microphone, mike
with the presence or absence of 200 object detection ◦ computer and computer peripherals: mouse, laptop, printer, keyboard, etc
◦ (51) computer mouse
categories in ILSVRC2013 and ILSVRC2014. All ques- ◦ (52) laptop, laptop computer
◦ (53) printer (please do not consider typewriters to be printers)
tions are of the form “is there a ... in the image?” Ques- ◦ (54) computer keyboard
◦ (55) lamp
tions marked with • are asked on every image. If the ◦ electric cooking appliance (an appliance which generates heat to cook food
or boil water)
answer to a question is determined to be “no” then the ◦ (56) microwave, microwave oven
◦ (57) toaster
answer to all descendant questions is assumed to be ◦ (58) waffle iron
◦ (59) coffee maker: a kitchen appliance used for brewing coffee automati-
“no”. The 200 numbered leaf nodes correspond to the cally
◦ (60) vacuum, vacuum cleaner
200 object detection categories. ◦ (61) dishwasher, dish washer, dishwashing machine
◦ (62) washer, washing machine: an electric appliance for washing clothes
The goal in the hierarchy construction is to save ◦ (63) traffic light, traffic signal, stoplight
◦ (64) tv or monitor: an electronic device that represents information in visual
cost (by asking as few questions as possible on every form
◦ (65) digital clock: a clock that displays the time of day digitally
image) while avoiding any ambiguity in questions which • kitchen items: tools,utensils and appliances usually found in the kitchen
◦ electric cooking appliance (an appliance which generates heat to cook food
would lead to false negatives during annotation. This or boil water)
◦ (56) microwave, microwave oven
hierarchy is not tree-structured; some questions have ◦ (57) toaster
◦ (58) waffle iron
◦ (59) coffee maker: a kitchen appliance used for brewing coffee automati-
multiple parents. cally
◦ (61) dishwasher, dish washer, dishwashing machine
Hierarchy of questions: ◦ (66) stove
• first aid/ medical items ◦ things used to open cans/bottles: can opener or corkscrew
◦ (1) stethoscope ◦ (67) can opener (tin opener)
◦ (2) syringe ◦ (68) corkscrew
◦ (3) neck brace ◦ (69) cocktail shaker
◦ (4) crutch ◦ non-electric item commonly found in the kitchen: pot, pan, utensil, bowl,
◦ (5) stretcher etc
◦ (6) band aid: an adhesive bandage to cover small cuts or blisters ◦ (70) strainer
• musical instruments ◦ (71) frying pan (skillet)
◦ (7) accordion (a portable box-shaped free-reed instrument; the reeds are ◦ (72) bowl: a dish for serving food that is round, open at the top, and has
made to vibrate by air from the bellows controlled by the player) no handles (please do not confuse with a cup, which usually has a handle
◦ (8) piano, pianoforte, forte-piano and is used for serving drinks)
◦ percussion instruments: chimes, maraccas, drums, etc ◦ (73) salt or pepper shaker: a shaker with a perforated top for sprinkling
◦ (9) chime: a percussion instrument consisting of a set of tuned bells that salt or pepper
are struck with a hammer; used as an orchestral instrument ◦ (74) plate rack
◦ (10) maraca ◦ (75) spatula: a turner with a narrow flexible blade
◦ (11) drum ◦ (76) ladle: a spoon-shaped vessel with a long handle; frequently used to
◦ stringed instrument transfer liquids from one container to another
◦ (12) banjo, the body of a banjo is round. please do not confuse with guitar ◦ (77) refrigerator, icebox
◦ (13) cello: a large stringed instrument; seated player holds it upright while • furniture (including benches)
playing ◦ (78) bookshelf: a shelf on which to keep books
◦ (14) violin: bowed stringed instrument that has four strings, a hollow ◦ (79) baby bed: small bed for babies, enclosed by sides to prevent baby from
body, an unfretted fingerboard and is played with a bow. please do not falling
confuse with cello, which is held upright while playing ◦ (80) filing cabinet: office furniture consisting of a container for keeping
◦ (15) harp papers in order
◦ (16) guitar, please do not confuse with banjo. the body of a banjo is round ◦ (81) bench (a long seat for several people, typically made of wood or stone)
◦ wind instrument: a musical instrument in which the sound is produced by an ◦ (82) chair: a raised piece of furniture for one person to sit on; please do not
enclosed column of air that is moved by the breath (such as trumpet, french confuse with benches or sofas, which are made for more people
horn, harmonica, flute, etc) ◦ (83) sofa, couch: upholstered seat for more than one person; please do not
◦ (17) trumpet: a brass musical instrument with a narrow tube and a flared confuse with benches (which are made of wood or stone) or with chairs (which
bell, which is played by means of valves. often has 3 keys on top are for just one person)
◦ (18) french horn: a brass musical instrument consisting of a conical tube ◦ (84) table
that is coiled into a spiral, with a flared bell at the end • clothing, article of clothing: a covering designed to be worn on a person’s body
◦ (19) trombone: a brass instrument consisting of a long tube whose length ◦ (85) diaper: Garment consisting of a folded cloth drawn up between the legs
can be varied by a u-shaped slide and fastened at the waist; worn by infants to catch excrement
◦ (20) harmonica ◦ swimming attire: clothes used for swimming or bathing (swim suits, swim
◦ (21) flute: a high-pitched musical instrument that looks like a straight trunks, bathing caps)
tube and is usually played sideways (please do not confuse with oboes, which ◦ (86) swimming trunks: swimsuit worn by men while swimming
have a distinctive straw-like mouth piece and a slightly flared end) ◦ (87) bathing cap, swimming cap: a cap worn to keep hair dry while swim-
◦ (22) oboe: a slender musical instrument roughly 65cm long with metal ming or showering
keys, a distinctive straw-like mouthpiece and often a slightly flared end ◦ (88) maillot: a woman’s one-piece bathing suit
(please do not confuse with flutes) ◦ necktie: a man’s formal article of clothing worn around the neck (including
◦ (23) saxophone: a musical instrument consisting of a brass conical tube, bow ties)
often with a u-bend at the end ◦ (89) bow tie: a man’s tie that ties in a bow
• food: something you can eat or drink (includes growing fruit, vegetables and ◦ (90) tie: a long piece of cloth worn for decorative purposes around the
mushrooms, but does not include living animals) neck or shoulders, resting under the shirt collar and knotted at the throat
◦ food with bread or crust: pretzel, bagel, pizza, hotdog, hamburgers, etc (NOT a bow tie)
◦ (24) pretzel ◦ headdress, headgear: clothing for the head (hats, helmets, bathing caps, etc)
◦ (25) bagel, beigel ◦ (87) bathing cap, swimming cap: a cap worn to keep hair dry while swim-
◦ (26) pizza, pizza pie ming or showering
◦ (27) hotdog, hot dog, red hot ◦ (91) hat with a wide brim
◦ (28) hamburger, beefburger, burger ◦ (92) helmet: protective headgear made of hard material to resist blows
ImageNet Large Scale Visual Recognition Challenge 33

◦ (93) miniskirt, mini: a very short skirt • school supplies: rulers, erasers, pencil sharpeners, pencil boxes, binders
◦ (94) brassiere, bra: an undergarment worn by women to support their breasts ◦ (167) ruler,rule: measuring stick consisting of a strip of wood or metal or
◦ (95) sunglasses plastic with a straight edge that is used for drawing straight lines and mea-
• living organism (other than people): dogs, snakes, fish, insects, sea urchins, suring lengths
starfish, etc. ◦ (168) rubber eraser, rubber, pencil eraser
◦ living organism which can fly ◦ (169) pencil sharpener
◦ (96) bee ◦ (170) pencil box, pencil case
◦ (97) dragonfly ◦ (171) binder, ring-binder
◦ (98) ladybug • sports items: items used to play sports or in the gym (such as skis, raquets,
◦ (99) butterfly gymnastics bars, bows, punching bags, balls)
◦ (100) bird ◦ (172) bow: weapon for shooting arrows, composed of a curved piece of re-
◦ living organism which cannot fly (please don’t include humans) silient wood with a taut cord to propel the arrow
◦ living organism with 2 or 4 legs (please don’t include humans): ◦ (173) puck, hockey puck: vulcanized rubber disk 3 inches in diameter that
◦ mammals (but please do not include humans) is used instead of a ball in ice hockey
◦ feline (cat-like) animal: cat, tiger or lion ◦ (174) ski
◦ (101) domestic cat ◦ (175) racket, racquet
◦ (102) tiger ◦ gymnastic equipment: parallel bars, high beam, etc
◦ (103) lion ◦ (176) balance beam: a horizontal bar used for gymnastics which is raised
◦ canine (dog-like animal): dog, hyena, fox or wolf from the floor and wide enough to walk on
◦ (104) dog, domestic dog, canis familiaris ◦ (177) horizontal bar, high bar: used for gymnastics; gymnasts grip it with
◦ (105) fox: wild carnivorous mammal with pointed muzzle and ears their hands (please do not confuse with balance beam, which is wide enough
and a bushy tail (please do not confuse with dogs) to walk on)
◦ animals with hooves: camels, elephants, hippos, pigs, sheep, etc ◦ ball
◦ (106) elephant ◦ (178) golf ball
◦ (107) hippopotamus, hippo ◦ (179) baseball
◦ (108) camel ◦ (180) basketball
◦ (109) swine: pig or boar ◦ (181) croquet ball
◦ (110) sheep: woolly animal, males have large spiraling horns (please ◦ (182) soccer ball
do not confuse with antelope which have long legs) ◦ (183) ping-pong ball
◦ (111) cattle: cows or oxen (domestic bovine animals) ◦ (184) rugby ball
◦ (112) zebra ◦ (185) volleyball
◦ (113) horse ◦ (186) tennis ball
◦ (114) antelope: a graceful animal with long legs and horns directed ◦ (187) punching bag, punch bag, punching ball, punchball
upward and backward ◦ (188) dumbbell: An exercising weight; two spheres connected by a short bar
◦ (115) squirrel that serves as a handle
◦ (116) hamster: short-tailed burrowing rodent with large cheek pouches • liquid container: vessels which commonly contain liquids such as bottles, cans,
◦ (117) otter etc.
◦ (118) monkey ◦ (189) pitcher: a vessel with a handle and a spout for pouring
◦ (119) koala bear ◦ (190) beaker: a flatbottomed jar made of glass or plastic; used for chemistry
◦ (120) bear (other than pandas) ◦ (191) milk can
◦ (121) skunk (mammal known for its ability fo spray a liquid with a ◦ (192) soap dispenser
strong odor; they may have a single thick stripe across back and tail, ◦ (193) wine bottle
two thinner stripes, or a series of white spots and broken stripes ◦ (194) water bottle
◦ (122) rabbit ◦ (195) cup or mug (usually with a handle and usually cylindrical)
◦ (123) giant panda: an animal characterized by its distinct black and • bag
white markings ◦ (196) backpack: a bag carried by a strap on your back or shoulder
◦ (124) red panda: Reddish-brown Old World raccoon-like carnivore ◦ (197) purse: a small bag for carrying money
◦ (125) frog, toad ◦ (198) plastic bag
◦ (126) lizard: please do not confuse with snake (lizards have legs) • (199) person
◦ (127) turtle • (200) flower pot: a container in which plants are cultivated
◦ (128) armadillo
◦ (129) porcupine, hedgehog
◦ living organism with 6 or more legs: lobster, scorpion, insects, etc.
◦ (130) lobster: large marine crustaceans with long bodies and muscular
tails; three of their five pairs of legs have claws
◦ (131) scorpion Appendix D Modification to bounding box
◦ (132) centipede: an arthropod having a flattened body of 15 to 173
segments each with a pair of legs, the foremost pair being modified as system for object detection
prehensors
◦ (133) tick (a small creature with 4 pairs of legs which lives on the blood
of mammals and birds)
◦ (134) isopod: a small crustacean with seven pairs of legs adapted for The bounding box annotation system described in Sec-
crawling
◦ (135) ant tion 3.2.1 is used for annotating images for both the
◦ living organism without legs: fish, snake, seal, etc. (please don’t include
plants)
◦ living organism that lives in water: seal, whale, fish, sea cucumber, etc.
single-object localization dataset and the object de-
◦ (136) jellyfish
◦ (137) starfish, sea star
tection dataset. However, two additional manual post-
◦ (138) seal
◦ (139) whale
processing are needed to ensure accuracy in the object
◦ (140) ray: a marine animal with a horizontally flattened body and
enlarged winglike pectoral fins with gills on the underside
detection scenario:
◦ (141) goldfish: small golden or orange-red fishes
◦ living organism that slides on land: worm, snail, snake
◦ (142) snail
◦ (143) snake: please do not confuse with lizard (snakes do not have Ambiguous objects. The first common source of error
legs)
• vehicle: any object used to move people or objects from place to place was that workers were not able to accurately differenti-
◦ a vehicle with wheels
◦ (144) golfcart, golf cart ate some object classes during annotation. Some com-
◦ (145) snowplow: a vehicle used to push snow from roads
◦ (146) motorcycle (or moped) monly confused labels were seal and sea otter, backpack
◦ (147) car, automobile (not a golf cart or a bus)
◦ (148) bus: a vehicle carrying many passengers; used for public transport and purse, banjo and guitar, violin and cello, brass in-
◦ (149) train
◦ (150) cart: a heavy open wagon usually having two wheels and drawn by struments (trumpet, trombone, french horn and brass),
an animal
◦ (151) bicycle, bike: a two wheeled vehicle moved by foot pedals
◦ (152) unicycle, monocycle
flute and oboe, ladle and spatula. Despite our best ef-
◦ a vehicle without wheels (snowmobile, sleighs)
◦ (153) snowmobile: tracked vehicle for travel on snow
forts (providing positive and negative example images
◦ (154) watercraft (such as ship or boat): a craft designed for water trans-
portation
in the annotation task, adding text explanations to alert
◦ (155) airplane: an aircraft powered by propellers or jets
• cosmetics: toiletry designed to beautify the body
the user to the distinction between these categories)
◦ (156) face powder
◦ (157) perfume, essence (usually comes in a smaller bottle than hair spray
these errors persisted.
◦ (158) hair spray
◦ (159) cream, ointment, lotion In the single-object localization setting, this prob-
◦ (160) lipstick, lip rouge
• carpentry items: items used in carpentry, including nails, hammers, axes, lem was not as prominent for two reasons. First, the
screwdrivers, drills, chain saws, etc
◦ (161) chain saw, chainsaw way the data was collected imposed a strong prior on
◦ (162) nail: pin-shaped with a head on one end and a point on the other
◦ (163) axe: a sharp tool often used to cut trees/ logs the object class which was present. Second, since only
◦ (164) hammer: a blunt hand tool used to drive nails in or break things apart
(please do not confuse with axe, which is sharp) one object category needed to be annotated per image,
◦ (165) screwdriver
◦ (166) power drill: a power tool for drilling holes into hard materials ambiguous images could be discarded: for example, if
34 Olga Russakovsky* et al.

workers couldn’t agree on whether or not a trumpet was localization dataset already had bounding box annota-
in fact present, this image could simply be removed. In tions of all instances of one object class on each im-
contrast, for the object detection setting consensus had age. We extended the existing annotations to the de-
to be reached for all target categories on all images. tection dataset by making two modification. First, we
To fix this problem, once bounding box annota- corrected any bounding box omissions resulting from
tions were collected we manually looked through all merging fine-grained categories: i.e., if an image be-
cases where the bounding boxes for two different object longed to the ”dalmatian” category and all instances of
classes had significant overlap with each other (about ”dalmatian” were annotated with bounding boxes for
3% of the collected boxes). About a quarter of these single-object localization, we ensured that all remain-
boxes were found to correspond to incorrect objects ing ”dog” instances are also annotated for the object
and were removed. Crowdsourcing this post-processing detection task. Second, we collected significantly more
step (with very stringent accuracy constraints) would training data for the person class because the existing
be possible but it occurred in few enough cases that it annotation set was not diverse enough to be representa-
was faster (and more accurate) to do this in-house. tive (the only people categories in the single-object lo-
calization task are scuba diver, groom, and ballplayer).
Duplicate annotations. The second common source of To compensate, we additionally annotated people in a
error were duplicate bounding boxes drawn on the same large fraction of the existing training set images.
object instance. Despite instructions not to draw more
than one bounding box around the same object instance
and constraints in the annotation UI enforcing at least Appendix E Competition protocol
a 5 pixel difference between different bounding boxes,
these errors persisted. One reason was that sometimes Competition format. At the beginning of the competi-
the initial bounding box was not perfect and subsequent tion period each year we release the new training/validation/test
labelers drew a slightly improved alternative. images, training/validation annotations, and competi-
This type of error was also present in the single- tion specification for the year. We then specify a dead-
object localization scenario but was not a major cause line for submission, usually approximately 4 months af-
for concern. A duplicate bounding box is a slightly per- ter the release of data. Teams are asked to upload a
turbed but still correct positive example, and single- text file of their predicted annotations on test images
object localization is only concerned with correctly lo- by this deadline to a provided server. We then evaluate
calizing one object instance. For the detection task algo- all submissions and release the results.
rithms are evaluated on the ability to localize every ob- For every task we released code that takes a text file
ject instance, and penalized for duplicate detections, so of automatically generated image annotations and com-
it is imperative that these labeling errors are corrected pares it with the ground truth annotations to return a
(even if they only appear in about 0.6% of cases). quantitative measure of algorithm accuracy. Teams can
Approximately 1% of bounding boxes were found use this code to evaluate their performance on the val-
to have significant overlap of more than 50% with an- idation data.
other bounding box of the same object class.We again As described in (Everingham et al., 2014), there are
manually verified all of these cases in-house. In approx- three options for measuring performance on test data:
imately 40% of the cases the two bounding boxes cor- (i) Release test images and annotations, and allow par-
rectly corresponded to different people in a crowd, to ticipants to assess performance themselves; (ii) Release
stacked plates, or to musical instruments nearby in an test images but not test annotations – participants sub-
orchestra. In the other 60% of cases one of the boxes mit results and organizers assess performance; (iii) Nei-
was randomly removed. ther test images nor annotations are released – partic-
These verification steps complete the annotation pro- ipants submit software and organizers run it on new
cedure of bounding boxes around every instance of ev- data and assess performance. In line with the PASCAL
ery object class in validation, test and a subset of train- VOC choice, we opted for option (ii). Option (i) allows
ing images for the detection task. too much leeway in overfitting to the test data; option
(iii) is infeasible, especially given the scale of our test
Training set annotation. With the optimized algorithm set (40K-100K images).
of Section 3.3.3 we fully annotated the validation and We released ILSVRC2010 test annotations for the
test sets. However, annotating all training images with image classification task, but all other test annotations
all target object classes was still a budget challenge. have remained hidden to discourage fine-tuning results
Positive training images taken from the single-object on the test data.
ImageNet Large Scale Visual Recognition Challenge 35

Evaluation protocol after the challenge. After the chal- Deng, J., Russakovsky, O., Krause, J., Bernstein, M.,
lenge period we set up an automatic evaluation server Berg, A. C., and Fei-Fei, L. (2014). Scalable multi-
that researchers can use throughout the year to con- label annotation. In CHI.
tinue evaluating their algorithms against the ground Donahue, J., Jia, Y., Vinyals, O., Hoffman, J., Zhang,
truth test annotations. We limit teams to 2 submis- N., Tzeng, E., and Darrell, T. (2013). Decaf: A
sions per week to discourage parameter tuning on the deep convolutional activation feature for generic vi-
test data, and in practice we have never had a problem sual recognition. CoRR, abs/1310.1531.
with researchers abusing the system. Dubout, C. and Fleuret, F. (2012). Exact acceleration
*Bibliography of linear object detectors. In Proceedings of the Eu-
ropean Conference on Computer Vision (ECCV).
Ahonen, T., Hadid, A., and Pietikinen, M. (2006). Face
Everingham, M., , Eslami, S. M. A., Van Gool, L.,
description with local binary patterns: Application to
Williams, C. K. I., Winn, J., and Zisserman, A.
face recognition. PAMI, 28.
(2014). The Pascal Visual Object Classes (VOC)
Alexe, B., Deselares, T., and Ferrari, V. (2012). Mea-
challenge - a Retrospective. IJCV.
suring the objectness of image windows. In PAMI.
Everingham, M., Gool, L. V., Williams, C., Winn, J.,
Arandjelovic, R. and Zisserman, A. (2012). Three
and Zisserman, A. (2005-2012). PASCAL Visual Ob-
things everyone should know to improve object re-
ject Classes Challenge (VOC). http://www.pascal-
trieval. In CVPR.
network.org/challenges/VOC/voc2012/workshop/index.html.
Arbelaez, P., Maire, M., Fowlkes, C., and Malik, J.
Everingham, M., Van Gool, L., Williams, C. K. I.,
(2011). Contour detection and hierarchical image seg-
Winn, J., and Zisserman, A. (2010). The Pas-
mentation. IEEE TPAMI, 33.
cal Visual Object Classes (VOC) challenge. IJCV,
Arbeláez, P., Pont-Tuset, J., Barron, J., Marques,
88(2):303–338.
F., and Malik, J. (2014). Multiscale combinatorial
Fei-Fei, L., Fergus, R., and Perona, P. (2004). Learn-
grouping. In Computer Vision and Pattern Recogni-
ing generative visual models from few examples: an
tion.
incremental bayesian approach tested on 101 object
Batra, D., Agrawal, H., Banik, P., Chavali, N., Math-
categories. In CVPR.
ialagan, C. S., and Alfadda, A. (2013). Cloudcv:
Felzenszwalb, P., Girshick, R., McAllester, D., and Ra-
Large-scale distributed computer vision as a cloud
manan, D. (2010). Object detection with discrimina-
service.
tively trained part based models. PAMI, 32.
Berg, A., Farrell, R., Khosla, A., Krause, J., Fei-Fei, L.,
Frome, A., Corrado, G., Shlens, J., Bengio, S., Dean,
Li, J., and Maji, S. (2013). Fine-Grained Competi-
J., Ranzato, M., and Mikolov, T. (2013). Devise: A
tion. https://sites.google.com/site/fgcomp2013/.
deep visual-semantic embedding model. In Advances
Chatfield, K., Simonyan, K., Vedaldi, A., and Zisser-
In Neural Information Processing Systems, NIPS.
man, A. (2014). Return of the devil in the de-
Geiger, A., Lenz, P., Stiller, C., and Urtasun, R. (2013).
tails: Delving deep into convolutional nets. CoRR,
Vision meets robotics: The kitti dataset. Interna-
abs/1405.3531.
tional Journal of Robotics Research (IJRR).
Chen, Q., Song, Z., Huang, Z., Hua, Y., and Yan, S.
Girshick, R., Donahue, J., Darrell, T., and Malik., J.
(2014). Contextualizing object detection and classi-
(2014). Rich feature hierarchies for accurate object
fication. volume PP.
detection and semantic segmentation. In CVPR.
Crammer, K., Dekel, O., Keshet, J., Shalev-Shwartz,
Girshick, R. B., Donahue, J., Darrell, T., and Malik, J.
S., and Singer, Y. (2006). Online passive-aggressive
(2013). Rich feature hierarchies for accurate object
algorithms. Journal of Machine Learning Research,
detection and semantic segmentation (v4). CoRR.
7:551–585.
Gould, S., Fulton, R., and Koller, D. (2009). Decom-
Criminisi, A. (2004). Microsoft Research Cambridge
posing a scene into geometric and semantically con-
(MSRC) object recognition image database (version
sistent regions. In ICCV.
2.0). http://research.microsoft.com/vision/
Graham, B. (2013). Sparse arrays of signatures for on-
cambridge/recognition.
line character recognition. CoRR.
Dean, T., Ruzon, M., Segal, M., Shlens, J., Vijaya-
Griffin, G., Holub, A., and Perona, P. (2007). Caltech-
narasimhan, S., and Yagnik, J. (2013). Fast, accu-
256 object category dataset. Technical Report 7694,
rate detection of 100,000 object classes on a single
Caltech.
machine. In CVPR.
Harada, T. and Kuniyoshi, Y. (2012). Graphical gaus-
Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., and
sian vector for image categorization. In NIPS.
Fei-Fei, L. (2009). ImageNet: a large-scale hierarchi-
cal image database. In CVPR.
36 Olga Russakovsky* et al.

Harel, J., Koch, C., and Perona, P. (2007). Graph-based In CVPR.

visual saliency. In NIPS. Lowe, D. G. (2004). Distinctive image features from
He, K., Zhang, X., Ren, S., , and Su, J. (2014). Spatial scale-invariant keypoints. IJCV, 60(2):91–110.
pyramid pooling in deep convolutional networks for Maji, S. and Malik, J. (2009). Object detection using
visual recognition. In ECCV. a max-margin hough transform. In CVPR.
Hinton, G. E., Srivastava, N., Krizhevsky, A., Manen, S., Guillaumin, M., and Van Gool, L. (2013).
Sutskever, I., and Salakhutdinov, R. (2012). Improv- Prime Object Proposals with Randomized Prim’s Al-
ing neural networks by preventing co-adaptation of gorithm. In ICCV.
feature detectors. CoRR, abs/1207.0580. Mensink, T., Verbeek, J., Perronnin, F., and Csurka,
Hoiem, D., Chodpathumwan, Y., and Dai, Q. (2012). G. (2012). Metric Learning for Large Scale Image
Diagnosing error in object detectors. In ECCV. Classification: Generalizing to New Classes at Near-
Howard, A. (2014). Some improvements on deep con- Zero Cost. In ECCV.
volutional neural network based image classification. Mikolov, T., Chen, K., Corrado, G., and Dean, J.
ICLR. (2013). Efficient estimation of word representations
Huang, G. B., Ramesh, M., Berg, T., and Learned- in vector space. ICLR.
Miller, E. (2007). Labeled faces in the wild: A Miller, G. A. (1995). Wordnet: A lexical database for
database for studying face recognition in uncon- english. Commun. ACM, 38(11).
strained environments. Technical Report 07-49, Uni- Ordonez, V., Deng, J., Choi, Y., Berg, A. C., and Berg,
versity of Massachusetts, Amherst. T. L. (2013). From large scale image categorization
Iandola, F. N., Moskewicz, M. W., Karayev, S., Gir- to entry-level categories. In IEEE International Con-
shick, R. B., Darrell, T., and Keutzer, K. (2014). ference on Computer Vision (ICCV).
Densenet: Implementing efficient convnet descriptor Ouyang, W., Luo, P., Zeng, X., Qiu, S., Tian, Y., Li,
pyramids. CoRR. H., Yang, S., Wang, Z., Xiong, Y., Qian, C., Zhu,
Jia, Y. (2013). Caffe: An open source convolutional Z., Wang, R., Loy, C. C., Wang, X., and Tang, X.
architecture for fast feature embedding. http:// (2014). Deepid-net: multi-stage and deformable deep
caffe.berkeleyvision.org/. convolutional neural networks for object detection.
Jojic, N., Frey, B. J., and Kannan, A. (2003). Epitomic CoRR, abs/1409.3505.
analysis of appearance and shape. In ICCV. Ouyang, W. and Wang, X. (2013). Joint deep learning
Kanezaki, A., Inaba, S., Ushiku, Y., Yamashita, Y., for pedestrian detection. In ICCV.
Muraoka, H., Kuniyoshi, Y., and Harada, T. (2014). Papandreou, G. (2014). Deep epitomic convolutional
Hard negative classes for multiple object detection. neural networks. CoRR.
In ICRA. Papandreou, G., Chen, L.-C., and Yuille, A. L. (2014).
Khosla, A., Jayadevaprakash, N., Yao, B., and Fei-Fei, Modeling image patches with a generic dictionary of
L. (2011). Novel dataset for fine-grained image cat- mini-epitomes.
egorization. In First Workshop on Fine-Grained Vi- Perronnin, F., Akata, Z., Harchaoui, Z., and Schmid, C.
sual Categorization, CVPR. (2012). Towards good practice in large-scale learning
Krizhevsky, A., Sutskever, I., and Hinton, G. (2012). for image classification. In CVPR.
ImageNet classification with deep convolutional neu- Perronnin, F. and Dance, C. R. (2007). Fisher kernels
ral networks. In NIPS. on visual vocabularies for image categorization. In
Kuettel, D., Guillaumin, M., and Ferrari, V. (2012). CVPR.
Segmentation Propagation in ImageNet. In eccv. Perronnin, F., Sánchez, J., and Mensink, T. (2010). Im-
Lazebnik, S., Schmid, C., and Ponce, J. (2006). Be- proving the fisher kernel for large-scale image classi-
yond bags of features: Spatial Pyramid Matching for fication. In ECCV (4).
recognizing natural scene categories. In CVPR. Russakovsky, O., Deng, J., Huang, Z., Berg, A., and Fei-
Lin, M., Chen, Q., and Yan, S. (2014a). Network in Fei, L. (2013). Detecting avocados to zucchinis: what
network. ICLR. have we done, and where are we going? In ICCV.
Lin, T.-Y., Maire, M., Belongie, S., Hays, J., Perona, P., Russell, B., Torralba, A., Murphy, K., and Freeman,
Ramanan, D., Dollr, P., and Zitnick, C. L. (2014b). W. T. (2007). LabelMe: a database and web-based
Microsoft COCO: Common Objects in Context. In tool for image annotation. IJCV.
ECCV. Sanchez, J. and Perronnin, F. (2011). High-dim. signa-
Lin, Y., Lv, F., Cao, L., Zhu, S., Yang, M., Cour, T., ture compression for large-scale image classification.
Yu, K., and Huang, T. (2011). Large-scale image clas- In CVPR.
sification: Fast feature extraction and SVM training.
ImageNet Large Scale Visual Recognition Challenge 37

Sanchez, J., Perronnin, F., and de Campos, T. (2012). van de Sande, K. E. A., Snoek, C. G. M., and Smeul-
Modeling spatial layout of images beyond spatial ders, A. W. M. (2014). Fisher and vlad with flair.
pyramids. In PRL. In Proceedings of the IEEE Conference on Computer
Scheirer, W., Kumar, N., Belhumeur, P. N., and Boult, Vision and Pattern Recognition.
T. E. (2012). Multi-attribute spaces: Calibration for van de Sande, K. E. A., Uijlings, J. R. R., Gevers, T.,
attribute fusion and similarity search. In CVPR. and Smeulders, A. W. M. (2011b). Segmentation as
Schmidhuber, J. (2012). Multi-column deep neural net- selective search for object recognition. In ICCV.
works for image classification. In CVPR. Vittayakorn, S. and Hays, J. (2011). Quality assessment
Sermanet, P., Eigen, D., Zhang, X., Mathieu, M., Fer- for crowdsourced object annotations. In BMVC.
gus, R., and LeCun, Y. (2013). Overfeat: Integrated von Ahn, L. and Dabbish, L. (2005). Esp: Labeling im-
recognition, localization and detection using convo- ages with a computer game. In AAAI Spring Sympo-
lutional networks. CoRR, abs/1312.6229. sium: Knowledge Collection from Volunteer Contrib-
Sheng, V. S., Provost, F., and Ipeirotis, P. G. (2008). utors.
Get another label? Improving data quality and data Vondrick, C., Patterson, D., and Ramanan, D. (2012).
mining using multiple, noisy labelers. In SIGKDD. Efficiently scaling up crowdsourced video annotation.
Simonyan, K., Vedaldi, A., and Zisserman, A. (2013). International Journal of Computer Vision.
Deep fisher networks for large-scale image classifica- Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus,
tion. In NIPS. R. (2013). Regularization of neural networks using
Simonyan, K. and Zisserman, A. (2014). Very deep con- dropconnect. In Proc. International Conference on
volutional networks for large-scale image recognition. Machine learning (ICML’13).
CoRR, abs/1409.1556. Wang, J., Yang, J., Yu, K., Lv, F., Huang, T., and
Sorokin, A. and Forsyth, D. (2008). Utility data anno- Gong, Y. (2010). Locality-constrained Linear Coding
tation with Amazon Mechanical Turk. In InterNet08. for image classification. In CVPR.
Su, H., Deng, J., and Fei-Fei, L. (2012). Crowdsourc- Wang, M., Xiao, T., Li, J., Hong, C., Zhang, J., and
ing annotations for visual object detection. In AAAI Zhang, Z. (2014). Minerva: A scalable and highly ef-
Human Computation Workshop. ficient training platform for deep learning. In APSys.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Wang, X., Yang, M., Zhu, S., and Lin, Y. (2013). Re-
Anguelov, D., Erhan, D., and Rabinovich, A. (2014). gionlets for generic object detection. In ICCV.
Going deeper with convolutions. Technical report. Welinder, P., Branson, S., Belongie, S., and Perona, P.
Tang, Y. (2013). Deep learning using support vector (2010). The multidimensional wisdom of crowds. In
machines. CoRR, abs/1306.0239. NIPS.
Thorpe, S., Fize, D., Marlot, C., et al. (1996). Speed Xiao, J., Hays, J., Ehinger, K., Oliva, A., and Torralba.,
of processing in the human visual system. nature, A. (2010). SUN database: Large-scale scene recogni-
381(6582):520–522. tion from Abbey to Zoo. CVPR.
Torralba, A., Fergus, R., and Freeman, W. (2008). 80 Yang, J., Yu, K., Gong, Y., and Huang, T. (2009). Lin-
million tiny images: A large data set for nonparamet- ear spatial pyramid matching using sparse coding for
ric object and scene recognition. In PAMI. image classification. In CVPR.
Uijlings, J., van de Sande, K., Gevers, T., and Smeul- Yao, B., Yang, X., and Zhu, S.-C. (2007). Introduction
ders, A. (2013). Selective search for object recogni- to a large scale general purpose ground truth dataset:
tion. International Journal of Computer Vision. methodology, annotation tool, and benchmarks.
Urtasun, R., Fergus, R., Hoiem, D., Torralba, A., Zeiler, M. D. and Fergus, R. (2013). Visualizing
Geiger, A., Lenz, P., Silberman, N., Xiao, J., and understanding convolutional networks. CoRR,
and Fidler, S. (2013-2014). Reconstruction meets abs/1311.2901.
recognition challenge. http://ttic.uchicago.edu/ Zeiler, M. D., Taylor, G. W., and Fergus, R. (2011).
~rurtasun/rmrc/. Adaptive deconvolutional networks for mid and high
van de Sande, K. E. A., Gevers, T., and Snoek, C. level feature learning. In ICCV.
G. M. (2010). Evaluating color descriptors for object Zhou, X., Yu, K., Zhang, T., and Huang, T. (2010).
and scene recognition. IEEE Transactions on Pattern Image classification using super-vector coding of local
Analysis and Machine Intelligence, 32(9):1582–1596. image descriptors. In ECCV.
van de Sande, K. E. A., Gevers, T., and Snoek, C. G. M.
(2011a). Empowering visual categorization with the
gpu. IEEE Transactions on Multimedia, 13(1):60–70.