ImageNet A Large-Scale Hierarchical Image Database PDF
ImageNet A Large-Scale Hierarchical Image Database PDF
net/publication/221361415
Conference Paper in Proceedings / CVPR, IEEE Computer Society Conference on Computer Vision and Pattern Recognition. IEEE Computer Society Conference on Computer Vision and Pattern
Recognition · June 2009
DOI: 10.1109/CVPR.2009.5206848 · Source: DBLP
CITATIONS READS
10,812 23,365
6 authors, including:
Li-Jia Li Kai Li
Stanford University Harbin Institute of Technology
28 PUBLICATIONS 13,437 CITATIONS 203 PUBLICATIONS 24,684 CITATIONS
Fei Fei Li
Stanford University
402 PUBLICATIONS 55,828 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
All content following this page was uploaded by Li-Jia Li on 30 May 2014.
Jia Deng, Wei Dong, Richard Socher, Li-Jia Li, Kai Li and Li Fei-Fei
Dept. of Computer Science, Princeton University, USA
{jiadeng, wdong, rsocher, jial, li, feifeili}@cs.princeton.edu
1
mammal placental carnivore canine dog working dog husky
Figure 1: A snapshot of two root-to-leaf branches of ImageNet: the top row is from the mammal subtree; the bottom row is from the
vehicle subtree. For each synset, 9 randomly sampled images are presented.
0.05
1377
Figure 2: Scale of ImageNet. Red curve: Histogram of number
of images per synset. About 20% of the synsets have very few
images. Over 50% synsets have more than 500 images. Table: Figure 3: Comparison of the “cat” and “cattle” subtrees between
Summary of selected subtrees. For complete and up-to-date statis- ESP [25] and ImageNet. Within each tree, the size of a node is
tics visit http://www.image-net.org/about-stats. proportional to the number of images it contains. The number of
images for the largest node is shown for each tree. Shared nodes
between an ESP tree and an ImageNet tree are colored in red.
images spread over 5247 categories (Fig. 2). On average
over 600 images are collected for each synset. Fig. 2 shows
the distributions of the number of images per synset for the gory labels into a semantic hierarchy by using WordNet, the
current ImageNet 1 . To our knowledge this is already the density of ImageNet is unmatched by others. For example,
largest clean image dataset available to the vision research to our knowledge no existing vision dataset offers images of
community, in terms of the total number of images, number 147 dog categories. Fig. 3 compares the “cat” and “cattle”
of images per category as well as the number of categories 2 . subtrees of ImageNet and the ESP dataset [25]. We observe
that ImageNet offers much denser and larger trees.
Hierarchy ImageNet organizes the different classes of
images in a densely populated semantic hierarchy. The Accuracy We would like to offer a clean dataset at all
main asset of WordNet [9] lies in its semantic structure, i.e. levels of the WordNet hierarchy. Fig. 4 demonstrates the
its ontology of concepts. Similarly to WordNet, synsets of labeling precision on a total of 80 synsets randomly sam-
images in ImageNet are interlinked by several types of re- pled at different tree depths. An average of 99.7% preci-
lations, the “IS-A” relation being the most comprehensive sion is achieved on average. Achieving a high precision for
and useful. Although one can map any dataset with cate- all depths of the ImageNet tree is challenging because the
lower in the hierarchy a synset is, the harder it is to classify,
1 About 20% of the synsets have very few images, because either there
e.g. Siamese cat versus Burmese cat.
are very few web images available, e.g. “vespertilian bat”, or the synset by
definition is difficult to be illustrated by images, e.g. “two-year-old horse”.
2 It is claimed that the ESP game [25] has labeled a very large number Diversity ImageNet is constructed with the goal that ob-
of images, but only a subset of 60K images are publicly available. jects in images should have variable appearances, positions,
1 datasets are needed for the next generation of algorithms.
precision
pose datasets, such as FERET faces [19], Labeled faces in the Wild [13] 5 All statistics are from [21, 27]. In addition to the 50k images, the
and the Mammal Benchmark by Fink and Ullman [11] are not included. Lotus Hill dataset also includes 587k video frames.
Lossless JPG size in byte
platypus
panda
okapi
elephant ImageNet
Caltech101
900 1000 1100
(a) (b) (c)
Figure 5: ImageNet provides diversified images. (a) Comparison of the lossless JPG file sizes of average images for four different synsets
in ImageNet ( the mammal subtree ) and Caltech101. Average images are downsampled to 32 × 32 and sizes are measured in byte. A more
diverse set of images results in a smaller lossless JPG file size. (b) Example images from ImageNet and average images for each synset
indicated by (a). (c) Examples images from Caltech101 and average images. For each category shown, the average image is computed
using all images from Caltech101 and an equal number of randomly sampled images from ImageNet.
Figure 7: Left: Is there a Burmese cat in the images? Six ran- 4.1. Non-parametric Object Recognition
domly sampled users have different answers. Right: The confi- Given an image containing an unknown object, we
dence score table for “Cat” and “Burmese cat”. More votes are
would like to recognize its object class by querying similar
needed to reach the same degree of confidence for “Burmese cat”
images.
images in ImageNet. Torralba et al. [24] has demonstrated
that, given a large number of images, simple nearest neigh-
bor methods can achieve reasonable performances despite a
ensure diversity. high level of noise. We show that with a clean set of full
resolution images, object recognition can be more accurate,
While users are instructed to make accurate judgment,
especially by exploiting more feature level information.
we need to set up a quality control system to ensure this
accuracy. There are two issues to consider. First, human We run four different object recognition experiments. In
users make mistakes and not all users follow the instruc- all experiments, we test on images from the 16 common
tions. Second, users do not always agree with each other, categories 7 between Caltech256 and the mammal subtree.
especially for more subtle or confusing synsets, typically at We measure classification performance on each category in
the deeper levels of the tree. Fig. 7(left) shows an example the form of an ROC curve. For each category, the negative
of how users’ judgments differ for “Burmese cat”. set consists of all images from the other 15 categories. We
now describe in detail our experiments and results(Fig. 8).
The solution to these issues is to have multiple users in-
dependently label the same image. An image is considered 1. NN-voting + noisy ImageNet First we replicate one
positive only if it gets a convincing majority of the votes. of the experiments described in [24], which we refer
We observe, however, that different categories require dif- to as “NN-voting” hereafter. To imitate the TinyIm-
ferent levels of consensus among users. For example, while age dataset (i.e. images collected from search engines
five users might be necessary for obtaining a good consen- without human cleaning), we use the original candi-
sus on “Burmese cat” images, a much smaller number is date images for each synset (Section 3.1) and down-
needed for “cat” images. We develop a simple algorithm to sample them to 32 × 32. Given a query image, we re-
dynamically determine the number of agreements needed trieve 100 of the nearest neighbor images by SSD pixel
for different categories of images. For each synset, we first distance from the mammal subtree. Then we perform
randomly sample an initial subset of images. At least 10 classification by aggregating votes (number of nearest
users are asked to vote on each of these images. We then ob- neighbors) inside the tree of the target category.
tain a confidence score table, indicating the probability of an 2. NN-voting + clean ImageNet Next we run the same
image being a good image given the user votes (Fig. 7(right) NN-voting experiment described above on the clean
shows examples for “Burmese cat” and “cat”). For each of ImageNet dataset. This result shows that having more
remaining candidate images in this synset, we proceed with accurate data improves classification performance.
the AMT user labeling until a pre-determined confidence
score threshold is reached. It is worth noting that the con- 3. NBNN We also implement the Naive Bayesian
fidence table gives a natural measure of the “semantic diffi- Nearest Neighbor (NBNN) method proposed in [5]
culty” of the synset. For some synsets, users fail to reach a to underline the usefulness of full resolution im-
majority vote for any image, indicating that the synset can- ages. NBNN employs a bag-of-features representa-
not be easily illustrated by images 6 . Fig. 4 shows that our tion of images. SIFT [15] descriptors are used in
algorithm successfully filters the candidate images, result- this experiment. Given a query image Q with de-
ing in a high percentage of clean images per synset. scriptors {di }, i = 1, . . . , M , for each object class
C, we compute the query-class distance DC =
7 The categories are bat, bear, camel, chimp, dog, elk, giraffe, goat,
6 An alternative explanation is that we did not obtain enough suitable gorilla, greyhound, horse, killer-whale, porcupine, raccoon, skunk, zebra.
candidate images. Given the extensiveness of our crawling scheme, this is Duplicates (∼ 20 per category ) with ImageNet are removed from the test
a rare scenario. set.
1 1
independent classifier
tree−max classifier
0.8 0.9
true positive rate
average AUC
0.8
0.6
0.7
0.4
0.6
NBNN
0.2
NBNN−100
NN−voting + clean ImageNet 0.5
NN−voting + noisy ImageNet 1 2 3 4 5 6 7 8 9
0 tree height
0 0.2 0.4 0.6 0.8 1
false positive rate
Figure 9: Average AUC at each tree height level. Performance
(a) average ROC comparison at different tree height levels between independently
trained classifiers and tree-max classifiers. The tree height of a
1 1
node is defined as the length of the longest path to its leaf nodes.
All leaf nodes’ height is 1.
true positive rate
true positive rate
0.8 0.8
0.6 0.6
0.4 0.4
0.2
NBNN
NBNN−100
0.2
NBNN
NBNN−100
method which we call the “tree-max classifier”. Imagine
NN−voting + clean ImageNet
0
NN−voting + clean ImageNet
NN−voting + noisy ImageNet
0
NN−voting + noisy ImageNet you have a classifier at each synset node of the tree and you
0 0.2 0.4 0.6 0.8 1 0 0.2 0.4 0.6 0.8 1
false positive rate false positive rate want to decide whether an image contains an object of that
(b) elk (c) killer-whale synset or not. The idea is to not only consider the classi-
fication score at a node such as “dog”, but also of its child
Figure 8: (a) Object recognition experiment results plotted in synsets, such as “German shepherd”, “English terrier”, etc.
ROC curves. Each curve is the result of one of the four experi- The maximum of all the classifier responses in this subtree
ments described in Section 4.1. It is an average of all ROC results becomes the classification score of the query image.
of 16 object categories commonly shared between Caltech256 and
the mammal subtree. Caltech256 images serve as testing images. Fig. 9 illustrates the result of our experiment on the
(b)(c) The ROC curve for “elk” and “killer-whale”. mammal subtree. Note that our algorithm is agnostic to any
method used to learn image classifiers for each synset. In
PM this case, we use an AdaBoost-based classifier proposed by
i=1 kdi − dC 2 C
i k , where di is the nearest neighbor of [6]. For each synset, we randomly sample 90% of the im-
di from all the image descriptors in class C. We order ages to form the positive training image set, leaving the rest
all classes by DC and define the classification score of the 10% as testing images. We form a common neg-
as the minimum rank of the target class and its sub- ative image set by aggregating 10 images randomly sam-
classes. The result shows that NBNN gives substan- pled from each synset. When training an image classifier
tially better performance, demonstrating the advantage for a particular synset, we use the positive set from this
of using a more sophisticated feature representation synset as well as the common negative image set excluding
available through full resolution images. the images drawn from this synset, and its child and parent
4. NBNN-100 Finally, we run the same NBNN experi- synsets.
ment, but limit the number of images per category to We evaluate the classification results by AUC (the area
100. The result confirms the findings of [24]. Per- under ROC curve). Fig. 9 shows the results of AUC for
formance can be significantly improved by enlarging synsets at different levels of the hierarchy, compared with
the dataset. It is worth noting that NBNN-100 out- an independent classifier that does not exploit the tree struc-
performs NN-voting with access to the entire dataset, ture of ImageNet. The plot indicates that images are easier
again demonstrating the benefit of having detailed fea- to classify at the bottom of the tree (e.g. star-nosed mole,
ture level information by using full resolution images. minivan, polar bear) as opposed to the top of the tree (e.g.
vehicles, mammal, artifact, etc.). This is most likely due to
4.2. Tree Based Image Classification stronger visual coherence near the leaf nodes of the tree.
Compared to other available datasets, ImageNet provides At nearly all levels, the performance of the tree-max
image data in a densely populated hierarchical structure. classifier is consistently higher than the independent clas-
Many possible algorithms could be applied to exploit a hi- sifier. This result shows that a simple way of exploiting
erarchical data structure (e.g. [16, 17, 28, 18]). the ImageNet hierarchy can already provide substantial im-
In this experiment, we choose to illustrate the usefulness provement for the image classification task without addi-
of the ImageNet hierarchy by a simple object classification tional training or model learning.
P recis ion
1 R ecall
0.8
0.6
0.4
0.2
0
Te hu Min tig Go L w he ta je t ba p mo g b tu y tr A p s c d s
xa m e lde y nx olf lic
op k
n by a c e pe re y h ov in s ke a c h ic y c rma upp te a lt a me obb pa c
s lo a n iv a n r n ca ca d ou e r t le dil y ha l in es
ng Re te rria r nd lo irc hu
ho trie r ra ttle
rn ge ft
ve
r