0% found this document useful (0 votes)
22 views10 pages

Building A Bird Recognition App and Large Scale Dataset With Citizen Scientists - The Fine Print in Fine-Grained Dataset Collection

This document discusses building a bird recognition dataset and app with citizen scientists. It introduces tools to collect large, high quality datasets using citizen scientists and details the NABirds dataset of over 48,000 bird images across 555 categories. Key findings include that citizen scientists provide higher quality annotations than Mechanical Turkers at no cost, and that existing datasets contain label errors but algorithms are robust to a level of corruption.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
22 views10 pages

Building A Bird Recognition App and Large Scale Dataset With Citizen Scientists - The Fine Print in Fine-Grained Dataset Collection

This document discusses building a bird recognition dataset and app with citizen scientists. It introduces tools to collect large, high quality datasets using citizen scientists and details the NABirds dataset of over 48,000 bird images across 555 categories. Key findings include that citizen scientists provide higher quality annotations than Mechanical Turkers at no cost, and that existing datasets contain label errors but algorithms are robust to a level of corruption.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

Building a bird recognition app and large scale dataset with citizen scientists:

The fine print in fine-grained dataset collection

Grant Van Horn1 Steve Branson1 Ryan Farrell2 Scott Haber3


3 4
Jessie Barry Panos Ipeirotis Pietro Perona1 Serge Belongie5
1
Caltech 2 BYU 3 Cornell Lab of Ornithology 4 NYU 5
Cornell Tech

Abstract human labor required. As we increase the number of classes


of interest, classes become more fine-grained and difficult
We introduce tools and methodologies to collect high to distinguish for the average person (and the average anno-
quality, large scale fine-grained computer vision datasets tator), more ambiguous, and less likely to obey an assump-
using citizen scientists – crowd annotators who are passion- tion of mutual exclusion. The annotation process becomes
ate and knowledgeable about specific domains such as birds more challenging, requiring an increasing amount of skill
or airplanes. We worked with citizen scientists and domain and knowledge. Dataset quality appears to be at direct odds
experts to collect NABirds, a new high quality dataset con- with dataset size.
taining 48,562 images of North American birds with 555 In this paper, we introduce tools and methodologies for
categories, part annotations and bounding boxes. We find constructing large, high quality computer vision datasets,
that citizen scientists are significantly more accurate than based on tapping into an alternate pool of crowd annota-
Mechanical Turkers at zero cost. We worked with bird ex- tors – citizen scientists. Citizen scientists are nonprofes-
perts to measure the quality of popular datasets like CUB- sional scientists or enthusiasts in a particular domain such as
200-2011 and ImageNet and found class label error rates birds, insects, plants, airplanes, shoes, or architecture. Citi-
of at least 4%. Nevertheless, we found that learning algo- zen scientists contribute annotations with the understanding
rithms are surprisingly robust to annotation errors and this that their expertise and passion in a domain of interest can
level of training data corruption can lead to an acceptably help build tools that will be of service to a community of
small increase in test error if the training set has sufficient peers. Unlike workers on Mechanical Turk, citizen scien-
size. At the same time, we found that an expert-curated high tists are unpaid. Despite this, they produce higher quality
quality test set like NABirds is necessary to accurately mea- annotations due to their greater expertise and the absence of
sure the performance of fine-grained computer vision sys- spammers. Additionally, citizen scientists can help define
tems. We used NABirds to train a publicly available bird and organically grow the set of classes and its taxonomic
recognition service deployed on the web site of the Cornell
Lab of Ornithology.1

1. Introduction
Computer vision systems – catalyzed by the availabil-
ity of new, larger scale datasets like ImageNet [6] – have
recently obtained remarkable performance on object recog-
nition [17, 32] and detection [10]. Computer vision has en-
tered an era of big data, where the ability to collect larger
datasets – larger in terms of the number of classes, the num-
ber of images per class, and the level of annotation per im-
age – appears to be paramount for continuing performance
improvement and expanding the set of solvable applica-
tions. Figure 1: Merlin Photo ID: a publicly available tool for bird
Unfortunately, expanding datasets in this fashion intro- species classification built with the help of citizen scientists. The
duces new challenges beyond just increasing the amount of user uploaded a picture of a bird, and server-side computer vision
1 merlin.allaboutbirds.org algorithms identified it as an immature Cooper’s Hawk.

1
structure to match the interests of real users in a domain of and at zero cost. Over 500 citizen scientists annotated im-
interest. Whereas datasets like ImageNet [6] and CUB-200- ages in our dataset – if we can expand beyond the domain
2011 [35] have been valuable in fostering the development of birds, the pool of possible citzen scientist annotators is
of computer vision algorithms, the particular set of cate- massive. b) A curation-based interface for visualizing and
gories chosen is somewhat arbitrary and of limited use to manipulating the full dataset can further improve the speed
real applications. The drawback of using citizen scientists and accuracy of citizen scientists and experts. c) Even when
instead of Mechanical Turkers is that the throughput of col- averaging answers from 10 MTurkers together, MTurkers
lecting annotations maybe lower, and computer vision re- have a more than 30% error-rate at 37-way bird classifi-
searchers must take the time to figure out how to partner cation. d) The general high quality of Flickr search re-
with different communities for each domain. sults (84% accurate when searching for a particular species)
We collected a large dataset of 48,562 images over 555 greatly mitigates the errors of MTurkers when collecting
categories of birds with part annotations and bounding fine-grained datasets. e) MTurkers are as accurate and fast
boxes for each image, using a combination of citizen scien- as citizen scientists at collecting part location annotations.
tists, experts, and Mechanical Turkers. We used this dataset f) MTurkers have faster throughput in collecting annota-
to build a publicly available application for bird species tions than citizen scientists; however, using citizen scien-
classification. In this paper, we provide details and analysis tists it is still realistic to annotate a dataset of around 100k
of our experiences with the hope that they will be useful and images in a domain like birds in around 1 week. g) At least
informative for other researchers in computer vision work- 4% of images in CUB-200-2011 and ImageNet have incor-
ing on collecting larger fine-grained image datasets. We ad- rect class labels, and numerous other issues including in-
dress questions like: What is the relative skill level of dif- consistencies in the taxonomic structure, biases in terms of
ferent types of annotators (MTurkers, citizen scientists, and which images were selected, and the presence of duplicate
experts) for different types of annotations (fine-grained cat- images. h) Despite these problems, these datasets are still
egories and parts)? What are the resulting implications in effective for computer vision research; when training CNN-
terms of annotation quality, annotation cost, human annota- based computer vision algorithms with corrupted labels, the
tor time, and the time it takes a requester to finish a dataset? resulting increase in test error is surprisingly low and signif-
Which types of annotations are suitable for different pools icantly less than the level of corruption. i) A consequence
of annotators? What types of annotation GUIs are best for of findings (a), (d), and (h) is that training computer vision
each respective pools of annotators? How important is an- algorithms on unfiltered Flickr search results (with no an-
notation quality for the accuracy of learned computer vision notation) can often outperform algorithms trained when fil-
algorithms? How significant are the quality issues in exist- tering by MTurker majority vote.
ing datasets like CUB-200-2011 and ImageNet, and what
impact has that had on computer vision performance? 2. Related Work
We summarize our contributions below: Crowdsourcing with Mechanical Turk: Amazon’s Me-
1. Methodologies to collect high quality, fine-grained chanical Turk (AMT) has been an invaluable tool that has
computer vision datasets using a new type of crowd allowed researchers to collect datasets of significantly larger
annotators: citizen scientists. size and scope than previously possible [31, 6, 20]. AMT
makes it easy to outsource simple annotation tasks to a large
2. NABirds: a large, high quality dataset of 555 cate-
pool of workers. Although these workers will usually be
gories curated by experts.
non-experts, for many tasks it has been shown that repeated
3. Merlin Photo ID: a publicly available tool for bird labeling of examples by multiple non-expert workers can
species classification. produce high quality labels [30, 37, 14]. Annotation of fine-
4. Detailed analysis of annotation quality, time, cost, and grained categories is a possible counter-example, where the
throughput of MTurkers, citizen scientists, and experts average annotator may have little to no prior knowledge to
for fine-grained category and part annotations. make a reasonable guess at fine-grained labels. For exam-
5. Analysis of the annotation quality of the popular ple, the average worker has little to no prior knowledge as
datasets CUB-200 and ImageNet. to what type of bird a ”Semipalmated Plover” is, and her
ability to provide a useful guess is largely dependent on the
6. Empirical analysis of the effect that annotation qual- efforts of the dataset collector to provide useful instructions
ity has when training state-of-the-art computer vision or illustrative examples. Since our objective is to collect
algorithms for categorization. datasets of thousands of classes, generating high quality in-
A high-level summary of our findings is: a) Citizen sci- structions for each category is difficult or infeasible.
entists have 2-4 times lower error rates than MTurkers at Crowdsourcing with expertise estimation: A possible so-
fine-grained bird annotation, while annotating images faster lution is to try to automatically identify the subset of work-
ers who have adequate expertise for fine-grained classifica- community, better than vision researchers, and so they can
tion [36, 38, 28, 22]. Although such models are promis- ensure that the resulting datasets are directed towards solv-
ing, it seems likely that the subset of Mechanical Turkers ing real world problems.
with expertise in a particular fine-grained domain is small A connection must be established with these communi-
enough to make such methods impractical or challenging. ties before they can be utilized. We worked with ornithol-
Games with a purpose: Games with a purpose target al- ogists at the Cornell Lab of Ornithology to build NABirds.
ternate crowds of workers that are incentivized by construc- The Lab of Ornithology provided a perfect conduit to tap
tion of annotation tasks that also provide some entertain- into the large citizen scientist community surrounding birds.
ment value. Notable examples include the ESP Game [33], Our partners at the Lab of Ornithology described that the
reCAPTCHA [34], and BubbleBank [7]. A partial inspira- birding community, and perhaps many other taxon commu-
tion to our work was Quizz [13], a system to tap into new, nities, can be segmented into several different groups, each
larger pools of unpaid annotators using Google AdWords to with their own particular benefits. We built custom tools to
help find and recruit workers with the applicable expertise.2 take advantage of each of the segments.
A limitation of games with a purpose is that they require 3.1. Experts
some artistry to design tools that can engage users. Experts are the professionals of the community, and our
Citizen science: The success of Wikipedia is another ma- partners at the Lab of Ornithology served this role. Figure 4
jor inspiration to our work, where citizen scientists have is an example of an expert management tool (Vibe3 ) and
collaborated to generate a large, high quality web-based was designed to let expert users quickly and efficiently cu-
encyclopedia. Studies have shown that citizen scientists rate images and manipulate the taxonomy of a large dataset.
are incentivized by altruism, sense of community, and reci- Beyond simple image storage, tagging, and sharing, the
procity [18, 26, 39], and such incentives can lead to higher benefit of this tool is that it lets the experts define the dataset
quality work than monetary incentives [11]. taxonomy as they see fit, and allows for the dynamic chang-
Datasets: Progress in object recognition has been accel- ing of the taxonomy as the need arises. For NABirds, an
erated by dataset construction. These advances are fu- interesting result of this flexibility is that bird species were
eled both by the release and availability of each dataset further subdivided into “visual categories.” A “visual cate-
but also by subsequent competitions on them. Key gory” marks a sex or age or plumage attribute of the species
datasets/competitions in object recognition include Caltech- that results in a visually distinctive difference from other
101 [9], Caltech-256 [12], Pascal VOC [8] and Ima- members within the same species, see Figure 2. This type
geNet/ILSVRC [6, 29]. of knowledge of visual variances at the species level would
Fine-grained object recognition is no exception to this have been difficult to capture without the help of someone
trend. Various domains have already had datasets intro- knowledgeable about the domain.
duced including Birds (the CUB-200 [35] and recently 3.2. Citizen Scientist Experts
announced Birdsnap [2] datasets), Flowers [25, 1], Dogs
After the experts, these individuals of the community are
and Cats [15, 27, 21], Stoneflies [24], Butterflies [19]
the top tier, most skilled members. They have the confi-
and Fish [4] along with man-made domains such as Air-
dence and experience to identify easily confused classes of
planes [23], Cars [16], and Shoes[3].
the taxonomy. For the birding community these individuals
3. Crowdsourcing with Citizen Scientists 3 vibe.visipedia.org

The communities of enthusiasts for a taxon are an un-


tapped work force and partner for vision researchers. The
individuals comprising these communities tend to be very
knowledgeable about the taxon. Even those that are novices
make up for their lack of knowledge with passion and ded-
ication. These characteristics make these communities a
fundamentally different work force than the typical paid
crowd workers. When building a large, fine-grained dataset
they can be utilized to curate images with a level of accu-
racy that would be extremely costly with paid crowd work-
ers, see Section 5. There is a mutual benefit as the taxon
communities gain from having a direct influence on the con-
struction of the dataset. They know their taxon, and their
2 The
Figure 2: Two species of hawks from the NABirds dataset are
viability of this approach remains to be seen as our attempt to test
it was foiled by a misunderstanding with the AdWords team. separated into 6 categories based on their visual attributes.
(a) Quiz Annotation GUI
(b) Part Annotation GUI

Figure 3: (a) This interface was used to collect category labels on images. Users could either use the autocomplete box or scroll through
a gallery of possible birds. (b) This interface was used to collect part annotations on the images. Users were asked to mark the visibility
and location of 11 parts. See Section 3.2 and 3.3

sourcing platforms.
4. NABirds
We used a combination of experts, citizen scientists, and
MTurkers to build NABirds, a new bird dataset of 555
categories with a total of 48,562 images. Members from
the birding community provided the images, the experts of
the community curated the images, and a combination of
CTurkers and MTurkers annotated 11 bird parts on every
image along with bounding boxes. This dataset is free to
use for the research community.
Figure 4: Expert interface for rapid and efficient curation of im- The taxonomy for this dataset contains 1011 nodes, and
ages, and easy modification of the taxonomy. The taxonomy is the categories cover the most common North American
displayed on the left and is similar to a file system structure. See birds. These leaf categories were specifically chosen to
Section 3.1 allow for the creation of bird identification tools to help
novice birders. Improvements on classification or detection
accuracy by vision researchers will have a straightforward
were identified by their participation in eBird, a resource and meaningful impact on the birding community and their
that allows birders to record and analyze their bird sight- identification tools.
ings.4 Figure 3a shows a tool that allows these members to We used techniques from [5] to baseline performance on
take bird quizzes. The tool presents the user with a series of this dataset. Using Caffe and the fc6 layer features extracted
images and requests the species labels. The user can sup- from the entire image, we achieved an accuracy of 35.7%.
ply the label using the autocomplete box, or, if they are not Using the best performing technique from [5] with ground
sure, they can browse through a gallery of possible answers. truth part locations, we achieved an accuracy of 75%.
At the end of the quiz, their answers can be compared with
other expert answers. 5. Annotator Comparison
In this section we compare annotations performed by
3.3. Citizen Scientist Turkers Amazon Mechanical Turk workers (MTurkers) with citizen
This is a large, passionate segment of the community scientists reached through the Lab of Ornithology’s Face-
motivated to help their cause. This segment is not nec- book page. The goal of these experiments was to quantify
essarily as skilled in difficult identification tasks, but they the followings aspects of annotation tasks. 1) Annotation
are capable of assisting in other ways. Figure 3b shows a Error: The fraction of incorrect annotations., 2) Annota-
part annotation task that we deployed to this segment. The tion Time: The average amount of human time required per
task was to simply click on all parts of the bird. The size annotation. 3) Annotation Cost: The average cost in dol-
of this population should not be underestimated. Depend- lars required per annotation. 4) Annotation Throughput:
ing on how these communities are reached, this population The average number of annotations obtainable per second,
could be larger than the audience reached in typical crowd- this scales with the total size of the pool of annotators.
In order to compare the skill levels of different annotator
4 ebird.org groups directly, we chose a common user interface that we
considered to be appropriate for both citizen scientists and
MTurkers. For category labeling tasks, we used the quiz
tool that was discussed in section 3.2. Each question pre-
sented the user with an image of a bird and requested the
species label. To make the task feasible for MTurkers, we
allowed users to browse through galleries of each possible
species and limited the space of possible answers to < 40
categories. Each quiz was focused on a particular group of
birds, either sparrows or shorebirds. Random chance was 1
(a) Sparrow Quiz Scores (b) Shorebird Quiz Scores
/ 37 for the sparrows and 1 / 32 for the shorebirds. At the
end of the quiz, users were given a score (the number of cor- Figure 5: Histogram of quiz scores. Each quiz has 10 images, a
rect answers) and could view their results. Figure 3a shows perfect score is 10. (a) Score distributions for the sparrow quizzes.
our interface. We targeted the citizen scientist experts by Random chance per image is 2.7%. (b) Score distributions for
posting the quizzes on the the eBird Facebook page. the shorebird quizzes. Random chance per image is 3.1%. See
Figure 5 shows the distribution of scores achieved by Section 5
the two different worker groups on the two different bird
groups. Not surprisingly, citizen scientists had better per-
formance on the classification task than MTurkers; however ers is larger than that of the citizen scientists. The primary
we were uncertain as to whether or not averaging a large benefit of using citizen scientists for this particular case is
number of MTukers could yield comparable performance. made clear by their zero cost in Figure 7b.
Figure 6a plots the time taken to achieve a certain error rate Summary: From these results, we can see that there
by combining multiple annotators for the same image us- are clear distinctions between the two different worker
ing majority voting. From this figure we can see that citi- pools. Citizen scientists are clearly more capable at labeling
zen scientists not only have a lower median time per image fine-grained categories than MTurkers. However, the raw
(about 8 seconds vs 19 seconds), but that one citizen sci- throughput of MTurk means that you can finish annotating
entist expert label is more accurate than the average of 10 your dataset sooner than when using citizen scientists. If
MTurker labels. We note that we are using a simple-as- the annotation task does not require much domain knowl-
possible (but commonly used) crowdsourcing method, and edge (such as part annotation), then MTurkers can perform
the performance of MTurkers could likely be improved by on par with citizen scientists. Gathering fine-grained cat-
more sophisticated techniques such as CUBAM [36]. How- egory labels with MTurk should be done with care, as we
ever, the magnitude of difference in the two groups and have shown that naive averaging of labels does not converge
overall large error rate of MTurkers led us to believe that to the correct label. Finally, the cost savings of using citizen
the problem could not be solved simply using better crowd- scientists can be significant when the number of annotation
sourcing models. tasks grows.
Figure 6c measures the raw throughput of the workers,
highlighting the size of the MTurk worker pool. With citi- 6. Measuring the Quality of Existing Datasets
zen scientists, we noticed a spike of participation when the CUB-200-2011 [35] and ImageNet [6] are two popular
annotation task was first posted on Facebook, and then a datasets with fine-grained categories. Both of these datasets
quick tapering off of participation. Finally, Figure 6b mea- were collected by downloading images from web searches
sures the cost associated with the different levels of error– and curating them with Amazon Mechanical Turk. Given
citizen scientists were unpaid. the results in the previous section, we were interested in an-
We performed a similar analysis with part annotations. alyzing the errors present in these datasets. With the help
For this task we used the tool shown in Figure 3b. Work- of experts from the Cornell Lab of Ornithology, we exam-
ers from the two different groups were given an image and ined these datasets, specifically the bird categories, for false
asked to specify the visibility and position of 11 different positives.
bird parts. We targeted the citizen scientist turkers with this CUB-200-2011: The CUB-200-2011 dataset has 200
task by posting the tool on the Lab of Ornithology’s Face- classes, each with roughly 60 images. Experts went through
book page. The interface for the tool was kept the same be- the entire dataset and identified a total of 494 errors, about
tween the workers. Figures 7a, 7b, and 7c detail the results 4.4% of the entire dataset. There was a total of 252 images
of this test. From Figure 7a we can see there is not a dif- that did not belong in the dataset because their category was
ference between the obtainable quality from the two worker not represented, and a total of 242 images that needed to be
groups, and that MTurkers tended to be faster at the task. moved to existing categories. Beyond this 4.4% percent er-
Figure 7c again reveals that the raw throughput of Mturk- ror, an additional potential concern comes from dataset bias
1.0 1.0
0.9 MTurkers 0.9 MTurkers
0.8 Citizen Scientists 0.8 Citizen Scientists
Citizen Scientists + Vibe Citizen Scientists + Vibe MTurkers
0.7 0.7 10000 Citizen Scientists
0.6 1x 0.6 9000

Annotations Completed
8000
Error

Error
0.5 0.5 7000
5x
0.4 10x 0.4 6000
5000
0.3 1x 0.3 4000
0.2 0.2 3000
0.1 1x 5x 0.1 2000
3x 1000
0.0 4 12 20 28 36 44 52 0.0 $0 $40 $80 $120 $160 $200 00 24 48 72 96 120 144
Annotation Time (hours) Annotation Cost Time (hours)

(a) Annotation Time (b) Annotation Cost (c) Throughput

Figure 6: Category Labeling Tasks: workers used the quiz interface (see Figure 3a) to label the species of birds in images. (a) Citizen
scientists are more accurate and faster for each image than MTurkers. If the citizen scientists use an expert interface (Vibe), then they
are even faster and more accurate. (b) Citizen scientists are not compensated monetarily, they donate their time to the task. (c) The total
throughput of MTurk is still greater, meaning you can finish annotating your dataset sooner, however this comes at a monetary cost. See
Section 5

1.9 1.9 100K


1.8 MTurkers 1.8 MTurkers 90K MTurkers
1.7 Citizen Scientists 1.7 Citizen Scientists 80K Citizen Scientists
Error (Ave # Incorrect Parts)

Error (Ave # Incorrect Parts)

1x

Annotations Completed
1.6 1x 1.6 1x 1x 70K
1.5 1.5
60K
1.4 1.4
50K
1.3 1.3
40K
1.2 1.2
1.1 1.1 30K
5x 5x 20K
1.0 1.0
0.9 10x 0.9 5x 10K
5x
0.8 4 12 20 28 36 0.8 $20 $60 $100 $140 $180 $220 $260 0K0.0 1.0 2.0 3.0
Annotation Time (hours) Annotation Cost ($) Time (hours)

(a) Annotation Time (b) Annotation Cost (c) Throughput

Figure 7: Parts annotation tasks: workers used the interface in Figure 3b to label the visibility and location of 11 parts. (a): For this task,
as opposed to the category labeling task, citizen scientists and MTurkers perform comparable on individual images. (b): Citizen scientists
donate their time, and are not compensated monetarily. (c): The raw throughput of MTurk is greater than that of the citizen scientists,
meaning you can finish your total annotation tasks sooner, but this comes at a cost. See Section 5

issues. CUB was collected by performing a Flickr image “partridge” beyond what was quantified in our analysis.
search for each species and using MTurkers to filter results.
A consequence is that the most difficult images tended to
be excluded from the dataset altogether. By having experts
Category Training Images False Positives
annotate the raw Flickr search results, we found that on av- magpie 1300 11
erage 11.6% of correct images of each species were incor- kite 1294 260
rectly filtered out of the dataset. See Section 7.2 for addi- dowitcher 1298 70
tional analysis. albatross, mollymark 1300 92
quail 1300 19
ptarmigan 1300 5
ImageNet: There are 59 bird categories in ImageNet, each ruffed grouse, partridge, 1300 69
with about 1300 images in the training set. Table 1 shows Bonasa umbellus
the false positive counts for a subset of these categories. prairie chicken, prairie 1300 52
In addition to these numbers, it was our general impres- grouse, prairie fowl
partridge 1300 55
sion that error rate of ImageNet is probably at least as high
as CUB-200 within fine-grained categories; for example, Table 1: False positives from ImageNet LSVRC dataset.
the synset “ruffed grouse, partridge, Bonasa umbellus” had
overlapping definition and image content with the synset
0.90 0.90 0.50
0.70 0.70 0.40
0.50 0.50
0.30

log(Classification Error)
log(Classification Error)

log(Classification Error)
0.30 0.30 0.20
0.10
0.10 0.10
5% corruption 5% corruption 0.05 corruption
15% corruption 15% corruption 0.15 corruption
50% corruption 50% corruption 0.50 corruption
Pure Pure Pure
100 101 102 103 100 101 102 103 100 101 102 103
log(Number of Categories) log(Number of Categories) log(Dataset Size)
(a) Image level features, train+test corruption (b) Image level features, train corruption only (c) Localized features, train corruption only

Figure 8: Analysis of error degradation with corrupted training labels: (a) Both the training and testing sets are corrupted. There is a
significant difference when compared to the clean data. (b) Only the training set is corrupted. The induced classification error is much less
than the corruption level. (c) Only the training set is corrupted but more part localized features are utilized. The induced classification error
is still much less than the corruption level. See Section 7.1

7. Effect of Annotation Quality & Quantity the level of annotation error in CUB and ImageNet (≈ 5%)
might not be a big deal.
In this section we analyze the effect of data quality and
quantity on learned vision systems. Does the 4%+ error in Obtaining a clean test set was important: On the other
CUB and ImageNet actually matter? We begin with simu- hand, one cannot accurately measure the performance of
lated label corruption experiments to quantify reduction in computer vision algorithms without a high quality test set,
classification accuracy for different levels of error in Sec- as demonstrated in Figure 8a, which measures performance
tion 7.1, then perform studies on real corrupted data using when the test set is also corrupted. There is clearly a signif-
an expert-vetted version of CUB in Section 7.2. icant drop in performance with even 5% corruption. This
highlights a potential problem with CUB and ImageNet,
7.1. Label Corruption Experiments where train and test sets are equally corrupted.
In this experiment, we attempted to measure the effect Effect of computer vision algorithm: Figure 8b uses com-
of dataset quality by corrupting the image labels of the puter vision algorithms based on raw image-level CNN-fc6
NABirds dataset. We speculated that if an image of true features (obtaining an accuracy of 35% on 555 categories)
class X is incorrectly labeled as class Y , the effect might while Figure 8c uses a more sophisticated method [5] based
be larger if class Y is included as a category in the dataset on pose normalization and features from multiple CNN lay-
(i.e., CUB and ImageNet include only a small subset of ers (obtaining an accuracy of 74% on 555 categories). Label
real bird species). We thus simulated class subsets by ran- corruption caused similar additive increases in test error for
domly picking N ≤ 555 categories to comprise our sample both methods; however, this was a much higher percentage
dataset. Then, we randomly sampled M images from the N of the total test error for the higher performing method.
selected categories and corrupted these images by swapping
their labels with another image randomly selected from all 7.2. Error Analysis on Real CUB-200-2011 Labels
555 categories of the original NABirds dataset. We used The results from the previous section were obtained on
this corrupted dataset of N categories to build a classifier. simulated label corruptions. We performed additional anal-
Note that as the number of categories N within the dataset ysis on real annotation errors on CUB-200-2011. CUB-
increases, the probability that a corrupted label is actually 200-2011 was collected by performing Flickr image search
in the dataset increases. Figure 8 plots the results of this queries for each species and filtering the results using votes
experiment for different configurations. We summarize our from multiple MTurkers. We had experts provide ground
conclusions below: truth labels for all Flickr search results on 40 randomly se-
5-10% Training error was tolerable: Figures 8b and 8c lected categories. In Figure 9, we compare different possi-
analyze the situation where only the training set is cor- ble strategies for constructing a training set based on thresh-
rupted, and the ground truth testing set remains pure. We olding the number of MTurk votes. Each method resulted in
see that the increase in classification error due to 5% and a different training set size and level of precision and recall.
even 15% corruption are remarkably low–much smaller For each training set, we measured the accuracy of a com-
than 5% and 15%. This result held regardless of the number puter vision classifier on a common, expert-vetted test set.
of classes or computer vision algorithm. This suggests that The classifier was based on CNN-fc6 features from bound-
ing box regions. Results are summarized below: Scale Size 1 1/2 1/4 1/8 1/16 1/32 1/64
ACC .77 .73 .676 .612 .517 .43 .353

Table 2: Classification accuracy with reduced training set size.


See Section 7.2
Dataset Images ACC
vote 0 6475 0.78
Recall

vote 1 6467 0.78


vote 2 6080 0.77 to annotate all Flickr image search results. It is possible that
vote 3 5002 0.77 annotation quality becomes more important as the number
vote 4 3410 0.75 of classes in the dataset grows. To test this, we had experts
vote 5 1277 0.68 go through all 200 classes in CUB-200-2011, annotating all
expert 5257 0.78
images that were included in the dataset (see Section 6). We
Precision obtained a similar result as on the 40-class subset, where
the expert filtered dataset performed at about the same level
Figure 9: Different datasets can be built up when modifying the
as the original CUB-200-2011 trainset that contains 4-5%
MTurker agreement requirement. Increasing the agreement re-
error. These results are consistent with simulated label cor-
quirement results in a dataset with low numbers of false positives
ruption experiments in Figure 8b.
and lower amounts of training data due to a high number of false
negatives. A classifier trained on all the images performs as well 8. Conclusion
or better than the datasets that attempt to clean up the data. See
We introduced tools for crowdsourcing computer vi-
Section 7.2
sion annotations using citizen scientists. In collecting a
new expert-curated dataset of 48,562 images over 555 cate-
The level of training error in CUB was tolerable: The re- gories, we found that citizen scientists provide significantly
sults were consistent with the results predicted by the sim- higher quality labels than Mechanical Turk workers, and
ulated label corruption experiments, where a 5-15% error found that Turkers have alarmingly poor performance anno-
rate in the training errors yielded only a very small (roughly tating fine-grained classes. This has resulted in error rates
1%) increase in test error. This provides comfort that CUB- of over 4% in fine-grained categories in popular datasets
200-2011 and ImageNet are still useful despite label er- like CUB-200-2011 and ImageNet. Despite this, we found
rors. We emphasize though that an error free test set is still that learning algorithms based on CNN features and part
necessary–this is still an advantage of NABirds over CUB localization were surprisingly robust to mislabeled training
and ImageNet. examples as long as the error rate is not too high, and we
Keeping all Flickr images without any MTurk curation would like to emphasize that ImageNet and CUB-200-2011
does surprisingly well: This “free dataset” was as good as are still very useful and relevant datasets for research in
the expert dataset and slightly better than the MTurk curated computer vision.
datasets. The raw Flickr image search results had a reason- Our results so far have focused on experiences in a sin-
ably high precision of 81%. Keeping all images resulted gle domain (birds) and has resulted in a new publicly avail-
in more training images than the MTurk and expert filtered able tool for bird species identification. We are currently
datasets. If we look at the voter agreement and the cor- working on expanding to other types of categories such as
responding dataset training sizes, we see that having high shoes and Lepidoptera. Given that over 500 citizen scien-
MTurk agreement results in much smaller training set sizes tists helped provide high quality annotations in just a sin-
and a correspondingly low recall. gle domain, working with citizen scientists has potential to
Quantity can be more important than quality: This un- generate datasets of unprecedented size and quality while
derlines the point that having a large training set is ex- encouraging the landscape of computer vision research to
tremely important, and having strict requirements on anno- shape around the interests of end users.
tation quality can come at an expense of reducing training
set size. We randomly reduced the size of the training set 9. Acknowledgments
within the 40 class dataset and measured performance of We would like to thank Nathan Goldberg, Ben Barkley,
each resulting computer vision classifier. The results are Brendan Fogarty, Graham Montgomery, and Nathaniel Her-
shown in Table 2; we see that classification accuracy is more nandez for assisting with the user experiments. We appre-
sensitive to training set size than it was to label corruption ciate the feedback and general guidance from Miyoko Chu,
(see Figure 8b and 9). Steve Kelling, Chris Wood and Alex Chang. This work was
Similar results when scaling to more classes: One caveat supported in part by a Google Focused Research Award, the
is that the above results were obtained on a 40 class subset, Jacobs Technion-Cornell Joint Research Fund, and Office
which was the limit of what was reasonable to ask experts of Naval Research MURI N000141010933.
References [16] J. Krause, J. Deng, M. Stark, and L. Fei-Fei. Collecting
a Large-Scale Dataset of Fine-Grained Cars. In Second
[1] A. Angelova and S. Zhu. Efficient Object Detection and Workshop on Fine-Grained Visual Categorization (FGVC2),
Segmentation for Fine-Grained Recognition. In The IEEE 2013. 3
Conference on Computer Vision and Pattern Recognition
[17] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet
(CVPR), June 2013. 3
classification with deep convolutional neural networks. In
[2] T. Berg, J. Liu, S. W. Lee, M. L. Alexander, D. W. Jacobs, NIPS, 2012. 1
and P. N. Belhumeur. Birdsnap: Large-Scale Fine-Grained [18] S. Kuznetsov. Motivations of contributors to wikipedia.
Visual Categorization of Birds. In 2014 IEEE Conference ACM SIGCAS computers and society, 36(2):1, 2006. 3
on Computer Vision and Pattern Recognition, pages 2019–
[19] S. Lazebnik, C. Schmid, and J. Ponce. Semi-local affine parts
2026. IEEE, June 2014. 3
for object recognition. In Proc. BMVC, pages 98.1–98.10,
[3] T. L. Berg, A. C. Berg, and J. Shih. Automatic attribute dis- 2004. doi:10.5244/C.18.98. 3
covery and characterization from noisy web data. In Pro-
[20] T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ra-
ceedings of the 11th European Conference on Computer Vi-
manan, P. Dollár, and C. L. Zitnick. Microsoft coco: Com-
sion: Part I, ECCV’10, pages 663–676, Berlin, Heidelberg,
mon objects in context. arXiv preprint arXiv:1405.0312,
2010. Springer-Verlag. 3
2014. 2
[4] B. J. Boom, J. He, S. Palazzo, P. X. Huang, C. Beyan, H.-
[21] J. Liu, A. Kanazawa, D. W. Jacobs, and P. N. Belhumeur.
M. Chou, F.-P. Lin, C. Spampinato, and R. B. Fisher. A
Dog Breed Classification Using Part Localization. In ECCV,
research tool for long-term and continuous analysis of fish
2012. 3
assemblage in coral-reefs using underwater camera footage.
[22] C. Long, G. Hua, and A. Kapoor. Active Visual Recognition
Ecological Informatics, 23:83–97, Sept. 2014. 3
with Expertise Estimation in Crowdsourcing. In 2013 IEEE
[5] S. Branson, G. Van Horn, S. Belongie, and P. Perona. Bird International Conference on Computer Vision, pages 3000–
species categorization using pose normalized deep convolu- 3007. IEEE, Dec. 2013. 3
tional nets. arXiv preprint arXiv:1406.2952, 2014. 4, 7
[23] S. Maji, J. Kannala, E. Rahtu, M. Blaschko, and A. Vedaldi.
[6] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Fine-grained visual classification of aircraft. Technical re-
ImageNet: A Large-Scale Hierarchical Image Database. In port, 2013. 3
CVPR09, 2009. 1, 2, 3, 5
[24] G. Martinez-Munoz, N. Larios, E. Mortensen, A. Yama-
[7] J. Deng, J. Krause, and L. Fei-Fei. Fine-grained crowdsourc- muro, R. Paasch, N. Payet, D. Lytle, L. Shapiro, S. Todor-
ing for fine-grained recognition. In Computer Vision and Pat- ovic, A. Moldenke, and T. Dietterich. Dictionary-free cate-
tern Recognition (CVPR), 2013 IEEE Conference on, pages gorization of very similar objects via stacked evidence trees.
580–587. IEEE, 2013. 3 In 2009 IEEE Conference on Computer Vision and Pattern
[8] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and Recognition, pages 549–556. IEEE, June 2009. 3
A. Zisserman. The pascal visual object classes (voc) chal- [25] M.-E. Nilsback and A. Zisserman. Automated flower classi-
lenge. International Journal of Computer Vision, 88(2):303– fication over a large number of classes. In Proceedings of the
338, June 2010. 3 Indian Conference on Computer Vision, Graphics and Image
[9] L. Fei-Fei, R. Fergus, and P. Perona. One-shot learning of ob- Processing, Dec 2008. 3
ject categories. Pattern Analysis and Machine Intelligence, [26] O. Nov. What motivates wikipedians? Communications of
IEEE Transactions on, 28(4):594–611, 2006. 3 the ACM, 50(11):60–64, 2007. 3
[10] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich fea- [27] O. M. Parkhi, A. Vedaldi, C. V. Jawahar, and A. Zisserman.
ture hierarchies for accurate object detection and semantic The truth about cats and dogs. In ICCV, 2011. 3
segmentation. arXiv preprint arXiv:1311.2524, 2013. 1 [28] V. C. Raykar, S. Yu, L. H. Zhao, A. Jerebko, C. Florin, G. H.
[11] U. Gneezy and A. Rustichini. Pay enough or don’t pay at all. Valadez, L. Bogoni, and L. Moy. Supervised learning from
Quarterly journal of economics, pages 791–810, 2000. 3 multiple experts: whom to trust when everyone lies a bit. In
[12] G. Griffin, A. Holub, and P. Perona. Caltech-256 Object Cat- Proceedings of the 26th Annual international conference on
egory Dataset. Technical Report CNS-TR-2007-001, Cali- machine learning, pages 889–896. ACM, 2009. 3
fornia Institute of Technology, 2007. 3 [29] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh,
[13] P. G. Ipeirotis and E. Gabrilovich. Quizz: targeted crowd- S. Ma, Z. Huang, A. Karpathy, A. Khosla, M. Bernstein,
sourcing with a billion (potential) users. pages 143–154, Apr. A. C. Berg, and L. Fei-Fei. ImageNet Large Scale Visual
2014. 3 Recognition Challenge, 2014. 3
[14] P. G. Ipeirotis, F. Provost, V. S. Sheng, and J. Wang. Re- [30] V. S. Sheng, F. Provost, and P. G. Ipeirotis. Get another label?
peated labeling using multiple noisy labelers. Data Mining improving data quality and data mining using multiple, noisy
and Knowledge Discovery, 28(2):402–441, Mar. 2013. 2 labelers. In Proceeding of the 14th ACM SIGKDD interna-
[15] A. Khosla, N. Jayadevaprakash, B. Yao, and L. Fei-Fei. tional conference on Knowledge discovery and data mining -
Novel dataset for fine-grained image categorization. In First KDD 08, page 614, New York, New York, USA, Aug. 2008.
Workshop on Fine-Grained Visual Categorization, IEEE ACM Press. 2
Conference on Computer Vision and Pattern Recognition, [31] A. Sorokin and D. Forsyth. Utility data annotation with
Colorado Springs, CO, June 2011. 3 Amazon Mechanical Turk. In 2008 IEEE Computer Soci-
ety Conference on Computer Vision and Pattern Recognition
Workshops, pages 1–8. IEEE, June 2008. 2
[32] Y. Taigman, M. Yang, M. Ranzato, and L. Wolf. Deepface:
Closing the gap to human-level performance in face verifica-
tion. In Computer Vision and Pattern Recognition (CVPR),
2014 IEEE Conference on, pages 1701–1708. IEEE, 2014. 1
[33] L. Von Ahn. Games with a purpose. Computer, 39(6):92–94,
2006. 3
[34] L. Von Ahn, B. Maurer, C. McMillen, D. Abraham, and
M. Blum. recaptcha: Human-based character recognition
via web security measures. Science, 321(5895):1465–1468,
2008. 3
[35] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie.
The Caltech-UCSD Birds-200-2011 Dataset. Technical Re-
port CNS-TR-2011-001, California Institute of Technology,
2011. 2, 3, 5
[36] P. Welinder, S. Branson, P. Perona, and S. Belongie. The
Multidimensional Wisdom of Crowds. In J. D. Lafferty,
C. K. I. Williams, J. Shawe-Taylor, R. S. Zemel, and A. Cu-
lotta, editors, Advances in Neural Information Process-
ing Systems 23, pages 2424–2432. Curran Associates, Inc.,
2010. 3, 5
[37] P. Welinder and P. Perona. Online crowdsourcing: Rat-
ing annotators and obtaining cost-effective labels. In 2010
IEEE Computer Society Conference on Computer Vision and
Pattern Recognition - Workshops, pages 25–32. IEEE, June
2010. 2
[38] J. Whitehill, T.-f. Wu, J. Bergsma, J. R. Movellan, and P. L.
Ruvolo. Whose vote should count more: Optimal integration
of labels from labelers of unknown expertise. In Advances
in neural information processing systems, pages 2035–2043,
2009. 3
[39] H.-L. Yang and C.-Y. Lai. Motivations of wikipedia content
contributors. Computers in Human Behavior, 26(6):1377–
1383, 2010. 3

You might also like