Building A Bird Recognition App and Large Scale Dataset With Citizen Scientists - The Fine Print in Fine-Grained Dataset Collection
Building A Bird Recognition App and Large Scale Dataset With Citizen Scientists - The Fine Print in Fine-Grained Dataset Collection
1. Introduction
Computer vision systems – catalyzed by the availabil-
ity of new, larger scale datasets like ImageNet [6] – have
recently obtained remarkable performance on object recog-
nition [17, 32] and detection [10]. Computer vision has en-
tered an era of big data, where the ability to collect larger
datasets – larger in terms of the number of classes, the num-
ber of images per class, and the level of annotation per im-
age – appears to be paramount for continuing performance
improvement and expanding the set of solvable applica-
tions. Figure 1: Merlin Photo ID: a publicly available tool for bird
Unfortunately, expanding datasets in this fashion intro- species classification built with the help of citizen scientists. The
duces new challenges beyond just increasing the amount of user uploaded a picture of a bird, and server-side computer vision
1 merlin.allaboutbirds.org algorithms identified it as an immature Cooper’s Hawk.
1
structure to match the interests of real users in a domain of and at zero cost. Over 500 citizen scientists annotated im-
interest. Whereas datasets like ImageNet [6] and CUB-200- ages in our dataset – if we can expand beyond the domain
2011 [35] have been valuable in fostering the development of birds, the pool of possible citzen scientist annotators is
of computer vision algorithms, the particular set of cate- massive. b) A curation-based interface for visualizing and
gories chosen is somewhat arbitrary and of limited use to manipulating the full dataset can further improve the speed
real applications. The drawback of using citizen scientists and accuracy of citizen scientists and experts. c) Even when
instead of Mechanical Turkers is that the throughput of col- averaging answers from 10 MTurkers together, MTurkers
lecting annotations maybe lower, and computer vision re- have a more than 30% error-rate at 37-way bird classifi-
searchers must take the time to figure out how to partner cation. d) The general high quality of Flickr search re-
with different communities for each domain. sults (84% accurate when searching for a particular species)
We collected a large dataset of 48,562 images over 555 greatly mitigates the errors of MTurkers when collecting
categories of birds with part annotations and bounding fine-grained datasets. e) MTurkers are as accurate and fast
boxes for each image, using a combination of citizen scien- as citizen scientists at collecting part location annotations.
tists, experts, and Mechanical Turkers. We used this dataset f) MTurkers have faster throughput in collecting annota-
to build a publicly available application for bird species tions than citizen scientists; however, using citizen scien-
classification. In this paper, we provide details and analysis tists it is still realistic to annotate a dataset of around 100k
of our experiences with the hope that they will be useful and images in a domain like birds in around 1 week. g) At least
informative for other researchers in computer vision work- 4% of images in CUB-200-2011 and ImageNet have incor-
ing on collecting larger fine-grained image datasets. We ad- rect class labels, and numerous other issues including in-
dress questions like: What is the relative skill level of dif- consistencies in the taxonomic structure, biases in terms of
ferent types of annotators (MTurkers, citizen scientists, and which images were selected, and the presence of duplicate
experts) for different types of annotations (fine-grained cat- images. h) Despite these problems, these datasets are still
egories and parts)? What are the resulting implications in effective for computer vision research; when training CNN-
terms of annotation quality, annotation cost, human annota- based computer vision algorithms with corrupted labels, the
tor time, and the time it takes a requester to finish a dataset? resulting increase in test error is surprisingly low and signif-
Which types of annotations are suitable for different pools icantly less than the level of corruption. i) A consequence
of annotators? What types of annotation GUIs are best for of findings (a), (d), and (h) is that training computer vision
each respective pools of annotators? How important is an- algorithms on unfiltered Flickr search results (with no an-
notation quality for the accuracy of learned computer vision notation) can often outperform algorithms trained when fil-
algorithms? How significant are the quality issues in exist- tering by MTurker majority vote.
ing datasets like CUB-200-2011 and ImageNet, and what
impact has that had on computer vision performance? 2. Related Work
We summarize our contributions below: Crowdsourcing with Mechanical Turk: Amazon’s Me-
1. Methodologies to collect high quality, fine-grained chanical Turk (AMT) has been an invaluable tool that has
computer vision datasets using a new type of crowd allowed researchers to collect datasets of significantly larger
annotators: citizen scientists. size and scope than previously possible [31, 6, 20]. AMT
makes it easy to outsource simple annotation tasks to a large
2. NABirds: a large, high quality dataset of 555 cate-
pool of workers. Although these workers will usually be
gories curated by experts.
non-experts, for many tasks it has been shown that repeated
3. Merlin Photo ID: a publicly available tool for bird labeling of examples by multiple non-expert workers can
species classification. produce high quality labels [30, 37, 14]. Annotation of fine-
4. Detailed analysis of annotation quality, time, cost, and grained categories is a possible counter-example, where the
throughput of MTurkers, citizen scientists, and experts average annotator may have little to no prior knowledge to
for fine-grained category and part annotations. make a reasonable guess at fine-grained labels. For exam-
5. Analysis of the annotation quality of the popular ple, the average worker has little to no prior knowledge as
datasets CUB-200 and ImageNet. to what type of bird a ”Semipalmated Plover” is, and her
ability to provide a useful guess is largely dependent on the
6. Empirical analysis of the effect that annotation qual- efforts of the dataset collector to provide useful instructions
ity has when training state-of-the-art computer vision or illustrative examples. Since our objective is to collect
algorithms for categorization. datasets of thousands of classes, generating high quality in-
A high-level summary of our findings is: a) Citizen sci- structions for each category is difficult or infeasible.
entists have 2-4 times lower error rates than MTurkers at Crowdsourcing with expertise estimation: A possible so-
fine-grained bird annotation, while annotating images faster lution is to try to automatically identify the subset of work-
ers who have adequate expertise for fine-grained classifica- community, better than vision researchers, and so they can
tion [36, 38, 28, 22]. Although such models are promis- ensure that the resulting datasets are directed towards solv-
ing, it seems likely that the subset of Mechanical Turkers ing real world problems.
with expertise in a particular fine-grained domain is small A connection must be established with these communi-
enough to make such methods impractical or challenging. ties before they can be utilized. We worked with ornithol-
Games with a purpose: Games with a purpose target al- ogists at the Cornell Lab of Ornithology to build NABirds.
ternate crowds of workers that are incentivized by construc- The Lab of Ornithology provided a perfect conduit to tap
tion of annotation tasks that also provide some entertain- into the large citizen scientist community surrounding birds.
ment value. Notable examples include the ESP Game [33], Our partners at the Lab of Ornithology described that the
reCAPTCHA [34], and BubbleBank [7]. A partial inspira- birding community, and perhaps many other taxon commu-
tion to our work was Quizz [13], a system to tap into new, nities, can be segmented into several different groups, each
larger pools of unpaid annotators using Google AdWords to with their own particular benefits. We built custom tools to
help find and recruit workers with the applicable expertise.2 take advantage of each of the segments.
A limitation of games with a purpose is that they require 3.1. Experts
some artistry to design tools that can engage users. Experts are the professionals of the community, and our
Citizen science: The success of Wikipedia is another ma- partners at the Lab of Ornithology served this role. Figure 4
jor inspiration to our work, where citizen scientists have is an example of an expert management tool (Vibe3 ) and
collaborated to generate a large, high quality web-based was designed to let expert users quickly and efficiently cu-
encyclopedia. Studies have shown that citizen scientists rate images and manipulate the taxonomy of a large dataset.
are incentivized by altruism, sense of community, and reci- Beyond simple image storage, tagging, and sharing, the
procity [18, 26, 39], and such incentives can lead to higher benefit of this tool is that it lets the experts define the dataset
quality work than monetary incentives [11]. taxonomy as they see fit, and allows for the dynamic chang-
Datasets: Progress in object recognition has been accel- ing of the taxonomy as the need arises. For NABirds, an
erated by dataset construction. These advances are fu- interesting result of this flexibility is that bird species were
eled both by the release and availability of each dataset further subdivided into “visual categories.” A “visual cate-
but also by subsequent competitions on them. Key gory” marks a sex or age or plumage attribute of the species
datasets/competitions in object recognition include Caltech- that results in a visually distinctive difference from other
101 [9], Caltech-256 [12], Pascal VOC [8] and Ima- members within the same species, see Figure 2. This type
geNet/ILSVRC [6, 29]. of knowledge of visual variances at the species level would
Fine-grained object recognition is no exception to this have been difficult to capture without the help of someone
trend. Various domains have already had datasets intro- knowledgeable about the domain.
duced including Birds (the CUB-200 [35] and recently 3.2. Citizen Scientist Experts
announced Birdsnap [2] datasets), Flowers [25, 1], Dogs
After the experts, these individuals of the community are
and Cats [15, 27, 21], Stoneflies [24], Butterflies [19]
the top tier, most skilled members. They have the confi-
and Fish [4] along with man-made domains such as Air-
dence and experience to identify easily confused classes of
planes [23], Cars [16], and Shoes[3].
the taxonomy. For the birding community these individuals
3. Crowdsourcing with Citizen Scientists 3 vibe.visipedia.org
Figure 3: (a) This interface was used to collect category labels on images. Users could either use the autocomplete box or scroll through
a gallery of possible birds. (b) This interface was used to collect part annotations on the images. Users were asked to mark the visibility
and location of 11 parts. See Section 3.2 and 3.3
sourcing platforms.
4. NABirds
We used a combination of experts, citizen scientists, and
MTurkers to build NABirds, a new bird dataset of 555
categories with a total of 48,562 images. Members from
the birding community provided the images, the experts of
the community curated the images, and a combination of
CTurkers and MTurkers annotated 11 bird parts on every
image along with bounding boxes. This dataset is free to
use for the research community.
Figure 4: Expert interface for rapid and efficient curation of im- The taxonomy for this dataset contains 1011 nodes, and
ages, and easy modification of the taxonomy. The taxonomy is the categories cover the most common North American
displayed on the left and is similar to a file system structure. See birds. These leaf categories were specifically chosen to
Section 3.1 allow for the creation of bird identification tools to help
novice birders. Improvements on classification or detection
accuracy by vision researchers will have a straightforward
were identified by their participation in eBird, a resource and meaningful impact on the birding community and their
that allows birders to record and analyze their bird sight- identification tools.
ings.4 Figure 3a shows a tool that allows these members to We used techniques from [5] to baseline performance on
take bird quizzes. The tool presents the user with a series of this dataset. Using Caffe and the fc6 layer features extracted
images and requests the species labels. The user can sup- from the entire image, we achieved an accuracy of 35.7%.
ply the label using the autocomplete box, or, if they are not Using the best performing technique from [5] with ground
sure, they can browse through a gallery of possible answers. truth part locations, we achieved an accuracy of 75%.
At the end of the quiz, their answers can be compared with
other expert answers. 5. Annotator Comparison
In this section we compare annotations performed by
3.3. Citizen Scientist Turkers Amazon Mechanical Turk workers (MTurkers) with citizen
This is a large, passionate segment of the community scientists reached through the Lab of Ornithology’s Face-
motivated to help their cause. This segment is not nec- book page. The goal of these experiments was to quantify
essarily as skilled in difficult identification tasks, but they the followings aspects of annotation tasks. 1) Annotation
are capable of assisting in other ways. Figure 3b shows a Error: The fraction of incorrect annotations., 2) Annota-
part annotation task that we deployed to this segment. The tion Time: The average amount of human time required per
task was to simply click on all parts of the bird. The size annotation. 3) Annotation Cost: The average cost in dol-
of this population should not be underestimated. Depend- lars required per annotation. 4) Annotation Throughput:
ing on how these communities are reached, this population The average number of annotations obtainable per second,
could be larger than the audience reached in typical crowd- this scales with the total size of the pool of annotators.
In order to compare the skill levels of different annotator
4 ebird.org groups directly, we chose a common user interface that we
considered to be appropriate for both citizen scientists and
MTurkers. For category labeling tasks, we used the quiz
tool that was discussed in section 3.2. Each question pre-
sented the user with an image of a bird and requested the
species label. To make the task feasible for MTurkers, we
allowed users to browse through galleries of each possible
species and limited the space of possible answers to < 40
categories. Each quiz was focused on a particular group of
birds, either sparrows or shorebirds. Random chance was 1
(a) Sparrow Quiz Scores (b) Shorebird Quiz Scores
/ 37 for the sparrows and 1 / 32 for the shorebirds. At the
end of the quiz, users were given a score (the number of cor- Figure 5: Histogram of quiz scores. Each quiz has 10 images, a
rect answers) and could view their results. Figure 3a shows perfect score is 10. (a) Score distributions for the sparrow quizzes.
our interface. We targeted the citizen scientist experts by Random chance per image is 2.7%. (b) Score distributions for
posting the quizzes on the the eBird Facebook page. the shorebird quizzes. Random chance per image is 3.1%. See
Figure 5 shows the distribution of scores achieved by Section 5
the two different worker groups on the two different bird
groups. Not surprisingly, citizen scientists had better per-
formance on the classification task than MTurkers; however ers is larger than that of the citizen scientists. The primary
we were uncertain as to whether or not averaging a large benefit of using citizen scientists for this particular case is
number of MTukers could yield comparable performance. made clear by their zero cost in Figure 7b.
Figure 6a plots the time taken to achieve a certain error rate Summary: From these results, we can see that there
by combining multiple annotators for the same image us- are clear distinctions between the two different worker
ing majority voting. From this figure we can see that citi- pools. Citizen scientists are clearly more capable at labeling
zen scientists not only have a lower median time per image fine-grained categories than MTurkers. However, the raw
(about 8 seconds vs 19 seconds), but that one citizen sci- throughput of MTurk means that you can finish annotating
entist expert label is more accurate than the average of 10 your dataset sooner than when using citizen scientists. If
MTurker labels. We note that we are using a simple-as- the annotation task does not require much domain knowl-
possible (but commonly used) crowdsourcing method, and edge (such as part annotation), then MTurkers can perform
the performance of MTurkers could likely be improved by on par with citizen scientists. Gathering fine-grained cat-
more sophisticated techniques such as CUBAM [36]. How- egory labels with MTurk should be done with care, as we
ever, the magnitude of difference in the two groups and have shown that naive averaging of labels does not converge
overall large error rate of MTurkers led us to believe that to the correct label. Finally, the cost savings of using citizen
the problem could not be solved simply using better crowd- scientists can be significant when the number of annotation
sourcing models. tasks grows.
Figure 6c measures the raw throughput of the workers,
highlighting the size of the MTurk worker pool. With citi- 6. Measuring the Quality of Existing Datasets
zen scientists, we noticed a spike of participation when the CUB-200-2011 [35] and ImageNet [6] are two popular
annotation task was first posted on Facebook, and then a datasets with fine-grained categories. Both of these datasets
quick tapering off of participation. Finally, Figure 6b mea- were collected by downloading images from web searches
sures the cost associated with the different levels of error– and curating them with Amazon Mechanical Turk. Given
citizen scientists were unpaid. the results in the previous section, we were interested in an-
We performed a similar analysis with part annotations. alyzing the errors present in these datasets. With the help
For this task we used the tool shown in Figure 3b. Work- of experts from the Cornell Lab of Ornithology, we exam-
ers from the two different groups were given an image and ined these datasets, specifically the bird categories, for false
asked to specify the visibility and position of 11 different positives.
bird parts. We targeted the citizen scientist turkers with this CUB-200-2011: The CUB-200-2011 dataset has 200
task by posting the tool on the Lab of Ornithology’s Face- classes, each with roughly 60 images. Experts went through
book page. The interface for the tool was kept the same be- the entire dataset and identified a total of 494 errors, about
tween the workers. Figures 7a, 7b, and 7c detail the results 4.4% of the entire dataset. There was a total of 252 images
of this test. From Figure 7a we can see there is not a dif- that did not belong in the dataset because their category was
ference between the obtainable quality from the two worker not represented, and a total of 242 images that needed to be
groups, and that MTurkers tended to be faster at the task. moved to existing categories. Beyond this 4.4% percent er-
Figure 7c again reveals that the raw throughput of Mturk- ror, an additional potential concern comes from dataset bias
1.0 1.0
0.9 MTurkers 0.9 MTurkers
0.8 Citizen Scientists 0.8 Citizen Scientists
Citizen Scientists + Vibe Citizen Scientists + Vibe MTurkers
0.7 0.7 10000 Citizen Scientists
0.6 1x 0.6 9000
Annotations Completed
8000
Error
Error
0.5 0.5 7000
5x
0.4 10x 0.4 6000
5000
0.3 1x 0.3 4000
0.2 0.2 3000
0.1 1x 5x 0.1 2000
3x 1000
0.0 4 12 20 28 36 44 52 0.0 $0 $40 $80 $120 $160 $200 00 24 48 72 96 120 144
Annotation Time (hours) Annotation Cost Time (hours)
Figure 6: Category Labeling Tasks: workers used the quiz interface (see Figure 3a) to label the species of birds in images. (a) Citizen
scientists are more accurate and faster for each image than MTurkers. If the citizen scientists use an expert interface (Vibe), then they
are even faster and more accurate. (b) Citizen scientists are not compensated monetarily, they donate their time to the task. (c) The total
throughput of MTurk is still greater, meaning you can finish annotating your dataset sooner, however this comes at a monetary cost. See
Section 5
1x
Annotations Completed
1.6 1x 1.6 1x 1x 70K
1.5 1.5
60K
1.4 1.4
50K
1.3 1.3
40K
1.2 1.2
1.1 1.1 30K
5x 5x 20K
1.0 1.0
0.9 10x 0.9 5x 10K
5x
0.8 4 12 20 28 36 0.8 $20 $60 $100 $140 $180 $220 $260 0K0.0 1.0 2.0 3.0
Annotation Time (hours) Annotation Cost ($) Time (hours)
Figure 7: Parts annotation tasks: workers used the interface in Figure 3b to label the visibility and location of 11 parts. (a): For this task,
as opposed to the category labeling task, citizen scientists and MTurkers perform comparable on individual images. (b): Citizen scientists
donate their time, and are not compensated monetarily. (c): The raw throughput of MTurk is greater than that of the citizen scientists,
meaning you can finish your total annotation tasks sooner, but this comes at a cost. See Section 5
issues. CUB was collected by performing a Flickr image “partridge” beyond what was quantified in our analysis.
search for each species and using MTurkers to filter results.
A consequence is that the most difficult images tended to
be excluded from the dataset altogether. By having experts
Category Training Images False Positives
annotate the raw Flickr search results, we found that on av- magpie 1300 11
erage 11.6% of correct images of each species were incor- kite 1294 260
rectly filtered out of the dataset. See Section 7.2 for addi- dowitcher 1298 70
tional analysis. albatross, mollymark 1300 92
quail 1300 19
ptarmigan 1300 5
ImageNet: There are 59 bird categories in ImageNet, each ruffed grouse, partridge, 1300 69
with about 1300 images in the training set. Table 1 shows Bonasa umbellus
the false positive counts for a subset of these categories. prairie chicken, prairie 1300 52
In addition to these numbers, it was our general impres- grouse, prairie fowl
partridge 1300 55
sion that error rate of ImageNet is probably at least as high
as CUB-200 within fine-grained categories; for example, Table 1: False positives from ImageNet LSVRC dataset.
the synset “ruffed grouse, partridge, Bonasa umbellus” had
overlapping definition and image content with the synset
0.90 0.90 0.50
0.70 0.70 0.40
0.50 0.50
0.30
log(Classification Error)
log(Classification Error)
log(Classification Error)
0.30 0.30 0.20
0.10
0.10 0.10
5% corruption 5% corruption 0.05 corruption
15% corruption 15% corruption 0.15 corruption
50% corruption 50% corruption 0.50 corruption
Pure Pure Pure
100 101 102 103 100 101 102 103 100 101 102 103
log(Number of Categories) log(Number of Categories) log(Dataset Size)
(a) Image level features, train+test corruption (b) Image level features, train corruption only (c) Localized features, train corruption only
Figure 8: Analysis of error degradation with corrupted training labels: (a) Both the training and testing sets are corrupted. There is a
significant difference when compared to the clean data. (b) Only the training set is corrupted. The induced classification error is much less
than the corruption level. (c) Only the training set is corrupted but more part localized features are utilized. The induced classification error
is still much less than the corruption level. See Section 7.1
7. Effect of Annotation Quality & Quantity the level of annotation error in CUB and ImageNet (≈ 5%)
might not be a big deal.
In this section we analyze the effect of data quality and
quantity on learned vision systems. Does the 4%+ error in Obtaining a clean test set was important: On the other
CUB and ImageNet actually matter? We begin with simu- hand, one cannot accurately measure the performance of
lated label corruption experiments to quantify reduction in computer vision algorithms without a high quality test set,
classification accuracy for different levels of error in Sec- as demonstrated in Figure 8a, which measures performance
tion 7.1, then perform studies on real corrupted data using when the test set is also corrupted. There is clearly a signif-
an expert-vetted version of CUB in Section 7.2. icant drop in performance with even 5% corruption. This
highlights a potential problem with CUB and ImageNet,
7.1. Label Corruption Experiments where train and test sets are equally corrupted.
In this experiment, we attempted to measure the effect Effect of computer vision algorithm: Figure 8b uses com-
of dataset quality by corrupting the image labels of the puter vision algorithms based on raw image-level CNN-fc6
NABirds dataset. We speculated that if an image of true features (obtaining an accuracy of 35% on 555 categories)
class X is incorrectly labeled as class Y , the effect might while Figure 8c uses a more sophisticated method [5] based
be larger if class Y is included as a category in the dataset on pose normalization and features from multiple CNN lay-
(i.e., CUB and ImageNet include only a small subset of ers (obtaining an accuracy of 74% on 555 categories). Label
real bird species). We thus simulated class subsets by ran- corruption caused similar additive increases in test error for
domly picking N ≤ 555 categories to comprise our sample both methods; however, this was a much higher percentage
dataset. Then, we randomly sampled M images from the N of the total test error for the higher performing method.
selected categories and corrupted these images by swapping
their labels with another image randomly selected from all 7.2. Error Analysis on Real CUB-200-2011 Labels
555 categories of the original NABirds dataset. We used The results from the previous section were obtained on
this corrupted dataset of N categories to build a classifier. simulated label corruptions. We performed additional anal-
Note that as the number of categories N within the dataset ysis on real annotation errors on CUB-200-2011. CUB-
increases, the probability that a corrupted label is actually 200-2011 was collected by performing Flickr image search
in the dataset increases. Figure 8 plots the results of this queries for each species and filtering the results using votes
experiment for different configurations. We summarize our from multiple MTurkers. We had experts provide ground
conclusions below: truth labels for all Flickr search results on 40 randomly se-
5-10% Training error was tolerable: Figures 8b and 8c lected categories. In Figure 9, we compare different possi-
analyze the situation where only the training set is cor- ble strategies for constructing a training set based on thresh-
rupted, and the ground truth testing set remains pure. We olding the number of MTurk votes. Each method resulted in
see that the increase in classification error due to 5% and a different training set size and level of precision and recall.
even 15% corruption are remarkably low–much smaller For each training set, we measured the accuracy of a com-
than 5% and 15%. This result held regardless of the number puter vision classifier on a common, expert-vetted test set.
of classes or computer vision algorithm. This suggests that The classifier was based on CNN-fc6 features from bound-
ing box regions. Results are summarized below: Scale Size 1 1/2 1/4 1/8 1/16 1/32 1/64
ACC .77 .73 .676 .612 .517 .43 .353