Today we’ve launched our second Observation Accuracy Experiment (v0.2). Thanks to everyone helping us conduct these experiments. We're learning a lot about iNaturalist observation accuracy and how to improve it.
Changes in this Experiment
We made two changes to the experimental design from v0.1 based on feedback:
- We changed the validator criteria to be at least 3 improving identifications of the taxon from the same continent since many reported not feeling comfortable identifying taxa outside of their regions of expertise.
- We messaged candidate validators rather than emailed them since many reported not noticing emails. We also only left a 4 day interval (rather than 2 weeks) between contacting validators and the deadline since last time most validating happened within the first couple days after contacting candidate validators.
Eventually, we’d like to increase the sample size from 1,000 to 10,000, but we’re sticking with 1,000 until we get a few more kinks out of the methods. The page for this experiment is already live and the stats will update once a day until the validator deadline at the end of the month, but you won’t be able to drill into the bars to see the sample observations or the validators until the deadline has passed.
New Data Quality Assessment condition for photos unrelated to a single subject
We also made one change to iNaturalist functionality in response to findings from the study. We added a new “Evidence related to a single subject” condition to the Data Quality Assessment table to make it easier to remove observations with multiple photographs of unrelated subjects from the verifiable pool.
Two of the incorrect Research Grade observations in Experiment v0.1 were of this type which we estimate to be ~350k observations in the entire iNaturalist dataset. The norm up until now to make these observations casual has been to set an identification to the nearest taxonomic node shared by the multiple subjects and then vote no to “Based on the evidence can the Community Taxon still be confirmed or improved?”, but many found this process to be clunky and confusing. We hope this new Data Quality Assessment condition will make it easier for the community to remove these observations from the verifiable pool where they negatively impact data quality and distort features on iNaturalist (such computer vision model training and the browse photo tool) that assume observation photos all relate a single labeled subject.
Thank you!
Thank you to everyone contacted as a candidate validator for participating in this experiment. We expect that considering location may decrease the percentage of samples validated compared to the previous experiment by constraining the candidates available to validate, so we very much appreciate your participation in helping get as much of this sample validated by the end of the month as possible. As always, please share any feedback or thoughts you may have on the topic of data quality. We’re excited to continue learning from these experiments and your feedback about data quality on iNaturalist and what changes we can make to improve it!
Results (added 2/29/2024)
Thanks everyone for participating in this 2nd experiment. The validator deadline has now passed, meaning that on the experiment page the stats will no longer update, the validators are now visible, and the “Accuracy results by subset” bar graphs are now clickable allowing you to drill in to see the observations behind the graphs.
In this second experiment, we estimated the accuracy of the iNaturalist Research Grade observation dataset to be 97% correct and the accuracy of the Needs ID subset to be 79% correct. The graph below shows the first experiment (v0.1) in lighter bars and this experiment (v0.2) in darker bars. The results are very close which is reassuring.
These estimates of average accuracy of the entire Research Grade observation dataset are in line with our expectations, largely because the iNaturalist dataset is skewed towards a relatively small number of common, easy to identify species (e.g. mallards, monarchs, etc.) that have an outsized impact on the average. Nonetheless, we wanted to touch on three sources of uncertainty in these estimates: validator skill, sample size, and the uncertain category.
Validator skill
We are assuming that candidate validators can perfectly validate observations as Correct, Incorrect, or Uncertain. We know this assumption is not exactly correct because there are a small fraction of situations where more than one validator looked at the same observation and they disagreed (e.g. validator 1 says Taricha torosa and validator 2 says Taricha granulosa OR validator 2 says Taricha because you can’t rule out Taricha granulosa). This happened on 1.6% of the time in v0.1 and 1.2% of the time in v0.2. This error might be higher if we are underestimating disagreements because validations aren’t done blind (i.e. validators can see each other’s validations). But the error might also be lower because each observation was validated an average of 4 times, so assuming the validations are mostly independent even if one validator made a mistake it was reviewed an average of 3 more times. In future experiments, we’ll do more work to estimate uncertainty in the labels (Correct, Incorrect, or Uncertain) stemming from imperfect validator skill. But while this uncertainty stemming from validator skill is non-zero, it’s likely close to 0. Furthermore, there’s no reason to assume that this uncertainty would bias towards inflating the accuracy by overestimating the proportion correct, it could just as well bias towards underestimating the proportion correct.
Sample size
Because being correct or not is like a coin-flip, we can put confidence intervals on our estimates of average accuracy based on the sample size. As the sample size increases the confidence intervals become narrower. We can compare our estimates and confidence intervals from v0.1 to our estimates and 95% confidence intervals if we pool v0.1 and v0.2 together effectively doubling our sample sizes.
We already have a large enough sample size (n) to have fairly confident estimates for large subsets of the iNaturalist database such as the Research Grade (RG) Accuracy Estimate. After v0.1 (n=534) our estimate and 95% confidence interval was 0.95 (0.93 - 0.96) and pooling v0.1 and v0.2 (n=1109) it is now 0.96 (0.94 - 0.97).
However, for smaller subsets such as RG Fungi our confidence intervals are still quite large. After v0.1 (n=6) our estimate was 0.83 (0.36 - 1.00) and pooling v0.1 and v0.2 (n=19) it is still 0.95 (0.74 - 1.00) - so somewhere between 74 and 100% accurate.
For very small subsets, our sample size is still much too small to provide useful estimates. For example, to estimate the accuracy of RG Rare (taxa with fewer than 1000 observations) African Insects our estimate even after pooling v0.1 and v0.2 (n=5) is 0.6 (0.15 - 0.95). For other subsets (e.g. RG Very Rare (<100 obs) African Insects) we have a sample size of zero and can’t make any estimate.
The size of our sample is adequate for getting relatively confident estimates of average accuracy for the entire iNaturalist database and for large subsets (e.g. the RG subset, the North American subset, the Insect subset etc.), but these sample sizes are too small to yield confident estimates for more niche subsets (e.g. RG South American Fungi from 2022, etc.). We are very interested in variation in average accuracies across these subsets and look forward to growing the sample size to the point where we can better understand this variation.
Uncertain Category
Ideally, we’d be able to label all observations as Correct or Incorrect. But because we don’t have the capacity to get validations on all observations in the sample, some remain Uncertain. This was 3% of the RG subset in each of v0.1 and v0.2. Since we are calculating Accuracy as the percent correct (as opposed to 1 minus the percent incorrect) this Uncertain category is biasing us to underestimate the Accuracy. The true average accuracy of the iNaturalist Research Grade observation dataset is somewhere from 0% to 3% higher than our estimates because of this bias which is on par with the uncertainty interval resulting from sample size.
Thank you and next steps
Thanks again for all your help with another successful experiment. We’re amazed by the capacity of this incredible community to validate these samples. We hope you’ll click on the graphs and explore the results here. We plan to run another experiment at the end of March. We may try keeping the methods the same and increasing the sample size from 1000 to 10,000. Or we may make another change to the methods such as more changes validator candidate criteria. Thanks again for making these experiments possible!