Auto Arborist CVPR 2022 Data Card
Auto Arborist CVPR 2022 Data Card
Auto Arborist CVPR 2022 that contains over 2 million trees belonging to 344 genus-level categories in 23 cities
across the US and Canada built to foster the development of robust methods for
Dataset: https://google.github.io/auto-arborist/ large-scale urban forest monitoring.
Data Card
DATASET TEAM(S) DATASET CONTACT DATASET AUTHORS
Audio Data Number of tfrecords IN PROGRESS The tfrecords dataset represents a subset of the
Video Data Instances (eventually, 1 tree_locations dataset, with additional fields available
million instances including an encoded aerial and street level image for each
Time Series with images) instance.
Graph Data Number of Fields in 15
Geospatial Data tfrecords
*(streetlevel
images blurred
with help from
human labels)
Other N/A
DATASET USAGE INTENDED AND/OR SUITABLE USE CASE(S) UNSUITABLE USE CASE(S)
Safe for production use ● Developing a model to predict genera and reporting ● Republishing the Auto Arborist dataset or any data
its architecture and results against the Auto Arborist derived from it (such as processed images or examples
Safe for research use benchmark of images) without authorization
Conditional use- some unsafe ● Running a large scale analysis of urban ecology in
North America and sharing conclusions from the
applications
analysis
Only approved use
Others (please specify)
SAFETY OF USE WITH OTHER DATA ACCEPTABLE TRANSFORMATIONS BEST PRACTICES FOR JOINING OR AGGREGATING WITH
DATASET
Safe to use with other data Joining with other datasets The dataset comes with train/test splits. For benchmarks
Subsampling and splitting against the dataset, we recommend following these strictly
Conditionally safe to use with other in order for the benchmark to be comparable to others.
Filtering
data
Joining input sources
Should not be used with other data Cleaning missing values
Unknown Anomaly detection
Grouping and summarizing
Others* Scaling and reducing
(Please specify) Statistical transformations
Redaction or Anonymization
Others (please specify)
Regularly Updated Current Version 1.0 The Auto Arborist team intends to update the dataset with
Last Updated 06/2022 (IN PROGRESS) new instances until 1 million tfrecords have been released.
New versions of the dataset have been or
Release Date 06/2022 (IN PROGRESS) After this, it expects to shift to Limited Maintenance, and
will continue to be made available.
the number of instances may decrease over time for error
reasons mentioned below.
Actively Maintained ● Versioning: Versions will be a M.m (Major.minor).
Major updates will usually add significant new
No new versions will be made available, but
instances or perform significant error corrections,
this dataset will be actively maintained,
while minor updates may be error corrections or
including but not limited to updates to the
removal of ~1-10 instances. An example
data.
progression may be: 1.0,2.0,2.1,3.0,...
● Update: Major updates will normally occur
Limited Maintenance whenever all data which we intend to release for a
The data will not be updated, but any city is ready. Minor updates may happen at any
technical issues will be addressed. time as needed.
● Errors: The dataset is expected to have errors and
Deprecated noise as described in the paper. These will
This dataset is obsolete or is no longer generally not be corrected unless there is a
being maintained. significant reason to do so.
● Feedback: The Auto Arborist team welcomes
feedback. Please see the website for feedback
instructions.
Here are the Terms and Conditions The retention policy is included in the ToC The wipeout policy is included in the ToC
There are no retention restrictions for the Auto Arborist Google may receive third party requests to take down or
● Access Prerequisites: Sign the dataset blur a specific panorama on the Google Street View
Terms and Conditions of use, seen at website. Google may forward this request to Organization
the link above. ● Retention Duration: None and provide Organization with an updated version that
● Data Usage Policy: Non-commercial, ● Retention Steps: None complies with the takedown request. Organization must
non-exclusive, worldwide, ● Retention Policy: None delete the Licensed Content originally delivered and replace
royalty-free, non-transferable and ● Exemptions & Exceptions: None it with the updated Licensed Content provided by Google.
non-sublicenseable license to use
(including reproducing and creating ● Wipeout Duration: <summarize here>
derivative works of) ● Deletion Event Steps: If a third party requests
● Access Control Lists: Users who street level imagery to be blurred or taken down, a
have signed the ToC new, compliant version of the dataset will be sent to
● Exemptions & Exceptions: None all users. They will be asked to delete the prior
version of the dataset and work with the new one
going forward.
● Post-deletion Obligations: <summarize here>
● Exemptions & Exceptions: None
API
Google Aerial API: An internal API used to access aerial
imagery in cities.
Date of Collection: Approximately [Jan 2019-May 2022]
(we are unable to fully verify this range)
Instrumentation: Low-flying aircraft and Satellite Imagery
Data Modality: Image Data
API
Google Semantic Segmentation API: An internal API that
provides semantic segmentation for Street View data. We
used the results from this API to blur PII for our data.
Date of Collection: [Jan 2022 - May 2022]
Instrumentation: Computer Vision Model
Data Modality: Image Masks
Per-City Tree Instance records Per-City Tree Instance records Per-City instance records were processed into a common
format. Each instance was supplemented with Aerial and
Cities were selected based on availability of ● Quality: Instances with invalid lat/lng or lat/lng that
Street Level imagery.
tree inventory, the inventory’s usage are outside of the expected city boundaries.
restrictions, quality of the inventory, etc. Instances which cannot be mapped to a genus, e.g.
Cities were restricted to North America. because “palm” is not a genus; common typos (e.g. Street Level Imagery
Records that were labeled with a genus that “ginkgo” vs. “ginko”) were corrected instead of Street Level imagery was processed to generate tree
was not mappable into the Catalog of Life excluded. bounding box data and to blur pixels.
taxonomy were removed ● Content: None?
SENSITIVE DATA FIELDS WITH SENSITIVE DATA SECURITY AND PRIVACY HANDLING
User Content Intentionally Collected Sensitive Data Blurring S/PII in street level imagery
● Filtering images that contain people or license
User Metadata none
plates: We first used an internal privacy API to filter
User Activity Data out any images that had visible human pixels
● Automated blurring with internal Semantic
Identifiable Data Segmentation API: Next we blur all pixels that are
Unintentionally Collected Sensitive Data not “tree”, “sky”, “paved_road”, “dirt_road”,
S/PII
streetlevel/encoded: Identifiable houses, cars, or people “sidewalk”, “crosswalk”, “water”, or “mountain”
Business Data in street level imagery using an internal semantic segmentation API
● Human verified S/PII removal: Finally, we used
Employee Data Crowd Compute to detect and draw boxes around
Pseudonymous Data any S/PII that was still visible after our automated
method and blurred the interior of those boxes.
Anonymous Data
Health Data
Children’s Data
None
Others*
(*please specify)
Anomaly Detection Cleaning Mismatched Values ● Blurring S/PII for anonymization: Internal Google
Semantic Segmentation API, pixels potentially
Cleaning Mismatched Values ● Genus (fixed common typos)
containing S/PII are blurred using a gaussian kernel
Cleaning Missing Values Converting Data Types
Converting Data Types ● Genus: Label
Data Aggregation ● tree_locations.SHAPE_LNG : tfrecords.tree/longitude
(downcast to 32 bit float)
Dimensionality Reduction ● tree_locations.SHAPE_LAT : tfrecords.tree/latitude
Joining Input Sources (downcast to 32 bit float)
Annotation Target in Data Human Annotations - Expert Annotations were used to label the genera of trees and to
blur S/PII.
Machine-generated Annotations Number of annotations
Human Annotations - Expert (based on tree_locations) 4,615,907 Human Annotations - Expert
Human Annotations - Non-expert We assume per-city tree inventories were produced by
Human Annotations - Contractors
Human Annotations - Employees experts and are of high quality.
Others*
(*Please specify)
Reflections on Data
Trees are non-offensive - Trees are not considered to be offensive or insulting, and images of trees should not cause anxiety.
However, we do not have control over what humans may place on trees (i.e. offensive signs) and have
not explicitly removed such objects. If any offensive material is found in the dataset please email
[email protected] to have it removed.