0% found this document useful (0 votes)
41 views11 pages

Auto Arborist CVPR 2022 Data Card

The Auto Arborist dataset contains over 2 million trees from 344 genus categories across 23 cities in the US and Canada. It was created to foster the development of methods for large-scale urban forest monitoring and contains aerial and street-level images with location and genus labels for each tree instance.

Uploaded by

nikhil.lahane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
41 views11 pages

Auto Arborist CVPR 2022 Data Card

The Auto Arborist dataset contains over 2 million trees from 344 genus categories across 23 cities in the US and Canada. It was created to foster the development of methods for large-scale urban forest monitoring and contains aerial and street-level images with location and genus labels for each tree instance.

Uploaded by

nikhil.lahane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

The Auto Arborist dataset is a multiview fine-grained visual categorization dataset

Auto Arborist CVPR 2022 that contains over 2 million trees belonging to 344 genus-level categories in 23 cities
across the US and Canada built to foster the development of robust methods for
Dataset: https://google.github.io/auto-arborist/ large-scale urban forest monitoring.

Data Card
DATASET TEAM(S) DATASET CONTACT DATASET AUTHORS

The Auto Arborist team ● Group Email:


[email protected] ● Sara Beery, PhD Candidate, Caltech, Google
● Website: https://google.github.io/auto-arborist ● Guanhang Wu, Software Engineer, Google
● Trevor Edwards, Software Engineer, Google
● Filip Pavetic, Software Engineer, Google
● Bo Majewski, Software Engineer, Google
● Shreyasee Mukherjee, Software Engineer, Google
● Stanley Chan, Software Engineer, Google
● John Morgan, (formerly at Google)
● Vivek Rathod, Software Engineer, Google
● Jonathan Huang, Research Scientist, Google

PRIMARY DATA MODALITY DATASET SNAPSHOT DESCRIPTION OF CONTENT

Image Data The content consists of two datasets.


Size of dataset 24 GB (v1.0.
Text Data Significant size The tree_locations dataset contains basic tree location and
Tabular Data increase expected) genus information derived from 24 cities.

Audio Data Number of tfrecords IN PROGRESS The tfrecords dataset represents a subset of the
Video Data Instances (eventually, 1 tree_locations dataset, with additional fields available
million instances including an encoded aerial and street level image for each
Time Series with images) instance.
Graph Data Number of Fields in 15
Geospatial Data tfrecords

Multimodal (Image, Geospatial) Number of tree_locations 4,615,907


instances
Others (please specify)
Number of Fields in 5 (or 6 including
Unknown
tree_locations implicit city per
file)
Labeled Classes 1 (Genus)
Number of Labels 322 Genera
Average labels per N/A
instance
Algorithmic Labels 2 (tree bounding
boxes; tree
horizontal position
in streetlevel
image)
Human Labels ~1 (Genus)

*(streetlevel
images blurred
with help from
human labels)
Other N/A

DATASET SUBJECT EXAMPLE: DATA POINT DATA FIELDS

Sensitive Data about people Example tree_locations datapoint:


Tree_locations: Contains the parsed and cleaned up tree
Non-Sensitive Data about people Datapoint below is slightly modified (e.g. fake location), but inventories of each city in the dataset, along with train/test
Data about natural phenomena otherwise represents a typical example. splits per city.
● IDX: An identifier for the row which is unique to the
Data about places and objects E.g. of Data Point: city.
Synthetically generated data ● SHAPE_LNG: The longitude of the tree.
IDX,SHAPE_LNG,SHAPE_LAT,GENUS,TAXONOMY_ID
● SHAPE_LAT: The latitude of the tree.
Data about systems or products and 057eeab4-1f14-11ec-93z5-eb8801c6f8d0,-85.82556863 ● GENUS: The lowercase genus of the tree.
their behaviors 059999,37.5307416556,pyrus,246 ● TAXONOMY_ID: A unique integer ID corresponding
to the genus. Note that this is indexed from 0.
Unknown
Others* Example tfrecords datapoint: Tfrecords: Contains train/test TFRecord files with one aerial
and blurred street level image per available tree for all cities
(*please specify) available in this release version. The trees are a subset of
Datapoint below is slightly modified (e.g. fake location,
redacted image bytes), but otherwise represents a typical the trees in tree_locations.
example. It corresponds to the instance above (to show ● tree/
how tfrecords can be matched to tree_locations). ○ id: bytes. An ID for the tree that is unique
across the release dataset.
E.g. of Data Point: ○ tree_locations_idx: bytes. An ID which links
to the tree_locations/ CSV IDX for the tree.
○ city: bytes. The city where the tree is
features: { located.
○ latitude: float. The ground truth latitude of
feature: { the tree.
key : "streetlevel/object/bbox/xmax" ○ longitude: float. The ground truth longitude
of the tree.
value: {
○ genus/
float_list: { ■ label: int64. Holds the ground truth
label number representing the tree’s
value: 0.99882144
genus.
value: 0.23218618 ■ genus: bytes. The ground truth
} genus of the tree.
● image/
} ○ aerial/
} ■ encoded: bytes. An encoded aerial
JPEG image of the tree
feature: { approximately centered on its trunk.
key : "streetlevel/center/x" ○ streetlevel/
■ encoded: bytes. An encoded street
value: {
level JPEG image of the tree.
Non-vegetation pixels are blurred.
■ capturetime: bytes. The month and
int64_list: {
year that the street level image was
value: 410 captured.
} ■ bbox/: float_lists. Represent tree
detection bounding boxes (based on
} Open Images) as regions scaled from
} [0, 1], with (0,0) representing the
top-left corner of the image.
feature: { ● xmin
key : "streetlevel/object/bbox/xmin" ● xmax
● ymin
value: {
● ymax
float_list: { ■ center/:
● x: int64. Represents an
value: 0.90714884
approximate (but noisy)
value: 0.0040578335 location for the horizontal
} center pixel of the tree in the
image.
} ● y: int64. This is always set to
} half of the image height. It is
provided for convenience.
feature: {
key : "tree/genus/label"
value: {
int64_list: {
value: 246
}
}
}
feature: {
key : "tree/city"
value: {
bytes_list: {
value: "Bloomington"
}
}
}
feature: {
key : "tree/genus/genus"
value: {
bytes_list: {
value: "pyrus"
}
}
}
feature: {
key : "streetlevel/object/bbox/ymax"
value: {
float_list: {
value: 0.59795773
value: 0.54920584
}
}
}
feature: {
key : "tree/id"
value: {
bytes_list: {
value: "6560451631306680540"
}
}
}
feature: {
key : "streetlevel/object/bbox/ymin"
value: {
float_list: {
value: 0.29351664
value: 0.05435237
}
}
}
feature: {
key : "tree/latitude"
value: {
float_list: {
value: 37.53074
}
}
}
feature: {
key : "streetlevel/encoded"
value: {
bytes_list: {
value: "JPEGGoesHere"
}
}
}
feature: {
key : "streetlevel/capturetime"
value: {
bytes_list: {
value: "July 2019"
}
}
}
feature: {
key : "streetlevel/center/y"
value: {
int64_list: {
value: 576
}
}
}
feature: {
key : "aerial/encoded"
value: {
bytes_list: {
value: "JPEGGoesHere"
}
}
}
feature: {
key : "tree/idx"
value: {
bytes_list: {
value:
"057eeab4-1f14-11ec-93z5-eb8801c6f8d0"
}
}
}
feature: {
key : "tree/longitude"
value: {
float_list: {
value: -85.8255
}
}
}
}

DATASET PURPOSE(S) KEY DOMAINS OR APPLICATION(S) PRIMARY MOTIVATION(S)

Monitoring Domains ● Enable the computer vision community to tackle


impactful environmental challenges
Research Machine Learning, Object Recognition, Computer Vision,
● Provide a real-world benchmark for tree
Computing for the Environment, Environmental Monitoring,
Production categorization in cities from multiview data with
Biodiversity Monitoring, Urban Planning
spatiotemporal structure
Others (please specify) ● Advocate for robust out-of-domain generalization
Problem Space analysis for SOTA computer vision architectures via
Multiview Recognition, Fine-Grained Visual Categorization, cross-domain data splits
Out-of-domain Recognition, Automated Urban Forest ● Provide the largest ever fine-grained visual
Monitoring categorization benchmark to the computer vision
community

DATASET USAGE INTENDED AND/OR SUITABLE USE CASE(S) UNSUITABLE USE CASE(S)

Safe for production use ● Developing a model to predict genera and reporting ● Republishing the Auto Arborist dataset or any data
its architecture and results against the Auto Arborist derived from it (such as processed images or examples
Safe for research use benchmark of images) without authorization
Conditional use- some unsafe ● Running a large scale analysis of urban ecology in
North America and sharing conclusions from the
applications
analysis
Only approved use
Others (please specify)

SAFETY OF USE WITH OTHER DATA ACCEPTABLE TRANSFORMATIONS BEST PRACTICES FOR JOINING OR AGGREGATING WITH
DATASET

Safe to use with other data Joining with other datasets The dataset comes with train/test splits. For benchmarks
Subsampling and splitting against the dataset, we recommend following these strictly
Conditionally safe to use with other in order for the benchmark to be comparable to others.
Filtering
data
Joining input sources
Should not be used with other data Cleaning missing values
Unknown Anomaly detection
Grouping and summarizing
Others* Scaling and reducing
(Please specify) Statistical transformations
Redaction or Anonymization
Others (please specify)

VERSION STATUS DATASET VERSION MAINTENANCE PLAN

Regularly Updated Current Version 1.0 The Auto Arborist team intends to update the dataset with
Last Updated 06/2022 (IN PROGRESS) new instances until 1 million tfrecords have been released.
New versions of the dataset have been or
Release Date 06/2022 (IN PROGRESS) After this, it expects to shift to Limited Maintenance, and
will continue to be made available.
the number of instances may decrease over time for error
reasons mentioned below.
Actively Maintained ● Versioning: Versions will be a M.m (Major.minor).
Major updates will usually add significant new
No new versions will be made available, but
instances or perform significant error corrections,
this dataset will be actively maintained,
while minor updates may be error corrections or
including but not limited to updates to the
removal of ~1-10 instances. An example
data.
progression may be: 1.0,2.0,2.1,3.0,...
● Update: Major updates will normally occur
Limited Maintenance whenever all data which we intend to release for a
The data will not be updated, but any city is ready. Minor updates may happen at any
technical issues will be addressed. time as needed.
● Errors: The dataset is expected to have errors and
Deprecated noise as described in the paper. These will
This dataset is obsolete or is no longer generally not be corrected unless there is a
being maintained. significant reason to do so.
● Feedback: The Auto Arborist team welcomes
feedback. Please see the website for feedback
instructions.

ACCESS POLICY RETENTION POLICY WIPEOUT POLICY

Here are the Terms and Conditions The retention policy is included in the ToC The wipeout policy is included in the ToC

There are no retention restrictions for the Auto Arborist Google may receive third party requests to take down or
● Access Prerequisites: Sign the dataset blur a specific panorama on the Google Street View
Terms and Conditions of use, seen at website. Google may forward this request to Organization
the link above. ● Retention Duration: None and provide Organization with an updated version that
● Data Usage Policy: Non-commercial, ● Retention Steps: None complies with the takedown request. Organization must
non-exclusive, worldwide, ● Retention Policy: None delete the Licensed Content originally delivered and replace
royalty-free, non-transferable and ● Exemptions & Exceptions: None it with the updated Licensed Content provided by Google.
non-sublicenseable license to use
(including reproducing and creating ● Wipeout Duration: <summarize here>
derivative works of) ● Deletion Event Steps: If a third party requests
● Access Control Lists: Users who street level imagery to be blurred or taken down, a
have signed the ToC new, compliant version of the dataset will be sent to
● Exemptions & Exceptions: None all users. They will be asked to delete the prior
version of the dataset and work with the new one
going forward.
● Post-deletion Obligations: <summarize here>
● Exemptions & Exceptions: None

DATA COLLECTION METHODS DATA SOURCES DATA COLLECTION


API City-generated Arboreal Censuses Crowdsourced
Artificially Generated Arboreal Censuses: These censuses are used by cities to Collected and included
monitor their urban trees and are collected infrequently. We ● none
Crowdsourced - Paid have used censuses that are published publicly, all licensing Collected and excluded
Crowdsourced - Volunteer information is available in the supplementary material of our ● Boxes around PII that was not blurred by our
CVPR 2002 paper. automated blurring based on internal APIs, used to
Vendor Collection Efforts Date of Collection: Nov 2020 - Nov 2021 create the final human-verified, PII-obscured
Scraped or Crawled Instrumentation: Human-generated GPS locations and images but not released
categories of trees
Survey, forms or polls Data Modality: Geospatial Data / Text Data
Taken from other existing datasets
API
Unknown Google Street View API: An internal API used to access
To be determined Google Street View images.
Date of Collection: [April 2009 - June 2021]
Others (Proprietary APIs) Instrumentation: Street View Cameras
Data Modality: Image Data

API
Google Aerial API: An internal API used to access aerial
imagery in cities.
Date of Collection: Approximately [Jan 2019-May 2022]
(we are unable to fully verify this range)
Instrumentation: Low-flying aircraft and Satellite Imagery
Data Modality: Image Data

API
Google Semantic Segmentation API: An internal API that
provides semantic segmentation for Street View data. We
used the results from this API to blur PII for our data.
Date of Collection: [Jan 2022 - May 2022]
Instrumentation: Computer Vision Model
Data Modality: Image Masks

INCLUSION CRITERIA EXCLUSION CRITERIA DATA PROCESSING

Per-City Tree Instance records Per-City Tree Instance records Per-City instance records were processed into a common
format. Each instance was supplemented with Aerial and
Cities were selected based on availability of ● Quality: Instances with invalid lat/lng or lat/lng that
Street Level imagery.
tree inventory, the inventory’s usage are outside of the expected city boundaries.
restrictions, quality of the inventory, etc. Instances which cannot be mapped to a genus, e.g.
Cities were restricted to North America. because “palm” is not a genus; common typos (e.g. Street Level Imagery
Records that were labeled with a genus that “ginkgo” vs. “ginko”) were corrected instead of Street Level imagery was processed to generate tree
was not mappable into the Catalog of Life excluded. bounding box data and to blur pixels.
taxonomy were removed ● Content: None?

Aerial Imagery Aerial Imagery


Images were obtained by querying a ● Quality: None.
proprietary API with the locations available ● Content: Instances without available imagery
for each instance. (extremely rare).
Street Level Imagery Street Level Imagery
Street level images taken Jan 1, 2008 or later ● Quality: Images which are too blurry.
which are expected to show the instance ● Content: Instances without available imagery. Images
tree within 10 meters based on a proprietary which contain pixels with people or other S/PII. Images
geolocation API. which do not contain a minimum number of pixels
associated with trees based on a proprietary pixel
segmentation.

SENSITIVE DATA FIELDS WITH SENSITIVE DATA SECURITY AND PRIVACY HANDLING

User Content Intentionally Collected Sensitive Data Blurring S/PII in street level imagery
● Filtering images that contain people or license
User Metadata none
plates: We first used an internal privacy API to filter
User Activity Data out any images that had visible human pixels
● Automated blurring with internal Semantic
Identifiable Data Segmentation API: Next we blur all pixels that are
Unintentionally Collected Sensitive Data not “tree”, “sky”, “paved_road”, “dirt_road”,
S/PII
streetlevel/encoded: Identifiable houses, cars, or people “sidewalk”, “crosswalk”, “water”, or “mountain”
Business Data in street level imagery using an internal semantic segmentation API
● Human verified S/PII removal: Finally, we used
Employee Data Crowd Compute to detect and draw boxes around
Pseudonymous Data any S/PII that was still visible after our automated
method and blurred the interior of those boxes.
Anonymous Data
Health Data
Children’s Data
None
Others*
(*please specify)

TRANSFORMATIONS APPLIED FIELDS TRANSFORMED LIBRARIES AND METHODS USED

Anomaly Detection Cleaning Mismatched Values ● Blurring S/PII for anonymization: Internal Google
Semantic Segmentation API, pixels potentially
Cleaning Mismatched Values ● Genus (fixed common typos)
containing S/PII are blurred using a gaussian kernel
Cleaning Missing Values Converting Data Types
Converting Data Types ● Genus: Label
Data Aggregation ● tree_locations.SHAPE_LNG : tfrecords.tree/longitude
(downcast to 32 bit float)
Dimensionality Reduction ● tree_locations.SHAPE_LAT : tfrecords.tree/latitude
Joining Input Sources (downcast to 32 bit float)

Redaction or Anonymization Redaction or Anonymization


Others* ● streetlevel/encoded (blurring)
● streetlevel/capturetime (reduced timestamp
(*Please specify) granularity)
● Various internal fields are removed for the release

SAMPLING METHOD(S) SAMPLING CHARACTERISTIC(S) ● SAMPLING CRITERIA


Cluster Sampling Stratified Sampling We sampled the total set of trees to determine which
imagery to release, with the goal of releasing 1M trees with
Haphazard Sampling Upstream Source All tree locations
imagery
Total data sampled 2.5M tree records
Multi-stage Sampling Sample size 1M tree records
● Stratified Sampling: We stratified our samples
Random Sampling Threshold applied Per city/per genera stratified count
across cities and genera. Our stratification method
Sampling Rate specific to each city/genera pair
Retrospective Sampling looped through each city and each genus in a
round robin and incremented a count for that
Stratified Sampling city/genus pair that would determine the number of
trees to be sampled for that strata. Once the total
Systematic Sampling available number of images for that city/genus pair
Weighted Sampling was reached, the count for that strata was no
longer incremented. This was to ensure that the
Unknown images were stratified across the genera and
Unsampled cities, instead of biased towards large cities and
common genera. We used this city/genera count to
Others* randomly sample the tree IDs to include with
(*Please specify) imagery.

ANNOTATION WORKFORCE TYPE ANNOTATION CHARACTERISTICS ANNOTATION DESCRIPTION

Annotation Target in Data Human Annotations - Expert Annotations were used to label the genera of trees and to
blur S/PII.
Machine-generated Annotations Number of annotations
Human Annotations - Expert (based on tree_locations) 4,615,907 Human Annotations - Expert
Human Annotations - Non-expert We assume per-city tree inventories were produced by
Human Annotations - Contractors
Human Annotations - Employees experts and are of high quality.

Human Annotations - Contractors Total number of street-level images annotated 1,000,000

Human Annotations - Crowdsourcing Human Annotations - Contractors


Human Annotations - Outsourced / Contractors annotated street level imagery for S/PII blurring
Managed Teams / removal. A confidential platform is used for collecting
Unlabeled these annotations.

Others*
(*Please specify)

Reflections on Data
Trees are non-offensive - Trees are not considered to be offensive or insulting, and images of trees should not cause anxiety.
However, we do not have control over what humans may place on trees (i.e. offensive signs) and have
not explicitly removed such objects. If any offensive material is found in the dataset please email
[email protected] to have it removed.

You might also like