0% found this document useful (0 votes)
19 views22 pages

Very High Resolution Canopy Height Maps From RGB Imagery Using Self-Supervised Vision Transformer and Convolutional Decoder Trained On Aerial Lidar

Uploaded by

gil.allenglenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
19 views22 pages

Very High Resolution Canopy Height Maps From RGB Imagery Using Self-Supervised Vision Transformer and Convolutional Decoder Trained On Aerial Lidar

Uploaded by

gil.allenglenc
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 22

Remote Sensing of Environment 300 (2024) 113888

Contents lists available at ScienceDirect

Remote Sensing of Environment


journal homepage: www.elsevier.com/locate/rse

Very high resolution canopy height maps from RGB imagery using
self-supervised vision transformer and convolutional decoder trained on
aerial lidar
Jamie Tolan a, Hung-I Yang a, Benjamin Nosarzewski a, Guillaume Couairon b, Huy V. Vo b,
John Brandt c, *, Justine Spore c, Sayantan Majumdar d, Daniel Haziza b, Janaki Vamaraju a,
Theo Moutakanni b, Piotr Bojanowski b, Tracy Johns a, Brian White a, Tobias Tiecke a,
Camille Couprie b
a
Meta, 1 Hacker Way, Menlo Park, CA 94025, USA
b
Fundamental AI Research (FAIR), Meta, 1 Hacker Way, Menlo Park, CA 94025, USA
c
World Resources Institute, 10 G St NE #800, Washington, DC 20002, USA
d
Desert Research Institute, 2215 Raggio Pkwy, Reno, NV 89512, USA

A R T I C L E I N F O A B S T R A C T

Edited by Jing M. Chen Vegetation structure mapping is critical for understanding the global carbon cycle and monitoring nature-based
approaches to climate adaptation and mitigation. Repeated measurements of these data allow for the observation
Keywords: of deforestation or degradation of existing forests, natural forest regeneration, and the implementation of sus­
LIDAR tainable agricultural practices like agroforestry. Assessments of tree canopy height and crown projected area at a
GEDI
high spatial resolution are also important for monitoring carbon fluxes and assessing tree-based land uses, since
Canopy height
forest structures can be highly spatially heterogeneous, especially in agroforestry systems. Very high resolution
Deep learning
Self-supervised learning satellite imagery (less than one meter (1 m) Ground Sample Distance) makes it possible to extract information at
Vision transformers the tree level while allowing monitoring at a very large scale. This paper presents the first high-resolution canopy
height map concurrently produced for multiple sub-national jurisdictions. Specifically, we produce very high
resolution canopy height maps for the states of California and São Paulo, a significant improvement in resolution
over the ten meter (10 m) resolution of previous Sentinel / GEDI based worldwide maps of canopy height. The
maps are generated by the extraction of features from a self-supervised model trained on Maxar imagery from
2017 to 2020, and the training of a dense prediction decoder against aerial lidar maps. We also introduce a post-
processing step using a convolutional network trained on GEDI observations. We evaluate the proposed maps
with set-aside validation lidar data as well as by comparing with other remotely sensed maps and field-collected
data, and find our model produces an average Mean Absolute Error (MAE) of 2.8 m and Mean Error (ME) of 0.6
m.

1. Introduction deforestation and regrowth (Friedlingstein et al., 2019). Such wall-to-


wall data on tree height and canopy structure are used to estimate
Spatially explicit maps of forest vegetation structure, such as tree aboveground woody biomass. However, land-use patterns operate on
canopy height and crown projected area, are powerful tools for assessing more granular spatio-temporal scales than those captured by global
forest degradation, forest and landscape restoration (FLR), and esti­ carbon models, which typically have coarse spatio-temporal resolution.
mating above-ground woody biomass for carbon emission and seques­ This contributes to the large uncertainty in existing nation-wide and
tration modeling. Existing assessments of the climate implications of global accounting of carbon stored in forests (Popkin, 2015; Duncanson
woody vegetation flux, including FLR, deforestation, and natural et al., 2020; Yanai et al., 2020). For instance, Cook-Patton et al. (2020)
regrowth, often rely on remotely sensed dynamic vegetation models of produce a global 1-km scale map of potential above-ground carbon

* Corresponding author.
E-mail address: [email protected] (J. Brandt).

https://doi.org/10.1016/j.rse.2023.113888
Received 19 April 2023; Received in revised form 24 October 2023; Accepted 25 October 2023
Available online 7 November 2023
0034-4257/© 2023 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

accumulation rates by developing machine learning models based on aerial optical imagery, Wagner et al. (2023) generated a submeter CHM
more than 13,000 locations derived from literature. Cook-Patton et al. of California, USA by training a regression U-Net CNN on 60-cm imagery
(2020) find significant variability in predicted carbon accumulation from the USDA-NAIP program and aerial lidar.
rates compared to defaults from the International Panel on Climate The estimation of canopy height from high resolution optical imag­
Change (IPCC) at the ecozone scale. In the African tropical montane ery shares similarities with the computer vision task of monocular depth
forests, Cuni-Sanchez et al. (2021) model forest carbon density based on estimation. Vision transformers, which are a deep learning approach to
72,336 measurements of height and tree diameter, identifying two- encoding low-dimensional input into a high dimensional feature space,
thirds higher carbon stocks than the respective IPCC default values. have established new frontiers in depth estimation compared to con­
The uncertainty of biomass modeling also affects the uncertainty of volutional neural networks (Ranftl et al., 2021). While depth estimation
the carbon implications of deforestation and regrowth. Tree-based FLR, models benefit significantly from large receptive fields (Li et al., 2018;
including agroforestry, reforestation, natural regeneration, and enrich­ Fu et al., 2018; Miangoleh et al., 2021), Luo et al. (2016) demonstrate
ment planting, is considered to be a cost-effective natural climate so­ that the effective receptive fields of CNN models have Gaussian distri­
lution for adaptation and mitigation. However, evaluating the butions, limiting the ability for CNNs to model long-range spatial de­
effectiveness of FLR interventions at a large scale is difficult due to its pendencies. In contrast to convolutional neural networks (CNNs), which
highly distributed nature, typically being practiced on individual land subsequently apply local convolutional operations to enable the
parcels by respective land owners (Reytar et al., 2020). While carbon modeling of increasingly long-range spatial dependencies, transformers
reporting frameworks exist for FLR, for example through verified carbon utilize self-attention modules to enable the modeling of global spatial
markets, such data are highly project-specific owing to their reliance on dependencies across the entire image input (Dosovitskiy et al., 2021a).
intensive manual field measurements. Utilizing remotely sensed data to For dense prediction tasks on high resolution imagery where the
assess vegetation structure on areas with FLR interventions such as context can be sparse, such as ground information in the case of near
intercropped agroforestry or natural regeneration is difficult due to the closed canopies, the ability of transformers to model global information
presence of multiple species, multiple canopy strata, and trees of is promising. Among the applications to aerial imagery, the work of Xu
different ages (Viani et al., 2018; Vallauri et al., 2005; Camarretta et al., et al. (2021) uses a Swin transformer to classify high-resolution land
2020). For instance, Tesfay et al. (2022) found that 70% of the shade cover. Finding that a baseline transformer model struggled with edge
trees in an agroforestry system in Ethiopia were below 3 m in height, detection, Xu et al. (2021) utilized a self-supervised edge extraction and
while 3% were above 12 m in height, with more than a two-order of enhancement method to improve definition of class edges. Wang et al.
magnitude range of per-tree carbon stocks depending on tree size. (2022) utilize the vision transformer architecture as a feature encoder,
Critical to reducing uncertainty in woody carbon models are mea­ and apply a feature pyramid decoder to the resulting multi-scale feature
surements of forest height and biomass to improve assessments of the maps. Gibril et al. (2023) segment individual date palm trees by
spatial variability of carbon removal rates across forest landscapes that applying vision transformers to 5- to 30-cm drone-based imagery,
have heterogeneous structure (Harris et al., 2021). Tree height is espe­ finding that the Segformer architecture improves generalizability to
cially critical to accurately assessing carbon removal rates, as growth different resolution imagery when compared to CNN-based models.
rate increases continuously with size (Stephenson et al., 2014). Recent More recently, also leveraging vision transformers, Reed et al. (2022)
earth observation missions from NASA, namely GEDI and ICESat-2, scale the Masked Auto-Encoder approach of He et al. (2022) and apply it
provide repeated vegetation canopy height maps for the first time. to building segmentation.
Global Ecosystem Dynamics Investigation (GEDI) collects canopy height A major challenge of applying high resolution, airborne lidar data to
and relative height at a 25 m resolution (Dubayah et al., 2021). ICESat-2 the generation of wall-to-wall canopy height maps is the relative scarcity
collects canopy height and relative height at a 13 × 100 meter native of airborne lidar data to the scientific community. Such scarcity can
footprint (Markus et al., 2017). Recently, multi-sensor fusion has negatively impact the generalizability of models to unseen geographies,
demonstrated potential to improve aboveground biomass mapping especially data-poor regions where little to no airborne lidar exists
(Silva et al., 2021). To generate wall-to-wall maps of canopy height, (Schacher et al., 2023). Given this context of low annotation, Self-
researchers commonly combine active optical LiDAR data from ICESat-2 Supervised Learning (SSL) is a promising tool to shape more robust
or GEDI with optical imagery from Sentinel-2 (Lang et al., 2022a; features than traditional deep approaches. In particular, the SSL DINOv2
Schwartz et al., 2022) or Landsat satellites (Schwartz et al., 2022; Li approach of Oquab et al. (2023) recently led to state-of-the-art perfor­
et al., 2020). mances in several computer vision tasks such as image classification,
A number of recent studies have utilized spaceborne lidar data from depth prediction, and segmentation. In the context of satellite image
GEDI and ICESat-2 to produce canopy height maps in combination with analysis, self-supervised learning has been shown to improve the
multispectral optical imagery. Among them, Potapov et al. (2021) generalizability of building segmentation models in Africa (Sirko et al.,
combined GEDI RH95 (95th percentile of Relative Height) data with 2021). To mitigate the reliance of vision transformers on self-supervised
Landsat data to establish a global map at 30 m resolution, using a bagged learning, Fayad et al. (2023) utilized knowledge distillation with a U-Net
regression tree ensemble algorithm. More recently, Lang et al. (2022a) CNN teacher model to generate 10-m CHM of Ghana using Sentinel-1,
produced a global canopy height map at a 10-m resolution, applying an Sentinel-2, and aerial lidar.
ensemble of convolutional neural network (CNN) models to Sentinel-2 Understanding the importance of highly spatially explicit vegetation
imagery to predict the GEDI RH98 footprint. Other works have pro­ structure mapping to both large-scale carbon modeling and project-
duced regional 10-m CHMs utilizing Sentinel-2 and aerial lidar (Astola specific avoided deforestation and restoration monitoring, the objec­
et al., 2021; Fayad et al., 2023). tive of this study is to produce high resolution canopy height maps that
Aerial lidar data has also demonstrated utility as training data for are able to scale and generalize to large geographic regions. Our method
high resolution (< 5 m) and very high resolution (< 1 m) canopy height consists of an image encoder-decoder model, where low spectral
maps. At a national scale, Csillik et al. (2019) generated biomass maps in dimensional input images are transformed to a high dimensional
Peru by applying gradient boosted regression trees between 3.7 m Planet encoding and subsequently decoded to predict per-pixel canopy height.
Dove imagery and airborne lidar, with low uncertainty in dense forests We employ DINOv2 self-supervised learning to generate universal and
but large amounts of uncertainty in transitional landscapes and areas generalizable encodings from the input imagery (Oquab et al., 2023),
that are hotspots of land use change. Recently, Liu et al. (2023) and train a dense vision transformer decoder (Ranftl et al., 2021) to
computed a canopy height map (CHM) map of Europe using 3 m Planet generate canopy height predictions based on aerial lidar data from sites
imagery, training two UNets to predict tree extent and CHM using lidar across the USA. To correct a potential bias coming from a geographically
observations and previous CHM predictions from the literature. Utilizing limited source of supervision, we finally refine the maps using a

2
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

convolutional network trained on spaceborne lidar data. We present pixels and that the pixel size was the same for all latitudes, preventing
canopy height maps for the states of São Paulo, Brazil, and California, potential biases with latitude which may be introduced by variation in
USA, and provide qualitative and quantitative error analyses of height pixel size.
estimation and the decomposition of height estimates into tree seg­
mentation maps. 2.3.2. Dataset for self-supervised learning
For training the self-supervised encoder, we randomly sampled 18
2. Data million 256 × 256 pixel satellite thumbnail images. No labels were used
for the SSL stage.
2.1. Experimental design
2.3.3. Validation segmentation dataset
This paper presents canopy height maps for São Paulo State, Brazil, We also manually annotated a random selection of 9000 Maxar
and California State, USA. These geographies were chosen due to their thumbnail images for segmentation testing. A binary tree / no tree label
prevalence of timber production, presence of old growth forests, was applied by human annotators. Pixels estimated to have a canopy
mountainous terrains, and high degree of tree biodiversity (Maioli et al., height above one meter (1 m) tall and with a canopy diameter of more
2020; Luyssaert et al., 2008; Ribeiro et al., 2011). The dataset was than three meters (3 m) were labeled as tree.
generated with a machine learning model utilizing a transformer
encoder and convolutional decoder trained with an input composite of 2.4. Supervised dataset
approximately 0.59 m GSD Maxar imagery spanning the years 2018 to
2020 and output labels from 1 m GSD aerial lidar. Our data and methods We gathered approximately 5800 canopy height maps (CHM),
sections are structured as follows. First, we describe the satellite and selected from the National Ecological Observatory Network (NEON)
aerial lidar data used for model training and map generation. Next, we (2022). Each CHM typically consisted of 1 km × 1 km geotiffs, with a
describe the model training specifics, including self supervised learning pixel size of one meter (1 m) GSD, in local UTM coordinates. We selected
and the methods for combining models trained on aerial lidar with the sites used by Weinstein et al. (2021) and additionally manually
models trained on GEDI observations, and the baseline models selected filtered for sites that have CHM imagery that was well registered and
and ablation studies performed. Finally, we present our approach for mostly free from mosaicing artifacts. Additionally, we selected sites with
qualitative and quantitative evaluation of height accuracy and tree imagery acquired less than two years from the observation date in the
segmentation, and discuss the generalization of our model. associated Maxar satellite imagery. A complete list of NEON sites used
for training and validation is contained in Appendix A.
2.2. Satellite image data description The CHM geotiffs were reprojected to a local tangent plane coordi­
nate system and resized to match the resolution of Maxar images. For
Maxar Vivid2 mosaic imagery1 served as input imagery for model each ALS CHM, a corresponding RGB satellite image was linked, and
training and inference. This dataset provides global coverage by these pairs of imagery served as the training data for our decoder model.
mosaicing together imagery from multiple instruments (WorldView-2 The 5800 images in the NEON ALS dataset were split in sets of 80%
(WV 2), WorldView-3 (WV 3), Quickbird II) and observation dates. By training images, 10% calibration and 10% set-aside validation images.
starting with this mosaiced imagery, we leveraged the extensive data During the training, validation and testing phases, we sampled 256 ×
selection pipeline from Maxar, resulting in imagery that had less than 256 random crops from the RGB - ALS image pairs. Model training was
2% percent cloud cover, a global revisit rate predominately (more than conducted over epochs sampled from the training dataset. At the
75%) below 36 months (imagery dates from 2017 to 2020 are utilized in completion of each epoch, metrics were computed from a 10% cali­
this dataset), view angles of less than 30 degrees off nadir, and sun angle bration dataset to calibrate the hyperparameters of the model training
of less than 60 degrees from zenith. This imagery consisted of three process. The calibration dataset was drawn from the same set of sites as
spectral bands: Red, Green, and Blue (RGB), with approximately a 0.5 m the training datasets, but from separate 1 km × 1 km geotiffs to ensure
GSD. The imagery was processed in the Web Mercator projection non overlapping pixels.
(EPSG:3857) and stored with the Bing tiling scheme.2 Given the high We constructed a set-aside validation dataset from a subset of sites in
resolution of the original geotiffs, Bing zoom 15 level tiles, with 2048 × our NEON dataset, which we call “NEON test”. None of the sites used in
2048 pixels per tile were used, giving a pixel size of 0.597 m GSD at the the validation dataset were contained in the training or calibration
equator. dataset. A list of NEON sites in the validation set appears in Appendix A.
We also prepared two validation datasets from other publicly available
2.3. Satellite image data preparation ALS Lidar datasets, outside of the NEON collection. These datasets
covered different geographic locations and ecosystems: “CA-Brande”
2.3.1. Image preparation (Brande, 2021) covered a coastal ecosystem in CA, and “São Paulo”
For easier training and validation of computer vision models, we (Dos-Santos et al., 2019) covered a region in the Brazilian São Paulo
extracted small regions from the input satellite imagery. Centered State. See Fig. A.18 for a visual breakdown of the Neon dataset splits.
around a given location, a box of fixed ground distance was selected, Where these datasets were available as CHMs, we directly used the
using a local tangent plane coordinate system. Due to the Web Mercator supplied CHMs. However, for the São Paulo datasets, which only con­
projection of the image tiles, the extracted images at each position had tained point cloud datasets, we processed CHMs following the pit-free
varying dimensions according to their latitude, which were re-sampled algorithm (Khosravipour et al., 2014). The pit-free algorithm was also
to a fixed number of pixels. We chose a box side length of 152.7 m, adopted by the NEON team for generating their CHM product, and we
which, when re-sampled to 256 × 256 pixel images, provided “thumb­ found that different input parameters to the pit-free algorithm had
nail” images that matches the lowest resolution (0.597 m) of the input negligible impact on the CHM output.
imagery described in Section 2.2. Using these thumbnail images both for
training and inference ensured that the dataset had constant number of 2.5. Data augmentation

The 256 × 256 pixel image thumbnail images of RGB and CHM im­
1
https://resources.maxar.com/data-sheets/imagery-basemaps-data-sheet. agery were augmented at training time, with random 90 degree rota­
2
https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-ti tions, brightness, and contrast jittering. We found that these
le-system. augmentations improved model prediction stability across the various

3
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. 1. Overview of our approach for generating ALS-based CHMs. During the first stage, we employed the self-supervised learning approach Oquab et al. (2023) on
18 million 256 × 256 satellite images leading to a set of four spatial feature maps, and four feature vectors, extracted at different layers of the Vision Transformer
model (ViT). In the second phase, we trained a convolutional DPT decoder to predict CHMs.

Maxar observations in the input dataset. where the inputs are decomposed into 16 × 16 patches. The two net­
works were trained jointly to output similar feature representations. The
3. Model and data generation methods procedure is illustrated in the Phase 1 in Fig. 1. In a second phase
described in Section 3.2, we freeze the SSL encoder layers using the
Our goal was to create a model that produces high resolution canopy weights of the teacher model and train the decoder with ALS data to
height maps and generalizes across large geographic scales. To accom­ generate high-resolution canopy height maps.
plish that goal, we leveraged the relative strengths of two types of lidar
data. Aerial lidar provided high resolution canopy height estimation, but 3.2. High resolution canopy height estimation using ALS
lacks global spatial coverage. In comparison, GEDI has nearly global
coverage of transects, but its beam width of approximately 25 m did not We used the reference dataset described in Section 2.4, prepared
allow for the identification of individual trees. following the methods described in Section 2.3.1. The output of the ALS
After self-supervised pre-training on satellite images globally, our model was a raster of predicted canopy heights at the same resolution as
high-resolution ALS CHM prediction model was trained on images from the input imagery. For training the supervised decoder, we used the ALS
the NEON dataset, as detailed in Section 3.2 and Fig. 1. As the Neon CHM data described in Section 2.4 to create a connection between the
dataset only has a spatial coverage from sites only within the United SSL features and the full resolution canopy height image. In this second
States, we expect this ALS CHM model to perform well on ecosystems phase, we trained the decoder introduced in Dense Prediction Trans­
similar to the training set. To improve generalization of other ecosys­ former (DPT) (Ranftl et al., 2021) on top of the obtained features. This
tems and locations, a low resolution CHM model was independently approach is described in Fig. 1, phase 2. The DPT paper describes a full
trained on global GEDI data (Section 3.3). The GEDI model was used to model composed of a transformer encoder extracting features at
compute a rescaling factor map (Section 3.4), which adjusted the pre­ different layers. In the decoder, each output was reassembled and all
dictions made by the ALS CHM model. outputs were fused. In our second phase of ALS training, we replaced the
transformer of DPT by our own SSL encoder, and trained the DPT
3.1. Self supervised learning decoder part only, from scratch. Our best results were obtained by
freezing all layers from the SSL encoder. We employed a one cycle
Following the recent success of self-supervised learning on dense learning rate schedule with a linear warmup in the encoder training
prediction tasks from Oquab et al. (2023), we employed a self- stage and a “Sigloss” loss function. Further architecture and training
supervised learning step on 18 million globally distributed, randomly details are provided in Appendix D.
sampled 256 × 256 pixel Maxar satellite images to obtain an image Sigloss function. We take advantage from the similarity of canopy
encoder delivering features specialized to vegetation images. In the height mapping to the task of depth estimation and borrow the loss from
training phase, different views of the image were fed to two versions of Eigen et al. (2014). Given a true canopy height map c and our prediction
the encoder: a teacher model receiving global crops, and a student ̂c , the Sigloss is given by
model receiving local and global views where part of the crops were
masked (replaced by zero values). We employ a huge ViT architecture,

4
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. 2. Overview of our methodology to generate predicted RH95 values using GEDI measurements across the globe. Terrain is used only during the training and set
to zero during inference.

Fig. 3. Post processing step using GEDI predictions during inference. We used the GEDI model to correct our CHM predictions, by computing a dense scaling factor,
and multiply it pointwise with the CHM prediction map.

√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
)2̅
√ ( ground truth data consisted of 13 million GEDI measurements, which
√1 ∑ λ ∑
L =α √ 2
δ − δi , (1) were randomly sampled from the full GEDI dataset described in Ap­
T i i T2 i pendix B.1. We trained the GEDI model to output a single scalar value for
a 128 × 128 pixel image patch, with a L1 loss on a regression task against
where δi = log(̂c i ) − log(ci ), and T is the number of pixels with valid the RH95 value from the GEDI instrument. The training details are
ground truth values. As in previous works, we fix λ = 0.85 and α = 10. specified in Appendix B.3.
Classification output. To avoid a bias towards small predicted values,
we implemented a classification step first, combined with the Sigloss
defined above. The strategy is described by Bhat et al. (2021) as the 3.4. Combining ALS and GEDI model outputs
uniform strategy. Specifically, we modified the output of our decoder to
return, instead of one scalar per pixel, a range of B bins. After a In this section, we describe the process of connecting our GEDI model
normalization on the predictions, we computed the scalar product be­ outputs (Section 3.3) with ALS model outputs (3.2). Conceptually, the
tween the obtained histogram of predicted bins and a linear vector ALS model output provides high resolution canopy estimates but lacks
ranging [0,B], with B set to 256. the global context to correctly estimate the absolute height of vegetation
in different ecosystems. Conversely, the GEDI model is trained on a
global dataset and contains position and metadata inputs (Fig. 2). A
3.3. Large scale canopy height estimation using GEDI prediction model schematic of the process is shown in Fig. 3.
Correlation between different lidar sources. The first step in making the
To mitigate the effect of the limited geographic distribution of GEDI/ALS connection is understanding the relationship between the two
available ALS data, we employed a second regression network trained on sets of lidar data: ALS CHM (Section 2.4) and GEDI lidar (Section Ap­
GEDI data to rescale the ALS network outputs. The GEDI prediction pendix B.1). These two datasets make measurements of fundamentally
model was a simple convolutional network, containing five convolu­ different properties of canopy structure. GEDI measures the relative
tional layers, followed by five fully connected layers. The inputs to the height of canopy based on the full waveform measurement of the return
model were 128 × 128 pixel Maxar images containing three RGB bands, energy from 25 m diameter beam footprints while aerial lidar constructs
in topocentric coordinates, processed as described in Section 2.3.1. The higher resolution point clouds. To connect these two, we ran simulations

5
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. 4. Canopy Height Map (CHM) for the state of California, inset showing zoomed in region with input RGB imagery.

with the GEDI simulator from Hancock et al. (2019) on the NEON ALS were smoothed with a 20 pixel sigma Gaussian kernel sσ to prevent sharp
point clouds. We found that there was a strong correlation (R2 = 0.88) transitions, and the correction factor was clipped between 0.5 and 2 to
between the 95th percentile of ALS canopy height maps and the simu­ avoid drastic rescaling.
lated GEDI RH95 (see Appendix B.2).
GEDI based correction of ALS trained maps. We used this correlation to
3.5. Baselines
scale the ALS model canopy height maps by computing a scalar multi­
plier that match percentiles of the CHM map with the GEDI model
3.5.1. ResUNet-based approach
predicted value for GEDI RH95. This process works as follows:
We utilized a ResUNet-18 architecture for our baseline (Zhang et al.,
Given an input RGB image, x, we combined the outputs of the ALS
2017), which is an encoder-decoder architecture predicting a N × N
and GEDI models by computing a dense correction factor γ(x), so that
canopy height map from a 3 × N × N RGB image, with N = 256. The
the novel prediction, C′(x) was related to the ALS model CHM, C(x): baseline model was trained with the sigloss between the predicted and
ground truth CHMs. We also experimented with a classification output,
C′(x) = γ(x) ⊙ C(x) (2)
however we did not obtain improvements from this approach.
where
3.5.2. Supervised transformer-based approach
γ(x) =
1 + sσ (G(x) )
( (( ) ). (3) To assess the benefit of the self supervised training phase on Satellite
1 + sσ Q(x)95 data, we consider a baseline given the state-of-the-art vision SWAG
encoder (Singh et al., 2022). We used the large version of this Vision
Here G(x) is the output CHM of our GEDI model and Q(x)95 is a per
Transformer (ViT) that was trained to perform hashtag prediction from
block upsampled 95th percentile of the ALS model CHM in meters,
Instagram images. At the time of writing this manuscript, this model was
computed over the exact same 128 × 128 pixel input regions as the input
in the top ten models with highest accuracy on ImageNet, CUB, Places,
to the GEDI model in G(x). More specifically, each input image was
and iNaturalist datasets, providing a warranty of feature quality. This
divided in four crops, each one independently fed to the height predic­
model contains the same number of parameters as our SSL encoder,
tion model, leading to four scalars, that were concatenated and
allowing a fair comparison in terms of model size.
upsampled. From the predicted CHM map by our ALS model, we
computed four percentiles from the same crops, concatenated and
upsampled in the same way. 3.6. Data validation
We used the ratio in Eq. (3) rather than G(x)/Q(x)95 to down-weight
noisy model estimates near zero canopy height. Since G(x) and Q(x)95 We evaluated the model performance against a variety of metrics,
are lower resolution than C(x), the correction factor map was upsampled which we divided into two broad classes: (1.) Metrics which primarily
to match the resolution of the ALS CHM, C(x). The ALS and GEDI maps evaluated the accuracy of canopy height maps, which we call canopy
height metrics (Section 4.1), and (2.) Metrics which primarily evaluated

6
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. 5. Canopy Height Map (CHM) for the state of São Paulo, inset showing zoomed in region with input RGB imagery.

Fig. 6. Selected sample regions from the canopy height predictions (log scale), overlayed on the input Maxar imagery (RGB). Canopy height prediction below 0.1 m
is transparent. The top row corresponds to regions in California and the bottom row, São Paulo.

the accuracy of image segmentation into tree or no tree pixels, which we labels independently labeled by photo-interpretation of Maxar imag­
call segmentation metrics (Section 4.2). The set-aside validation dataset ery (Section 4.2.1).
of ALS canopy height maps described in Section 2 served as the primary
dataset for all types of metrics. For the segmentation metrics, we also
evaluated the model predictions against a dataset of human-annotated

7
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. 7. Comparison of our CHM (second column) with that of Lang et al. (2022a) (third column) and Potapov et al. (2021) (fourth column).

Table 1
Comparison of results with SSL pre-training on different datasets and with other supervised strategies (ResUNet, SWAG). IN: ImageNet. Sat: dataset described in
Section 2.3.2. IG: Instagram dataset. R: DPT decoder with a regression (scalar) output. C: DPT decoder with a classification (256 bins) output. ViT L: large, H: huge.
Note that the results are non GEDI corrected in this table, and all models were trained with a Sigloss. We later denote the model in the last line as the “SSL” model.
pre-training NEON test set São Paulo CA Brande

model size dataset MAE R -block


2 ME MAE R -block
2 ME MAE R2 -block ME

ResUNet RN18 IN1k 3.1 0.63 0.0 5.2 0.42 − 2.2 0.6 0.74 ¡0.1
SWAG C ViT L IG 3.0 0.63 − 1.6 5.8 0.16 − 4.3 0.7 0.56 − 0.6
DINOv2 R ViT L IN1k 3.4 0.52 − 1.4 6.8 − 0.20 − 5.2 0.6 0.67 − 0.4
DINOv2 R ViT H IN22k+ 3.0 0.62 − 1.4 5.7 0.27 − 2.9 0.6 0.62 − 0.4
DINOv2 R ViT L Sat 3.5 M 2.8 0.67 − 1.2 6.0 0.14 − 4.2 0.6 0.70 − 0.5
DINOv2 R ViT L Sat 18 M 2.9 0.66 − 1.4 4.9 0.46 − 1.9 0.6 0.68 − 0.5
DINOv2 C ViT L Sat 18 M 2.7 0.70 − 0.9 5.0 0.46 − 2.1 0.6 0.80 − 0.3
DINOv2 C ViT H Sat 18 M 2.6 0.70 − 0.9 5.2 0.39 0.4 0.6 0.81 ¡0.1

4. Results of 50 × 50 pixels (∼ 30 × 30 meters). We have chosen the exact size of


these blocks somewhat arbitrarily, but were motivated to compute on a
We generated CHMs for the State of California, USA (Fig. 4) and São scale of 10s of meters due to: a.) georegistration errors in both the Maxar
Paulo, Brazil (Fig. 5) by running inference on 0.59 m GSD Maxar images imagery and ALS data, b.) projection differences between the two
with the SSL + GEDI ViT huge model trained with 1 m aerial lidar data. datasets, with the ALS data being orthorectified and the Maxar imagery
In California, 39% of the area used Maxar imagery observed in 2020, have off nadir view angles of up to 30 degrees. As such, the R2 -block
and 90% within the years spanning 2018 to 2020. In São Paulo, 63% of score better reflects the local accuracy of CHMs and provides a more
the area was observed in 2019, and 94% within the years spanning direct performance comparison to lower resolution models. However,
2017–2019. Small regions of the canopy height predictions are visual­ averages across blocks of this resolution do not provide a good indicator
ized in Fig. 6. We compare our maps to the previously available highest of the edge accuracy of the produced maps, which can be a desirable
resolution, global canopy height maps of Lang et al. (2022a) and Pota­ property for downstream tasks such as segmentation. We separately
pov et al. (2021) in Fig. 7. We have added the full resolution dataset to report the Edge Error (EE) metric we developed to measure the sharp­
AWS Opendata programs, in the form of cloud optimized geotiffs ness of the maps, described in Appendix C.3. Finally, to estimate the bias
(COGS) with associated cutlines and image acquisition dates.3 Addi­ of different models, we report the Mean Error (ME). We provide for­
tionally, these datasets are visible on a Google Earth Engine public url.4 mulas for the above mentioned metrics in Appendix C.

4.1. Canopy height metrics 4.1.1. Canopy height metrics for ALS models
We present in Table 1 an ablation study of different pre-training data,
We compared the predicted canopy height maps with aerial lidar model size and output on the Neon and São Paulo test sets. From this
data in terms of mean absolute error (MAE), Root Mean Squared Error ablation study, we selected the SSL model trained on 18 million images
(RMSE), and R2 -block (R2 ). The R2 -block score is the coefficient of utilizing the classification output, which achieved the highest canopy
determination, which we computed on cropped images with a resolution height accuracy metrics. We also trained a huge model instead of a large
one, that significantly reduced the bias of the predictions on the São
Paulo dataset. We refer to this model as the SSL model throughout the
3
https://registry.opendata.aws/dataforgood-fb-forests/. paper. Table 1 suggests that pre-training on satellite images gives better
4
https://wri-datalab.earthengine.app/view/submeter-canopyheight. results compared to pre-training on ImageNet. Compared to the ViTs

8
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Table 2
Canopy Height Metrics to assess the gedi correction step. R2 corresponds to ∼ 30 × 30 meter block R2 . “Average” is the unweighted average across datasets.
NEON test CA-Brande São Paulo Average

MAE RMSE R2 MAE RMSE R2 MAE RMSE R2 MAE RMSE R2

ResUNet 3.1 4.9 0.63 0.6 1.6 0.75 5.2 7.4 0.42 3.0 4.6 0.60
ResUNet + GEDI 3.0 4.8 0.64 0.6 1.6 0.74 5.4 7.7 0.35 3.0 4.7 0.58
SSL 2.6 4.4 0.70 0.6 1.4 0.82 5.2 7.5 0.39 2.8 4.5 0.64
SSL + GEDI 2.7 4.5 0.69 0.6 1.5 0.80 5.1 7.3 0.41 2.8 4.4 0.63

Fig. 8. Block (∼ 30m × 30m) aggregated SSL + GEDI model predictions compared to ALS ground truth measurements for different set-aside validation datasets.
Colormap density is normalized to the 99.6th percentile of the heatmaps.

Fig. 9. Global model evaluation on held-out GEDI data. (a) p95 of block (76m × 76m) model CHM predictions compared to the measured GEDI RH95 metrics. (b)
Left: Difference between the p95 of block model CHM predictions and the measured GEDI RH95 metrics w.r.t model CHM predictions. Negative values indicate that
the model estimates are lower than the GEDI RH95 values. Residuals in function of RH95 appear in Appendix F. Right: CHM p95 in function of RH95.

9
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Table 3 experimented with different loss functions, and a smaller dataset for self-
R2 between predicted CHM p95 and GEDI RH95 by geographic subregion for supervised pre-training. We found that was training on more data was
20,000 test GEDI observations for models with and without the GEDI calibration leading to much better results in São Paulo. Comparing L1, L2 and
model. Sigloss, we found that Sigloss and L2 were leading to the best results.
Subregion SSL ResUNet SWAG Additional discussion of these trials can be found in Appendix E.
GEDI + – + – + –
4.1.2. Canopy height metrics for ALS + GEDI models
Central Asia 0.22 0.19 0.25 0.23 0.23 0.17
Table 2 presents a quantitative validation of the best performing
Eastern Asia 0.50 0.44 0.47 0.42 0.43 0.38
Eastern Europe 0.70 0.66 0.67 0.61 0.67 0.63 models, namely the ResUNet and the self-supervised model (SSL),
Latin America + Caribbean 0.68 0.64 0.65 0.56 0.64 0.56 combined with the GEDI correction step (ResUNet+GEDI, SSL + GEDI).
Melanesia 0.52 0.48 0.51 0.41 0.44 0.45 We note the improved performance of the SSL model compared to the
Northern Africa 0.12 0.11 0.10 0.06 0.06 0.05 ResUNet in the NEON test and CA-Brande datasets. Although the SSL
Northern America 0.73 0.69 0.70 0.65 0.69 0.64
Northern Europe 0.54 0.46 0.41 0.30 0.33 0.33
model performed the best across the datasets in the USA (NEON test and
Oceana 0.68 0.63 0.66 0.58 0.61 0.54 CA-Brande), it performed worse than the ResUNet and ResUNet + GEDI
South East Asia 0.46 0.36 0.45 0.34 0.44 0.32 for São Paulo, possibly due to the large domain shift in ecosystems. In
Southern Asia 0.52 0.49 0.52 0.48 0.47 0.42 the case of São Paulo, we found that the inclusion of GEDI (“SSL +
Southern Europe 0.46 0.47 0.42 0.37 0.46 0.40
GEDI”) produced the best results, possibly indicating better general­
Sub-Saharan Africa 0.68 0.66 0.58 0.50 0.64 0.59
Western Asia 0.53 0.49 0.53 0.47 0.47 0.42 ization by including the globally trained GEDI model, which also in­
Western Europe 0.64 0.59 0.64 0.55 0.58 0.50 cludes additional metadata such as geographic position (Fig. 2).
Overall 0.61 0.52 0.59 0.44 0.54 0.37 Fig. 8 shows 2D-histograms of the SSL + GEDI model predictions vs
the set-aside validation ALS-derived canopy height averaged over
∼ 30m blocks and the corresponding pixel MAE and block-R2 scores.

4.1.3. Quantitative comparison of CHM model with GEDI RH95 data


The ALS set-aside validation datasets used in the previous section are
limited in both total area and geographic coverage. In this section, we
leverage the global coverage of the GEDI dataset to validate our CHM
models. As described in Appendix B.2, CHM maps can be connected to
GEDI RH95 metrics by taking the 95th percentile. In this analysis, we
draw 2 × 104 GEDI samples globally in the set-aside validation split the
same way as in training the GEDI model, i.e., weighted proportional to
the square root of the inverse sample size of its RH95 bin. Similarily to
Potapov et al. (2021), we removed GEDI observations corresponding to
< 0.5 normalized difference vegetation index (NDVI) that had no tree
cover in the 2010 data of Hansen et al. (2013), corresponding to 337
samples out of 20,000. In Fig. 9a, we show the scatter plot and histogram
of the 128 × 128 pixels (76m × 76m) block 95th percentile vs. the
measured GEDI RH95. In Fig. 9b, we analyze the difference of the CHM
p95 and the GEDI RH95 with respect to the referenced GEDI RH95
heights.
We found that the p95 of the CHM model had a small negative bias
against the GEDI RH95 values and adding the GEDI correction to the
CHM model significantly reduces the bias. There is a slight positive bias
in the GEDI RH95 data due to the terrain slope (Lang et al., 2022a). We
used terrain slope (Mapzen, 2017) as one of the input metadata when
training the GEDI model (see Section 3.3), while setting the terrain slope
to zero during inference. With this setup, we were able to calibrate out
the positive bias caused by terrain slope in our GEDI model.
To assess the importance of the GEDI calibration model for
geographic generalization, and the generalizability of the different
baseline models, we computed R2 on globally distributed GEDI test data
(Table 3).
We found that the SSL + GEDI model had the highest agreement with
GEDI RH95 data in 13 of 15 geographic regions. In 42 out of 45 com­
binations of subregions and models, including the GEDI correction
model increased R2 .

Fig. 10. CHM error compared to reference tree height as indicated in the 4.1.4. Correlation with field data
Brazilian National Forest Inventory for Espirito Santo. To measure the agreement between our computed CHMs and field-
collected tree height data, we utilize the Brazilian National Forest In­
that are pre-trained on ImageNet, including the SWAG approach, the ventory (NFI) data, which consists of systematic field plot inventories of
ResUNet remains the strongest baseline. The SSL model clearly out­ tree count and height (da Luz et al., 2018). Because the NFI data for São
performs the ResUNet on Neon, reducing the MAE from 3.1 to 2.6 m, is Paulo was not yet available, we additionally generate a CHM of the
also improving results on CA Brande, and leads to similar results on São nearby Espirito Santo state and evaluate its agreement with the NFI data
Paulo, with a slightly worse R2 but a much lower ME. We also for Espirito Santo. The NFI data analyzed encompassed 1450 10 × 10 m

10
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Table 4
Segmentation metrics. U/P corresponds to pixel user's / producer's accuracy of the tree class. IOU to the average of tree & no tree IOU class scores. EE: Edge error.
NEON test CA-Brande São Paulo Average

U/P IOU U/P IOU U/P IOU U/P IOU EE

ResUNet 0.74/0.75 0.58 0.72/0.64 0.70 0.91/0.85 0.67 0.79/0.75 0.65 0.50
ResUNet + GEDI 0.77/0.68 0.53 0.73/0.52 0.68 0.91/0.84 0.65 0.80/0.68 0.62 0.52
SSL 0.81/0.76 0.65 0.71/0.75 0.76 0.90/0.88 0.67 0.82/0.81 0.68 0.50
SSL + GEDI 0.82/0.71 0.59 0.74/0.60 0.74 0.91/0.86 0.66 0.83/0.76 0.66 0.49

initial plots considered, we removed 291 that had tree cover loss since
Table 5 2014 in the dataset of Hansen et al. (2016). Fig. 10 visualizes box plots of
Segmentation metrics on global, human annotated dataset. U/P corresponds to
the 95th percentile CHM by reference NFI height bins. The overall ME
pixel user's / producer's accuracy. IOU to the average of tree & no tree IOU
was 0.72 m while the RMSE was 4.25 m, with a slight positive bias for
scores. Since the GEDI correction only adjusts large scale height percentiles, the
“+GEDI” rows show only small improvement over the base ALS models. trees ≤15 m (ME = 1.10 m, RMSE = 4.28 m), and negative bias for trees
>15 m (ME = − 1.00 m, RMSE = 3.79 m).
Global, Annotated

U/P IOU
4.2. Segmentation metrics
ResUNet 0.89/0.86 0.75
ResUNet + GEDI 0.90/0.86 0.74
SSL 0.83/0.87 0.77 In addition to the canopy height metrics discussed in Section 4.1, we
SSL + GEDI 0.82/0.88 0.77 compute a number of metrics that reflect the ability of the model to
correctly assign individual pixels as trees. CHMs were converted into
binary masks by thresholding height values of at least five meters (5 m)
subplots within 87 plots positioned within a 20 × 20 km grid in Espirito
as tree canopy extent. Table 4 shows the pixel user's and producer's
Santo. The field data was collected primarily in November and
accuracy values (also know as precision and recall, respectively) for
December 2014, and includes the height of each tree within each subplot
pixels labeled as trees. Table 4 also shows the Intersection Over Union
having a diameter at breast height (DBH) of at least 10 cm. Of the 1450
(IOU) for the binary masks, which was calculated as the average of IOU

Fig. 11. Tree segmentation predictions from the SSL + GEDI model vs human annotated ground truth. Binary prediction masks were created from the CHM by
thresholding at 1 m. U/P corresponds to pixel user's / producer's accuracy of the tree class. The IOU represents the Intersection-Over-Union score for the tree class.

11
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. 12. Pixelwise user's accuracy (UA) and producer's accuracy (PA) for 8903 validation plots stratified by geographic sub-region. Error bars represent the 80, 90,
and 95% confidence intervals as derived from 10,000 bootstrap iterations. Numbers in the x-axis tick labels denote sample size.

Fig. 13. Qualitative comparison between different models for example imagery. Left: Input Maxar “thumbnail” image, 256 × 256 pixels, in local tangent plane
coordinate system. Second from left: ALS CHM image, in same projection and pixelization. Right columns: Model CHMs.

for pixels labeled as tree and the IOU for pixels labeled as ground. edges in both maps. Scores range between 0 and 1, where lower scores
Additionally, we introduce an Edge Error (EE) metric that computes indicate improved accuracy along patch edges. In Table 4, the edge error
the ratio of the sum of the absolute difference between edges from is computed over all set-aside validation datasets. We detail the formula
predicted and ground truth CHM, normalized by the sum of detected with a figure illustrating the behavior of this metric in Appendix C.3.

12
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

aside validation set where the GEDI measurement had RH95 greater
than 3 m, which we enforce to bias the dataset towards vegetated areas.
The data is independent of the aerial lidar measurements used to train
the model. Over the entire dataset, the user's and producer's accuracy
was 0.88 ± 0.006 and 0.82 ± 0.008, while the IOU was 0.77 ± 0.006
indicating good agreement with the human annotations, cf. Table 5.
Fig. 11 shows examples of model predictions and their corresponding
annotations.
We additionally calculated user's and producer's accuracy by
geographic subregion according to the United Nations geoscheme.
Boostrapping with 10,000 iterations was used to calculate uncertainty
for tree segmentation accuracy metrics rather than the methods of
Stehman (2014) because the cluster sampling approach was used to
generate validation data (Olofsson et al., 2014; Mugabowindekwe et al.,
2022; Maxwell et al., 2021). This validation analysis indicated strong
generalizability across different geographic regions, without signifi­
cantly different accuracy metrics in geographic regions where we had
training data and where we did not (Fig. 12). This suggests that the use
of self supervised learning on global images facilitated the creation of a
generalizable segmentation network.

4.3. Qualitative comparison of models

Although we have attempted to capture the performance of each


model qualitatively with the included metrics, we note that visual in­
spection often leads to additional insights. Therefore, we additionally
present a few examples of maps produced by our various models. Fig. 13
compares the results with a ResUNet and SSL based strategies.

4.4. Canopy height as a function of plantation age


Fig. 14. Canopy height estimates for areas with tree cover gain of various ages
in São Paulo relative to the imagery year analyzed. Densely planted monoculture stands, such as those commonly found
in the Atlantic forest, can be many hundreds of hectares large. Assessing
the age-height relationship of tree stands with CHMs derived from op­
Table 6
tical imagery may be challenging due to the relative homogeneity of the
CHM prediction accuracy on NEON test dataset using aerial input images as
canopy structures, the large area to perimeter ratio, and the lack of
inputs. Trained on satellite images only, the SSL approaches demonstrates
canopy gaps. To assess the CHMs ability to map the height of planted
generalization abilities.
trees, we utilized the annual 30-m tree cover gain and loss data from
Neon test - aerial
MapBiomas in São Paulo (Azevedo et al., 2018). We calculated the
Encoder Decoder MAE block ME EE number of years since the most recent tree cover gain with no subse­
training training R2 quent loss event for each image date analyzed. Fig. 14 shows a positive
dataset dataset
relationship (R2 = 0.59) between the number of years since the most
ResUNet INet Sat. images 3.7 0.34 − 2.0 0.77 recent tree cover gain, and our predicted 95th percentile CHM. For areas
SSL Sat. images Sat. images 3.0 0.55 1.7 0.71
with gain events older than seven years, there was no significant age-
SSL Sat. images aerial 1.8 0.86 ¡1.0 0.41
height relationship, as areas with trees with gain events more than
seven years prior to the analysis year had similar height distributions to
We observe an improvement of the SSL approach over the ResUNet areas with stable (no gain or loss since 2000) trees. For this analysis, it's
baseline in terms all segmentation metrics. Both approaches leads to important to note that the tree cover gain year identified in MapBiomas
maps with the same level of sharpness, and the GEDI correction slightly is a lagging indicator of the tree age, since tree cover gain is not
degrades results. immediately recognizable from Landsat imagery.

4.2.1. Tree detection metrics evaluated against human annotated validation 4.5. Generalization to aerial imagery
data
To assess the ability of the model to generalize to new geographies, Using a model fully trained on Satellite images. To assess the general­
we compiled human-annotated validation labels for tree detection (bi­ ization ability of our approach to other input imagery, we measure
nary classification of tree vs no-tree) across 8, 903 Maxar thumbnail model performance using airborne imagery at inference. For inference,
images. Human annotators were instructed to label any trees above one we resized the NEON aerial images to match the size of corresponding
meter (1 m) tall and with a canopy diameter of more than three meters satellite image, and apply a normalization of the aerial image to match
(3 m). Annotators were to include standing dead trees and tree-like the color histogram of the satellite imagery. Details about image
shrubs, but exclude any grasslands, row crops, lawns, or otherwise normalization are provided in Appendix G.
vegetative ground cover whose peak height was estimated to be less than The second line of Table 6 shows canopy height metrics computed on
1 m from the ground surface. To create the model binary masks for the predictions made from NEON input RGB imagery. The SSL model almost
annotated dataset, we thresholded the model CHM at 1 m. doubles the R2 of the ResUNet baseline. Compared to the performance of
The geographic locations for the images in the dataset correspond to the SSL model with satellite images as input as reported in Table 1, the
randomly sampled GEDI measurement footprints from our global set- MAE is only slightly higher (3.0 instead of 2.7), the R2 is a bit more

13
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. 15. Performance of models given aerial images inputs. Top: Model fully trained on satellite images; Bottom: Performance of encoder trained on satellite images,
decoder trained on aerial images.

Fig. 16. Generalization of our SSL approach. Even if trained on Satellite images, inference on airborne images does not seem to suffer from a domain shift.

RGB aerial image Lidar Ground Truth Wagner et al. (2023) Our predicted CHM
Fig. 17. Comparison of our aerial model, where we trained the DPT decoder on Neon aerial RGB images, with the approach of Wagner et al. (2023). Note that despite
a slight change in the scale of the input image, which was zoomed to obtain a 256 × 256 input, and despite the fact we did not use the infra-red input, we obtain a
result similar to the one of Wagner et al. (2023). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)

impacted (0.55 instead of 0.70), while the bias is much higher, but seems similar to the one obtained using in domain satellite imagery.
evenly distributed between different height bins (Fig. 15). Fig. 16 dis­ Despite changes in color intensity, image angle, and sun angle, our
plays a qualitative example, where we observe that despite a blurrier approach manages to generate predictions with consistent visual qual­
result, the accuracy of the model given an out-of-domain aerial image ity. From an application point of view, the robustness of SSL predictions

14
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

without the need to retrain models on new lidar datasets is very observations above 30 m.
interesting. The generated maps are limited by variation in input imagery,
Training a new decoder on aerial images. We compared these results to particularly by variation in view angle, sun angle, acquisition times and
another baseline, training a decoder on top of our pretrained SSL fea­ dates, and optical aerosol. As shown in Fig. 17, qualitative data quality
tures on Neon RGBs (last line of Table 6). Given a better alignment with improves considerably when processed on VHR aerial optical imagery,
the ground truth CHMs, and view angles close to Nadir across the Neon as opposed to VHR satellite optical imagery. Additionally, terrain slope
dataset, this aerial model performed reasonably well compared to the appears to influence predicted height, since it affects the length of
recent result of Wagner et al. (2023), only using the RGB channels, as shadow an individual tree casts. At present, the ability to conduct tree
illustrated in Fig. 17. height change detection assessments is limited by the need for improved
input image processing to better align these differences between image
5. Discussion pairs.

Our proposed method provides a novel approach to estimating can­ 6. Conclusion


opy height from VHR satellite imagery. We demonstrate the effective­
ness of our approach based on self-supervised learning, dense vision This study presents high-resolution canopy height maps at a juris­
transformers, and introduce an approach to rescale high-resolution dictionial scale based on VHR (Maxar) optical imagery trained on aerial
canopy height maps from a model trained on Maxar and ALS data lidar and calibrated with spaceborn lidar (GEDI) data using latest ad­
with low-resolution canopy height maps from a model trained on Maxar vances from self-supervised learning and vision transformers. We
and GEDI data. In contrast to Lang et al. (2022a), which downscales the demonstrate quantitatively and qualitatively the advantages of large-
25-m GEDI data to generate 10-m canopy height maps by only consid­ scale self-supervised learning, the versatility of obtained representa­
ering Sentinel-2 pixels at the centroid of each GEDI pixel, our approach tions allowing generalization to different geographic regions and input
uses a GEDI-based canopy height map to rescale an ALS-based model of imagery. Compared to existing canopy height maps, the presented data
canopy height map. While both Lang et al. (2022a) and Potapov et al. better captures tree structure variability at small spatial scales. Such
(2021) only utilize ALS data to validate their generated maps, we very high resolution maps of canopy height can improve the monitoring
directly model the relationship between Maxar imagery and ALS data, of forest degradation, restoration, and forest carbon dynamics. The next
enabling the mapping of sub-GEDI scale canopy height variability, some steps include (a) developing and validating allometrically-derived high-
times at a per-tree level outside of dense, closed-canopy forests. resolution woody carbon data and (b) testing and validating the utility
Segmentation. Previous research applying deep learning image seg­ of the proposed approach for the operation monitoring of tree growth.
mentation approaches to map trees in high-resolution imagery, such as
Brandt et al. (2020) and Mugabowindekwe et al. (2022) have utilized a CRediT authorship contribution statement
U-Net model (Ronneberger et al., 2015) and entirely hand-labeled
reference data. Focusing in Rwanda, Mugabowindekwe et al. (2022) Jamie Tolan: Conceptualization, Supervision, Writing – original
map carbon stock estimates for individual trees by developing empirical draft. Hung-I Yang: Investigation, Data curation. Benjamin Nosar­
relationships between crown area and carbon, finding that half of the zewski: Validation, Data curation, Investigation. Guillaume Couairon:
national tree carbon stock is located outside of natural forests. In com­ Methodology, Investigation, Writing – review & editing. Huy V. Vo:
parison to these approaches, our results suggest that incorporating SSL Methodology, Investigation. John Brandt: Writing – original draft,
can improve model generalizability for vegetation structure mapping, in Visualization, Formal analysis. Justine Spore: Writing – original draft,
line with various research demonstrating the effectiveness of SSL in Visualization. Sayantan Majumdar: Investigation. Daniel Haziza:
other domains. Additionally, our per-pixel height predictions combine Software, Data curation. Janaki Vamaraju: Software. Theo Mouta­
the predictive quality of height for assessing biomass as demonstrated in kanni: Methodology. Piotr Bojanowski: Conceptualization. Tracy
Lang et al. (2022b) and Potapov et al. (2021) with the predictive quality Johns: Supervision. Brian White: Methodology. Tobias Tiecke: Su­
of crown area for assessing biomass as demonstrated in Mugabo­ pervision, Visualization. Camille Couprie: Conceptualization, Meth­
windekwe et al. (2022) and Skole et al. (2021). odology, Investigation, Writing – original draft.
Limitations. The production of high-resolution canopy height maps
from optical imagery has challenges and associated limitations. Fore­ Declaration of Competing Interest
most, the availability of recent ALS training data is limited in geographic
scope. While the utilization of self-supervised learning and the GEDI- None.
based corrective model improve generalization and reduce test error,
increased geographic availability of ALS remains necessary to further Data availability
validate the proposed maps in new geographies. While we were able to
validate error as a function of canopy height for trees under 25 m based Input imagery is licensed by Maxar, and not available publicly. We
on field inventory data in Espirito Santo, Brazil (Fig. 10), we were un­ share the derived maps of canopy height under Creative Commons 4.0
able to utilize field data to assess potential height saturation for very tall and are available for public download.
trees which may affect derived above ground carbon data. However,
Fig. 9a does suggest significant saturation of predictions for GEDI RH95

15
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Acknowledgments Declaration of Competing Interest

We would like to thank Ben Weinstein for helpful discussions The authors declare that they have no known competing financial
regarding the NEON dataset. We thank Andi Gros and Saikat Basu for interests or personal relationships that could have appeared to influence
their technical advice. We would like to thank Shmulik Eisenmann, the work reported in this paper.
Patrick Louis, Lee Francisco, Leah Harwell, Eric Alamillo, Sylvia Lee,
Patrick Nease, Alex Pompe and Shawn Mcguire for their project support.

Appendix A. Data used in training / calibration / validation

Fig. A.18. Distribution of ALS Datasets: Train/Calibration/set-aside validation (aka Train/Validation/Test): (a) All ALS datasets. Here Train and Calibration points
overlap and are shown in blue. Set-aside validation (Test) datasets are from non-overlapping geographic regions. (b) Zooming in on one Train / Calibration site
(NEON GRSM) - we have randomly split the data into non spatially overlapping tiles so that calibration data is drawn from the same sites and ecosystems as training
data, but separated spatially. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
The NEON sites during training / calibration are: SJER, SOAP, TEAK, BART, DSNY, HARV, JERC, OSBS, DELA, GRSM, LENO, MLBS, BLAN, CLBJ,
KONZ, NOGP, SCBI, TALL, UKFS, WOOD, ABBY, BONA, DEJU, JORN, MOAB, OAES, ONAQ, SERC, SRER, UNDE, WREF, HEAL, LAJA, RMNP, PUUM.
The set-aside validation dataset, “NEON test”, contains the following NEON sites: CUPE, REDB, WLOU, HOPB, GUAN.
To ensure repeatability of our approach, we provide a complete list of CHM files used during training/calibration at: https://dataforgood-fb-data.
s3.amazonaws.com/forests/v1/NEON_training_images.csv

Appendix B. GEDI Dataset and model training details

B.1. GEDI dataset

The GEDI instrument is a full waveform lidar instrument aboard the International Space Station which has sampled global regions between 51.6∘ N
& S latitude with a ∼25m beam footprint at ground surface. The instrument details are described in Dubayah et al. (2020), and its measurements of
canopy height are described in Dubayah et al. (2022). We used the GEDI L2A Version 2 product and filtered the dataset to reduce noise by only
including data which had: degrade flag = 0, surface flag = 1, solar elevation < 0, and sensitivity > 0.95. After this filtering, we were left with a total
sample size of 1.3 × 109 measurements. We used the 95th percentile of relative height (RH95) that we paired to 128 × 128 pixel (76 × 76 meter)
satellite images from Maxar. These images were processed as described in Section 2.3.1, but were smaller to more closely approximate the GEDI
footprint. Although these images are still significantly larger than the 25 m GEDI footprint, we have found improved results from our GEDI model
using larger areas - potentially due to pointing errors in the GEDI data and a larger spatial context improving the CNN model results.

B.2. Connection between ALS CHM 95th percentiles and GEDI RH95

To leverage the GEDI model output, we made the following assumption: the GEDI model, on a 128 × 128 pixel sample, approximates the 95th
percentile (p95) of the sample's ground truth canopy height map. This is justified by running simulations with the the GEDI simulator from Hancock
et al. (2019) on the NEON ALS point clouds. We used simulated values rather than actual GEDI measurements because the GEDI measurements suffer
from point errors, and because the simlator allows for denser sampling from with the limited geographic footprint of our ALS dataset.
The GEDI RH95 measurement used for training the GEDI model corresponds to the 95th percentile of the lidar's energy response. We simulated the
GEDI RH95 values and find that they have excellent correlation (R2 = 0.88) with the 95th percentile of the canopy height map around the corre­
sponding GEDI footprints. This high correlation between GEDI RH95 and p95 of CHM was consistent across the diverse ecosystems covered in all 40
NEON sites in Appendix A.

16
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

B.3. Height prediction network training

The GEDI measurements were split into a 80/10/10% train/calibration/set-aside validation subsets. During training, the samples were drawn with
a weight inversely proportional to the natural log of the total number of global samples in its RH95 bin, where each bin has a width of 1 m. We found
that this sampling method provided a good number of training sample from higher canopy height locations while not overly biasing the model towards
ecosystems with the highest canopy heights. Log inverse sample weighting is a less aggressive re-weighting that typical linear inverse weighting, as
done in Lang et al. (2022a), which we choose so as not to overly bias the model towards the relatively few high canopy height samples.
After the convolutional layers, we also input a collection of scalar values, designated as “Satellite Metadata” in Fig. 2. This metadata included: the
latitude, longitude position of each image, the off-nadir view angle of the satellite, the angle between zenith and sun position at capture, and the
terrain slope (Mapzen, 2017) of the image footprint. Measured terrain slope is used during training, but set to zero height during forward inference,
which allows the model to reduce the systematic error resulting from the bias of GEDI measurements towards higher canopy heights when the beam
width straddles large surface gradients (see Section 4.1.3, Appendix B.4).
When training the GEDI model, we only used random 90 degree rotations and random horizontal and vertical flips, since the larger volume of data
made augmentation less helpful.

Fig. B.19. Correlation between 95th percentiles of ALS Canopy Height Maps and simulated GEDI RH95 values from the same maps. The 95th percentile is computed
within weighted Gaussians with σ = 12.5m, in order to roughly approximate the GEDI beam width.

B.4. GEDI height and terrain slope correlation

As has been noted in Adam et al. (2020), the GEDI instruments estimate of canopy height is influenced by the terrain slope. We found evidence of
this correlation in the data, and due to this have chosen to set the terrain slope to zero during inference to mitigate this systematic.

17
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. B.20. Correlation between terrain slope and GEDI RH95 for samples in CA. The dashed line indicates the height of the terrain with the GEDI beam (GEDI beam
radius times the terrain slope). The heatmap is predominately above this line, indicating that there are no GEDI height estimates which fall below the terrain change
within the beam.

Appendix C. Details on different metrics

C.1. Block R2

To compute the block R2 score, we split the ground truth CHM c and the prediction ̂c into 50 × 50 pixels blocks and average their values, leading to
a 5 × 5 array, reshaped into 1 × 25 vectors c(b) and ̂c (b) . Given the average ground truth CHM average of all c(b) in the test set, the classical R2 score is
then computed:
∑ ( (b) )2
ci − ̂c (b)
i
R2block = 1 − ∑ ( )2 . (C.1)
c(b)
i − c
(b)

C.2. Mean error (ME)

We compute the mean error, also referred as bias, as


1 ∑
ME = ̂c i − ci , (C.2)
|D | i=1…|D |

where ∣D ∣ the number of pixels in the test set.

C.3. Edge error metric (EE)

We are interested in measuring the sharpness of the CHM while beeing close to the ground truth. Because a blurry prediction would lead to the
same MAE, Block R2 or PSNR than a sharp one, this metrics is not serving this purpose. Therefore, we established a metric comparing the image
gradients of the maps, dubbed “edge error score”, given by Algorithm 1. Fig. C.21 illustrates how this metric is computed in an example.
Algorithm 1. Edge error metric.

18
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. C.21. Illustration of edge error metric for two results: the ResUNet (trained with an L1 loss) edge error score is 0.66 in this example, the score of the SSL model is
0.55, computed using difference of the prediction and ground truth edge maps appearing in the two images at the right.

Appendix D. Architecture and training details

Our code uses Pytorch 1.9.0 with Cuda 10.2.


SSL pretraining. We refer the reader to Oquab et al. (2023) for the SSL pretraining phase details. We only changed the image normalization pa­
rameters from ImageNet parameters to match the standard deviation and mean color intensities of our Satellite image dataset. The unsupervised
pretraining of a large model took a little less than three days on two 8-GPUs Voltas. Instead of the standard ImageNet normalization parameters, we
computed the mean and standard deviation on the dataset of 3.5 M images. The large encoder contains 303 million parameters, while the huge one has
606 million.
Decoder training. The training of CHM prediction from SSL features takes 8 h for a large model on 8 GPUs, and 9 h for a huge model. During this step,
we kept the weights of the SSL encoder frozen and only train the DPT model. Our DPT decoder for the SSL model was trained for 140 k steps using a
Cosine learning rate schedule (from 1 × 10− 8 to 1 × 10− 4) with a linear warmup step for 12 k iterations. We used a batch size of 16. The decoder model
contains 34.2 M of parameters.
Estimating the carbon footprint of model training. We estimate the carbon footprint of training the ViT huge model using the calculations from Oquab
et al. (2023), a Thermal Design Power (TDP) of the V100-32G GPU equal to 250 W, a Power Usage Effectiveness (PUE) of 1.1, a carbon intensity factor
of 0.385 kg CO2 per KWh, a time of 11 days × 24 h × 64 GPUs = 16,896 GPU hours. The 4647 kWh used to train the model is approximately equivalent
to a CO2 footprint of 4647 × 0.385 = 1.8 T of CO2 . The training of the ResUNet baseline took 19 h on 8 V100-32G GPUs, or approximately 16.1 kg of
CO2 . The training of the decoder model took 75 GPU hours, generating about 8 kg of CO2 . While the carbon footprint of the ViT huge model, 1.81 T of
CO2 , was two orders of magnitude larger than the training of a ResUNet, the model training is a one-shot expense, and the inference time (and thus
energy use and emissions) of the ViT huge and ResUNet were within the same order of magnitude.
Architecture details. Our encoder architecture is a ViT architecture as introduced by Dosovitskiy et al. (2021b). It treats an image as a set of patches,
called tokens, that are first embedded into a feature space and then processed by a cascade of transformer layers to produce updated representations of
the tokens. The transformer layers use multi-head attention and self attention as their fundamental operation. Multihead attention is an operation that
relates each token to every token in the image and consequently, has a global receptive field. The ViT does not use down sampling operations in its
intermediate stages and thus supports fine-grained feature maps also in the deeper layers of the network. For the huge model, the features consists in
outputs from layers (9, 16, 22, 29). At each layer, a 8 × 8 × 1280 feature map and 1 × 1 × 1280 class output is extracted. In the DPT decoder, the set of
tokens at various stages of the backbone is first reassembled into image like representations. Then, the decoder iteratively fuses the feature maps from
different stages and produces the final dense prediction using an application specific output head. This step involves several residual convolution
layers. The code of our backbone is available at https://github.com/facebookresearch/dinov2, with pre-training on natural images.

Appendix E. Alternate loss function ablation

We compare in Table E.7 results of models trained with L1 loss or Sigloss, and using different sizes of pretraining dataset: one with 3.5 × 106 images
(referred to as “3.5 M”) and one with 18 × 106 images (“18 M”). More pretraining data improves the performance of the SSL models. In terms of loss,
we did not notice strong difference between L2 and sigloss, while the L1 results were slightly worse.

Table E.7
CHM prediction accuracy metrics with different loss functions. sl: sigloss. cl: using classification output. Linear: using a linear layer instead of DPT. We do not display
CA Brande result to improve visibility but the results are included in the average.

Neon test São Paulo Average

MAE R2 ME EE MAE R2 ME EE MAE R2 ME EE

3.5 M sl 2.8 0.67 − 1.2 0.51 6.0 0.14 − 4.2 0.60 3.1 0.56 1.9 0.54
18 M sl 2.9 0.66 − 1.3 0.52 4.9 0.46 − 2.1 0.59 2.9 0.64 1.3 0.54
18 M linear sl 3.0 0.58 − 1.8 0.68 7.1 − 0.27 − 6.7 0.71 3.6 0.41 2.8 0.67
18 M cl sl 2.6 0.71 − 0.9 0.48 4.9 0.47 − 1.9 0.55 2.7 0.70 1.0 0.51
18 M cl l1 2.5 0.80 0.0 0.51 5.2 0.39 − 2.6 0.56 2.9 0.72 0.7 0.53
18 M cl l2 2.6 0.86 − 0.1 0.52 5.1 0.43 ¡1.4 0.55 2.8 0.75 0.5 0.51

Appendix F. Residuals with respect to the GEDI RH95

Fig. F.22 displays the difference between the p95 of block model CHM predictions and the measured GEDI RH95 metrics w.r.t the GEDI RH95.

19
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Fig. F.22. Residuals of p95 CHM predictions with GEDI RH95, with respect to the GEDI RH95.

Appendix G. Normalization for inference on aerial imagery

An image normalization step is necessary to improve the SSL inference performance on aerial images, when training only on Satellite imagery. We
perform a classical histogram normalization of the aerial images (i.e. normalize the RGB channels of the aerial image to the p5-p95 distribution of the
satellite image). This makes the color balance much more similar, leading to better performance for the SSL model. The satellite image is taken through
much more atmosphere and we expect it to be less blue on average, because of preferential scattering of shorter wavelengths. Noting I the satellite
image, A the original aerial image, we first compute for each color channel i and each image X the 5% percentile p5 (X)i and 95% percentile p95 (X)i .
Then, the normalized aerial image is given by
( ) p95 (I)i − p5 (I)i
Ai = Ai − p5 (A)i * + p5 (I)i .
p95 (A)i − p5 (A)i
We only apply this normalization to the SSL model trained on satellite imagery. Applying this normalization to the ResUNet model deteriorated the
results.

References An image is worth 16x16 words: transformers for image recognition at scale. In: 9th
International Conference on Learning Representations, ICLR 2021, Virtual Event,
Austria, May 3-7, 2021. https://doi.org/10.48550/ARXIV.2010.11929.
Adam, M., Urbazaev, M., Dubois, C., Schmullius, C., 2020. Accuracy assessment of gedi
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
terrain elevation and canopy height estimates in european temperate forests:
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021b.
influence of environmental and acquisition parameters. Remote Sens. 12 https://doi.
An image is worth 16x16 words: transformers for image recognition at scale arXiv:
org/10.3390/rs12233948.
2010.11929.
Astola, H., Seitsonen, L., Halme, E., Molinier, M., Lönnqvist, A., 2021. Deep neural
Dos-Santos, M., Keller, M., Morton, D., 2019. Lidar Surveys over Selected Forest Research
networks with transfer learning for forest variable estimation using sentinel-2
Sites, Brazilian Amazon, 2008–2018. ORNL DAAC, Oak Ridge, Tennessee, USA. URL:
imagery in boreal forest. Remote Sens. 13 https://doi.org/10.3390/rs13122392.
https://daac.ornl.gov/CMS/guides/LiDAR_Forest_Inventory_Brazil.html.
Azevedo, T., Souza, C., Zanin Shimbo, J., Alencar, A., 2018. Mapbiomas Initiative:
Dubayah, R., Blair, J.B., Goetz, S., Fatoyinbo, L., Hansen, M., Healey, S., Hofton, M.,
Mapping Annual Land Cover and Land Use Changes in Brazil from 1985 to 2017.
Hurtt, G., Kellner, J., Luthcke, S., Armston, J., Tang, H., Duncanson, L., Hancock, S.,
Bhat, S.F., Alhashim, I., Wonka, P., 2021. Adabins: depth estimation using adaptive bins.
Jantz, P., Marselis, S., Patterson, P.L., Qi, W., Silva, C., 2020. The global ecosystem
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
dynamics investigation: high-resolution laser ranging of the earth's forests and
Recognition, pp. 4009–4018.
topography. Sci. Remote Sens. 1, 100002. URL: https://www.sciencedirect.com/sci
Brande, K., 2021. 3d fuel structure in relation to prescribed fire, ca 2020. National center
ence/article/pii/S2666017220300018 https://doi.org/10.1016/j.srs.2020.100002.
for airborne laser mapping (ncalm). Distributed by opentopography. URL:. https://
Dubayah, R., Luthcke, S., Sabaka, T., Nicholas, J., Preaux, S., Hofton, M.. Gedi l3 Gridded
doi.org/10.5069/G9C53J18. accessed: 2023-02-15.
Land Surface Metrics, Version 1. URL. https://daac.ornl.gov/cgi-bin/dsviewer.pl?
Brandt, M., Tucker, C.J., Kariryaa, A., Rasmussen, K., Abel, C., Small, J., Chave, J.,
ds_id=1865.
Rasmussen, L.V., Hiernaux, P., Diouf, A.A., Kergoat, L., Mertz, O., Igel, C.,
Dubayah, R., Armston, J., Kellner, J., Duncanson, L., Healey, S., Patterson, P.,
Gieseke, F., Schöning, J., Li, S., Melocik, K., Meyer, J., Sinno, S., Romero, E.,
Hancock, S., Tang, H., Bruening, J., Hofton, M., Blair, J., Luthcke, S., . GEDI L4A
Glennie, E., Montagu, A., Dendoncker, M., Fensholt, R., 2020. An unexpectedly large
footprint level aboveground biomass density, version 2.1. URL: https://daac.ornl.go
count of trees in the west African Sahara and Sahel. Nature 587, 78–82.
v/cgi-bin/dsviewer.pl?ds_id=2056 https://doi.org/10.3334/ORNLDAAC/2056.
Camarretta, N., Harrison, P.A., Bailey, T., Potts, B., Lucieer, A., Davidson, N., Hunt, M.,
Duncanson, L., Neuenschwander, A., Hancock, S., Thomas, N., Fatoyinbo, T., Simard, M.,
2020. Monitoring forest structure to guide adaptive management of forest
Silva, C.A., Armston, J., Luthcke, S.B., Hofton, M., Kellner, J.R., Dubayah, R., 2020.
restoration: a review of remote sensing approaches. New For. 51, 573–596. https://
Biomass estimation from simulated GEDI, ICESat-2 and NISAR across environmental
doi.org/10.1007/s11056-019-09754-5.
gradients in Sonoma County, California. Remote Sens. Environ. 242, 111779. URL:
Cook-Patton, S.C., Leavitt, S.M., Gibbs, D., Harris, N.L., Lister, K., Anderson-Teixeira, K.
https://www.sciencedirect.com/science/article/pii/S0034425720301498 htt
J., Briggs, R.D., Chazdon, R.L., Crowther, T.W., Ellis, P.W., Griscom, H.P.,
ps://doi.org/10.1016/j.rse.2020.111779.
Herrmann, V., Holl, K.D., Houghton, R.A., Larrosa, C., Lomax, G., Lucas, R.,
Eigen, D., Puhrsch, C., Fergus, R., 2014. Depth map prediction from a single image using
Madsen, P., Malhi, Y., Paquette, A., Parker, J.D., Paul, K., Routh, D., Roxburgh, S.,
a multi-scale deep network. Adv. Neural Inf. Proces. Syst. 27.
Saatchi, S., van den Hoogen, J., Walker, W.S., Wheeler, C.E., Wood, S.A., Xu, L.,
Fayad, I., Ciais, P., Schwartz, M., Wigneron, J.P., Baghdadi, N., de Truchis, A.,
Griscom, B.W., 2020. Mapping carbon accumulation potential from global natural
d’Aspremont, A., Frappart, F., Saatchi, S., Pellissier-Tanon, A., Bazzi, H., 2023.
forest regrowth. Nature 585, 545–550. https://doi.org/10.1038/s41586-020-2686-
Vision transformers, a new approach for high-resolution and large-scale mapping of
x.
canopy heights arXiv:2304.11487.
Csillik, O., Kumar, P., Mascaro, J., O’Shea, T., Asner, G.P., 2019. Monitoring tropical
Friedlingstein, P., Jones, M.W., O'Sullivan, M., Andrew, R.M., Hauck, J., Peters, G.P.,
forest carbon stocks and emissions using planet satellite data. Sci. Rep. 9, 17831.
Peters, W., Pongratz, J., Sitch, S., Le Quéré, C., Bakker, D.C.E., Canadell, J.G.,
https://doi.org/10.1038/s41598-019-54386-6.
Ciais, P., Jackson, R.B., Anthoni, P., Barbero, L., Bastos, A., Bastrikov, V., Becker, M.,
Cuni-Sanchez, A., Sullivan, M.J.P., Platts, P., et al., 2021. High aboveground carbon
Bopp, L., Buitenhuis, E., Chandra, N., Chevallier, F., Chini, L.P., Currie, K.I., Feely, R.
stock of African tropical montane forests. Nature 596, 536–542. https://doi.org/
A., Gehlen, M., Gilfillan, D., Gkritzalis, T., Goll, D.S., Gruber, N., Gutekunst, S.,
10.1038/s41586-021-03728-4.
Harris, I., Haverd, V., Houghton, R.A., Hurtt, G., Ilyina, T., Jain, A.K., Joetzjer, E.,
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
Kaplan, J.O., Kato, E., Klein Goldewijk, K., Korsbakken, J.I., Landschützer, P.,
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021a.

20
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Lauvset, S.K., Lefèvre, N., Lenton, A., Lienert, S., Lombardozzi, D., Marland, G., Mugabowindekwe, M., Brandt, M., Chave, J., Reiner, F., Skole, D.L., Kariryaa, A., Igel, C.,
McGuire, P.C., Melton, J.R., Metzl, N., Munro, D.R., Nabel, J.E.M.S., Nakaoka, S.I., Hiernaux, P., Ciais, P., Mertz, O., et al., 2022. Nation-wide mapping of tree-level
Neill, C., Omar, A.M., Ono, T., Peregon, A., Pierrot, D., Poulter, B., Rehder, G., aboveground carbon stocks in rwanda. Nat. Clim. Chang. 1–7.
Resplandy, L., Robertson, E., Rödenbeck, C., Séférian, R., Schwinger, J., Smith, N., National Ecological Observatory Network (NEON), 2022. Ecosystem Structure
Tans, P.P., Tian, H., Tilbrook, B., Tubiello, F.N., van der Werf, G.R., Wiltshire, A.J., (dp3.30015.001). URL: https://data.neonscience.org/data-products/DP3.30015.
Zaehle, S., 2019. Global carbon budget 2019. In: Earth System Science Data, 11, 001.
pp. 1783–1838. URL. https://essd.copernicus.org/articles/11/1783/2019/. htt Olofsson, P., Foody, G.M., Herold, M., Stehman, S.V., Woodcock, C.E., Wulder, M.A.,
ps://doi.org/10.5194/essd-11-1783-2019. 2014. Good practices for estimating area and assessing accuracy of land change.
Fu, H., Gong, M., Wang, C., Tao, D., 2018. A compromise principle in deep monocular Remote Sens. Environ. 148, 42–57. https://doi.org/10.1016/j.rse.2014.02.015.
depth estimation. arXiv:1708.08267. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V.,
Gibril, M.B.A., Shafri, H.Z.M., Al-Ruzouq, R., Shanableh, A., Nahas, F., Al Mansoori, S., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,
2023. Large-scale date palm tree segmentation from multiscale uav-based and aerial Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G.,
images using deep vision transformers. Drones 7. https://doi.org/10.3390/ Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P., 2023. Dinov2:
drones7020093. learning robust visual features without supervision arXiv:2304.07193.
Hancock, S., Armston, J., Hofton, M., Sun, X., Tang, H., Duncanson, L.I., Kellner, J.R., Popkin, G., 2015. The hunt for the world’s missing carbon. Nature 523, 20–22. https://
Dubayah, R., 2019. The GEDI simulator: a large-footprint waveform lidar simulator doi.org/10.1038/523020a.
for calibration and validation of spaceborne missions. Earth Space Sci. 6, 294–310. Potapov, P., Li, X., Hernandez-Serna, A., Tyukavina, A., Hansen, M.C., Kommareddy, A.,
https://doi.org/10.1029/2018EA000506. Pickens, A., Turubanova, S., Tang, H., Silva, C.E., Armston, J., Dubayah, R., Blair, J.
Hansen, M.C., Potapov, P.V., Moore, R., Hancher, M., Turubanova, S.A., Tyukavina, A., B., Hofton, M., 2021. Mapping global forest canopy height through integration of
Thau, D., Stehman, S.V., Goetz, S.J., Loveland, T.R., Kommareddy, A., Egorov, A., GEDI and Landsat data. Remote Sens. Environ. 253, 112165 https://doi.org/
Chini, L., Justice, C.O., Townshend, J.R.G., 2013. High-resolution global maps of 10.1016/j.rse.2020.112165.
21st-century forest cover change. Science 342, 850–853. https://doi.org/10.1126/ Ranftl, R., Bochkovskiy, A., Koltun, V., 2021. Vision transformers for dense prediction.
science.1244693. In: International Conference on Computer Vision.
Hansen, M.C., Krylov, A., Tyukavina, A., Potapov, P.V., Turubanova, S., Zutta, B., Ifo, S., Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Candido, S.,
Margono, B., Stolle, F., Moore, R., 2016. Humid tropical forest disturbance alerts Uyttendaele, M., Darrell, T., 2022. Scale-MAE: A scale-aware masked autoencoder
using landsat data. Environ. Res. Lett. 11, 034008. https://doi.org/10.1088/1748- for multiscale geospatial representation learning arXiv preprint arXiv:
9326/11/3/034008. 2212.14532.
Harris, N.L., Gibbs, D.A., Baccini, A., Birdsey, R.A., de Bruin, S., Farina, M., Fatoyinbo, L., Reytar, K., Buckingham, K., Stolle, F., Brandt, J., Zamora-Cristales, R., Landsberg, F.,
Hansen, M.C., Herold, M., Houghton, R.A., Potapov, P.V., Suarez, D.R., Roman- Singh, R., Streck, C., Saint-Laurent, C., Tucker, C., Henry, M., Walji, K., Finegold, Y.,
Cuesta, R.M., Saatchi, S.S., Slay, C.M., Turubanova, S.A., Tyukavina, A., 2021. Aga, Y., Rezende, M., 2020. Measuring progress in forest and landscape restoration.
Global maps of twenty-first century forest carbon fluxes. Nature. Climate Change 11, Unasylva 71, 62.
234–240. https://doi.org/10.1038/s41558-020-00976-6. Ribeiro, M.C., Martensen, A.C., Metzger, J.P., Tabarelli, M., Scarano, F., Fortin, M.J.,
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are 2011. The Brazilian Atlantic Forest: A Shrinking Biodiversity Hotspot. Springer,
scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Berlin Heidelberg, Berlin, Heidelberg, pp. 405–434. https://doi.org/10.1007/978-3-
Vision and Pattern Recognition, pp. 16000–16009. 642-20992-5_21.
Khosravipour, A., Skidmore, A.K., Isenburg, M., Wang, T., Hussin, Y.A., 2014. Generating Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: convolutional networks for
pit-free canopy height models from airborne lidar. Photogramm. Eng. Remote. Sens. biomedical image segmentation. In: Medical Image Computing and Computer-
80, 863–872. https://doi.org/10.14358/PERS.80.9.863. Assisted Intervention–MICCAI 2015: 18th International Conference, Munich,
Lang, N., Jetz, W., Schindler, K., Wegner, J.D., 2022a. A high-resolution canopy height Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, pp. 234–241.
model of the earth. https://doi.org/10.48550/ARXIV.2204.08322. Schacher, A., Roger, E., Williams, K.J., Stenson, M.P., Sparrow, B., Lacey, J., 2023. Use-
Lang, N., Kalischek, N., Armston, J., Schindler, K., Dubayah, R., Wegner, J.D., 2022b. specific considerations for optimising data quality trade-offs in citizen science:
Global canopy height regression and uncertainty estimation from GEDI LIDAR recommendations from a targeted literature review to improve the usability and
waveforms with deep ensembles. Remote Sens. Environ. 268, 112760 https://doi. utility for the calibration and validation of remotely sensed products. Remote Sens.
org/10.1016/j.rse.2021.112760. 15 https://doi.org/10.3390/rs15051407.
Li, B., Dai, Y., He, M., 2018. Monocular depth estimation with hierarchical fusion of Schwartz, M., Ciais, P., Ottlé, C., De Truchis, A., Vega, C., Fayad, I., Brandt, M.,
dilated cnns and soft-weighted-sum inference. Pattern Recogn. 83, 328–339. https:// Fensholt, R., Baghdadi, N., Morneau, F., Morin, D., Guyon, D., Dayau, S.,
doi.org/10.1016/j.patcog.2018.05.029. Wigneron, J.P., 2022. High-resolution canopy height map in the Landes forest
Li, W., Niu, Z., Shang, R., Qin, Y., Wang, L., Chen, H., 2020. High-resolution mapping of (France) based on GEDI, Sentinel-1, and Sentinel-2 data with a deep learning
forest canopy height using machine learning by coupling icesat-2 lidar with sentinel- approach. arXiv:2212.10265.
1, sentinel-2 and landsat-8 data. Int. J. Appl. Earth Obs. Geoinf. 92, 102163 https:// Silva, C.A., Duncanson, L., Hancock, S., Neuenschwander, A., Thomas, N., Hofton, M.,
doi.org/10.1016/j.jag.2020.102163. Fatoyinbo, L., Simard, M., Marshak, C.Z., Armston, J., Lutchke, S., Dubayah, R.,
Liu, S., Brandt, M., Nord-Larsen, T., Chave, J., Reiner, F., Lang, N., Tong, X., Ciais, P., 2021. Fusing simulated GEDI, ICESat-2 and NISAR data for regional aboveground
Igel, C., Li, S., Mugabowindekwe, M., Saatchi, S., Yue, Y., Chen, Z., Fensholt, R., biomass mapping. Remote Sens. Environ. 253, 112234 https://doi.org/10.1016/j.
2023. The overlooked contribution of trees outside forests to tree cover and woody rse.2020.112234.
biomass across Europe. https://doi.org/10.21203/rs.3.rs-2573442/v1. Singh, M., Gustafson, L., Adcock, A., Reis, V.D.F., Gedik, B., Kosaraju, R.P., Mahajan, D.,
Luo, W., Li, Y., Urtasun, R., Zemel, R., 2016. Understanding the effective receptive field Girshick, R., Dollár, P., van der Maaten, L., 2022. Revisiting Weakly Supervised Pre-
in deep convolutional neural networks. Adv. Neural Inf. Proces. Syst. 29 https://doi. Training of Visual Perception Models. CVPR.
org/10.48550/ARXIV.1701.04128. Sirko, W., Kashubin, S., Ritter, M., Annkah, A., Bouchareb, Y.S.E., Dauphin, Y.N.,
Luyssaert, S., Schulze, E.D., Börner, A., Knohl, A., Hessenmöller, D., Law, B.E., Ciais, P., Keysers, D., Neumann, M., Cissé, M., Quinn, J., 2021. Continental-scale building
Grace, J., 2008. Old-growth forests as global carbon sinks. Nature 455, 213–215. detection from high resolution satellite imagery. CoRR abs/2107.12283. URL: htt
https://doi.org/10.1038/nature07276. ps://arxiv.org/abs/2107.12283.
da Luz, N.B., Garrastazu, M.C., Rosot, M.A.D., Maran, J.C., de Oliveira, Y.M.M., Skole, D.L., Samek, J.H., Dieng, M., Mbow, C., 2021. The contribution of trees outside of
Franciscon, L., Cardoso, D.J., de Freitas, J.V., 2018. Inventário florestal nacional do forests to landscape carbon and climate change mitigation in West Africa. Forests 12.
brasil - uma abordagem em escala de paisagem para monitorar e avaliar paisagens https://doi.org/10.3390/f12121652.
florestais. Pesquisa Florestal Bras. 38 https://doi.org/10.4336/2018. Stehman, S.V., 2014. Estimating area and map accuracy for stratified random sampling
pfb.38e201701493. when the strata are different from the map classes. Int. J. Remote Sens. 35,
Maioli, V., Belharte, S., Stuker Kropf, M., Callado, C.H., 2020. Timber exploitation in 4923–4939. https://doi.org/10.1080/01431161.2014.930207.
colonial Brazil: a historical perspective of the Atlantic forest. Hist. Ambient. Stephenson, N.L., Das, A.J., Condit, R., Russo, S.E., Baker, P.J., Beckman, N.G.,
Latinoamericana Caribeña (HALAC) Rev. Solcha 10, 46–73. https://doi.org/ Coomes, D.A., Lines, E.R., Morris, W.K., Rüger, N., Álvarez, E., Blundo, C.,
10.32991/2237-2717.2020v10i2.p74-101. Bunyavejchewin, S., Chuyong, G., Davies, S.J., Duque, Á., Ewango, C.N., Flores, O.,
Mapzen, 2017. Amazon. Terrain Tiles on AWS. https://registry.opendata.aws/terrain-t Franklin, J.F., Grau, H.R., Hao, Z., Harmon, M.E., Hubbell, S.P., Kenfack, D., Lin, Y.,
iles. Makana, J.R., Malizia, A., Malizia, L.R., Pabst, R.J., Pongpattananurak, N., Su, S.H.,
Markus, T., Neumann, T., Martino, A., Abdalati, W., Brunt, K., Csatho, B., Farrell, S., Sun, I.F., Tan, S., Thomas, D., van Mantgem, P.J., Wang, X., Wiser, S.K., Zavala, M.
Fricker, H., Gardner, A., Harding, D., Jasinski, M., Kwok, R., Magruder, L., Lubin, D., A., 2014. Rate of tree carbon accumulation increases continuously with tree size.
Luthcke, S., Morison, J., Nelson, R., Neuenschwander, A., Palm, S., Popescu, S., Nature 507, 90–93. https://doi.org/10.1038/nature12914.
Shum, C., Schutz, B.E., Smith, B., Yang, Y., Zwally, J., 2017. The ice, cloud, and land Tesfay, F., Moges, Y., Asfaw, Z., 2022. Woody species composition, structure, and carbon
elevation satellite-2 (icesat-2): science requirements, concept, and implementation. stock of coffee-based agroforestry system along an elevation gradient in the moist
Remote Sens. Environ. 190, 260–273. https://doi.org/10.1016/j.rse.2016.12.029. mid-highlands of southern Ethiopia. Int. J. Forest. Res. 2022, 1–12. https://doi.org/
Maxwell, A.E., Warner, T.A., Guillén, L.A., 2021. Accuracy assessment in convolutional 10.1155/2022/4729336.
neural network-based deep learning remote sensing studies—part 2: Vallauri, D., Aronson, J., Dudley, N., Vallejo, R., 2005. Monitoring and Evaluating Forest
recommendations and best practices. Remote Sens. 13 https://doi.org/10.3390/ Restoration Success. Springer, New York, New York, NY, pp. 150–158. https://doi.
rs13132591. org/10.1007/0-387-29112-1_21.
Miangoleh, S.M.H., Dille, S., Mai, L., Paris, S., Aksoy, Y., 2021. Boosting monocular Viani, R.A.G., Barreto, T.E., Farah, F.T., Rodrigues, R.R., Brancalion, P.H.S., 2018.
depth estimation models to high-resolution via content-adaptive multi-resolution Monitoring young tropical forest restoration sites: how much to measure? Trop.
merging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Conserv. Sci. 11 https://doi.org/10.1177/1940082918780916,
Pattern Recognition (CVPR), pp. 9685–9694. 1940082918780916.

21
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888

Wagner, F.H., Roberts, S., Ritz, A.L., Carter, G., Dalagnol, R., Favrichon, S., Hirye, M.C., Xu, Z., Zhang, W., Zhang, T., Yang, Z., Li, J., 2021. Efficient transformer for remote
Brandt, M., Ciais, P., Saatchi, S., 2023. Sub-meter tree height mapping of California sensing image segmentation. Remote Sens. 13 https://doi.org/10.3390/rs13183585.
using aerial images and lidar-informed u-net model arXiv:2306.01936. Yanai, R.D., Wayson, C., Lee, D., Espejo, A.B., Campbell, J.L., Green, M.B., Zukswert, J.
Wang, W., Tang, C., Wang, X., Zheng, B., 2022. A ViT-based multiscale feature fusion M., Yoffe, S.B., Aukema, J.E., Lister, A.J., Kirchner, J.W., Gamarra, J.G.P., 2020.
approach for remote sensing image segmentation. IEEE Geosci. Remote Sens. Lett. Improving uncertainty in forest carbon accounting for redd+ mitigation efforts.
19, 1–5. https://doi.org/10.1109/LGRS.2022.3187135. Environ. Res. Lett. 15, 124002 https://doi.org/10.1088/1748-9326/abb96f.
Weinstein, B.G., Graves, S.J., Marconi, S., Singh, A., Zare, A., Stewart, D., Bohlman, S.A., Zhang, Z., Liu, Q., Wang, Y., 2017. Road extraction by deep residual U-net. CoRR abs/
White, E.P., 2021. A benchmark dataset for canopy crown detection and delineation 1711.10684. URL: http://arxiv.org/abs/1711.10684.
in co-registered airborne RGB, LiDAR and hyperspectral imagery from the National
Ecological Observation Network. PLoS Comput. Biol. 17 (7), e1009180.

22

You might also like