Very High Resolution Canopy Height Maps From RGB Imagery Using Self-Supervised Vision Transformer and Convolutional Decoder Trained On Aerial Lidar
Very High Resolution Canopy Height Maps From RGB Imagery Using Self-Supervised Vision Transformer and Convolutional Decoder Trained On Aerial Lidar
Very high resolution canopy height maps from RGB imagery using
self-supervised vision transformer and convolutional decoder trained on
aerial lidar
Jamie Tolan a, Hung-I Yang a, Benjamin Nosarzewski a, Guillaume Couairon b, Huy V. Vo b,
John Brandt c, *, Justine Spore c, Sayantan Majumdar d, Daniel Haziza b, Janaki Vamaraju a,
Theo Moutakanni b, Piotr Bojanowski b, Tracy Johns a, Brian White a, Tobias Tiecke a,
Camille Couprie b
a
Meta, 1 Hacker Way, Menlo Park, CA 94025, USA
b
Fundamental AI Research (FAIR), Meta, 1 Hacker Way, Menlo Park, CA 94025, USA
c
World Resources Institute, 10 G St NE #800, Washington, DC 20002, USA
d
Desert Research Institute, 2215 Raggio Pkwy, Reno, NV 89512, USA
A R T I C L E I N F O A B S T R A C T
Edited by Jing M. Chen Vegetation structure mapping is critical for understanding the global carbon cycle and monitoring nature-based
approaches to climate adaptation and mitigation. Repeated measurements of these data allow for the observation
Keywords: of deforestation or degradation of existing forests, natural forest regeneration, and the implementation of sus
LIDAR tainable agricultural practices like agroforestry. Assessments of tree canopy height and crown projected area at a
GEDI
high spatial resolution are also important for monitoring carbon fluxes and assessing tree-based land uses, since
Canopy height
forest structures can be highly spatially heterogeneous, especially in agroforestry systems. Very high resolution
Deep learning
Self-supervised learning satellite imagery (less than one meter (1 m) Ground Sample Distance) makes it possible to extract information at
Vision transformers the tree level while allowing monitoring at a very large scale. This paper presents the first high-resolution canopy
height map concurrently produced for multiple sub-national jurisdictions. Specifically, we produce very high
resolution canopy height maps for the states of California and São Paulo, a significant improvement in resolution
over the ten meter (10 m) resolution of previous Sentinel / GEDI based worldwide maps of canopy height. The
maps are generated by the extraction of features from a self-supervised model trained on Maxar imagery from
2017 to 2020, and the training of a dense prediction decoder against aerial lidar maps. We also introduce a post-
processing step using a convolutional network trained on GEDI observations. We evaluate the proposed maps
with set-aside validation lidar data as well as by comparing with other remotely sensed maps and field-collected
data, and find our model produces an average Mean Absolute Error (MAE) of 2.8 m and Mean Error (ME) of 0.6
m.
* Corresponding author.
E-mail address: [email protected] (J. Brandt).
https://doi.org/10.1016/j.rse.2023.113888
Received 19 April 2023; Received in revised form 24 October 2023; Accepted 25 October 2023
Available online 7 November 2023
0034-4257/© 2023 The Authors. Published by Elsevier Inc. This is an open access article under the CC BY license (http://creativecommons.org/licenses/by/4.0/).
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
accumulation rates by developing machine learning models based on aerial optical imagery, Wagner et al. (2023) generated a submeter CHM
more than 13,000 locations derived from literature. Cook-Patton et al. of California, USA by training a regression U-Net CNN on 60-cm imagery
(2020) find significant variability in predicted carbon accumulation from the USDA-NAIP program and aerial lidar.
rates compared to defaults from the International Panel on Climate The estimation of canopy height from high resolution optical imag
Change (IPCC) at the ecozone scale. In the African tropical montane ery shares similarities with the computer vision task of monocular depth
forests, Cuni-Sanchez et al. (2021) model forest carbon density based on estimation. Vision transformers, which are a deep learning approach to
72,336 measurements of height and tree diameter, identifying two- encoding low-dimensional input into a high dimensional feature space,
thirds higher carbon stocks than the respective IPCC default values. have established new frontiers in depth estimation compared to con
The uncertainty of biomass modeling also affects the uncertainty of volutional neural networks (Ranftl et al., 2021). While depth estimation
the carbon implications of deforestation and regrowth. Tree-based FLR, models benefit significantly from large receptive fields (Li et al., 2018;
including agroforestry, reforestation, natural regeneration, and enrich Fu et al., 2018; Miangoleh et al., 2021), Luo et al. (2016) demonstrate
ment planting, is considered to be a cost-effective natural climate so that the effective receptive fields of CNN models have Gaussian distri
lution for adaptation and mitigation. However, evaluating the butions, limiting the ability for CNNs to model long-range spatial de
effectiveness of FLR interventions at a large scale is difficult due to its pendencies. In contrast to convolutional neural networks (CNNs), which
highly distributed nature, typically being practiced on individual land subsequently apply local convolutional operations to enable the
parcels by respective land owners (Reytar et al., 2020). While carbon modeling of increasingly long-range spatial dependencies, transformers
reporting frameworks exist for FLR, for example through verified carbon utilize self-attention modules to enable the modeling of global spatial
markets, such data are highly project-specific owing to their reliance on dependencies across the entire image input (Dosovitskiy et al., 2021a).
intensive manual field measurements. Utilizing remotely sensed data to For dense prediction tasks on high resolution imagery where the
assess vegetation structure on areas with FLR interventions such as context can be sparse, such as ground information in the case of near
intercropped agroforestry or natural regeneration is difficult due to the closed canopies, the ability of transformers to model global information
presence of multiple species, multiple canopy strata, and trees of is promising. Among the applications to aerial imagery, the work of Xu
different ages (Viani et al., 2018; Vallauri et al., 2005; Camarretta et al., et al. (2021) uses a Swin transformer to classify high-resolution land
2020). For instance, Tesfay et al. (2022) found that 70% of the shade cover. Finding that a baseline transformer model struggled with edge
trees in an agroforestry system in Ethiopia were below 3 m in height, detection, Xu et al. (2021) utilized a self-supervised edge extraction and
while 3% were above 12 m in height, with more than a two-order of enhancement method to improve definition of class edges. Wang et al.
magnitude range of per-tree carbon stocks depending on tree size. (2022) utilize the vision transformer architecture as a feature encoder,
Critical to reducing uncertainty in woody carbon models are mea and apply a feature pyramid decoder to the resulting multi-scale feature
surements of forest height and biomass to improve assessments of the maps. Gibril et al. (2023) segment individual date palm trees by
spatial variability of carbon removal rates across forest landscapes that applying vision transformers to 5- to 30-cm drone-based imagery,
have heterogeneous structure (Harris et al., 2021). Tree height is espe finding that the Segformer architecture improves generalizability to
cially critical to accurately assessing carbon removal rates, as growth different resolution imagery when compared to CNN-based models.
rate increases continuously with size (Stephenson et al., 2014). Recent More recently, also leveraging vision transformers, Reed et al. (2022)
earth observation missions from NASA, namely GEDI and ICESat-2, scale the Masked Auto-Encoder approach of He et al. (2022) and apply it
provide repeated vegetation canopy height maps for the first time. to building segmentation.
Global Ecosystem Dynamics Investigation (GEDI) collects canopy height A major challenge of applying high resolution, airborne lidar data to
and relative height at a 25 m resolution (Dubayah et al., 2021). ICESat-2 the generation of wall-to-wall canopy height maps is the relative scarcity
collects canopy height and relative height at a 13 × 100 meter native of airborne lidar data to the scientific community. Such scarcity can
footprint (Markus et al., 2017). Recently, multi-sensor fusion has negatively impact the generalizability of models to unseen geographies,
demonstrated potential to improve aboveground biomass mapping especially data-poor regions where little to no airborne lidar exists
(Silva et al., 2021). To generate wall-to-wall maps of canopy height, (Schacher et al., 2023). Given this context of low annotation, Self-
researchers commonly combine active optical LiDAR data from ICESat-2 Supervised Learning (SSL) is a promising tool to shape more robust
or GEDI with optical imagery from Sentinel-2 (Lang et al., 2022a; features than traditional deep approaches. In particular, the SSL DINOv2
Schwartz et al., 2022) or Landsat satellites (Schwartz et al., 2022; Li approach of Oquab et al. (2023) recently led to state-of-the-art perfor
et al., 2020). mances in several computer vision tasks such as image classification,
A number of recent studies have utilized spaceborne lidar data from depth prediction, and segmentation. In the context of satellite image
GEDI and ICESat-2 to produce canopy height maps in combination with analysis, self-supervised learning has been shown to improve the
multispectral optical imagery. Among them, Potapov et al. (2021) generalizability of building segmentation models in Africa (Sirko et al.,
combined GEDI RH95 (95th percentile of Relative Height) data with 2021). To mitigate the reliance of vision transformers on self-supervised
Landsat data to establish a global map at 30 m resolution, using a bagged learning, Fayad et al. (2023) utilized knowledge distillation with a U-Net
regression tree ensemble algorithm. More recently, Lang et al. (2022a) CNN teacher model to generate 10-m CHM of Ghana using Sentinel-1,
produced a global canopy height map at a 10-m resolution, applying an Sentinel-2, and aerial lidar.
ensemble of convolutional neural network (CNN) models to Sentinel-2 Understanding the importance of highly spatially explicit vegetation
imagery to predict the GEDI RH98 footprint. Other works have pro structure mapping to both large-scale carbon modeling and project-
duced regional 10-m CHMs utilizing Sentinel-2 and aerial lidar (Astola specific avoided deforestation and restoration monitoring, the objec
et al., 2021; Fayad et al., 2023). tive of this study is to produce high resolution canopy height maps that
Aerial lidar data has also demonstrated utility as training data for are able to scale and generalize to large geographic regions. Our method
high resolution (< 5 m) and very high resolution (< 1 m) canopy height consists of an image encoder-decoder model, where low spectral
maps. At a national scale, Csillik et al. (2019) generated biomass maps in dimensional input images are transformed to a high dimensional
Peru by applying gradient boosted regression trees between 3.7 m Planet encoding and subsequently decoded to predict per-pixel canopy height.
Dove imagery and airborne lidar, with low uncertainty in dense forests We employ DINOv2 self-supervised learning to generate universal and
but large amounts of uncertainty in transitional landscapes and areas generalizable encodings from the input imagery (Oquab et al., 2023),
that are hotspots of land use change. Recently, Liu et al. (2023) and train a dense vision transformer decoder (Ranftl et al., 2021) to
computed a canopy height map (CHM) map of Europe using 3 m Planet generate canopy height predictions based on aerial lidar data from sites
imagery, training two UNets to predict tree extent and CHM using lidar across the USA. To correct a potential bias coming from a geographically
observations and previous CHM predictions from the literature. Utilizing limited source of supervision, we finally refine the maps using a
2
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
convolutional network trained on spaceborne lidar data. We present pixels and that the pixel size was the same for all latitudes, preventing
canopy height maps for the states of São Paulo, Brazil, and California, potential biases with latitude which may be introduced by variation in
USA, and provide qualitative and quantitative error analyses of height pixel size.
estimation and the decomposition of height estimates into tree seg
mentation maps. 2.3.2. Dataset for self-supervised learning
For training the self-supervised encoder, we randomly sampled 18
2. Data million 256 × 256 pixel satellite thumbnail images. No labels were used
for the SSL stage.
2.1. Experimental design
2.3.3. Validation segmentation dataset
This paper presents canopy height maps for São Paulo State, Brazil, We also manually annotated a random selection of 9000 Maxar
and California State, USA. These geographies were chosen due to their thumbnail images for segmentation testing. A binary tree / no tree label
prevalence of timber production, presence of old growth forests, was applied by human annotators. Pixels estimated to have a canopy
mountainous terrains, and high degree of tree biodiversity (Maioli et al., height above one meter (1 m) tall and with a canopy diameter of more
2020; Luyssaert et al., 2008; Ribeiro et al., 2011). The dataset was than three meters (3 m) were labeled as tree.
generated with a machine learning model utilizing a transformer
encoder and convolutional decoder trained with an input composite of 2.4. Supervised dataset
approximately 0.59 m GSD Maxar imagery spanning the years 2018 to
2020 and output labels from 1 m GSD aerial lidar. Our data and methods We gathered approximately 5800 canopy height maps (CHM),
sections are structured as follows. First, we describe the satellite and selected from the National Ecological Observatory Network (NEON)
aerial lidar data used for model training and map generation. Next, we (2022). Each CHM typically consisted of 1 km × 1 km geotiffs, with a
describe the model training specifics, including self supervised learning pixel size of one meter (1 m) GSD, in local UTM coordinates. We selected
and the methods for combining models trained on aerial lidar with the sites used by Weinstein et al. (2021) and additionally manually
models trained on GEDI observations, and the baseline models selected filtered for sites that have CHM imagery that was well registered and
and ablation studies performed. Finally, we present our approach for mostly free from mosaicing artifacts. Additionally, we selected sites with
qualitative and quantitative evaluation of height accuracy and tree imagery acquired less than two years from the observation date in the
segmentation, and discuss the generalization of our model. associated Maxar satellite imagery. A complete list of NEON sites used
for training and validation is contained in Appendix A.
2.2. Satellite image data description The CHM geotiffs were reprojected to a local tangent plane coordi
nate system and resized to match the resolution of Maxar images. For
Maxar Vivid2 mosaic imagery1 served as input imagery for model each ALS CHM, a corresponding RGB satellite image was linked, and
training and inference. This dataset provides global coverage by these pairs of imagery served as the training data for our decoder model.
mosaicing together imagery from multiple instruments (WorldView-2 The 5800 images in the NEON ALS dataset were split in sets of 80%
(WV 2), WorldView-3 (WV 3), Quickbird II) and observation dates. By training images, 10% calibration and 10% set-aside validation images.
starting with this mosaiced imagery, we leveraged the extensive data During the training, validation and testing phases, we sampled 256 ×
selection pipeline from Maxar, resulting in imagery that had less than 256 random crops from the RGB - ALS image pairs. Model training was
2% percent cloud cover, a global revisit rate predominately (more than conducted over epochs sampled from the training dataset. At the
75%) below 36 months (imagery dates from 2017 to 2020 are utilized in completion of each epoch, metrics were computed from a 10% cali
this dataset), view angles of less than 30 degrees off nadir, and sun angle bration dataset to calibrate the hyperparameters of the model training
of less than 60 degrees from zenith. This imagery consisted of three process. The calibration dataset was drawn from the same set of sites as
spectral bands: Red, Green, and Blue (RGB), with approximately a 0.5 m the training datasets, but from separate 1 km × 1 km geotiffs to ensure
GSD. The imagery was processed in the Web Mercator projection non overlapping pixels.
(EPSG:3857) and stored with the Bing tiling scheme.2 Given the high We constructed a set-aside validation dataset from a subset of sites in
resolution of the original geotiffs, Bing zoom 15 level tiles, with 2048 × our NEON dataset, which we call “NEON test”. None of the sites used in
2048 pixels per tile were used, giving a pixel size of 0.597 m GSD at the the validation dataset were contained in the training or calibration
equator. dataset. A list of NEON sites in the validation set appears in Appendix A.
We also prepared two validation datasets from other publicly available
2.3. Satellite image data preparation ALS Lidar datasets, outside of the NEON collection. These datasets
covered different geographic locations and ecosystems: “CA-Brande”
2.3.1. Image preparation (Brande, 2021) covered a coastal ecosystem in CA, and “São Paulo”
For easier training and validation of computer vision models, we (Dos-Santos et al., 2019) covered a region in the Brazilian São Paulo
extracted small regions from the input satellite imagery. Centered State. See Fig. A.18 for a visual breakdown of the Neon dataset splits.
around a given location, a box of fixed ground distance was selected, Where these datasets were available as CHMs, we directly used the
using a local tangent plane coordinate system. Due to the Web Mercator supplied CHMs. However, for the São Paulo datasets, which only con
projection of the image tiles, the extracted images at each position had tained point cloud datasets, we processed CHMs following the pit-free
varying dimensions according to their latitude, which were re-sampled algorithm (Khosravipour et al., 2014). The pit-free algorithm was also
to a fixed number of pixels. We chose a box side length of 152.7 m, adopted by the NEON team for generating their CHM product, and we
which, when re-sampled to 256 × 256 pixel images, provided “thumb found that different input parameters to the pit-free algorithm had
nail” images that matches the lowest resolution (0.597 m) of the input negligible impact on the CHM output.
imagery described in Section 2.2. Using these thumbnail images both for
training and inference ensured that the dataset had constant number of 2.5. Data augmentation
The 256 × 256 pixel image thumbnail images of RGB and CHM im
1
https://resources.maxar.com/data-sheets/imagery-basemaps-data-sheet. agery were augmented at training time, with random 90 degree rota
2
https://learn.microsoft.com/en-us/bingmaps/articles/bing-maps-ti tions, brightness, and contrast jittering. We found that these
le-system. augmentations improved model prediction stability across the various
3
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. 1. Overview of our approach for generating ALS-based CHMs. During the first stage, we employed the self-supervised learning approach Oquab et al. (2023) on
18 million 256 × 256 satellite images leading to a set of four spatial feature maps, and four feature vectors, extracted at different layers of the Vision Transformer
model (ViT). In the second phase, we trained a convolutional DPT decoder to predict CHMs.
Maxar observations in the input dataset. where the inputs are decomposed into 16 × 16 patches. The two net
works were trained jointly to output similar feature representations. The
3. Model and data generation methods procedure is illustrated in the Phase 1 in Fig. 1. In a second phase
described in Section 3.2, we freeze the SSL encoder layers using the
Our goal was to create a model that produces high resolution canopy weights of the teacher model and train the decoder with ALS data to
height maps and generalizes across large geographic scales. To accom generate high-resolution canopy height maps.
plish that goal, we leveraged the relative strengths of two types of lidar
data. Aerial lidar provided high resolution canopy height estimation, but 3.2. High resolution canopy height estimation using ALS
lacks global spatial coverage. In comparison, GEDI has nearly global
coverage of transects, but its beam width of approximately 25 m did not We used the reference dataset described in Section 2.4, prepared
allow for the identification of individual trees. following the methods described in Section 2.3.1. The output of the ALS
After self-supervised pre-training on satellite images globally, our model was a raster of predicted canopy heights at the same resolution as
high-resolution ALS CHM prediction model was trained on images from the input imagery. For training the supervised decoder, we used the ALS
the NEON dataset, as detailed in Section 3.2 and Fig. 1. As the Neon CHM data described in Section 2.4 to create a connection between the
dataset only has a spatial coverage from sites only within the United SSL features and the full resolution canopy height image. In this second
States, we expect this ALS CHM model to perform well on ecosystems phase, we trained the decoder introduced in Dense Prediction Trans
similar to the training set. To improve generalization of other ecosys former (DPT) (Ranftl et al., 2021) on top of the obtained features. This
tems and locations, a low resolution CHM model was independently approach is described in Fig. 1, phase 2. The DPT paper describes a full
trained on global GEDI data (Section 3.3). The GEDI model was used to model composed of a transformer encoder extracting features at
compute a rescaling factor map (Section 3.4), which adjusted the pre different layers. In the decoder, each output was reassembled and all
dictions made by the ALS CHM model. outputs were fused. In our second phase of ALS training, we replaced the
transformer of DPT by our own SSL encoder, and trained the DPT
3.1. Self supervised learning decoder part only, from scratch. Our best results were obtained by
freezing all layers from the SSL encoder. We employed a one cycle
Following the recent success of self-supervised learning on dense learning rate schedule with a linear warmup in the encoder training
prediction tasks from Oquab et al. (2023), we employed a self- stage and a “Sigloss” loss function. Further architecture and training
supervised learning step on 18 million globally distributed, randomly details are provided in Appendix D.
sampled 256 × 256 pixel Maxar satellite images to obtain an image Sigloss function. We take advantage from the similarity of canopy
encoder delivering features specialized to vegetation images. In the height mapping to the task of depth estimation and borrow the loss from
training phase, different views of the image were fed to two versions of Eigen et al. (2014). Given a true canopy height map c and our prediction
the encoder: a teacher model receiving global crops, and a student ̂c , the Sigloss is given by
model receiving local and global views where part of the crops were
masked (replaced by zero values). We employ a huge ViT architecture,
4
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. 2. Overview of our methodology to generate predicted RH95 values using GEDI measurements across the globe. Terrain is used only during the training and set
to zero during inference.
Fig. 3. Post processing step using GEDI predictions during inference. We used the GEDI model to correct our CHM predictions, by computing a dense scaling factor,
and multiply it pointwise with the CHM prediction map.
√̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅̅
)2̅
√ ( ground truth data consisted of 13 million GEDI measurements, which
√1 ∑ λ ∑
L =α √ 2
δ − δi , (1) were randomly sampled from the full GEDI dataset described in Ap
T i i T2 i pendix B.1. We trained the GEDI model to output a single scalar value for
a 128 × 128 pixel image patch, with a L1 loss on a regression task against
where δi = log(̂c i ) − log(ci ), and T is the number of pixels with valid the RH95 value from the GEDI instrument. The training details are
ground truth values. As in previous works, we fix λ = 0.85 and α = 10. specified in Appendix B.3.
Classification output. To avoid a bias towards small predicted values,
we implemented a classification step first, combined with the Sigloss
defined above. The strategy is described by Bhat et al. (2021) as the 3.4. Combining ALS and GEDI model outputs
uniform strategy. Specifically, we modified the output of our decoder to
return, instead of one scalar per pixel, a range of B bins. After a In this section, we describe the process of connecting our GEDI model
normalization on the predictions, we computed the scalar product be outputs (Section 3.3) with ALS model outputs (3.2). Conceptually, the
tween the obtained histogram of predicted bins and a linear vector ALS model output provides high resolution canopy estimates but lacks
ranging [0,B], with B set to 256. the global context to correctly estimate the absolute height of vegetation
in different ecosystems. Conversely, the GEDI model is trained on a
global dataset and contains position and metadata inputs (Fig. 2). A
3.3. Large scale canopy height estimation using GEDI prediction model schematic of the process is shown in Fig. 3.
Correlation between different lidar sources. The first step in making the
To mitigate the effect of the limited geographic distribution of GEDI/ALS connection is understanding the relationship between the two
available ALS data, we employed a second regression network trained on sets of lidar data: ALS CHM (Section 2.4) and GEDI lidar (Section Ap
GEDI data to rescale the ALS network outputs. The GEDI prediction pendix B.1). These two datasets make measurements of fundamentally
model was a simple convolutional network, containing five convolu different properties of canopy structure. GEDI measures the relative
tional layers, followed by five fully connected layers. The inputs to the height of canopy based on the full waveform measurement of the return
model were 128 × 128 pixel Maxar images containing three RGB bands, energy from 25 m diameter beam footprints while aerial lidar constructs
in topocentric coordinates, processed as described in Section 2.3.1. The higher resolution point clouds. To connect these two, we ran simulations
5
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. 4. Canopy Height Map (CHM) for the state of California, inset showing zoomed in region with input RGB imagery.
with the GEDI simulator from Hancock et al. (2019) on the NEON ALS were smoothed with a 20 pixel sigma Gaussian kernel sσ to prevent sharp
point clouds. We found that there was a strong correlation (R2 = 0.88) transitions, and the correction factor was clipped between 0.5 and 2 to
between the 95th percentile of ALS canopy height maps and the simu avoid drastic rescaling.
lated GEDI RH95 (see Appendix B.2).
GEDI based correction of ALS trained maps. We used this correlation to
3.5. Baselines
scale the ALS model canopy height maps by computing a scalar multi
plier that match percentiles of the CHM map with the GEDI model
3.5.1. ResUNet-based approach
predicted value for GEDI RH95. This process works as follows:
We utilized a ResUNet-18 architecture for our baseline (Zhang et al.,
Given an input RGB image, x, we combined the outputs of the ALS
2017), which is an encoder-decoder architecture predicting a N × N
and GEDI models by computing a dense correction factor γ(x), so that
canopy height map from a 3 × N × N RGB image, with N = 256. The
the novel prediction, C′(x) was related to the ALS model CHM, C(x): baseline model was trained with the sigloss between the predicted and
ground truth CHMs. We also experimented with a classification output,
C′(x) = γ(x) ⊙ C(x) (2)
however we did not obtain improvements from this approach.
where
3.5.2. Supervised transformer-based approach
γ(x) =
1 + sσ (G(x) )
( (( ) ). (3) To assess the benefit of the self supervised training phase on Satellite
1 + sσ Q(x)95 data, we consider a baseline given the state-of-the-art vision SWAG
encoder (Singh et al., 2022). We used the large version of this Vision
Here G(x) is the output CHM of our GEDI model and Q(x)95 is a per
Transformer (ViT) that was trained to perform hashtag prediction from
block upsampled 95th percentile of the ALS model CHM in meters,
Instagram images. At the time of writing this manuscript, this model was
computed over the exact same 128 × 128 pixel input regions as the input
in the top ten models with highest accuracy on ImageNet, CUB, Places,
to the GEDI model in G(x). More specifically, each input image was
and iNaturalist datasets, providing a warranty of feature quality. This
divided in four crops, each one independently fed to the height predic
model contains the same number of parameters as our SSL encoder,
tion model, leading to four scalars, that were concatenated and
allowing a fair comparison in terms of model size.
upsampled. From the predicted CHM map by our ALS model, we
computed four percentiles from the same crops, concatenated and
upsampled in the same way. 3.6. Data validation
We used the ratio in Eq. (3) rather than G(x)/Q(x)95 to down-weight
noisy model estimates near zero canopy height. Since G(x) and Q(x)95 We evaluated the model performance against a variety of metrics,
are lower resolution than C(x), the correction factor map was upsampled which we divided into two broad classes: (1.) Metrics which primarily
to match the resolution of the ALS CHM, C(x). The ALS and GEDI maps evaluated the accuracy of canopy height maps, which we call canopy
height metrics (Section 4.1), and (2.) Metrics which primarily evaluated
6
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. 5. Canopy Height Map (CHM) for the state of São Paulo, inset showing zoomed in region with input RGB imagery.
Fig. 6. Selected sample regions from the canopy height predictions (log scale), overlayed on the input Maxar imagery (RGB). Canopy height prediction below 0.1 m
is transparent. The top row corresponds to regions in California and the bottom row, São Paulo.
the accuracy of image segmentation into tree or no tree pixels, which we labels independently labeled by photo-interpretation of Maxar imag
call segmentation metrics (Section 4.2). The set-aside validation dataset ery (Section 4.2.1).
of ALS canopy height maps described in Section 2 served as the primary
dataset for all types of metrics. For the segmentation metrics, we also
evaluated the model predictions against a dataset of human-annotated
7
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. 7. Comparison of our CHM (second column) with that of Lang et al. (2022a) (third column) and Potapov et al. (2021) (fourth column).
Table 1
Comparison of results with SSL pre-training on different datasets and with other supervised strategies (ResUNet, SWAG). IN: ImageNet. Sat: dataset described in
Section 2.3.2. IG: Instagram dataset. R: DPT decoder with a regression (scalar) output. C: DPT decoder with a classification (256 bins) output. ViT L: large, H: huge.
Note that the results are non GEDI corrected in this table, and all models were trained with a Sigloss. We later denote the model in the last line as the “SSL” model.
pre-training NEON test set São Paulo CA Brande
ResUNet RN18 IN1k 3.1 0.63 0.0 5.2 0.42 − 2.2 0.6 0.74 ¡0.1
SWAG C ViT L IG 3.0 0.63 − 1.6 5.8 0.16 − 4.3 0.7 0.56 − 0.6
DINOv2 R ViT L IN1k 3.4 0.52 − 1.4 6.8 − 0.20 − 5.2 0.6 0.67 − 0.4
DINOv2 R ViT H IN22k+ 3.0 0.62 − 1.4 5.7 0.27 − 2.9 0.6 0.62 − 0.4
DINOv2 R ViT L Sat 3.5 M 2.8 0.67 − 1.2 6.0 0.14 − 4.2 0.6 0.70 − 0.5
DINOv2 R ViT L Sat 18 M 2.9 0.66 − 1.4 4.9 0.46 − 1.9 0.6 0.68 − 0.5
DINOv2 C ViT L Sat 18 M 2.7 0.70 − 0.9 5.0 0.46 − 2.1 0.6 0.80 − 0.3
DINOv2 C ViT H Sat 18 M 2.6 0.70 − 0.9 5.2 0.39 0.4 0.6 0.81 ¡0.1
4.1. Canopy height metrics 4.1.1. Canopy height metrics for ALS models
We present in Table 1 an ablation study of different pre-training data,
We compared the predicted canopy height maps with aerial lidar model size and output on the Neon and São Paulo test sets. From this
data in terms of mean absolute error (MAE), Root Mean Squared Error ablation study, we selected the SSL model trained on 18 million images
(RMSE), and R2 -block (R2 ). The R2 -block score is the coefficient of utilizing the classification output, which achieved the highest canopy
determination, which we computed on cropped images with a resolution height accuracy metrics. We also trained a huge model instead of a large
one, that significantly reduced the bias of the predictions on the São
Paulo dataset. We refer to this model as the SSL model throughout the
3
https://registry.opendata.aws/dataforgood-fb-forests/. paper. Table 1 suggests that pre-training on satellite images gives better
4
https://wri-datalab.earthengine.app/view/submeter-canopyheight. results compared to pre-training on ImageNet. Compared to the ViTs
8
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Table 2
Canopy Height Metrics to assess the gedi correction step. R2 corresponds to ∼ 30 × 30 meter block R2 . “Average” is the unweighted average across datasets.
NEON test CA-Brande São Paulo Average
ResUNet 3.1 4.9 0.63 0.6 1.6 0.75 5.2 7.4 0.42 3.0 4.6 0.60
ResUNet + GEDI 3.0 4.8 0.64 0.6 1.6 0.74 5.4 7.7 0.35 3.0 4.7 0.58
SSL 2.6 4.4 0.70 0.6 1.4 0.82 5.2 7.5 0.39 2.8 4.5 0.64
SSL + GEDI 2.7 4.5 0.69 0.6 1.5 0.80 5.1 7.3 0.41 2.8 4.4 0.63
Fig. 8. Block (∼ 30m × 30m) aggregated SSL + GEDI model predictions compared to ALS ground truth measurements for different set-aside validation datasets.
Colormap density is normalized to the 99.6th percentile of the heatmaps.
Fig. 9. Global model evaluation on held-out GEDI data. (a) p95 of block (76m × 76m) model CHM predictions compared to the measured GEDI RH95 metrics. (b)
Left: Difference between the p95 of block model CHM predictions and the measured GEDI RH95 metrics w.r.t model CHM predictions. Negative values indicate that
the model estimates are lower than the GEDI RH95 values. Residuals in function of RH95 appear in Appendix F. Right: CHM p95 in function of RH95.
9
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Table 3 experimented with different loss functions, and a smaller dataset for self-
R2 between predicted CHM p95 and GEDI RH95 by geographic subregion for supervised pre-training. We found that was training on more data was
20,000 test GEDI observations for models with and without the GEDI calibration leading to much better results in São Paulo. Comparing L1, L2 and
model. Sigloss, we found that Sigloss and L2 were leading to the best results.
Subregion SSL ResUNet SWAG Additional discussion of these trials can be found in Appendix E.
GEDI + – + – + –
4.1.2. Canopy height metrics for ALS + GEDI models
Central Asia 0.22 0.19 0.25 0.23 0.23 0.17
Table 2 presents a quantitative validation of the best performing
Eastern Asia 0.50 0.44 0.47 0.42 0.43 0.38
Eastern Europe 0.70 0.66 0.67 0.61 0.67 0.63 models, namely the ResUNet and the self-supervised model (SSL),
Latin America + Caribbean 0.68 0.64 0.65 0.56 0.64 0.56 combined with the GEDI correction step (ResUNet+GEDI, SSL + GEDI).
Melanesia 0.52 0.48 0.51 0.41 0.44 0.45 We note the improved performance of the SSL model compared to the
Northern Africa 0.12 0.11 0.10 0.06 0.06 0.05 ResUNet in the NEON test and CA-Brande datasets. Although the SSL
Northern America 0.73 0.69 0.70 0.65 0.69 0.64
Northern Europe 0.54 0.46 0.41 0.30 0.33 0.33
model performed the best across the datasets in the USA (NEON test and
Oceana 0.68 0.63 0.66 0.58 0.61 0.54 CA-Brande), it performed worse than the ResUNet and ResUNet + GEDI
South East Asia 0.46 0.36 0.45 0.34 0.44 0.32 for São Paulo, possibly due to the large domain shift in ecosystems. In
Southern Asia 0.52 0.49 0.52 0.48 0.47 0.42 the case of São Paulo, we found that the inclusion of GEDI (“SSL +
Southern Europe 0.46 0.47 0.42 0.37 0.46 0.40
GEDI”) produced the best results, possibly indicating better general
Sub-Saharan Africa 0.68 0.66 0.58 0.50 0.64 0.59
Western Asia 0.53 0.49 0.53 0.47 0.47 0.42 ization by including the globally trained GEDI model, which also in
Western Europe 0.64 0.59 0.64 0.55 0.58 0.50 cludes additional metadata such as geographic position (Fig. 2).
Overall 0.61 0.52 0.59 0.44 0.54 0.37 Fig. 8 shows 2D-histograms of the SSL + GEDI model predictions vs
the set-aside validation ALS-derived canopy height averaged over
∼ 30m blocks and the corresponding pixel MAE and block-R2 scores.
Fig. 10. CHM error compared to reference tree height as indicated in the 4.1.4. Correlation with field data
Brazilian National Forest Inventory for Espirito Santo. To measure the agreement between our computed CHMs and field-
collected tree height data, we utilize the Brazilian National Forest In
that are pre-trained on ImageNet, including the SWAG approach, the ventory (NFI) data, which consists of systematic field plot inventories of
ResUNet remains the strongest baseline. The SSL model clearly out tree count and height (da Luz et al., 2018). Because the NFI data for São
performs the ResUNet on Neon, reducing the MAE from 3.1 to 2.6 m, is Paulo was not yet available, we additionally generate a CHM of the
also improving results on CA Brande, and leads to similar results on São nearby Espirito Santo state and evaluate its agreement with the NFI data
Paulo, with a slightly worse R2 but a much lower ME. We also for Espirito Santo. The NFI data analyzed encompassed 1450 10 × 10 m
10
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Table 4
Segmentation metrics. U/P corresponds to pixel user's / producer's accuracy of the tree class. IOU to the average of tree & no tree IOU class scores. EE: Edge error.
NEON test CA-Brande São Paulo Average
ResUNet 0.74/0.75 0.58 0.72/0.64 0.70 0.91/0.85 0.67 0.79/0.75 0.65 0.50
ResUNet + GEDI 0.77/0.68 0.53 0.73/0.52 0.68 0.91/0.84 0.65 0.80/0.68 0.62 0.52
SSL 0.81/0.76 0.65 0.71/0.75 0.76 0.90/0.88 0.67 0.82/0.81 0.68 0.50
SSL + GEDI 0.82/0.71 0.59 0.74/0.60 0.74 0.91/0.86 0.66 0.83/0.76 0.66 0.49
initial plots considered, we removed 291 that had tree cover loss since
Table 5 2014 in the dataset of Hansen et al. (2016). Fig. 10 visualizes box plots of
Segmentation metrics on global, human annotated dataset. U/P corresponds to
the 95th percentile CHM by reference NFI height bins. The overall ME
pixel user's / producer's accuracy. IOU to the average of tree & no tree IOU
was 0.72 m while the RMSE was 4.25 m, with a slight positive bias for
scores. Since the GEDI correction only adjusts large scale height percentiles, the
“+GEDI” rows show only small improvement over the base ALS models. trees ≤15 m (ME = 1.10 m, RMSE = 4.28 m), and negative bias for trees
>15 m (ME = − 1.00 m, RMSE = 3.79 m).
Global, Annotated
U/P IOU
4.2. Segmentation metrics
ResUNet 0.89/0.86 0.75
ResUNet + GEDI 0.90/0.86 0.74
SSL 0.83/0.87 0.77 In addition to the canopy height metrics discussed in Section 4.1, we
SSL + GEDI 0.82/0.88 0.77 compute a number of metrics that reflect the ability of the model to
correctly assign individual pixels as trees. CHMs were converted into
binary masks by thresholding height values of at least five meters (5 m)
subplots within 87 plots positioned within a 20 × 20 km grid in Espirito
as tree canopy extent. Table 4 shows the pixel user's and producer's
Santo. The field data was collected primarily in November and
accuracy values (also know as precision and recall, respectively) for
December 2014, and includes the height of each tree within each subplot
pixels labeled as trees. Table 4 also shows the Intersection Over Union
having a diameter at breast height (DBH) of at least 10 cm. Of the 1450
(IOU) for the binary masks, which was calculated as the average of IOU
Fig. 11. Tree segmentation predictions from the SSL + GEDI model vs human annotated ground truth. Binary prediction masks were created from the CHM by
thresholding at 1 m. U/P corresponds to pixel user's / producer's accuracy of the tree class. The IOU represents the Intersection-Over-Union score for the tree class.
11
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. 12. Pixelwise user's accuracy (UA) and producer's accuracy (PA) for 8903 validation plots stratified by geographic sub-region. Error bars represent the 80, 90,
and 95% confidence intervals as derived from 10,000 bootstrap iterations. Numbers in the x-axis tick labels denote sample size.
Fig. 13. Qualitative comparison between different models for example imagery. Left: Input Maxar “thumbnail” image, 256 × 256 pixels, in local tangent plane
coordinate system. Second from left: ALS CHM image, in same projection and pixelization. Right columns: Model CHMs.
for pixels labeled as tree and the IOU for pixels labeled as ground. edges in both maps. Scores range between 0 and 1, where lower scores
Additionally, we introduce an Edge Error (EE) metric that computes indicate improved accuracy along patch edges. In Table 4, the edge error
the ratio of the sum of the absolute difference between edges from is computed over all set-aside validation datasets. We detail the formula
predicted and ground truth CHM, normalized by the sum of detected with a figure illustrating the behavior of this metric in Appendix C.3.
12
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
aside validation set where the GEDI measurement had RH95 greater
than 3 m, which we enforce to bias the dataset towards vegetated areas.
The data is independent of the aerial lidar measurements used to train
the model. Over the entire dataset, the user's and producer's accuracy
was 0.88 ± 0.006 and 0.82 ± 0.008, while the IOU was 0.77 ± 0.006
indicating good agreement with the human annotations, cf. Table 5.
Fig. 11 shows examples of model predictions and their corresponding
annotations.
We additionally calculated user's and producer's accuracy by
geographic subregion according to the United Nations geoscheme.
Boostrapping with 10,000 iterations was used to calculate uncertainty
for tree segmentation accuracy metrics rather than the methods of
Stehman (2014) because the cluster sampling approach was used to
generate validation data (Olofsson et al., 2014; Mugabowindekwe et al.,
2022; Maxwell et al., 2021). This validation analysis indicated strong
generalizability across different geographic regions, without signifi
cantly different accuracy metrics in geographic regions where we had
training data and where we did not (Fig. 12). This suggests that the use
of self supervised learning on global images facilitated the creation of a
generalizable segmentation network.
4.2.1. Tree detection metrics evaluated against human annotated validation 4.5. Generalization to aerial imagery
data
To assess the ability of the model to generalize to new geographies, Using a model fully trained on Satellite images. To assess the general
we compiled human-annotated validation labels for tree detection (bi ization ability of our approach to other input imagery, we measure
nary classification of tree vs no-tree) across 8, 903 Maxar thumbnail model performance using airborne imagery at inference. For inference,
images. Human annotators were instructed to label any trees above one we resized the NEON aerial images to match the size of corresponding
meter (1 m) tall and with a canopy diameter of more than three meters satellite image, and apply a normalization of the aerial image to match
(3 m). Annotators were to include standing dead trees and tree-like the color histogram of the satellite imagery. Details about image
shrubs, but exclude any grasslands, row crops, lawns, or otherwise normalization are provided in Appendix G.
vegetative ground cover whose peak height was estimated to be less than The second line of Table 6 shows canopy height metrics computed on
1 m from the ground surface. To create the model binary masks for the predictions made from NEON input RGB imagery. The SSL model almost
annotated dataset, we thresholded the model CHM at 1 m. doubles the R2 of the ResUNet baseline. Compared to the performance of
The geographic locations for the images in the dataset correspond to the SSL model with satellite images as input as reported in Table 1, the
randomly sampled GEDI measurement footprints from our global set- MAE is only slightly higher (3.0 instead of 2.7), the R2 is a bit more
13
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. 15. Performance of models given aerial images inputs. Top: Model fully trained on satellite images; Bottom: Performance of encoder trained on satellite images,
decoder trained on aerial images.
Fig. 16. Generalization of our SSL approach. Even if trained on Satellite images, inference on airborne images does not seem to suffer from a domain shift.
RGB aerial image Lidar Ground Truth Wagner et al. (2023) Our predicted CHM
Fig. 17. Comparison of our aerial model, where we trained the DPT decoder on Neon aerial RGB images, with the approach of Wagner et al. (2023). Note that despite
a slight change in the scale of the input image, which was zoomed to obtain a 256 × 256 input, and despite the fact we did not use the infra-red input, we obtain a
result similar to the one of Wagner et al. (2023). (For interpretation of the references to color in this figure legend, the reader is referred to the web version of
this article.)
impacted (0.55 instead of 0.70), while the bias is much higher, but seems similar to the one obtained using in domain satellite imagery.
evenly distributed between different height bins (Fig. 15). Fig. 16 dis Despite changes in color intensity, image angle, and sun angle, our
plays a qualitative example, where we observe that despite a blurrier approach manages to generate predictions with consistent visual qual
result, the accuracy of the model given an out-of-domain aerial image ity. From an application point of view, the robustness of SSL predictions
14
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
without the need to retrain models on new lidar datasets is very observations above 30 m.
interesting. The generated maps are limited by variation in input imagery,
Training a new decoder on aerial images. We compared these results to particularly by variation in view angle, sun angle, acquisition times and
another baseline, training a decoder on top of our pretrained SSL fea dates, and optical aerosol. As shown in Fig. 17, qualitative data quality
tures on Neon RGBs (last line of Table 6). Given a better alignment with improves considerably when processed on VHR aerial optical imagery,
the ground truth CHMs, and view angles close to Nadir across the Neon as opposed to VHR satellite optical imagery. Additionally, terrain slope
dataset, this aerial model performed reasonably well compared to the appears to influence predicted height, since it affects the length of
recent result of Wagner et al. (2023), only using the RGB channels, as shadow an individual tree casts. At present, the ability to conduct tree
illustrated in Fig. 17. height change detection assessments is limited by the need for improved
input image processing to better align these differences between image
5. Discussion pairs.
15
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
We would like to thank Ben Weinstein for helpful discussions The authors declare that they have no known competing financial
regarding the NEON dataset. We thank Andi Gros and Saikat Basu for interests or personal relationships that could have appeared to influence
their technical advice. We would like to thank Shmulik Eisenmann, the work reported in this paper.
Patrick Louis, Lee Francisco, Leah Harwell, Eric Alamillo, Sylvia Lee,
Patrick Nease, Alex Pompe and Shawn Mcguire for their project support.
Fig. A.18. Distribution of ALS Datasets: Train/Calibration/set-aside validation (aka Train/Validation/Test): (a) All ALS datasets. Here Train and Calibration points
overlap and are shown in blue. Set-aside validation (Test) datasets are from non-overlapping geographic regions. (b) Zooming in on one Train / Calibration site
(NEON GRSM) - we have randomly split the data into non spatially overlapping tiles so that calibration data is drawn from the same sites and ecosystems as training
data, but separated spatially. (For interpretation of the references to color in this figure legend, the reader is referred to the web version of this article.)
The NEON sites during training / calibration are: SJER, SOAP, TEAK, BART, DSNY, HARV, JERC, OSBS, DELA, GRSM, LENO, MLBS, BLAN, CLBJ,
KONZ, NOGP, SCBI, TALL, UKFS, WOOD, ABBY, BONA, DEJU, JORN, MOAB, OAES, ONAQ, SERC, SRER, UNDE, WREF, HEAL, LAJA, RMNP, PUUM.
The set-aside validation dataset, “NEON test”, contains the following NEON sites: CUPE, REDB, WLOU, HOPB, GUAN.
To ensure repeatability of our approach, we provide a complete list of CHM files used during training/calibration at: https://dataforgood-fb-data.
s3.amazonaws.com/forests/v1/NEON_training_images.csv
The GEDI instrument is a full waveform lidar instrument aboard the International Space Station which has sampled global regions between 51.6∘ N
& S latitude with a ∼25m beam footprint at ground surface. The instrument details are described in Dubayah et al. (2020), and its measurements of
canopy height are described in Dubayah et al. (2022). We used the GEDI L2A Version 2 product and filtered the dataset to reduce noise by only
including data which had: degrade flag = 0, surface flag = 1, solar elevation < 0, and sensitivity > 0.95. After this filtering, we were left with a total
sample size of 1.3 × 109 measurements. We used the 95th percentile of relative height (RH95) that we paired to 128 × 128 pixel (76 × 76 meter)
satellite images from Maxar. These images were processed as described in Section 2.3.1, but were smaller to more closely approximate the GEDI
footprint. Although these images are still significantly larger than the 25 m GEDI footprint, we have found improved results from our GEDI model
using larger areas - potentially due to pointing errors in the GEDI data and a larger spatial context improving the CNN model results.
B.2. Connection between ALS CHM 95th percentiles and GEDI RH95
To leverage the GEDI model output, we made the following assumption: the GEDI model, on a 128 × 128 pixel sample, approximates the 95th
percentile (p95) of the sample's ground truth canopy height map. This is justified by running simulations with the the GEDI simulator from Hancock
et al. (2019) on the NEON ALS point clouds. We used simulated values rather than actual GEDI measurements because the GEDI measurements suffer
from point errors, and because the simlator allows for denser sampling from with the limited geographic footprint of our ALS dataset.
The GEDI RH95 measurement used for training the GEDI model corresponds to the 95th percentile of the lidar's energy response. We simulated the
GEDI RH95 values and find that they have excellent correlation (R2 = 0.88) with the 95th percentile of the canopy height map around the corre
sponding GEDI footprints. This high correlation between GEDI RH95 and p95 of CHM was consistent across the diverse ecosystems covered in all 40
NEON sites in Appendix A.
16
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
The GEDI measurements were split into a 80/10/10% train/calibration/set-aside validation subsets. During training, the samples were drawn with
a weight inversely proportional to the natural log of the total number of global samples in its RH95 bin, where each bin has a width of 1 m. We found
that this sampling method provided a good number of training sample from higher canopy height locations while not overly biasing the model towards
ecosystems with the highest canopy heights. Log inverse sample weighting is a less aggressive re-weighting that typical linear inverse weighting, as
done in Lang et al. (2022a), which we choose so as not to overly bias the model towards the relatively few high canopy height samples.
After the convolutional layers, we also input a collection of scalar values, designated as “Satellite Metadata” in Fig. 2. This metadata included: the
latitude, longitude position of each image, the off-nadir view angle of the satellite, the angle between zenith and sun position at capture, and the
terrain slope (Mapzen, 2017) of the image footprint. Measured terrain slope is used during training, but set to zero height during forward inference,
which allows the model to reduce the systematic error resulting from the bias of GEDI measurements towards higher canopy heights when the beam
width straddles large surface gradients (see Section 4.1.3, Appendix B.4).
When training the GEDI model, we only used random 90 degree rotations and random horizontal and vertical flips, since the larger volume of data
made augmentation less helpful.
Fig. B.19. Correlation between 95th percentiles of ALS Canopy Height Maps and simulated GEDI RH95 values from the same maps. The 95th percentile is computed
within weighted Gaussians with σ = 12.5m, in order to roughly approximate the GEDI beam width.
As has been noted in Adam et al. (2020), the GEDI instruments estimate of canopy height is influenced by the terrain slope. We found evidence of
this correlation in the data, and due to this have chosen to set the terrain slope to zero during inference to mitigate this systematic.
17
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. B.20. Correlation between terrain slope and GEDI RH95 for samples in CA. The dashed line indicates the height of the terrain with the GEDI beam (GEDI beam
radius times the terrain slope). The heatmap is predominately above this line, indicating that there are no GEDI height estimates which fall below the terrain change
within the beam.
C.1. Block R2
To compute the block R2 score, we split the ground truth CHM c and the prediction ̂c into 50 × 50 pixels blocks and average their values, leading to
a 5 × 5 array, reshaped into 1 × 25 vectors c(b) and ̂c (b) . Given the average ground truth CHM average of all c(b) in the test set, the classical R2 score is
then computed:
∑ ( (b) )2
ci − ̂c (b)
i
R2block = 1 − ∑ ( )2 . (C.1)
c(b)
i − c
(b)
We are interested in measuring the sharpness of the CHM while beeing close to the ground truth. Because a blurry prediction would lead to the
same MAE, Block R2 or PSNR than a sharp one, this metrics is not serving this purpose. Therefore, we established a metric comparing the image
gradients of the maps, dubbed “edge error score”, given by Algorithm 1. Fig. C.21 illustrates how this metric is computed in an example.
Algorithm 1. Edge error metric.
18
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. C.21. Illustration of edge error metric for two results: the ResUNet (trained with an L1 loss) edge error score is 0.66 in this example, the score of the SSL model is
0.55, computed using difference of the prediction and ground truth edge maps appearing in the two images at the right.
We compare in Table E.7 results of models trained with L1 loss or Sigloss, and using different sizes of pretraining dataset: one with 3.5 × 106 images
(referred to as “3.5 M”) and one with 18 × 106 images (“18 M”). More pretraining data improves the performance of the SSL models. In terms of loss,
we did not notice strong difference between L2 and sigloss, while the L1 results were slightly worse.
Table E.7
CHM prediction accuracy metrics with different loss functions. sl: sigloss. cl: using classification output. Linear: using a linear layer instead of DPT. We do not display
CA Brande result to improve visibility but the results are included in the average.
3.5 M sl 2.8 0.67 − 1.2 0.51 6.0 0.14 − 4.2 0.60 3.1 0.56 1.9 0.54
18 M sl 2.9 0.66 − 1.3 0.52 4.9 0.46 − 2.1 0.59 2.9 0.64 1.3 0.54
18 M linear sl 3.0 0.58 − 1.8 0.68 7.1 − 0.27 − 6.7 0.71 3.6 0.41 2.8 0.67
18 M cl sl 2.6 0.71 − 0.9 0.48 4.9 0.47 − 1.9 0.55 2.7 0.70 1.0 0.51
18 M cl l1 2.5 0.80 0.0 0.51 5.2 0.39 − 2.6 0.56 2.9 0.72 0.7 0.53
18 M cl l2 2.6 0.86 − 0.1 0.52 5.1 0.43 ¡1.4 0.55 2.8 0.75 0.5 0.51
Fig. F.22 displays the difference between the p95 of block model CHM predictions and the measured GEDI RH95 metrics w.r.t the GEDI RH95.
19
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Fig. F.22. Residuals of p95 CHM predictions with GEDI RH95, with respect to the GEDI RH95.
An image normalization step is necessary to improve the SSL inference performance on aerial images, when training only on Satellite imagery. We
perform a classical histogram normalization of the aerial images (i.e. normalize the RGB channels of the aerial image to the p5-p95 distribution of the
satellite image). This makes the color balance much more similar, leading to better performance for the SSL model. The satellite image is taken through
much more atmosphere and we expect it to be less blue on average, because of preferential scattering of shorter wavelengths. Noting I the satellite
image, A the original aerial image, we first compute for each color channel i and each image X the 5% percentile p5 (X)i and 95% percentile p95 (X)i .
Then, the normalized aerial image is given by
( ) p95 (I)i − p5 (I)i
Ai = Ai − p5 (A)i * + p5 (I)i .
p95 (A)i − p5 (A)i
We only apply this normalization to the SSL model trained on satellite imagery. Applying this normalization to the ResUNet model deteriorated the
results.
References An image is worth 16x16 words: transformers for image recognition at scale. In: 9th
International Conference on Learning Representations, ICLR 2021, Virtual Event,
Austria, May 3-7, 2021. https://doi.org/10.48550/ARXIV.2010.11929.
Adam, M., Urbazaev, M., Dubois, C., Schmullius, C., 2020. Accuracy assessment of gedi
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
terrain elevation and canopy height estimates in european temperate forests:
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021b.
influence of environmental and acquisition parameters. Remote Sens. 12 https://doi.
An image is worth 16x16 words: transformers for image recognition at scale arXiv:
org/10.3390/rs12233948.
2010.11929.
Astola, H., Seitsonen, L., Halme, E., Molinier, M., Lönnqvist, A., 2021. Deep neural
Dos-Santos, M., Keller, M., Morton, D., 2019. Lidar Surveys over Selected Forest Research
networks with transfer learning for forest variable estimation using sentinel-2
Sites, Brazilian Amazon, 2008–2018. ORNL DAAC, Oak Ridge, Tennessee, USA. URL:
imagery in boreal forest. Remote Sens. 13 https://doi.org/10.3390/rs13122392.
https://daac.ornl.gov/CMS/guides/LiDAR_Forest_Inventory_Brazil.html.
Azevedo, T., Souza, C., Zanin Shimbo, J., Alencar, A., 2018. Mapbiomas Initiative:
Dubayah, R., Blair, J.B., Goetz, S., Fatoyinbo, L., Hansen, M., Healey, S., Hofton, M.,
Mapping Annual Land Cover and Land Use Changes in Brazil from 1985 to 2017.
Hurtt, G., Kellner, J., Luthcke, S., Armston, J., Tang, H., Duncanson, L., Hancock, S.,
Bhat, S.F., Alhashim, I., Wonka, P., 2021. Adabins: depth estimation using adaptive bins.
Jantz, P., Marselis, S., Patterson, P.L., Qi, W., Silva, C., 2020. The global ecosystem
In: Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
dynamics investigation: high-resolution laser ranging of the earth's forests and
Recognition, pp. 4009–4018.
topography. Sci. Remote Sens. 1, 100002. URL: https://www.sciencedirect.com/sci
Brande, K., 2021. 3d fuel structure in relation to prescribed fire, ca 2020. National center
ence/article/pii/S2666017220300018 https://doi.org/10.1016/j.srs.2020.100002.
for airborne laser mapping (ncalm). Distributed by opentopography. URL:. https://
Dubayah, R., Luthcke, S., Sabaka, T., Nicholas, J., Preaux, S., Hofton, M.. Gedi l3 Gridded
doi.org/10.5069/G9C53J18. accessed: 2023-02-15.
Land Surface Metrics, Version 1. URL. https://daac.ornl.gov/cgi-bin/dsviewer.pl?
Brandt, M., Tucker, C.J., Kariryaa, A., Rasmussen, K., Abel, C., Small, J., Chave, J.,
ds_id=1865.
Rasmussen, L.V., Hiernaux, P., Diouf, A.A., Kergoat, L., Mertz, O., Igel, C.,
Dubayah, R., Armston, J., Kellner, J., Duncanson, L., Healey, S., Patterson, P.,
Gieseke, F., Schöning, J., Li, S., Melocik, K., Meyer, J., Sinno, S., Romero, E.,
Hancock, S., Tang, H., Bruening, J., Hofton, M., Blair, J., Luthcke, S., . GEDI L4A
Glennie, E., Montagu, A., Dendoncker, M., Fensholt, R., 2020. An unexpectedly large
footprint level aboveground biomass density, version 2.1. URL: https://daac.ornl.go
count of trees in the west African Sahara and Sahel. Nature 587, 78–82.
v/cgi-bin/dsviewer.pl?ds_id=2056 https://doi.org/10.3334/ORNLDAAC/2056.
Camarretta, N., Harrison, P.A., Bailey, T., Potts, B., Lucieer, A., Davidson, N., Hunt, M.,
Duncanson, L., Neuenschwander, A., Hancock, S., Thomas, N., Fatoyinbo, T., Simard, M.,
2020. Monitoring forest structure to guide adaptive management of forest
Silva, C.A., Armston, J., Luthcke, S.B., Hofton, M., Kellner, J.R., Dubayah, R., 2020.
restoration: a review of remote sensing approaches. New For. 51, 573–596. https://
Biomass estimation from simulated GEDI, ICESat-2 and NISAR across environmental
doi.org/10.1007/s11056-019-09754-5.
gradients in Sonoma County, California. Remote Sens. Environ. 242, 111779. URL:
Cook-Patton, S.C., Leavitt, S.M., Gibbs, D., Harris, N.L., Lister, K., Anderson-Teixeira, K.
https://www.sciencedirect.com/science/article/pii/S0034425720301498 htt
J., Briggs, R.D., Chazdon, R.L., Crowther, T.W., Ellis, P.W., Griscom, H.P.,
ps://doi.org/10.1016/j.rse.2020.111779.
Herrmann, V., Holl, K.D., Houghton, R.A., Larrosa, C., Lomax, G., Lucas, R.,
Eigen, D., Puhrsch, C., Fergus, R., 2014. Depth map prediction from a single image using
Madsen, P., Malhi, Y., Paquette, A., Parker, J.D., Paul, K., Routh, D., Roxburgh, S.,
a multi-scale deep network. Adv. Neural Inf. Proces. Syst. 27.
Saatchi, S., van den Hoogen, J., Walker, W.S., Wheeler, C.E., Wood, S.A., Xu, L.,
Fayad, I., Ciais, P., Schwartz, M., Wigneron, J.P., Baghdadi, N., de Truchis, A.,
Griscom, B.W., 2020. Mapping carbon accumulation potential from global natural
d’Aspremont, A., Frappart, F., Saatchi, S., Pellissier-Tanon, A., Bazzi, H., 2023.
forest regrowth. Nature 585, 545–550. https://doi.org/10.1038/s41586-020-2686-
Vision transformers, a new approach for high-resolution and large-scale mapping of
x.
canopy heights arXiv:2304.11487.
Csillik, O., Kumar, P., Mascaro, J., O’Shea, T., Asner, G.P., 2019. Monitoring tropical
Friedlingstein, P., Jones, M.W., O'Sullivan, M., Andrew, R.M., Hauck, J., Peters, G.P.,
forest carbon stocks and emissions using planet satellite data. Sci. Rep. 9, 17831.
Peters, W., Pongratz, J., Sitch, S., Le Quéré, C., Bakker, D.C.E., Canadell, J.G.,
https://doi.org/10.1038/s41598-019-54386-6.
Ciais, P., Jackson, R.B., Anthoni, P., Barbero, L., Bastos, A., Bastrikov, V., Becker, M.,
Cuni-Sanchez, A., Sullivan, M.J.P., Platts, P., et al., 2021. High aboveground carbon
Bopp, L., Buitenhuis, E., Chandra, N., Chevallier, F., Chini, L.P., Currie, K.I., Feely, R.
stock of African tropical montane forests. Nature 596, 536–542. https://doi.org/
A., Gehlen, M., Gilfillan, D., Gkritzalis, T., Goll, D.S., Gruber, N., Gutekunst, S.,
10.1038/s41586-021-03728-4.
Harris, I., Haverd, V., Houghton, R.A., Hurtt, G., Ilyina, T., Jain, A.K., Joetzjer, E.,
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T.,
Kaplan, J.O., Kato, E., Klein Goldewijk, K., Korsbakken, J.I., Landschützer, P.,
Dehghani, M., Minderer, M., Heigold, G., Gelly, S., Uszkoreit, J., Houlsby, N., 2021a.
20
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Lauvset, S.K., Lefèvre, N., Lenton, A., Lienert, S., Lombardozzi, D., Marland, G., Mugabowindekwe, M., Brandt, M., Chave, J., Reiner, F., Skole, D.L., Kariryaa, A., Igel, C.,
McGuire, P.C., Melton, J.R., Metzl, N., Munro, D.R., Nabel, J.E.M.S., Nakaoka, S.I., Hiernaux, P., Ciais, P., Mertz, O., et al., 2022. Nation-wide mapping of tree-level
Neill, C., Omar, A.M., Ono, T., Peregon, A., Pierrot, D., Poulter, B., Rehder, G., aboveground carbon stocks in rwanda. Nat. Clim. Chang. 1–7.
Resplandy, L., Robertson, E., Rödenbeck, C., Séférian, R., Schwinger, J., Smith, N., National Ecological Observatory Network (NEON), 2022. Ecosystem Structure
Tans, P.P., Tian, H., Tilbrook, B., Tubiello, F.N., van der Werf, G.R., Wiltshire, A.J., (dp3.30015.001). URL: https://data.neonscience.org/data-products/DP3.30015.
Zaehle, S., 2019. Global carbon budget 2019. In: Earth System Science Data, 11, 001.
pp. 1783–1838. URL. https://essd.copernicus.org/articles/11/1783/2019/. htt Olofsson, P., Foody, G.M., Herold, M., Stehman, S.V., Woodcock, C.E., Wulder, M.A.,
ps://doi.org/10.5194/essd-11-1783-2019. 2014. Good practices for estimating area and assessing accuracy of land change.
Fu, H., Gong, M., Wang, C., Tao, D., 2018. A compromise principle in deep monocular Remote Sens. Environ. 148, 42–57. https://doi.org/10.1016/j.rse.2014.02.015.
depth estimation. arXiv:1708.08267. Oquab, M., Darcet, T., Moutakanni, T., Vo, H., Szafraniec, M., Khalidov, V.,
Gibril, M.B.A., Shafri, H.Z.M., Al-Ruzouq, R., Shanableh, A., Nahas, F., Al Mansoori, S., Fernandez, P., Haziza, D., Massa, F., El-Nouby, A., Assran, M., Ballas, N., Galuba, W.,
2023. Large-scale date palm tree segmentation from multiscale uav-based and aerial Howes, R., Huang, P.Y., Li, S.W., Misra, I., Rabbat, M., Sharma, V., Synnaeve, G.,
images using deep vision transformers. Drones 7. https://doi.org/10.3390/ Xu, H., Jegou, H., Mairal, J., Labatut, P., Joulin, A., Bojanowski, P., 2023. Dinov2:
drones7020093. learning robust visual features without supervision arXiv:2304.07193.
Hancock, S., Armston, J., Hofton, M., Sun, X., Tang, H., Duncanson, L.I., Kellner, J.R., Popkin, G., 2015. The hunt for the world’s missing carbon. Nature 523, 20–22. https://
Dubayah, R., 2019. The GEDI simulator: a large-footprint waveform lidar simulator doi.org/10.1038/523020a.
for calibration and validation of spaceborne missions. Earth Space Sci. 6, 294–310. Potapov, P., Li, X., Hernandez-Serna, A., Tyukavina, A., Hansen, M.C., Kommareddy, A.,
https://doi.org/10.1029/2018EA000506. Pickens, A., Turubanova, S., Tang, H., Silva, C.E., Armston, J., Dubayah, R., Blair, J.
Hansen, M.C., Potapov, P.V., Moore, R., Hancher, M., Turubanova, S.A., Tyukavina, A., B., Hofton, M., 2021. Mapping global forest canopy height through integration of
Thau, D., Stehman, S.V., Goetz, S.J., Loveland, T.R., Kommareddy, A., Egorov, A., GEDI and Landsat data. Remote Sens. Environ. 253, 112165 https://doi.org/
Chini, L., Justice, C.O., Townshend, J.R.G., 2013. High-resolution global maps of 10.1016/j.rse.2020.112165.
21st-century forest cover change. Science 342, 850–853. https://doi.org/10.1126/ Ranftl, R., Bochkovskiy, A., Koltun, V., 2021. Vision transformers for dense prediction.
science.1244693. In: International Conference on Computer Vision.
Hansen, M.C., Krylov, A., Tyukavina, A., Potapov, P.V., Turubanova, S., Zutta, B., Ifo, S., Reed, C.J., Gupta, R., Li, S., Brockman, S., Funk, C., Clipp, B., Candido, S.,
Margono, B., Stolle, F., Moore, R., 2016. Humid tropical forest disturbance alerts Uyttendaele, M., Darrell, T., 2022. Scale-MAE: A scale-aware masked autoencoder
using landsat data. Environ. Res. Lett. 11, 034008. https://doi.org/10.1088/1748- for multiscale geospatial representation learning arXiv preprint arXiv:
9326/11/3/034008. 2212.14532.
Harris, N.L., Gibbs, D.A., Baccini, A., Birdsey, R.A., de Bruin, S., Farina, M., Fatoyinbo, L., Reytar, K., Buckingham, K., Stolle, F., Brandt, J., Zamora-Cristales, R., Landsberg, F.,
Hansen, M.C., Herold, M., Houghton, R.A., Potapov, P.V., Suarez, D.R., Roman- Singh, R., Streck, C., Saint-Laurent, C., Tucker, C., Henry, M., Walji, K., Finegold, Y.,
Cuesta, R.M., Saatchi, S.S., Slay, C.M., Turubanova, S.A., Tyukavina, A., 2021. Aga, Y., Rezende, M., 2020. Measuring progress in forest and landscape restoration.
Global maps of twenty-first century forest carbon fluxes. Nature. Climate Change 11, Unasylva 71, 62.
234–240. https://doi.org/10.1038/s41558-020-00976-6. Ribeiro, M.C., Martensen, A.C., Metzger, J.P., Tabarelli, M., Scarano, F., Fortin, M.J.,
He, K., Chen, X., Xie, S., Li, Y., Dollár, P., Girshick, R., 2022. Masked autoencoders are 2011. The Brazilian Atlantic Forest: A Shrinking Biodiversity Hotspot. Springer,
scalable vision learners. In: Proceedings of the IEEE/CVF Conference on Computer Berlin Heidelberg, Berlin, Heidelberg, pp. 405–434. https://doi.org/10.1007/978-3-
Vision and Pattern Recognition, pp. 16000–16009. 642-20992-5_21.
Khosravipour, A., Skidmore, A.K., Isenburg, M., Wang, T., Hussin, Y.A., 2014. Generating Ronneberger, O., Fischer, P., Brox, T., 2015. U-net: convolutional networks for
pit-free canopy height models from airborne lidar. Photogramm. Eng. Remote. Sens. biomedical image segmentation. In: Medical Image Computing and Computer-
80, 863–872. https://doi.org/10.14358/PERS.80.9.863. Assisted Intervention–MICCAI 2015: 18th International Conference, Munich,
Lang, N., Jetz, W., Schindler, K., Wegner, J.D., 2022a. A high-resolution canopy height Germany, October 5-9, 2015, Proceedings, Part III 18. Springer, pp. 234–241.
model of the earth. https://doi.org/10.48550/ARXIV.2204.08322. Schacher, A., Roger, E., Williams, K.J., Stenson, M.P., Sparrow, B., Lacey, J., 2023. Use-
Lang, N., Kalischek, N., Armston, J., Schindler, K., Dubayah, R., Wegner, J.D., 2022b. specific considerations for optimising data quality trade-offs in citizen science:
Global canopy height regression and uncertainty estimation from GEDI LIDAR recommendations from a targeted literature review to improve the usability and
waveforms with deep ensembles. Remote Sens. Environ. 268, 112760 https://doi. utility for the calibration and validation of remotely sensed products. Remote Sens.
org/10.1016/j.rse.2021.112760. 15 https://doi.org/10.3390/rs15051407.
Li, B., Dai, Y., He, M., 2018. Monocular depth estimation with hierarchical fusion of Schwartz, M., Ciais, P., Ottlé, C., De Truchis, A., Vega, C., Fayad, I., Brandt, M.,
dilated cnns and soft-weighted-sum inference. Pattern Recogn. 83, 328–339. https:// Fensholt, R., Baghdadi, N., Morneau, F., Morin, D., Guyon, D., Dayau, S.,
doi.org/10.1016/j.patcog.2018.05.029. Wigneron, J.P., 2022. High-resolution canopy height map in the Landes forest
Li, W., Niu, Z., Shang, R., Qin, Y., Wang, L., Chen, H., 2020. High-resolution mapping of (France) based on GEDI, Sentinel-1, and Sentinel-2 data with a deep learning
forest canopy height using machine learning by coupling icesat-2 lidar with sentinel- approach. arXiv:2212.10265.
1, sentinel-2 and landsat-8 data. Int. J. Appl. Earth Obs. Geoinf. 92, 102163 https:// Silva, C.A., Duncanson, L., Hancock, S., Neuenschwander, A., Thomas, N., Hofton, M.,
doi.org/10.1016/j.jag.2020.102163. Fatoyinbo, L., Simard, M., Marshak, C.Z., Armston, J., Lutchke, S., Dubayah, R.,
Liu, S., Brandt, M., Nord-Larsen, T., Chave, J., Reiner, F., Lang, N., Tong, X., Ciais, P., 2021. Fusing simulated GEDI, ICESat-2 and NISAR data for regional aboveground
Igel, C., Li, S., Mugabowindekwe, M., Saatchi, S., Yue, Y., Chen, Z., Fensholt, R., biomass mapping. Remote Sens. Environ. 253, 112234 https://doi.org/10.1016/j.
2023. The overlooked contribution of trees outside forests to tree cover and woody rse.2020.112234.
biomass across Europe. https://doi.org/10.21203/rs.3.rs-2573442/v1. Singh, M., Gustafson, L., Adcock, A., Reis, V.D.F., Gedik, B., Kosaraju, R.P., Mahajan, D.,
Luo, W., Li, Y., Urtasun, R., Zemel, R., 2016. Understanding the effective receptive field Girshick, R., Dollár, P., van der Maaten, L., 2022. Revisiting Weakly Supervised Pre-
in deep convolutional neural networks. Adv. Neural Inf. Proces. Syst. 29 https://doi. Training of Visual Perception Models. CVPR.
org/10.48550/ARXIV.1701.04128. Sirko, W., Kashubin, S., Ritter, M., Annkah, A., Bouchareb, Y.S.E., Dauphin, Y.N.,
Luyssaert, S., Schulze, E.D., Börner, A., Knohl, A., Hessenmöller, D., Law, B.E., Ciais, P., Keysers, D., Neumann, M., Cissé, M., Quinn, J., 2021. Continental-scale building
Grace, J., 2008. Old-growth forests as global carbon sinks. Nature 455, 213–215. detection from high resolution satellite imagery. CoRR abs/2107.12283. URL: htt
https://doi.org/10.1038/nature07276. ps://arxiv.org/abs/2107.12283.
da Luz, N.B., Garrastazu, M.C., Rosot, M.A.D., Maran, J.C., de Oliveira, Y.M.M., Skole, D.L., Samek, J.H., Dieng, M., Mbow, C., 2021. The contribution of trees outside of
Franciscon, L., Cardoso, D.J., de Freitas, J.V., 2018. Inventário florestal nacional do forests to landscape carbon and climate change mitigation in West Africa. Forests 12.
brasil - uma abordagem em escala de paisagem para monitorar e avaliar paisagens https://doi.org/10.3390/f12121652.
florestais. Pesquisa Florestal Bras. 38 https://doi.org/10.4336/2018. Stehman, S.V., 2014. Estimating area and map accuracy for stratified random sampling
pfb.38e201701493. when the strata are different from the map classes. Int. J. Remote Sens. 35,
Maioli, V., Belharte, S., Stuker Kropf, M., Callado, C.H., 2020. Timber exploitation in 4923–4939. https://doi.org/10.1080/01431161.2014.930207.
colonial Brazil: a historical perspective of the Atlantic forest. Hist. Ambient. Stephenson, N.L., Das, A.J., Condit, R., Russo, S.E., Baker, P.J., Beckman, N.G.,
Latinoamericana Caribeña (HALAC) Rev. Solcha 10, 46–73. https://doi.org/ Coomes, D.A., Lines, E.R., Morris, W.K., Rüger, N., Álvarez, E., Blundo, C.,
10.32991/2237-2717.2020v10i2.p74-101. Bunyavejchewin, S., Chuyong, G., Davies, S.J., Duque, Á., Ewango, C.N., Flores, O.,
Mapzen, 2017. Amazon. Terrain Tiles on AWS. https://registry.opendata.aws/terrain-t Franklin, J.F., Grau, H.R., Hao, Z., Harmon, M.E., Hubbell, S.P., Kenfack, D., Lin, Y.,
iles. Makana, J.R., Malizia, A., Malizia, L.R., Pabst, R.J., Pongpattananurak, N., Su, S.H.,
Markus, T., Neumann, T., Martino, A., Abdalati, W., Brunt, K., Csatho, B., Farrell, S., Sun, I.F., Tan, S., Thomas, D., van Mantgem, P.J., Wang, X., Wiser, S.K., Zavala, M.
Fricker, H., Gardner, A., Harding, D., Jasinski, M., Kwok, R., Magruder, L., Lubin, D., A., 2014. Rate of tree carbon accumulation increases continuously with tree size.
Luthcke, S., Morison, J., Nelson, R., Neuenschwander, A., Palm, S., Popescu, S., Nature 507, 90–93. https://doi.org/10.1038/nature12914.
Shum, C., Schutz, B.E., Smith, B., Yang, Y., Zwally, J., 2017. The ice, cloud, and land Tesfay, F., Moges, Y., Asfaw, Z., 2022. Woody species composition, structure, and carbon
elevation satellite-2 (icesat-2): science requirements, concept, and implementation. stock of coffee-based agroforestry system along an elevation gradient in the moist
Remote Sens. Environ. 190, 260–273. https://doi.org/10.1016/j.rse.2016.12.029. mid-highlands of southern Ethiopia. Int. J. Forest. Res. 2022, 1–12. https://doi.org/
Maxwell, A.E., Warner, T.A., Guillén, L.A., 2021. Accuracy assessment in convolutional 10.1155/2022/4729336.
neural network-based deep learning remote sensing studies—part 2: Vallauri, D., Aronson, J., Dudley, N., Vallejo, R., 2005. Monitoring and Evaluating Forest
recommendations and best practices. Remote Sens. 13 https://doi.org/10.3390/ Restoration Success. Springer, New York, New York, NY, pp. 150–158. https://doi.
rs13132591. org/10.1007/0-387-29112-1_21.
Miangoleh, S.M.H., Dille, S., Mai, L., Paris, S., Aksoy, Y., 2021. Boosting monocular Viani, R.A.G., Barreto, T.E., Farah, F.T., Rodrigues, R.R., Brancalion, P.H.S., 2018.
depth estimation models to high-resolution via content-adaptive multi-resolution Monitoring young tropical forest restoration sites: how much to measure? Trop.
merging. In: Proceedings of the IEEE/CVF Conference on Computer Vision and Conserv. Sci. 11 https://doi.org/10.1177/1940082918780916,
Pattern Recognition (CVPR), pp. 9685–9694. 1940082918780916.
21
J. Tolan et al. Remote Sensing of Environment 300 (2024) 113888
Wagner, F.H., Roberts, S., Ritz, A.L., Carter, G., Dalagnol, R., Favrichon, S., Hirye, M.C., Xu, Z., Zhang, W., Zhang, T., Yang, Z., Li, J., 2021. Efficient transformer for remote
Brandt, M., Ciais, P., Saatchi, S., 2023. Sub-meter tree height mapping of California sensing image segmentation. Remote Sens. 13 https://doi.org/10.3390/rs13183585.
using aerial images and lidar-informed u-net model arXiv:2306.01936. Yanai, R.D., Wayson, C., Lee, D., Espejo, A.B., Campbell, J.L., Green, M.B., Zukswert, J.
Wang, W., Tang, C., Wang, X., Zheng, B., 2022. A ViT-based multiscale feature fusion M., Yoffe, S.B., Aukema, J.E., Lister, A.J., Kirchner, J.W., Gamarra, J.G.P., 2020.
approach for remote sensing image segmentation. IEEE Geosci. Remote Sens. Lett. Improving uncertainty in forest carbon accounting for redd+ mitigation efforts.
19, 1–5. https://doi.org/10.1109/LGRS.2022.3187135. Environ. Res. Lett. 15, 124002 https://doi.org/10.1088/1748-9326/abb96f.
Weinstein, B.G., Graves, S.J., Marconi, S., Singh, A., Zare, A., Stewart, D., Bohlman, S.A., Zhang, Z., Liu, Q., Wang, Y., 2017. Road extraction by deep residual U-net. CoRR abs/
White, E.P., 2021. A benchmark dataset for canopy crown detection and delineation 1711.10684. URL: http://arxiv.org/abs/1711.10684.
in co-registered airborne RGB, LiDAR and hyperspectral imagery from the National
Ecological Observation Network. PLoS Comput. Biol. 17 (7), e1009180.
22