Gustau Camps-Valls (Editor), Devis Tuia (Editor), Xiao Xiang Zhu (Editor), Markus Reichstein (Editor) - Deep Learning for the Earth Sciences_ a Comprehensive Approach to Remote Sensing, Climate Scienc
Gustau Camps-Valls (Editor), Devis Tuia (Editor), Xiao Xiang Zhu (Editor), Markus Reichstein (Editor) - Deep Learning for the Earth Sciences_ a Comprehensive Approach to Remote Sensing, Climate Scienc
Gustau Camps-Valls (Editor), Devis Tuia (Editor), Xiao Xiang Zhu (Editor), Markus Reichstein (Editor) - Deep Learning for the Earth Sciences_ a Comprehensive Approach to Remote Sensing, Climate Scienc
Edited by
Gustau Camps-Valls
Universitat de València, Spain
Devis Tuia
EPFL, Switzerland
Markus Reichstein
Max Planck Institute, Germany
This edition first published 2021
© 2021 John Wiley & Sons Ltd
Chapter 14 © 2021 John Wiley & Sons Ltd. The contributions to the chapter written by Samantha Adams
© Crown copyright 2021, Met Office. Reproduced with the permission of the Controller of Her Majesty’s
Stationery Office. All Other Rights Reserved.
All rights reserved. No part of this publication may be reproduced, stored in a retrieval system, or
transmitted, in any form or by any means, electronic, mechanical, photocopying, recording or otherwise,
except as permitted by law. Advice on how to obtain permission to reuse material from this title is available
at http://www.wiley.com/go/permissions.
The right of Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein to be identified as
the authors of the editorial material in this work has been asserted in accordance with law.
Registered Offices
John Wiley & Sons, Inc., 111 River Street, Hoboken, NJ 07030, USA
John Wiley & Sons Ltd, The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
Editorial Office
The Atrium, Southern Gate, Chichester, West Sussex, PO19 8SQ, UK
For details of our global editorial offices, customer services, and more information about Wiley products
visit us at www.wiley.com.
Wiley also publishes its books in a variety of electronic formats and by print-on-demand. Some content that
appears in standard print versions of this book may not be available in other formats.
Limit of Liability/Disclaimer of Warranty
While the publisher and authors have used their best efforts in preparing this work, they make no
representations or warranties with respect to the accuracy or completeness of the contents of this work and
specifically disclaim all warranties, including without limitation any implied warranties of merchantability
or fitness for a particular purpose. No warranty may be created or extended by sales representatives, written
sales materials or promotional statements for this work. The fact that an organization, website, or product
is referred to in this work as a citation and/or potential source of further information does not mean that
the publisher and authors endorse the information or services the organization, website, or product may
provide or recommendations it may make. This work is sold with the understanding that the publisher is
not engaged in rendering professional services. The advice and strategies contained herein may not be
suitable for your situation. You should consult with a specialist where appropriate. Further, readers should
be aware that websites listed in this work may have changed or disappeared between when this work was
written and when it is read. Neither the publisher nor authors shall be liable for any loss of profit or any
other commercial damages, including but not limited to special, incidental, consequential, or other
damages.
Library of Congress Cataloging-in-Publication Data
Name: Camps-Valls, Gustau, editor.
Title: Deep learning for the earth sciences : a comprehensive approach to
remote sensing, climate science and geosciences / edited by Gustau
Camps-Valls [and three others].
Description: Hoboken, NJ : Wiley, 2021. | Includes bibliographical
references and index.
Identifiers: LCCN 2021012965 (print) | LCCN 2021012966 (ebook) | ISBN
9781119646143 (cloth) | ISBN 9781119646150 (adobe pdf) | ISBN
9781119646167 (epub)
Subjects: LCSH: Earth sciences–Study and teaching. | Algorithms–Study and
teaching.
Classification: LCC QE26.3 .D44 2021 (print) | LCC QE26.3 (ebook) | DDC
550.71–dc23
LC record available at https://lccn.loc.gov/2021012965
LC ebook record available at https://lccn.loc.gov/2021012966
Cover Design: Wiley
Cover Image: © iStock.com/monsitj, Emilia Szymanek/Getty Images
Set in 9.5/12.5pt STIXTwoText by Straive, Chennai, India
10 9 8 7 6 5 4 3 2 1
To Adrian Albert, in memoriam
vii
Contents
Foreword xvi
Acknowledgments xvii
List of Contributors xviii
List of Acronyms xxiv
1 Introduction 1
Gustau Camps-Valls, Xiao Xiang Zhu, Devis Tuia, and Markus Reichstein
1.1 A Taxonomy of Deep Learning Approaches 2
1.2 Deep Learning in Remote Sensing 3
1.3 Deep Learning in Geosciences and Climate 7
1.4 Book Structure and Roadmap 9
23 Outlook 328
Markus Reichstein, Gustau Camps-Valls, Devis Tuia, and Xiao Xiang Zhu
Bibliography 331
Index 401
xvi
Foreword
Earth science, like many other scientific disciplines, is undergoing a data revolution. In
particular, a massive amount of data about Earth and its environment is now continuously
being generated by Earth observing satellites as well as physics-based earth system mod-
els running on large-scale computational platforms. These information-rich datasets offer
huge potential for understanding how the Earth’s climate and ecosystem have been chang-
ing, and for addressing societal grand challenges relating to food/water/energy security and
climate change.
Deep learning, which has already revolutionized many disciplines (e.g., computer vision,
natural language processing) holds tremendous promise to revolutionize earth and environ-
mental sciences. In fact, recent years have seen an exponential growth in the use of deep
learning in Earth Science, with many amazing results. Deep learning also faces challenges
that are unique to earth science data: multimodality; high degree of heterogeneity in space
and time; and the fact that earth science data can only provide an incomplete and noisy
view of the underlying eco-geo-physical processes that are interacting and unfolding at dif-
ferent spatial and temporal scales. Addressing these challenges requires development of
entirely new approaches that can effectively incorporate existing earth science knowledge
inside the deep learning learning framework. Success in addressing these challenges stands
to revolutionize deep learning itself and accelerate discovery across many other scientific
domains.
The book does a fantastic job of capturing the state of the art in this fast evolving area. It is
logically organized in to 3 coherent parts, each containing chapters written by experts in the
field. Each chapter provides an easily to understand introductory material followed by an
in-depth treatment of the applications of deep learning to specific earth science applications
as well as ideas for future research. This book is a must read for the students and researchers
alike who would like to harness the data revolution in earth sciences to address pressing
societal challenges.
xvii
Acknowledgments
We would like to acknowledge the help of all involved in the collation and review
process of the book, without whose support the project could not have been satisfactorily
completed. A further special note of thanks goes also to all the staff at Wiley, whose
contributions throughout the whole process, from inception of the initial idea to final
publication, have been valuable. Special thanks also go to the publishing team at Wiley,
who continuously prodded via e-mail, keeping the project on schedule.
We wish to thank all of the authors for their insights and excellent contributions to this
book. Most of the authors of chapters included in this book also served as referees for
chapters written by other authors. Thanks go to all those who provided constructive and
comprehensive reviews.
This book was possible without any dedicated funding, but editors’ and authors’ research
was partially supported by research projects that made it possible. We want to thank all
agencies and organizations for supporting our research in general, and this book indirectly.
Gustau Camps-Valls acknowledges support by the European Research Council (ERC) under
the ERC-CoG-2014 project 647423.
Thanks all!
List of Contributors
Xingjian Shi
Amazon
USA
xxiv
List of Acronyms
AE Autoencoder
AI Artificial Intelligence
AIC Akaike’s Information Criterion
AP Access Point
AR Autoregressive
ARMA Autoregressive and Moving Average
ARX Autoregressive eXogenous
AWGN Additive white Gaussian noise
BCE Binary Cross-Entropy
BER Bit Error Rate
BP Back-propagation
BPTT Back-propagation through Time
BRT Bootstrap Resampling Techniques
BSS Blind Source Separation
CAE Contractive Autoencoder
CBIR Content-based Image Retrieval
CCA Canonical Correlation Analysis
CCE Categorical Cross-Entropy
CGAN Conditional Generative Adversarial Network
CNN Convolutional Neural Network
CONUS Conterminous United States
CPC Contrastive Predicting Coding
CSVM Complex Support Vector Machine
CV Cross Validation
CWT Continuous Wavelet Transform
DAE Denoising Autoencoder
DCT Discrete Cosine Transform
DFT Discrete Fourier Transform
DL Deep Learning
DNN Deep Neural Network
DSM Dual Signal Model
DSP Digital Signal Processing
DSTL Deep Self-taught Learning
List of Acronyms xxv
1
Introduction
Gustau Camps-Valls , Xiao Xiang Zhu , Devis Tuia , and Markus Reichstein
Machine learning methods are widely used to extract patterns and insights from the
ever-increasing data streams from sensory systems. Recently, deep learning, a particular
type of machine learning algorithm (Goodfellow et al. 2016), has excelled in tackling data
science problems, mainly in the fields of computer vision, natural language processing,
and speech recognition. Since a few years ago, it has become impossible to ignore deep
learning. Started as a curiosity in the 1990s, deep learning has imposed itself as the prime
machine learning paradigm in the last ten years, especially thanks to the availability of
large datasets and of the advances in hardware and parallelization allowing them to be
learned from. Nowadays, most machine learning research is somehow deep learning-based
and new heights in performance have been reached in virtually all fields of data science,
both applied and theoretical. Adding to this the community efforts in sharing code and the
availability of computational resources, deep learning seems to be the winner to unlock
data science research.
In recent years, deep learning has shown increased evidence of the potential to address
problems in Earth and climate sciences as well (Reichstein et al. 2019). As for many
applied fields of science, Earth observation and climate science are more and more
strongly data-driven. Deep learning strategies are currently explored by more and more
researchers and neural networks are used in many operational systems. The advances in
the field are impressive, but there is still much ground to cover to understand the complex
systems that are our Earth and its climate. Why deep learning is working in Earth data
problems is also a challenging question, for which one could argue a statistical reason.
As in computer vision or language processing, Earth Sciences also consider spatial and
temporal data that exhibit high autocorrelation functions which deep learning methods
treat very well. But what is the physical reason, if any? Is deep learning discovering
guiding or first principles in the data automatically? Why do convolutions in space or time
lead to appropriate feature representations? Are those representations sparse, physically
consistent, or even causal? Explaining what the deep learning model actually learned is
a challenge itself. Even though AI has promised to change the way we often do science,
with DL the first step in this endeavor, this will not be the case unless we resolve these
questions.
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
2 1 Introduction
The field of deep learning for Earth and climate sciences is so wide and fast-evolving
that we could not cover all different methodological approaches and geoscientific problems.
A selected representative subset of methods, problems, and promising approaches were
selected for the book. With this introduction (and more in general with this book), we want
to take a picture of the state of the art of the efforts in machine learning (section 1.1), in
the remote sensing (section 1.2) and geosciences and climate (section 1.3) communities
to integrate, use, and improve deep learning methods. We also want to provide resources
for researchers that want to start including neural networks-based solutions in their data
problems.
Given the current pace of the advances in deep learning, providing a taxonomy of
approaches is not an easy task. The field is full of creativity and new inventive approaches
can be found on a regular basis. Without the pretension of being exhaustive, most deep
learning approaches can be placed along the lines of the following dimensions:
● Supervised vs. unsupervised. This is probably the most traditional distinction in machine
learning and also applies in deep learning methodologies. Basically, it boils down to
knowing whether the method uses labeled information to train or not. The best known
examples of supervised deep methods are the Convolutional Neural Network (CNN,
Fukushima (1980); LeCun et al. (1998a); Krizhevsky (1992)) and the recurrent neural
network (RNN, Hochreiter and Schmidhuber (1997)), both using labels to evaluate
the loss function and backpropagate errors to update weights, the former for image
data and the latter for data sequences. As for unsupervised methods, they do not use
ground truth information and therefore rely on unsupervised criteria to train. Among
unsupervised methods, autoencoders (Kramer 1991; Hinton and Zemel 1994) are the
most well known. They use the error in reconstructing the original image to train and
are often used to learn low-dimensional representations (Hinton and Salakhutdinov
2006a) or for denoising images (Vincent and Larochelle 2010).
In between these two endpoints, one can find a number of approaches tuning the level
and the nature of supervision: weakly supervised models (Zhou 2018), for instance, use
image-level supervision to predict phenomena at a finer resolution (e.g. localize objects
by only knowing whether they are present in the image), while self-supervised models use
the content of the image itself as a supervisory signal; proceeding this way, the labels to
train the model come for free. For example, self-supervised tasks include predicting the
color values from a greyscale version of the image (Zhang et al. 2016c), predicting relative
position of patches to learn part to object relations (Doersch et al. 2015), or predicting the
rotation that has been applied to an image (Gidaris et al. 2018).
● Generative vs. discriminative. Most methods described above are discriminative, in the
sense that they minimize an error function comparing the prediction with the true output
(a label or the image itself when reconstructing). They model the conditional probability
of the target Y given an observation x, i.e., P(Y |X = x). A generative model generates
possible inputs that respect the joint input/outputs distribution. In other words it models
1.2 Deep Learning in Remote Sensing 3
the conditional probability of the data X given an output y, i.e. P(X|Y = y). Generative
models can therefore sample instances (e.g. patches, objects, images) from a distribution,
rather than only choosing the most likely one, which is a great advantage when data are
complex and show multimodalities. For instance, when generating images of birds, they
could generate different instances of birds of the same species with subtle shape or color
differences. Examples of generative deep models are the variational autoencoders (VAE,
Kingma and Welling (2014); Doersch (2016)) and the generative adversarial networks
(GAN, Goodfellow et al. (2014a)), where a generative model is trained to generate images
that are so realistic that a model trained to recognize real from fake ones fails.
● Forward vs. recurrent. The third dimension concerns the functioning of the network.
Most models described above are forward models, meaning that the information
flows once from the input to prediction before errors are backpropagated. However,
when dealing with data structured as sequences (e.g. temporal data) one could make
information flow across the sequence dimension. Recurrent models (RNNs, firstly
introduced in Williams et al. (1986)) exploit this structure to inform the next step in
the sequence of the hidden representations learned by the previous. Backpropagating
information along the sequence also has its drawbacks, especially in terms of vanishing
gradients, i.e. gradients that, after few recursion steps become zero and do not update the
model any more: to cope with this, network including skip connections called memory
gates have been proposed: the Long-Short Memory Network (LSTM, Hochreiter and
Schmidhuber (1997)) and the Gated Recurrent Unit (GRU, Cho et al. (2014)) are the
most known.
● Phase 1: Exploration (2014 to date): The exploration phase is characterized by quick wins,
often achieved by the transfer and tailoring of network architectures from other fields,
most notably from computer vision. To name a few early examples, stacked autoencoders
are applied to extract high-level features from hyperspectral data for classification pur-
poses in Chen et al. (2014). Bentes et al. have exploited deep neural networks for the
detection and classification of objects, such as ships and windparks, in oceanographic
SAR images (Bentes et al. 2015). In 2015, Marmanis et al. (2015) have fine-tuned Ima-
geNet pre-trained networks to boost the performance of land use classification with aerial
images. Since then, researchers explore the power of deep learning for a wide range of
classic tasks and applications in remote sensing, such as classification, detection, seman-
tic segmentation, instance segmentation, 3D reconstruction, data fusion, and many more.
Whether using pre-trained models or training models from scratch, it is always about
addressing new and intrinsic characteristics of remote sensing data (Zhu et al. 2017):
– Remote sensing data are often multi-modal. Tailored architectures must be developed
for, e.g. optical (multi- and hyperspectral) (Audebert et al. 2019) and synthetic aperture
radar (SAR) data (Chen et al. 2016; Zhang et al. 2017; Marmanis et al. 2017; Shahzad
et al. 2019), where both the imaging geometries and the content are completely differ-
ent. Data and information fusion use these complementary data sources in a synergistic
way (Schmitt and Zhu 2016). Already prior to a joint information extraction, a crucial
step is to develop novel architectures for the matching of images taken from differ-
ent perspectives and even different imaging modality, preferably without requiring an
existing 3D model (Marcos et al. 2016; Merkle et al. 2017; Hughes et al. 2018). Also,
besides conventional decision fusion, an alternative is to investigate transfer learning
from deep features of different imaging modalities (Xie et al. 2016).
– Remote sensing data are geo-located, i.e., each pixel in a remote sensing imagery corre-
sponds to a geospatial coordinate. This facilitates the fusion of pixel information with
other sources of data, such as GIS layers (Chen and Zipf 2017; Vargas et al. 2019; Zhang
et al. 2019b), streetview images (Lefèvre et al. 2017; Srivastava et al. 2019; Kang et al.
2018; Hoffmann et al. 2019a), geo-tagged images from social media (Hoffmann et al.
2019b; Huang et al. 2018c), or simply other sensors as above.
– Remote sensing time series data is becoming standard, enabled by Landsat, ESA’s
Copernicus program, and the blooming NewSpace industry. This capability is trigger-
ing a shift from individual image analysis to time-series processing. Novel network
architectures must be developed for optimally exploiting the temporal information
jointly with the spatial and spectral information of these data. For example, convo-
lutional recurrent neural networks are becoming baselines in multitemporal remote
sensing data analysis applied to change detection (Mou et al. 2018), crop monitoring
(Rußwurm and Körner 2018b; Wolanin et al. 2020), as well as land use and land
cover classification (Qiu et al. 2019). An important research direction is unsupervised
or weakly supervised learning for change detection (Saha et al. 2019b) or anomaly
detection (Munir et al. 2018) from time series data.
– Remote sensing has irreversibly entered the big data era. We are dealing with very
large and ever-growing data volumes, and often on a global scale. On the one hand this
allows large-scale or even global applications, such as monitoring global urbanization
(Qiu et al. 2020), large-scale mapping of land use/cover (Li et al. 2016a), large-scale
1.2 Deep Learning in Remote Sensing 5
cloud detection (Mateo-García et al. 2018) or cloud removal (Grohnfeldt et al. 2018),
and retrieval of global greenhouse gas concentrations (Buchwitz et al. 2017) and a mul-
titude of trace gases resolved in space, time, and vertical domains (Malmgren-Hansen
et al. 2019). On the other hand, algorithms must be fast enough and sufficiently trans-
ferable to be applied for the whole Earth surface/atmosphere, which in turn calls for
large and representative training datasets, which is the main topic of phase 2.
In addition, it is important to mention that – unlike in computer vision – classification
and detection are only small fractions of remote sensing and Earth observation prob-
lems. Actually, most of the problems are related to the retrieval of bio-geo-physical or
bio-chemical variables. This will be discussed in section 1.3.
● Phase 2: Benchmarking (2016 to date): To train deep learning methods with good gen-
eralization abilities and to compare different deep learning models, large-scale bench-
mark datasets are of great importance. In the computer vision community, there are
many high-quality datasets available which are dedicated to, for example, image clas-
sification, semantic segmentation, object detection, and pose estimation tasks. To give an
example, the well-known ImageNet image classification database consists of more than
14 million hand-annotated images cataloged into more than 20,000 categories (Deng et al.
2009). It is debatable whether the computer vision community is too much driven by the
benchmark culture, instead of caring about real-world challenges. In remote sensing it is,
however, the other extreme – we are lacking sufficient training data. For example, most
classic methodological developments in hyperspectral remote sensing have been based on
only a few benchmark images of limited sizes, let alone the annotation demanding deep
learning methods. To push deep learning related research in remote sensing, community
efforts in generating large-scale real-world scenario benchmarks are due. Motivated by
this, since 2016 an increasing number of large-scale remote sensing datasets have become
available covering a variety of problems, such as instance segmentation (Chiu et al. 2020;
Weir et al. 2019; Gupta et al. 2019), object detection (Xia et al. 2018; Lam et al. 2018),
semantic segmentation (Azimi et al. 2019; Schmitt et al. 2019; Mohajerani and Saeedi
2020), (multi-label) scene classification (Sumbul et al. 2019; Zhu et al. 2020), and data
fusion (Demir et al. 2018; Le Saux et al. 2019). To name a few examples:
– DOTA (Xia et al. 2018): This is a large-scale dataset for object detection in aerial
images, which collect 2806 aerial images from different sensors and platforms con-
taining objects exhibiting a wide variety of scales, orientations, and shapes. In total,
it contains 188,282 object instances in 15 common object categories and serves as a
very important benchmark for development of advanced object detection algorithms
in very high resolution remote sensing.
– So2Sat LCZ42 (Zhu et al. 2020): This is a benchmark dataset for global local climate
zones classification. It is a rigorously labeled reference dataset in EO. Over one month
15 domain experts carefully designed the labeling workflow, the error mitigation strat-
egy, the validation methods, and conducted the data labeling. It consists of manually
assigned local climate zone labels of 400,673 Sentinel-1 and Sentinel-2 image patch
pairs globally distributed in 42 urban agglomerations covering all the inhabited con-
tinents and 10 cultural zones. In particular, it is the first EO dataset that provides a
quantitative measure of the label uncertainty, achieved by letting a group of domain
experts cast 10 independent votes on 19 cities in the dataset.
6 1 Introduction
Ilg et al. (2018); Kohl et al. (2018) proposed BNNs that output a number of plausible
hypotheses enabling creation of distribution over outputs and measuring uncertain-
ties. Actually, Bayesian deep learning (BDL) offers a probabilistic interpretation of deep
learning models by inferring distributions over the models’ weights (Wang and Yeung
2016; Kendall and Gal 2017). These models, however, have not been applied extensively
in the Earth Sciences, where, given the relevance of uncertainty propagation and quan-
tification, they could find wide adoption. Only some pilot applications of deep Gaussian
processes (Svendsen et al. 2018) for parameter retrieval and BNNs for time series data
analysis (Rußwurm et al. 2020) are worth mentioning. In summary, the Bayesian deep
learning community has developed model-agnostic and easy-to-implement methodol-
ogy to estimate both data and model uncertainty within deep learning models, which
has great potential when applied to remote sensing problems (Rußwurm et al. 2020).
Other open issues that recently caught the attention in the remote sensing community
include but are not limited to: hybrid models integrating physics-based modeling into
deep neural networks, efficient deep nets, unsupervised and weakly supervised learning,
network architecture search, and robustness in deep nets.
Physical modeling and machine learning have often been treated as completely different
and irreconcilable fields; scientists should adhere to either a theory-driven or a data-driven
approach. Yet these approaches are indeed complementary: physical approaches are
interpretable and allow extrapolation beyond the observation space by construction, and
data-driven approaches are highly flexible and adaptive to data. Their synergy has gained
attention lately in the geosciences (Karpatne et al. 2017a; Camps-Valls et al. 2018b).
Interactions can be diverse (Reichstein et al. 2019). There are several ways in which Physics
and DL can interact within Earth Sciences:
● Improving parameterizations. Physical models require parameters that can be seldom
derived from first principles. Deep learning can learn such parameterizations to opti-
mally describe the ground truth that can be observed or generated from detailed and
high-resolution models of clouds (Gentine et al. 2018). In the land domain, for example,
instead of assigning parameters of the vegetation in an Earth system model to plant func-
tional types, one can allow these parameterizations to be learned from proxy covariates
with machine learning, allowing them to be more dynamic, interdependent, and contex-
tual (Moreno-Martínez et al. 2018).
● Surrogate modeling and emulation. Surrogate modeling, also known as emulation,
is gaining popularity in remote sensing (Camps-Valls et al. 2016; Reichstein et al. 2019;
Camps-Valls et al. 2019). Emulators are essentially statistical models that learn to mimic
the energy transfer code using a small yet representative dataset of simulations. Emula-
tors allow one to readily perform fast forward simulations, which in turn allow improved
inversion. However, replacing an simulator (e.g. RTM or climate model (sub)component)
with a deep model requires running expensive evaluations off first. Recent more efficient
alternatives construct an approximation to the forward model starting with a set of
optimal RTM simulations selected iteratively (Camps-Valls et al. 2018a; Svendsen et al.
2020). This topic is related to active learning and Bayesian optimization, which might
push results further in accuracy and sparsity, especially when modeling complex codes
such as climate model components.
● Blending networks and process-based models. Including knowledge through extra regu-
larization that forces DL models to respect some physics laws can be seen as a form of
inductive bias for which ML is prepared with many optimization techniques (Kashinath
et al. 2019; Wu et al. 2018). A fully coupled net can be devised: here, layers describ-
ing complicated and uncertain processes feed physics-layers that encode known rela-
tions of intermediate feature representation with the target variables (Reichstein et al.
2019). The integration of physics into DL models not only achieves improved generaliza-
tion but, more importantly, endorses DL models with consistency and faithfulness. As a
by-product, the hybridization process has an interesting regularization effect, as physics
discards implausible models and promotes simpler, sparser structures.
An important point and active research field is that of explainability of the derived DL
models. Interpreting what the DL model learned is important to understand how the system
works to debug it or improve it, to anticipate unforeseen circumstances, to build up trust in
the technology, to understand the strengths and limitations, to audit a prediction/decision,
to facilitate monitoring and testing, and to guide users into actions or behaviors. A plethora
of techniques have been developed to gain insight from a model (Molnar 2019; Samek
1.4 Book Structure and Roadmap 9
et al. 2017): (1) feature visualization to characterize the network architecture (e.g. how
redundant, outlier-prone, or adversarial-sensitive the network is); (2) feature attribution
to analyze how each input contributed to a particular prediction; and (3) model distillation
that explains a neural network with a surrogate simpler (often linear or tree-based) model.
Several works in remote sensing and geosciences have studied interpretability of deep nets.
For example, Kraft et al. (2019) introduced an agnostic-based method through time-series
permutation which allows memory effects of climate and vegetation affecting net ecosys-
tem CO2 fluxes in forests to be studied. In Wolanin et al. (2020), activation maps of hidden
units in convolutional nets were studied for crop yield estimation from remote sensing data;
analysis suggested that networks mainly focus on growing seasons and can provide a rank-
ing of more important covariates. Recently, in Toms et al. (2019a), the method of layer-wise
relevance propagation (LRP) (Montavon et al. 2018) was used to study patterns in Earth
System variability.
This book is not conceived as a mere compilation of works about Deep Learning in Earth
Sciences but rather aims to put a carefully selected set of puzzle pieces together for high-
lighting the scope of relevant milestones in the intersection of both fields. We start the book
with an introductory chapter Ch. that treats the main challenges and opportunities in Earth
Sciences. After this, the book is split into four main Parts:
Part I. Deep learning to extract information from remote sensing images. The first
part is devoted to extract information from remote sensing images. We depart from novel
developments in unsupervised learning, move to weakly supervised models and then
follow reviewing the main applications that involve supervised learning.
The field of unsupervised learning in Earth observation problems is of paramount rele-
vance, given the high cost of obtaining labeled data in resources, human and economic
terms. The concepts of sparse representations, compactness and expressive features has
emerged in canonical – yet unsupervised – convolutional neural networks (Ch. 2). Simu-
lating processes with neural networks have also found wide application in geoscientific
problems, in particular with Generative Adversarial Networks (GANs) (Ch. 3). When
a few labeled data are also available, the field of semisupervised and self-taught learn-
ing emerge as potentially useful fields (Ch. 4). Supervised learning is, however, the most
active one in the field, and we included dedicated chapters to the most relevant aspects:
image classification and segmentation (Ch. 5) and object recognition in remote sens-
ing images (Ch. 6). However, often all problems implying data classification fail because
domains of training and test differ in their statistics. Here either adapting the classifier
or the feature representation becomes crucial. The key problem of adapting domains for
data classification is surveyed in Ch. 7. When time is involved in detection and classifica-
tion of land use and land covers, recurrent neural networks excel; an extensive review of
these deep learning techniques is provided in Ch. 8. Yet deep learning has been also used
in contexts where information extraction does not mean straightforward goals like classi-
fication or detection. For instance, whenever one needs to perform particular operations
10 1 Introduction
like image matching and co-registration (Ch. 9), multisource image fusion Ch. 10, and
search and retrieval of images in huge data archives (Ch. 11)
Part II. Making a difference in the geosciences with deep learning. The second part
of the book deals with a selected collection of problems where deep learning has made
a big difference compared to previous approaches: problems that involve real target
variables, particular data structures (spatio-temporal, strongly time autocorrelated,
extremely high dimensional volumetric atmospheric data), challenging problems like
the detection of climate extremes, weather forecasting and nowcasting, and the study of
the cryosphere.
The part starts with a chapter dedicated to the detection of extreme weather patterns
in Ch. 12). Spatio-temporal data and teleconnections is present in many weather and
climate studies; here spatio-temporal autoencoders can discover the underlying modes of
variability in data cubes to represent the underlying processes (Ch. 13). Deep learning to
improve weather predictions is treated thoroughly in Ch. 14. Ch. 15 reviews the problem
of weather forecasting and in particular that of precipitation nowcasting architectures.
Deep learning has found enormous challenges when working with high-dimensional
data for parameter retrieval; Ch. 16 shows developments to approach the problem in the
atmosphere and the cryosphere. An extensive review of DL for cryospheric studies is
provided in Ch. 17. The part ends with the application of recurrent networks to emulate
and learn ecological memory (Ch. 18).
Part III. Linking physics and deep learning models. The field has grown enormously
in methods and applications. Yet a wide adoption in the broad field of Earth Sciences
is still missing, mainly due to the fact that (a) models are often overparameterized black
boxes, hence interpretability is often compromised, (b) deep learning models often do not
respect the most elementary laws of physics (like advection, convection, or the conser-
vation of mass and energy), and (c) they are after all costly models to train from scratch
needing a huge amount of labeled data, hence democratization of machine learning is
not actually a reality. These issues have been recently tackled by pioneer works in the
interface between machine learning and physics, and will be reviewed as well in this last
part of the book, where physics-aware deep learning hybrid models will be treated.
We start the part with a chapter dedicated to the impact of deep learning in hydrology,
another field where DL has impacted recently; applications and new architectures suited
to the problems are treated in detail in Ch. 19. A field where DL has found enormous
interest and adoption recently is that of parametrization of subgrid processes for unre-
solved turbulent ocean processes (Ch. 20) and climate models in general (Ch. 21). Using
deep learning to correct theoretically-derived models opens a new path of interesting
applications where machine learning and physics interact (Ch. 22).
We end up the book with some final words and perspectives in Ch. 23. We review where
we are now, and the challenges ahead. We treat issues such as adapting architectures to
data characteristics (to deal with e.g. large-range relations), interpretability and explain-
ability, hybrid modeling as evolved forms of data assimilation techniques, learning plausible
physics models, and the more ambitious goal of learning expressive causal representations
from data, DL models, and assumptions.
Supporting material is also included in two forms. On the one hand, examples real and
advanced application examples are provided in each chapters. But we also provide scripts,
1.4 Book Structure and Roadmap 11
code, and pointers to toolboxes and applications of deep learning in the geosciences in a
dedicated GitHub site:
https://github.com/DL4ES
In this dedicated repository, many links are maintained to other widely used software
toolboxes for deep learning and applications in the Earth Sciences. This repository is
periodically updated with the latest contributions, and can be helpful for the Earth and
climate data scientist.
We sincerely hope you enjoy the material and that it serves your purposes!
13
Part I
2
Learning Unsupervised Feature Representations of
Remote Sensing Data with Sparse Convolutional
Networks
Jose E. Adsuara , Manuel Campos-Taberner , Javier García-Haro , Carlo Gatta , Adriana
Romero , and Gustau Camps-Valls
2.1 Introduction
Fields like remote sensing, computer vision, or natural language processing typically work
in the so-called structured domains, in which the original data representation has temporal
and/or spatial dimensions defined in uniform grids. From a geometrical viewpoint, data can
be represented in their original coordinates, but visualizing, understanding, and designing
algorithms therein is challenging, mainly due to the high dimensionality and correlation
between covariates. This is why learning alternative, typically simpler and compact, feature
representations of the data has captured a lot of interest in the scientific community. This
is the field of dimensionality reduction or feature extraction, for which one has both super-
vised and unsupervised algorithms (Rojo-Álvarez et al. 2018; Bengio et al. 2013; Hinton and
Salakhutdinov 2006a).
Unsupervised learning is the preferred approach in cases of label sparsity. Different
algorithms implementing different criteria are available. Principal Component Analysis
(PCA) (Jolliffe 1986) is one of the most popular methods for dimensionality reduction
due to its easy implementation and interpretability. Two relevant, and often unrealistic,
assumptions are done though: linearity and orthogonality. In the last decade, a profusion of
non-linear dimensionality reduction methods including both manifold (Lee and Verleysen
2007) and dictionary learning (Kreutz-Delgado et al. 2003) have sprung up in the literature.
Non-linear manifold learning methods can be mainly categorized as either local or global
approaches (Silva and Tenenbaum 2003). Local methods retain local geometry of data,
and are computationally efficient since only sparse matrix computations are needed.
Global manifold methods, on the other hand, keep the entire topology of the dataset,
yet are computationally expensive for large datasets. They have higher generalization
power, but local ones can perform well on datasets with different sub-manifolds. Local
manifold learning methods include, inter alia, locally linear embedding (LLE) (Roweis and
Saul 2000), local tangent space alignment (LTSA) (Zhang and Zha 2004), and Laplacian
eigenmaps (LE) (Belkin and Niyogi 2003). Basically, these approaches build local structures
to obtain a global manifold. Among the most widely used global manifold methods the
ISOMAP (Tenenbaum et al. 2000) and the kernel version of PCA (kPCA) (Schölkopf
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
16 2 Learning Unsupervised Feature Representations of Remote Sensing Data with Sparse Convolutional Networks
et al. 1998) stand out, as well as kernel-based and spectral decompositions that learn
mappings optimizing for maximum variance, correlation, entropy, or minimum noise
fraction (Arenas-García et al. 2013), and their sparse versions (Tipping 2001). In addition
there exist neural networks that generalize PCA to encode non-linear data structures via
autoencoding networks (Hinton and Salakhutdinov 2006a), as well as projection pursuit
approaches leading to convenient Gaussian domains (Laparra et al. 2011).
In the last years, the use of deep learning techniques has become a trending topic
in remote sensing and geosciences due to the increasing availability of many and large
datasets. Some excellent reviews on this topic have been published by Zhang et al. (2016b),
Zhu et al. (2017), and Ma et al. (2019). Deep learning methods can deal with the intrinsic
problems related with the analysis of non-linear spatial-spectral datasets.
Several unsupervised neural nets are available too; for example, an autoencoder is a
deep learning architecture in which the input signal is reconstructed/decoded at the
output layer through an intermediate layer with reduced number of hidden nodes.
Basically, the autoencoder aims at reproducing the inputs at the output layer by using
the high-abstraction features learned in the intermediate layer. The use of autoencoders
implies, however, the tuning of several free parameters, addressing the regularization issue
mainly by limiting the structure of the network heuristically. The use of autoencoders
in remote sensing is widespread for a wide range of applications, including feature
extraction and image classification (Zabalza et al. 2016; Othman et al. 2016), spectral
unmixing (Guo et al. 2015; Su et al. 2019), image fusion (Azarang et al. 2019), and
change detection (Lv et al. 2018). However, autoencoders require tuning several critical
hyperparameters.
While the issue of non-linear feature extraction can be resolved with deep networks effi-
ciently, there is still the issue of what a sensible criterion should be employed. Natural data,
namely, data generated by a physical process or mechanism, often show strong autocorre-
lation functions (in space and time), heavy tails, strong feature correlations, not necessarily
linear, and (from a manifold learning perspective) data lives in subspaces of (much) lower
dimensionality. This can be interpreted as that the observations were generated by a much
simpler mechanisms, and that one should seek for a sparse representation.
In this sense, sparse coding and dictionary learning has an efficient way to learn sparse
image features in unsupervised settings, which are eventually used for image classification
and object recognition. A dictionary can be understood as a set of bases that sparsely
represent the original data. The main goal in dictionary learning is to find a dataset
to best represent the inputs by only using a small subset (also known as atoms) of the
dictionary. Many applications and approaches have been developed using dictionary
learning: image denoising via sparse and redundant representations (Elad and Aharon
2006), spatial-spectral sparse-representation and image classification using discriminative
dictionaries (Wang et al. 2013b), change detection based on joint dictionary learning (Lu
et al. 2016), image classification with sparse kernel networks (Yang et al. 2014), image
pansharpening by means of sparse representations over learned dictionaries (Li et al.
2013), large-scale remote sensing image representation based on including particle swarm
optimization into online dictionary learning (Wang et al. 2015), image segmentation with
saliency-based codes (Dai and Yang 2010; Rigas et al. 2013), image super-resolution based
2.2 Sparse Unsupervised Convolutional Networks 17
on patch-wise sparse recovery (Yang et al. 2012), automatic target detection employing
sparse bag-of-words codes (Sun et al. 2012), unsupervised learning of sparse features
for aerial image classification (Cheriyadat 2014), and cloud removal based on sparse
representation using multitemporal dictionary learning (Xu et al. 2016). These methods
describe the input images in sparse representation spaces but do not take advantage of the
high non-linear nature of deep architectures.
However, attaining sparse non-linear deep networks is still unresolved in the literature,
especially in unsupervised learning. In the next section, we introduce a methodology to
learn sparse spatial-spectral feature representations in deep convolutional neural network
architectures in an unsupervised way.
The optimization is performed by means of Stochastic Gradient Descent (SGD) with adap-
tive learning rates (Schaul et al. 2013).
2.2.3 Remarks
The learned hierarchical representations of the input data (in our case, remote sensing
images) are used for classification, where lower layers extract low-level features and higher
layers exhibit more abstract and complex representations. The methodology is fully unsu-
pervised, which is a different (and more challenging) setting to the common supervised use
of convolutional nets.
N × N l–1 N × N lh Tl N × N lh
Hl–1 h Hl
W l* = argminWt ||H l – T l||22
b l* = argminbl || H l – T l||22
2
51
2
51
64 64
layer 1 EPLS
Figure 2.1 Scheme of the proposed method for unsupervised and sparse learning of image
feature representations, where a convolutional neural network is trained iteratively driven by the
EPLS algorithm that generates a sparse output (pseudo-labels) target matrix. The EPLS algorithm
selects the output that has the maximal activation value, thus ensuring population sparsity, and the
element that most frequently activates, ensuring population sparsity. More details on the EPLS
algorithm can be found in Romero et al. (2014).
2.3 Applications 19
After training the parameters of a network, we can proceed to extract feature represen-
tations. To do so, we must choose an encoder to map the input feature map of each layer
to its representation, i.e. we must choose the non-linearity to be used after applying the
learned filters to all input locations. A straightforward choice is the use of a natural encod-
ing, i.e. the non-linearity used to compute the output of each layer. However, different
training and encoding strategies might be combined together. Summarizing, we train deep
architectures by means of greedy layer-wise unsupervised pre-training in conjunction with
EPLS and choose a feature encoding strategy for each specific problem. The interested
reader may find an implementation of the EPLS algorithm in https://sites.google.com/site/
adriromsor/epls.
2.3 Applications
To make the potentiality of the introduced methodology clear, we will use it for classification
of hyperspectral images, and for multisensor multispectral and LiDAR image fusion.
1 1
PCA, 1x1
0.9 PCA, 3x3 0.9
PCA, 5x5
0.8 KPCA, 1x1
KPCA, 3x3
0.8
0.7
KPCA, 5x5
0.6 NNET, 1x1 0.7
NNET, 3x3
0.5 0.6
κ
κ
NNET, 5x5
0.4 0.5 L2
0.3 L3
0.4 L4
0.2 L5
0.1 0.3 L6
L7
0 0.2
5 10 20 50 100 200 0 0.1 0.2 0.3 0.4 0.5
Number of features Rate of training samples/class
Figure 2.2 Kappa statistic (classification accuracy estimated) for several numbers of features (left),
spatial extent of the receptive fields (for the single-layer network) or the included Gaussian filtered
features (for PCA and kPCA) using 30% of data for training; and for different rates of training
samples (right), {1%, 5%, 10%, 20%, 30%, 50%}, with pooling.
layer-wise fashion by means of EPLS with logistic non-linearity. Then, natural encoding has
been used without polarity split to extract the network’s features. Figure 2.2(b) highlights
that using a few supervised samples to train a deep CNN can provide better results than
using far more supervised samples to train a single-layer one. Note, for instance, that the
6-layer network using 5% samples/class outperforms the best single-layer network using
30% of the samples/class.
An important aspect of the proposed deep architectures lies in the fact that they typi-
cally give rise to compact hierarchical representations. In Figure 2.3, and for a subset of
the whole image, the best three features extracted by the networks according to the mutual
information between features and labels are depicted. The deeper we go, the more compli-
cated and abstract features we retrieve, except for the seventh layer that provides spatially
over-regularized features due to the downscaling impact of the max-pooling stages. Interest-
ingly, it is also observed that, the deeper structures we use, the higher spatial decorrelation
of the best features we obtain.
l=1 l=2 l=3 l=4 l=5 l=6 l=7
f=1
f=2
f=3
Figure 2.3 For the outputs of the different layers 1st to 7th, in columns, most informative features
three features, in rows, according to the mutual information with the labels for a subregion of the
whole image.
2.3 Applications 21
Figure 2.4 Top: for RGB (a), LiDAR (b) and RGB+LiDAR (c), learned bases by the convolutional
network using EPLS. Bottom: the corresponding topological representations using the first two
ISOMAP components.
2.4 Conclusions
Unsupervised learning remains an important and challenging endeavor for artificial intel-
ligence in general, and for Earth Sciences in particular. We treated the topic of learning
feature representations from multivariate remote sensing data when no labels are available.
All the chapters of this book deal with deep generative models like Generative Adversar-
ial Networks, Variational AutoEncoders, and self-taught learning, and rely on the idea of
finding a good latent embedding space autonomously.
Extracting meaningful representations in this scenario is challenging because an objec-
tive criterion is missing. We want to highlight that whatever criterion will be absolutely
arbitrary. Here we focused on including sparsity in standard convnets, even if the frame-
work around the EPLS algorithm could be in principle applied to any other deep learning
architecture. Sparsity is a very sensible criterion, motivated by the neuroscience literature,
and very useful in practice: sparse representations enforce faster, more compact and inter-
pretable models. We reviewed the field of sparse coding in deep learning, illustrated the
framework in several remote sensing applications, and paid special attention to accuracy,
robustness and interpretability of the extracted features. We have confirmed that the deeper
the neural network, the more advantageous the sparse representation, thus confirming the
results widely observed in the supervised literature. We have also shown that the network
results in less compact representation when it fuses data with (physically orthogonal and
2.4 Conclusions 23
3
Generative Adversarial Networks in the Geosciences
Gonzalo Mateo-García , Valero Laparra , Christian Requena-Mesa , and Luis
Gómez-Chova
3.1 Introduction
In the last years, deep learning has been applied to develop generative models using mainly
three different approaches.
Variational autoencoders (VAEs) (Doersch 2016) are a class of autoencoders that use
deep neural networks architectures for the encoder and the decoder, and the parameters
are optimized to enforce the statistical properties of the data in the latent space. VAEs allow
new synthetic data to be generated following the distribution of the training data. In order
to do so, one only has to generate data following the distribution defined in the interme-
diate stage and apply the decoder network. While VAE is an interesting technique, it has
not been widely adopted in remote sensing yet. Further insight into VAEs and their use for
Earth System Science can be found in Chapter 13.
Another popular approach is based on normalizing flows (Jimenez Rezende and
Mohamed 2015), which rely on architectures that respect some properties of the data,
like the dimensionality. Similarly to VAEs, they enforce a particular distribution in the
transformed domain, being the selected distribution a multivariate Gaussian in most of the
cases. However, this technique has not been extensively used in remote sensing problems.
Probably the most used generative methods based on deep learning are generative adver-
sarial networks (GANs) (Goodfellow et al. 2014b). Of the three mentioned methods, GANs
have excelled in several problems, and in the Earth Sciences in particular. Application of
GANs has had an enormous impact on fields like image and language processing. Some of
the applications of GANs have become the state of the art in these fields where there is a
clear spatial and/or temporal data structure (see for instance Gonog and Zhou (2019)).
In the last decade, a plethora of models and architectures based on the fundamentals of
GANs have been developed theoretically and widely used in real-world applications1 . While
presenting a taxonomy of this huge number of methods is far from the scope of this chapter,
most of the approaches can be divided in three main families. The first family corresponds
to the regular GANs, where the architecture is similar to the VAE or the normalizing flow
methods. The second family is the conditional GANs, where an extra input, on which the
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
3.2 Generative Adversarial Networks 25
model conditions some properties of the generated data, is added to the model. Finally,
the third family is devoted to merge different GANs structures, which helps to deal with
unaligned datasets, as we will see in detail in this chapter.
In the field of Earth Sciences, GANs have been used for multiple problems, as we will
review in section 3.3. Firstly, GANs have been used for generating synthetic samples to be
used in supervised, semi-supervised, and unsupervised problems. However, their use is not
restricted to these tasks. For instance, an interesting problem in remote sensing is domain
adaptation (DA), since some GANs architectures can be used to adapt existing datasets or
algorithms to different satellite features.
In this section, we review the three main different families of GANs architectures. The orig-
inal GANs were proposed as an unsupervised method, in which the only requirement is to
have data from the distribution we want to synthesize new samples. However, in practice,
when we generate new samples, we might want to control the characteristics of the gen-
erated data; for instance, the class which the generated samples belongs to. Conditional
GANs were proposed to address this issue. In conditional GANs, the generator synthesizes
data taking into account particular auxiliary information given as input. Conditional GANs
thus need to be provided with this extra information. Note that this approach loses the main
advantage of GANs, which is being an unsupervised method. The idea of the third family
is a little bit different from the previous ones but probably is the most interesting one. This
family is based on an autoencoder architecture where the encoder and decoder networks
are the generator part of two different GANs. Therefore, it is devoted to finding transforma-
tions to convert from one class of data to another class. The extra terms of the discriminators
to the autoencoder architecture help the learning process when using unpaired datasets
for training, since having paired data is a strong requirement, e.g. for domain adaptation
problems.
the sample being real. When the provided data comes from the generator, this probability
is used as the error metric in order to improve the generator.
In particular, the loss functions of the discriminator (D) and the generator (G) are
given by:
x̂
⏞⏞⏞
GAN (D) = −𝔼X [log(D(x))] − 𝔼R [log(1 − D( G(r))]. (3.1)
x̂
⏞⏞⏞
GAN (G) = −𝔼R [log(D( G(r))] (3.2)
Note that Equations 3.1 and 3.2 are adversarial since minimizing Equation 3.1 will push
̂ towards zero, whereas minimizing 3.2 will push D(x)
D(x) ̂ to 1. GANs’ proposal consists
of minimizing these equations iteratively using a gradient descent based method w.r.t. the
weights of the generator and discriminator networks, respectively.
The main novelty of GANs is proposing a novel non-parametric and adaptive cost func-
tion. This cost function can be seen as a way to evaluate the likelihood of the generated
data but without using a fixed likelihood function (Gutmann et al. 2018). Instead, we use
a non-parametric method (e.g. a neural network) in order to learn the likelihood function
while training the generator. This allows the likelihood function to change from less to
more restrictive during the training. This helps the generator learning procedure since, at
the beginning of the training, it is complicated for the generator to produce reliable samples.
Therefore, having a very restrictive likelihood function at this stage would stall the genera-
tor learning process, i.e., the likelihood of the generated samples would be always zero and
the generator would have no useful gradients towards improving the generated samples.
While GANs generated samples are state-of-the-art, several extensions have been pro-
posed in order to restrict the generated data to have particular characteristics. For instance,
the Info-GAN (Chen et al. 2016) maximizes the mutual information between some latent
space features and the target space. By doing so, we can have control over imposing differ-
ent features when generating the data while, at the same time, we can explore the feature
space.
We see that, in this approach, the discriminator also has two inputs which are either
̂ and auxiliary information (y). Hence, the discriminator tries to
real (x) or fake data (x)
distinguish samples from the joint distribution of X and Y . This in turn forces the generator
to be consistent with its input (the auxiliary information y), avoiding the mode collapse
problem (the generator memorizes one sample which is always produced as output). But,
on the other hand, this makes the method dependent on paired samples.
The first (GAN1 ) and second (GAN2 ) parts of the cost functions correspond to the cost
function of the conditional GANs, where the auxiliary input information corresponds to
the output of the other generator. The third part (CYC ) is the classical autoencoder cost
x^ 2 x^ 1 x^ 2 x2
G2 D2 Data 2
28 3 Generative Adversarial Networks in the Geosciences
function. It enforces that the sample that passes through both generators has to be similar
to itself:
x̂ 2
⏞⏞⏞
CYC = ||x1 − G2 ( G1 (x1 ))||. (3.6)
As in classical autoencoders, different norm functions can be used in this part depending
on the final goal.
assume that the training and test datasets come from different domains, and GANs can
be useful to find invariant representations for both the training and test data (Elshamli
et al. 2017). This happens, for example, in the classification of aerial images when one
wants to transfer previously labeled data to the new images (Zhu et al. 2019b), and
DA with GANs helps to reduce the bias between the source and target distributions
increasing the discriminative ability of generated features (Yan et al. 2019 ; Liu et al.
2019). On the other hand, a similar but more difficult situation arises when dealing
with two or more different satellite platforms. The idea is still the same and GANs
are used to obtain a better adaptation between the two satellites’ data during the
test phase without carrying out a separate training for each platform. For example,
in Ye et al. (2019), an unsupervised DA model was presented to learn invariant fea-
tures between SAR and optical images for image retrieval. Even when working with
multiple optical remote sensing platforms, DA can increase robustness of the models
allowing to transfer the knowledge gained from one trained domain to the target
domain, in terms of both transfer learning and data augmentation (Segal-Rozenhaimer
et al. 2020).
Following a similar reasoning as for DA, different adversarial architectures can be
also used to extract the intrinsic data features in a deep latent space. The idea is to
exploit the powerful ability of deep learning models to extract spatial and spectral
features in an unsupervised manner, providing a feature representation that can be
eventually useful for: anomaly detection in hyperspectral imagery (Xie et al. 2019),
hyperspectral image processing (Zhang et al. 2019), or aerial scene classification (Yu et al.
2020).
The list of specific remote sensing and geosciences applications that have benefited
from the advent of generative adversarial methods is large. However, two of these
applications deserve a special attention due to the number of generative adversarial
approaches that have been proposed to deal with change detection and super-resolution.
On the one hand, the first change detection approach based on GANs was proposed
by Gong et al. (2017), where change detection was handled as a generative learning
procedure that modeled the relation between bitemporal images and the desired change
map. In Zhao et al. (2019), a seasonal invariant term was introduced to avoid unde-
sired changes in the final maps due to seasonality trends. Finally, in Hou et al. (2019),
GANs were used to reformulate change detection as an image translation problem,
differencing bitemporal images in the feature domain rather than in the traditional
image domain. On the other hand, in the last years, a lot of attention has been also put
on super-resolution techniques based on generative adversarial methods. GAN-based
methods have shown to provide high-resolution images with higher perceptual quality
than mean-square-error-based methods, which tend to generate smoothed images. In
Jiang et al. (2019), a GAN-based edge-enhancement network was proposed for the
super-resolution reconstruction along with a noise insensitive adversarial learning. In
Li et al. (2020), a multiband super-resolution method based on GANs was proposed to
exploit the correlation of spectral bands and avoid the spectral distortion in hyperspectral
images. Finally, in Zhang et al. (2020a), a visual saliency GAN was proposed to enhance
the perceptual quality of generated super-resolution images avoiding undesired pseudo
textures.
30 3 Generative Adversarial Networks in the Geosciences
translation at the cost of training the models in a supervised manner with paired images.
One of the consequences of the SAR-to-optical image translation is that the generated opti-
cal images are free of clouds, which is critical for land studies. In this context, CycleGANs
have also been applied for cloud removal in optical images by learning the mapping between
unpaired cloudy images and cloud-free images (Singh and Komodakis 2018). On the other
hand, CycleGANs have been also used in remote sensing to exploit their DA capabilities. In
Liu et al. (2018b), a CycleGAN was used to adapt simulated samples to be more similar to
real samples, which allows an improved data augmentation with simulation approaches.
In a similar way, in Saha et al. (2019a), CycleGANs are used to mitigate multisensor differ-
ences and to adapt the different source domains before applying an unsupervised change
detection.
In this section, we are going to present two illustrative applications of GANs in real remote
sensing problems. The goal is to pass from theory to practice in two relevant case studies.
These applications are domain adaptation of images coming from two different satellites
and landscape emulation using climate, geological, and anthropogenic variables as input.
would mean that simultaneous images of the same location and same acquisition time from
both sensors are needed. In some cases, the time constraint can be relaxed to close in time
images, however, in other cases this might not be enough; for instance, for applications that
look for sudden changes in images such as clouds, floods, and wildfire detection. In those
cases, the CycleGANs formulation is more appealing since it does not require paired images
for training. See the work of Hoffman et al. (2018) for a comprehensive reference of the use
of Conditional GANs and CycleGANs to DA problems in computer vision.
One illustrative example of DA using conditional GANs applied to the multi-sensor
scenario is the work in Mateo-García et al. (2019). In this work, the authors propose a slight
modification of the conditional GANs formulation to build a DA transformation between
Landsat-8 and Proba-V satellites that does not require paired samples. The goal is to
exploit Landsat-8 manually annotated cloud masks to build a cloud detection algorithm for
Proba-V. In order to build the DA transformation, firstly the overlapping bands of Landsat-8
and Proba-V are selected and then Landsat-8 images are upscaled from the 30m Landsat-8
spatial resolution to the 333m Proba-V resolution using the physical characteristics of both
sensors: the point spread function of Landsat-8 and Proba-V and their spectral response.
After this physically based transformation, the spatio-spectral properties of the images
are the same (similar spectral bands and same spatial resolution), however, statistical
differences between upscaled Landsat-8 and Proba-V images still remain as shown in
Figure 3.4 (see e.g. the blueish color of clouds in Proba-V). These differences are probably
due to differences in the Proba-V and Landsat-8 instruments. Since Proba-V is a smaller
and somehow less accurate satellite, authors build a DA model from Proba-V to Landsat-8
upscaled images. This model can be seen as a noise removal method for Proba-V images.
Figure 3.5 summarizes the procedure to train the DA model (the generator in the CGAN
scheme). The conditional GAN model is trained using unpaired Landsat-8 and Proba-V
images adding a consistency loss for the generator between the real and the generated
image. This is required since only the generated data is used as input to the discriminator,
but not the input and the labels as in equations 3.3 and 3.4. Results show a better quality of
adapted Proba-V images with a lower amount of saturated values as seen in Figure 3.4. In
addition, the cloud detection model trained in Landsat-8 upscaled images of Mateo-García
et al. (2020) performs better in the Proba-V denoised images than in the raw Proba-V
imagery.
Figure 3.4 Close in time upscaled (333m resolution) Landsat-8 and Proba-V images before and
after the domain adaptation (Generator).
3.4 Applications of GANs in Earth Observation 33
Consistency
loss
Generator
8 Discriminator
Landsat-8 16
32
image 64 128
{real, fake}
Physically based
transformation
T
Landsat-8
upscaled image
Figure 3.5 Example of architecture for Domain Adaptation between two satellites proposed in
Mateo-García et al. (2019).
In Requena-Mesa et al. (2019), a model capable of predicting landscapes as seen from space
is introduced for the first time. Raw remote sensing data is used as a proxy of the landscapes
of Earth. Landscapes are complex systems, and their evolution is linked to the interplay
of climatic, geological and anthropogenic factors. They built a predictive model based
on conditional GANs capable of generating new landscapes, as seen from space, given
a set of landscape forming variables (e.g. average temperature, precipitation, geological
substrate, etc.). To model the problem uncertainty, they defined the ground truth as a
probability distribution over the remote sensing data conditioned on a set of environmental
conditions C. They then trained a generative neural model G as an approximation of the
unknown function that relates environmental conditions to landscapes as seen from space:
G(C, r; 𝜃) ≈ f (Clim, Geo, AI), (3.7)
where 𝜃 denotes the network parameters and r is a probabilistic latent space useful to sam-
ple multiple plausible landscapes for each set of environmental variables (Climatic vari-
ables, Geological variables, and Anthropogenic Intervention indicators).
The study deployed a spatial to spatial generative model (see Figure 3.6). Such a model
convolutionally encodes into the latent code spatial information describing higher-order
relationships across the environmental variables, and deconvolutionally decodes satellite
imagery features from the latent code. The model also makes use of skip connections,
as these are needed to keep the landscape features on the right spatial locations, e.g., if
there is a high slope on the top-right corner of the environmental predictors, there probably
should be a mountain on the generated imagery. The study shows that the conditional
GANs can generate landscapes that closely resemble the real ones as measured based
on patch level metrics, while simpler models cannot replicate these high level metrics to an
usable degree. In addition, they show that both the use of a convolutional-deconvolutional
network architecture and a discriminator-based training are key to achieve a good
landscape prediction. Both the convolutions and the adversarial training are some of the
greatest recent advances in deep learning and it is only now that their applications
to relevant Earth system problems is being displayed.
While there are just very few works deploying generative models for systems that are
hard to predict numerically, it is not hard to imagine it could be used towards many
other challenging tasks. For example, wildfire nowcasting, long term fluvial and coastal
sediment dynamics, landscape evolution over time, or patterns of urban growths among
others.
The use of spatio-temporal generator networks to emulate existing numerical simulations
is in its infancy. However, the dimensionality of the data, both large in space and time, make
the current deep learning architectures used for video prediction of special interest for Earth
System science. Specially, those architectures that model explicitly temporal dependen-
cies (with LSTM-like structures), spatial dependencies (with convolutional steps), and the
inherent stochastic and ambiguity of the ground truth (with probabilistic latent space), such
as Babaeizadeh et al. (2017) or Lee et al. (2018a), seem to have all the key ingredients to be
able to emulate complex Earth System models. There are currently ongoing works exam-
ining the suitability of conditioned stochastic spatio-temporal generator to emulate some
of the current classical models showing preliminary promising results. These models might
trickle down into a more massive usage in the following years.
Fake
224
Real
5
14
conv5
28
5
Groundtruth (Real)
conv4
20
1
56
224 2
N (0,1)
2
6 conv3
11
5
14
deconv1 6 conv2
22
28
5 conv1
deconv2
56
2
2
deconv3 6
11
Generated (Fake)
4
deconv4 6
22
deconv5
Generator Discriminator
Figure 3.6 An example architecture of a convolutional generative adversarial model. We can use generative models to unsupervisedly learn
the distribution of Earth system variables and expand our available datasets.
36 3 Generative Adversarial Networks in the Geosciences
4
Deep Self-taught Learning in Remote Sensing
Ribana Roscher
4.1 Introduction
STL, originally proposed by Raina et al. (2007), has become a promising paradigm to exploit
large amounts of unlabeled data for analysis and interpretation tasks. Its main strength is
to exploit unlabeled data samples, which do not have to belong to the same classes as the
labeled data samples nor do the unlabeled samples have to follow the same data distribution
as the labeled data samples (see Figure 4.1). This makes the approach advantageous over
common approaches such as semi-supervised learning or active learning. The common pro-
cedure to STL is sparse representation (SR), which learns features in an unsupervised way
and uses them for supervised classification.
DSTL extends this approach by combining STL and deep learning. It has the same goal
as deep neural networks, namely to learn a representation that is better suited for the cho-
sen task than the original representation. However, the basis of self-taught learning is sparse
representation instead of a neural network, and thus other possibilities exist such as design-
ing and learning of an interpretable model. In the literature, several variants of deep sparse
representations have been proposed, which show considerable improvements over shallow
sparse representations. Recent approaches learn multiple layers of sparse representations:
He et al. (2014) propose a fully unsupervised feature learning procedure and combines it
with hand-crafted feature extractions and pooling layers. Combining the representations
from multiple layers, they achieve state-of-the-art performance on object recognition. Lin
et al. (2010) use deep belief networks with local coordinate coding, which represents all data
samples as a sparse linear combination of anchor points. All these approaches use labeled
information only as a final step for classification purposes, and thus the representation is
not optimized given labeled information. In contrast, Gwon et al. (2016) applied the con-
cept of backpropagation which is commonly used for optimizing the parameters in neural
networks.
The above approaches determine the presentation in such a way that the data is repre-
sented as well as possible and in some cases additionally the classification task is solved as
accurately as possible. Since the representation does not consist of real data samples and
no restrictions are imposed, the representation cannot be interpreted and explained. How-
ever, this is often necessary, for example, for unmixing tasks (Bioucas-Dias et al. 2012). One
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
38 4 Deep Self-taught Learning in Remote Sensing
feature 2
labeled
unlabeled
feature 1
Figure 4.1 Schematic illustration of different learning paradigms and their use of labeled (red)
and unlabeled (blue) data samples. In contrast to semi-supervised learning (data samples used
shown in dotted boxes), self-taught learning also uses unlabeled data, which need not belong to
the same classes as the labeled data. Images are from the UC Merced dataset (Yand and
Newsam 2010).
approach in this direction is presented by Bettge et al. (2017), where the representation can
be directly related to real data samples.
The first section of the chapter describes the basic principle of STL and introduces an
interpretable version if it. Next, the motivation behind a deep version of STL is given, and
the deep framework and the single steps are explained in detail. Moreover, the main com-
ponents sparse representation and dictionary learning are introduced. We refer the reader
to Goodfellow et al. (2016) for details about the basic concept of deep learning. The goal of
this chapter is to describe an interpretable and explainable deep learning framework which
is able to learn deep features with the help of large amounts of unlabeled data. The chapter
shows how the set of unlabeled data is optimized to be used for classification and how inter-
pretability can be enforced, the latter being important for various tasks in remote sensing
(Roscher et al. 2020; Reichstein et al. 2019).
The following section explains sparse representation, which is the most commonly used
approach to STL. The basic idea of sparse representation is that each sample can be repre-
sented by a weighted linear combination of a few elements from a dictionary. For STL, the
4.2 Sparse Representation 39
dictionary contains unlabeled, yet powerful basis elements which are relevant to a given
task (e.g., classification, anomaly, or change detection). For classification tasks, for example,
the estimated weights of the sparse linear combination are used as input into a classifier and
thus, the dictionary elements need to be chosen accordingly such that the weights are highly
discriminative. Earth observation data such as remote sensing images are particularly suit-
able for an efficient representation as a sparse linear combination of dictionary elements
due to its statistical structure, especially its high spatial redundancy among neighbored
observations (Thiagarajan et al. 2015).
In terms of basic sparse representation a (V × 1)-dimensional sample x is represented by
a weighted linear combination of a few elements taken from a (V × T)-dimensional dictio-
nary D such that
x = D𝜶 + 𝝐, (4.1)
with ||𝝐||2 being the reconstruction error, V the dimension of the sample, and T the number
of dictionary elements. The coefficient vector comprising the weights is given by 𝜶. The
sample x can be, for example, an (M × 1)-dimensional pixel from an image, so that V = M,
or an ((M ⋅ Z) × 1)-dimensional vectorized image patch x = vec(X) with Z being the number
of pixels in patch X and V = M ⋅ Z.
The sparse coding optimization problem for the determination of optimal 𝜶̂ is given by
𝜶̂ = argmin ||D𝜶 − x||2 subject to ||𝜶||f < 𝜌 . (4.2)
The norm || ⋅ ||f needs to be chosen accordingly to induce sparsity. In case L0 -norm is used,
that means f = 0, 𝜌 is the number of non-zero elements in the coefficient vector. Less intu-
itive, L1 -norm with f = 1 is defined as the sum of absolute values and demands a suitable
choice of a threshold 𝜌. A commonly used optimization procedure is orthogonal matching
pursuit for the L0 -optimization task (Zhang et al. 2015; Elad 2010). Further constraints can
also be applied, for example non-negativity constraints or a summation of the coefficients
to 1, which will enhance the interpretability of the result.
structured dictionary based on the class assignment of the dictionary elements. An addi-
tional enforcement of samples to be reconstructed by one class-specific dictionary only can
be introduced, showing improved classification results (Chen and Zhang 2011). Approaches
which construct dictionaries from given samples or representatives of them have the advan-
tage to provide interpretable and explainable results, which is of great interest in a wide
variety of applications in the Earth Sciences. Typical applications based on satellite-based
and close-range Earth observation data comprise unmixing tasks (Bioucas-Dias et al. 2012)
or plant phenotyping (Wahabzada et al. 2016; Römer et al. 2012). However, for Earth obser-
vation data, the number of labeled samples is limited and thus, class-wise dictionaries may
not be representative enough.
labels are given by y ∈ = {1 , … , K }, where K is the number of classes. The labels are
[ ]
also represented by target vectors t = tk of length k coding the label with tk = 1 for y = k
[ ]
and tk = 0 otherwise. The dictionary D = u x1 , … , u xT is embodied by unlabeled data sam-
ples, whereas generally T ≤ U. In case of supervised classification, a classifier model is
trained with tr 𝜶̂ n being the new higher-level feature representations of tr xn with respect
to the dictionary D. In the same way, higher-level features are extracted for the test data t xu ,
which are classified by the learned model.
1
classifier
uA(l) 3
Figure 4.2 Schematic illustration of the deep self-taught learning framework: Deep feature
representations are learned layer-wise by an iterative procedure of updating the sparse activations
(steps 1, 2, and 4) and learning updated dictionaries (step 3).
The schematic structure and the workflow of DSTL is illustrated in Figure 4.2. The left
side illustrates the part of the architecture which uses unlabeled samples and the right side
illustrates the part of the architecture which uses labeled samples, where the samples and
their representations are depicted with blue and red circles. The network consists of L layers
and is trained layer-wise. The following steps describe the procedure.
Initialization The initialization is performed layer-wise, so that in each layer l = 1, ..., L the
dictionary elements D(l) are determined. The first layer l = 1 is the first one to be initial-
ized by extracting relevant dictionary elements from unlabeled data u X. After one of the
procedures specified in section 4.2.1, the dictionary elements either represent real data so
that the dictionary is interpretable, or the dictionary elements are calculated with respect
to an optimization criterion. Given the dictionaries, u A(1) is estimated using sparse coding.
Likewise, in all subsequent layers l > 1, representative samples are extracted from u A(l−1) to
build D(l) , and sparse coding is performed yielding the representations for the labeled data
tr A(l) = D(l+1) tr A(l+1) + E(l+1) .
Classifier Training The last layer of the DSTL framework is intended to perform the
classification given the learned representation. Regarding neural networks, the most
common method is the use of a softmax layer. Transferred to DSTL, the layer performs
a logistic regression for classification given the learned representation in the last layer,
42 4 Deep Self-taught Learning in Remote Sensing
see (Bishop 2006, chapter 4). The posterior probabilities derived by logistic regression are
given by
( )
( ) exp wTk tr 𝜶 (L)
n
P Ck |tr 𝜶 (L)
n = hk = ∑ ( ) (4.6)
T tr (L)
k exp w k
𝜶 n
where the weight matrix W = [wk ] contains the parameters of the separating hyperplanes
in feature space. Given sparse representations tr A(L) = [tr 𝜶 (L) n ], the goal is to learn a classifier
hk (tr A(L) ) for all classes = {1 , … , K }. For this, the posterior probabilities
(L)
⎡ P(C1 |tr 𝜶 n ) ⎤
tr ̃T = ⎢ ⋮ ⎥ (4.7)
⎢ (L) ⎥
⎣P(CK | 𝜶 n )⎦
tr
are compared to reference targets T = [tn ]. The error is minimized by updating the classi-
fier parameters, the learned dictionaries and sparse representations using the update steps
explained in the following.
Update Procedure In order to learn these parameters, we perform the following steps illus-
trated by (1)–(4) in Figure 4.2.
Step 1: The output from the last layer L is fed into a classifier and the classification loss
function value is computed. Therefore, in the first update step, the dictionaries are fixed and
only the training representations are updated. Given the reference targets tr tn and estimated
posterior probabilities tr t̃n , the loss function is given by
( )
1
tr 𝜶 (L)
n = ||tr tn − tr t̃n ||2 . (4.8)
2
With the gradient of the loss function with respect to the training representations tr 𝜶 (L)
n , the
representations in the last layer are updated with
( )
𝜕 tr 𝜶 (L) n
tr ∗(L) (L)
𝛼rn = tr 𝛼rn +𝜌 (L)
. (4.9)
𝜕 tr 𝛼rn
with the upper index (⋅)∗ indicating the updated representations, 𝜌 being the learning rate,
and the index r denoting the r-th row.
Step 2: In order to update the representations in layers l = 2, … , L − 1, the updated
representation in the last layer is used subsequently with
tr
A∗(l) = D(l+1)tr A∗(l+1) , (4.10)
starting with L − 1.
Step 3: Given the updated training representations, the dictionaries D(l) are updated using
the gradient descent method as proposed by Gwon et al. (2016). The following loss function
is used to compute the dictionary update
( ) 1
JD D(l) = ||D(l)tr A∗(l) − tr A∗(l−1) ||2 , (4.11)
2
which will be minimized. In the zeroth layer (input layer), tr A∗(0) is set to be the original
data tr X. The gradient descent updating rule for the r th dictionary element of the lth layer is
4.3 Deep Self-taught Learning 43
Figure 4.3 Example images from UC Merced dataset for the classes agriculture, forest, and
buildings.
given by
( (l)tr (l) tr ∗(l−1) )
d(l) (l)
r = dr − 𝛾 D A − A tr
𝜶 (l)
r , (4.12)
where 𝛾 is the learning rate. In the same way, step 2 is repeated for the unlabeled data and
dictionary updates are repeated using the unlabeled data representation u A(l) .
Due to the dictionary updates, their entries no longer represent real samples of data, so
they can no longer be interpreted and explained in the context of a specific application. As
an optional step and to keep them interpretable, the dictionary elements in the first layer
are limited to real data samples. This can be achieved, for example, by moving a sufficiently
changed dictionary element to the nearest neighbor in the set of unlabeled data points in
the feature space. A sufficiently large change is necessary to ensure that the data samples
are shifted in such a way that the updated dictionary elements do not match the original
ones and the optimization gets stuck.
Step 4: This step readjusts the labeled representations with the updated dictionaries
D(l) by minimizing the reconstruction error of tr X using the sparse coding procedure and
updates the classifier. We iterate steps (1)–(4) until convergence of the dictionaries or by
applying early stopping.
Extensions: The presented workflow illustrates the basic variant of the DSTL framework.
However, it only represents a linear representation, which may not be flexible enough to
solve an intended task. To learn a more complex, yet interpretable, representation, oper-
ations like pooling can be introduced. An application of further operations as they are
common in neural networks is also possible.
Experimental Setup The used RGB images are resized to 32 × 32 × 3 pixels, leading to
3072-dimensional input feature vectors. All samples are normalized to a range [0, 1]. We
randomly extract 150 training samples (tr X) and use the remaining 150 samples as test
samples (t X). We use archetypal analysis on the unlabeled dataset to extract dictionary
elements, where the size of the dictionary is limited by how many archetypes can be
extracted (Cutler and Breiman 1994).
44 4 Deep Self-taught Learning in Remote Sensing
Table 4.1 Class-wise accuracies [%], overall accuracy [%], average accuracy [%], and Kappa
coefficient obtained by logistic regression (LR) using various approaches. The best results are
highlighted in bold-print.
The experiment compares logistic regression using the original samples, STL with logis-
tic regression, and DSTL with two layers and logistic regression to investigate whether the
accuracy benefits from the DSTL approach over basic logistic regression and STL. The learn-
ing rate to update the sparse representation of the training data in the last layer is set to
𝜌 = 0.1 and the dictionary learning rate is set to 𝛾 = 0.1. Logistic regression is performed
with gradient descent with a learning rate of 0.1 and 1000 iterations in each DSTL itera-
tion. The weights are initialized by the solution from the last iteration. The DSTL update is
iterated 100 times to find the best dictionary, judged by application to the validation data.
Results Table 4.1 shows the class-wise, overall, and average test accuracy. In all our exper-
iments, the STL and DSTL with logistic regression achieve an improvement over the orig-
inal representations with logistic regression. For DSTL initialization, 10 archetypes were
extracted in the first layer and 16 archetypes could be extracted in the second layer.
The approach was also implemented as an interpretable framework, resulting in a slight
drop in accuracy of about 2%, by moving the dictionary elements to the nearest neighbors
from the set of unlabeled data at every fifth iteration after the dictionary update using the
unlabeled data. The used unlabeled dictionary elements show beach, storage tanks, harbor,
tennis courts, runways, and free ways. Overall, similarities to known, though unlabeled,
scenes can be helpful for a classification. In the context of other applications where these
relations may be important, this approach can help to gain further insight into the results.
however, can be adjusted to yield interpretable dictionary elements. The filter responses in
the CNN are derived independently in each layer by summed multiplications in contrast
to the jointly derived activations in the sparse linear combination in the DSTL framework.
Nevertheless, the filters and the responses in the CNN are jointly optimized beyond all lay-
ers. An even closer relation represent networks using convolutional sparse coding (Bristow
et al. 2013), which replace the multiplication in the sparse coding procedure by convolu-
tions. Also Kemker and Kanan (2017) introduce a related approach, which uses stacked
convolutional autoencoders for learning deep representations.
4.4 Conclusion
In this chapter deep self-taught learning was introduced to combine the advantages of
self-taught learning and deep learning. In our example experiment we could show that the
framework for self-taught learning benefits from unlabeled data, so that the learned deep
features can be used for an improved classification compared to a classification with the
original feature representation. Since the dictionaries can be restricted to unlabeled data
samples, they are interpretable and explainable in the context of a specific application and
can be used to derive further insights into the learned model. Deep self-taught learning
shows many parallels to other methods that use deep learning, especially neural networks.
Many operations that are used in neural networks can also be used for deep self-taught
learning, so the presented method can benefit from the previous knowledge of neural net-
works. The advantage of deep self-taught learning compared to many other methods is that
it has a simple influence on the interpretability of the model. However, this also increases
the computing time, which requires the development of more efficient methods for learning
interpretable dictionaries. Self-taught learning has not yet been used for many applications,
and especially in the field of remote sensing the potential has not yet been fully analyzed.
In contrast to many other deep learning methods, self-taught learning can use a lot of unla-
beled data and work with few labeled data, which is a typical scenario for remote sensing.
In addition, in remote sensing the interpretability and explainability of the model is often
more important than in other communities, because on the one hand we can use previous
knowledge and on the other hand a scientific consistency for the quality of the result must
often be provided.
46
5
Deep Learning-based Semantic Segmentation
in Remote Sensing
Devis Tuia , Diego Marcos , Konrad Schindler , and Bertrand Le Saux
5.1 Introduction
Semantic segmentation is the task of attributing each pixel in an image to a semantic class.
In the case of Earth observation images, it is also called semantic labeling and is often
related to some kind of mapping, for instance of land use types, vulnerability/risk, or in
order to detect changes that have occurred in between acquisitions. Semantic segmenta-
tion is generally framed as a supervised task: labeled examples of each class are provided
to a model which learns the input/output relations necessary to predict the same classes in
as-yet unseen data.
Segmenting images from an overhead perspective has always been related to the need to
integrate some kind of a-priori about spatial structures (Fauvel et al. 2013): in urban envi-
ronments, the co-occurrence of classes and typical geometrical arrangements of objects are
precious information that can lead to more accurate models, while in agriculture applica-
tions, the mixture of spectral signatures of crops and soil, as well as the textures observed
at the leaves level, can be used to characterize stages of growth or detect diseases attacking
the crops ahead of time.
This need to integrate priors about spatial arrangements, as well as the spatio-temporal
correlations observed in remote sensing signals, made the transition to deep learning algo-
rithms very natural: convolutional neural networks are spatial feature extractors by design
and were rapidly adopted by the optical remote sensing community, which had been using
convolutional image filters for decades. Questions of scale and rotation invariance as well as
multi-sensor processing then became drivers for new developments of algorithms custom
tailored to remote sensing data, which we will review in this chapter.
The chapter is organized as follows: in section 5.2 we review recent literature on semantic
segmentation for remote sensing. In section 5.3, we present the most common approaches
to semantic segmentation as they were introduced in computer vision. Finally, in section
5.4 we present three approaches from the recent literature where these architectures were
introduced in remote sensing and modified to cope with the data specificities or the prob-
lem’s own requirements.
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
5.2 Literature Review 47
the prediction of two nearby pixels was obtained independently in subsequent inference
passes (Volpi and Tuia 2017).
To cope with these shortcomings, fully convolutional approaches were explored (Sherrah
2016) and have nowadays become the state of the art in remote sensing semantic segmen-
tation: these approaches are mostly based on encoder-decoder structures (Audebert et al.
2016; Kampffmeyer et al. 2016; Volpi and Tuia 2017; Maggiori et al. 2017a; Daudt et al. 2019)
or on the creation of multiscale tensors by stacking activation maps at different levels of the
CNN (Maggiori et al. 2017b; Fu et al. 2017; Volpi and Tuia 2018). Further works pushed
the accuracy of the networks, for example by making use of Conditional Random Fields for
post-processing (Paisitkriangkrai et al. 2016), class-specific edge information (Marmanis
et al. 2018), multiscale context (Liu et al. 2018), co-occurrences between classes (Volpi and
Tuia 2018), or by post-processing the resulting maps, for instance by deploying recurrent
networks refining the maps iteratively (Maggiori et al. 2017b).
The success of these remote sensing specific approaches was enabled by new, public
datasets with high spatial resolution and dense ground references. A large palette of datasets
that focuses on sub-metric pixel segmentation in urban areas is nowadays available, for
land use mapping (ISPRS 2D segmentation benchmark1 , IEEE Data Fusion Contest 2015
(Campos-Taberner et al. 2016)2 , Zurich summer (Volpi and Ferrari 2015)3 ), and building
detection (Inria dataset (Maggiori et al. 2017c)4 , Spacenet5 ). Some other datasets tackle
several of these challenges in parallel, such as DeepGlobe dataset (Demir et al. 2018)6 ,
which includes tasks of land cover classification as well as building and road extraction
and can be used for the development of multitask and lifelong learning methods. Other
data modalities than multispectral very high resolution images are also gaining momen-
tum through competitions, such as SAR imagery (see the recent SpaceNet6 (Shermeyer
et al. 2020)7 about urban classification and Sen1Floods11 (Bonafilia et al. 2020)8 about flood
water identification) and hyperspectral images (IEEE GRSS Data Fusion Contest 20189 ) or
LiDAR point clouds (DALES dataset (Varney et al. 2020)10 ). Finally, recent datasets aiming
at large scale (e.g. multi-city) classification with high-resolution data (e.g. Sentinel-2) are
also more and more present, for instance for the classification of Local Climate zones (IEEE
GRSS Data Fusion Contest 2017 (Yokoya et al. 2018)11 or So2Sat LCZ42 (Zhu et al. 2020)12 ),
land use (MiniFrance dataset (Castillo-Navarro et al. 2020)13 ), cloud detection (38-Cloud
1 http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html
2 http://www.grss-ieee.org/community/technical-committees/data-fusion/2015-ieee-grss-data-fusion-
contest/
3 https://sites.google.com/site/michelevolpiresearch/data/zurich-dataset
4 https://project.inria.fr/aerialimagelabeling/
5 https://spacenetchallenge.github.io
6 http://deepglobe.org/
7 https://spacenet.ai/sn6-challenge/
8 https://github.com/cloudtostreet/Sen1Floods11
9 http://www.grss-ieee.org/community/technical-committees/data-fusion/2018-ieee-grss-data-fusion-
contest/
10 https://udayton.edu/engineering/research/centers/vision_lab/research/was_data_analysis_and_
processing/dale.php
11 http://www.grss-ieee.org/community/technical-committees/data-fusion/2017-ieee-grss-data-fusion-
contest-2/
12 https://mediatum.ub.tum.de/1454690
13 http://dx.doi.org/10.21227/b9pt-8x03
5.3 Basics on Deep Semantic Segmentation: Computer Vision Models 49
dataset (Mohajerani and Saeedi 2019)14 ) or temporal land cover changes (OCSD dataset
(Daudt et al. 2018)15 ).
However, obtaining sufficient ground truth data for particular tasks and locations
remains a challenge. In computer vision, the use of pretrained models is common to
reduce the amount of required labels, thanks to the availability of large-scale datasets such
as ImageNet. This has been shown to help in some remote sensing tasks (Cheng et al.
2017), but is not straightforward to use, for instance with images with more than three
input bands or statistics that differ substantially from RGB images (e.g. SAR, thermal).
As alternatives, it has been proposed to pre-train models on related remote sensing tasks
(Wurm et al. 2019), render synthetic images by simulating an appropriate sensor (Kemker
et al. 2018) or make use of publicly available maps as ground truth (Kaiser et al. 2017).
Finally, domain adaptation (Wang and Deng 2018) is also being considered to adapt deep
models trained on one geographic region to another: for the interested reader, we invite to
read the dedicated chapter in the book (cf. Chapter 7).
As we show in the rest of this chapter, Earth observation data also possesses unique char-
acteristics that can be exploited for semantic segmentation. For instance, data from different
modalities (Audebert et al. 2018) and acquisition times (Zhong et al. 2019) are often avail-
able, which demands segmentation models specifically adapted to each type of data.
Semantic segmentation as a task is rooted in the computer vision literature and many archi-
tectures that have been proposed for handling natural images can in principle be adapted
for the labeling of pixels composing aerial or satellite images. In this section, we review
standard architectures as they were originally proposed and in the following section we
discuss potential bottlenecks when adapting them to serve Earth observation purposes.
14 https://github.com/SorourMo/38-Cloud-A-Cloud-Segmentation-Dataset
15 https://rcdaudt.github.io/oscd/
50 5 Deep Learning-based Semantic Segmentation in Remote Sensing
Downsampling
0.90 building
0.01 grass
0.03 tree
0.04 car
0.02 street
Fully connected
and classification
(a) Image classification
Downsampling
Upsampling
and classification
Figure 5.1 Comparison of pipelines for (a) image classification versus (b) semantic segmentation.
In image classification CNNs, the last layer is a tensor of size 1 × 1 × cl and has the whole
image as its receptive field. This tensor is treated as a vector and serves as input to the fully
connected part of the model that outputs a vector of class scores of size C, summarizing
the content of the image: in the case of remote sensing image semantic segmentation, this
should represent the class the central pixel belongs to.
However, this aggregation of contextual information is at the price of spatial detail, which
is detrimental for semantic segmentation. On the other hand, not performing any down-
sampling would allow to maintain the spatial detail at the cost of a smaller receptive field,
which could prevent the access to enough spatial context to assign a class to each pixel.
In semantic segmentation (Figure 5.1b), the desired output is a tensor of size M∕d ×
N∕d × C, where d is the downsampling factor. Often d = 1, meaning that each individ-
ual pixel in the input image is assigned to a class. As pointed out in section 5.2, treating
semantic segmentation as M × N classification problems, i.e. one prediction per pixel, is
highly inefficient. This has driven the research of deep learning architectures for semantic
segmentation that are able to simultaneously extract contextual information while main-
taining spatial detail. We invite the reader to consult (Minaee et al. 2020) for a detailed
survey of methods. In the following we will group and present the main architectures fol-
lowing an arbitrary distinction based on how they perform the upsampling from the CNN’s
last layer to the expected prediction support (one prediction per each original pixel).
The first approaches, such as Fully Convolutional Networks (FCN) (Long et al. 2015),
consisted of substituting the fully connected operators with 1 × 1 convolutions, which are
algebraically equivalent. In this way, an increase in the size on the input image results in a
proportional increase in the size of the output tensor, which then becomes a map of class
probabilities. This map is then upsampled to the resolution of the ground truth map in
order to compute a pixel-wise loss. However, the resulting map tends to be coarse due spatial
information lost through downsampling. In FCN, the authors compute multiple class prob-
ability maps, using tensors from different layers in the backbone, and average the results to
minimize the loss of spatial detail.
Hypercolumns (Hariharan et al. 2015) is a similar architecture, but where feature maps
obtained at different scales are upsampled and stacked, allowing each scale to specialize
in different features. In the example shown in section 5.4.1, authors use variations of this
method, in which the features are fused earlier by stacking them and applying several fully
connected layers that allow to learn interactions between features at different scales (see
Figure 5.2).
In PSPNet (Zhao et al. 2017) the authors propose to increase the receptive field to the
whole image by applying average pooling with different number of bins, including a global
average pooling, to a feature map. The resulting downsampled feature maps undergo an
additional convolution before being upsampled back to the original resolution and stacked.
Activations
Conv1+BN+ReLU Activations
Conv2+ Activations
BN+ReLU Conv3+
BN+ReLU
POOLING POOLING
INPUT IMAGE
UPSAMPLE UPSAMPLE
Stack
PER
PIXEL
CLASSIFIER
SEGMENTATION
Figure 5.2 Example of architecture with a hard-coded upsampling, in which every feature map in
the downsampling backbone is bilinearly interpolated to the image resolution and stacked.
52 5 Deep Learning-based Semantic Segmentation in Remote Sensing
Another way of improving the resolution/receptive field trade-off is to use dilated con-
volutions, also known as à-trous convolutions. Filters in dilated convolutions are sparse,
resulting in larger kernels, and therefore larger receptive fields, without an increase in the
number of parameters. The DeepLab (Chen et al. 2017a) pipeline uses dilated convolutions
to reduce the impact of downsampling, but the authors found that a post-processing step
based on Conditional Random Fields was needed to obtain a satisfactory level of spatial
detail.
Architectures Learning the Upsampling A different type of approach are the encoder-decoder
architectures (Noh et al. 2015), which consists of coupling the downsampling network, or
encoder, with one that computes the upsampling in a cascade of stages, the decoder. These
are often designed to be approximately symmetric, such that information from each layer
in the encoder can be transmitted to the corresponding layer in the decoder.
This can be done by transferring the indices from each max-pooling layer in the encoder
to un-pooling layers in the decoder, such as in SegNet (Badrinarayanan et al. 2017), see
Figure 5.3a. Alternatively, each entire feature map from the encoder can be appended to
the corresponding feature map in the decoder. In U-Net (Ronneberger et al. 2015a), both
feature maps are stacked and upsampled with a deconvolutional layer, which is equivalent
to a convolutional layer with fractional stride, see Figure 5.3b. Instead of a deconvolution, a
bilinear upsampling can be applied, followed by convolutional layers to learn how to refine
the result (Pinheiro et al. 2016). This learned upsampling can add a substantial amount of
additional parameters and computational cost to the downsampling backbone, but allows
the context both at the input and at the output levels to be learned.
Loss Functions Independently of the chosen architecture, a CNN for semantic segmenta-
tion generates a map of per-pixel class scores. This map needs to be compared to the ground
truth map in order to produce the learning signal that allows the model to be trained. The
exact nature of this comparison is defined by the loss function. Since semantic segmenta-
tion can be posed as pixel-wise classification, the majority of methods use variants of the
cross entropy loss, also called multinomial logistic loss, used in classification (Volpi and
Tuia 2017; Audebert et al. 2016; Maggiori et al. 2017a; Wurm et al. 2019). These variants
often aim to compensate the imbalance present in many semantic segmentation datasets
by re-weighting the importance of each class (Kampffmeyer et al. 2016). In addition, the
per-pixel nature of the task can be leveraged by using loss functions that aim to exploit aux-
iliary information and spatial relations, such as those based on the Euclidean distance to a
height map (Audebert et al. 2016) or to a distance map to the nearest semantic boundary
(Yuan 2016; Marmanis et al. 2018; Audebert et al. 2019a). More complex loss functions that
take explicitly into account the geometry of the segmented objects (Marcos et al. 2018b) or
the nature of the noise in the ground truth (Mnih and Hinton 2012) have also been explored.
Pooling indices
Activations
Activations
Conv1+BN+ReLU
Conv1+BN+ReLU
Stacking PER
PIXEL
FUSING CLASSIFIER
CONV.
SEGMENTATION
INPUT IMAGE Activations Activations
Conv2 Conv2 UN-
+BN+ReLU +BN+ReLU POOLING
POOLING
FUSING Activations
CONV.
DeConv1+BN+ReLU
Stacking
Activations
Conv3
+BN+ReLU
Activations
POOLING UN- DeConv2+BN+ReLU
POOLING
emphasized previously also apply: the segmentation model is learned in a supervised man-
ner and the spatial configuration of points matters. However, due to point cloud peculiari-
ties, most methods described previously are not transferable directly. Therefore, research on
point cloud semantic segmentation is very active and the current state of the art is blooming
with several new approaches.
Statistical learning approaches for 3D aim to efficiently sample the spatial arrangement
of local neighborhoods, while ensuring invariance at global scale and over various scenes.
It leads to different families of approaches that we detail in the following: graph-based, 3D,
2D or 1D (i.e. approaches acting directly on every point). We also refer to the comprehensive
review of Xie et al. (2019) for an overview of the state-of-the-art.
Graph-based Approaches These methods build a graph over points or locally consistent sub-
sets of points and use graph neural networks for classification. For example, SuperPoint-
Graph (Landrieu and Simonovsky 2018) first creates superpoints (which are geometrically
simple shapes) and then builds the graph of superpoints using rich relationship features
to link the superpoints. Finally, contextual segmentation is performed using local neural
networks and graph learning.
3D-based Approaches These approaches are similar to the image-based CNNs described in
the previous section, but consider an input space with an extra dimension. In VoxNet (Matu-
rana and Scherer 2015), the local 3D neighborhoods are sampled with 3D convolutions over
voxels (the 3D analog of pixels). Sparsity of points in 3D is a key issue here. It is handled
with trilinear interpolation in SegCloud (Tchapmi et al. 2017) to refine the characterization
of points with respect to their precise location. Another trick consists in using octrees as in
OctNet (Riegler et al. 2017) to allocate voxels of various sizes according to the local density
of points.
Point-based Sampling This is the most prolific category of algorithms for point-cloud seg-
mentation. PointNet (Qi et al. 2017b) and PointNet++ (Qi et al. 2017b) learn global and local
representations by applying fully-connected Multi-Layer Perceptrons over a set of points,
thus encoding their geometric relationship. To offer a better characterization, point descrip-
tors can be used as the input of PointNet rather than the simple location, as in PointSIFT
(Jiang et al. 2018b).
Such approaches were surprisingly efficient and paved the way for developments of algo-
rithms which try to emulate on point clouds the behavior of convolutions in 2D. Indeed,
they define local transforms with global and local invariance properties. As for global sam-
pling, local neighborhood characterization can be 3D, 2D, or 1D. In 3D, Flex-convolutions
(Groh et al. 2018) use a local 3D voxel grid to capture the surroundings of each point. In
5.4 Selected Examples 55
2D, Tangent-Conv (Tatarchenko et al. 2018) projects points on locally tangent 2D planes.
Finally, local point-based approaches include PointCNN Li et al. (2018b) which introduced
𝜒-convolutions over local subsets of points, KPConv (Thomas et al. 2019) and ConvPoint
(Boulch 2019), which defined discrete convolutions weighted with respect to point distance.
In the previous sections, we have presented deep learning architectures designed in other
fields (mostly computer vision) and explained their functioning. As mentioned in the intro-
duction, these architectures are very effective and can be used out of the box on aerial and
satellite images, but with some points of attention. First, they only consider RGB images as
inputs: to accommodate the high-dimensional input space offered, for example, by hyper-
spectral images, specific architectures must be designed (for a review of architectures for
hyperspectral imaging, see Audebert et al. (2019b)). Second, they are not specific to remote
sensing image characteristics and do not take into account priors such as the behavior of
the image with respect to rotation (in optical aerial images, rotation is arbitrary and should
not influence prediction) or looking geometry in SAR. Third, environmental monitoring
often involves repetitive sensing, i.e. images of the same place being acquired several times:
Semantic segmentation models can take advantage of this strong prior about spatial consis-
tency, while focusing on learning temporal changes.
In this section, we discuss three case studies, each one dealing with one of the points men-
tioned above: first we’ll see the benefits of encoding rotational invariance in a small CNN,
and show that an invariant model can match performances of larger models (by orders of
magnitude in terms of learnable parameters). Then, we will show a solution to process point
clouds by characterizing local neighborhoods with 2D sampling. Finally, in the third case
we will present a study in environmental monitoring, specifically lake ice detection, where
static webcam across seasons and synthetic aperture radar images are used.
Approach The same principle can be applied to rotation. Indeed, semantic segmentation
is by nature rotation equivariant, since a rotation of the image would ideally result in the
same rotation of the segmentation map. This can be implemented by using a sliding win-
dow that also rotates, applying the same function at all pixel locations and a discrete set of
orientations, with 𝛼r ∈ {𝛼, 2𝛼, … , R𝛼}, 𝛼 = 2𝜋∕R. Given a filter W ∈ ℝm×n×c0 , we can see it
as a collection of feature vectors Wi,j,∶ ∈ ℝc0 , each associated with a spatial location [i, j].
A rotated version Wr of the filter can be obtained by computing a new location for these
vectors and interpolating to the nearest grid points:
[ ]
cos(𝛼r ) sin(𝛼r )
[i′ , j′ ] = [i, j] . (5.1)
− sin(𝛼r ) cos(𝛼r )
Applying a filter bank of c1 filters in a sliding and rotating window fashion, which we
will call RotConv filter, to an image X ∈ ℝM×N×c0 results in a tensor Y ∈ ℝM×N×R×c1 . Note
how this tensor, as well as any filter we would like to apply to it, can be interpreted as a
collection of feature vectors Yi,j,r,∶ ∈ ℝc1 , one per location of the roto-translational space.
This increases substantially the memory and computational footprint of the model as R
becomes larger. To prevent this, we could max-pool across the rotation dimension, but this
would result in the loss of information related to relative orientation (e.g. we would see that
a car and a road edge have been detected, but without information about their orientation
with respect to each other). We propose to use max-pooling across the rotation dimension
but returning both the maximally activating magnitude and orientation:
Y𝜌 i,j,∶ = maxr Yi,j,r,∶ Y𝜃 i,j,∶ = 360
R
arg maxr Yi,j,r,∶ . (5.2)
These two tensors can be interpreted as the polar representation of a 2D tensor field if
Y𝜌 i,j,∶ ≥ 0 ∀i, j, which can be enforced by applying a linear rectifier ReLU(𝜌) = max (𝜌, 0).
A Cartesian representation Z ∈ ℝM×N×2×c1 is then computed as:
Since the input tensor to the following RotConv layer is a stack of vector fields, a filter
bank Q ∈ ℝm×n×2×c2 also needs to consist of vector fields with u and 𝑣 components, and the
convolution operator is computed separately for each component:
Note that the rotation described in Equation 5.1 can be applied to Q but requires the addi-
tional step of rotating each 2D vector Qi,j,∶,k by 𝛼r according to Equation 5.1.
Data and Setup We performed experiments on the ISPRS Vaihingen “2D semantic labeling
contest” benchmark16 , which consists of 33 tiles of 9 cm resolution aerial imagery acquired
over the town of Vaihingen (Germany), with an average tile size of 2494 × 2064, three opti-
cal bands (near infrared, red, and green), and a photogrammetric digital surface model
(DSM) (see examples in Figure 5.5). Sixteen of the tiles are publicly available and contain
six land-cover classes: “impervious surfaces” (roads, concrete flat surfaces), “buildings”,
“low vegetation”, “trees”, “cars”, and a class of “clutter” to group uncategorized surfaces
16 http://www2.isprs.org/commissions/comm3/wg4/semantic-labeling.html
0° 0°
Filter 2
Rotate 90°
Image (3 bands) 180°
90° 270°
Spatially pooled vector
field activations for filter 2
180°
Orientation pooling:
maximal activation
AND its angle
270° Maxpooling
Activation maps for filter 1 window
(one per orientation) Stack of activations
Receptive field Vector field activations for filter 1
of first conv layer
0°
Filter 2
Rotate 0°
90°
180°
90° 270°
180°
Orientation pooling:
maximal activation Maxpooling
Spatially pooled vector
AND its angle window
field activations for filter 2
270°
Vector field activations for filter 2
Activation maps for filter 2
(one per orientation)
Figure 5.4 (Adapted from (Marcos et al. 2018a)) Diagram of the first RotConv layer with two filters and R = 4. The output is a stack of 2D vector fields
that encode which rotated version of each filter is maximally activated, in terms of both magnitude and angle.
58 5 Deep Learning-based Semantic Segmentation in Remote Sensing
Table 5.1 Results on the Vaihingen validation set. F1 scores per class and global average (AA) and
overall accuracies (OA). Best result per row is in dark gray, second in light gray.
and noisy structures. Classes are highly imbalanced, with the classes “buildings” and “im-
pervious surfaces” accounting for roughly 50% of the data, while classes such as “cars” and
“clutter” account only for 2% of the total labels.
The CNN architecture we used is inspired in Hypercolumns (Hariharan et al. 2015) and
follows the structure depicted in Figure 5.2. It consists of six RotConv layers with 7 × 7 fil-
ters, each followed by orientation pooling, factor two spatial pooling and a modified batch
normalization. The magnitudes of all vector field maps are upsampled to the original res-
olution with bilinear interpolation and stacked before applying three layers of 1 × 1 con-
volutions, which are also equivariant to rotation because they do not capture any spatial
patterns. In order to test the effect of changing the size of the model, the number of filters
in each layer was parametrized by a single integer, Nf , as [2, 2, 3, 4, 4, 4] ⋅ Nf.
Results and Discussion We investigated the effect of varying the amount of training data
(100%, 12% and 4% of the total) on the generalization capabilities of the model and compared
against a standard CNN with an equivalent architecture. On the full training set, RotEqNet
saturated in performance with Nf = 3 (approximately 105 parameters), while the standard
CNN needed Nf = 12 (approximately 106 parameters). As shown in Table 5.1, RotEqNet
obtains a comparable overall accuracy in all the studied settings with one order of mag-
nitude less parameters. In addition, the average accuracy obtained by RotEqNet is higher
than the one of a standard CNN, mostly because of its higher performance in the class “cars”
(that can be clearly appreciated in Figure 5.5) and “buildings” (in the 4% and 12% training
scenarios). These results suggest that RotEqNet offers a bigger advantage when segment-
ing classes with complex geometry, such as “cars” and “buildings”, compared to those with
simpler, texture-like characteristics, such as vegetation.
5.4 Selected Examples 59
Figure 5.5 Examples of classification maps obtained in the Vaihingen validation images with the
RotEqNet and the standard CNN models (from Marcos et al. (2018a)). Best viewed in color.
Approach The idea behind SnapNet (Boulch et al. 2018) comes from the Building Rome in a
Day article (Agarwal et al. 2011) where an approach for bundle adjustment able to produce a
point cloud from thousands of images (such as tourist snapshots) was proposed. Conversely,
SnapNet produces thousand of views from a single point cloud and learn to classify them
to retrieve the 3D semantics.
The strengths of the approach lie in making possible to leverage the power of 2D convo-
lutional networks (including the use of pre-training) and the ability to process appearance
60 5 Deep Learning-based Semantic Segmentation in Remote Sensing
Discussion Figure 5.7 shows results of the SnapNet approach for semantic segmentation
of point clouds. The scene is one of the test point-clouds of the Semantic3D dataset (Hackel
et al. 2017) and was obtained by terrestrial laser scanning in an urban environment. These
point clouds have a large scale (4 ⋅ 108 points on average) and require computationally effi-
cient algorithms for processing. The obtained semantic 3D maps show precise classification
of points belonging to buildings, terrain, roads, or urban hardscape.
RGB mesh Image
pairs Semantized
images
Colored Semantized
point cloud point cloud
Figure 5.6 SnapNet processing: (1) The point-cloud is meshed to enable the (2) generation of random views at multiple scales, both in appearance and
geometry. (3) Semantic segmentation is performed in the 2D domain, and results are (4) back-projected in 3D for voting and 3D semantic segmentation.
62 5 Deep Learning-based Semantic Segmentation in Remote Sensing
Figure 5.7 SnapNet results on the Semantic3D dataset (Hackel et al. 2017): colored point cloud
captured in St Gall, Switzerland (left) and semantic 3D map with buildings in red, natural terrain
in green, impervious surfaces in gray, etc. (right).
SnapNet integrates two features which are essential in modern point cloud segmentation
approaches: learning local representations while maintaining global statistics on the scene.
Local spatial patterns are encoded by 2D convolutional filters applied on both appearance
and geometric features after projection in the image space. Global statistics are computed
through the multiscale view generation strategy.
Approach We tackle the semantic segmentation of lake ice with a state-of-the-art convo-
lutional network, DeepLab v3+ (Chen et al. 2018a), an encoder-decoder architecture that
uses separable convolutions and Atrous Spatial Pyramid Pooling (ASPP). For SAR images,
we use a variant of DeepLab v3+ with mobilenetv2 (Sandler et al. 2018) as encoder, and
train it on 128×128 pixel patches with batch size 8, minimizing the cross-entropy loss with
stochastic gradient descent. Atrous rates were set to [1, 2, 3]. For webcams we use Xception65
(Chollet 2017) as the encoder backbone, with atrous rates [6, 12, 18], and train with 321×321
patches and batch size 8. Overall, that encoder has an output stride (spatial downsampling
from input to final feature encoding) of 16, which we upsample in the decoder stage in two
steps, each of factor ×4, with additional skip connections to re-inject high-frequency detail
(similar to the U-Net model described in section 5.3.1). In both cases we employ models
5.4 Selected Examples 63
pre-trained on the PASCAL VOC 2012 close-range dataset. It turns out that pre-training on
RGB amateur images greatly improves the performance not only for webcams, but, some-
what surprisingly, also for SAR amplitude images. It appears that, despite the completely
different sensing principle, the local structure of SAR data after standard preprocessing (see
below) is similar enough to that of optical images to benefit from the pre-trained initial
weights.
Figure 5.8 The four Sentinel-1 orbits (15, 66, 17, 168) that scan Region Sils (shown as yellow filled
rectangle).
17 https://earthengine.google.com
64 5 Deep Learning-based Semantic Segmentation in Remote Sensing
Table 5.2 Leave-one-winter-out results (left, over all three lakes) and Leave-one-lake-out results
(right, over both winters).
Winter Lake
2016–17 2017–18 Sils Silvaplana St. Moritz
St.Moritz
23/11/2017
Silvaplana
13/03/2018
Sils
23/12/2017
non-frozen
Composite (VV, VH) frozen Ice Snow
Water
Figure 5.9 Example results for St. Moritz on a non-frozen day (row 1), Silvaplana on a frozen day
(row 2), and Sils on a transition day (row 3). Best viewed in color.
having seen data from any day within the test period. Leave-one-lake-out CV evaluates the
capability to generalize to unseen lakes. The results are shown on Table 5.2. Depending on
the lake, the predictions are 84–96% correct, meaning that ice segmentation works well also
for new lakes (with similar imaging conditions). In all cases, a single model was trained for
images from all orbits. Fitting separate models for ascending and descending orbits (respec-
tively, morning and afternoon) resulted in performance drops of 5–7 percent points, see Tom
et al. (2018).
Figure 5.9 shows exemplary qualitative results on frozen, non-frozen, and transition
dates, as well as the corresponding soft probability maps (blue denotes higher probability
for frozen, red higher probability for non-frozen). To give a better visual impression we
also show the corresponding image from Sentinel-2.
annotated ground truth masks of the lakes, and labels water, ice, snow, clutter for the lake
pixels. The numbers of images and their resolution are given in Table 5.3, for further details
see (Prabha et al. 2020). There are two different, fixed webcams, which we call Cam0 and
Cam1, both observing lake St. Moritz at different zoom levels. Example images are shown
in Figure 5.10.
We report results for different train/test settings, see Table 5.4. In the same camera/both
winters setting, the model is trained randomly on 75% of the images from a webcam stream
and tested on the remaining images from the same webcam. The cross-winter setting again
evaluates generalization to the potentially different conditions of an unseen year. The model
also generalizes quite well across winters with an average IoU scores of 78%, although not
quite as well as the SAR version. In the cross-camera setting, we train on one camera and
test on another, so as to check how well the model generalizes to an unseen viewpoint,
image scale, and lighting. While there is a noticeable performance drop, the segmentation
still works surprisingly well, reaching mean IoU scores around 70%.
Qualitative example results are shown in Figure 5.10. Some images are even confusing for
humans to annotate correctly, e.g., row 2 shows an example of ice with smudged snow on
66 5 Deep Learning-based Semantic Segmentation in Remote Sensing
Cam1→Cam0
Cam0 →Cam1
top, for which the “correct” labeling is not well-defined. One can see that the segmentation
is robust against cloud/mountain shadows cast on the lake (row 3). There are also cases of
label noise where the network “corrects” human errors, such as in row 5, where humans
present on the frozen lake were not annotated due to their small size.
6
Object Detection in Remote Sensing
Jian Ding, Jinwang Wang, Wen Yang, and Gui-Song Xia
6.1 Introduction
6.1.1 Problem Description
Object detection is a fundamental task towards an automatic understanding of remote
sensing images. The aim of object detection in remote sensing images is to locate the
objects of interest and identify their categories on the ground (e.g., vehicles, airplanes).
To acquire remote sensing images, there are a variety of platforms, including satellites,
airplanes, and drones equipped with different sensors, such as optical cameras or synthetic
aperture radars (SARs). Figure 6.1 shows several images containing objects taken with
optical and SAR sensors. In the past decades, extensive research has been devoted to object
detection in remote sensing images (Porway et al. 2010; Lin et al. 2015; Cheng et al. 2016b;
Moranduzzo and Melgani 2014; Wang et al. 2017a; Wan et al. 2017; Ok et al. 2013; Shi
et al. 2013; Kembhavi et al. 2010; Proia and Pagé 2009), using hand-crafted features. For
object detection in remote sensing images, traditional methods like HOG (Dalal and Triggs
2005) and SIFT (Lowe 1999) are well used for feature extraction. However, these shallow
models have limited ability to detect objects in complex environments. Nowadays, with
the development of deep learning and its successful application in object detection, earth
vision researchers have tried methods (Liu et al. 2016d, 2017c; Liao et al. 2018b; Yang
et al. 2019c; Cheng et al. 2016b) based on fine-tuning networks pre-trained on large-scale
datasets of natural images such as ImageNet (Deng et al. 2009). Nevertheless, there is a
huge domain shift between natural images and remote sensing ones. Thus, object detection
methods developed in natural images can not be directly used in remote sensing images.
We summarize the difficulties of object detection in remote sensing as follows:
● Arbitrary orientation of objects. Objects in remote sensing images can be arbitrarily
orientation without any restrictions because the sensors observe the objects on the ground
from a bird’s eye view. This largely challenges conventional systems since it requires
rotation-invariant features to obtain good performance.
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
68 6 Object Detection in Remote Sensing
(a) (b)
(c) (d)
Figure 6.1 Examples of remote sensing images containing objects of interest. (a) An image from
Google Earth, containing ships and harbors. (b) An image form JL-1 satellite, including planes. (c) An
drone-based image containing many vehicles. (d) A SAR image, containing ships.
Figure 6.2 Challenges of object detection in remote sensing. (a) Arbitrary orientations. (b) Huge
scale variations. (c) Densely packed instances.
● Huge scale variations. Objects in remote sensing vary in a wide range according to the
GSD of sensors and actual physical sizes of objects, from 10 pixels (e.g., small vehicles) to
above 1000 pixels (e.g., ground track field), all in one scene (Figure 6.2).
● Large-size images. Images in remote sensing may be very large (above 20,000 pixels),
and such large-size images are always challenging for the current computational hard-
ware. Besides, the instances are distributed non-uniformly in remote sensing images,
some small-size (e.g., 1k×1k) chips can contain hundreds to thousands of instances, while
some large-size images (above 20,000 pixels) sometimes only contain a few.
● Densely packed instances. The instances are usually densely packed in some specific
scenes, such as harbor and parking lot. This makes them hard to distinguish and separate.
We choose some examples to show the difficulties in Figure 6.2.
6.1 Introduction 69
where cij (cij ∈ C) and bij denote categorical label and bounding box of j-th object in xi respec-
tively. Bounding box is the minimum rectangle that encloses the object; it tends to be repre-
sented as (cx , cy , 𝑤, h), which denote coordinates of the center, width, and height of the box.
The model weights of detector are parameterized by 𝜃. For each image xi , the prediction
yipred shares an identical format with yi :
{( ) ( ) }
yipred = cipred , bipred , cipred , bipred , ... . (6.2)
1 1 2 2
I K
C G
A
E
H
D
Ground Truths
Positives Negatives
Predictions Positives TP FP
Negatives FN TN
For oriented object detection, IoU is calculated between two OBBs using computational
geometry method which is more complicated than the horizontal ones. Specifically, as
shown in Figure 6.3, we need find the intersection polygon IJKCLE of two OBBs ABCD
and EFGH, and calculate the IoU as:
SIJKCLE S + SICL + SIKC + SIJK
IoUOBB = = ILE (6.5)
SABCD + SEFGH SABCD + SEFGH
where S denotes the area of a region.
● Precision and Recall. For a common binary classification task, Table 6.1 illustrates the
rules of samples classification. In object detection, if a detection overlaps with the nearest
ground truth by an IoU pre-defined threshold (usually 0.5), it will be regarded as False
Positive (FP) directly. For every ground truth box, at most one prediction is counted as a
True Positive (TP). Any other prediction with IoU greater than the set threshold is dis-
carded as a FP instead. False Negative (FN) indicates a ground truth which is not detected.
Then, we can define the concept of Precision (P) and Recall (R):
TP
P= (6.6)
TP + FP
TP
R= . (6.7)
TP + FN
1.0 1.0
0.8 0.8
Precision
Precision
0.6 0.6
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall
(a) (b)
Figure 6.4 Examples of Precision-Recall Curve. As the recall increases, (a) stays high precision
while the precision of (b) drops significantly.
precision and recall as the confidence threshold increases. As shown in Figure 6.4, if an
object detector stays high precision as recall increases (e.g., Figure 6.4(a)), it can be con-
sidered as a good detector. This means that even you change the confidence threshold, the
precision and recall can still be high. However, comparing the curves of different classes
and detectors in the same plot is not an easy task, as the curves often cross with each other.
6.1.5 Applications
● Ship detection. With the development of remote sensing technology, more and more
high-resolution remote sensing images are available for ship detection and recognition.
There are a wide range of applications for ship detection, such as fishery management,
vessel traffic services, and maritime and docks surveillance. Both SAR images and optical
images have been used for ship detection (Liu et al. 2017c; Yang et al. 2018a; Zhang et al.
2018d).
● Vehicle detection. Vehicle detection in aerial images is important for applications such
as traffic management, parking lot utilization, urban planning. By collecting traffic and
parking information from remote sensing images, we can quickly get coverage over a
72 6 Object Detection in Remote Sensing
large area with lower cost and fewer deployment of sensors compared with traditional
approaches (e.g., road monitors). Liu and Máttyus (2015) and Deng et al. (2017) show
that it is possible to detect small vehicles from large remote sensing images and still be
fast enough.
● Airport and plane detection. Airport and plane detection in remote sensing images
have attracted much attention for its importance in military and civilian fields. Xiao et al.
(2017) use a multiscale fusion feature for optical remote sensing images to detect airports.
Mohammadi (2018) propose a rotation and scale invariant airplane proposal generator to
solve scale variances and direction of airplanes.
● Others. As a fundamental problem in remote sensing image analysis, object detection
has many complicated applications, such as environmental monitoring, land use and land
cover mapping, geographic information system update and so on.
● Faster R-CNN. The proposal generation in Fast R-CNN is still hand-crafted. Faster
R-CNN (Ren et al. 2017) proposes an efficient fully convolutional network to generate
region proposals, which is called Region Proposal Network (RPN). RPN learns the
“objectness” of all instances and accumulates the proposals, which are used by the
detector. The detector subsequently classifies and refines bounding boxes for those
proposals. RPN and detector can be trained end-to-end. Faster R-CNN uses thousands
of reference boxes, which are also called anchors. These anchors form a grid of boxes,
which act as starting points for bounding boxes regression. They are then trained to
regress the bounding boxes offsets and score the objectness for each anchor. The size
and aspect ratio of anchors are determined by the general range of size of instances
in the dataset and the receptive field of the convolutional layers. RoI Pooling then warps
the proposals generated by the RPN to fixed-size features. Then the features are fed into
the fully-connected layer for classification and detection.
6.2.1.2 R-FCN
In Faster R-CNN, each proposal generated by RPN is computed to offset and class individu-
ally, which is time-consuming. To minimize the repetitive computation from proposals, Dai
et al. (2016) proposed R-FCN (Region-based Fully Convolutional Networks). As illustrated
in Figure 6.5, R-FCN crops the last layer of features before prediction, instead of cropping
features from the same layer where applied RPN. Besides, to add localization represen-
tations that respect translation variance in detection task, a position-sensitive cropping
strategy is proposed in R-FCN, which is used to replace the standard ROI pooling oper-
ations used in Fast R-CNN and Faster R-CNN. R-FCN achieves comparable accuracy to
Faster R-CNN while at faster running times.
6.2.2.2 SSD
Inspired by Faster-RCNN, SSD (Liu et al. 2016a) uses reference boxes with various aspect
ratios and sizes to predict object instances, while the region proposal stage is completely
gotten rid of. During training, thousands of default boxes corresponding to different anchors
on different feature maps are learned to distinguish objects and background, localize, and
predict class probabilities of the object instances, with the help of a multitask loss.
Feature Extract Box Classify Feature Extract Box Classify
Proposal Generate Proposal Generate
Objectness Objectness
BBox BBox
Crop Convs Crop
Rotation Region Proposal Network (RRPN) to generate prior proposals with the object
orientation angle information, and then regress the offsets of OBBs relative to oriented
proposals. R-DFPN (Yang et al. 2018a) adopts RRPN and puts forward the Dense Feature
Pyramid Network to solve the narrow width problems of objects like ships. Ding et al.
(2019) designs a RoI learner to transform horizontal RoIs to oriented RoIs in a supervised
way. All these regression-based methods summarize the problem of regression as the
offsets of OBBs relative to HBBs or OBBs, and they rely on the accurate representation of
OBB. There are also some methods that intend to seek the object region at pixel-level and
then utilize the post-processing methods to obtain OBBs. We call these kinds of methods
segmentation-based methods. For instance, Wang et al. (2019b) proposes the Mask OBB,
which uses binary segmentation map to represent oriented objects. SegmRDet (Wang et al.
2020) uses the Mask R-CNN (He et al. 2017) structure to generate box masks for detecting
oriented objects.
is applied when it is dominated by small objects. The results show that it reduces the
running time by 50% while keeping similar accuracy on the challenging aerial dataset,
xView (Lam et al. 2018).
6.3.2.1 DOTA
The original data of DOTA (Xia et al. 2018)1 mainly comes from China Resources Satellite
Data and Application Center, Google Earth, JL-1 satellite, and GF-2 satellite remote sens-
ing data. The dataset contains a total of 15 categories. The dataset contains 2806 remote
sensing images acquired by different sensors, with resolutions ranging from 800 × 800 to
4000 × 4000. The dataset contains a total of 188,282 object instances annotated with ori-
ented bounding boxes and is divided into a training set, a verification set, and a test set
according to the ratio of 1/2, 1/6, and 1/3.
6.3.2.2 VisDrone
The VisDrone (Zhu et al. 2018)2 dataset is collected and annotated by the AISKYEYE team
of the Machine Learning and Data Mining Laboratory of Tianjin University. The dataset
includes 263 drone videos and 10209 drone images. These videos contain a total of 179,264
frames with image size of 2000. The dataset contains more than 2.5 million manually anno-
tated objects with axis-aligned bounding boxes in 10 categories, such as pedestrians, cars,
bicycles, and tricycles. It also provides some additional attributes, including scene visibility,
object class, and occlusion, to make better use of the data.
VisDrone contains a total of four tasks: object detection in image, object detection in
video, single object tracking, and multi-object tracking.
6.3.2.3 DIOR
DIOR (Li et al. 2020) images are collected from Google Earth. It consists of 23,463 images,
containing a total of 20 object categories, and each category contains about 1200 images, for
a total of 192,472 instances. The objects are annotated with axis-aligned bounding boxes.
6.3.2.4 xView
xView (Lam et al. 2018)3 contains data from WordView-3 satellites at 0.3 m ground sample
distance, giving higher resolution imagery than many other satellite datasets. This dataset
covers images over 1400 km2 of the ground. It has 60 fine-grained classes and over 1 million
objects. The annotation method is axis-aligned bounding boxes.
1 https://captain-whu.github.io/DOTA/
2 http://aiskyeye.com/
3 http://xviewdataset.org/
k
l1 < l2 l1 (x1, y1) θ −based OBB: l 1 > l2 l1 (x1, y1) θ −based OBB:
h (cx, cy, h, w, θ) h (cx, cy, w, h, θ΄)
l2 OBB l2 OBB
Point-based OBB: Point-based OBB:
(x4, y4) object
(x1, y1, x2, y2, x3, y3, x4, y4) (x4, y4) object
(x4, y4, x1, y1, x2, y2, x3, y3)
h-based OBB: h-based OBB:
(cx, cy) (x1, y1, x2, y2, h) (cx, cy) (x4, y4, x1, y1, w)
HBB w HBB
w
(x2, y2) (x2, y2)
θ΄
θ
(a) (b)
Figure 6.6 (a–b) Borderline states of regression-based OBB representations. The full line, dash
line, and gray region represent horizontal bounding box, oriented bounding box and oriented
object. The feature map of the left instance should be very similar to the right one. But by the
definition in Xia et al. (2018) to choose the first vertex (yellow vertex of OBB in (a) and (b)), the
coordinates of 𝜃-based OBB, point-based OBB, and h-based OBB representations differ greatly. The
representation of Mask OBB can avoid the problem of ambiguity and obtain better detection results.
k
6.3 Object Detection in Optical RS Images 79
For h-based OBB and 𝜃-based OBB, 𝜋∕4 is still a discontinuity point. As shown in Figure
6.6 (a) and (b), with 𝜃 oscillating near 𝜋∕4, the h-based OBB representation would switch
between (x1 , y1 , x2 , y2 , h) and (x4 , y4 , x1 , y1 , 𝑤). The 𝜃-based OBB representation would
switch back and forth between (cx, cy, h, 𝑤, 𝜃) and (cx, cy, 𝑤, h, 𝜃 ′ ) similarly.
Mask OBB for Oriented Object Detection For handling the ambiguity problem, Wang et al.
(2019b) represents the oriented object as binary segmentation map which ensures the
uniqueness naturally, and the problem of detecting oriented bounding box can be treated
as the pixel-level classification for each proposal. Then the oriented bounding boxes are
generated from the predicted masks by post-processing, and this kind of OBB repre-
sentation is called mask oriented bounding box representation (Mask OBB). Under this
representation, there is no discontinuity point and ambiguity problem.
Furthermore, aerial image datasets like DOTA (Xia et al. 2018) and HRSC2016 (Liu et al.
2016d) give the regression-based oriented bounding boxes as ground truth. Specifically,
DOTA and HRSC2016 use point-based OBBs {(xi , yi )|i = 1, 2, 3, 4} (Xia et al. 2018) and
𝜃-based OBBs {(cx, cy, h, 𝑤, 𝜃)} (Liu et al. 2016d) as the ground truth, respectively. However,
for pixel-level classification, pixel-level annotations are essential. To handle this problem,
pixel-level annotations are converted from original OBB ground truth. Specifically, pixels
inside oriented bounding boxes are labeled as positive and pixels outside are labeled as neg-
ative. And then we obtain the pixel-level annotations which will be treated as the pixel-level
classification ground truth. Figure 6.7 illustrates the point-based OBBs and converted Mask
OBBs on DOTA images. The highlight points are original ground truth, and the highlight
regions inside point-based OBBs are new ground truth for pixel-level classification, which
is well known as instance segmentation problem. Being different from point-based OBB,
h-based OBB, and 𝜃-based OBB, Mask OBB is unique in the definition no matter how
point-based OBB changes. Using Mask OBB, the problem of ground truth ambiguity can
be solved in nature, and there are no discontinuity points allowed in this mode.
Figure 6.7 Samples for illustrating mask-oriented bounding box representation (Mask OBB). The
corner points are original ground truth (point-based OBB), and the regions inside point-based OBBs
are ground truth for pixel-level classification.
80 6 Object Detection in Remote Sensing
Backbone Head
RPN
HBB
Proposals
+
Class
+
OBB
+
RoI Align
Post-processing
Figure 6.8 Overview of the pipeline for detecting oriented objects by Mask OBB. Horizontal
bounding boxes and oriented bounding boxes are generated by HBB branch and OBB branch,
respectively.
The overall architecture of Mask OBB is illustrated in Figure 6.8. Mask OBB is a two-stage
method based on Mask R-CNN (He et al. 2017), which is well known as an instance segmen-
tation framework. In the first stage, a number of region proposals are generated by a RPN
(Ren et al. 2015). In the second stage, after the RoI Align (He et al. 2017) for each proposal,
aligned features extracted from FPN (Lin et al. 2017) features are fed into the HBB branch
and OBB branch to generate the horizontal bounding boxes and instance masks. Finally, the
oriented bounding boxes are obtained by OBB branch based on predicted instance masks.
Besides, Mask OBB applies FPN (Lin et al. 2017) with ResNet as the backbone to fuse
low-level features and high-level features. Each level of the pyramid will be used for detect-
ing objects at a different scale. We denote the output as {C2 , C3 , C4 , C5 } for conv2, conv3,
conv4, and conv5 of ResNet, and call the final feature map set of FPN as {P2 , P3 , P4 , P5 , P6 }.
In this work, {P2 , P3 , P4 , P5 , P6 } have strides of {4, 8, 16, 32, 64} pixels with respect to the
input image.
In the inference stage, we calculate the minimum area oriented bounding box of predicted
segmentation map by Topological Structural Analysis Algorithm (Suzuki et al. 1985). The
minimum area oriented bounding box has the same representation as 𝜃-based OBB, which
can be directly used by the HRSC2016 dataset for calculating mAP. For DOTA, the four
vertexes of minimum area oriented bounding boxes can be used for evaluating performance.
Experiments In this section, we first study the different “first vertex” definition methods,
which will affect the performance of point-based OBB and 𝜃-based OBB in Table 6.2, and
then, we study the effect of different OBB representations in Table 6.3. For a fair comparison,
we re-implement the above three bounding box representations on the same basic network
structure as Mask OBB.
For the first vertex definition, we compare two different methods. One is the same as Xia
et al. (2018), which chooses the vertex closest to the “top left” vertex of the corresponding
horizontal bounding box, and we call this method as “best point”. The other one is defined
by ourselves, which chooses the “extreme top” vertex of the oriented bounding box as the
first vertex, then other vertexes are fixed in clockwise order, and we call this method the
“extreme point”. As shown in Table 6.2, the “best point” method significantly outperforms
“extreme point” method on the OBB task of the DOTA dataset. We can learn that different
“first vertex” definition methods will significantly affect mAPs of the OBB task. Thus if we
want to obtain great performance on the OBB task by using point-based OBB and h-based
6.3 Object Detection in Optical RS Images 81
Table 6.2 Comparisons with different first vertex definition methods on the mAP of point-based
OBB and h-based OBB representations. “Best point” method significantly outperforms the “extreme
point” method on the OBB task of DOTA dataset.
dataset first vertex OBB represent. backbone OBB (%) HBB (%) gap (%)
Table 6.3 Comparison with different methods on the gap of mAP between HBB and OBB.
implementations OBB representation backbone OBB (%) HBB (%) gap (%)
OBB representations, we should design a special “first vertex” definition method which can
represent OBB uniquely.
For different oriented bounding box representations, there is a higher gap between the
HBB and OBB performance for both 𝜃-based OBB, point-based OBB, and h-based OBB rep-
resentation than Mask OBB. Theoretically, changing from the prediction of HBB to OBB
should not affect the classification precision. But as shown in Table 6.3, the methods which
use regression-based OBB representations have higher HBB task performance than OBB
task performance. We argue that the reduction is due to the low-quality localization, which
is caused by the discontinuity point. There should not be such a large gap between the per-
formance of HBB and OBB task if the representation of OBB is defined well. The result of
Mask OBB verified that. In addition, mAPs of HBB and OBB of Mask OBB are nearly all
higher than the other three OBB representations in our implementations.
For other implementations, FR-O (Xia et al. 2018) uses point-based OBB and gets 60.46%
HBB mAP and 54.13% OBB mAP, and the gap is 6.33%. ICN (Azimi et al. 2018) also uses
point-based OBB and gets 72.45% HBB mAP and 68.16% OBB mAP, and the gap is 4.29%.
SCRDet (Yang et al. 2019c) uses 𝜃-based OBB and gets 72.61% OBB map and 75.35% HBB
map, and the gap is 2.70%. Li et al. (2019) also uses 𝜃-based OBB and gets 73.28% OBB map
and 75.38% HBB map, and the gap is 2.10%. Note that the performances of ICN, SCRDet
and Li et al. are obtained by using other modules and data augmentation technology. The
gaps between HBB task and OBB task of these methods (6.33%, 4.29%, 2.70%, 2.10%) are all
82 6 Object Detection in Remote Sensing
higher than Mask OBB (−0.16%). Therefore, we can draw the conclusion that Mask OBB is
a better representation on the oriented object detection problem.
RRoI Learner RRoI Learner aims to infer RRoIs from horizontal RoI features. Suppose
we have obtained n horizontal RoI, represented as Hi . For each Hi , we use (x, y, 𝑤, h) to
6.3 Object Detection in Optical RS Images 83
RRoI Learner
Classification
FC-5
Decoder
oI
10 channels RR
FC-2048
Rotated Position Sensitive RoI Align
10 channels
RoI Transformer
490 channels Regression
represent the center, width, and height of horizontal RoI. The corresponding feature map
is denoted as Fi . We can infer the geometry of RRoI from Fi followed by a fully-connected
layer. The formulation of the learning targets are:
1 ( ∗ )
tx∗ = (x − xr ) cos 𝜃r + (y∗ − yr ) sin 𝜃r ,
𝑤r
1 ( ∗ )
ty∗ = (y − yr ) cos 𝜃r − (x∗ − xr ) sin 𝜃r ,
hr (6.10)
𝑤∗ h∗
t𝑤∗
= log , th∗ = log ,
𝑤r hr
1 ( ∗ )
∗
t𝜃 = (𝜃 − 𝜃r ) mod 2𝜋 ,
2𝜋
where (xr , yr , 𝑤r , hr , 𝜃r ) represent the center, width, height, and orientation of RRoI and
(x∗ , y∗ , 𝑤∗ , h∗ , 𝜃 ∗ ) represent the ground truth annotation. In fact, if 𝜃 ∗ = 3𝜋∕2, the offset
relative to the horizontal RoI is a particular case of Eq. (6.10). The general relative offset
is shown in Figure 6.11. There are three coordinates: XOY is global coordinates binding to
the image, and x1 o1 y1 and x2 o2 y2 are two local coordinates binding to the RRoIs. (Δx, Δy)
represent the offset between the oriented bounding box annotation and the RRoI. 𝛼1 and 𝛼2
represent the angles of two RRoIs. The yellow rotated rectangle represent the ground truth
annotation. We can transform the left two rectangles to right two rectangles via translation
and rotation, keeping the relative position unchanged. The (Δx1 , Δy1 ) and (Δx2 , Δy2 ) are
the same if we observe them in x1 o1 y1 and x2 o2 y2 respectively. But they are not the same if
we observe them in XOY . To derive Equation 6.10, we need to calculate the offsets in local
coordinates such as x1 o1 y1 .
For each feature map Fi , the fully-connected layer is followed to output a vector
(tx , ty , t𝑤 , th , t𝜃 ) by
t = ( ; Θ), (6.11)
where , Θ denote the fully connected layer and its parameter. denotes the feature map
for each HRoI. During training, there is a matching process between input HRoIs and the
84 6 Object Detection in Remote Sensing
X
O
α1 y1
α2
Oa1
x2
(Δx1, Δy1) Ob1
Oa2 Ob2
y2
x1 (Δ x2, Δy2)
annotated oriented bounding boxes (OBBs). For efficiency, we calculate the IoU between
input HRoIs and the corresponding HRoIs of annotated ground truths. For each matched
HRoI, we assign the learning target by Equation 6.10. We use the Smooth L1 Loss (Girshick
et al. 2014) for regression loss. For every predicted t, we decode it to the RRoI.
RRoI Warping Once we get the RRoI, the rotation-invariant features can be extracted by
RRoI Warping. Here, we implement the RRoI Warping by Rotated Position Sensitive (RPS)
RoI Align in detail, because our baseline model is Light-Head R-CNN (Li et al. 2017). Given
the input feature map with shape of (H, W, K × K × C) and a RRoI (xr , yr , 𝑤r , hr , 𝜃r ),
where (xr , yr ) denotes the center of the RRoI and (𝑤r , hr ) denotes the width and height of
the RRoI. The 𝜃r gives the orientation of the RRoI. Here, we implement the RRoI Warp-
ing by Rotated Position Sensitive (RPS) RoI Align in detail, because our baseline model
is Light-Head R-CNN (Li et al. 2017). For the input feature of shape (H, W, K × K × C)
and RRoI of shape (xr , yr , 𝑤r , hr , 𝜃r ), where the (xr , yr ) means the center and (𝑤r , hr ) are the
width and height. 𝜃r is the angle of RRoI. We divide the feature map into K × K bins and
output feature map of shape (K, K, C). For the bin at (i, j) location (0 ≤ i, j < K) and channel
c(0 ≤ c < C), we have
∑
c (i, j) = Di,j,c (𝜃 (x, y))∕n, (6.12)
(x,y)∈bin(i,j)
where the Di,j,c represent one feature map of the K × K × C output feature maps. The n
represent the number of sampling ponits at one dimension. The bin(i,j) is the coordinates set
𝑤 𝑤 h hr
{i kr + (sx + 0.5) k×nr ; sx = 0, 1, ...n − 1} × {j kr + (sy + 0.5) k×n ; sy = 0, 1, ...n − 1}. For each
(x, y) ∈ bin(i, j), it is transferred to (x , y ) by 𝜃 , where
′ ′
( ′) ( )( ) ( )
x cos𝜃 −sin𝜃 x − 𝑤r ∕2 xr
= + . (6.13)
y′ sin𝜃 cos𝜃 y − hr ∕2 yr
RoI Transformer for Oriented Object Detection The RoI Transformer can be used to replace
the regular RoI warping operation such as RoI Align and RoI Pooling. RoI Transformer
outputs the rotation-invariant features and better initialization for subsequent regression.
After RRoI warping, we add one 2048-dimension fully connected layer and two sibling
fully connected layer for classification and regression respectively. The classification tar-
gets are calculated as normal. However, the regression targets are calculated different from
6.3 Object Detection in Optical RS Images 85
Table 6.4 Comparison between deformable RoI pooling and RoI Transformer. The
DPSRP is the abbreviation of deformable position sensitive RoI pooling.
Experiments To conduct the experiments, we use the DOTA (Xia et al. 2018) dataset. To
validate the performance is not from extra computation, we compare RoI Transformer with
deformable position sensitive RoI pooling. Because both deformable RoI pooling and RoI
Transformer are a kind of RoI Warping operation and can model the geometry transfor-
mation. To conduct experiments, we use the Light-Head R-CNN OBB (Li et al. 2017) as
baseline. The deformable position sensitive RoI pooling and RoI Transformer are used to
replace the position sensitive RoI align in Light-Head R-CNN OBB. The detailed results
are shown in Table 6.4. It shows that the RoI Transformer is lightweight and efficient. We
also compared the experiment results with other methods in Table 6.5. The model Light
Head R-CNN + RoI Transformer outperforms other methods. The code of RoI Transformer
is available4 . Besides the original Mxnet implementation. There is another version imple-
mented with Pytorch5 .
4 https://github.com/dingjiansw101/RoITransformer_DOTA
5 https://github.com/dingjiansw101/AerialDetection
86 6 Object Detection in Remote Sensing
Compare
Pixel detection
results
No
The window slides to Finished?
the next pixel
Yes
process is shown in Figure 6.12. Firstly, a false alarm probability value is set, and its
statistical characteristics are calculated according to the background clutter pixels which
near the pixel to be detected, then the threshold value of target detection is estimated
adaptively according to this statistical feature; finally, the value of the pixel to be detected
is compared with the estimated threshold value. If the value is larger than the threshold
value, it is the target point; otherwise, it is the pixel background point. There are many
kinds of CFAR algorithms. According to the different distribution of background clutter in
SAR images, they can be divided into two categories: single-parameter CFAR (Finn and
Johnson 1968) and multi-parameter CFAR (Leslie and Michael 1988). If the background
clutter distribution is quantified by a single parameter(such as Rayleigh distribution,
Exponential distribution, etc.) (Goldstein 1973), it is a single parameter CFAR, such as Cell
Averaging CFAR and Greatest-Of CFAR. If the background clutter distribution contains
two or more parameters (such as Gaussian distribution, Wilbur distribution, etc.), it is
multi-parameter CFAR, such as double-parameter CFAR.
For more details, to solve the problem of detecting small objects, a ship detector com-
posed of a RPN and an object detection network with contextual features has been pro-
posed (Miao et al. 2017). The used strategy, fusing both the deep semantic and shallow
high-resolution features helps to improve the detection performance for small-sized ships.
A coupled convolutional neural network (CCNN) is also designed for small and densely
88 6 Object Detection in Remote Sensing
clustered ships (Zhao et al. 2019). In CCNN, an exhaustive ship proposal network (ESPN)
is designed for proposal generation, while an accurate ship discriminative network (ASDN)
is used for excluding false alarms. In ESPN, features from different layers are reused and
the representative intermediate layers are used to generate more reliable ship proposals.
To rule out false alarms as accurately as possible, the context information for each pro-
posal is combined with the original deep features in ASDN. For dealing with the multiscale
problem, a densely connected multiscale neural network (DCMSNN) is proposed, which
is a densely connected network. Clearly, the CNN-based methods have got huge success.
However, for SAR images, there are not adequate annotated samples for model training.
Therefore, most deep-learning-based detectors in SAR images have to fine-tune networks
pre-trained on large-scale natural image datasets, such as ImageNet. But the huge domain
gap exists between SAR images and natural images which will incur the learning bias. To
solve this problem, the method to learn deep ship detector from scratch is proposed (Deng
et al. 2019). Learning from the scratch also makes the redesign of the network structure pos-
sible. A condensed backbone network is designed, and several dense blocks are included to
receive additional supervision from the objective function through the dense connections.
● SAR-Ship-Dataset (Wang et al. 2019a). The SAR ship detection dataset contains 102
high-resolution SAR images and 108 sentinel-1 SAR images collected from Gaofen-3 SAR
and sentinel-1 SAR. At present, the dataset has 43,819 ships in complex backgrounds and
is suitable for many SAR image applications.
● SSDD (Li et al. 2017) It is the first publicly available dataset for SAR image ship detection.
There are 1160 images and 2456 ships, with an average of 2.12 ships per image. These pub-
lic SAR images were downloaded from the Internet and cropped into a size of 500 × 500
pixels. The dataset collects images from RadarSat-2, TerrasAR-X, and Sentinel-1 sensors
with a resolution of 1–15 m and four polarizations (HH, HV, VV, and VH). It contains
ships both in the sea and near shore. As a result, it is sufficient to train the ship detection
model.
● SpaceNet 6 Multi-Sensor All-Weather (Shermeyer et al. 2020) Capella Space collected
the data via an airborne sensor. Each image has four polarizations (HH, HV, VH, and
VV) data and is preprocessed to show the backscattering intensity at a spatial resolution
of 0.5 m. The entire dataset contains more than 48,000 high-quality architectural footprint
notes, with extra quality control over the tags, removing incorrectly marked areas, and
adding tags for unmarked or destroyed buildings. The dataset also contains a 3D com-
ponent from the publicly available digital elevation model which is from airborne lidar.
Therefore, for each annotation, we report the mean, median, and standard deviation of
the height in meters for the 75th percentile. The height information will be valuable in
detecting object height from the upper air.
● AIR-SARShip-1.0 (Xian et al. 2019) The dataset contains 31 large images collected from
Gaofen-3 satellite. The images consist of 1 m resolution and 3 m resolution, including
bunching type and band type. The polarization mode is single-polarization, and the image
6.5 Conclusion 89
format is Tiff. Most of the images have a size of 3000 × 3000 pixels. The dataset also con-
tains relevant information such as the surrounding sea, land, and port, which is closer to
real-world applications.
6.5 Conclusion
In this chapter, we first gave the definition of object detection and introduced the evalu-
ation metrics as well as applications. Object detection in remote sensing images is quite
different from object detection in natural images and challenging. Because the objects can
be an arbitrary orientation in remote sensing images, the oriented bounding box is more
suitable for object detection in remote sensing images. Except for the arbitrary orientation,
densely packed instances, scale variation, and large-size image inference are also chal-
lenges for object detection in remote sensing images. We categorized the previous works for
object detection in remote sensing images according to the challenges they addressed. We
also described two example algorithms and analyzed the experiment results to show some
details to solve part of these challenges. With the large-scale datasets for object detection
in optical remote sensing images available, there is a significant improvement. However,
these challenges mentioned are still not well solved and remain open problems. Besides,
large-scale datasets for object detection in SAR remote sensing images need to be estab-
lished in the future.
90
7
Deep Domain Adaptation in Earth Observation
Benjamin Kellenberger , Onur Tasar , Bharath Bhushan Damodaran , Nicolas Courty ,
and Devis Tuia
7.1 Introduction
Environmental data, and in particular Earth Observation data, are not stationary in space
and time. Signals acquired by sensors tend to vary (or shift) according to differences in acqui-
sition conditions. For example, a crop field will return a very different signature if observed
in the morning or at noon, after seeding, or at full growth of the crops or in different years of
exploitation. This is due to different atmospheric effects, different relative positions of the
sensor and the sun, or design differences in the sensors acquiring the measurements: for
example, satellites have slightly different spectral ranges for bands, even if named identi-
cally (e.g. the first near infrared band of WorldView 2 covers the range of 770–895 nm, while
the one of the French Pléiades covers 740–940 nm). All these factors lead to (spectral) sig-
natures that are different when observing the same object. Since machine learning models
base their decision on observed data only, these differences are oftentimes challenging. This
problem is denoted as dataset shift, and particularly affects the generalization capability of
a model. For example, a model trained to detect crops in Earth Observation data that has
been acquired in the morning may learn to use the shadows cast by the crop plants. If such
a model is applied on scenes acquired during mid-day, where such shadows are absent, it
will likely fail.
A second challenge related to multiple data acquisition campaigns is that class concepts
also drift in space and time and depend on where (and when) they are observed. Taking
again crops as example, crops are dynamic, as they grow and change in time: a model
trained on data acquired in an early growth stage will not be effective in recognizing crops at
later stages, since leaf characteristics and biomass-to-soil ratios will be different, even when
observed under the exact same experimental conditions. Shifts may be even worse in the
case of classification across different geographical regions, for example when developing a
model for building detection (as described in Chapter 5 of this book): models trained for
suburban areas of Western cities will specialize in the detection of buildings with very dif-
ferent definitions compared to those trained for e.g. central business districts of mega-cities.
Moreover, applying either of those models to a city in an Eastern country would probably
be unsuccessful, because of differences in architecture, materials used and planning habits.
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
7.2 Families of Methodologies 91
This concept drift is also a serious problem hampering model generalization: when class def-
initions change between training and testing times, a model must be adapted accordingly
to be successful.
A third challenge is to develop models that can generalize across different sensor types
and configurations: this case arises when the same concept of interest is measured from dif-
ferent devices like sensors with different bands and spectral ranges or other data modalities
(sensor networks, LiDAR, etc.). In this case, traditional models cannot work, as they expect
a specific format of the input data (e.g. an image with a three-layered structure in case of
RGB data) that would not be respected at inference time. This last type of domain differ-
ence has been referred to as multi-modal domain shift and is related to the domain shift
case discussed above, but further requires models that are able to account for the different
sensors configurations and specifications explicitly. Note finally that it is realistic to expect
that most operational settings would suffer simultaneously from the three aforementioned
challenges, making the domain adaptation at the same time particularly challenging, but
omnipresent in modern applications.
In this chapter, we will discuss recent advances in methodologies addressing either of the
three challenges discussed above and present different approaches to achieve generalization
despite sensors or concepts differences, i.e. domain adaptation (Quiñonero-Candela et al.
2009). The field is actively studied in the communities of statistical learning (Redko et al.
2019), computer vision (Csurka 2017), and Earth Observation (Tuia et al. 2016b). Domain
adaptation in the geo-spatial domain has strong implications on the ambition of building
models that can be applied globally, at fine temporal scale, and with multiple sensors observ-
ing the planet from above. We will focus on methodologies addressing domain adaptation
for deep learning models (Wang and Deng 2018), which, at least in the Earth observation
community, is a young field with large potential for innovation.
In a domain adaptation context, the problem changes as the testing distribution (target
domain) differs from the training distribution (source domain). Two variants of method-
ologies can be distinguished: supervised domain adaptation deals with the case where
labeled data is also available in the target domain, although not in sufficient proportions to
train an accurate model, and unsupervised domain adaptation considers cases where
no labeled data is available for the target domain. Shallow (not deep learning-based) meth-
ods partially solve this problem by either reweighing source samples (Hidetoshi 2000), or
by learning a common representation through a subspace projection that is invariant to
domain shifts (Baktashmotlagh et al. 2013). In a deep learning setting, one can try to lever-
age the representational power of the network to mitigate the effect of domain shift. Deep
learning-based domain adaptation methods generally belong to one of three families:
● Adapting the inner representation: methods attempt to minimize a statistical
divergence criterion between the representations of the two domains at a given layer in
the network. Popular choices for computationally tractable divergences include aligning
second-order statistics (i.e. covariances) (Sun and Saenko 2016), contrastive domain
discrepancy (Kang et al. 2019), maximum mean discrepancy (Mingsheng et al. 2015), or
Wasserstein metrics (Damodaran et al. 2018). An alternative approach lies in adversarial
training, which uses learning signals from a domain classifier to align inner source and
target features (Ganin et al. 2016).
● Adapting the input distribution: other approaches align the input data distributions
in the source and target domain before training the classifier. A first class of methods
adapts the image statistics of the target domain, either using a common latent repre-
sentation (as autoencoders) or using image-to-image translation principles (Zhu et al.
2017; Hoffman et al. 2018). A second class of methods focuses on generating adversar-
ial examples that fit as close as possible into the target domain distribution, and then
use these artificial data to train a domain-agnostic classifier. An example of this second
strategy is CoGAN (Liu and Tuzel 2016).
● Using (few, well-chosen) labels from the target domain: sometimes the shift
between domains is too severe or the class proportions vary too much, so that meth-
ods aligning domains in an unsupervised way cannot succeed. Supervised methods,
i.e. methods using labels from the target domain, address those cases, but at the price
of needing to annotate images from target. Strategies using selective sampling, or active
learning (Settles 2012) can be used in this case to minimize the sampling effort.
It is worth mentioning that most of these methods are known to work in controlled set-
tings, where the balance between classes is similar in the source and target domains. Vari-
ants of those methods, closer to real-world applications, are an active subject of research.
This includes models for target shift, open set domain adaptation, partial domain adap-
tation, and multi-modal, sometimes called heterogeneous domain adaptation. Target shift
occurs when source and target domains do not share the same class proportions. In open
set domain adaptation, new classes are present in the target domain. Conversely, in par-
tial domain adaptation, few classes are absent in the target domain, which is a specific
case of the concept drift problem presented in the introduction. Finally, the multi-modal
adaptation problem occurs when the source and target domains do not live in the same
space. In the remainder of this chapter, we discuss examples of those three families of meth-
ods to give the reader an idea of concrete deep domain adaptation methods, without being
exhaustive.
7.3 Selected Examples 93
source domain
adaptation loss
0.5 residential
0.45 buildings
...
target
Figure 7.1 Domain adaptation loss (red) imposed on a CNN’s feature vectors produced by the
penultimate layer (“fc1”).
94 7 Deep Domain Adaptation in Earth Observation
● Deep Joint Optimal Transport (DeepJDOT; Damodaran et al. (2018)) likewise attempts to
minimize the discrepancy between the feature vector distributions of both domains, but
also incorporates the label information associated to the features, thus aligning the joint
distributions between feature and labels. It does so with Optimal Transport, which yields
source-to-target couplings that provide the minimum Wasserstein distance between the
two domains (Courty et al. 2017).
Experiments In the following, we evaluate the adaptation performance of all three method-
ologies on a remote sensing image classification problem (as described in Chapter 5 of
this book). We employ two datasets named UC Merced (Yang and Newsam 2010) and
WHU-RS19 (Dai and Yang 2010). Both datasets contain satellite-derived RGB images, but
obtained from different sources (USGS National Map Urban Area Imagery Collection for
UC Merced, and Google Earth for WHU-RS19), with different image sizes (256 × 256
for UC Merced, 600 × 600 for WHU-RS19), different numbers of images per class (100 for
UC Merced and 50 for WHU-RS19) and different class definitions. The two datasets were
therefore limited to a set of ten overlapping classes, as described in Table 7.1. Both datasets
were further divided by selecting 60% of the images of each class at random for training,
10% for validation, and 30% for testing. Example images are shown in Figure 7.2. As can
be seen, the different resolutions, label classes, and geographic locations together pose
a comparably strong domain shift. The experiments below will investigate adaptation
performances of the three models above, and in both directions (i.e., UC Merced →
WHU-RS19, and the inverse). Code to reproduce the experiments is provided on the
dedicated GitHub page1 .
As a base classifier, we employ a ResNet-50 (He et al. 2016b) that has been pre-trained on
the ImageNet classification dataset (Deng et al. 2009). We replace the last fully-connected
layer with a new one with random initialization to match the number of classes in our
Table 7.1 Common label classes between the UC Merced and WHU-RS19 datasets.
AG agricultural farmland
AP airplane airport
BE beach beach
DR buildings, dense residential commercial, industrial
FO forest forest
HA harbor port
MR medium residential residential
VI overpass viaduct
PA parking lot parking
RI river river
1 https://github.com/bkellenb/da-dl4eo
7.3 Selected Examples 95
Figure 7.2 Examples from the UC Merced (top) and WHU-RS19 (bottom) datasets.
datasets. We initially train one model for each of the two datasets, with equal settings and
hyperparameters for both: we draw minibatches of 32 images from the training set, resize
the images to 128 × 128 pixels, and apply data augmentation in the form of random hor-
izontal and vertical flips, as well as a slight color jitter. We use a softmax and multi-class
cross-entropy loss for training, and employ the Adam optimizer (Kingma and Ba 2014) with
an initial learning rate of 10−4 that gets divided by 10 after every ten epochs. We do not use
weight decay for training. In order to ensure convergence, these base models are trained for
100 epochs on their respective dataset.
In a second step, we use the pre-trained models as a starting point for the three domain
adaptation strategies presented above. We keep all settings constant for all strategies, but
start with a learning rate of 10−5 . In addition to the cross-entropy loss on the predictions
in the source domain, we add a domain adaptation loss from one of the three strategies
on the 2048-dimensional feature vector output after the global average pooling layer in
the ResNet-50 (i.e., the penultimate layer in the model). We train the respective model for
another 100 epochs, with one epoch being defined as the maximum of the lengths of the
two datasets, drawing 32 images per minibatch from each dataset at random.
Table 7.2 shows overall accuracies on the test sets for the unadapted baseline models
(top), the three domain adaptation strategies (middle), as well as the target models (bottom).
A first observation to make is that both source models yield perfect predictions on their
respective datasets’ test sets (“target only”), indicating that the separability of the datasets
Table 7.2 Overall accuracies for the discussed datasets and domain adaptation methods.
AG AG
AP AP
BE BE
1.0
DR DR
Prediction
Prediction
0.8 FO FO
HA HA
0.6
MR MR
VI VI
0.4
PA PA
0.2 RI RI
AG AP BE DR FO HA MR VI PA RI AG AP BE DR FO HA MR VI PA RI
0.0 Ground Truth Ground Truth
Source only MMD
AG AG
AP AP
BE BE
0.6
DR DR
Prediction
Prediction
0.4 FO FO
0.2 HA HA
MR MR
0.0
VI VI
0.2 PA PA
0.4
RI RI
AG AP BE DR FO HA MR VI PA RI AG AP BE DR FO HA MR VI PA RI
0.6 Ground Truth Ground Truth
Deep CORAL DeepJ DOT
Figure 7.3 Confusion matrix of the source only model (top left) and differences to it for the
domain adaptation strategies on the WHU-RS19 test set. Best viewed in color.
and learning capacity of the ResNet-50 are sufficient if enough labeled data from the specific
domain are available. If applied without adaptation to the other domain’s test sets (“source
only”), accuracies drop by 34, resp. 41 absolute percent. A look at the per-class predictions
(Figure 7.3) reveals that the primary confusion occurs between “airplane” and “commer-
cial, industrial”, with 88% of the WHU-RS19 “industrial” images misclassified as “airplane”.
This is likely to be attributable to WHU-RS19 showing entire airports rather than single air-
planes, which looks similar to more industrial scenes. Other classes being confused are
“buildings” and “industrial” (around 32% false positives), and “agricultural” and “viaduct”
(around 20% false positives).
The three domain adaptation methods manage to slightly improve the overall accuracy
in the UC Merced → WHU-RS19 case, and significantly raise the accuracy in the inverse
adaptation experiment. As for the first adaptation direction, MMD lowers the confusion
between “airplane” and “commercial, industrial” from 88% to 69%, but increases other con-
fusions, such as between “industrial” and “residential”. DeepCORAL does not significantly
7.3 Selected Examples 97
reduce the confusion between any two specific classes, but provides a more average
result, decreasing confusion between some pairs, but increasing it in other cases. Finally,
DeepJDOT significantly increases the true positive predictions or leaves them unaffected
for all but two classes (“viaduct” and “parking”). It also significantly reduces the confusion
between “dense residential” and “airplane”. These improvements can in parts be attributed
to the fact that DeepJDOT tries to minimize the feature vector distance between specific
source-target samples, retrieved through Optimal Transport. This stands in contrast to
MMD and DeepCORAL, which align samples that are close in feature space, but not
necessarily with respect to the global distributions. This only works well if the source and
target distributions lie in manifolds that are comparable with respect to global distribution
characteristics. As soon as source and target samples of dissimilar classes lie closely
together, those methods may force the model to consistently mis-predict target samples.
● Generate synthetic data: as described above, the intent behind synthetic data is to capture
the distribution of the target domain. Those artificial data samples can then in turn be
used to train or fine-tune a model on the fake source data. By proceeding this way, the
distributions of the fake source and target data resemble each other, which is supposed
to increase final model performance.
● Standardize both domains: a second strategy, often used in hyperspectral imaging (Gross
et al. 2019) and multi-source image processing (Tuia et al. 2016a), maps the samples from
both source and target domains into a common subspace. This is done in such a way that
the samples belonging to the source domain are representative for the target domain.
Then, good predictions can be obtained by training a model on the standardized source
data and classifying the mapped target samples.
● CycleGAN (Zhu et al. 2017) trains two GANs, with the first one generating target
synthetic samples, and the second mapping the generated target samples back to the
source domain. The aim is that the back-transformed, generated samples are realistic
with respect to the source domain.
● UNIT (Liu et al. 2017b) maps both source and target data to a common latent space. Fake
data are then sampled from the latent space.
● MUNIT (Huang et al. 2018d) combines the content code of a domain with the style code
of another domain via Adaptive Instance Normalization (Huang and Belongie 2017).
● DRIT (Lee et al. 2018b): Similar to MUNIT, DRIT also combines content code of a domain
with the style code of another domain. The difference is that combination is performed
through concatenation, rather than Adaptive Instance Normalization.
● ColorMapGAN (Tasar et al. 2020) maps colors of the source domain to those of the target
domain to correct the domain shift. Unlike the other GANs, the generator of ColorMap-
GAN performs only matrix multiplication and addition operations.
These deep learning-based methods will also be compared with two traditional image
normalization methods:
● Histogram matching (Gonzalez and Woods 2006): The histograms for the spectral bands
of the source data are matched with the histograms of the target data. This method is not
based on GANs or any deep learning-based approaches.
● Gray world (Buchsbaum 1980): All the methods described above align the data distri-
bution of the source domain to the distribution of the target domain. The Gray world
algorithm, on the other hand, assumes that the color of the illuminant highly affects the
colors of the objects and aims at removing this effect. We use the original Gray world
algorithm to standardize both domains separately.
Discussion As can be seen in Figure 7.4, the fake source data generated by UNIT, MUNIT,
and DRIT are semantically inconsistent with the real source data. Therefore, the fake
source data and the ground truth for the real source data do not match, which results in a
7.3 Selected Examples 99
(a) Target city (b) Source city (c) Cycle GAN (d) UNIT
(e) MUNIT (f) DRIT (g) Hist. match (h) ColorMap GAN
Figure 7.4 Source, target, and fake source images. Best viewed in color.
(a) Source (b) Standard. source (c) Standard. target (d) Target
Figure 7.5 Real data and the standardized images by the Gray-world algorithm. Best viewed in
color.
poor performance. Hence, we conclude that these algorithms should not be considered for
data augmentation. Figure 7.5 shows that the large gap between domains can be reduced
by the Gray world algorithm. However, as confirmed by Table 7.3, the performance is not
satisfactory.
100 7 Deep Domain Adaptation in Earth Observation
(a) Target city (b) GT (c) U-net (d) CycleGAN (e) UNIT
(f) MUNIT (g) DRIT (h) Gray world (i) Hist. match. (j) ColorMapGAN
Figure 7.6 Classification maps on the target city (Villach) by the U-net fine tuned on the fake data.
Blue, green, and white colors show building, road, and tree classes, respectively. In black are pixels
for which no class was predicted with more than 50% confidence.
Figure 7.8 Example drone images from the source (left) and target (right) domains.
cameras, they exhibit domain shifts due to multiple causes, such as terrain properties,
animal species and illumination conditions. The consequence of these shifts are that a
CNN, trained on source, will generate a high number of false positives in target. Even
worse, the number of objects of interest is minuscule in comparison to the vast amounts of
background pixels, which makes this a challenging needle-in-the-haystack problem. This
setting requires domain adaptation strategies that are robust to class imbalances.
Method One way to achieve robustness to imbalance is to focus on predictions that are most
likely to be true positives, with respect to their location in the feature space. Figure 7.9
shows t-SNE embeddings (van der Maaten and Hinton 2008) of source (left) and target
(right) domain predictions of a CNN detector. Since the model has been trained on source,
it predicts a higher number of true positives (blue in the color version, black when printed
in black and white) in that domain, compared to the significantly more false alarms (red
in the color version, gray when printed in black white) in target. Of particular interest is
the rightmost feature space area of the source domain, where the concentration of true
True positives
Flase positives
Figure 7.9 Feature space projections using t-SNE (van der Maaten and Hinton 2008) for
predictions of the unadapted model in the source (left) and target (right) domains. True positives
are shown in blue (black when printed in black and white), false positives in red (gray when printed
in black and white). The gray lines show Optimal Transport correspondences between the target
true positives and associated source samples.
102 7 Deep Domain Adaptation in Earth Observation
positives is highest. The domain adaptation criterion of Kellenberger et al. (2019), named
“Transfer Sampling” (TS), exploits the fact that this concentration in source exists, and tries
to find the same hotspot of true positives in the (unlabeled) target domain. To this end,
the work employs Optimal Transport (OT; Courty et al. (2017)), which is a distribution
matching framework that finds correspondences between all samples of two distributions
with respect to a minimal global cost. In essence, this means that source and target samples
correspond well to each other if they are close according to a distance, such as an 𝓁2 norm,
and lie in similar regions of the distribution.
These source-to-target associations are shown with gray lines in Figure 7.9. In the domain
adaptation setting, OT essentially helps re-finding the hotspot of true positives in the target
domain – note how most of the lines point from the source true positives hotspot to the
target true positives. At runtime, the labels of the target domain are initially unknown,
but the source ground truth is assumed to be present. Hence, by localizing the source true
positives hotspot and establishing an OT source-to-target correspondence, it is theoretically
possible to transfer the source labels to the target domain. For additional robustness to class
imbalance, Kellenberger et al. (2019) instead transfers the “goodness” of the source samples,
e.g. by their distance from the false positives hotspot (leftmost area in Figure 7.9), to the
target domain and uses the scores as an AL criterion.
Results Figure 7.10 shows precision-recall curves of the CNN in the source (left) and tar-
get (right) domains, and also model performances after adaptation. The unadapted model
(right panel, black curve) clearly loses precision in target, but still reaches roughly the same
recall of more than 80%. When adapted, the model significantly gains in precision, and does
so the most when using TS.
In addition, Figure 7.11 shows the number of animals found in the target dataset during
the ten AL iterations. Both TS and max confidence find about 80% of the total number of
animals present (dash-dotted line), but TS manages to do so one entire AL iteration (50 label
2 https://kuzikus-namibia.de/xe_index.html
7.3 Selected Examples 103
1.0 1.0
0.8 0.8
0.6 0.6
Precision
Precision
0.4 0.4
0.2 0.2
0.0 0.0
0.0 0.2 0.4 0.6 0.8 1.0 0.0 0.2 0.4 0.6 0.8 1.0
Recall Recall
unadapted random Breaking Ties max confidence Transfer Sampling
Figure 7.10 Precision-recall curves for the CNN on source (left) and target (right) datasets, without
domain adaptation (black), and with (colored). Colored graphs show model performances on target
after the final AL iteration. Best viewed in color.
random
Breaking Ties
500 max confidence
Cumulative No. Animals Found
Transfer Sampling
400 with model updates
static
300
200
100
0
AL Iteration 1 2 3 4 5 6 7 8 9 10
Oracle Queries 50 100 150 200 250 300 350 400 450 500
Figure 7.11 Cumulative number of animals found over the course of the ten AL iterations, with
CNN updates at the end of every AL iteration (solid), and by simply sampling from the initial,
unadapted CNN (dashed). The black dashed line denotes the total number of animals present in the
dataset (upper bound).
queries) earlier. Breaking Ties and random sampling are not designed to focus on the most
likely true positives, and hence retrieve less animals in the same span. Finally, the graph fur-
ther shows that fine-tuning the CNN after every iteration (solid) yields significantly higher
animal counts than just employing the criteria, without adapting the model (dashed).
Discussion One may find that unsupervised domain adaptation is not always suitable,
depending on the data and application. In the example shown, the scarcity of the animals
raises a significant class imbalance problem to the adaptation process, which requires
104 7 Deep Domain Adaptation in Earth Observation
a strategy that is virtually immune to such imbalances. Furthermore, the specific use
case of animal detection not only requires satisfactory model performance in the target
domain, but also an economic animal retrieval rate during adaptation. To this end, it may
be required to obtain ground truth for the most relevant target domain samples, which
can be achieved using AL. The presented TS criterion was designed explicitly for such
purposes and, contrary to conventional AL criteria, focuses on the most probable target
samples, which makes it robust to class imbalances and provides a high object retrieval
rate already during the adaptation process.
In this chapter, we presented recent advances in domain adaptation for geospatial data.
Starting from a categorization of methods based on the stage they affect (inputs, inner repre-
sentation or usage of labeled data), we presented a series of comparisons on remote sensing
data and described pros and cons of the different approaches. We also provided reproducible
code (on GitHub) to allow experimenting and better understanding of the properties of the
model and further adoption of domain adaptation methods in geosciences.
The categories of approaches presented are clearly not exclusive, and one could poten-
tially design hybrid methods using several aspects at once. But independently of the method
of preference, we hope that we raised awareness on the need of dealing with dataset shift
and the pitfalls that one could fall in if such distribution changes are not taken into account
during model design and training.
105
8
Recurrent Neural Networks and the Temporal
Component
Marco Körner and Marc Rußwurm
In the previous chapters, input data was assumed to be individual multi-dimensional mea-
surements x ∈ D , 0 < D ∈ ℕ. If these signals come organized in a matrix-like structure,
i.e. x ∈ M×N , 0 < M, N ∈ ℕ, convolutional neural networks are, by design, able to process
each individual measurement, i.e. each pixel, considering its spatial context and, thus, can
be applied to images of various scales M × N.
Earth observation data, in general, is mostly provided in the form of sequential data, i.e.
{ }
𝕏 = xt = h(f (t)) ∈ D t (8.1)
{ }
or 𝕏 = xt = h(f (t)) ∈ M×N t , 0≤t≤T∈ℕ (8.2)
representing a continuous-time dynamical system f sampled at discrete time-stamps t and
projected into any observation space by an observation model h. As illustrated in Figure 8.1,
regular DNNs can only process observations instance-wise, i.e.
( )
yt = fDNN xt , (8.3)
or fixed-length concatenations of multiple observations, i.e.
( )
yt = f̃ DNN xt , xt−1 , … , xt−p . (8.4)
On the contrary, it appears natural that the temporal context should be taken into account
dynamically when processing this kind of data, i.e.
({ }𝜏 )
y𝜏 = f̂ DNN xt t=0 . (8.5)
For a long time, graphical models, e.g. hidden Markov models (HMMs) (Rabiner and
Juang 1986), were considered the method of choice when dealing with time series of Earth
observation data (Siachalou et al. 2015; Hoberg et al. 2015). These generative models update
an unobservable hidden state of a dynamical system following the Markov assumption, i.e.
solely by means of the state at the previous time step and the current observation, yielding
the generic update rule
( )
st ← 𝜋 st−1 , xt . (8.6)
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
106 8 Recurrent Neural Networks and the Temporal Component
yt
fDNN
xt
y1 y2 yt
... ⚬ ⚬ ~
fDNN
fDNN fDNN ... fDNN
xt−p . . . xt−1 xt
x1 x2 xt
(a) a single-time feed-forward neural network (b) a multi-temporal feed-forward neural network
This abstract representation of the underlying processes producing the observations is fur-
ther used to derive the final outputs, e.g. classification labels,
( )
yt = g st . (8.7)
Their straightforward formulation and tractable training regimes, like the Baum–Welch
algorithm (Baum and Petrie 1966), made them handy to use and showed quite some success.
Despite that, HMMs show some major drawbacks, as their individual designs encode a
lot of domain and expert knowledge. In particular, suitable state-space formulations and
the correct temporal dependencies have to be modelled manually. Especially the extent of
the temporal context used within the model needs to be hard-wired a-priori and remains
unchanged throughout the entire process. This evidently contradicts the fundamental prin-
ciple of representation learning that underlies the deep learning concept.
ht−1 + ς ht
xt
(8.14)
encode the connections of an RNN, as shown in Figure 8.3, where bold edges highlighted
in grey and respective matrix elements correspond to recurrent feed-back loops.
∑
T
𝔏= 𝔏t . (8.15)
t=0
108 8 Recurrent Neural Networks and the Temporal Component
yt y1 y2 y3
time steps
t−1 t t+1
Figure 8.3 When a recursive, feed-back neural network is unrolled through time, it can be
represented as a feed-forward neural network with layers corresponding with the individual time
steps.
+ + + ... L
∂L ∂L ∂L
∂L1 ∂L2 ∂L3
L1 L2 L3
z1 z2 z3
Wout Wout Wout
h0 · ς h1 · ς h2 · ς h3 ...
Win Win W in
x1 x2 x3
Wrec ...
Figure 8.4 The computational graph of an unrolled RNN with forward (black arrows) and
backward passes (Colored arrows).
8.1 Recurrent Neural Networks 109
For updating, e.g. the shared recurrent weights Wrec , the partial derivatives
𝜕𝔏 ∑ 𝜕𝔏t T
= (8.16)
𝜕Wrec t=0
𝜕Wrec
of the loss need to be computed. Similarly, for updating the internal states h, gradients need
to be propagated from ht+t back to ht . For simplicity, let the nonlinear activation
𝜍(x) = 𝜍ReLU (x) = max (0, x) (8.17)
in Equation 8.13 be the rectified linear unit (ReLU), Equation 8.16 reformulates to
𝜕𝔏 ∑T
𝜕𝔏t ∑T
𝜕𝔏t ∏ 𝜕ht−𝜏
t
𝜕h1
= = ⋅ ⋅ (8.18)
𝜕Wrec t=0
𝜕Wrec t=0
𝜕ht 𝜏=0 𝜕ht−𝜏−1 𝜕Wrec
which contains partial gradients
( )
𝜕𝔏t ( ) 𝜕𝔏t
= Wrec ⋅ 1≥0 ht−k−1 ⊙ (8.19)
𝜕ht−k−1 𝜕ht−k
𝜕𝔏
iteratively defined by their upstream gradients 𝜕h t . Here, it can be clearly seen that
t−k
backpropagating gradients through 𝜏 ∈ ℕ time steps requires 𝜏 multiplications of the
( )𝜏
recurrent weight matrix Wrec , i.e. Wrec . Depending on their actual values – or,
more precisely, their eigenvalues after eigenvalue decomposition 1 –, this yields an
exponential growth or shrinkage of the recurrent weights, such that small perturba-
tions in early iterations might manifest massive effects in later iterations. This effect is
commonly referred to as exploding and vanishing gradients (Hochreiter 1991; Bengio
et al. 1994), respectively, and poses a major problem when training very deep neural
networks, and so RNNs, using gradient-based optimization, as the objective becomes
effectively discontinuous. Thus, RNNs forfeit their ability to capture long-term temporal
dependencies.
Real-time Recurrent Learning The probably most obvious countermeasure against exploding
and vanishing gradients is to avoid backward passes through time entirely. The real-time
recurrent learning (RTRL) algorithm (Williams and Zipser 1989), for instance, deter-
mines an optimal parameter update with only one complete forward pass and without
memorizing elapsed hidden states. Thus, RTRL resembles a purely online learning
procedure, as opposed to the batch-wise offline learning strategy in BPTT. Nevertheless,
its extraordinarily high runtime and memory complexity make this procedure impractical
to be used.
1 =1
⏞⏞⏞⏞⏞
( )𝜏 ( )𝜏
Wrec = U𝚲U −1 = U𝚲U −1 ⋅ U𝚲U −1 ⋅ … = U𝚲𝜏 U −1
⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏞⏟
𝜏 times
110 8 Recurrent Neural Networks and the Temporal Component
Truncated back-propagation through time In a similar way deep neural network architectures
are suggested to be kept as shallow as possible, RNNs could be constrained to propagate
back gradients not to the very beginning of the time series, but only for a fixed number of
steps. Hence, Equation 8.18 becomes
𝜕𝔏 ∑T
𝜕𝔏t ∏ 𝜕ht−𝜏
k
𝜕h1
≈ ⋅ ⋅ , t > k ∈ ℕ. (8.20)
𝜕Wrec t=0
𝜕ht 𝜏=0 𝜕h t−𝜏−1 𝜕W rec
𝜕𝔏 ∑ 𝜕𝔏t ∏ k
𝜕ht−𝜏 𝜕h1
≈ ⋅ ⋅ , T > 𝜅 ∈ ℕ, (8.21)
𝜕Wrec t≡ T 𝜕ht 𝜏=0 𝜕ht−𝜏−1 𝜕Wrec
𝜅
yields the truncated back-propagation through time (tBPTT) algorithm (Elman 1990;
Williams and Peng 1990; Williams and Zipser 1995).
Temporal Skip Connections As the explosion and vanishing rates of a RNN are proportional to
a function of the number of time steps 𝜏 to be back-propagated, temporal skip connections
introduced to the recursive architecture might help to diminish this effect. If, for instance, a
static time delay 𝛿t is added (Lin et al. 1996), the vanishing rates will now grow proportional
to a function of 𝛿𝜏 instead, while gradient explosion is still possible.
t
Truncated Gradients When optimizing parameterized systems using gradient descent, large
𝜕𝔏
gradients g = 𝜕𝝎 with respect to certain parameters 𝝎 yield large updates for these param-
eters. In awareness of the ability of gradients to explode, one commonly used strategy is
to limit the gradient values, or the length of the gradient vectors g, not to exceed a certain
threshold 0 < c ∈ , e.g. by rescaling
{ g
c ⋅ ||g|| if ||g|| ≥ c
g ∶= . (8.22)
g else
Regularization Another commonly used way to restrict the parameter space during opti-
mization is regularization, i.e.
𝜽∗ = argmin𝜽 𝔏(f , 𝜽) + 𝜆 ⋅ Ω(𝜽), (8.23)
where Ω(𝜽) is a defined measure of complexity of the parameter set 𝜽. In the context of
training RNNs, Pascanu et al. (2013) proposed the regularizer
⎛ ‖ 𝜕𝔏 ⋅ 𝜕ht+1 ‖
2
T−1 ‖ ‖ ⎞
∑ ‖
⎜ ‖ t+1
𝜕h 𝜕h ‖
t ‖ ⎟
Ω(𝜽) = ⎜ ‖ ‖
− 1⎟ (8.24)
t=0 ⎜ ‖ 𝜕𝔏 ‖ ⎟
⎝ ‖ ‖
‖ 𝜕ht+1 ‖ ⎠
to enforce norm-preserving error updates.
Weight Initializations The problem when magnitudes of parameter updates grow without
control is that a (local) optimum 𝜽∗ could easily be omitted. As this is only problematic in
non-convex optimization settings, a careful selection of initial parameterization 𝜽(0) is often
8.2 Gated Variants of RNNs 111
used to ensure stable convergence. For RNNs entirely based on ReLU activations, Le et al.
(2015) propose to initialize the recurrent weights and biases with Wrec = I and brec = 0,
respectively. In contrast, Talathi and Vartak (2015) propose to initialize Wrec in a way that
its eigenvalues are normalized with respect to its largest one.
The aforementioned strategies to mitigate the problems of exploding and vanishing gradi-
ents all come with their particular advantages and downsides and are, thus, only rarely used
in practice. Opposed to that, the idea of augmenting RNNs by internal gates that actively
control the extent of temporal information to be memorized – or to be forgotten – to neatly
address these problems brought the community quite a success.
xt
112 8 Recurrent Neural Networks and the Temporal Component
single forget gate. While these reformulations reduce the number of learnable parameters,
they lose their ability to detect context-free languages and, hence, to count or to model the
frequency of perceived events (Weiss et al. 2018).
A single recurrent cell (cf. Figure 8.6(a)) comes with a certain capacity and can, thus, store
a limited amount of information. This capacity is architecture-specific and grows linearly
with the number of its parameters (Collins et al. 2017). The particular gates of LSTM cells
and their variants, which are introduced to direct the flow of information to improve the
practical training properties, result in a reduced information storage capacity. For this rea-
son, recurrent cells are commonly organized in various topologies to increase their repre-
sentative capabilities.
ht−1 ht−1
xt xt+1 xt+2 xt xt+1 xt+2
Figure 8.6 As the capacity of a single RNN cell is limited, several RNN cells can be organized in
different sequential topologies, resulting in higher-capacitive networks.
8.3 Representative Capabilities of Recurrent Networks 115
y1 y2 y3 yT
σ σ σ σ
backward pass
...
forward pass
h0 ...
x1 x2 x3 xT
Figure 8.7 Bi-directional RNNs (Schuster and Paliwal, 1997) and LSTM networks (Graves and
Schmidhuber, 2005) contain an additional backward path to process the input data in inverse
sequential order. This allows them to access past and future data, which makes them suitable to be
employed in offline problem settings.
predicted whenever new data xt+1 becomes available. For offline problems, when the
entire data sequence 𝕏1→T is already available, this becomes an undesired yet unnecessary
restriction. To mitigate this limitation, bi-directional models (Schuster and Paliwal 1997;
Graves and Schmidhuber 2005), for instance, introduce an additional backward pass
processing the input data sequence in inverted sequential order, i.e. 𝕏T→1 ., as illustrated
in Figure 8.7.
Considering that Earth observation data, especially obtained by space-borne sensors,
is mostly organized in matrix form, the vector-valued formulation of RNNs turns out to
be another undesired restriction. Multi-dimensional network topologies (Fernández et al.
2007; Kalchbrenner et al. 2016) augment this purely sequential structure by further spatial
dimensions and are, thus, able to process matrix-valued input signals. Irregular local
neighborhood relations can be modeled with more general graph RNN (Goller and Küchler
1996) or graph LSTM (Liang et al. 2016) network architectures. In order to avoid full
connectivity of recurrent LSTM cells, ConvLSTMs (Sainath et al. 2015) allow convolutional
operations and, thus, local weight sharing and translation equivariance.
8.3.2 Experiments
As motivated earlier, global geophysical Earth system processes can be seen as dynami-
cal systems that are governed by an as yet undiscovered multitude of sub-processes. These
can be of chaotic or deterministic in nature and spread over various-length time spans. For
that reason, computational models used to capture such processes based on time-discrete
observations are required to show massive capacities in order to process the entire spatial
and temporal variance of these Earth systems.
As recurrent neural networks are, formally, Turing-complete (Siegelmann and Sontag
1991, 1995) and can, thus, approximate any arbitrary program, they are theoretically suit-
able for such tasks. Their vanilla realizations, however, suffer from several severe practical
limitations that restrict them in terms of their computational power. As described before,
gated RNN variants aim to mitigate these shortcomings, mostly by counteracting the van-
ishing and exploding gradients phenomenon.
116 8 Recurrent Neural Networks and the Temporal Component
0.4 10−8
0.2
0 10−15
2000-01-01 2002-09-27 2005-06-23 2008-03-19 2010-12-14 2013-09-09 2016-06-05 2019-03-02
0.4 10−8
0.2
0 10−15
2000-01-01 2002-09-27 2005-06-23 2008-03-19 2010-12-14 2013-09-09 2016-06-05 2019-03-02
Figure 8.8 Two recurrent network models – i.e. (a) a vanilla recurrent neural network and (b) a
long short-term memory neural network – have been trained to solve an auto-regressive NDVI data
prediction task. The models have been shown NDVI time series acquired by MODIS from 2000 to
2010 over central Europe and predicted them further until 2020. The gradients are evaluated at the
last known time point (2010) and indicate the influence of previous data (before 2010) on the
observation in 2010. It can be seen that the vanishing gradients of the standard recurrent neural
network restrict the ability to retrieve long-term temporal context while the LSTM network uses
data from the previous five years.
In order to exemplify this effect, Figure 8.8 shows the results of a real-world experiment.
For that purpose, a series of NDVI values has been derived from optical MODIS satellite
observation from central Europe over the years from 2000 to 2010. Assuming that a
high-capacity model is expected to be able to predict future data based on a sequence of
past observations, this task was chosen as a proxy problem. The figure compares these
prediction capabilities of a vanilla RNN and a LSTM network. As can be seen in the
figure, while both models were able to estimate the unseen future data, i.e. the time
span from 2010 to 2020, the LSTM model produced comparably smoother times series
while still being able to reproduce periodical variation at different temporal scales. One
reason for this behavior can be attributed to the evolution of their temporal gradients.
The green curves (in log scale) show these gradients evaluated at the time step of the
first prediction, i.e. 2010-01-01. It becomes evident that the gradient magnitudes decayed
exponentially in the case of the RNN model, while they remained almost equally strong
in the LSTM case. The ability to actively steer the gradient flows back through time
enabled the LSTM network to consider a much longer temporal context and, thus, to
predict the unseen data more stably. These stable predictions can build the basis for further
classification tasks.
8.4 Application in Earth Sciences 117
While LSTM networks have been shown to maintain the expressive power of RNNs,
i.e. they can learn context-sensitive languages (Gers and Schmidhuber 2001), further
design choices realize their further increased long-term stability at the cost of reduced
computational capacity. It, thus, depends on the entirety of circumstances – i.e. the
problem formulation, the used data, the available computational resources, etc. – which
model variant performs best in a certain practical scenario.
1 meadow
0.8 clouds cutting/harvesting corn
NDIV
0.6
0.4 growth onset
0.2
0
day of year
Figure 8.9 The vegetation activity of two field parcels – cultivated with meadow and corn crops,
respectively – monitored over one entire season of 2016 by means of NDVI values derived from
repeated Sentinel-2 observations. From these curves, it is possible to deduce the effects of various
processes influencing crop growth, like, for instance, climatic and weather conditions (e.g. clouds),
crop-specific phenological dynamics (e.g. growth onsets), and agricultural cultivation (cutting and
harvesting events).
118 8 Recurrent Neural Networks and the Temporal Component
CNN LSTM
1
0.8
Kappa
0.6
0.4
0.2
0
day of year
(a) Crop classification accuracy
0.9
validation accuracy
0.8
0.7
0.6
0.5
0 2 4 6
training iterations ·106
(b) Training process
Figure 8.10 Recurrent models can outperform feed-forward baselines in crop classification tasks,
as they are able to extract crop-specific phenological patterns in Earth observation data.
Furthermore, they show better training behavior, as they converge faster and more reliably to their
optimal parameter settings.
Taking a closer look at the training process itself reveals another important observa-
tion. Figure 8.10(b) visualizes the validation accuracy of an LSTM and CNN model while
training, as a function of the iterations performed for parameter optimization and accumu-
lated over ten runs with varying initializations. It becomes evident that the recurrent LSTM
model consistently converged faster to its final optimum, while the individual runs did, in
general, show a smaller variance compared to the CNN baseline.
8.5 Conclusion
We have shown that recurrent neural network models, in their different formulations and
variants, are able to capture the dynamics of Earth observation data that is assumed to be
driven by complex latent dynamical systems. Most importantly, the active steering of gradi-
ent flows while training increases the representative power of these models which can, thus,
8.5 Conclusion 119
produce more stable predictions over longer periods. These capabilities open the potential
for tackling more complex machine learning tasks. Thus, such models can be employed in
various Earth observation data processing systems, like, for instance, in land cover and crop
type classification tasks (Rußwurm and Körner 2018, 2017a) or, at a global scale, for climate
system analysis (see Chapter 18; (Kraft et al. 2020, 2019)).
Methodological research on this topic of recurrent data processing has gathered pace
remarkably in recent years and these models have already been brought to numerous prac-
tical applications. They undoubtedly come with the potential to help to exploit the massive
data stocks piled up since the rise of modern Earth observation satellite missions. Never-
theless, the research community is still facing unanswered questions. While feed-forward
neural networks do already lack transparency, the information flows in trained recurrent
neural networks are even harder to analyze and to interpret. Visualizing the information
steering mechanisms of particular cells of gated variants already gives valuable insights,
but the majority of these cells show complex, non-intuitive behavior. Further, active and
prospective fields of research try to find approaches to integrate expert model knowledge
into such data-driven models, to derive causal relationships between patterns present in
Earth observation data, or to estimate the certainty and confidence at which predictions
are made by such models.
120
9
Deep Learning for Image Matching and Co-registration
Maria Vakalopoulou , Stergios Christodoulidis , Mihir Sahasrabudhe , and
Nikos Paragios
9.1 Introduction
Image matching and registration are some of the most important and popular problems for
many communities, including Earth observation. Efficient and robust algorithms that can
address such topics are essential for several other tasks including, but not limited to, optical
flow, stereo vision, 3D reconstruction, image fusion, and change detection. Deep learning
algorithms are becoming more and more popular, providing state-of-the-art performance
for various problems including image matching and registration. These algorithms prove
very efficient running time and robustness with variety of studies reporting their success in
supervised and unsupervised settings.
Given a pair of images depicting the same area, image matching is the process of compar-
ing the two images (source image S and target (or reference) image R) to obtain a measure
of their similarity, while image registration is the process that aligns or maps these images
by finding a suitable transformation between S and R. In particular, both image matching
and registration are measuring or mapping identical pixels in the pair of images with the
focus of the second to align S to R as accurately as possible. Although these problems seem
to be easy conceptually, they are still an open research area for a variety of communities,
considered as ill-posed problems that suffer from many uncertainties. This is mainly due to
the nature of algorithms and images on which small changes in translation, illumination,
or viewpoint can significantly affect these algorithms’ performance even if the depicted
areas are exactly identical. Therefore, numerous approaches have been proposed to address
these problems and have been summarized in different surveys (Zitova and Flusser 2003;
Sotiras et al. 2013; Leng et al. 2018; Burger and Burge 2016; Weng and He 2018). Nowadays,
however, with the recent advances of deep learning, more and more techniques integrate
these technologies for both matching and image registration, offering better performances
especially in time requirements.
Some of the most common problems that image matching and registration algorithms
need to address, especially for earth observation applications can be grouped in four main
categories, namely (i) radiation distortions; (ii) geometric changes; (iii) areas including
changes; and (iv) multimodal, large-scale sources of data. Starting with the first group,
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
9.1 Introduction 121
radiation distortions refer to the difference between the real emissivity of the ground
objects and the one that is represented in the image level. This difference can be caused
mainly by the imaging properties of the sensor itself or radiation transmission errors
caused by the atmosphere during the acquisition of the objects’ emissions. The later
is also known as Bidirectional Reflectance Distribution Function (BRDF) with a lot of
algorithms being proposed for its modeling (Montes and Ureña 2012). The second group
refers to the geometric differences in the ground objects that have to do with differences
in viewpoints of the sensors and the height of objects influencing mainly high-resolution
images (with spatial resolution higher than 10 meters). The next group refers to the
difficulty of these methods to work on places that contain changed areas. Both image
matching and registration problems assume that the depicted regions are mainly identical,
being unable by default to work properly on regions that contain changes, something that
is quite common on earth observation and remote sensing datasets. Finally, it is also very
challenging to create algorithms that are working robustly for multimodal datasets such
as Synthetic-Aperture Radar (SAR) and multispectral or hyperspectral optical sensors or
images produced by sensors with significantly different spatial resolutions. A variety of
different sensor characteristics are available for earth observation and there is a significant
need for matching and registration algorithms that can fuse their multitemporal informa-
tion. All these cases make the problems of matching and co-registration very challenging
and the use of algorithms proposed by other communities such as computer vision or
robotics really challenging to be adapted to satellite data (Le Moigne et al. 2011; Nicolas
and Inglada 2014).
Traditionally, image matching and image registration are two closely related problems,
with the first providing usually reliable information for the second (Figure 9.1). Differ-
ent sub-regions or pixels from the source and reference images are matched and used to
define the best transformation model, G, to map S to R, thus resulting in a warped image
D. Depending on the implemented strategy the matching algorithm can be applied glob-
ally, searching for the most similar regions in the entire image, or it can be applied locally
S Image
D Image
R Image
Matching Transformation
Algorithm G
Image Registration
Figure 9.1 A schematic diagram of the image matching and image registration techniques for
Earth observation. Image registration has two main components: the matching algorithm that is
measuring how similar are different regions in the image; and the definition of the transformation
G, which will be applied to the S image to generate the warped image D.
122 9 Deep Learning for Image Matching and Co-registration
by searching for the best matching on predefined regions. The choice of the matching strat-
egy also depends on the choice of transformation model used for the registration algorithm.
Nowadays, various methods based on deep learning approaches are proposed from both the
computer vision and earth observation fields. Most of the techniques that originate from
the earth observation field are focusing on high-resolution datasets (Ma et al. 2019) due
to the more challenging nature of this kind of images and their need for dense and more
complex registration models.
Starting with image matching, traditionally two main components are usually applied –
the feature extraction of keypoints or sub-regions of the images and the establishment
of the proper correspondences. Feature extraction methods typically rely either on
intensity-based methods using the raw image’s intensities directly or on higher-level
representations extracted from the pair of images. These representations are typically
produced using either classical image descriptors or they are obtained using deep learning
approaches. After feature extraction, optimal correspondences are found using a similarity
function. Typical choices for this similarity function are mean squared error, normalized
cross-correlation, and mutual information. The implementation of the similarity function
in a deep learning framework is usually achieved using Siamese or triplet networks that
share their weights (Kaya and Bilge 2019).
As far as the image registration task is concerned, depending on the transformation used,
the methods can be categorized into: (i) rigid or linear; and (ii) deformable or elastic. Rigid
methods define maps with transformations that include, e.g., rotations, scaling, and transla-
tions. They are global, and hence cannot model local geometric differences between images,
which is usually the case in high-resolution datasets. However, they are very efficient and
robust for the co-registration of satellite imagery. On the other hand, the deformable meth-
ods rely on a spatially varying model by associating the observed pair of images through
non-linear dense transformations. After obtaining the optimal transformation G, the source
image is resampled to construct the warped image which is in the same coordinate system
as the reference image. Deep learning and convolutional neural networks have been used
for image registration (Kuppala et al. 2020) while methods also based on generative mod-
els (Mahapatra and Ge 2019) and deep reinforcement learning are proposed in the literature
for both 2D and 3D registration (Liao et al. 2016) mainly for the medical and computer
vision communities.
This chapter focuses on recent advances in image matching and registration for earth
observation tasks with emphasis on emerging methods in the domain and the integration of
deep learning techniques. To study these recent advances we analyse their key components
independently. The rest of the chapter is organized as follows. In section 9.2, we present
a detailed overview of existing literature for both image matching and image registration
focusing on the recent deep learning techniques. In section 9.3, we discuss and present an
unsupervised deep learning technique applied to high-resolution datasets and compared
with conventional image registration techniques. We describe the dataset used for this study
in section 9.3.4, followed by experiments and results in 9.3.5. Finally, in 9.4 we summarize
the chapter and enumerate future research directions for these algorithms.
9.2 Literature Review 123
Table 9.1 Grouping of image matching techniques depending on the type of imagery they have
been applied to.
Optical to Optical Altwaijry et al. (2016); Zhu et al. (2019a); En et al. (2018); Liu et al.
(2018a); Chen et al. (2017b); Yang et al. (2018); He et al. (2018); Dong
et al. (2019); Zhu et al. (2019); Jia et al. (2018); He et al. (2019a);
Tharani et al. (2018); Wang et al. (2018)
SAR to SAR Quan et al. (2016); Wang et al. (2018)
SAR to Optical Merkle et al. (2017); Bürgmann et al. (2019); Merkle et al. (2018);
Hughes et al. (2018); Quan et al. (2018); Ma et al. (2019); Merkle et al.
(2017); Zhang et al. (2019)
Other Ma et al. (2019); Zhang et al. (2019)
Siamese architectures are also popular for SAR to Optical image matching. In Merkle et al.
(2017) a Siamese architecture is proposed to generate reliable matching points between
TerraSAR-X and PRISM images. For the same type of images, a conditional generative
adversarial network (see Chapter 3) is trained in Merkle et al. (2018) to generate SAR-like
image patches from optical images to enhance the performance of known classical match-
ing approaches. Moreover, a (pseudo-) Siamese network is proposed in Hughes et al. (2018).
Medium- and high-resolution SAR and optical data are evaluated in Bürgmann et al. (2019),
presenting an approach for matching ground control points (GCPs) from SAR to optical
imagery. The training of conditional generative adversarial networks (see Chapter 3) is also
proposed in Merkle et al. (2017) to generate artificial templates and the matching of optical
to SAR data. Finally, a combination of deep and local features is used in Ma et al. (2019) to
match and register multimodal remote sensing data.
Deep learning methods are also used to match images from completely different sources
of data such as satellite images with maps. In Ma et al. (2019) the authors evaluated their
methods using a pair of an optical image and the corresponding Tencent Map providing
very promising results. A similar approach based on Siamese architectures is proposed
in Zhang et al. (2019), evaluating its performance on multimodal data including optical
to map matching.
Additionally, the authors proposed an improvement to their previous work in Girard et al.
(2019) by developing a multi-task scheme for simultaneous registration and segmentation,
which improved the performance of the reported registration.
where p⃗ and q⃗ denote pixel locations, d ∈ {x, y} denotes an axis, and [G(⃗p)]d denotes the
d-component of G(⃗p).
The formulation in this case consists of two different components – one which calculates
a linear/affine transformation, and another that calculates a dense transformation. Depend-
ing on the application and the type of data, these two terms can be used and trained together
9.3 Image Registration with Deep Learning 127
or separately. In case, that these two operations are trained at the same time, they are applied
one after the other, by integrating first the affine component and then the deformable for
more fine transformations. Such scheme can be described by
where GA represents the affine deformation grid, while GN represents the deformable one.
Here it should be mentioned that the network is trained end-to-end, optimizing both linear
and deformable parts simultaneously. GA is computed from six regressed affine transfor-
mation components, denoted by a 2 × 3 matrix A. For the deformable part GN , an approach
similar to Shu et al. (2018) is adopted. Instead of regressing the components of GN directly,
we regress instead a matrix Φ of spatial gradients along x- and y- axes. As is discussed in Shu
et al. (2018), this approach helps generate smoother grids that render the deformable com-
ponent, making it easier to train. The actual grid GN can then be obtained by applying an
integration operation on Φ along x- and y-axes, which is approximated by the cumulative
sum in the discrete case. Adopting such a strategy enables us to draw some conclusions on
the relative position of adjacent pixels in the warped image based on Φ. Concretely, two pix-
els p⃗ and p⃗ + 1 will have moved closer, maintained distance, or moved apart in the warped
image, if Φ(p) is respectively less than 1, equal to 1, or greater than 1. Such an approach
avoids self-crossings, while allows the control of maximum displacements among consec-
utive pixels.
L1 Reg
Source
k = 128
k = 64
k = 32
k = 16
k=2
SE
Concat, k = 246
k = 128, D = 5
Concat, k = 6
k = 16, D = 1
k = 32, D = 2
k = 64, D = 3
Spatial Moved
Transformer Image
k=6
GAP
Feature Transformation
Target Extraction Prediction
L1 Reg
MSE Loss
Dilated Conv 2D, Deconv 2D,
Concatenation Squeeze Excitation Global Average Conv 2D,
Instance Norm, Instance Norm,
Block Pooling Sigmoid / Linear
LeakyReLU LeakyReLU
L1 Reg
Source
k = 128
Feature encoding, k = 128
k = 64
k = 32
k = 16
k=2
Concat, k = 6
k = 128
k = 16
k = 32
k = 64
Spatial Deformed
Transformer Image
k=6
GAP
Feature Transformation
Target Extraction Prediction
L1 Reg
MSE Loss
Conv 2D, Instance Deconv 2D,
Global Average Conv 2D,
Concatenation Norm, LeakyReLU, Instance Norm,
Pooling Sigmoid / Linear
MaxPool LeakyReLU
Figure 9.2 A schematic diagram of two different architectures presented in this chapter. The
architecture consists of two different parts following an autoencoder scheme: the feature
extraction part and the part for the prediction of the transformation.
The decoder part has two different branches, one that calculates the affine parameters
and one the deformable ones. For the linear/affine parameters A, a linear layer was used
together with a global average pooling to reduce the spatial dimensions, while for the spatial
gradients Φ a sigmoid activation was employed. Finally, the output of the sigmoid activation
was scaled by a factor of 2 to allow consecutive pixels to have larger displacements than the
initial.
where AI represents the identity affine transformation matrix, ΦI the spatial gradients
of the identity deformation, and 𝛼 and 𝛽 are regularization weights, controlling the
influence of the regularization terms on the obtained displacements. The higher the values
of 𝛼 and 𝛽, the closer the deformation is to the identity. The regularization parameters
are essential for the joint optimization, as they ensure that the predicted deformations
will be smooth for both components. Moreover, the regularization parameters are very
important in the regions of change, as they do not allow the deformations to become
very large.
The most commonly employed reconstruction loss is the mean-squared error (MSE).
However, MSE suffers from several drawbacks. Firstly, it cannot account for changes
in contrast, brightness, tint, etc. Secondly, MSE tends to produce smooth images.
Thirdly, MSE does not account for the perceptual information in the image (Wang et al.
2004). Recent papers have hence reported the use of other types of similarity functions,
either instead of MSE or in combination with it, to construct more descriptive loss
functions. One of this type of losses is the local cross correlation (LCC) presented in
Balakrishnan et al. (2019).
Table 9.2 Errors measured as average Euclidean distances between estimated landmark locations.
dx and dy denote distances along x-, y-, respectively, while ds denotes the average error along all
axes per pixel.
Figure 9.3 Qualitative evaluation for three different pairs of images. From top to bottom:
unregistered, (Karantzalos et al. 2014) with NCC, dilated (Vakalopoulou et al. 2019) only A,
dilated (Vakalopoulou et al. 2019) only Φ, dilated (Vakalopoulou et al. 2019) A and Φ.
9.3 Image Registration with Deep Learning 133
Figure 9.4 Qualitative evaluation for the different methods ( (Vakalopoulou and Karantzalos
2014), (Karantzalos et al. 2014), (Vakalopoulou et al. 2019) respectively). From top to bottom:
Quickbird 2006 - WorldView-2 2011, Quickbird 2007 - WorldView-2 2011, Quickbird 2009 -
WorldView 2011, WorldView-2 2010 - WorldView-2 2011.
134 9 Deep Learning for Image Matching and Co-registration
for these two problems. Generative Adversarial Networks can provide very valuable tools
towards developments in that direction.
10
Multisource Remote Sensing Image Fusion
Wei He, Danfeng Hong, Giuseppe Scarpa, Tatsumi Uezato, and Naoto Yokoya
10.1 Introduction
Multisource remote sensing image fusion is used to obtain detailed and accurate informa-
tion regarding the surface, which cannot be acquired from a single image, by fusing multiple
image sources (Ghamisi et al. 2019). Typical examples of multisource image fusion used
in remote sensing are the resolution enhancement tasks, which include (i) pansharpening
to enhance the spatial resolution of multispectral imagery by fusing it with panchromatic
imagery (Garzelli et al. 2007; Vivone et al. 2014; Loncan et al. 2015), and (ii) multi-
band image fusion to reconstruct high-spatial-resolution and high-spectral-resolution
images (Lanaras et al. 2017; Yokoya et al. 2017; Ghamisi et al. 2019).
Since 1990, pansharpening has been actively researched to produce higher-level prod-
ucts for optical satellites that are composed of panchromatic and multispectral imagers.
Furthermore, multiband image fusion has received great attention in the last decade with
the emergence of space-borne hyperspectral sensors. Traditional approaches based on com-
ponent substitution and multi-resolution analysis have been studied in detail and are com-
monly applied for practical applications. These approaches extract spatial details from a
high-spatial-resolution image and inject them into an upsampled low-spatial-resolution
image. These methods differs in how to extract and inject spatial details. As the next trend,
researchers have formulated image fusion tasks as optimization problems and implemented
priors of data structures in various models, such as low-rank, sparse, variational, and non-
local modeling, to improve the quality of reconstruction. Even though these priors were
useful for achieving significant improvement, the accompanying high computational cost
has been a serious issue when applied to generate higher-level products of operational satel-
lites that acquire large-scale data.
Deep learning (DL) has proven to be a powerful tool in many fields as well as various
image processing tasks. Recently, DL-based methods have been proposed as a groundbreak-
ing approach for pansharpening and multiband image fusion (Masi et al. 2016; Scarpa et al.
2018; Xie et al. 2019); these methods achieved state-of-the-art performance and compu-
tational efficiency at the inference phase. Such DL-based methods are capable of learn-
ing complex data transformations from input images to target images of training samples
in an end-to-end manner. This book chapter provides an overview of state-of-the-art DL
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
10.2 Pansharpening 137
methods for pansharpening and multiband image fusion as well as comparative experi-
ments to demonstrate their advantages and characteristics. This is followed by a discussion
of the future challenges and research directions.
10.2 Pansharpening
Pansharpening is a special type of super-resolution where, in addition to the multiband or
multispectral (MS) image to be super-resolved, a high-resolution but single-band image,
referred to as the panchromatic (PAN) band, which is spectrally overlapped with the MS, is
also available in input. It can be considered as a single-sensor fusion task wherein spectral
and spatial information are obtained by two distinct channels, i.e., MS and PAN, respec-
tively. The goal of this technique is to obtain a spatial and spectral full-resolution data cube.
A survey of the pansharpening methods proposed prior to the advent of the deep learn-
ing era can be found in Vivone et al. (2015). In the following sections, we first provide an
overview of the most recent deep learning approaches for pansharpening and subsequently
present and discuss a few related experimental results.
Φn+1
DenseBlock
x ReLU f Φn(x)
··· ConvLayer
ResBlock
BatchNorm r
r
···
Figure 10.1 Training samples generation workflow (top) and iterative network parameters
learning (bottom).
the PAN band, and a tiling phase is used to create mini-batches for the sake of computational
efficiency during training. This generation process has quickly become a standard method
for CNN-based pansharpening approaches following its introduction in Masi et al. (2016).
The interested reader may refer to this for further details. However, it is worth to under-
line that such a resolution shift will come with a little price to pay as pansharpening is scale
invariant only to a little extent. This is particularly evident for data-driven approaches since
the same content of the images used for training will strictly depend on their ground sample
distance (GSD).
Network architecture design (2) amounts to the definition of a directed acyclic graph
(DAG), which defines the input-output information flow, thereby associating a specific task,
such as convolution, point-wise nonlinearities (e.g., ReLU), batch normalization, concate-
nation, sum, and pooling to each DAG vertex. The specific sub-graph structures obtained
by combining these elementary operations are also commonly employed (e.g., Residual,
Dense, or Inception modules). The PNN model (Masi et al. 2016) is rather simple: a serial net
composed of three sequential convolutional units interleaved by ReLU activations. In 2016,
this simple architecture exhibited the significant potential of DL in contrast to traditional
methods, and achieved state-of-the-art results. It has been further improved by convert-
ing it to a residual net named PNN+ (Scarpa et al. 2018). Residual learning, introduced
by He et al. (2016a) as a very effective strategy for training very deep models, have quickly
proved to be the optimal choice for resolution enhancement (Kim et al. 2016). Moreover, the
desired super-resolved image can be viewed as a composition of its low and high frequency
components, the former essentially being the input low-resolution image, and the latter
being the missing (or residual) part that has to be restored. Owing to this partition, residual
schemes naturally address problems associated with super-resolution or pansharpening,
10.2 Pansharpening 139
thereby avoiding an unnecessary reconstruction of the entire desired output and reducing
the risk of altering the low-frequency contents of the image. The PanNet model proposed by
Yang et al. (2017) is a further improvement in this direction; it eliminates the low-frequency
contents from the input stack as well. Additionally, a majority of the recent DL pansharpen-
ing models include residual modules (Wei et al. 2017; Yang et al. 2017; Scarpa et al. 2018; Liu
et al. 2018c; Shao and Cai 2018; Zhang et al. 2019), although they can be significantly differ-
ent in terms of complexity. While Scarpa et al. (2018) keep a relatively shallow architecture
with only three convolutional layers, other models employ tens of them (Wei et al. 2017;
Yang et al. 2017; Liu et al. 2018c). Shao and Cai (2018) have presented a two-branch archi-
tecture wherein the MS and PAN components follow different convolutional paths before
they are combined, and the influence of the number of convolutional layers per branch is
analyzed. Their main conclusion was that the PAN branch should be relatively deeper than
the MS one; in particular, the optimal values for the proposed model were eight and two
layers, respectively. Finally, it is also worth mentioning a complementary approach such as
the U-Net-like model proposed by Yao et al. (2018).
Once the model architecture is fixed, say fΦ (x), its parameters Φ have to be learned (4),
with the help of a suitably chosen guiding loss functions (fΦ (x), r) (3) that quantifies the
network error at each iteration over a training mini-batch in order to adjust the parameters
by moving in the opposite direction of their gradient:
The top-level workflow of this process is summarized in the bottom part of Figure 10.1.
Multiple optimizers, which are the variants of the stochastic gradient descent algorithm,
can be used to this purpose. For the sake of brevity, we skip this point as is not highly rel-
evant to our specific problem, and refer the interested reader to the pansharpening papers
discussed above. However, the so-called generative adversarial learning paradigm (Liu et al.
2018c) is an important exception where it is used, which is worth mentioning.
A more interesting aspect is the selection of the loss function. The L2-norm can be con-
sidered a standard option as its minimization corresponds to the minimization of the mean
squared error. It has been actually adopted in many cases (Masi et al. 2016; Wei et al. 2017;
Yao et al. 2018; Shao and Cai 2018; Yuan et al. 2018) for its simplicity and convergence
properties. However, the pansharpening quality assessment is still an open issue, likewise
the intimately related choice of a proper loss function for an optimal training. In fact, in
addition to the classical mean squared (MSE) and the mean absolute (MAE) errors, both
spectral oriented measurements with slightly different properties, many other quantitative
quality measurements have been proposed during the last decades (see Vivone et al. (2015)).
Some of these, like the erreur relative globale adimensionnelle de synthèse (ERGAS), a revis-
ited MSE measure with bandwise weighting, or the Spectral Angle Mapper (SAM), are
more related to the spectral fidelity, whereas measures such as the spatial cross-correlation,
average cross-correlation between gradient images, focus on spatial consistency. Moreover,
different loss functions may present different convergence properties during training. For
these reasons Scarpa et al. (2018) decided to use the L1-norm achieving better results (less
smooth than using L2-norm). The same selection was adopted by Liu et al. (2018c), while
(Zhang et al. 2019) have recently proposed to use ERGAS that relates to the root mean
140 10 Multisource Remote Sensing Image Fusion
where l ∈ {1, … , L} indexes the generic band, 𝜇(l) is the l-th band average, and R is the
resolution ratio.
Finally, a validation process (5) completes the network design cycle. Samples reserved
for validation, hence “unseen” by the training process, are used both for detecting
overfitting and for comparison of different hyper-parameters configurations. Once the
training-validation process is stopped, the hyper-/parameters are frozen and the net is
ready to be used.
1 BDSD: http://openremotesensing.net/wp-content/uploads/2015/01/pansharpeningtoolver_1_3.rar
2 MTF-GLP: http://openremotesensing.net/wp-content/uploads/2015/01/pansharpeningtoolver_1_3.rar
3 SIRF: http://cchen156.web.engr.illinois.edu/code/CODE_SIRF.zip
4 PNN: http://www.grip.unina.it/download/prog/PNN/PNN_v0.1.zip
5 DRPNN: https://github.com/Decri/DRPNN-Deep-Residual-Pan-sharpening-Neural-Network
6 PanNet: https://xueyangfu.github.io/paper/2017/iccv/ICCV17_training_code.zip
7 PNN+: https://github.com/sergiovitale/pansharpening-cnn
10.2 Pansharpening 141
Table 10.1 Performance comparison of three non-DL and four DL methods at reduced (reference)
and full (no-reference) resolutions on the WV-2 datasets. The best results are shown in bold.
BDSD 0.888 0.893 28.719 6.319 4.899 0.878 0.894 0.045 0.064
MTF-GLP 0.907 0.910 30.114 5.745 4.163 0.902 0.918 0.040 0.049
SIRF 0.894 0.901 31.041 5.925 3.984 0.897 0.911 0.053 0.039
PNN 0.907 0.921 30.777 6.559 4.155 0.913 0.927 0.032 0.043
DRPNN 0.922 0.933 31.016 5.822 3.792 0.913 0.921 0.031 0.050
PanNet 0.916 0.916 30.392 5.448 3.976 0.887 0.939 0.021 0.042
PNN+ 0.923 0.933 31.598 5.796 3.743 0.905 0.949 0.025 0.027
Ideal value ↑1 ↑1 ↑ ↓0 ↓0 ↑1 ↑1 ↓0 ↓0
Figure 10.2 Pansharpening results with different compared methods at a reduced resolution
WV-2 image. An enlarged region is framed in green and the corresponding residual image between
the fused image and MS-GT is framed in red.
Figure 10.3 Pansharpening results with different compared methods at a full-resolution WV-2
image. An enlarged region is framed in green and the corresponding residual image between the
fused image and MS-GT is framed in red.
10.3 Multiband Image Fusion 143
Table 10.2 Processing time comparison of three non-DL and four DL methods in the test phase.
Running Time (s) 0.1278 0.1729 9.3736 0.1095 0.1184 0.1036 0.1101
edges and also yield lower residual errors between the MS-GT up-sampled via bi-cubic
interpolation.
Computational time: All test experiments in this chapter were implemented on a Win-
dows 10 operating system and conducted on Intel Core i7-8700K 3.70GHZ desktop with
64GB memory. With the setting, the running time for those compared non-DL and DL
methods is given in Table 10.2. Overall, the running time of DL-based techniques is basi-
cally superior to that of non-DL ones, particularly optimization-based approaches (e.g.,
SIRF). Remarkably, those DL-based methods hold a similar and faster running speed in
practice, owing to the linear processing time of their feed-forward mechanism in the infer-
ence process.
INPUT OUTPUT
Reconstruction loss
Testing stage
Spectral
MS HR-HS Downsampled
enhancement MS MS
Deep net HR-HS
(b) (a)
Figure 10.4 Example of HS and MS data fusion: (a) supervised approaches and (b) unsupervised
approaches.
The supervised approaches commonly assume that the paired training data are avail-
able. The training data used in the supervised approaches include low spatial resolution
HS (LR-HS) and MS images as the inputs and the HR-HS images as the outputs. In the
methods (Han et al. 2018; Palsson et al. 2017), the LR-HS and MS images are simply con-
catenated as a single input after applying spatial resampling. In the other methods (Xie et al.
2019; Dian et al. 2018), the LR-HS and MS images are separately incorporated as inputs in
the optimization process.
The architecture of the network differs significantly, depending on the method. The spec-
tral spatial fusion architectures of CNN (SSF-CNN) (Han et al. 2018) are aimed at learning
the nonlinear relationship between the concatenated LR-HS and MS images and the HR-HS
images, using CNN. Additionally, the training loss between the estimated and reference
HR-HS images is considered. Once the model has been trained, the HR-HS images are com-
puted using the trained CNN for a new given input (i.e., the concatenated LR-HS and MS
images).
A 3D convolutional neural network (3D-CNN) (Palsson et al. 2017) also adopts a simi-
lar approach. One noticeable difference is that 3D-CNN uses different training data. The
input data are spatially decimated HS and MS images, while the observed HS image is used
as the target HR-HS images. The similar trick is also explored in the pansharpening. The
relationship is learned by using 3D-CNN in an end-to-end manner. Once the model has
been trained, the observed HS and MS images are provided as inputs to compute HR-HS
images. This is based on the assumption that the relationship learned by 3D-CNN at a low
resolution can also be applicable at higher resolutions.
10.3 Multiband Image Fusion 145
The aforementioned two models possess a simple network architecture and perform well
in the experiments. However, the CNN models used are not specifically designed for the
MS/HS fusion problem. MS/HS Fusion Net (MHF-net) (Xie et al. 2019) was proposed to
incorporate the following observation models:
Y = XR + Ny , (10.1)
Z = CX + Nz , (10.2)
where Y is the observed MS image, X is the HR-HS image, R is the spectral response
of the multispectral sensor, Z is the observed HS image, C is the linear operator that is
composed of a cyclic convolution operator and a downsampling operator, and Ny and Nz
represent noise present in the MS and HS images, respectively. MHF-net formulates a
new optimization problem from the observation models and shows that the optimization
problem can be solved by a specifically designed deep network. MHF-net automatically
estimates the parameters related to the downsampling operator within a supervised deep
learning framework. MHF-net can exploit the general prior structure of the latent HS
images (e.g., low-rankness) and also enables each step of the network architecture to be
interpretable.
A deep HSI sharpening method DHSIS which can incorporate the observation model has
been developed (Dian et al. 2018). DHSIS comprises three steps. The first step estimates the
initial HR-HS image from a conventional optimization problem derived by the observation
model. In the second step, it learns image priors representing the mapping between the ini-
tialized HR-HS image and the reference HR-HS image using deep residual learning. Finally,
the learned image priors are incorporated into an image fusion optimization framework to
reconstruct the final HR-HS image.
An unsupervised sparse Dirichlet net (uSDN) (Qu et al. 2018) has been proposed
to address the HS and MS image fusion problem. uSDN is based on an unsupervised
encoder-decoder architecture. The architecture comprises two encoder-decoder networks.
One network encodes and decodes an HS image, whereas the other network encodes
and decodes an MS image. The decoder is shared between the two networks, and the
network activations derived from the encoders are promoted to have a similar pattern. In
this architecture, the HS and MS images are encoded as proportional coefficients, and the
decoder represents spectral signatures corresponding to the coefficients. The parameters
of the encoder and the decoder are alternatively optimized until it converges to an HR-HS
image.
8 http://naotoyokoya.com/Download.html
9 https://github.com/qw245/BlindFuse
10 https://github.com/alfaiate
11 https://sites.google.com/site/harikanats/
12 https://sites.google.com/view/renweidian/
13 https://github.com/aicip/uSDN
14 https://github.com/XieQi2015/MHF-net
15 http://naotoyokoya.com/Download.html
16 http://www1.cs.columbia.edu/CAVE/databases/
10.3 Multiband Image Fusion 147
Table 10.3 The size of the image used for HSR experiments.
Table 10.4 Quantitative comparison of different algorithms on two different images. The best
results are marked in bold.
CNMF 35.21 5.09 6.54 5.591 0.906 34.40 5.84 0.77 8.05 0.942
FUSE 34.98 5.79 6.83 6.418 0.859 34.88 5.94 0.71 12.24 0.902
HySure 35.85 5.17 6.23 5.698 0.904 34.45 6.50 0.79 19.43 0.897
STEREO 34.66 6.52 7.40 10.272 0.832 34.56 5.10 0.80 17.09 0.907
CSTF 36.39 5.37 6.11 7.616 0.855 38.46 4.75 0.54 11.86 0.939
uSDN 34.04 6.30 7.48 7.055 0.888 33.17 6.38 0.83 14.21 0.905
MHF-net 39.52 4.09 4.52 5.896 0.932 39.22 3.34 0.39 6.38 0.978
Ideal vaule ↑ ↓0 ↓0 ↓0 ↑1 ↑ ↓0 ↓0 ↓0 ↑1
Figure 10.5 HSR results of different methods with Chikusei image. We choose bands 70, 100, 36
for illustration. An enlarged region is framed in green and the corresponding residual image
between the fused image and MS-GT is framed in red.
Table 10.5 Processing time compassion of five non-DL and two DL methods in the test phase.
This chapter presents a review of state-of-the-art DL-based image fusion techniques for
two spatial-spectral resolution enhancement tasks, namely, pansharpening and multiband
image fusion. The DL-based methods exhibit the capability to learn complex data transfor-
mations from input image sources to the targets. Unlike the conventional approaches that
are based on hand-crafted filters and priors (or regularizations), DL-based methods have
the potential to learn priors from the training samples in an end-to-end manner. However,
a careful design of the network architecture and a definition of the loss function are required
for the DL-based methods. The popular concepts of conventional approaches, such as the
injection of spatial details and spatial-spectral preservation based on observation models,
can facilitate more efficient learning. As demonstrated in the comparative experiments,
the DL-based algorithms achieved higher reconstruction quality as compared to the con-
ventional approaches, with a relatively fast speed at the inference. Hence, DL-based image
fusion is suitable for processing large-scale optical satellite images.
A limitation of DL-based approaches is that a majority of these methods require numer-
ous input-output training samples for each combination of the sensors, which cannot be
easily obtained. Owing to this limitation, generalization so as to accept a pair of images
acquired by any combination of sensors as the input remains a key challenge from a
practical point of view. Unsupervised DL approaches are a potential solution to address
10.4 Conclusion and Outlook 149
this challenge because they can be trained by only using the input data; however, their
computational cost is high and there is still room for improvement in their reconstruction
performance. Transfer learning is a possible direction to improve computational efficiency
and reconstruction performance of unsupervised DL approaches. A majority of the
architectures developed for multisource remote sensing image fusion have been manually
designed by humans. An automated neural architecture search can be another direction
for future research.
150
11
Deep Learning for Image Search and Retrieval in Large
Remote Sensing Archives
Gencer Sumbul, Jian Kang, and Begüm Demir
11.1 Introduction
With the unprecedented advances in the satellite technology, recent years have witnessed a
significant increase in the volume of remote sensing (RS) image archives. Thus, the devel-
opment of efficient and accurate content-based image retrieval (CBIR) systems in massive
archives of RS images is a growing research interest in RS. CBIR aims to search for RS
images of the similar information content within a large archive with respect to a query
image. To this end, CBIR systems are defined based on two main steps: (i) image descrip-
tion step (which characterizes the spatial and spectral information content of RS images);
and (ii) image retrieval step (which evaluates the similarity among the considered descrip-
tors and then retrieve images similar to a query image in the order of similarity). A general
block scheme of a CBIR system is shown in Figure 11.1.
Traditional CBIR systems extract and exploit hand-crafted features to describe the con-
tent of RS images. As an example, bag-of-visual-words representations of the local invariant
features extracted by the scale invariant feature transform (SIFT) are introduced in Yang
and Newsam (2013). In Aptoula (2014), a bag-of-morphological-words representation of
the local morphological texture features (descriptors) is proposed in the context of CBIR.
Local Binary Patterns (LBPs), which represent the relationship of each pattern (i.e., pixel)
in a given image with its neighbors located on a circle around that pixel, are found very effi-
cient in RS. In Tekeste and Demir (2018), a comparative study that analyzes and compares
different LBPs in RS CBIR problems is presented. To define the spectral information con-
tent of high-dimensional RS images the bag-of-spectral-values descriptors are presented in
Dai et al. (2018). Graph-based image representations, where the nodes describe the image
region properties and the edges represent the spatial relationships among the regions, are
presented in Li and Bretschneider (2007); Chaudhuri et al. (2016, 2018). Hashing methods
that embed high-dimensional image features into a low-dimensional Hamming (binary)
space by a set of hash functions are found very effective in RS (Demir and Bruzzone 2016;
Li and Ren 2017; Reato et al. 2019). By this method, the images are represented by binary
hash codes that can significantly reduce the amount of memory required for storing the
RS images with respect to the other descriptors. Hashing methods differ from each other
on how the hash functions are generated. As an example, in Demir and Bruzzone (2016);
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
11.1 Introduction 151
Image
Image Descriptor
Characterization
Query Image
RS Image
Retrieval
Image
Characterization Image
Descriptor
Retrieved
Image Images
Archive
Reato et al. (2019) kernel-based hashing methods that define hash functions in the kernel
space are presented, whereas a partial randomness hashing method that defines the hash
functions based on a weight matrix defined using labeled images is introduced in Li and
Ren (2017). More details on hashing for RS CBIR problems are given in section 11.3.
Once image descriptors are obtained, one can use the k-nearest neighbor (k-NN) algo-
rithm, which computes the similarity between the query image and all archive images
to find the k most similar images to the query. If the images are represented by graphs,
graph matching techniques can be used. As an example, in Chaudhuri et al. (2016) an inex-
act graph matching approach, which is based on the sub-graph isomorphism and spectral
embedding algorithms, is presented. If the images are represented by binary hash codes,
image retrieval can be achieved by calculating the Hamming distances with simple bit-wise
XOR operations that allow time-efficient search capability (Demir and Bruzzone 2016).
However, these unsupervised systems do not always result in satisfactory query responses
due to the semantic gap, which is occurred among the low-level features and the high-level
semantic content of RS images (Demir and Bruzzone 2015). To overcome this problem and
improve the performance of CBIR systems, semi-supervised and fully supervised systems,
which require user feedback in terms of RS image annotations, are introduced (Demir and
Bruzzone 2015). Most of these systems depend on the availability of training images, each
of which is annotated with a single broad category label that is associated to the most sig-
nificant content of the image. However, RS images typically contain multiple classes and
thus can simultaneously be associated with different class labels. Thus, CBIR methods that
properly exploit training images annotated by multi-labels are recently found very promis-
ing in RS. As an example, in Dai et al. (2018) a CBIR system that exploits a measure of label
likelihood based on a sparse reconstruction-based classifier is presented in the framework
of multi-label RS CBIR problems. Semi-supervised CBIR systems based on graph matching
algorithms are proposed in Wang et al. (2016); Chaudhuri et al. (2018). In detail, in Wang
et al. (2016) a three-layer framework in the context graph-based learning is proposed for
query expansion and fusion of global and local features by using the label information of
152 11 Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives
query images. In Chaudhuri et al. (2018) a correlated label propagation algorithm, which
operates on a neighborhood graph for automatic labeling of images by using a small number
of training images, is proposed.
The above-mentioned CBIR systems rely on shallow learning architectures and
hand-crafted features. Thus, they can not simultaneously optimize feature learning and
image retrieval, resulting in limited capability to represent the high-level semantic content
of RS images. This issue leads to inaccurate search and retrieval performance in practice.
Recent advances in deep neural networks (DNNs) have triggered substantial performance
gain for image retrieval due to their high capability to encode higher-level semantics present
in RS images. Differently from conventional CBIR systems, deep learning (DL)-based CBIR
systems learn image descriptors in such a way that feature representations are optimized
during the image retrieval process. In order words, DNNs eliminate the need for human
effort to design discriminative and descriptive image descriptors for the retrieval problems.
Most of the existing RS CBIR systems based on DNNs attempt to improve image retrieval
performance by: (i) learning discriminative image descriptors; and (ii) achieving scalable
image search and retrieval. The aim of this chapter is to present different DNNs proposed
in the literature for the retrieval of RS images. The rest of this chapter is organized as
follows. Section 11.2 reviews the DNNs proposed in the literature for the description of
the complex information content of RS images in the framework of CBIR. Section 11.3
presents the recent progress on the scalable CBIR systems defined based on DNNs in RS.
Finally, section 11.4 draws the conclusion of this chapter.
Image Pairs
Convolutional
Autoencoder Neural
Pre-trained Network Networks
Weights Classification
Image Triplets
Clustering
Convolutional Neural
Networks
Metric Learning
Reconstruction
Data Augmentation Graph Convolutional
Networks
Learning Strategy
Greedy Layer-wise
Pre-training
Network Initialization
DNN Type
Mini-batch Sampling
Figure 11.2 Different strategies considered within the DL-based RS CBIR systems.
for mini-batches and initializes the model parameters of a Convolutional Neural Network
(CNN). Then, it employs the CNN for k-means clustering (instead of classification). To this
end, a reconstruction loss function is utilized to minimize the error induced between the
CNN results and the cluster assignments. Collaborative affinity metric fusion is employed
to incorporate the traditional image descriptors (e.g., SIFT, LBP) with those extracted from
different layers of the CNN. A CBIR system with deep bag-of-words is proposed in Tang
et al. (2018b). This system employs a convolutional autoencoder (CAE) for extracting image
descriptors in an unsupervised manner. The method first encodes the local areas of ran-
domly selected RS images into a descriptor space and then decodes from descriptors to
image space. Since encoding and decoding steps are based on convolutional layers, a recon-
struction loss function is directly applied to reduce the error between the input and con-
structed local areas for the unsupervised reconstruction based learning. Since this system
operates on local areas of the images, bag-of-words approach with k-means clustering is
applied to define the global image descriptor from local areas. Although this system has the
same learning strategy as Zhou et al. (2015), its advantages are two-fold compared to Zhou
et al. (2015). First, model parameters are initialized with greedy layer-wise pre-training
that allows more effective learning procedure with respect to the random initialization
approach. Second, the CAE model has better capability to characterize the semantic con-
tent of images since it considers the neighborhood relationship through the convolution
operations. The reader is referred to Chapter 2 for the detailed discussion on unsupervised
feature learning in RS.
154 11 Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives
local convolutional descriptors from multiplicative and additive attention mechanisms are
considered to characterize the descriptors of the most relevant regions of the RS images.
This is achieved based on three steps. In the first step, similar to Ye et al. (2018) and Zhou
et al. (2017a), the system operates on randomly selected RS images and applies fine-tuning
to a state-of-the art CNN model while relying on a classification based learning strategy
with the cross-entropy loss function. In the second step, additive and multiplicative
attention mechanisms are integrated into the convolutional layers of the CNN and thus
are retrained to learn their parameters. Then, local descriptors are characterized based on
the attention scores of the resized RS images at different scales (which is achieved based
on data augmentation). In the last step, the system transforms VLAD representations with
Memory Vector (MV) construction (which produces the expanded query descriptor) to
make the CBIR system sensitive to the selected query images. In this system, the query
expansion strategy is applied after obtaining all the local descriptors. This query-sensitive
CBIR approach further improves the discrimination capability of image descriptors, since
it adapts the overall learning procedure of DNNs based on the selected queries. Thus, it
has a huge potential for RS CBIR problems.
Most of the above-mentioned DL-based supervised CBIR systems learn an image feature
space directly optimized for a classification task by considering entropy-based loss func-
tions. Thus, the image descriptors are designed to discriminate the pre-defined classes by
taking into account the class based similarities rather than the image based similarities dur-
ing the training stage of the DL models. The absence of positive and negative images with
respect to the selected query image during the training phase can lead to a poor CBIR perfor-
mance. To overcome this limitation, metric learning is recently introduced in RS to take into
account image similarities within DNNs. Accordingly, a Siamese graph convolutional net-
work is introduced in Chaudhuri et al. (2019) to model the weighted region adjacency graph
(RAG) based image descriptors by a metric learning strategy. To this end, mini-batches are
first constructed to include either similar or dissimilar RS images (Siamese pairs). If a pair of
images belongs to the same class, they are assumed as similar images, and vice versa. Then,
RAGs are fed into two graph convolutional networks with shared parameters to model
image similarities with the contrastive loss function. Due to the considered metric learning
strategy (which is guided by the contrastive loss function) the distance between the descrip-
tors of similar images is decreased, while that between dissimilar images is increased. The
contrastive loss function only considers the similarity estimated among image pairs, i.e.,
similarities among multiple images are not evaluated, which can limit the success of simi-
larity learning for CBIR problems.
To address this limitation, a triplet deep metric learning network (TDMLN) is proposed
in Cao et al. (2020). TDMLN employs three CNNs with shared model parameters for simi-
larity learning through image triplets in the content of metric learning. Model parameters
of the TDMLN are initialized with a state-of-the-art CNN model pre-trained on ImageNet.
For the mini-batch sampling, TDMLN considers an anchor image together with a similar
(i.e., positive) image and a dissimilar (i.e., negative) image to the anchor image at a time.
Image triplets are constructed based on the annotated training images (Chaudhuri et al.
2019). While anchor and positive images belong to the same class, the negative image is
associated to a different class. Then, similarity learning of the triplets is achieved based on
the triplet loss function. By the use of triplet loss function, the distance estimated between
156 11 Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives
Figure 11.3 The intuition behind the triplet loss function: after training, a positive sample is
moved closer to the anchor sample than the negative samples of the other classes.
the anchor and positive images in the descriptor (i.e., feature) space is minimized, whereas
that computed between the anchor and negative images is separated by a certain margin.
Figure 11.3 illustrates intuition behind the triplet loss function. Metric learning guided by
the triplet loss function learns similarity based on the image triplets and thus provides
highly discriminative image descriptors in the framework of CBIR. However, how to define
and select image triplets is still an open question. Current methods rely on the image-level
annotations based on the land-cover land-use class labels, which do not directly repre-
sent the similarity of RS images. Thus, metric learning-based CBIR systems need further
improvements to characterize retrieval specific image descriptors. One possible way to over-
come this limitation can be an identification of image triplets through visual interpreta-
tion instead of defining triplets based on the class labels. Tabular overview of the recent
DL-based CBIR systems in RS is presented in Table 11.1.
representations into low-dimensional binary codes, such that the similarity to the original
space can be well preserved (Demir and Bruzzone 2016; Li and Ren 2017; Reato et al. 2019;
Fernandez-Beltran et al. 2020). Thus, descriptor extraction and binary code generation are
applied independently from each other, resulting in sub-optimal hash codes. Success of
DNNs in image feature learning has inspired research on developing DL-based hashing
methods (i.e., deep hashing methods).
Recently, several deep hashing-based CBIR systems that simultaneously learn image rep-
resentations and hash functions based on the suitable loss functions are introduced in RS
(see Table 11.2). As an example, in Li et al. (2018b) a supervised deep hashing neural net-
work (DHNN) that learns deep features and binary hash codes by using the contrastive
and quantization loss functions in an end-to-end manner is introduced. The contrastive
loss function can also be considered as the binary cross-entropy loss function, which is
optimized to classify whether an input image pair is similar or not. One advantage of the
contrastive loss function is its capability of similarity learning, where similar images can
be grouped together, while moving away dissimilar images from each other in the feature
space. Due to the ill-posed gradient problem, the standard back-propagation of DL mod-
els to directly optimize hash codes is not feasible. The use of the quantization loss miti-
gates the performance degradation of the generated hash codes through the binarization
158 11 Deep Learning for Image Search and Retrieval in Large Remote Sensing Archives
Table 11.2 Main characteristics of the state-of-the-art deep hashing-based CBIR systems in RS.
on the CNN outputs. In Li et al. (2018a) the quantization and contrastive loss functions
are combined in the framework of the source-invariant deep hashing CNNs for learning a
cross-modality hashing system. Without introducing a margin threshold between the sim-
ilar and dissimilar images, a limited image retrieval performance can be achieved based on
the contrastive loss function. To address this issue, a metric-learning based supervised deep
hashing network (MiLaN) is recently introduced in Roy et al. (2020). MiLaN is trained by
using three different loss functions: (i) the triplet loss function for learning a metric space
(where semantically similar images are close to each other and dissimilar images are sep-
arated); (ii) the bit balance loss function (which aims at forcing the hash codes to have a
balanced number of binary values); and (iii) the quantization loss function. The bit balance
loss function makes each bit of hash codes to have a 50% chance of being activated, and
different bits to be independent from each other. As noted in Roy et al. (2020), the learned
hash codes based on the considered loss functions can efficiently characterize the complex
semantics in RS images. A supervised deep hashing CNN (DHCNN) is proposed in Song
et al. (2019) in order to retrieve the semantically similar images in an end-to-end man-
ner. In detail, DHCNN utilizes the joint loss function composed of: (i) the contrastive loss
function; (ii) the cross-entropy loss function (which aims at increasing the class discrimina-
tion capability of hash codes); and (iii) the quantization loss. In order to predict the classes
based on the hash codes, a FC layer is connected to the hash layer in DHCNN. As men-
tioned above, one disadvantage of the cross-entropy loss function is its deficiency to define
a metric space, where similar images are clustered together. To address this issue, the con-
trastive loss function is jointly optimized with the cross-entropy loss function in DHCNN.
A semi-supervised deep hashing method based on the adversarial autoencoder network
(SSHAAE) is proposed in Tang et al. (2019) for RS CBIR problems. In order to generate the
discriminative and similarity preserved hash codes with low quantization errors, SSHAAE
exploits the joint loss function composed of: (i) the cross-entropy loss function; (ii) a recon-
struction loss function; (iii) the contrastive loss function; (iv) the bit balance loss function;
and (v) the quantization loss function. By minimizing the reconstruction loss function, the
label vectors and hash codes can be obtained as the latent outputs of the AEs. A supervised
deep hashing method based on a generative adversarial network (GAN) is proposed in Liu
et al. (2019). For the generator of the GAN, this method introduces a joint loss function that
11.4 Discussion and Conclusion 159
Table 11.3 Comparison of the DL loss functions considered within the deep hashing-based RS
CBIR systems. Different marks are provided: “×” (no) or “✓” (yes).
composed of: (i) the cross-entropy loss function; (ii) the contrastive loss function; and (iii)
the quantization loss function. For the discriminator of the GAN, the sigmoid function is
used for the classification of the generated hash codes as true codes. This allows the learned
hash codes following the uniform binary distribution to be restricted. Thus, the bit bal-
ance capability of hash codes can be achieved. It is worth noting that the above-mentioned
supervised deep hashing methods preserve the discrimination capability and the semantic
similarity of the hash codes in the Hamming space by using annotated training images.
In Table 11.3, we analyze and compare all the above-mentioned loss functions based on
their: (i) capability on similarity learning, (ii) requirement on the mini-batch sampling;
(iii) capability of assessing the bit balance issues; (iv) capability of binarization of the image
descriptors; and (v) requirement on the annotated images. For instance, the contrastive and
triplet loss functions have the capabilities to learn the relationship among the images in the
feature space, where the semantic similarity of hash codes can be preserved. Regarding to
the requirement of mini-batch sampling, pairs of images should be sampled for the con-
trastive loss function, image triplets should be constructed for the triplet loss function. The
bit balance and adversarial loss functions are exploited for learning the hash codes with the
uniform binary distribution. It is worth noting that an adversarial loss function can be also
exploited for other purposes, such as for image augmentation problems to avoid overfitting
(Cao et al. 2018). The quantization loss function enforces the produced low-dimensional
features by the DNN models to approximate the binary hash codes. With regard to the
requirement on image annotations, the contrastive and triplet loss functions require the
semantic labels to construct the relationships among the images.
attention on the DL-based CBIR systems in RS. We initially analyzed the recent DL-based
CBIR systems based on: (i) the strategies considered for the mini-batch sampling; (ii) the
approaches used for the initialization of the parameters of the considered DNN models; (iii)
the type of the considered DNNs; and (iv) the strategies used for image representation learn-
ing. Then, the most recent methodological developments in RS related to scalable image
search and retrieval were discussed. In particular, we reviewed the deep hashing-based
CBIR systems and analyzed the loss functions considered within these systems based on
their: (i) capability of similarity learning, (ii) requirement on the mini-batch sampling; (iii)
capability of assessing the bit balance issues; (iv) capability of binarization; and (v) require-
ment on the annotated images. Analysis of the loss functions under these factors provides
a guideline to select the most appropriate loss function for large-scale RS CBIR problems.
It is worth emphasizing that developing accurate and scalable CBIR systems is becoming
more and more important due to the increased number of images in the RS data archives.
In this context, the CBIR systems discussed in this chapter are very promising. Despite the
promising developments discussed in this chapter (e.g., metric learning, local feature aggre-
gation, and graph learning), it is still necessary to develop more advanced CBIR systems.
For example, most of the systems are based on the direct use of the CNNs for the retrieval
tasks, whereas the adapted CNNs are mainly designed for learning a classification prob-
lem and thus model the discrimination of pre-defined classes. Thus, the image descriptors
obtained through these networks cannot learn an image feature space that is directly opti-
mized for the retrieval problems. Siamese and triplet networks are defined in the context of
metric learning in RS to address this problem. However, the image similarity information to
train these networks is still provided based on the pre-defined classes, preventing to achieve
retrieval specific image descriptors. Thus, CBIR systems that can efficiently learn image
features optimized for retrieval problems are needed. Furthermore, the existing supervised
DL-based CBIR systems require a balanced and complete training set with annotated image
pairs or triplets, which is difficult to collect in RS. Learning an accurate CBIR model from
imbalanced and incomplete training data is very crucial and thus there is a need for devel-
oping systems addressing this problem for operational CBIR applications. Furthermore, the
availability of an increased number of multi-source RS images (multispectral, hyperspectral
and SAR) associated to the same geographical area motivates the need for effective CBIR
systems, which can extract and exploit multi-source image descriptors to achieve rich char-
acterization of RS images (and thus to improve image retrieval performance). However,
multi-source RS CBIR has not been explored yet (i.e., all the deep hashing-based CBIR
systems are defined for images acquired by single sensors). Thus, it is necessary to study
CBIR systems that can mitigate the aforementioned problems.
Acknowledgement
This work is supported by the European Research Council (ERC) through the
ERC-2017-STG BigEarth Project under Grant 759764.
161
Part II
12
Deep Learning for Detecting Extreme Weather Patterns
Mayur Mudigonda, Prabhat Ram, Karthik Kashinath, Evan Racah, Ankur Mahesh,
Yunjie Liu, Christopher Beckham, Jim Biard, Thorsten Kurth, Sookyung Kim,
Samira Kahou, Tegan Maharaj, Burlen Loring, Christopher Pal, Travis O’Brien,
Ken Kunkel, Michael F. Wehner, and William D. Collins
moderate agreement between the different tracking methods, with some models and exper-
iments showing better agreement across schemes than others. When comparing responses
between experiments, it is found that much of the disagreement between schemes is due
to differences in duration, wind speed, and formation-latitude thresholds. After homoge-
nization in these thresholds, agreement between different tracking methods is improved.
However, much disagreement remains, accountable for by more fundamental differences
between the tracking schemes. The results indicate that sensitivity testing and selection of
objective thresholds are the key factors in obtaining meaningful, reproducible results when
tracking tropical cyclones in climate model data at these resolutions, but that more fun-
damental differences between tracking methods can also have a significant impact on the
responses in activity detected.” (Horn et al. 2014a)
Extratropical cyclones (ETC) are often identified by conditions of locally maximal vortic-
ity and minimal pressure but are considered more difficult to identify than tropical cyclones
due to their larger and more asymmetric physical characteristics, faster propagation speeds,
and greater numbers. The Intercomparison of Mid-Latitude Storm Diagnostics (IMILAST)
project examined 15 different ETC identification schemes applied to a common reanalysis
and found profound sensitivities in annual global counts, ranging from 400 to 2600 storms
per year (Neu et al. 2013). Atmospheric rivers (AR) are characterized by “a long, narrow, and
transient corridor of strong horizontal water vapor transport that is typically associated with
a low-level jet stream ahead of the cold front of an extratropical cyclone” (American Mete-
orological Society 2017). As a non-cyclonic event, AR identification schemes are even more
heterogeneous and diverse, and are based upon a wide variety of criteria involving total
precipitable water, integrated water transport, and other several variables. The AR commu-
nity has recently self organized the Atmospheric River Intercomparison Project (ARTMIP)
along similar lines to the IMILAST project (Shields et al. 2018). Some of the recent con-
clusions of the ARTMIP project are: (i) AR frequency, duration, and seasonality exhibit
a wide range of results; and (ii) “understanding the uncertainties and how the choice of
detection algorithm impacts quantities such as precipitation is imperative for stakehold-
ers such as water managers, city and transportation planners, agriculture, or any industry
that depends on global and regional water cycle information for the near term and into
the future. Understanding and quantifying AR algorithm uncertainty is also important for
developing metrics and diagnostics for evaluating model fidelity in simulating ARs and
their impacts. ARTMIP launched a multitiered intercomparison effort designed to fill this
community need. The first tier of the project is aimed at understanding the impact of AR
algorithm on quantitative baseline statistics and characteristics of ARs, and the second tier
of the project includes sensitivity studies designed around specific science questions, such
as reanalysis uncertainty and climate change.”
Other types of weather events are less amenable to such heuristics-based automated
identification. Blocking events are obstructions “on a large scale, of the normal west-to-east
progress of migratory cyclones and anticyclones” (American Meteorological Society 2017)
and are associated with both extreme precipitation and extreme temperature events.
Three objective schemes were compared by Barnes et al. (2012), who found that differing
block structure affects the robustness of identification. Objective identification of fronts,
“interface or transition zone between two air masses of different density” (American
Meteorological Society 2017), is even less developed. Location of fronts can often be
12.1 Scientific Motivation 165
detected visually from maps of pressure and temperature, but a clear identification of the
boundary usually requires the synthesis of multiple variables. Hewson (1998) summarizes
efforts to objectively detect fronts.
The climate science community is yet to fully exploit the capabilities of modern machine
learning and deep learning methods for pattern detection. Self-organizing maps (SOMs)
have been used to a limited extent for visually summarizing patterns in atmospheric fields
(Sheridan and Lee 2011). For example, Hewitson and Crane (2002) and Skific et al. (2009)
used this technique to detect patterns in surface pressure fields in 4the eastern U.S. and
the Arctic, respectively. Loikith et al. (2017) found large-scale meteorological patterns asso-
ciated with temperature and precipitation extremes for the northeastern U.S. by utilizing
SOMs. Gibson et al. (2017) applied the method to examination of extreme events in Aus-
tralia. While SOMs are a powerful tool to visualize the variations in atmospheric patterns,
these patterns often represent different locations of the same meteorological phenomenon.
Thus, they are not relevant to the purposes of our study.
These heuristic schemes can be abstracted into two distinct parts. The first part (detec-
tion) scans each time step of high frequency model output for candidate events according
to the specified criteria while the second part (tracking) implements a continuity condition
across space and time (Prabhat et al. 2012, 2015a). Details, of course, vary significantly. The
detection step is a massive data reduction but may still contain many candidates that will
not satisfy the tracking criteria. This is especially the case for tropical cyclones but much
less so for atmospheric rivers.
Supervised machine learning techniques tailored to identify extreme weather events offer
an alternative to these objective schemes as well as provide an automated method to imple-
ment subjective schemes. The latter is critical to understanding how climate change affects
weather systems, such as frontal systems, for which objective identification schemes have
not been developed. In both cases, the construction of suitable labeled training datasets is
necessary and the details of how they are constructed are important. We discuss towards
the end some progress on this front as well.
The heuristics-based identification schemes described above present quantitative
definitions of the weather events in question. Hence, it should come as no surprise that
different definitions yield different results. Supervised machine learning techniques
trained on datasets produced by these schemes should, at the very least, mimic the original
identification scheme.
Modern day deep learning presents opportunities for computational modeling that are
to-date unparalleled. Deep learning has shown remarkable successes on a wide range of
problems in vision, robotics, natural language processing and more. A large category of
problems in climate science can be posed as supervised learning problems such as weather
front classification, identifying atmospheric rivers and tropical cyclones, etc. There are
many parallel applications to these problems in computer vision.
In summary, identifying, detecting, and localizing extreme weather events is a crucial first
step in understanding how they may vary under different climate change scenarios. Pattern
recognition tasks such as classification, object detection, and segmentation have remained
challenging problems in the weather and climate sciences. While there exist many empirical
heuristics for detecting extreme events, the disparities between the output of these different
methods even for a single event are large and often difficult to reconcile. Given the success
166 12 Deep Learning for Detecting Extreme Weather Patterns
12.2.1 Methods
We collected ground truth labeling of TCs and ARs obtained via application of heuristic
multivariate threshold-based criteria (Prabhat et al. 2015a; Knutson et al. 2007) and manual
12.2 Tropical Cyclone and Atmospheric River Classification 167
Is this a
Cyclone?
And exactly
where is it?
Feature Extraction Pattern Detection using
Input Output
by domain experts heuristic algorithm
T4 = Yˆ
(Output Layer)
TC Is this a
Cyclone?
And exactly
AR where is it?
Input Feature Extraction + Pattern Detection Output
Figure 12.1 Contrasting traditional heuristics-based event detection versus deep learning-based
detection.
classification by expert meteorologists (Neiman et al. 2008; Lavers et al. 2012). Training
data for these two types of events consist of image patches, defined as a prescribed geo-
metrical box that encloses an event, and a corresponding spatial grid of relevant variables
extracted from global climate model simulations or reanalyses. The size of the box is based
on domain knowledge – for example, a 500 × 500 km box is likely to contain most tropical
cyclones. To facilitate model training, an image patch is centered over the event. Because
the spatial extent of events vary and the spatial resolution of simulation and reanalysis data
is non-uniform, final training images differ in their size across the two types of events. This
is one of the key limitations that prevents developing one single convolutional neural net-
work to classify both types of storms simultaneously. The images are classified as those that
contain events and those that do not contain events.
A summary of the attributes of training data is listed in Table 12.1, and attributes of orig-
inal reanalysis and model simulation data are documented in Table 12.2. The datasets are
split 80% and 20% for training and testing respectively.
Table 12.2 Dimension of image, diagnostic variables (channels) and labeled dataset size for
classification task (PSL: sea surface pressure, U: zonal wind, V: meridional wind, T: temperature,
TMQ: vertical integrated water vapor, Pr: precipitation).
Image
Events Dimension Variables Total Examples
×5
10
4
×5
50
×1
2
0×
16
14
2
×1
×3
×2
8×
16
32
28
8×
8×
06
47
95
0
4
×1
20
8×
2
22
21
7×
68
×2
8×
7×
×5
8×
16
14
13
16
2×
8×
Figure 12.2 Top: architecture for tropical cyclone classification. Right: architecture or atmospheric
rivers. Precise details can be found in Table 12.3.
Table 12.3 Classification CNN architecture and layer parameters. The convolutional layer
parameters are denoted as <kernel size> − <number of feature maps> (e.g. 5 × 5 − 8). The pooling
layer parameters are denoted as <pooling window> (e.g. 2 × 2). The fully connected layer
parameter are denoted as <number of units> (e.g. 2). Non-linear activation function of hidden unit
is shown in parentheses.
Conv1 (ReLu) Pool1 Conv2 (ReLu) Pool2 Fully (ReLu) Fully (Sigmoid)
and pooling layers, followed by a fully connected layer and a sigmoid unit, that predicts the
binary label.
Training deep architecture is known to be difficult (Larochelle et al. 2009; Glorot and
Bengio 2010) and requires carefully tuning of parameters. In this study, we employ a Bayes’
framework of hyper-parameter optimization technique to facilitate parameter selecting.
Referring to AlexNet (Krizhevsky et al. 2012), we build a classification system with two
convolutional layers followed by two fully connected layers. Details of the architecture and
layer parameters can be found in Table 12.3 and Figure 12.2.
12.2.3 Results
The distinct characteristics of tropical cyclones, such as a low pressure center with strong
winds circulating the center and a warm temperature core in the upper troposphere, make
their patterns relatively easy to learn to represent with a CNN. Our deep CNNs achieved
nearly perfect (99%) classification accuracy with failures associated with weakly developed
storms that did not have the distinct features described above. Table 12.5 shows the con-
fusion matrix for TCs. The confusion matrix reports accuracies obtained by a procedure
in predicting labels compared to ground truth. Further details on the statistics and accuracy,
including examples of failure modes, are presented in Liu et al. (2016b, c).
In contrast, deep CNNs achieve 90% classification accuracy for ARs. The challenges
faced by deep CNNs in AR detection are their relative weakness and/or disjointed (bro-
ken) features of IWV and the presence of other weather systems in the vicinity, such
as extra-tropical cyclones and jet streams. Table 12.6 shows the confusion matrix for ARs.
Further details on the statistics and accuracy, including examples of failure modes, are
in Liu et al. (2016b, c).
The results in this section suggest that deep convolutional neural networks are a pow-
erful and novel approach for extreme event detection that does not rely on cherry picking
features and thresholds. They also motivate the application of deep learning for a broader
class of pattern detection problems in climate science.
Figure 12.3 4-layer front detection CNN architecture with 64 5 × 5 filters per layer in the first
three convolutional layers and 5 5 × 5 filters in the last layer. All convolutional layers are padded to
have the same output shape.
12.3.2 Dataset
The input dataset consisted of gridded fields of five surface variables taken from the
MERRA2 https://gmao.gsfc.nasa.gov/reanalysis/merra2. The variables were 3-hourly
instantaneous values of 2 m air temperature, 2 m specific humidity, sea level pressure, and
the 10 m wind velocity.
Our truth dataset was extracted from the Coded Surface Bulletin (CSB) (http://www.nws
.noaa.gov/noaaport/html/noaaport.shtml). Each text bulletin contains latitudes and longi-
tudes specifying the locations of pressure centers, fronts, and troughs identified visually.
Each front and trough is represented by a polyline We obtained all the bulletins possible for
2003–2016 and produced a 5-channel (categories of cold, warm, stationary, occluded, and
none) image for each time step by drawing the front lines into latitude/longitude grids with
one degree cell size. The image is filtered so that only one channel is set in each pixel, with
a front-type preference order of warm over occluded over cold over stationary over none.
Each front is drawn with a transverse extent of three degrees to account for the fact that a
front is not a zero-width line and to add tolerances forslight lateral differences in position
between the CSB and MERRA-2 derived fronts. In addition, the quantitative evaluation
was restricted to regions where the frequency of fronts is at least 40 per year. The network
was implemented in Keras and Theano, and trained with the data for 2003–2016 using an
80%-20% training-test split.
Each front is drawn with a transverse extent of three degrees to account for the fact that a
front is not a zero-width line and to add tolerances for slight lateral differences in position
between the CSB and MERRA-2 derived fronts. In addition, the quantitative evaluation was
restricted to regions where the frequency of fronts is at least 40 per year. The network was
implemented in Keras and Theano, and trained with the data for 2009 using an 80%–20%
training-test split.
172 12 Deep Learning for Detecting Extreme Weather Patterns
We then tested and calculated the confusion matrix and per-category IOU (ratio of cor-
rectly categorized pixels to total pixels in that category in either truth or CNN data, com-
puted over each category) for the entire set of images. We also extracted polylines describing
the fronts in each timestep by tracing out the lines following the maxima for each type of
front. These were used to calculate the annual average number of front crossings at each
96 km × 9 kmm grid cell in a Lambert Conformal Conic map. We then compared the results
with the annual average number of front crossings found using the original polylines from
the CSB.
12.3.3 Results
Figure 12.4 shows an example of the output from the CNN using MERRA2 inputs for 20
March, 2010 at 12:00 UTC along with the fronts produced by NWS meteorologists for the
same date and time. All four front types have been drawn together using different colors,
with the color intensity in the CNN-generated image correlating with the likelihood value
produced for that front type. Areas of mixed color correspond with regions where more
Cold
Warm
Occluded
Stationary
Cold
Warma
Occluded
Stationary
Figure 12.4 Coded Surface Bulletin fronts and CNN-generated front likelihoods.
12.3 Detection of Fronts 173
Table 12.7 Per-category counts and IOU front detection metrics for 2013–2016.
than one front type produced a significant response. This image pair shows that the CNN is
capturing a large majority of the CBS features. There are some, mostly low likelihood, fronts
found by the CNN that may not be physical, and there are some regions of disagreement.
Table 12.7 shows the per-category IOU metric for the front likelihoods prediction. The
IOU values indicate low agreement between the truth and predicted data, but inspection
of the images (as seen in Figure 12.4) suggests that this is largely due to spatial differences
between CSB and CNN in the location of fronts. The counts and IOU values show that the
performance of the CNN is best for cold and occluded fronts, and worst for warm fronts.
Table 12.8 shows the confusion matrices for the cases when both the truth and CNN
datasets indicated the presence of some type of front. Warm fronts are shown to have the
greatest confusion, with true warm front pixels being categorized by the CNN as a different
type of front almost 60% of the time. The other three types of fronts have much better per-
formance, with cold and occluded fronts nearing 80% accuracy relative to the truth counts
for those front types.
174 12 Deep Learning for Detecting Extreme Weather Patterns
Figure 12.5 displays the results of taking the per-pixel mean of the number occurrences of
fronts of all types over the entire 2003–2016 period covered by both of our datasets. The spa-
tial pattern for the CSB fronts (Figure 12.5a) shows the highest values off the east and west
Coasts and in the central U.S. The absolute numbers for the MERRA2 fronts (Figure 12.5b)
are lower than the CSB numbers almost everywhere, but the spatial pattern is very simi-
lar (Figure 12.5b) with maxima in the same locations. The overall difference between the
frequencies indicates that the CNN is detecting approximately 80% of the fronts found by
NWS meteorologists.
12.3.4 Limitations
The current front detection CNN consistently under-detects fronts by 20%, particularly
warm fronts, and tends to conflate cold and stationary fronts. Improvement is likely pos-
sible by experimenting with a larger number of layers, with different numbers of filters,
and with adjustments to the relative weights given to the different front types in the loss
function used for training. The CNN currently treats each 3-hourly image independently
and does not take advantage of the high temporal correlations in front locations. Building
12.4 Semi-supervised Classification and Localization of Extreme Events 175
a CNN or a hybrid LSTM architecture that makes use of multiple time steps has potential
to produce significant improvements. The current system also only uses surface variables
as inputs. Adding one or more variables at some elevation above the surface also has the
potential to produce improvements.
Table 12.9 Class frequency breakdown for Tropical Cyclones (TC), Extra-Tropical Cyclones (ETC),
Tropical Depressions (TD), and Atmospheric Rivers (AR). Raw counts in parentheses. Table
reproduced with permission from (Racah et al. 2016).
consecutive simulation frames in order to extract spatiotemporal features that can poten-
tially assist with climate event detection.
12.4.2 Results
12.4.2.1 Frame-wise Reconstruction
Before bounding box prediction, we first trained a 2D convolutional autoencoder on the
data by treating each time-step as an individual training example, in order to visually assess
reconstructions and to ensure the unsupervised part of the architecture was extracting use-
ful features. Figure 12.8 shows the original and reconstructed feature maps for the 16 cli-
mate variables of one image in the dataset. We are able to achieve excellent reconstructions
10
76
36
18
24
8×
4×
51
2
2×
×7
×2
48
44
×2
2×
×1
25
48
4×
×7
×1
24
96
80
6×
8×
96
2
8
10
×1
28
6
12
76
19
2×
12
44
2×
2×
6
51
8×
57
19
28
38
4×
6×
16
8
4×
38
25
2
×7
15
57
8×
68
×1
6
12
×1
68
15
×7
2
16
2×12×18
4×12×18
4×12×18
Figure 12.6 Diagram of the 3D semi-supervised convolutional network architecture. Output shapes of the various layers are denoted below the feature
maps. The small network at the bottom of the diagram is responsible for predicting the bounding boxes, while the decoder part of the network (right),
which is symmetric to the encoder, is responsible for reconstructing the output. Because both the bounding box predictor and decoder feed off of the
encoder part of the network (left), they both contribute useful error signals to shape the underlying representation of the network.
178 12 Deep Learning for Detecting Extreme Weather Patterns
Table 12.10 Semi-Supervised Accuracy Results: Mean AP for the models. Table reproduced with
permission from (Racah et al. 2016).
using an extremely compressed bottleneck representation (slightly less than 1% of the orig-
inal input size).
Table 12.11 AP for each class. Frequency of each class in the test set shown in parentheses. First
number is at IOU = 0.1 and second number after the semicolon is at IOU = 0.5 as the criteria for a
true positive. In each column, highlighted in bold is the best result for that particular class and IOU
setting. Table reproduced with permission from (Racah et al. 2016).
ETC TC TD AR
Parameters (46.47%) (39.04 %) (9.44 %) (5.04 %)
Model Mode (millions) 𝝀 AP (%) AP % AP (%) AP (%)
2D Sup 66.53 0 21.92; 14.42 52.26; 9.23 95.91; 10.76 35.61; 33.51
2D Semi 66.53 1 18.05; 5.00 52.37; 5.26 97.69; 14.60 36.33; 0.00
2D Semi 66.53 10 15.57; 5.87 44.22; 2.53 98.99; 28.56 36.61; 0.00
2D Sup 16.68 0 13.90; 5.25 49.74; 15.33 97.58; 7.56 35.63; 33.84
2D Semi 16.68 1 15.80; 9.62 39.49; 4.84 99.50; 3.26 21.26; 13.12
3D Sup 50.02 0 22.65; 15.53 50.01; 9.12 97.31; 3.81 34.05; 17.94
3D Semi 50.02 1 24.74; 14.46 56.40; 9.00 96.57; 5.80 33.95; 0.00
0 0
100 100
200 200
300 300
400 400
500 500
600 600
700 700
0 200 400 600 800 1000 0 200 400 600 800 1000
0 0
100 100
200 200
300 300
400 400
500 500
600 600
700 700
0 200 400 600 800 1000 0 200 400 600 800 1000
Figure 12.7 Bounding box predictions shown on 2 consecutive (6 hours in between) simulation
frames (integrated water vapor column). Green = ground truth, red = high confidence predictions
(confidence above 0.8 IoU). Left: 3D supervised model, right: 3D semi-supervised model. Figure
reproduced with permission from (Racah et al. 2016).
Figure 12.8 Feature maps for the 16 channels for one of the frames in the dataset (left) versus
their reconstructions from the 2D convolutional autoencoder (right). The dimensions of the
bottleneck of the encoder are roughly 0.8% of the size of the input dimensionality, which
demonstrates the ability of deep autoencoders to find a robust and compressed representation of
their inputs. Figure reproduced with permission from (Racah et al. 2016).
input output
1152 × 768, 16 1152 ×768, 3
encoder decoder
7 × 7 conv, 64, /2 1 × 1 conv, 3
3 × 3 conv, 64
3 × 3 maxpool, /2 3 × 3 conv, 256
288 × 192, 64
3 × 3 conv, 128
3 × 3 conv, 256
1 × 1 conv, 64
1152 × 768, 128
3× 3 × 3 conv, 64 1152 × 768, 256
Figure 12.9 Schematic of the modified DeepLabv3+ network used in this work. The encoder
(which uses a ResNet-50 core) and atrous spatial pyramid pooling (ASPP) blocks are changed
for the larger input resolution. The DeepLabv3+ decoder has been replaced with one that operates
at full resolution to produce precise segmentation boundaries. Standard convolutions are in dark
blue, and deconvolutional layers are light blue. Atrous convolution layers are in green and specify
the dilation parameter used.
that this approach led to numerical stability issues, especially with FP16 training, due to
the large difference in per-pixel loss magnitudes. We examined more moderate weightings
of the classes and found that using the inverse square root of the frequencies addressed
stability concerns while still encouraging the network to learn to recognize the minority
classes (see Figure 12.10).
The developers of the original Tiramisu network advocate the use of many layers with
a relatively small growth rate per layer (e.g. 12 or 16) (Jégou et al. 2017) and our initial
network design used a growth rate of 16. This network learned well, but performance anal-
ysis of the resulting TensorFlow operations on Pascal and Volta GPUs found considerable
room for improvement and we determined that a growth rate of 32 would be significantly
more efficient. To keep the overall network size roughly the same, we reduced the number
of layers in each dense block by a factor of two and changed the convolutions from 3 × 3
to 5 × 5 to maintain the same receptive field. Not only was the new network much faster
k
k
12.5 Detecting Atmospheric Rivers and Tropical Cyclones Through Segmentation Methods 183
60°N 81
72
30°N 63
54
0° 45
36
27
30°S
18
9
60°S
0
120°W 60°W 0° 60°E 120°E
30°N
15°N
0°
k k
15°S
120°E 150°E
Figure 12.10 Top: Segmentation masks overlaid on a globe. Colors (white-yellow) indicate IWV
(integrated water vapor, kg/m2 ), one of the 16 input channels used by the network. Bottom:
Detailed inset showing predictions (red and blue) vs. labels used in training (black). Segmentation
results from modified DeepLabv3+ network. Atmospheric rivers (ARs) are labeled in blue, while
tropical cyclones (TCs) are labeled in red.
to compute, we found that it trained faster and yielded a better model than our original
network.
For DeepLabv3+, the atrous convolutions result in a more computationally expensive
network than Tiramisu. The standard DeepLabv3+ design makes the compromise of per-
forming segmentation at one-quarter resolution (i.e. 288 × 192 rather than 1152 × 768) to
keep the computation tractable for less-powerful systems, at the cost of fidelity in the result-
ing masks. The irregular and fine-scale nature of our segmentation labels requires operating
at the native resolution of the dataset. With the performance of Summit available for this
work, we were able to replace the standard DeepLabv3+ decoder with one that operates at
full resolution, thereby benefiting the science use case.
12.5.3 Results
Segmentation accuracy is often measured using the intersection over union (IoU) metric.
The Tiramisu network obtained an IoU of 59% on our validation dataset, while our modified
DeepLabv3+ network was able to achieve 73% IoU. Visually, this translates into qualita-
tively pleasing masks as seen in Figure 12.10. Not only does the network find the same
k
184 12 Deep Learning for Detecting Extreme Weather Patterns
The work presented in this chapter is an important first step towards establishing the rele-
vance and success of deep learning methods in finding extreme weather patterns. We now
enumerate a number of open challenges, and encourage the community to work with us in
addressing these problems in the future.
● Training time: Deep learning is computationally expensive. Our current front detec-
tion and supervised classification implementations take several days to train; the
semi-supervised architectures currently take 1–2 weeks to converge. Kurth et al. (2017,
2018) and Mudigonda et al. (2018) present early results targeting this issue with impres-
sive results with training times less than a day in many cases. It is very important that
the climate science community have access to optimized, multi-node implementations
of deep learning libraries;
● Hyper-parameter optimization: Determining the right DL network architecture for a given
problem is currently an art. Practitioners have to typically conduct some amount of explo-
ration in the space of number of layers, type of layers, learning rates, training schedules,
etc. If each hyper-parameter combination requires a few days to converge, this exploration
quickly becomes infeasible. We would like to request the mainstream AI research com-
munity to develop efficient software and techniques for solving the problem of finding
optimal DL architectures.
● Extension to temporal patterns: Our current results are largely based on processing instan-
taneous snapshots of 2D fields. Climate patterns often span 3D fields and persist for
extended periods of time. Ideally, we would train hybrid convolutional + LSTM archi-
tectures (Xingjian et al. 2015);
The challenges of modeling time at high-dimensional scales is non-trivial. That said,
there are many physical constraints that we are aware of (such as conservation of energy,
etc.) that can help constrain the manifold for learning.
While we present some work (based on Kurth et al. (2018) and Mudigonda et al. (2018))
on higher-dimensional grids, more work is required to understand larger-scale phenom-
ena.
● Interpretability: Deep Networks are complicated functions with several layers of linear
and non-linear filtering. While some effort has been spent in understanding ImageNet
architectures (Zeiler and Fergus 2013), there is currently a gap in mapping the extracted
feature hierarchy to semantic concepts from climate science. Recent approaches targeted
at interpreting climate data (Toms et al. 2019b) demonstrate promising results; however,
much remains to be accomplished towards the goal of developing interpretable, explain-
able DL networks.
12.7 Conclusions 185
12.7 Conclusions
This article presents the first comprehensive assessment of deep learning for extracting
extreme weather patterns from climate datasets. We have demonstrated the application
of supervised convolutional architectures for detecting tropical cyclones and atmospheric
rivers in cropped, centered image patches. Subsequently, we demonstrated the applica-
tion of similar architectures to predicting the type of weather front at the granularity of a
grid-cell. Finally, we developed a unified architecture for simultaneously localizing, as well
as classifying tropical cyclones, extra-tropical cyclones and atmospheric rivers. The benefit
of the semi-supervised approach lies in the possibility of detecting other coherent fluid-flow
structures that may not yet have a semantic label attached to them.
This work also highlights a number of avenues for future work motivated by the prag-
matic challenges associated with improving the performance and scaling of deep learning
methods and hyper-parameter optimization. Extending the methods to 3D space-time grids
is an obvious next step. However, this will require creation of large training datasets, requir-
ing the climate science community to conduct labeling campaigns. Finally, improving the
interpretability of these methods will be key to ensure adoption by the broader climate sci-
ence community.
186
13
Spatio-temporal Autoencoders in Weather
and Climate Research
Xavier-Andoni Tibau, Christian Reimers, Christian Requena-Mesa, and Jakob Runge
13.1 Introduction
Understanding and predicting weather and climate is one of the main concerns of
humankind today. This is more true now than ever before due to the urgent need to
understand how climate change will affect the Earth’s atmosphere with its severe societal
impacts (Stocker et al. 2013).
The main tools in weather and climate research are physics-based models and obser-
vational data analysis. In contrast to other complex systems, such as the human brain, a
large body of knowledge on the underlying physical processes governing weather and cli-
mate exists. Since the 1960s numerical weather and climate models that simulate these
processes have greatly progressed (Simmons and Hollingsworth 2002), with each new gen-
eration of models bringing better resolution and more accurate forecasts. Physics-based
models are also used to understand particular processes, for example, by targeted model
experiments to evaluate how the atmosphere is coupled to the ocean. A major challenge of
such models today lies in the computational complexity of simulating all relevant physical
processes, from atmospheric turbulence to the biosphere, where dynamics are chaotic and
occur on multiple scales. Deep learning could make simulations much faster by augment-
ing physical climate models with faster deep learning-based parametrization schemes for
such processes, termed hybrid modeling, with the challenge to preserve physical consis-
tency (Reichstein et al. 2019).
On the other hand, in the last decades satellite systems and ground-based measurement
stations lead to vast amounts of observational data of the various subsystems of Earth. Such
datasets, together with increasing computational power, pave the way for novel data-driven
learning algorithms to better understand the underlying processes governing the Earth’s
climate. Two prominent data analysis approaches, causal inference (Runge et al. 2019) and
deep learning (Reichstein et al. 2019), both bear great promise in this endeavor. However,
typical Earth science datasets pose major challenges for such methods, from dataset sizes
and nonlinearity to the spatio-temporal nature of the underlying system.
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
13.2 Autoencoders 187
13.2 Autoencoders
The goal of an AE is to encode information in an efficient way. The basic idea is to build
two neural networks, one encoder and one decoder. The encoder contains multiple layers,
each with less neurons than the layer before. The decoder is inverse to the encoder in the
sense that every layer contains more neurons than the layer before. The two networks do
not have to be symmetric. They connect at the layer with the smallest number of neurons,
the bottleneck. The general architecture of an autoencoder can be seen in Figure 13.1.
To fit the parameters of the AE, the standard back-propagation (Rumelhart and Williams
1986) is used to minimize the loss function 𝔏(X, X ′ ), for example, the mean squared error,
between the input X and output X ′ . In this process the parameters of the AE converge in the
direction of steepest descent. Formally, the encoder 𝜙 and the decoder 𝜑 are given by
𝜙∶ → 𝜑∶ →
(13.1)
X → H H → X ′ .
We adapt the parameters of these functions to minimize the distance between 𝜑 (𝜙(x))
and x′ . The idea is that during this training the shrinking layers of the AE will encode
the information in an efficient way in the bottleneck.
Figure 13.1 The general architecture of a spatial AE. The left-most layer constitutes the input
data, e.g., a spatio-temporal field. Each next layer is a simple function of the previous layer. The
size of the representation gets smaller in every layer which forces the AE to compress
the information of the input efficiently in the center bottleneck of the AE. The symmetric layers
to the right represent the decoder.
188 13 Spatio-temporal Autoencoders in Weather and Climate Research
Here 𝑤ij is the weight between the units i and j, by si we denote the state of unit i, and 𝜃i
is a threshold. Some of the units are associated to observables while some of the states are
associated to latent variables. The machine is trained by fixing the units associated with
observables and then computing the other states such that the energy is minimized. The
advantage of this approach is that the optimal state of a unit can be determined locally by
the difference in energy between a unit being on or off,
∑
ΔEk = 𝑤ki si − 𝜃k . (13.3)
i
The actual state of all units is decided randomly to avoid getting stuck in local minima. The
level of randomness is given by a temperature constant T. The probability P𝛼 to be in global
state 𝛼 compared to the probability P𝛽 to be in global state 𝛽 is given by the Boltzmann
distribution
P𝛼
= e−(E𝛼 −E𝛽 )∕T . (13.4)
P𝛽
Boltzmann machine are trained by adapting the weights such that the expected energy in
the minimum of energy for any observation is minimal. The linearity of the energy further
allows the derivative of the log probability to be computed depending on the strength of the
connections as
𝜕 ln 1
= (s𝛼i s𝛼j − pij ) , (13.5)
𝜕𝑤 T
where s𝛼i is the state of unit i in global state 𝛼 and pij is the probability of both units being
on at the same time across the dataset.
The authors of Ackley et al. (1985) call the abovementioned problem “The Encoder Prob-
lem” and credit it to Sanjaya Addanki. They use an architecture with one input and one
output layer of size 𝜈 and log 𝜈 hidden units. Due to the symmetric connections of the
Boltzmann machine, the hidden units are not ordered in layers, but every hidden unit
is connected to every other hidden unit. In their experiments the authors (Ackley et al.
1985) show that a Boltzmann machine with four input and output neurons can solve the
auto-associative task reliably. For larger input-sizes, however, many learning cycles are
needed to find at least a close to optimal solution. While reliable learning problem can be
13.2 Autoencoders 189
solved by allowing for more hidden units, the problem that training Boltzmann machines
is slow remains.
Today, the term “neural network” is mainly used for feed forward networks that are
trained using backpropagation. This kind of network was first presented by Rumelhart and
Williams (1986); Rumelhart et al. (1985). They used the same problem setting as presented
above and found that their AE consistently managed to learn the task while being simpler
and faster in training than Boltzmann machines.
In the following, several authors presented examples that indicated that AEs could solve
difficult and interesting problems. For example, Cottrell et al. demonstrated (Cottrell and
Willen 1988) that AEs with only three layers reach results on image compression that are
comparable with the state-of-the-art of their time. They achieved this using an architecture
similar to a modern convolutional neural network with just one filter. They divided the
image in patches of 8 × 8 pixels and used a network consisting of 64 input neurons, 64 out-
put neurons and 16 hidden neurons to encode and decode every single patch of the image
individually. Their approach did not just learn to encode images with information loss close
to the state-of-the-art of that time but also generalized to different, unseen images. One of
the main drawbacks was that the AEs as described above converge towards a principal com-
ponent analysis (PCA). This phenomenon is discussed, for example, in Cottrell and Willen
(1988); Bourlard and Kamp (1988); Baldi and Hornik (1989). We discuss the similarities
between AEs and standard machine learning methods in subsection 13.2.4.
In the following, the development of AEs was closely related to the general development
of deep neural networks, which enabled to fit more complex and expressive encoders
and decoders. We now mention some steps of this development. Firstly, the development
of convolutional neural networks, an idea proposed by Fukushima (1980) and further
considered by Rumelhart and Williams (1986); Rumelhart et al. (1985), demonstrated
impressive effectiveness on real-world problems (see, e.g., LeCun et al. (1989)). Secondly,
the work of Nair and Hinton (2010) and Krizhevsky et al. (2012) demonstrated that
these methods can be scaled up to large image datasets and outperform state-of-the-art
methods at that time. Thirdly, Rumelhart and Williams (1986); Rumelhart et al. (1985),
Hochreiter and Schmidhuber (1997), and Cho et al. (2014) developed recurrent neural
networks and made them feasible for large datasets. Recurrent neural networks have the
advantage that they can handle inputs of different size, for example time series of different
length. Additionally, such networks can couple the weights and create models with fewer
parameters to optimize. The scope of this chapter does not allow for a discussion of
the progress that enabled deep learning to reach the results and applications that it has
achieved today.
Table 13.1 This table summarizes all discussed variations of the standard AE. We describe shortly
how they approximate the data manifold better than a standard AE and which form of
regularization is used. We use 𝔏 to denote the standard loss function of the AE and F(⋅) stands for
the function fitted by the AE (i.e., F(x) = 𝜑(𝜙(x)).
Generalized Autoencoder (GAE) The GAE was introduced by Wang et al. (2014). A GAE is
based on the common assumption in dimensionality reduction that the data is forming a
lower-dimensional manifold in the input space. Its main goal is to reconstruct this man-
ifold and the relations between examples that it represents. This is achieved by training
the AE to not only reconstruct the input but a set of similar input samples. The loss for all
these reconstructions is weighted by the similarity of these samples to the original input as
measured by k-nearest neighbor distance or by all images belonging to the same class. The
authors argue that the GAE captures the structure of the dataset better than an AE because
the encoder is forced to map similar input examples closer on the manifold.
Relational Autoencoder (RAE) Presented in 2017 by Meng et al. (Meng et al. 2017), this AE is
designed to not only reconstruct the input, but additionally preserve the relation between
different samples in the dataset. While the loss function for the AE is
where 𝜃 denotes all network parameters, the loss for the RAE is
Here R is a measure of some relation between the data points (e.g., the variance) and F is the
function realized by the AE. The authors claim that considering the relationships between
different inputs decreases the reconstruction error and generate more robust features.
Sparse Autoencoder (SAE) The underlying idea of the SAE is to find a sound and meaningful
representation of the inputs that allow transferring knowledge between similar datasets.
13.2 Autoencoders 191
The SAE was introduced by Deng et al. (2013). The standard loss function of the AE,
Equation 13.6, is enhanced by a term
∑
m
𝜌 1−𝜌
𝜆 𝜌 log + (1 − 𝜌) log (13.8)
j=1
𝜌̂j 1 − 𝜌̂j
that enforces sparsity (Lee et al. 2008). Here, 𝜆 is a constant to weigh the importance of spar-
sity compared to reconstruction performance. The parameter 𝜌 is a fixed sparsity level, for
example, 0.01. The dimension of the embedding is denoted by m and 𝜌̂j is the average acti-
vation of hidden unit j averaged over all inputs. Notice that in AEs, the compression effect
emerges from the bottleneck being much smaller than the input, while in SAE, the com-
pression is given by the redundant nature of sparse representations. Indeed, the bottleneck
of a SAE does not have to be smaller than the input size.
Denoising Autoencoder (DAE) The core assumption leading to a DAE is that the data is located
on a low dimensional manifold in the input space. Additionally, it is assumed that one can
only observe noisy versions of these inputs. To allow an AE to learn from noisy inputs and
identify useful features for classification, Vincent et al. (2010) introduced the DAE.
The DAE is related to the idea of data augmentation. For training, instead of the original
input X, a corrupted version X̃ is used. The target to reconstruct is the clean input X, which
has not been observed by the network. The goal of this training method is that the latent
space captures the areas of high probability, i.e., the data manifold, and maps inputs of low
probability onto these areas by deleting the noise.
While this method was used for feature extraction in the paper mentioned above, the
actual classification is trained using clean inputs only. Hence, denoising is different from
data augmentation. Interestingly, the authors compare a three-layer stacked DAE with a
network build of three stacked Boltzmann Machines and could not outperform the latter on
some datasets, showing that the main drawback of Boltzmann Machines is the complicated
training and not the resulting performance.
Contractive Autoencoder (CAE) To determine the data manifold in the input space, this AE
uses a penalty on the Frobenius norm of the Jacobian of the encoder. The idea was first
presented by Rifai et al. (2011). In comparison to the loss function of the AE (Equation 13.6),
the loss function of the CAE is given by
min 𝔏(X, F(X)) + 𝜆||JF (X)||2F .
𝜃
The penalty term on the Jacobian matrix JF leads to the representation being robust to
small perturbations. It also enforces that the eigenvectors corresponding to large eigenval-
ues of JF point into directions of high variance of the data points, which indicates a more
meaningful representation.
Gaussian distribution is centered at zero with unit variance (identity) matrix Id , where
d is the dimensionality of the compressed feature space. The VAE was first described
by Kingma and Welling (2013).
We denote probability distributions by P(X) and the corresponding density functions
by lowercase p(X). The objective is to ensure two things. Firstly, the distribution of the
output data is intended to approximate the distribution of the input data and secondly,
to force the latent distribution to be multivariate standard normal. The first goal is addressed
by maximizing the likelihood of the inputs under the output distribution. The main prob-
lem is that such a distribution is unknown. The VAE approach solves this problem by using
the evidence lower bound, also called ELBO,
log p(x) ≥ 𝔼Hq (log p(x, H)) + ℍ(H) = log p(x) − DKL (q(H)||p(H|x)) .
Here ℍ denotes the entropy of H and 𝔼Hq(⋅) denotes the expectation over any density
function q. The quality of this bound depends on the density function q. The better
q(H) approximates p(H|x), the tighter the bound. Therefore, by finding a q such that
the Kullback–Leibler divergence DKL (q(H)||p(H|x) is minimized, one can push both
distributions to be close. The calculation of the ELBO is mostly an application of Jensen’s
inequality (see Yang (2017)).
The second goal is to force the latent distribution to be multivariate standard normal,
i.e., q = (0, Id ). The VAE is trained as follows: The encoder maps from the origi-
nal distribution to the latent distribution (and not a representation as in a standard
AE). The reconstruction of the input is performed by sampling from that distribution
and feeding the decoder with that sample. Now the problem is that it is not possible
to calculate the derivative of the sampling process. This problem is solved by the so-called
reparametrization trick. Essentially, the encoder learns a distribution and the network
tries to determine the parameters of that distribution. Hence, the latent distribution is
restricted to be a factorized Gaussian, and the encoder only derives the mean 𝜇 and the
variance 𝜎 2 of every latent component. Then the latent sample h is obtained by sampling
h̃ from a standard normal distribution and compute
h = 𝜎 h̃ + 𝜇.
This representation allows derivatives for 𝜇 and 𝜎 to be calculated and backpropagation
to be used train the network. An example of an architecture for a VAE can be seen
in Figure 13.2.
Figure 13.2 The architecture of a variational autoencoder. The main difference to an autoencoder
is the random sampling process in the center.
the highest variance. From this point of view, the nonlinear encoder and decoder of the AE
approximate this feature function, and hence optimizing an AE on data can be thought of
as finding the optimal kernel for a kernel-PCA.
Note that the search space of this optimization does not span all possible kernels. The
dimension of the latent space has to be decided before starting the optimization. This
implies that, in general, it is only possible to optimize over kernels with the given latent
dimension. For example, it is not possible to approximate the RBF-Kernel which has an
infinite latent dimensionality. If the encoder is a close approximation to the optimal kernel,
then the solution found to the dimensionality reduction problem is also optimal. Ham et al.
(2004) show how many standard dimensionality reduction methods can be understood as
a kernel-PCA.
13.3 Applications
In this section, we suggest several different applications that AEs can have in weather and
climate sciences, divided into the common uses of AEs. The applications are divided into
two main categories “Use of the latent space” (section 13.3.1) and “Use of the decoder”
(section 13.3.2). In the first, it is shown how the lower-dimensional latent variables can
be used to improve predictions and for knowledge extraction. In the second, we describe
the capabilities of a neural network based decoder for denoising, sampling generation, and
anomaly detection. Figure 13.3 synthesizes this classification.
Input Output
Latent
Representation
X Encoder H Decoder X’
ϕ φ
Figure 13.3 Summary of the use of an AE for weather and climate. These can be divided in
(a) direct utilization of the latent space and (b) the use of the decoder function. (a) can be
subdivided into (a.1) to use extracted features for better understanding the encoded data and (a.2)
to use the latent representation for prediction. (b) can be subdivided into the use of the decoder
function for (b.1) generating new samples, (b.2) denoising, and (b.3) as an anomaly detector.
account for virtually all the variance among individuals….” In climate and weather sci-
ence, the first approach came from Eduard N. Lorenz, who, in 1959, used PCA (in his
works called Orthogonal Empirical Functions, EOFs) as a method for reducing the data
dimension before prediction (Lorenz 1956). Later on, to improve interpretability of these
lower-dimensional variables, some modifications to PCA were introduced, again, first in
the psychological (Kaiser 1958) and later in the climatology domain (Richman 1986).
Next to the methods used to visualize linear dependencies among the data, there have
been several approaches to capture nonlinear dependencies, motivated by the fact that
real-world data, and especially weather, often has nonlinear dependencies. One of the
leading methods is Kernel-PCA (Schölkopf et al. 1998), which uses the kernel-trick to
account for nonlinearities in the data. However, two main problems arise with Kernel-PCA:
First, the new space is often no longer interpretable, and, second, it is not easy to know
before-hand which kernel is suited (Rasmussen 1999).
In this context, AEs can address both problems, finding a suitable kernel for dimension-
ality reduction and understanding the latent space. It is well known that AEs are able to
represent original data in a lower-dimensional space, often called hidden representation,
latent space, embedding, code, bottleneck, or probabilistic latent space (in case of the VAE).
One key feature of this mapping is that the samples that are close to each other in the latent
representation are also close in the original space. However, this property is not ensured by
a plain AE and this issue can be addressed with the introduction of additional loss terms.
The latent representation aims to be a good lower-dimensional description of latent
variables, and therefore one can use it for different purposes; the first, and most
13.3 Applications 195
straightforward one, is to predict future stats of the system or classify the input data from
these extracted features, as is done with other dimensionality reduction methods like PCA.
In section 13.3.1.2 we show several examples. Another way to make use of these extracted
latent variables is to help scientists to understand the governing dynamics of the system
better.
f
0
(a)
25
50
75
n
100
125
150
175
0 50 100 150 200 250 300 350
m
1st PC
E.V.:0.576% R2 = 1.78×10−5
0 0.020
(b)
0.015
50 0.010
0.005
PCA
n
100 0.000
0.005
150 –0.010
–0.015
0 50 100 150 200 250 300 350
E.V.:12.3% R2 = 0.756
0 0.0050
(c)
0.0045
50
SupernoVAE
0.0040
n
100
0.00035
150 0.00030
Figure 13.4 An example of the results in Tibau et al. (2018). Plot of (a) , (b) the 1st PC for the
Lorenz ’96 dataset and (c) the 1st PC for Lorenz ’96 after applying SupernoVAE. E.V.: Explained
Variation by the first principal component.
13.3 Applications 197
Table 13.2 Summary of results in Tibau et al. (2018). The column Reconstruction shows the
coefficient of determination of the VAE reconstructions and the input time series, the columns 1st ,
2nd and 3rd PC show the coefficient of determination between the kth principal component and the
forcing pattern . 𝜃 stands for the relation of the bottleneck dimension size and the input
dimensional size. The PC marked * is represented in Figure 13.4 (c).
As mentioned above, one of the main problems with this approach is that linearity in
the latent space is not ensured. Another interesting work where this issue is faced is Lusch
et al. (2018). Lusch et al. rely on the fact that the eigenfunctions of the Koopman operator
provide intrinsic coordinates that globally linearize the dynamics. Since the identification
and representation of such functions is complex for highly nonlinear dynamics, the authors
propose an AE for embedding it into a lower-dimensional manifold. The authors aim to
satisfy:
𝜙(Xk+1 ) = 𝜙(Xk ) (13.9)
where k denotes the different states of the system over time. They expect 𝜙 to project the
data in a space where the Koopman operators are linear, that is, 𝜙 ∶ ℝn → ℝp and then,
Hk+1 = KHk , where n and p are the dimensions in the original and in the latent space,
( )
respectively. To train 𝜙, a regular AE with the MSE loss function 1 = ||Xk − 𝜑 𝜙(Xk ) ||22
is used. The final network is built with this encoder-decoder function, where the encoder
encodes Xk and the decoder decodes Hk+1 through a dense layer with weights K and a
198 13 Spatio-temporal Autoencoders in Weather and Climate Research
Input Output
Latent
Representation
Linear
Xt Encoder H1 H2 Decoder X’t+t
Layer
ϕ φ
Decoder 2
X’t
Figure 13.5 Schematic view of the architecture used in Lusch et al. (2018).
linear activation between Hk and Hk+1 . See Figure 13.5 for a depiction of the architec-
ture. To train the entire network they add two more losses: one for the linear dynamics
in the latent space 2 = ||𝜙(Xk+1 ) − K𝜙(Xk )||22 , and the other for future state predictions
( )
3 = ||Xk+1 − 𝜑 K𝜙(Xk ) ||22 .
As a proof of concept, the authors conducted three experiments. In the first, they used a
well-studied model with a discrete spectrum of eigenvalues. In the second, the system was
based on a nonlinear pendulum with a continuous spectrum of eigenvalues and increasing
energy. Finally, in the third experiment, they used a high-dimensional model of nonlinear
fluid flow. In all three cases, the authors achieved linear representations of the dynamics in
space.
The representation of the latent space using a dimensionality reduction method is also
explored in Racah et al. (2017). The goal was to detect and track extreme events, using an AE
as a method to extract features. The architecture proposed by the authors consisted of an AE
where a bottleneck connects to three CNNs to predict the type of event, its location, and the
confidence interval. As is often done with dimensionality reduction methods, the authors
use t-SNE to visualize the different representations of the original data in the latent space.
The resulting visualization showed the data grouped by the original categories, even if they
were unknown to the AE. The authors stress the importance of using these visualization
techniques to better understand weather data and how features interact with each other.
For a more detailed explanation of how to use deep learning for extreme event detection
and tracking, see Chapter 12.
Another example of the use of the latent space to understand data is found in Krinit-
skiy et al. (2019). There, the authors employed a sparse variational autoencoder (SVAE)
to cluster polar vortex states. The authors pose that the idea behind the use of an SVAE
is that the variational inference ensures that similar examples in the original space are
close in the latent space. The sparsity constraint enforces features in the latent space to
be Bernouilli-distributed, allowing a clustering that produces more valuable results than
standard normal distributed latent spaces (see subsection 13.2.2). They also applied the
commonly used techniques for convolutional networks, such as transfer learning, deploy-
ing an encoder derived from a powerful network previously trained on ImageNet. In the
13.3 Applications 199
Table 13.3 Summary of results in Klampanos et al. (2018). Accuracy of the different methods used
to classify clusters of dispersion of a nuclear plume. Data input was as a function of km2 or density,
and for 1 or 3 likely origins.
Raw k-means 0.287 0.290 0.584 0.581 0.377 0.385 0.699 0.696
Shallow DAE 0.310 0.296 0.609 0.601 0.409 0.384 0.739 0.723
Deep DAE 0.301 0.303 0.599 0.616 0.406 0.393 0.727 0.727
Deep Conv AE 0.305 0.288 0.603 0.583 0.399 0.389 0.721 0.710
Deep MC Conv AE 0.266 0.301 0.559 0.602 0.359 0.416 0.675 0.745
PCA (16) 0.297 0.291 0.592 0.595 0.402 0.376 0.722 0.713
PCAT 0.294 0.251 0.584 0.568 0.394 0.328 0.702 0.683
Input Output
Latent
representation
Input Output
Latent
representation
Mineral
Fully
contents Decoder
Step 2 connected H T2’
and fluid frozen weights
network
saturations
Figure 13.6 Schematic view of the architecture used by Li and Misra (2017). In a first step, a VAE is
trained on the results of NMR (T2), then the encoder is exchanged by a fully connected layer that
maps from standard measures of soil composition to the latent representation and the decoder
maps to T2’. In this way the network learns how to obtain the results of a NMR from another source.
standard measurements of fluids and mineral content to the main NMR features as previ-
ously represented by the VAE in the latent space. See the architecture used in Figure 13.6.
After the second step training is complete and the model can generate the desired NMR T2
distributions from the fluids and mineral contents measurements.
Scher (2018) implemented an autoencoder-like network to emulate a simple General
Circulation Model (GCM). While the network he implemented does not attempt a
self-reconstruction of the inputs, it shares a similar bottlenecked architecture. He used
as input a complete set of atmospheric fields of the GCM and as target the set of fields
13.3 Applications 201
at a later time. Given the spatial nature of both inputs and outputs, he made use of a
convolutional encoder and a deconvolutional decoder. While one can not expect such a
model to be able to generate cohesive time series that respect the physics of the system in
the long run, in his study, the network was able to generate similar time series.
Figure 13.7 Climate data is typically represented on a grid at different levels in both satellite data
and the output of numerical models. The spatial structure makes these datasets an ideal case for
the use of convolutional architectures. Unsupervised models such as spatio-temporal VAEs can
unravel climatic processes and different regimes by analyzing the latent representation of those
datasets.
and climate. For a more detailed overview of the use of deep learning for anomaly detection,
see Chapter 6.
the noise represents the prediction error, and the network corrects the forecast toward a
“clean” image. Using this scheme, the authors achieved state-of-the-art performance on
the forecasting of atmospheric rivers.
14
Deep Learning to Improve Weather Predictions
Peter D. Dueben, Peter Bauer, and Samantha Adams
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
Chapter 14 © 2021 John Wiley & Sons Ltd. The contributions to the chapter written by Samantha Adams © Crown
copyright 2021, Met Office. Reproduced with the permission of the Controller of Her Majesty’s Stationery Office.
All Other Rights Reserved.
14.1 Numerical Weather Prediction 205
Non-orographic
wave drag
O3 chemistry
Long-wave Short-wave CH4 oxidation
radiation radiation
Cloud
Cloud Subgrid-scale
orographic drag
Deep
convection
Latent Sensible
Long-wave Short-wave heat heat flux
Wind wave flux flux flux Surface
Ocean model
Figure 14.1 Processes that influence weather and climate. The figure is reproduced from Bauer
et al. (2015).
insufficiencies of the model tend to grow linearly over time (Magnusson and Källén 2013)
and will eventually make a weather prediction useless even if the model could have been
started from perfect initial conditions.
Furthermore, the Earth is large and current computing resources restrict the resolution
that is available within models to a typical grid-spacing of around 10 km for global weather
predictions and around 2 km for limited area models. As illustrated in Figure 14.1, the
Earth is also complex, with many processes interacting with each other, which makes it
difficult to represent all important physical processes within numerical models. The indi-
vidual components of the Earth system will also show non-linear behavior and respond on
different timescales from seconds to decades. Due to the complexity of the underlying sys-
tem, forecast model software is also very complex, often with more than one million lines
of code. Even so, many of the important processes cannot be represented explicitly within
model simulations since they can either not be resolved due to the lack of spatial reso-
lution – the horizontal extent of clouds is, for example, typically smaller than 10 km – or
since the equations of the model components are unknown – for example for soil physics.
Sub-grid-scale processes need to be described by so-called parametrization schemes, which
need to be based on model fields at spatial scales that can be resolved within simulations.
Edward Lorenz has argued that the forecast horizon of weather predictions is limited
since the error doubling time (which is the time that it takes for an errors to double in size
on average) reduces if smaller and smaller scales of the weather are considered (Lorenz
1969; Palmer et al. 2014). To increase resolution of a weather model by a factor of two will
increase the computational cost of the model significantly (a factor of two for each dimen-
sion in space and time). However, if the error doubling time is reducing towards smaller
scales, this will result in an ever-decreasing increase of the forecast horizon as resolution is
increased. This may eventually cause a stagnation of improvements of weather predictions
206 14 Deep Learning to Improve Weather Predictions
as resolution of observations, DA and the forecast model is increased. However, the fact
that seasonal predictions are showing some predictive skill for weeks and months into the
future (Weisheimer and Palmer 2014) suggests that we are still far away from a limit of
predictability. However, skill in seasonal predictions can partly be explained as the Earth
System consists of a number of components that are interacting, with components such as
the ocean and land surface that act on slower timescales when compared to the atmosphere.
To be useful, weather forecasts are not only required to provide the most likely future
scenario (for example that there will most likely be 1 mm of precipitation in London on
Saturday) but are equally required to produce the probability of specific weather events
(for example the probability for precipitation of more than 2 mm, which may influence the
decision whether to take an umbrella). To get estimates of probability distributions for pre-
dictions taking into account all known sources of forecasting uncertainty, weather forecast
centers typically run ensemble simulations that perform a number of simulations in paral-
lel, with each simulation being perturbed by either a change in initial conditions and/or by
adding stochastic perturbations to the model simulation (Berner et al. 2017).
Once the model simulations are finished, model output needs to be post-processed and
disseminated. This typically requires selection and compression of model output. At the
European Centre for Medium-Range Weather Forecasts (ECMWF) model variables are typ-
ically stored with 16 bits per variable while the forecast model is using 64 bits per variable.
The forecast results are then distributed to end-users and will eventually reach the general
public. The data output is huge, the ensemble forecast of ECMWF is producing more than
70 terabytes of output data during a single day.
This complex workflow needs to be completed end-to-end within about three hours to be
useful to the public audience since a weather prediction can only be useful if it is timely. To
run through the workflow processing as many observations as possible and with a model
at optimal resolution requires supercomputing facilities and therefore weather prediction
centers like ECMWF or the UK Met Office have supercomputers that rank within the fastest
50 supercomputers of the world1 .
To build DA and modeling frameworks that scale efficiently to peta-scale supercom-
puters requires a significant investment in software engineering infrastructure. Examples
are the Scalability Programme at ECMWF2 and the LFRic project at the UK Met Office
(Adams et al. 2019). International collaborations between the Weather and Climate
community across Europe and worldwide are also required (Lawrence et al. 2018). To
optimize models, the community is investigating a number of directions of research
including the development of new dynamical cores, the use of domain-specific languages
to improve portability, improvements of workflow management, and mixed precision
(Biercamp et al. 2019).
Next to the computing challenge, there is also a challenge to manage Earth system data.
This includes observations but also the output of weather models and distribution of fore-
cast data. ECMWF’s data archive is growing by 233 terabytes per day and has a total storage
of 210 petabytes of primary data3 . This amount of data will likely increase further and it
will be more and more challenging to process the data in a timely fashion and to make it
1 https://www.top500.org/
2 https://www.ecmwf.int/en/about/what-we-do/scalability
3 https://www.ecmwf.int/en/computing/our-facilities/data-handling-system
14.2 How Will Machine Learning Enhance Weather Predictions? 207
accessible to users, in particular since data storage capacity has not been growing at the
same pace as supercomputing processing power.
Despite these challenges, there have been significant improvements in weather forecasts
during the past decades for many reasons, including: an increase in the performance of
supercomputers, the increase in the number of observations that can be assimilated, the
increase in resolution of forecast models, improvements in the efficiency of models, and
the increase of complexity of modeling frameworks (Bauer et al. 2015).
As outlined in the previous section, the Earth system is complex and consists of a number
of components that show non-linear dynamics. Furthermore, a lot of data is available for
training (both observations and model data). Such an environment therefore offers a large
number of potential application areas for ML tools. In fact, applications for ML potentially
cover the entire workflow of numerical weather predictions as described in the following
section.
Product
Forecast run Web services Internet
generation
• quality control
• adaptive thinning • adaptive information extraction
• adaptive bias correction (resolution, ensembles, features)
• data compression
• surrogate model components • integration of downstream applications
• tangent-linear/adjoint models
• error covariance statistics Data Handling
Archive
System
Figure 14.2 Workflow of weather prediction from observations to forecast dissemination and the impact of machine learning on the different workflow
components (blue boxes). The figure is reproduced from Dueben et al. (2019).
210 14 Deep Learning to Improve Weather Predictions
spatio-temporal forecasting problems (Shi et al. 2015b; Heye et al. 2017; Zhang et al. 2017;
Foresti et al. 2019; Lebedev et al. 2019b). There are, however, challenges in applying these
techniques as they have been designed for quite different applications. Although initial
results are promising, ML solutions are not yet improving on traditional methods. More
research into probabilistic methods (in order to model the chaotic nature of convective evo-
lution), incorporating other sources of information (such as orography) and incorporating
physical mechanisms is needed (Prudden et al. 2020).
Data assimilation: There are many potential applications of ML to DA that could help to
make better use of increasing volume and diversity of observations data thus leading to more
accurate weather predictions. Actually, DA as such can be considered as being a machine
learning process. There are many similarities for example between the minimization pro-
cess during training of neural network and the minimization used during 4DVar DA or for
applications of Kalman Filters (Kalman 1960). However, standard methods often assume
that model dynamics and observation operators are linear and that probability distributions
are Gaussian. This is one area where newer ML techniques may be able to improve matters.
ML could be used to speed up the DA process in the same ways as suggested for the
forecast (see discussion of the Forecast model below), thus allowing for more time to spend
on model simulation (for example, adding in more science or increasing resolution). If it
is possible to emulate model components of the forward model using ML, it should also
be comparably simple to generate tangent linear or adjoint code for ML emulators which
are required for 4DVar DA. This exercise is usually difficult and consumes a significant
amount of a scientist’s time for complex parts of the non-linear forward model, and in
particular for the physical parametrization schemes. Neural network emulators could
therefore allow for more complex representations of the forward model as tangent linear
and adjoint code. However, whether this approach is applicable for complex networks
still needs to be shown as gradients of the tangent linear code may become too steep and
irregular for use in the 4DVar framework.
Since conventional DA requires estimation of the error covariance matrix, ML could
be used to learn this matrix dependent on specific weather situations. Furthermore,
bias correction could be performed during the mapping from the spatial distribution in
the observation space to model space. ML could also be used to learn model bias when
comparing the tendency of the model with analysis increments (this is the forcing that is
pushing the model towards observations during DA). Some recent work (Poterjoy et al.
2017, 2019) shows that Monte Carlo methods can be used to create ‘local’ particle filters
without assumptions on the prior or posterior error distributions. Gilbert et al. (2010)
have looked at kernel methods and Support Vector Machines as an alternative to standard
Kalman filter and variational methods.
One promising application by Hall et al. (2018) uses Generative Adversarial Networks
(GANs; see chapter 3) to learn a direct mapping between GOES-15 satellite upper tropo-
spheric water vapor observations to the total column precipitable water (pwat) variable in
the Global Forecast System (GFS) model. This shows that ML techniques can help when
the observed (sensed) quantities do not map exactly to model variables, in particular if the
relationships are non-linear. GANs have also been used by Finn et al. for DA in a Lorenz’96
model (the Lorenz’96 model is a dynamical system formulated by Edward Lorenz (Lorenz
1996) that often serves as toy model for atmospheric dynamics), where they concluded that
14.3 Machine Learning Across the Workflow of Weather Prediction 211
although the technique is much faster than standard methods, the error is about the same
and success is very dependent on the stability of GAN training. Another possible direction
is to combine ML with standard DA techniques. Moosavi et al. (2019) have, for example,
used ML for adaptive tuning of localization in standard ensemble Kalman filter methods. It
has recently been shown for the Lorenz’96 model that ML can be used within a DA frame-
work to either learn the equations of motion or to develop model emulators (Bocquet et al.
2019; Brajard et al. 2019).
Forecast model: In order to make use of accelerator technologies such as GPUs, and
potentially also programmable hardware such as FPGAs, many weather and climate cen-
ters have been working to port parts of their code bases to these types of hardware (Fuhrer
et al. 2018). To port an entire forecast model onto accelerators is a difficult exercise since it
requires specialist knowledge and very often code refactoring since traditional weather and
climate codes have not been designed with these hardware architectures in mind. Phys-
ical parametrization schemes within the model are particularly difficult since they often
comprise a very large fraction of the model code, are typically written by domain scien-
tists and are very heterogeneous regarding the code and underlying analytic structures. In
contrast, ML techniques such as Neural Networks are ideally suited to running on GPUs
and the use of dense linear algebra makes them efficient on almost all hardware. So if it is
possible to emulate some of the model components via neural networks they will be able
to run efficiently on GPUs. In order to do this, the original model would be run for a long
time and input/output pairs of a specific part of the model would be stored. In a second
step, the data pairs would be used to train a neural network which would eventually allow
replacement of the original model component within forecast simulations. This approach
has now been tested by a number of groups and results show great potential (Chevallier
et al. 1998; Krasnopolsky et al. 2010; Pal et al. 2019). As well as emulating parametrizations,
ML could also be used to develop parametrization schemes which are better in compar-
ison to the schemes that are used with forecast models. If, for example, neural network
emulators are trained from super-parametrized simulations, which use a two-dimensional
large eddy simulation model to mimic sub-grid-scale cloud features, the neural network
emulator may become more realistic in comparison to existing parametrization schemes,
in particular for cloud physics and convection (Brenowitz and Bretherton 2019; Gentine
et al. 2018; Rasp et al. 2018). For an emulator of the radiation scheme, the neural networks
could be trained from schemes that allow to represent 3-dimensional cloud effects or a more
detailed evaluation of gas optics which are currently too expensive to be used as the default
option. Observations could also be taken as reference (Ukkonen and Mäkelä 2019). Another
example where a neural network has been used to emulate and improve on the representa-
tion in a model is ozone (He et al. 2019b).
Finally, it will also be interesting whether weather and climate models will be able to
make use of the newer accelerators developed specifically for deep learning applications
that will be available on the next generation of supercomputers, e.g. in form of tensor pro-
cessing units (TPUs) or accelerators of low-precision matrix-matrix multiplications. Con-
cerning the latter, a recent study has shown that TensorCores accelerators of NVIDIA Volta
GPUs could, in principle, be used to calculate the most expensive kernel of spectral atmo-
sphere models – the Legendre Transformation (Hatfield et al. 2019). The peak performance
212 14 Deep Learning to Improve Weather Predictions
applied both within a running model and as post-processing and allow to take specific
weather regimes into account. ML approaches to uncertainty quantification in order to
produce better probabilistic forecasts are another possibility since many of the modern
deep learning models incorporate probabilistic mechanisms (e.g. Deep belief networks,
Variational Autoencoders, GANs) and the idea of quantifying predictive uncertainty in
NNs is already being considered in general ML research (Lakshminarayanan et al. 2017).
Wang et al. (2019a) have already explored a deep neural network ensemble approach to
weather forecasting that incorporates uncertainty quantification.
100 100
250 250
Geopotential
500 500
850 850
100 100
250 250
Temperature
500 500
850 850
100 100
250 250
Wind
500 500
850 850
Relative 200 200
humidity 700 700
2 m temperature
10 m wind
Significant wave
height
Figure 14.3 Score-card for ensemble simulations at ECMWF (reproduced from Dueben et al. (2019)) that shows the differences between single precision
(SP) and double precision (DP) simulations. In a tightly linked system such as the Earth system, a change in one model component (such as
parametrization of deep convection) can have unforeseen effects on many components (such as surface temperature at the pole). If the score-card shows a
strong negative impact on any of the scores, it is unlikely that the change will be adopted for operational predictions.
14.4 Challenges for the Application of ML in Weather Forecasts 215
to optimize global tendencies often results in compensating errors between the different
model components. If individual model components are replaced by ML tools that are,
for example, trained from observations or high-resolution simulations, this may generate
a degradation of forecast scores for some quantities due to un-compensated errors.
This is not a new problem caused by ML and is also a challenge for other approaches
that replace individual model components. Furthermore, it would be beneficial in the
long term to remove compensating errors from the model. However, it will still make it
difficult to introduce ML tools within complex prediction models.
● How can we ensure reliability in a changing climate? As the climate is changing,
new weather situations will happen and ML tools that are trained on the current climate
may fail if, for example, the Arctic is suddenly ice-free in summer. If model compo-
nents are based on physical understanding, they will be more likely to provide a reliable
response to an unforeseen weather situation.
● How can we ensure reliability when changing the design of parametrization
schemes? Weather and climate models grow and develop. It will therefore be impor-
tant that the tool-chain is adjustable. If, for example, a neural network emulator of the
radiation scheme is developed for the current operational setting, it will be essential that
the neural network will also be able to perform a proper emulation if vertical resolution
is increased or if the underlying conventional scheme is improved. At the moment, we
are still missing the confidence that this will be possible.
● How can we optimize hyper parameters? To find the optimal setting of ML tools, and
in particular deep neural networks, requires the optimal setting of hyper-parameters to
be found – such as the number of layers, the number of neurons per layer, the activation
function, optimizer, the use of recurrent or convolutional networks, the number of
epochs, etc. Today, this optimization will require a very large number of trial-and-error
tests and the optimization may well have step changes when changing one of the
hyper-parameters. To have a lot of training cost and a comparably small cost for forecast
applications is, in principle, great for weather and climate predictions due to the critical
time window of forecasts. However, training costs may become prohibitive as data and
network sizes are increasing.
● How can we prepare supercomputers of weather centers to balance different
needs between ML and conventional models? Today, most weather and climate mod-
els are still unable to run on hardware accelerators such as GPUs despite their introduc-
tion in HPC in 2006. On the other hand, ML tools, and in particular neural networks,
are most efficient on GPUs. This will make the allocation of hardware difficult if a model
configuration is relying on both ML and conventional methods.
● How can we scale neural networks to more complexity? At the moment, most of
the ML solutions that are investigated are still not supercomputing applications but rather
networks that are trained on single GPUs with a couple of gigabytes of training data. For
challenging ML applications, such as the use for global weather and climate predictions,
it will require a complexity of millions of input and output parameters which is still not
possible.
● How can we prepare for different use of data, data mining and data fusion in
the future with more/larger data requests? Deep learning can make use of a sheer
unlimited amount of data and still improve the quality of result. As more and more users
216 14 Deep Learning to Improve Weather Predictions
are working with complex deep learning tools for both model output (for example of
CMIP simulations) or observations, weather and climate prediction centers will need to
prepare for changes of data requests and larger data needs.
● How can we embed user products within the operational forecast model and
how can we interface ML code with legacy code? To run ML tools live within fore-
cast models (as described above) may allow for a new family of use-cases for numerical
weather and climate predictions. However, it is still unclear how to couple conventional
models with new ML applications and how to make sure that load-balancing and the
critical time window for predictions remain unaffected. It is also surprisingly difficult to
use common libraries for ML, such as TensorFlow, within the framework of conventional
forecast models that are typically based on Fortran.
● How can we design good training data and labeled datasets? While there is an
enormous amount of Earth system data available, this data is not suitable for many ML
applications mainly since the time-frequency of data snapshots is not sufficient to resolve
the evolution of important dynamic features. If ML models are trained from observations,
these observations will often have biases between products and satellite measurements
are only available since a couple of decades. This may often be insufficient for training
of complex ML solutions. There is also a lack of large labeled datasets for many applica-
tions. If a feature detection algorithm is trained from data that a conventional algorithm
for feature detection has labeled, the usefulness of ML is reduced significantly. However,
there are first initiatives to generate large labeled datasets (Rasp et al. 2019).
● How should we train the next generation of weather and climate scientists?
Traditionally, numerical weather predictions have required skills in the domain sciences
of the different model components (for example Meteorology, Physics, Chemistry,
or Applied Mathematics) but also Software Engineering and HPC. The skill of ML
Engineering may need to be added into this mix in the future and it will become even
more important to foster collaborations across the borders of the individual domains.
po
fu
DOWN
le ark
Sc lut uds
oli
on
predictions and a bottom-up approach that is
ng
riz
ramete ation sche
ati
s
, pa
ab s
pr ch
me
ore
ctiv
s,
laye
lc
investigating basic scientific questions within
le
en
clo
ica
m
r, #neurons, a
na
r, data
idealized systems. It will furthermore require the
, dy
SLs
study of idealized equations to learn basic rules
DAGs, D
vection, radiation
normalizat
for ML applications in physical systems, the study Weather and climate
of uncertainty quantification for ML tools, the models
#laye
e,
cod
development of benchmark problems to tie ML to
, co
rce
ns
io
ou
at y
er
od,
Id ua
,s
weather and climate models as well as the
ic nt
v
n
at
n
s ion
on
ea tio
eq
io
,
tif tai
ati
eth
ba
rop p
equ
lis ns
ertie
an er
development of scalable solutions that are ready s, differential
ck
ed
m
qu nc
pr
BOTTOM
U
for implementation on modern supercomputers.
o
pa yl
ga Ta
tio
na eep UP
lgor ,d
ithm, Tensorflow
Machine Learning
4 https://en.wikipedia.org/wiki/MNIST_database
218
15
Deep Learning and the Weather Forecasting Problem:
Precipitation Nowcasting
Zhihan Gao, Xingjian Shi, Hao Wang, Dit-Yan Yeung, Wang-chun Woo, and Wai-Kin
Wong
15.1 Introduction
Precipitation nowcasting refers to the forecasting of rainfall and other types of precipitation
up to 6 hours ahead (as defined by the World Meteorological Organization)1 . Since rainfall
can be localized and highly changeable, users of precipitation nowcast typically demand to
know the exact time, location and intensity of rainfall. It is therefore necessary to make very
high resolution, both spatially and temporally, precipitation nowcast products in a timely
manner, typically in the order of minutes. The most important use of precipitation now-
cast is to support the operations of rainstorm warning systems managed by meteorological
services around the world. Rainstorm warning systems provide early alerts to the public,
disaster risk reduction agencies, government departments in particular those related to
public security and works, as well as managers of infrastructures and facilities. Upon the
issuance of rainstorm warnings, these parties take actions according to their own standard
operating procedures with a view to saving lives and protecting properties. It has tremen-
dous impact on various areas from aviation service and public safety to people’s daily life.
For example, commercial airlines rely on precipitation nowcasting to predict extreme
weather events and ensures flight safety. On land, heavy rainfall can severely affect the road
conditions and increases the risk of traffic accidents, which can be avoided with the help of
precipitation nowcasting. For local businesses, the number of customers and their feedback
about a restaurant are largely related to the weather (Bujisic et al. 2019), especially the rain
rate. Thus, accurate and timely prediction of rainfall helps restaurants predict and adjust
their sales strategies. Therefore, the past years have seen an ever-growing need for real-time,
large-scale, timely and fine-grained precipitation nowcasting (Xingjian et al. 2015; Shi et al.
2017; Lebedev et al. 2019b; Agrawal et al. 2019). Due to the inherent complexities of the
atmosphere and relevant dynamical processes, the problem imposes new challenge to the
meteorological community (Sun et al. 2014).
Traditionally, precipitation nowcasting is approached by either Optical Flow (OF)-based
methods (Li et al. 2000; Reyniers 2008) or numerical methods (Weisman et al. 2008; Sun
et al. 2014; Benjamin et al. 2016). OF-based methods first estimate the flow field, which
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
15.1 Introduction 219
represents the convective motion of the precipitation, with the observed weather data (e.g.,
the Constant Altitude Plan Position Indicator (CAPPI) radar echo maps (Douglas 1990))
and then use the flow field for extrapolation (Woo and Wong 2017). The numerical meth-
ods build mathematical models of the atmosphere on top of the physical principles such as
the dynamic and thermodynamic laws. Future rainfall intensities are predicted by numeri-
cally solving partial differential equations within the mathematical models. However, both
approaches have deficiencies that limit their success. The OF-based methods attempt to
identify the convective motion of the cloud, but they fail to represent cloud initiation or
decay and lack the ability of expressing strong nonlinear dynamics. In addition, the flow
field estimation step and the radar echo extrapolation step are separated, making it chal-
lenging to determine the best model parameters. Numerical methods can provide reliable
forecast but require meticulous simulation of the physical equations. The inference time of
numerical models usually take several hours and they are therefore not suitable for gener-
ating fine-grained predictions required by precipitation nowcasting.
Recently, a new approach, deep learning for precipitation nowcasting, has emerged in
the area and shown promising results. Shi et al. (Xingjian et al. 2015) first formulated
precipitation nowcasting as a spatiotemporal sequence forecasting problem and proposed
a DL-based model, dubbed Convolutional Long Short-Term Memory (ConvLSTM), to
directly predict the future rainfall intensities based on the past radar echo maps. The
model is learned end-to-end with a large amount of historical weather data and performs
substantially better than the OF-based algorithm in the operational Short-range Warning of
Intense Rainstorms in Localized Systems (SWIRLS) developed by the Hong Kong Observatory
(HKO) (Li et al. 2000; Woo and Wong 2017). After this seminal work, researchers start
to explore DL-based methods for precipitation nowcasting and have built models with
state-of-the-art performance (Hernández et al. 2016; Shi et al. 2017; Qiu et al. 2017; Tran
and Song 2019; Lebedev et al. 2019b; Chen et al. 2019; Agrawal et al. 2019). In essence,
precipitation nowcasting is well-suited for DL for three reasons. Firstly, the problem
satisfies the big data requirement of DL. Numerous amount of weather data are generated
on a daily basis and can be used to train the nowcasting model. For example, in National
Oceanic and Atmospheric Administration (NOAA), tens of terabytes of data are generated
in a single day (Szura 2018). Secondly, DL is suitable for modeling complex dynamical
systems (Goodfellow et al. 2016); a single-hidden-layer Multi-Layer Perceptron (MLP),
which is the most basic form of DL models, is a universal functional approximator (Csáji
et al. 2001). Thirdly, the inference speed of DL models is faster than numerical meth-
ods (Agrawal et al. 2019). Moreover, in the inference stage, we can dynamically update the
DL model with the newly observed data (Shi et al. 2017), making the model more adaptive
to the emerging weather patterns.
In this chapter, we introduce the current progress of DL-based methods for precipita-
tion nowcasting. In section 15.2, we describe how to mathematically formulate precipi-
tation nowcasting as a spatiotemporal sequence forecasting problem. In section 15.3, we
review the high-level strategies for constructing and learning DL models for precipitation
nowcasting; because precipitation nowcasting requires predicting rainfall intensities for
multiple timestamps ahead, we introduce various strategies to learn such a multi-step fore-
casting model. In section 15.4.1 and section 15.4.2, we introduce the DL models in two
220 15 Deep Learning and the Weather Forecasting Problem: Precipitation Nowcasting
categories: Feed-forward Neural Network (FNN)-based models and Recurrent Neural Net-
work (RNN)-based models. In section 15.5, we describe the first systematic benchmark
of the DL models for precipitation nowcasting, the HKO-7 benchmark. We conclude this
chapter and discuss the potential future works along this area in section 15.6.
15.2 Formulation
There are two advantages of the IME approach: (i) The objective function in equation 15.2
is easy to train because it only requires optimizing for the one-step-ahead forecasting error
and (ii) we can predict for an arbitrary horizons in the future by recursively applying the
basic forecaster. However, there is an intrinsic discrepancy between training and testing in
IME. In the training phase, we use the ground-truths from t + 1 to t + i − 1 to predict the
regional rainfall at timestamp t + i, which is also known as teacher-forcing (Goodfellow
et al. 2016). While in the testing phase, we feed the model predictions instead of the
ground-truths back to the forecaster. This makes the model prone to accumulative
errors in the forecasting process (Bengio et al. 2015). Usually, the optimal forecaster for
[ ]
timestamp t + i, which is obtained by maximizing 𝔼p̂ data log p(Xt+i ∣ Xt−J+1∶t ; 𝜽) , is not
the same as recursively applying the optimal one-step-ahead forecaster when the model
is nonlinear. This is because the forecasting error at earlier timestamps will propagate to
later timestamps (Lin and Granger 1994).
Direct Multi-Step Estimation The main motivation behind DME is to avoid the error drifting
problem in IME by directly minimizing the long-term prediction error. Instead of training a
single model, DME trains a different model p(Xt+i ∣ Xt−J+1∶t ; 𝜽i ) for each forecasting horizon
i, in which 𝜽i is the parameter. There can thus be L models in the DME approach. The
set of optimal parameters {𝜽★ ★
1 , … , 𝜽L } can be estimated from the following optimization
222 15 Deep Learning and the Weather Forecasting Problem: Precipitation Nowcasting
problem:
[ L ]
∑
𝜽★ ★
1 , … , 𝜽L = arg max 𝔼p̂ data log p(Xt+i ∣ Xt−J+1∶t ; 𝜽i ) (15.3)
𝜽1 ,…,𝜽L i=1
To disentangle the model size from the number of forecasting steps L, we can
also construct p(Xt+i ∣ Xt−J+1∶t ; 𝜽i ) by recursively applying the single-step forecaster
p(Xt+1 ∣ X1∶t ; 𝜽). In this case, the model parameters {𝜽1 , ..., 𝜽L } are shared. For
example, when the single-step forecasting model is deterministic and predicts X ̃ t+1
as m(X1∶t ; 𝜽), we can obtain the second step prediction by feeding in the predicted
( )
rainfall intensity, i.e., X ̃ t+2 = m X1 , X2 , … , Xt , m(X1∶t ; 𝜽); 𝜽 . By repeating the process
for L times, we obtain the predictions X ̃ t+1∶t+L . The optimal parameter 𝜽★ can be esti-
mated by minimizing the distance between the prediction and the ground-truth, i.e.,
̃ t+1∶t+L , Xt+1;t+L ), in which d(⋅, ⋅) is a distance function. We need
𝜽★ = arg min𝜽 𝔼p̂ data d(X
to emphasize here that the aforementioned objective function directly optimizes the
multi-step-ahead forecasting error and is different from Equation 15.2, which only
minimizes the one-step-ahead forecasting error.
Scheduled Sampling According to Chevillon (2007), DME leads to more accurate predic-
tions when (i) the model is misspecified, (ii) the sequences are non-stationary, or (iii) the
training set is too small. However, DME is more computationally expensive than IME. For
DME, if the 𝜽h s are not shared, we need to store and train L models. If the 𝜽h s are shared,
we need to recursively apply the basic forecasting model for O(L) steps (Chevillon 2007;
Bengio et al. 2015; Lamb et al. 2016). Both cases require larger memory storage or longer
running time than solving the IME objective. Overall, IME is easier to train but less accurate
for multi-step forecasting, while DME is more difficult to train but more accurate.
Schedules sampling (SS) Bengio et al. (2015) tries to bridge the gap between IME and
DME. The idea of SS is to first train the model with IME and then gradually replace
the ground-truths in the objective function with samples generated by the model itself.
When all ground-truth samples are replaced with model-generated samples, the training
objective falls back into the DME objective. The generation process of SS is described
in Equation 15.4:
∀1 ≤ i ≤ L,
̃ t+i ∼ p(Xt+i ∣ X
X ̂ t−J+1∶t , Xt+1∶t+i−1 ; 𝜽),
̂ t+i + 𝜏t+i X
Xt+i = (1 − 𝜏t+i )X ̃ t+i ,
𝜏t+i ∼ Binomial(1, 𝜖k ). (15.4)
Here, X̃ t+i and X̂ t+i are correspondingly the generated sample and the ground-truth at
timestamp t + i. p(Xt+i ∣ X ̂ t−J+1∶t , Xt+1∶t+i−1 ; 𝜽) is the basic single-step forecasting model.
Meanwhile, 𝜏t+h is generated from a binomial distribution and controls whether to use the
ground-truth or the generated sample. 𝜖k is the probability of choosing the ground-truth
at the kth iteration. In the training phase, SS minimizes the distance between X ̃ t+1∶t+L
̂
and Xt+1∶t+L . In the testing phase, 𝜏t+i s are fixed to 0, meaning that the model-generated
samples are always used.
15.4 Models 223
SS lies in the mid-ground between IME and DME. If 𝜖k equals to 1, the ground-truths are
always chosen, and the objective function will be the same as in the IME strategy. If 𝜖k is 0,
the generated samples are always chosen, and the optimization objective will be the same
as in the DME strategy. In practice (Bengio et al. 2015; Wang et al. 2019b), 𝜖k is gradually
decayed during the training phase to make the optimization objective shift smoothly from
IME to DME, which is a type of curriculum learning (Bengio et al. 2009).
When applied for precipitation nowcasting, existing DL models adopt either of these three
learning strategies. We will introduce the detailed architectures of these two types of models
in section 15.4.1 and section 15.4.2 and give an overview of the learning strategy that each
model uses in section 15.6.
15.4 Models
Ci′ = Ci × J. Notice that in this manner, the channel Ci′ is determined by the input length,
hence unlike the RNN-based models, which will be explained in detail in section 15.4.2,
the input length must be fixed for the FNN-base models. The radar image at the next times-
tamp Xt+1 is predicted by feeding the preprocessed input sequence Xin into the FNN: Xt+1 =
f (Xin ; 𝜽). Also, to predict for multiple steps ahead, the author adopted the IME strategy
by feeding the predicted radar image back to the network in the inference phase. Also,
the author compared different transformation techniques for preprocessing the 2D radar
images and used two radar images taken with the 5min interval as the input to predict the
next radar image.
Agrawal et al. (2019) also concatenates the input images in the temporal dimension and
used the U-Net (Ronneberger et al. 2015a) architecture for prediction. U-Net combines
down-sampling, up-sampling and skip connections to learn better hidden represen-
tations. Figure 15.1 illustrates how these building blocks are organized. The iterative
down-sampling part extracts more global and more abstract representations and the
up-sampling part gradually refines the representation and adds the finer details to the
generated output. The skip connection helps preserve high-resolution details and facilitates
gradient backpropagation. The author used QPE data from the Multi-Radar Multi-Sensor
(MRMS) system (Zhang et al. 2016a) for training and testing the model. The input is a
sequence of radar images taken with 2min interval for one hour and the output is the
sequence of radar images for the next several hours. Experiments show that the U-Net
based DL model outperforms OF-based model and the HRRR model by NOAA (Benjamin
et al. 2016).
Klein et al. Klein et al. (2015) designed the dynamic convolution layer to replace the
conventional convolution layer. Instead of using a data-independent filter, the dynamic
Conv2D
Basic Conv2D LeakyReLU
Downsample Upsample LeakyReLU BatchNorm
Downsample Upsample Conv2D BatchNorm Conv2D
LeakyReLU MaxPooling LeakyReLU
Downsample Upsample BatchNorm LeakyReLU BatchNorm
Downsample Upsample Conv2D BatchNorm Up-Conv2D
Basic
Input Output
(a) (b) (c) (d)
Figure 15.1 (a) The overall structure of the U-NET in Agrawal et al. (2019). Solid lines indicate
input connections between layers. Dashed lines indicate skip connections. (b) The operations within
the basic layer. (c) The operations within our down-sample layers. (d) The operations within the
up-sample layers.
15.4 Models 225
Output
Sub-network A Sub-network B
Input
Figure 15.2 The dynamic convolutional layer. The input is fed into two sub-networks. The
features are the result of sub-network A while the convolution filters are obtained from
sub-network B. The final output of the dynamic convolution layer is computed by convolving the
filters from sub-network B across the features from sub-network A.
convolution layer generates both the feature maps and the filters from the input and con-
volves the filters with the feature maps to get the output. The feature maps and the filters
are obtained from the input with two sub-networks. Because the filters are dependent on
the input, they will vary from one sample to another in the testing phase. The author con-
catenates four radar images as the input to predict the next radar image. Also, the author
proposed a patch-by-patch synthesis technique which predicts a 10 × 10 patch in the output
from a sequence of 70 × 70 patches in the input. Figure 15.2 illustrates the workflow of the
dynamic convolution layer. Notice that this layer is different from the dynamic filter (Jia
et al. 2016) layer that is introduce in section 15.4.2. In dynamic convolution layer, the filter
is shared for all locations in the input, while they are adaptively selected in the dynamic
filter layer.
Besides the radar images, satellite images are also commonly used as input in FNN-based
models. In Lebedev et al. (2019b), satellite images and the observations from Global Forecast
System (GFS) (Center 2003) are combined and used as the input. These two types of data are
in different modalities and are misaligned with regard to spatial and temporal resolution.
Thus, the author remapped them into the same spatial and temporal grid by interpolation.
Lebedev et al. (2019b) also applied the U-Net architecture.
Similar to Lebedev et al. (2019b), Hernández et al. (2016) and Qiu et al. (2017) also
deal with meteorological data from multiple modalities, including temperature, humidity,
wind speed, barometric pressure, Dew point, etc. However, they do not consider the
spatial dimension of these data. FNNs with 1D convolution layers and FC layers are built
for the 1 × D input data. The weather nowcasting problem is formulated as learning a
deterministic mapping Yt+1 = f (Xt ) that maps the current meteorological observation
Xt to the precipitation at next step Yt+1 . Since the formulation has not fully utilized the
spatiotemporal structure of the data, we will not go into the details here.
2014; Karpathy and Fei-Fei 2015; Ranzato et al. 2014; Srivastava et al. 2015; Xu et al. 2015).
Different from FNN-based models, which are designed for modeling inputs with static
shapes, RNN-based models are designed for modeling dynamic systems. In this section,
we introduce the RNN-based models for precipitation nowcasting. We first introduce the
encoder-forecaster structure which is the common approach for constructing RNN-based
models for spatiotemporal sequence forecasting. Then we introduce the Convolutional
LSTM (ConvLSTM) network (Xingjian et al. 2015), which combines the advantage of
CNN and RNN and is the first DL-based model for precipitation nowcasting. After that,
we introduce other RNN-based models like the ConvLSTM with star-shaped bridge (Cao
et al. 2019; Chen et al. 2019), Predictive RNN (PredRNN) (Wang et al. 2017d), Memory In
Memory (MIM) Network (Wang et al. 2019b), and the Trajectory GRU (TrajGRU) (Shi et al.
2017), which improves upon ConvLSTM from different directions.
Ht+1, Ct+1
Ht, Ct Xt+1
Ht–1, Ct–1 Xt
X̂t
t t+1
1 × 1 conv
ConvLSTM ConvLSTM plus
concatenate
1 × 1 conv
split
ConvLSTM ConvLSTM
Apart from the star-shaped bridge, the author also inserts Group Normalization (GN) (Wu
and He 2018) between ConvLSTM layers. Ablation study shows that the best performance
is obtained by combining ConvLSTM, star-shaped bridge, and GN. Also, experiments on
4-year radar echo data from Shanghai, China showed that the learning-based model out-
performs the conventional COTREC method (Chen et al. 2019).
Figure 15.5 Connection structure of PredRNN. The orange arrows in PredRNN denote the flow of
the spatiotemporal memory Mlt .
zigzag order. For the bottom ST-LSTM with l = 1, the memory cell from the previous layer is
defined as Mtl−1 = MLt−1 , which results in a zigzag update flow. Experiments show that Pre-
dRNN outperforms the ConvLSTM structure in precipitation nowcasting. The experiment
is conducted on a dataset with 10,000 consecutive radar observations recorded every 6 min-
utes in Guangzhou, China. 10 frames are used as the input to predict the future 10 frames.
It Ot
Gt Ctl
l
Ct–1 Mlt It'
Ft
G't
Hlt–1
Ft'
Xt
Ml–1
t
It Ot
Gt Ctl
l
Ct–1 MIM-S Mlt It'
Sl
Hlt–1 MIM-S G't
Hl–1 Nl Ft'
t–1
Xt
Ml–1
t
Figure 15.6 ST-LSTM block (top) and Memory In Memory block (bottom). For brevity,
Gt = tanh(Wxc ∗ Xt + Whc ∗ Hlt−1 + bc ), G′t = tanh(Wxm ∗ Xt + Whm ∗ Ml−1
t + bm ). MIM is designed to
introduce two recurrent modules (yellow squares) to replace the forget gate (dashed box) in
ST-LSTM. MIM-N is the non-stationary module and MIM-S is the stationary module.
MIM-N MIM-S
(Non-stationary) (Stationary)
Ot Ot Ttl
It Ntl l
Ct–1 It Stl
l–1
Ht–1
Gt Dtl Gt
Htl–1
l
Nt–1 l
St–1
Ft Ft
Figure 15.7 The non-stationary module (MIM-N) and the stationary module (MIM-S), which are
interlinked in a cascaded structure in the MIM block. Non-stationarity is modeled by differencing.
Encoder Forecaster
I1, G I2, G I3 I4
Figure 15.8 Encoder-forecaster architecture adopted in Shi et al. (2017). Source: Shi et al. (2017).
where S and N denote the horizontally-transited memory cells in the non-stationary mod-
ule (MIM-N) and stationary module (MIM-S) respectively; Dlt s are the differential features
learned by MIM-N; Tlt is the memory passing the virtual “forget gate”. MIM-N is a Con-
vLSTM with Htl−1 − Ht−1l−1
as the hidden state input. MIM-S is a ConvLSTM with Dlt as the
hidden state input. The detailed formula are omitted here and readers can refer to Wang
et al. (2019b) for more details.
Here, Ht , Rt , Zt , H′t ∈ ℝCh ×H×W are the memory state, reset gate, update gate, and new infor-
mation, respectively.
As stated in Shi et al. (2017), when used for capturing spatiotemporal correlations, the
deficiency of ConvGRU and ConvLSTM is that the connection structure and weights are
fixed for all the locations. The convolution operation basically applies a location-invariant
filter to the input. If the inputs are all zero and the reset gates are all one, the author pointed
out that the calculation process of H′t at a specific location (i, j), i.e, H′t,∶,i,j , can be rewritten
as follows:
|i,jh |
∑ (15.12)
H′t,∶,i,j = f( Wlhh Ht−1,∶,p(l,i,j),q(l,i,j) ),
l=1
in which i,jh is the ordered neighborhood set at location (i, j) defined by the hyper-
parameters of the state-state convolution such as kernel size, dilation and padding (Yu and
Koltun 2016). (p(l, i, j), q(l, i, j)) is the lth neighborhood location corresponding to position
(i, j).
Based on this observation, TrajGRU uses the current input and previous state to gen-
erate the local neighborhood set for each location at each timestamp. The detailed for-
mula is given in Equation 15.13. Here, L is the number of allowed links. Ut , Vt ∈ ℝL×H×W
are the flow fields that store the local connection structure generated by 𝛾(Xt , Ht−1 ). The
Wlhz , Wlhr , Wlhh are the weights for projecting the channels and were chosen as 1 × 1 con-
volutions in the paper. The warp(Ht−1 , Ut,l , Vt,l ) function selects the positions pointed out
by Ut,l , Vt,l from Ht−1 via the bilinear sampling kernel (Jaderberg et al. 2015; Ilg et al. 2017;
Shi et al. 2017).
Ut , Vt = 𝛾(Xt , Ht−1 ),
∑
L
Zt = 𝜎(Wxz ∗ Xt + Wlhz ∗ warp(Ht−1 , Ut,l , Vt,l )),
l=1
∑
L
Rt = 𝜎(Wxr ∗ Xt + Wlhr ∗ warp(Ht−1 , Ut,l , Vt,l )),
l=1
∑
L
H′t = f (Wxh ∗ Xt + Rt ⊙ ( Wlhh ∗ warp(Ht−1 , Ut,l , Vt,l ))),
l=1
Ht = (1 − Zt ) ⊙ H′t + Zt ⊙ Ht−1 . (15.13)
The advantage of such a structure is that it could learn the connection topology by learn-
ing the parameters of the subnetwork 𝛾. 𝛾 has only a small number of parameters and
thus adds nearly no cost to the overall computation. Compared to a ConvGRU with K × K
state-state convolution, TrajGRU is able to learn a more efficient connection structure with
L < K 2 . For ConvGRU and TrajGRU, the number of model parameters is dominated by the
size of the state-state weights, which is O(L × Ch2 ) for TrajGRU and O(K 2 × Ch2 ) for Con-
vGRU. If L is chosen to be smaller than K 2 , the number of parameters of TrajGRU can also
be smaller than the ConvGRU and the TrajGRU model is able to use the parameters more
efficiently. Illustration of the recurrent connection structures of ConvGRU and TrajGRU is
given in Figure 15.9.
15.5 Benchmark 233
H1 H2 H3 H4
χ1 χ2 χ3 χ4
H1 H2 H3 H4 H1 H2 H3 H4
X1 X2 X3 X4 X1 X2 X3 X4
(a) (b)
Figure 15.9 Top: For convolutional RNN, the recurrent connections are fixed over time. Bottom:
For trajectory RNN, the recurrent connections are dynamically determined. Comparison of the
connection structures of convolutional RNN and trajectory RNN. Links with the same color share
the same transition weights. (Best viewed in color). Source of figure: Shi et al. (2017).
15.5 Benchmark
Despite the rapid development of DL models in solving this problem, the way to evaluate
the models has some deficiencies. Firstly, the deep learning models are only evaluated on
relatively small dataset containing limited data frames. Secondly, different models report
evaluation results on different criteria. As the needs of real-world precipitation nowcasting
system diverge from indicating raining or not to rainstorms alert, single criterion is not
sufficient for demonstrating the algorithm’s overall performance. Thirdly, in the real-world
scenario, the meteorological data arrive in a stream and the nowcasting algorithm should
be able to actively adapt to the new-coming sequences. Considering this online setting is
not less important than considering offline setting with fixed-length input. In fact, as the
area deep learning for precipitation nowcasting is still in its early stages, it is not clear how
models should be evaluated to meet the needs of real-world applications.
Shi et al. (2017) proposed the large-scale HKO-7 benchmark for precipitation nowcast-
ing to address this problem. HKO-7 benchmark is built on the HKO-7 dataset containing
radar echo data from 2009 to 2015 near Hong Kong. Since the radar echo maps arrive in a
stream in the real-world scenario, the nowcasting algorithms can adopt online learning to
adapt to the newly emerging patterns dynamically. To take this setting into account, there
are two testing protocols in this benchmark: the offline setting in which the algorithm can
only use a fixed window of the previous radar echo maps and the online setting in which the
algorithm is free to use all the historical data and any online learning algorithm. Another
234 15 Deep Learning and the Weather Forecasting Problem: Precipitation Nowcasting
issue for the precipitation nowcasting task is that the proportions of rainfall events at dif-
ferent rain-rate thresholds are highly imbalanced. Heavier rainfall occurs less often but has
a higher real-world impact. Balanced Mean Squared Error (B-MSE) and Balanced Mean
Absolute Error (B-MAE) measures are thus introduced for training and evaluation, which
assign more weights to heavier rainfalls in the calculation of MSE and MAE. Empirical
study showed that the balanced variants of the loss functions are more consistent with the
overall nowcasting performance at multiple rain-rate thresholds than the original loss func-
tions. Moreover, training with the balanced loss functions is essential for deep learning
models to achieve good performance at higher rain-rate thresholds.
Using the new dataset, testing protocols, and training loss, there are seven models being
extensively evaluated, including a simple baseline model which always predicts the last
frame, two OF-based models (ROVER and its nonlinear variant), and four representative
deep learning models (TrajGRU, ConvGRU, 2D CNN, and 3D CNN). This large-scale bench-
mark for precipitation nowcasting is the first comprehensive benchmark of deep learning
models for the precipitation nowcasting problem.
rainfall levels in HKO-7 dataset. The thresholds 0.5, 2, 5, 10, 30 are selected to calculate
the CSI and Heidke Skill Score (HSS) (Hogan et al. 2010). For calculating the skill score
at a specific threshold 𝜏, which is 0.5, 2, 5, 10 or 30, the pixel values in prediction and
ground-truth are first converted to 0/1 by thresholding with 𝜏. Then calculate the TP
(prediction=1, truth=1), FN (prediction=0, truth=1), FP (prediction=1, truth=0), and
TP
TN (prediction=0, truth=0). The CSI score is calculated as TP+FN+FP and the HSS score
TP×TN−FN×FP
is calculated as (TP+FN)(FN+TN)+(TP+FP)(FP+TN) . During the computation, the masked noisy
points are ignored.
As shown in Table 15.1, the frequencies of different rainfall levels are highly imbalanced.
Using weighted loss function helps solve this problem. Specifically, a weight 𝑤(x) is assigned
to each pixel according to its rainfall intensity x:
⎧ 1, x < 2
⎪ 2, 2 ≤ x < 5
⎪
𝑤(x) = ⎨ 5, 5 ≤ x < 10 .
⎪10, 10 ≤ x < 30
⎪
⎩30, x ≥ 30
Also, the masked pixels have weight 0. The resulting B-MSE and B-MAE scores are
∑ ∑∑
N 480 480
∑ ∑∑
N 480 480
computed as B-MSE = N1 𝑤n,i,j (xn,i,j − x̂ n,i,j )2 and B-MAE = N1 𝑤n,i,j |xn,i,j −
n=1 i=1 j=1 n=1 i=1 j=1
x̂ n,i,j |, where N is the total number of frames and 𝑤n,i,j is the weight corresponding to
the (i, j)th pixel in the nth frame. For the conventional MSE and MAE measures, all the
weights are simply set to 1 except the masked points.
training process, all models are optimized by the Adam optimizer with learning rate equal
to 10−4 and momentum equal to 0.5, with early-stopping on the sum of B-MSE and B-MAE.
The ConvGRU model is also trained with the original MSE and MAE loss, which is named
“ConvGRU-nobal” in the paper (Shi et al. 2017), to evaluate the improvement by training
with the B-MSE and B-MAE loss.
15.6 Discussion
In this chapter, we reviewed the DL-based methods for precipitation nowcasting. The
architecture, building block, training objective function, metrics, and data source of the
reviewed methods are summarized in Table 15.2. Precipitation nowcasting is formulated
as a spatiotemporal sequence forecasting problem from the machine learning perspective.
Thanks to the increased computational power and the amounts of data, the area is making
rapid progress. Machine learning, specifically deep learning, facilitates the large amount
of weather data and provides promising research directions for better modeling and
understanding of precipitation nowcasting problem. Despite the success of DL-based
methods achieved on precipitation nowcasting, this problem is still challenging. Below we
list several major future research directions that are not solved or have not been explored:
Table 15.2 Summary of reviewed methods. The first half are FNN-based models and the second
half are RNN-based models.
fail to accurately model it and infer its future evolution. Multi-source data, in contrast,
provide multi-modal and multiscale meteorological information, giving the model a
more holistic view of the system. Therefore exploring DL models that are able to jointly
process complementary multi-source data can certainly help learn better representations
of the observing systems.
238 15 Deep Learning and the Weather Forecasting Problem: Precipitation Nowcasting
● Handling uncertainty
Precipitation nowcasting involves complex physics dynamics. According to chaos theory,
chaotic behaviors in a meteorological system make it unpredictable due to high degree of
uncertainty. Learning to capture the internal uncertainty is one of the major challenge in
modeling and understanding the latent dynamics. However, most DL models address pre-
cipitation nowcasting in deterministic manner, which averages all possible futures into a
single output, without describing its whole distribution. For some application scenarios
such as rainstorm alert, not only the average and the most likely futures are concerned,
but also some possible extreme cases should be considered. There are some recent works
(Xue et al. 2016; Babaeizadeh et al. 2018; Denton and Fergus 2018; Lee et al. 2018a) that
developed stochastic spatiotemporal models to predict different possible futures through
variational inference. Though stochastic spatiotemporal models are not yet evaluated on
precipitation nowcasting tasks, they are inspiring potential solutions for handling uncer-
tainty in precipitation nowcasting.
● Integration with numerical methods
Compared with theory-driven quantitative precipitation forecast (QPF) methods with
clear physical meanings, deep learning models are data-driven and typically suffer
from poor interpretability. Although theory-driven precipitation nowcasting models are
derived from physical theories, they are essentially phenomenological models built by
summarizing the empirical relationship of observations instead of deriving from first
principles, which means theory-driven models are not entirely different from but in
essence analogous to data-driven models. Theory-driven models consist of interpretable
components to describe the observed data, while keeping consistent with physical
laws including conservation of mass, momentum, energy, etc. They are determined by
human experts and are hence hard to adjust according to different data from different
distributions. On the contrary, data-driven models are equipped with high flexibility
to adapt to different datasets since they directly learn parameters from data under few
constraints. These two approaches are complementary in respect of interpretability
and flexibility. Integrating theory-driven and data-driven approaches provides new
opportunities in future precipitation nowcasting research, including but not limited to
model calibration, recognizing unidentified observations, etc.
Appendix
The deep learning precipitation nowcast models introduced in this chapter have been
utilized to support development of the operational rainfall nowcasting system, namely
SWIRLS (Short-range Warning of Intense Rainstorms in Localized Systems) of the Hong
Kong Observatory (HKO). In particular, TrajGRU has also been made available in the
community version of SWIRLS (a.k.a. Com-SWIRLS) as part of core components under the
Regional Specialized Meteorological Centre (RSMC) for Nowcasting of HKO, see https://
rsmc.hko.gov.hk. The rainfall nowcasting models including TrajGRU are shared with the
National Meteorological and Hydrological Services (NMHSs) of the World Meteorological
Acknowledgement 239
Acknowledgement
This research has been partially supported by General Research Fund 16207316 from the
Research Grants Council of Hong Kong.
240
16
Deep Learning for High-dimensional Parameter
Retrieval
David Malmgren-Hansen
16.1 Introduction
Various models in Earth Sciences e.g. related to climate, ecology, and meteorology, etc., are
based on input parameters describing current and past states of different variables. The goal
with Earth Science models is to understand the relationship between their parameters by
describing the dynamic processes from which they vary. In this chapter we will refer to the
parameters as bio-geo-physical parameters.
To be able to update our models’ parameters we need to feed them with measurements
from ground sensors (in-situ) or by retrieval from satellite observations, i.e. parameter
retrieval. In-situ measurements often come with high cost, due to installment of sensors
in remote places. Further, in-situ measurements provide often only sparse geographical
coverage and the value of adding satellite observations, which provide frequent global
coverage, is therefore high.
One example of a bio-geophysical parameter is atmospheric temperatures measured from
sensors or satellites, which is often used as input to a Numerical Weather Prediction model.
Many other examples of bio-geophysical parameters exist, and they can be grouped into:
● biological, e.g. leaf area index or other vegetation associated indices;
● physical, e.g. soil moisture indices, temperature or humidity, (see Figure 16.1);
● chemical, e.g. atmospheric trace gases, chlorophyll content in plants, hydrocarbon con-
centrations;
● geographical, land cover, sea ice cover, etc.
The task of retrieving parameters from observations is most commonly associated with find-
ing functions that map observations to the parameter values by learning the inversion of a
model that accounts for effects such as atmospheric distortion, geometry of the observation,
and different noise sources. This is called the inverse problem because we map from effect
(radiometric measurement) to cause (bio-geophysical parameter) rather than the opposite
forward problem performed in e.g. radiative transfer models (RTMs). The inversion prob-
lems can be multi-modal or ill-posed, hence the mapping between observations and target
parameters could have several or infinite solutions (Tsagkatakis et al. 2019). Further, the
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
16.1 Introduction 241
220 240 260 280 300 320 4.0 4.5 5.0 5.5 6.0 6.5 7.0 750 800 850 900 950 1000 1050 1100
Figure 16.1 The plots are model forecasting parameters at the surface, extracted on measurement
positions of the IASI sensor on board the MetOp satellite series, plotted over two days of orbits.
retrieval task can be based on high-dimensional sensor measurements or the target param-
eter can consist of several variables, which lead to statistically challenges when looking for
significant mapping coefficients.
It is important to note that the difference between parameter retrieval and more generally
prediction of a target variable from data, relates to how we use the predictions. For fore-
casting applications where satellite retrieved parameters are used we often consider dense
predictions over a certain geographical area. Hence, predicting the crop types for a number
of fields is not necessarily to be considered parameter retrieval. If, on the other hand, we
apply our crop predictor’s output in a model to understand agricultural parameters’ effect
on the surrounding environment, we would be solving a parameter retrieval problem.
Deep learning (DL) has the potential to play an increasing role in the future of
bio-geophysical parameter retrieval from remote sensing data. The data can be
high-dimensional and the relationships between measurements and the parameters
often highly non-linear. These relationships can also have dependencies in time, space and
spectral dimensions. DL offers a flexible framework to model such complex problems and
the data we can use to train statistical models keeps growing. Despite the flexibility of deep
learning, challenges must be overcome in order to extent the use further than today. These
challenges include:
16.2.1 Land
Land-based parameter retrieval often concerns bio-chemical parameters such as vegetation
indices but can also include physical parameters such as LST. Deep Learning was used to
retrieve LST from Microwave Radiometer data in Tan et al. (2019b), and the method was
tested on reference data from both ground stations and other optical satellite data with good
results.
Bio-chemical related retrieval often concerns indices such as Leaf-Area-Index (LAI)
or Leaf-Chlorophyll-Content (LCC). Verrelst et al. (2015) provides an overview of
bio-geophysical parameter retrieval and methods. Artificial Neural Networks (ANNs) are
part of the standard toolbox for parameter retrieval, yet mostly confined to the use of
16.2 Deep Learning Parameter Retrieval Literature 243
shallower architectures (Verrelst et al. 2015). Simple ANNs with multiple layers (>1 hidden
layer) can be considered as a part of the deep learning methods, but recent trends in deep
learning research goes towards the use of neural networks variants such as Convolutional
Neural Networks or Recurrent Neural Networks where depth is a common and powerful
characteristic.
Research in farming applications also relates to biological parameter retrieval. Often
though, the goal is not to use predictions as parameters in models, but as proxies for
the health condition of the crops in so called smart farming applications. By monitoring
and optimizing these vegetation indices the goal is to increase crop yield. The variables
of interest, such as crop type, crop yield, soil moisture, and weather variables, can also
be used to model and understand the ecosystems that farming effect (Kamilaris and
Prenafeta-Boldú 2018). Most often though, it is applied to datasets covering only smaller
regions of agricultural areas. As opposed to biological parameter retrieval applications,
DL is frequently used in farming applications. Some country level work on agriculture
has been done for e.g. corn crop yield (Kuwata and Shibasaki 2016), but little work
exists on larger-scale studies where predictions could be used in models. Kim et al.
(2017a) provides a comparison of several artificial intelligence methods on a case study in
midwestern USA.
Forest cover and biomass are other types of biological parameters which is of high impor-
tance to understand and monitor the Earth. Martone et al. (2018) addresses forest cover
with fuzzy clustering machine learning techniques to create global forest maps in 50 × 50 m
grids. DL has also been applied to this problem although mostly on continental level scale,
for example by Ye et al. (2019) where a Recurrent Neural Network (RNN), based on the
Long-Short-Term-Memory (LSTM) architecture, outperformed the other Machine Learn-
ing methods in their benchmark. Khan et al. (2017) models forest dynamics over a 28-year
period by stacking time-series and formulating the task as a change-classification problem.
Zhang et al. (2019) predicts forest biomass from Lidar and Landsat 8 data, with a com-
parison between four different Machine Learning models. The best performing model is a
DL model based on the concept of Stacked Sparse Autoencoders (SSAE). This model likely
performs well due to the unsupervised pretraining.
16.2.2 Ocean
Sometimes deep learning can be beneficial in hybrid approaches where the model acts a
post processing of physical-based retrieval. This is the case in Krasnopolsky et al. (2016)
where a Neural Network works on physically retrieved ocean color chlorophyll to fill the
gaps in data from NOAA’s operational Visible Imaging Infrared Radiometer Suite (VIIRS).
This approach can be seen as a general image post processing technique but is a clever way
to exploit NNs’ abilities to learn patterns, which is necessary to perform well in gap filling.
Similar approach on microwave radiometer parameters were performed in Jo et al. (2018)
to Chlorophyll-a content in oceans. Direct ocean parameter retrieval has also been done,
e.g. from hyperspectral satellite images in Awad (2014), or from ground station measured
water quality with temporal modeling by an LSTM in Cho et al. (2018). Hyperspectral data
can also be modelled with a CNN, a more advanced deep learning technique that includes
244 16 Deep Learning for High-dimensional Parameter Retrieval
spatial information to extract chlorophyll (CHL), colored dissolved organic matter (CDOM),
total suspended sediments (TSS) (Nock et al. 2019).
16.2.3 Cryosphere
Typical bio-geophysical parameters in cryospheric studies can be sea ice cover/
concentrations (SIC), sea ice types (SIT), snow depth, snow water equivalent, and
snow cover. Deep learning has been used for predicting SICs in Wang et al. (2016, 2017c);
Karvonen (2017); Malmgren-Hansen et al. (2020) and for distinguishing SITs in Boulze
et al. (2020). Both SIC and SIT are parameters used in climate models and models that
forecast ice drift for marine users.
Snow cover applications of various kinds have been explored with ANN and Dl tech-
niques e.g. in Tsai et al. (2019); Çiftçi et al. (2017); Nijhawan et al. (2019b); Gatebe et al.
(2018). Çiftçi et al. (2017); Dobreva and Klein (2011) used physically derived parameters,
NDSI and NDVI, as inputs to ANNs together with optical satellite data which performs well
compared to model working only on optical satellite data alone. Snow depth and snow water
equivalent has long been retrieved by ANNs, especially from microwave radiometer data
(Davis et al. 1993; Tanikawa et al. 2015) and sometimes with auxiliary terrain information
to correct for various effects (Bair et al. 2018).
observe which’ emits or reflects a certain spectral signature. If we plot our parameters on
a map we see the spatial dependencies as neighborhood correlation, i.e. patterns can be
observed. An example can be seen on the surface temperatures in Figure 16.1 (a) where
the poles exhibit colder temperature regions while equatorial locations are warmer.
The spatio-spectral-temporal dependencies can be quantified by measures such as covari-
ance/correlation, mutual-information, multi-information or the Rotation based iterative
Gaussianization (RBIG) (Laparra et al. 2011). The RBIG method is used in Laparra and
Santos-Rodríguez (2015); Johnson et al. (Submitted) to find the optimal balance between
spatial and spectral neighborhood samples.
The temporal dependencies occur as the processes we observe change over time, e.g. sea-
sonal vegetation changes. If we measure frequently enough compared to the time-constant
of the temporal development we will be able to observe and potentially model this depen-
dency.
Example: Let us consider an example of monitoring sea ice in the Arctic. If we take two images
with days between them on the same location covering a large area with land-fast ice, it is most
likely that the images overall will look the same, and we have time dependency in our samples.
This is opposed to zooming in on a small area near the edge of land-fast ice where the ice moves
fast due to sea currents and wind and the whole image would change in the course of a day.
B T–n
+ CNN YT
Y
T–1
T
X (a)
B·T
Y + CNN YT
X
(b)
CNN RNN y1
Y
X
...
...
...
...
CNN RNN yT – 1
CNN RNN yT
(c)
Figure 16.2 Three ways of modelling spatial, spectral and temporal relationships. (a) Cubic
convolutions over space and frequencies and stacking the n time steps. (b) Stacking frequency, B,
and time, T, and performing 2D convolutions over spatial dimensions (X, Y). (c) Hybrid approach of
combining CNN to handle space and frequencies, (X, Y, B), with an RNN to handle time T.
Originally, the concept was developed for visual recognition problems with time-varying inputs by
Donahue et al. (2015).
It is important to note that when stacking channels, e.g. (B, T), the order of them is not
taken into account by the model. Cubic convolutions or sequential learning with an RNN,
on the other hand, assume the order to follow the natural order of the samples.
One could have input data with three spatial dimensions (X, Y , Z) although this is
not commonly seen in RS research. Here, Z could be an altitude of the measurement.
16.3 The Challenge of High-dimensional Problems 247
This would require a cubic convolution over the three spatial dimensions and either
stacking of time sequences, or handling time with an RNN hybrid approach.
A practical problem arises on most computing platforms with increasing dimensions.
The memory requirements increase drastically and a trade-off between the sizes of each
dimension per data sample, and number of dimensions must be made. Typically, DL
research crops or resize 2D data into samples around 250 × 250 pixels and 3D data into
96 × 96 × 96 voxels as described in Yang et al. (2009), for typical GPU-based computing
platforms.
Table 16.1 Summary of CNN model used in the retrieval of atmospheric temperatures in
Figure 16.4. The amount of Product-Sum Operations (PSops ) is given per layer in the fourth column.
data type size (DS) in bytes. For the above example with batch size 32 and data stored in
float32 (4 bytes per number) we have 32 ∗ 15 ∗ 15 ∗ 125 ∗ 4 = 3.6MB per batch. Now with
an increase of the spectral dimension to 4699 we have 32 ∗ 15 ∗ 15 ∗ 4699 ∗ 4 = 135.3MB
per batch. A chunk size of 135.3MB can cause too large delays when transferring data to
a GPU, resulting in non-optimal utilization of GPU cores. Whether in practice it is a prob-
lem, though, depends on the platform, the amount of training data, how many epochs are
necessary, and how many weights the network has.
As previously discussed, one might consider spatial and spectral dimensions a data cube
and perform cubic convolutions to exploit a combination of spatio-spectral feature extrac-
tion. This, however, comes at a higher computational cost. The extension of Equation 16.1
would be straightforward,
PSops = (f𝑤 × fh × fd ) × Ninp × (K𝑤 × Kh × Kd ) × Nout , (16.2)
with fd being the filter-depth and the Kd the output cube depth. A consequence of consid-
ering the spectral dimension in the convolutions for the above IASI example is that Ninp
16.3 The Challenge of High-dimensional Problems 249
becomes 1, so for a layer with a 3 × 3 × 3 filter kernel and 100 output nodes we have
PSops = (3 × 3 × 3) × 1 × (15 × 15 × 4699) × 100 = 2,854, 642,500 ≈ 2.8G (16.3)
for a single convolutional input layer. An alternative to cubic convolutions is the use of
depth-wise separable convolutions. This method consist of first applying 2D spatial filter
kernels to each input channel separately and then applying Nout number of 1 ∗ 1 ∗ fd
depth-wise filters to combine the spatially extracted information. This will split our
product-sum formula in two terms and we see this will reduce Equation 16.3 to
PSops = (f𝑤 × fh ) × Ninp−chan × (K𝑤 × Kh ) + fd × Ninp−chan × Nout−nod (16.4)
For our example, PSops = (3 × 3) × 4699 × (15 × 15) + 3 × 4699 × 100 = 10,925, 175 ≈ 11M,
which is a drastic decrease in product-sum operations compared to the 2.8 billion for the
cubic case.
In Equation 16.5 the n’th sample xn can be modelled by our neural network output y(xn , w)
with the weights w and the probability tn given an input sample is Gaussian distributed and
has a xn dependent mean. The output of y(...) should be a linear projection, i.e. no output
activation function.
When using a MSE loss function we assume Gaussian distributed prediction errors
and that the optimal solution can be found by minimizing the conditional average of this
error. This might not be true for all problems though. Problems of multi-modal character
or ill-posed inversion problems will not follow the assumption of the optimum solution
being the minimization of Eq. 16.5. For addressing this Bishop (2006) suggests a variant of
neural networks called Mixture Density Networks (MDN). MDNs has an alternative for-
mulation of the loss function over a space of continuous values by predicting a probability
distribution parameterized by a Mixture Model, with e.g. Gaussian kernels. Instead of the
250 16 Deep Learning for High-dimensional Parameter Retrieval
model predicting a specific value as an MSE-trained network, one can sample from the
probability distribution predicted by an MDN.
For Bernoulli distributed targets (tn ∈ [0, 1]) such as how we would encode categorical
problems, the maximum likelihood solution becomes the cross-entropy error function:
∑ ∑
N K
E(w) = − tn yk (xn , w). (16.6)
n=1 k=1
For the output activation in this case, we need a canonical link that fulfills y ∈ [0; 1] hence
the activation should be mapped to probabilities by the logistic sigmoid for the binary case
or the multi-class extension, softmax, when the output belongs to one of more classes.
It can be worth considering reformulating problems to one or the other loss function.
Oord et al. (2016) saw improvements in building a Text-To-Speech model based on Neural
Networks that predicted the wave signals directly by a probability distribution over its dis-
cretized values. In order not to have too many classes, K, and thereby making the problem
intractable, the authors transformed the waveform with a 𝜇-law algorithm to 255 discrete
values rather than 65 536 values necessary to represent the full 16-bit. The authors in Oord
et al. (2016) describe the advantages this modeling approach thus:
One of the reasons is that a categorical distribution is more flexible and can more easily
model arbitrary distributions because it makes no assumptions about their shape.
In Wang et al. (2017c) the authors models concentrations of sea ice in the Gulf of St.
Lawrence and Newfoundland Sea, Canada, with a convolutional neural network, feeding
it Synthetic Aperture Radar images from the RADARSAT2 sensor. The authors choose to
model the fractions of sea ice in square blocks of 18 × 18 km2 with a least square error func-
tion, but a categorical error function could have been used as well. In the simplest case of
modeling sea ice concentrations with cross-entropy error we could set the target equal to
the percentage of sea ice in each block and have a model with one output that predicts this
concentration.
Establishing the relationships between a measured IASI spectrum and e.g. the temper-
ature at a certain altitude, is often a statistical under-determined problem. Due to this,
dimensionality reduction is often applied as the first step (Pellet and Aires 2018).
IASI – Dataset The Infrared Atmospheric Sounding Interferometer (IASI) on board the
MetOp satellite series measures the infrared spectrum with high resolution (Malmgren-
Hansen et al. 2020). The ground footprint resolution of the instruments is 12 km at nadir,
and a spectral resolution of 0.25 cm−1 in the spectrum between 645 cm−1 and 2760 cm−1 .
This results in 8461 spectral samples covering 2200 km scan-swath with 60 points per
line. IASI is an ideal instrument for monitoring different physical/chemical parameters in
the atmosphere e.g. temperature, humidity, and trace gases such as ozone. Energy from
different altitudes returns a different spectral shift. In this way atmospheric profiles can
be obtained and these provides important information for e.g. NWP models. In the IASI
dataset, channel selection has been performed reducing spectral components from its 8461
channels to 4699, before any further processing. For statistical modeling of atmospheric
parameters, forecast models are used to provide dense target values for every point
observed by IASI. The IASI dataset has been matched with forecasts by the Medium-Range
Weather Forecasts (ECMWF) model. This includes temperatures, humidity, and ozone
concentrations for 137 altitudes through the atmosphere. EUMETSAT which operates the
MetOp satellites uses forecasting data together with IASI measurements to provide derived
products. These retrievals are validated both on the forecasts and in-situ measurements
from e.g. radiosondes. Temperatures can be derived down to 1K accuracy, humidity with
10% and ozone with 5%, all at 25 km ground resolution. For the following experiments 13
orbits from the 17-08-2013 were used, with the first 7 for training and the rest for test. In situ
measurements were not used for validation as this is a relative comparison of the perfor-
mance between a CNN and the traditionally used Ordinary Least Square (OLS) regression
model.
Input Output y
z
CNN z
y
x x
Figure 16.3 Input: Decomposed IASI spectrum using the MNF (260 components). x-axis is along
track orbital direction and y-axis across track. z-axis represents the spectral MNF-components. The
input cube is sliced in the corner to illustrate the rapid decreasing color intensity of the sorted MNF
components as most of the information is compressed into the first components. Output:
Atmospheric temperatures. x-axis is along track orbital direction and y-axis across track. z-axis
represents altitudes in the atmosphere. The white square on the input depicts the 15 × 15 spatial
neighborhood sample that is passed to the CNN.
While an OLS regression, would capture immediate correlation between spectral com-
ponents and target, a neural network would be able to extend this to more complex and
non-linear relationships. Further, an advantage of the CNN variant is that it shares all
weights between all outputs, except for at the very last layer. On the contrary an OLS regres-
sion model will have 90 independent output predictions. This gives the CNN the advantage
of smoother transitions between predictions as seen on the results in Figure 16.4.
On the test set the OLS regression model achieves a Root-Mean-Square-Error (RMSE)
of 2.85 K while the CNN has a RMSE of 1.94 K. As opposed to many other studies,
the RMSE is calculated over all measurements regardless whether they are marked
as cloud-contaminated and contains samples over land as well as ocean. The main
conclusion that can be drawn from this experiment is that the spatial dependencies are
better modelled with the DL model compared to the traditional OLS regression. One
of the advantages the CNN has over the OLS regression is the filtering operations that
transforms the spatial dimensions into a feature space at a lower dimension than the
OLS 15×15
Pressure [hPa]
Fraction [%]
12.0
102 100
Cloud
50 10.5
0
9.0
103
∣error∣ [K]
Fraction [%]
50 1.5
0
103
0 50 100 150 200
Transect axis
Figure 16.4 Transect profile of RMSE, Linear Regression (OLS), and CNN on cubes of IASI data
15 × 15 × 125 (height × width × MNF−components). The pressure on the y-axis corresponds to
altitudes in the atmosphere and the x-axis shows distance along an arbitrary line on the surface of
the Earth.
16.4 Applications and Examples 253
original input. This is both a good way to model spatial dependencies, but also a way
to tackle the statistical under-determination which the problem suffers from. The OLS
regression will estimate least square residuals from a regressor parameterized over input
variables (15 × 15 × 125 = 28,125) for each target variable, here 90. Another advantage of
the CNN filtering properties is that the filters can perform noise reduction (averaging),
edge detection, or contrast enhancement, which might help the model tackle difficult
predictions, e.g. around clouds, coastal areas, and between weather fronts.
ASIP Sea Ice Dataset The ASIP Sea Ice Dataset version 1 (ASID-v1, publicly available at
Malmgren-Hansen et al. (2020)) was collected for the Automated Sea Ice Products (ASIP)
research project, that aimed at automating sea ice information retrieval in the arctic. Today,
monitoring sea ice mainly consists of a time-consuming manual process where experts draw
polygons, typically on Synthetic Aperture Radar (SAR) imagery, and assign information
about ice conditions, referred to as ice charts or ice image analysis. ASID-v1 consists of
Sentinel-1 SAR images matched with expert drawn ice charts containing polygons with
sea ice concentrations. The concentrations are ranging from 0% to 100% in steps of 5%. The
dataset covers the period from November 2014 to December 2017 and is gathered across 912
Sentinel-1 SAR scenes. All seasons are covered and all coastal areas of Greenland except for
the north most region are represented. The polygons’ geometry follow no strict definition
and is based on the sea ice experts intuition on natural segments in the scenes. Further,
the dataset also includes brightness temperatures from a Microwave Radiometer (MWR).
MWR measurements from the Advanced Microwave Scanning Radiometer 2 (AMSR2) are
recorded in 10 × 10 km grids at frequencies from 6.9 GHz to 36.5 GHz and in 5 × 5 km
grid at 89 GHz, although the footprint resolution ranges from 35 × 62 km (6.9 GHz) to
3 × 5 km (89 GHz). All these frequencies, available in both horizontal and vertical polariza-
tion gives 14 brightness temperatures per measurement. Here, all AMSR2 measurements
are resampled to a 2 × 2 km grid where each grid cell center align with every 50th Sentinel-1
pixel. The dataset is split in 90%/10% for training and test on scene level. In this way a inde-
pendent test can be made as samples in training and test are separated in time or space, and
it reflects the operational scenario of an automatic ice chart extraction algorithm.
254 16 Deep Learning for High-dimensional Parameter Retrieval
< = 20
20–40
40–60
60–80
> 80
Figure 16.5 Polygon ice chart overlay on the HH polarization of a Sentinel-1 scene. After the ice
experts have marked the points that outline the ice boundary, a spline curve fit is applied to make
the polygon smooth. This results in ice occasionally not being encapsulated by the polygon along
the edge.
The ice concentrations provided as averages over polygons pose a challenge when map-
ping them as target values with the SAR pixels. Assigning the average concentration to every
pixel inside the polygon will lead to label errors as the distribution of ice within the polygon
is unknown unless the sea ice concentration is 0% or 100%, see Figure 16.5. At the scale of
the SAR image (40 × 40 m pixel-spacing), most pixels will be either wholly open water or
wholly sea ice.
Example of Sea Ice Concentration Estimation In the following experiments we aim to apply a
CNN to fuse the information in the SAR and MWR data and model the sea ice concentration
values. Figure 16.6 shows a conceptual flow of the CNN fusion architecture that aims to
combine SAR and MWR data to predict the pixel-wise presence of a sea ice concentration.
The CNN takes in a patch of 300 × 300 SAR pixels with corresponding 6x6 MWR pixels
and predicts ice-concentrations in 300 × 300 pixel prediction maps. The output activation
2D Upsampling
AMSR2
14 Channels
Output
function
CNN
AMSR2 channels
Sentinel 1, HH+HV
Output prediction
CNN feature maps
Figure 16.6 Conceptual flow of the prediction of sea ice maps with a CNN applied on SAR images
for feature extraction that is concatenated with upsampled MWR measurements for fusion of
satellite sources. The output function can either be linear with least-square loss or a sigmoid with
binary cross entropy.
16.4 Applications and Examples 255
function is chosen according to the respective loss chosen (Bishop 2006), i.e. linear function
with mean square loss or sigmoid function for cross-entropy loss.
One challenge here is the choice of loss function. If we model the sea ice concentrations
as a continuous variable over pixels, a Mean Square Error (MSE) loss might be chosen,
Equation 16.5. We know, though, that there are a lot of label errors associated with the
pixels, and this might make it hard to minimize the error residuals. The problem could also
be considered as modeling the probability of a pixel being ice. In this way we assume that
a random sample from a polygon with a concentration of e.g. 40% has a 40% probability of
being ice. Modeling this probability could be done with a Binary Cross-Entropy (BCE) loss
function, where we replace the typically binary encoded categorical target with the discrete
ice-probability values. From Equation 16.6 with K = 1 we have,
∑
E(w) = − ti log(y(w, xi )), (16.7)
i
where tn,i is the pixel-wise sea ice concentration for the ith image patch xi . y(w, xi ) is the
CNN output prediction map. A summary of the CNN architecture is given in Table 16.2.
Both models were optimized with the Adam optimizer, with hyperparameters as given in
Kingma and Ba (2014), for 80 epochs, i.e. runs over the training data.
Table 16.2 Summary of CNN model. S is the window size in the average pooling operation and DR
is the Dilation-Rate of the convolutional filter, i.e. the pixel spacing between each filter coefficient.
0
10
20
30
40
50
60
70
80
90
100
Figure 16.7 Results of Fusion-CNN. Left: ice chart from DMI experts, Mid: prediction from model
with binary cross-entropy loss, Right: prediction from model optimized with mean square error.
Results from the two experiments can be seen in Figures 16.7 and 16.8. When comparing
the results on Figure 16.7 it is natural that they do not match the label polygons exactly. The
network learns to reflect input SAR backscatter values with ice concentrations and contain
more details than the polygons. For validation it is necessary to compare predictions at the
same scale as the polygons, by comparing the average prediction within each.
This can be done by comparing the mean concentration of a polygon with the mean of
the pixel prediction within that polygon. Figure 16.8 shows such a comparison for the test
set data. As each scene contains several polygons a mean (red dot) and a std error (black
vertical lines) is estimated for the prediction of each unique ice concentration value.
The resolution of Sentinel-1 (EW-mode: 93 × 87 m) and AMSR2 (at 6.9 GHz:
35 × 62 km) are magnitudes apart but still, in the sea ice case, it makes sense to fuse
these due to the different advantages of each sensor. The model presented in Table 16.2
by-passes the AMSR2 input around the CNN layers to merge information from all 14
channels at the end layer. Other approaches could have been chosen but since there only
model-s1_amsr-sic model-aspp-mse
100 100
80 80
CNN Probabilities [%]
CNN - Ice pixels [%]
60 60
# Obs # Obs
10 10
40 20 40 20
30 30
40 40
50 50
20 20
60 60
70 70
80 80
0 0
0 20 40 60 80 100 0 20 40 60 80 100
DMI SICs [%] IA SICs [%]
(a) (b)
are 6x6 pixels of AMSR2 data for every patch, there are much lesser spatial features and
applying the same amount of filters to this data leads to redundant convolutional opera-
tions. A typical approach with CNNs for sensor fusion is to resample all data to same pixel
spacing and stack it as extended “color” channels. If the 2 channels of the Sentinel-1 image
where stacked with the 14 channels upsampled AMSR2 the amount of PSops in the first
layer would, according to Equation 16.1, rise from (3 × 3) × 2 × (300 × 300) × 12 = 19.44e6
to (3 × 3) × 16 × (300 × 300) × 12 = 155.52e6. This is a large increase of computational load
to introduce and therefore is the stacking method not optimal when resolutions in fusion
datasets differs so much.
Generally, we can conclude that the BCE loss does significantly better in optimizing the
weights for sea ice predictions at 40 × 40 m scale. The BCE trained model aligns much better
with the expert annotated mean sea ice concentrations, Figure 16.8(a), as oppose to the
MSE trained model that struggles to properly catch the full range of values, Figure 16.8(b).
Other approaches to loss functions could have been taken as well. Following the previously
discussed approach in Oord et al. (2016), with a multi class categorical cross-entropy (CCE)
loss, we would assign each unique ice concentration a class. Another approach could be
to use the MDN networks proposed by Bishop (2006). Both a CCE-trained network and
the MDN network uses several output nodes to model the probability distribution over the
full range of possible outputs, as opposed to the single output approaches shown in these
experiments.
16.5 Conclusion
Many challenges exist for deep learning in bio-geophysical parameter retrieval problems.
Since we are modeling the Earth’s state, we need to apply algorithms on large amounts of
data. Further, we have many sources of variance in our observations caused by e.g. sea-
sonal, yearly and geographical variation. There is also many different types of sensors,
some measuring very small signal values resulting in noise problems, and sometime we
need to fuse data from sensors to optimize predictions. Deep learning has several advan-
tages though. The end-to-end learning concept makes it possible to map between highly
non-linear relationships and the latent feature space inside a neural networks impose some
sparsity when working with high-dimensional problems. Further, the architectures of DL
models are flexible and can be tailored to the many sources of Earth observation data. Once
trained, DL models are typically not a large computational burden, which often makes
it possible to incorporate in operational pipelines. It therefore very likely that the Earth
parameter retrieval will see increasing use of the DL framework in the future.
258
17
A Review of Deep Learning for Cryospheric Studies
Lin Liu
17.1 Introduction
The cryosphere refers to the Earth’s surface where water is frozen. Its major components
include snow cover, glaciers, ice sheets, permafrost and seasonally frozen ground, sea ice,
lake ice, river ice, ice shelves, and icebergs. Storing about 75% of the world’s fresh water
in frozen state, the cryosphere plays an important role in the global water cycles. As an
integrated part of the Earth system, the cryosphere modulates the surface energy and gas
exchange with the biosphere, atmosphere, and ocean. We refer interested readers to Barry
and Gan (2011) and Marshall (2011) for a comprehensive and detailed description of the
cryosphere.
In recent decades, the cryosphere has been undergoing strong warming and area
reduction. The Special Report on the Ocean and Cryosphere in a Changing Climate,
released in September 2019 by the United Nations’ Intergovernmental Panel on Climate
Change (IPCC), provides an up-to-date and comprehensive summary of the past, ongoing,
and future changes of the cryosphere (IPCC 2019). For instance, according to the most
recent Ice Sheet Mass Balance Inter-comparison Exercises (IMBIE), both the Greenland
and Antarctic Ice Sheets have been losing ice mass at accelerated rates in the past two
decades (IMBIE 2018, 2019). According to space-borne passive microwave measurements,
the Arctic sea ice extent in September has decreased by ∼ 12.8% per decade between
1979 and 2018 (Onarheim et al. 2018). Globally, the ground temperature at the depth of
zero amplitude in the continuous permafrost zone rose by ∼ 0.39∘ C from 2007 to 2016
(Biskaborn et al. 2019).
The rapid changes of the cryosphere have numerous profound implications for human
society such as opening of new shipping routes in the Arctic, inundation and land loss
associated with rising sea level, glacial lake outburst flood, and slope instability and
infrastructure damage from permafrost degradation (IPCC 2019). Cryospheric changes
also affect the global climate system through feedbacks associated with the decrease of
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
17.1 Introduction 259
Earth’s albedo (Flanner et al. 2011), modifications to ocean circulation caused by the influx
of fresh meltwater (Böning et al. 2016), and the release of carbon from thawing permafrost
(Schuur et al. 2015).
This chapter reviews the use of deep learning for cryospheric studies, aiming to showcase
the diverse applications of deep learning for tackling research tasks that are becoming more
challenging to conventional approaches. Even though such applications are still in an early
stage, deep learning has been utilized to characterize nearly all the cryospheric components
in a diverse manner (see Figure 17.1 for a graph summary). In terms of datasets, a majority
of deep learning studies have been applied to remote sensing observations, largely due to
the effectiveness and versatility of remote sensing tools to observe cryospheric systems in
remote and often inaccessible places (Tedesco 2014). Moreover, many deep learning algo-
rithms that have been well-developed for computer vision can be directly used to remote
sensing observations (also see chapters in Part I). Some modeling studies make use of deep
learning for parameterization and prediction. Some deep learning studies on Arctic vege-
tation (e.g., Langford et al. 2019), Arctic wetland (e.g., Jiang et al. 2019a), and land surface
temperature (e.g., Tan et al. 2019), all related to the cryospheric variables and processes, but
are beyond the scope of this review.
Because the applications are diverse with no single common dataset, methodology, eval-
uation metrics, we are not going to compare various works. Instead, we will highlight a few
studies that represent the growing literature on DL applications, summarize some innova-
tive and unique use of DL, and offer some thoughts on common strengths and limitations
as well as future directions.
Section 17.2 summarizes DL-based remote sensing studies of various components of the
cryosphere, including glaciers, ice sheets, snow cover, permafrost, sea ice, and freshwater
ice. Section 17.3 describes the use of deep learning for modeling the cryosphere. Section 17.4
concludes this review by highlighting the key achievements and future directions of deep
learning, followed by a list of public data and codes in the appendix.
260 17 A Review of Deep Learning for Cryospheric Studies
69°15′ Glacier
Glacier
Calving front DL-delineated
calving front
5 km
Figure 17.2 Jakobshavn Isbræ in western Greenland. Left: aerial photo (oblique view) of the
glacier. Photo credit: NASA/OIB/John Sonntag (https://www.jpl.nasa.gov/news/news.php?
feature=7356). Its calving front, manually delineated, separates the glacier and ice mélange (a
mixture of calved icebergs and sea ice). Right: TerraSAR-X image taken on August 28, 2013. The
calving front was delineated by Zhang et al. (2019) using DL. This figure was modified from Figure
4a in Zhang et al. (2019) with the authors’ permission.
fronts directly. Zhang et al. (2019) only mapped one glacier, and data diversity originates
from hundreds of SAR images taken in both summer and winter seasons over several years.
Mohajerani et al. (2019) and Baumhoer et al. (2019) tested the networks’ performance on
glaciers/regions not used in training, showing good transferability of their DL networks.
In mountain areas, the ablation zones of many valley glaciers are covered by debris at
the surface. The supraglacial debris presents similar spectral properties as the surround-
ing landscapes on remote sensing imagery, making it challenging to map the boundary of
debris-covered glaciers. Xie et al. (2020) took the first step towards a DL-based mapping
of debris-covered glaciers. Their input data consist of 17 layers, including all 11 Landsat-8
bands, 1 DEM, and 5 DEM-derived layers containing topo-geomorphic parameters (slope
angle, slope-azimuth divergence index, etc.). They used the manually-delineated bound-
aries from the GLIMS dataset to train a feed-forward neural network. Testing in two regions
in central Karakoram and Nepal Himalaya, they evaluated the network’s performance when
using different portions of splitting the ground truth data for training and demonstrated
an overall high accuracy despite in complex cases where lakes and proglacial moraine are
present. They also conducted a transfer-learning experiment that used the pre-trained net-
work with the Karakoram debris-covered glaciers as the base model to train the Nepal data.
This strategy helped to reduce the training time and slightly improve the mapping accuracy.
preexisting 1-km-resolution bed elevation map, alone cannot meet their goal of producing a
higher-resolution bed map, they added three remote-sensing-based datasets, including the
surface elevation (100 m resolution), ice velocity (500 m), and snow accumulation (1 km) as
conditional input grids. Trained with 250-m-gridded ground-truth bed elevation resampled
from ice-penetrating radar surveys at five locations in western Antarctica, the GAN gen-
erated a higher 250-m-resolution bed elevation map over the entire ice sheet. The authors
demonstrated that this new map is generally realistic but still exaggerates the roughness or
introduces obvious artifacts such as ridges and speckles in some places. The major limita-
tions are that the training dataset only covers an extremely small fraction of the ice sheet
and the high-resolution details in the GAN-based bed elevation rely on the conditional data
from the surface of the ice sheet, especially the surface elevation. More work is needed to
enhance the quantity of training data and incorporate glacial flow mechanics into the DL
model.
Yuan et al. (2020a) extracted supraglacial lakes in central west Greenland from Landsat
8 Operational Land Imager (OLI) data and further documented their changes during the
melt seasons from 2014 to 2018. Their input data are the mean of Bands 1 to 8, Normalized
Difference Water Index calculated from the green and near-infrared bands, and Normal-
ized Difference Vegetation Index calculated from the red and near-infrared bands. Their
training data are manual-delineated supraglacial lake outlines from Landsat 8 RGB images.
Comparing with an unsupervised image thresholding method (Otsu), and two supervised
methods (Random Forests and Support Vector Machine), they demonstrated that the CNN
outputs contain the least noise and omission errors.
17.2.3 Snow
Due to the high albedo of snow, its presence can often be easily identified from visible and
passive microwave images. Numerous hand-crafted retrieval algorithms, tailed towards
specific sensors, haven been developed towards routine products of snow extent, melt,
albedo, depth, and water equivalent (see Tedesco 2014, Chapters 3, 4, 5, 6).
Recently, a few studies explored the use of DL to extract snow cover from remote sensing
imagery and evaluated its potentials against conventional methods. For instance, Xia et al.
(2019) used a multi-dimensional deep residual network to classify snow and cloud cover
images on multi-spectral HuanJing-1 satellite imagery over Tibet. Nijhawan et al. (2019a)
developed a hybrid method that integrates an AlexNet-based DL network with Sentinel-2
optical images as the input and a Random Forest classifier with hand-crafted features based
on Sentinel-1 SAR and SRTM DEM to extract snow cover in northern India. Validating
against ground truth based on field observations, they showed that their hybrid method
gave the best accuracy (98.1%) and highest Kappa coefficient (0.98) compared with conven-
tional machine-learning methods (accuracies ranging from 77% to 95%). Guo et al. (2020)
firstly trained DeepLabv3+ (Chen et al. 2018a) with a 30-m snow-cover product based on
Landsat-8-based Normalized Difference Snow Index and then fine-tuned the DL network
using a smaller amount of training data based on 3.2-m-resolution Gaofen-2 imagery. Their
initial experiments demonstrated that such a DL-based model can differentiate snow from
clouds and even recognize snow in image shadows.
17.2 Deep-learning-based Remote Sensing Studies of the Cryosphere 263
In addition to snow cover, DL has also been used to retrieve snow depth. Braakmann-
Folgmann and Donlon (2019) proposed a DL-based retrieval algorithm for estimating snow
depth on Arctic sea ice from passive microwave radiometer measurements. Their input
data included three AMSR2-based brightness temperature (Tb) ratios (vertically polarized
Tb at 18.7 vs. at 36.5 GHz, vertically polarized Tb at 6.9 vs. 18.7 GHz, and vertically vs.
horizontally polarized Tb at 36.5 GHz) plus one SMOS-based Tb ratio (vertically vs. horizon-
tally polarized Tb at 1.4 GHz). They used snow depth measured from an airborne radar by
NASA’s airborne Operation IceBridge campaigns to train their DL network that consists of
five fully connected hidden layers. Comparing with three empirical snow depth algorithms,
the authors showed that the DL-based one gives the highest accuracy. They further demon-
strated that the DL-estimated snow depth could improve the retrieval of sea ice thickness
when converting altimeter measurements of ice freeboard to sea ice thickness.
17.2.4 Permafrost
Permafrost refers to the ground that remains at or below 0 ∘ C for at least two consecu-
tive years (French 2017). Because permafrost is an underground thermal phenomenon, it
is challenging to observe directly using remote sensing. The use of DL has thus far been
limited to mapping ice wedge polygons and hillslope thermokarst landforms from remote
sensing data.
In a pioneer study, Zhang et al. (2018) applied DL to detect, delineate, and classify ice
wedge polygons from aerial orthophotos with spatial resolution ranging from 0.15 m to 1 m)
taken in northern Alaska. They manually annotated the training samples as high-centered
and low-centered polygons (a high/low-centered polygon is slightly higher/lower at the
center than at its rim). Then they carried out object instance segmentation using Mask
R-CNN and outputs binary mask (ice wedges or non-ice-wedges) with classification. They
separately evaluated their accuracies of detection, delineation, and classification and
reported that the DL-based method could detect about 79% of ice wedge polygons across a
134-km2 area. Demonstrating the degree of transferability of DL by applying the network
trained to coarser-resolution images (0.5 m to 1 m) taken in a new area, they showed that
the DL can still achieve a 72% detection accuracy.
Abolt et al. (2019) applied a simpler, 10-layer CNN to 50-cm-resolution DEMs constructed
by using airborne LiDAR data over two areas on the Arctic coast in northern Alaska. The
training data are manually-delineated ice wedge boundaries and non-ice-wedge bound-
aries. After the CNN-based operation, they applied a watershed analysis based on DEM
and measured microtopography, classify them into high- and low-centered polygons. They
detected up to 5000 ice wedge polygons per square kilometer and more than 1,000,000 over
an area of 1200 km2 near Prudhoe Bay (Abolt and Young 2020). It is arguably only feasible
to use DL to generate such kind of high-resolution, high-density, and extensive maps.
Thermokarst is a generic term that refers to “the process by which characteristic
landforms result from the thawing of ice-rich permafrost or the melting of massive ice”
(Harris et al. 1988). Thermokarst landforms are important surface expressions and visual
indicators of permafrost degradation. According to their geomorphological and hydrolog-
ical characteristics, thermokarst landforms are classified into more than 20 types, such
as thermokarst lakes, thermo-erosion gullies, active layer detachments, and retrogressive
264 17 A Review of Deep Learning for Cryospheric Studies
38°0′9.0″N
DL delineated
Manually delineated
38°0′1.8″N
100 m
100°54′39.6″E 100°54′46.8″E
Figure 17.3 Deep-learning-based delineation of thermokarst landforms. This example shows one
of 16 landforms that Huang et al. (2018) identified from a high-resolution UAV image (background)
using DL. This figure was modified from Figure 8c in Huang et al. (2018) with the authors’
permission.
thaw slumps (Jorgenson 2013). Thermokarst landforms are common on the Qinghai-Tibet
Plateau and high mountains of China, but their locations and surface dynamics are still
poorly quantified or understood, especially compared with their counterparts in the Arctic.
Huang et al. (2018) identified thermo-erosion gullies in a small watershed (6 km2 ) in
northeastern Tibet. They applied DeepLab v2 (Chen et al. 2016), to a 0.15-m-resolution dig-
ital orthophoto map constructed aerial photographs taken by an unmanned aerial vehicle
(UAV). Validating the results against field-mapped boundaries, they showed that the DL
method successfully mapped all the 16 thermokarst landforms in the watershed (see one of
them in Figure 17.3). They also showed the drastic improvement of DL over a conventional
object-based image analysis method, the latter of which has many challenges in identifying
thermo-erosion gullies with complex geometric and geomorphic features. Applying a newer
and improved version DeepLabv3+ to 3-m-resolution CubeSat images taken by the Planet
constellation, Huang et al. (2020) successfully delineated 220 retrogressive thaw slumps
within an area of 5200 km2 in central Tibet. They also proved the robustness of their results
based on more than 100 experiments with different data augmentation strategies and por-
tions of ground truth data used for training.
simple CNN (two convolutional layers, two max-pooling layers, and a fully-connected layer)
to dual-polarized (HH and HV) RADARSAT-2 ScanSAR images taken over the Beaufort
Sea (Arctic). The training data they used were ice concentration charts manually produced
by experts from visual interpretation of SAR images. Validating against with AMSR-E ice
concentration products, they showed the robustness of their DL-based method even in
cases of significant SAR speckle noise, varying incidence angle, and in areas of low ice
concentration. Cooke and Scott (2019) presented an innovative use of DL that is trained
using passive microwave sea ice concentration products (from AMSR-E) and inferring on
higher-resolution RADARSAT-2 SAR images. Yan et al. (2017); Yan and Huang (2018) used
DL to detect the presence of sea ice and estimate its concentration from TechDemoSat-1
Delay-Doppler maps obtained using Global Navigation Satellite System Reflectometry.
Mei et al. (2019) estimated sea ice thickness in the Ross Sea (Antarctica) from profiles of
snow surface acquired from terrestrial laser scanning. The highlight of this work is that the
input does not include any snow depth or surface densities. Instead, the DL, which consists
of three convolutional layers and two fully connected layers, learns 3D geomorphic features
in the laser scanning and builds a non-linear link with sea ice thickness. Additionally, DL
has been used to classify sea ice types from Earth Observing-1 (EO-1) hyperspectral imagery
(Han et al. 2019) and detect sea ice changes from multi-temporal SAR images (Gao et al.
2019).
Nash-Sutcliffe efficiency, mean absolute error, root-mean-square error, and its normaliza-
tion). They also demonstrated the superior performance of their CNN model for predicting
sea ice concentration in extreme cases such as the significant sea ice loss in the summers of
2007 and 2012.
Bolibar et al. (2020) used DL for simulating and reconstructing the surface mass balance
(SMB) of glaciers in the French Alps. In contrary to empirical or physics-based models,
their data-driven method used DL to parameterize the non-linear link between annual
glacier-wide SMB with topographic variables such as mean and maximum altitude and
meteorological/climatic variables such as cumulative positive degree days, snow precipita-
tion and temperature anomalies. Because of the small size of the annual SMB data, covering
32 glaciers spanning 31 years, they designed a simple, 6-layer feed-forward fully-connected
DL and implemented it in Keras. Comparing with two classic linear regression methods,
they showed that the DL-based model gives improved explained variance (by up to 64% in
space and 100% in time) and better accuracy (by up to 47% in space and 58% in time). In par-
ticular, the DL model captures about one-third of non-linear variabilities in the temporal
changes. This DL-based SMB reconstruction is now included in the open-source ALpine
Parameterized Glacier Model (ALPGM, https://github.com/JordiBolibar/ALPGM). How-
ever, because the DL model was trained using data from the French Alps, it needs to be
retrained if being applied to other regions.
Here are the major data centers, repositories, and providers for cryospheric studies:
Below we list the data and codes published in the cryospheric studies reviewed in this
chapter, grouped by the cryospheric components.
1. Glaciers
● Detection of glacier calving margins with convolutional neural networks (Mohajerani
et al. 2019)
Code and data: https://github.com/yaramohajerani/FrontLearning
268 17 A Review of Deep Learning for Cryospheric Studies
et al. 2020)
Code: https://github.com/yghlc/Landuse_DL
Training and test data: https://doi.pangaea.de/10.1594/PANGAEA.908909
● High-resolution mapping of spatial heterogeneity in ice wedge polygon geomorphol-
Code: https://github.com/abhineet123/river_ice_segmentation
Data: https://ieee-dataport.org/open-access/alberta-river-ice-segmentation-dataset
269
18
Emulating Ecological Memory with Recurrent
Neural Networks
Basil Kraft, Simon Besnard, and Sujan Koirala
Ecological memory can be broadly defined as the encoding of past environmental condi-
tions in the current ecosystem state that affects its future trajectory. The consequent effects,
known as memory effects, are the direct influence of ecological memory on the current
ecosystem functions (Peterson 2002; Ogle et al. 2015). Such memory effects are prevalent
across several spatial and temporal scales. For example, at the seasonal scale, the variabil-
ity of spring temperature affects ecosystem productivity over the subsequent summer and
autumn (Buermann et al. 2018). Inter-annually, moisture availability over the previous year
is linked to contemporary ecosystem carbon uptake (Aubinet et al. 2018; Barron-Gafford
et al. 2011; Ryan et al. 2015). Furthermore, less frequent and large extreme events (e.g.,
heat waves, frost, fires, or insect outbreaks) can lead to short-term phenological changes
(Marino et al. 2011) or long-term damage to the ecosystem with diverse effects on present
and future ecosystem dynamics (Larcher 2003; Lobell et al. 2012; Niu et al. 2014). This
evidence highlights the relevance of short to long-term temporal dependencies on past envi-
ronmental conditions in terrestrial ecosystems. However, due to the large spectrum of the
environmental conditions and their consequent effects on the ecosystem, quantifying and
understanding the strength and persistence of memory effects is often challenging.
Ecological memory effects may comprise direct and indirect influences of external and
internal factors (Ogle et al. 2015) that are either concurrent or lagged in time. For instance,
a drought may directly decrease ecosystem productivity, with indirect concurrent effects on
loss of biomass due to the drought-induced fire (t3 in Figure 18.1). Additionally, ecosystems
may not only be responding to concurrent factors, but also the lagged effects of past
environmental conditions. A drought event can further impact the ecosystem productivity
for months to years through a direct but lagged effect. On the other hand, indirect lagged
effects involve external factors that affect the ecosystem productivity during a drought,
e.g., disturbances like tree mortality and deadwood accumulation (t4 in Figure 18.1),
which may lead to insect outbreaks with further influences on the ecosystem (t5 and t6 in
Figure 18.1).
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
270 18 Emulating Ecological Memory with Recurrent Neural Networks
t1 t2 t3 t4 t5 t6
Figure 18.1 Schematic diagram illustrating the temporal forest dynamics during and
post-disturbance: drought occurring in t2 and t3 conditions, fire event in t2 , and insect outbreaks
in t5 .
Memory effects are not exclusive to ecosystem productivity, but encompass a large num-
ber of Earth system processes of carbon (Green et al. 2019) and water cycles (Humphrey
et al. 2017). A key variable that encodes memory effects in the Earth system is soil moisture.
Soil moisture is controlled by instantaneous and long-term climate regimes, vegetation
properties, soil hydraulic properties, topography, and geology. As such, soil moisture
exhibits complex variabilities in space, time, and along with soil depth. Owing to its central
role but large complexity, most physical models are built around the parameterization of
moisture storage, which in turn affects the responses of land surface to environmental
conditions. Nevertheless, physical models have inherent uncertainties due to differences
in structure and complexity and input data as well as unconstrained model parameters.
Several data-driven methods have therefore been developed to address the shortcomings
of physical models for understanding Earth system processes as observed in the data. But,
the data-driven methods may also be limited by data quality and availability. For example,
the vegetation state over the land surface can be observed with satellite remote sensing. Yet,
state variables such as soil moisture, which have imprints of ecological memory, are difficult
to measure to meaningful soil depths and across larger scales. This poses a key challenge in
capturing the memory effects using conventional data-driven methods. As such, dynamic
statistical methods (cf. Chapter 8), such as Recurrent Neural Networks (RNNs, LeCun et al.,
2015), may address these shortcomings, as they do not necessarily require measurements
or observations of state variables. In this context, RNNs have a large potential for bringing
the data-driven estimates on par with Earth system models with regards to capturing the
ecological memory effects on land surface responses. This chapter focuses on this aspect
and demonstrates the capabilities of RNNs to quantify memory effects with and without
the use of state variables.
through time via the system state St , at every time step, which can be expressed as
St = f (St−1 , Xt ). (18.1)
Yt = g(St ). (18.2)
The St encodes all memory effects needed to compute Yt , and it can be interpreted as the
ecological memory. From a data-driven perspective, the memory St emerges solely from
the effects of “unobserved” previous states that are not directly encoded in any given obser-
vations (Jung et al. 2019). For example, if instantaneous vegetation state (e.g., vegetation
greenness) and current climatic conditions (e.g., air temperature or rainfall) are included
in the observed state Ot , their effects are not necessarily encoded in St . Therefore, St can be
mathematically expressed as
St = f (St−1 , Xt , Ot ). (18.3)
Table 18.2 Factorial experimental design: the four models are trained individually to
assess the capability of an LSTM to learn ecological memory (LSTMSM , with soil moisture
vs. LSTM¬SM , without soil moisture as input) and to quantify the local relevance of soil
moisture for ET (FCSM vs. FC¬SM ). The temporal models learn a mapping from the
concurrent and past features X≤t to the target Yt , while the non-temporal models have
access to the concurrent features Xt only. St is the ecosystem state, i.e., soil moisture.
model type
temporal non-temporal
The predictions from four model setups were evaluated against the MATSIRO simula-
tion at global and regional scales. At the grid-scale, the overall performances were evalu-
ated using the Nash–Sutcliffe model efficiency coefficient (NSE) (Nash and Sutcliffe 1970)
and the Root Mean Square Error (RMSE) (Omlin and Reichert 1999). Globally, the perfor-
mances are also summarized across different temporal (daily, daily anomalies, daily sea-
sonal cycle, interannual variation) scales. At the regional scale, our evaluation focused on
the capability of LSTM to simulate temporal ET dynamics in two focus regions: the humid
Amazon and semi-arid Australia. In these two example cases, the mean seasonal cycle for
the period 2001–2013 and seasonal anomalies observed during climate extreme events (2005
drought in the Amazon (Phillips et al. 2009) and the 2010 La Niña in Australia (Boening
et al. 2012)) were evaluated. Table 18.3 summarizes the main features of the evaluations.
Regions
Objective assessed Period assessed Input used
The LSTM takes the multivariate time-series and static variables as input, which is fol-
lowed by a hyperbolic tangent activation and a linear layer that maps the LSTM output at
each time step to a single value: the predicted ET. The FC models consist of several fully
connected layers, each followed by a non-linear activation function, except for the output
layer, where no activation function is used. The FC model takes the static variables and only
a single time-step of the time-series as input.
The final model architectures (Table 18.4) were selected using a hyper-parameter
optimization approach: the Bayesian optimization hyper-band algorithm (Falkner
et al. 2018). The state-of-the-art optimization algorithm efficiently finds optimal
hyper-parameters by combining an early stopping mechanism (dropping non-promising
runs early) and a Bayesian sampling of promising hyper-parameters, with a surrogate
loss model for the existing samples. To prevent over-fitting of the hyper-parameters, we
used only every 6th latitude/longitude grid-cell (approximately 3% of the data) during
hyper-parameter optimization. To avoid over-fitting of the residuals caused by temporal
Table 18.4 The model and training parameters from hyper-parameter optimization and their
ranges searched. Both LSTM models (SM vs. ¬SM) consist of several LSTM layers, followed by
multiple fully connected layers. The non-temporal FC models consist of several stacked fully
connected layers. In all setups, dropout was enabled for the input data and between all layers. Note
that a dropout of 0.0 means that no dropout is applied.
LSTM
dropout (input) (0.0, 0.5) 0.0 0.0
LSTM number of layers (1, 3) 2 1
LSTM hidden size (50, 300) 300 200
LSTM dropout (0.0, 0.5) 0.4 0.3
FC number of layers (2, 6) 3 5
FC hidden size (50, 300) 300 300
FC activation {ReLU, softplus, tanh} ReLU ReLU
FC dropout (0.0, 0.5) 0.3 0.1
learning rate (0.001, 0.0001) 0.001 0.001
weight decay (0.01, 0.0001) 0.01 0.01
FC
dropout (input) (0.0, 0.5) 0.0 0.0
FC number of layers (2, 6) 6 4
FC hidden size (50, 600) 200 200
FC activation {ReLU, softplus, tanh} ReLU ReLU
FC dropout (0.0, 0.5) 0.0 0.0
learning rate (0.001, 0.0001) 0.01 0.01
weight decay (0.01, 0.0001) 0.001 0.001
276 18 Emulating Ecological Memory with Recurrent Neural Networks
auto-correlation and to test how the model generalizes, the data were split into two
sets: training data from 1981 to 1999 inclusive and test data from 2000 to 2013 inclu-
sive. For both sets, an additional period of 5 years was used for model warm-up. For
all four setups, the hyper-parameter optimization and model training were carried out
independently.
0 1 0.5 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
LSTM¬SM
0 1 0.5 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
FCSM
0 1 0.5 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
FC¬SM
0 1 0.5 1.0
0.0 0.2 0.4 0.6 0.8 1.0 0.2 0.4 0.6 0.8 1.0
Figure 18.2 Global distributions of performances of different model setups based on daily model
predictions from the test dataset. Nash-Sutcliffe model efficiency coefficient (NSE) is shown in the
left and Root Mean Square Error (RMSE) in the right column for the temporal LSTM and
non-temporal FC models with (SM) and without (¬SM) soil moisture input, respectively. The inset
histogram represents the global distribution of the metrics.
−0.20 −0.16 −0.12 −0.08 −0.04 0.00 0.00 0.06 0.12 0.18 0.24 0.30
FC¬SM - FCSM
−0.20 −0.16 −0.12 −0.08 −0.04 0.00 0.00 0.06 0.12 0.18 0.24 0.30
Figure 18.3 Difference maps of Nash-Sutcliffe model efficiency coefficient (NSE) and Root Mean
Square Error (RMSE) for the LSTM (first row) and FC (second row) model setups. For the LSTM
models, differences in NSE or RMSE were computed as LSTM¬SM − LSTMSM , while for the FC
models, differences were computed as FC¬SM − FCSM . While the SM models have the ecosystem
state (soil moisture) as input, the ¬SM do not have it. Red colors indicate that the SM model
performs better than ¬SM. The inset histogram represents the global distribution of the differences.
performances of the LSTM models were still good with a median NSE of 0.91 (LSTMSM )
and 0.88 (LSTM¬SM ) for the anomalies.
The FC models performed worse than the LSTM models on the daily time series, partic-
ularly when soil moisture was not used as an input variable (FC¬SM ). The decomposition of
the daily time series into the mean seasonal cycle and anomalies suggested that the lower
performance of the FC models compared to the LSTM model, was mostly controlled by
weaker performance with regards to anomalies (median NSE of 0.75 for FCSM and 0.63 for
FC¬SM ). The mean seasonal cycle was captured similarly well in the LSTM and FC models
(median NSE from 0.97 to 1.00, where lowest is FC¬SM and highest is LSTMSM ), although
with a larger variability across grid-cells, with a 25th to 75th percentile of 0.95 to 1.00 (FCSM )
and 0.82 to 0.99 (FC¬SM ) versus 1.00 to 1.00 (LSTMSM ) and 0.97 to 1.00 (LSTM¬SM ). The
model performances for anomalies were substantially lower for FC models compared to
the LSTM models. These results suggest that ecological memory effects appear to be espe-
cially relevant for improving the model performance of capturing the daily and annual
anomalies.
Surprisingly, the FCSM model performed worse than the LSTM models, particularly for
the anomalies, even though the only relevant state variable for a given time step, SMt , was
known to the model. This contradiction may be associated with several factors. First, in
MATSIRO simulation, the ET is based on the transient soil moisture with losses and gains
within a day between the SMt−1 and SMt . In the experiment here, SMt−1 was used as an input
for the FCSM model, and as such, one would expect some minor differences. Additionally,
18.4 Results and Discussion 279
Training Test
1.00
0.75
0.50
NSE (−)
0.25
0.00
–0.25
–0.50
1.00
RMSE (mm day −1 )
0.80
0.60
0.40
0.20
0.00
daily daily daily interannual daily daily daily interannual
seas. cycle anomalies anomalies seas. cycle anomalies anomalies
Figure 18.4 Box and whisker plots showing grid-level model performances across timescales
(i.e., daily, daily seasonal cycle, daily anomalies, and annual anomalies) for the training and test
sets. Daily seasonal cycle are calculated as the mean of each day across different years, daily
anomalies are calculated as the difference between daily raw estimates and the mean of each day
and annual anomalies are calculated as the difference between mean annual and mean estimates
within each grid-cell. Nash-Sutcliffe model effiiency coefficient (NSE) and Root Mean Square Error
(RMSE) are shown. The whiskers represent the 1.5 ⋅ inter-quartile range (IQR) of the spatial
variability of the model performances.
albeit hypothetical, the FC may not have enough capacity to extract high-level features for
an instantaneous mapping from the concurrent time step of the input data, while the LSTM
models can learn complex representations from a series of past time steps. Therefore, the
LSTM can learn part of the ecological memory effects through temporal dynamics of soil
moisture in addition to instantaneous soil moisture, compared to information used by the
FCSM . This also extends to the potential utilization of the distribution of input data by the
LSTM model, which has access to the full global distribution of all the input data.
3.5
2
3.0
1
n
b
ar
pr
ay
ug
pt
pt
l
ct
ov
ec
n
b
ar
pr
ay
n
l
ug
ct
ov
ec
Ju
Ju
Ja
Fe
Ju
Ja
Fe
Ju
O
O
M
D
A
Se
Se
M
N
A
A
mean seasonal residuals mean seasonal residuals
ET ( mm day −1 )
0.1 0.2
0.0 0.0
−0.2
−0.1
ay
ay
n
b
ar
pr
ug
pt
n
l
ct
ov
ec
n
b
ar
pr
ug
pt
l
ct
ov
ec
Ju
Ju
Ja
Fe
Ju
Ja
Fe
Ju
O
O
M
D
A
Se
Se
M
N
A
A
seasonal anomaly in 2005 seasonal anomaly in 2010
ET ( mm day −1 )
0.1
0.50
0.0
0.25
−0.1
0.00
b
ar
ay
pr
pt
pt
n
ug
n
b
ar
ay
l
ct
ov
ec
pr
n
l
ug
ct
ov
ec
Ju
Ju
Ja
Fe
Ju
Ja
Fe
Ju
O
O
M
D
A
Se
Se
M
N
A
Figure 18.5 Seasonal cycle (first row), seasonal variation of the residuals (second row)
and seasonal anomaly (third row) in the Amazon region (first column) and Australia (second
column). Seasonal residuals were computed as ET residualsi = [ET MATSIROi - mean(ET MATSIRO)]
− [ET predictedi − mean(ET predicted)], where i is a month. Seasonal anomalies are shown for the
years 2005 and 2010 for the Amazon region and Australia, respectively.
row). But, not all the models perform well under all conditions. For example, in humid
Amazon, the LSTMSM performs the best across all months, while other models perform
relatively worse in drier condition (July-December). The mean seasonal variations of the
residuals (second row) show that the LSTM models can better learn temporal dynamics
of ET than FC models, as the residuals for these models (blue lines) is closer to zero over
the entire year. The FC models have larger residuals, with particularly high values for FC¬SM
model, especially during the dry season in the Amazon region and over the growing season
in Australia (August to May). The high values in the seasonal patterns of residuals in Aus-
tralia for the FC¬SM experiment but not in the FCSM model suggested apparent importance
of soil moisture in controlling ET in this region.
We further investigate the performance of LSTM models under two extreme climatic con-
ditions: the 2005 drought in the Amazon, and the 2010 La Niña in Australia (Figure 18.5,
bottom row). The LSTMSM (dashed blue line) and LSTM¬SM (solid blue line) models can
reproduce the MATSIRO simulation of strong seasonal anomalies even under the extreme
18.5 Conclusions 281
conditions (second row). As also shown in the previous sections, the FCSM model cannot
reproduce the seasonal anomalies as well as the LSTM models do.
18.5 Conclusions
This chapter provided an overview of ecological memory effects in the Earth system, along
with a case of the application of a deep learning method, the RNNs, for representing
ecological memory effects. The case study used the simulations of a physical model as
a pseudo-observation to evaluate the capabilities of RNNs models to predict ET and
ecological memory effects therein.
The LSTM model was able to capture the ecological memory effects inherent in the phys-
ical model. Moreover, the difference in the performances of the LSTM model with and
without soil moisture state was found to be negligible. This appeared to be consistent from
daily to annual temporal scales, and over most regions globally. This finding demonstrated
that the LSTM, through its hidden states, is indeed able to learn the memory effects that are
explicitly encoded in the state variables of a physical model.
We further found that the LSTM was able to predict the soil moisture-ET dynamics even
during anomalous climatic conditions demonstrating that the predictions of the LSTM are
general and applicable under a wide range of environmental conditions. This was true for
seasonal responses of ET to the 2005 dry spell in the Amazon, and the 2010 La Niña event in
Australia. The non-temporal FC models generally performed worse, especially with regards
to anomalies when soil moisture was not given as input (FC¬SM ). Under the assumption
that the physical model is analogous to reality, the poorer performance of the model can
be interpreted as the importance of memory effects of soil moisture on ET. The relatively
weaker performance of the FC model, which has access to soil moisture (FCSM ), compared
to the LSTM architectures could not be explained conceptually. We hypothesize that access
to the distribution of the past climate observations in the LSTM models and the LSTMs
being able to compensate for biases emerging from temporal aggregation may be associated
with its better performance.
In summary, our results compared with the simulations of a physical model demon-
strated the usefulness of LSTM model architecture for learning the dynamics and the
ecological memory of unobserved state variables, such as soil moisture. This justifies the
need, and provides confidence, for use of a dynamic statistical model, such as LSTM,
when investigating temporally dependent ecohydrological processes using (often limited)
observation-based dataset. The coupling of dynamic data-driven methods either with
physically-based models (i.e., hybrid modeling, cf. Chapter 22) or with complementary
machine learning approaches (e.g., convolutional neural networks, cf. Chapter 2) will
pave the way for a better understanding of the known as well as unknown Earth system
processes.
283
Part III
19
Applications of Deep Learning in Hydrology
Chaopeng Shen and Kathryn Lawson
19.1 Introduction
Hydrologists have had a long history working with neural networks in myriad applications
including rainfall runoff modeling, groundwater management, water quality, stream salin-
ity, and precipitation forecasting (Gupta et al. 2000; Govindaraju 2000). As one of the largest
fields in geoscience by population, hydrology was also one of the early geoscientific fields
to adopt deep learning (DL) (Shen et al. 2018; Shen 2018a). Following several pioneering
applications (Tao et al. 2016; Fang et al. 2017; Kratzert et al. 2018), DL has gradually taken
hold in hydrology. As a 2018 open-access review paper has already summarized some of the
applications of hydrologic DL (Shen 2018a), the main purpose of this chapter is to account
for the recent trends from late 2017 to early 2020 and provide some outlooks into the next
stage.
The 2017-early 2020 era marked a proof-of-capability phase for DL in hydrology as well as
a period of fast researcher onboarding and radically increasing hydrologic applications for
many topics (section 19.2). DL has been evolving from a niche tool to a method of choice for
some prediction tasks, while a wide range of approaches have been attempted to offer the
full suite of services commonly provided by traditional hydrologic models (e.g., dynamical
modeling, forecasting, data collection). Nevertheless, at the same time, DL is still a skill
wielded by a minority in the field. This may be mainly because the educational background
required for DL is fundamentally different from the traditional hydrology curriculum (Shen
et al. 2018). However, with the current growth rate, it is possible that DL will one day be an
integral component of the hydrologic discipline (Shen 2018b).
DL was developed primarily as a tool to learn from data and extract information. It is no
surprise that hydrologists first used DL to learn from the most prevalent hydrologic datasets,
including both satellite-based and gage-based observations. Within this realm, the applica-
tions can be grouped into big-data DL (section 19.2.1.1), small-data DL (section 19.2.1.2),
and information retrieval (section 19.2.3) (Figure 19.1). However, applied mathematicians
and modelers have come to realize that the fundamental approach of DL, including the
tracking of the derivative chain and the back-propagation of error, provides new ways to
support scientific computing and new ways to ask questions (section 19.2.2).
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
286 19 Applications of Deep Learning in Hydrology
Water Long-term
Information Forecast
Prediction
Lots of
data
Physics-based models
Some data
Parameter
fields/constitutive Observations
relationships
No data
Figure 19.1 A summary of recent progress on deep learning applications in hydrology. DL has
been used to process raw data and extract information, which can then be consumed by other
models (dashed line). With physical intuition and physics-inspired problem setup, data-driven time
series models have been created to successfully model a range of hydrologic variables, in both
small-data and large-data domains, for both forward model runs and short-term forecasting. When
data is limited but the physics is relatively straightforward, physically-informed neural networks
can be incorporated to allow forward modeling or inference of parameters or constitutive
relationships under a limited-data regime. This figure has been inspired by Figure 1 in Tartakovsky
et al. (2018). Dark grey-colored components represent the models.
instances, then it is likely to have learned the fundamental relationship. The use of big data
is the differentiating factor from previous neural network applications, which fitted curves
to a single site/basin. Optionally, one can also provide simulated results from another model
as additional input items.
Time series deep learning models have demonstrated unrivaled predictive accuracy: they
directly learn from the target data and excel at reproducing dynamics of the observations.
Fang et al. (2017) demonstrated that LSTM could learn from soil moisture dynamics
observed by the Soil Moisture Active Passive (SMAP) satellite with high fidelity. Inputs
include atmospheric forcings and land surface characteristics such as soil texture and
slope. The problem is posed in a fundamental and universal way so the trained LSTM is
essentially a hydrologic model that predicts surface soil moisture, and could replace the
corresponding component in land surface models. The test error of the conterminous
United States (CONUS)-scale LSTM model achieved 0.027, significantly smaller than
SMAP’s design accuracy. In addition, even if only trained for 3 years, the model could
capture multi-year trends in surface and root-zone soil moisture for an unseen period
(Fang et al. 2018), and is thus applicable in the face of mild climate non-stationarity.
However, this evaluation has only been demonstrated on mild trends, as soil moisture
is bounded and has limited length of memory. The effectiveness of LSTM in predicting
stronger trends has not been assessed yet. In rainfall-runoff modeling, Kratzert et al. (2018)
showed that a regional-scale LSTM model trained without basin attributes can produce
high mean Nash-Sutcliffe model efficiency coefficients (NSE). Once basin attributes were
included, a CONUS-scale model gave the highest performance metrics compared to other
conceptual rainfall-runoff models when evaluated on hundreds of basins, with a median
NSE of around 0.75 for their ensemble mean streamflow (Kratzert et al. 2019). Similar
metrics were reported in Feng et al. (2020a), with the forcing data introducing some minor
differences. Rainfall-runoff modeling is the most classic modeling task for hydrologists,
and arguably attracts the most attention. Countless rainfall-runoff models have been
developed in the past with different mechanisms and complexities. The demonstration
that a DL model was able to outperform many different hydrologic models should have
been shocking, but similar feats have already been accomplished in many domains such as
chemistry (Goh et al. 2017) and physics (Baldi et al. 2014).
Yang et al. (2019b) used basin-average climate forcings along with simulated outputs
from a global hydrologic model to predict floods for global basins. The authors showed
that several global hydrologic models performed reasonably well in terms of amplitude of
peak discharge, but poorly in terms of their timing. Utilizing an LSTM model that received
inputs from the global hydrologic model simulations, the authors pushed the median NSE
from −0.54 to 0.49 on a global scale. Although not mentioned by the authors, the results
also suggested that the LSTM model was learning principles of routing from the mismatch
between model and observations.
For forecasting, all available information including the most recent (or lagged)
observations should be employed with any model to maximally improve the fore-
cast accuracy. Traditionally, the primary way of achieving such a forecast was either
autoregression for statistical models or data assimilation for process-based models. For
autoregression models, the choice of formulation was limited for describing the coupling
between lagged observations and environmental variables. For data assimilation with
288 19 Applications of Deep Learning in Hydrology
CDF
DI(1)-Sub
0.4 0.4
0.2 0.2
0.0 0.0
–200 –100 0 100 200 0.0 0.2 0.4 0.6 0.8 1.0
Bias (%) NSE
(c) 1.0 (d)
1.0
0.8 0.8
0.6 0.6
CDF
CDF
0.4 0.4
0.2 0.2
0.0 0.0
–100 –50 0 50 100 150 200 –100 –50 0 50 100 150 200
FLV(%) FHV(%)
Figure 19.2 Performance of the LSTM forecast model for the CAMELS data, in comparison to the
SAC-SMA model, a well-established operational hydrologic model. Figure is from Feng et al.
(2020a) with permission from the authors. DI(1) is the forecast model with data integration of the
1-day-lag streamflow observations. The “-Sub” suffix refers to the 531-basin subset used in
previous studies (a) FLV: The percent bias of low flow regime (bottom 30%); FHV: The percent bias
of high flow regime (top 2%).
this case, the CNN was able to predict the mismatch, and greatly reduce the error with the
simulated water storage.
Rather than directly learning from observations and building a DL model, machine
learning can be employed to estimate parameter sets for process-based models. Krapu et al.
(2019) compared automatic differentiation variational inference, a Bayesian gradient-based
optimization method scheme, to several Markov Chain Monte Carlo schemes for estimat-
ing parameter distributions for a hydrologic model. They coded a hydrologic model in
Theano, a deep learning platform, which allowed the tracking of derivatives through the
hydrologic model, and hence, gradient-based parameter adjustment. The approach was
reported to be a highly effective parameter estimator. At the core, this scheme still solves
an inverse problem. Tsai et al., 2020 proposed a parameter learning scheme that linked
deep learning with a hydrologic model. They turned the parameter estimation problem
into a big data machine learning problem, and demonstrated substantial advantages as the
method scales with more data.
290 19 Applications of Deep Learning in Hydrology
weather forecast, and reservoir characteristics (storage volume). The model achieved NSE
values of 0.85, 0.93, and 0.66 for three reservoirs in Southeast Asia with catchment sizes of
26,386 km2 , 13,130 km2 , and 4,254 km2 respectively, which seems to suggest that smaller
dams are more difficult to predict. The outflow in the dry season of 2012 was underesti-
mated because of an abrupt change in the operation rules during that period, which had
never been seen in the training data. For the Gezhouba reservoir in China (catchment area
>1×106 km2 ), Zhang et al. (2018a) used current inflow rates, lagged inflow and outflow
rates, water levels in the downstream region, and month of the year as inputs to an LSTM
model. They showed that their LSTM model outperformed other data-driven methods,
such as a support vector machine and a simpler neural network. While these results
showed the promise of LSTM models, it was less clear what features LSTM constructed to
model the operation, and is still less clear whether longer-range forecasts can succeed with
using lagged outflows, as this variable may not be attainable for longer-term simulations.
Apart from LSTM, Yen et al. (2019) used deep echo state network (also called reservoir
computing), a different form of recurrent neural network, to forecast rainfall using hourly
meteorological variables as inputs, including air pressure, temperature, humidity, wind
speed, wind direction, current precipitation, and sea level. Perhaps because the analysis
scope was local (no spatial dependence or large-scale dataset was considered), the forecast
had limited capability. The authors also did not compare the results with LSTM. However,
this work is mentioned here because reservoir computing has shown great success in
modeling chaos (Pathak et al. 2017), and should gain more attention from hydrologists.
We are currently witnessing a surge in time series DL applications in a data-limited
setting. Despite the solid progress and widely reported superior performance, there are
nonetheless some potential concerns regarding such DL applications. The most significant
one is that because the data from a small number of sites are employed for both training and
test, there are limited opportunities to train or evaluate the model for extreme conditions.
These pitfalls are typically mitigated by exposing the model to myriad situations in a big
data setting, but are more problematic for data-limited scenarios. This situation can lead
to reduced reliability for rare events, as witnessed by Yang et al. (2019a) described above.
The implications of such limitations should be carefully considered for mission-critical
applications. Also, as such a model is entrained in the specific dynamics at a site and
therefore does not need to understand the implications of control factors, these models
do not capture the fundamental processes, e.g., rainfall-runoff responses or reservoir
operation policies, and thus cannot be migrated to other regions.
It is well-known that DL, and especially reinforcement learning, can be used for
decision making. AlphaGo (Silver et al. 2016) has rocked the world with an AI
(a combination of Monte Carlo tree search and CNNs that assess the strength of the
player’s position in the game and make proposals of next moves) that is capable of very
long-term, superior-than-human decision-making for an extremely complex game. Such
decision-making applications are still rare in hydrology, but the reservoir management
problem seems like an obvious target. Matheussen et al. (2019) combined direct policy
search methods with DL to optimize the reservoir management in a Norwegian system of
two reservoirs. The authors used a direct policy search to optimize reservoir operations
and obtain maximum power generation profits given variables such as hydropower price,
given hydropower price, inflows, constraints in minimum reservoir levels, and start-stop
costs of machinery. Then, they ran an ensemble of simulations to produce inflow and price
292 19 Applications of Deep Learning in Hydrology
scenarios and obtain their respective best policies, which were used as a training dataset
for a multilayer perceptron (MLP) network. This network could then directly predict the
optimum policy, and was used in a sequential simulation to determine the overall system
performance in terms of profits.
Urban water systems are highly complex systems to model, yet data-driven approaches
can prove to be a cost-effective option to enable rapid deployment and fast response to
infrastructure management. Karimi et al. (2019) simulated wastewater flow rate in a sewage
network based on hour index, rainfall rate, and groundwater level, and were able to obtain
an R2 value of around 0.81 for different periods. They showed that including groundwater
data improved the model, suggesting a connection between groundwater and the sewage
network. However, more evidence in different scenarios and locations would be needed to
confirm that the model is not overfitting to this signal. Liu et al. (2019) forecasted water
quality parameters for city drinking water intakes based on LSTM and lagged water quality
data. However, it was not entirely clear if weather forcing attributes were included in the
inputs, and if not, what inputs were driving the model.
With the help of DL and more instrumentation, the operation of the urban water system
can be automated, providing more lead time and allowing for more monitoring. On the
other side, due to the uniqueness of each water system, most of the applications belong in
the data-limited setting, meaning that models built in one city cannot be directly employed
in other cities, and so only places with sufficient history of monitoring could tap into this
prediction potential.
The features of DL, e.g., high accuracy, high efficiency, low cost, and low barriers, are
bound to increase the public’s access to hydrologic predictions. If better prediction is the
sole purpose, DL offers not only highly accurate results but orders-of-magnitude lower cost
in terms of both model preparation/validation and run-time computation efforts. If under-
standing the relationships or causes and effects is the priority, then we need to employ more
interpretable AI techniques, which are so far rare in geosciences.
in the training dataset. However, this model required many supporting variables that were
only available at the monitoring sites.
Read et al. (2019) created a process-guided deep learning framework to simulate lake
temperature. This model was based on LSTM, but modified to impose a soft penalty for vio-
lating the conservation of energy. The authors also employed model pre-training to initialize
the network weights, using outputs from a process-based model. They reported a supe-
rior performance of the hybrid model as compared to either the process-based model or a
data-driven model.
It is worth mentioning that the principles that can be integrated into DL models so far are
few, and it is more difficult for more complicated models. The engineering complexity will
undoubtedly increase with more complex systems. It also requires delicate decisions with
respect to which physical laws are to be retained. However, these hybrid models seem to be
an important direction for improvement in the accuracy of future DL-based hydrological
solutions.
have already demonstrated the potential for precipitation retrieval from a mixture of data
sources including microwave data.
The current level of AI makes it possible to retrieve information from massive and uncon-
ventional datasets, and leverage help from citizen scientists. For example, the PhenoCam
network consists of nearly 500 cameras in North America for vegetation monitoring (Seyed-
nasrollah et al. 2019; Richardson et al. 2018). Kosmala et al. (2018) enlisted crowdsourced
labeling on whether snow exists on the images, which they first verified against expert labels
and then used to further tune a pre-trained CNN. They replaced the classification layer of
the model by a Support Vector Machine (SVM) and trained the SVM, which obtained better
results than classifying the scene with Places365-VGG and determining if one of the top five
categories for an image contained snow. Despite the heterogeneity in camera model, camera
view/configuration, and background vegetation, the model with SVM with Places365-VGG
features has produced an accuracy of higher than 97%.
Jiang et al. (2018a) used transfer learning and Inception V3, a pre-trained CNN model,
to extract urban water-logging depths from video images. They showed an R2 of 0.75 − 0.98
and a root-mean-square error of 0.031 − 0.033 m for their two test datasets. Pan et al. (2018)
proposed a low-cost system to read water level information from unmanned surveillance
videos.
Pan et al. (2019) trained a CNN to estimate precipitation from geopotential height
and precipitable water data from the National Centers for Environmental Prediction
(NCEP) North American Regional Reanalysis (NARR) dataset, which was obtained by
regionally downscaling NCEP global reanalysis. For the western and eastern coasts of the
United States, where there is more rainfall, the CNN model was stronger than conventional
stochastic downscaling products and the reference NARR precipitation data, a high-quality
baseline that has already assimilated precipitation observations. It is noteworthy that this
CNN model is closer to downscaling weather model than to information retrieval. Since it
has to perform forecasts, the model needs to capture how precipitation evolves over time
from given initial conditions. The authors argued that this model can be seamlessly trans-
ferred to numerical weather modeling, suggesting the model could be an improvement
over our present precipitation prediction methods.
models are used in a forward mode, i.e., solving for model states when given parameters.
The inverse problem is solved via various algorithms that run the costly forward simula-
tions many times to maximize the fit between the simulated states and the observations,
or to estimate the distribution for the parameter sets that make the observations possible.
However, DL provides a novel and efficient way to obtain such inverse mapping. Of course,
the inverse mappings are not unique, so the uncertainty needs to be estimated. Sun (2018)
demonstrated that it was possible to use the Generative Adversarial Network (GAN)
method to generate conductivity (K) fields using hydraulic head (H) fields, and vice versa.
The GAN, composed of two CNNs, was trained using 400 pairs of K fields (generated by
a geostatistical method) and their resulting H fields (obtained using a groundwater flow
solver after the solutions were evolved for a fixed time step. Both log-normal K fields with a
correlation length and K fields with bimodal distribution were tested, and in both scenarios
the GAN-estimated fields were similar to the original ones. Although there could be some
concerns regarding the sample size and the model performance in real-world situations,
the study showed the possibility for DL models to directly learn the inverse problem.
Compared to multi-point statistical methods, the CNNs can capture highly complex
spatial structure that was not limited to the first two statistical moments (Laloy et al. 2018).
Similarly, Mo et al. (2019a, b) turned the multi-phase flow and reactive transport prediction
problem into an image-to-image regression problem with a densely convolutional neural
network (DenseNet), where the time step was used as an input argument whose effects are
learned. DenseNet was essentially used as a surrogate model, but compared to previous
surrogate models, it can reproduce the full 3D dynamics governed by the partial differential
equation (PDE) to enable fast simulations and uncertainty estimates.
The networks used in Sun (2018) and Mo et al. (2019b) were trained entirely on numerical
solutions. If the problem is complex with varied boundary and forcing conditions, it could
require a large number of expensive numerical solutions to train, which weakens the moti-
vation for using a data-driven approach. The branch of physics-informed neural networks
(PINNs) has developed rapidly to address this issue. It should be noted that PINN has a dif-
ferent scope than previously advocated theory-guided data science (TGDS) (Karpatne et al.
2017a). TGDS is an overall concept to bring physical principles such as mass conservation
into neural networks. PINN is more targeted toward data-driven scientific computing, and
seeks to encode PDEs into the formulation of the neural network.
Raissi et al. (2019) proposed a form of PINN by supervising the derivatives of a network
with the PDE, instead of learning all the physics from numerical solutions. Such supervi-
sion is possible because modern machine learning infrastructure allows one to calculate the
derivation of the output with respect to its inputs. If a neural network can predict u = f (x, t),
then its derivatives 𝜕u∕𝜕x and 𝜕u∕𝜕t can be extracted by automatic differentiation. These
derivatives can be put together as dictated by the PDE, and thus a network can be trained
to respect a PDE. They demonstrated that this approach can infer solutions with multi-
ple governing equations, and can also identify system equations, with a limited amount of
training data. Tartakovsky et al. (2018) extended this framework for saturated groundwa-
ter flow problems to estimate (i) a heterogeneous K field using scattered observations of H;
and (ii) a nonlinear K as a function of H. Problem (i) is somewhat similar to Sun (2018) and
Tartakovsky et al. (2018) only trained the network on steady-state solutions and scattered
observations, which is closer to reality, but the inclusion of physics allowed the model to
296 19 Applications of Deep Learning in Hydrology
be trained with one K field. Because of its unique way to model u = f (x, t), PINN is also
useful for data assimilation (He et al. 2020). It is noteworthy that the PINN framework can
be cast in either a discretized time-step version or a continuous version, and it was reported
that just learning the constitutive relationships (the parameter depending on the system
states) produced more reliable results than learning the whole dynamics (Tipireddy et al.
2019). Furthermore, this framework has been scaled to very high performance on the Sum-
mit supercomputer (Yang et al. 2019). A different flavor of PINN proposed to transform the
PDE into a minimization problem (Zhu and Zabaras 2018; Zhu et al. 2019c). However, this
kind of transformation requires substantial customized adaptation of the method to each
physical equation.
PINN is still nascent and has its limitations, e.g., for every combination of initial and
boundary conditions, the PINN needs to be retrained. Thus it is more suitable for solv-
ing inverse problems than being used as a replacement for traditional numerical modeling.
Ultimately, the advantages of DL include allowing us to ask questions in a novel manner
(mapping relationships that could be not pursued using traditional modeling approaches)
and the ability to continue to learn from data beyond known physics. While research on
data-driven scientific computing is advancing rapidly, there are currently still some limita-
tions with respect to flexibly handling boundaries, time stepping, and full 3D simulations.
Presently most DL models focus on a single task and are tailored to their respective appli-
cations and domains. In the numerical modeling domain, multiple components for differ-
ent tasks can be coupled together as integrated land surface hydrologic models (Maxwell
et al. 2014; Ji et al. 2019; Lawrence et al. 2018). For them, the interfacing requires specific
handling and skills (Kollet and Maxwell 2006; Shen et al. 2016; Camporese et al. 2010).
Interface handling may similarly be needed when networks are put together to enable
multi-physics large-domain simulations, for example, if a network predicting groundwater
level fluctuation is coupled to a network predicting streamflow.
Uncertainty quantification remains challenging for varied model architectures. A cen-
tral question is if we know a new instance is close to the training dataset. There has been
some initial investigation, e.g., Fang et al. (2020) tested the Monte Carlo dropout scheme,
which argued that running the Monte Carlo version of the network through a randomized
dropout mask is similar to running an approximate variational Bayesian inference (Gal and
Ghahramani 2015). However, much more testing and algorithm improvement is required
for different models under different application scenarios.
For physics-guided machine learning for scientific computing, many of the demonstrated
cases are for steady-state solutions, 1D or 2D. Learning to reproduce 3D transient solutions
remains difficult, as it can require too much training data or too large a computational
demand. In addition, numerical solutions can accommodate various boundary and initial
conditions, where these different configurations would need to be covered in the training
dataset and thus could entail significant effort in training data preparation. Future efforts
will need to develop ways to more easily teach DL models the meaning of different boundary
and source conditions.
Present applications of time series deep learning models have been mostly learning
directly from data, and are thus limited to variables that can be directly observed.
I anticipate future uses will feature a deep meshing and integration between DL and
physically-based models, to overcome multiple issues facing purely data-driven models
and to use DL as a knowledge discovery tool. Process-based models could be used to assess
causal controls and distinguish between competing factors in an adversarial fashion Fang
et al. (2020). The next stage may see in-depth modification of DL algorithms to fit the
needs of hydrology and to offer a full suite of services to fit the tasks that society asks of
hydrologists.
Acknowledgments
This work was supported by National Science Foundation Award OAC #1940190 and the
Office of Biological and Environmental Research of the U.S. Department of Energy under
contract DE-SC0016605.
298
20
Deep Learning of Unresolved Turbulent Ocean Processes
in Climate Models
Laure Zanna and Thomas Bolton
20.1 Introduction
Current climate models do not resolve many nonlinear turbulent processes, which occur on
scales smaller than 100 km, and are key in setting the large-scale ocean circulation and the
transport of heat, carbon and oxygen in the ocean. The spatial-resolution of the ocean com-
ponent of climate models, in the most recent phases of the Coupled Model Intercomparison
Project, CMIP5 and CMIP6, ranges from 0.1∘ to 1∘ (Taylor et al. 2012; Eyring et al. 2016b).
For example, at such resolution, mesoscale eddies, which have characteristic horizontal
scales of 10–100 km, are only partially resolved – or not resolved at all – in most regions
of the ocean (Hallberg 2013). While numerical models contribute to our understanding of
the future of our climate, they do not fully capture the physical effects of processes such
as mesoscale eddies. The lack of a resolved mesoscale eddy field leads to biases in ocean
currents (e.g., the Gulf Stream or the Kuroshio Extension), stratification, and ocean heat
and carbon uptake (Griffies et al. 2015).
To resolve turbulent processes, we can increase the spatial resolution of climate mod-
els. However, we are limited by the computational costs of an increase in resolution
(Fox-Kemper et al. 2014). We must instead approximate the effects of turbulent processes,
which cannot be resolved in climate models. This problem is known as the parame-
terization (or closure) problem. For the past several decades, parameterizations have
conventionally been derived from semi-empirical physical principles, and when imple-
mented in coarse resolution climate models, they can lead to improvements in the mean
state of the climate (Danabasoglu et al. 1994). However, these parameterizations remain
imperfect and can lead to large biases in ocean currents, ocean heat and carbon uptake.
The amount – and availability – of data from observations and high-resolution simula-
tions has been increasing. These data contain spatio-temporal information that can com-
plement or surpass our theoretical understanding of the effects of unresolved (subgrid)
processes on the large-scale, such as mesoscale eddies. Efficient and accurate deep learning
algorithms can now be used to leverage information within this data, exploiting subtle pat-
terns previously inaccessible to former data-driven techniques. The ability of deep learning
to extract complex spatio-temporal patterns can be used to improve the parameterizations
of subgrid scale processes, and ultimately improve coarse resolution climate models.
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
20.2 The Parameterization Problem 299
the parameterization and its functional form. Some early attempts of data-driven ocean
eddy (subgrid) parameterizations showed interesting results in idealized model setups
(e.g., Berloff 2005; Mana and Zanna 2014; Zanna et al. 2017). More generally, data-driven
modeling of turbulence has advanced in recent years (Duraisamy et al. 2019). The advent
of new tools from machine learning can improve the computational efficiency and gener-
alization of data-driven parameterizations. Combined with the increasing wealth of data
from observations and high-resolution simulations, it is now possible to begin investigating
more thoroughly how data-driven parameterizations can improve the representation of
unresolved processes and reduce model biases in long-range climate simulations.
where the coefficients gn are predicted by the NN using only Galilean-invariant inputs.
The deep NN outperform linear regression models, only after applying these physical con-
straints. Integrating physical principles into data-driven algorithms is important for fidelity,
but can also boost the predictive skill of the resulting parameterization.
Additional studies have used NNs to parameterize eddy momentum fluxes in models of
freely-decaying 2D turbulence (Maulik and San 2017; San and Maulik 2018; Maulik et al.
2019; Cruz et al. 2019) or in large-eddy simulations (Zhou et al. 2019). For example, Maulik
et al. (2019) used NNs to parameterize eddy vorticity fluxes, which were then implemented
back into the same model. They used the conventional Smagorinsky and Leith eddy vis-
cosity functions (Smagorinsky 1963; Leith 1968) as input features to one of the NNs: this
did not improve the predictive skill of the NN but did improve the numerical stability of
the turbulence model once the NN is implemented. By doing so, they removed upgradient
momentum fluxes to stabilize the numerical simulations but therefore altered the physics
of turbulence processes. Nonetheless, incorporating physical and mathematical properties
into DL algorithms is an important step for making parameterizations physically-consistent
and potentially improving their performance when implemented into a climate model.
For parameterizations of ocean turbulence, a handful of studies using CNNs have
emerged. Bolton and Zanna (2019) and Zanna and Bolton (2020) used CNNs to param-
eterize ocean mesoscale eddies in idealized models. They showed that CNNs can be
extremely skillful for eddy momentum parameterizations, with predictions that generalize
very well to different dynamical regimes (e.g., different ocean conditions and turbulence
regimes). Another idealized study by Salehipour and Peltier (2018) showed the potential
of CNNs to parameterize ocean vertical mixing rates. The DL algorithm could predict the
mixing efficiency well beyond the range of the training data, producing a more universal
parameterization compared to previous studies.
However, the underlying assumptions behind this approach is that the structural form of
the parameterization is a correct representation of a given process, and no other parameter-
izations than the ones already in use are needed. Neither of these assumptions are valid in
ocean models (Zanna et al. 2018), since not all parameterizations included in ocean models
are correct or encompass all the missing processes.
A second avenue is a change in the loss function used during optimizations. The loss
function can be adjusted to include additional constraints, such as global conservation of
mass, momentum, salt, or energy. This simple approach helps ensure that the system tends
toward such conservation principles (Beucler et al. 2019). However, the conservation laws
may not be strictly enforced, only approximately, unless hard constraints are used.
The third avenue is to modify the architecture of the NNs (Ling et al. 2016a; Zanna
and Bolton 2020). For example, Zanna and Bolton (2020) used maps of resolved velocity
components as input to the CNN in order to predict both components of the subgrid
eddy momentum forcing Ŝ x and Ŝ y . To physically-constrain the architecture, they used
a specifically-constructed final convolutional layer with fixed parameters (Figure 20.1).
The activation maps of the second-to-last convolution layer represent the elements of an
eddy stress tensor T. The final convolution layer then takes the spatial derivatives of the
activation maps of the second-to-last convolutional layer (i.e., the eddy tensor elements)
using fixed filters, representing central-difference stencils to form the two outputs Ŝ x , and
Ŝ y for eddy momentum forcing. This ensures that the final prediction originates from
taking the divergence of a symmetric eddy stress tensor, achieving global momentum and
Algorithms + physical
constraints
Conv Fixed conv
Conv 3×3 3×3
3×3 T00 Sˆx =
дT00 дT10
Conv дx
+
дy
3×3 T10 дT10 дT11
Sˆy = +
T11 дx дy
д д
дx , дy
u υ 128 64 16 filters Sˆ = ∇ · T
filters filters for each Tij
Interpretability
Enhance
Knowledge
Implementation in
Coarse-Res Simulations
Improve
climate
projections
1920km 3840km
mean standard deviation
(b) 30 km
+
CNN
(c) 3.75 km
0k 19 38 0k 19 38
m 20 40 20 40
km km m km km
Figure 20.2 Evaluation in an idealized model, zonal velocity (time-mean, left and standard
deviation, right): a) Coarse-resolution (30 km), b) Coarse-resolution (30 km) with physics-aware CNN
parameterization implemented, c) High-resolution (3.75 km).
while showing some success, remain sub-optimal as many processes remain poorly
represented or are missing from models. Deep learning can help bridge the gap and
improve the representation of missing processes using the wealth of new data from
high-resolutions simulations and observations, together with physical constraints (as
described in section 20.4 and other chapters of this book). There are, however, several
challenges ahead in developing physics-aware ML parameterizations, which relate to: how
and what to learn from data; how to improve the generalization of ML parameterizations;
the interpretability of the resulting algorithm.
Learning from data. Dealing with substantial amounts of data to train ML algorithms
remains an obstacle in deriving subgrid parameterizations, but coordinated efforts are well
underway to break this barrier, such as the Pangeo project (e.g. Eynard-Bontemps et al.
2019).
However, defining “subgrid” (or unresolved) scales, via an averaging procedure,
from either model or observational data is a non-trivial but crucial component of any
data-driven parameterization which is often overlooked. The choice of subgrid definition
directly impacts what physical processes will be captured by the data-driven parameteriza-
tion. Gentine et al. (2018) and Rasp et al. (2018) were able to by-pass this problem by using
data directly extracted from a 2D high-resolution model embedded into a coarse-resolution
climate model; therefore, the “subgrid” scales were available without additional processing.
However, this case is an exception. Most other groups tackling ML parameterizations have
so far used spatial coarse-graining, which produces a local definition of eddy forcing, on
uniform grid (Bolton and Zanna 2019; Zanna and Bolton 2020) or non-spherical geometry
(Brenowitz and Bretherton 2018; Yuval and O’Gorman 2020). The choice of averaging
procedure has a significant impact on the nature of the resulting subgrid forcing and the
separation of scales (as illustrated in Figure 20.3 for the subgrid momentum forcing). The
choice of how to separate resolved and unresolved scales can lead to artifacts in the evalua-
tion of nonlinear subgrid forcing (e.g., panel d in which a simple coarse-graining procedure
is used), or can produce different patterns and magnitudes of subgrid forcings (e.g., panels
d–f, which show the effects of using course-graining, a low-pass filter, or a combination
of a low pass filter with coarse-graining). If using a (low-pass) filter, the spatial scale of
the filter should also be carefully considered when dealing with spherical coordinates as
the subgrid forcing will change in spatial scale as well; e.g., the Rossby deformation scale
at which mesoscale eddies are resolved varies with latitudes. Whether these definitions
(panels d–f) are truly representative of the missing forcing in a coarse-resolution model
remains to be determined.
Generalization of ML parameterizations. Another obstacle to accurate ML parame-
terizations is their ability to generalize to different regimes or conditions (i.e., to extrapolate
outside the range on which they were trained). While CNNs for ocean eddy momentum
parameterizations have shown great success in generalizing to different turbulent regimes
(Bolton and Zanna 2019), when implemented into ocean models, even if physical con-
straints imposed, they can lead to unphysical behaviors without ad-hoc tuning (Zanna and
Bolton 2020). There are several ways to improve generalizations of ML-parameterizations,
which include: (i) learning from a range of high-resolution simulation under different
regimes (O’Gorman and Dwyer 2018) and optimally combining the resulting DL parame-
terizations as suggested by Bolton and Zanna (2019) while imposing physical constraints;
(ii) the use of causal inference to target physical relationships in the training data to be used
20.5 Further Challenges ahead for Deep Learning Parameterizations 305
x x x
x x x
Figure 20.3 Illustrative example considering the effects of averaging procedure on the
corresponding zonal eddy momentum forcing (Sx ): Assume a Gaussian ellipse streamfunction
𝜓 ∝ e−(ax +by ) , which emulates a coherent vortex (panel a), and associated velocity components
2 2
u = − 𝜕y (panel b) and 𝑣 = 𝜕𝜓
𝜕𝜓
𝜕x
(panel c). Panel (d) shows coarse-graining, Panel (e) a Gaussian
spatial filter, and Panel (f) a Gaussian spatial filtering followed by coarse-graining.
as input in DL algorithm; this has the potential to select variables which co-vary according
to physical laws and therefore constrain the algorithm to reproduce that relationship, even
in unseen conditions. The training data, whether from high-resolution numerical models
or observations, possess some biases which may limit the performance or accuracy of
the ML parameterizations. A potential way forward is to use transfer learning: one trains
ML algorithms with abundant model data and re-tune the ML parameterizations with
observations Chattopadhyay et al. (2020), which have less biases. Transfer learning could
also potentially improve the generalization of these deep learning models.
Interpretability. Finally, for deep learning models in general, predictive skill is valued
above other factors such as interpretability. In general, it is difficult to understand how deep
learning methods transform an input into the target variable. The final prediction of a CNN
is a culmination of the information extracted from the previous convolutional layers of the
network. We can talk broadly about how convolution layers automate feature extraction,
and then attempt to dissect the feature maps of the intermediate layers, but identifying
exactly what features are being extracted by the many learnt filters can be cumbersome
or sometimes completely unfeasible. For example, in Figure 20.4, the first layers extract
306 20 Deep Learning of Unresolved Turbulent Ocean Processes in Climate Models
Figure 20.4 Interpretability: Activation maps are the result of the convolution acting on the
previous layers output, and then passing it through the activation function. Here, a radially-
symmetric Gaussian function to generate an eddy is fed into the already-trained NN for an ocean
subgrid parameterization by Bolton and Zanna (2019). The activation maps for each convolutional
layers are shown. The activation maps for the first convolution layer are collection of first- and
second-order derivatives. Therefore, without a-priori knowledge, the neural network learns to take
derivatives of the input streamfunction, which corresponds to velocities and velocity shears. This is
a robust feature across all of the NNs trained to predict the eddy momentum forcing.
21
Deep Learning for the Parametrization of Subgrid
Processes in Climate Models
Pierre Gentine, Veronika Eyring, and Tom Beucler
21.1 Introduction
Earth system models simulate the physical climate and biogeochemical cycles under a wide
range of forcings (e.g., greenhouse gases, land use, and land cover changes). Given their
complexity and number of processes represented, there is persistent inter-model spread in
their projections even for a given prescribed carbon dioxide (CO2 ) concentration pathway
(IPCC 2013; Schneider et al. 2017b). Despite significant progress in climate modeling over
the last decades (Taylor et al. 2012; Eyring et al. 2016c), the simulated range for effective cli-
mate sensitivity (ECS), i.e. the change in global mean surface temperature for a doubling of
atmospheric CO2 concentration, has not decreased since the 1970s. It still ranges between
2.1 and 4.7 ∘ C and is even increasing in the newest generation of climate models participat-
ing in the World Climate Research Programme (WCRP) Coupled Model Intercomparison
Project Phase 6 (CMIP6, Eyring et al. (2016c)) (see Figure 21.1).
One of the largest contributions to this uncertainty stems from differences in the
representation of clouds and convection (i.e, deep clouds) occurring at scales smaller than
the model grid resolution (Schneider et al. 2019; Stevens and Bony 2013; Bony et al. 2015;
Stevens et al. 2016; Sherwood et al. 2014). These processes need to be approximated in
global models using so-called parametrizations, i.e. an empirical representation of the
process at play, because the typical horizontal resolution of today’s global Earth system
and climate models is around 100 km or more. This limits the models’ ability to accurately
project global and regional climate changes, as well as climate variability, extremes and
their impacts on ecosystems and biogeochemical cycles. Yet, accurate projections are
essential for efficient adaptation and mitigation strategies (IPCC 2018) and for assessing
targets to limit global mean temperature increase below 1.5 ∘ C above pre-industrial levels,
as defined in the Paris Agreement (UNFCCC 2015). Reducing uncertainties in climate
projections in the next decade can also significantly reduce associated economic costs
(Hope 2015).
The long-standing deficiencies in cloud parametrizations (Randall et al. 2003; Boucher
et al. 2013; Flato et al. 2013) have motivated the developments of high-resolution global
cloud-resolving climate models with the ultimate goal to explicitly resolve clouds and con-
vection (Schneider et al. 2017b; Stevens et al. 2019), as well as shorter duration large-eddy
simulations (LES), resolving most of the energy-containing atmospheric turbulence and
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
308 21 Deep Learning for the Parametrization of Subgrid Processes in Climate Models
6.0
5.0
4.0
ECS [K]
3.0
2.0
1.0
Charney AR1 AR2 / AR2 / AR4 / AR5 / CMIP6
(1979) (1990) CMIP1 CMIP2 CMIP3 CMIP5 (2020)
(1996) (2001) (2007) (2013)
Figure 21.1 Effective Climate Sensitivity. Assessed range of effective climate sensitivity in IPCC
Reports over the years (blue bars). ECS values from individual models participating in CMIP5 and
CMIP6 are shown in addition (symbols). Source: Modified from Meehl et al. (2020).
covering up to a few hundreds of kilometers (Tonttila et al. 2017). Yet, these simulations are
extremely computationally demanding so they can only be run for a few days to months and
cannot be used for long-term climate projections in a foreseeable future. Coarse-scale model
simulations, in particular those from Earth system models that include additional Earth sys-
tem components beyond the physical climate such as the carbon cycle, will therefore con-
tinue to be required. These additional processes are needed to represent key feedbacks that
affect climate change, such as the biogeochemistry cycle, but are also likely to increase the
spread of climate projections across the multi-model ensemble even further. For instance,
future terrestrial carbon uptake remains one of the most uncertain processes in the Earth
system, as even its mere sign in the future is unknown (Friedlingstein et al. 2014).
Yet, as many cloud and convection processes are explicitly resolved in high-resolution
simulations (Figure 21.2), these simulations can serve as important sources of information
to constrain small-scale representation (parametrizations) in coarse-resolution Earth sys-
tem and climate models. With the recent developments in machine learning, in particular
deep learning, this provides unique new opportunities for the development of improved
Earth system and climate models.
In this chapter, we present pioneering results and progress on the use of machine learn-
ing for cloud parametrizations that can replace typical parametrizations in coarse-scale
Earth system and climate models (section 21.2) and discuss studies that particularly address
generalization and the implementation of physical constraints into machine learning algo-
rithms (section 21.3). Section 21.4 closes with an outlook of remaining challenges for this
new and exciting interdisciplinary research field with what we argue has huge potential to
improve understanding and modeling of the Earth system, in particular if guided both by
data and by physical knowledge.
21.2 Deep Neural Networks for Moist Convection (Deep Clouds) Parametrization 309
Current Goal
Figure 21.2 Schematic representation of clouds in current climate models and the objective to
represent them similarly to very fine resolution models. In coarse-scale climate models (left)
small-scale physical processes need to be empirically represented as a function of the coarse-scale
resolved variables such as mean temperature or humidity over the grid, at a given level (level).
These small-scale processes can be explicitly resolved in high-resolution cloud resolving models
(right).
Gentine et al. (2018) and Rasp et al. (2018) demonstrated that deep convection simulated
by a cloud-resolving model (CRM) could be correctly emulated by a deep neural network,
at a scale comparable to a coarse global climate model (GCM). The authors used a
super-parametrization (SP) of convection, where 2-dimensional (2D) CRMs (in the y
and z directions) were embedded in a coarse GCM. They further idealized the setup by
prescribing oceanic surface conditions with a steady latitudinal temperature gradient, and
without continents, topography, and sea ice (aquaplanet setup as in Stevens and Bony
(2013)). This strategy allowed the authors to bypass the coarse-graining of CRMs typically
required for ML subgrid-scale parametrizations (Figure 21.3), a notoriously difficult step.
The NN maps the vertical temperature T(z), specific humidity q(z), surface pressure ps ,
solar insolation S0 , surface sensible flux H, and latent heat flux LE (inputs) to the heating
and moistening (tendencies) of the coarse-grained SP model (outputs). In more techni-
𝜕q
cal terms, this means that the predicted heating 𝜕T 𝜕t
and moistening 𝜕t physics tendencies
(rate of change of temperature and moisture due to the physical components of the model
unlinked to advection) of the coarse-pixels (Figure 21.3) can be written as:
𝜕X
= NN(T, q, ps , S0 , H, LE), (21.1)
𝜕t
with X the grid mean value of either temperature X = T or specific humidity q.
Rasp et al. (2018)’s NN was trained on labeled data from the simulation’s first year and
validated on data from the second year. The training required more than 6 months of data
310 21 Deep Learning for the Parametrization of Subgrid Processes in Climate Models
Coarse-graining
Fully connected
CNNs (space), RNNs (time)
...
Figure 21.3 Schematic diagram for ML-based cloud parametrizations for climate models.
High-resolution cloud-resolving model simulations are coarse-grained to the scale of the climate
modes (∼100 km) with the help of convolutional or recurrent NNs to learn the impact of convection
on the resolved coarse-scale variables.
(approximately 140 million training samples) to reach final convergence. Alternatively, the
NN can be trained on less samples (e.g., 40 million), and computational resources can be
invested to tune its hyperparameters (e.g., number of layers, number of nodes per layer,
learning rate, etc.) instead to guarantee optimal performance on the validation dataset. The
NN used for this chapter’s figures was trained using this second strategy (see Beucler et al.
(2019) for details). Note that both NNs did not include any temporal nor spatial covariations
of the coarse resolution pixels, similar to the embedded 2D CRM.
The NN reproduced not only the heating and moistening due to deep convection, but
also due to all other subgrid processes such as turbulence, radiation, waves, and shallow
convection. The NN was able to correctly reproduce the CRM, as seen in uncoupled mode
(i.e., by prescribing at each time step the input features T, q, ps , S0 , H, LE from the CRM
model) with mostly the correct spatial structure (Figure 21.4), even if less stochasticity (i.e.,
𝜕q
random noise) was present. The heating 𝜕T 𝜕t
and moistening 𝜕t profiles were also well repro-
duced in terms of both the mean and standard deviation for the total moistening and heating
as well as for the radiative components only in the longwave and shortwave (Figure 21.5).
Similarly, Brenowitz and Bretherton (2018) showed that a NN could reproduce a full 3D
CRM in offline mode, but this 3D model was more difficult to use in coupled mode and gen-
erated coupled model instabilities. These instabilities led the model to blow up. To solve this
issue, they removed the upper-atmospheric temperature and moisture inputs to guarantee
a bottom-up only effect of convection (Brenowitz and Bretherton 2019). In addition they
developed a general diagnostic tool that identified regimes during which NNs were creating
unrealistic convection in two different climate models (Brenowitz et al. 2020). Based on a
linear stability analysis of the NN convection scheme coupled to a simplified wave dynamics
model, this tool was used to diagnose and prevent the instability of NN convection schemes.
Brenowitz et al. (2020) additionally showed that regardless of the climate model they were
trained on, NNs exhibited physically-consistent behavior, such as increased convective
21.2 Deep Neural Networks for Moist Convection (Deep Clouds) Parametrization 311
Figure 21.4 Snapshot comparison of the CRM and NN convective responses. Snapshot of
convective moistening (left) and heating (right) over the globe in energy units, from an offline
comparison between the NN (bottom) and the Cloud Resolving Model (top).
400
600
800
1000
0 20 40 60 0 20 40 –10 0 10 0 2 4 6
Convective Total Heating Longwave Heating Shortwave Heating
Moistening
Figure 21.5 Comparison of the thermodynamic profiles predicted by the CRM and NN. Ensemble
mean (dotted lines) and standard deviation (full lines) of total subgrid moistening (left), total
subgrid heating (second to the left), subgrid longwave heating (second to the right), and subgrid
shortwave heating (right) in energy units, from an offline comparison between the NN (blue) and
the Cloud Resolving Model (black).
312 21 Deep Learning for the Parametrization of Subgrid Processes in Climate Models
Count
NN
0
(c) Architecture
Enthalpy Longwave 1000
Count
constrained Mass Shortwave
NN
0
10–11 10–10 10–9 10–8
2 –4
Mean Squared Spurious Energy Production (W m )
Figure 21.6 Architecture-constrained NNs can enforce conservation laws to within machine
precision (∼ 10−8 W2 m−4 compared to ∼ 102 W2 m−4 ). (a) Schematic of the architecture-constrained
network from Beucler et al. (2019). Histogram of the mean squared spurious energy production
associated to enthalpy (orange), mass (black), longwave (blue), and shortwave (red) conservation for
(b) a standard NN and (c) a constrained NN.
(2019). The authors demonstrated how strict physical constraints can be imposed within
an NN architecture. The conservation of mass and energy can be strictly imposed through
the addition of so-called “constraints layers” that combine inputs and outputs in order to
impose strict equalities (Figure 21.6a). In the example of moist convection, these equalities
are enthalpy, mass, as well as longwave and shortwave radiation conservation, which the
constrained-NN enforces to within machine precision (Figure 21.6c). This goes beyond
traditional way of imposing constraints in a soft way, using a regularization of the loss
function with Lagrange multiplier (Márquez-Neila et al. 2017; Karpatne et al. 2017d).
Indeed, in this latter approach the physical constraint is only approximately true. However,
for climate modeling energy and mass conservation need to be exactly satisfied at every
time step. Note that this framework also goes beyond parametrizations that only enforce
linear constraints by default, such as random forests (O’Gorman and Dwyer 2018; Yuval
and O’Gorman 2020), as it can enforce non-linear constraints as long as they are analytic.
A second challenge relates to generalization way outside of the regime of training.
For instance, Rasp et al. (2018) tested the capacity of the NN trained on a given climate
(0 Kelvin experiment) to generalize to a much warmer world (+4 Kelvin). The algorithm
failed and exhibited the typical double intertropical convergence zone bias (i.e. two tropical
rain bands) similarly to many global climate models (Oueslati and Bellon 2015; Flato
et al. 2013). A similar experiment showed that the model was also unable to generalize
to a colder climate (-4 Kelvin), but this time mostly at the poles, again emphasizing the
difficulty to generalize. Krasnopolsky et al. (2008) suggested training a NN to anticipate
314 21 Deep Learning for the Parametrization of Subgrid Processes in Climate Models
22
Using Deep Learning to Correct Theoretically-derived
Models
Peter A. G. Watson
Earth system simulators are dynamical models based on the laws of physics, chemistry,
and biology as far as we understand them. However, the laws cannot be used directly as
this would be too computationally costly. Instead, approximate equations are used, leading
to substantial errors in the output. Examples of particular difficult problems are predicting
climate change feedbacks due to clouds (Vial et al. 2013) and representing tropical rainfall
variability (e.g. Westra et al. 2014; Watson et al. 2017). Reducing these errors would be very
valuable for giving better warnings of severe climatic events (also see Chapter 21).
Some attempts to apply deep learning to produce better simulators have proposed learn-
ing all of the equations from data (e.g. in the case of the atmosphere by Dueben and Bauer
2018; Weyn et al. 2019), but as of the time of writing, these have not come very close to
matching the skill of state-of-the-art weather forecasts.
It is potentially more promising to combine deep learning with the theoretically-derived
models we currently have. Karpatne et al. (2017a) present a set of such approaches they call
“theory-guided data science”, which includes:
● using theory to restrict statistical models to have physically-consistent structures (for
example, ensuring that quantities like rainfall cannot become negative);
● guiding these models to learn physically-consistent solutions, such as by includ-
ing known scientific laws in objective functions (for example, penalizing energy
non-conservation);
● using theory to refine models’ outputs (for example, using a data-driven model to produce
possible solutions to a problem and then validating these with a theoretically-derived
model);
● creating hybrids of models based on theory and statistical learning (for example, using a
statistical model to predict the error term of a theoretically-derived model);
● enhancing theory-based models using statistics (for example, by finding optimal param-
eter settings).
Reichstein et al. (2019) reviewed approaches to integrate deep learning with theory-based
modeling, including by learning parameterizations for processes that are particularly hard
to derive from theory and emulating parts of physical models to enable them to be run more
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
316 22 Using Deep Learning to Correct Theoretically-derived Models
Δx
= (x) + 𝜖(x), (22.1)
Δt
where x is the vector of state variables, Δx∕Δt is the tendency over one discretized time
step Δt, (x) is the prediction of the theoretically-derived model and 𝜖(x) is the correction
produced by the algorithm (each presumed not to have explicit time-dependence). 𝜖(x)
is trained to maximize some cost function based on differences between the overall
predictions and data from the target system. 𝜖(x) affects predictions made for subsequent
time steps. In this set-up, the theoretically-derived model retains a critical role and keeps a
link between the predictions and our theoretical understanding, whilst the deep learning
component allows potential skill increases. This potentially hugely reduces the complexity
of the necessary machine learning algorithms compared with learning to represent Δx∕Δt
entirely (as in Dueben and Bauer 2018; Weyn et al. 2019), since achieving a skill improve-
ment does not require learning all the knowledge encoded in the existing simulators. 𝜖(x)
does not need to be as complex as the theoretically-derived model – for example, not all
input and output variables need to be included, as long as what is included is enough to
give a skill gain. This is a particular advantage for subject areas that use very complex
simulators, such as climate modeling. This may make it a more practical approach for a
research program, since improvements can be made using simple algorithms and then
built on from there, rather than waiting many years for an algorithm that outperforms
current models. Additionally, keeping the theoretically-derived model components will
help to maintain interpretability of the simulator’s behavior. This approach may also
work better in novel physical situations than replacing the full equations because the
theoretically-derived model provides a prediction based on theory and it is not all left to
the algorithm, as long as |𝜖(x)| is not generally much larger than |(x)| (though it would
be a challenge to firmly guarantee reasonable behavior of 𝜖(x)).
Using machine learning algorithms to correct output from dynamical models in “offline”
mode (without using the correction at one time step in calculating the tendency at the next
time step) has been applied in several contexts. Xu and Valocchi (2015) found that support
vector machines and random forests could improve the skill of predictions of groundwa-
ter flow. Their approach reduced the bias in the mean predicted flow from 18% to nearly
zero and reduced the mean absolute error by 50%. Karpatne et al. (2017d) trained a neu-
ral network to correct predictions from a dynamic model of lake temperatures. As well as
minimizing the root mean square error (RMSE) of prediction errors of standardized tem-
peratures at each time step, they included a “physics-based” term in the loss function to
penalize the model if it predicted that the water density would increase with height, and
attained a 30% reduction in the RMSE. Rasp et al. (2018) used ANNs to postprocess weather
forecasts of temperature in Germany and found that this gave better skill than some other
frequently-used statistical methods. Together, these results indicate that machine learning
has the potential to considerably reduce errors in predictions from dynamical models.
22.1 Experiments with the Lorenz ’96 System 317
There do not seem to be many examples in the literature of applying this approach to
Earth science applications in “online” mode, where the corrected output of dynamical mod-
els is used as input for predicting the next time step. Early work on this method was done by
Forssell and Lindskog (1997), who addressed a laboratory-scale problem of predicting the
water level in a tank that was being filled by a pump driven by a time-varying voltage. Com-
bining a theoretically-derived model with a neural network to account for phenomena that
the theoretically-derived model did not represent (for example, eddies in the tank) reduced
prediction RMSE by a factor of two. It also did not predict unphysical situations like the
water level becoming negative, which did happen when a neural network alone was used
to try to solve the problem.
Cooper and Zanna (2015) used a stochastic linear algorithm added to a low-resolution
model to improve simulations of an idealized two-dimensional ocean with a double gyre.
“Truth” simulations were produced using a horizontal resolution of 7.5 km and it was
attempted to reproduce the long-term statistics of this run using models with a resolution
of 30 km. The targeted statistics were the mean, variance, and autocorrelation in time at
each grid point. An iterative approach was used to learn parameters of the algorithm that
produced the best results, requiring ∼100 integrations of the low-resolution model. The
resulting simulations have mean squared errors in the bias and variance of the horizontal
flow velocities that are reduced by more than a factor of 10, and also substantially reduced
errors of their 5-day lag covariance. The response to a change in wind forcing of the system,
whilst still far from perfect, had a mean squared error that was 50% and 60% of that for the
uncorrected low-resolution model for the eastward and northward components of the flow
respectively. However, it is unclear whether their linear approach could give substantial
improvements of nonlinear processes in the atmosphere that are difficult to model, such
as convection (Arakawa 2004).
Pathak et al. (2018) applied an algorithmic correction approach with a non-linear
machine learning method to simulate low-dimensional chaotic systems, whose variabil-
ity was coming entirely from internal dynamics rather than being forced. They tested
“knowledge-based” predictors of these systems that differed from the true equations
through a parameter change, a “reservoir computer”, and a hybrid of the two. A reservoir
computer is similar to a neural network in that it consists internally of a large-number of
non-linear functions relating inputs to outputs, but only parameters relating the outputs
of these functions to the predictions are learnt. So, unlike in deep learning, the non-linear
functions between the inputs and outputs are not optimized for the prediction task. For both
the Lorenz (1963) system of equations (that which produces the famous “butterfly attrac-
tor”) and the one-dimensional Kuramoto–Sivashinsky equations, the hybrid system made
predictions that remained close to those from the true equations for typically ∼2–3 times
longer than predictions made by the separate knowledge-based and reservoir-based models.
that are closer to the truth. This is a key aim in Earth system modeling in order to have
models to use both for making short-range predictions and simulating long-term effects of
climate change. As well as it being more practical to have one set of models for both tasks
rather than maintaining them separately, showing that models used for climate change pro-
jections can make good short-range forecasts can also increase our confidence in their abil-
ity to correctly simulate the dynamics of phenomena like extreme weather events (discussed
more in section 22.1.3).
To test whether this is possible, and also to give a more detailed explanation of how the
error-correction approach might be applied, an example based on the chaotic Lorenz ’96
dynamical system (Lorenz 1996, sometimes also referred to as the Lorenz ’95 system) will
now be presented in detail. This follows the approach of Watson (2019), but focuses more
on how varying the ANN structure affects the prediction skill and demonstrates how it can
be used to test ideas like seamless prediction of weather and climate (Palmer et al. 2008).
The Lorenz ’96 system is instructive for testing new concepts for approaches to improve
Earth system models and has been used numerous times before in this way (e.g. Wilks
2005; Arnold et al. 2013; Schneider et al. 2017a; Dueben and Bauer 2018), although it is
a lot simpler than a model of the Earth’s climate. The set up is to have a “Truth” system
with complex, fine scale behavior and try to simulate its behavior using a system that only
has coarse-scale information. This is analogous to trying to simulate the Earth, with all of
its important small-scale phenomena such as tropical thunderstorms and oceanic eddies,
using weather and climate models with ∼10–100 km scale resolution, which is the limit
of what can currently be afforded computationally. The experimental method is described
below, with more details given by Watson (2019).
dt
= −Xk−1 (Xk−2 − Xk+1 ) − Xk + F − (hc∕b) Yj,k ,
j=1 (22.2)
dYj,k
dt
= −cbYj+1,k (Yj+2,k − Yj−1,k ) − cYj,k + (hc∕b)Xk ,
with cyclic boundary conditions Xk = Xk+K and Yj,k = Yj,k+K . This work uses parameter val-
ues K = 8, J = 32, h = 1, F = 20, b = 10 and c = 4, following the work of Arnold et al. (2013).
This ensures that Xk , defined for k = 1, … , K, are slowly-varying variables and Yj,k , defined
additionally for j = 0, … , J + 2, are quickly-varying. The Y variables are connected in a ring
with Y0,k = YJ,k−1 , YJ+1,k = Y1,k+1 , and YJ+2,k = Y2,k+1 . This means there are J unique Yj,k
variables associated with each Xk variable. Lorenz (1996) suggested that the Yj,k be consid-
ered analogous to a convective-scale quantity in the real atmosphere and Xk analogous to
an environmental variable that favors convective activity. These equations are integrated
in time using a fourth-order Runge–Kutta time-stepping scheme with a time step of 0.001
time units. Equation 22.2 is hereafter referred to as the “Truth” system.
simulate the Y variables but parameterize their effect on the X variables. This is conceptu-
ally similar to how the effect of unresolved phenomena on larger scales is treated in Earth
system models. Thus we obtain the system of equations
dXk∗ ∗ ∗ ∗
= −Xk−1 (Xk−2 − Xk+1 ) − Xk∗ + F − U(Xk∗ ), (22.3)
dt
with X ∗ k = X ∗ k + K. U(Xk∗ ) is a function that parameterizes the effect of the Y variables on
the X variables. The time step is lengthened to 0.005 time units, which is analogous to how
Earth system models do not properly simulate processes that occur on very short timescales,
as well as those on short length scales.
U(Xk∗ ) is assumed to be a cubic polynomial, following Arnold et al. (2013), such that
∑
3
U(X) = an X n .
n=0
Its parameters are derived using a coarse-graining approach, and have values a0 = −0.207,
a1 = 0.577, a2 = −0.00553 and a3 = −0.000220, following Watson (2019).
The model given by equation 22.3 will be referred to as the “No-ANN model” from
here on.
prediction error failed to decrease by at least 10−4 twice consecutively after iterating over
the whole training dataset.
A validation dataset for evaluating the models with ANNs was produced by separately
creating 3000 time units of data from equation 22.2.
22.1.2 Results
Here, performance of the models with ANNs is presented, and the difference made by
choosing different ANN structures is explored. Note that sampling variability is not gen-
erally large compared to the differences between coarse models with and without ANNs,
with very similar results being found using the first and second halves of the datasets only.
(The exception is the biases in the time-mean of the X variables, which differed between the
first and second halves of the datasets, and were not found to be statistically significantly
different between the models in most cases (not shown), so there is substantial uncertainty
in the patterns in the results for this statistic, but it is not important for the conclusions.)
1.8
1.6
1.4
1.2
Figure 22.1 RMSEs of single-time step tendency predictions by coarse-scale models with
error-correcting ANNs as a function of the number of ANN parameters. Different symbols are used
for models with ANNs of different depths (number of hidden layers). Opaque symbols show RMSEs
on validation data and partially transparent symbols those on training data. The black and grey
horizontal dashed lines show the RMSE for the No-ANN model in the validation and training
datasets respectively. Note the horizontal axis is logarithmic. The models with ANNs robustly
outperform the No-ANN model, with increases in the number of parameters giving smaller gains as
the number of parameters increases.
22.1 Experiments with the Lorenz ’96 System 321
function of the number of ANN parameters, which indicates the ANN complexity and is
approximately proportional to the number of floating point operations required to make
a prediction. ANNs are grouped according to the number of hidden layers. The RMSE is
calculated using 10,000 randomly chosen time steps in the training and validation datasets
separately, with the same time steps selected for each ANN structure.
Every ANN tested reduces the RMSE compared to the No-ANN model (whose RMSE is
shown by the dashed line). This demonstrates that even ANNs of low complexity can give
a better performance, including an ANN with just two neurons in one hidden layer. The
RMSE decreases as the number of parameters increases, by up to 42% for the validation
data, and has not saturated for the ANNs tested here – further error reductions are probably
possible, but this is not tested here. The relationship between the RMSE and the number
of parameters seems quite independent of the number of hidden layers when the number
of parameters is more than about 100. The RMSEs on the training data are not very much
lower than those on the validation data, indicating that overfitting is not occurring to a
large extent.
Watson (2019) showed that ANNs in this set up could also improve predictions for
extreme positive and negative tendencies in the validation dataset. This suggests the
ANNs have actually learnt to better represent the dynamics, rather than learning how to
reproduce examples seen in training. This is very important for applications like Earth
system models, for which predictions of extreme situations are a large part of the total
value they produce.
ACC
0.54 Depth = 1
Depth = 2
0.52 Depth = 3
0.50
0.48
0.46 No-ANN
RMSE
5.95
5.90
No-ANN
5.85
5.80
5.75
5.70
102 103 104
No. of parameters
Figure 22.2 The ACC and RMSE of ensemble forecasts of the Truth validation run trajectory made
by models with error-correcting ANNs as a function of the number of ANN parameters, plotted as in
Figure 22.1 but not showing results for evaluation using the training dataset. It is better to have a
larger ACC and a smaller RMSE. All but two models with ANNs outperform the No-ANN model, but
there is no large gain with increasing model complexity.
possible that when predicting more complex systems, where the theoretically-derived mod-
els have substantially lower skill compared to the maximum attainable, the improvements
made by using ANNs in this way would be relatively much greater.
Increasing the ANN’s complexity does not appear to give higher forecast skill improve-
ments beyond having ∼50 parameters, despite improvements being found for predicting
single-time step tendencies (Figure 22.1).
Figure 22.3 shows diagnostics of the quality of long-term statistics diagnosed from
free-running 3000 time unit simulations with each coarse-resolution model – these
are analogous to “climate” statistics for the Earth system. The diagnosed biases in the
time-mean of the X variables for the models with ANNs lie both above and below the value
for the No-ANN model, and are not generally statistically significantly different from it,
apart from in the cases of the two models with the highest biases (not shown). However,
the two-sample Kolmogorov–Smirnov statistic is improved by nearly all ANN structures.
22.1 Experiments with the Lorenz ’96 System 323
Mean bias
0.30 Depth = 1
Depth = 2
Depth = 3
0.25
0.20
0.15 No-ANN
KS statistic (×100)
2.3
2.2
2.1 No-ANN
2.0
1.9
1.8
Figure 22.3 The bias in the climate mean and Kolmogorov–Smirnov (KS) statistic of long time
series produced by models with error-correcting ANNs as a function of the number of ANN
parameters, plotted as in Figure 22.2. It is better for both diagnostics to be as small as possible. The
ANNs do not give a clear improvement in the mean bias, but the KS statistic is improved in all but
one case, with greater model complexity giving improved results up to having ∼100 parameters.
This is the maximum difference between the cumulative distribution functions of X in the
Truth system and in each respective coarse-resolution model, and therefore depends on
the shape of the distributions of X variables as well as their means. Watson (2019) shows
that this reflects the fact that the ANNs reduce the excessive occurrence of X values near
the central peak of the distribution and reduce the deficit in its negative flank. However,
the ANNs do not improve the low frequency bias of extreme X values. Subsequent work
by Chattopadhyay et al. (2019) found that algorithms that can incorporate memory over
sequences of time steps can simulate the frequency of extreme values well in a system very
similar to the one being considered here. Therefore, a promising direction is to modify
error-correcting algorithms to also be able to do this.
For the KS statistic, there is some evidence of improving skill with increasing model com-
plexity up to ∼100 parameters. Using a 3-layer ANN gives a slightly worse performance than
a 2-layer ANN with a comparable number of parameters.
324 22 Using Deep Learning to Correct Theoretically-derived Models
Forecast RMSE
5.90
Forecast ACC
0.49
0.48 5.85
0.47 5.80
0.46 5.75
0.45
5.70
1.4 1.6 1.8 2.0 1.4 1.6 1.8 2.0
Tendency RMSE Tendency RMSE
(c) 0.24 (d)
Long-term KS statistic
2.3
Long-term mean bias
0.22
0.20 2.2
0.18 2.1
0.16 2.0
0.14 1.9
0.12 1.8
0.22 2.3
0.20 2.2
0.18 2.1
0.16 2.0
0.14 1.9
0.12 1.8
5.70 5.75 5.80 5.85 5.90 5.95 5.70 5.75 5.80 5.85 5.90 5.95
Forecast RMSE Forecast RMSE
Figure 22.4 Simulation quality diagnostics plotted against each other: (a) and (b) show forecast
ACC and RMSE at lead time 1 time unit against the single-time step prediction error respectively;
(c) and (d) show similar results for the long-term mean bias and KS statistic; and (e) and (f) show the
long-term mean bias and KS statistic plotted against the forecast RMSE at lead time 1 time unit.
Different symbols are used for ANNs with different numbers of hidden layers. There are substantial
correlations between these diagnostics, signaling that skill at simulating shorter timescales is
indicative of skill on longer timescales, but these are sensitive to the exclusion of outliers (see text).
at tasks like predicting how such systems will respond to external forcing, which is very
relevant for getting better climate change projections. The final part of this chapter discusses
particular challenges that deserve attention.
the Earth system because observations in a given place are generally separated by six hours
or more. The time steps in state-of-the-art dynamical Earth system models are a lot shorter,
which is likely to be needed so the models are numerically stable, and is desirable in order
to better represent the true continuous-time equations. The method therefore needs to be
extended to allow an algorithm to be learnt when there is only data many time steps apart.
In a free-running system, the impact of perturbations to the algorithm’s parameters on pre-
dictions over multiple time steps needs to be taken into account. One possibility is to use
the “backpropagation through time” algorithm (Werbos 1990a), as used in recurrent ANNs
(Funahashi and Nakamura 1993). This would require the tangent linear approximation of
the theoretically-derived model. This becomes more complicated if the algorithm is “local”,
so predictions at a point only depend on inputs from nearby points, which is likely to be
beneficial for parallel computing and also incorporates spatial invariance of the prediction
equations if the same algorithm is used at all grid points, as in the above work. Then the
effect of perturbing parameters on predictions at those nearby points likely also needs to
be taken into account, so that backpropagation needs to be done “backwards through time
and sideways through space”.
Once appropriate algorithms have been shown to work in simple cases, improving Earth
system simulations would require them to be trained on data either from high-quality sim-
ulators that we cannot afford to use in all of the experiments we would like to, or on data
based on observations of the real system. Reanalysis data is a possible choice for the latter
(e.g. Dee et al. 2011). Although it is not perfect, it is likely to be closer to the behavior of
the real system than existing simulators, so learning to simulate it would yield modeling
improvements. Using the improved models to create a better reanalysis could then give a
self-improvement cycle, where improved reanalysis is used to produce improved simula-
tors, which are used to make better a reanalysis, and so on. Alternatively, Bocquet et al.
(2020) have shown that the model and system state may be learnt together.
they occur and counterfactual values corresponding to a world with less anthropogenic
climate change. If the counterfactual values are taken from an earlier observed period,
then it may be possible to train systems to predict weather risk for both cases based on
observations, and so derive the difference between them, without requiring extrapolation
into unseen climates.
In order to predict the Earth system’s behavior in future climatic states, it is clearly neces-
sary to represent the effect of changing forcings, for example the increasing concentrations
of carbon dioxide. Our current simulators rely on data based on laboratory experiments and
calculations from the laws of physics for this, back to the work of Tyndall (1861). I do not
know a way to integrate this data into simulators trained to reproduce observed Earth sys-
tem behavior with no theoretically-derived structure. Combining machine learning compo-
nents with theoretically-derived modules has the advantage that the theoretically-derived
part can incorporate such knowledge. Therefore, a theoretically-derived simulator with an
error correction algorithm could reproduce effects such as the warming produced by car-
bon dioxide, whilst also giving a more realistic simulation of things like weather variability.
The algorithm would become increasingly less trustworthy as the climate changed more,
but may still add value. For example, Scher and Messori (2019) found some skill for neural
networks simulating simple dynamical systems with external forcings for some way outside
the range seen in training.
Tests of whether combining theoretically-derived and machine learnt components per-
form better at predicting the effect on chaotic systems of changes in external forcings than
using either approach in isolation would be very valuable.
22.3 Conclusion
This chapter has discussed using machine learning algorithms to correct errors in simu-
lated tendencies of theoretically-derived models of dynamical systems. This is a promis-
ing approach to produce improved simulators, including for the Earth system. It has been
shown here that this can yield better quality simulations in the chaotic Lorenz ’96 system
and produce insights into theories like seamless prediction. The main challenges for future
development are applying this in more complex models in situations where observational
data is sparse in time and making it reliable at predicting the impact of changing external
boundary conditions, such as greenhouse gas emissions.
328
23
Outlook
Markus Reichstein, Gustau Camps-Valls, Devis Tuia, and Xiao Xiang Zhu
Deep learning has in the past decade surpassed the boundaries of pure machine learning
and computer vision research, and became a state-of-the-art tool in almost every scientific
discipline and is exponentially growing. On Google Scholar, more than half of the total
literature on the terms “Earth Science and deep learning” is recorded since 2019
(except for pre-2016 articles, where “deep learning” is found with an educational meaning).
More importantly, while the success of deep learning has started with “black-box” classifi-
cation tasks, deep learning is contributing more and more in diverse ways to the scientific
process of knowledge generation, with some latest examples reported in the chapters of this
book. As an example, remote sensing (Chapters 2–11) exploited the ability of deep learning
to align and fuse heterogeneous sources of data (Chapters 9 and 10) before extracting
spatial, often geometrical features (Chapters 5 and 6), as well as temporal features in
observational multitemporal data (Chapters 8 and 18), as indicated in Figure 23.1, arrow A.
Both discriminative and generative deep learning models, as well as the whole continuum
from fully supervised to unsupervised approaches is applicable. In particular, to deal with
domain adaptation (Chapter 7) and sparse labels, smart combinations of supervised and
unsupervised approaches need to be researched further: purely unsupervised (Chapter 2),
semi-supervised, and self-supervised (Chapter 4) are strong candidates in this direction.
Another strand of deep learning for geosciences has been emerging later, which relates to
exploiting synergies with system modeling (Figure 23.1B). System modeling is often termed
“physical modeling”, which is too narrow, because chemical, biological, ecophysiological,
and physical processes may be modeled here. System modeling (also called process-based
modeling or mechanistic modeling) refers to an approach which attempts to create
the behavior of a system from the behavior and interactions of its components, ideally
derived from fundamental laws or at least ample empirically established knowledge. With
respect to system modeling, deep learning can be used as an accelerator of the calibration
processes: the examples in this book include emulation of dynamical spatio-temporal
systems (Chapter 18), parameterizations of processes which (mostly for computational
reasons) cannot be explicitly resolved at the required resolution (Chapters 20 and 21),
and bias corrections of system models (Chapter 14). These examples mostly deal indeed
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
23 Outlook 329
Observations Real-world
experiments
Experimental
A design D
Hybrid Causal
Deep learning
modelling testing
B Causal
C
modelling
System Causal
modelling interpretation/
XAI
Figure 23.1 Future challenges of deep learning in linking to observations, experiments, causal
interpretation and system modeling.
with physical systems, but the approaches have high potential also for geo-biological and
eco-physiological processes in the future.
There are a number of future challenges ahead, which relate to the integration of
deep learning with other approaches, most notably four main pillars: experimental
design, hybrid modeling and causal testing and causal modeling. In Figure 23.1, they are
found as the triangles. For instance, bringing together Deep learning, System modeling
and observations in a hybrid modeling framework, which complements physics-guided
machine learning (Reichstein et al. 2019), see triangle “Hybrid modelling” in Figure 23.1.
While technically this sort of hybrid modeling can be addressed seamlessly within a
differentiable programming framework (e.g. PyTorch, JuliaFlux), and first examples
exist (de Bezenac et al. 2019; Kraft et al. 2020), there are a lot of conceptual questions still to
be addressed. One of the advantages is that in such a framework, physically interpretable
parameters or states can be estimated as in a data assimilation system, but the identifiability
of the parameters and states is a big challenge in particular if several of these are estimates,
together with all the weights of the DNN. Regularization techniques have to be explored,
for instance.
DL models are often intransparent and overparameterized, and the learned relations are
difficult to understand and visualize. One would need to decompose models responses in
interpretable building blocks, in order to understand the drivers of the different (climatic,
perceptive, demographic, …) processes being modeled by the network. Explainable deep
learning is a rising trend (Lapuschkin et al. 2019; Samek et al. 2019), which is also picking
up in spatial studies (Marcos et al. 2019; Roscher et al. 2020). Making the black boxes grayer
will definitely help not only in gaining confidence and trust in machine decisions, but also
would certainly make a decisive step forward understanding through data analysis.
The other big challenge lies in the links between deep learning and XAI/Causal inference.
By construction DNNs are not causal, but simply find the “best” associations/correlations
between data patterns, given a defined cost function. There are two ways of making link-
ages: (i) using causal theory to explain the functioning (e.g. feature relevance) of a DNN
(e.g. Tibau et al. (2020)) and (ii) using DNN for causal hypothesis generation or causal infer-
ence in general (e.g. Bengio et al. (2019)). DL models learn a useful discriminative feature
330 23 Outlook
representation from data, but the network has not necessarily learned any causal relation
between the involved covariates. Learning causal feature representations should be a pri-
ority, particularly in current times where accountability, explainability, and interpretability
(see last point) are needed in general, and in the relevant case of attribution of causes to
climate change. In this context, the link to system modeling offers at least good opportuni-
ties for ground-truthing respective approaches across various levels of complexity, since in
system models the causal relations are defined (triangle “Causal modelling”, Figure 23.1).
In the real world, these kinds of tests can be achieved via experimental approaches,
yet usually only in less complex systems and only for selected variables for pragmatic
reasons (e.g. we cannot yet build an analog of the world), cf. triangle “Causal testing” in
Figure 23.1.
Last but not least, real-world experiments are important to test the hypotheses we
generate from observations or from theoretical reasoning (Cuttler et al. 2020). Optimal
experimental design attempts to design experiments that can best constrain parameters of
a model, or distinguish between different model structures, but also for non-parametric
estimation (Winer 1962). A geo-scientific non-parametric example strongly related to
machine learning is an estimation of a spatio-temporal stochastic process where optimal
experimental design tells where in space observations should be placed for maximal infor-
mation. Certainly, deep generative models can play an interesting role in the “Experimental
design” triangle (Figure 23.1). For model parameter estimation and model selection, it
will be interesting to link this to hybrid modeling as well, for instance asking which are
most informative experiments or spatio-temporal sampling strategies to constrain hybrid
models and/or the physically interpretable parameters and latent states. While Bayesian
nonparametric models has been very active in this regard, it appears that the deep learning
literature is lacking on this topic, hence there are very good prospects to make impactful
first contributions.
The future of the interface between DL are Earth and Climate sciences is bright and excit-
ing. We have now access to operational tools that allow optimizing arbitrary networks,
losses, and physics-aware architectures. Besides, current methods are able to make sense
of the learned latent network representations: interpretability is just the first step; eXplain-
able AI (XAI) and causal inference have to guide network training. Our long-term vision
is tied to these open frontiers and foster research towards algorithms capable of discover-
ing knowledge from Earth data, a stepping stone before the more ambitious final goal of
machine reasoning of anthropogenic climate change.
However, while the field of machine/DL has traditionally progressed very rapidly, we
observe that this is not the case in tackling such grand challenges. Cognitive barriers are still
on our pathway: domain knowledge is elusive and difficult to encode, interaction between
computer scientists and physicists is still complicated, and education in these synergistic
concepts will be a hard task to achieve in the upcoming years. The ways forward we have
promoted, based on experimental design, hybrid DL, interpretability, and causal discov-
ery, definitely call for an active and continuous interaction between domain knowledge
experts and computer scientists. The new era for AI in geoscience is knocking at the door
and shouting out: “collaborate, collaborate!”
331
Bibliography
C.J. Abolt, M.H. Young, A.L. Atchley, and C.J. Wilson. Brief communication: Rapid
machine-learning-based extraction and measurement of ice wedge polygons in high-
resolution digital elevation models. The Cryosphere, 13(1):237–245, 2019. doi: 10.5194/
tc-13-237-2019.
C.J. Abolt and M.H. Young. High-resolution mapping of spatial heterogeneity in ice wedge
polygon geomorphology near Prudhoe Bay, Alaska. Scientific Data, 7(1):87, 2020. doi:
10.1038/s41597-020-0423-9.
D.H. Ackley, G.E. Hinton, and T.J. Sejnowski. A learning algorithm for Boltzmann machines.
Cognitive Science, 9(1):147–169, 1985.
S.V. Adams, R.W. Ford, M. Hambley, J.M. Hobson, I. Kavčič, C.M. Maynard, T. Melvin, E.H.
Müller, S. Mullerworth, A.R. Porter, M. Rezny, B.J. Shipway, and R. Wong. Lfric: Meeting the
challenges of scalability and performance portability in weather and climate models. Journal
of Parallel and Distributed Computing, 132:383–396, 2019. ISSN 0743-7315. doi:
https://doi.org/10.1016/j.jpdc.2019.02.007. URL http://www.sciencedirect.com/science/
article/pii/S0743731518305306.
S. Agarwal, Y. Furukawa, N. Snavely, I. Simon, B. Curless, S.M. Seitz, and R. Szeliski. Building
Rome in a day. Communications of the ACM, 54(10):105–112, 2011.
S. Agrawal, L. Barrington, C. Bromberg, J. Burge, C. Gazen, and J. Hickey. Machine learning
for precipitation nowcasting from radar images. arXiv preprint arXiv:1912.12132, 2019.
M. Aharon, M. Elad, and A. Bruckstein. K -SVD: An algorithm for designing overcomplete
dictionaries for sparse representation. IEEE Transactions on Signal Processing, 54(11):
4311–4322, Nov 2006. ISSN 1053-587X. doi: 10.1109/TSP.2006.881199.
F. Aires, W.B. Rossow, N.A. Scott, and A. Chédin. Remote sensing from the infrared
atmospheric sounding interferometer instrument 2. Simultaneous retrieval of temperature,
water vapor, and ozone atmospheric profiles. Journal of Geophysical Research: Atmospheres,
107(D22), 2002.
G.C. Allen. Understanding China’s AI strategy: Clues to Chinese strategic thinking on artificial
intelligence and national security. Technical report, Center for a New American Security,
February 2019. URL https://www.cnas.org/publications/reports/understanding-chinas-ai-
strategy.
M. Allen. Liability for climate change. Nature, 421:891–892, 2003. ISSN 02624079. doi:
10.1016/S0262-4079(10)62047-7.
Deep Learning for the Earth Sciences: A Comprehensive Approach to Remote Sensing, Climate Science, and Geosciences,
First Edition. Edited by Gustau Camps-Valls, Devis Tuia, Xiao Xiang Zhu, and Markus Reichstein.
© 2021 John Wiley & Sons Ltd. Published 2021 by John Wiley & Sons Ltd.
332 Bibliography
N. Audebert, B. Le Saux, and S. Lefevre. Deep learning for classification of hyperspectral data:
A comparative review. IEEE Geoscience and Remote Sensing Magazine, 7(2):159–173, 2019.
M. Awad. Sea water chlorophyll-a estimation using hyperspectral images and supervised
artificial neural network. Ecological informatics, 24:60–68, 2014.
G. Ayzel, M. Heistermann, A. Sorokin, O. Nikitin, and O. Lukyanova. All convolutional neural
networks for radar-based precipitation nowcasting. Procedia Computer Science, 150:186–192,
2019.
A. Azarang, H.E. Manoochehri, and N. Kehtarnavaz. Convolutional autoencoder-based
multispectral image fusion. IEEE access, 7:35673–35683, 2019.
S.M. Azimi, E. Vig, R. Bahmanyar, M. Körner, and P. Reinartz. Towards multi-class object
detection in unconstrained remote sensing imagery. In ACCV, pages 150–165, 2018. doi:
10.1007/978-3-030-20893-6_10.
S.M. Azimi, C. Henry, L. Sommer, A. Schumann, and E. Vig. Skyscapes – fine-grained semantic
understanding of aerial scenes. In 2019 IEEE/CVF International Conference on Computer
Vision (ICCV), pages 7392–7402, 2019.
M. Babaeizadeh, C. Finn, D. Erhan, R.H. Campbell, and S. Levine. Stochastic variational video
prediction. arXiv preprint arXiv:1710.11252, 2017.
M. Babaeizadeh, C. Finn, D. Erhan, R.H. Campbell, and S. Levine. Stochastic variational video
prediction. In 6th International Conference on Learning Representations, ICLR 2018, 2018.
V. Badrinarayanan, A. Kendall, and R. Cipolla. Segnet: A deep convolutional encoder-decoder
architecture for image segmentation. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 39(12):2481–2495, 2017.
E.H. Bair, A.A. Calfa, K. Rittger, and J. Dozier. Using machine learning for real-time estimates
of snow water equivalent in the watersheds of Afghanistan. Cryosphere, 12(5), 2018.
M. Baktashmotlagh, M. Harandi, B. Lovell, and M. Salzmann. Unsupervised domain
adaptation by domain invariant projection. In International Conference on Computer Vision,
pages 769–776, 2013.
G. Balakrishnan, A. Zhao, M.R. Sabuncu, J. Guttag, and A.V. Dalca. Voxelmorph: A learning
framework for deformable medical image registration. IEEE Transactions on Medical
Imaging, 38(8):1788–1800, Aug 2019. ISSN 1558-254X. doi: 10.1109/TMI.2019.2897538.
P. Baldi and K. Hornik. Neural networks and principal component analysis: Learning from
examples without local minima. Neural Networks, 2(1):53–58, 1989.
P. Baldi, P. Sadowski, and D. Whiteson. Searching for exotic particles in high-energy physics
with deep learning. Nature Communications, 5, Jul 2014. ISSN 2041-1723. doi:
10.1038/ncomms5308. URL http://www.nature.com/doifinder/10.1038/ncomms5308.
E.A. Barnes, J. Slingo, and T. Woollings. A methodology for the comparison of blocking
climatologies across indices, models and climate scenarios. Climate Dynamics,
38(11):2467–2481, Jun 2012. ISSN 1432-0894. doi: 10.1007/s00382-011-1243-6. URL https://
doi.org/10.1007/s00382-011-1243-6.
G.A. Barron-Gafford, R.L. Scott, G.D. Jenerette, and T.E. Huxman. The relative controls of
temperature, soil moisture, and plant functional group on soil CO2 efflux at diel, seasonal,
and annual scales. Journal of Geophysical Research: Biogeosciences, 116, 2011. doi:
10.1029/2010JG001442.
R. Barry and T.Y. Gan. The Global Cryosphere: Past, Present and Future. Cambridge University
Press, 2011.
334 Bibliography
P. Bauer, A. Thorpe, and G. Brunet. The quiet revolution of numerical weather prediction.
Nature, 525, 2015. URL https://doi.org/10.1038/nature14956.
L.E. Baum and T. Petrie. Statistical Inference for Probabilistic Functions of Finite State Markov
Chains. volume 37, pages 1554–1563. Institute of Mathematical Statistics, 1966.
C.A. Baumhoer, A.J. Dietz, C. Kneisel, and C. Kuenzer. Automated extraction of antarctic
glacier and ice shelf fronts from Sentinel-1 imagery using deep learning. Remote Sensing,
11(21), 2019. doi: 10.3390/rs11212529.
H. Bay, A. Ess, T. Tuytelaars, and L. Van Gool. Speeded-up robust features (SURF). Computer
Vision and Image Understanding, 110(3):346–359, June 2008. ISSN 1077-3142. doi:
10.1016/j.cviu.2007.09.014. URL https://doi.org/10.1016/j.cviu.2007.09.014.
M. Belkin and P. Niyogi. Laplacian eigenmaps for dimensionality reduction and data
representation. Neural computation, 15(6):1373–1396, 2003.
J.A. Benediktsson, M. Pesaresi, and K. Arnason. Classification and feature extraction for
remote sensing images from urban areas based on morphological transformations. IEEE
Transactions in Geoscience and Remote Sensing, 41(9):1940–1949, 2003.
S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer. Scheduled sampling for sequence prediction
with recurrent neural networks. In NIPS, 2015.
Y. Bengio, T. Deleu, N. Rahaman, R. Ke, S. Lachapelle, O. Bilaniuk, A. Goyal, and C. Pal. A
meta-transfer objective for learning to disentangle causal mechanisms. 01 2019.
Y. Bengio, P. Simard, and P. Frasconi. Learning long-term dependencies with gradient
descent is difficult. IEEE Transactions on Neural Networks, 5(2):157–166, 1994. doi:
10.1109/72.279181.
Y. Bengio, J. Louradour, R. Collobert, and J. Weston. Curriculum learning. In ICML, 2009.
Y. Bengio, A.C. Courville, and P. Vincent. Representation learning: A review and new
perspectives. IEEE TPAMI, 35(8):1798–1828, 2013.
S.G. Benjamin, S.S. Weygandt, J.M. Brown, M. Hu, C.R. Alexander, T.G. Smirnova, J.B. Olson,
E.P. James, D.C. Dowell, G.A. Grell, et al. A North American hourly assimilation and model
forecast cycle: The rapid refresh. Monthly Weather Review, 144(4): 1669–1694, 2016.
C. Bentes, D. Velotto, and S. Lehner. Target classification in oceanographic SAR images with
deep neural networks: Architecture and initial results. In 2015 IEEE International Geoscience
and Remote Sensing Symposium (IGARSS), pages 3703–3706, 2015.
K.J. Bergen, P.A. Johnson, V. Maarten, and G.C. Beroza. Machine learning for data-driven
discovery in solid earth geoscience. Science, 363(6433):eaau0323, 2019.
P.S. Berloff. Random-forcing model of the mesoscale oceanic eddies. Journal of Fluid
Mechanics, 529:71–95, 2005.
J.D. Bermudez, P.N. Happ, R.Q. Feitosa, and D.A.B. Oliveira. Synthesis of multispectral optical
images from SAR/optical multitemporal data using conditional generative adversarial
networks. IEEE Geoscience and Remote Sensing Letters, 16(8):1220–1224, Aug 2019.
J. Berner, U. Achatz, L. Batté, L. Bengtsson, A. de la Cámara, Hannah M. Christensen, M.
Colangeli, D.R.B. Coleman, D. Crommelin, S.I. Dolaptchiev, C.L.E. Franzke, P. Friederichs,
P. Imkeller, H. Järvinen, S. Juricke, V. Kitsios, F. Lott, V. Lucarini, S. Mahajan, T.N. Palmer,
C. Penland, M. Sakradzija, J.-S. von Storch, A. Weisheimer, M. Weniger, P.D. Williams, and
J.-I. Yano. Stochastic parameterization: Toward a new view of weather and climate models.
Bulletin of the American Meteorological Society, 98(3):565–588, 2017. doi:
10.1175/BAMS-D-15-00268.1.
Bibliography 335
Y. Boualleg and M. Farah. Enhanced interactive remote sensing image retrieval with scene
classification convolutional neural networks model. In IEEE International Geoscience and
Remote Sensing Symposium, pages 4748–4751, July 2018.
O. Boucher, D. Randall, P. Artaxo, C. Bretherton, G. Feingold, P. Forster, V.-M. Kerminen,
Y. Kondo, H. Liao, U. Lohmann, P. Rasch, S.K. Satheesh, S. Sherwood, B. Stevens, and X.Y.
Zhang. Clouds and aerosols. In Climate Change 2013: The Physical Science Basis.
Contribution of Working Group I to the Fifth Assessment Report of the Intergovernmental Panel
on Climate Change [T.F. Stocker, D. Qin, G.-K. Plattner, M.M.B. Tignor, S.K. Allen,
J. Boschung, A. Nauels, Y. Xia, V. Bex and P.M. Midgley (eds.)]. Cambridge University Press,
Cambridge, United Kingdom and New York, NY, USA, 2013.
A. Boulch. Generalizing discrete convolutions for unstructured point clouds. In Eurographics
3DOR, April 2019.
A. Boulch and R. Marlet. Deep learning for robust normal estimation in unstructured point
clouds. Computer Graphics Forum, 2016.
A. Boulch, J. Guerry, Be. Le Saux, and N. Audebert. SnapNet: 3d point cloud semantic labeling
with 2d deep segmentation networks. Computers & Graphics, 71:189–198, 2018. ISSN
00978493.
H. Boulze, A. Korosov, and J. Brajard. Classification of sea ice types in Sentinel-1 SAR data
using convolutional neural networks. Remote Sensing, 12(13):2165, 2020.
H. Bourlard and Y. Kamp. Auto-association by multilayer perceptrons and singular value
decomposition. Biological Cybernetics, 59(4–5):291–294, 1988.
B.D. Bowes, J.M. Sadler, M.M. Morsy, M. Behl, and J.L. Goodall. Forecasting groundwater table
in a flood prone coastal city with long short-term memory and recurrent neural networks.
Water, 11 (5):1098, May 2019. ISSN 2073-4441. doi: 10.3390/w11051098. URL http://dx.doi
.org/10.3390/w11051098.
A. Braakmann-Folgmann and C. Donlon. Estimating snow depth on arctic sea ice using
satellite microwave radiometry and a neural network. The Cryosphere, 13(9):2421–2438,
2019. doi: 10.5194/tc-13-2421-2019.
J. Brajard, A. Carrassi, M. Bocquet, and L. Bertino. Combining data assimilation and machine
learning to emulate a dynamical model from sparse and noisy observations: a case study
with the Lorenz 96 model. Geoscientific Model Development Discussions, pages 1–21, May
2019. ISSN 1991-962X. doi: 10.5194/gmd-2019-136. URL https://www.geosci-model-dev-
discuss.net/gmd-2019-136/.
N.D. Brenowitz and C.S. Bretherton. Prognostic validation of a neural network unified physics
parameterization. Geophysical Research Letters, 45(12):6289–6298, 2018.
N.D. Brenowitz and C.S. Bretherton. Spatially extended tests of a neural network
parametrization trained by coarse-graining. Apr 2019. URL http://arxiv.org/abs/1904.03327.
N.D. Brenowitz, T. Beucler, M. Pritchard, and C.S. Bretherton. Interpreting and stabilizing
machine-learning parametrizations of convection. arXiv preprint arXiv:2003.06549, 2020.
H. Bristow, A. Eriksson, and S. Lucey. Fast convolutional sparse coding. In Proceedings of
CVPR, pages 391–398, 2013.
G. Buchsbaum. A spatial processor model for object colour perception. Journal of the Franklin
Institute, 310(1):1–26, 1980.
M. Buchwitz, M. Reuter, O. Schneising, W. Hewson, R.G. Detmers, H. Boesch, O.P. Hasekamp,
I. Aben, H. Bovensmann, J.P. Burrows, et al. Global satellite observations of
Bibliography 337
column-averaged carbon dioxide and methane: The GHG-CCI XCO2 and XCH4 CRDP3 data
set. Remote Sensing of Environment, 203:276–295, 2017.
W. Buermann, M. Forkel, M. O’Sullivan, S. Sitch, P. Friedlingstein, V. Haverd, A.K. Jain,
E. Kato, M. Kautz, S. Lienert, D. Lombardozzi, J.E.M.S. Nabel, H. Tian, A.J. Wiltshire,
D. Zhu, W.K. Smith, and A.D. Richardson. Widespread seasonal compensation effects of
spring warming on northern plant productivity. Nature, 562:110, 2018. doi: 10.1038/
s41586-018-0555-7.
M. Bujisic, V. Bogicevic, H.G. Parsa, V. Jovanovic, and A. Sukhu. It’s raining complaints! How
weather factors drive consumer comments and word-of-mouth. Journal of Hospitality &
Tourism Research, 43(5): 656–681, 2019.
W. Burger and M.J. Burge. Image Matching and Registration, pages 565–585. Springer London,
London, 2016. ISBN 978-1-4471-6684-9. doi: 10.1007/978-1-4471-6684-9_23. URL https://doi
.org/10.1007/978-1-4471-6684-9_23.
A. Buslaev, A. Parinov, E. Khvedchenya, V.I. Iglovikov, and A.A. Kalinin. Albumentations: Fast
and flexible image augmentations. ArXiv e-prints, 2018.
T. Bürgmann, W. Koppe, and M. Schmitt. Matching of terrasar-x derived ground control points
to optical image patches using deep learning. ISPRS Journal of Photogrammetry and Remote
Sensing, 158:241–248, 2019. ISSN 0924-2716. doi:
https://doi.org/10.1016/j.isprsjprs.2019.09.010.
M. Calonder, V. Lepetit, M. Ozuysal, T. Trzcinski, C. Strecha, and P. Fua. Brief: Computing
a local binary descriptor very fast. IEEE Transactions on Pattern Analysis and Machine
Intelligence, 34(7):1281–1298, July 2012. ISSN 1939-3539. doi: 10.1109/
TPAMI.2011.222.
M. Camporese, C. Paniconi, M. Putti, and S. Orlandini. Surface-subsurface flow modeling with
path-based runoff routing, boundary condition-based coupling, and assimilation of
multisource observation data. Water Resources Research, 46(2):W02512, Feb 2010. ISSN
0043-1397. doi: 10.1029/2008WR007536. URL http://www.agu.org/pubs/crossref/2010/
2008WR007536.shtml.
M. Campos-Taberner, A. Romero-Soriano, C. Gatta, G. Camps-Valls, A. Lagrange, B. Le Saux,
A. Beaupère, A. Boulch, A. Chan-Hon-Tong, S. Herbin, H. Randrianarivo, M. Ferecatu, M.
Shimoni, G. Moser, and D. Tuia. Processing of extremely high resolution LiDAR and RGB
data: Outcome of the 2015 IEEE GRSS Data Fusion Contest. Part A: 2D contest. IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing, 9(12):
5547–5559, 2016.
G. Camps-Valls, L. Gómez-Chova, J. Muñoz-Marí, J. Vila-Francés, and J. Calpe-Maravilla.
Composite kernels for hyperspectral image classification. IEEE Geoscience and Remote
Sensing Letters, 3(1):93–97, 2006.
G. Camps-Valls, D. Svendsen, L. Martino, J. Muñoz-Marí, V. Laparra, M. Campos-Taberner, and
D. Luengo. Physics-aware Gaussian processes in remote sensing. Applied Soft Computing,
68:69–82, Jul 2018a. doi: https://doi.org/10.1016/j.asoc.2018.03.021.
G. Camps-Valls, J. Verrelst, J. Munoz-Mari, V. Laparra, F. Mateo-Jimenez, and J. Gomez-Dans.
A survey on Gaussian processes for earth-observation data analysis: A comprehensive
investigation. IEEE Geoscience and Remote Sensing Magazine, 4(2):58–78, 2016.
338 Bibliography
L.-C. Chen, Y. Zhu, G. Papandreou, F. Schroff, and H. Adam. Encoder-decoder with atrous
separable convolution for semantic image segmentation. In European Conference on
Computer Vision, 2018a.
L.-C Chen, Y. Cao, L. Ma, and J. Zhang. A deep learning based methodology for precipitation
nowcasting with radar. Earth and Space Science, page e2019EA000812, 2019.
S. Chen and D. Zhang. Semisupervised dimensionality reduction with pairwise constraints for
hyperspectral image classification. IEEE Geoscience and Remote Sensing Letters,
8(2):369–373, Mar 2011.
S. Chen, H. Wang, F. Xu, and Y. Jin. Target classification using the deep convolutional
networks for SAR images. IEEE Transactions on Geoscience and Remote Sensing,
54(8):4806–4817, 2016.
S. Chen, X. Li, Y. Zhang, R. Feng, and Ch. Zhang. Local deep hashing matching of aerial images
based on relative distance and absolute distance constraints. Remote Sensing, 9(12), 2017b.
ISSN 2072-4292. doi: 10.3390/rs9121244.
S. Chen, X. Yuan, W. Yuan, J. Niu, F. Xu, and Y. Zhang. Matching multi-sensor remote sensing
images via an affinity tensor. Remote Sensing, 10(7), 2018c. ISSN 2072-4292. doi:
10.3390/rs10071104. URL https://www.mdpi.com/2072-4292/10/7/1104.
T. Chen, S. Kornblith, M. Norouzi, and G. Hinton. A simple framework for contrastive learning
of visual representations, 2020. URL http://arxiv.org/abs/2002.05709.
X. Chen, Y. Duan, R. Houthooft, J. Schulman, I. Sutskever, and P. Abbeel. Infogan:
Interpretable representation learning by information maximizing generative adversarial
nets. In D.D. Lee, M. Sugiyama, U.V. Luxburg, I. Guyon, and R. Garnett, editors, Advances in
Neural Information Processing Systems 29, pages 2172–2180. Curran Associates, Inc., 2016.
Y. Chen, Z. Lin, X. Zhao, G. Wang, and Y. Gu. Deep learning-based classification of
hyperspectral data. IEEE Journal of Selected Topics in Applied Earth Observations and Remote
Sensing, 7(6):2094–2107, 2014.
Y. Chen, J. Mairal, Z. Harchaoui, et al. Fast and robust archetypal analysis for representation
learning. In CVPR 2014-IEEE Conference on Computer Vision & Pattern Recognition, 2014.
G. Cheng, J. Han, and X. Lu. Remote sensing image scene classification: Benchmark and state
of the art. Proceedings of the IEEE, 105(10):1865–1883, 2017.
G. Cheng, P. Zhou, and J. Han. Learning rotation-invariant convolutional neural networks for
object detection in VHR optical remote sensing images. IEEE Transactions in Geoscience and
Remote Sensing, 54(12):7405–7415, 2016a.
G. Cheng, P. Zhou, and J. Han. RIFD-CNN: Rotation-invariant and fisher discriminative
convolutional neural networks for object detection. In CVPR, pages 2884–2893, 2016b.
Y. Cheng, M. Giometto, P. Kauffmann, L. Lin, C. Cao, C. Zupnick, H. Li, Q. Li, R. Abernathey,
and P. Gentine. Deep learning for subgrid-scale turbulence modeling in large-eddy
simulations of the atmospheric boundary layer. arXiv preprint arXiv:1910.12125, 2019.
A.M. Cheriyadat. Unsupervised feature learning for aerial scene classification. IEEE
Transactions in Geoscience and Remote Sensing, 52(1):439–451, Jan 2014. ISSN 0196-2892.
doi: 10.1109/TGRS.2013.2241444.
F. Chevallier, F. Chéruy, N.A. Scott, and A. Chédin. A neural network approach for a fast and
accurate computation of a longwave radiative budget. Journal of Applied Meteorology,
37(11):1385–1397, 1998. doi: 10.1175/1520-0450(1998)037⟨1385:ANNAFA⟩2.0.CO;2.
340 Bibliography
M.A. Cruz, R.L. Thompson, L.E.B. Sampaio, and R.D.A. Bacchi. The use of the Reynolds force
vector in a physics informed machine learning approach for predictive turbulence modeling.
Computers & Fluids, page 104258, 2019.
B.C. Csáji et al. Approximation with artificial neural networks. Faculty of Sciences, Etvs Lornd
University, Hungary, 24(48):7, 2001.
G. Csurka. Domain Adaptation in Computer Vision Applications. Springer, 2017.
A. Cutler and L. Breiman. Archetypal analysis. Technometrics, 36:338–347, 1994.
C. Cuttler, R.S. Jhangiani, and D.C. Leighton. Research methods in psychology-open textbook
library. 2020.
D. Dai and W. Yang. Satellite image classification via two-layer sparse coding with biased image
representation. IEEE Geoscience and Remote Sensing Letters, 8(1):173–176, 2010.
J. Dai, Y. Li, K. He, and J. Sun. R-FCN: object detection via region-based fully convolutional
networks. In NIPS, pages 379–387, 2016.
J. Dai, H. Qi, Y. Xiong, Y. Li, G. Zhang, H. Hu, and Y. Wei. Deformable convolutional networks.
CoRR, abs/1703.06211, 1 (2):3, 2017.
O.E. Dai, B. Demir, B. Sankur, and L. Bruzzone. A novel system for content-based retrieval
of single and multi-label high-dimensional remote sensing images. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, 11(7):2473–2490, July
2018.
N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In 2005 IEEE
Computer Society Conference on Computer Vision and Pattern Recognition (CVPR’05), volume
1, pages 886–893. IEEE, 2005.
B.B. Damodaran, B. Kellenberger, R. Flamary, D. Tuia, and N. Courty. DeepJDOT: Deep joint
distribution optimal transport for unsupervised domain adaptation. In European Conference
on Computer Vision, pages 467–483. Springer International Publishing, 2018.
G. Danabasoglu, J.C. McWilliams, and P.R. Gent. The role of mesoscale tracer transports in the
global ocean circulation. Science, 264(5162): 1123–1126, 1994.
R.C. Daudt, B. Le Saux, A. Boulch, and Y. Gosseau. Multitask learning for large-scale semantic
change detection. Computer Vision and Image Understanding, 187:102783, 2019.
R. Caye Daudt, B. Le Saux, A. Boulch, and Y. Gousseau. Urban change detection for
multispectral earth observation using convolutional neural networks. In IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), Valencia, Spain, 2018.
D.T. Davis, Z. Chen, L. Tsang, J.-N. Hwang, and A.T.C. Chang. Retrieval of snow parameters by
iterative inversion of a neural network. IEEE Transactions on Geoscience and Remote Sensing,
31(4):842–852, 1993.
E. De Bézenac, A. Pajot, and P. Gallinari. Towards a hybrid approach to physical process
modeling. Technical report, 2017.
E. de Bezenac, A. Pajot, and P. Gallinari. Deep learning for physical processes: Incorporating
prior scientific knowledge. Journal of Statistical Mechanics: Theory and Experiment,
2019(12):124009, 2019.
A. de la Fuente, V. Meruane, and C. Meruane. Hydrological early warning system based on a
deep learning runoff model coupled with a meteorological forecast. Water, 11(9):1808, Aug
2019. ISSN 2073-4441. doi: 10.3390/w11091808. URL http://dx.doi.org/10.3390/w11091808.
G.J.M. De Lannoy, R.H. Reichle, P.R. Houser, V.R.N. Pauwels, and N.E.C. Verhoest. Correcting
for forecast bias in soil moisture assimilation with the ensemble Kalman filter. Water
342 Bibliography
Resources Research, 43(9), Sep 2007. ISSN 00431397. doi: 10.1029/2006WR005449. URL
http://doi.wiley.com/10.1029/2006WR005449.
D.P. Dee, S.M. Uppala, A.J. Simmons, P. Berrisford, P. Poli, S. Kobayashi, U. Andrae, M.A.
Balmaseda, G. Balsamo, P. Bauer, P. Bechtold, A.C.M. Beljaars, L. van de Berg, J. Bidlot,
N. Bormann, C. Delsol, R. Dragani, M. Fuentes, A.J. Geer, L. Haimberger, S.B. Healy, H.
Hersbach, E. V. Hólm, L. Isaksen, P. Kållberg, M. Köhler, M. Matricardi, A.P. McNally, B.M.
Monge-Sanz, J.-J. Morcrette, B.-K. Park, C. Peubey, P. de Rosnay, C. Tavolato, J.-N. Thépaut,
and F. Vitart. The ERA-Interim reanalysis: configuration and performance of the data
assimilation system. Quarterly Journal of the Royal Meteorological Society, 137(656):
553–597, Apr 2011. ISSN 00359009. doi: 10.1002/qj.828. URL http://doi.wiley.com/10.1002/
qj.828.
B. Demir and L. Bruzzone. A novel active learning method in relevance feedback for
content-based remote sensing image retrieval. IEEE Transactions on Geoscience and Remote
Sensing, 53(5):2323–2334, May 2015.
B. Demir and L. Bruzzone. Hashing-based scalable remote sensing image search and retrieval
in large archives. IEEE Transactions on Geoscience and Remote Sensing, 54(2):892–904,
February 2016.
I. Demir, K. Koperski, D. Lindenbaum, G. Pang, J. Huang, S. Basu, F. Hughes, D. Tuia, and
R. Raskar. Deepglobe 2018: A challenge to parse the earth through satellite images. 2018
IEEE/CVF Conference on Computer Vision and Pattern Recognition Workshops (CVPRW),
Jun 2018. doi: 10.1109/cvprw.2018.00031. URL http://dx.doi.org/10.1109/CVPRW.2018
.00031.
J. Deng, W. Dong, R. Socher, L.J. Li, K. Li, and L. Fei-Fei. ImageNet: A large-scale hierarchical
image database. In IEEE Conference on Computer Vision and Pattern Recognition, pages
248–255. IEEE, 2009.
J. Deng, Z. Zhang, E. Marchi, and B. Schuller. Sparse autoencoder-based feature transfer
learning for speech emotion recognition. In 2013 Humaine Association Conference on
Affective Computing and Intelligent Interaction, pages 511–516. IEEE, 2013.
Z. Deng, H. Sun, S. Zhou, J. Zhao, and H. Zou. Toward fast and accurate vehicle detection in
aerial images using coupled region-based convolutional neural networks. J-STARS,
10(8):3652–3664, 2017.
Z. Deng, H. Sun, S. Zhou, and J. Zhao. Learning deep ship detector in SAR images from scratch.
IEEE Transactions on Geoscience and Remote Sensing, 57(6):4021–4039, 2019.
E. Denton and R. Fergus. Stochastic video generation with a learned prior. In International
Conference on Machine Learning, pages 1174–1183, 2018.
R. Dian, S. Li, A. Guo, and L. Fang. Deep hyperspectral image sharpening. IEEE transactions on
neural networks and learning systems, 29(11):5345–5355, 2018.
J. Ding, N. Xue, Y. Long, G.-S. Xia, and Q. Lu. Learning roi transformer for oriented object
detection in aerial images. In CVPR, pages 2849–2858, 2019.
I.D. Dobreva and A.G. Klein. Fractional snow cover mapping through artificial neural network
analysis of modis surface reflectance. Remote Sensing of Environment, 115(12):3355–3366,
2011.
C. Doersch. Tutorial on variational autoencoders, 2016. URL http://arxiv.org/abs/1606.05908 .
cite arxiv:1606.05908.
Bibliography 343
C. Doersch, A. Gupta, and A.A. Efros. Unsupervised visual representation learning by context
prediction. In International Conference on Computer Vision (ICCV), 2015.
J. Donahue, L.A. Hendricks, S. Guadarrama, M. Rohrbach, S. Venugopalan, K. Saenko, and
T. Darrell. Long-term recurrent convolutional networks for visual recognition and
description. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 2625–2634, 2015.
G. Dong, W. Huang, W.A.P. Smith, and P. Ren. A shadow constrained conditional generative
adversarial net for srtm data restoration. Remote Sensing of Environment, 237:111602, 2020.
J. Dong, R. Yin, X. Sun, Q. Li, Y. Yang, and X. Qin. Inpainting of remote sensing sst images with
deep convolutional generative adversarial network. IEEE Geoscience and Remote Sensing
Letters, 16(2):173–177, Feb 2019.
Y. Dong, W. Jiao, T. Long, L. Liu, G. He, Ch. Gong, and Y. Guo. Local deep descriptor for remote
sensing image feature matching. Remote Sensing, 11(4), 2019. ISSN 2072-4292. doi:
10.3390/rs11040430.
R.H. Douglas. The stormy weather group (Canada). In Radar in Meteorology, pages 61–68.
Springer, 1990.
J. Duchi, E. Hazan, and Y. Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121–2159, 2011.
P. Dueben, P. Bauer, J.-N. Thepaut, V.-H. Peuch, A. Geer, and S. English. Machine learning at
ECMWF. ECMWF Memorandum, 2019.
P.D. Dueben and P. Bauer. Challenges and design choices for global weather and climate
models based on machine learning. Geoscientific Model Development, 11(10):3999–4009, Oct
2018. ISSN 19919603. doi: 10.5194/gmd-11-3999-2018.
V. Dumoulin and F. Visin. A guide to convolution arithmetic for deep learning. arXiv preprint
arXiv:1603.07285, 2016.
K. Duraisamy, G. Iaccarino, and H. Xiao. Turbulence modeling in the age of data. Annual
Review of Fluid Mechanics, 51(1):357–377, 2019. doi: 10.1146/annurev-fluid-010518-040547.
M. Elad. Sparse and Redundant Representations: From Theory to Applications in Signal and
Image Processing. Springer Science & Business Media, 2010.
M. Elad and M. Aharon. Image denoising via sparse and redundant representations over
learned dictionaries. IEEE Transactions on Image processing, 15(12):3736–3745, 2006.
J.L. Elman. Finding structure in time. Cognitive Science, 14(2): 179–211, 1990. doi:
10.1016/0364-0213(90)90002-E.
A. Elshamli, G.W. Taylor, A. Berg, and S. Areibi. Domain adaptation using representation
learning for the classification of remote sensing images. IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, 10(9):4198–4209, Sep. 2017.
S. En, A. Lechervy, and Fr. Jurie. Ts-net: Combining modality specific and common features for
multimodal patch matching. In 2018 25th IEEE International Conference on Image Processing
(ICIP), pages 3024–3028, 2018. doi: 10.1109/icip.2018.8451804.
G. Evensen. Sequential data assimilation with a nonlinear quasi-geostrophic model using
Monte Carlo methods to forecast error statistics. Journal of Geophysical Research,
99(C5):10143, 1994. ISSN 0148-0227. doi: 10.1029/94JC00572. URL http://doi.wiley.com/10
.1029/94JC00572.
344 Bibliography
M. Everingham, L. Van Gool, C. Williams, J. Winn, and A. Zisserman. The pascal visual object
classes (VOC) challenge. International Journal of Computer Vision, 88(2):303–338, 2010.
G. Eynard-Bontemps, R. Abernathey, J. Hamman, A. Ponte, and W. Rath. The pangeo big data
ecosystem and its use at cnes. In Big Data from Space (BiDS’19). … Turning Data into
insights…19-21 February 2019, Munich, Germany, 2019.
V. Eyring, S. Bony, G.A. Meehl, C.A. Senior, B. Stevens, R.J. Stouffer, and K.E. Taylor. Overview
of the coupled model intercomparison project phase 6 (cmip6) experimental design and
organization. Geoscientific Model Development, 9(5):1937–1958, 2016a. doi:
10.5194/gmd-9-1937-2016. URL https://www.geosci-model-dev.net/9/1937/2016/.
V. Eyring, S. Bony, G.A. Meehl, C.A. Senior, B. Stevens, R.J. Stouffer, and K.E. Taylor. Overview
of the coupled model intercomparison project phase 6 (cmip6) experimental design and
organization. Geoscientific Model Development (Online), 9 (LLNL-JRNL-736881), 2016b.
V. Eyring, M. Righi, A. Lauer, M. Evaldsson, S. Wenzel, C. Jones, Alessandro Anav,
O. Andrews, I. Cionni, and E.L. Davin. ESMValTool (v1. 0) – a community diagnostic and
performance metrics tool for routine evaluation of Earth system models in CMIP.
Geoscientific Model Development, 9:1747–1802, 2016c.
S. Falkner, A. Klein, and F. Hutter. Bohb: Robust and efficient hyperparameter optimization at
scale. arXiv preprint arXiv:1807.01774, 2018.
H. Fan, M. Jiang, L. Xu, H. Zhu, J. Cheng, and J. Jiang. Comparison of long short term memory
networks and the hydrological model in runoff simulation. Water, 12(1):175, Jan 2020. ISSN
2073-4441. doi: 10.3390/w12010175. URL http://dx.doi.org/10.3390/w12010175.
K. Fang and C. Shen. Near-real-time forecast of satellite-based soil moisture using long
short-term memory with an adaptive data integration kernel. Journal of Hydrometeorology,
pages JHM–D–19–0169.1, Jan 2020. ISSN 1525-755X. doi: 10.1175/JHM-D-19-0169.1. URL
http://journals.ametsoc.org/doi/10.1175/JHM-D-19-0169.1.
K. Fang, C. Shen, D. Kifer, and X. Yang. Prolongation of SMAP to spatio-temporally seamless
coverage of continental US using a deep learning neural network. Geophysical Research
Letters, 44:11030–11039, 2017. doi: 10.1002/2017GL075619. URL https://arxiv.org/abs/1707
.06611.
K. Fang, M. Pan, and C. Shen. The value of SMAP for long-term soil moisture estimation with
the help of deep learning. IEEE Transactions on Geoscience and Remote Sensing, pages 1–13,
2018. ISSN 0196-2892. doi: 10.1109/TGRS.2018.2872131. URL https://ieeexplore.ieee.org/
document/8497052/.
K. Fang, D. Kifer, K. Lawson, and C. Shen. Evaluating the potential and challenges of an
uncertainty quantification method for long short-term memory models for soil moisture
predictions, Water Resources Research, 2020. doi: 10.1029/2020WR028095.
K. Fang, W.-P. Tsai, X. Ji, K. Lawson, C. Shen, Revealing causal controls of storage-streamflow
relationships with a data-centric Bayesian framework combining machine learning and
process-based modeling. Frontiers in Water-Water and Hydrocomplexity, 2020.
doi:10.3389/frwa.2020.583000.
W. Fang, C. Wang, X. Chen, W. Wan, H. Li, S. Zhu, Y. Fang, B. Liu, and Y. Hong. Recognizing
global reservoirs from Landsat 8 images: A deep learning approach. IEEE Journal of Selected
Topics in Applied Earth Observations and Remote Sensing, 12(9):3168–3177, Sep. 2019. ISSN
2151-1535. doi: 10.1109/JSTARS.2019.2929601.
Bibliography 345
P. Friedlingstein, M. Meinshausen, V.K. Arora, C.D. Jones, A. Anav, S.K. Liddicoat, and
R. Knutti. Uncertainties in cmip5 climate projections due to carbon cycle feedbacks. Journal
of Climate, 27(2):511–526, 2014. doi: 10.1175/JCLI-D-12-00579.1. URL https://doi.org/10
.1175/JCLI-D-12-00579.1.
G. Fu, C. Liu, R. Zhou, T. Sun, and Q. Zhang. Classification for high resolution remote sensing
imagery using a fully convolutional network. Remote Sensing, 9(5):498, 2017.
Y. Fu, T. Zhang, Y. Zheng, D. Zhang, and H. Huang. Hyperspectral image super-resolution with
optimized RGB guidance. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 11661–11670, 2019.
O. Fuhrer, T. Chadha, T. Hoefler, G. Kwasniewski, X. Lapillonne, D. Leutwyler, D. Lüthi,
C. Osuna, C. Schär, T.C. Schulthess, and H. Vogt. Near-global climate simulation at 1km
resolution: establishing a performance baseline on 4888 GPUs with Cosmo 5.0. Geoscientific
Model Development, 11 (4):1665–1681, 2018. doi: 10.5194/gmd-11-1665-2018. URL https://
www.geosci-model-dev.net/11/1665/2018/.
K. Fukushima. Neocognitron: A self-organizing neural network model for a mechanism of
pattern recognition unaffected by shift in position. Biological Cybernetics, 36(4):193–202,
1980.
K. Funahashi and Y. Nakamura. Approximation of dynamical systems by continuous time
recurrent neural networks. Neural Networks, 6(6):801–806, 1993. ISSN 0893-6080. doi:
https://doi.org/10.1016/S0893-6080(05)80125-X. URL http://www.sciencedirect.com/
science/article/pii/S089360800580125X.
G.B. Goldstein False-alarm regulation in log-normal and weibull clutter. IEEE Transactions on
Aerospace and Electronic Systems, 9(1):84–92, 1973.
Y. Gal and Z. Ghahramani. A theoretically grounded application of dropout in recurrent neural
networks. arxiv preprint, Dec 2015. URL http://arxiv.org/abs/1512.05287.
Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, M. Marchand, and
V. Lempitsky. Domain-adversarial training of neural networks. Journal of Machine Learning
Research, 17(1):2096–2030, 2016.
Y. Gao, F. Gao, J. Dong, and S. Wang. Transferred deep learning for sea ice change detection
from synthetic-aperture radar images. IEEE Geoscience and Remote Sensing Letters,
16(10):1655–1659, Oct 2019. doi: 10.1109/LGRS.2019.2906279.
A. Garzelli, F. Nencini, and L. Capobianco. Optimal mmse pan sharpening of very high
resolution multispectral images. IEEE Transactions on Geoscience and Remote Sensing,
46(1):228–236, 2007.
C. Gatebe, W. Li, N. Chen, Y. Fan, R. Poudyal, L. Brucker, and K. Stamnes. Snow-covered area
using machine learning techniques. In IGARSS 2018-2018 IEEE International Geoscience and
Remote Sensing Symposium, pages 6291–6293. IEEE, 2018.
P. Gentine, M. Pritchard, S. Rasp, G. Reinaudi, and G. Yacalis. Could machine learning break
the convection parameterization deadlock? Geophysical Research Letters, 45(11):5742–5751,
Jun 2018. ISSN 19448007. doi: 10.1029/2018GL078202.
F.A. Gers and J. Schmidhuber. Recurrent nets that time and count. In IEEE-INNS-ENNS
International Joint Conference on Neural Networks (IJCNN), volume 3, pages 189–194, 2000.
doi: 10.1109/IJCNN.2000.861302.
Bibliography 347
F.A. Gers and J. Schmidhuber. LSTM recurrent networks learn simple context-free and
context-sensitive languages. IEEE Transactions on Neural Networks, 12(6):1333–1340, 2001.
ISSN 1941-0093. doi: 10.1109/72.963769.
F.A. Gers, J. Schmidhuber, and F. Cummins. Learning to forget: Continual prediction with
LSTM. Neural Computation, 12(10):2451–2471, 2000. doi: 10.1162/089976600300015015.
F.A. Gers, N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent
networks. Journal of Machine Learning Research, 3:115–143, 2003. doi:
10.1162/153244303768966139.
P. Ghamisi and N. Yokoya. IMG2DSM: Height simulation from single imagery using
conditional generative adversarial net. IEEE Geoscience and Remote Sensing Letters,
15(5):794–798, May 2018.
P. Ghamisi, B. Rasti, N. Yokoya, Q. Wang, B. Hofle, L. Bruzzone, F. Bovolo, M. Chi, K. Anders,
R. Gloaguen, et al. Multisource and multitemporal data fusion in remote sensing: A
comprehensive review of the state of the art. IEEE Geoscience and Remote Sensing Magazine,
7(1):6–39, 2019.
P.B. Gibson, S.E. Perkins-Kirkpatrick, P. Uotila, A.S. Pepler, and L.V. Alexander. On the use of
self-organizing maps for studying climate extremes. Journal of Geophysical Research:
Atmospheres, 122(7):3891–3903, 2017.
S. Gidaris, P. Singh, and N. Komodakis. Unsupervised representation learning by predicting
image rotations. In International Conference on Learning Representations (ICLR), 2018.
R.C. Gilbert, M.B. Richman, T.B. Trafalis, and L.M. Leslie. Machine learning methods for data
assimilation. In Intelligent Engineering Systems through Artificial Neural Networks, Volume
20, pages 105–112. ASME Press, Nov 2010. doi: 10.1115/1.859599.paper14.
N. Girard, G. Charpiat, and Y. Tarabalka. Aligning and updating cadaster maps with aerial
images by multi-task, multi-resolution deep learning. In C.V. Jawahar, Hongdong Li, Greg
Mori, and Konrad Schindler, editors, Computer Vision – ACCV 2018, pages 675–690, Cham,
2019. Springer International Publishing. ISBN 978-3-030-20873-8.
R. Girshick. Fast R-CNN. In CVPR, pages 1440–1448, 2015.
R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object
detection and semantic segmentation. In CVPR, pages 580–587, 2014.
X. Glorot and Y. Bengio. Understanding the difficulty of training deep feedforward neural
networks. In International Conference on Artificial Intelligence and Statistics, pages 249–256,
2010.
G.B. Goh, N.O. Hodas, and A. Vishnu. Deep learning for computational chemistry. Journal of
Computational Chemistry, 38(16):1291–1307, Jun 2017. ISSN 01928651. doi:
10.1002/jcc.24764. URL http://doi.wiley.com/10.1002/jcc.24764.
C. Goller and A. Küchler. Learning task-dependent distributed representations by
backpropagation through structure. In International Conference on Neural Networks (ICNN),
volume 1, pages 347–352, Jun 1996. doi: 10.1109/ICNN.1996.548916.
L. Gómez-Chova, D. Tuia, G. Moser, and G. Camps-Valls. Multimodal classification of remote
sensing images: A review and future directions. Proceedings of the IEEE, 103(9):1560–1584,
Sep 2015. ISSN 0018-9219. doi: 10.1109/JPROC.2015.2449668.
L. Gómez-Chova, G. Mateo-García, J. Muñoz-Marí, and G. Camps-Valls. Cloud detection
machine learning algorithms for proba-v. In 2017 IEEE International Geoscience and Remote
Sensing Symposium (IGARSS), pages 2251–2254. IEEE, 2017.
348 Bibliography
M. Gong, X. Niu, P. Zhang, and Z. Li. Generative adversarial networks for change detection in
multispectral imagery. IEEE Geoscience and Remote Sensing Letters, 14(12):2310–2314, Dec
2017.
L. Gonog and Y. Zhou. A review: Generative adversarial networks. In 2019 14th IEEE
Conference on Industrial Electronics and Applications (ICIEA), pages 505–510, 2019.
R.C. Gonzalez and R.E. Woods. Digital Image Processing (3rd Edition). Prentice-Hall, Inc.,
Upper Saddle River, NJ, USA, 2006. ISBN 013168728X.
I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and
Y. Bengio. Generative adversarial nets. In Advances in Neural Information Processing
Systems, pages 2672–2680, 2014a.
I. Goodfellow, Y. Bengio, and A. Courville. Deep Learning. Book in preparation for MIT press,
2016. URL http://www.deeplearningbook.org.
I.J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A.C. Courville,
and Y. Bengio. Generative adversarial networks. CoRR, abs/1406. 2661, 2014b.
R.S. Govindaraju. Artificial neural networks in hydrology. ii: Hydrologic applications. Journal
of Hydrologic Engineering, 5(2):124–137, 2000. doi: 10.1061/(ASCE)1084-0699(2000)5:2(124).
A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural
networks. In Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International
Conference on, pages 6645–6649. IEEE, 2013.
A. Graves and J. Schmidhuber. Framewise phoneme classification with bidirectional LSTM and
other neural network architectures. Neural Networks, 18(5):602–610, 2005. doi:
10.1016/j.neunet.2005.06.042.
A. Graves and J. Schmidhuber. Offline handwriting recognition with multidimensional
recurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,
Advances in Neural Information Processing Systems 21, pages 545–552. Curran Associates,
Inc., 2009. URL http://papers.nips.cc/paper/3449-offline-handwriting-recognition-with-
multidimensional-recurrent-neural-networks.pdf.
A.A. Green, M. Berman, P. Switzer, and M.D. Craig. A transformation for ordering
multispectral data in terms of image quality with implications for noise removal. IEEE
Transactions on Geoscience and Remote Sensing, 26(1):65–74, 1988.
J.K. Green, S.I. Seneviratne, A.M. Berg, K.L. Findell, S. Hagemann, D.M. Lawrence, and
P. Gentine. Large influence of soil moisture on long-term terrestrial carbon uptake. Nature,
565:476–479, 2019. doi: 10.1038/s41586-018-0848-x.
K. Greff, R. Kumar Srivastava, J. Koutník, B.R. Steunebrink, and J. Schmidhuber. LSTM: A
search space odyssey. IEEE Transactions on Neural Networks and Learning Systems (TNNLS),
28(10):2222–2232, 2017. doi: 10.1109/TNNLS.2016.2582924.
K. Gregor, I. Danihelka, A. Graves, D. Rezende, and D. Wierstra. Draw: A recurrent neural
network for image generation. In F. Bach and D. Blei, editors, Proceedings of the 32nd
International Conference on Machine Learning, volume 37 of Proceedings of Machine Learning
Research, pages 1462–1471, Lille, France, 07–09 Jul 2015. PMLR.
S.M. Griffies, M. Winton, W.G. Anderson, R. Benson, T.L. Delworth, C.O. Dufour, J.P. Dunne,
P. Goddard, A.K. Morrison, A. Rosati, et al. Impacts on ocean heat from transient mesoscale
eddies in a hierarchy of climate models. Journal of Climate, 28(3): 952–977, 2015.
F. Groh, P. Wieschollek, and H.P.A. Lensch. Flex-convolution (million-scale point-cloud
learning beyond grid-worlds). In Asian Conference on Computer Vision (ACCV), 2018.
Bibliography 349
J.H. Ham, D.D. Lee, S. Mika, and B. Schölkopf. A kernel view of the dimensionality reduction
of manifolds. Departmental Papers (ESE), page 93, 2004.
X. Han, Th. Leung, Y. Jia, R. Sukthankar, and A.C. Berg. Matchnet: Unifying feature and
metric learning for patch-based matching. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2015.
X. Han, B. Shi, and Y. Zheng. Ssf-CNN: Spatial and spectral fusion with CNN for hyperspectral
image super-resolution. In Proceedings of the 25th IEEE Conference on Image Processing
(ICIP), pages 2506–2510, 2018.
Y. Han, Y. Gao, Y. Zhang, J. Wang, and S. Yang. Hyperspectral sea ice image classification based
on the spectral-spatial-joint feature with deep learning. Remote Sensing, 11(18), 2019. doi:
10.3390/rs11182170.
B. Hariharan, P. Arbeláez, R. Girshick, and J. Malik. Hypercolumns for object segmentation
and fine-grained localization. In Computer Vision and Pattern Recognition (CVPR), 2015.
S.A. Harris, H.M. French, J.A. Heginbottom, G.H. Johnston, B. Ladanyi, D.C. Sego, and R.O.
van Everdingen. Glossary of permafrost and related ground-ice terms. Technical
Memorandum of The National Research Council of Canada, Ottawa, 1988.
S. Hatfield, M. Chantry, P. Dueben, and T. Palmer. Accelerating high-resolution weather
models with deep-learning hardware. In Proceedings of the Platform for Advanced Scientific
Computing Conference, PASC’2019, pages 1:1–1:11, New York, NY, USA, 2019. ACM. ISBN
978-1-4503-6770-7. doi: 10.1145/3324989.3325711. URL http://doi.acm.org/10.1145/3324989
.3325711.
J.B. Haurum, C.H. Bahnsen, and T.B. Moeslund. Is it raining outside? Detection of rainfall
using general-purpose surveillance cameras, 2019. URL https://vbn.aau.dk/en/publications/
is-it-raining-outside-detection-of-rainfall-using-general-purpose.
H. He, M. Chen, T. Chen, and D. Li. Matching of remote sensing images with complex
background variations via Siamese convolutional neural network. Remote Sensing, 10(2),
2018. ISSN 2072-4292. doi: 10.3390/rs10020355.
H. He, M. Chen, T. Chen, D. Li, and P. Cheng. Learning to match multitemporal optical
satellite images using multi-support-patches Siamese networks. Remote Sensing Letters,
10(6):516–525, 2019a. doi: 10.1080/2150704X.2019.1577572.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In 2016 IEEE
Conference on Computer Vision and Pattern Recognition (CVPR), pages 770–778, June 2016a.
K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Computer
Vision and Pattern Recognition (CVPR), pages 770–778, 2016b.
K. He, G. Gkioxari, P. Dollár, and R. Girshick. Mask R-CNN. In ICCV, pages 2980–2988. IEEE,
2017.
Q. He, D. Barajas-Solano, G. Tartakovsky, and A.M. Tartakovsky. Physics-informed neural
networks for multiphysics data assimilation with application to subsurface transport.
Advances in Water Resources, 141:103610, Jul 2020. ISSN 0309-1708. doi: 10.1016/
J.ADVWATRES.2020.103610. URL https://www.sciencedirect.com/science/
article/pii/S0309170819311649.
T.-L. He, D.B.A. Jones, B. Huang, Y. Liu, K. Miyazaki, Z. Jiang, E.C. White, H.M. Worden, and
J.R. Worden. Recurrent U-net: Deep learning to predict daily summertime ozone in the
United States. Aug 2019b. URL http://arxiv.org/abs/1908.05841.
Bibliography 351
Y. He, K. Kavukcuoglu, Y. Wang, A. Szlam, and Y. Qi. Unsupervised feature learning by deep
sparse coding. In Proceedings of the SIAM International Conference on Data Mining, pages
902–910, 2014.
E. Hernández, V. Sanchez-Anguix, V. Julian, J. Palanca, and N. Duque. Rainfall prediction:
A deep learning approach. In International Conference on Hybrid Artificial Intelligence
Systems, pages 151–162. Springer, 2016.
B.C. Hewitson and R.G. Crane. Self-organizing maps: applications to synoptic climatology.
Climate Research, 22(1):13–26, 2002.
T.D. Hewson. Objective fronts. Meteorological Applications, 5(1):37–65, 1998. ISSN 1469-8080.
doi: 10.1017/S1350482798000553. URL http://dx.doi.org/10.1017/S1350482798000553.
A. Heye, K. Venkatesan, and J. Cain. Precipitation nowcasting: Leveraging deep recurrent
convolutional neural networks. Technical report, 2017.
S. Hidetoshi. Improving predictive inference under covariate shift by weighting the
log-likelihood function. Journal of Statistical Planning and Inference, 90 (2):227–244, 2000.
G.E. Hinton and R.R. Salakhutdinov. Reducing the dimensionality of data with neural
networks. Science, 313(5786):504–507, July 2006a.
G.E. Hinton and R.S. Zemel. Autoencoders, minimum description length and helmholtz free
energy. In Neural Information Processing Systems, 1994.
G.E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets. Neural
Computation, 18(7):1527–1554, July 2006.
G.F. Hinton. A parallel computation that assigns canonical object-based frames of reference.
In Proceedings of the 7th International Joint Conference on Artificial Intelligence – Volume 2,
IJCAI’81, p-age 683–685, San Francisco, CA, USA, 1981. Morgan Kaufmann Publishers
Inc.
T. Hoberg, F. Rottensteiner, R. Queiroz Feitosa, and C. Heipke. Conditional random fields for
multitemporal and multiscale classification of optical satellite imagery. IEEE Transactions
on Geoscience and Remote Sensing (TGRS), 53(2):659–673, 2015. doi:
10.1109/TGRS.2014.2326886.
J. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Technical
University of Munich, 1991.
S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural Computation,
9(8):1735–1780, 1997. doi: 10.1162/neco.1997.9.8.1735.
S. Hochreiter, Y. Bengio, P. Frasconi, and J. Schmidhuber. Gradient flow in recurrent nets: the
difficulty of learning long-term dependencies. In S.C. Kremer and J.F. Kolen, editors, A Field
Guide to Dynamical Recurrent Neural Networks, pages 237–244. IEEE Press, Piscataway, NJ,
USA, 2001.
J. Hoffman, E. Tzeng, T. Park, J.-Y. Zhu, P. Isola, K. Saenko, A. Efros, and T. Darrell. CyCADA:
Cycle-consistent adversarial domain adaptation. In International Conference on Machine
Learning 2018, pages 1989–1998, July 2018. URL http://proceedings.mlr.press/v80/
hoffman18a.html.
E.J. Hoffmann, Y. Wang, M. Werner, J. Kang, and X.X. Zhu. Model fusion for building type
classification from aerial and street view images. Remote Sensing, 11(11):1259, 2019a.
E.J. Hoffmann, M. Werner, and X.X. Zhu. Building instance classification using social media
images. In 2019 Joint Urban Remote Sensing Event (JURSE), pages 1–4. IEEE, 2019b.
352 Bibliography
R.J. Hogan, C.A.T. Ferro, I.T. Jolliffe, and D.B. Stephenson. Equitability revisited: Why the
“equitable threat score” is not equitable. Weather and Forecasting, 25(2):710–726, 2010.
D. Hong, N. Yokoya, J. Chanussot, and X. Zhu. An augmented linear mixing model to address
spectral variability for hyperspectral unmixing. IEEE Transactions on Image Processing,
28(4):1923–1938, 2019.
C. Hope. The $10 trillion value of better information about the transient climate response.
Philosophical Transactions of the Royal Society A, 373: 20140429, 2015. URL http://dx.doi
.org/10.1098/rsta.2014.0429.
M. Horn, K. Walsh, M. Zhao, S.J. Camargo, E. Scoccimarro, H. Murakami, H. Wang,
A. Ballinger, A. Kumar, D.A. Shaevitz, J.A. Jonas, and K. Oouchi. Tracking scheme
dependence of simulated tropical cyclone response to idealized climate simulations. Journal
of Climate, 27(24):9197–9213, 2014a. doi: 10.1175/JCLI-D-14-00200.1. URL https://doi.org/
10.1175/JCLI-D-14-00200.1.
H. Hotelling. Analysis of a complex of statistical variables into principal components. Journal
of Educational Psychology, 24(6):417, 1933.
B. Hou, Q. Liu, H.Wang, and Y.Wang. From W-Net to CDGAN: Bitemporal change detection
via deep learning techniques. IEEE Transactions on Geoscience and Remote Sensing, pages
1–13, 2019.
R.A. Houze Jr,. Mesoscale convective systems. Reviews of Geophysics, 42 (4), 2004.
F. Hu, X. Tong, G. Xia, and L. Zhang. Delving into deep representations for remote sensing
image retrieval. In IEEE International Conference on Signal Processing, pages 198–203,
November 2016.
Y. Hua, L. Mou, and X.X. Zhu. Relation network for multilabel aerial image classification. IEEE
Transactions on Geoscience and Remote Sensing, 2020.
B. Huang, K. Zhang, Y. Lin, B. Schölkopf, and C. Glymour. Generalized score functions for
causal discovery. volume 2018, pages 1551–1560, 07 2018a.
C. Huang, H. Ai, Y. Li, and S. Lao. High-performance rotation invariant multiview face
detection. IEEE TPAMI, 29(4):671–686, 2007.
K. Huang, J. Xia, Y. Wang, A. Ahlström, J. Chen, R.B. Cook, E. Cui, Y. Fang, J.B. Fisher, D.N.
Huntzinger, Z. Li, A.M. Michalak, Y. Qiao, K. Schaefer, C. Schwalm, J. Wang, Y. Wei, X. Xu,
L. Yan, C. Bian, and Y. Luo. Enhanced peak growth of global vegetation and its key
mechanisms. Nature Ecology & Evolution, 2(12):1897–1905, December 2018b. ISSN
2397-334X. doi: 10.1038/s41559-018-0714-0.
L. Huang, J. Luo, Z. Lin, F. Niu, and L. Liu. Using deep learning to map retrogressive thaw
slumps in the Beiluhe region (Tibetan Plateau) from CubeSat images. Remote Sensing of
Environment, 237:111534, 2020. doi: https://doi.org/10.1016/j.rse.2019.111534.
L. Huang, L. Liu, L. Jiang, and T. Zhang. Automatic mapping of thermokarst landforms from
remote sensing images using deep learning: A case study in the northeastern Tibetan
plateau. Remote Sensing, 10 (12), 2018. doi: 10.3390/rs10122067.
R. Huang, H. Taubenböck, L. Mou, and X.X. Zhu. Classification of settlement types from tweets
using LDA and LSTM. In IGARSS 2018-2018 IEEE International Geoscience and Remote
Sensing Symposium, pages 6408–6411. IEEE, 2018c.
X. Huang and S. Belongie. Arbitrary style transfer in real-time with adaptive instance
normalization. In IEEE International Conference on Computer Vision, pages 1501–1510, 2017.
Bibliography 353
X. Ji, L. Lesack, J.M. Melack, S. Wang, W.J. Riley, and C. Shen. Seasonal and inter-annual
patterns and controls of hydrological fluxes in an Amazon floodplain lake with a
surface-subsurface processes model. Water Resources Research, 55(4):3056–3075, 2019. doi:
10.1029/2018WR023897.
Q. Jia, X. Wan, B. Hei, and S. Li. Dispnet based stereo matching for planetary scene depth
estimation using remote sensing images. In 2018 10th IAPR Workshop on Pattern Recognition
in Remote Sensing (PRRS), Aug 2018. doi: 10.1109/PRRS.2018.8486195.
X. Jia, B. De Brabandere, T. Tuytelaars, and L.V. Gool. Dynamic filter networks. In Advances in
Neural Information Processing Systems, pages 667–675, 2016.
J. Jiang, J. Liu, C.-Z. Qin, and D. Wang. Extraction of urban waterlogging depth from video
images using transfer learning. Water, 10(10), 2018a. ISSN 2073-4441. doi:
10.3390/w10101485. URL https://www.mdpi.com/2073-4441/10/10/1485.
K. Jiang, Z. Wang, P. Yi, G. Wang, T. Lu, and J. Jiang. Edge-enhanced gan for remote sensing
image super-resolution. IEEE Transactions on Geoscience and Remote Sensing,
57(8):5799–5812, Aug 2019.
M. Jiang, Y. Wu, T. Zhao, Z. Zhao, and C. Lu. PointSIFT: A SIFT-like Network Module for 3d
Point Cloud Semantic Segmentation, July 2018b.
S. Jiang, V. Babovic, Y. Zheng, and J. Xiong. Advancing opportunistic sensing in hydrology:
A novel approach to measuring rainfall with ordinary surveillance cameras. Water Resources
Research, 55(4): 3004–3027, 2019a. doi: 10.1029/2018WR024480. URL https://agupubs
.onlinelibrary.wiley.com/doi/abs/10.1029/2018WR024480.
Y. Jiang, X. Zhu, X. Wang, S. Yang, W. Li, H. Wang, P. Fu, and Z. Luo. R2cnn: rotational region
CNN for orientation robust scene text detection. arXiv:1706.09579, 2017.
Z. Jiang, K. Von Ness, J. Loisel, and Z. Wang. Arcticnet: A deep learning solution to classify
arctic wetlands, 2019.
D. Jimenez Rezende and S. Mohamed. Variational inference with normalizing flows. 05
2015.
Y.-H. Jo, D.-W. Kim, and H. Kim. Chlorophyll concentration derived from microwave remote
sensing measurements using artificial neural network algorithm. Journal of the Academy of
Marketing Science, 26:102–110, 2018.
J. Emmanuel Johnson, V. Laparra, M. Piles, and G. Camps-Valls. Gaussianizing the earth:
Multidimensional information measures for earth data analysis. Submitted.
I.T. Jolliffe. Principal components in regression analysis. In Principal Component Analysis,
pages 129–155. Springer, 1986.
M.T. Jorgenson. Thermokarst terrains. Treatise on Geomorphology, 8:313–324, 2013.
R. Jozefowicz, W. Zaremba, and I. Sutskever. An empirical exploration of recurrent network
architectures. In F. Bach and D. Blei, editors, International Conference on Machine Learning
(ICML), volume 37 of Proceedings of Machine Learning Research (PRML), pages 2342–2350,
2015.
M. Jung, C. Schwalm, M. Migliavacca, S. Walther, G. Camps-Valls, S. Koirala, P. Anthoni, S.
Besnard, P. Bodesheim, N. Carvalhais, F. Chevallier, F. Gans, D.S. Groll, V. Haverd, K. Ichii,
A.K. Jain, J. Liu, D. Lombardozzi, J.E.M.S. Nabel, J.A. Nelson, M. Pallandt, D. Papale,
W. Peters, J. Pongratz, C. Rödenbeck, S. Sitch, G. Tramontana, U. Weber, M. Reichstein,
P. Koehler, M. O’Sullivan, and A. Walker. Scaling carbon fluxes from eddy covariance sites to
Bibliography 355
Y.J. Kim, H.-C. Kim, D. Han, S. Lee, and J. Im. Prediction of monthly Arctic sea ice
concentrations using satellite and reanalysis data based on convolutional neural networks.
The Cryosphere, 14(3):1083–1104, 2020. doi: 10.5194/tc-14-1083-2020.
D.P. Kingma and M. Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114,
2013.
D.P. Kingma and J. Ba. Adam: A method for stochastic optimization. arXiv preprint
arXiv:1412.6980, 2014.
D.P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on
Learning Representations (ICLR), 2014.
J. Kirkpatrick, R. Pascanu, N. Rabinowitz, J. Veness, G. Desjardins, A.A. Rusu, K. Milan,
J. Quan, T. Ramalho, A. Grabska-Barwinska, D. Hassabis, C. Clopath, D. Kumaran, and
R. Hadsell. Overcoming catastrophic forgetting in neural networks. Proceedings of the
National Academy of Sciences, 114(13): 3521–3526, 2017.
I.A. Klampanos, A. Davvetas, S. Andronopoulos, C. Pappas, A. Ikonomopoulos, and V.
Karkaletsis. Autoencoder-driven weather clustering for source estimation during nuclear
events. Environmental Modelling & Software, 102:84–93, 2018.
B. Klein, L. Wolf, and Y. Afek. A dynamic convolutional layer for short range weather
prediction. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 4840–4848, 2015.
T.R. Knutson, J.J. Sirutis, S.T. Garner, I.M. Held, and R.E. Tuleya. Simulation of the recent
multidecadal increase of atlantic hurricane activity using an 18-km-grid regional model.
Bulletin of the American Meteorological Society, 88(10):1549–1565, 2007.
S. Kohl, B. Romera-Paredes, C. Meyer, J. De Fauw, J.R. Ledsam, K. Maier-Hein, S.M.
Ali Eslami, D. Jimenez Rezende, and O. Ronneberger. A probabilistic u-net for segmentation
of ambiguous images. In Advances in Neural Information Processing Systems, pages
6965–6975, 2018.
S. Koirala, P.J.-F. Yeh, Y. Hirabayashi, S. Kanae, and T. Oki. Global-scale land surface
hydrologic modeling with the representation of water table dynamics. Journal of Geophysical
Research: Atmospheres, 119: 75–89, 2014. doi: 10.1002/2013JD020398.
S.J. Kollet and R.M. Maxwell. Integrated surface–groundwater flow modeling: A free-surface
overland flow boundary condition in a parallel groundwater flow model. Advances inWater
Resources, 29(7):945–958, Jul 2006. ISSN 03091708. doi: 10.1016/j.advwatres.2005.08.006.
URL http://dx.doi.org/10.1016/j.advwatres.2005.08.006.
G.J. Kooperman, M.S. Pritchard, M.A. Burt, M.D. Branson, and D.A. Randall. Robust effects of
cloud superparameterization on simulated daily rainfall intensity statistics across multiple
versions of the c ommunity e arth s ystem m odel. Journal of Advances in Modeling Earth
Systems, 8(1):140–165, 2016.
E.N. Kornaropoulos, E.I. Zacharaki, P. Zerbib, C. Lin, A. Rahmouni, and N. Paragios.
Deformable group-wise registration using a physiological model: Application to
diffusion-weighted MRI. In 2016 IEEE International Conference on Image Processing (ICIP),
pages 2345–2349, Sep. 2016. doi: 10.1109/ICIP.2016.7532778.
M. Kosmala, K. Hufkens, and A.D. Richardson. Integrating camera imagery, crowdsourcing,
and deep learning to improve high-frequency automated monitoring of snow at
continental-to-global scales. PLoS ONE, 13 (12):1–19, 2018. doi: 10/ggh5m7. URL https://doi
.org/10.1371/journal.pone.0209649.
358 Bibliography
A. Krizhevsky, I. Sutskever, and G.E. Hinton. Imagenet classification with deep convolutional
neural networks. In F. Pereira, C.J.C. Burges, L. Bottou, and K.Q. Weinberger, editors,
Advances in Neural Information Processing Systems 25, pages 1097–1105. 2012.
K.E. Kunkel, D.R. Easterling, D.A.R. Kristovich, B. Gleason, L. Stoecker, and R. Smith.
Meteorological causes of the secular variations in observed extreme precipitation events for
the conterminous united states. Journal of Hydrometeorology, 13(3):1131–1141, 2012. doi:
10.1175/JHM-D-11-0108.1. URL https://doi.org/10.1175/JHM-D-11-0108.1.
K. Kuppala, S. Banda, and Th. R. Barige. An overview of deep learning methods for image
registration with focus on feature-based approaches. International Journal of Image and
Data Fusion, pages 1–23, 2020. doi: 10.1080/19479832.2019.1707720. URL https://doi.org/10
.1080/19479832.2019.1707720.
T. Kurth, J. Yang, N. Satish, M. Patwary, E. Racah, N. Mitliagkas, I. Sundaram, M. Patwary,
Prabhat, and Pradeep Dubey. Deep learning at 15pf: supervised and semi-supervised
learning for scientific data. to appear in Supercomputing, 2017.
T. Kurth, S. Treichler, J. Romero, M. Mudigonda, N. Luehr, E. Phillips, A. Mahesh, M.
Matheson, J. Deslippe, M. Fatica, Prabhat, and M. Houston. Exascale deep learning for
climate analytics. In Proceedings of the International Conference for High Performance
Computing, Networking, Storage, and Analysis, SC ‘18, pages 51:1–51:12, Piscataway, NJ, USA,
2018. IEEE Press. doi: 10.1109/SC.2018.00054. URL https://doi.org/10.1109/SC.2018.00054.
K. Kuwata and R. Shibasaki. Estimating corn yield in the United States with modis evi and
machine learning methods. ISPRS Annals of Photogrammetry, Remote Sensing & Spatial
Information Sciences, 3(8), 2016.
A. Lagrange, B.L. Saux, A. Beaupere, A. Boulch, A. Chan-Hon-Tong, S. Herbin,
H. Randrianarivo, and M. Ferecatu. Benchmarking classification of earth-observation
data: From learning explicit features to convolutional networks. In IEEE International
Geoscience and Remote Sensing Symposium (IGARSS), Milan, Italy, 2015.
B. Lakshminarayanan, A. Pritzel, and C. Blundell. Simple and scalable predictive uncertainty
estimation using deep ensembles. Advances in Neural Information Processing Systems,
2017-Decem(Nips):6403–6414, 2017. ISSN 10495258.
R. LaLonde, D. Zhang, and M. Shah. Clusternet: Detecting small objects in large scenes by
exploiting spatio-temporal information. In CVPR, June 2018.
E. Laloy, R. Hérault, D. Jacques, and N. Linde. Training-image based geostatistical inversion
using a spatial generative adversarial neural network. Water Resources Research,
54(1):381–406, 2018. doi: 10.1002/2017WR022148. URL https://agupubs.onlinelibrary.wiley
.com/doi/abs/10.1002/2017WR022148.
D. Lam, R. Kuzma, K. McGee, S. Dooley, M. Laielli, M. Klaric, Yaroslav Bulatov, and
B. McCord. xview: Objects in context in overhead imagery, 2018. URL https://aps.arxiv.org/
abs/1802.07856.
A.M. Lamb, A. Goyal alias Parth Goyal, Y. Zhang, S. Zhang, A.C. Courville, and Y. Bengio.
Professor forcing: A new algorithm for training recurrent networks. In NIPS. 2016.
C. Lanaras, E. Baltsavias, and K. Schindler. Hyperspectral super-resolution with spectral
unmixing constraints. Remote Sensing, 9(11): 1196, 2017.
L. Landrieu and M. Simonovsky. Large-scale point cloud semantic segmentation with
superpoint graphs. In Computer Vision and Pattern Recognition (CVPR), pages 4558–4567,
Salt Lake City, UT, 2018.
360 Bibliography
Z.L. Langford, J. Kumar, F.M. Hoffman, A.L. Breen, and C.M. Iversen. Arctic vegetation
mapping using unsupervised training datasets and convolutional neural networks. Remote
Sensing, 11(1), 2019. doi: 10.3390/rs11010069.
V. Laparra and R. Santos-Rodríguez. Spatial/spectral information trade-off in hyperspectral
images. In 2015 IEEE International Geoscience and Remote Sensing Symposium (IGARSS),
pages 1124–1127, 2015.
V. Laparra, G. Camps, and J. Malo. Iterative gaussianization: from ICA to random rotations.
IEEE Transactions on Neural Networks, 22(4):537–549, 2011.
S. Lapuschkin, S. Wäldchen, A. Binder, G. Montavon, W. Samek, and K.-R. Müller. Unmasking
clever hans predictors and assessing what machines really learn. Nature Communications,
10:1096, 2019. doi: 10.1038/s41467-019-08987-4. URL http://dx.doi.org/10.1038/s41467-019-
08987-4.
W. Larcher. Physiological Plant Ecology: Ecophysiology and Stress Physiology of Functional
Groups. Springer-Verlag, Berlin Heidelberg, 2003. ISBN 978-3-540-43516-7.
H. Larochelle, Y. Bengio, J. Louradour, and P. Lamblin. Exploring strategies for training deep
neural networks. The Journal of Machine Learning Research, 10:1–40, 2009.
D.A. Lavers, G. Villarini, R.P. Allan, E.F. Wood, and A.J. Wade. The detection of atmospheric
rivers in atmospheric reanalyses and their links to british winter floods and the large-scale
climatic circulation. Journal of Geophysical Research: Atmospheres, 117(D20), 2012.
B.N. Lawrence, M. Rezny, R. Budich, P. Bauer, J. Behrens, M. Carter, W. Deconinck, R. Ford,
C. Maynard, S. Mullerworth, C. Osuna, A. Porter, K. Serradell, S. Valcke, N.Wedi, and S.
Wilson. Crossing the chasm: how to develop weather and climate models for next generation
computers? Geoscientific Model Development, 11(5):1799–1821, 2018. doi:
10.5194/gmd-11-1799-2018. URL https://www.geosci-model-dev.net/11/1799/2018/.
D.M. Lawrence, R.A. Fisher, C.D. Koven, K.W. Oleson, S.C. Swenson, G. Bonan, N. Collier,
B. Ghimire, L. van Kampenhout, D. Kennedy, E. Kluzek, P.J. Lawrence, F. Li, H. Li,
D. Lombardozzi, W.J. Riley, W.J. Sacks, Mingjie Shi, M. Vertenstein, W.R. Wieder, C. Xu, A.A.
Ali, A.M. Badger, G. Bisht, M. van den Broeke, M.A. Brunke, S.P. Burns, J. Buzan, M. Clark,
A. Craig, K. Dahlin, B. Drewniak, J.B. Fisher, M. Flanner, A.M. Fox, P. Gentine, F. Hoffman,
G. Keppel-Aleks, R. Knox, S. Kumar, J. Lenaerts, L.R. Leung, W.H. Lipscomb, Y. Lu,
A. Pandey, J.D. Pelletier, J. Perket, J.T. Randerson, D.M. Ricciuto, B.M. Sanderson, A. Slater,
Z.M. Subin, J. Tang, R.Q. Thomas, M.V. Martin, and X. Zeng. The community land model
version 5: Description of new features, benchmarking, and impact of forcing uncertainty.
Journal of Advances in Modeling Earth Systems, n/a(n/a). doi: 10.1029/2018MS001583. URL
https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018MS001583.
Q.V. Le, N. Jaitly, and G.E. Hinton. A simple way to initialize recurrent networks of rectified
linear units. 2015. URL https://aps.arxiv.org/abs/1504.00941.
J. Le Moigne, N. Netanyahu, and R. Eastman. Image Registration for Remote Sensing.
Cambridge University Press, 2011. doi: 10.1017/CBO9780511777684.
B. Le Saux, N. Yokoya, R. Hansch, M. Brown, and G. Hager. 2019 data fusion contest
[technical committees]. IEEE Geoscience and Remote Sensing Magazine, 7(1):103–105,
2019.
T.P. Leahy, F.P. Llopis, M.D. Palmer, and N.H. Robinson. Using neural networks to correct
historical climate observations. Journal of Atmospheric and Oceanic Technology,
35(10):2053–2059, Oct 2018. ISSN 15200426. doi: 10.1175/JTECH-D-18-0012.1.
Bibliography 361
Z. Li, C. Peng, G. Yu, X. Zhang, Y. Deng, and J. Sun. Light-head R-CNN: In defense of two-stage
object detector. arXiv:1711.07264, 2017.
C. Liang, H. Li, M. Lei, and Q. Du. Dongting lake water level forecast and its relationship with
the Three Gorges Dam based on a long short-term memory network. Water, 10(10):1389, Oct
2018. ISSN 2073-4441. doi: 10.3390/w10101389. URL http://dx.doi.org/10.3390/w10101389.
X. Liang, X. Shen, J. Feng, L. Lin, and S. Yan. Semantic object parsing with graph LSTM. In
European Conference on Computer Vision (ECCV), pages 125–143, 2016. doi:
10.1007/978-3-319-46448-0_8.
M. Liao, B. Shi, and X. Bai. Textboxes++: A single-shot oriented scene text detector. IEEE TIP,
27(8):3676–3690, 2018a.
M. Liao, Z. Zhu, B. Shi, G. Xia, and X. Bai. Rotation-sensitive regression for oriented scene text
detection. In CVPR, pages 5909–5918, 2018b.
R. Liao, S. Miao, P. Tournemire, S. Grbic, A. Kamen, T. Mansi, and D. Comaniciu. An artificial
agent for robust image registration. 2016. URL https://aps.arxiv.org/abs/1611.10336.
J.-L. Lin and C.W.J. Granger. Forecasting from non-linear models in practice. Journal of
Forecasting, 13(1):1–9, 1994.
T.-Y. Lin, M. Maire, S. Belongie, J. Hays, P. Perona, D. Ramanan, P. Dollár, and C.L. Zitnick.
Microsoft COCO: Common objects in context. In ECCV, pages 740–755, 2014.
T.-Y. Lin, P. Dollar, R. Girshick, K. He, B. Hariharan, and S. Belongie. Feature pyramid
networks for object detection. In IEEE Conference on Computer Vision and Pattern
Recognition, pages 2117–2125, 2017.
T. Lin, W.G. Horne, P. Tiňo, and C.L. Giles. Learning long-term dependencies in NARX
recurrent neural networks. IEEE Transactions on Neural Networks, 7(6):1329–1338, 1996.
doi: 10.1109/72.548162.
Y. Lin, T. Zhang, S. Zhu, and K. Yu. Deep coding network. In Proceedings of NIPS, pages
1405–1413, 2010.
Y. Lin, H. He, Z. Yin, and F. Chen. Rotation-invariant object detection in remote sensing
images based on radial-gradient angle. IEEE Geoscience and Remote Sensing Letters,
12(4):746–750, 2015.
J. Ling, R. Jones, and J. Templeton. Machine learning strategies for systems with invariance
properties. Journal of Computational Physics, 318:22–35, 2016a.
J. Ling, A. Kurzawski, and J. Templeton. Reynolds averaged turbulence modelling using deep
neural networks with embedded invariance. Journal of Fluid Mechanics, 807:155–166, 2016b.
Z.C. Lipton, J. Berkowitz, and C. Elkan. A critical review of recurrent neural networks for
sequence learning. arXiv:1506.00019 [cs], 2015.
C. Liu, J. Ma, X. Tang, X. Zhang, and L. Jiao. Adversarial hash-code learning for remote sensing
image retrieval. In IEEE International Geoscience and Remote Sensing Symposium, pages
4324–4327, July 2019.
H. Liu and K.C. Jezek. A complete high-resolution coastline of Antarctica extracted from
orthorectified Radarsat SAR imagery. Photogrammetric Engineering & Remote Sensing,
70(5):605–616, 2004.
J. Liu, S. Ji, C. Zhang, and Z. Qin. Evaluation of deep learning based stereo matching methods:
from ground to aerial images. ISPRS – International Archives of the Photogrammetry, Remote
Sensing and Spatial Information Sciences, 422: 593–597, May 2018a. doi:
10.5194/isprs-archives-XLII-2-593-2018.
364 Bibliography
K. Liu and G. Máttyus. Fast multiclass vehicle detection on aerial images. IEEE Geoscience and
Remote Sensing Letters, 12(9):1938–1942, 2015.
L. Liu and B. Lei. Can SAR images and optical images transfer with each other? In IGARSS
2018 – 2018 IEEE International Geoscience and Remote Sensing Symposium, pages
7019–7022, July 2018.
L. Liu, Z. Pan, X. Qiu, and L. Peng. Sar target classification with cyclegan transferred simulated
samples. In IGARSS 2018 – 2018 IEEE International Geoscience and Remote Sensing
Symposium, pages 4411–4414, July 2018b.
L. Liu, Z. Pan, and B. Lei. Learning a rotation invariant detector with rotatable bounding box.
CoRR, abs/1711. 09405, 2017a.
M.Y. Liu and O. Tuzel. Coupled generative adversarial networks. In Advances in Neural
Information Processing Systems, pages 469–477, 2016.
M.Y. Liu, T. Breuel, and J. Kautz. Unsupervised image-to-image translation networks. In
Advances in Neural Information Processing Systems, pages 700–708, 2017b.
P. Liu, J. Wang, A.K. Sangaiah, Y. Xie, and X. Yin. Analysis and prediction of water quality
using LSTM deep neural networks in iot environment. Sustainability, 11(7), 2019. ISSN
2071-1050. doi: 10.3390/su11072058. URL https://www.mdpi.com/2071-1050/11/7/2058.
W. Liu, F. Su, and X. Huang. Unsupervised adversarial domain adaptation network for
semantic segmentation. IEEE Geoscience and Remote Sensing Letters, pages 1–5, 2019.
W. Liu, D. Anguelov, D. Erhan, C. Szegedy, S. Reed, C.-Y. Fu, and A.C. Berg. Ssd: Single shot
multibox detector. arXiv preprint arXiv:1512.02325, 2015.
X. Liu, Y. Wang, and Q. Liu. Psgan: A generative adversarial network for remote sensing image
pan-sharpening. In 2018 25th IEEE International Conference on Image Processing (ICIP),
pages 873–877, 2018c.
Y. Liu, B. Fan, L. Wang, J. Bai, S. Xiang, and C. Pan. Semantic labeling in very high resolution
images via a self-cascaded convolutional neural network. ISPRS J. Int. Soc. Photo. Remote
Sens., 145:78–95, 2018.
Y. Liu, C.R. Schwalm, K.E. Samuels-Crow, and K. Ogle. Ecological memory of daily carbon
exchange across the globe and its importance in drylands. Ecology Letters, 22:1806–1816,
2019. doi: 10.1111/ele.13363.
Y. Liu, Prabhat, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. Wehner, and
W. Collins. Application of deep convolutional neural networks for detecting extreme
weather in climate datasets. In Advances in Big Data Analytics, pages 81–88, 2016b.
Y. Liu, Prabhat, E. Racah, J. Correa, A. Khosrowshahi, D. Lavers, K. Kunkel, M. Wehner, and
William Collins. Extreme weather pattern detection using deep convolutional neural
network. In Proceedings of the 6th International Workshop on Climate Informatics, pages
109–112, 2016c.
Z. Liu, H. Wang, L. Weng, and Y. Yang. Ship rotated bounding box space for ship extraction
from high-resolution optical satellite images with complex backgrounds. IEEE Geosci.
Remote Sensing Letters, 13 (8):1074–1078, 2016d.
Z. Liu, J. Hu, L. Weng, and Y. Yang. Rotated region based CNN for ship detection. In ICIP,
pages 900–904. IEEE, 2017c.
D.B. Lobell, A. Sibley, and J.I. Ortiz-Monasterio. Extreme heat effects on wheat senescence in
India. Nature Climate Change, 2:186–189, 2012. doi: 10.1038/nclimate1356.
Bibliography 365
S. Lobry, J. Murray, D. Marcos, and D. Tuia. Visual question answering from remote sensing
images. In IGARSS 2019-2019 IEEE International Geoscience and Remote Sensing
Symposium, pages 4951–4954. IEEE, 2019.
P.C. Loikith, B.R. Lintner, and A. Sweeney. Characterizing large-scale meteorological patterns
and associated temperature and precipitation extremes over the northwestern united states
using self-organizing maps. Journal of Climate, 30(8):2829–2847, 2017.
L. Loncan, L.B. De Almeida, J.M. Bioucas-Dias, X. Briottet, J. Chanussot, N. Dobigeon, S.
Fabre, W. Liao, G.A. Licciardi, M. Simoes, et al. Hyperspectral pansharpening: A review.
IEEE Geoscience and remote sensing magazine, 3(3):27–46, 2015.
J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation.
In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
3431–3440, 2015.
E.N. Lorenz. Empirical orthogonal functions and statistical weather prediction. Scientific
Reports 1, Statistical Forecasting Project, 1956.
E.N. Lorenz. Deterministic Nonperiodic Flow. Journal of the Atmospheric Sciences,
20(2):130–141, March 1963. doi: 10.1175/1520-0469(1963)020⟨0130:dnf⟩2.0.co;2.
E.N. Lorenz. The predictability of a flow which possesses many scales of motion. Tellus,
21(3):289–307, 1969. doi: 10.1111/j.2153-3490.1969.tb00444.x. URL https://onlinelibrary
.wiley.com/doi/abs/10.1111/j.2153-3490.1969.tb00444.x .
E.N. Lorenz. Predictability: A problem partly solved. In Proceedings of the Seminar on
Predictability, volume 1, 1996.
D.G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the Seventh
IEEE International Conference on Computer Vision, volume 2, Sep. 1999. doi:
10.1109/ICCV.1999.790410.
D.G. Lowe. Object recognition from local scale-invariant features. In Proceedings of the Seventh
IEEE International Conference on Computer Vision, volume 2, pages 1150–1157. Ieee, 1999.
X. Lu, Y. Yuan, and X. Zheng. Joint dictionary learning for multispectral change detection.
IEEE Transactions on Cybernetics, 47(4): 884–897, 2016.
T. Luo, K. Kramer, D.B. Goldgof, L.O. Hall, S. Samson, A. Remsen, and T. Hopkins. Active
learning to recognize multiple types of plankton. Journal of Machine Learning Research,
6:589–613, 2005.
B. Lusch, J.N. Kutz, and S.L. Brunton. Deep learning for universal linear embeddings of
nonlinear dynamics. Nature Communications, 9(1):4950, 2018.
N. Lv, C. Chen, T. Qiu, and A.K. Sangaiah. Deep learning and superpixel feature extraction
based on contractive autoencoder for change detection in SAR images. IEEE transactions on
industrial informatics, 14(12): 5530–5538, 2018.
M. Kang, K. Ji, X. Leng and Z. Lin. Contextual region-based convolutional neural network with
multilayer fusion for SAR ship detection. Remote Sensing, 9(8): 860–, 2017.
H.-Y. Ma, S. Xie, S.A. Klein, K.D. Williams, J.S. Boyle, S. Bony, H. Douville, S. Fermepin, B.
Medeiros, S. Tyteca, M. Watanabe, and D. Williamson. On the correspondence between
mean forecast errors and climate errors in CMIP5 Models. Journal of Climate,
27(4):1781–1798, Nov 2013. ISSN 0894-8755. doi: 10.1175/JCLI-D-13-00474.1. URL https://
doi.org/10.1175/JCLI-D-13-00474.1.
J. Ma, W. Shao, H. Ye, L. Wang, H. Wang, Y. Zheng, and X. Xue. Arbitrary-oriented scene text
detection via rotation proposals. IEEE Transactions on Multimedia, 20(11):3111–3122, 2018.
366 Bibliography
K. Ma, D. Feng, K. Lawson, W.-P. Tsai, C. Liang, X. Huang, A. Sharma, and C. Shen.
Transferring hydrologic data across continents – leveraging data-rich regions to improve
hydrologic prediction in data-sparse regions. Water Resources Research, 57, e2020WR028600,
2021. https://doi.org/10.1029/2020wr028600.
L. Ma, Y. Liu, X. Zhang, Y. Ye, G. Yin, and B.A. Johnson. Deep learning in remote sensing
applications: A meta-analysis and review. ISPRS Journal of Photogrammetry and Remote
Sensing, 152:166–177, 2019.
W. Ma, J. Zhang, Y. Wu, L. Jiao, H. Zhu, and W. Zhao. A novel two-step registration method for
remote sensing images based on deep and local features. IEEE Transactions on Geoscience
and Remote Sensing, 57(7): 4834–4843, July 2019. ISSN 1558-0644. doi: 10.1109/
TGRS.2019.2893310.
E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez. Convolutional neural networks for
large-scale remote-sensing image classification. IEEE Transactions in Geoscience and Remote
Sensing, 55(2):645–657, 2017a.
E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez. High-resolution aerial image labeling with
convolutional neural networks. IEEE Transactions in Geoscience and Remote Sensing,
55(12):7092–7103, 2017b.
E. Maggiori, Y. Tarabalka, G. Charpiat, and P. Alliez. Can semantic labeling methods generalize
to any city? The inria aerial image labeling benchmark. In IEEE International Geoscience and
Remote Sensing Symposium (IGARSS). IEEE, 2017c.
L. Magnusson and E. Källén. Factors influencing skill improvements in the ECMWF
forecasting system. Monthly Weather Review, 141(9): 3142–3153, 2013. doi: 10.1175/
MWR-D-12-00318.1.
D. Mahapatra and Z. Ge. Combining transfer learning and segmentation information with gans
for training data independent image registration. CoRR, abs/1903.10139, 2019. URL http://
arxiv.org/abs/1903.10139.
J. Mairal, M. Elad, and G. Sapiro. Sparse representation for color image restoration. IEEE
Transactions on Image Processing, 17(1):53–69, 2008.
J. Mairal, F. Bach, and J. Ponce. Task-driven dictionary learning. IEEE Transactions on Pattern
Analysis and Machine Intelligence, 34(4): 791–804, 2011.
D. Malmgren-Hansen, V. Laparra, A.A. Nielsen, and G. Camps-Valls. Statistical retrieval of
atmospheric profiles with deep convolutional neural networks. ISPRS Journal of
Photogrammetry and Remote Sensing, 158:231–240, 2019.
D. Malmgren-Hansen, V. Laparra, G. Camps-Valls, X. Calbet. IASI dataset v1. Technical
University of Denmark. Dataset. 2020. https://doi.org/10.11583/DTU.12999642.v1
D. Malmgren-Hansen, L.T. Pedersen, A.A. Nielsen, H. Skriver, R. Saldo, M.B. Kreiner, and J.
Buus-Hinkler. ASIP Sea Ice Dataset – version 1. 3 2020. doi: 10.11583/DTU.11920416.v1.
P.P. Mana and L. Zanna. Toward a stochastic parameterization of ocean mesoscale eddies.
Ocean Modelling, 79: 1–20, 2014.
D. Marcos, R. Hamid, and D. Tuia. Geospatial correspondence for multimodal registration. In
IEEE International Conference on Computer Vision and Pattern Recognition (CVPR), 2016.
D. Marcos, M. Volpi, B. Kellenberger, and D. Tuia. Land cover mapping at very high resolution
with rotation equivariant CNNs: Towards small yet accurate models. ISPRS Journal of the
International Society of Photogrammetry and Remote Sensing, 145:96–107, 2018a.
Bibliography 367
Q. Meng, D. Catchpoole, D. Skillicom, and P.J. Kennedy. Relational autoencoder for feature
extraction. In 2017 International Joint Conference on Neural Networks (IJCNN), pages
364–371. IEEE, 2017.
N. Merkle, P. Fischer, S. Auer, and R. Müller. On the possibility of conditional adversarial
networks for multi-sensor image matching. In 2017 IEEE International Geoscience and
Remote Sensing Symposium (IGARSS), pages 2633–2636, July 2017. doi:
10.1109/IGARSS.2017.8127535.
N. Merkle,W. Luo, S. Auer, R. Müller, and R. Urtasun. Exploiting deep matching and SAR data
for the geo-localization accuracy improvement of optical satellite images. Remote Sensing,
9(6), 2017. ISSN 2072-4292. doi: 10.3390/rs9060586.
N. Merkle, S. Auer, R. Müller, and P. Reinartz. Exploring the potential of conditional
adversarial networks for optical and SAR image matching. IEEE Journal of Selected Topics in
Applied Earth Observations and Remote Sensing, 11(6):1811–1820, June 2018. ISSN
2151-1535.
K. Miao, et al. Contextual region-based convolutional neural network with multilayer fusion
for SAR ship detection. Remote Sensing 9.8, 860, 2017.
S. Minaee, Y. Boykov, F. Porikli, A. Plaza, N. Kehtarnavaz, and D. Terzopoulos. Image
segmentation using deep learning: A survey. arXiv preprint arXiv:2001.05566, 2020.
L. Mingsheng, C. Yue, W. Jianmin, and L. Michael. Learning transferable features with deep
adaptation networks. In International Conference on Machine Learning, volume 37, pages
97–105, 2015.
M. Mirza and S. Osindero. Conditional generative adversarial nets, 2014. URL http://arxiv.org/
abs/1411.1784 . cite arxiv:1411.1784.
V. Mnih and G.E. Hinton. Learning to label aerial images from noisy data. In Proceedings of the
29th International Conference on Machine Learning (ICML-12), pages 567–574, 2012.
S. Mo, N. Zabaras, X. Shi, and J. Wu. Deep autoregressive neural networks for
high-dimensional inverse problems in groundwater contaminant source identification.
Water Resources Research, 55 (5):3856–3881, 2019a. doi: 10.1029/2018WR024638. URL
https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018WR024638.
S. Mo, Y. Zhu, N. Zabaras, X. Shi, and J. Wu. Deep convolutional encoder-decoder networks for
uncertainty quantification of dynamic multiphase flow in heterogeneous media. Water
Resources Research, 55(1):703–728, 2019b. doi: 10.1029/2018WR023528. URL https://
agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018WR023528.
S. Mohajerani and P. Saeedi. Cloud-net: An end-to-end cloud detection algorithm for Landsat 8
imagery. In IGARSS 2019 - 2019 IEEE International Geoscience and Remote Sensing
Symposium, pages 1029–1032, July 2019. doi: 10.1109/IGARSS.2019.8898776.
S. Mohajerani and P. Saeedi. Cloud-net+ : A cloud segmentation CNN for Landsat 8 remote
sensing imagery optimized with filtered jaccard loss function, 2020.
Y. Mohajerani, M. Wood, I. Velicogna, and E. Rignot. Detection of glacier calving margins with
convolutional neural networks: A case study. Remote Sensing, 11(1), 2019. doi:
10.3390/rs11010074.
M.R. Mohammadi. Deep multiple instance learning for airplane detection in high resolution
imagery. CoRR, abs/1808.06178, 2018. URL http://arxiv.org/abs/1808.06178.
370 Bibliography
S. Molins, D. Trebotich, C.I. Steefel, and C. Shen. An investigation of the effect of pore scale
flow on average geochemical reaction rates using direct numerical simulation. Water
Resources Research, 48(3): n/a–n/a, Mar 2012. ISSN 00431397. doi: 10.1029/2011WR011404.
C. Molnar. Interpretable Machine Learning. 2019. https://christophm.github.io/
interpretable-ml-book/.
G. Montavon, W. Samek, and K.-R. Müller. Methods for interpreting and understanding deep
neural networks. Digital Signal Processing, 73:1–15, 2018.
R. Montes and C. Ureña. An overview of brdf models. University of Grenada, Technical Report
LSI-2012, 1, 2012.
A. Moosavi, Ahmed Attia, and Adrian Sandu. Tuning Covariance Localization Using Machine
Learning, 2019.
T. Moranduzzo and F. Melgani. Detecting cars in UAV images with a catalog-based approach.
IEEE Transactions in Geoscience and Remote Sensing, 52(10): 6356–6367, 2014.
Á. Moreno-Martínez, G. Camps-Valls, J. Kattge, N. Robinson, M. Reichstein, P. van Bodegom,
K. Kramer, J.H.C. Cornelissen, P. Reich, M. Bahn, et al. A methodology to derive global maps
of leaf traits using remote sensing and climate data. Remote Sensing of Environment,
218:69–88, 2018.
L. Mou, L. Bruzzone, and X.X. Zhu. Learning spectral-spatial-temporal features via a recurrent
convolutional neural network for change detection in multispectral imagery. IEEE
Transactions on Geoscience and Remote Sensing, 57(2):924–935, 2018.
L. Mou, Y. Hua, and X.X. Zhu. A relation-augmented fully convolutional network for semantic
segmentation in aerial scenes. In Proceedings of the IEEE Conference on Computer Vision and
Pattern Recognition, pages 12416–12425, 2019.
S. Mouatadid, J.F. Adamowski, M.K. Tiwari, and J.M. Quilty. Coupling the maximum overlap
discrete wavelet transform and long short-term memory networks for irrigation flow
forecasting. Agricultural Water Management, 219:72–85, 2019. ISSN 0378-3774. doi:
https://doi.org/10.1016/j.agwat.2019.03.045. URL http://www.sciencedirect.com/science/
article/pii/S0378377418311831.
M. Mudigonda, S. Kim, A. Mahesh, S. Kahou, K. Kashinath, D. Williams, V. Michalski,
T. O’Brien, and Prabhat. Segmenting and Tracking Extreme Climate Events using Neural
Networks. Technical report, 2017.
M. Mudigonda, K. Kashinath, Prabhat, S. Kim, L. Kapp-Schoerer, E. Karaismailoglu,
A. Graubner, L. von Kleist, K. Yang, C. Lewis, J. Chen, A. Greiner, T. Kurth, T. O’Brien,
W. Chapman, C. Shields, K. Dagon, A. Albert, M. Wehner, and W. Collins. Climatenet:
Bringing the power of deeplearning to weather and climate sciences via open datasets and
architectures. In SC18: International Conference for High Performance Computing,
Networking, Storage and Analysis, pages 649–660. IEEE, 2018.
M. Munir, S.A. Siddiqui, A. Dengel, and S. Ahmed. Deepant: A deep learning approach for
unsupervised anomaly detection in time series. IEEE Access, 7:1991–2005, 2018.
H. Murakami. Tropical cyclones in reanalysis data sets. Geophysical Research Letters,
41(6):2133–2141, 2014. ISSN 1944-8007. doi: 10.1002/2014GL059519. URL
http://dx.doi.org/10.1002/2014GL059519. 2014GL059519.
H. Murakami, Y. Wang, H. Yoshimura, R. Mizuta, M. Sugi, E. Shindo, Y. Adachi, S. Yukimoto,
M. Hosaka, S. Kusunoki, T. Ose, and A. Kitoh. Future changes in tropical cyclone activity
Bibliography 371
Sensor/Information Fusion, and Target Recognition XXVIII, volume 11018, page 110180Y.
International Society for Optics and Photonics, 2019.
H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In
International Conference on Computer Vision (ICCV), pages 1520–1528, 2015.
D.S. Nolan and M.G. McGauley. Tropical cyclogenesis in wind shear: Climatological
relationships and physical processes. In Cyclones: Formation, Triggers, and Control, pages
1–36. 2012.
P.D. Nooteboom, Q.Y. Feng, C. López, E. Hernández-García, and H. A. Dijkstra. Using network
theory and machine learning to predict el niño. Earth System Dynamics, 9(3):969–983, 2018.
doi: 10.5194/esd-9-969-2018. URL https://www.earth-syst-dynam.net/9/969/2018/.
D.O. North. An analysis of the factors which determine signal/noise discrimination in
pulsed-carrier systems. Proceedings of the IEEE, 51(7): 1016–1027, 1963.
L.M. Novak and M.C. Burl. Optimal speckle reduction in polarimetric SAR imagery. IEEE
Transactions on Aerospace and Electronic Systems, 26(2): 293–305, 1988.
K. Ogle, J.J. Barber, G.A. Barron-Gafford, L. Patrick Bentley, J.M. Young, T.E. Huxman, M.E.
Loik, and D.T. Tissue. Quantifying ecological memory in plant and ecosystem processes.
Ecology Letters, 18:221–235, 2015. doi: 10.1111/ele.12399.
P.A. O’Gorman and J.G. Dwyer. Using machine learning to parameterize moist convection:
Potential for modeling of climate, climate change, and extreme events. Journal of Advances
in Modeling Earth Systems, 10(10): 2548–2563, 2018.
A. Özgün Ok, Ç. Senaras, and B. Yüksel. Automated detection of arbitrarily shaped buildings in
complex environments from monocular VHR optical satellite imagery. IEEE Transactions on
Geoscience and Remote Sensing, 51(3-2): 1701–1717, 2013.
D.A.B. Oliveira, R.S. Ferreira, R. Silva, and E.V. Brazil. Improving seismic data resolution with
deep generative networks. IEEE Geoscience and Remote Sensing Letters, 16(12):1929–1933,
Dec 2019.
B. Olshausen and D.J. Field. Sparse coding with an overcomplete basis set: a strategy employed
by V1? Vision Research, 37(23):3311–3325, 1997.
M. Omlin and P. Reichert. A comparison of techniques for the estimation of model prediction
uncertainty. Ecological Modelling, 115:45–59, 1999.
I.H. Onarheim, T. Eldevik, L.H. Smedsrud, and J.C. Stroeve. Seasonal and regional
manifestation of Arctic sea ice loss. Journal of Climate, 31(12):4917–4932, 2018.
A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner,
A. Senior, and K. Kavukcuoglu. Wavenet: A generative model for raw audio. arXiv preprint
arXiv:1609.03499, 2016.
K. Oouchi, J. Yoshimura, H. Yoshimura, R. Mizuta, S. Kusonoki, and A. Noda. Tropical cyclone
climatology in a global-warming climate as simulated in a 20 km-mesh global atmospheric
model: Frequency and wind intensity analyses. Journal of the Meteorological Society of Japan.
Ser. II, 84(2):259–276, 2006. doi: 10.2151/jmsj.84.259.
E. Othman, Y. Bazi, N. Alajlan, H. Alhichri, and F. Melgani. Using convolutional features and a
sparse autoencoder for land-use scene classification. International Journal of Remote
Sensing, 37(10):2149–2167, 2016.
B. Oueslati and G. Bellon. The double ITCZ bias in CMIP5 models: interaction between SST,
large-scale circulation and precipitation. Climate Dynamics, 44(3-4):585–607, 2015.
Bibliography 373
V. Pellet and F. Aires. Bottleneck channels algorithm for satellite data dimension reduction:
A case study for IASI. IEEE Transactions on Geoscience and Remote Sensing, 56(10):8360516,
6069–6081, 2018. ISSN 15580644, 01962892. doi: 10.1109/tgrs.2018.2830123.
G.D. Peterson. Contagious disturbance, ecological memory, and the emergence of landscape
pattern. Ecosystems, 5:329–338, 2002. doi: 10.1007/s10021-001-0077-1.
O.L. Phillips, L.E.O.C. Aragão, S.L. Lewis, J.B. Fisher, J. Lloyd, G. López-González, Y. Malhi,
A. Monteagudo, J. Peacock, C.A. Quesada, G. van der Heijden, S. Almeida, I. Amaral,
L. Arroyo, G. Aymard, T.R. Baker, O. Bánki, L. Blanc, D. Bonal, P. Brando, J. Chave, Á.C.
Alves de Oliveira, N.D. Cardozo, C.I. Czimczik, T.R. Feldpausch, M. Aparecida Freitas,
E. Gloor, N. Higuchi, E. Jiménez, G. Lloyd, P. Meir, C. Mendoza, A. Morel, D.A. Neill,
D. Nepstad, S. Patiño, M.C. Peñuela, A. Prieto, F. Ramírez, M. Schwarz, J. Silva, M. Silveira,
S. Sota Thomas, H. ter Steege, J. Stropp, R. Vásquez, P. Zelazowski, E. Alvarez Dávila,
S. Andelman, A. Andrade, K.-J. Chao, T. Erwin, A. Di Fiore, E. Honorio C, H. Keeling, T.J.
Killeen, W.F. Laurance, A. Peña Cruz, N.C.A. Pitman, P. Núñez Vargas, H. Ramírez-Angulo,
A. Rudas, R. Salamão, N. Silva, John Terborgh, and A. Torres-Lezama. Drought sensitivity of
the Amazon rainforest. Science, 323:1344–1347, 2009. doi: 10.1126/science.1164033.
P.O. Pinheiro, T.-Y. Lin, R. Collobert, and P. Dollár. Learning to refine object segments. In
European Conference on Computer Vision (ECCV), pages 75–91. Springer, 2016.
R.S. Plant and G.C. Craig. A stochastic parameterization for deep convection based on
equilibrium statistics. Journal of the Atmospheric Sciences, 65(1):87–105, 2008.
S.B. Pope. A more general effective-viscosity hypothesis. Journal of Fluid Mechanics,
72(2):331–340, 1975.
J. Porway, Q. Wang, and S.C. Zhu. A hierarchical and contextual model for aerial image
parsing. International Journal of Computer Vision, 88(2):254–283, 2010.
J. Poterjoy, R.A. Sobash, and J.L. Anderson. Convective-scale data assimilation for the weather
research and forecasting model using the local particle filter. Monthly Weather Review,
145(5):1897–1918, Mar 2017. ISSN 0027-0644. doi: 10.1175/mwr-d-16-0298.1.
J. Poterjoy, L. Wicker, and M. Buehner. Progress toward the application of a localized particle
filter for numerical weather prediction. Monthly Weather Review, 147(4):1107–1126, Apr
2019. ISSN 15200493. doi: 10.1175/MWR-D-17-0344.1.
R. Prabha, M. Tom, M. Rothermel, E. Baltsavias, L. Leal-Taixé, and K. Schindler. Lake ice
monitoring with webcams and crowd-sourced images. In ISPRS Annals of Photogrammetry,
Remote Sensing and Spatial Information Sciences, 2020. (to appear).
Prabhat, O. Rübel, S. Byna, K. Wu, F. Li, M. Wehner, W. Bethel, et al. Teca: A parallel toolkit for
extreme climate analysis. In Third Worskhop on Data Mining in Earth System Science
(DMESS) at the International Conference on Computational Science (ICCS), 2012.
Prabhat, S. Byna, V. Vishwanath, E. Dart, M. Wehner, and W.D. Collins. Teca: Petascale pattern
recognition for climate science. In Computer Analysis of Images and Patterns, pages 426–436.
Springer, 2015a.
Prabhat, K. Kashinath, T. Kurth, M. Mudigonda, A. Mahesh, B.A. Toms, J. Biard, S.K. Kim,
S. Kahou, B. Loring, et al. Climatenet: Bringing the power of deep learning to the climate
community via open datasets and architectures. In AGU Fall Meeting Abstracts, 2018.
W.K. Pratt. Digital Image Processing. John Wiley & Sons, Inc., USA, 1978. ISBN 0471018880.
N. Proia and V. Pagé. Characterization of a Bayesian ship detection method in optical satellite
images. IEEE Geoscience and Remote Sensing Letters, 7(2):226–230, 2009.
Bibliography 375
R. Raina, A. Battle, H. Lee, B. Packer, and A.Y. Ng. Self-taught learning: transfer learning from
unlabeled data. In International Conference on Machine Learning, pages 759–766. ACM,
2007.
M. Raissi, P. Perdikaris, and G.E. Karniadakis. Physics-informed neural networks: A deep
learning framework for solving forward and inverse problems involving nonlinear partial
differential equations. Journal of Computational Physics, 378:686–707, 2019. ISSN 0021-9991.
doi: 10/gfzbvx. URL http://www.sciencedirect.com/science/article/pii/S0021999118307125.
D. Randall, M. Khairoutdinov, A. Arakawa, and W. Grabowski. Breaking the cloud
parameterization deadlock. Bulletin of the American Meteorological Society,
84(11):1547–1564, 2003.
M.A. Ranzato, C. Poultney, S. Chopra, and Y. Lecun. Efficient learning of sparse
representations with an energy-based model. In NIPS, pages 1137–1144, 2006.
M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Collobert, and S. Chopra. Video (language)
modeling: A baseline for generative models of natural videos. arXiv preprint arXiv:1412.6604,
2014.
C.E. Rasmussen. Evaluation of Gaussian Processes and Other Methods for Non-linear Regression.
University of Toronto, 1999.
S. Rasp and S. Lerch. Neural networks for postprocessing ensemble weather forecasts.
MonthlyWeather Review, 146(11):3885–3900, 2018. doi: 10.1175/MWR-D-18-0187.1.
S. Rasp, M.S. Pritchard, and P. Gentine. Deep learning to represent subgrid processes in
climate models. Proceedings of the National Academy of Sciences, 115(39):9684–9689,
2018.
S. Rasp, H. Schulz, S. Bony, and B. Stevens. Combining crowd-sourcing and deep learning to
understand meso-scale organization of shallow convection. 2019. URL https://aps.arxiv.org/
abs/1906.01906.
S. Rasp, P.D. Dueben, S. Scher, J.A. Weyn, S. Mouatadid, and N. Thuerey. WeatherBench:
A benchmark dataset for data-driven weather forecasting, 2020. URL https://aps.arxiv.org/
abs/2002.00469.
B.H. Raup, A. Racoviteanu, S.J.S. Khalsa, C. Helm, R. Armstrong, and Y. Arnaud. The GLIMS
geospatial glacier database: a new tool for studying glacier change. Global and Planetary
Change, 56(1-2):101–110, 2007.
B.H. Raup, L.M. Andreassen, T. Bolch, and S. Bevan. Remote Sensing of Glaciers, chapter 7,
pages 123–156. John Wiley & Sons, 2014. doi: 10.1002/9781118368909.ch7.
J.S. Read, X. Jia, J. Willard, A.P. Appling, J.A. Zwart, S.K. Oliver, A. Karpatne, G.J.A. Hansen,
P.C. Hanson, W. Watkins, M. Steinbach, and V. Kumar. Process-guided deep learning
predictions of lake water temperature. Water Resources Research, 55 (11):9173–9190,
2019. doi: 10.1029/2019WR024922. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/
10.1029/2019WR024922.
T. Reato, B. Demir, and L. Bruzzone. An unsupervised multicode hashing method for accurate
and scalable remote sensing image retrieval. IEEE Geoscience and Remote Sensing Letters,
16(2):276–280, October 2019.
I. Redko, E. Morvant, A. Habrard, M. Sebban, and Y. Bennani. Advances in Domain Adaptation
Theory. Elsevier, 2019.
Bibliography 377
J. Redmon, S. Divvala, R. Girshick, and A. Farhadi. You only look once: Unified, real-time
object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern
Recognition, pages 779–788, 2016.
M. Reichstein, S. Besnard, N. Carvalhais, F. Gans, M. Jung, B. Kraft, and M. Mahecha.
Modelling Landsurface Time-Series with Recurrent Neural Nets. In IGARSS 2018 – 2018
IEEE International Geoscience and Remote Sensing Symposium, pages 7640–7643, 2018. doi:
10.1109/IGARSS.2018.8518007.
M. Reichstein, G. Camps-Valls, N. Stevens, M. Jung, J. Denzler, N. Carvalhais, et al. Deep
learning and process understanding for data-driven earth system science. Nature,
566(7743):195–204, 2019. doi: 10.1038/s41586-019-0912-1. URL https://doi.org/10.1038/
s41586-019-0912-1.
S. Ren, K. He, R. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with
region proposal networks. In Advances in Neural Information Processing Systems, pages
91–99, 2015.
S. Ren, K. He, R.B. Girshick, and J. Sun. Faster R-CNN: Towards real-time object detection with
region proposal networks. IEEE TPAMI, 39(6):1137–1149, 2017.
C. Requena-Mesa, M. Reichstein, M. Mahecha, B. Kraft, and J. Denzler. Predicting landscapes
from environmental conditions using generative networks. In German Conference on Pattern
Recognition, pages 203–217. Springer, 2019.
J. Revaud, Ph. Weinzaepfel, Z. Harchaoui, and C. Schmid. Deepmatching: Hierarchical
deformable dense matching. International Journal of Computer Vision, 120, 04 2016. doi:
10.1007/s11263-016-0908-3.
M. Reyniers. Quantitative Precipitation Forecasts Based on Radar Observations: Principles,
Algorithms and Operational Systems. Institut Royal Météorologique de Belgique Brussel,
Belgium, 2008.
A.D. Richardson, K. Hufkens, T. Milliman, D.M. Aubrecht, M. Chen, J.M. Gray, M.R. Johnston,
T.F. Keenan, S.T. Klosterman, M. Kosmala, E.K. Melaas, M.A. Friedl, and S. Frolking.
Tracking vegetation phenology across diverse North American biomes using PhenoCam
imagery. Scientific Data, 5(1):1–24, March 2018. ISSN 2052-4463. doi: 10/gc6crk. URL
https://www.nature.com/articles/sdata201828.
M.B. Richman. Rotation of principal components. Journal of Climatology, 6(3): 293–335,
1986.
G. Riegler, A.O. Ulusoy, and A. Geiger. OctNet: Learning deep 3D representations at high
resolutions. In Computer Vision and Pattern Recognition (CVPR), pages 6620–6629,
Honolulu, HI, 2017. IEEE.
C. Rieke et al. Awesome satellite imagery datasets. https://github.com/chrieke/
awesome-satellite-imagery-datasets, 2020.
S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit
invariance during feature extraction. In Proceedings of the 28th International Conference on
Machine Learning (ICML-11), pages 833–840, 2011.
I. Rigas, G. Economou, and S. Fotopoulos. Low-level visual saliency with application on aerial
imagery. IEEE Geoscience and Remote Sensing Letters, 10(6): 1389–1393, Nov 2013. ISSN
1545-598X. doi: 10.1109/LGRS.2013.2243402.
378 Bibliography
S. Scher and G. Messori. Weather and climate forecasting with neural networks: using GCMs
with different complexity as study-ground. Geoscientific Model Development Discussions,
pages 1–15, Mar 2019. doi: 10.5194/gmd-2019-53.
M. Schmitt and X.X. Zhu. Data fusion and remote sensing: An ever-growing relationship. IEEE
Geoscience and Remote Sensing Magazine, 4(4):6–23, 2016.
M. Schmitt, L.H. Hughes, C. Qiu, and X.X. Zhu. Sen12ms–a curated dataset of georeferenced
multi-spectral Sentinel-1/2 imagery for deep learning and data fusion. ISPRS Annals of
Photogrammetry, Remote Sensing and Spatial Information Sciences, IV-2/W7:153–160, Sep
2019. ISSN 2194-9050. doi: 10.5194/isprs-annals-iv-2-w7-153-2019. URL http://dx.doi.org/10
.5194/isprs-annals-iv-2-w7-153-2019.
T. Schneider, S. Lan, A. Stuart, and J. Teixeira. Earth system modeling 2.0: A blueprint for
models that learn from observations and targeted high-resolution simulations. Geophysical
Research Letters, 44(24): 12–396, 2017a.
T. Schneider, J. Teixeira, C.S. Bretherton, F. Brient, K.G. Pressel, C. Schär, and A.P. Siebesma.
Climate goals and computing the future of clouds. Nature Climate Change, 7(1):3–5,
2017b.
T. Schneider, C.M. Kaul, and K.G. Pressel. Possible climate transitions from breakup of
stratocumulus decks under greenhouse warming. Nature Geoscience, 12(3):163, 2019.
B. Schölkopf, A. Smola, and K.-R. Müller. Nonlinear component analysis as a kernel eigenvalue
problem. Neural Computation, 10 (5):1299–1319, 1998.
M. Schuster and K.K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on
Signal Processing, 45(11):2673–2681, 1997. doi: 10.1109/78.650093.
E.A.G. Schuur, A.D. McGuire, C. Schädel, G. Grosse, J.W. Harden, D.J. Hayes, G. Hugelius, C.D.
Koven, P. Kuhry, D.M. Lawrence, et al. Climate change and the permafrost carbon feedback.
Nature, 520(7546):171–179, 2015.
E. Scoccimarro. Modeling tropical cyclones in a changing climate, 2016. URL //
naturalhazardscience.oxfordre.com/10.1093/acrefore/9780199389407.001.0001/acrefore-
9780199389407-e-22.
A. Seale, P. Christoffersen, R.I. Mugford, and M. O’Leary. Ocean forcing of the Greenland Ice
Sheet: Calving fronts and patterns of retreat identified by automatic satellite monitoring of
eastern outlet glaciers. Journal of Geophysical Research: Earth Surface, 116:F03013, 2011.
A. Sedaghat and N. Mohammadi. Illumination-robust remote sensing image matching based
on oriented self-similarity. ISPRS Journal of Photogrammetry and Remote Sensing, 153:21–35,
2019. ISSN 0924-2716. doi: https://doi.org/10.1016/j.isprsjprs.2019.04.018.
M. Segal-Rozenhaimer, A. Li, K. Das, and V. Chirayath. Cloud detection algorithm for
multi-modal satellite imagery using convolutional neural-networks (CNN). Remote Sensing
of Environment, 237:111446, 2020.
P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated
recognition, localization and detection using convolutional networks. In International
Conference on Learning Representations (ICLR2014). CBLS, April 2014. URL http://
openreview.net/document/d332e77d-459a-4af8-b3ed-55ba.
B. Settles. Active learning. Synthesis Lectures on Artificial Intelligence and Machine Learning,
6(1):1–114, 2012.
D.M.H. Sexton, A.V. Karmalkar, J.M. Murphy, K.D. Williams, I.A. Boutle, C.J. Morcrette, A.J.
Stirling, and S.B. Vosper. Finding plausible and diverse variants of a climate model. Part 1:
Bibliography 381
establishing the relationship between errors at weather and climate time scales. Clim. Dyn.,
53:989–1022, 2019. ISSN 1432-0894. doi: 10.1007/s00382-019-04625-3. URL https://doi.org/
10.1007/s00382-019-04625-3.
B. Seyednasrollah, A.M. Young, K. Hufkens, T. Milliman, M.A. Friedl, S. Frolking, and A.D.
Richardson. Publisher correction: Tracking vegetation phenology across diverse biomes
using Version 2.0 of the PhenoCam Dataset. Scientific Data, 6(1):1–1, November 2019.
ISSN 2052-4463. doi: 10/ggh5m4. URL https://www.nature.com/articles/ s41597-019-
0270-8.
K. Sfikas, T. Theoharis, and I. Pratikakis. Exploiting the PANORAMA representation for
convolutional neural network classification and retrieval. In Eurographics Workshop on 3D
Object Retrieval, Lyon, France, 2017.
M. Shahzad, M. Maurer, F. Fraundorfer, Y.Wang, and X.X. Zhu. Buildings detection in vhr SAR
images using fully convolution neural networks. IEEE Transactions on Geoscience and
Remote Sensing, 57(2):1100–1116, 2019.
Z. Shao and J. Cai. Remote sensing image fusion with deep convolutional neural network. IEEE
Journal of Selected Topics in Applied Earth Observations and Remote Sensing,
11(5):1656–1669, May 2018.
Z. Shao, Z. Lu, M. Ran, L. Fang, J. Zhou, and Y. Zhang. Residual encoder-decoder conditional
generative adversarial network for pansharpening. IEEE Geoscience and Remote Sensing
Letters, pages 1–5, 2019.
C. Shen. A trans-disciplinary review of deep learning research and its relevance for water
resources scientists. Water Resources Research, 54(11): 8558–8593, Dec 2018a. doi:
10.1029/2018WR022643. URL https://doi.org/10.1029/2018WR022643.
C. Shen. Deep learning: A next-generation big-data approach for hydrology, April 2018b. URL
https://eos.org/editors-vox/deep-learning-a-next-generation-big-data-approach-for-
hydrology.
C. Shen, W.J. Riley, K.M. Smithgall, J.M. Melack, and K. Fang. The fan of influence of streams
and channel feedbacks to simulated land surface water and carbon dynamics. Water
Resources Research, 52(2): 880–902, 2016. doi: 10.1002/2015WR018086.
C. Shen, E. Laloy, A. Elshorbagy, A. Albert, J. Bales, F.-J. Chang, S. Ganguly, K.-L. Hsu,
D. Kifer, Z. Fang, K. Fang, D. Li, X. Li, and W.-P. Tsai. HESS Opinions: Incubating
deep-learning-powered hydrologic science advances as a community. Hydrology and Earth
System Sciences, 22(11):5639–5656, Nov 2018. ISSN 1607-7938. doi:
10.5194/hess-22-5639-2018. URL https://www.hydrol-earth-syst-sci.net/22/5639/2018/.
S.C. Sheridan and C.C. Lee. The self-organizing map in synoptic climatological research.
Progress in Physical Geography, 35(1):109–119, 2011.
J. Shermeyer, D. Hogan, J. Brown, A. Van Etten, N. Weir, F. Pacifici, R. Haensch, A. Bastidas,
S. Soenen, T. Bacastow, and R. Lewis. Spacenet 6: Multi-sensor all weather mapping dataset.
In Computer Vision and Pattern Recognition Workshops (CVPRw), 2020.
J. Sherrah. Fully convolutional networks for dense semantic labelling of high-resolution aerial
imagery. arXiv preprint arXiv:1606.02585, 2016.
S.C. Sherwood, S. Bony, and J.-L. Dufresne. Spread in model climate sensitivity traced to
atmospheric convective mixing. Nature, 505(7481):37, 2014.
B. Shi, S. Bai, Z. Zhou, and X. Bai. DeepPano: Deep Panoramic Representation for 3-D Shape
Recognition. IEEE Signal Processing Letters, 22(12): 2339–2343, December 2015a.
382 Bibliography
X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-K. Wong, W.-C. Woo, and Hong Kong Observatory.
Convolutional LSTM network: A machine learning approach for precipitation nowcasting.
arxiv, 2015b.
X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-K. Wong, and W.-C. Woo. Deep learning for
precipitation nowcasting: A benchmark and a new model. In Advances in Neural
Information Processing Systems, pages 5617–5627, 2017.
Z. Shi, X. Yu, Z. Jiang, and B. Li. Ship detection in high-resolution optical imagery based on
anomaly detector and local shape feature. IEEE Transactions on Geoscience and Remote
Sensing, 52(8): 4511–4523, 2013.
C.A. Shields, J.J. Rutz, L.-Y. Leung, F.M. Ralph, M. Wehner, B. Kawzenuk, J.M. Lora,
E. McClenny, T. Osborne, A.E. Payne, et al. Atmospheric river tracking method
intercomparison project (ARTMIP): project goals and experimental design. Geoscientific
Model Development, 11(6):2455–2474, 2018.
Z. Shu, M. Sahasrabudhe, A. Guler Riza, D. Samaras, N. Paragios, and I. Kokkinos. Deforming
autoencoders: Unsupervised disentangling of shape and appearance. In The European
Conference on Computer Vision (ECCV), September 2018.
S. Siachalou, G. Mallinis, and M. Tsakiri-Strati. A hidden Markov models approach for crop
classification: linking crop phenology to time series of multi-sensor remote sensing data.
Remote Sensing, 7(4):3633–3650, 2015. doi: 10.3390/rs70403633.
H.T. Siegelmann and E.D. Sontag. Turing vomputability with neural nets. Applied Mathematics
Letters, 4(6):77–80, 1991. doi: 10.1016/0893-9659(91)90080-f.
H.T. Siegelmann and E.D. Sontag. On the computational power of neural nets. Journal of
Computer and System Sciences, 50(1):132–150, 1995. doi: 10.1006/jcss.1995.1013.
V.D. Silva and J.B. Tenenbaum. Global versus local methods in nonlinear dimensionality
reduction. In Advances in Neural Information Processing Systems, pages 721–728, 2003.
D. Silver, A. Huang, C.J. Maddison, A. Guez, L. Sifre, G. van den Driessche, J. Schrittwieser, I.
Antonoglou, V. Panneershelvam, M. Lanctot, S. Dieleman, D. Grewe, J. Nham, N.
Kalchbrenner, I. Sutskever, T. Lillicrap, M. Leach, K. Kavukcuoglu, Thore Graepel, and D.
Hassabis. Mastering the game of Go with deep neural networks and tree search. Nature,
529(7587): 484–489, Jan 2016. ISSN 0028-0836. doi: 10.1038/nature16961. URL http://www
.nature.com/doifinder/10.1038/nature16961.
A.J. Simmons and A. Hollingsworth. Some aspects of the improvement in skill of numerical
weather prediction. Quarterly Journal of the Royal Meteorological Society: A Journal of the
Atmospheric Sciences, Applied Meteorology and Physical Oceanography, 128(580):647–677,
2002.
M. Simões, J. Bioucas-Dias, L.B. Almeida, and J. Chanussot. A convex formulation for
hyperspectral image super-resolution via subspace-based regularization. IEEE Transactions
on Geoscience and Remote Sensing, 53(6):3373–3388, 2015.
K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image
recognition. In Internaltional Conference on Learning Representation (ICLR), 2015.
A. Singh, H. Kalke, M. Loewen, and N. Ray. River ice segmentation with deep learning, 2019.
URL https://aps.arxiv.org/abs/1901.04412.
B. Singh, M. Najibi, and L.S. Davis. Sniper: Efficient multi-scale training. arXiv:1805.09300,
2018.
Bibliography 383
P. Singh and N. Komodakis. Cloud-gan: Cloud removal for Sentinel-2 imagery using a cyclic
consistent generative adversarial networks. In IGARSS 2018 – 2018 IEEE International
Geoscience and Remote Sensing Symposium, pages 1772–1775, July 2018.
N. Skific, J.A. Francis, and J.J. Cassano. Attribution of projected changes in atmospheric
moisture transport in the arctic: A self-organizing map perspective. Journal of Climate,
22(15):4135–4153, 2009.
J. Smagorinsky. General circulation experiments with the primitive equations: I. The basic
experiment. Monthly Weather Review, 91(3):99–164, 1963.
H.-G. Sohn and K.C. Jezek. Mapping ice sheet margins from ERS-1 SAR and SPOT imagery.
International Journal of Remote Sensing, 20(15-16): 3201–3216, 1999.
W. Song, S. Li, and J. Benediktsson. Deep hashing learning for visual and semantic retrieval of
remote sensing images. arXiv preprint arXiv:1909.04614, 2019.
A. Sotiras, C. Davatzikos, and N. Paragios. Deformable medical image registration: A survey.
IEEE Transactions on Medical Imaging, 32(7), 2013.
J.T. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all
convolutional net. arXiv preprint arXiv:1412.6806, 2014.
N. Srivastava, E. Mansimov, and R. Salakhutdinov. Unsupervised learning of video
representations using LSTMs. In ICML, 2015.
S. Srivastava, J.E. Vargas, and D. Tuia. Understanding urban landuse from the above and
ground perspectives: a deep learning, multimodal solution. Remote Sensing of Environment,
228:129–143, 2019.
B. Stevens and S. Bony. What are climate models missing? Science, 340(6136):1053–1054, 2013.
ISSN 0036-8075. doi: 10.1126/science.1237554. URL https://science.sciencemag.org/content/
340/6136/1053.
B. Stevens, S.C. Sherwood, S. Bony, and M.J. Webb. Prospects for narrowing bounds on Earth’s
equilibrium climate sensitivity. Earth’s Future, 4(11):512–522, 2016.
B. Stevens, M. Satoh, L. Auger, J. Biercamp, C.S. Bretherton, X. Chen, P. Düben, F. Judt, M.
Khairoutdinov, D. Klocke, et al. Dyamond: The dynamics of the atmospheric general
circulation modeled on non-hydrostatic domains. Progress in Earth and Planetary Science,
6(1):61, 2019.
T.F. Stocker, D. Qin, G. Plattner, M. Tignor, S.K. Allen, J. Boschung, A. Nauels, Y. Xia, V. Bex,
and P.M. Midgley. IPCC, 2013: summary for policymakers in climate change 2013: the
physical science basis, contribution of Working Group I to the Fifth Assessment Report of the
Intergovernmental Panel on Climate Change. Camb. Univ. Press Camb. UKNY NY USA, 2013.
J. Strachan and J. Camp. Tropical cyclones of 2012. Weather, 68 (5):122–125, 2013. ISSN
1477-8696. doi: 10.1002/wea.2096. URL http://dx.doi.org/10.1002/wea.2096.
R. Stull. Practical Meteorology: An Algebra-based Survey of Atmospheric Science. University of
British Columbia, 2015. ISBN 9780888651761.
H. Su, S. Maji, E. Kalogerakis, and E. Learned-Miller. Multi-view convolutional neural
networks for 3d shape recognition. In Computer Vision and Pattern Recognition (CVPR),
pages 945–953, 2015.
Y. Su, J. Li, A. Plaza, A. Marinoni, P. Gamba, and S. Chakravortty. Daen: Deep autoencoder
networks for hyperspectral unmixing. IEEE Transactions on Geoscience and Remote Sensing,
57(7):4309–4321, 2019.
384 Bibliography
C.H. Sudre, W. Li, T. Vercauteren, S. Ourselin, and M.J. Cardoso. Generalised dice overlap as a
deep learning loss function for highly unbalanced segmentations. In Deep Learning in
Medical Image Analysis and Multimodal Learning for Clinical Decision Support, pages
240–248. Springer, 2017.
G. Sumbul, M. Charfuelan, B. Demir, and V. Markl. Bigearthnet: A large-scale benchmark
archive for remote sensing image understanding. IGARSS 2019 – 2019 IEEE International
Geoscience and Remote Sensing Symposium, Jul 2019. doi: 10.1109/igarss.2019.8900532. URL
http://dx.doi.org/10.1109/IGARSS.2019.8900532.
A.Y. Sun. Discovering state-parameter mappings in subsurface models using generative
adversarial networks. Geophysical Research Letters, 45(20):11,137–11,146, 2018. doi:
10.1029/2018GL080404. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/
2018GL080404.
A.Y. Sun, B.R. Scanlon, Z. Zhang, D. Walling, S.N. Bhanja, A. Mukherjee, and Z. Zhong.
Combining physically based modeling and deep learning for fusing grace satellite data: Can
we learn from mismatch? Water Resources Research, 55(2):1179–1195, 2019. doi:
10.1029/2018WR023333. URL https://agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/
2018WR023333.
B. Sun and K. Saenko. Deep CORAL: Correlation alignment for deep domain adaptation. In
European Conference on Computer VisionWorkshops, pages 443–450. Springer, 2016.
H. Sun, X. Sun, H.Wang, Y. Li, and X. Li. Automatic target detection in high-resolution remote
sensing images using spatial sparse coding bag-of-words model. IEEE Geoscience and Remote
Sensing Letters, 9(1):109–113, Jan 2012. ISSN 1545-598X. doi: 10.1109/LGRS.2011.2161569.
J. Sun, M. Xue, J.W. Wilson, I. Zawadzki, S.P. Ballard, J. Onvlee-Hooimeyer, P. Joe, D.M. Barker,
P.-W. Li, B. Golding, et al. Use of nwp for nowcasting convective precipitation: Recent
progress and challenges. Bulletin of the American Meteorological Society, 95 (3):409–426,
2014.
F. Sung, Y. Yang, L. Zhang, T. Xiang, Ph.H.S. Torr, and T.M. Hospedales. Learning to compare:
Relation network for few-shot learning. In The IEEE Conference on Computer Vision and
Pattern Recognition (CVPR), June 2018.
I. Sutskever, O. Vinyals, and Q.V. Le. Sequence to sequence learning with neural networks. In
Advances in Neural Information Processing Systems, pages 3104–3112, 2014.
S. Suzuki et al. Topological structural analysis of digitized binary images by border following.
Computer vision, graphics, and image processing, 30(1): 32–46, 1985.
D.H. Svendsen, P. Morales-Álvarez, R. Molina, and G. Camps-Valls. Deep Gaussian processes
for geophysical parameter retrieval. In IGARSS 2018-2018 IEEE International Geoscience and
Remote Sensing Symposium, pages 6175–6178. IEEE, 2018.
D.H. Svendsen, L. Martino, and G. Camps-Valls. Active emulation of computer codes with
gaussian processes – application to remote sensing. Pattern Recognition, 100(107103):1–12,
2020. doi: https://doi.org/10.1016/j.patcog.2019.107103.
C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, V. Vanhoucke, and
A. Rabinovich. Going deeper with convolutions. In Proceedings of the IEEE Conference on
Computer Vision and Pattern Recognition (CVPR), pages 1–9, 2015.
R. Szeliski. Computer Vision: Algorithms and Applications. Springer-Verlag, Berlin, Heidelberg,
1st edition, 2010. ISBN 1848829345.
Bibliography 385
K. Szura. The big data project: Enhancing public access to noaa’s open data. In AGU Fall
Meeting Abstracts, 2018.
K. Takata, S. Emori, and T. Watanabe. Development of the minimal advanced treatments of
surface interaction and runoff. Global and Planetary Change, 38:209–222, 2003. doi:
10.1016/S0921-8181(03)00030-4.
S.S. Talathi and A. Vartak. Improving performance of recurrent neural network with ReLU
nonlinearity. In International Conference on Learning Representations (ICLR) Workshops,
2015.
J. Tan, N. NourEldeen, K. Mao, J. Shi, Z. Li, T. Xu, and Z. Yuan. Deep learning convolutional
neural network for the retrieval of land surface temperature from AMSR2 data in China.
Sensors, 19(13), 2019. doi: 10.3390/s19132987.
G. Tang, D. Long, A. Behrangi, C. Wang, and Y. Hong. Exploring deep neural networks to
retrieve rain and snow in high latitudes using multisensor and reanalysis data. Water
Resources Research, 54(10): 8253–8278, 2018a. doi: 10.1029/2018WR023830. URL https://
agupubs.onlinelibrary.wiley.com/doi/abs/10.1029/2018WR023830.
X. Tang, X. Zhang, F. Liu, and L. Jiao. Unsupervised deep feature learning for remote sensing
image retrieval. Remote Sensing, 10(8):1243, August 2018b.
X. Tang, C. Liu, X. Zhang, J. Ma, C. Jiao, and L. Jiao. Remote sensing image retrieval based on
semi-supervised deep hashing learning. In IEEE International Geoscience and Remote
Sensing Symposium, pages 879–882, July 2019.
T. Tanikawa, W. Li, K. Kuchiki, T. Aoki, M. Hori, and K. Stamnes. Retrieval of snow physical
parameters by neural networks and optimal estimation: case study for ground-based spectral
radiometer system. Optics Express, 23(24):A1442–A1462, 2015.
Y. Tao, X. Gao, K. Hsu, S. Sorooshian, and A. Ihler. A deep neural network modeling
framework to reduce bias in satellite precipitation products. Journal of Hydrometeorology,
17:931–945, 2016. doi: 10.1175/JHM-D-15-0075.1.
Y. Tao, X. Gao, A. Ihler, S. Sorooshian, K. Hsu, Y. Tao, X. Gao, A. Ihler, S. Sorooshian, and K.
Hsu. Precipitation identification with bispectral satellite information using deep learning
approaches. Journal of Hydrometeorology, 18(5):1271–1283, May 2017. ISSN 1525-755X. doi:
10.1175/JHM-D-16-0176.1. URL http://journals.ametsoc.org/doi/10.1175/JHM-D-16-0176.1.
Y. Tao, K. Hsu, A. Ihler, X. Gao, and S. Sorooshian. A two-stage deep neural network
framework for precipitation estimation from bispectral satellite information. Journal of
Hydrometeorology, 19(2):393–408, Feb 2018. doi: 10.1175/JHM-D-17-0077.1. URL http://
journals.ametsoc.org/doi/10.1175/JHM-D-17-0077.1.
A.M. Tartakovsky, Carlos Ortiz Marrero, Paris Perdikaris, Guzel Tartakovsky, and David A.
Barajas-Solano. Learning parameters and constitutive relationships with physics informed
deep neural networks. 2018.
O. Tasar, Y. Tarabalka, and P. Alliez. Incremental learning for semantic segmentation of
large-scale remote sensing data. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 12(9):3524–3537, 2019.
O. Tasar, S.L. Happy, Y. Tarabalka, and P. Alliez. ColorMapGAN: Unsupervised domain
adaptation for semantic segmentation using color mapping generative adversarial networks.
IEEE Transactions on Geoscience and Remote Sensing, 2020.
386 Bibliography
M. Tatarchenko, J. Park, V. Koltun, and Q.-Y. Zhou. Tangent convolutions for dense prediction
in 3D. In Computer Vision and Pattern Recognition (CVPR), pages 3887–3896, Salt Lake City,
UT, USA, June 2018. ISBN 978-1-5386-6420-9.
K.E. Taylor, R.J. Stouffer, and G.A. Meehl. An overview of CMIP5 and the experiment design.
Bulletin of the American Meteorological Society, 93(4):485–498, 2012.
L. Tchapmi, C. Choy, I. Armeni, J.Y. Gwak, and S. Savarese. SEGCloud: Semantic segmentation
of 3D point clouds. In 2017 International Conference on 3D Vision (3DV), pages 537–547,
Qingdao, 2017. 3DV.
C. Tebaldi and R. Knutti. The use of the multi-model ensemble in probabilistic climate
projections. Philosophical Transactions of the Royal Society A: Mathematical, Physical and
Engineering Sciences, 365(1857):2053–2075, Aug 2007. doi: 10.1098/rsta.2007.2076. URL
https://doi.org/10.1098/rsta.2007.2076.
M. Tedesco. Remote Sensing of the Cryosphere. JohnWiley & Sons, 2014.
J. Teixeira and C.A. Reynolds. Stochastic nature of physical parameterizations in ensemble
prediction: A stochastic convection approach. Monthly Weather Review, 136(2):483–496,
2008.
I. Tekeste and B. Demir. Advanced local binary patterns for remote sensing image retrieval. In
IEEE International Geoscience and Remote Sensing Symposium, pages 6855–6858, July 2018.
J.B. Tenenbaum, V. De Silva, and J.C. Langford. A global geometric framework for nonlinear
dimensionality reduction. Science, 290 (5500):2319–2323, 2000.
M. Tharani, N. Khurshid, and M. Taj. Unsupervised deep features for remote sensing image
matching via discriminator network, 2018. URL https://aps.arxiv.org/abs/1810.06470.
The RGI Consortium. Randolph glacier inventory – a dataset of global glacier outlines: Version
6.0: technical report, Global Land Ice Measurements From Space, Colorado, USA, 2017.
J.J. Thiagarajan, K.N. Ramamurthy, and A. Spanias. Learning stable multilevel dictionaries for
sparse representations. IEEE Transactions on Neural Networks and Learning Systems,
26(9):1913–1926, 2015.
H. Thomas, C.R. Qi, J.-E. Deschaud, B. Marcotegui, F. Goulette, and L. J. Guibas. KPConv:
flexible and deformable convolution for point clouds, April 2019. URL https://arxiv.org/abs/
1904.08889.
V. Thompson, N.J. Dunstone, A.A. Scaife, D.M. Smith, J.M. Slingo, S. Brown, and S.E. Belcher.
High risk of unprecedented UK rainfall in the current climate. Nature Communications,
8(1):107, 2017. ISSN 2041-1723. doi: 10.1038/s41467-017-00275-3. URL https://doi.org/10
.1038/s41467-017-00275-3.
X.-A. Tibau, C. Requena-Mesa, C. Reimers, J. Denzler, V. Eyring, M. Reichstein, and J. Runge.
Supernovae: Vae based kernel-pca for analysis of spatio-temporal earth data. In Proceedings
of the 8th International Workshop on Climate Informatics: CI 2018, pages 73–77. NCAR, 2018.
X.-A. Tibau, C. Reimers, V. Eyring, J. Denzler, M. Reichstein, and J. Runge. Spatiotemporal
model for benchmarking causal discovery algorithms. EGU General Assembly Conference
Abstracts, 2020.
R. Tipireddy, P. Perdikaris, P. Stinis, and A.M. Tartakovsky. A comparative study of
physics-informed neural network models for learning unknown dynamics and constitutive
relations. ArXiv, abs/1904.04058, 2019.
M.E. Tipping. Sparse kernel principal component analysis. In Advances in Neural Information
Processing Systems, pages 633–639, 2001.
Bibliography 387
P. Tokarczyk, J.D. Wegner, S. Walk, and K. Schindler. Features, color spaces, and boosting: New
insights on semantic classification of remote sensing images. IEEE Transactions in
Geoscience and Remote Sensing, 53(1):280–295, 2015.
E. Tola, V. Lepetit, and P. Fua. Daisy: An efficient dense descriptor applied to wide-baseline
stereo. IEEE Transactions on Pattern Analysis and Machine Intelligence, 32(5):815–830, May
2010. ISSN 1939-3539. doi: 10.1109/TPAMI.2009.77.
M. Tom, U. Kälin, M. Sütterlin, E. Baltsavias, and K. Schindler. Lake ice detection in
low-resolution optical satellite images. ISPRS Annals of Photogrammetry, Remote Sensing
and Spatial Information Sciences, IV-2, 2018.
M. Tom, R. Aguilar, S. Leinss, E. Baltsavias, and K. Schindler. Lake ice detection from
Sentinel-1 SAR with deep learning. In ISPRS Annals of Photogrammetry, Remote Sensing and
Spatial Information Sciences, 2020. (to appear).
B.A. Toms, E.A. Barnes, and I. Ebert-Uphoff. Physically interpretable neural networks for the
geosciences: Applications to earth system variability. arXiv preprint arXiv:1912.01752, 2019a.
B.A. Toms, K. Kashinath, D. Yang, et al. Deep learning for scientific inference from geophysical
data: The madden-julian oscillation as a test case. arXiv preprint arXiv:1902.04621, 2019b.
B.A. Toms, E.A. Barnes, and I. Ebert-Uphoff. Physically interpretable neural networks for the
geosciences: Applications to earth system variability. Journal of Advances in Modeling Earth
Systems, 12 (9), Aug 2020. ISSN 1942-2466. doi: 10.1029/2019ms002002. URL http://dx.doi
.org/10.1029/2019ms002002.
J. Tonttila, Z. Maalick, T. Raatikainen, H. Kokkola, T. Kuhn, and S. Romakkaniemi.
UCLALES-SALSA v1.0: a large-eddy model with interactive sectional microphysics for
aerosol, clouds and precipitation. Geoscientific Model Development, 10(1):169–188, 2017. doi:
10.5194/gmd-10-169-2017.
B.D. Tracey, K. Duraisamy, and J.J. Alonso. A machine learning strategy to assist turbulence
model development. In 53rd AIAA Aerospace Sciences Meeting, page 1287, 2015.
G. Tramontana, M. Jung, C.R. Schwalm, K. Ichii, G. Camps-Valls, B. Ráduly, M. Reichstein,
M.A. Arain, A. Cescatti, G. Kiely, L. Merbold, P. Serrano-Ortiz, S. Sickert, S. Wolf, and D.
Papale. Predicting carbon dioxide and energy fluxes across global FLUXNET sites with
regression algorithms. Biogeosciences, 13:4291–4313, 2016. doi: 10.5194/bg-13-4291-2016.
D. Tran, L. Bourdev, R. Fergus, L. Torresani, and M. Paluri. Learning spatiotemporal features
with 3d convolutional networks. In ICCV, 2015.
Q.-K. Tran and S. Song. Computer vision in precipitation nowcasting: Applying image quality
assessment metrics for training deep neural networks. Atmosphere, 10(5):244, 2019.
G. Tsagkatakis, A. Aidini, K. Fotiadou, M. Giannopoulos, A. Pentari, and P. Tsakalides. Survey
of deep-learning approaches for remote sensing observation enhancement. Sensors,
19(18):3929, 2019.
Y.-L.S. Tsai, A. Dietz, N. Oppelt, and C. Kuenzer. Remote sensing of snow cover using
spaceborne SAR: A review. Remote Sensing, 11 (12):1456, 2019.
W.-P. Tsai, D. Feng, M. Pan, H. Beck, K. Lawson, Y. Yang, J. Liu and C. Shen. From parameter
calibration to parameter learning: Revolutionizing large-scale geoscientific modeling with
big data, 2020. https://arxiv.org/abs/2007.15751
D. Tuia, M. Volpi, L. Copa, M. Kanevski, and J. Munoz-Mari. A survey of active learning
algorithms for supervised remote sensing image classification. IEEE Journal of Selected
Topics in Signal Processing, 5(3):606–617, 2011.
388 Bibliography
A. van den Oord, Y. Li, and O. Vinyals. Representation learning with contrastive predictive
coding. CoRR, abs/1807.03748, 2018. URL http://arxiv.org/abs/1807.03748.
L. van der Maaten and G. Hinton. Visualizing data using t-SNE. Journal of Machine Learning
Research, 9:2579–2605, 2008.
J.E. Vargas, S. Lobry, A.X. Falcão, and D. Tuia. Correcting rural building annotations in
OpenStreetMap using convolutional neural networks. ISPRS Journal of the International
Society for Photogrammetry and Remote Sensing, 147:283–293, 2019.
N. Varney, V.K. Asari, and Q. Graehling. Dales: A large-scale aerial LiDAR data set for semantic
segmentation. In Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern
RecognitionWorkshops, pages 186–187, 2020.
J. Verrelst, G. Camps-Valls, J. Muñoz-Marí, J. Pablo Rivera, F. Veroustraete, J.G.P.W. Clevers,
and J. Moreno. Optical remote sensing and the retrieval of terrestrial vegetation
bio-geophysical properties – a review. ISPRS Journal of Photogrammetry and Remote Sensing,
108:273–290, 2015.
J. Vial, J.-L. Dufresne, and S. Bony. On the interpretation of intermodel spread in cmip5 climate
sensitivity estimates. Climate Dynamics, 41(11-12): 3339–3362, 2013.
P. Vincent, H. Larochelle, I. Lajoie, Y. Bengio, and P.-A. Manzagol. Stacked denoising
autoencoders: Learning useful representations in a deep network with a local denoising
criterion. Journal of Machine Learning Research, 11 (Dec):3371–3408, 2010.
P. Vincent and H. Larochelle. Stacked denoising autoencoders: Learning useful representations
in a deep network with a local denoising criterion. Journal of Machine Learning Research,
11:3371–3408, 2010.
P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust
features with denoising autoencoders. In Proceedings of the 25th International Conference on
Machine Learning, pages 1096–1103. ACM, 2008.
P. Viola and W. Wells. Alignment by maximization of mutual information. Proceedings of IEEE
International Conference on Computer Vision. volume 24, pages 16–23, 1995. ISBN
0-8186-7042-8. doi: 10.1109/ICCV.1995.466930.
G. Vivone, L. Alparone, J. Chanussot, M. Dalla Mura, A. Garzelli, G. Licciardi, R. Restaino, and
L. Wald. A critical comparison among pansharpening algorithms. IEEE Transactions on
Geoscience and Remote Sensing, 53(5): 2565–2586, 2014.
G. Vivone, L. Alparone, J. Chanussot, M. Dalla Mura, A. Garzelli, G.A. Licciardi, R. Restaino,
and L. Wald. A critical comparison among pansharpening algorithms. IEEE Transactions on
Geoscience and Remote Sensing, 53(5): 2565–2586, May 2015.
M. Volpi and V. Ferrari. Semantic segmentation of urban scenes by learning local class
interactions. In Computer Vision and Pattern RecognitionWorkshops (CVPRw), 2015.
M. Volpi and D. Tuia. Dense semantic labeling of subdecimeter resolution images with
convolutional neural networks. IEEE Transactions in Geoscience and Remote Sensing, 55
(2):881–893, 2017.
M. Volpi and D. Tuia. Deep multi-task learning for a geographically-regularized semantic
segmentation of aerial images. ISPRS Journal of the International Society for Photogrammetry
and Remote Sensing, 144:48–60, 2018.
M. Wahabzada, A.-K. Mahlein, C. Bauckhage, U. Steiner, E.-C. Oerke, and K. Kersting. Plant
phenotyping using probabilistic topic models: Uncovering the hyperspectral language of
plants. Scientific Reports, 6, 2016.
390 Bibliography
L. Wang, H. Geng, P. Liu, K. Lu, J. Kolodziej, R. Ranjan, and A.Y. Zomaya. Particle swarm
optimization based dictionary learning for remote sensing big data. Knowledge-Based
Systems, 79:43–50, 2015.
M. Wang and W. Deng. Deep visual domain adaptation: A survey. Neurocomputing,
312:135–153, 2018.
S. Wang, D. Quan, X. Liang, M. Ning, Y. Guo, and L. Jiao. A deep learning framework for
remote sensing image registration. ISPRS Journal of Photogrammetry and Remote Sensing,
145:148–164, 2018. ISSN 0924-2716. doi: https://doi.org/10.1016/j.isprsjprs.2017.12.012.
Deep Learning RS Data.
W. Wang, Y. Huang, Y. Wang, and L. Wang. Generalized autoencoder: A neural network
framework for dimensionality reduction. In Proceedings of the IEEE Conference on Computer
Vision and Pattern Recognition Workshops, pages 490–497, 2014.
Y. Wang, L. Zhang, X. Tong, L. Zhang, Z. Zhang, H. Liu, X. Xing, and P.T. Mathiopoulos.
A three-layered graph-based learning approach for remote sensing image retrieval. IEEE
Transactions on Geoscience and Remote Sensing, 54(10):6020–6034, October 2016.
Y. Wang, L. Wang, H. Lu, and Y. He. Segmentation based rotated bounding boxes prediction
and image synthesizing for object detection of high resolution aerial images.
Neurocomputing, 2020.
Y. Wang, C. Wang, H. Zhang, Y. Dong, and S. Wei. A SAR dataset of ship detection for deep
learning under complex backgrounds. Remote Sensing, 11(7):765–, 2019a.
Y. Wang, M. Long, J. Wang, Z. Gao, and P.S. Yu. Predrnn: Recurrent neural networks for
predictive learning using spatiotemporal LSTMs. In Advances in Neural Information
Processing Systems, pages 879–888, 2017d.
Y. Wang, J. Zhang, H. Zhu, M. Long, J. Wang, and P.S. Yu. Memory in memory: A predictive
neural network for learning higher-order non-stationarity from spatiotemporal dynamics. In
Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages
9154–9162, 2019b.
Z. Wang and A.C. Bovik. A universal image quality index. IEEE Signal Processing Letters,
9(3):81–84, 2002.
Z. Wang, N.M. Nasrabadi, and T.S. Huang. Spatial-spectral classification of hyperspectral
images using discriminative dictionary designed by learning vector quantization. IEEE
Transactions on Geoscience and Remote Sensing, PP(99):1–15, 2013b. ISSN 0196-2892. doi:
10.1109/TGRS.2013.2285049.
Z. Wang and A.C. Bovik. Mean squared error: Love it or leave it? a new look at signal fidelity
measures. IEEE signal processing magazine, 26(1): 98–117, 2009.
Z. Wang, E.P. Simoncelli, and A.C. Bovik. Multiscale structural similarity for image quality
assessment. In The Thrity-Seventh Asilomar Conference on Signals, Systems & Computers,
2003, volume 2, pages 1398–1402. Ieee, 2003.
Z. Wang, A.C. Bovik, H.R. Sheikh, E.P. Simoncelli, et al. Image quality assessment: From error
visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600–612,
2004.
P.A.G. Watson. Applying machine learning to improve simulations of a chaotic dynamical
system using empirical error correction. Apr 2019. URL http://arxiv.org/abs/
1904.10904.
392 Bibliography
Q. Yan and W. Huang. Sea ice sensing from GNSS-R data using convolutional neural networks.
IEEE Geoscience and Remote Sensing Letters, 15(10): 1510–1514, Oct 2018. doi:
10.1109/LGRS.2018.2852143.
Q. Yan, W. Huang, and C. Moloney. Neural networks based sea ice detection and concentration
retrieval from GNSS-R delay-Doppler maps. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 10(8):3789–3798, 2017.
F. Yang, H. Fan, P. Chu, E. Blasch, and H. Ling. Clustered object detection in aerial images. In
CVPR, pages 8311–8320, 2019.
G.-Z. Yang, D.J. Hawkes, D. Rueckert, A. Noble, and C. Taylor. Medical Image Computing and
Computer-Assisted Intervention–MICCAI 2009: 12th International Conference, London, UK,
September 20–24, 2009, Proceedings, volume 5761. Springer, 2009.
J. Yang, X. Fu, Y. Hu, Y. Huang, X. Ding, and J. Paisley. Pannet: A deep network architecture
for pan-sharpening. In 2017 IEEE International Conference on Computer Vision (ICCV),
pages 1753–1761, Oct 2017.
J. Yang, Z. Wang, Z. Lin, S. Cohen, and T. Huang. Coupled dictionary training for image
super-resolution. IEEE Transactions on Image Processing, 21(8):3467–3478, 2012.
L. Yang, S. Treichler, T. Kurth, K. Fischer, D. Barajas-Solano, J. Romero, V. Churavy, A.
Tartakovsky, M. Houston, M. Prabhat, and G. Karniadakis. Highly-ccalable,
physics-informed gans for learning solutions of stochastic pdes. In 2019 IEEE/ACM Third
Workshop on Deep Learning on Supercomputers (DLS), pages 1–11, Nov 2019. doi:
10.1109/DLS49591.2019.00006.
S. Yang, H. Jin, M. Wang, Y. Ren, and L. Jiao. Data-driven compressive sampling and learning
sparse coding for hyperspectral image classification. IEEE Geoscience and Remote Sensing
Letters, 11(2):479–483, Feb 2014. ISSN 1545-598X. doi: 10.1109/LGRS.2013.2268847.
S. Yang, D. Yang, J. Chen, and B. Zhao. Real-time reservoir operation using recurrent neural
networks and inflow forecast from a distributed hydrological model. Journal of Hydrology,
579:124229, Dec 2019a. ISSN 0022-1694. doi: 10.1016/J.JHYDROL.2019.124229. URL https://
www.sciencedirect.com/science/article/pii/S0022169419309643{#}!
T. Yang, F. Sun, P. Gentine, W. Liu, H. Wang, J. Yin, M. Du, and C. Liu. Evaluation and machine
learning improvement of global hydrological model-based flood simulations. Environmental
Research Letters, 14(11):114027, Nov 2019b.
X. Yang. Understanding the variational lower bound. 2017. URL http://legacydirs.umiacs.umd
.edu/~xyang35/files/understanding-variational-lower.pdf
X. Yang, H. Sun, K. Fu, J. Yang, X. Sun, M. Yan, and Z. Guo. Automatic ship detection in
remote sensing images from Google Earth of complex scenes based on multiscale rotation
dense feature pyramid networks. Remote Sensing, 10(1):132, 2018a. doi: 10.3390/rs10010132.
URL https://doi.org/10.3390/rs10010132.
X. Yang, H. Sun, X. Sun, M. Yan, Z. Guo, and K. Fu. Position detection and direction prediction
for arbitrary-oriented ships via multiscale rotation region convolutional neural network.
arXiv:1806.04828, 2018b.
X. Yang, J. Yang, J. Yan, Y. Zhang, T. Zhang, Z. Guo, X. Sun, and K. Fu. SCRDet: Towards more
robust detection for small, cluttered and rotated objects. In ICCV, pages 8232–8241, 2019c.
Y. Yang and S. Newsam. Geographic image retrieval using local invariant features. IEEE
Transactions on Geoscience and Remote Sensing, 51(2): 818–832, February 2013.
396 Bibliography
Z. Yang and S. Newsam. Bag-of-visual-words and spatial extensions for land-use classification.
In Proceedings of the 18th SIGSPATIAL International Conference on Advances in Geographic
Information Systems, pages 270–279, 2010.
Z. Yang, T. Dan, and Y. Yang. Multi-temporal remote sensing image registration using deep
convolutional features. IEEE Access, 6:38544–38555, 2018. ISSN 2169-3536.
W. Yao, Z. Zeng, C. Lian, and H. Tang. Pixel-wise regression using u-net and its application on
pansharpening. Neurocomputing, 312:364–371, 2018.
F. Ye, H. Xiao, X. Zhao, M. Dong, W. Luo, and W. Min. Remote sensing image retrieval using
convolutional neural network features and weighted distance. IEEE Geoscience and Remote
Sensing Letters, 15(10):1535–1539, Oct 2018.
F. Ye, W. Luo, M. Dong, H. He, and W. Min. SAR image retrieval based on unsupervised
domain adaptation and clustering. IEEE Geoscience and Remote Sensing Letters,
16(9):1482–1486, Sep 2019.
L. Ye, L. Gao, R. Marcos-Martinez, D. Mallants, and B.A. Bryan. Projecting Australia’s forest
cover dynamics and exploring influential factors using deep learning. Environmental
Modelling & Software, 119:407–417, 2019.
M.-H. Yen, D.-W. Liu, Y.-C. Hsin, C.-E. Lin, and C.-C. Chen. Application of the deep learning
for the prediction of rainfall in Southern Taiwan. Scientific Reports, 9(1):1–9, September
2019. ISSN 2045-2322. doi: 10/ggcfxm. URL https://www.nature.com/articles/s41598-019-
49242-6.
Z. Yi, H. Zhang, P. Tan, and M. Gong. Dualgan: Unsupervised dual learning for image-to-image
translation. 2017 IEEE International Conference on Computer Vision (ICCV), pages
2868–2876, 2017.
N. Yokoya, P. Ghamisi, J. Xia, S. Sukhanov, R. Heremans, C. Debes, B. Bechtel, B. Le Saux,
G. Moser, and D. Tuia. Open data for global multimodal land use classification: Outcome of
the 2017 IEEE GRSS Data Fusion Contest. IEEE Journal of Selected Topics in Applied Earth
Observations and Remote Sensing, 11(5):1363–1377, 2018.
N. Yokoya, T. Yairi, and A. Iwasaki. Coupled nonnegative matrix factorization unmixing for
hyperspectral and multispectral data fusion. IEEE Transactions on Geoscience and Remote
Sensing, 50(2):528–537, 2012.
N. Yokoya, C. Grohnfeldt, and J. Chanussot. Hyperspectral and multispectral data fusion: A
comparative review of the recent literature. IEEE Geoscience and Remote Sensing Magazine,
5(2):29–56, 2017.
F. Yu and V. Koltun. Multi-scale context aggregation by dilated convolutions. In ICLR, 2016.
Y. Yu, X. Li, and F. Liu. Attention gans: Unsupervised deep feature learning for aerial scene
classification. IEEE Transactions on Geoscience and Remote Sensing, 58(1):519–531, Jan 2020.
J. Yuan. Automatic building extraction in aerial scenes using convolutional networks. arXiv
preprint arXiv:1602.06564, 2016.
J. Yuan, Z. Chi, X. Cheng, T. Zhang, T. Li, and Z. Chen. Automatic extraction of supraglacial
lakes in southwest greenland during the 2014–2018 melt seasons based on convolutional
neural network. Water, 12(3), 2020. ISSN 2073-4441. doi: 10.3390/w12030891.
Q. Yuan, Y.Wei, X. Meng, H. Shen, and L. Zhang. A multiscale and multidepth convolutional
neural network for remote sensing imagery pan-sharpening. IEEE Journal of Selected Topics
in Applied Earth Observations and Remote Sensing, 11(3):978–989, March 2018.
Bibliography 397
Q. Yuan, H. Shen, T. Li, Z. Li, S. Li, Y. Jiang, H. Xu, W. Tan, Q. Yang, J. Wang, et al. Deep
learning in environmental remote sensing: Achievements and challenges. Remote Sensing of
Environment, 241:111716, 2020b.
J. Yuval and P.A. O’Gorman. Use of machine learning to improve simulations of climate. Jan
2020. URL http://arxiv.org/abs/2001.03151.
J. Yuval and P.A. O’Gorman. Stable machine-learning parameterization of subgrid processes for
climate modeling at a range of resolutions. Nature Communications, 11(1):1–10, 2020.
J. Zabalza, J. Ren, J. Zheng, H. Zhao, C. Qing, Z. Yang, P. Du, and S. Marshall. Novel segmented
stacked autoencoder for effective dimensionality reduction and feature extraction in
hyperspectral imaging. Neurocomputing, 185:1–10, 2016.
S. Zagoruyko and N. Komodakis. Learning to compare image patches via convolutional neural
networks. In 2015 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),
pages 4353–4361. doi: 10.1109/CVPR.2015.7299064.
A. Zampieri, G. Charpiat, N. Girard, and Y. Tarabalka. Multimodal image alignment through a
multiscale chain of neural networks with application to remote sensing. In Computer
Vision – ECCV 2018 – 15th European Conference, Munich, Germany, September 8-14, 2018,
Proceedings, Part XVI, pages 679–696, 2018. doi: 10.1007/978-3-030-01270-0_40.
L. Zanna, J.M. Brankart, M. Huber, S. Leroux, T. Penduff, and P.D. Williams. Uncertainty and
scale interactions in ocean ensembles: From seasonal forecasts to multidecadal climate
predictions. Quarterly Journal of the Royal Meteorological Society, 2018.
L. Zanna and T. Bolton. Data-driven discovery of mesoscale eddy closures. Geophysical
Research Letters, 2020. doi: 10.1029/2020GL088376.
L. Zanna, P. Porta Mana, J. Anstey, T. David, and T. Bolton. Scale-aware deterministic and
stochastic parametrizations of eddy-mean flow interaction. Ocean Modelling, 111:66–80,
2017.
M.D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. arXiv
preprint arXiv:1311.2901, 2013.
D. Zhang, J. Lin, Q. Peng, D. Wang, T. Yang, S. Sorooshian, X. Liu, and J. Zhuang. Modeling and
simulating of reservoir operation using the artificial neural network, support vector
regression, deep learning algorithm. Journal of Hydrology, 565:720–736, 2018a. ISSN
0022-1694. doi: https://doi.org/10.1016/j.jhydrol.2018.08.050. URL http://www.sciencedirect
.com/science/article/pii/S0022169418306462.
E. Zhang, L. Liu, and L. Huang. Automatically delineating the calving front of Jakobshavn
Isbræ from multitemporal TerraSAR-X images: a deep learning approach. The Cryosphere,
13(6):1729–1741, 2019. doi: 10.5194/tc-13-1729-2019.
G. Zhang, P. Ghamisi, and X.X. Zhu. Fusion of heterogeneous earth observation data for the
classification of local climate zones. IEEE Transactions on Geoscience and Remote Sensing,
57(10):7623–7642, 2019b.
H. Zhang, W. Ni, W. Yan, D. Xiang, J. Wu, X. Yang, and H. Bian. Registration of multimodal
remote sensing image based on deep fully convolutional neural network. IEEE Journal of
Selected Topics in Applied Earth Observations and Remote Sensing, 12(8):3028–3042, Aug
2019. ISSN 2151-1535. doi: 10.1109/JSTARS.2019.2916560.
J. Zhang, K. Howard, C. Langston, B. Kaney, Y. Qi, L. Tang, H. Grams, Y. Wang, S. Cocks,
S. Martinaitis, et al. Multi-radar multi-sensor (mrms) quantitative precipitation estimation:
398 Bibliography
J. Zhao, W. Guo, Z. Zhang, and Y. Wenxian. A coupled convolutional neural network for small
and densely clustered ship detection in SAR images. In SCIENCE CHINA Information
Sciences, pages 1–16, 2019.
W. Zhao, L. Mou, J. Chen, Y. Bo, and W.J. Emery. Incorporating metric learning and adversarial
network for seasonal invariant change detection. IEEE Transactions on Geoscience and
Remote Sensing, pages 1–12, 2019.
W.L. Zhao, P. Gentine, M. Reichstein, Y. Zhang, S. Zhou, Y. Wen, C. Lin, X. Li, and G.Y. Qiu.
Physics-constrained machine learning of evapotranspiration. Geophysical Research Letters,
page 2019GL085291, Dec 2019. ISSN 0094-8276. doi: 10.1029/2019GL085291. URL https://
onlinelibrary.wiley.com/doi/abs/10.1029/2019GL085291.
W. Zhi, D. Feng, W.-P. Tsai, G. Sterle, A. Harpold, C. Shen and L. Li. From hydrometeorology to
river water quality: Can a deep learning model predict dissolved oxygen at the continental
scale? Environmental Science & Technology, 2021. doi: 10.1021/acs.est.0c06783.
L. Zhong, L. Hu, and H. Zhou. Deep learning based multi-temporal crop classification. Remote
Sensing of Environment, 221:430–443, 2019.
B. Zhou, A. Andonian, A. Oliva, and A. Torralba. Temporal relational reasoning in videos. In
Proceedings of the European Conference on Computer Vision (ECCV), pages 803–818, 2018.
G.-B. Zhou, J. Wu, C.-L. Zhang, and Z.-H. Zhou. Minimal gated unit for recurrent neural
networks. International Journal of Automation and Computing, 13:226–234, 2016. doi:
10.1007/s11633-016-1006-2.
J. Zhou, D. Civco, and J. Silander. A wavelet transform method to merge Landsat TM and SPOT
panchromatic data. International Journal of Remote Sensing, 19(4):743–757, 1998.
W. Zhou, Z. Shao, C. Diao, and Q. Cheng. High-resolution remote-sensing imagery retrieval
using sparse features by auto-encoder. Remote Sensing Letters, 6(10):775–783, October 2015.
W. Zhou, S. Newsam, C. Li, and Z. Shao. Learning low dimensional convolutional neural
networks for high-resolution remote sensing image retrieval. Remote Sensing, 9(5):489, May
2017a.
Y. Zhou, Q. Ye, Q. Qiu, and J. Jiao. Oriented response networks. In CVPR, pages 4961–4970.
IEEE, 2017b.
Z.-H. Zhou. A brief introduction to weakly supervised learning. National Science Review,
5:44–53, 2018.
Z. Zhou, G. He, S. Wang, and G. Jin. Subgrid-scale model for large-eddy simulation of isotropic
turbulent flows using an artificial neural network. Computers & Fluids, page 104319, 2019.
H. Zhu, L. Jiao, W. Ma, F. Liu, and W. Zhao. A novel neural network for remote sensing image
matching. IEEE Transactions on Neural Networks and Learning Systems, 30(9):2853–2865,
Sep 2019. ISSN 2162-2388.
J.-Y. Zhu, T. Park, P. Isola, and A.A. Efros. Unpaired image-to-image translation using
cycle-consistent adversarial networks. In IEEE International Conference on Computer Vision,
pages 2223–2232, 2017.
P. Zhu, L. Wen, X. Bian, H. Ling, and Q. Hu. Vision meets drones: A challenge.
arXiv:1804.07437, 2018.
R. Zhu, D. Yu, Sh. Ji, and M. Lu. Matching rgb and infrared remote sensing images with
densely-connected convolutional neural networks. Remote Sensing, 11(23), 2019a. ISSN
2072-4292. doi: 10.3390/rs11232836.
400 Bibliography
Index
lifetime 17, 18, 21 transfer learning 4, 66, 91, 149, 154, 198,
population 17, 18 261, 293, 294, 305
Sparse Autoencoder (SAE) 17, 190–191, triplet loss 156, 158, 159, 160
199, 243 truncated back-propagation through time
Sparse Representation 9, 16, 17, 22, 37, (tBPTT) 110
38–40, 42, 44, 191
spatial correlation coefficient (SCC) 140, u
141 uncertainty 5–7, 33, 34, 164, 206, 213, 217,
spatial gradients 126–129 238, 244, 272, 295, 297, 307, 320
spatial-spectral preservation 148 U-Net 52, 53, 60, 62, 98–100, 139, 208, 224,
spatial transformer 75, 125, 128 225, 231, 237, 260, 265
spectral angle mapper (SAM) 139–141, unsupervised 2, 4, 7, 9, 15–23, 25–29, 31, 35,
146, 147 37–39, 92–93, 100, 103, 120, 124, 126,
stochastic spatio-temporal generator 34 128, 130, 133, 143–146, 149–149,
stream flow 287–289, 297 151–154, 157, 176, 195, 201–203, 243,
structured domain 15 262, 328
subgrid parameterization 299, 300, 304, 306 unsupervised deep learning 122, 145–146
Supercomputer 206, 207, 211, 215, 217, 296 unsupervised learning 9, 15, 17, 22, 23, 39,
super-resolution 16, 29, 137, 138 154, 207
supervised 2, 4, 7, 9, 15, 17–18, 20, 22, 25, update gate 113, 232
28, 30–31, 37, 38, 40, 46, 54, 76, 92, upsampling 50–53, 127, 141, 231
100, 120, 124, 143–147, 151, 154, 155,
157–160, 165–167, 170, 176, 178, 180, v
184–185, 201, 262, 300, 328
vanishing gradients 3, 107–109, 111, 113,
surrogate model 8, 209, 295
126, 226
Synthetic Aperture Radar (SAR) 4, 28–31,
variational autoencoders (VAE) 3, 22, 24,
48–49, 55, 62–65, 67–68, 71, 86–89,
191–192, 194, 195, 197, 199–201, 213
121, 124–125, 160, 250, 253–254, 256,
259–262, 265, 268
system modelling 329
w
water cycle 164, 258, 270
water demand 290
t water level 289–291, 294, 317
terrestrial water storage anomaly 288 weakly supervised 2, 4, 7, 9
theory-guided data science 295, 315 weather forecasting/prediction 7, 10, 210,
Tibet see Qinghai-Tibetan Plateau 216, 222–239, 324
topography 166, 212, 259, 260, 261–263, 270, weather forecasting/prediction 7, 10, 210,
309, 314 216, 222–239, 324
WILEY END USER LICENSE AGREEMENT
Go to www.wiley.com/go/eula to access Wiley’s ebook EULA.