0% found this document useful (0 votes)

13 views137 pages

Deep Learning in Hydrology

Uploaded by

aghajanlookm2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views137 pages

Deep Learning in Hydrology

Uploaded by

aghajanlookm2

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 137

Submitted by

Frederik Kratzert

Submitted at
Institute for Machine
Learning

Thesis Supervisor / First

Evaluator
Sepp Hochreiter

Second Evaluator
Bart Nijssen

Deep Learning June 2021

in Hydrology

Doctoral Thesis
to obtain the academic degree of

Doktor der Naturwissenschaften

in the Doctoral Program

Naturwissenschaften

JOHANNES KEPLER
UNIVERSITY LINZ
Altenbergerstraße 69
4040 Linz, Austria
www.jku.at
DVR 0093696
Statutory Declaration

I hereby declare that the thesis submitted is my own unaided work, that I have not
used other than the sources indicated, and that all direct and indirect sources are
acknowledged as references.

This printed thesis is identical with the electronic version submitted.

Linz, June 2021 Frederik Kratzert

iii
Abstract

In the last decade, methods from the field of deep learning have transformed various
fields of science and technology. In this thesis, I investigate the applicability of time
series deep learning models for rainfall–runoff modeling; one of the key tasks in
hydrology, where the aim is to predict streamflow, given meteorological observations.
Those predictions are, for example, required for early flood warnings, efficient
management of water resources and the optimization of hydro power plants.

The transformation from precipitation into streamflow is a highly complex

environmental process, that is often conceptualized using physically-based storage
models. In this thesis, I explore a different approach for rainfall–runoff modeling,
focusing on a data-driven approach using the Long Short-Term Memory Network
(LSTM). The LSTM is a recurrent neural network with explicit memory states,
which makes it conceptually similar to classical hydrological models. However,
the significant difference is that the LSTM can learn any process that is deducible
from the data, while classical models are limited by the explicitly defined system
description that is implemented in the model.

In my thesis, I show how training an LSTM on data of hundreds of basins, using

meteorological time series features and static watershed attributes as inputs, results
in the new state of the art for rainfall–runoff modeling. I show how these models
generalize better to unseen rivers than traditional hydrological models that were
specifically calibrated for these rivers. In an analytical study, I investigate the learned
process understanding of LSTMs in the context of rainfall–runoff modeling and
how parts of the learned behavior match hydrological domain knowledge. Lastly,
I present a new model architecture that is inspired by the LSTM but constrained
by conservation laws. This model architecture, called Mass-Conserving LSTM, is
tested on a range of different tasks, one of which is rainfall–runoff modeling, where
it outperforms the LSTM on low probability events.

v
Kurzfassung

Methoden des Deep Learnings (DL) sind im Verlauf der letzten Jahren zu ei-
nem essentiellen Bestandteil in vielen Bereichen von Wissenschaft und Technik
geworden. Diese Dissertation untersucht das Potential von DL-Methoden für die
Niederschlags-Abfluss-Modellierung, eine der Hauptaufgaben der Hydrologie. Das
Ziel von Niederschlags-Abfluss-Modellen ist es, den Abfluss eines Flusses mit Hilfe
verschiedener meteorologischer Eingangsgrößen zu modellieren. Benötigt wer-
den diese Modellen unter anderem für Hochwasserwarnungen, die nachhaltige
Bewirtschaftung vorhandener Wasserresourcen sowie die effiziente Steuerung von
Wasserkraftanlagen.

Der Wasserkreislauf ist ein hochkomplexes, interagierendes System, das in der

Hydrologie meist durch konzeptionelle Speichermodelle abstrahiert wird. In die-
ser Dissertation untersuche ich, ob und wie Long Short-Term Memory Networks
(LSTMs) als alternativer Modellansatz für die Beschreibung von Niederschlags-
Abfluss-Prozessen genutzt werden können. LSTMs gehören zur Familie der rekurren-
ten neuronalen Netze und besitzen explizite Speicherzellen, wodurch sie strukturelle
Ähnlichkeiten zu klassischen hydrologischen Speichermodellen aufweisen. Der große
Unterschied ist allerdings, dass LSTMs jede bestehende Beziehung zwischen Ein-
gangsgrößen und Zielgröße lernen können, wohingegen klassische Modelle durch
das bereits vorhandene Wissen zur Beschreibungen dieser Beziehungen sowie die
spezifische Modellformulierung limitiert sind.

Basierend auf meteorologischen Zeitreihen und statischen Kenngrößen des

Flusseinzugsgebietes ist das LSTM in der Lage räumlich übertragbare Repräsenta-
tionen zu lernen. Im Verlauf dieser Dissertation zeige ich, wie ein LSTM anhand
von Daten hunderter unterschiedlicher Flüsse trainiert werden kann und dadurch
zu einem im Vergleich zu klassischen hydrologischen Modellierungsansätzen signifi-
kant besseren Modell wird; selbst in Flüssen, die nicht für das Training verwendet
wurden, liefert das LSTM bessere Abflussvorhersagen als klassische hydrologische
Modelle, welche speziell für diese Flüsse kalibriert wurden. Zudem zeige ich, dass
die vom LSTM gelernten Zusammenhänge zwischen Eingangsgrößen und Abfluss
physikalisch interpretiert werden können und sich mit hydrologischem Prozessver-

vii
ständnis decken. Zum Abschluss präsentiere ich noch eine neue DL-Architektur, die
auf den Prinzipien des LSTMs aufbaut, jedoch explizit das physikalische Gesetz der
Massenerhaltung befolgt.

viii
Acknowledgements

This thesis would have never been possible without the guidance, interaction and
help by many different people around the world.

First, I want to thank my supervisor Sepp. I am thankful that you gave me the
possibility to join your lab as a PhD student, even though my formal background
was in environmental engineering. I appreciated all discussions we had over the
years, as well as your feedback to all my questions. I am especially grateful for the
knowledge that you shared with me about machine learning in general and more
specifically the Long Short-Term Memory Network.

My sincere gratitude also goes to the Flood Forecasting team from Google,
especially Sella Nevo. You did everything possible to find a way for funding my PhD,
because you believed in my work. The funding that you organized was the sole
reasons that allowed me to dedicate all of my time on the topics presented in this
thesis.

This PhD was made enjoyable in large part due to my nice and brilliant col-
leagues at the Institute for Machine Learning. A special shout-out to everyone in
the Viennese office, with whom I spend the majority of my time: Günter, for his
guidance and supervision in our Viennese office, as well as PJ and Johannes, with
whom I shared my office.

Undoubtedly, I have to give special credits to Daniel, who was there with me
from the first minute. Together, we started the journey that resulted in the content of
this thesis in our free time. I am happy that I could convince you to join the Institute
for Machine Learning as well, where we started the AI for Earth Sciences group.
I want you to know that all of this would have been barely possible without you.
Speaking of the AI for Earth Sciences group, I want to thank Martin, who recently
joined our group. You were a huge help for the release of our library and you taught
me a lot on software design. I hope that in the future we as a group will continue to
work together, no matter where each of us will be. We achieved a lot over the past
years and there are exciting times ahead.

ix
My sincere thanks to Grey Nearing, who also had a major impact on all work
presented in this thesis. After we met at the AGU 2018 in Washington, we spent
countless days/nights talking and discussing ideas, which resulted in many of the
papers presented here.

I am grateful for my family, especially my mother, who always believed in me

and allowed me to go my own way at a very young age. You thought me to question
everything and to stand in for my beliefs. I hope that I will become such a good
parent as you have been for me.

Without a single doubt, the most important person during my journey as a PhD
student was my partner Claire. You led by example and finished your PhD with
brilliance. You always support me without any questions and you provide me a safe
space, whenever I am in need. You are the reason for many decisions that I took in
my life. Some of which also led to this PhD, some of them resulted in our children. I
do not regret a single one of them and I am joyfully looking into our future.

Last but not least, I am grateful for our son Mo, who allowed me to observe
how neural networks learn in the real world. Watching you growing up and learning
everything from scratch is a good example of how much trial and error is needed.
Soon, there will be your sibling and we can experience this process again, starting
from a different initialization.

x
Contents

Statutory Declaration iii

Abstract v

Kurzfassung vii

Acknowledgements ix

1 Introduction 1

2 Overview 3
2.1 Long Short-Term Memory networks for rainfall–runoff modeling . . . 4
2.2 Towards learning universal, regional, and local hydrological behaviors
via machine learning applied to large-sample datasets . . . . . . . . . 5
2.3 Toward improved predictions in ungauged basins: Exploiting the
power of machine learning . . . . . . . . . . . . . . . . . . . . . . . . 6
2.4 Neuralhydrology–Interpreting LSTMs in Hydrology . . . . . . . . . . 7
2.5 MC-LSTM: Mass-Conserving LSTM . . . . . . . . . . . . . . . . . . . 7

3 Publications 9
F. Kratzert et al. (2018). „Rainfall–runoff modelling using Long Short-
Term Memory (LSTM) networks“. In: Hydrology and Earth System
Sciences 22.11, pp. 6005–6022 . . . . . . . . . . . . . . . . . . . . . . 9
F. Kratzert et al. (2019c). „Towards learning universal, regional, and
local hydrological behaviors via machine learning applied to large-
sample datasets“. In: Hydrology and Earth System Sciences 23.12,
pp. 5089–5110 . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 28
F. Kratzert et al. (2019b). „Toward Improved Predictions in Un-
gauged Basins: Exploiting the Power of Machine Learning“. In: Water
Resources Research 55.12, pp. 11344–11354 . . . . . . . . . . . . . . 51
F. Kratzert et al. (2019a). „NeuralHydrology – Interpreting LSTMs in
Hydrology“. In: Samek W., Montavon G., Vedaldi A., Hansen L., Müller
KR. (eds) Explainable AI: Interpreting, Explaining and Visualizing Deep
Learning. Lecture Notes in Computer Science 11700 . . . . . . . . . . 63

xi
P.-J. Hoedt et al. (2021). „MC-LSTM: Mass-Conserving LSTM“. in:
Proceedings of the 38th International Conference on Machine Learning,
ICML 2021, 18-24 July 2021, Virtual Event . . . . . . . . . . . . . . . 81

4 Conclusion 111

Bibliography 113

Appendices 117
Curriculum Vitae . . . . . . . . . . . . . . . . . . . . . . . . . . . . . 119

xii
Introduction 1
This thesis investigates the use of deep learning methods for solving one of the most
important tasks in hydrology: rainfall–runoff modeling.

Deep Learning (DL), as a subfield of Machine Learning, brought advances to

numerous fields of science and technology. Computer vision and natural language
processing are maybe the most prominent examples, where approaches based on
deep neural networks dominate the field. The truth is, however, that it is almost
impossible to find any area of technology or science that is not affected by innovations
in DL.

Hydrology, on the other hand, is an applied field of science that studies the
movement of water on our planet. One of the most fundamental tasks from that
field for our society is rainfall–runoff modeling, where the goal is to predict the
amount of water in a river given meteorological data, such as precipitation and
temperature. Those predictions are, for instance, necessary for early flood warnings
and the efficient management of hydro-power plants, thus, helping to save lives as
well as to optimize the generation of renewable energy.

Although neural networks have been tested since the 90’s for rainfall–runoff
modeling (Daniell 1991; Halff et al. 1993; Carriere et al. 1996; Hsu et al. 1997; Abra-
hart et al. 2012), the most dominant type of models are still conceptual hydrological
models. These models implement equations to conceptualize our understanding of
(some of the most) important physical processes that are involved in the transforma-
tion of precipitation into streamflow, leaving only a handful of free model parameters
that can be used to fit the model to a specific river. However, these models are limited
to the description of input–output relationships that are already known, since the
processes must be hard-coded in the model implementation. Unfortunately, these
processes are not always correct, or they might not be relevant to the problem.
Additionally, it has been shown that those models suffer from a poor generalization
skill, which complicates their application at rivers for which they were not calibrated
(Sivapalan 2003; Hrachowitz et al. 2013).

Therefore, the goal of this thesis is to investigate the potential of DL for rainfall–
runoff modeling. More specifically, the focus are DL models based on the Long
Short-Term Memory Network (LSTM; Hochreiter 1991; Hochreiter and Schmidhuber
1997). LSTMs are recurrent neural networks with dedicated memory cells and a

1
Fig. 1.1: Visualization of a virtual watershed and some of the most important processes that
are involved in the transformation from precipitation into streamflow.

series of gates. The cells can be used to store information over time and the gates are
used to control the information flow from inputs to outputs. Having internal memory
seems to be an ideal model configuration, since storage processes play a key role in
the transformation of rainfall into runoff (e.g., snow, soil moisture, groundwater).
Conceptually, this makes LSTMs similar to traditional hydrology models, yet the
main difference is that LSTMs can learn anything that is deducible from the data
and that is important for solving the task.

This thesis addresses the following questions:

1. Are DL models suited for modeling the complex physical real-world processes
involved in the transformation of precipitation into streamflow?

2. Can DL-based rainfall–runoff models generalize to rivers outside of their

training domain?

3. Can we benefit from advances in the field of Explainable AI to understand

what the DL models have learned?

4. Does the introduction of stronger inductive biases (e.g. conservation principles)

into the model architecture improve model predictions, especially for low
probability events?

2 Chapter 1 Introduction
Overview 2
The work presented in this chapter has been published in different journals and
conference proceedings:

Kratzert, F., D. Klotz, C. Brenner, K. Schulz, and M. Herrnegger (2018). „Rainfall–

runoff modelling using Long Short-Term Memory (LSTM) networks“. In: Hydrology
and Earth System Sciences 22.11, pp. 6005–6022. DOI: 10.5194/hess-22-6005-
2018.
Kratzert, F., D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, and G. Nearing
(2019a). „Towards learning universal, regional, and local hydrological behaviors
via machine learning applied to large-sample datasets“. In: Hydrology and Earth
System Sciences 23.12, pp. 5089–5110. DOI: 10.5194/hess-23-5089-2019.
Kratzert, F., D. Klotz, M. Herrnegger, A. Sampson, S. Hochreiter, and G. Nearing
(2019b). „Toward Improved Predictions in Ungauged Basins: Exploiting the Power
of Machine Learning“. In: Water Resources Research 55.12, pp. 11344–11354. DOI:
https://doi.org/10.1029/2019WR026065.
Kratzert, F., M. Herrnegger, D. Klotz, S. Hochreiter, and G. Klambauer (2019c).
„NeuralHydrology – Interpreting LSTMs in Hydrology“. In: Samek W., Montavon
G., Vedaldi A., Hansen L., Müller KR. (eds) Explainable AI: Interpreting, Explaining
and Visualizing Deep Learning. Lecture Notes in Computer Science 11700. DOI:
https://doi.org/10.1007/978-3-030-28954-6_19.
Hoedt, P.-J.∗ , F. Kratzert∗ , D. Klotz, C. Halmmich, M. Holzleitner, G. Nearing, S.
Hochreiter, and G. Klambauer (2021). „MC-LSTM: Mass-Conserving LSTM“. In:
Proceedings of the 38th International Conference on Machine Learning, ICML 2021,
18-24 July 2021, Virtual Event.

In this list, * denotes shared first authorship. The full publications can be found
in Chapter 3.

Related Publications. While not being part of this thesis, the following journal
publications are related to the topics that are covered by this thesis:

3
Gauch, M., F. Kratzert, D. Klotz, G. Nearing, J. Lin, and S. Hochreiter (2021).
„Rainfall–runoff prediction at multiple timescales with a single Long Short-Term
Memory network“. In: Hydrology and Earth System Sciences 25.4, pp. 2045–2062.
DOI : 10.5194/hess-25-2045-2021.
Kratzert, F., D. Klotz, S. Hochreiter, and G. S. Nearing (2021). „A note on leveraging
synergy in multiple meteorological data sets with deep learning for rainfall–runoff
modeling“. In: Hydrology and Earth System Sciences 25.5, pp. 2685–2703. DOI:
10.5194/hess-25-2685-2021.
Klotz, D., F. Kratzert, M. Gauch, A. Sampson, J. Brandstetter, G. Klambauer, S.
Hochreiter, and G. Nearing (2021). „Uncertainty Estimation with Deep Learning
for Rainfall–Runoff Modelling“. In: Hydrology and Earth System Sciences Discussions
2021, pp. 1–32. DOI: 10.5194/hess-2021-154.
Nearing, G., F. Kratzert, A. Sampson, C. Pelissier, D. Klotz, J. Frame, C. Prieto,
and H. Gupta (2021). „What Role Does Hydrological Science Play in the Age of
Machine Learning?“ In: Water Resources Research 57.3, e2020WR028091. DOI:
https://doi.org/10.1029/2020WR028091.

The following sections summarize the key publications of this thesis and describe
how these publications are connected to each other.

2.1 Long Short-Term Memory networks for

rainfall–runoff modeling
In Kratzert et al. (2018), I conducted first experiments to investigate the suit-
ability of LSTMs as rainfall–runoff models. Using 241 watersheds from the publicly
available CAMELS dataset (Catchment Attributes and MEteorology for Large-sample
Studies; Newman et al. 2015; Addor et al. 2017), I compared the LSTM to the
well-established Sacramento Soil Moisure Accounting Model (SAC-SMA; Burnash
et al. 1973). The LSTMs were trained with time series of different meteorological
variables (precipitation, temperature, solar radiation, vapor pressure). While from a
DL perspective the inclusion of historic runoff as input might simplify the prediction
task, here, it is not included on purpose since runoff predictions in ungauged wa-
tersheds, i.e. without runoff observations, are essential given that the majority of
rivers in the world are either ungauged or poorly gauged (Sivapalan 2003; Goswami
et al. 2007). I ran three different experiments: First, similar to the state-of-the-art
method for hydrological models, I trained one LSTM for each watershed separately.

4 Chapter 2 Overview
Second, similar to approaches for regional hydrological modeling, I trained a single
LSTM for all watersheds of a region combined. Third, inspired by methods from DL,
I finetuned the regional model to each watershed individually. The results showed
that the LSTM is indeed able to learn the rainfall–runoff relationship. However,
while training a model for an individual watershed showed no benefit over SAC-
SMA, learning a regional model from multiple watersheds and then finetuning these
models to individual watersheds did result in better prediction accuracy.

2.2 Towards learning universal, regional, and local

hydrological behaviors via machine learning
applied to large-sample datasets
Given the promising results of the regional modeling experiment in Kratzert
et al. (2018), in Kratzert et al. (2019c) I investigated how this approach can be
enhanced by making use of additional static watershed attributes.

To recall from above, in Kratzert et al. (2018) the only model inputs to the LSTM
were meteorological time series data, even in the case of the regional modeling
experiment. Thus, given an input sequence of meteorological data, the LSTM had no
clue for which river or type of eco-region it was making predictions. However, the
catchment response to the same meteorological time series varies heavily, depending
on the characteristics of the watershed. For example, forests usually absorb and store
a large amount of the fallen precipitation, leading to a delayed or damped signal
in the discharge time series, while the same precipitation in a watershed that is
dominated by bare soil leads to an immediate, undamped spike in the hydrograph.

Having dynamic time series data and static watershed attributes, in Kratzert
et al. (2019c) I compared two different model architectures for this setup. First,
the LSTM by Hochreiter (1991), where the static features are concatenated to the
dynamic features at each time step. Second, the Entity-Aware-LSTM (EA-LSTM), a
variant of the LSTM, where the static features are used exclusively to modulate the
input gate, while the dynamic features are used as inputs in all other gates. The idea
was to make the model more interpretable by using exclusively the static features to
decide which parts of the model are used in a given watershed.

In a large-sample benchmarking experiment with more than 500 watersheds,

I could show that both LSTM and EA-LSTM outperform various well-established
hydrological models by large margins, making LSTM-based models the new state
of the art for rainfall–runoff modeling. While performing slightly worse than the

2.2 Towards learning universal, regional, and local hydrological behaviors via machine learning applied to
large-sample datasets 5
LSTM, the watershed encoding that is learned in the input gate of the EA-LSTM
revealed that the model learned an understanding of watershed similarities and
differences that corresponds well to what we would expect from prior hydrological
understanding.

2.3 Toward improved predictions in ungauged

basins: Exploiting the power of machine
learning
In hydrology, it is often argued that models based on machine learning will un-
derperform in conditions that are different to the training data (e.g., Kirchner 2006;
Milly et al. 2008; Vaze et al. 2015). The task of Prediction in Ungauged Basins (PUB)
is an example for such a setting, where we want to make predictions in watersheds
without streamflow observations, i.e. no model can be trained specifically for this
location. PUB was a decadal problem of the International Association of Hydrological
Sciences (IAHS) from 2003–2012 (Sivapalan 2003) and despite a combined effort
by the hydrology community, it is still largely unresolved (Hrachowitz et al. 2013).

Given the excellent regional modeling results in Kratzert et al. (2019c), in

Kratzert et al. (2019b) I tested how well LSTM-based models simulate runoff in
watersheds that were not included in the training data, i.e. simulating PUB. For that,
I split the basins into 10 random folds, using k-fold cross validation. I then trained
one model using the data of all basins from nine out of ten folds and evaluated the
model on the hold-out basins in the tenth fold. This was repeated ten times, so that
each basin was exactly once in the hold-out test set. Again, as in the previous study
(Kratzert et al. 2019c), the model inputs were a combination of meteorological time
series features and static watershed attributes.

I compared the LSTM predictions to the current US National Water Model, as

well as to the SAC-SMA, which was individually calibrated for each basin (best
practice for hydrological models). The results showed that, on average, the LSTM
gives better predictions in rivers that were not included in the training data, than both
hydrological models. This suggests that the LSTM learns a general understanding
of the relationship between inputs and outputs that is transferable to previously
unseen rivers. Further, the argument that process-based models might be preferable
for out-of-sample conditions, might not hold water.

6 Chapter 2 Overview
2.4 Neuralhydrology–Interpreting LSTMs in
Hydrology
Despite the long history of neural networks applications in hydrology and the
most recent successes of LSTMs, those models still have not the best reputation
among hydrologists. A phrase that is often raised in this context is that models "must
work well for the right reasons" (Kirchner 2006). In Kratzert et al. (2019a), I looked
at different possibilities to qualitatively interpret LSTMs and their learned process
understanding in the context of rainfall–runoff modeling. Using state correlations
I could show that the LSTM learned to model hydrological states internally (such
as soil moisture and snow), while only being trained to predict discharge from the
meteorological inputs. Furthermore, I analyzed the influencing inputs over time
using the integrated gradient method by Sundararajan et al. (2017) and could show
that the behavioral patterns match our hydrological system understanding.

2.5 MC-LSTM: Mass-Conserving LSTM

Lastly, in Hoedt et al. (2021), we investigated the benefit of inductive bias for
time series problems, where certain quantities need to be conserved — e.g. water
in the context of rainfall–runoff modeling. We modified the LSTM to enforce the
conservation of mass by design. The mass in this case does not have to be physical
mass, but can be any quantity, such as energy, particles, or count of a specific input.
The core concepts of the Mass-Conserving LSTM (MC-LSTM) are as follows:

• The inputs are split into mass inputs and auxiliary inputs. The mass inputs are
conserved by design, while the auxiliary inputs are used to modulate the gates
of the model.

• The model uses stochastic matrices in the input and a (newly added) redistri-
bution gate to conserve the incoming mass, as well as the mass stored in the
cell states.

• The outgoing mass, determined by a sigmoidal gate, is removed from the cell
states.

We tested the model on different tasks where conservation matters, one of

which was rainfall–runoff modeling. Here, we could show that the inductive bias of
the MC-LSTM adds special benefit to the tails of the distribution, e.g. for predicting
flood peaks.

2.4 Neuralhydrology–Interpreting LSTMs in Hydrology 7

Publications 3
This chapter presents publications as originally published, reprinted with permission
from the corresponding publishers. The copyright of the original publications is held
by the respective copyright holders. In order to fit the paper dimension, reprinted
publications may be scaled in size and/or cropped.

9
Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018
https://doi.org/10.5194/hess-22-6005-2018
© Author(s) 2018. This work is distributed under
the Creative Commons Attribution 4.0 License.

Rainfall–runoff modelling using Long Short-Term

Memory (LSTM) networks
Frederik Kratzert1,* , Daniel Klotz1 , Claire Brenner1 , Karsten Schulz1 , and Mathew Herrnegger1
1 Instituteof Water Management, Hydrology and Hydraulic Engineering, University of Natural
Resources and Life Sciences, Vienna, 1190, Austria
* Invited contribution by Frederik Kratzert, recipient of the EGU Hydrological Sciences

Outstanding Student Poster and PICO Award 2016.

Correspondence: Frederik Kratzert ([email protected])

Received: 4 May 2018 – Discussion started: 14 May 2018

Revised: 25 September 2018 – Accepted: 13 November 2018 – Published: 22 November 2018

Abstract. Rainfall–runoff modelling is one of the key chal- 1 Introduction

lenges in the field of hydrology. Various approaches exist,
ranging from physically based over conceptual to fully data- Rainfall–runoff modelling has a long history in hydrologi-
driven models. In this paper, we propose a novel data-driven cal sciences and the first attempts to predict the discharge
approach, using the Long Short-Term Memory (LSTM) net- as a function of precipitation events using regression-type
work, a special type of recurrent neural network. The advan- approaches date back 170 years (Beven, 2001; Mulvaney,
tage of the LSTM is its ability to learn long-term depen- 1850). Since then, modelling concepts have been further
dencies between the provided input and output of the net- developed by progressively incorporating physically based
work, which are essential for modelling storage effects in e.g. process understanding and concepts into the (mathemati-
catchments with snow influence. We use 241 catchments of cal) model formulations. These include explicitly address-
the freely available CAMELS data set to test our approach ing the spatial variability of processes, boundary conditions
and also compare the results to the well-known Sacramento and physical properties of the catchments (Freeze and Har-
Soil Moisture Accounting Model (SAC-SMA) coupled with lan, 1969; Kirchner, 2006; Schulla, 2007). These develop-
the Snow-17 snow routine. We also show the potential of ments are largely driven by the advancements in computer
the LSTM as a regional hydrological model in which one technology and the availability of (remote sensing) data at
model predicts the discharge for a variety of catchments. In high spatial and temporal resolution (Hengl et al., 2017; Kol-
our last experiment, we show the possibility to transfer pro- let et al., 2010; Mu et al., 2011; Myneni et al., 2002; Rennó
cess understanding, learned at regional scale, to individual et al., 2008).
catchments and thereby increasing model performance when However, the development towards coupled, physically
compared to a LSTM trained only on the data of single catch- based and spatially explicit representations of hydrologi-
ments. Using this approach, we were able to achieve bet- cal processes at the catchment scale has come at the price
ter model performance as the SAC-SMA + Snow-17, which of high computational costs and a high demand for neces-
underlines the potential of the LSTM for hydrological mod- sary (meteorological) input data (Wood et al., 2011). There-
elling applications. fore, physically based models are still rarely used in oper-
ational rainfall–runoff forecasting. In addition, the current
data sets for the parameterization of these kind of models,
e.g. the 3-D information on the physical characteristics of the
sub-surface, are mostly only available for small, experimen-
tal watersheds, limiting the model’s applicability for larger
river basins in an operational context. The high computa-
tional costs further limit their application, especially if uncer-

Published by Copernicus Publications on behalf of the European Geosciences Union.

10 Chapter 3 Publications
6006 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

tainty estimations and multiple model runs within an ensem- 2009; Schmidhuber, 2015). While most well-known applica-
ble forecasting framework are required (Clark et al., 2017). tions of DL are in the field of computer vision (Farabet et al.,
Thus, simplified physically based or conceptual models are 2013; Krizhevsky et al., 2012; Tompson et al., 2014), speech
still routinely applied for operational purposes (Adams and recognition (Hinton et al., 2012) or natural language process-
Pagaon, 2016; Herrnegger et al., 2018; Lindström et al., ing (Sutskever et al., 2014) few attempts have been made to
2010; Stanzel et al., 2008; Thielen et al., 2009; Wesemann apply recent advances in DL to hydrological problems. Shi
et al., 2018). In addition, data-based mechanistic modelling et al. (2015) investigated a deep learning approach for pre-
concepts (Young and Beven, 1994) or fully data-driven ap- cipitation nowcasting. Tao et al. (2016) used a deep neural
proaches such as regression, fuzzy-based or artificial neu- network for bias correction of satellite precipitation prod-
ral networks (ANNs) have been developed and explored in ucts. Fang et al. (2017) investigated the use of deep learn-
this context (Remesan and Mathew, 2014; Solomatine et al., ing models to predict soil moisture in the context of NASA’s
2009; Zhu and Fujita, 1993). Soil Moisture Active Passive (SMAP) satellite mission. As-
ANNs are especially known to mimic highly non-linear sem et al. (2017) compared the performance of a deep learn-
and complex systems well. Therefore, the first studies using approach for water flow level and flow predictions for
ing ANNs for rainfall–runoff prediction date back to the the Shannon River in Ireland with multiple baseline models.
early 1990s (Daniell, 1991; Halff et al., 1993). Since then, They reported that the deep learning approach outperforms
many studies applied ANNs for modelling runoff processes all baseline models consistently. More recently, D. Zhang
(see for example Abrahart et al., 2012; ASCE Task Commit- et al. (2018) compared the performance of different neural
tee on Application of Artificial Neural Networks, 2000, for network architectures for simulating and predicting the wa-
a historic overview). However, a drawback of feed-forward ter levels of a combined sewer structure in Drammen (Nor-
ANNs, which have mainly been used in the past, for time se- way), based on online data from rain gauges and water-level
ries analysis is that any information about the sequential or- sensors. They confirmed that LSTM (as well as another re-
der of the inputs is lost. Recurrent neural networks (RNNs) current neural network architecture with cell memory) are
are a special type of neural network architecture that have better suited for for multi-step-ahead predictions than tradi-
been specifically designed to understand temporal dynamics tional architectures without explicit cell memory. J. Zhang
by processing the input in its sequential order (Rumelhart et al. (2018) used an LSTM for predicting water tables in
et al., 1986). Carriere et al. (1996) and Hsu et al. (1997) con- agricultural areas. Among other things, the authors compared
ducted the first studies using RNNs for rainfall–runoff mod- the resulting simulation from the LSTM-based approach with
elling. The former authors tested the use of RNNs within that of a traditional neural network and found that the for-
laboratory conditions and demonstrated their potential use mer outperforms the latter. In general, the potential use and
for event-based applications. In their study, Hsu et al. (1997) benefits of DL approaches in the field of hydrology and wa-
compared a RNN to a traditional ANN. Even though the tra- ter sciences has only recently come into the focus of discus-
ditional ANN in general performed equally well, they found sion (Marçais and de Dreuzy, 2017; Shen, 2018; Shen et al.,
that the number of delayed inputs, which are provided as 2018). In this context we would like to mention Shen (2018)
driving inputs to the ANN, is a critical hyperparameter. How- more explicitly, since he provides an ambitious argument for
ever, the RNN, due to its architecture, made the search for the potential of DL in earth sciences/hydrology. In doing so
this number obsolete. Kumar et al. (2004) also used RNNs he also provides an overview of various applications of DL in
for monthly streamflow prediction and found them to outper- earth sciences. Of special interest for the present case is his
form a traditional feed-forward ANN. point that DL might also provide an avenue for discovering
For problems however for which the sequential order of emergent behaviours of hydrological phenomena.
the inputs matters, the current state-of-the-art network archi- Regardless of the hydrological modelling approach ap-
tecture is the so-called “Long Short-Term Memory” (LSTM), plied, any model will be typically calibrated for specific
which in its initial form was introduced by Hochreiter and catchments for which observed time series of meteorologi-
Schmidhuber (1997). Through a specially designed architec- cal and hydrological data are available. The calibration pro-
ture, the LSTM overcomes the problem of the traditional cedure is required because models are only simplifications
RNN of learning long-term dependencies representing e.g. of real catchment hydrology and model parameters have to
storage effects within hydrological catchments, which may effectively represent non-resolved processes and any effect
play an important role for hydrological processes, for exam- of subgrid-scale heterogeneity in catchment characteristics
ple in snow-driven catchments. (e.g. soil hydraulic properties) (Beven, 1995; Merz et al.,
In recent years, neural networks have gained a lot of atten- 2006). The transferability of model parameters (regionaliza-
tion under the name of deep learning (DL). As in hydrolog- tion) from catchments where meteorological and runoff data
ical modelling, the success of DL approaches is largely fa- are available to ungauged or data-scarce basins is one of the
cilitated by the improvements in computer technology (espe- ongoing challenges in hydrology (Buytaert and Beven, 2009;
cially through graphic processing units or GPUs; Schmidhu- He et al., 2011; Samaniego et al., 2010).
ber, 2015) and the availability of huge data sets (Halevy et al.,

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

11
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6007

The aim of this study is to explore the potential of the

Ouput layer
LSTM architecture (in the adapted version proposed by Gers
et al., 2000) to describe the rainfall–runoff behaviour of a
large number of differently complex catchments at the daily Recurrent cell
Dense layer

timescale. Additionally, we want to analyse the potential

of LSTMs for regionalizing the rainfall–runoff response by 2nd recurrent layer
training a single model for a multitude of catchments. In or-
der to allow for a more general conclusion about the suit-
ability of our modelling approach, we test this approach on a 1st recurrent layer
large number of catchments of the CAMELS data set (Ad-
dor et al., 2017b; Newman et al., 2014). This data set is
freely available and includes meteorological forcing data and Input layer
observed discharge for 671 catchments across the contigu-
ous United States. For each basin, the CAMELS data set
1 2 n-1 n

also includes time series of simulated discharge from the Figure 1. A general example of a two-layer recurrent neural net-
Sacramento Soil Moisture Accounting Model (Burnash et al., work unrolled over time. The outputs from the last recurrent layer
1973) coupled with the Snow-17 snow model (Anderson, (second layer in this example) and the last time step (x n ) are fed
1973). In our study, we use these simulations as a benchmark, into a dense layer to calculate the final prediction (y).
to compare our model results with an established modelling
approach.
To explain how the RNN and the LSTM work, we unfold
The paper is structured in the following way: in Sect. 2, we
the recurrence of the network into a directed acyclic graph
will briefly describe the LSTM network architecture and the
(see Fig. 1). The output (in our case discharge) for a specific
data set used. This is followed by an introduction into three
time step is predicted from the input x = [x 1 , ..., x n ] con-
different experiments: in the first experiment, we test the gen-
sisting of the last n consecutive time steps of independent
eral ability of the LSTM to model rainfall–runoff processes
variables (in our case daily precipitation, min/max temper-
for a large number of individual catchments. The second ex-
ature, solar radiation and vapour pressure) and is processed
periment investigates the capability of LSTMs for regional
sequentially. In each time step t (1 ≤ t ≤ n), the current in-
modelling, and the last tests whether the regional models can
put x t is processed in the recurrent cells of each layer in the
help to enhance the simulation performance for individual
network.
catchments. Section 3 presents and discusses the results of
The differences of the traditional RNN and the LSTM
our experiments, before we end our paper with a conclusion
are the internal operations of the recurrent cell (encircled in
and outlook for future studies.
Fig. 1) that are depicted in Fig. 2.
In a traditional RNN cell, only one internal state ht exists
(see Fig. 2a), which is recomputed in every time step by the
2 Methods and database
following equation:
2.1 Long Short-Term Memory network ht = g (Wx t + Uht−1 + b) , (1)

In this section, we introduce the LSTM architecture in more where g(·) is the activation function (typically the hyperbolic
detail, using the notation of Graves et al. (2013). Beside a tangent), W and U are the adjustable weight matrices of the
technical description of the network internals, we added a hidden state h and the input x, and b is an adjustable bias
“hydrological interpretation of the LSTM” in Sect. 3.5 in or- vector. In the first time step, the hidden state is initialized as
der to bridge differences between the hydrological and deep a vector of zeros and its length is an user-defined hyperpa-
learning research communities. rameter of the network.
The LSTM architecture is a special kind of recurrent neu- In comparison, the LSTM has (i) an additional cell state
ral network (RNN), designed to overcome the weakness of or cell memory ct in which information can be stored, and
the traditional RNN to learn long-term dependencies. Ben- (ii) gates (three encircled letters in Fig. 2b) that control
gio et al. (1994) have shown that the traditional RNN can the information flow within the LSTM cell (Hochreiter and
hardly remember sequences with a length of over 10. For Schmidhuber, 1997). The first gate is the forget gate, intro-
daily streamflow modelling, this would imply that we could duced by Gers et al. (2000). It controls which elements of the
only use the last 10 days of meteorological data as input to cell state vector ct−1 will be forgotten (to which degree):
predict the streamflow of the next day. This period is too short f t = σ (Wf x t + Uf ht−1 + bf ) , (2)
considering the memory of catchments including groundwa-
ter, snow or even glacier storages, with lag times between where f t is a resulting vector with values in the range (0, 1),
precipitation and discharge up to several years. σ (·) represents the logistic sigmoid function and Wf , Uf and

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

12 Chapter 3 Publications
6008 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

Figure 2. (a) The internal operation of a traditional RNN cell: ht stands for hidden state and x t for the input at time step t. (b) The internals
of a LSTM cell, where f stands for the forget gate (Eq. 2), i for the input gate (Eqs. 3–4), and o for the output gate (Eqs. 6–7). ct denotes
the cell state at time step t and ht the hidden state.

bf define the set of learnable parameters for the forget gate, where ot is a vector with values in the range (0, 1), and Wo ,
i.e. two adjustable weight matrices and a bias vector. As for Uo and bo are a set of learnable parameters, defined for the
the traditional RNN, the hidden state h is initialized in the output gate. From this vector, the new hidden state ht is cal-
first time step by a vector of zeros with a user-defined length. culated by combining the results of Eqs. (5) and (6):
In the next step, a potential update vector for the cell state
ht = tanh(ct ) ot . (7)
is computed from the current input (x t ) and the last hidden
state (ht−1 ) given by the following equation: It is in particular the cell state (ct ) that allows for an ef-
fective learning of long-term dependencies. Due to its very
ct = tanh (Wec x t + Uec ht−1 + bec ) ,
e (3) simple linear interactions with the remaining LSTM cell, it
where ect is a vector with values in the range (−1, 1), tanh(·) can store information unchanged over a long period of time
is the hyperbolic tangent and Wec , Uec and bec are another set steps. During training, this characteristic helps to prevent the
of learnable parameters. problem of the exploding or vanishing gradients in the back-
Additionally, the second gate is compute, the input gate, propagation step (Hochreiter and Schmidhuber, 1997). As
defining which (and to what degree) information ofe ct is used with other neural networks, where one layer can consist of
to update the cell state in the current time step: multiple units (or neurons), the length of the cell and hidden
state vectors in the LSTM can be chosen freely. Addition-
i t = σ (Wi x t + Ui ht−1 + bi ) , (4) ally, we can stack multiple layers on top of each other. The
output from the last LSTM layer at the last time step (hn )
where i t is a vector with values in the range (0, 1), and Wi , is connected through a traditional dense layer to a single out-
Ui and bi are a set of learnable parameters, defined for the put neuron, which computes the final discharge prediction (as
input gate. shown schematically in Fig. 1). The calculation of the dense
With the results of Eqs. (2)–(4) the cell state ct is updated layer is given by the following equation:
by the following equation:
y = Wd hn + bd , (8)
ct = f t ct−1 + i t e
ct , (5)
where y is the final discharge, hn is the output of the last
where denotes element-wise multiplication. Because the LSTM layer at the last time step derive from Eq. (7), Wd is
vectors f t and i t have both entries in the range (0, 1), Eq. (5) the weight matrix of the dense layer, and bd is the bias term.
can be interpreted in the way that it defines, which informa- To conclude, Algorithm 1 shows the pseudocode of the en-
tion stored in ct−1 will be forgotten (values of f t of approx. tire LSTM layer. As indicated above and shown in Fig. 1, the
0) and which will be kept (values of f t of approx. 1). Sim- inputs for the complete sequence of meteorological observa-
ilarly, i t decides which new information stored in e ct will be tions x = [x 1 , ..., x n ], where x t is a vector containing the
added to the cell state (values of i t of approx. 1) and which meteorological inputs of time step t, is processed time step
will be ignored (values of i t of approx. 0). Like the hidden by time step and in each time step Eqs. (2)–(7) are repeated.
state vector, the cell state is initialized by a vector of zeros in In the case of multiple stacked LSTM layers, the next layer
the first time step. Its length corresponds to the length of the takes the output h = [h1 , ..., hn ] of the first layer as input.
hidden state vector. The final output, the discharge, is then calculated by Eq. (8),
The third and last gate is the output gate, which controls where hn is the last output of the last LSTM layer.
the information of the cell state ct that flows into the new hid-
den state ht . The output gate is calculated by the following 2.2 The calibration procedure
equation:
In traditional hydrological models, the calibration involves
ot = σ (Wo x t + Uo ht−1 + bo ) , (6) a defined number of iteration steps of simulating the entire

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

13
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6009

Algorithm 1 Pseudocode of the LSTM layer

1: Input: x = [x 1 , . . ., x n ], x t ∈ Rm
2: Given parameters: Wf , Uf , bf , We c , Ue c , Wi , Ui , bi ,
c , be
Wo , Uo , bo
−
→
3: Initialize h0 , c0 = 0 of length p
4: for t=1, ..., n do
5: Calculate f t (Eq. 2), e ct (Eq. 3), i t (Eq. 4)
6: Update cell state ct (Eq. 5)
7: Calculate ot (Eq. 6), ht (Eq. 7)
8: end for
9: Output: h = [h1 , . . ., hn ], ht ∈ Rp Figure 3. Illustration of one iteration step in the training process of
the LSTM. A random batch of input data x consisting of m inde-
pendent training samples (depicted by the colours) is used in each
step. Each training sample consists of n days of look-back data and
calibration period with a given set of model parameters and
one target value (yobs ) to predict. The loss is computed from the
evaluating the model performance with some objective crite- observed discharge and the network’s predictions ysim and is used
ria. The model parameters are, regardless of the applied opti- to update the network parameters.
mization technique (global and/or local), perturbed in such a
way that the maximum (or minimum) of an objective criteria
is found. Regarding the training of a LSTM, the adaptable gorithm without a convergence criterion). The correspond-
(or learnable) parameters of the network, the weights and ing term for neural networks is called epoch. One epoch is
biases, are also updated depending on a given loss function defined as the period in which each training sample is used
of an iteration step. In this study we used the mean-squared once for updating the model parameters. For example, if the
error (MSE) as an objective criterion. data set consists of 1000 training samples and the batch size
In contrast to most hydrological models, the neural net- is 10, one epoch would consist of 100 iteration steps (num-
work exhibits the property of differentiability of the net- ber of training samples divided by the number of samples
work equations. Therefore, the gradient of the loss func- per batch). In each iteration step, 10 of the 1000 samples are
tion with respect to any network parameter can always be taken without replacement until all 1000 samples are used
calculated explicitly. This property is used in the so-called once. In our case this means, each time step of the discharge
back-propagation step in which the network parameters are time series in the training data is simulated exactly once.
adapted to minimize the overall loss. For a detailed descrip- This is somewhat similar to one iteration in the calibration
tion see e.g. Goodfellow et al. (2016). of a classical hydrological model, with the significant differ-
A schematic illustration of one iteration step in the LSTM ence however that every sample is generated independently
training/calibration is is provided in Fig. 3. One iteration step of each other. Figure 4 shows the learning process of the
during the training of LSTMs usually works with a subset LSTM over a number of training epochs. We can see that the
(called batch or mini-batch) of the available training data. network has to learn the entire rainfall–runoff relation from
The number of samples per batch is a hyperparameter, which scratch (grey line of random weights) and is able to better
in our case was defined to be 512. Each of these samples represent the discharge dynamics with each epoch.
consists of one discharge value of a given day and the me- For efficient learning, all input features (the meteorolog-
teorological input of the n preceding days. In every iteration ical variables) as well as the output (the discharge) data are
step, the loss function is calculated as the average of the MSE normalized by subtracting the mean and dividing by the stan-
of simulated and observed runoff of these 512 samples. Since dard deviation (LeCun et al., 2012; Minns and Hall, 1996).
the discharge of a specific time step is only a function of the The mean and standard deviation used for the normalization
meteorological inputs of the last n days, the samples within a are calculated from the calibration period only. To receive
batch can consist of random time steps (depicted in Fig. 3 by the final discharge prediction, the output of the network is re-
the different colours), which must not necessarily be ordered transformed using the normalization parameters from the cal-
chronologically. For faster convergence, it is even advanta- ibration period (Fig. 4 shows the retransformed model out-
geous to have random samples in one batch (LeCun et al., puts).
2012). This procedure is different from traditional hydrolog-
ical model calibration, where usually all the information of 2.3 Open-source software
the calibration data is processed in each iteration step, since
all simulated and observed runoff pairs are used in the model Our research heavily relies on open source software.
evaluation. The programming language of choice is Python 3.6
Within traditional hydrological model calibration, the (van Rossum, 1995). The libraries we use for preprocessing
number of iteration steps defines the total number of model our data and for data management in general are Numpy
runs performed during calibration (given an optimization al- (Van Der Walt et al., 2011), Pandas (McKinney, 2010) and

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

14 Chapter 3 Publications
6010 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

Evolution of hydrograph over the numbers of training epochs

Observed discharge
15 Random weights
Discharge (mm d )

Epoch 1
-1

Epoch 2
Epoch 3
10 Epoch 5
Epoch 10
Epoch 20
Epoch 50
5

3 4 84 4 5 85 5 6 86 6
98 98 19 98 98 19 98 98 19 98
t1 b1 t1 b1 t1 b1 t1
Oc Fe Jun Oc Fe Jun Oc Fe Jun Oc
Date

Figure 4. Improvement of the runoff simulation during the learning process of the LSTM. Visualized are the observed discharge and LSTM
output after various epochs for the basin 13337000 of the CAMELs data set from 1 October 1983 until 30 September 1986. Random weights
represent randomly initialized weights of a LSTM before the first iteration step in the training process.

Scikit-Learn (Pedregosa et al., 2011). The Deep-Learning east contains 27 more or less homogeneous basins (e.g. in
frameworks we use are TensorFlow (Abadi et al., 2016) and terms of snow influence or aridity). The Arkansas-White-
Keras (Chollet, 2015). All figures are made using Matplotlib Red region in the center of CONUS has a comparable num-
(Hunter, 2007). ber of basins, namely 32, but is completely different oth-
erwise. Within this region, attributes e.g. aridity and mean
2.4 The CAMELS data set annual precipitation have a high variance and strong gradi-
ent from east to west (see Fig. 5). Also comparable in size
The underlying data for our study is the CAMELS data set but with disparate hydro-climatic conditions are the South
(Addor et al., 2017b; Newman et al., 2014). The acronym Atlantic-Gulf region (92 basins) and the Pacific Northwest
stands for “Catchment Attributes for Large-Sample Studies” region (91 basins). The latter spans from the Pacific coast
and it is a freely available data set of 671 catchments with till the Rocky Mountains and also exhibits a high variance
minimal human disturbances across the contiguous United of attributes across the basins, comparable to the Arkansas-
States (CONUS). The data set contains catchment aggre- White-Red region. For example, there are very humid catch-
gated (lumped) meteorological forcing data and observed ments with more than 3000 mm yr−1 precipitation close to
discharge at the daily timescale starting (for most catch- the Pacific coast and very arid (aridity index 2.17, mean an-
ments) from 1980. The meteorological data are calculated nual precipitation 500 mm yr−1 ) basins in the south-east of
from three different gridded data sources (Daymet, Thornton this region. The relatively flat South Atlantic-Gulf region
et al., 2012; Maurer, Maurer et al., 2002; and NLDAS, Xia contains more homogeneous basins, but in contrast to the
et al., 2012) and consists of day length, precipitation, short- New England region is not influenced by snow.
wave downward radiation, maximum and minimum tem- Additionally, the CAMELS data set contains time series
perature, snow-water equivalent and humidity. We used the of simulated discharge from the calibrated Snow-17 mod-
Daymet data, since it has the highest spatial resolution (1 km els coupled with the Sacramento Soil Moisture Accounting
grid compared to 12 km grid for Maurer and NLDAS) as a Model. Roughly 35 years of meteorological observations and
basis for calculating the catchment averages and all available streamflow records are available for most basins. The first
meteorological input variables with exception of the snow- 15 hydrological years with streamflow data (in most cases
water equivalent and the day length. 1 October 1980 until 30 September 1995) are used for cal-
The 671 catchments in the data set are grouped into 18 hy- ibrating the model, while the remaining data are used for
drological units (HUCs) following the U.S. Geological Sur- validation. For each basin, 10 models were calibrated, start-
vey’s HUC map (Seaber et al., 1987). These groups corre- ing with different random seeds, using the shuffled com-
spond to geographic areas that represent the drainage area of plex evolution algorithm by Duan et al. (1993) and the root
either a major river or the combined drainage area of a series mean squared error (RMSE) as objective function. Of these
of rivers. 10 models, the one with the lowest RMSE in the calibration
In our study, we used 4 out of the 18 hydrological units period is used for validation. For further details see Newman
with their 241 catchments (see Fig. 5 and Table 1) in or- et al. (2015).
der to cover a wide range of different hydrological con-
ditions on one hand and to limit the computational costs
on the other hand. The New England region in the north-

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

15
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6011

Table 1. Overview of the HUCs considered in this study and some region statistics averaged over all basins in that region. For each variable
mean and standard deviation is reported.

Mean Mean Mean Mean Mean

HUC Region name No. of basins precipitation aridity1 altitude snow frac.2 seasonality3
(mm day−1 ) (–) (m) (–) (–)
01 New England 27 3.61 ± 0.26 0.60 ± 0.03 316 ± 182 0.24 ± 0.06 0.10 ± 0.08
03 South Atlantic-Gulf 92 3.79 ± 0.49 0.87 ± 0.14 189 ± 179 0.02 ± 0.02 0.12 ± 0.26
11 Arkansas-White-Red 31 2.86 ± 0.89 1.18 ± 0.50 613 ± 713 0.08 ± 0.13 0.25 ± 0.29
17 Pacific Northwest 91 5.22 ± 2.03 0.59 ± 0.40 1077 ± 589 0.33 ± 0.23 −0.72 ± 0.17
1 PET/P; see Addor et al. (2017a). 2 Fraction of precipitation falling on days with temperatures below 0 ◦ C. 3 Positive values indicate that precipitation peaks in
summer, negative values that precipitation peaks in the winter month, and values close to 0 that the precipitation is uniform throughout the year (see Addor et al.,
2017a).

Table 2. Shapes of learnable parameters of all layers.

Layer Parameter Shape

1st LSTM layer Wf , We c , Wi , Wo [20, 5]
Uf , Uec , Ui , Uo [20, 20]
bf , be
c , bi , bo [20]
2nd LSTM layer Wf , We c , Wi , Wo [20, 20]
Uf , Uec , Ui , Uo [20, 20]
bf , be
c , bi , bo [20]
Dense layer Wd [20, 1]
bd [1]

this value constant at 365 days for this study in order to cap-
ture at least the dynamics of a full annual cycle.
The specific design of the network architecture, i.e. the
number of layers, cell/hidden state length, dropout rate and
input sequence length were found through a number of ex-
Figure 5. Overview of the location of the four hydrological units
from the CAMELS data set used in this study, including all their
periments in several seasonal-influenced catchments in Aus-
basins. Panel (a) shows the mean annual precipitation of each basin, tria. In these experiments, different architectures (e.g. one or
whereas the type of marker symbolizes the snow influence of the two LSTM layers or 5, 10, 15, or 20 cell/hidden units) were
basin. Panel (b) shows the aridity index of each basin, calculated as varied manually. The architecture used in this study proved
PET/P (see Addor et al., 2017a). to work well for these catchments (in comparison to a cal-
ibrated hydrological model we had available from previous
studies; Herrnegger et al., 2018) and was therefore chosen to
2.5 Experimental design be applied here without further tuning. A systematic sensitiv-
ity analysis of the effects of different hyper-parameters was
Throughout all of our experiments, we used a two-layer however not done and is something to do in the future.
LSTM network, with each layer having a cell/hidden state We want to mention here that our calibration scheme (see
length of 20. Table 2 shows the resulting shapes of all model description in the three experiments below) is not the stan-
parameters from Eqs. (2) to (8). Between the layers, we dard way for calibrating and selecting data-driven models,
added dropout, a technique to prevent the model from over- especially neural networks. As of today, a widespread cali-
fitting (Srivastava et al., 2014). Dropout sets a certain per- bration strategy for DL models is to subdivide the data into
centage (10 % in our case) of random neurons to zero during three parts, referred to as training, validation and test data
training in order to force the network into a more robust fea- (see Goodfellow et al., 2016). The first two splits are used to
ture learning. Another hyperparameter is the length of the in- derive the parametrization of the networks and the remainder
put sequence, which corresponds to the number of days of of the data to diagnose the actual performance. We decided to
meteorological input data provided to the network for the not implement this splitting strategy, because we are limited
prediction of the next discharge value. We decided to keep to the periods Newman et al. (2015) used so that our models

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

16 Chapter 3 Publications
6012 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

are comparable with their results. Theoretically, it would be driven approaches, the network has to learn the entire “hy-
possible to split the 15-year calibration period of Newman drological model” purely from the available data (see Fig. 4).
et al. (2015) further into a training and validation set. How- Therefore, having more than just the data of a single catch-
ever, this would lead to (a) a much shorter period of data that ment available would help to obtain a more general under-
is used for the actual weight updates or (b) a high risk of over- standing of the rainfall–runoff processes. An illustrative ex-
fitting to the short validation period, depending on how this ample are two similarly behaving catchments of which one
15-year period is divided. In addition to that, LSTMs with a lacks high precipitation events or extended drought periods
low number of hidden units are quite sensitive to the initial- in the calibration period, while having these events in the val-
ization of their weights. It is thus common practice to repeat idation period. Given that the second catchment experienced
the calibration task several times with different random seeds these conditions in the calibration set, the LSTM could learn
to select the best performing realization of the model (Ben- the response behaviour to those extremes and use this knowl-
gio, 2012). For the present purpose we decided not to imple- edge in the first catchment. Classical hydrological models
ment these strategies, since it would make it more difficult or have the process understanding implemented in the model
even impossible to compare the LSTM approach to the SAC- structure itself and therefore – at least in theory – it is not
SMA + Smow-17 reference model. The goal of this study is strictly necessary to have these kind of events in the calibra-
therefore not to find the best per-catchment model, but rather tion period.
to investigate the general potential of LSTMs for the task of The second motivation is the prediction of runoff in un-
rainfall–runoff modelling. However, we think that the sam- gauged basins, one of the main challenges in the field of hy-
ple size of 241 catchment is large enough to infer some of drology (Blöschl et al., 2013; Sivapalan, 2003). A regional
the (average) properties of the LSTM-based approach. model that performs reasonably well across all catchments
within a region could potentially be a step towards the pre-
2.5.1 Experiment 1: one model for each catchment diction of runoff for such basins.
Therefore, the aim of the second experiment is to anal-
With the first experiment, we test the general ability of our yse how well the network architecture can generalize (or re-
LSTM network to model rainfall–runoff processes. Here, gionalize) to all catchments within a certain region. We use
we train one network separately for each of the 241 catch- the HUCs that are used for grouping the catchments in the
ments. To avoid the effect of overfitting of the network on CAMELS data set for the definition of the regions (four in
the training data, we identified the number of epochs (for a this case). The training data for these regional models are
definition of an epoch, see Sect. 2.2) in a preliminary step, the combined data of the calibration period of all catchments
which yielded, on average, the highest Nash–Sutcliffe effi- within the same HUC.
ciency (NSE) across all basins for an independent validation To determine the number of training epochs, we performed
period. For this preliminary experiment, we used the first the same preliminary experiment as described in Experi-
14 years of the 15-year calibration period as training data ment 1. Across all catchments, the highest mean NSE was
and the last, fifteenth, year as the independent validation pe- achieved after 20 epochs in this case. Although the number
riod. With the 14 years of data, we trained a model for in to- of epochs is smaller compared to Experiment 1, the number
tal 200 epochs for each catchment and evaluated each model of weight updates is much larger. This is because the num-
after each epoch with the validation data. Across all catch- ber of available training samples has increased and the same
ments, the highest mean NSE was achieved after 50 epochs batch size as in Experiment 1 is used (see Sect. 2.2 for an
in this preliminary experiment. Thus, for the final training of explanation of the connection of number of iterations, num-
the LSTM with the full 15 years of the calibration period as ber of training samples and number of epochs). Thus, for the
training data, we use the resulting number of 50 epochs for final training, we train one LSTM for each of the four used
all catchments. Experiment 1 yields 241 separately trained HUCs for 20 epochs with the entire 15-year long calibration
networks, one for each of the 241 catchments. period.

2.5.2 Experiment 2: one regional model for each 2.5.3 Experiment 3: fine-tuning the regional model for
hydrological unit each catchment

Our second experiment is motivated by two different ideas: In the third experiment, we want to test whether the more
(i) deep learning models really excel, when having many general knowledge of the regional model (Experiment 2) can
training data available (Hestness et al., 2017; Schmidhuber, help to increase the performance of the LSTM in a single
2015), and (ii) regional models as potential solution for pre- catchment. In the field of DL this is a common approach
diction in ungauged basins. called fine-tuning (Razavian et al., 2014; Yosinski et al.,
Regarding the first motivation, having a huge training data 2014), where a model is first trained on a huge data set to
set allows the network to learn more general and abstract learn general patterns and relationships between (meteoro-
patterns of the input-to-output relationship. As for all data- logical) input data and (streamflow) output data (this is re-

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

17
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6013

ferred to as pre-training). Then, the pre-trained network is learn long-term dependencies and its deficits for the task of
further trained for a small number of epochs with the data rainfall–runoff modelling. This is followed by the analysis of
of a specific catchment alone to adapt the more generally the results of Experiment 1, for which we trained one net-
learned processes to a specific catchment. Loosely speak- work separately for each basin and compare the results to the
ing, the LSTM first learns the general behaviour of the runoff SAC-SMA + Snow-17 benchmark model. Then we investi-
generating processes from a large data set, and is in a second gate the potential of LSTMs to learn hydrological behaviour
step adapted in order to account for the specific behaviour of at the regional scale. In this context, we compare the per-
a given catchment (e.g. the scaling of the runoff response in formance of the regional models from Experiment 2 against
a specific catchment). the models of Experiment 1 and discuss their strengths and
In this study, the regional models of Experiment 2 serve as weaknesses. Lastly, we examine whether our fine-tuning ap-
pre-trained models. Therefore, depending on the affiliation proach enhances the predictive power of our models in the
of a catchment to a certain HUC, the specific regional model individual catchments. In all cases, the analysis is based on
for this HUC is taken as a starting point for the fine-tuning. the data of the 241 catchments of the calibration (the first
With the initial LSTM weights from the regional model, the 15 years) and validation (all remaining years available) peri-
training is continued only with the training data of a spe- ods.
cific catchment for a few epochs (ranging from 0 to 20, me-
dian 10). Thus, similar to Experiment 1, we finally have 3.1 The effect of (not) learning long-term dependencies
241 different models, one for each of the 241 catchments.
Different from the two previous experiments, we do not use As stated in Sect. 2.1, the traditional RNN can only learn de-
a global number of epochs for fine-tuning. Instead, we used pendencies of 10 or less time steps. The reason for this is the
the 14-year/1-year split to determine the optimal number of so-called “vanishing or exploding gradients” phenomenon
epochs for each catchment individually. The reason is that (see Bengio et al., 1994, and Hochreiter and Schmidhuber,
the regional model fits individual catchments within a HUC 1997), which manifests itself in an error signal during the
differently well. Therefore, the number of epochs the LSTM backward pass of the network training that either diminishes
needs to adapt to a certain catchment before it starts to overfit towards zero or grows against infinity, preventing the effec-
is different for each catchment. tive learning of long-term dependencies. However, from the
perspective of hydrological modelling, a catchment contains
2.6 Evaluation metrics various processes with dependencies well above 10 days
(which corresponds to 10 time steps in the case of daily
The metrics for model evaluation are the Nash–Sutcliffe ef- streamflow modelling), e.g. snow accumulation during win-
ficiency (Nash and Sutcliffe, 1970) and the three decomposi- ter and snowmelt during spring and summer. Traditional hy-
tions following Gupta et al. (2009). These are the correlation drological models need to reproduce these processes cor-
coefficient of the observed and simulated discharge (r), the rectly in order to be able to make accurate streamflow pre-
variance bias (α) and the total volume bias (β). While all of dictions. This is in principle not the case for data-driven ap-
these measures evaluate the performance over the entire time proaches.
series, we also use three different signatures of the flow du- To empirically test the effect of (not) being able to learn
ration curve (FDC) that evaluate the performance of specific long-term dependencies, we compared the modelling of a
ranges of discharge. Following Yilmaz et al. (2008), we cal- snow-influenced catchment (basin 13340600 of the Pacific
culate the bias of the 2 % flows, the peak flows (FHV), the Northwest region) with a LSTM and a traditional RNN. For
bias of the slope of the middle section of the FDC (FMS) this purpose we adapted the number of hidden units of the
and the bias of the bottom 30 % low flows (FLV). RNN to be 41 for both layers (so that the number of learn-
Because our modelling approach needs 365 days of me- able parameters of the LSTM and RNN is approximately the
teorological data as input for predicting one time step of same). All other modelling boundary conditions, e.g. input
discharge, we cannot simulate the first year of the calibra- data, the number of layers, dropout rate, and number of train-
tion period. To be able to compare our models to the SAC- ing epochs, are kept identical.
SMA + Snow-17 benchmark model, we recomputed all met- Figure 6a shows 2 years of the validation period of ob-
rics for the benchmark model for the same simulation peri- served discharge as well as the simulation by LSTM and
ods. RNN. We would like to highlight three points. (i) The hydro-
graph simulated by the RNN has a lot more variance com-
pared to the smooth line of the LSTM. (ii) The RNN under-
3 Results and discussion estimates the discharge during the melting season and early
summer, which is strongly driven snowmelt and by the pre-
We start presenting our results by showing an illustra- cipitation that has fallen through the winter months. (iii) In
tive comparison of the modelling capabilities of traditional the winter period, the RNN systematically overestimates ob-
RNNs and the LSTM to highlight the problems of RNNs to served discharge, since snow accumulation is not accounted

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

18 Chapter 3 Publications
6014 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

(a) shows the problems of RNNs with long-term dependencies,

Precipitation (mm d-1)

30 0
we can conclude that traditional RNNs should not be used if
Discharge (mm d-1)

25 20 (e.g. daily) discharge is predicted only from meteorological

20
40 observations.
15
60
10 3.2 Using LSTMs as hydrological models
5

0 Figure 7a shows the spatial distribution of the LSTM per-

Observed discharge Rain formances for Experiment 1 in the validation period. In over
LSTM simulation Snow 50 % of the catchments, an NSE of 0.65 or above is found,
RNN simulation
with a mean NSE of 0.63 over all catchments. We can see
(b)
that the LSTM performs better in catchments with snow in-
Min. T.
fluence (New England and Pacific Northwest regions) and
Temperature (°C)

30
Max. T.
20 catchments with higher mean annual precipitation (also the
10 New England and Pacific Northwest regions, but also basins
0 in the western part of the Arkansas-White-Red region; see
10 Fig. 5a for precipitation distribution). The performance de-
20
teriorates in the more arid catchments, which are located in
0 1 1 1 1 2 2 2
the western part of the Arkansas-White-Red region, where
00 200 200 00 00 00 00 00
De
c2
Ma
r
Jun Se
p 2 ec 2 ar 2 Jun 2 ep 2
D M S
no discharge is observed for longer periods of the year (see
Date Fig. 5b). Having a constant value of discharge (zero in this
case) for a high percentage of the training samples seems to
Figure 6. (a) Two years of observed as well as the simulated be difficult information for the LSTM to learn and to repro-
discharge of the LSTM and RNN from the validation period of duce this hydrological behaviour. However, if we compare
basin 13340600. The precipitation is plotted from top to bottom and the results for these basins to the benchmark model (Fig. 7b),
days with minimum temperature below zero are marked as snow
we see that for most of these dry catchments the LSTM out-
(black bars). (b) The corresponding daily maximum and minimum
temperature.
performs the latter, meaning that the benchmark model did
not yield satisfactory results for these catchments either. In
general, the visualization of the differences in the NSE shows
that the LSTM performs slightly better in the northern, more
for. These simulation deficits can be explained by the lack of snow-influenced catchments, while the SAC-SMA + Snow-
the RNN to learn and store long-term dependencies, while es- 17 performs better in the catchments in the south-east. This
pecially the last two points are interesting and connected. Re- clearly shows the benefit of using LSTMs, since the snow
call that the RNN is trained to minimize the average RMSE accumulation and snowmelt processes are correctly repro-
between observation and simulation. The RNN is not able to duced, despite their inherent complexity. Our results suggest
store the amount of water which has fallen as snow during that the model learns these long-term dependencies, i.e. the
the winter and is, in consequence, also not able to generate time lag between precipitation falling as snow during the
sufficient discharge during the time of snowmelt. The RNN, winter period and runoff generation in spring with warmer
minimizing the average RMSE, therefore overestimates the temperatures. The median value of the NSE differences is
discharge most times of the year by a constant bias and un- −0.03, which means that the benchmark model slightly out-
derestimates the peak flows, thus being closer to predicting performs the LSTM. Based on the mean NSE value (0.58 for
the mean flow. Only for a short period at the end of the sum- the benchmark model, compared to 0.63 for the LSTM of this
mer is it close to predicting the low flow correctly. Experiment), the LSTM outperforms the benchmark results.
In contrast, the LSTM seems to have (i) no or fewer prob- In Fig. 8, we present the cumulative density functions
lems with predicting the correct amount of discharge dur- (CDF) for various metrics for the calibration and validation
ing the snowmelt season and (ii) the predicted hydrograph is period. We see that the LSTM and the benchmark model
much smoother and fits the general trends of the hydrograph work comparably well for all but the FLV (bias of the bot-
much better. Note that both networks are trained with the ex- tom 30 % low flows) metric. The underestimation of the peak
act same data and have the same data available for predicting flow in both models could be expected when using the MSE
a single day of discharge. as the objective function for calibration (Gupta et al., 2009).
Here we have only shown a single example for a snow- However, the LSTM underestimates the peaks more strongly
influenced basin. We also compared the modelling behaviour compared to the benchmark model (Fig. 8d). In contrast, the
in one of the arid catchments of the Arkansas-White-Red re- middle section of the FDC is better represented in the LSTM
gion, and found that the trends and conclusion were simi- (Fig. 8e). Regarding the performance in terms of the NSE,
lar. Although only based on a single illustrative example that the LSTM shows fewer negative outliers and thus seems to

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

19
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6015

Experiment 1 vs. SAC-SMA+Snow17

1.00
(a) (b) (c)
0.75

CDF
0.50

0.25

0.00
1.0 0.5 0.0 0.5 1.0 0.0 0.5 1.0 1.5 2.0 1.0 0.5 0.0 0.5 1.0
NSE Flow variance ( ) Flow volume ( )
1.00
(d) (e) (f)
0.75

CDF
0.50

0.25

0.00
50 25 0 25 50 100 50 0 50 2000 1000 0
FHV FMS FLV
Exp. 1 calibration SAC-SMA + Snow17 cal.
Exp. 1 validation SAC-SMA + Snow17 val.

Figure 8. Cumulative density functions for various metrics of the

calibration and validation period of Experiment 1 compared to the
benchmark model. FHV is the bias of the top 2 % flows, the peak
flows, FMS is the slope of the middle section of the flow duration
curve and FLV is the bias of the bottom 30 % low flows.
Figure 7. Panel (a) shows the NSE of the validation period of the
models from Experiment 1 and panel (b) the difference of the NSE
between the LSTM and the benchmark model (blue colours (> 0)
indicate that the LSTM performs better than the benchmark model,
1.0
red (< 0) the other way around). The colour maps are limited to [0, Experiment 1 cal.
Experiment 1 val.
1] for the NSE and [−0.4, 0.4] for the NSE differences for better
0.8 SAC-SMA + Snow17 cal.
visualization. SAC-SMA + Snow17 val.
Adapted experiment 1 cal.
0.6 Adapted experiment 1 val.
CDF

be more robust. The poorest model performance in the val-

0.4
idation period is an NSE of −0.42 compared to −20.68 of
the SAC-SMA + Snow-17. Figure 8f shows large differences
0.2
between the LSTM and the SAC-SMA + Snow-17 model re-
garding the FLV metric. The FLV is highly sensitive to the 0.0
one single minimum flow in the time series, since it com-
2000 1500 1000 500 0
pares the area between the FDC and this minimum value in
Bottom 30 % flow bias (FLV) [%]
the log space of the observed and simulated discharge. The
discharge from the LSTM model, which has no exponen- Figure 9. The effect of limiting the discharge prediction of the net-
tial outflow function like traditional hydrological models, can work not to zero (blue lines) but instead to the minimum observed
easily drop to diminutive numbers or even zero, to which we discharge of the calibration period (green lines) on the FLV. Bench-
limited our model output. A rather simple solution for this is- mark model (orange lines) for comparison.
sue is to introduce just one additional parameter and to limit
the simulated discharge not to zero, but to the minimum ob-
served flow from the calibration period. Figure 9 shows the
effect of this approach on the CDF of the FLV. We can see
that this simple solution leads to better FLV values compared the validation period for both modelling approaches. This is
to the benchmark model. Other metrics, such as the NSE, are a sign of overfitting, and in the case of the LSTM, could be
almost unaffected by this change, since these low-flow val- tackled by a smaller network size, stronger regularization or
ues only marginally influence the resulting NSE values (not more data. However, we want to highlight again that achiev-
shown here). ing the best model performance possible was not the aim of
From the CDF of the NSE in Fig 8a, we can also observe this study, but rather testing the general ability of the LSTM
a trend towards higher values in the calibration compared to to reproduce runoff processes.

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

20 Chapter 3 Publications
6016 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

Median: -0.015
Median: 0.069

100
Median: -0.049

50 Median: 0.001

-0.4 0.0 0.4 Median: -0.001

Figure 10. Difference of the regional model compared to the models from Experiment 1 for each basin regarding the NSE of the validation
period. Blue colours (> 0) mean the regional model performed better than the models from Experiment 1, red (< 0) the other way around.

3.3 LSTMs as regional hydrological models Experiment 1 vs. experiment 2

1.00
(a) (b) (c)
We now analyse the results of the four regional models that 0.75

we trained for the four investigated HUCs in Experiment 2.

CDF

0.50
Figure 10 shows the difference in the NSE between the
model outputs from Experiments 1 and 2. For some basins, 0.25

the regional models perform significantly worse (dark red) 0.00

than the individually trained models from Experiment 1. 1.0 0.5 0.0
NSE
0.5 1.0 0.0 0.5 1.0
Flow variance ( )
1.5 2.0 1.0 0.5 0.0
Flow volume ( )
0.5 1.0

However, from the histograms of the differences we can see 1.00

(d) (e) (f)
that the median is almost zero, meaning that in 50 % of the 0.75

basins the regional model performs better than the model

CDF

0.50
specifically trained for a single basin. Especially in the New
England region the regional model performed better for al- 0.25

most all basins (except for two in the far north-east). In gen- 0.00
50 25 0 25 50 100 50 0 50 2000 1000 0
eral, for all HUCs and catchments, the median difference is FHV FMS FLV
−0.001. Exp. 1 cal. Exp. 1 val. Exp. 2 cal. Exp. 2 val.
From Fig. 11 it is evident that the increased data size of the
regional modelling approach (Experiment 2) helps to atten- Figure 11. Cumulative density functions for several metrics of the
uate the drop in model performance between the calibration calibration and validation period of the models from Experiment 1
and validation periods, which could be observed in Exper- compared to the regional models from Experiment 2. FHV is the
iment 1 probably as a result of overfitting. From the CDF bias of the top 2 % flows, the peak flows, FMS is the slope of the
of the NSE (Fig. 11a) we can see that Experiment 2 per- middle section of the flow duration curve and FLV is the bias of the
formed worse for approximately 20 % of the basins, while bottom 30 % low flows.
being comparable or even slightly better for the remaining
watersheds. We can also observe that the regional models
show a more balanced under- and over-estimation, while the HUCs, but reveals a trend toward higher NSE values in
models from Experiment 1 as well as the benchmark model the New England region and to lower NSE values in the
tend to underestimate the discharge (see Fig. 11d–f, e.g. the Arkansas-White-Red region. The reason for these differences
flow variance, the top 2 % flow bias or the bias of the middle might become clearer once we look at the correlation in
flows). This is not too surprising, since we train one model the observed discharge time series of the basins within both
on a range of different basins with different discharge charac- HUCs (see Fig. 12). We can see that in the New England
teristics, where the model minimizes the error between sim- region (where the regional model performed better for most
ulated and observed discharge for all basins at the same time. of the catchments compared to the individual models of Ex-
On average, the regional model will therefore equally over- periment 1) many basins have a strong correlation in their
and under-estimate the observed discharge. discharge time series. Conversely, for Arkansas-White-Red
The comparison of the performances of Experiment 1 region the overall image of the correlation plot is much dif-
and 2 shows no clear consistent pattern for the investigated ferent. While some basins exist in the eastern part of the HUC

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

21
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6017

Discharge correlation
(a) New England (b) Arkansas-White-Red
0 0 1.0

5 0.8
5

Correlation
10 0.6
10
Basin

Basin

15 0.4
15
20 0.2
20
25 0.0
25
30 0.2
0 5 10 15 20 25 0 5 10 15 20 25 30
Basin Basin

Figure 12. Correlation matrices of the observed discharge of all

basins in (a) the New England region and (b) Arkansas-White-Red
region. The basins for both subplots are ordered by longitude from
east to west.

with discharge correlation, especially the basins in the west-

ern, more arid part have no inter-correlation at all. The re-
sults suggest that a single, regionally calibrated LSTM could
generally be better in predicting the discharge of a group
of basins compared to many LSTMs trained separately for
each of the basins within the group especially when the
group’s basins exhibit a strong correlation in their discharge
behaviour.
Figure 13. Panel (a) shows the difference of the NSE in the vali-
3.4 The effect of fine-tuning dation period of Experiment 3 compared to the models of Experi-
ment 1 and panel (b) in comparison to the models of Experiment 2.
In this section, we analyse the effect of fine-tuning the re- Blue colours (> 0) indicate in both cases that the fine-tuned models
gional model for a few number of epochs to a specific catch- of Experiment 3 perform better and red colours (< 0) the opposite.
ment. The NSE differences are capped at [−0.4, 0.4] for better visualiza-
Figure 13 shows two effects of the fine-tuning process. tion.
In the comparison with the model performance of Experi-
ment 1, and from the histogram of the differences (Fig. 13a), (27.4 % of all basins compared to 17.4 % for the benchmark
we see that in general the pre-training and fine-tuning im- model), which is often taken as a threshold value for reason-
proves the NSE of the runoff prediction. Comparing the re- ably well-performing models (Newman et al., 2015).
sults of Experiment 3 to the regional models of Experiment 2
(Fig. 13b), we can see the biggest improvement in those 3.5 A hydrological interpretation of the LSTM
basins in which the regional models performed poorly (see
also Fig. 10). It is worth highlighting that, even though the To round off the discussion of this manuscript, we want to
models in Experiment 3 have seen the data of their specific come back to the LSTM and try to explain it again in com-
basins for fewer epochs in total than in Experiment 1, they parison to the functioning of a classical hydrological model.
still perform better on average. Therefore, it seems that pre- Similar to continuous hydrological models, the LSTM pro-
training with a bigger data set before fine-tuning for a specific cesses the input data time step after time step. In every time
catchment helps the model to learn general rainfall–runoff step, the input data (here meteorological forcing data) are
processes and that this knowledge is transferable to single used to update a number of values in the LSTM internal cell
basins. It is also worth noting that the group of catchments states. In comparison to traditional hydrological models, the
we used as one region (the HUC) can be quite inhomoge- cell states can be interpreted as storages that are often used
neous regarding their hydrological catchment properties. for e.g. snow accumulation, soil water content, or groundwa-
Figure 14 finally shows that the models of Experiment 3 ter storage. Updating the internal cell states (or storages) is
and the benchmark model perform comparably well over all regulated through a number of so-called gates: one that reg-
catchments. The median of the NSE for the validation pe- ulates the depletion of the storages, a second that regulates
riod is almost the same (0.72 and 0.71 for Experiment 3 and the increase in the storages and a third that regulates the out-
the benchmark model), while the mean for the models of Ex- flow of the storages. Each of these gates comes with a set of
periment 3 is about 15 % higher (0.68 compared to 0.58). In adjustable parameters that are adapted during a calibration
addition, more basins have an NSE above a threshold of 0.8 period (referred to as training). During the validation period,

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

22 Chapter 3 Publications
6018 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

1.0
(a)

Temperature [°C]
20

0.5
0
NSE

0.0 20 Max. temp

Min. temp

0.5
15
(b) Cell state

Cell state value

1.0
5
Ex. 1 Ex. 2 Ex. 3 SAC-SMA + Snow-17

Figure 14. Boxplot of the NSE of the validation period for our three 0

Experiments and the benchmark model. The NSE is capped to −1 0 100 200 300
for better visualization. The green square diamond marks the mean Time step
in addition to the median (red line).
Figure 15. Evolution of a specific cell state in the LSTM (b) com-
pared to the daily min and max temperature, with accumulation in
updates of the cell states depend only on the input at a spe- winter and depletion in spring (a). The vertical grey lines are in-
cific time step and the states of the last time step (given the cluded for better guidance.
learned parameters of the calibration period).
In contrast to hydrological models, however, the LSTM
does not “know” the principle of water/mass conservation 4 Summary and conclusion
and the governing process equations describing e.g. infiltra-
This contribution investigated the potential of using Long
tion or evapotranspiration processes a priori. Compared to
Short-Term Memory networks (LSTMs) for simulating
traditional hydrological models, the LSTM is optimized to
runoff from meteorological observations. LSTMs are a spe-
predict the streamflow as well as possible, and has to learn
cial type of recurrent neural networks with an internal mem-
these physical principles and laws during the calibration pro-
ory that has the ability to learn and store long-term depen-
cess purely from the data.
dencies of the input–output relationship. Within three ex-
Finally, we want to show the results of a preliminary anal-
periments, we explored possible applications of LSTMs and
ysis in which we inspect the internals of the LSTM. Neural
demonstrated that they are able to simulate the runoff with
networks (as well as other data-driven approaches) are often
competitive performance compared to a baseline hydrologi-
criticized for their “black box”-like nature. However, here we
cal model (here the SAC-SMA + Snow-17 model). In the first
want to argue that the internals of the LSTM can be inspected
experiment we looked at classical single basin modelling, in
as well as interpreted, thus taking away some of the “black-
a second experiment we trained one model for all basins in
box-ness”.
each of the regions we investigated, and in a third experiment
Figure 15 shows the evolution of a single LSTM cell (ct ;
we showed that using a pre-trained model helps to increase
see Sect. 2.1) of a trained LSTM over the period of one input
the model performance in single basins. Additionally, we
sequence (which equals 365 days in this study) for an ar-
showed an illustrative example why traditional RNNs should
bitrary, snow-influenced catchment. We can see that the cell
be avoided in favour of LSTMs if the task is to predict runoff
state matches the dynamics of the temperature curves, as well
from meteorological observations.
as our understanding of snow accumulation and snowmelt.
The goal of this study was to explore the potential of the
As soon as temperatures fall below 0 ◦ C the cell state starts
method and not to obtain the best possible realization of
to increase (around time step 60) until the minimum tempera-
the LSTM model per catchment (see Sect. 2.5). It is there-
ture increases above the freezing point (around time step 200)
fore very likely that better performing LSTMs can be found
and the cell state depletes quickly. Also, the fluctuations be-
by an exhaustive (catchment-wise) hyperparameter search.
tween time steps 60 and 120 match the fluctuations visible
However, with our simple calibration approach, we were al-
in the temperature around the freezing point. Thus, albeit
ready able to obtain comparable (or even slightly higher)
the LSTM was only trained to predict runoff from meteoro-
model performances compared to the well-established SAC-
logical observations, it has learned to model snow dynamics
SMA + Snow-17 model.
without any forcing to do so.
In summary, the major findings of the present study are the
following.

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

23
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6019

a. LSTMs are able to predict runoff from meteorological Competing interests. The authors declare that they have no conflict
observations with accuracies comparable to the well- of interest.
established SAC-SMA + Snow-17 model.

b. The 15 years of daily data used for calibration seem to Acknowledgements. Part of the research was funded by the
constitute a lower bound of data requirements. Austrian Science Fund (FWF) through project P31213-N29.
Furthermore, we would like to thank the two anonymous reviewers
c. Pre-trained knowledge can be transferred into different for their comments that helped to improve this paper.
catchments, which might be a possible approach for re-
ducing the data demand and/or regionalization applica- Edited by: Uwe Ehret
tions, as well as for prediction in ungauged basins or Reviewed by: Niels Schuetze and one anonymous referee
basins with few observations.

The data intensive nature of the LSTMs (as for any deep
learning model) is a potential barrier for applying them in
data-scarce problems (e.g. for the usage within a single basin References
with limited data). We do believe that the use of “pre-trained
LSTMs” (as explored in Experiment 3) is a promising way Abadi, M., Agarwal, A., Barham, P., Brevdo, E., Chen, Z., Citro,
to reduce the large data demand for an individual basin. C., Corrado, G. S., Davis, A., Dean, J., Devin, M., Ghemawat, S.,
However, further research is needed to verify this hypothe- Goodfellow, I., Harp, A., Irving, G., Isard, M., Jia, Y., Jozefow-
icz, R., Kaiser, L., Kudlur, M., Levenberg, J., Mane, D., Monga,
sis. Ultimately, however, LSTMs will always strongly rely
R., Moore, S., Murray, D., Olah, C., Schuster, M., Shlens, J.,
on the available data for calibration. Thus, even if less data
Steiner, B., Sutskever, I., Talwar, K., Tucker, P., Vanhoucke,
are needed, it can be seen as a disadvantage in comparison V., Vasudevan, V., Viegas, F., Vinyals, O., Warden, P., Watten-
to physically based models, which – at least in theory – are berg, M., Wicke, M., Yu, Y., and Zheng, X.: TensorFlow: Large-
not reliant on calibration and can thus be applied with ease Scale Machine Learning on Heterogeneous Systems, available
to new situations or catchments. However, more and more at: https://www.tensorflow.org/ (last access: 21 November 2018),
large-sample data sets are emerging which will catalyse fu- 2016.
ture applications of LSTMs. In this context, it is also imag- Abrahart, R. J., Anctil, F., Coulibaly, P., Dawson, C. W., Mount,
inable that adding physical catchment properties as an addi- N. J., See, L. M., Shamseldin, A. Y., Solomatine, D. P., Toth,
tional input layer into the LSTM may enhance the predictive E., and Wilby, R. L.: Two decades of anarchy? Emerging themes
power and ability of LSTMs to work as regional models and and outstanding challenges for neural network river forecasting,
Prog. Phys. Geog., 36, 480–513, 2012.
to make predictions in ungauged basins.
Adams, T. E. and Pagaon, T. C. (Eds.): Flood Forecasting: A Global
An entirely justifiable barrier of using LSTMs (or any Perspective, Academic Press, Boston, MA, USA, 2016.
other data-driven model) in real-world applications is their Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The
black-box nature. Like every common data-driven tool in hy- CAMELS data set: catchment attributes and meteorology for
drology, LSTMs have no explicit internal representation of large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313,
the water balance. However, for the LSTM at least, it might https://doi.org/10.5194/hess-21-5293-2017, 2017a.
be possible to analyse the behaviour of the cell states and link Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: Catch-
them to basic hydrological patterns (such as the snow accu- ment attributes for large-sample studies, UCAR/NCAR, Boulder,
mulation melt processes), as we showed briefly in Sect. 3.5. CO, USA, https://doi.org/10.5065/D6G73C3Q, 2017b.
We hypothesize that a systematic interpretation or the in- Anderson, E. A.: National Weather Service River Forecast System -
terpretability in general of the network internals would in- Snow Accumulation and Ablation Model, Tech. Rep. November,
US Department of Commerce, Silver Spring, USA, 1973.
crease the trust in data-driven approaches, especially those
ASCE Task Committee on Application of Artificial Neural Net-
of LSTMs, leading to their use in more (novel) applications works: Artificial Neural Networks in Hydrology. II: Hydrologic
in the environmental sciences in the near future. Applications, J. Hydrol. Eng., 52, 124–137, 2000.
Assem, H., Ghariba, S., Makrai, G., Johnston, P., Gill, L., and
Pilla, F.: Urban Water Flow and Water Level Prediction Based on
Data availability. All underlying data used in this research study Deep Learning, in: ECML PKDD 2017: Machine Learning and
are openly available. The sources are mentioned in Sect. 2.4. Model Knowledge Discovery in Databases, 317–329, Springer, Cham.,
outputs as well as code may be made available by request to the 2017.
corresponding author. Bengio, Y.: Practical recommendations for gradient-based training
of deep architectures, in: Neural networks: Tricks of the trade,
437–478, Springer, Berlin, Heidelberg, 2012.
Author contributions. FK and DK designed all experiments. FK Bengio, Y., Simard, P., and Frasconi, P.: Learning long-term depen-
conducted all experiments and analysed the results. FK prepared dencies with gradient descent is difficult, IEEE T. Neural Net-
the paper with contributions from DK, CB, KS and MH. wor., 5, 157–166, 1994.

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

24 Chapter 3 Publications
6020 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

Beven, K.: Linking parameters across scales: subgrid parameteriza- Halevy, A., Norvig, P., and Pereira, F.: The Unreasonable Effective-
tions and scale dependent hydrological models, Hydrol. Process., ness of Data, IEEE Intell. Syst., 24, 8–12, 2009.
9, 507–525, 1995. Halff, A. H., Halff, H. M., and Azmoodeh, M.: Predicting runoff
Beven, K.: Rainfall-Runoff Modelling: The Primer, John Wiley & from rainfall using neural networks, in: Engineering hydrology,
Sons, Chichester, UK, 2001. ASCE, 760–765, 1993.
Blöschl, G., Sivapalan, M., Wagener, T., Viglione, A., and Savenije, He, Y., Bárdossy, A., and Zehe, E.: A review of regionalisation for
H. (Eds.): Runoff Prediction in Ungauged Basins: Synthesis continuous streamflow simulation, Hydrol. Earth Syst. Sci., 15,
Across Processes, Places and Scales, Cambridge University 3539–3553, https://doi.org/10.5194/hess-15-3539-2011, 2011.
Press, UK, 465 pp., 2013. Hengl, T., Mendes de Jesus, J., Heuvelink, G. B. M., Ruiperez
Burnash, R. J. C., Ferral, R. L., and McGuire, R. A.: A generalised Gonzalez, M., Kilibarda, M., Blagotic, A., Shangguan, W.,
streamflow simulation system conceptual modelling for digital Wright, M. N., Geng, X., Bauer-Marschallinger, B., Gue-
computers., Tech. rep., US Department of Commerce National vara, M. A., Vargas, R., MacMillan, R. A., Batjes, N. H.,
Weather Service and State of California Department of Water Leenaars, J. G. B., Ribeiro, E., Wheeler, I., Mantel, S.,
Resources, Sacramento, CA, USA, 1973. and Kempen, B.: SoilGrids250m: Global gridded soil infor-
Buytaert, W. and Beven, K.: Regionalization as a learning process, mation based on machine learning, PLOS ONE, 12, 1–40,
Water Resour. Res., 45, 1–13, 2009. https://doi.org/10.1371/journal.pone.0169748, 2017.
Carriere, P., Mohaghegh, S., and Gaskar, R.: Performance of a Vir- Herrnegger, M., Senoner, T., and Nachtnebel, H. P.: Adjustment of
tual Runoff Hydrographic System, Water Resources Planning spatio-temporal precipitation patterns in a high Alpine environ-
and Management, 122, 120–125, 1996. ment, J. Hydrol., 556, 913–921, 2018.
Chollet, F.: Keras, available at: https://github.com/fchollet/keras Hestness, J., Narang, S., Ardalani, N., Diamos, G. F., Jun, H., Kia-
(last access: 1 April 2018), 2015. ninejad, H., Patwary, M. M. A., Yang, Y., and Zhou, Y.: Deep
Clark, M. P., Bierkens, M. F. P., Samaniego, L., Woods, R. A., Ui- Learning Scaling is Predictable, Empirically, available at: https:
jlenhoet, R., Bennett, K. E., Pauwels, V. R. N., Cai, X., Wood, //arxiv.org/abs/1712.00409 (last access: 21 November 2018),
A. W., and Peters-Lidard, C. D.: The evolution of process-based 2017.
hydrologic models: historical challenges and the collective quest Hinton, G., Deng, L., Yu, D., Dahl, G. E., Mohamed, A.-R., Jaitly,
for physical realism, Hydrol. Earth Syst. Sci., 21, 3427–3440, N., Senior, A., Vanhoucke, V., Nguyen, P., Sainath, T. N., and
https://doi.org/10.5194/hess-21-3427-2017, 2017. Kingsbury, B.: Deep Neural Networks for Acoustic Modeling
Daniell, T. M.: Neural networks. Applications in hydrology and wa- in Speech Recognition: The Shared Views of Four Research
ter resources engineering, in: Proceedings of the International Groups, IEEE Signal Proc. Mag., 29, 82–97, 2012.
Hydrology and Water Resource Symposium, vol. 3, 797–802, In- Hochreiter, S. and Schmidhuber, J.: Long Short-Term Memory,
stitution of Engineers, Perth, Australia, 1991. Neural Comput., 9, 1735–1780, 1997.
Duan, Q., Gupta, V. K., and Sorooshian, S.: Shuffled complex evo- Hsu, K., Gupta, H. V., and Soroochian, S.: Application of a recur-
lution approach for effective and efficient global minimization, J. rent neural network to rainfall-runoff modeling, Proc., Aesthetics
Optimiz. Theory App., 76, 501–521, 1993. in the Constructed Environment, ASCE, New York, 68–73, 1997.
Fang, K., Shen, C., Kifer, D., and Yang, X.: Prolongation of SMAP Hunter, J. D.: Matplotlib: A 2D graphics environment, Comput. Sci.
to Spatiotemporally Seamless Coverage of Continental U.S. Us- Eng., 9, 90–95, 2007.
ing a Deep Learning Neural Network, Geophys. Res. Lett., 44, Kirchner, J. W.: Getting the right answers for the right reasons:
11030–11039, 2017. Linking measurements, analyses, and models to advance the sci-
Farabet, C., Couprie, C., Najman, L., and Lecun, Y.: Learning Hi- ence of hydrology, Water Resour. Res., 42, 1–5, 2006.
erarchical Features for Scence Labeling, IEEE T. Pattern Anal., Kollet, S. J., Maxwell, R. M., Woodward, C. S., Smith, S., Van-
35, 1915–1929, 2013. derborght, J., Vereecken, H., and Simmer, C.: Proof of concept
Freeze, R. A. and Harlan, R. L.: Blueprint for a physically-based, of regional scale hydrologic simulations at hydrologic resolution
digitally-simulated hydrologic response model, J. Hydrol., 9, utilizing massively parallel computer resources, Water Resour.
237–258, 1969. Res., 46, 1–7, 2010.
Gers, F. A., Schmidhuber, J., and Cummins, F.: Learning to Forget: Krizhevsky, A., Sutskever, I., and Hinton, G. E.: ImageNet Classi-
Continual Prediction with LSTM, Neural Comput., 12, 2451– fication with Deep Convolutional Neural Networks, Adv. Neur.
2471, 2000. In., 1097–1105, 2012.
Goodfellow, I., Bengio, Y., and Courville, A.: Deep Learning, MIT Kumar, D. N., Raju, K. S., and Sathish, T.: River Flow Forecast-
Press, available at: http://www.deeplearningbook.org (last ac- ing using Recurrent NeuralNetworks, Water Resour. Manag., 18,
cess: 1 April 2018), 2016. 143–161, 2004.
Graves, A., Mohamed, A.-R., and Hinton, G.: Speech recognition LeCun, Y. A., Bottou, L., Orr, G. B., and Müller, K. R.: Efficient
with deep recurrent neural networks, in: Acoustics, speech and backprop, Springer, Berlin, Heidelberg, Germany, 2012.
signal processing (ICASSP), 2013 IEEE International Confer- Lindström, G., Pers, C., Rosberg, J., Strömqvist, J., and Arheimer,
ence on, 6645–6649, Vancouver, Canada, 2013. B.: Development and testing of the HYPE (Hydrological Pre-
Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decom- dictions for the Environment) water quality model for different
position of the mean squared error and NSE performance criteria: spatial scales, Hydrol. Res., 41, 295–319, 2010.
Implications for improving hydrological modelling, J. Hydrol., Marçais, J. and de Dreuzy, J. R.: Prospective Interest of Deep
377, 80–91, 2009. Learning for Hydrological Inference, Groundwater, 55, 688–692,
2017.

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

25
F. Kratzert et al.: Rainfall–runoff modelling using LSTMs 6021

Maurer, E. P., Wood, A. W., Adam, J. C., Lettenmaier, D. P., and forest environments in Amazonia, Remote Sens. Environ., 112,
Nijssen, B.: A long-term hydrologically based dataset of land 3469–3481, 2008.
surface fluxes and states for the conterminous United States, J. Rumelhart, D. E., Hinton, G. E., and Williams, R. J.: Learning in-
Climate, 15, 3237–3251, 2002. ternal representations by error propagation (No. ICS-8506), Cal-
McKinney, W.: Data Structures for Statistical Computing in ifornia Univ San Diego La Jolla Inst for Cognitive Science, 1985.
Python, Proceedings of the 9th Python in Science Confer- Samaniego, L., Kumar, R., and Attinger, S.: Multiscale param-
ence, Location: Austin, Texas, USA, 1697900, 51–56, available eter regionalization of a grid-based hydrologic model at the
at: http://conference.scipy.org/proceedings/scipy2010/mckinney. mesoscale, Water Resour. Res., 46, 1–25, 2010.
html (last access: 1 April 2018), 2010. Schmidhuber, J.: Deep learning in neural networks: An overview,
Merz, R., Blöschl, G., and Parajka, J.: Regionalisation methods Neural Networks, 61, 85–117, 2015.
in rainfall-runoff modelling using large samples, Large Sample Schulla, J.: Model description WaSiM (Water balance Sim-
Basin Experiments for Hydrological Model Parameterization: ulation Model), completely revised version 2012, last
Results of the Model Parameter Experiment–MOPEX, IAHS change: 19 June 2012, available at: http://www.wasim.ch/
Publ., 307, 117–125, 2006. downloads/doku/wasim/wasim_2012_ed2_en.pdf (last access:
Minns, A. W. and Hall, M. J.: Artificial neural networks as rainfall- 1 April 2018), 2017.
runoff models, Hydrolog. Sci. J., 41, 399–417, 1996. Seaber, P. R., Kapinos, F. P., and Knapp, G. L.: Hydrologic Unit
Mu, Q., Zhao, M., and Running, S. W.: Improvements to a MODIS Maps, Tech. rep., U.S. Geological Survey, Water Supply Paper
global terrestrial evapotranspiration algorithm, Remote Sens. En- 2294, Reston, Virginia, USA, 1987.
viron., 115, 1781–1800, 2011. Shen, C.: A transdisciplinary review of deep learning research and
Mulvaney, T. J.: On the use of self-registering rain and flood gauges its relevance for water resources scientists, Water Resour. Res.,
in making observations of the relations of rainfall and of flood 54, https://doi.org/10.1029/2018WR022643, 2018.
discharges in a given catchment, in: Proceedings Institution of Shen, C., Laloy, E., Elshorbagy, A., Albert, A., Bales, J., Chang,
Civil Engineers, Dublin, Vol. 4, 18–31, 1850. F.-J., Ganguly, S., Hsu, K.-L., Kifer, D., Fang, Z., Fang,
Myneni, R. B., Hoffman, S., Knyazikhin, Y., Privette, J. L., Glassy, K., Li, D., Li, X., and Tsai, W.-P.: HESS Opinions: In-
J., Tian, Y., Wang, Y., Song, X., Zhang, Y., Smith, G. R., Lotsch, cubating deep-learning-powered hydrologic science advances
A., Friedl, M., Morisette, J. T., Votava, P., Nemani, R. R., and as a community, Hydrol. Earth Syst. Sci., 22, 5639–5656,
Running, S. W.: Global products of vegetation leaf area and frac- https://doi.org/10.5194/hess-22-5639-2018, 2018.
tion absorbed PAR from year one of MODIS data, Remote Sens. Shi, X., Chen, Z., Wang, H., Yeung, D.-Y., Wong, W.-K., and Woo,
Environ., 83, 214–231, 2002. W.-C.: Convolutional LSTM network: A machine learning ap-
Nash, J. E. and Sutcliffe, J. V.: River Flow Forecasting Through proach for precipitation nowcasting, Adv. Neur. In., 28, 802–810,
Conceptual Models Part I-a Discussion of Principles, J. Hydrol., 2015.
10, 282–290, 1970. Sivapalan, M.: Prediction in ungauged basins: a grand challenge for
Newman, A., Sampson, K., Clark, M., Bock, A., Viger, R., and theoretical hydrology, Hydrol. Process., 17, 3163–3170, 2003.
Blodgett, D.: A large-sample watershed-scale hydrometeorologi- Solomatine, D., See, L. M., and Abrahart, R. J.: Data-driven mod-
cal dataset for the contiguous USA, UCAR/NCAR, Boulder, CO, elling: concepts, approaches and experiences, in: Practical hy-
USA, https://doi.org/10.5065/D6MW2F4D, 2014. droinformatics, 17–30, Springer, Berlin, Heidelberg, 2009.
Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L. Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and
E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J. Salakhutdinov, R.: Dropout: A Simple Way to Prevent Neural
R., Hopson, T., and Duan, Q.: Development of a large-sample Networks from Overfitting, J. Mach. Learn. Res., 15, 1929–1958,
watershed-scale hydrometeorological data set for the contiguous 2014.
USA: data set characteristics and assessment of regional variabil- Stanzel, P., Kahl, B., Haberl, U., Herrnegger, M., and Nachtnebel,
ity in hydrologic model performance, Hydrol. Earth Syst. Sci., H. P.: Continuous hydrological modelling in the context of real
19, 209–223, https://doi.org/10.5194/hess-19-209-2015, 2015. time flood forecasting in alpine Danube tributary catchments,
Pedregosa, F., Varoquaux, G., Gramfort, A., Michel, V., Thirion, IOP C. Ser. Earth Env., 4, 012005, https://doi.org/10.1088/1755-
B., Grisel, O., Blondel, M., Prettenhofer, P., Weiss, R., Dubourg, 1307/4/1/012005, 2008.
V., Vanderplas, J., Passos, A., Cournapeau, D., Brucher, M., Per- Sutskever, I., Vinyals, O., and Le, Q. V.: Sequence to sequence
rot, M., and Duchesnay, E.: Scikit-learn: Machine Learning in learning with neural networks, in: Advances in neural informa-
Python, J. Mach. Learn. Res., 12, 2825–2830, 2011. tion processing systems, 3104–3112, 2014.
Razavian, A. S., Azizpour, H., Sullivan, J., and Carlsson, S.: CNN Tao, Y., Gao, X., Hsu, K., Sorooshian, S., and Ihler, A.: A Deep
features off-the-shelf: An astounding baseline for recognition, Neural Network Modeling Framework to Reduce Bias in Satel-
IEEE Computer Society Conference on Computer Vision and lite Precipitation Products, J. Hydrometeorol., 17, 931–945,
Pattern Recognition Workshops, 24–28 June 2014, Columbus, 2016.
Ohio, USA, 512–519, 2014. Thielen, J., Bartholmes, J., Ramos, M.-H., and de Roo, A.:
Remesan, R. and Mathew, J.: Hydrological data driven modelling: The European Flood Alert System – Part 1: Concept
a case study approach, vol. 1, Springer International Publishing, and development, Hydrol. Earth Syst. Sci., 13, 125–140,
2014. https://doi.org/10.5194/hess-13-125-2009, 2009.
Rennó, C. D., Nobre, A. D., Cuartas, L. A., Soares, J. V., Hod- Thornton, P. E., Thornton, M. M., Mayer, B. W., Wilhelmi, N.,
nett, M. G., Tomasella, J., and Waterloo, M. J.: HAND, a new Wei, Y., Devarakonda, R., and Cook, R.: Daymet: Daily sur-
terrain descriptor using SRTM-DEM: Mapping terra-firme rain- face weather on a 1 km grid for North America, 1980–2008, Oak

www.hydrol-earth-syst-sci.net/22/6005/2018/ Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018

26 Chapter 3 Publications
6022 F. Kratzert et al.: Rainfall–runoff modelling using LSTMs

Ridge National Laboratory (ORNL) Distributed Active Archive Yilmaz, K. K., Gupta, H. V., and Wagener, T.: A process-based di-
Center for Biogeochemical Dynamics (DAAC), Oak Ridge, Ten- agnostic approach to model evaluation: Application to the NWS
nessee, USA, 2012. distributed hydrologic model, Water Resour. Res., 44, 1–18,
Tompson, J., Jain, A., LeCun, Y., and Bregler, C.: Joint Training of a 2008.
Convolutional Network and a Graphical Model for Human Pose Yosinski, J., Clune, J., Bengio, Y., and Lipson, H.: How transferable
Estimation, in: Proceedings of Advances in Neural Information are features in deep neural networks?, Adv. Neur. In., 27, 1–9,
Processing Systems, 27, 1799–1807, 2014. 2014.
Van Der Walt, S., Colbert, S. C., and Varoquaux, G.: The NumPy Young, P. C. and Beven, K. J.: Data-based mechanistic modelling
array: A structure for efficient numerical computation, Comput. and the rainfall-flow non-linearity, Environmetrics, 5, 335–363,
Sci. Eng., 13, 22–30, 2011. 1994.
van Rossum, G.: Python tutorial, Technical Report CS-R9526, Zhang, D., Lindholm, G., and Ratnaweera, H.: Use long short-term
Tech. rep., Centrum voor Wiskunde en Informatica (CWI), Am- memory to enhance Internet of Things for combined sewer over-
sterdam, the Netherlands, 1995. flow monitoring, J. Hydrol., 556, 409–418, 2018.
Wesemann, J., Herrnegger, M., and Schulz, K.: Hydrological mod- Zhang, J., Zhu, Y., Zhang, X., Ye, M., and Yang, J.: Developing a
elling in the anthroposphere: predicting local runoff in a heavily Long Short-Term Memory (LSTM) based model for predicting
modified high-alpine catchment, J. Mt. Sci., 15, 921–938, 2018. water table depth in agricultural areas, J. Hydrol., 561, 918–929,
Wood, E. F., Roundy, J. K., Troy, T. J., van Beek, L. P. H., Bierkens, 2018.
M. F. P., Blyth, E., de Roo, A., Döll, P., Ek, M., Famiglietti, Zhu, M. and Fujita, M.: Application of neural networks to runoff
J., Gochis, D., van de Giesen, N., Houser, P., Jaffé, P. R., Kol- forecast, vol. 3, Springer, Dodrecht, the Netherlands, 1993.
let, S., Lehner, B., Lettenmaier, D. P., Peters-Lidard, C., Siva-
palan, M., Sheffield, J., Wade, A., and Whitehead, P.: Hyperres-
olution global land surface modeling: Meeting a grand challenge
for monitoring Earth’s terrestrial water, Water Resour. Res., 47,
W05301, https://doi.org/10.1029/2010WR010090, 2011.
Xia, Y., Mitchell, K., Ek, M., Sheffield, J., Cosgrove, B., Wood,
E., Luo, L., Alonge, C., Wei, H., Meng, J., and Livneh, B.:
Continental-scale water and energy flux analysis and valida-
tion for the North American Land Data Assimilation System
project phase 2 (NLDAS-2): 1. Intercomparison and applica-
tion of model products, J. Geophys. Res.-Atmos., 117, D03109,
https://doi.org/10.1029/2011JD016048, 2012.

Hydrol. Earth Syst. Sci., 22, 6005–6022, 2018 www.hydrol-earth-syst-sci.net/22/6005/2018/

27
28 Chapter 3 Publications
Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019
https://doi.org/10.5194/hess-23-5089-2019
© Author(s) 2019. This work is distributed under
the Creative Commons Attribution 4.0 License.

Towards learning universal, regional, and local hydrological

behaviors via machine learning applied to
large-sample datasets
Frederik Kratzert1 , Daniel Klotz1 , Guy Shalev2 , Günter Klambauer1 , Sepp Hochreiter1,* , and Grey Nearing3,*
1 LIT AI Lab & Institute for Machine Learning, Johannes Kepler University Linz, Linz, Austria
2 Google Research, Tel Aviv, Israel
3 Department of Geological Sciences, University of Alabama, Tuscaloosa, AL, USA
∗ These authors contributed equally to this work.

Correspondence: Frederik Kratzert ([email protected])

Received: 15 July 2019 – Discussion started: 2 August 2019

Revised: 29 October 2019 – Accepted: 6 November 2019 – Published: 17 December 2019

Abstract. Regional rainfall–runoff modeling is an old but 1 Introduction

still mostly outstanding problem in the hydrological sci-
ences. The problem currently is that traditional hydrologi- A long-standing problem in the hydrological sciences is
cal models degrade significantly in performance when cal- about how to use one model, or one set of models, to pro-
ibrated for multiple basins together instead of for a single vide spatially continuous hydrological simulations across
basin alone. In this paper, we propose a novel, data-driven ap- large areas (e.g., regional, continental, global). This is the so-
proach using Long Short-Term Memory networks (LSTMs) called regional modeling problem, and the central challenge
and demonstrate that under a “big data” paradigm, this is not is about how to extrapolate hydrologic information from one
necessarily the case. By training a single LSTM model on area to another – e.g., from gauged to ungauged watersheds,
531 basins from the CAMELS dataset using meteorological from instrumented to non-instrumented hillslopes, or from
time series data and static catchment attributes, we were able areas with flux towers to areas without (Blöschl and Siva-
to significantly improve performance compared to a set of palan, 1995). Often this is done using ancillary data (e.g.,
several different hydrological benchmark models. Our pro- soil maps, remote sensing, digital elevation maps) to help
posed approach not only significantly outperforms hydrolog- understand similarities and differences between different ar-
ical models that were calibrated regionally, but also achieves eas. The regional modeling problem is thus closely related to
better performance than hydrological models that were cali- the problem of prediction in ungauged basins (Blöschl et al.,
brated for each basin individually. Furthermore, we propose 2013; Sivapalan et al., 2003). This problem is well docu-
an adaption to the standard LSTM architecture, which we call mented in several review papers; therefore, we point the in-
an Entity-Aware-LSTM (EA-LSTM), that allows for learn- terested reader to the comprehensive reviews by Razavi and
ing catchment similarities as a feature layer in a deep learn- Coulibaly (2013) and Hrachowitz et al. (2013) and to the
ing model. We show that these learned catchment similarities more recent review in the introduction by Prieto et al. (2019).
correspond well to what we would expect from prior hydro- Currently, the most successful hydrological models are
logical understanding. calibrated to one specific basin, whereas a regional model
must be somehow “aware” of differences between hydro-
logic behaviors in different catchments (e.g., ecology, ge-
ology, pedology, topography, geometry). The challenge of
regional modeling is to learn and encode these differ-
ences so that differences in catchment characteristics trans-
late into appropriately heterogeneous hydrologic behavior.

Published by Copernicus Publications on behalf of the European Geosciences Union.

29
5090 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Razavi and Coulibaly (2013) recognize two primary types of on one catchment (using only meteorological inputs) could
strategies for regional modeling: model-dependent methods be moved to a similar catchment (during a similar time pe-
and model-independent (data-driven) methods. Here, model- riod). However, the accuracy of their network in the train-
dependent denotes approaches where regionalization explic- ing catchment was only a NSE of 0.29. Recently, Kratzert
itly depends on a pre-defined hydrological model (e.g., clas- et al. (2018b) have shown that Long Short-Term Memory
sical process-based models), while model-independent de- (LSTM) networks, a special type of recurrent neural network,
notes data-driven approaches that do not include a specific are well suited for the task of rainfall–runoff modeling. This
model. The critical difference is that the first tries to de- study already included the first experiments towards regional
rive hydrologic parameters that can be used to run simula- modeling while still using only meteorological inputs and ig-
tion models from available data (i.e., observable catchment noring ancillary catchment attributes. In a preliminary study
characteristics). In this case, the central challenge is the fact Kratzert et al. (2018c) demonstrated that their LSTM-based
that there is typically strong interaction between individual approach outperforms, on average, the well-calibrated Sacra-
model parameters (e.g., between soil porosity and soil depth, mento Soil Moisture Accounting Model (SAC-SMA) in an
or between saturated conductivity and an infiltration rate pa- asymmetrical comparison where the LSTM was used in an
rameter), such that any meaningful joint probability distri- ungauged setting and SAC-SMA was used in a gauged set-
bution over model parameters will be complex and multi- ting – i.e., SAC-SMA was calibrated individually for each
modal. This is closely related to the problem of equifinality basin, whereas the LSTM never saw training data from any
(Beven and Freer, 2001). catchment where it was used for prediction. This was done
Model-dependent regionalization has enjoyed major atten- by providing the LSTM-based model with meteorological
tion from the hydrological community, so that today a large forcing data and additional catchment attributes. From these
variety of approaches exist. To give a few selective exam- preliminary results we can already assume that this general
ples, Seibert (1999) calibrated a conceptual model for 11 modeling approach is promising and has the potential for re-
catchments and regressed them against the available catch- gionalization.
ment characteristics. The regionalization capacity was tested The objectives of this study are
against seven other catchments, where the reported perfor-
mance ranged between an Nash–Sutcliffe efficiency (NSE) i. to demonstrate that we can use large-sample hydrology
of 0.42 and 0.76. Samaniego et al. (2010) proposed a multi- data (Gupta et al., 2014; Peters-Lidard et al., 2017) to
scale parameter regionalization (MPR) method, which simul- develop a regional rainfall–runoff model that capitalizes
taneously sets up the model and a regionalization scheme by on observable ancillary data in the form of catchment
regressing the global parameters of a set of a priori defined attributes to produce accurate streamflow estimates over
transfer functions that map from ancillary data like soil prop- a large number of basins,
erties to hydrological model parameters. Beck et al. (2016)
ii. to benchmark the performance of our neural network
calibrated a conceptual model for 1787 catchments around
model against several existing hydrology models, and
the globe, used these as a catalog of “donor catchments”, and
then extended this library to new catchments by identifying iii. to show how the model uses information about catch-
the 10 most similar catchments from the library in terms of ment characteristics to differentiate between different
climatic and physiographic characteristics to parameterize a rainfall–runoff behaviors.
simulation ensemble. Prieto et al. (2019) first regionalized
hydrologic signatures (Gupta et al., 2008) using a regression To this end we built an LSTM-based model that learns
model (random forests) and then calibrated a rainfall–runoff catchment similarities directly from meteorological forcing
model to the regionalized hydrologic signatures. data and ancillary data of multiple basins and evaluate its
Model-independent methods, in contrast, do not rely on performance in a “gauged” setting, meaning that we never
prior knowledge of the hydrological system. Instead, these ask our model to predict in a basin where it did not see
methods learn the entire mapping from ancillary data and training data. Concretely, we propose an adaption of the
meteorological inputs to streamflow or other output fluxes di- LSTM where catchment attributes explicitly control which
rectly. A model of this type has to “learn” how catchment at- parts of the LSTM state space are used for a given basin.
tributes or other ancillary data distinguish between different Because the model is trained using both catchment attributes
catchment response behaviors. However, hydrological mod- and meteorological time series data, to predict streamflow,
eling typically provides the most accurate predictions when a it can learn how to combine different parts of the network
model is calibrated to a single specific catchment (Mizukami to simulate different types of rainfall–runoff behaviors. In
et al., 2017), whereas data-driven approaches might benefit principle, the approach explicitly allows for sharing of parts
from a large cross section of diverse training data, because of the networks for similarly behaving basins while using
knowledge can be transferred across sites. Among the cate- different independent parts for basins with completely dif-
gory of data-driven approaches are neural networks. Besaw ferent rainfall–runoff behavior. Furthermore, our adaption
et al. (2010) showed that an artificial neural network trained provides a mapping from catchment attribute space into a

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

30 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5091

learned, high-dimensional space, i.e., a so-called embedding, initialized as a vector of zeros. W, U, and b are learnable pa-
in which catchments with similar rainfall–runoff behavior rameters for each gate, where subscripts indicate which gate
can be placed together. This embedding can be used to pre- the particular weight matrix/vector is used for, σ (·) is the sig-
form data-driven catchment similarity analysis. moid function, tanh(·) is the hyperbolic tangent function, and
The paper is organized as follows. Section 2 (Methods) is element-wise multiplication. The intuition behind this
describes our LSTM-based model, the data, the benchmark network is that the cell states (c[t]) characterize the memory
hydrological models, and the experimental design. Section 3 of the system. The cell states can get modified by the forget
(Results) presents our modeling results, the benchmarking gate (f [t]), which can delete states, and the input gate (i[t])
results, and the results of our embedding layer analysis. Sec- and cell update (g[t]), which can add new information. In the
tion 4 (Discussion and conclusion) reviews certain implica- latter case, the cell update is seen as the information that is
tions of our model and results and summarizes the advan- added and the input gate controls into which cells new infor-
tages of using data-driven methods for extracting information mation is added. Finally, the output gate (o[t]) controls which
from catchment observables for regional modeling. information, stored in the cell states, is outputted. For a more
detailed description, as well as a hydrological interpretation,
see Kratzert et al. (2018b).
2 Methods
2.2 A new type of recurrent network: the
2.1 A brief overview of the Long Short-Term Memory Entity-Aware-LSTM
network
To reiterate from the introduction, our objective is to build
An LSTM network is a type of recurrent neural network that a network that learns to extract information that is relevant
includes dedicated memory cells that store information over to rainfall–runoff behaviors from observable catchment at-
long time periods. A specific configuration of operations in tributes. To achieve this, it is necessary to provide the net-
this network, so-called gates, controls the information flow work with information on the catchment characteristics that
within the LSTM (Hochreiter, 1991; Hochreiter and Schmid- contain some amount of information that allows for discrim-
huber, 1997). These memory cells are, in a sense, analogous ination between different catchments. Ideally, we want the
to a state vector in a traditional dynamical systems model, network to condition the processing of the dynamic inputs on
which makes LSTMs potentially an ideal candidate for mod- a set of static catchment characteristics. That is, we want the
eling dynamical systems like watersheds. Compared to other network to learn a mapping from meteorological time series
types of recurrent neural networks, LSTMs do not have a into streamflow that itself (i.e., the mapping) depends on a set
problem with exploding and/or vanishing gradients, which of static catchment characteristics that could, in principle, be
allows them to learn long-term dependencies between input measured anywhere in our modeling domain.
and output features. This is desirable for modeling catchment One way to do this would be to add the static features as
processes like snow accumulation and snowmelt that have additional inputs at every time step. That is, we could sim-
relatively long timescales compared with the timescales of ply augment the vectors x[t] at every time step with a set
purely input-driven domains (i.e., precipitation events). of catchment characteristics that do not (necessarily) change
An LSTM works as follows (see also Fig. 1a): given over time. However, this approach does not allow us to di-
an input sequence x = [x[1], . . ., x[T ]] with T time steps, rectly inspect what the LSTM learns from these static catch-
where each element x[t] is a vector containing input features ment attributes.
(model inputs) at time step t (1 ≤ t ≤ T ), the following equa- Our proposal is therefore to use a slight variation on
tions describe the forward pass through the LSTM: the normal LSTM architecture (an illustration is given in
Fig. 1b):
i[t] = σ Wi x[t] + Ui h[t − 1] + bi , (1)

i = σ Wi x s + b i , (7)

f [t] = σ Wf x[t] + Uf h[t − 1] + bf , (2)

f [t] = σ Wf x d [t] + Uf h[t − 1] + bf , (8)

g[t] = tanh Wg x[t] + Ug h[t − 1] + bg , (3)

g[t] = tanh Wg x d [t] + Ug h[t − 1] + bg , (9)

o[t] = σ Wo x[t] + Uo h[t − 1] + bo , (4)

o[t] = σ Wo x d [t] + Uo h[t − 1] + bo , (10)

c[t] = f [t] c[t − 1] + i[t] g[t], (5)
c[t] = f [t] c[t − 1] + i g[t], (11)
h[t] = o[t] tanh (c[t]) , (6)
h[t] = o[t] tanh (c[t]) . (12)
where i[t], f [t], and o[t] are the input gate, forget gate, and Here i is an input gate, which now does not change over time.
output gate, respectively, g[t] is the cell input and x[t] is x s are the static inputs (e.g., catchment attributes) and x d [t]
the network input at time step t (1 ≤ t ≤ T ), and h[t − 1] is are the dynamic inputs (e.g., meteorological forcings) at time
the recurrent input, c[t − 1] the cell state from the previous step t (1 ≤ t ≤ T ). The rest of the LSTM remains unchanged.
time step. At the first time step, the hidden and cell states are The intuition is as follows: we explicitly process the static

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

31
5092 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Figure 1. Visualization of (a) the standard (LSTM) cell as defined by Eqs. (1)–(6) and (b) the proposed Entity-Aware-LSTM (EA-LSTM)
cell as defined by Eqs. (7)–(12).

inputs x s and the dynamic inputs x d [t] separately within the ized by the total variance of the observations. For single-
architecture and assign them special tasks. The static features basin optimization, the MSE and NSE will typically yield
control, through input gate (i), which parts of the LSTM are the same optimum parameter values, discounting any effects
activated for any individual catchment, while the dynamic in the numerical optimizer that depend on the absolute mag-
and recurrent inputs control what information is written into nitude of the loss value.
the memory (g[t]), what is deleted (f [t]), and what of the The linear relation between these two metrics (MSE and
stored information to output (o[t]) at the current time step t. NSE) is lost, however, when calculated over data from mul-
We call this an Entity-Aware-LSTM (EA-LSTM) because it tiple basins. In this case, the means and variances of the ob-
explicitly differentiates between similar types of dynamical servation data are no longer constant because they differ be-
behaviors (here rainfall–runoff processes) that differ between tween basins. We will exploit this fact. In our case, the MSE
individual entities (here different watersheds). After training, from a basin with low average discharge (e.g., smaller, arid
the static input gate of the EA-LSTM contains a series of real basins) is generally smaller than the MSE from a basin with
values in the range (0, 1) that allow certain parts of the in- high average discharge (e.g., larger, humid basins). We need
put gate to be active through the simulation of any individual an objective function that does not depend on basin-specific
catchment. In principle, different groups of catchments can mean discharge so that we do not overweight large humid
share different parts of the full trained network. basins (and thus perform poorly on small, arid basins). Our
This is an embedding layer, which allows for non-naive in- loss function is therefore the average of the NSE values cal-
formation sharing between the catchments. For example, we culated at each basin that supplies training data – referred to
could potentially discover, after training, that two particular as basin-averaged Nash–Sutcliffe efficiency (NSE∗ ). Addi-
catchments share certain parts of the activated network based tionally, we add a constant term to the denominator ( = 0.1),
on geological similarities while other parts of the network re- the variance of the observations, so that our loss function
main distinct due to ecological dissimilarities. This embed- does not explode (to negative infinity) for catchments with
ding layer allows for complex interactions between catch- very low flow variance. Our loss function is therefore
ment characteristics, and – importantly – makes it possible 2
B X N
for those interactions to be directly informed by the rainfall– 1X yn − yn
NSE∗ = (13)
b
runoff data from all catchments used for training. ,
B b=1 n=1 s(b) + 2

2.3 Objective function: a smooth-joint NSE where B is the number of basins, N is the number of sam-
ples (days) per basin B, b yn is the prediction of sample n
An objective function is required for training the network. (1 ≤ n ≤ N), yn is the observation, and s(b) is the standard
For regression tasks such as runoff prediction, the mean- deviation of the discharge in basin b (1 ≤ b ≤ B), calculated
squared error (MSE) is commonly used. Hydrologists also from the training period. In general, an entity-aware deep
sometimes use the NSE because it has an interpretable range learning model will need a loss function that does not un-
of (−∞, 1). Both the MSE and NSE are squared error loss derweight entities with lower (relative to other entities in the
functions, with the difference being that the latter is normal- training dataset) absolute values in the target data.

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

32 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5093

2.5 Benchmark models

The first part of this study benchmarks our proposed model

against several high-quality benchmarks. The purpose of this
exercise is to show that the EA-LSTM provides reasonable
hydrological simulations.
To do this, we collected a set of existing hydrological mod-
els1 that were configured, calibrated, and run by several pre-
vious studies over the CAMELS catchments. These models
are (i) SAC-SMA (Burnash et al., 1973; Burnash, 1995) cou-
pled with the Snow-17 snow routine (Anderson, 1973), here-
after referred to as SAC-SMA, (ii) VIC (Liang et al., 1994),
(iii) FUSE (Clark et al., 2008; Henn et al., 2008) (three dif-
ferent model structures, 900, 902, 904), (iv) HBV (Seibert
Figure 2. Overview of the basin location and corresponding catch- and Vis, 2012), and (v) mHM (Samaniego et al., 2010; Ku-
ment attributes. (a) The mean catchment elevation, (b) the catch- mar et al., 2013). In some cases, these models were calibrated
ment aridity (PET/P), (c) the fraction of the catchment covered by to individual basins, and in other cases they were not. All of
forest, and (d) the daily average precipitation. these benchmark models were run by other groups – we did
not run any of our own benchmarks. We chose to use exist-
ing model runs so as not to bias the calibration of the bench-
2.4 The NCAR CAMELS dataset marks to possibly favor our own model. Each set of simu-
lations that we used for benchmarking is documented else-
To benchmark our proposed EA-LSTM model and to as- where in the hydrology literature (references below). Each
sess its ability to learn meaningful catchment similarities, of these benchmark models use the same daily Maurer forc-
we will use the Catchment Attributes and Meteorological ings that we used with our EA-LSTM, and all were calibrated
(CAMELS) dataset (Newman et al., 2014; Addor et al., and validated on the same time period(s). These benchmark
2017b). CAMELS is a set of data concerning 671 basins that models can be distinguished into two different groups.
is curated by the US National Center for Atmospheric Re-
1. Models calibrated for each basin individually. These are
search (NCAR). The CAMELS basins range in size between
SAC-SMA (Newman et al., 2017), VIC (Newman et al.,
4 and 25 000 km2 and were chosen because they have rel-
2017), FUSE2 , mHM (Mizukami et al., 2019), and HBV
atively low anthropogenic impacts. These catchments span
(Seibert et al., 2018). The HBV model supplied both a
a range of geologies and ecoclimatologies, as described in
lower and an upper benchmark, where the lower bench-
Newman et al. (2015) and Addor et al. (2017a).
mark is an ensemble mean of 1000 uncalibrated HBV
We used the same subselection of 531 basins from the
models and the upper benchmark is an ensemble of 100
CAMELS dataset that was used by Newman et al. (2017).
calibrated HBV models.
These basins are mapped in Fig. 2 and were chosen (by New-
man et al., 2017) out of the full set because some of the 2. Models that were regionally calibrated. These share one
basins have a large (> 10 %) discrepancy between different parameter set for all basins in the dataset. Here we have
strategies for calculating the basin area, and incorrect basin calibrations of the VIC model (Mizukami et al., 2017)
area would introduce significant uncertainty into a model- and mHM (Rakovec et al., 2019).
ing study. Furthermore, only basins with a catchment area
smaller than 2000 km2 were kept. 2.6 Experimental setup
For time-dependent meteorological inputs (x d [t]), we used
the daily, basin-averaged Maurer forcings (Wood et al., 2002) All model calibration and training were performed using data
supplied with CAMELS. Our input data include (i) daily cu- from the time period 1 October 1999 through 30 Septem-
mulative precipitation, (ii) daily minimum air temperature, ber 2008. All model and benchmark evaluation was done
(iii) daily maximum air temperature, (iv) average short-wave using data from the time period 1 October 1989 through
radiation, and (v) vapor pressure. Furthermore, 27 CAMELS 30 September 1999. We trained a single LSTM or EA-LSTM
catchment characteristics were used as static input features 1 Will be released on HydroShare
(x s ); these were chosen as a subset of the full set of charac- (https://doi.org/10.4211/hs.474ecc37e7db45baa425cdb4fc1b61e1).
teristics explored by Addor et al. (2017b) that are derivable 2 The FUSE runs were generated by Nans Addor
from remote sensing or CONUS-wide available data prod- ([email protected]) and given to us by personal communica-
ucts. These catchment attributes include climatic and vegetation. These runs are part of current development by N. Addor on
tion indices, as well as soil and topographical properties (see the FUSE model itself and might not reflect the final performance
Table A1 for an exhaustive list). of the FUSE model.

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

33
5094 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

model using calibration period data from all basins and eval- tion metrics used to compare models are listed in Table 1.
uated this model using validation period data from all basins. These metrics focus specifically on assessing the ability of
This implies that a single parameter set (i.e., W, U, b from the model to capture high flows and low flows as well as on
Eqs. 1–4 and 7–10) was trained to work across all basins. assessing overall performance using a decomposition of the
We trained and tested the following three model configu- standard squared error metrics that is less sensitive to bias
rations. (Gupta et al., 2009).
– LSTM without static inputs. A single LSTM trained on
the combined calibration data from all basins, using 2.6.2 Robustness and feature ranking
only the meteorological forcing data and ignoring static
catchment attributes. All catchment attributes used in this study are derived from
gridded data products (Addor et al., 2017a). Taking the catch-
– LSTM with static inputs. A single LSTM trained on ment’s mean elevation as an example, we would get different
the combined calibration data of all basins, using the mean elevations depending on the resolution of the gridded
meteorological features as well as the static catchment digital elevation model. More generally, there is uncertainty
attributes. These catchment descriptors were concate- in all CAMELS catchment attributes. Thus, it is important
nated to the meteorological inputs at each time step. that we evaluate the robustness of our model and of our em-
bedding layer (particular values of the 256 static input gates)
– EA-LSTM with static inputs. A single EA-LSTM trained
to changes in the exact values of the catchment attributes.
on the combined calibration data of all basins, using
Additionally, we want some idea about the relative impor-
the meteorological features as well as the static catch-
tance of different catchment attributes.
ment attributes. The catchment attributes were input to
To estimate the robustness of the trained model to un-
the static input gate in Eq. (7), while the meteorological
certainty in the catchment attributes, we added Gaussian
inputs were used at all remaining parts of the network
noise N (0, σ ) with increasing standard deviation to the in-
(Eqs. 8–10).
dividual attribute values and assessed resulting changes in
All three model configurations were trained using model performance for each noise level. Concretely, additive
the squared-error performance metrics discussed in noise was drawn from normal distributions with 10 differ-
Sect. 2.3 (MSE and NSE∗ ). This resulted in six different ent standard deviations: σ = [0.1, 0.2, . . ., 0.9, 1.0]. All input
model/training configurations. features (both static and dynamic) were standardized (zero
To account for stochasticity in the network initialization mean, unit variance) before training, so these perturbation
and in the optimization procedure (we used stochastic gra- sigmas did not depend on the units or relative magnitudes of
dient descent), all networks were trained with n = 8 differ- the individual catchment attributes. For each basin and each
ent random seeds. Predictions from the different seeds were standard deviation we drew 50 random noise vectors, result-
combined into an ensemble by taking the mean prediction at ing in 531 × 10 × 50 = 265 500 evaluations of each trained
each time step of all n different models under each configura- EA-LSTM.
tion. In total, we trained and tested six different settings and To provide a simple estimate of the most important static
eight different models per setting for a total of 48 different features of the trained model, we used the method of Mor-
trained LSTM-type models. For all LSTMs we used the same ris (1991). Although the Morris method is relatively simple,
architecture (apart from the inclusion of a static input gate it has been shown to provide meaningful estimations of the
in the EA-LSTM), which we found through hyperparameter global sensitivity and is widely used (e.g., Herman et al.,
optimization (see Appendix B for more details about the hy- 2013; Wang and Solomatine, 2019). The method of Mor-
perparameter search). The LSTMs had 256 memory cells and ris uses an approximation of local derivatives, which can be
a single fully connected layer with a dropout rate (Srivastava extracted directly from neural networks without additional
et al., 2014) of 0.4. The LSTMs were run in sequence-to- computations, which makes this a highly efficient method of
value mode (as opposed to sequence-to-sequence mode), so sensitivity analysis.
that to predict a single (daily) discharge value required mete- The method of Morris typically estimates feature sensitiv-
orological forcings from 269 preceding days, as well as the ities (EEi ) from local (numerical) derivatives.
forcing data of the target day, making the input sequences
270 time steps long.

f x1 , . . ., xi + 4i , . . ., xp − f (x)
EEi = (14)
2.6.1 Assessing model performance 4i

Because no one evaluation metric can fully capture the con- Neural networks are completely differentiable (to allow for
sistency, reliability, accuracy, and precision of a streamflow back-propagation) and thus it is possible to calculate the ex-
model, it was necessary to use a variety of performance met- act gradient with respect to the static input features. Thus, for
rics for model benchmarking (Gupta et al., 1998). Evalua- neural networks the method of Morris can be applied analyt-

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

34 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5095

Table 1. Overview of used evaluation metrics. The notation of the original publications is kept.

Metric Reference Equation

PT 2
Qm [t]−Qo [t]
Nash–Sutcliffe efficiency (NSE) Nash and Sutcliffe (1970) 1− P
t=1
T
2
t=1 Qo [t]−Qo

α-NSE decomposition Gupta et al. (2009) σs /σo

β-NSE decomposition Gupta et al. (2009) (µs − µo ) /σo
PH
QSh −QOh
Top 2 % peak flow bias (FHV) Yilmaz et al. (2008) h=1
PH × 100
h=1 QOh

log(QSm1 )−log(QSm2 ) − log(QOm1 )−log(QOm2 )
Bias of FDC midsegment slope (FMS) Yilmaz et al. (2008) × 100
log(QOm1 )−log(QOm2 )
PL P
log(QSl )−log(QSL ) − L l=1 log(QO )−log(QOL )
30 % low flow bias (FLV) Yilmaz et al. (2008) l=1
PL l × 100
l=1 log(QOl )−log(QOL )

ically. basin groupings and then plotted the clustering results geo-
graphically. The number of clusters was determined using a
f x1 , . . ., xi + 4i , . . ., xp − f (x) ∂f (x) mean silhouette score.
EEi = lim = (15)
4i →0 4i ∂xi In addition to visually analyzing the k-means cluster-
ing results by plotting them spatially (to ensure that the
This makes it unnecessary to run computationally expensive
input embedding preserved expected geographical similar-
sampling methods to approximate the local gradient. Further,
ity), we measured the ability of these cluster groupings to
since we predict one time step of discharge at the time, we
explain variance in certain hydrological signatures in the
obtain this sensitivity measure for each static input for each
CAMELS basins. For this, we used 13 of the hydrologic
day in the validation period. A global sensitivity measure for
signatures that were used by Addor et al. (2018): (i) mean
each basin and each feature is then derived by taking the av-
annual discharge (q mean), (ii) runoff ratio, (iii) slope of
erage absolute gradient (Saltelli et al., 2004).
the flow duration curve (slope-fdc), (iv) baseflow index,
2.6.3 Analysis of catchment similarity from the (v) streamflow–precipitation elasticity (stream-elas), (vi) 5th
embedding layer percentile flow (q5 ), (vii) 95th percentile flow (q95 ), (viii) fre-
quency of high-flow days (high-q-freq), (ix) mean duration
Once the model is trained, the input gate vector (i; see Eq. 7) of high-flow events (high-q-dur), (x) frequency of low-flow
for each catchment is fixed for the simulation period. This days (low-q-freq), (xi) mean duration of low-flow events
results in a vector that represents an embedding of the static (low-q-dur), (xii) zero flow frequency (zero-q-freq), and
catchment features (here in R27 ) into the high-dimensional (xiii) average day of year when half of the cumulative annual
space of the LSTM (here in R256 ). The result is a set of real- flow occurs (mean-hfd).
valued numbers that map the catchment characteristics onto a Finally, we reduced the dimension of the input gate em-
strength, or weight, associated with each particular cell state bedding layer (from R256 to R2 ) so as to be able to visualize
in the EA-LSTM. This weight controls how much of the cell dominant features in the input embedding. To do this we use
input (g[t]; see Eq. 9) is written into the corresponding cell a dimension reduction algorithm, called UMAP (McInnes
state (c[t]; see Eq. 11). et al., 2018) for “Uniform Manifold Approximation and Pro-
Per design, our hypothesis is that the EA-LSTM will learn jection for Dimension Reduction”. UMAP is based on neigh-
to group similar basins together into the high-dimensional bor graphs (while, e.g., principle component analysis is based
space, so that hydrologically similar basins use similar parts on matrix factorization), and it uses ideas from topological
of the LSTM cell states. This is dependent, of course, on data analysis and manifold learning techniques to guaran-
the information content of the catchment attributes used as tee that information from the high-dimensional space is pre-
inputs, but the model should at least not degrade the qual- served in the reduced space. For further details we refer the
ity of this information and should learn hydrologic similar- reader to the original publication by McInnes et al. (2018).
ity in a way that is useful for rainfall–runoff prediction. We
tested this hypothesis by analyzing the learned catchment
embedding from a hydrological perspective. We analyzed
geographical similarity by using k-means clustering on the
R256 feature space of the input gate embedding to delineate

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

35
5096 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

3 Results

This section is organized as follows.

– The first subsection (Sect. 3.1) presents a comparison
between the three different LSTM-type model config-
urations discussed in Sect. 2.6.1. The emphasis in this
comparison is to examine the effect of adding catch-
ment attributes as additional inputs to the LSTM using
the standard vs. adapted EA-LSTM architectures.
– The second subsection (Sect. 3.2) presents results from
our benchmarking analysis, that is, the direct com-
parison between the performances of our EA-LSTM
model with the full set of benchmark models outlined
in Sect. 2.5.
– The third subsection (Sect. 3.3) presents results of the
sensitivity analysis outlined in Sect. 2.6.2.
– The final subsection (Sect. 3.4) presents an analysis of Figure 3. Cumulative density functions of the NSE for all LSTM-
the EA-LSTM embedding layer to demonstrate that the type model configurations described in Sect. 2.6.1. For each model
model learned how to differentiate between different type the ensemble mean and one of the n = 8 repetitions are shown.
rainfall–runoff behaviors across different catchments. LSTM configurations are shown in orange (with catchment at-
tributes) and purple (without catchment attributes), and the EA-
3.1 Comparison between LSTM modeling approaches LSTM configurations (always with catchment attributes) are shown
in green.
The key results from a comparison between the LSTM ap-
proaches are in Fig. 3, which shows the cumulative density
functions (CDFs) of the basin-specific NSE values for all without static features (square vs. triangle markers in Fig. 3).
six LSTM models (three model configurations and two loss The mean (over basins) NSE improved in comparison with
functions) over the 531 basins. the LSTM that did not take catchment characteristics as in-
Table 2 contains average key overall performance statis- puts by 0.44 (range (0.38, 0.56)) when optimized using the
tics. Statistical significance was evaluated using the paired MSE and 0.30 (range (0.22, 0.43)) when optimized using the
Wilcoxon test (Wilcoxon, 1945), and the effect size was eval- basin-average NSE∗ . To assess statistical significance for sin-
uated using Cohen’s d (Cohen, 1988). The comparison con- gle models, we first calculated the mean basin performance,
tains four key results. i.e., the mean SE per basin across the eight repetitions. The
mean basin performance thus derived was then used for the
i. Using catchment attributes as static input features im-
test of significance. To assess statistical significance for en-
proves overall model performance as compared with not
semble means, the ensemble prediction (i.e., the mean dis-
providing the model with catchment attributes. This is
charge prediction of the eight model repetitions) was used
expected, but worth confirming.
to compare between different model approaches. For mod-
ii. Training against the basin-average NSE∗ loss function els trained using the standard MSE loss function, the p value
improves overall model performance as compared with for the single model was p = 1.2 × 10−75 and the p value
training against an MSE loss function, especially in the between the ensemble means was p = 4 × 10−68 . When op-
low NSE spectra. timized using the basin-average NSE∗ , the p value for the
single model was p = 8.8 × 10−81 ) and the p value between
iii. There is statistically significant difference between the the ensemble means was p = 3.3 × 10−75 .
performance of the standard LSTM with static input fea- It is worth emphasizing that the improvement in overall
tures and the EA-LSTM, however, with a small effect model performance due to including catchment attributes im-
size. plies that these attributes contain information that helps to
iv. Some of the error in the LSTM-type models is due to distinguish different catchment-specific rainfall–runoff be-
randomness in the training procedure and can be miti- haviors. This is especially interesting since these attributes
gated by running model ensembles. are derived from remote sensing and other ubiquitously avail-
able data products, as described by Addor et al. (2017b).
Related to result (i), there was a significant difference be- Our benchmarking analysis presented in the next subsection
tween LSTMs with standard architecture trained with vs. (Sect. 3.2) shows that this information content is sufficient

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

36 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5097

Table 2. Evaluation results of the single models and ensemble means.

Model NSE∗ No. of basins with NSE ≤ 0

mean median
LSTM without static inputs
using MSE:
Single model: 0.24 (± 0.049) 0.60 (±0.005) 44 (±4)
Ensemble mean (n = 8): 0.36 0.65 31
using NSE∗ :
Single model: 0.39 (± 0.059) 0.59 (±0.008) 28 (±3)
Ensemble mean (n = 8): 0.49 0.64 20
LSTM with static inputs
using MSE:
Single model: 0.66 (±0.012) 0.73 (±0.003) 6 (±2)
Ensemble mean (n = 8): 0.71 0.76 3
using NSE∗ :
Single model: 0.69 (±0.013) 0.73 (±0.002) 2 (±1)
Ensemble mean (n = 8): 0.72 0.76 2
EA-LSTM
using MSE:
Single model: 0.63 (±0.018) 0.71 (±0.005) 9 (±1)
Ensemble mean (n = 8): 0.68 0.74 6
using NSE∗ :
Single model: 0.67 (±0.006) 0.71 (±0.005) 3 (±1)
Ensemble mean (n = 8): 0.70 0.74 2
∗ Nash–Sutcliffe efficiency: (−∞, 1]; values closer to 1 are desirable.

to perform high-quality regional modeling (i.e., competitive with any type of hydrological model (e.g., bad input data,
with lumped models calibrated separately for each basin). unique catchment behaviors). This improvement at the low-
Related to result (ii), using the basin-average NSE∗ loss performance end of the spectrum can also been seen by look-
function instead of a standard MSE loss function improved ing at the number of “catastrophic failures”, i.e., basins with
performance for single models (different individual seeds) as an NSE value of less than zero. Across all models we see a
well as for the ensemble means across all model configura- reduction in this number when optimized with the basin av-
tions (see Table 2). The differences are most pronounced for erage NSE∗ , compared to optimizing with MSE.
the EA-LSTM and for the LSTM without static features. For Related to result (iii), Fig. 3 shows a small difference in
the EA-LSTM, the mean NSE for the single model increased the empirical CDFs between the standard LSTM with static
from 0.63 when optimized with MSE to 0.67 when opti- input features and the EA-LSTM under both functions (com-
mized with the basin average NSE∗ . For the LSTM trained pare green vs. orange lines). The difference is significant
without catchment characteristics the mean NSE went from (p value for single model p = 1 × 10−28 , p value for the en-
0.23 when optimized with MSE to 0.39 when optimized with semble mean p = 2.1 × 10−26 , paired Wilcoxon test); how-
NSE∗ . Further, the median NSE did not change significantly ever, the effect size is small: d = 0.055. This is important
depending on loss function due to the fact that the improve- because the embedding layer in the EA-LSTM adds a layer
ments from using the NSE∗ are mostly to performance in of interpretability to the LSTM, which we argue is desirable
basins at the lower end of the NSE spectra (see also Fig. 1 for scientific modeling in general and is useful in our case
dashed vs. solid lines). This is as expected as catchments for understanding catchment similarity. This is only useful,
with relatively low average flows have a small influence on however, if the EA-LSTM does not sacrifice performance
(LSTM) training with an MSE loss function, which results in compared to the less interpretable traditional LSTM. There
poor performance in these basins. Using the NSE∗ loss func- is some small performance sacrifice in this case, likely due to
tion helps to mitigate this problem. It is important to note an increase in the number of tunable parameters in the net-
that this is not the only reason why certain catchments have work, but the benefit of this small reduction in performance
low skill scores, which can happen for a variety of reasons is explicability.

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

37
5098 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Related to result (iv), in all cases there were several basins

with very low NSE values (this is also true for the benchmark
models, which we will discuss in Sect. 3.2). Using catchment
characteristics as static input features with the EA-LSTM ar-
chitecture reduced the number of such basins from 44 (31)
to 9 (6) for the average single model (ensemble mean) when
optimized with the MSE and from 28 (20) to 3 (2) for the
average single model (ensemble mean) if optimized using
the basin-average NSE∗ . This result is worth emphasizing:
each LSTM or EA-LSTM trained over all basins results in a
certain number of basins that perform poorly (NSE ≤ 0), but
the basins where this happens are not always the same. The
model outputs, and therefore the number of catastrophic fail-
ures, differ depending on the randomness in the weight ini-
tialization and optimization procedure and, thus, running an
ensemble of LSTMs substantively reduces this effect. This is
good news for deep learning – it means that at least a portion
of uncertainty can be mitigated using model ensembles. We
leave as an open question for future research how many en- Figure 4. Cumulative density functions of the NSE of two region-
semble members, as well as how these are initialized, should ally calibrated benchmark models (VIC and mHM), compared to
be used to minimize uncertainty for a given dataset. the EA-LSTM and the LSTM trained with and without static input
features.
3.2 Model benchmarking: EA-LSTM vs. calibrated
hydrology models
in approximately 2 basins out of 447 basins (0.4±0.2 %) and
The results in this section are calculated from 447 basins that the ensemble mean of the EA-LSTM failed in only a single
were modeled by all benchmark models, as well as our EA- basin (i.e., 0.2 %). In comparison, mHM failed in 29 basins
LSTM. In this section, we concentrate on benchmarking the (6.49 %) and VIC failed in 41 basins (9.17 %).
EA-LSTM; however, for the sake of completeness, we added Second, we compared our multi-basin calibrated EA-
the results of the LSTM with static inputs to all figures and LSTMs to individual-basin calibrated hydrological models.
tables. This is a more rigorous benchmark than the regionally cal-
First we compared the EA-LSTM against the two hy- ibrated models, since hydrological models usually perform
drological models that were regionally calibrated (VIC and better when trained for specific basins. Figure 5 compares
mHM). Specifically, what was calibrated for each model was CDFs of the basin-specific NSE values for all benchmark
a single set of transfer functions that map from static catch- models over the 447 basins. Table 3 contains the performance
ment characteristics to model parameters. The procedure for statistics for these benchmark models as well as for the re-
parameterizing these models for regional simulations is de- calculated EA-LSTM.
scribed in detail by the original authors: Mizukami et al. The main benchmarking result is that the EA-LSTM sig-
(2017) for VIC and Rakovec et al. (2019) for mHM. Fig- nificantly outperforms all benchmark models in the overall
ure 4 shows that the EA-LSTM outperformed both region- NSE. The two best-performing hydrological models were the
ally calibrated benchmark models by a large margin. Even ensemble (n = 100) of basin-calibrated HBV models and a
the LSTM trained without static catchment attributes (only single basin-calibrated mHM model. The EA-LSTM outper-
trained on meteorological forcing data) outperformed both formed both of these models at any reasonable alpha level.
regionally calibrated models consistently as a single model, The p value for the single model, compared to the HBV up-
and even more so as an ensemble. per bound, was p = 1.9 × 10−4 and for the ensemble mean
The mean and median NSE scores across the basins of p = 6.2 × 10−11 with a medium effect size (Cohen’s d for
the individual EA-LSTM models (Nensemble = 8) were 0.67± single model d = 0.22 and for the ensemble mean d = 0.40).
0.006 (0.71) and 0.71 ± 0.004 (0.74), respectively. In con- The p value for the single model, compared to the basin-
trast, VIC had a mean NSE of 0.17 and a median NSE of 0.31 wise calibrated mHM, was p = 4.3 × 10−6 and for the en-
and the mHM had a mean NSE of 0.44 and a median NSE of semble mean p = 1.0×10−13 with a medium effect size (Co-
0.53. Overall, VIC scored higher than the EA-LSTM ensem- hen’s d for single model d = 0.26 and for the ensemble mean
ble in 2 out of 447 basins (0.4 %) and mHM scored higher d = 0.45).
than the EA-LSTM ensemble in 16 basins (3.58 %). Inves- Regarding all other metrics except the Kling–Gupta de-
tigating the number of catastrophic failures (the number of composition of the NSE, there was no statistically signif-
basins where NSE ≤ 0), the average single EA-LSTM failed icant difference between the EA-LSTM and the two best-

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

38 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5099

Table 3. Comparison of the EA-LSTM and LSTM (with static inputs) average single model and ensemble mean to the full set of benchmark
models. VIC (basin) and mHM (basin) denote the basin-wise calibrated models, while VIC (CONUS) and mHM (CONUS) denote the
CONUS-wide calibrated models. HBV (lower) denotes the ensemble mean of n = 1000 uncalibrated HBVs, while HBV (upper) denotes the
ensemble mean of n = 100 calibrated HBVs (for details, see Seibert et al., 2018). For the FUSE model, the numbers behind the name denote
different FUSE model structures. All statistics were calculated from the validation period of all 447 commonly modeled basins.

Model NSEa No. of basins α-NSEb β-NSEc FHVd FMSe FLVf

mean median with NSE ≤ 0 median median median median median
EA-LSTM single 0.674 0.714 2 0.82 −0.03 −16.9 −10.0 2.0
(± 0.006) (± 0.004) (±1) (±0.013) (±0.009) (±1.1) (±1.7) (±7.6)
EA-LSTM ensemble 0.705 0.742 1 0.81 −0.03 −18.1 −11.3 31.9
LSTM single 0.685 0.731 1 0.85 −0.03 −14.8 −8.3 26.5
(±0.015) (±0.002) (±0) (±0.011) (±0.007) (±1.1) (±1.2) (±7.6)
LSTM ensemble 0.718 0.758 1 0.84 −0.03 −15.7 −8.8 55.1
SAC-SMA 0.564 0.603 13 0.78 −0.07 −20.4 −14.3 37.3
VIC (basin) 0.518 0.551 10 0.72 −0.02 −28.1 −6.6 −70.0
VIC (CONUS) 0.167 0.307 41 0.46 −0.07 −56.5 −28.0 17.4
mHM (basin) 0.627 0.666 7 0.81 −0.04 −18.6 −7.2 11.4
mHM (CONUS) 0.442 0.527 29 0.59 −0.04 −40.2 −30.4 36.4
HBV (lower) 0.237 0.416 35 0.58 −0.02 −41.9 −15.9 23.9
HBV (upper) 0.631 0.676 9 0.79 −0.01 −18.5 −24.9 18.3
FUSE (900) 0.587 0.639 12 0.80 −0.03 −18.9 −5.1 −11.4
FUSE (902) 0.611 0.650 10 0.80 −0.05 −19.4 9.6 −33.2
FUSE (904) 0.582 0.622 9 0.78 −0.07 −21.4 15.5 −66.7
a Nash–Sutcliffe efficiency: (−∞, 1]; values closer to one are desirable.
b α -NSE decomposition: (0, ∞); values close to one are desirable.
c β -NSE decomposition: (−∞, ∞); values close to zero are desirable.
d Top 2 % peak flow bias: (−∞, ∞); values close to zero are desirable.
e Bias of FDC midsegment slope: (−∞, ∞); values close to zero are desirable.
f 30 % low flow bias: (−∞, ∞); values close to zero are desirable.

performing hydrological models. The β decomposition of the when forced (not trained) with perturbed static features in
NSE measures a scaled difference in simulated vs. observed each catchment against model performance using the same
mean streamflow values, and in this case the HBV bench- static feature values that were used for training. As expected,
mark performed better that the EA-LSTM, with an average the model performance degrades with increasing noise in
scaled absolute bias (normalized by the root variance of ob- the static inputs. However, the degradation does not hap-
servations) of −0.01, whereas the EA-LSTM had an average pen abruptly, but smoothly with increasing levels of noise,
scaled bias of −0.03 for the individual model as well as for which is an indication that the LSTM is not overfitting on
the ensemble (p = 3.5 × 10−4 ). the static catchment attributes. That is, it is not remember-
ing each basin with its set of attributes exactly, but rather
3.3 Robustness and feature ranking learns a smooth mapping between attributes and model out-
put. To reiterate from Sect. 2.6.2, the perturbation noise is
In Sect. 3.1, we found that adding static features provided always relative to the overall standard deviation of the static
a large boost in performance. We would like to check that features across all catchments, which is always σ = 1 (i.e.,
the model is not simply “remembering” each basin instead all static input features were normalized prior to training).
of learning a general relation between static features and When noise with small standard deviation was added (e.g.,
catchment-specific hydrologic behavior. To this end, we ex- σ = 0.1 and σ = 0.2) the mean and median NSE were rela-
amined model robustness with respect to noisy perturba- tively stable. The median NSE decreased from 0.71 without
tions of the catchment attributes. Figure 6 shows the re- noise to 0.48 with an added noise equal to the total variance
sults of this experiment by comparing the model performance

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

39
5100 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Figure 5. Cumulative density function of the NSE for all basin-

Figure 6. Boxplot showing degradation of model performance with
wise calibrated benchmark models compared to the EA-LSTM and
increasing noise level added to the catchment attributes. Orange
the LSTM with static input features.
lines denote the median across catchments, green markers represent
means across catchments, box denote the 25th and 75th percentiles,
whiskers denote the 5th and 95th percentiles, and circles are catch-
of the input features (σ = 1). This is roughly similar to the ments that fall outside the 5th–95th percentile range.
performance of the LSTM without static input features (Ta-
ble 2). In contrast, the lower percentiles of the NSE distribu-
tions were more strongly affected by input noise. For exam-
ple, the 1st (5th) percentile of the NSE values decreased from
an NSE of 0.13 (0.34) to −5.87 (−0.94) when going from Table 4 provides an overall ranking of dominant sensi-
zero noise (the catchment attribute data from CAMELS) to tivities for one of the eight model repetitions of the EA-
additive noise with variance equal to the total variance of the LSTM. These were derived by normalizing the sensitivity
inputs (i.e., σ = 1). This confirms that static features are es- measures per basin to the range (0, 1) and then calculating
pecially helpful for increasing performance in basins at the the overall mean across all features. As might be inferred
lower end of the NSE spectrum, that is, differentiating hy- from Fig. 7, the most sensitive catchment attributes are topo-
drological behaviors that are underrepresented in the training logical features (mean elevation and catchment area) and cli-
dataset. mate indices (mean precipitation, aridity, duration of high-
Figure 7 plots a spatial map where each basin is labeled precipitation events, and the fraction of precipitation falling
corresponding to the most sensitive catchment attribute de- as snow). Certain groups of catchment attributes did not typ-
rived from the explicit Morris method for neural networks ically provide much additional information. These include
(Sect. 2.6.2). In the Appalachian Mountains, sensitivity in vegetation indices like maximum leaf area index or maxi-
most catchments is dominated by topological features (e.g., mum green vegetation fraction as well as the annual vege-
mean catchment elevation and catchment area), and in the tation differences. Most soil features were at the lower end
eastern US more generally, sensitivity is dominated by cli- of the feature ranking. This sensitivity ranking is interesting
mate indices (e.g., mean precipitation, high precipitation du- in that most of the top-ranked features are relatively easy to
ration). Meteorological patterns like aridity and mean precip- measure or estimate globally from readily available gridded
itation become more important as we move away from the data products. Soil maps are one of the hardest features to
Appalachians and towards the Great Plains, likely because obtain accurately at a regional scale because they require ex-
elevation and slope begin to play less of a role. The aridity tensive in situ mapping and interpolation. Note that the re-
index dominates sensitivity in the Central Great Plains. In sults between the eight model repetitions (not shown here)
the Rocky Mountains most basins are sensitive climate in- vary slightly in terms of sensitivity values and ranks. How-
dices (mean precipitation and high precipitation duration), ever, the quantitative ranking is robust between all eight rep-
with some sensitivity to vegetation in the Four Corners re- etitions, meaning that climate indices (e.g., aridity and mean
gion (northern New Mexico). In the West Coast there is a precipitation) and topological features (e.g., catchment area
wider variety of dominant sensitivities, reflecting a diversity and mean catchment elevation) are always ranked highest,
of catchments. while soil and vegetation features are of less importance and

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

40 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5101

Figure 7. Spatial map of all basins in the dataset. Markers denote the individual catchment characteristic with the highest sensitivity value
for each particular basin.

are ranked lower. It is worth noting that our rankings qualita-

tively agree with much of the analysis by Addor et al. (2018).

3.4 Analysis of catchment similarity from the

embedding layer

Kratzert et al. (2018a, 2019) showed that these LSTM net-

works are able to learn to model snow and store this informa-
tion in specific memory cells without ever directly training
on any type of snow-related observation data other than total
precipitation and temperature. Multiple types of catchments
will use snow-related states in mixture with other states that Figure 8. Input gate activations (y axis) for all 531 basins (x axis).
represent other processes or combinations of processes. The The basins are ordered from left to right according to the ascending
eight-digit USGS gauge ID. Yellow colors denote open input gate
memory cells allow an interpretation along the time axis for
cells and blue colors denote closed input gate cells for a particular
each specific basin and are part of both the standard LSTM
basin.
and the EA-LSTM. A more detailed analysis of the specific
functionality of individual cell states is out-of-scope for this
paper and will be part of future work. Here, we focus on anal-
ysis of the embedding layer, which is a unique feature of the ular catchment. These (real-valued) activations are a function
EA-LSTM. of the 27 catchment characteristics input into the static fea-
From each of the trained EA-LSTM models, we calculated ture layer of the EA-LSTM.
the input gate vector (Eq. 7) for each basin. The raw EA- The embedding layer is necessarily high dimensional –
LSTM embedding from one of the models trained over all in this case R256 – due to the fact that the LSTM layer of
catchments is shown in Fig. 8. Yellow colors indicate that the model requires sufficient cell states to simulate a wide
a particular one of the 256 cell states is activated and con- variety of catchments. Ideally, hydrologically similar catch-
tributes to the simulation of a particular catchment. Blue col- ments should utilize overlapping parts of the LSTM network
ors indicate that a particular cell state is not used for a partic- – this would mean that the network is both learning and us-

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

41
5102 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Table 4. Feature ranking derived from the explicit Morris method

for one of the EA-LSTM model repetitions.

Rank Catchment characteristic Sensitivity

1. Mean precipitation 0.68
2. Aridity 0.56
3. Area 0.50
4. Mean elevation 0.46
5. High precip. duration 0.41
6. Fraction of snow 0.41
7. High precip. frequency 0.38
8. Mean slope 0.37
9. Geological permeability 0.35
10. Frac. of carbonate sedimentary rock 0.34
11. Clay fraction 0.33
12. Mean PET 0.31
13. Low precip. frequency 0.30
14. Soil depth to bedrock 0.27
15. Precip. seasonality 0.27
16. Frac. of forest 0.27
17. Sand fraction 0.26
18. Saturated hyd. conductivity 0.24
19. Low precip. duration 0.22
20. Max. green veg. frac. (GVF) 0.21
21. Annual GVF diff. 0.21
22. Annual leaf area index (LAI) diff. 0.21
23. Volumetric porosity 0.19
24. Soil depth 0.19
25. Max. LAI 0.19
26. Silt fraction 0.18
27. Max. water content 0.16

Figure 9. Mean and minimum silhouette scores over varying cluster

sizes. For the LSTM embeddings, the line denotes the mean of the
ing catchment similarity to train a regionalizable simulation n = 8 repetitions and the vertical lines the standard deviation over
10 random restarts of the k-means clustering algorithm.
model.
To assess whether this happened, we first performed a
clustering analysis on the R256 embedding space using k
means with a Euclidean distance criterion. We compared The highest mean silhouette value from clustering with the
this with a k-means clustering analysis using directly the raw catchment attributes was k = 6 and the highest mean sil-
27 catchment characteristics to see whether there was a dif- houette value from clustering with the embedding layer was
ference in clusters before vs. after the transformation into k = 5. Ideally, these clusters would be related to hydrologic
the embedding layer – remember that this transform was in- behavior. To test this, Fig. 10 shows the fractional reduction
formed by rainfall–runoff training data. To choose an appro- in variance of 13 hydrologic signatures due to clustering by
priate cluster size, we looked at the mean (and minimum) sil- both raw catchment attributes vs. by the EA-LSTM embed-
houette value. Silhouette values measure within-cluster sim- ding layer. Ideally, the within-cluster variance of any partic-
ilarity and range between [−1, 1], with positive values indi- ular hydrological signature should be as small as possible, so
cating a high degree of separation between clusters and neg- that the fractional reduction in variance is as large (close to
ative values indicating a low degree of separation between one) as possible. In both the k = 5 and k = 6 cluster exam-
clusters. The mean and minimum silhouette values for dif- ples, clustering by the EA-LSTM embedding layer reduced
ferent cluster sizes are shown in Fig. 9. In all cases with clus- variance in the hydrological signatures by more or approxi-
ter sizes less than 15, we see that clustering by the values of mately the same amount as by clustering on the raw catch-
the embedding layer provides more distinct catchment clus- ment attributes. The exception to this was the hfd-mean date,
ters than when clustering by the raw catchment attributes. which represents an annual timing process (i.e., the day of
This indicates that the EA-LSTM is able to use catchment at- year when the catchment releases half of its annual flow).
tribute data to effectively cluster basins into distinct groups. This indicates that the EA-LSTM embedding layer largely

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

42 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5103

LSTMs. In each subplot in Fig. 12, each point corresponds to

one basin. The absolute values of the transformed embedding
are not of particular interest, but we are interested in the rela-
tive arrangement of the basins in this two-dimensional space.
Because this is a reduced-dimension transformation, the fact
that there are four clear clusters of basins does not neces-
sarily indicate that these are the only distinct basin clusters
in our 256-dimensional embedding layer (as we saw above).
Figure 12 shows that there is strong interaction between the
different catchment characteristics in this embedding layer.
For example, high-elevation dry catchments with low forest
cover are in the same cluster as low-elevation wet catchments
with high forest cover (see cluster B in Fig. 12). These two
groups of catchments share parts of their network function-
ality in the LSTM, whereas highly seasonal catchments ac-
tivate a different part of the network. Additionally, there are
two groups of basins with high forest fractions (clusters A
and B); however, if we also consider the mean annual green
vegetation difference, both of these clusters are quite distinct.
Cluster A in the upper left of each subplot in Fig. 12 contains
forest-type basins with a high annual variation in the green
vegetation fraction (possibly deciduous forests) and cluster B
on the right has almost no annual variation (possibly conifer-
ous forests). One feature that does not appear to affect catch-
ment groupings (i.e., apparently acts independently of other
catchment characteristics) is basin size – large and small
basins are distributed throughout the three UMAP clusters.
To summarize, this analysis demonstrates that the EA-LSTM
is able to learn complex interactions between catchment at-
tributes, which allows for grouping of different basins (i.e.,
choosing which cell states in the LSTM any particular basin
Figure 10. Fractional reduction in variance about different hydro- or group of basins will use) in ways that account for interac-
logical signatures due to k-means clustering on catchment attributes
tion between different catchment attributes.
vs. the EA-LSTM embedding layer.

4 Discussion and conclusion

preserves the information content about hydrological behav-
iors while overall increasing distinctions between groups of The EA-LSTM is an example of what Razavi and Coulibaly
similar catchments. The EA-LSTM was able to learn about (2013) called a model-independent method for regional mod-
hydrologic similarity between catchments by directly train- eling. We cited Besaw et al. (2010) as an earlier example of
ing on both catchment attributes and rainfall–runoff time se- this type of approach, since they used classical feed-forward
ries data. Remember that the EA-LSTMs were trained on the neural networks. In our case, the EA-LSTM achieved state-
time series of streamflow data that these signatures were cal- of-the-art results, outperforming multiple locally and region-
culated from, but were not trained directly on these hydro- ally calibrated benchmark models. These benchmarking re-
logic signatures. sults are arguably a pivotal part of this paper.
Clustering maps for k = 5 and k = 6 are shown in Fig. 11. The results of the experiments described above demon-
Although latitude and longitude were not part of the catch- strate that a single “universal” deep learning model can learn
ment attributes vector that was used as input into the embed- both regionally consistent and location-specific hydrologic
ding layer, both the raw catchment attributes and the embed- behaviors. The innovation in this study – besides benchmark-
ding layer clearly delineated catchments that correspond to ing the LSTM family of rainfall–runoff models – was to add
different geographical regions within the CONUS. a static embedding layer in the form of our EA-LSTM. This
To visualize the high-dimensional embedding learned by model offered similar performance as compared with a con-
the EA-LSTM, we used UMAP (McInnes et al., 2018) to ventional LSTM (Sect. 3.1) but offers a level of interpretabil-
project the full R256 embedding onto R2 . Figure 12 shows ity about how the model learns to differentiate aspects of
results of the UMAP transformation for one of the eight EA- complex catchment-specific behaviors (Sect. 3.3 and 3.4). In

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

43
5104 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Figure 11. Clustering maps for the LSTM embeddings (a, b) and the raw catchment attributes (c, d) using k = 5 clusters (a, c, optimal choice
for LSTM embeddings) and k = 6 clusters (b, d, optimal choice for the raw catchment attributes).

Figure 12. UMAP transformation of the R256 EA-LSTM catchment embedding onto R2 . Each dot in each subplot corresponds to one basin.
The colors denote specific catchment attributes (notated in subplot titles) for each particular basin. In the upper left plot, clusters are encircled
and named to facilitate the description in the text.

a certain sense, this is similar to the aforementioned (MPR) cal take-away, in our opinion, is that the EA-LSTM learns
approach, which links its model parameters to the given spa- a single model from large catchment datasets in a way that
tial characteristics (in a nonlinear way, by using transfer explicitly incorporates local (catchment) similarities and dif-
functions) but has a fixed model structure to work with. In ferences.
comparison, our EA-LSTM links catchment characteristics Neural networks generally require a lot of training data
to the dynamics of specific sites and learns the overall model (our unpublished results indicate that it is often difficult to
from the combined data of all catchments. Again, the criti- reliably train an LTM) at a single catchment, even with multi-

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

44 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5105

decade data records, and adding the ability for the LSTM ar- A notable corollary of our main result is that the catch-
chitecture to transfer information from similar catchments is ment attributes collected by Addor et al. (2017b) appear to
critical for this to be a viable approach for regional model- contain sufficient information to distinguish between diverse
ing. This is in contrast with traditional hydrological mod- rainfall–runoff behaviors, at least to a meaningful degree.
eling and model calibration, which typically has the best It is arguable whether this was known previously, since re-
results when models are calibrated independently for each gional modeling studies have largely struggled to fully ex-
basin. This property of classical models is somewhat prob- tract this information (Mizukami et al., 2017) – i.e., existing
lematic, since it has been observed that the spatial patterns of regional models do not perform with accuracy similarly to
model parameters obtained by ad hoc extrapolations based models calibrated in a specific catchment. In contrast, our re-
on calibrated parameters from reference catchments can lead gional EA-LSTM actually performs better than models cali-
to unrealistic parameter fields and spatial discontinuities of brated separately for individual catchments. This result chal-
the hydrological states (Mizukami et al., 2017). As shown lenges the idea that runoff time series alone only contain
in Sect. 3.4, this does not occur with our proposed approach. enough information to restrict a handful of parameters (Naef,
Thus, by leveraging the ability of deep learning to simultane- 1981; Jakeman and Hornberger, 1993; Perrin et al., 2001;
ously learn time series relationships and also spatial relation- Kirchner, 2006) and implies that structural improvements are
ships in the same predictive framework, we sidestep many still possible for most large-scale hydrology models, given
problems that are currently associated with the estimation the size of today’s datasets.
and transfer of hydrologic model parameters.
Moving forward, it is worth mentioning that treating catch-
ment attributes as static is a strong assumption (especially Code and data availability. The CAMELS input data are freely
over long time periods), which is not necessarily reflected in available at the homepage of the NCAR. The validation periods
the real world. In reality, catchment attributes may continu- of all benchmark models used in this study are available at
ally change at various timescales (e.g., vegetation, topogra- https://doi.org/10.4211/hs.474ecc37e7db45baa425cdb4fc1b61e1
(Kratzert, 2019a), the extended Maurer forcings, including
phy, pedology, climate). In future studies it will be impor-
daily minimum and maximum temperature, are available at
tant to develop strategies to derive analogs to our embedding https://doi.org/10.4211/hs.17c896843cf940339c3c3496d0c1c077
layer that allow for dynamic or evolving catchment attributes (Kratzert, 2019b), our trained networks as
or features – perhaps that act on raw remote sensing data well as our model outputs are available at
inputs rather than aggregated indexes derived from time se- https://doi.org/10.4211/hs.83ea5312635e44dc824eeb99eda12f06
ries of remote sensing products. In principle, our embedding (Kratzert, 2019d), and finally, the code to reproduce our results
layer could learn directly from raw brightness temperatures, is available at https://doi.org/10.5281/zenodo.3530884 (Kratzert,
since there is no requirement that the inputs be hydrologi- 2019c) (https://github.com/kratzert/ealstm_regional_modeling, last
cally relevant – only that these inputs are related to hydro- access: 13 December 2019).
logical behavior. A dynamic input gate is, at least in prin-
ciple, possible without significant modification to the pro-
posed EA-LSTM approach, for example, by using a separate
sequence-to-sequence LSTM that encodes time-dependent
catchment observables (e.g., from climate models or remote
sensing) and feeds an embedding layer that is updated at
each time step. This would allow the model to “learn” a dy-
namic embedding that turns off and on different parts of the
rainfall–runoff portion of the LSTM over the course of a sim-
ulation.

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

45
5106 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Appendix A: Full list of the used CAMELS catchment

characteristics

A1 Table of catchment attributes used in this

experiment. Description taken from the dataset of
Addor et al. (2017a)

p_mean Mean daily precipitation.

pet_mean Mean daily potential evapotranspiration.
aridity Ratio of mean PET to mean precipitation.
p_seasonality Seasonality and timing of precipitation. Estimated by representing annual
precipitation and temperature as sine waves. Positive (negative) values indicate
precipitation peaks during the summer (winter). Values of approx. 0 indicate
uniform precipitation throughout the year.
frac_snow_daily Fraction of precipitation falling on days with temperatures below 0 ◦ C.
high_prec_freq Frequency of high-precipitation days (≥ 5 times mean daily precipitation).
high_prec_dur Average duration of high-precipitation events (number of consecutive days with
≥ 5 times mean daily precipitation).
low_prec_freq Frequency of dry days (< 1 mm d−1 ).
low_prec_dur Average duration of dry periods (number of consecutive days
with precipitation < 1 mm d−1 ).
elev_mean Catchment mean elevation.
slope_mean Catchment mean slope.
area_gages2 Catchment area.
forest_frac Forest fraction.
lai_max Maximum monthly mean of leaf area index.
lai_diff Difference between the max. and min. mean of the leaf area index.
gvf_max Maximum monthly mean of green vegetation fraction.
gvf_diff Difference between the maximum and minimum monthly
mean of the green vegetation fraction.
soil_depth_pelletier Depth to bedrock (maximum 50 m).
soil_depth_statsgo Soil depth (maximum 1.5 m).
soil_porosity Volumetric porosity.
soil_conductivity Saturated hydraulic conductivity.
max_water_content Maximum water content of the soil.
sand_frac Fraction of sand in the soil.
silt_frac Fraction of silt in the soil.
clay_frac Fraction of clay in the soil.
carb_rocks_frac Fraction of the catchment area characterized as “Carbonate sedimentary rocks”.
geol_permeability Surface permeability (log10).

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

46 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5107

Appendix B: Hyperparameter tuning

The hyperparameters, i.e., the number of hidden/cell states,

dropout rate, length of the input sequence, and number of
stacked LSTM layers for our model, were found by running
a grid search over a range of parameter values. Concretely
we considered the following possible parameter values.

1. Hidden states: 64, 96, 128, 156, 196, 224, 256

2. Dropout rate: 0.0, 0.25, 0.4, 0.5
3. Length of input sequence: 90, 180, 270, 365
4. Number of stacked LSTM layer: 1, 2

We used k-fold cross-validation (k = 4) to split the basins

into a training set and an independent test set. We trained
one model for each split for each parameter combination on
the combined calibration period of all basins in the specific
training set and evaluated the model performance on the cal-
ibration data of the test basins. The final configuration was
chosen by taking the parameter set that resulted in the high-
est median NSE over all possible parameter configurations.
The parameters are the following.
1. Hidden states: 256

2. Dropout rate: 0.4

3. Length of input sequence length: 270
4. Number of stacked LSTM layer: 1

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

47
5108 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Author contributions. FK had the idea for the regional modeling mental systems using the GLUE methodology, J. Hydrol., 249,
approach. SH proposed the adapted LSTM architecture. FK, DK, 11–29, 2001.
and GN designed all the experiments. FK conducted all the exper- Blöschl, G. and Sivapalan, M.: Scale issues in hydrological mod-
iments and analyzed the results, together with DK, GS, and GN. elling: a review, Hydrol. Process., 9, 251–290, 1995.
GN supervised the manuscript from the hydrological perspective Blöschl, G., Sivapalan, M., Savenije, H., Wagener, T., and Viglione,
and GK and SH from the machine-learning perspective. GN and A.: Runoff prediction in ungauged basins: synthesis across pro-
SH share the responsibility for the last authorship in the respective cesses, places and scales, Cambridge University Press, Cam-
fields. All the authors worked on the manuscript. bridge, 2013.
Burnash, R. J. C.: The NWS river forecast system–catchment mod-
eling, in: Computer models of watershed hydrology, edited by:
Competing interests. The authors declare that they have no conflict Singh, V. P., Water Resources Publications, Littleton, CO, 311–
of interest. 366, 1995.
Burnash, R. J., Ferral, R. L., and McGuire, R. A.: A generalized
streamflow simulation system, conceptual modeling for digital
Acknowledgements. The project relies heavily on open-source soft- computers, Joint Federal and State River Forecast Center, U.S.
ware. All programming was done in Python version 3.7 (van National Weather Service, and California Departmentof Water
Rossum, 1995) and associated libraries, including Numpy (Van Der Resources Tech. Rep., 204 pp., 1973.
Walt et al., 2011), Pandas (McKinney, 2010), PyTorch (Paszke Clark, M. P., Slater, A. G., Rupp, D. E., Woods, R. A., Vrugt, J. A.,
et al., 2017), and Matplotlib (Hunter, 2007). Gupta, H. V., Wagener, T., and Hay, L. E.: Framework for Under-
standing Structural Errors (FUSE): A modular framework to di-
agnose differences between hydrological models, Water Resour.
Res., 44, W00B02, https://doi.org/10.1029/2007WR006735,
Financial support. This research has been supported by Bosch, ZF,
2008.
Google, and the NVIDIA Corporation with the GPU donations (LIT
Cohen, J.: Statistical power analysis for the behavioral sciences, 2nd
(grant no. LIT-2017-3-YOU-003) and FWF (grant no. P 28660-
Edn., Erlbaum, Hillsdale, NJ, 1988.
N31)).
Gupta, H. V., Sorooshian, S., and Yapo, P. O.: Toward improved cal-
ibration of hydrologic models: Multiple and noncommensurable
measures of information, Water Resour. Res., 34, 751–763, 1998.
Review statement. This paper was edited by Nadav Peleg and re- Gupta, H. V., Wagener, T., and Liu, Y.: Reconciling theory with
viewed by Hoshin Gupta and two anonymous referees. observations: elements of a diagnostic approach to model evalu-
ation, Hydrol. Process., 22, 3802–3813, 2008.
Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F.: Decom-
position of the mean squared error and NSE performance criteria:
Implications for improving hydrological modelling, J. Hydrol.,
References 377, 80–91, 2009.
Gupta, H. V., Perrin, C., Blöschl, G., Montanari, A., Kumar, R.,
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: The Clark, M., and Andréassian, V.: Large-sample hydrology: a need
CAMELS data set: catchment attributes and meteorology for to balance depth with breadth, Hydrol. Earth Syst. Sci., 18, 463–
large-sample studies, Hydrol. Earth Syst. Sci., 21, 5293–5313, 477, https://doi.org/10.5194/hess-18-463-2014, 2014.
https://doi.org/10.5194/hess-21-5293-2017, 2017a. Henn, B., Clark, M. P., Kavetski, D., and Lundquist, J. D.: Esti-
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P.: Catch- mating mountain basin-mean precipitation fromstreamflow using
ment attributes for large-sample studies, UCAR/NCAR, Boulder, Bayesian inference, Water Resour. Res., 51, 8012–8033, 2008.
CO, USA, https://doi.org/10.5065/D6G73C3Q, 2017b. Herman, J. D., Kollat, J. B., Reed, P. M., and Wagener, T.: Tech-
Addor, N., Nearing, G., Prieto, C., Newman, A. J., Le Vine, N., and nical Note: Method of Morris effectively reduces the compu-
Clark, M. P.: A ranking of hydrological signatures based on their tational demands of global sensitivity analysis for distributed
predictability in space, Water Resources Res., 54, 8792–8812, watershed models, Hydrol. Earth Syst. Sci., 17, 2893–2903,
https://doi.org/10.1029/2018WR022606, 2018. https://doi.org/10.5194/hess-17-2893-2013, 2013.
Anderson, E. A.: National Weather Service river forecast system: Hochreiter, S.: Untersuchungen zu dynamischen neuronalen Net-
Snow accumulation and ablation model, NOAA Tech. Memo. zen, Diploma, Technische Universität München, Germany, 1991.
NWS HYDRO-17, 87 pp., 1973. Hochreiter, S. and Schmidhuber, J.: Long short-term memory, Neu-
Beck, H. E., van Dijk, A. I. J. M., de Roo, A., Mi- ral Comput., 9, 1735–1780, 1997.
ralles, D. G., McVicar, T. R., Schellekens, J., and Brui- Hrachowitz, M., Savenije, H., Blöschl, G., McDonnell, J., Siva-
jnzeel, L. A.: Global-scale regionalization of hydrologic palan, M., Pomeroy, J., Arheimer, B., Blume, T., Clark, M.,
model parameters, Water Resour. Res., 52, 3599–3622, Ehret, U., Fenicia, F., Freer, J. E., Gelfan, A., Gupta, H. V.,
https://doi.org/10.1002/2015WR018247, 2016. Hughes, D. A., Hut, R. W., Montanari, A., Pande, S., Tetzlaff,
Besaw, L. E., Rizzo, D. M., Bierman, P. R., and Hackett, W. R.: Ad- D., Troch, P. A., Uhlenbrook, S., Wagener, T., Winsemius, H. C.,
vances in ungauged streamflow prediction using artificial neural Woods, R. A., Zehe, E., and Cudennec, C.: A decade of Predic-
networks, J. Hydrol., 386, 27–37, 2010. tions in Ungauged Basins (PUB) – a review, Hydrolog. Sci. J.,
Beven, K. and Freer, J.: Equifinality, data assimilation, and uncer- 58, 1198–1255, 2013.
tainty estimation in mechanistic modelling of complex environ-

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

48 Chapter 3 Publications
F. Kratzert et al.: Learning universal hydrological behaviors via machine learning 5109

Hunter, J. D.: Matplotlib: A 2D graphics environment, Comput. Sci. Morris, M. D.: Factorial sampling plans for preliminary computa-
Eng., 9, 90–95, 2007. tional experiments, Technometrics, 33, 161–174, 1991.
Jakeman, A. J. and Hornberger, G. M.: How much complexity is Naef, F.: Can we model the rainfall-runoff process today?/Peut-on
warranted in a rainfall-runoff model?, Water Resour. Res., 29, actuellement mettre en modèle le processus pluie-écoulement?,
2637–2649, 1993. Hydrol. Sci. B., 26, 281–289, 1981.
Kirchner, J. W.: Getting the right answers for the right rea- Nash, J. E. and Sutcliffe, J. V.: River flow forecasting through con-
sons: Linking measurements, analyses, and models to advance ceptual models part I – A discussion of principles, J. Hydrol., 10,
the science of hydrology, Water Resour. Res., 42, W03S04, 282–290, 1970.
https://doi.org/10.1029/2005WR004362, 2006. Newman, A., Sampson, K., Clark, M., Bock, A., Viger, R., and
Kratzert, F.: Benchmark models, HydroShare, https://doi.org/10. Blodgett, D.: A large-sample watershed-scale hydrometeorologi-
4211/hs.474ecc37e7db45baa425cdb4fc1b61e1, 2019a. cal dataset for the contiguous USA, UCAR/NCAR, Boulder, CO,
Kratzert, F.: CAMELS extended Maurer forc- USA, https://doi.org/10.5065/D6MW2F4D, 2014.
ings, HydroShare, https://doi.org/10.4211/hs. Newman, A. J., Clark, M. P., Sampson, K., Wood, A., Hay, L.
17c896843cf940339c3c3496d0c1c077, 2019b. E., Bock, A., Viger, R. J., Blodgett, D., Brekke, L., Arnold, J.
Kratzert, F.: kratzert/ealstm_regional_modeling: Code R., Hopson, T., and Duan, Q.: Development of a large-sample
to reproduce paper experiments/results, Zenodo, watershed-scale hydrometeorological data set for the contiguous
https://doi.org/10.5281/zenodo.3530884, 2019c. USA: data set characteristics and assessment of regional variabil-
Kratzert, F.: Pre-trained models, HydroShare, https://doi.org/10. ity in hydrologic model performance, Hydrol. Earth Syst. Sci.,
4211/hs.83ea5312635e44dc824eeb99eda12f06, 2019d. 19, 209–223, https://doi.org/10.5194/hess-19-209-2015, 2015.
Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., and Klam- Newman, A. J., Mizukami, N., Clark, M. P., Wood, A. W., Nijssen,
bauer, G.: Do internals of neural networks make sense in B., and Nearing, G.: Benchmarking of a physically based hydro-
the context of hydrology?, in: AGU Fall Meeting Abstracts, logic model, J. Hydrometeorol., 18, 2215–2225, 2017.
2018AGUFM.H13B..06K, 2018a. Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito,
Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrnegger, Z., Lin, Z., Desmaison, A., Antiga, L., and Lerer, A.: Automatic
M.: Rainfall–runoff modelling using Long Short-Term Mem- differentiation in PyTorch, in: NIPS 2017 Autodiff Workshop:
ory (LSTM) networks, Hydrol. Earth Syst. Sci., 22, 6005–6022, The Future of Gradient-based Machine Learning Software and
https://doi.org/10.5194/hess-22-6005-2018, 2018b. Techniques, Long Beach, CA, US, 9 December 2017.
Kratzert, F., Klotz, D., Herrnegger, M., and Hochreiter, S.: A Perrin, C., Michel, C., and Andréassian, V.: Does a large number
glimpse into the Unobserved: Runoff simulation for ungauged of parameters enhance model performance? Comparative assess-
catchments with LSTMs, in: Workshop on Modeling and ment of common catchment model structures on 429 catchments,
Decision-Making in the Spatiotemporal Domain, 32nd Confer- J. Hydrol., 242, 275–301, 2001.
ence on Neural Information Processing Systems (NeurIPS 2018), Peters-Lidard, C. D., Clark, M., Samaniego, L., Verhoest, N.
Montreal, Canada, 3–8 December 2018c. E. C., van Emmerik, T., Uijlenhoet, R., Achieng, K., Franz,
Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., and Klam- T. E., and Woods, R.: Scaling, similarity, and the fourth
bauer, G.: NeuralHydrology-Interpreting LSTMs in Hydrology, paradigm for hydrology, Hydrol. Earth Syst. Sci., 21, 3701–3713,
arXiv preprint arXiv:1903.07903, 2019. https://doi.org/10.5194/hess-21-3701-2017, 2017.
Kumar, R., Samaniego, L., and Attinger, S.: Implications of dis- Prieto, C., Le Vine, N., Kavetski, D., García, E., and Medina, R.:
tributed hydrologic model parameterization on water fluxes at Flow Prediction in Ungauged Catchments Using Probabilistic
multiple scales and locations, Water Resour. Res., 49, 360–379, Random Forests Regionalization and New Statistical Adequacy
2013. Tests, Water Resour. Res., 55, 4364–4392, 2019.
Liang, X., Lettenmaier, D. P., Wood, E. F., and Burges, S. J.: A Rakovec, O., Mizukami, N., Kumar, R., Newman, A. J., Thober, S.,
simple hydrologically based model of land surface water and en- Wood, A. W., Clark, M. P., and Samaniego, L.: Diagnostic Eval-
ergy fluxes for general circulation models, J. Geophys. Res., 99, uation of Large-domain Hydrologic Models calibrated across the
14415–14428, 1994. Contiguous United States, J. Geophys. Res.-Atmos., in review,
McInnes, L., Healy, J., and Melville, J.: Umap: Uniform manifold 2019.
approximation and projection for dimension reduction, arXiv Razavi, T. and Coulibaly, P.: Streamflow Prediction in Ungauged
preprint arXiv:1802.03426, 2018. Basins: Review of Regionalization Methods, J. Hydrol. Eng., 18,
McKinney, W.: Data Structures for Statistical Computing in Python, 958–975, 2013.
Proceedings of the 9th Python in Science Conference, Austin, Saltelli, A., Tarantola, S., Campolongo, F., and Ratto, M.: Sensi-
Texas, 28 June–2 July 2010, 1697900, 51–56, 2010. tivity analysis in practice: a guide to assessing scientific models,
Mizukami, N., Clark, M. P., Newman, A. J., Wood, A. W., Gut- Wiley Online Library, 94–100, 2004.
mann, E. D., Nijssen, B., Rakovec, O., and Samaniego, L.: To- Samaniego, L., Kumar, R., and Attinger, S.: Multiscale pa-
wards seamless large-domain parameter estimation for hydro- rameter regionalization of a grid-based hydrologic model
logic models, Water Resour. Res., 53, 8020–8040, 2017. at the mesoscale, Water Resour. Res., 46, W05523,
Mizukami, N., Rakovec, O., Newman, A. J., Clark, M. P., https://doi.org/10.1029/2008WR007327, 2010.
Wood, A. W., Gupta, H. V., and Kumar, R.: On the choice Seibert, J.: Regionalisation of parameters for a conceptual rainfall–
of calibration metrics for “high-flow” estimation using hy- runoff model, Agr. Forest Meteorol., 98–99, 279–293, 1999.
drologic models, Hydrol. Earth Syst. Sci., 23, 2601–2614, Seibert, J. and Vis, M. J. P.: Teaching hydrological modeling with a
https://doi.org/10.5194/hess-23-2601-2019, 2019. user-friendly catchment-runoff-model software package, Hydrol.

www.hydrol-earth-syst-sci.net/23/5089/2019/ Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019

49
5110 F. Kratzert et al.: Learning universal hydrological behaviors via machine learning

Earth Syst. Sci., 16, 3315–3325, https://doi.org/10.5194/hess-16- van Rossum, G.: Python tutorial, Technical Report CS-R9526, Cen-
3315-2012, 2012. trum voor Wiskunde en Informatica (CWI), Amsterdam, 1995.
Seibert, J., Vis, M. J. P., Lewis, E., and van Meerveld, H. J.: Upper Wang, A. and Solomatine, D. P.: Practical Experience of Sensi-
and lower benchmarks in hydrological modelling, Hydrol. Pro- tivity Analysis: Comparing Six Methods, on Three Hydrologi-
cess., 32, 1120–1125, 2018. cal Models, with Three Performance Criteria, Water, 11, 1062,
Sivapalan, M., Takeuchi, K., Franks, S. W., Gupta, V. K., Karam- https://doi.org/10.3390/w11051062, 2019.
biri, H., Lakshmi, V., Liang, X., McDonnell, J. J., Mendiondo, E. Wilcoxon, F.: Individual comparisons by ranking methods, Biomet-
M., O’Connell, P. E., Oki, T., Pomeroy, J. W., Schertzer, D., Uh- rics Bull., 1, 80–83, 1945.
lenbrook, S., and Zehe, E.: IAHS Decade on Predictions in Un- Wood, A. W., Maurer, E. P., Kumar, A., and Lettenmaier,
gauged Basins (PUB), 2003–2012: Shaping an exciting future for D. P.: Long-range experimental hydrologic forecasting for
the hydrological sciences, Hydrolog. Sci. J., 48, 857–880, 2003. the eastern United States, J. Geophys. Res., 107, 4429,
Srivastava, N., Hinton, G., Krizhevsky, A., Sutskever, I., and https://doi.org/10.1029/2001JD000659, 2002.
Salakhutdinov, R.: Dropout: a simple way to prevent neural net- Yilmaz, K. K., Gupta, H. V., and Wagener, T.: A process-based di-
works from overfitting, J. Mach. Learn. Res., 15, 1929–1958, agnostic approach to model evaluation: Application to the NWS
2014. distributed hydrologic model, Water Resour. Res., 44, 1–18,
Van Der Walt, S., Colbert, S. C., and Varoquaux, G.: The NumPy 2008.
array: A structure for efficient numerical computation, Comput.
Sci. Eng., 13, 22–30, 2011.

Hydrol. Earth Syst. Sci., 23, 5089–5110, 2019 www.hydrol-earth-syst-sci.net/23/5089/2019/

50 Chapter 3 Publications
51
TECHNICAL Toward Improved Predictions in Ungauged Basins:
REPORTS: METHODS
10.1029/2019WR026065
Exploiting the Power of Machine Learning
Special Section: Frederik Kratzert1 , Daniel Klotz1 , Mathew Herrnegger2 , Alden K. Sampson3 ,
Big Data & Machine Learning in Sepp Hochreiter1 , and Grey S. Nearing4
Water Sciences: Recent Progress
1 LIT
AI Lab and Institute for Machine Learning, Johannes Kepler University, Linz, Austria, 2 Institute for Hydrology
and Their Use in Advancing
Science and Water Management, University of Natural Resources and Life Sciences, Vienna, Austria, 3 Upstream Tech, Natel
Energy Inc., Alameda, CA, USA, 4 Department of Geological Sciences, University of Alabama, Tuscaloosa, AL, USA
Key Points:
• Overall accuracy of LSTMs in
ungauged basins is comparable to Abstract Long short-term memory (LSTM) networks offer unprecedented accuracy for prediction in
standard hydrology models in gauged ungauged basins. We trained and tested several LSTMs on 531 basins from the CAMELS data set using
basins
• There is sufficient information k-fold validation, so that predictions were made in basins that supplied no training data. The training
in catchment characteristics and test data set included ∼30 years of daily rainfall-runoff data from catchments in the United States
data to differentiate between ranging in size from 4 to 2,000 km2 with aridity index from 0.22 to 5.20, and including 12 of the 13 IGPB
catchment-specific rainfall-runoff
behaviors vegetated land cover classifications. This effectively “ungauged” model was benchmarked over a 15-year
validation period against the Sacramento Soil Moisture Accounting (SAC-SMA) model and also against the
Correspondence to:
NOAA National Water Model reanalysis. SAC-SMA was calibrated separately for each basin using 15
G. S. Nearing, years of daily data. The out-of-sample LSTM had higher median Nash-Sutcliffe Efficiencies across the 531
[email protected] basins (0.69) than either the calibrated SAC-SMA (0.64) or the National Water Model (0.58). This indicates
that there is (typically) sufficient information in available catchment attributes data about similarities and
Citation: differences between catchment-level rainfall-runoff behaviors to provide out-of-sample simulations that
Kratzert, F., Klotz, D., Herrnegger, M., are generally more accurate than current models under ideal (i.e., calibrated) conditions. We found
Sampson, A. K., Hochreiter, S.,
evidence that adding physical constraints to the LSTM models might improve simulations, which we
& Nearing, G. S. (2019). Toward
improved predictions in ungauged suggest motivates future research related to physics-guided machine learning.
basins: Exploiting the power of
machine learning. Water Resources
Research, 55, 11,344–11,354.
https://doi.org/10.1029/2019WR026065
1. Introduction
Science and society are firmly in the age of machine learning (ML; McAfee & Brynjolfsson, 2017). ML
Received 29 JUL 2019 models currently outperform state-of-the-art techniques at some of the most sophisticated domain prob-
Accepted 19 NOV 2019 lems across the Natural Sciences (e.g., AlQuraishi, 2019; He et al., 2019; Liu et al., 2016; Mayr et al., 2016).
Accepted article online 23 NOV 2019 In Hydrology, the first demonstration of ML outperforming a process-based model that we are aware of was
Published online 23 DEC 2019
by Hsu et al. (1995), who compared a calibrated Sacramento Soil Moisture Accounting Model (SAC-SMA)
against a feed-forward artificial neural network across a range of flow regimes. More recently, Nearing
et al. (2018) compared neural networks against the half-hourly surface energy balance of hydrometeoro-
logical models used operationally by several international weather and climate forecasting agencies, and
showed that the former generally out-performed the latter at out-of-sample FluxNet sites. In a companion
paper to this one, Kratzert et al. (2019) showed that regionally trained long short-term memory (LSTM) net-
work outperforms basin-specific calibrations of several traditional hydrology models and demonstrated that
LSTM-type models were able to extract information from observable catchment characteristics to differen-
tiate between different rainfall-runoff behaviors in hydrologically diverse catchments. The purpose of this
paper is to show that we can leverage this capability for prediction in ungauged basins.
There is a long-standing discussion in the field of Hydrology about the relative merits of data-driven versus
process-driven models (e.g., Klemeš, 1986). In their summary of a recent workshop on “Big Data and the
Earth Sciences,” Sellars (2018) noted that “Many participants who have worked in modeling physical-based
systems continue to raise caution about the lack of physical understanding of machine learning methods
©2019. The Authors.
This is an open access article under the that rely on data-driven approaches.” It is often argued that data-driven models might underperform relative
terms of the Creative Commons to models that include explicit process representations in conditions that are dissimilar to training data (e.g.,
Attribution License, which permits Kirchner, 2006; Milly et al., 2008; Vaze et al., 2015). While this may or may not be true (we are unaware of
use, distribution and reproduction in
any medium, provided the original any study that has tested this hypothesis directly), in any case where an ML model does outperform relative
work is properly cited. to a given process-based model, we can conclude that the process-based model does not take advantage of

KRATZERT ET AL. 11,344

52 Chapter 3 Publications
Water Resources Research 10.1029/2019WR026065

the full information content of the input/output data (Nearing & Gupta, 2015). At the very least, such cases
indicate that there is potential to improve the process-based model(s).
One of the situations where the accuracy of out-of-sample predictions matter is for prediction in ungauged
basins (PUB). PUB was the decadal problem of the International Association of Hydrological Sciences
(IAHS) from 2003–2012 (Hrachowitz et al., 2013; Sivapalan et al., 2003). State-of-the-art regionalization,
parameter transfer, catchment similarity, and surrogate basin techniques (e.g., Parajka et al., 2013; Razavi
& Coulibaly, 2012; Samaniego et al., 2017) result in streamflow predictions that are less accurate than from
models calibrated individually in gauged catchments. Current community best practices for PUB center
around obtaining detailed local knowledge of a particular basin (Blöschl, 2016), which is expensive for indi-
vidual catchments and impossible for large-scale (e.g., continental) simulations like those from the U.S.
National Water Model (NWM; Salas et al., 2018) or the streamflow component of the North American
Land Data Assimilation System (NLDAS; Xia et al., 2012). Moreover, Vrugt et al. (2006) argued that reli-
able streamflow predictions from lumped catchment models typically require at least 2 to 3 years of gauge
data for calibration (even this is likely an underestimate of the amount of data necessary for reliable model
calibration). PUB remains an important challenge because the majority of streams in the world are either
ungauged or poorly gauged (Goswami et al., 2007; Sivapalan, 2003), and the number of gauged catchments,
even in the United States, is shrinking (Fekete et al., 2015).
In this technical note, we demonstrate an ML strategy for PUB. Our results show that out-of-sample LSTMs
outperform, on average, a conceptual model (SAC-SMA) calibrated independently for each catchment, and
also a distributed, process-based model (NWM). The purpose of this demonstration is twofold. First, to
show that there is sufficient information in the available hydrological data record to provide meaningful
predictions in ungauged basins—at least a significant portion of the time. Second, to show that ML offers
a promising path forward for extracting this information, and for PUB in general. The current authors are
unaware of any existing model that performs as well, on average, as the LSTMs that we demonstrate here.
At the end of this technical note we offer some thoughts—both philosophical and practical—about future
work that could be done to advance the utility of ML in a complex systems science like Hydrology.
To reemphasize our primary findings succinctly, ML in ungauged basins outperforms, on average
(i.e., in more catchments than not) a lumped conceptual model calibrated in gauged basins, and also a
state-of-the-art distributed process-based model. This rapid correspondence is intended to highlight initial
results that might motivate continued development of these and similar techniques—this is not intended to
be a comprehensive analysis of the application of LSTMs or deep learning in general to PUB.

2. Data
Experimental data for our analysis came from the publicly available Catchment Attributes and Meteorology
for Large-Sample Studies (CAMELS) data set curated by National Center for Atmospheric Research
(NCAR; Addor et al., 2017; Newman et al., 2014; Newman et al., 2015). CAMELS consists of 671 catchments
in the continental United States ranging in size from 4 to 25,000 km2 . These catchments were chosen from
the available gauged catchments in the United States due to the fact that they are largely natural and have
long gauge records (1980–2010) available from the United States Geological Survey National Water Infor-
mation System. CAMELS includes daily forcing from Daymet, Maurer, and NLDAS, as well as several static
catchment attributes related to soils, climate, vegetation, topography, and geology (Addor et al., 2018). It is
important to point out that these catchment attributes were derived from maps, remote sensing products,
and climate data that are generally available over the continental United States and either exactly or in close
approximation, globally. For this project, we used only 531 of 671 CAMELS catchments—these were the
same basins that were used for model benchmarking by Newman et al. (2017), who removed basins from
the full CAMELS data set with (i) large discrepancies between different methods of calculating catchment
area and (ii) areas larger than 2,000 km2 .
The CAMELS repository also includes daily streamflow values simulated by 10 SAC-SMA models calibrated
separately in each catchment using Shuffled Complex Evolution (SCE; Duan et al., 1993) with 10 random
seeds. Each SAC-SMA was calibrated on 15 years of data in each catchment (1980–1995). These calibrations
were performed in previous work by NCAR (Newman et al., 2015). We used this ensemble of SAC-SMA mod-
els as a benchmark for our LSTMs. In addition, we benchmarked against the NWM reanalysis, which spans
the years 1993–2017 (https://docs.opendata.aws/nwm-archive). All performance statistics that we report

KRATZERT ET AL. 11,345

53
Water Resources Research 10.1029/2019WR026065

Figure 1. Visualization of (a) the standard LSTM cell as defined by equations (1)–(6).

(for all models) are from the water years 1996–2010, so that the SAC-SMA models were tested out of sample
in time but at the same basins where they were calibrated.

3. Methods
3.1. A Brief Overview of LSTM Networks
LSTMs are a type of recurrent neural network (RNN) first proposed by Hochreiter and Schmidhuber (1997).
LSTMs have memory cells that are analogous to the states of a traditional dynamical systems model, which
make them useful for simulating natural systems like watersheds. Compared with other types of recurrent
neural networks, LSTMs avoid exploding and/or vanishing gradients, which allows them to learn long-term
dependencies between input and output features. This is desirable for modeling catchment processes like
snow accumulation and seasonal vegetation patterns that have relatively long timescales as compared with
input-driven processes like direct surface runoff. Kratzert, Klotz, et al. (2018) applied LSTMs to the problem
of rainfall-runoff modeling and later demonstrated that the internal memory states of the network were
highly correlated with observed snow and soil moisture states without the model seeing any type of snow
or soil moisture data during training (Kratzert, Herrnegger, Kratzert et al., 2018).
Figure 1 provides an illustration of an LSTM, which works as follows. The model takes a time series (more
generally, a sequence) of inputs x = [x[1], .., x[T]] of data over T time steps, where each element x[t] is a
vector containing features (model inputs) at time step t. This is not dissimilar to any standard hydrological
simulation model (i.e., is it not a one-step-ahead forecast model). The LSTM model structure is described
by the following equations:

x[t] + Ui h[t − 1] + bi ) (1)

f[t] = 𝜎(Wf x[t] + Uf h[t − 1] + bf ) (2)

g[t] = tan h(Wg x[t] + Ug h[t − 1] + bg ) (3)

o[t] = 𝜎(Wo x[t] + Uo + bo ) (4)

c[t] = f[t] ⊙ c[t − 1] + i[t] ⊙ g[t] (5)

h[t] = o[t] ⊙ tan h(c[t]), (6)

where i[t], f[t], and o[t] are the input gate, forget gate, and output gate, respectively, g[t] is the cell input and
x[t] is the network input at time step t (1 ≤ t ≤ T), h[t − 1] is the recurrent input c[t − 1] the cell state
from the previous time step. At the first time step, the hidden and cell states are initialized as a vector of
zeros. W, U, and b are calibrated parameters. These are specific to each gate, and subscripts indicate which
gate the particular weight matrix/vector is associated with. 𝜎(·) is the sigmoid activation function, tanh(·)
the hyperbolic tangent function, and ⊙ is element-wise multiplication. The intuition is that the cell states
(c[t]) characterize the memory of the system. These are modified by (i) the forget gate (f[t]), which allows
attenuation of information in the states over time, and by (ii) a combination of the input gate (i[t]) and cell
update (g[t]), which can add new information. In the latter case, the cell update contains information to be
added to each cell state, and the input gate (which is a sigmoid function) controls which cells are “allowed”

KRATZERT ET AL. 11,346

54 Chapter 3 Publications
Water Resources Research 10.1029/2019WR026065

Table 1
Table of LSTM Inputs
Meteorological forcing data
Maximum air temp 2 m daily maximum air temperature (◦ C)
Minimum air temp 2 m daily minimum air temperature (◦ C)
Precipitation Average daily precipitation (mm/day)
Radiation Surface-incident solar radiation (W/m2 )
Vapor pressure Near-surface daily average (Pa )
Static catchment attributes
Precipitation mean Mean daily precipitation.
PET mean Mean daily potential evapotranspiration
Aridity index Ratio of Mean PET to Mean Precipitation
Precip seasonality Estimated by representing annual
precipitation and temperature as sin waves
Positive (negative) values indicate precipitation peaks
during the summer (winter). Values of ∼0 indicate
uniform precipitation throughout the year.
Snow fraction Fraction of precipitation falling on days with temp < 0 ◦ C.
High precipitation frequency Frequency of days with ≤ 5× mean daily precipitation
High precip duration Average duration of high precipitation events
(number of consecutive days with ≤ 5× mean daily precipitation).
Low precip frequency Frequency of dry days (< 1 mm/day).
Low precip duration Average duration of dry periods
(number of consecutive days with precipitation < 1 mm/day).
Elevation Catchment mean elevation.
Slope Catchment mean slope.
Area Catchment area.
Forest fraction Fraction of catchment covered by forest.
LAI max Maximum monthly mean of leaf area index.
LAI difference Difference between the max. and min. mean of the leaf area index.
GVF max Maximum monthly mean of green vegetation fraction.
GVF difference Difference between the maximum and minimum monthly mean of the
green vegetation fraction.
Soil depth (Pelletier) Depth to bedrock (maximum 50 m).
Soil depth (STATSGO) Soil depth (maximum 1.5 m).
Soil Porosity Volumetric porosity.
Soil conductivity Saturated hydraulic conductivity.
Max water content Maximum water content of the soil.
Sand fraction Fraction of sand in the soil.
Silt fraction Fraction of silt in the soil.
Clay fraction Fraction of clay in the soil.
Carbonate rocks fraction Fraction of the catchment area characterized as
“carbonate sedimentary rocks.”
Geological permeability Surface permeability (log10).

to receive new information. Finally, the output gate (o[t]) controls the flow of information from states to
model output.

3.2. Experimental Design

The LSTMs used in this study took as inputs at each time step the NLDAS meteorological forcing data listed
in Table 1. Additionally, at each time step, the meteorological inputs were augmented with the catchment

KRATZERT ET AL. 11,347

55
Water Resources Research 10.1029/2019WR026065

Figure 2. Frequencies of NSE values from 531 catchments given by “gauged” and “ungauged” LSTMs, calibrated
(gauged) SAC-SMA, and the National Water Model reanalysis.

attributes also listed in Table 1. These catchment attributes were described in detail by Addor et al. (2017)
and remain constant in time throughout the simulation (training and testing). In total we used 32 LSTM
inputs at each daily time step: 5 meteorological forcings and 27 catchment characteristics. All LSTMs were
configured to have 256 cell states with a dropout rate of 0.4 applied to the LSTM output before a single
regression layer.
We trained and tested three types of LSTM models:
1. Global LSTM without static features: LSTMs with only meteorological forcing inputs, and without catch-
ment attributes, trained on all catchments simultaneously (without k-fold validation).
2. Global LSTM with static features: LSTMs with both meteorological forcings and catchment characteristics
as inputs, trained on all catchments simultaneously (without k-fold validation).
3. PUB LSTM: LSTMs with both meteorological forcings and catchment characteristics as inputs, trained
and tested with k-fold validation (k = 12).
The third model is the one we want to test—this is the one that simulates in basins that are different than the
ones that the models were trained on. Out-of-sample testing was done by k-fold validation, which splits the
531 basins randomly into k = 12 groups of approximately equal size, uses all basins from k-1 groups to train
the model, and then tests the model on the single group of holdout basins. This procedure is repeated k = 12
times so that out-of-sample predictions are available from every basin. The second model sets an upper
benchmark for our PUB LSTMs. In particular, comparison between the second and third models tells us how
much information was lost due to prediction in out-of-sample basins versus in-sample basins. Similarly, a
comparison between the first and second models lets us evaluate the value of adding catchment attributes to
the model inputs, since these are what will, at least potentially, allow the model to be transferable between
catchments.
For each model type we trained and tested an ensemble of N = 10 LSTM models to match the 10 SCE restarts
used to calibrate the SAC-SMA models. All metrics reported in Section 4 were calculated from the mean of
the 10-member ensembles, except for the NWM reanalysis.
All LSTM models were trained on the first 15 years of CAMELS data (1981–1995 water years)—this is the
same data period that Newman et al. (2015) used to calibrate SAC-SMA. And all models (LSTMs, SAC-SMA,
and NWM) were evaluated on the last 15 years of CAMELS data (1996–2010 water years). LSTMs were
trained and evaluated using a k-fold approach (k = 12). The training loss function was the average NSE
over all training catchments; this is a squared-error loss function that, unlike a more traditional MSE loss
function, does not overweight catchments with larger mean streamflow values (i.e., does not overweight
large, humid catchments) (Kratzert et al., 2019).

4. Results
A comparison between interpolated frequency distributions over the NSE values from 531 CAMELS catch-
ments from all three LSTM models and both benchmark models (SAC-SMA, NWM) is shown in Figure 2.

KRATZERT ET AL. 11,348

56 Chapter 3 Publications
Water Resources Research 10.1029/2019WR026065

Table 2
Summary of Benchmark Statistics for All Models Across 531 Catchments
Median Mean Minimum Maximum
Nash Sutcliffe efficiency: (−∞, 1] – values close to 1 are desirable.
SAC-SMA: 0.64 0.51 −12.28 0.88
NWM: 0.58 0.31 −20.28 0.89
Global LSTM (no statics): 0.63 0.45 −31.72 0.90
Global LSTM (with statics): 0.74 0.68 −1.78 0.93
PUB LSTM: 0.69 0.54 −13.02 0.90
Fractional Bias: (−∞, 1] – values close to 0 are desirable.
SAC-SMA: 0.04 0.02 −1.76 0.71
NWM: 0.05 −0.01 −4.80 1.00d
Global LSTM (no statics): 0.01 −0.03 −3.01 0.77
Global LSTM (with statics): −0.01 −0.04 −2.19 0.49
PUB LSTM: −0.02 −0.09 −4.86 0.72
Standard Deviation Ratioe: [0, ∞) – values close to 1 are desirable.
SAC-SMA: 0.83 0.87 0.10 3.76
NWM: 0.86 0.93 0.00f 4.04
Global LSTM (no statics): 0.74 0.81 0.10 5.83
Global LSTM (with statics): 0.88 0.89 0.17 1.96
PUB LSTM: 0.86 0.91 0.10 3.23
95th Percentile Differenceb : (−∞, 1] – values close to 0 are desirable.
SAC-SMA: 0.02 −0.05 −3.98 0.83
NWM: 0.07 −0.07 −8.59 1.00c
Global LSTM (no statics): 0.12 0.02 −4.97 0.81
Global LSTM (with statics): 0.03 −0.03 −3.30 0.63
PUB LSTM: 0.03 −0.08 −5.26 0.78
a Ratio of the standard deviation of simulated versus observed flows at each catchment. b Difference between the values

of the observed versus simulated 95th percentile flows divided by the observed 95th percentile flows at each catchment.
c Values of zero and one in the NWM max/min statistics are due to rounding. In particular, for one basin (USGS basin

ID: 2108000) the NWM simulates a 95th flow percentile of ∼1 × 10−3 (mm/day), whereas the 95th percentile of observed
flow is ∼4 (mm/day).

Mean and median values of several performance statistics are given in Table 2. Interpolation was done with
kernel density estimation using Gaussian kernels and an optimized bandwidth.
The primary result is that the out-of-sample PUB LSTM ensemble performed at least as well as both of the
in-sample benchmarks in more than half of the catchments against all four performance metrics that we
tested, except that the basin-calibrated SAC-SMA has a slightly lower average difference between the 95th
percentile flows (both SAC-SMA and the PUB LSTM underestimated peak flows to some extent). The PUB
LSTM had a higher NSE than SAC-SMA in 307 of 531 (58%) catchments, and higher than the NWM in 347
of 531 (66%) catchments. The PUB LSTM ensemble also had higher mean and maximum NSE scores than
the benchmark models; however, SAC-SMA tended to outperform the PUB LSTM in catchments with low
NSE values (see the CDF plot in Figure 2).
There is some amount of stochasticity associated with training the LSTMs, especially through the random
weight initialization of the LSTMs, but also by the weight optimization strategy (we used an ADAM opti-
mizer, Kingma & Ba, 2014). Because of this, the LSTM-type models give better predictions when used as
an ensemble. It is not necessarily the case that if one particular LSTM model performs poorly in one catch-
ment that a different LSTM trained on exactly the same data will also perform poorly. In our case, we used
an ensemble of N = 10 (the same size as the SAC-SMA ensemble developed by Newman et al., 2015 that
was used here for benchmarking). Figure 3 shows the NSE values for each ensemble member for the PUB
LSTM models. In total, there were 103 basins with at least one PUB LSTM ensemble member with an NSE

KRATZERT ET AL. 11,349

57
Water Resources Research 10.1029/2019WR026065

Figure 3. NSE scores for all PUB LSTM ensemble members. In some number of basins, certain ensemble members
perform well and certain ensemble members perform poorly. This motivates the use of ensembles of LSTMs.

score of below zero. Only 9 of these 103 basins have all N = 10 ensemble members with NSE < 0, while 55 of
the 103 have at least one ensemble member with NSE > 0.5. As an example, one of the basins (USGS basin
ID: 01142500, which is basin number 232 in Figure 3) had 9 of 10 ensemble members with NSE < 0, but
one ensemble member with NSE > 0.7. This indicates that a substantial portion of the uncertainty in these
LSTM models is due to randomness, rather than to systematic model structural error.
The global LSTM model with static catchment attributes performs better than all other models against the
metrics that we tested. Figure 4 compares the performance of the Global LSTM with other benchmark mod-
els (SAC-SMA and the Global LSTM without static catchment attributes). The Global LSTM with catchment
attributes performs better in most—but not all—catchments. This indicates two things. First, the compari-
son between the Global LSTM with and without static catchment attributes indicates that although there is
useful information in the catchment attributes, in some catchments having these data actually hurts us. We
explored this relationship briefly, but did not find any patterns in terms of which catchment attributes might
tend to lead to underperformance. Specifically, Figure 5 shows that there is generally no correlation between

Figure 4. Comparison between the Global LSTM model with static catchment attributes and other benchmark models
used in this study: (top row) LSTM without static catchment attributes and (bottom row) SAC-SMA.

KRATZERT ET AL. 11,350

58 Chapter 3 Publications
Water Resources Research 10.1029/2019WR026065

Figure 5. Scatterplots of the LSTM NSE scores in each basin with versus without static catchment attributes as model
inputs. Colors indicate the relative values of a subselection of the static catchment attributes from Table 1—each
subplot has a different colorscale depending on the absolute magnitudes of the specific attributes data. It is the relative
values of the attributes that we care about here. There are no apparent direct relationships between the values of
different catchment attributes and basins where adding catchment attribute data hurts model performance.

the values of individual catchment attributes and whether the Global LSTM with versus without statics
performs better. Our initial conclusion is that the basins where the LSTM without catchment attributes per-
forms better is likely an indication of error or uncertainty in the catchment attributes data. Nonetheless,
these data did generally add significant skill to the model (the difference in NSE scores was statistically dif-
ferent at p < 1e−9). Future work might use a more sophisticated sensitivity analysis (e.g., sequential model
building or a Sobol'-type analysis) to test which specific catchment attributes cause this underperformance
when added to the model.
The second thing that we want to highlight from the comparison between the Global LSTM and SAC-SMA
(Figure 4) is that there is substantial room to improve SAC-SMA overall. This clearly shows that the LSTM
finds rainfall-runoff relationships in individual catchments that SAC-SMA cannot emulate. However, the
fact that SAC-SMA performs better in some catchments indicates the potential value of having physical
constraints in a hydrological model. The LSTMs in these cases are either overfit or are not able to simulate
behaviors of certain similar catchments in the training data set.

5. Discussion
The results illustrated in the previous section tell us three things:
1. The process-driven hydrology models that we used here as benchmarks could be improved. The LSTM
often finds a better functional representation of rainfall-runoff behavior in most catchments than either
SAC-SMA or the NWM.

KRATZERT ET AL. 11,351

59
Water Resources Research 10.1029/2019WR026065

2. The argument that process-driven models may be preferable in out-of-sample conditions may not hold
water. Modern ML methods are quite powerful at extracting information from large, diverse data sets
under a variety of hydrological conditions.
3. The comparison between models with and without static catchment attributes as inputs demonstrates
that there is sufficient information contained in catchment attribute data to distinguish between different
rainfall-runoff relationships in at least most of the U.S. catchments that we tested.
Related to the third conclusion, the challenge going forward is about how to extract the useful informa-
tion from catchment attributes data for regional modeling. One of the historical reasons why this has been
a hard problem is because the usual strategy is to use observable catchment attributes or characteristics to
identify or “regionalize” parameters of conceptual or process-based simulation models (e.g., Prieto et al.,
2019; Razavi & Coulibaly, 2012). This is hard because of strong interactions in high-dimensional param-
eter spaces. There are many methods for this—notably a family of regionalization methods that Razavi
and Coulibaly (2012) called “model independent”; however, we are unaware of any approach that is as
effective as LSTMs at extracting this information for streamflow simulation. This is also in line with the
recent results by Kratzert et al. (2019), where similar LSTMs were compared against models calibrated
with a parameter regionalization strategy (Samaniego et al., 2010). That paper additionally showed that
the response of LSTM-type models were relatively smooth with respect to perturbing catchment attributes,
indicating a robust fit (i.e., that the models were not overfit or simply remembering different catchments).
The results presented here show that the LSTM is able to extrapolate on catchment attributes to new catch-
ments. Taken together, these results indicate that the catchment attribute data set (Addor et al., 2017)
contains a significant amount of useful information about the differences between rainfall-runoff behav-
iors across (eco)hydrological regimes, and that machine learning is effective at extracting and using these
patterns.
Related to the first conclusion, this is yet another example where traditional hydrological models do not
take full advantage of the information available from the Earth-observation data record. In this case, nei-
ther SAC-SMA nor the NWM are able to directly use the catchment attribute data that we use here, but
even if those models could leverage this information, they still could not compete with the LSTM, since the
LSTM outperforms even when the conceptual model is calibrated in-basin. This means that not only is there
useful information in catchment attributes data, but also that there is more information in meteorological
forcing data than is used by the traditional models. Several recent experiments have shown the same thing
for a number of operational terrestrial hydrology models (e.g., Nearing et al., 2018, 2016). Hrachowitz et al.
(2013) and others suggest that better process-based understanding of catchment behaviors should result in
better out-of sample predictions. In reality, it is data-driven models that have consistently given increasingly
better predictions. From a more optimistic perspective, ML benchmarking experiments like the one in this
paper show that there are probably organizing theories about watersheds yet to be discovered, since machine
learning models are able to find informative patterns in multibasin data sets that our current models do
not reproduce.
The power of big data and machine learning for problems like this is that such techniques can synthesize
information from multiple sites and situations into a single model. As an example, if we were to want to
simulate catchment behavior under nonstationary conditions (e.g., evolving climate), then a single LSTM
trained to recognize and distinguish different types of hydrological behavior (as shown here) will have a
larger range of conditions where it can be expected to remain realistic than a model calibrated to a past
conditions in only a single basin.
In our opinion, the most effective strategy moving forward will probably be theory-guided data-science
Karpatne et al. (2017). There are now numerous strategies across scientific disciplines that allow for mean-
ingful fusions of domain knowledge with machine learning and other algorithms for learning and predicting
directly from data. Adopting approaches like this will be critical moving forward.

6. Code and Data Availability

CAMELS data, including SAC-SMA simulations, are available from NCAR at this site (https://ral.ucar.edu/
solutions/products/camels). National Water Model reanalysis data are available from the NOAA Big Data
Repository (https://registry.opendata.aws/nwm-archive/). All code used for this project is available at this
site (https://github.com/kratzert/lstm_for_pub).

KRATZERT ET AL. 11,352

60 Chapter 3 Publications
Water Resources Research 10.1029/2019WR026065

Acknowledgments References
The project relies heavily on open
source software. All programming was Addor, N., Nearing, G., Prieto, C., Newman, A., Le Vine, N., & Clark, M. (2018). A ranking of hydrological signatures based on their
done in Python version 3.7 (van predictability in space. Water Resources Research, 54, 8792–8812. https://doi.org/10.1029/2018WR022606
Rossum, 1995) and associated libraries Addor, N., Newman, A., Mizukami, N., & Clark, M. P. (2017). Catchment attributes for large-sample studies. https://doi.org/10.5065/
including: Numpy (Van Der Walt et al., D6G73C3Q
2011), Pandas (McKinney, 2010), Addor, N., Newman, A. J., Mizukami, N., & Clark, M. P. (2017). The CAMELS data set: Catchment attributes and meteorology for
PyTorch (Paszke et al., 2017), and large-sample studies. Hydrology and Earth System Sciences (HESS), 21(10), 5293–5313.
Matplotlib (Hunter, 2007). This work AlQuraishi, M. (2019). End-to-end differentiable learning of protein structure. Cell systems, 8(4), 292–301.
was supported by Bosch, ZF, and Blöschl, G. (2016). Predictions in ungauged basins—Where do we stand? Proceedings of the International Association of Hydrological
Google. We thank the NVIDIA Sciences, 373, 57–60.
Corporation for the GPU donations, Duan, Q., Gupta, V. K., & Sorooshian, S. (1993). Shuffled complex evolution approach for effective and efficient global minimization.
LIT with Grant LIT-2017-3-YOU-003 Journal of optimization theory and applications, 76(3), 501–521.
and FWF Grant P 28660-N31. Fekete, B. M, Robarts, R. D., Kumagai, M., Nachtnebel, H.-P., Odada, E., & Zhulidov, A. V. (2015). Time for in situ renaissance. Science,
349(6249), 685–686.
Goswami, M., Oconnor, K., & Bhattarai, K. (2007). Development of regionalisation procedures using a multi-model approach for flow
simulation in an ungauged catchment. Journal of Hydrology, 333(2-4), 517–531.
He, S., Li, Y., Feng, Y., Ho, S., Ravanbakhsh, S., Chen, W., & Póczos, B. (2019). Learning to predict the cosmological structure formation.
Proceedings of the National Academy of Sciences, 116, 13,825–13,832.
Hochreiter, S., & Schmidhuber, J. (1997). Long short-term memory. Neural computation, 9(8), 1735–1780.
Hrachowitz, M., Savenije, H., Blöschl, G, McDonnell, J., Sivapalan, M., Pomeroy, J., et al. (2013). A decade of Predictions in Ungauged
Basins (PUB)—A review. Hydrological sciences journal, 58(6), 1198–1255.
Hsu, K.-l., Gupta, H. V., & Sorooshian, S. (1995). Artificial neural network modeling of the rainfall-runoff process. Water resources research,
31(10), 2517–2530.
Hunter, J. D. (2007). Matplotlib: A 2D graphics environment. Computing In Science & Engineering, 9(3), 90–95.
Karpatne, A., Atluri, G., Faghmous, J. H., Steinbach, M., Banerjee, A., Ganguly, A., et al. (2017). Theory-guided data science: A new
paradigm for scientific discovery from data. IEEE Transactions on Knowledge and Data Engineering, 29(10), 2318–2331.
Kingma, D. P., & Ba, J. (2014). Adam: A method for stochastic optimization. arXiv preprint arXiv:1412.6980.
Kirchner, J. W. (2006). Getting the right answers for the right reasons: Linking measurements, analyses, and models to advance the science
of hydrology. Water Resources Research, 42, W03S04. https://doi.org/10.1029/2005WR005362
Klemeš, V. (1986). Dilettantism in hydrology: Transition or destiny? Water Resources Research, 22(9S), 177S–188S.
Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., & Klambauer, G. (2018). Do internals of neural networks make sense in the context
of hydrology? In Proceedings of the 2018 AGU fall meeting.Washington, DC.
Kratzert, F., Klotz, D., Brenner, C., Schulz, K., & Herrnegger, M. (2018). Rainfall–runoff modelling using long short-term memory (LSTM)
networks. Hydrology and Earth System Sciences, 22(11), 6005–6022.
Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochreiter, S., & Nearing, G. (2019). Towards learning universal, regional, and local
hydrological behaviors via machine learning applied to large-sample datasets. Hydrology and Earth System Sciences, 23, 5089–5110.
https://doi.org/10.5194/hess-23-5089-2019
Liu, Y., Racah, E., Correa, J., Khosrowshahi, A., Lavers, D., Kunkel, K., et al. et al. (2016). Application of deep convolutional neural networks
for detecting extreme weather in climate datasets. arXiv preprint arXiv:1605.01156.
Mayr, A., Klambauer, G., Unterthiner, T., & Hochreiter, S. (2016). Deeptox: Toxicity prediction using deep learning. Frontiers in
Environmental Science, 3, 80.
McAfee, A., & Brynjolfsson, E. (2017). Machine, platform, crowd: Harnessing our digital future. New York, NY: WW Norton & Company.
McKinney, W. (2010). Data structures for statistical computing in Python. Proceedings of the 9th Python in Science Conference,
1697900(Scipy), 51–56.
Milly, P. C. D., Betancourt, J., Falkenmark, M., Hirsch, R. M., Kundzewicz, Z. W., Lettenmaier, D. P., & Stouffer, R. J. (2008). Stationarity is
dead: Whither water management? Science, 319(5863), 573–574.
Nearing, G. S., & Gupta, H. V. (2015). The quantity and quality of information in hydrologic models. Water Resources Research, 51, 524–538.
https://doi.org/10.1002/2014WR015895
Nearing, G. S., Mocko, D. M., Peters-Lidard, C. D., Kumar, S. V., & Xia, Y. (2016). Benchmarking NLDAS-2 soil moisture and evapotran-
spiration to separate uncertainty contributions. Journal of Hydrometeorology, 17(3), 745–759.
Nearing, G. S., Ruddell, B. L., Clark, M. P., Nijssen, B., & Peters-Lidard, C. (2018). Benchmarking and process diagnostics of land models.
Journal of Hydrometeorology, 19(11), 1835–1852.
Newman, A., Clark, M., Sampson, K., Wood, A., Hay, L., Bock, A., et al. (2015). Development of a large-sample watershed-scale
hydrometeorological data set for the contiguous USA: Data set characteristics and assessment of regional variability in hydrologic model
performance. Hydrology and Earth System Sciences, 19(1), 209–223.
Newman, A., Sampson, K., Clark, M. P., Bock, A., Viger, R. J., & Blodgett, D. (2014). A large-sample watershed-scale hydrometeorological
dataset for the contiguous USA. Boulder, CO: UCAR/NCAR. https://doi.org/10.5065/D6MW2F4D
Newman, A. J., Mizukami, N., Clark, M. P., Wood, A. W., Nijssen, B., & Nearing, G. (2017). Benchmarking of a physically based hydrologic
model. Journal of Hydrometeorology, 18(8), 2215–2225.
Parajka, J., Viglione, A., Rogger, M., Salinas, J., Sivapalan, M., & Blöschl, G (2013). Comparative assessment of predictions in ungauged
basins—Part 1: Runoff-hydrograph studies. Hydrology and Earth System Sciences, 17(5), 1783–1795.
Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., et al. (2017). Automatic differentiation in PyTorch.
Prieto, C., Le Vine, N., Kavetski, D., Garcia, E., & Medina, R. (2019). Flow prediction in ungauged catchments using probabilistic
random forests regionalization and new statistical adequacy tests. Water Resources Research, 55, 4364–4392. https://doi.org/10.1029/
2018WR023254
Razavi, T., & Coulibaly, P. (2012). Streamflow prediction in ungauged basins: Review of regionalization methods. Journal of Hydrologic
Engineering, 18(8), 958–975.
Salas, F. R., Somos-Valenzuela, M. A., Dugger, A., Maidment, D. R., Gochis, D. J., David, C. H., et al. (2018). Towards real-time continental
scale streamflow simulation in continuous and discrete space. JAWRA Journal of the American Water Resources Association, 54(1), 7–27.
Samaniego, L., Kumar, R., & Attinger, S. (2010). Multiscale parameter regionalization of a grid-based hydrologic model at the mesoscale.
Water Resources Research, 46, W05523. https://doi.org/10.1029/2008WR007327

KRATZERT ET AL. 11,353

61
Water Resources Research 10.1029/2019WR026065

Samaniego, L., Kumar, R., Thober, S., Rakovec, O., Zink, M., Wanders, N., et al. (2017). Toward seamless hydrologic predictions across
spatial scales. Hydrology and Earth System Sciences, 21(9), 4323–4346. https://doi.org/10.5194/hess-21-4323-2017
Sellars, S. (2018). “Grand challenges” in big data and the earth sciences. Bulletin of the American Meteorological Society, 99(6), ES95–ES98.
Sivapalan, M. (2003). Prediction in ungauged basins: A grand challenge for theoretical hydrology. Hydrological Processes, 17(15), 3163–3170.
Sivapalan, M., Takeuchi, K., Franks, S., Gupta, V., Karambiri, H., Lakshmi, V., et al. (2003). IAHS decade on Predictions in Ungauged
Basins (PUB), 2003–2012: Shaping an exciting future for the hydrological sciences. Hydrological sciences journal, 48(6), 857–880.
Van Der Walt, S., Colbert, S. C., & Varoquaux, G. (2011). The NumPy array: A structure for efficient numerical computation. Computing
in Science and Engineering, 13(2), 22–30.
van Rossum, G. (1995). Python tutorial (Technical Report CS-R9526). Amsterdam: Centrum voor Wiskunde en Informatica (CWI).
Vaze, J., Chiew, F., Hughes, D., & Andréassian, V. (2015). Preface: Hs02–hydrologic non-stationarity and extrapolating models to predict
the future. Proceedings of the International Association of Hydrological Sciences, 371, 1–2.
Vrugt, J. A., Gupta, H. V., Dekker, S. C., Sorooshian, S., Wagener, T., & Bouten, W. (2006). Application of stochastic parameter optimization
to the Sacramento Soil Moisture Accounting Model. Journal of Hydrology, 325(1-4), 288–307.
Xia, Y., Mitchell, K., Ek, M., Cosgrove, B., Sheffield, J., Luo, L., et al. (2012). Continental-scale water and energy flux analysis and validation
for North American land data assimilation system project phase 2 (NLDAS-2): 2. Validation of model-simulated streamflow. Journal of
Geophysical Research, 117, D03110. https://doi.org/10.1029/2011JD016051

KRATZERT ET AL. 11,354

62 Chapter 3 Publications
63
NeuralHydrology - Interpreting LSTMs in
Hydrology

Frederik Kratzert1? , Mathew Herrnegger2?? , Daniel Klotz1 , Sepp Hochreiter1 ,

and Günter Klambauer1
1
LIT AI Lab & Institute for Machine Learning
arXiv:1903.07903v2 [cs.LG] 12 Nov 2019

Johannes Kepler University Linz

A-4040 Linz, Austria

2
Institute of Hydrology and Watermanagement
University of Natural Resources and Life Sciences, Vienna
A-1190 Vienna, Austria

Abstract. Despite the huge success of Long Short-Term Memory net-

works, their applications in environmental sciences are scarce. We argue
that one reason is the difficulty to interpret the internals of trained net-
works. In this study, we look at the application of LSTMs for rainfall-
runoff forecasting, one of the central tasks in the field of hydrology, in
which the river discharge has to be predicted from meteorological obser-
vations. LSTMs are particularly well-suited for this problem since mem-
ory cells can represent dynamic reservoirs and storages, which are essen-
tial components in state-space modelling approaches of the hydrological
system. On basis of two different catchments, one with snow influence and
one without, we demonstrate how the trained model can be analyzed and
interpreted. In the process, we show that the network internally learns to
represent patterns that are consistent with our qualitative understanding
of the hydrological system.

Keywords: Neural Networks · LSTM · Interpretability · Hydrology ·

Rainfall-Runoff modelling

1 Introduction

Describing the relationship between rainfall and runoff is one of the central
tasks in the field of hydrology [21]. This involves the prediction of the river
discharge from meteorological observations from a river basin. The basin or
catchment of a river is defined by the area of which all (surface) runoff drains
to a common outlet [40]. Predicting the discharge of the river is necessary for
e.g. flood forecasting, the design of flood protection measures, or the efficient
management of hydropower plants.
?
1st corresponding author: [email protected]
??
2nd corresponding author: [email protected]

64 Chapter 3 Publications
2 F. Kratzert et al.

Within the basin of a river, various hydrological processes take place that
influence and lead to the river discharge, including, for example, evapotranspira-
tion, where water is lost to the atmosphere, snow accumulation and snow melt,
water movement in the soil or groundwater recharge and discharge (see Fig. 1).

Fig. 1. Simplified visualization of processes and fluxes that influence the river dis-
charge, such as precipitation, snow melt, surface runoff or subsurface flows.

The hydrological processes have highly non-linear interactions and depend,

to a large degree, on the states of the system, which represent the memory of,
e.g. a river basin. Consequently, hydrological models are formulated in a state-
space approach where the states at a specific time depend on the input It , the
system states at the previous time step St−1 , and a set of parameters Θi [14]:

St = f (It , St−1 ; Θi ) (1)

The discharge at a given time step t is driven by the system states and in
consequences by the meteorological events of the preceding time steps. More gen-
erally, any output Ot of a hydrological system (e.g. the runoff) can be described
as:

Ot = g(It , St ; Θj ), (2)

65
Interpreting LSTMs in Hydrology 3

where g(·) is a mapping function that connects the states of the system and
the inputs to the system output, and Θj is the corresponding subset of model
parameters.
For making proficient predictions these non-linearities make it inevitable (at
least in classical process-based hydrological models) to explicitly implement the
hydrological processes [15,25,32,38]. However, defining the mathematical repre-
sentations of the processes, including the model structures and determining their
effective parameters so that the resulting system exhibits good performance and
generalizable properties (e.g. in the form of seamless parameter fields) still re-
mains a challenge in the field of hydrology [11,22,35].
A significant problem and limiting factor in this context is the frequently
missing information regarding the physical properties of the system [5,10]. These
tend to be highly heterogeneous in space (e.g. soil characteristics) and can addi-
tionally change over time, e.g. vegetation cover. Our knowledge of the properties
on, or near the surface has increased significantly in the last decades. This is
mainly due to advances in high-resolution air- as well as spaceborne remote sens-
ing [7,13,29]. However, hydrology, to a significant part, takes place underground,
for which detailed information is rarely available. In essence, the process-based
models try to describe a system determined by spatially and temporally dis-
tributed system states and physical parameters, which are most of the time
unknown.
In contrast, data-driven methods, such as Neural Networks, are solely trained
to predict the discharge, given meteorological observations, and do not necessi-
tate an explicit definition of the underlying processes. But these models have not
the best reputation among many hydrologists because of the prevailing opinion
that models “must work well for the right reasons” [20]. However, due to their
predictive power, first studies of using Neural Networks for predicting the river
discharge date back to the early 90s [9,12].
Recently, Kratzert et al. [23] used Long Short-Term Memory networks (LSTMs)
[17] for daily rainfall-runoff modelling and could show that LSTMs achieve com-
petitive results, compared to the well established Sacramento Soil Moisture Ac-
counting Model [8] coupled with the Snow-17 snow model [2]. LSTM is an es-
pecially well-suited network architecture for Hydrological applications, since the
evolution of states can be modelled explicitly through time and mapped to a
given output. The approach is very similar to rainfall-runoff models defined by
Equations 1-2 (in the case of the LSTM the system states are the memory cell
states and the parameters are the learnable network weights, [23]).
The aim of this chapter is to show different possibilities that enable the
interpretation of the LSTM and its internals in the context of rainfall-runoff
simulations. Concretely, we explore and investigate the following questions: How
many days of the past influence the output of the network at a given day of the
year? Do some of the memory cells correlate with hydrological states? If yes,
which input variables influence these cells and how? Answering these questions
is important to a) gain confidence in data-driven models, e.g. in case of the
necessity for extrapolation, b) have tools to understand possible mistakes and

66 Chapter 3 Publications
4 F. Kratzert et al.

difficulties in the learning process and c) potentially learn from findings for future
applications.

1.1 Related work

In the field of water resources and hydrology, a lot of effort has been made on
interpreting neural networks and analyzing the importance of input variables (see
[6] for an overview). However, so far only feed-forward neural networks have been
applied in these studies. Only recently, Kratzert et al. [23] have demonstrated
the potential use of LSTMs for the task of rainfall-runoff modelling. In their
work they have also shown that memory cells with interpretable functions exist,
which were found by visual inspection.
Outside of hydrology, LSTMs found a wide range of applications, which at-
tracted researches to study on analyzing and interpreting the network internals.
For example Hochreiter et al. [16] found new protein motifs through analyzing
LSTM memory cells. Karpathy et al. [18] inspected memory cells in character
level language modelling and identify some interpretable memory cells, e.g. cells
that track the line-length or cells that check if the current text is inside brackets
of quotation marks. Li et al. [24] inspected trained LSTMs in the application of
sentence- and phrase-based sentiment classification and showed through saliency
analysis, which parts of the inputs are influencing the network prediction most.
Arras et al. [3] used Layer-wise Relevance Propagation to calculate the impact
of single words on sentiment analysis from text sequences. Also for sentiment
analysis, Murdoch et al. [28] present a decomposition strategy for the hidden
and cell state of the LSTM to extract the contributions of single words on the
overall prediction. Poerner et al. [33] summarize various interpretability efforts
in the natural language processing domain and present an extension of the LIME
framework, introduced originally by Reibiero et al. [34]. Strobelt et al. [36] de-
veloped LSTMVis, a general purpose visual analysis tool for inspecting hidden
state values in recurrent neural networks.
Inspired by these studies, we investigate the internals of LSTMs in the do-
main of environmental science and compare our findings to hydrological domain
knowledge.

2 Methods
2.1 Model architecture
In this study, we will use a network consisting of a single LSTM layer with
10 hidden units and a dense layer, that connects the output of the LSTM at
the last time step to a single output neuron with linear activation. To predict
the discharge of a single time step (day) we provide the last 365 time steps of
meteorological observations as inputs. Compared to Eq. 1 we can formulate the
LSTM as:

{ct , ht } = fLSTM (it , ct−1 , ht−1 ; Θk ), (3)

67
Interpreting LSTMs in Hydrology 5

where fLSTM (·) symbolizes the LSTM cell that is a function of the meteoro-
logical input it at time t, and the previous cell state ct−1 as well as the previous
hidden state ht−1 , parametrized by the network weights Θk . The output of the
system, formally described in Eq. 2, would in this specific case be given by:

y = fDense (h365 ; Θl ), (4)

where y is the output of a dense layer fDense (·) parametrized by the weights
Θl , which predicts the river discharge from the hidden state at the end of the
input sequence h365 .
The difference between the LSTM and conventional rainfall-runoff models is
that the former has the ability to infer the needed structure/parametrization
from data without preconceived assumptions about the nature of the processes.
This makes them extremely attractive for hydrological applications.
The network is trained for 50 epochs to minimize the mean squared error
using RMSprop [39] with an initial learning rate of 1e-2. The final model is
selected based on the score of an independent validation set.

2.2 Data

In this work, we concentrate on two different basins from the publicly available
CAMELS data set [1,31]. Basin A, which is influenced by snow, and basin B,
which is not influenced by snow. Some key attributes of both basins can be found
in Table 1.

Table 1. Basin overview.

Snow NSE NSE

Basin IDi Area (km2 )
fractionii validation test
A 13340600iii 56 % 3357 0.79 0.76
B 11481200iv 0% 105 0.72 0.72

For meteorological forcings, the data set contains basin averaged daily records
of precipitation (mm/d), solar radiation (W/m2 ), minimum and maximum tem-
perature (◦ C) and vapor pressure (Pa). The streamflow is reported as daily
average (m3 /s) and is normalized by the basin area (m2 ) to (mm/d). Approxi-
mately 33 years of data is available, of which we use the first 15 for training the
LSTMs. Of the remaining years the first 25 % is used as validation data by which
we select the final model. The remaining data points (approx. 13 years) are used
for the final evaluation and for all experiments in this study. The meteorological
i
USGS stream gauge ID
ii
Fraction of precipitation falling with temperatures below 0◦ C
iii
Clearwater river, CA
iv
Little river, CA

68 Chapter 3 Publications
6 F. Kratzert et al.

input features, as well as the target variable, the discharge are normalized by
the mean and standard deviation of the training period.
One LSTM is trained for each basin separately and the trained model is
evaluated using the Nash-Sutcliffe-Efficiency [30], an established measure used
to evaluate hydrological time series given by the following equation:
PT
(Qtm − Qto )2
NSE = 1 − Pt=1
T
, (5)
t 2
t=1 (Qo − Q̄o )

where T is the total number of time steps, Qtm is the simulated discharge
at time t (1 ≤ t ≤ T ), Qto is the observed discharge at time t and Q̄o is the
mean observed discharge. The range of the NSE is (-inf, 1], where a value of
1 means a perfect simulation, a NSE of 0 means the simulation is as good as
the mean of the observation and everything below zero means the simulation is
worse compared to using the observed mean as a prediction.
In the test period the LSTM achieves a NSE of above 0.7 (see Table 1), which
can be considered a reasonably good result [27].

Basin A - 13340600
25 0
10
20
20
precipitation (mm/d)
discharge (mm/d)

15 30
40
10 50
60
5
70
0 80
2011-10 2012-01 2012-04 2012-07 2012-10 2013-01 2013-04 2013-07
observed predicted

Fig. 2. Example of predicted (dashed line) and observed discharge (solid line) of two
years of the test period in the snow influenced basin. Corresponding daily precipitation
sums are plotted upside down, where snow (temperature below 0◦ C) is plotted darker.

Figure 2 shows observed and simulated discharge of two years of the test
period in the snow influenced basin A, as well as the input variable precipitation.
We can see that the discharge has its peak in the spring/early summer und that
the model in the year 2012 underestimates the discharge, while in the second
year it fits the observed discharge pretty well. The time lag between precipitation
and discharge can be explained by snow accumulation in the winter months and
subsequent melt of this snow layer in spring.

69
Interpreting LSTMs in Hydrology 7

2.3 Integrated gradients

Different methods have been presented recently to analyze the attribution of in-
put variables on the network output (or any in-between neuron) (e.g. [3,4,19,26,37])
In this study we focus on Integrated gradients by Sundarajan et al. [37]. Here,
the attribution of each input to e.g. the output neuron is calculated by looking
at the change of this neuron when the input shifts from a baseline input to the
target input of interest. Formally, let x be the input of interest, (in our case
a sequence of 365 time steps with 5 meteorological observations each), x0 the
baseline input and F (·) the neuron of interest. Then the integrated gradients,
for the i -th input variable xi , can be appoximated by:

m
xi − x0 i X ∂F (x̃)
IntegratedGradsapprox
i (x) := , (6)
m ∂x̃i k
x̃=x0 + m (x−x0 )
k=1

where m is the number of steps used to approximate the integral (here m =

1000). As baseline x0 , we used an input sequence of zeros.

2.4 Experiments

Question 1: How many days of past influence the network output?

The discharge of a river in a seasonal influenced region varies strongly through-
out the year. For e.g. snow influenced basins the discharge usually peaks in the
spring or early summer, when not only precipitation and groundwater but also
snow melt contributes to the discharge generation. Therefore, at least from a
hydrological point of view, the precipitation of the entire winter might be in-
fluential for the correct prediction of the discharge. In contrast, in drier periods
(e.g. here at the end of summer) the discharge likely depends on far fewer time
steps of the meteorological past. Since we provide a constant number of time
steps (365 days) of meteorological data as input, it is interesting to see how
many of the previous days are really used by the LSTM for the final prediction.
To answer this question, we calculate the integrated gradients for one sample
of the test period w.r.t. the input variables and sum the integrated gradients
across the features for each time step. We then calculate the difference from
time step to time step and determine the first timestep t (1 ≤ t ≤ T ), at which
the difference surpasses a threshold of 2e-3, with T being the total length of the
sequence. We have chosen the threshold value empirically so that noise in the
integrated gradient signal is ignored. For each sample the number of Time Steps
Of Influence (TSOI) on the final prediction can then be calculated by:

TSOI = T − n (7)

This is repeated for each day of each year in the test period.

70 Chapter 3 Publications
8 F. Kratzert et al.

Question 2: Do memory cells correlate with hydrological states?

The discharge of a river basin is frequently approximated by decomposing its
(hypothetical) components into a set of interacting reservoirs or storages (see Fig.
1). Take snow as an example, which is precipitation that falls if temperatures
are below 0◦ C. It can be represented in a storage S (see Eq. 1), which generally
accumulates during the winter period and depletes from spring to summer when
the temperatures rise above the melting point. Similarly other components of
the system can be modelled as reservoirs of lower or higher complexity. The soil
layer, for example, can also be represented by a bucket, which is filled - up to a
certain point - by incoming water (e.g. rainfall) and depleted by evapotranspi-
ration, horizontal subsurface flow and water movement into deeper layers, e.g.
the groundwater body.
Theoretically, memory cells of LSTMs could learn to mimic these storage pro-
cesses. This is a crucial property, at least from a hydrological point of view, to
be able to correctly predict the river discharge. Therefore the aim of this exper-
iment is to see if certain memory cells ct (Eq. 3) correlate to these hydrological
states St (Eq. 1).
Because the CAMELS data does not include observations for these states,
we take the system states of the included SAC-SMA + Snow-17 model as a
proxy. This is far from optimal, but since this is a well established and studied
hydrological model, we can assume that at least the trend and tendencies of these
simulations are correct. Furthermore, we only want to test in this experiment if
memory cells correlate with these system states and not if they quantitatively
match these states exactly. Of the calibrated SAC-SMA + Snow-17 we use the
following states as a reference in this experiment:

– SWE (snow water equivalent): This is the amount of water stored in the
snow layer. This would be available if the entire snow in the system would
melt.
– UZS (upper zone storage): This state is calculated as the sum of the UZTWC
(upper zone tension water storage content) and the UZFWC (upper zone free
water storage) of the SAC-SMA + Snow-17 model. This storage represents
upper layer soil moisture and controls the fast response of the soil, e.g. direct
surface runoff and interflow.
– LZS (lower zone storage): This state is calculated by the sum of LZTWC
(lower zone tension water content), LZFSC (lower zone free supplemental
water storage content) and LZFPC (lower zone free primary water storage
content). This storage represents the groundwater storage and is relevant for
the baseflowv .

For each sample in the test period we calculate the correlation of the cell
states with the corresponding time series of these four states.
v
Following [40] the baseflow is defined as: “Discharge which enters a stream channel
mainly from groundwater, but also from lakes and glaciers, during long periods when
no precipitation or snowmelt occurs.”

71
Interpreting LSTMs in Hydrology 9

Question 3: Which inputs influence a specific memory cell?

Suppose that we find memory cells that correlate with time series of hydro-
logical system states, then a natural question would be if the inputs influencing
these memory cells agree with our understanding of the hydrological system.
For example, a storage that represents the snow layer in a catchment, should
be influenced by precipitation and solar radiation in a contrarious way. Solid
precipitation or snow would increase the amount of snow available in the system
during winter. At the same time, solar radiation, providing energy for subli-
mation, would effectively reduce the amount of snow stored in the snow layer.
Therefore, in this experiment we look in more detail at specific cells that emerged
from the previous experiment and analyse the influencing variables on this cell.
We do this by calculating the integrated gradients from a single memory cell at
the last time step of the input sequence w.r.t. the input variables.

3 Results and Discussion

3.1 Timesteps influencing the network output

Figure 3 shows how many time steps with meteorological inputs from the past
have an influence on the LSTM output at the time step of prediction (TSOI).
The TSOI does thereby not differentiate between single inputs. It is rather the
integrated signal of all inputs. Instead of using specific dates, we here show the
temporal dimension in the unit day of year (DOY). Because all years of the test
period are integrated in the plot, we show the 25 %, 50 % and 75 % percent
quantiles. For the sake of interpretation and the seasonal context, Fig. 3 also
includes the temporal dynamics of the median precipitation, temperature and
discharge.
The left column of Fig. 3 shows the results for snow influenced basin A. Here,
3 different periods can be distinguished in the TSOI time series:
(1) Between DOY 200 and 300 the TSOI shows very low values of less than
10-20 days. This period corresponds to the summer season characterised by high
temperatures and low flows, with fairly little precipitation. In this period the
time span of influence of the inputs on the output is short. From a hydrological
perspective this makes sense, since the discharge in this period is driven by short-
term rainfall events, which lead to a very limited response in the discharge. The
short time span of influence can be explained by higher evapotranspiration rates
in this season and lower precipitation amounts. Higher evapotranspiration rates
lead to the loss of precipitation to the atmosphere, which is then missing in the
discharge. The behaviour of the hydrograph is fairly easy to predict and does
not necessitate much information about the past.
(2) In the winter period, starting with DOY 300, the TSOI increases over
time reaching a plateau of 140-150 days around DOY 100. In this period the daily
minimum temperature is below 0◦ C, leading to solid precipitation (snow) and
therefore water being stored as snow in the catchment without leading to runoff.
This is underlined with the low discharge values, despite high precipitation input.

72 Chapter 3 Publications
10 F. Kratzert et al.

Basin A - 13340600 Basin B - 11481200

150 median 150 median
25/75% quantile 25/75% quantile
100 100
TSOI

TSOI

50 50

0 0

15 0 20 0
mean precipitation (mm/d)

mean precipitation (mm/d)

5 5
10
discharge (mm/d)

discharge (mm/d)

15
10 10 15
median 15 10 median 20
25/75% quantile 25/75% quantile
5 median precipitation median precipitation
5
0 0
min. temperature (°C)

min. temperature (°C)

10 10
5 5
0 0
5 5
median median
10 25/75% quantile 10 25/75% quantile
0 50 100 150 200 250 300 350 0 50 100 150 200 250 300 350
day of the year day of the year

Fig. 3. Time steps of influence (TSOI) on the network output over the unit day of year
(DOY) for the snow influenced basin A (left column) and the basin B without snow
influence (right column). Corresponding median precipitation, discharge and minimum
temperature are shown for reference. For the snow-influenced basin A, we can see
for example that the TSOI increases during the winter period and is largest during
the snow melting period (∼DOY 100-160), which matches our understanding of the
hydrological processes.

73
Interpreting LSTMs in Hydrology 11

Thus the LSTM has to understand that this input does not lead to an immediate
output and therefore the TSOI has to increase. The plateau is reached, as soon
as the minimum temperature values are higher than the freezing point. From a
hydrological perspective, it is interesting to observe that the TSOI at the end of
the winter season has a value which corresponds to the beginning of the winter
season (∼DOY 300), when the snow accumulation begins. It should be noted
that the transition between the winter period and the following spring season is
not sharp, at least when taking the hydrograph as a reference. It is visible that,
although the TSOI is still increasing, the discharge is also increasing. From a
hydrological point of view this can be explained by a mixed signal in discharge -
appart from melting snow (daily maximum temperature is larger than 0◦ C), we
still have negative minimum temperatures, which would lead to snow fall.
(3) In spring, during the melting season, the TSOI stays constant (DOY
100-160) followed by a sharp decrease until the summer low flows. During the
main melting period, the TSOI of approximately 140-150 days highlights that
the LSTM uses the entire information of the winter period to predict the dis-
charge. The precipitation in this period now falls as rain, directly leading to
runoff, without increasing the TSOI. At the same time, all the inputs from the
winter still influence the river discharge explaining the stable plateau. The sharp
decrease of the TSOI around DOY 160 represents a transition period where the
snow of the winter continuously loses its influence until all snow has melted.
Although it has the same precipitation seasonality, Basin B (Fig. 3, right
column) has different characteristics compared to basin A, since it is not influ-
enced by snow. Here, only 2 different periods can be distinguished, where the
transitions periods are however more pronounced:
(1) Between DOY 180 and 280, the warm and dry summer season, the catch-
ment is characterised by very low flows. In this period the TSOI is also constantly
low, with values of around 10-15 days. The discharge can be predicted with a
very short input time series, since rainfall as input is missing and the hydrology
does not depend on any inputs.
(2) Following summer, the wet period between DOY 280 and 180 is charac-
terised by a steady increase in the TSOI. The temporal influence of rainfall on
runoff becomes longer, the longer the season lasts. The general level of runoff
is now significantly higher compared to the summer. It does not solely depend
on single rainfall events, but is driven by the integrated signal of inputs from
the past, explaining the increasing TSOI. The TSOI reaches a maximum median
value of around 120 days. This peak is followed by a decrease towards the end
of the wet period, which is however not as rapid as in basin A. This distinct
transition period in TSOI between wet and dry (∼DOY 100-180) season corre-
sponds very well with the observed falling limb in the hydrograph. As the runoff
declines, the influence of past meteorological inputs also declines. Compared to
basin A, a higher variability in TSOI is evident, which can be explained with
a higher variability in precipitation inputs in the single years. In basin A, a
high interannual variability in the rainfall is also observable. However, the lower
temperatures below freezing level lead to the precipitation falling as snow, and

74 Chapter 3 Publications
12 F. Kratzert et al.

therefore act as a filter. This leads to a lower variability in discharge and in

consequence in TSOI.
Overall, the TSOI results of the two contrasting basins match well with our
hydrological understanding of the anterior days influencing the runoff signal at a
specific day. It is interesting to see that the LSTM shows the capability to learn
these differing, basin specific properties of long-term dependencies.

3.2 Correlation of memory cells with hydrological states

Figure 4 shows the average correlation of every memory cell with the hydrological
states considered in this experiment. The correlation is averaged over all samples
in the test period and only correlations with ρ > 0.5 are shown. We can see, that
in both basins some of the memory cells have a particularly high correlation
with the provided hydrological states. For both basins several cells exhibit a high
correlation with both, the upper (UZS) and lower (LZS), soil states. Although
the majority of the cells show a positive correlation, negative correlations are
also visible, which are however of a lower absolute magnitude.

Basin A - 13340600
1.00

LZS 0.68 0.61 -0.51

0.75

UZS 0.77 0.78 -0.56 0.75 0.74 -0.69 0.50

0.25
SWE -0.61 0.80 0.72 0.81 -0.68
Correlation

0.00
Cell 0 Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9

Basin B - 11481200 0.25

LZS 0.91 0.86 -0.70 0.73 -0.64 0.50

0.75
UZS 0.73 0.75 -0.53 -0.73 0.65 -0.75

1.00
Cell 0 Cell 1 Cell 2 Cell 3 Cell 4 Cell 5 Cell 6 Cell 7 Cell 8 Cell 9

Fig. 4. Average correlations between memory cells and hydrological states for basin
A (upper plot) and basin B (lower plot). Only correlations with ρ > 0.5 are shown.
Ellipse are scaled by the absolute correlation and rotated by their sign (left inclined
for negative correlation and right inclined for positive correlation)

The correlation between the LSTM cells and the baseflow influencing state
LZS are significantly higher for the drier basin B. However, the baseflow index,
a measure to define the importance of the contribution of the baseflow to total

75
Interpreting LSTMs in Hydrology 13

runoff, is lower for this basin. In basins A and B the ratios of mean daily base-
flow to mean daily discharge are about 66 % and 44 %, respectively. Currently,
we cannot explain this discrepancy. In the snow influenced basin, the trained
LSTM also has some cells with high correlation with the snow-water-equivalent
(SWE). The occurrence of multiple cells with high correlation to different sys-
tem states can be seen as an indicator that the LSTM is not yet defined in a
parsimonious way. Therefore, hydrologist can use this information to restrict the
neural network even further.
In general, the correlation analysis is difficult to interpret in detail. Fre-
quently, high correlations however exist, indicating a strong relationship be-
tween LSTM memory cells and system states from a well-established hydrologi-
cal model.

3.3 Inspection of memory cells

In the first experiment, we used the integrated gradient method to calculate
the attribution of the input variables on the output of the model. In this ex-
periment, we apply it to analyse interactions between arbitrary neurons within
the neural network (e.g. here, a memory cell at the last time step). The pre-
vious experiment proved that memory cells with a high correlation to some of
the hydrological system states exist. The aim of this experiment is therefore to
analyse the influences and functionality of a given cell. Here, we can explore (i)
which (meteorological) inputs are important and (ii) at what time in the past
this influence was high. We chose a “snow-cell ” from the previous experiment to
demonstrate this task, since the accumulation and depletion of snow is a par-
ticularly illustrative example. To be more precise, we chose to depict a single
sample of the test period from the cell with the highest correlation to the SWE,
which is memory cell 4 from basin A (see Fig. 4). Figure 5 shows the integrated
gradients of the meteorological inputs in the top row, the evolution of the mem-
ory cell value in the second row and the corresponding change in minimum and
maximum temperature in the third row.
One can see that the snow-cell is mainly influenced by the meteorological
inputs of precipitation, minimum temperature, maximum temperature and solar
radiation. Precipitation shows the largest magnitude of positive influence. Solar
radiation in contrast has a negative sign of influence, possibly reproducing the
sublimation from the snow layer and leading to a reduction of the snow state.
All influencing factors only play a role at temperatures around or below freezing
level. This matches the expected behaviour of a snow storage from a hydrological
point of view: Lower temperatures and concurrent precipitation are associated
with snow accumulation. Consequently, this leads to an increase in magnitude
of the memory cell value (especially for the first temperature below the freezing
point). This can be observed e.g. in October 1999, where the temperature values
decrease, the influences of the meteorological parameters appear and the snow-
cell begins to accumulate. In contrast, as the temperature rises, the value of the
cell decreases again, especially when the daily minimum temperature also rises
above 0◦ C.

76 Chapter 3 Publications
14 F. Kratzert et al.

precipitation
integrated gradient

solar radiation
max temp
min temp
vapor pressure

7 38 68 99 129 160 191 220 251 281 312 342

time step of input sequence

memory cell 4
memory cell value

30 max temp
min temp
temperature (C)

20
10
0
10
-08 -09 999-10 -11 -12 -01 000-02 000-03 -04 000-05 -06 -07
1999 1999 1 1999 1999 2000 2 2 2000 2 2000 2000
Corresponding date

Fig. 5. Integrated gradient analysis of the snow-cell (cell 4) of the LSTM trained for
basin A. Upper plot shows the integrated gradient signal on each input variable at each
time step. The plot in the center shows the memory cell state over time for reference
and the bottom plot the minimum and maximum daily temperature.

This suggests that the LSTM realistically represents short- as well as long-
term dynamics in the snow cell storage and their connection to the meteorological
inputs.

4 Conclusion

LSTM networks are a versatile tool for time series predictions, with many poten-
tial applications in hydrology and environmental sciences in general. However,
currently they do not enjoy a wide-spread application. We argue that one rea-
son is the difficulty to interpret the LSTMs. The methods presented in this
book provide solutions regarding interpretability and allow a deeper analysis of
these models. In this chapter, we demonstrate this for the task of rainfall-runoff
modelling (where the aim is to predict the river discharge from meteorological
observations). In particular, we were able to show that the processes learned by
the LSTM matches our comprehension of a real-world environmental system.
For this study, we focused on a qualitative analysis of the correspondence be-
tween the hydrological system and the learned behaviour of the LSTM. In a first
experiment, we looked at the number of time steps of influence on the network
output (TSOI) and how this number varies throughout the year. We saw that the
TSOI pattern matches our hydrological understanding of the yearly pattern. In
the next experiment, we looked at the correlation of the memory cells of the net-
work with some selected states of the hydrological system (such as snow or soil

77
Interpreting LSTMs in Hydrology 15

moisture). We found some cells that exhibited relatively high correlation with
the chosen states, which strengthens the hypothesis that the LSTM obtained
some general understanding of the runoff-generation processes. In the last ex-
periment, we inspected a single memory cell that exhibited a high correlation
with the snow state. We analyzed the influencing inputs over time through the
integrated gradient method and could see, that the behavioral patterns man-
ifested in the cell, closely resemble the ones suggested by hydrological theory.
We view this as a further underpinning of the observation that the internally
modelled processes of the network follow some sort of physically viable pattern.
We hypothesize that this relation can be seen as a legitimization of the LSTM
usage within environmental sciences applications, and thus believe that the pre-
sented methods will pave the way for its future in environmental-modelling. The
correspondence of the memory cells and the physical states can be especially
useful in novel situations, which often arise in this context. Environmental sci-
entists and practitioners can exploit it (together with the proposed techniques)
to “peek into the LSTM” and argue about potential behaviours. Our demon-
stration was certainly not exhaustive and should rather be seen as indicative
application study.
The most important message is that the combination of domain-knowledge
(in this case hydrology) and the insights provided by the proposed interpretation
techniques, provide the fundamentals for designing environmental forecasting
systems with neural networks. Consequently, we expect that the combination of
powerful data driven models (such as LSTMs) with the possibility of interpre-
tation by experts will lead to new insights in the field of application.

References
1. Addor, N., Newman, A.J., Mizukami, N., Clark, M.P.: Catchment attributes for
large-sample studies. Boulder, CO: UCAR/NCAR (2017)
2. Anderson, E.A.: National Weather Service River Forecast System - Snow Accumu-
lation and Ablation Model. Tech. Rep. November, US Department of Commerce,
Silver Spring (1973)
3. Arras, L., Montavon, G., Müller, K.R., Samek, W.: Explaining Recurrent Neu-
ral Network Predictions in Sentiment Analysis. arXiv preprint arXiv:1706.07206
(2017)
4. Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, K.R., Samek, W.: On
pixel-wise explanations for non-linear classifier decisions by layer-wise relevance
propagation. PLoS ONE 10(7), 1–46 (2015)
5. Beven, K.: How far can we go in distributed hydrological modelling ? Hydrological
and Earth System Sciences 5(1), 1–12 (2001)
6. Bowden, G.J., Dandy, G.C., Maier, H.R.: Input determination for neural network
models in water resources applications. Part 1 - Background and methodology.
Journal of Hydrology 301(1-4), 75–92 (2005)
7. Brenner, C., Thiem, C.E., Wizemann, H.D., Bernhardt, M., Schulz, K.: Estimating
spatially distributed turbulent heat fluxes from high-resolution thermal imagery
acquired with a UAV system. International Journal of Remote Sensing 38(8-10),
3003–3026 (2017)

78 Chapter 3 Publications
16 F. Kratzert et al.

8. Burnash, R.J.C., Ferral, R.L., McGuire, R.A.: A generalised streamflow simulation

system–conceptual modelling for digital computers. Tech. rep., US Department of
Commerce National Weather Service and State of California Department of Water
Resources (1973)
9. Daniell, T.M.: Neural networks—applications in hydrology and water resources
engineering. In: Proceedings of the International Hydrology and Water Resources
Symposium. vol. 3, pp. 797–802. Institution of Engineers, Perth, Australia (1991)
10. Freeze, R.A., Harlan, R.L.: Blueprint for a physically-based, digitally-simulated
hydrologic response model. Journal of Hydrology 9(3), 237–258 (1969)
11. Gupta, H.V., Sorooshian, S., Yapo, P.O.: Status of Automatic Calibration for Hy-
drologic Models: Comparison with Multilevel Expert Calibration. Journal of Hy-
drologic Engineering 4(2), 135–143 (1999)
12. Half, A.H., Half, H.M., Azmoodeh, M.: Predicting runoff from rainfall using neural
networks. In: ASCE, New York, 760-765. pp. 760–765. New York, USA (1993)
13. Hengl, T., De Jesus, J.M., Heuvelink, G.B., Gonzalez, M.R., Kilibarda, M.,
Blagotić, A., Shangguan, W., Wright, M.N., Geng, X., Bauer-Marschallinger, B.,
Guevara, M.A., Vargas, R., MacMillan, R.A., Batjes, N.H., Leenaars, J.G., Ribeiro,
E., Wheeler, I., Mantel, S., Kempen, B.: SoilGrids250m: Global gridded soil infor-
mation based on machine learning, vol. 12 (2017)
14. Herrnegger, M., Nachtnebel, H.P., Schulz, K.: From runoff to rainfall: Inverse
rainfall-runoff modelling in a high temporal resolution. Hydrology and Earth Sys-
tem Sciences 19(11), 4619–4639 (2015)
15. Herrnegger, M., Nachtnebel, H.P., Haiden, T.: Evapotranspiration in high alpine
catchments – an important part of the water balance! Hydrology Research 43(4),
460 (2012)
16. Hochreiter, S., Heusel, M., Obermayer, K.: Fast model-based protein homology
detection without alignment. Bioinformatics 23(14), 1728–1736 (2007)
17. Hochreiter, S., Schmidhuber, J.: Long Short-Term Memory. Neural Computation
9(8), 1735–1780 (1997)
18. Karpathy, A., Johnson, J., Fei-Fei, L.: Visualizing and Understanding Recurrent
Networks. arXiv preprint arXiv:1506.02078 (2015)
19. Kindermans, P.J., Schütt, K.T., Alber, M., Müller, K.R., Erhan, D., Kim, B.,
Dähne, S.: Learning how to explain neural networks: PatternNet and PatternAt-
tribution pp. 1–12 (2017)
20. Klemeš, V.: Dilettantism in hydrology: Transition or destiny? Water Resources
Research 22(9 S), 177S–188S (1986)
21. Klemes, V.: Stochastic models of rainfall-runoff relationship (1982)
22. Klotz, D., Herrnegger, M., Schulz, K.: Symbolic Regression for the Estimation
of Transfer Functions of Hydrological Models. Water Resources Research 53(11),
9402–9423 (2017)
23. Kratzert, F., Klotz, D., Brenner, C., Schulz, K., Herrnegger, M.: Rainfall–runoff
modelling using Long Short-Term Memory (LSTM) networks. Hydrology and
Earth System Sciences 22(11), 6005–6022 (2018)
24. Li, J., Chen, X., Hovy, E., Jurafsky, D.: Visualizing and Understanding Neural
Models in NLP. arXiv preprint arXiv:1506.01066 (2015)
25. Lindström, G., Pers, C., Rosberg, J., Strömqvist, J., Arheimer, B.: Development
and testing of the HYPE (Hydrological Predictions for the Environment) water
quality model for different spatial scales. Hydrology Research 41(3-4), 295 (2010)
26. Montavon, G., Lapuschkin, S., Binder, A., Samek, W., Müller, K.R.: Explaining
nonlinear classification decisions with deep Taylor decomposition. Pattern Recog-
nition 65(May 2016), 211–222 (2017)

79
Interpreting LSTMs in Hydrology 17

27. Moriasi, D.N., Gitau, M.W., Pai, N., Daggupati, P.: Hydrologic and Water Qual-
ity Models: Performance Measures and Evaluation Criteria. Transactions of the
ASABE 58(6), 1763–1785 (2015)
28. Murdoch, W.J., Liu, P.J., Yu, B.: Beyond word importance: Contextual decomposi-
tion to extract interactions from LSTMs. In: International Conference on Learning
Representations (2018)
29. Myneni, R.B., Hoffman, S., Knyazikhin, Y., Privette, J.L., Glassy, J., Tian, Y.,
Wang, Y., Song, X., Zhang, Y., Smith, G.R., Lotsch, A., Friedl, M., Morisette,
J.T., Votava, P., Nemani, R.R., Running, S.W.: Global products of vegetation leaf
area and fraction absorbed PAR from year one of MODIS data. Remote Sensing
of Environment 83(1-2), 214–231 (2002)
30. Nash, J.E., Sutcliffe, J.V.: River flow forecasting through conceptual models part
I - A discussion of principles. Journal of Hydrology 10(3), 282–290 (1970)
31. Newman, A., Sampson, K., Clark, M., Bock, A., Viger, R., Blodgett, D.: A large-
sample watershed-scale hydrometeorological dataset for the contiguous USA. Boul-
der, CO: UCAR/NCAR. (2014)
32. Perrin, C., Michel, C., Andréassian, V.: Improvement of a parsimonious model for
streamflow simulation. Journal of Hydrology 279(1-4), 275–289 (2003)
33. Poerner, N., Schütze, H., Roth, B.: Evaluating neural network explanation methods
using hybrid documents and morphosyntactic agreement. In: Proceedings of the
56th Annual Meeting of the Association for Computational Linguistics (Volume 1:
Long Papers). vol. 1, pp. 340–350 (2018)
34. Ribeiro, M.T., Singh, S., Guestrin, C.: Why should i trust you?: Explaining the
predictions of any classifier. In: Proceedings of the 22nd ACM SIGKDD interna-
tional conference on knowledge discovery and data mining. pp. 1135–1144. ACM
(2016)
35. Samaniego, L., Kumar, R., Thober, S., Rakovec, O., Zink, M., Wanders, N., Eisner,
S., Müller Schmied, H., Sutanudjaja, E., Warrach-Sagi, K., Attinger, S.: Toward
seamless hydrologic predictions across spatial scales. Hydrology and Earth System
Sciences 21(9), 4323–4346 (2017)
36. Strobelt, H., Gehrmann, S., Pfister, H., Rush, A.M.: LSTMVis: A Tool for Visual
Analysis of Hidden State Dynamics in Recurrent Neural Networks. IEEE Transac-
tions on Visualization and Computer Graphics 24(1), 667–676 (2018)
37. Sundararajan, M., Taly, A., Yan, Q.: Axiomatic attribution for deep networks. In:
Proceedings of the 34th International Conference on Machine Learning-Volume 70.
pp. 3319–3328. JMLR. org (2017)
38. Thielen, J., Bartholmes, J., Ramos, M.H., de Roo, A.: The European Flood Alert
System – Part 1: Concept and development. Hydrology and Earth System
Sciences Discussions 5(1), 257–287 (2008)
39. Tieleman, T., Hinton, G.: Lecture 6.5 - RMSProp, COURSERA: Neural Networks
for Machine Learning. Tech. rep. (2012)
40. WMO, W.M.O., UNESCO, U.N.E., Organization, S.: International Glossary of
Hydrology. No. 12, Geneva, Switzerland (1998)

80 Chapter 3 Publications
81
MC-LSTM: Mass-Conserving LSTM

Pieter-Jan Hoedt * 1 Frederik Kratzert * 1 Daniel Klotz 1 Christina Halmich 1 Markus Holzleitner 1
Grey Nearing 2 Sepp Hochreiter 1 3 Günter Klambauer 1

Abstract fect of this inductive bias has been demonstrated by CNNs

The success of Convolutional Neural Networks that solve vision-related tasks with random weights, mean-
(CNNs) in computer vision is mainly driven by ing without learning (He et al., 2016; Gaier & Ha, 2019;
arXiv:2101.05186v2 [cs.LG] 8 Feb 2021

their strong inductive bias, which is strong enough Ulyanov et al., 2020). Another success story is Long Short-
to allow CNNs to solve vision-related tasks with Term Memory (LSTM) (Hochreiter, 1991; Hochreiter &
random weights, meaning without learning. Sim- Schmidhuber, 1997), which has a strong inductive bias to-
ilarly, Long Short-Term Memory (LSTM) has a ward storing information through its memory cells. This
strong inductive bias toward storing information inductive bias allows LSTM to excel at speech, text, and
over time. However, many real-world systems language tasks (Sutskever et al., 2014; Bohnet et al., 2018;
are governed by conservation laws, which lead to Kochkina et al., 2017; Liu & Guo, 2019), as well as time-
the redistribution of particular quantities — e.g. series prediction. Even with random weights and only a
in physical and economical systems. Our novel learned linear output layer LSTM is better at predicting time-
Mass-Conserving LSTM (MC-LSTM) adheres to series than reservoir methods (Schmidhuber et al., 2007).
these conservation laws by extending the induc- In a seminal paper on biases in machine learning, Mitchell
tive bias of LSTM to model the redistribution of (1980) stated that “biases and initial knowledge are at the
those stored quantities. MC-LSTMs set a new heart of the ability to generalize beyond observed data”.
state-of-the-art for neural arithmetic units at learn- Therefore, choosing an appropriate architecture and induc-
ing arithmetic operations, such as addition tasks, tive bias for deep neural networks is key to generalization.
which have a strong conservation law, as the sum
is constant over time. Further, MC-LSTM is ap- Mechanisms beyond storing are required for real-world
plied to traffic forecasting, modeling a damped applications. While LSTM can store information over
pendulum, and a large benchmark dataset in hy- time, real-world applications require mechanisms that go
drology, where it sets a new state-of-the-art for beyond storing. Many real-world systems are governed
predicting peak flows. In the hydrology example, by conservation laws related to mass, energy, momentum,
we show that MC-LSTM states correlate with real charge, or particle counts, which are often expressed through
world processes and are therefore interpretable. continuity equations. In physical systems, different types
of energies, mass or particles have to be conserved (Evans
& Hanney, 2005; Rabitz et al., 1999; van der Schaft et al.,
1. Introduction 1996), in hydrology it is the amount of water (Freeze &
Harlan, 1969; Beven, 2011), in traffic and transportation
Inductive biases enabled the success of CNNs and the number of vehicles (Vanajakshi & Rilett, 2004; Xiao &
LSTMs. One of the greatest success stories of deep Duan, 2020; Zhao et al., 2017), and in logistics the amount
learning are Convolutional Neural Networks (CNNs) of goods, money or products. A real-world task could be
(Fukushima, 1980; LeCun & Bengio, 1998; Schmidhuber, to predict outgoing goods from a warehouse based on a
2015; LeCun et al., 2015), whose proficiency can be at- general state of the warehouse, i.e., how many goods are in
tributed to their strong inductive bias toward visual tasks storage, and incoming supplies. If the predictions are not
(Cohen & Shashua, 2017; Gaier & Ha, 2019). The ef- precise, then they do not lead to an optimal control of the
*
Equal contribution 1 Institute for Machine Learning (EL- production process. For modeling such systems, certain in-
LIS Unit), Johannes Kepler University, Linz, Austria 2 Google puts must be conserved but also redistributed across storage
Research, Mountain View, CA, USA 3 Institute of Advanced locations within the system. We will refer to conserved in-
Research in Artificial Intelligence (IARAI). Correspondence puts as mass, but note that this can be any type of conserved
to: Pieter-Jan Hoedt <[email protected]>, Frederik Kratzert
<[email protected]>. quantity. We argue that for modeling such systems, special-
ized mechanisms should be used to represent locations &
Preprint whereabouts, objects, or storage & placing locations and

82 Chapter 3 Publications
MC-LSTM

thus enable conservation.

Conservation laws should pervade machine learning

models in the physical world. Since a large part of ma-
chine learning models are developed to be deployed in the
real world, in which conservation laws are omnipresent
rather than the exception, these models should adhere to
them automatically and benefit from them. However, stan-
dard deep learning approaches struggle at conserving quan-
tities across layers or timesteps (Beucler et al., 2019b; Grey-
danus et al., 2019; Song & Hopke, 1996; Yitian & Gu,
2003), and often solve a task by exploiting spurious cor-
relations (Szegedy et al., 2014; Lapuschkin et al., 2019).
Thus, an inductive bias of deep learning approaches via Figure 1. Schematic representation of the main operations in the
mass conservation over time in an open system, where mass MC-LSTM architecture (adapted from: Olah, 2015).
can be added and removed, could lead to a higher gener-
alization performance than standard deep learning for the
above-mentioned tasks. means of a fixed recurrent self-connection of the memory
cells. If we denote the values in the memory cells at time t
by ct , this recurrence can be formulated as
A mass-conserving LSTM. In this work, we introduce
Mass-Conserving LSTM (MC-LSTM), a variant of LSTM ct = ct−1 + f (xt , ht−1 ), (1)
that enforces mass conservation by design. MC-LSTM is
where x and h are, respectively, the forward inputs and
a recurrent neural network with an architecture inspired by
recurrent inputs, and f is some function that computes the
the gating mechanism in LSTMs. MC-LSTM has a strong
increment for the memory cells. Here, we used the original
inductive bias to guarantee the conservation of mass. This
formulation of LSTM without forget gate (Hochreiter &
conservation is implemented by means of left-stochastic
Schmidhuber, 1997), but in all experiments we also consider
matrices, which ensure the sum of the memory cells in the
LSTM with forget gate (Gers et al., 2000).
network represents the current mass in the system. These
left-stochastic matrices also enforce the mass to be con- MC-LSTMs modify this recurrence to guarantee the conser-
served through time. The MC-LSTM gates operate as con- vation of the mass input.The key idea is to use the memory
trol units on mass flux. Inputs are divided into a subset of cells from LSTMs as mass accumulators, or mass storage.
mass inputs, which are propagated through time and are The conservation law is implemented by three architectural
conserved, and a subset of auxiliary inputs, which serve as changes. First, the increment, computed by f in Eq. (1),
inputs to the gates for controlling mass fluxes. We demon- has to distribute mass from inputs into accumulators. Sec-
strate that MC-LSTMs excel at tasks where conservation of ond, the mass that leaves MC-LSTM must also disappear
mass is required and that it is highly apt at solving real-world from the accumulators. Third, mass has to be redistributed
problems in the physical domain. between mass accumulators. These changes mean that all
gates explicitly represent mass fluxes.
Contributions. We propose a novel neural network archi-
Since, in general, not all inputs must be conserved, we
tecture based on LSTM that conserves quantities, such as
distinguish between mass inputs, x, and auxiliary inputs,
mass, energy, or count, of a specified set of inputs. We show
a. The former represents the quantity to be conserved and
properties of this novel architecture, called MC-LSTM, and
will fill the mass accumulators in MC-LSTM. The auxiliary
demonstrate that these properties render it a powerful neu-
inputs are used to control the gates. To keep the notation
ral arithmetic unit. Further, we show its applicability in
uncluttered, and without loss of generality, we use a single
real-world areas of traffic forecasting and modeling the
mass input at each timestep, xt , to introduce the architecture.
damped pendulum. In hydrology, large-scale benchmark
experiments reveal that MC-LSTM has powerful predictive The forward pass of MC-LSTM at timestep t can be speci-
quality and can supply interpretable representations. fied as follows:

2. Mass-Conserving LSTM
mttot = Rt · ct−1 + it · xt (2)
The original LSTM introduced memory cells to Recurrent ct = (1 − ot ) mttot (3)
Neural Networks (RNNs), which alleviate the vanishing
gradient problem (Hochreiter, 1991). This is achieved by ht = ot mttot . (4)

83
MC-LSTM

where it and ot are the input- and output gates, respectively, softmax function is applied column-wise. This version col-
and R is a positive left-stochastic matrix, i.e., 1T · R = 1T , lapses to a time-independent redistribution matrix if Wr and
for redistributing mass in the accumulators. The total mass Ur are equal to 0. Thus, there exists the option to initialize
mtot is the redistributed mass, Rt · ct−1 , plus the mass Wr and Ur with weights that are small in absolute value
influx, or new mass, it · xt . The current mass in the system compared to the weights of B r , to favour learning time-
is stored in ct . Finally, ht is the mass leaving the system. independent redistribution matrices. We use this variant in
the hydrology experiments (see Sec. 5.4).
Note the differences between Eq. (1) and Eq. (3). First,
the increment of the memory cells no longer depends on Redistribution via a hypernetwork. Even more general,
ht . Instead, mass inputs are distributed by means of the a hypernetwork (Schmidhuber, 1992; Ha et al., 2017) that
normalized i (see Eq. 5). Furthermore, Rt replaces the we denote with g can be used to procure R. The hypernet-
implicit identity matrix of LSTM to redistribute mass among work has to produce a column-normalized, square matrix
memory cells. Finally, Eq. (3) introduces 1 − ot as a forget Rt = g(a0 , . . . , at , c0 , . . . , ct−1 ). Notably, a hypernet-
gate on the total mass, mtot . Together with Eq. (4), this work can be used to design an autoregressive version of
assures that no outgoing mass is stored in the accumulators. MC-LSTMs, if the network additionally predicts auxiliary
This formulation has some similarity to Gated Recurrent inputs for the next time step. We use this variant in the
Units (GRU) (Cho et al., 2014), however MC-LSTM gates pendulum experiments (see Sec. 5.3).
are used to split off the output instead of mixing the old and
new cell state. 3. Properties
Basic gating and redistribution. The MC-LSTM gates Conservation. MC-LSTM guarantees that mass is con-
at timestep t are computed as follows: served over time. This is a direct consequence of connecting
memory cells with stochastic matrices. The mass conserva-
ct−1 tion ensures that no mass can be removed or added implicitly,
it = softmax(W i · at + U i · + bi ) (5)
kct−1 k1 which makes it easier to learn functions that generalize well.
ct−1 The exact meaning of mass conservation is formalized in
ot = σ(W o · at + U o · + bo ) (6) the following Theorem.
kct−1 k1
PK
Rt = softmax(B r ), (7) Theorem 1 (Conservation property). Let mτc = k=1 cτk
PK
be the mass contained in the system and mh = k=1 hτk
τ
where the softmax operator is applied column-wise, σ is be the mass efflux, or, respectively, the accumulated mass
the logistic sigmoid function, and W i , bi , W o , bo , and B r in the MC-LSTM storage and the outputs at time τ . At any
are learnable model parameters. Note that the input gate and timestep τ , we have:
redistribution matrix are required to be column normalized.
This can also be achieved by other means than using the τ τ
X X
softmax function. For example, an alternative way to en- mτc = m0c + xt − mth . (9)
sure a column-normalized matrix Rt is to use a normalized t=1 t=1
σ(rkj )
logistic, σ̃(rkj ) = P σ(r kn )
. Also note that MC-LSTMs
n That is, the change of mass in the memory cells is the differ-
compute the gates from the memory cells, directly. This is in
ence between the input and output mass, accumulated over
contrast with the original LSTM, which uses the activations
time.
from the previous time step. The accumulated values from
the memory cells, ct , are normalized to counter saturation The proof is by induction over τ (see Appendix C). Note
of the sigmoids and to supply probability vectors that rep- that it is still possible for input mass to be stored indefinitely
resent the current distribution of the mass across cell states. in a memory cell so that it does not appear at the output.
We use this variation e.g. in our experiments with neural This can be a useful feature if not all of the input mass is
arithmetics (see Sec. 5.1). needed at the output. In this case, the network can learn that
one cell should operate as a collector for excess mass in the
Time-dependent redistribution. It can also be useful to system.
predict a redistribution matrix for each sample and timestep,
similar to how the gates are computed:
Boundedness of cell states. In each timestep τ , the mem-
ory
ct−1
Rt = softmax Wr · at + Ur · t−1 + B r , (8) Pτ cells, cτk , are bounded byPτthe sum of mass inputs
t=1 x + mc , that is |ck | ≤ t=1 x + mc . Furthermore,
t 0 τ t 0
kc k1 Pτ
if the series of mass inputs converges, limτ →∞ t=1 xτ =
where the parameters Wr and Ur are weight tensors and mx , then also the sum of cell states converges (see Ap-
∞

their multiplications result in K × K matrices. Again, the pendix, Corollary 1).

84 Chapter 3 Publications
MC-LSTM

Initialization and gradient flow. MC-LSTM with Rt = networks. MC-LSTMs would have to learn the transition
I has a similar gradient flow to LSTM with forget gate (Gers matrix implicitly.
et al., 2000). Thus, the main difference in the gradient flow
is determined by the redistribution matrix R. The forward Relation to normalizing flows and volume-conserving
pass of MC-LSTM without gates ct = Rt ct−1 leads to neural networks. In contrast to normalizing flows
t
the following backward expression ∂c∂ct−1 = Rt . Hence, (Rezende & Mohamed, 2015; Papamakarios et al., 2019),
MC-LSTM should be initialized with a redistribution matrix which transform inputs in each layer and trace their density
close to the identity matrix to ensure a stable gradient flow as through layers or timesteps, MC-LSTMs transform distri-
in LSTMs. For random redistribution matrices, the circular butions and do not aim to trace individual inputs through
law theorem for random Markov matrices (Bordenave et al., timesteps. Normalizing flows thereby conserve informa-
2012) can be used to analyze the gradient flow in more tion about the input in the first layer and can use the in-
detail, see Appendix, Section D. verted mapping to trace an input back to the initial space.
MC-LSTMs are concerned with modeling the changes of
Computational complexity. Whereas the gates in a tradi- the initial distribution over time and can guarantee that a
tional LSTM are vectors, the input gate and redistribution multinomial distribution is mapped to a multinomial dis-
matrix of an MC-LSTM are matrices in the most general tribution. For MC-LSTMs without gates, the sequence
case. This means that MC-LSTM is, in general, compu- of cell states c0 , . . . , cT constitutes a normalizing flow if
tationally more demanding than LSTM. Concretely, the an initial distribution p0 (c0 ) is available. In more detail,
forward pass for a single timestep in MC-LSTM requires MC-LSTM can be considered a linear flow with the map-
O(K 3 + K 2 (M + L) + KM L) Multiply-Accumulate op- ping ct+1 = Rt ct and p(ct+1 ) = p(ct )| det Rt |−1 in this
erations (MACs), whereas LSTM takes O(K 2 + K(M + case. The gate providing the redistribution matrix (see Eq. 8)
L)) MACs per timestep. Here, M , L and K are the num- is the conditioner in a normalizing flow model. From the
ber of mass inputs, auxiliary inputs and outputs, respec- perspective of normalizing flows, MC-LSTM can be con-
tively. When using a time-independent redistribution ma- sidered as a flow trained in a supervised fashion. Deco
trix cf. Eq. (7), the complexity reduces to O(K 2 M + & Brauer (1995) proposed volume-conserving neural net-
KM L) MACs. works, which conserve the volume spanned by input vectors
and thus the information of the starting point of an input
Potential interpretability through inductive bias and ac- is kept. In other words, they are constructed so that the
cessible mass in cell states. The representations within Jacobians of the mapping from one layer to the next have a
the model can be interpreted directly as accumulated mass. determinant of 1. In contrast, the determinant of the Jaco-
If one mass or energy quantity is known, the MC-LSTM bians in MC-LSTMs is generally smaller than 1 (except for
architecture would allow to force a particular cell state to degenerate cases), which means that volume of the inputs is
represent this quantity, which could facilitate learning and not conserved.
interpretability. An illustrative example is the case of rain-
fall runoff modelling, where observations, say of the soil Relation to Layer-wise Relevance Propagation. Layer-
moisture or groundwater-state, could be used to guide the wise Relevance Propagation (LRP) (Bach et al., 2015) is
learning of an explicit memory cell of MC-LSTM. similar to our approach with respect to the idea that the
sum of a quantity, the relevance Ql is conserved over layers
l. LRP aims toP maintain the sum of the relevance values
4. Special cases and related work PK l−1 K
= k=1 Qlk backward through a classifier in
k=1 Qk
Relation to Markov chains. In a special case MC-LSTM order to a obtain relevance values for each input feature.
collapses to a finite Markov chain, when c0 is a probability
vector, the mass input is zero xt = 0 for all t, there is no in- Relation to other networks that conserve particular
put and output gate, and the redistribution matrix is constant properties. While a standard feed-forward neural network
over time Rt = R. For finite Markov chains, the dynamics does not give guarantees aside from the conservation of the
are known to converge, if R is irreducible (see e.g. Hairer proximity of datapoints through the continuity property. The
(2018, Theorem 3.13.)). Awiszus & Rosenhahn (2018) aim conservation of the first moments of the data distribution
to model a Markov Chain by having a feed-forward net- in the form of normalization techniques (Ioffe & Szegedy,
work predict the state distribution given the current state 2015) has had tremendous success. Here, batch normal-
distribution. In order to insert randomness to the network, a ization (Ioffe & Szegedy, 2015) could exactly conserve
random seed is appended to the input, which allows to sim- mean and variance across layers, whereas self-normalization
ulate Markov processes. Although MC-LSTMs are closely (Klambauer et al., 2017) conserves those approximately.
related to Markov chains, they do not explicitly learn the The conservation of the spectral norm of each layer in the
transition matrix, as is the case for Markov chain neural forward pass has enabled the stable training of generative ad-

85
MC-LSTM

Table 1. Performance of different models on the LSTM addition task in terms of the MSE. MC-LSTM significantly (all p-values below
.05) outperforms its competitors, LSTM (with high initial forget gate bias), NALU and NAU. Error bars represent 95%-confidence
intervals across 100 runs.

referencea seq lengthb input rangec countd comboe NaNf

MC-LSTM 0.004 ± 0.003 0.009 ± 0.004 0.8 ± 0.5 0.6 ± 0.4 4.0 ± 2.5 0
LSTM 0.008 ± 0.003 0.727 ± 0.169 21.4 ± 0.6 9.5 ± 0.6 54.6 ± 1.0 0
NALU 0.060 ± 0.008 0.059 ± 0.009 25.3 ± 0.2 7.4 ± 0.1 63.7 ± 0.6 93
NAU 0.248 ± 0.019 0.252 ± 0.020 28.3 ± 0.5 9.1 ± 0.2 68.5 ± 0.8 24
a
training regime: summing 2 out of 100 numbers between 0 and 0.5.
b
longer sequence lengths: summing 2 out of 1 000 numbers between 0 and 0.5.
c
more mass in the input: summing 2 out of 100 numbers between 0 and 5.0.
d
higher number of summands: summing 20 out of 100 numbers between 0 and 0.5.
e
combination of previous scenarios: summing 10 out of 500 numbers between 0 and 2.5.
f
Number of runs that did not converge.

versarial networks (Miyato et al., 2018). The conservation explicitly with the resulting terms. This approach, while
of the spectral norm of the errors through the backward pass promising, does require an exact knowledge of the govern-
of an RNN has enabled the avoidance of the vanishing gradi- ing equations. By contrast, our approach is able to learn its
ent problem (Hochreiter, 1991; Hochreiter & Schmidhuber, own representation of the underlying process, while obeying
1997). In this work, we explore an architecture that exactly the pre-specified conservation properties.
conserves the mass of a subset of the input, where mass is
defined as a physical quantity such as mass or energy. 5. Experiments
In the following, we demonstrate the broad applicability
Relation to neural networks for physical systems. Neu- and high predictive performance of MC-LSTM in settings
ral networks have been shown to discover physical concepts where mass conservation is required. Since there is no
such as the conservation of energies (Iten et al., 2020), and quantity to conserve in standard benchmarks for language
neural networks could allow to learn natural laws from ob- models, we use benchmarks from areas in which a quantity
servations (Schmidt & Lipson, 2009; Cranmer et al., 2020b). has to be conserved. We assess MC-LSTM on the bench-
MC-LSTM can be seen as a neural network architecture marking setting in the area of neural arithmetics (Madsen
with physical constraints (Karpatne et al., 2017; Beucler & Johansen, 2020; Trask et al., 2018), in physical model-
et al., 2019c). It is however also possible to impose conser- ing on the damped pendulum modeling task by (Iten et al.,
vation laws by using other means, e.g. initialization, con- 2020), and in environmental modeling on flood forecasting
strained optimization or soft constraints (as, for example, (Kratzert et al., 2019c). Additionally, we demonstrate the ap-
proposed by Karpatne et al., 2017; Beucler et al., 2019c;a; plicability of MC-LSTM to a traffic forecasting setting. For
Jia et al., 2019). Hamiltonian Neural Networks (HNNs) more details on the datasets and hyperparameter selection
(Greydanus et al., 2019) and Symplectic Recurrent Neural for each experiment, we refer to Appendix B.
Networks (Chen et al., 2019) make energy conserving pre-
dictions by using the Hamiltonian, a function that maps the 5.1. Arithmetic tasks
inputs to the quantity that needs to be conserved. By using
the symplectic gradients, it is possible to move around in Addition problem. We first considered a problem for
the input space, without changing the output of the Hamilto- which exact mass conservation is required. One example for
nian. Lagrangian Neural Networks (Cranmer et al., 2020a), such a problem has been described in the original LSTM
extend the Hamiltonian concept by making it possible to paper (Hochreiter & Schmidhuber, 1997), showing that
use arbitrary coordinates as inputs. LSTM is capable of summing two arbitrarily marked ele-
ments in a sequence of random numbers. We show that
All of these approaches, while very promising, assume MC-LSTM is able to solve this task, but also generalizes
closed physical systems and are thus too restrictive for the better to longer sequences, input values in a different range
application we have in mind. Raissi et al. (2019) propose and more summands. Table 1 summarizes the results of this
to enforce physical constraints on simple feed-forward net- method comparison and shows that MC-LSTM significantly
works by computing the partial derivatives with respect to outperformed the other models on all tests (p-value ≤ 0.03,
the inputs and computing the partial differential equations

86 Chapter 3 Publications
MC-LSTM

Success rate
1.00
● ● ● ● ● ● ● ●
●
●
0.75 ●
●

0.50

0.25 ● MC-LSTM ● NAU

0.00
1 10 200 400 600 800 1000
Sequence length
Figure 3. Schematic depiction of inbound-outbound traffic situa-
tions that require the conservation-of-vehicles principle. All ve-
Figure 2. MNIST arithmetic task results for MC-LSTM and NAU. hicles on outbound roads (yellow arrows) must have entered the
The task is to correctly predict the sum of a sequence of presented city center before (green arrows) or have been present in the first
MNIST digits. The success rates are depicted on the y-axis in timestep.
dependency of the length of the sequence (x-axis) of MNIST digits.
Error bars represent 95%-confidence intervals.
Static arithmetic. To enable a direct comparison with the
results reported in Madsen & Johansen (2020), we also
compared a feed-forward variant of MC-LSTM on the static
Wilcoxon test). In Appendix B.1.5, we provide a qualitative
arithmetic task, see Appendix B.1.3.
analysis of the learned model behavior for this task.
MNIST arithmetic. We tested that feature extractors can
be learned from MNIST images (LeCun et al., 1998) to
Recurrent arithmetic. Following Madsen & Johansen perform arithmetic on the images (Madsen & Johansen,
(2020), the inputs for this task are sequences of vectors, uni- 2020). The input is a sequence of MNIST images and
formly drawn from [1, 2]10 . For each vector in the sequence, the target output is the corresponding sum of the labels.
the sum over two random subsets is calculated. Those val- Auxiliary inputs are all 1, except the last entry, which is
ues are then summed over time, leading to two values. The −1, to indicate the end of the sequence. The models are
target output is obtained by applying the arithmetic opera- the same as in the recurrent arithmetic task with a CNN
tion to these two values. The auxiliary input for MC-LSTM to convert the images to (mass) inputs for these networks.
is a sequence of ones, where the last element is −1 to signal The network is learned end-to-end. L2 -regularization is
the end of the sequence. added to the output of the CNN to prevent its outputs from
We evaluated MC-LSTM against NAUs and Neural Accu- growing arbitrarily large. The results for this experiment are
mulators (NACs) directly in the framework of Madsen & depicted in Fig. 2. MC-LSTM significantly outperforms the
Johansen (2020). NACs and NAUs use the architecture as state-of-the-art, NAU (p-value 0.002, Binomial test).
presented in (Madsen & Johansen, 2020). That is, a single
hidden layer with two neurons, where the first layer is recur- 5.2. Inbound-outbound traffic forecasting
rent. The MC-LSTM model has two layers, of which the
We examined the usage of MC-LSTMs for traffic forecast-
second one is a fully connected linear layer. For subtraction
ing in situations in which inbound and outbound traffic
an extra cell was necessary to properly discard redundant
counts of a city are available (see Fig. 3). For this type of
input mass.
data, a conservation-of-vehicles principle (Nam & Drew,
For testing, the model with the lowest validation error was 1996) must hold, since vehicles can only leave the city if
used, c.f. early stopping. The performance is measured they have entered it before or had been there in the first place.
by the percentage of runs that successfully generalized to Based on data from the traffic4cast 2020 challenge (Kreil
longer sequences. Generalization is considered success- et al., 2020), we constructed a dataset to model inbound
ful if the error is lower than the numerical imprecision of and outbound traffic in three different cities: Berlin, Istan-
the exact operation (Madsen & Johansen, 2020). The sum- bul and Moscow. We compared MC-LSTM against LSTM,
mary in Tab. 2 shows that MC-LSTM was able to signifi- which is the state-of-the-art method for several types of traf-
cantly outperform the competing models (p-value 0.03 for fic forecasting situations (Zhao et al., 2017; Tedjopurnomo
addition and 3e−6 for multiplication, proportion test). In et al., 2020), and found that MC-LSTM significantly outper-
Appendix B.1.5, we provide a qualitative analysis of the forms LSTM in this traffic forecasting setting (all p-values
learned model behavior for this task. ≤ 0.01, Wilcoxon test). For details, see Appendix B.2.

87
MC-LSTM

Table 2. Recurrent arithmetic task results. MC-LSTMs for addition and subtraction/multiplication have two and three neurons, respectively.
Error bars represent 95%-confidence intervals.

addition subtraction multiplication

success ratea updatesb success ratea updatesb success ratea updatesb
MC-LSTM 96% +2%
−6% 4.6 · 105
81% +6%
−9% 1.2 · 105
67% +8%
−10% 1.8 · 105
NAU / NMU 88% +5%
−8% 8.1 · 104 60% +9%
−10% 6.1 · 104 34% +10%
−9% 8.5 · 104
NAC 56% +9%
−10% 3.2 · 105 86% +5%
−8% 4.5 · 104 0% +4%
−0% –
NALU 10% +7%
−4% 1.0 · 106 0% +4%
−0% – 1% +4%
−1% 4.3 · 105
a
Percentage of runs that generalized to longer sequences.
b
Median number of updates necessary to solve the task.

(standard deviation 0.14; with a p-value 4.7e−10, Wilcoxon

test). In the friction-free case, no significant difference to
HNNs was found (see Appendix B.3.2).

5.4. Hydrology: rainfall runoff modeling

We tested MC-LSTM for large-sample hydrological mod-
eling following Kratzert et al. (2019c). An ensemble of
10 MC-LSTMs was trained on 10 years of data from 447
Figure 4. Example for the pendulum-modelling exercise. (a)
basins using the publicly-available CAMELS dataset (New-
LSTM trained for predicting energies of the pendulum with fric-
tion in autoregressive fashion, (b) MC-LSTM trained in the same man et al., 2015; Addor et al., 2017a). The mass input
setting. Each subplot shows the potential- and kinetic energy and is precipitation and auxiliary inputs are: daily min. and
the respective predictions. max. temperature, solar radiation, and vapor pressure, plus
27 basin characteristics related to geology, vegetation, and
climate (described by Kratzert et al., 2019c). All models,
apart from MC-LSTM and LSTM, were trained by different
5.3. Damped pendulum
research groups with experience using each model. More
In the area of physics, we examined the usability of details are given in Appendix B.4.2.
MC-LSTM for the problem of modeling a swinging damped
As shown in Tab. 3, MC-LSTM performed better with re-
pendulum. Here, the total energy is the conserved property.
spect to the Nash–Sutcliffe Efficiency (NSE; the R2 be-
During the movement of the pendulum, kinetic energy is
tween simulated and observed runoff) than any other mass-
converted into potential energy and vice-versa. This con-
conserving hydrology model, although slightly worse than
version between both energies has to be learned by the
LSTM.
off-diagonal values of the redistribution matrix. A qualita-
tive analysis of a trained MC-LSTM for this problem can NSE is often not the most important metric in hydrology,
be found in Appendix B.3.1. since water managers are typically concerned primarily with
extremes (e.g. floods). MC-LSTM performed significantly
Accounting for friction, energy dissipates and the swinging
better (p = 0.025, Wilcoxon test) than all models, includ-
slows over time, toward a fixed point. This type of behavior
ing LSTM, with respect to high volume flows (FHV), at or
presents a difficulty for machine learning and is impossi-
above the 98th percentile flow in each basin. This makes
ble for methods that assume the pendulum to be a closed
MC-LSTM the current state-of-the-art model for flood pre-
system, such as HNNs (Greydanus et al., 2019) (see Ap-
diction. MC-LSTM also performed significantly better than
pendix B.3.2). We generated 120 datasets with timeseries
LSTM on low volume flows (FLV) and overall bias, how-
of a pendulum, where we used multiple different settings
ever there are other hydrology models that are better for
for initial angle, length of the pendulum, and the amount of
predicting low flows (which is important, e.g. for managing
friction. We then selected LSTM and MC-LSTM models
droughts).
and compared them with respect to the analytical solution
in terms of MSE. For an example, see Fig. 4. Overall,
MC-LSTM significantly outperformed LSTM with a mean Model states and environmental processes. It is an
MSE of 0.01 (standard deviation 0.02) compared to 0.07 open challenge to bridge the gap between the fact that

88 Chapter 3 Publications
MC-LSTM

Table 3. Hydrology benchmark results. All values represent the median (25% and 75% percentile in sub- and superscript, respectively)
over the 447 basins.

MCa NSEb β-NSEc FLVd FHVe

−7.0
MC-LSTM Ensemble 3 0.744 0.641
0.814
-0.020 0.013
−0.066 -24.7 31.1
−94.4 -14.7 −23.4
LSTM Ensemble 7 0.763 0.835
0.676 -0.034 −0.002
−0.077 36.3 59.7
−0.4 -15.7 −8.6
−23.8
−0.026 −12.2
SAC-SMA 3 0.603 0.682
0.512 -0.066 −0.108 37.4 −31.9
68.1
-20.4 −29.9
−17.5
VIC (basin) 3 0.551 0.465
0.641
-0.018 0.032
−0.071 -74.8 −271.8
23.1
-28.1 −40.1
VIC (regional) 3 0.307 0.402
0.218 -0.074 0.023
−0.166 18.9 69.6
−73.1 -56.5 −38.3
−64.6
−9.5
mHM (basin) 3 0.666 0.730
0.588 -0.040 0.003
−0.102 11.4 65.1
−64.0 -18.6 −27.7
−23.8
mHM (regional) 3 0.527 0.391
0.619
-0.039 0.033
−0.169 36.8 −32.6
70.9
-40.2 −51.0
−17.3
HBV (lower) 3 0.417 0.550
0.276 -0.023 −0.114
0.058
23.9 61.0
−25.9 -41.9 −55.2
−8.5
HBV (upper) 3 0.676 0.749
0.578 -0.012 0.034
−0.058 18.3 −62.9
67.5
-18.5 −27.8
FUSE (900) 3 0.639 0.715
0.539 -0.031 0.024
−0.100 -10.5 −94.8
49.2
-18.9 −9.9
−27.8
−0.004 −8.9
FUSE (902) 3 0.650 0.727
0.570 -0.047 −0.098 -68.2 17.1
−239.9 -19.4 −27.9
−0.019 −11.3
FUSE (904) 3 0.622 0.527
0.705
-0.067 −0.135 -67.6 35.7
−238.6 -21.4 −33.0
a
: Mass conservation (MC).
b
: Nash-Sutcliffe efficiency: (−∞, 1], values closer to one are desirable.
c
: β-NSE decomposition: (−∞, ∞), values closer to zero are desirable.
d
: Bottom 30% low flow bias: (−∞, ∞), values closer to zero are desirable.
e
: Top 2% peak flow bias: (−∞, ∞), values closer to zero are desirable.

LSTM approaches give generally better predictions than the internal states. Future work will determine whether this
other models (especially for flood prediction) and the fact is possible with other difficult-to-observe states and fluxes.
that water managers need predictions that help them under-
stand not only how much water will be in a river at a given 5.5. Ablation study
time, but also how water moves through a basin.
In order to demonstrate that the design choices of
400
Snow Water Equivalent (mm)
MC-LSTM are necessary together to enable accurate pre-
Sum of MC-LSTM 'snow' cells dictive models, we performed an ablation study. In this
SWE (mm)

200
study, we made changes that disrupt the mass conservation
0 property a) of the input gate, b) the redistribution operation,
1981 1982 1983 1984 1985 1986 1987 1988 1989 1990
Date and c) the output gate. We tested these three variants on data
from the hydrology experiments. We chose 5 random basins
Figure 5. Snow-water-equivalent (SWE) from a single basin. The to limit computational expenses and trained nine repetitions
blue line is SWE modeled by Newman et al. (2015). The orange for each configuration and basin. The strongest decrease in
line is the sum over 4 MC-LSTM memory cells (Pearson correla-
performance is observed if the redistribution matrix does not
tion coefficient r ≥ 0.8).
conserve mass, and smaller decreases if input or output gate
Snow processes are difficult to observe and model. Kratzert do not conserve mass. The results of the ablation study indi-
et al. (2019a) showed that LSTM learns to track snow in cate that the design of the input gate, redistribution matrix,
memory cells without requiring snow data for training. We and output gate, are necessary together to obtain accurate
found similar behavior in MC-LSTMs, which has the ad- and mass-conserving models (see Appendix Tab. B.8).
vantage of doing this with memory cells that are true mass
storages. Figure 5 shows the snow as the sum over a sub- 6. Conclusion
set of MC-LSTM memory states and snow water equiva-
lent (SWE) modeled by the well-established Snow-17 snow We have demonstrated how to design an RNN that has the
model (Anderson, 1973) (Pearson correlation coefficient property to conserve mass of particular inputs. This architec-
r ≥ 0.91). It is important to note that MC-LSTMs did not ture is proficient as neural arithmetic unit and is well-suited
have access to any snow data during training. In the best for predicting physical systems like hydrological processes,
case, it is possible to take advantage of the inductive bias to in which water mass has to be conserved. We envision
predict how much water will be stored as snow under differ- that MC-LSTM can become a powerful tool in modeling
ent conditions by using simple combinations or mixtures of environmental, sustainability, and biogeochemical cycles.

89
MC-LSTM

Acknowledgments for climate modeling. arXiv preprint arXiv:1906.06622,

2019b.
The ELLIS Unit Linz, the LIT AI Lab, the Institute for
Machine Learning, are supported by the Federal State Beucler, T., Rasp, S., Pritchard, M., and Gentine, P. Achiev-
Upper Austria. IARAI is supported by Here Technolo- ing conservation of energy in neural network emulators
gies. We thank the projects AI-MOTION (LIT-2018- for climate modeling. ICML Workshop “Climate Change:
6-YOU-212), DeepToxGen (LIT-2017-3-YOU-003), AI- How Can AI Help?”, 2019c.
SNN (LIT-2018-6-YOU-214), DeepFlood (LIT-2019-8-
YOU-213), Medical Cognitive Computing Center (MC3), Beven, K. Deep learning, hydrological processes and the
PRIMAL (FFG-873979), S3AI (FFG-872172), DL for uniqueness of place. Hydrological Processes, 34(16):
granular flow (FFG-871302), ELISE (H2020-ICT-2019- 3608–3613, 2020.
3 ID: 951847), AIDD (MSCA-ITN-2020 ID: 956832). Beven, K. J. Rainfall-runoff modelling: the primer. John
We thank Janssen Pharmaceutica, UCB Biopharma SRL, Wiley & Sons, 2011.
Merck Healthcare KGaA, Audi.JKU Deep Learning Center,
TGW LOGISTICS GROUP GMBH, Silicon Austria Labs Bohnet, B., McDonald, R., Simoes, G., Andor, D., Pitler,
(SAL), FILL Gesellschaft mbH, Anyline GmbH, Google, E., and Maynez, J. Morphosyntactic tagging with a
ZF Friedrichshafen AG, Robert Bosch GmbH, Software meta-bilstm model over context sensitive token encodings.
Competence Center Hagenberg GmbH, TÜV Austria, and arXiv preprint arXiv:1805.08237, 2018.
the NVIDIA Corporation.
Bordenave, C., Caputo, P., and Chafai, D. Circular law
theorem for random markov matrices. Probability Theory
References and Related Fields, 152(3-4):751–779, 2012.
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P. Chen, Z., Zhang, J., Arjovsky, M., and Bottou, L. Sym-
The camels data set: catchment attributes and meteo- plectic recurrent neural networks. arXiv preprint
rology for large-sample studies. Hydrology and Earth arXiv:1909.13334, 2019.
System Sciences (HESS), 21(10):5293–5313, 2017a.
Cho, K., van Merriënboer, B., Gulcehre, C., Bahdanau,
Addor, N., Newman, A. J., Mizukami, N., and Clark, M. P. D., Bougares, F., Schwenk, H., and Bengio, Y. Learn-
Catchment attributes for large-sample studies. Boulder, ing phrase representations using rnn encoder-decoder for
CO: UCAR/NCAR, 2017b. statistical machine translation. In Proceedings of the
Conference on Empirical Methods in Natural Language
Anderson, E. A. National weather service river forecast Processing, pp. 1724–1734. Association for Computa-
system: Snow accumulation and ablation model. NOAA tional Linguistics, 2014.
Tech. Memo. NWS HYDRO-17, 87 pp., 1973.
Cohen, N. and Shashua, A. Inductive bias of deep convo-
Awiszus, M. and Rosenhahn, B. Markov chain neural net- lutional networks through pooling geometry. In Interna-
works. In 2018 IEEE/CVF Conference on Computer tional Conference on Learning Representations, 2017.
Vision and Pattern Recognition Workshops (CVPRW), pp.
2261–22617, 2018. Cranmer, M., Greydanus, S., Hoyer, S., Battaglia, P.,
Spergel, D., and Ho, S. Lagrangian neural networks.
Bach, S., Binder, A., Montavon, G., Klauschen, F., Müller, arXiv preprint arXiv:2003.04630, 2020a.
K.-R., and Samek, W. On pixel-wise explanations for
non-linear classifier decisions by layer-wise relevance Cranmer, M., Sanchez Gonzalez, A., Battaglia, P., Xu, R.,
propagation. PloS one, 10(7):1–46, 2015. Cranmer, K., Spergel, D., and Ho, S. Discovering sym-
bolic models from deep learning with inductive biases.
Bengio, S., Vinyals, O., Jaitly, N., and Shazeer, N. Sched- Advances in Neural Information Processing Systems, 33,
uled sampling for sequence prediction with recurrent neu- 2020b.
ral networks. In Advances in Neural Information Process-
ing Systems 28, pp. 1171–1179. Curran Associates, Inc., Cui, Z., Henrickson, K., Ke, R., and Wang, Y. Traffic graph
2015. convolutional recurrent neural network: A deep learning
framework for network-scale traffic learning and fore-
Beucler, T., Pritchard, M., Rasp, S., Gentine, P., Ott, J., casting. IEEE Transactions on Intelligent Transportation
and Baldi, P. Enforcing analytic constraints in neural- Systems, 2019.
networks emulating physical systems, 2019a.
Deco, G. and Brauer, W. Nonlinear higher-order statistical
Beucler, T., Rasp, S., Pritchard, M., and Gentine, P. Achiev- decorrelation by volume-conserving neural architectures.
ing conservation of energy in neural network emulators Neural Networks, 8(4):525–535, 1995. ISSN 0893-6080.

90 Chapter 3 Publications
MC-LSTM

Dehaene, S. The number sense: How the mind creates Ioffe, S. and Szegedy, C. Batch normalization: Accelerat-
mathematics. Oxford University Press, 2 edition, 2011. ing deep network training by reducing internal covariate
ISBN 9780199753871. shift. In Proceedings of the 32nd International Confer-
ence on Machine Learning, volume 37, pp. 448–456.
Evans, M. R. and Hanney, T. Nonequilibrium statistical PMLR, 2015.
mechanics of the zero-range process and related models.
Journal of Physics A: Mathematical and General, 38(19): Iten, R., Metger, T., Wilming, H., Del Rio, L., and Renner,
R195, 2005. R. Discovering physical concepts with neural networks.
Physical Review Letters, 124(1):010508, 2020.
Freeze, R. A. and Harlan, R. Blueprint for a physically-
based, digitally-simulated hydrologic response model. Jia, X., Willard, J., Karpatne, A., Read, J., Zwart, J., Stein-
Journal of Hydrology, 9(3):237–258, 1969. bach, M., and Kumar, V. Physics guided rnns for model-
ing dynamical systems: A case study in simulating lake
Fukushima, K. Neocognitron: A self-organizing neural temperature profiles. In Proceedings of the 2019 SIAM
network model for a mechanism of pattern recognition International Conference on Data Mining, pp. 558–566.
unaffected by shift in position. Biological Cybernetics, SIAM, 2019.
36(4):193–202, 1980.
Karpatne, A., Atluri, G., Faghmous, J. H., Steinbach, M.,
Gaier, A. and Ha, D. Weight agnostic neural networks. In Banerjee, A., Ganguly, A., Shekhar, S., Samatova, N., and
Advances in Neural Information Processing Systems 32, Kumar, V. Theory-guided data science: A new paradigm
pp. 5364–5378. Curran Associates, Inc., 2019. for scientific discovery from data. IEEE Transactions
on Knowledge and Data Engineering, 29(10):2318–2331,
Gallistel, C. R. Finding numbers in the brain. Philosophical 2017.
Transactions of the Royal Society B: Biological Sciences,
373(1740), 2018. doi: 10.1098/rstb.2017.0119. Kingma, D. P. and Ba, J. Adam: A method for stochastic
optimization. In International Conference on Learning
Gers, F. A., Schmidhuber, J., and Cummins, F. Learning to Representations, 2015.
forget: Continual prediction with lstm. Neural Computa-
tion, 12(10):2451–2471, 2000. Klambauer, G., Unterthiner, T., Mayr, A., and Hochreiter, S.
Self-normalizing neural networks. In Advances in neural
Greydanus, S., Dzamba, M., and Yosinski, J. Hamiltonian information processing systems 30, pp. 971–980, 2017.
neural networks. In Advances in Neural Information Pro-
cessing Systems 32, pp. 15353–15363. Curran Associates, Kochkina, E., Liakata, M., and Augenstein, I. Turing at
Inc., 2019. semeval-2017 task 8: Sequential approach to rumour
stance classification with branch-lstm. arXiv preprint
Gupta, H. V., Kling, H., Yilmaz, K. K., and Martinez, G. F. arXiv:1704.07221, 2017.
Decomposition of the mean squared error and nse perfor-
mance criteria: Implications for improving hydrological Kratzert, F., Klotz, D., Brenner, C., Schulz, K., and Herrneg-
modelling. Journal of hydrology, 377(1-2):80–91, 2009. ger, M. Rainfall–runoff modelling using long short-term
memory (lstm) networks. Hydrology and Earth System
Ha, D., Dai, A., and Le, Q. Hypernetworks. In International Sciences, 22(11):6005–6022, 2018.
Conference on Learning Representations, 2017.
Kratzert, F., Herrnegger, M., Klotz, D., Hochreiter, S., and
Hairer, M. Ergodic properties of markov processes. Lecture Klambauer, G. NeuralHydrology–Interpreting LSTMs in
notes, 2018. Hydrology, pp. 347–362. Springer, 2019a.

He, K., Wang, Y., and Hopcroft, J. A powerful generative Kratzert, F., Klotz, D., Herrnegger, M., Sampson, A. K.,
model using random weights for the deep image repre- Hochreiter, S., and Nearing, G. S. Toward improved
sentation. In Advances in Neural Information Processing predictions in ungauged basins: Exploiting the power of
Systems 29, pp. 631–639. Curran Associates, Inc., 2016. machine learning. Water Resources Research, 55(12):
11344–11354, 2019b.
Hochreiter, S. Untersuchungen zu dynamischen neuronalen
Netzen. PhD thesis, Technische Universität München, Kratzert, F., Klotz, D., Shalev, G., Klambauer, G., Hochre-
1991. iter, S., and Nearing, G. Towards learning universal,
regional, and local hydrological behaviors via machine
Hochreiter, S. and Schmidhuber, J. Long short-term memory. learning applied to large-sample datasets. Hydrology and
Neural Computation, 9(8):1735–1780, 1997. Earth System Sciences, 23(12):5089–5110, 2019c.

91
MC-LSTM

Kratzert, F., Klotz, D., Hochreiter, S., and Nearing, G. A Mizukami, N., Rakovec, O., Newman, A. J., Clark, M. P.,
note on leveraging synergy in multiple meteorological Wood, A. W., Gupta, H. V., and Kumar, R. On the choice
datasets with deep learning for rainfall-runoff modeling. of calibration metrics for “high-flow” estimation using hy-
Hydrology and Earth System Sciences Discussions, 2020: drologic models. Hydrology and Earth System Sciences,
1–26, 2020. 23(6):2601–2614, 2019.

Kreil, D. P., Kopp, M. K., Jonietz, D., Neun, M., Gruca, A., Nam, D. H. and Drew, D. R. Traffic dynamics: Method
Herruzo, P., Martin, H., Soleymani, A., and Hochreiter, for estimating freeway travel times in real time from flow
S. The surprising efficiency of framing geo-spatial time measurements. Journal of Transportation Engineering,
series forecasting as a video prediction task–insights from 122(3):185–191, 1996.
the iarai traffic4cast competition at neurips 2019. In Nash, J. E. and Sutcliffe, J. V. River flow forecasting
NeurIPS 2019 Competition and Demonstration Track, pp. through conceptual models part i—a discussion of princi-
232–241. PMLR, 2020. ples. Journal of hydrology, 10(3):282–290, 1970.
Lapuschkin, S., Wäldchen, S., Binder, A., Montavon, G., Nearing, G. S., Tian, Y., Gupta, H. V., Clark, M. P., Harri-
Samek, W., and Müller, K.-R. Unmasking clever hans pre- son, K. W., and Weijs, S. V. A philosophical basis for
dictors and assessing what machines really learn. Nature hydrological uncertainty. Hydrological Sciences Journal,
communications, 10(1):1–8, 2019. 61(9):1666–1678, 2016.
LeCun, Y. and Bengio, Y. Convolutional Networks for Newman, A., Sampson, K., Clark, M., Bock, A., Viger, R.,
Images, Speech, and Time Series, pp. 255–258. MIT and Blodgett, D. A large-sample watershed-scale hydrom-
Press, Cambridge, MA, USA, 1998. eteorological dataset for the contiguous USA. Boulder,
CO: UCAR/NCAR, 2014.
LeCun, Y., Bottou, L., Bengio, Y., and Haffner, P. Gradient-
based learning applied to document recognition. Proceed- Newman, A., Clark, M., Sampson, K., Wood, A., Hay, L.,
ings of the IEEE, 86(11):2278–2324, 1998. Bock, A., Viger, R., Blodgett, D., Brekke, L., Arnold, J.,
et al. Development of a large-sample watershed-scale hy-
LeCun, Y., Bengio, Y., and Hinton, G. Deep learning. Na- drometeorological data set for the contiguous USA: data
ture, 521(7553):436–444, 2015. set characteristics and assessment of regional variability
Liu, G. and Guo, J. Bidirectional lstm with attention mecha- in hydrologic model performance. Hydrology and Earth
nism and convolutional layer for text classification. Neu- System Sciences, 19(1):209–223, 2015.
rocomputing, 337:325–338, 2019. Newman, A. J., Mizukami, N., Clark, M. P., Wood, A. W.,
Nijssen, B., and Nearing, G. Benchmarking of a physi-
Liu, Y., Liu, Z., and Jia, R. Deeppf: A deep learning
cally based hydrologic model. Journal of Hydrometeorol-
based architecture for metro passenger flow prediction.
ogy, 18(8):2215–2225, 2017.
Transportation Research Part C: Emerging Technologies,
101:18–34, 2019. Nieder, A. The neuronal code for number. Nature Reviews
Neuroscience, 17(6):366–382, 2016. doi: https://doi.org/
Madsen, A. and Johansen, A. R. Neural arithmetic units. In 10.1038/nrn.2016.40.
International Conference on Learning Representations,
2020. Olah, C. Understanding LSTM networks, 2015.
URL https://colah.github.io/posts/
Mitchell, T. M. The need for biases in learning generaliza- 2015-08-Understanding-LSTMs/.
tions. Technical Report CBM-TR-117, Rutgers Univer-
sity, Computer Science Department, New Brunswick, NJ, Papamakarios, G., Nalisnick, E., Rezende, D. J., Mohamed,
1980. S., and Lakshminarayanan, B. Normalizing flows for
probabilistic modeling and inference. Technical report,
Miyato, T., Kataoka, T., Koyama, M., and Yoshida, Y. Spec- DeepMind, 2019.
tral normalization for generative adversarial networks. In
International Conference on Learning Representations, Paszke, A., Gross, S., Massa, F., Lerer, A., Bradbury, J.,
2018. Chanan, G., Killeen, T., Lin, Z., Gimelshein, N., Antiga,
L., Desmaison, A., Kopf, A., Yang, E., DeVito, Z., Raison,
Mizukami, N., Clark, M. P., Newman, A. J., Wood, M., Tejani, A., Chilamkurthy, S., Steiner, B., Fang, L.,
A. W., Gutmann, E. D., Nijssen, B., Rakovec, O., and Bai, J., and Chintala, S. Pytorch: An imperative style,
Samaniego, L. Towards seamless large-domain parame- high-performance deep learning library. In Advances
ter estimation for hydrologic models. Water Resources in Neural Information Processing Systems 32, pp. 8024–
Research, 53(9):8020–8040, 2017. 8035. Curran Associates, Inc., 2019.

92 Chapter 3 Publications
MC-LSTM

Rabitz, H., Aliş, Ö. F., Shorter, J., and Shim, K. Efficient Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan,
input—output model representations. Computer physics D., Goodfellow, I., and Fergus, R. Intriguing proper-
communications, 117(1-2):11–20, 1999. ties of neural networks. In International Conference on
Learning Representations, 2014.
Raissi, M., Perdikaris, P., and Karniadakis, G. E. Physics-
informed neural networks: A deep learning framework for Tedjopurnomo, D. A., Bao, Z., Zheng, B., Choudhury, F.,
solving forward and inverse problems involving nonlinear and Qin, A. A survey on modern deep neural network for
partial differential equations. Journal of Computational traffic prediction: Trends, methods and challenges. IEEE
Physics, 378:686–707, 2019. Transactions on Knowledge and Data Engineering, 2020.

Rakovec, O., Mizukami, N., Kumar, R., Newman, A. J., Todini, E. Rainfall-runoff modeling — past, present and
Thober, S., Wood, A. W., Clark, M. P., and Samaniego, L. future. Journal of Hydrology, 100(1):341–352, 1988.
Diagnostic evaluation of large-domain hydrologic models ISSN 0022-1694.
calibrated across the contiguous united states. Journal
of Geophysical Research: Atmospheres, 124(24):13991– Trask, A., Hill, F., Reed, S. E., Rae, J., Dyer, C., and Blun-
14007, 2019. som, P. Neural arithmetic logic units. In Advances in Neu-
ral Information Processing Systems 31, pp. 8035–8044.
Rezende, D. and Mohamed, S. Variational inference with Curran Associates, Inc., 2018.
normalizing flows. In Proceedings of the 32nd Interna-
Ulyanov, D., Vedaldi, A., and Lempitsky, V. Deep image
tional Conference on Machine Learning, volume 37, pp.
prior. International Journal of Computer Vision, 128(7):
1530–1538. PMLR, 2015.
1867–1888, 2020.
Saxe, A. M., McClelland, J. L., and Ganguli, S. Exact
van der Schaft, A. J., Dalsmo, M., and Maschke, B. M.
solutions to the nonlinear dynamics of learning in deep
Mathematical structures in the network representation
linear neural networks. In International Conference on
of energy-conserving physical systems. In Proceedings
Learning Representations, 2014.
of 35th IEEE Conference on Decision and Control, vol-
Schmidhuber, J. Learning to control fast-weight memories: ume 1, pp. 201–206, 1996.
An alternative to dynamic recurrent networks. Neural
Vanajakshi, L. and Rilett, L. Loop detector data diagnostics
Computation, 4(1):131–139, 1992.
based on conservation-of-vehicles principle. Transporta-
Schmidhuber, J. Deep learning in neural networks: An tion research record, 1870(1):162–169, 2004.
overview. Neural networks, 61:85–117, 2015.
Xiao, X. and Duan, H. A new grey model for traffic flow
Schmidhuber, J., Wierstra, D., Gagliolo, M., and Gomez, F. mechanics. Engineering Applications of Artificial Intelli-
Training recurrent networks by Evolino. Neural Compu- gence, 88:103350, 2020.
tation, 19(3):757–779, 2007.
Yilmaz, K. K., Gupta, H. V., and Wagener, T. A process-
Schmidt, M. and Lipson, H. Distilling free-form natural based diagnostic approach to model evaluation: Appli-
laws from experimental data. science, 324(5923):81–85, cation to the nws distributed hydrologic model. Water
2009. Resources Research, 44(9):1–18, 2008. ISSN 00431397.

Seibert, J., Vis, M. J. P., Lewis, E., and van Meerveld, H. J. Yitian, L. and Gu, R. R. Modeling flow and sediment trans-
Upper and lower benchmarks in hydrological modelling. port in a river system using an artificial neural network.
Hydrological Processes, 32(8):1120–1125, 2018. Environmental management, 31(1):0122–0134, 2003.

Sellars, S. “grand challenges” in big data and the earth sci- Zhao, Z., Chen, W., Wu, X., Chen, P. C., and Liu, J. Lstm
ences. Bulletin of the American Meteorological Society, network: a deep learning approach for short-term traffic
99(6):ES95–ES98, 2018. forecast. IET Intelligent Transport Systems, 11(2):68–75,
2017.
Song, X.-H. and Hopke, P. K. Solving the chemical mass
balance problem using an artificial neural network. Envi-
ronmental science & technology, 30(2):531–535, 1996.

Sutskever, I., Vinyals, O., and Le, Q. V. Sequence to se-

quence learning with neural networks. In Advances in
neural information processing systems, pp. 3104–3112,
2014.

93
MC-LSTM: Appendix

A. Notation overview 2 000 000 batches of 128 problems were created and for the
static tasks 500 000 batches of 128 samples were used in
Most of the notation used throughout the paper, is summa- the addition and subtraction tasks and 3 000 000 batches
rized in Tab. A.1. for multiplication. Note that since the subsets overlap, i.e.,
inputs are re-used, this data does not have mass conservation
B. Experimental Details properties.
In the following, we provide further details on the experi- For a more detailed description of the MNIST addition data,
mental setups. we refer to (Trask et al., 2018) and the appendix of (Madsen
& Johansen, 2020).
B.1. Neural arithmetic
B.1.2. D ETAILS ON H YPERPARAMETERS .
Neural networks that learn arithmetic operations have re-
cently come into focus (Trask et al., 2018; Madsen & Jo- For the addition problem, every network had a single hid-
hansen, 2020). Specialized neural modules for arithmetic den layer with 10 units. The output layer was a linear,
operations could play a role for complex AI systems since fully connected layer for all MC-LSTM and LSTM variants.
cognitive studies indicate that there is a part of the brain The NAU (Madsen & Johansen, 2020) and NALU/NAC
that enables animals and humans to perform basic arith- (Trask et al., 2018) networks used their corresponding out-
metic operations (Nieder, 2016; Gallistel, 2018). Although put layer. Also, we used a more common L2 regularization
this primitive number processor can only perform approx- scheme with low regularization constant (10−4 ) to keep the
imate arithmetic, it is a fundamental part of our ability to weights ternary for the NAU, rather than the strategy used
understand and interpret numbers (Dehaene, 2011). in the reference implementation from Madsen & Johansen
(2020). Optimization was done using Adam (Kingma & Ba,
B.1.1. D ETAILS ON DATASETS 2015) for all models. The initial learning rate was selected
from {0.1, 0.05, 0.01, 0.005, 0.001} on the validation data
We consider the addition problem that was proposed in the for each method individually. All methods were trained for
original LSTM paper (Hochreiter & Schmidhuber, 1997). 100 epochs.
We chose input values in the range [0, 0.5] in order to be able
to use the fast standard implementations of LSTM. For this The weight matrices of LSTM were initialized in a standard
task, 20 000 samples were generated using a fixed random way, using orthogonal and identity matrices for the forward
seed to create a dataset, which was split in 50% training and and recurrent weights, respectively. Biases were initialized
50% validation samples. For the test data, 1 000 samples to be zero, except for the bias in the forget gate, which was
were generated with a different random seed. initialized to 3. This should benefit the gradient flow for
the first updates. Similarly, MC-LSTM is initialized so that
A definition of the static arithmetic task is provided by the redistribution matrix (cf. Eq. 7) is (close to) the identity
(Madsen & Johansen, 2020). The following presents this matrix. Otherwise we used orthogonal initialization (Saxe
definition and its extension to the recurrent arithmetic task et al., 2014). The bias for the output gate was initialized
(c.f. Trask et al., 2018). to -3. This stimulates the output gates to stay closed (keep
The input for the static version is a vector, x ∈ U(1, 2)100 , mass in the system), which has a similar effect as setting
consisting of numbers that are drawn randomly from a uni- the forget gate bias in LSTM. This practically holds for all
form distribution. The target, y, is computed as subsequently described experiments.

a+c
! b+c
! For the recurrent arithmetic tasks, we tried to stay as close
X X as possible to the setup that was used by Madsen & Johansen
y= xk xk ,
(2020). This means that all networks had again a single hid-
k=a k=b
den layer. The NAU, Neural Multiplication Unit (NMU)
where c ∈ N, a ≤ b ≤ a + c ∈ N and ∈ {+, −, ·}. For and NALU networks all had two hidden units and, respec-
the recurrent variant , the input consists of a sequence of tively, NAU, NMU and NALU output layers. The first,
T vectors, denoted by xt ∈ U(1, 2)10 , t ∈ {1, . . . , T }, and recurrent layer for the first two networks was a NAU and
the labels are computed as the NALU network used a recurrent NALU layer. For the
! ! exact initialization of NAU and NALU, we refer to (Madsen
XT Xa+c T X
X b+c
& Johansen, 2020).
y= xtk xtk .
t=1 k=a t=1 k=b The MC-LSTM models used a fully connected linear layer
with L2 -regularization for projecting the hidden state to the
For these experiments, no fixed datasets were used. Instead, output prediction for the addition and subtraction tasks. A
samples were generated on the fly. For the recurrent tasks,

94 Chapter 3 Publications
MC-LSTM: Appendix

Table A.1. Symbols and notations used in this paper.

Definition Symbol/Notation Dimension
mass input at timestep t x or x
t t
M or 1
auxiliary input at timestep t at L
cell state at timestep t ct K
limit of sequence of cell states c∞
hidden state at timestep t ht K
redistribution matrix R K ×K
input gate i K
output gate o K
mass m K
input gate weight matrix Wi K ×L
input gate weight matrix Wo K ×L
output gate weight matrix Ui K ×K
output gate weight matrix Uo K ×K
identity matrix K K ×K
input gate bias bi K
output gate bias bo K
arbitrary differentiable function f
hypernetwork function (conditioner) g
redistribution gate bias BR K ×K
stored mass mc
mass efflux mh
limit of series of mass inputs m∞ x
timestep index t
an arbitrary timestep τ
last timestep of a sequence T
redistribution gate weight tensor Wr K ×K ×L
redistribution gate weight tensor Ur K ×K ×K
arbitrary feature index a
arbitrary feature index b
arbitrary feature index c

95
MC-LSTM: Appendix

free linear layer was used to compensate for the fact that can be defined as MC-FC : RM → RK : x 7→ y, where
the data does not have mass-conserving properties. How-
ever, it is important to note that the mass conservation in y = diag(o) · I · x I = softmax(B I ) o = σ(bo ),
MC-LSTM is still necessary to solve this task. For the mul-
tiplication problem, we used a multiplicative, non-recurrent
variant of MC-LSTM with an extra scalar parameter to allow where the softmax operates on the row dimension to get a
the conserved mass to be re-scaled if necessary. This multi- column-normalized matrix, I, for the input gate.
plicative layer is described in more detail in Appendix B.1.3. Using the log-exp transform (c.f. Trask et al., 2018), a mul-
Whereas the addition could be solved with two hidden units, tiplicative MC-FC with scaling parameter, α, can be con-
MC-LSTM needed three hidden units to solve both subtrac- structed as follows: exp(MC-FC(log(x))+α). The scaling
tion and multiplication. This extra unit, which we refer to parameter is necessary to break the mass conservation when
as the trash cell, allows MC-LSTMs to get rid of excessive it is not needed. By replacing the output layer with this
mass that should not influence the prediction. Note that, multiplicative MC-FC, it can also be used to solve the multi-
since the mass inputs are vectors, the input gate has to be plication problem. This network also required three hidden
computed in a similar fashion as the redistribution matrix. neurons. This model was compared to a NMU network with
Adam was again used for the optimization. We used the two hidden neurons and NALU network.
same learning rate (0.001) as Madsen & Johansen (2020) to All models were trained for two million updates with the
train the NAU, NMU and NALU networks. For MC-LSTM Adam optimizer (Kingma & Ba, 2015). The learning rate
the learning rate was increased to 0.01 for addition and sub- was set to 0.001 for all networks, except for the MC-FC
traction and 0.05 for multiplication after a manual search on network, which needed a lower learning rate of 0.0001, and
the validation set. All models were trained for two million the multiplicative MC-FC variant, which was trained with
update steps. learning rate 0.01. These hyperparameters were found using
In a similar fashion, we used the same models from Mad- a manual search.
sen & Johansen (2020) for the MNIST addition task. For Since the input consists of a vector, the input gate predicts
MC-LSTM, we replaced the recurrent NAU layer with a a left-stochastic matrix, similar to the redistribution matrix.
MC-LSTM layer and the output layer was replaced with a This allows us to verify generalization abilities of the induc-
fully connected linear layer. In this scenario, increasing the tive bias in MC-LSTMs. The performance was measured
learning rate was not necessary. This can probably be ex- in a similar way as for the recurrent task, except that gen-
plained by the fact that training CNN to regress the MNIST eralization was tested over the range of the input values
images is the main challenge during learning. We also used (Madsen & Johansen, 2020). Concretely, the models were
a standard L2 -regularization on the outputs of CNN instead trained on input values in [1, 2] and tested on input values
of the implementation proposed in (Madsen & Johansen, in the range [2, 6]. Table B.2 shows that MC-FC is able to
2020) for this task. match or outperform both NALU and NAU on this task.
B.1.3. S TATIC ARITHMETIC B.1.4. C OMPARISON WITH TIME - DEPENDENT
This experiment should enable a more direct comparison MC-LSTM
to the results from Madsen & Johansen (2020) than the We used MC-LSTM with a time-independent redistribution
recurrent variant. The data for the static task is equivalent matrix, as in Eq. (7), to solve the addition problem. This
to that of the recurrent task with sequence length one. For resembles another form of inductive bias, since we know
more details on the data, we refer to Appendix B.1.1 or that no redistribution across cells is necessary to solve this
(Madsen & Johansen, 2020). problem and it results also in a more efficient model, because
Since the static task does not require a recurrent model, we less parameters have to be learned. However, for the sake of
discarded the redistribution matrix in MC-LSTM. The result flexibility, we also verified that it is possible to use the more
is a layer with only input and output gates, which we refer to general time-dependent redistribution matrix (cf. Eq. 8).
as a Mass-Conserving Fully Connected (MC-FC) layer. We The results of this experiment can be found in Table B.3.
compared this model to the results reported in (Madsen & Although the performance of MC-LSTM with time-
Johansen, 2020), using the code base that accompanied the dependent redistribution matrix is slightly worse than that of
paper. All NALU and NAU networks had a single hidden the more efficient MC-LSTM variant, it still outperforms all
layer. Similar to the recurrent task, MC-LSTM required two other models on the generalisation tasks. This can partly be
hidden units for addition and three for subtraction. Mathe- explained by the fact that is harder to train a time-dependent
matically, an MC-FC with K hidden neurons and M inputs redistribution matrix, while the training budget is limited to
100 epochs.

96 Chapter 3 Publications
MC-LSTM: Appendix

Table B.2. Results for the static arithmetic task. MC-FC is a mass-conserving variant of MC-LSTM based on fully-connected layers for
non-recurrent tasks. MC-FCs for addition and subtraction/multiplication have two and three neurons, respectively. Error bars represent
95% confidence intervals.

addition subtraction multiplication

a b a b
success rate updates success rate updates success ratea updatesb
MC-FC 100% +0%
−4%
5
2.1 · 10 100% +0%
−4%
5
1.6 · 10 100% +0%
−4% 1.4 · 106
NAU / NMU 100% +0%
−4% 1.8 · 104 100% +0%
−4% 5.0 · 103 98% +1%
−5% 1.4 · 106
NAC 100% +0%
−4% 2.5 · 105 100% +0%
−4% 9.0 · 103 +10%
31% −8% 2.8 · 106
NALU 14% +8%
−5% 1.5 · 106 14% +8%
−5% 1.9 · 106 0% +4%
−0% –
a
Percentage of runs that generalized to a different input range.
b
Median number of updates necessary to solve the task.

Table B.3. Performance of different models on the LSTM addition task in terms of the MSE. MC-LSTM significantly (all p-values below
.05) outperforms its competitors, LSTM (with high initial forget gate bias), NALU and NAU. Error bars represent 95%-confidence
intervals across 100 runs.

referencea seq lengthb input rangec countd comboe NaNf

†
MC-LSTM 0.013 ± 0.004 0.022 ± 0.010 2.6 ± 0.8 2.2 ± 0.7 13.6 ± 4.0 0
MC-LSTM 0.004 ± 0.003 0.009 ± 0.004 0.8 ± 0.5 0.6 ± 0.4 4.0 ± 2.5 0
LSTM 0.008 ± 0.003 0.727 ± 0.169 21.4 ± 0.6 9.5 ± 0.6 54.6 ± 1.0 0
NALU 0.060 ± 0.008 0.059 ± 0.009 25.3 ± 0.2 7.4 ± 0.1 63.7 ± 0.6 93
NAU 0.248 ± 0.019 0.252 ± 0.020 28.3 ± 0.5 9.1 ± 0.2 68.5 ± 0.8 24
a
training regime: summing 2 out of 100 numbers between 0 and 0.5.
b
longer sequence lengths: summing 2 out of 1 000 numbers between 0 and 0.5.
c
more mass in the input: summing 2 out of 100 numbers between 0 and 5.0.
d
higher number of summands: summing 20 out of 100 numbers between 0 and 0.5.
e
combination of previous scenarios: summing 10 out of 500 numbers between 0 and 2.5.
f
Number of runs that did not converge.
†
MC-LSTM with time-dependent redistribution matrix.

97
MC-LSTM: Appendix

B.1.5. Q UALITATIVE ANALYSIS OF THE MC-LSTM Recurrent Arithmetic. In the following we take a closer
MODELS TRAINED ON ARITHMETIC TASKS look at the solution that is learned with MC-LSTM. Con-
cretely, we look at the weights of a MC-LSTM model that
Addition Problem. To reiterate, we used MC-LSTM
successfully solves the following recurrent arithmetic task:
with 10 hidden units and replaced the linear output layer
by a simple summation. The model has to learn to sum all T
X T
X
mass inputs of the timesteps, where the auxiliary input (the y= (xt6 + xt7 ) (xt7 + xt8 ),
marker) equals at = 1, and ignore all other values. At the fi- t=1 t=1
nal timestep — where the auxiliary input equals at = −1 —
where ∈ {−, +}, given a sequence of input vectors xt ∈
the network should output the sum of all previously marked
R10 (the only purpose of the colors is to provide an aid to
mass inputs.
readers). We highlight the following observations:
In our experiment, the model has learned to store the marked
input values in a single cell, while all other mass inputs 1. For the addition task (i.e., ≡ +), MC-LSTM has
mainly end up in a single, different cell. That is, a single two units (see Appendix B.1.2 for details on the experi-
cell learns to accumulate the inputs to compute the solution ments). Trask et al. (2018); Madsen & Johansen (2020)
and the other cells are used as trash cells. In Fig. B.1, fixed the number of hidden units to two with the idea
we visualize the cell states for a single input sample over that each unit can learn one term of the addition opera-
time, where the orange and the blue line denote the mass tion (). However, if we take a look at the input gate
accumulator and the main trash cell, respectively. of our model, we find that the first cell is used to accu-
mulate (xt1 + . . . + xt5 + 0.5xt6 + 0.5xt8 + xt9 + xt10 ) and
We can see that at the last time step — where the network
the second cell collects (0.5xt6 + xt7 + 0.5xt8 ). Since
is queried to return the accumulated sum — the value of
the learned redistribution matrix is the identity matrix,
this mass accumulator drops to zero, i.e., the output gate
these accumulators operate individually.
is completely open. Note that this would not be the case
for a model with a fully connected layer. After all, the This means that, instead of computing the individ-
fully connected layer can arbitrarily scale the output of the ual terms, MC-LSTM directly computes the solution,
MC-LSTM layer, which allows the output gate to open only scaled by a factor 1/2 in its second cell. The first cell
partially. Apart from this distinction in the last timestep, accumulates the rest of the mass, which it does not
the cell states for both models behave the same way. For need for the prediction. In other words, it operates
all other cells (grey lines), the output gate at the last time as some sort of trash cell. Note that due to the mass-
step is zero. This illustrates nicely how the model output is conservation property, it would be impossible to com-
only determined by the value of the single cell that acted as pute each side of the operation individually. After all,
accumulator of the marked values (orange line). xt7 appears on both sides of the central operation (),
and therefore the data is not mass conserving.
The output gate is always open for the trash cell and
3.0 closed for the other cell, indicating that redundant mass
trash cell
main cell is discarded through the output of the MC-LSTM in
2.5 other cells every timestep and the scaled solution is properly ac-
cumulated. However, in the final timestep — when the
2.0
cell state value

prediction is to be made, the output gate for the trash

1.5 cell is closed and opened for the other cell. That is, the
accumulated solution is passed to the final linear layer,
1.0 which scales the output of MC-LSTM by a factor of
0.5 two to get the correct solution.

0.0 2. For the subtraction task (i.e., ≡ −), a similar be-

0 20 40 60 80 100 havior can be observed. In this case, the final model
timestep
requires three units to properly generalize. The first
two cells accumulate xt6 and xt8 , respectively. The last
Figure B.1. MC-LSTM cell states over time for model trained to
cell operates as trash cell and collects (xt1 + . . . + xt5 +
solve the addition problem (see Appendix B.1.1). Each line de-
notes the value of one particular cell over time, while the two
xt7 +xt9 +xt10 ). The redistribution matrix is the identity
vertical grey indicator lines denote the timesteps, where the auxil- matrix for the first two cells. For the trash cell, equal
iary input was 1 (i.e., which numbers in the sequence have to be parts (0.4938) are redistributed to the two other cells.
added). The output gate operates in a similar fashion as for ad-
dition. Finally, the linear layer computes the difference

98 Chapter 3 Publications
MC-LSTM: Appendix

between the first two cells with weights 1, -1 and the For simplicity, we ignored the fact that a single-pixel frame
trash cell is ignored with weight 0. might have issues with fast-moving vehicles.
Although MC-LSTM with two units was not able to By taking into account the direction of the vehicles, the
generalize well enough for the Madsen & Johansen inbound and outbound traffic can be combined for every
(2020) benchmarks, it did turn out to be able to provide pixel on the border of our frame. To get a more tractable
a reasonable solution (albeit with numerical flaws). dataset, we additionally combined the pixels of the four
With two cells, the network learned to store (0.5xt1 + edges of the frame to end up with eight values: four values
. . . + 0.5xt5 + xt6 + 0.5xt7 + 0.5xt9 + 0.5xt10 ) in one cell, for the incoming traffic, i.e: one for each border of the
and (0.5xt1 +. . .+0.5xt5 +0.5xt7 +xt8 +0.5xt9 +0.5xt10 ) frame, and four values for the outgoing traffic. The inbound
in the other cell. With a similar linear layer as for the traffic would be the mass input for MC-LSTM and the target
three-unit variant, this solution should also compute a outputs are the outbound traffic along the different borders.
correct solution for the subtraction task. The auxiliary input is the current daytime, encoded as a
value between zero and one.
B.2. Inbound-outbound traffic forecast To model the sparsity that is often available in other traffic
counting problems, we chose three time-slots (6 am, 12 pm
Traffic forecasting considers a large number of different and 6 pm) for which we use fifteen minutes of the actual
settings and tasks (Tedjopurnomo et al., 2020). For ex- measurements — i.e., three timesteps. This could for ex-
ample whether the physical network topology of streets ample simulate the deployment of mobile traffic counting
can be exploited by using graph neural networks combined stations. The other inputs are imputed by the average in-
with LSTMs (Cui et al., 2019). Within traffic forecasting bound traffic over the training data, which consists of 181
mass conservation translates to a conservation-of-vehicles days. Outputs are only available when the actual measure-
principle. Generally, models that adhere to this principle ments are used. This gives a total of 9 timesteps per day on
are desired (Vanajakshi & Rilett, 2004; Zhao et al., 2017) which the loss can be computed. For training, this dataset is
since they could be useful for long-term forecasts. Many randomly split in 85% training and 15% validation samples.
recent benchmarking datasets for traffic forecasts are usually
uni-directional and are measured at few streets. Thus con- During inference, all 288 timesteps of the inbound and
servation laws cannot be directly applied (Tedjopurnomo outbound measurements are used to find out which model
et al., 2020). learned the traffic dynamics from the sparse training data
best. For this purpose, we used the 18 sequences of vali-
We demonstrate how MC-LSTM can be used in traffic fore- dation data from the original dataset as test set, which are
casting settings. A typical setting for vehicle conservation is distributed across the second half of the year. In order to
when traffic counts for inbound and outbound roads of a city enable a fair comparison between LSTM and MC-LSTM,
are available. In this case, all vehicles that come from an the data for LSTM was normalized to zero mean and unit
inbound road must either be within a city or leave the city on variance for training and inference (using statistics from the
an outbound road. The setting is similar to passenger flows training data). MC-LSTM does not need this pre-processing
in inbound and outbound metro (Liu et al., 2019), where step and is fed the raw data.
LSTMs have also prevailed. We were able to extract such
data from a recent dataset based on GPS-locations (Kreil
et al., 2020) of vehicles at a fine geographic grid around Model and Hyperparameters For the traffic prediction,
cities, which represents good approximation of a vehicle we used LSTM followed by a fully connected layer as base-
conserving scenario. line (c.f. Zhao et al., 2017; Liu et al., 2019). For MC-LSTM,
we chose to enforce end-to-end mass conservation by using
a MC-FC output layer, which is described in detail in Ap-
An approximately mass-conserving traffic dataset
pendix B.1.3. For the initialization of the models, we refer
Based on the data for the traffic4cast 2020 challenge (Kreil
to the details of the arithmetic experiments in Appendix B.1.
et al., 2020), we constructed a dataset to model inbound and
outbound traffic of three different cities: Berlin, Istanbul For each model and for each city, the best hyperparame-
and Moscow. The original data consists of 181 sequences ters were found by performing a grid search on the val-
of multi-channel images encoding traffic volume and speed idation data. This means that the hyperparameters were
for every five minutes in four (binned) directions. Every chosen to minimize the error on the nine 5-minute inter-
sequence corresponds to a single day in the first half of the vals. For all models, the number of hidden neurons was
year. In order to get the traffic flow from the multi-channel chosen from {10, 50, 100} and for the learning rate, the op-
images at every timestep, we defined a frame around the tions were {0.100, 0.050, 0.010, 0.005, 0.001}. All models
city and collected the traffic-volume data for every pixel were trained for 2 000 epochs using the Adam optimizer
on the border of this frame. This is illustrated in Fig. 3. (Kingma & Ba, 2015). Additionally, we considered values

99
MC-LSTM: Appendix

in {0, 5} for the initial value of the forget gate bias in LSTM.
For MC-LSTM, the extra hyperparameters were the initial
cell state value (∈ {0, 100}) — i.e., how much cars are in
each memory cell at timestep zero — and whether or not
the initial cell state should be trained via backpropagation.
The results of the hyperparameter search can be found in
Tab. B.4.
The idea behind tuning the initial cell state, is that unlike
with LSTM, the cell state in MC-LSTM directly reflects
the number of cars that can drive out of a city during the
first timesteps. If the initial cell state is too high or too low,
this might negatively affect the prediction capabilities of
the model. If it would be possible to estimate the number
of cars in a city at the start of the sequence, this could
also be used to get better estimates for the initial cell state.
However, from the results of the hyperparameter search (see
Tab. B.4), we might have overestimated the importance of
these hyperparameters.

Results. All models were evaluated on the test data, us-

ing the checkpoint after 2 000 epochs for fifty runs. An
example of what the predictions of both models look like
for an arbitrary day in an arbitrarily chosen city is displayed
in Fig. B.2. The average Root MSE (RMSE) and Mean
Absolute Error (MAE) are summarized in Tab. B.5. The re-
sults show that MC-LSTM is able to generalize significantly
better than LSTM for this task. The RMSE of MC-LSTM
is significantly better than LSTM (p-values 4e−10, 8e−3,
and 4e−10 for Istanbul, Berlin, and Moscow, respectively,
Wilcoxon test).

B.3. Damped pendulum

In the area of physics, we consider the problem of modeling
a swinging pendulum with friction. The conserved quantity
of interest is the total energy. During the movement of
the pendulum, kinetic energy is converted into potential
energy and vice-versa. Neglecting friction, the total energy
is conserved and the movement would continue indefinitely.
Accounting for friction, energy dissipates and the swinging
slows over time until a fixed point is reached. This type
of behavior presents a difficulty for machine learning and
is impossible for methods that assume the pendulum to be
Figure B.2. Traffic forecasting models for outbound traffic in
closed systems, such as HNNs (Greydanus et al., 2019). We
Moscow. An arbitrary day has been chosen for display. Note
postulated that both energy conversion and dissipation can that both models have only been trained on data at timesteps 71-73,
be fitted by machine learning models, but that an appropriate 143-145, and 215-217. Colors indicate the four borders of the
inductive bias will allow to generalize from the learned data frame, i.e., north, east, south and west. Left: LSTM predictions
with more ease. shown in dashed lines versus the actual traffic counts (solid lines).
Right: MC-LSTM predictions shown in dashed lines versus the
To train the model, we generated a set of timeseries us-
actual traffic counts (solid lines).
ing the differential equations for a pendulum with fric-
tion. For small angles, this problem is equivalent to the
harmonic oscillator and an analytic solution exists with
which we can compare the models (Iten et al., 2020). We

100 Chapter 3 Publications

MC-LSTM: Appendix

Table B.4. The hyperparameters resulting from the grid search for the traffic forecast experiment.
hidden lr forget bias initial state learnable state
LSTM 10 0.01 0 – –
Berlin
MC-LSTM 100 0.01 – 0 True
LSTM 100 0.005 5 – –
Istanbul
MC-LSTM 50 0.01 – 0 False
LSTM 50 0.001 5 – –
Moscow
MC-LSTM 10 0.01 – 0 False

Table B.5. Results on outbound traffic forecast avg RMSE and MAE with 95% confidence intervals over 50 runs
Istanbul Berlin Moscow
RMSE MAE RMSE MAE RMSE MAE
MC-LSTM 7.3 ± 0.1 28 ± 2 13.6 ± 1.8 66 ± 1 25.5 ± 1.1 27.8 ± 1.1
LSTM 142.6 ± 4.4 84 ± 3 135.4 ± 5.0 84 ± 3 45.6 ± 0.8 31.7 ± 0.5

used multiple different settings for initial angle, length of See explanation of the used loss below), on a separately
the pendulum, the amount of friction, the length of the generated validation dataset.
training-period and with and without Gaussian noise. Each
For MC-LSTM, a hidden size of two was used so that each
model received the initial kinetic and potential energy of
state directly maps to the two energies. The hypernetwork
the pendulum and must predict the consecutive timesteps.
consists of three fully connected layers of size 50, 100
The time series starts always with the pendulum at the
and 4, respectively. To account for the critical values at
maximum displacement — i.e., the entire energy in the
the extreme-points of the pendulum (i.e. the amplitudes —
system is potential energy. We generated timeseries of
where the energy is present only in the form of potential
potential- and kinetic energies by iterating the following
energy — and the midpoint — where only kinetic energy
settings/conditions: initial amplitude ({0.2, 0.4}), pendu-
exists), we slightly offset the cell state from the actual pre-
lum length ({0.75, 1}), length of training sequence in terms
dicted value by using a linear regression with a slope of 1.02
of timesteps ({100, 200, 400}), noise level ({0, 0.01}), and
and an intercept −0.01.
dampening constant ({0.0, 0.1, 0.2, 0.4, 0.8}). All combina-
tions of those settings were used to generate a total of 120 For both models, we used a combination of Pearson’s corre-
datasets, for which we train both models (the autoregressive lation of the energy signals and the MSE as a loss function
LSTM and MC-LSTM). (by subtracting the former mean from the latter). Further, we
used a simple curriculum to deal with the long autoregres-
We trained an autoregressive LSTM that receives its current
sive nature of the timeseries (Bengio et al., 2015): Starting at
state and a low-dimensional temporal embedding (using
a time window of eleven we added five additional timesteps
nine sinusoidal curves with different frequencies) to predict
whenever the combined loss was below −0.9.
the potential and kinetic energy of the pendulum. Similarly,
MC-LSTM is trained in an autoregressive mode, where a Overall, MC-LSTM has significantly outperformed LSTM
hypernetwork obtains the current state and the same tempo- with a mean MSE of 0.01 (standard deviation 0.02) com-
ral embedding as LSTM. The model-setup is thus similar pared to 0.07 (standard deviation 0.14; with a p-value
to an autoregressive model with exogenous variables from 4.7e−10, Wilcoxon test).
classical timeseries modelling literature. To obtain suitable
hyperparameters we manually adjusted the learning rate B.3.1. Q UALITATIVE ANALYSIS OF THE MC-LSTM
(0.01), hidden size of LSTM (256), the hypernetwork for es- MODELS TRAINED FOR A PENDULUM
timating the redistribution (a fully connected network with
In the following, we analyse the behavior of the simplest
3 layers, ReLU activations and hidden sizes of 50, 100, and
pendulum setup, i.e., the one without friction. Special to
2 respectively), optimizer (Adam, Kingma & Ba, 2015) and
the problem of the pendulum without friction is that there
the training procedure (crucially, the amount of additionally
are no mass in- or outputs and the whole dynamic of the
considered timesteps in the loss after a threshold is reached.
system has to be modeled by the redistribution matrix. The

101
MC-LSTM: Appendix

B.3.2. C OMPARISON WITH H AMILTONIAN N EURAL

Fraction of moving Energy

0.6 Flow: Epot to Ekin

0.5 Flow: Ekin to Epot N ETWORKS
0.4 We aimed at a comparison with HNN in the case of the
0.3 friction-free pendulum. To this end, we use the data
0.2 generation process by (Greydanus et al., 2019). We use
0.1 amplitudes of {0.2, 0.3, 0.4, 1}, training sequence length
0.0 {100, 200, 400}, and noise level {0, 0.01}, which leads to
0 2 4 6 8 10
24 time-series. We adhere to the HNN reference implemen-
0.20 tation, which contains a gravity constant of g = 6 and mass
Ekin
0.15 Epot m = 0.5. In the case of the pendulum with friction, the as-
sumptions of HNNs are not met which leads to problematic
Energy in J

0.10 modeling behavior (see Figure B.4).

0.05 The HNNs directly predict the symplectic gradients that
provide the dynamics for the pendulum. These gradients
0.00 can then be integrated to obtain position and momentum
0 2 4 6 8 10
Time in s for future timesteps. From these prediction, we compute
the potential and kinetic energy over time. For MC-LSTM
Figure B.3. Redistribution of energies in a pendulum learned by we used the autoregressive version as described above and
MC-LSTM. The upper plot shows the fraction of energy that is used position and momentum, both rescaled to amplitude
redistributed between the two cells that model Epot and Ekin over 1, as auxiliary inputs. Note that HNNs are feed-forward
time. The continuous redistribution of energy results in the two networks, and the dynamics are obtained by integrating
time series of potential and kinetic energy displayed in the lower
over their predictions. This implies that due to the peri-
plot.
odicity of the data, the samples in the test set could also
be in the training data. Moreover, there is only noise on
the input data, i.e., position and momentum, but not on the
time derivatives, such that HNNs receive non-noisy labels.
initial state of the system is given by the displacement of the Therefore the training could be considered less noisy for
pendulum at the start, where all energy is stored as potential HNN compared to MC-LSTMs. The mean-squared error of
energy. Afterwards, the pendulum oscillates, converting the predictions for the potential and kinetic energy is com-
potential to kinetic energy and vice-versa. pared against the analytic solution. Concretely, the average
MSE of MC-LSTM is 4.6e−4, and the MSE of HNNs is
In MC-LSTM, the conversion between the two forms of 3.0e−4. On 16 out 24 datasets, MC-LSTM outperformed
energy has to be learned by the redistribution matrix. More HNN, which indicates that there is no significant difference
specifically, the off-diagonal elements denote the fraction between the two methods (p-value 0.15, binomial test).
of energy that is converted from one form to the other. In
contrast, the diagonal elements of the redistribution matrix B.4. Hydrology
denote the fraction of energy that is not converted.
Modeling river discharge from meteorological data (e.g.,
In Fig. B.3, we visualize the off-diagonal elements of the precipitation, temperature) is one of the most important
redistribution matrix (i.e., the conversion of energy) for the tasks in hydrology, and is necessary for water resource man-
pendulum task without friction, as well as the modeled po- agement and risk mitigation related to flooding. Recently,
tential and kinetic energy. We can see that an increasing Kratzert et al. (2019c; 2020) established LSTM-based mod-
fraction of energy is converted into the other form, until els as state-of-the-art in rainfall runoff modeling, outper-
the total energy of the system is stored as either kinetic or forming traditional hydrological models by a large margin
potential energy. As soon as the total energy is e.g. con- against most metrics (including peak flows, which is critical
verted into kinetic energy, the corresponding off-diagonal for flood prediction). However, the hydrology community
element (the orange line of the upper plot in Fig. B.3) drops is still reluctant to adopt these methods (e.g. Beven, 2020).
to zero. Here, the other off-diagonal element (the blue line A recent workshop on ‘Big Data and the Earth Sciences’
of the upper plot in Fig. B.3) starts to increase, meaning (Sellars, 2018) reported that “[m]any participants who have
that energy is converted back from kinetic into potential worked in modeling physical-based systems continue to
energy. Note that the differences in the maximum values raise caution about the lack of physical understanding of
of the off-diagonal elements is not important, since at this ML methods that rely on data-driven approaches.”
point the corresponding energy is already approximately
zero. One of of the most basic principles in watershed modeling

102 Chapter 3 Publications

MC-LSTM: Appendix

NLDAS). Each meteorological dataset consist of five differ-

ent variables: daily cumulative precipitation, daily minimum
and maximum temperature, average short-wave radiation
and vapor pressure. We used the Maurer forcing data be-
cause this is the data product that was used by all benchmark
models (see Sec. B.4.5). In addition to meteorological data,
CAMELS also includes a set of static catchment attributes
derived from remote sensing or CONUS-wide available data
products. The static catchment attributes can broadly be
grouped into climatic, vegetation or hydrological indices, as
well as soil and topological properties. In this study, we used
the same 27 catchment attributes as Kratzert et al. (2019c).
Target data were daily averaged streamflow observations
originally from the USGS streamflow gauge network, which
are also included in the CAMELS dataset.
Figure B.4. Example of modeling a pendulum with friction with
Training, validation and test set. Following the calibra-
a HNN. HNNs assume a closed system and cannot model the
pendulum with friction, from which energy dissipates. tion and test procedure of the benchmark hydrology models,
we trained on streamflow observations from 1 October 1999
through 30 September 2008 and tested on observations from
1 October 1989 to 30 September 1999. The remaining pe-
is mass conservation. Whether water is treated as a resource riod (1 October 1980 to 30 September 1989) was used as
(e.g. droughts) or hazard (e.g. floods), a modeller must be validation period for hyperparameter tuning.
sure that they are accounting for all of the water in a catch-
ment. Thus, most models conserve mass (Todini, 1988), and B.4.2. D ETAILS ON THE TRAINING SETUP AND
attempt to explicitly implement the most important physical MC-LSTM HYPERPARAMETERS
processes. The downside of this ‘model everything’ strategy
is that errors are introduced for every real-world process The general model setup follows insights from previous stud-
that is not implemented in a model, or implemented incor- ies (Kratzert et al., 2018; 2019c;b; 2020), where LSTMs
rectly. In contrast, MC-LSTM is able to learn any necessary were used for the same task. We use sequences of 365
behavior that can be induced from the signal (like LSTM) timesteps (days) of meteorological inputs to predict dis-
while still conserving the overall water budget. charge at the last timestep of the sequence (sequence-to-
one prediction). The mass input x in this experiment was
B.4.1. D ETAILS ON THE DATASET catchment averaged precipitation (mm/day) and the auxil-
iary inputs a were the 4 remaining meteorological variables
The data used in all hydrology related experiments is the (min. and max. temperature, short-wave radiation and vapor
publicly available Catchment Attributes and Meteorology pressure) as well as the 27 static catchment attributes, which
for Large-sample Studies (CAMELS) dataset (Newman are constant over time.
et al., 2014; Addor et al., 2017b). CAMELS contains data
for 671 basins and is curated by the US National Center We tested a variety of MC-LSTM model configurations and
for Atmospheric Research (NCAR). It contains only basins adaptions for this specific task, which are briefly described
with relatively low anthropogenic influence (e.g., dams and below:
reservoirs) and basin sizes range from 4 to 25 000 km2 . The
basins cover a range of different geo- and eco-climatologies,
1. Processing auxiliary inputs with LSTM: Instead of
as described by Newman et al. (2015) and Addor et al.
directly using the auxiliary inputs in the input gate
(2017a). Out of all 671 basins, we used 447 — these are the
(Eq. 5), output gate (Eq. 6) and time-dependent mass
basins for which simulations from all benchmark models
redistribution (Eq. 8), we first processed the auxiliary
are available (see Sec. B.4.5). To reiterate, we used bench-
inputs a with LSTM and then used the output of this
mark hydrology models that were trained and tested by other
LSTM as the auxiliary inputs. The idea was to add
groups with experience using these models, and were there-
additional memory for the auxiliary inputs, since in its
fore limited to the 447 basis with results for all benchmark
base form only mass can be stored in the cell states
models. The spatial distribution of the 447 basins across the
of MC-LSTM. This could be seen as a specific adap-
contiguous USA (CONUS) is shown in Fig. B.5.
tion for the rainfall runoff modeling application, since
For each catchment, roughly 30 years of daily meteorologi- information about the weather today and in the past
cal data from three different products exist (DayMet, Maurer, ought to be useful for controlling the gates and mass

103
MC-LSTM: Appendix

0.0 0.2 0.4 0.6 0.8 1.0

NSE

Figure B.5. Spatial distribution of the 447 catchments considered in this study. The color denotes the Nash-Sutcliffe Efficiency of the
MC-LSTM ensemble for each basin, where a value of 1 means perfect predictions.

redistribution. Empirically however, we could not see 3. Explicit trash cell Another way to account for evap-
any significant performance gain and therefore decided otranspiration that we tested is to allow the model to
to not use the more complex version with an additional use one memory cell as explicit “trash cell”. That is,
LSTM. instead of deriving the final model prediction as the
sum over the entire outgoing mass vector, we only
2. Auxiliary output + regularization to account for calculate the sum over all but e.g. one element (see
evapotranspiration: Of all precipitation falling in a Eq. 13). This simple modification allows the model to
catchment, only a part ends as discharge in the river. use e.g. the first memory cell to discard mass from the
Large portions of precipitation are lost to the atmo- system, which is then ignored for the model prediction.
sphere in form of evaporation (from e.g. open water We found that this modification improved performance,
surfaces) and transpiration (from e.g. plants and trees), and thus integrated it into our final model setup.
and to groundwater. One approach to account for this
“mass loss” is the following: instead of summing over 4. Input/output scaling to account for input/output
outgoing mass (Eq. 4), we used a linear layer to con- uncertainty: Both, input and output data in our ap-
nect the outgoing mass to two output neurons. One plications inherit large uncertainties (Nearing et al.,
neuron was fitted against the observed discharge data, 2016), which is not ideal for mass-conserving models
while the second was used to estimate water loss due (and likely one of the reasons why LSTM performs so
to unobserved sinks. A regularization term was added well compared to all other mass-conserving models).
to the loss function to account for this. This regular- To account for that, we tried three different adaptions.
ization term was computed as the difference between First, we used a small fully connected network to de-
the sum of the outgoing mass from MC-LSTM and rive time-dependent scaling weights for the mass input,
the sum over the two output neurons. This did work, which we regularized to be close to one. Second, we
and the timeseries of the second auxiliary output neu- used a linear layer with positive weights to map the
ron gave interesting results (i.e. matching the expected outgoing mass to the final model prediction, where all
behavior of the annual evapotranspiration cycle), how- weights were initialized to one and the bias to zero.
ever results were not significantly better compared to Third, we combined both. Out of the three, the input
our final model setup, which is why we rejected this scaling resulted in the best performing model, however
architectural change. the results were worse than not scaling.

104 Chapter 3 Publications

MC-LSTM: Appendix

5. Time-dependent redistribution matrix variants:

For this experiment, a time-dependent redistribution
matrix is necessary, since the underlying real-world mttot = Rt · ct−1 + it · xt (10)
processes (such as snow melt and thus conversion from ct = (1 − ot ) mttot (11)
snow into e.g. soil moisture or surface runoff) are time-
ht = ot mttot (12)
dependent. Since using the redistribution matrix as
n
X
proposed in Eq. 8 is memory-demanding, especially
yb = hti , (13)
for models with larger numbers of memory cells, we
i=2
also tried to use a different method for this experiment.
Here, we learned a fixed matrix (as in Eq. 7) and only
with the gates being defined by
calculated two vectors for each timestep. The final re-
distribution matrix was then derived as the outer prod-
uct of the two time-dependent vectors and the static
ct−1
matrix. This resulted in lower memory consumption, it = σ̃(W i · at + U i · + V i · xt + bi ) (14)
however the model performance deteriorated signifi- kct−1 k1
cantly, which could be a hint toward the complexity ct−1
ot = σ(W o · at + U o · t−1 + V o · xt + bo ) (15)
required to learn the redistributing processes in this kc k1
problem. t−1

^ Wr · at + Ur · c
Rt = ReLU + Vr · xt + B r ,
t−1
kc k1
6. Activation function of the redistribution matrix:
(16)
We tested several different activation functions for the
redistribution matrix in this experiment. Among those
were the normalized sigmoid function (that is used ^ is the
where σ̃ is the normalized logistic function and ReLU
e.g. for the input gate), the softmax function (as in normalized rectified linear unit (ReLU) that we define in
Eq. 8) and the normalized ReLU activation function the following. The normalized logistic function defined of
(see Eq. 18). We could achieve the best results using the input gate is defined by:
the normalized ReLU variant and can only hypothe-
size the reason for that: In this application (rainfall- σ(ik )
runoff modelling) there are several state processes that σ̃(ik ) = P . (17)
k σ(ik )
are strictly disconnected. One example is snow and
groundwater: groundwater will never turn into snow In this experiment, the activation function for the redistribu-
and snow will never transform into groundwater (not tion gate is the normalized ReLU function defined by:
directly at least, it will first need to percolate through
upper soil layers). Using normalized sigmoids or soft-
max makes it numerically harder (or impossible) to ^ k ) = Pmax(sk , 0) ,
ReLU(s (18)
not distributed at least some mass between every cell k max(sk , 0)

— because activations can never be exactly zero. The

where s is some input vector to the normalized ReLU func-
normalized ReLU activation can do so, however, which
tion.
might be the reason that it worked better in this case.
We manually tried different sets of hyperparameters, be-
cause a large-scale automatic hyperparameter search was
As an extension to the standard MC-LSTM model intro-
not feasible. Besides trying out all variants as described
duced in Eq. (5) to Eq. (8), we also used the mass input
above, the main hyperparameter that we tuned for the final
(precipitation) in all gates. The reason is the following:
model was the number of memory cells. For other param-
Different amounts of precipitations can lead to different
eters, such as learning rate, mini-batch size, number of
processes. For example, low amounts of precipitation could
training epochs, we relied on previous work using LSTMs
be absorbed by the soil and stored as soil moisture, leading
on the same dataset.
to effectively no immediate discharge contribution. Large
amounts of precipitation on the other hand, could lead to The final hyperparameters are a hidden size of 64 memory
direct surface runoff, if the water cannot infiltrate the soil at cells and a mini-batch size of 256. We used the Adam
the rate of the precipitation falling down. Therefore, it is cru- optimizer (Kingma & Ba, 2015) with a scheduled learning
cial that the gates have access to the information contained rate starting at 0.01 then lowering the learning rate after 20
in the precipitation input. The final model design used in epochs to 0.005 and after another 5 epochs to 0.001. We
all hydrology experiments is described by the following trained the model for a total number of 30 epochs and used
equations: the weights of the last epoch for the final model evaluation.

105
MC-LSTM: Appendix

All weight matrices were initialized as (semi) orthogonal lations of the VIC model (Mizukami et al., 2017) and
matrices (Saxe et al., 2014) and all bias terms with a constant mHM (Rakovec et al., 2019).
value of zero. The only exception was the bias of the output
gate, which we initialized to −3, to keep the output gate
B.4.6. D ETAILED RESULTS
closed at the beginning of the training.
Table B.7 provides results for MC-LSTM and LSTM aver-
B.4.3. D ETAILS ON THE EVALUATION METRICS aged over the n = 10 model repetitions.
Table B.6 lists the definition of all metrics used in the hydrol-
B.5. Ablation study
ogy experiments as well as the corresponding references.
In order to demonstrate that the design choices of
B.4.4. D ETAILS ON THE LSTM MODEL MC-LSTM are necessary together to enable accurate pre-
dictive models, we performed an ablation study. In this
For LSTM, we largely relied on expertise from previous
study, we make the following changes to the input gate, the
studies (Kratzert et al., 2018; 2019c;b; 2020). The only
redistribution operation, and the output gate, to test if mass
hyperparameter we adapted was the number of memory
conservation in the individual parts is necessary.
cells, since we used fewer basins (447) than in the previous
studies (531). We found that LSTM with 128 memory cells,
compared to the 256 used in previous studies, resulted in 1. Input gate: We change the activation function of the
slightly better results. Apart from that, we trained LSTMs input gate from a normalized sigmoid function to the
with the same inputs and settings (sequence-to-one with a standard sigmoid function, thus resulting in the input
sequence length of 365) as described in the previous section gate of a standard LSTM. Since the sigmoid function
for MC-LSTM. We used the standard LSTM implementa- is bounded to (0, 1), the mass input x at every timestep
tion from the PyTorch package (Paszke et al., 2019), i.e., t that is added into the system can be scaled between
with forget gate (Gers et al., 2000). We manually initialized (0, n ∗ xt ).
the bias of the forget gate to be 3 in order to keep the forget
gate open at the beginning of the training. 2. Redistribution matrix: We remove the normalized ac-
tivation function from the redistribution matrix and
B.4.5. D ETAILS ON THE BENCHMARK MODELS instead use a linear activation function. This allows
for unconstrained and scaled flow of mass from each
The benchmark models were first collected by Kratzert et al. memory cell into each other memory cell.
(2019c). All models were configured, trained and run by
several different research groups, most often the respective 3. Output gate: Instead of removing the outgoing mass
model developers themselves. This was done to avoid any (ot mttot ) from the cell states at each timestep t,
potential to favor our own models. All models used the we leave the cell states unchanged and keep all mass
same forcing data (Maurer) and the same time periods to within the system.
train and test. The models can be classified in two groups:

We test these variants on data from the hydrology experi-

1. Models trained for individual watersheds. These are ment. We chose 5 random basins to limit computational
SAC-SMA (Newman et al., 2017), VIC (Newman expenses and trained nine repetitions for each configuration
et al., 2017), three different model structures of FUSE1 , and basin. The results are compared against the full mass-
mHM (Mizukami et al., 2019) and HBV (Seibert et al., conserving MC-LSTM architecture as described in App.
2018). For the HBV model, two different simulations B.4.2 and reported in Table B.8. The results of the ablation
exist: First, the ensemble average of 1000 untrained study indicate that the design of the input gate, redistribution
HBV models (lower benchmark) and second, the en- matrix, and output gate, are necessary together for proficient
semble average of 100 trained HBV models (upper predictive performance. The strongest decrease in perfor-
benchmarks). For details see (Seibert et al., 2018). mance is observed if redistribution matrix does not conserve
mass, and smaller decreases if input or output gate do not
2. Models trained regionally. For hydrological models,
conserve mass.
regional training means that one parameter transfer
model was trained, which estimates watershed-specific
model parameters through globally trained model func- C. Theorems & proofs
tions from e.g. soil maps or other catchment attributes. P
For this setting, the benchmark dataset includes simu- TheoremP 1 (Conservation property). Let mτc = k cτk and
mh = k hk be, respectively, the mass in the MC-LSTM
τ τ
1
Provided by Nans Addor on personal communication storage and the outputs at time τ . At any timestep τ , we

106 Chapter 3 Publications

MC-LSTM: Appendix

Table B.6. Definition of all metrics used in the hydrology experiments. The NSE is defined as the R2 between simulated, ŷ, and observed,
y, runoff and is listed for completion. FHV and FLV are both derived from the flow duration curve, which is a cumulative frequency curve
of the discharge. H for the FHV and L for the FLV correspond to the 2% highest flow and the 30% lowest flow, respectively.

Metric Reference Equation

PT
a y t −y t )2
(b
Nash-Sutcliff-Efficiency (NSE) Nash & Sutcliffe (1970) 1 − Pt=1T t 2
t=1 (y −ȳ)
β-NSE Decompositionb Gupta et al. (2009) P
(µyb − µ y )/σ y
H
y −yh )
(b
Top 2% peak flow bias (FHV)c Yilmaz et al. (2008) PH h
h=1
yh
× 100
PL h=1P
d l=1 (log(b yL ))− L
yl )−log(b l=1 (log(yl )−log(yL ))
30% low flow bias (FLV) Yilmaz et al. (2008) PL × 100
l=1 (log(yl )−log(yL ))
a
: Nash-Sutcliffe efficiency: (−∞, 1], values closer to one are desirable.
b
: β-NSE decomposition: (−∞, ∞), values closer to zero are desirable.
c
: Top 2% peak flow bias: (−∞, ∞), values closer to zero are desirable.
d
: Bottom 30% low flow bias: (−∞, ∞), values closer to zero are desirable.

Table B.7. Model robustness of MC-LSTM and LSTM results over the n = 10 different random seeds. For all n = 10 models, we
calculated the median performance for each metric and report the mean and standard deviation of the median values in this table.

MCa NSEb β-NSEc FLVd FHVe

MC-LSTM Single 3 0.726±0.003 -0.021±0.003 -38.7±3.2 -13.9±0.7
LSTM Single 7 0.737±0.003 -0.035±0.005 13.6±3.4 -14.8±1.0
a
: Mass conservation (MC).
b
: Nash-Sutcliffe efficiency: (−∞, 1], values closer to one are desirable.
c
: β-NSE decomposition: (−∞, ∞), values closer to zero are desirable.
d
: Bottom 30% low flow bias: (−∞, ∞), values closer to zero are desirable.
e
: Top 2% peak flow bias: (−∞, ∞), values closer to zero are desirable.

107
MC-LSTM: Appendix

from Eq. (4):

Table B.8. Ablation study results of the hydrology experiment.
Models are trained for five, random basin with nine model repeti- K
X
tions. We computed the median over the repetitions and then the mTh +1 = ok mTtot,k
+1
.
mean over the five basins. k=1
MCa NSEb
Putting everything together, we find
MC-LSTM 3 0.635 ± 0.102
MC-LSTM − input 7 0.603 ± 0.123 K
X K
X
MC-LSTM − output 7 0.55 ± 0.097 mTc +1 = mTtot,k
+1
− ok mTtot,k
+1

MC-LSTM − redis.c 7 −4.229 ± 8.982 k=1 k=1

a
: Mass conservation (MC). = mTc + xT +1 − mTh +1
b
: Nash-Sutcliffe efficiency: (−∞, 1], values closer to T T
one are desirable. X X
c
: For one out of five basins, all nine model repetitions
= m0c + xt − mth + xT +1 − mTh +1
resulted into NaNs during training. Here, we report t=1 t=1
the statistics calculated from only the four successful T
X +1 T
X +1
basins. = m0c + xt − mth
t=1 t=1

By the principle of induction, we conclude that mass is

have: conserved, as specified in Eq. (9).
τ
X τ
X
mτc = m0c + xt − mth .
Corollary 1. In each timestep τ , thePτcell states ck are
τ
t=1 t=1
bounded P by the sum of mass inputs t=1 xτ + m0c , that
That is, the change of mass in the cell states is the difference τ
is |cτk | ≤ t=1 xτ + m0c . Furthermore, if the series of mass
between input and output mass, accumulated over time. Pτ
inputs converges limτ →∞ t=1 xτ = m∞ x , then also the
sum of cell states converges.
Proof. The proof is by induction and we use mtot = Rt ·
ct−1 + it · xt from Eq.(2). Proof. Since ctk ≥ 0, xt ≥ 0 and mth ≥ 0 for all k and t,
P0 P0
For τ = 0, we have m0c = m0c + t=1 xt − t=1 mth ,
K
X τ
X
which
P0 is trivially true when using the convention that |cτk | = cτk ≤ cτk = mτc ≤ xτ + m0c , (19)
t=1 = 0. t=1
k=1
Assuming that the statement holds for τ = T , we show that
it must also hold for τ = T + 1. where we used Theorem 1. Convergence follows immedi-
Starting from Eq. (3), the mass of the cell states at time ately through the comparison test.
T + 1 is given by:
K K K
D. On random Markov matrices.
X X X
mTc +1 = (1 − ok )mTtot,k
+1
= mTtot,k
+1
− ok mTtot,k
+1
, When initializing an MC-LSTM model, the entries of the re-
k=1 k=1 k=1 distribution matrix R of dimension K × K are created from
non-negative and iid random variables (sij )1≤i,j≤K with
where mttot,k is the k-th entry of the result from Eq. (2) (at
finite means m and variances σ 2 and bounded fourth mo-
timestep t). The sum over entries in the first term can be
ments. We collect them in a matrix S. Next we assume that
simplified as follows:
those entries get column-normalized to obtain the random
 
XK K
X XK Markov matrix R.
m T +1
=  T
rkj cj + ik x T +1 
tot,k
k=1 k=1 j=1 Properties of Markov matrices and random Markov
K K
! K matrices. Let λ1 , . . . , λK be the eigenvalues and
X X X
= cTj rkj + xT +1 ik s1 , . . . , sK be the singular values of R, ordered such that
j=1 k=1 k=1 |λ1 | ≥ . . . ≥ |λK | and s1 ≥ . . . ≥ sk . We then have the
following properties for any Markov matrix (not necessarily
= mTc + xT +1 .
random):
The final simplification is possible because R and i are (left-
)stochastic. The mass of the outputs can then be computed • λ1 = 1.

108 Chapter 3 Publications

MC-LSTM: Appendix

• 1T R = 1T . Gradient flow of MC-LSTM for random redistributions.

√ Here we provide a short note on the gradient dynamics of
• s1 = kRk2 ≤ K. the cell state in a random MC-LSTM, hence, at initialization
of the model. Specifically we want to provide some heuris-
tics based on the arguments about the behavior of large
Furthermore, for random Markov matrices, we have
stochastic matrices. Let us start by recalling the formula for
ct :
• limK→∞ s1 = 1 (Bordenave et al., 2012, Theorem
1.2) ct = (1 − ot ) (Rt · ct−1 + it · xt ). (20)

t
For the reader’s convenience we briefly discuss further se- Now we investigate the gradient of k ∂c∂ct−1 k2 in the limit
lected interesting properties of random Markov matrices in K → ∞. We assume that for K → ∞, ot ≈ 0 and it ≈ 0
the next paragraph, especially concerning the global behav- for all t. Thus we approximately have:
ior of their eigenvalues and singular values.
∂ct
k k2 ≈ kRt k2 . (21)
Circular and Quartercircular law for random Markov ∂ct−1
matrices. In random matrix theory one major field of in-
Rt is a stochastic matrix, and s1 = kRt k2 is its largest
terest concerns the behavior of eigenvalues and singular
singular value. Theorem 1.2 from Bordenave et al. (2012)
values when K → ∞. One would like to find out how the
ensures that kRt k2 = 1 for K → ∞ under reasonable
limiting distribution of the eigenvalues or singular values
moment assumptions on the distribution of the unnormal-
looks like. To discuss the most important results in this di-
ized entries (see above). Thus we are able to conclude
rection for large Markov matrices R, let us introduce some t
k ∂c∂ct−1 k2 ≈ 1 for large K and all t, which can prevent the
notation.
gradients from exploding.

• δa denotes the Dirac delta measure centered at a.

PK
• By µR = K 1
k=1 δλk we denote the empirical spec-
tral density of the eigenvalues of R.

• Similarly we define the empirical spectral

PK density of
the singular values of R as: νR = K1
k=1 δsk .

• Qσ denotes the quartercircular distribution on the in-

terval [0, σ] and

• Uσ the uniform distribution on the disk {z ∈ C : |z| ≤

σ}.

Then we have as K → ∞:

• Quarter cirular law theorem: (Bordenave et al.,

2012, Theorem 1.1): ν√KR → Qσ almost surely.

• Cirular law theorem: (Bordenave et al., 2012, Theo-

rem 1.3): ν√KR → Uσ almost surely.

The convergence here is understood in the sense of

weak convergence of probability measures with respect to
bounded continuous functions. Note that those two famous
√
theorems originally appeared for √1K S instead of KR.
Of course much more details on those results can be found
in Bordenave et al. (2012).

109
Conclusion 4
The aim of this thesis was to investigate the applicability of Deep Learning methods
in hydrology, more specifically that of LSTMs (and its variants) for rainfall–runoff
modeling. As I could show in various publications, LSTMs excel in this task and are
the new state of the art for rainfall–runoff modeling.

I started from a very classical point of view, where LSTMs were trained in-
dividually for each watershed, which resembles the best practice for traditional
hydrological models. Although the performance of the LSTM was on a par with
the hydrology benchmark model, the most interesting result of the first study was
that regional training (i.e. training a single model on the data of multiple water-
sheds) seemed to result in better prediction accuracy. This is in contrast to regional
hydrological models, which usually perform considerably worse when calibrated
for multiple watersheds and not for each watershed separately (Mizukami et al.
2017). Continuing this line of research, I could show that combining meteorological
time series data with static watershed attributes of multiple (e.g. hundreds) of
different watersheds in a single LSTM results in the state of the art for rainfall–runoff
modeling.

There has been a long discussion in hydrology whether or not scale-relevant

watershed theories (Beven 1987) exist that are transferable across space, or if the
"uniqueness of place" (Beven 2000) is the more dominant factor (i.e. the hetero-
geneity of watersheds dominates and little consistency exists between watersheds).
Another argument that is raised in the context of poorly performing regional models,
is the lack of sufficient observations to discover similarities between watersheds.
However, the results presented in this thesis suggest that a) there is sufficient infor-
mation available in the existing data and b) that transferable relationships between
inputs and output exist. Otherwise, the LSTM would not be able to provide better
predictions in hold-out watersheds than hydrology models, that were specifically
calibrated for these watersheds. In Nearing et al. (2021), we discussed these find-
ings in the context of hydrological modeling in more detail. One future avenue
of research could be to investigate what the LSTM learned and why it is much
better than existing hydrological models. I already did first steps in this direction,
showing that LSTMs use their internal memory cells to mimic known hydrological
storage processes, such as snow accumulation and snow melt as well as soil moisture
dynamics.

111
In the last part of my thesis, we started to investigate the benefit of adding
(physical) constraints to the LSTM. That is, we build an LSTM variant with (mass)
conserving properties. We proved that this model, called Mass-Conserving LSTM, has
high predictive capabilities across various tasks, ranging from arithmetic operations
to traffic forecasting, modeling a damped pendulum, and rainfall–runoff modeling.
The last application, rainfall–runoff modeling, fits especially well into the context of
this thesis. Here, we could show that the inductive bias of the MC-LSTM increases
the performance of the model in the tails of the discharge distribution, i.e. floods
and droughts, compared to the LSTM.

To summarize: During the course of this thesis, I could establish LSTMs (and
variants) as the new state of the art for rainfall–runoff modeling. I was able to
show that the model learns internal representations that match hydrological domain
expertise. Furthermore, we presented a variant with (mass) conserving properties
that shows great potential for applications that focus on floods and droughts. This
research has sparked a lot of interesting follow-up work from our group, as well as
many other researchers from around the world. I am curiously looking forward to
see what the next years will bring, as this thesis is certainly only the beginning.

112 Chapter 4 Conclusion

Bibliography

Abrahart, R. J., F. Anctil, P. Coulibaly, C. W. Dawson, N. J. Mount, L. M. See, A. Y. Shamseldin,

D. P. Solomatine, E. Toth, and R. L. Wilby (2012). „Two decades of anarchy? Emerging
themes and outstanding challenges for neural network river forecasting“. In: Prog. Phys.
Geog. 36, pp. 480–513.

Addor, N., A. J. Newman, N. Mizukami, and M. P. Clark (2017). „The CAMELS data set:
catchment attributes and meteorology for large-sample studies“. In: Hydrology and Earth
System Sciences 21.10, pp. 5293–5313. DOI: 10.5194/hess-21-5293-2017. URL: https:
//hess.copernicus.org/articles/21/5293/2017/.

Beven, K. J. (1987). „Towards a new paradigm in hydrology“. In: Water for the Future:
Hydrology in Perspective. IAHS Publication 164, pp. 393–403.

– (2000). „Uniqueness of place and process representations in hydrological modelling“. In:

Hydrology and Earth System Sciences 4.2, pp. 203–213. DOI: 10.5194/hess-4-203-2000.

Burnash, R. J. C., R. L. Ferral, and R. A. McGuire (1973). A generalised streamflow simulation

system conceptual modelling for digital computers., Tech. rep. Sacramento, CA, USA: US
Department of Commerce National Weather Service and State of California Department
of Water Resources.

Carriere, P., S. Mohaghegh, and R. Gaskar (1996). „Performance of a Virtual Runoff Hydro-
graphic System“. In: Water Resources Planning and Management 122, pp. 120–125.

Daniell, T. M. (1991). „Neural networks. Applications in hydrology and water resources

engineering“. In: Proceedings of the International Hydrology and Water Resource Symposium.
3. Institution of Engineers, Perth, Australia, pp. 797–802.

Goswami, M., K. Oconnor, and K. Bhattarai (2007). „Development of regionalisation proce-

dures using a multi-model approach for flow simulation in an ungauged catchment.“ In:
Journal of Hydrology 333.2-4, pp. 517–531.

Halff, A. H., H. M. Halff, and M. Azmoodeh (1993). „Predicting runoff from rainfall using
neural networks“. In: Engineering hydrology, ASCE, pp. 760–765.

Hochreiter, S. (1991). „Untersuchungen zu dynamischen neuronalen Netzen“. PhD thesis.

Munich, Germany: Technische Universität München.

Hochreiter, S. and J Schmidhuber (1997). „Long Short-Term Memory“. In: Neural Computa-
tion 9, pp. 1735–1780.

113
Hoedt, P.-J., F. Kratzert, D. Klotz, C. Halmmich, M. Holzleitner, G. Nearing, S. Hochreiter,
and G. Klambauer (2021). „MC-LSTM: Mass-Conserving LSTM“. In: Proceedings of the
38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual
Event.

Hrachowitz, M., H.H.G. Savenije, G. Blöschl, J.J. McDonnell, M. Sivapalan, J.W. Pomeroy,
B. Arheimer, T. Blume, M.P. Clark, U. Ehret, F. Fenicia, J.E. Freer, A. Gelfan, H.V. Gupta,
D.A. Hughes, R.W. Hut, A. Montanari, S. Pande, D. Tetzlaff, P.A. Troch, S. Uhlenbrook,
T. Wagener, H.C. Winsemius, R.A. Woods, E. Zehe, and C. Cudennec (2013). „A decade of
Predictions in Ungauged Basins (PUB)—a review“. In: Hydrological Sciences Journal 58.6,
pp. 1198–1255. DOI: 10.1080/02626667.2013.803183.

Hsu, K., H. V. Gupta, and S. Soroochian (1997). „Application of a recurrent neural network
to rainfall-runoff modeling“. In: Proc., Aesthetics in the Constructed Environment. ASCE,
New York, pp. 68–73.

Kirchner, J. (2006). „Getting the right answers for the right reasons: Linking measurements,
analyses, and models to advance the science of hydrology“. In: Water Resources Research
42. DOI: https://doi.org/10.1029/2005WR005362.

Kratzert, F., M. Herrnegger, D. Klotz, S. Hochreiter, and G. Klambauer (2019a). „Neural-

Hydrology – Interpreting LSTMs in Hydrology“. In: Samek W., Montavon G., Vedaldi A.,
Hansen L., Müller KR. (eds) Explainable AI: Interpreting, Explaining and Visualizing Deep
Learning. Lecture Notes in Computer Science 11700.

Kratzert, F., D. Klotz, C. Brenner, K. Schulz, and M. Herrnegger (2018). „Rainfall–runoff

modelling using Long Short-Term Memory (LSTM) networks“. In: Hydrology and Earth
System Sciences 22.11, pp. 6005–6022.

Kratzert, F., D. Klotz, M. Herrnegger, A. Sampson, S. Hochreiter, and G. Nearing (2019b).

„Toward Improved Predictions in Ungauged Basins: Exploiting the Power of Machine
Learning“. In: Water Resources Research 55.12, pp. 11344–11354.

Kratzert, F., D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, and G. Nearing (2019c).

„Towards learning universal, regional, and local hydrological behaviors via machine
learning applied to large-sample datasets“. In: Hydrology and Earth System Sciences 23.12,
pp. 5089–5110.

Milly, P. C. D., J. Betancourt, M. Falkenmark, R. M. Hirsch, Z. W. Kundzewicz, D. P. Letten-

maier, and R. J. Stouffer (2008). „Stationarity is dead: Whither water management?“ In:
Science 319, pp. 573–574.

Mizukami, N., M. P. Clark, A. J. Newman, A. W. Wood, E. D. Gutmann, B. Nijssen, O. Rakovec,

and L. Samaniego (2017). „Towards seamless large-domain parameter estimation for
hydrologic models“. In: Water Resour. Res. 53, pp. 8020–8040.

Nearing, G., F. Kratzert, A. Sampson, C. Pelissier, D. Klotz, J. Frame, C. Prieto, and H. Gupta
(2021). „What Role Does Hydrological Science Play in the Age of Machine Learning?“
In: Water Resources Research 57.3, e2020WR028091. DOI: https://doi.org/10.1029/
2020WR028091.

114 Bibliography
Newman, A. J., M. P. Clark, K. Sampson, A. Wood, L. E. Hay, A. Bock, R. J. Viger, D.
Blodgett, L. Brekke, J. R. Arnold, T. Hopson, and Q. Duan (2015). „Development of a
large-sample watershed-scale hydrometeorological data set for the contiguous USA: data
set characteristics and assessment of regional variability in hydrologic model performance“.
In: Hydrology and Earth System Sciences 19.1, pp. 209–223. DOI: 10.5194/hess-19-209-
2015. URL: https://hess.copernicus.org/articles/19/209/2015/.

Sivapalan, M. (2003). „Prediction in ungauged basins: A grand challenge for theoretical

hydrology“. In: Hydrological Processes 17.15, pp. 3163–3170.

Sundararajan, M., A. Taly, and Q.: Yan (2017). „xiomatic attribution for deep networks“. In:
Proceedings of the 34th International Conference on Machine Learning 70, pp. 3319–3328.

Vaze, J., F. Chiew, D. Hughes, and V. Andréassian (2015). „Preface: Hs02–hydrologic non-
stationarity and extrapolating models to predict the future“. In: Proceedings of the Interna-
tional Association of Hydrological Sciences 371, pp. 1–2.

Bibliography 115
Appendices

117
119
Frederik Kratzert
MACHINE LEARNING RESEARCHER, PHD. CAND.
 [email protected] |  https://neuralhydrology.github.io |  kratzert

Education
Johannes Kepler University Linz, Austria
PHD CAND. IN NATURAL SCIENCES (MACHINE LEARNING) May 2018 -
• Research around the use of deep learning models for (global) rainfall-runoff modeling, supervised by Sepp Hochreiter
• Funded by Google Faculty Research Award
University of Natural Resources and Life Sciences Vienna, Austria
DIPL. ING. (M.SC) IN CIVIL ENGINEERING AND WATER MANAGEMENT Feb. 2014 - Nov. 2016
• Masther thesis: ”Entwicklung einer Software zur automatisierten Objekterkennung in videoüberwachten Fischaufstiegen”
• Graduated with destinction
Universidad Politécnica de Valencia Valencia, Spain
ERASMUS EXCHANGE Sep. 2011 - Aug. 2012

Technical University Berlin Berlin, Germany

B.SC. IN CIVIL ENGINEERING Oct. 2010 - Sep. 2013
• Bachelor thesis: ”Ermittlung von Struktur- und Materialparametern an isotropen flächigen Bauteilen unter Ausnutzung
der Ausbreitung elastischer Wellen”

Skills
Programming Python, MATLAB, LaTeX, (Basic knowledge: JavaScript, Lua, Fortran, C++)
Languages German, English, Spanish
IT tools Git, GIS, Linux/Unix, SQL

Work experience
Johannes Kepler University Linz, Austria
RESEARCH ASSISTANT, PHD CAND. May. 2018 -
• Research mainly focused around the Long Short-Term Memory Network (development, application and interpretability studies).
University of Natural Resources and Life Sciences Vienna, Austria
RESEARCHER ASSISTANT Nov. 2016 - Apr. 2018
• Development and application of a software for automatic video monitoring of technical fish passes.
• Development of a QGIS Plugin version of the ”digHAO - digitaler Hydrologischer Atlas Österreichs”.
University of Natural Resources and Life Sciences Vienna, Austria
GRADUATE RESEARCH ASSISTANT Apr. 2014 - Nov. 2016
• Development and application of a software for automatic video monitoring of technical fish passes.
Technical University Berlin Berlin, Germany
UNDERGRADUATE RESEARCH ASSISTANT Oct. 2012 - Sep. 2013
• Part of the ”Bladetester” project for automatic fault detection on blades of wind turbines.

FREDERIK KRATZERT · CURRICULUM VITAE 1

120
Teaching experience
LSTM and Recurrent Neural Nets Linz, Austria
JOHANNES KEPLER UNIVERSITY, WINTER, 2019
• Obligatory master class of the ”AI curriculum” at the Johannes Kepler University, Linz, Austria.
• Helped creating the script and designed the programming assignments
Programming in Python I Linz, Austria
JOHANNES KEPLER UNIVERSITY, WINTER, 2019
• Obligatory bachelor class of the ”AI curriculum” at the Johannes Kepler University, Linz, Austria.

Honors & Awards

2019 Google Faculty Research Award, 3-year PhD funding
2017 Klaus Fischer Innovationspreis für Technik und Umwelt, for my master thesis
2016 OSPP, Outstanding Student Poster & PICO Award, EGU General Assembly 2016

Publications
PEER REVIEWED
Rainfall–runoff prediction at multiple timescales with a single Long Short-Term Memory network
M. Gauch, F. Kratzert, D. Klotz, G. Nearing, J. Lin, S. Hochreiter
Hydrology and Earth System Sciences 25.4 (2021) PP. 2045–2062. 2021

MC-LSTM: Mass-Conserving LSTM

P.-J. Hoedt, F. Kratzert, D. Klotz, C. Halmmich, M. Holzleitner, G. Nearing, S. Hochreiter, G. Klambauer
Proceedings of the 38th International Conference on Machine Learning, ICML 2021, 18-24 July 2021, Virtual Event (2021). 2021

Niederschlags-Abflussmodellierung mit Long Short-Term Memory (LSTM)

F. Kratzert, M. Gauch, S. Hochreiter, D. Klotz
Österr Wasser- und Abfallw (2021). 2021

Toward Improved Predictions in Ungauged Basins: Exploiting the Power of Machine Learning
F. Kratzert, D. Klotz, M. Herrnegger, A. Sampson, S. Hochreiter, G. Nearing
Water Resources Research 55.12 (2019) PP. 11344–11354. 2019

Towards learning universal, regional, and local hydrological behaviors via machine learning applied to large-sample datasets
F. Kratzert, D. Klotz, G. Shalev, G. Klambauer, S. Hochreiter, G. Nearing
Hydrology and Earth System Sciences 23.12 (2019) PP. 5089–5110. 2019

What Role Does Hydrological Science Play in the Age of Machine Learning?
G. Nearing, F. Kratzert, A. Sampson, C. Pelissier, D. Klotz, J. Frame, C. Prieto, H. Gupta
Water Resources Research (2019) E2020WR028091. 2019

Rainfall–runoff modelling using Long Short-Term Memory (LSTM) networks

F. Kratzert, D. Klotz, C. Brenner, K. Schulz, M. Herrnegger
Hydrology and Earth System Sciences 22.11 (2018) PP. 6005–6022. 2018

SUBMITTED
Uncertainty Estimation with Deep Learning for Rainfall–Runoff Modelling
D. Klotz, F. Kratzert, M. Gauch, A. Keefe Sampson, J. Brandstetter, G. Klambauer, S. Hochreiter, G. Nearing
Hydrology and Earth System Sciences Discussions 2021 (2021) PP. 1–32. 2021

Post processing the U.S. National Water Model with a Long Short-Term Memory network
J. Frame, G. Nearing, F. Kratzert, A. Raney, M. Rahman
JAWRA Journal of the American Water Resources Association (2020). 2020

FREDERIK KRATZERT · CURRICULUM VITAE 2

121
BOOK CHAPTER
NeuralHydrology – Interpreting LSTMs in Hydrology
F.. Kratzert, M. Herrnegger, D. Klotz, S. Hochreiter, G. Klambauer
Explainable AI: Interpreting, Explaining and Visualizing Deep Learning. Lecture Notes in Computer Science, vol 11700, 2019, Cham

Scientific Community Services

REVIEWER
• Reviewed manuscripts for Neural Information Processing Systems, Hydrology and Earth System Sciences, Water Resources Research, Journal
of Hydrology, River Research and Applications
SESSION CONVENER
• Session convener of the ”Deep Learning in Hydrology” session at the European Geosciences Union 2020 and 2021 General Assembly
• Session convener of the ”Machine Learning in Hydrologic Modeling” session at the American Geosciences Union 2020 Fall Meeting

FREDERIK KRATZERT · CURRICULUM VITAE 3

122
123

TJMF1 de 1
No ratings yet
TJMF1 de 1
269 pages
LSTM Reservoirmanagement
No ratings yet
LSTM Reservoirmanagement
17 pages
Out
No ratings yet
Out
109 pages
Ameddah - Maria Mokhtari - Rania
No ratings yet
Ameddah - Maria Mokhtari - Rania
118 pages
Year 2 Autumn Block 1 Step 1 PPT Count Objects To 100
No ratings yet
Year 2 Autumn Block 1 Step 1 PPT Count Objects To 100
21 pages
8051 Microcontroller Program
100% (1)
8051 Microcontroller Program
15 pages
Thesis Nisha 21ag62r07
No ratings yet
Thesis Nisha 21ag62r07
66 pages
C. Pushpam An Artificial Intelligence-Based Rainfall Prediction Using LSTM and ANN
No ratings yet
C. Pushpam An Artificial Intelligence-Based Rainfall Prediction Using LSTM and ANN
65 pages
Final Report (Innovation Product)
100% (1)
Final Report (Innovation Product)
28 pages
A Critical Review of RNN and LSTM Variants in Hydrological Time
No ratings yet
A Critical Review of RNN and LSTM Variants in Hydrological Time
19 pages
JH 0220541
No ratings yet
JH 0220541
21 pages
Hydrology 09 00202 v2
No ratings yet
Hydrology 09 00202 v2
21 pages
A Comprehensive Review of Methods For Hydrological Forecasting Based On Deep Learning
No ratings yet
A Comprehensive Review of Methods For Hydrological Forecasting Based On Deep Learning
33 pages
ITD - PPT 21ee01036
No ratings yet
ITD - PPT 21ee01036
16 pages
SSRN-adombi Et Al 2023
No ratings yet
SSRN-adombi Et Al 2023
57 pages
Shubham Thesis PDF
No ratings yet
Shubham Thesis PDF
63 pages
Enhancing Predictive Skills in Physically-Consiste
No ratings yet
Enhancing Predictive Skills in Physically-Consiste
20 pages
JSSR Kapita Selecta
No ratings yet
JSSR Kapita Selecta
9 pages
Hydrological Processes - 2023 - Mangukiya - How To Enhance Hydrological Predictions in Hydrologically Distinct Watersheds
No ratings yet
Hydrological Processes - 2023 - Mangukiya - How To Enhance Hydrological Predictions in Hydrologically Distinct Watersheds
18 pages
Design and Analysis of Algorithms CSC 321 Lecture 3 29092022 032607pm
No ratings yet
Design and Analysis of Algorithms CSC 321 Lecture 3 29092022 032607pm
49 pages
An Adaptive Daily Runoff Forecast Model Using VMD-LSTM-PSO Hybrid Approach
No ratings yet
An Adaptive Daily Runoff Forecast Model Using VMD-LSTM-PSO Hybrid Approach
16 pages
Handout - Measuring Risk and Return
No ratings yet
Handout - Measuring Risk and Return
79 pages
Temporal Fusion Transformers For Streamflow Prediction
No ratings yet
Temporal Fusion Transformers For Streamflow Prediction
10 pages
TFT Paper 2 Reviewed
No ratings yet
TFT Paper 2 Reviewed
16 pages
Long Short-Term Memory RNN: Department of Computer Science
No ratings yet
Long Short-Term Memory RNN: Department of Computer Science
16 pages
Ggvyyu
No ratings yet
Ggvyyu
18 pages
Deep Learning Based Adjusted Forecasting Window Scale For Streamflow Forecasting
No ratings yet
Deep Learning Based Adjusted Forecasting Window Scale For Streamflow Forecasting
20 pages
Discover Civil Engineering: Ujjwal Marasini Madan Pokhrel
No ratings yet
Discover Civil Engineering: Ujjwal Marasini Madan Pokhrel
12 pages
Deep Learning and Machine Learning in Hydrological
No ratings yet
Deep Learning and Machine Learning in Hydrological
24 pages
DOCUMENTATION
No ratings yet
DOCUMENTATION
10 pages
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting Model With Genetic Algorithm Optimization
No ratings yet
Daily Scale River Flow Forecasting Using Hybrid Gradient Boosting Model With Genetic Algorithm Optimization
16 pages
Ap Calculus Ab Test Booklet
No ratings yet
Ap Calculus Ab Test Booklet
8 pages
Water 15 03982
No ratings yet
Water 15 03982
18 pages
Water-14-00490-V2 PSO
No ratings yet
Water-14-00490-V2 PSO
21 pages
PrOBLEM Reading and Measuring THERMOMETER
No ratings yet
PrOBLEM Reading and Measuring THERMOMETER
16 pages
Prediction of Floods in Kerala Using Hybrid Model of CNN and LSTM
No ratings yet
Prediction of Floods in Kerala Using Hybrid Model of CNN and LSTM
7 pages
An Approach For Rainfall Prediction Using Long Short Term Memory Neural Network
No ratings yet
An Approach For Rainfall Prediction Using Long Short Term Memory Neural Network
6 pages
00-Qe20-00014 Rev B - Draf 021625
No ratings yet
00-Qe20-00014 Rev B - Draf 021625
9 pages
MAin Title Pico Siddu
No ratings yet
MAin Title Pico Siddu
4 pages
Parity Bits Exercises
No ratings yet
Parity Bits Exercises
4 pages
Jee Main - (One Year Crp-2425) C-Lot-Ph-1 (Vec, KM, Lom, Wep & Com)
No ratings yet
Jee Main - (One Year Crp-2425) C-Lot-Ph-1 (Vec, KM, Lom, Wep & Com)
20 pages
Summaries of Sources For Tomorrow
No ratings yet
Summaries of Sources For Tomorrow
3 pages
Pair of Linear Eqn Compact
No ratings yet
Pair of Linear Eqn Compact
3 pages
Rainfall-Runoff Modelling Using Long Short-Term Memory (LSTM) Networks
No ratings yet
Rainfall-Runoff Modelling Using Long Short-Term Memory (LSTM) Networks
18 pages
Inventory Management
100% (16)
Inventory Management
77 pages
Simulating and Predicting of Hydrological Time Series Based On Tensorflow Deep Learning
No ratings yet
Simulating and Predicting of Hydrological Time Series Based On Tensorflow Deep Learning
8 pages
Maintaining Test Methods in The User's Laboratory: Standard Guide For
No ratings yet
Maintaining Test Methods in The User's Laboratory: Standard Guide For
4 pages
Virani Sir
No ratings yet
Virani Sir
17 pages
Water: Deep Learning With A Long Short-Term Memory Networks Approach For Rainfall-Runoff Simulation
No ratings yet
Water: Deep Learning With A Long Short-Term Memory Networks Approach For Rainfall-Runoff Simulation
16 pages
Ai 1
No ratings yet
Ai 1
4 pages
Mass and Energy Balances - Basic Principles For Calculation, Design, and Optimization of Macro - Nano Systems
100% (8)
Mass and Energy Balances - Basic Principles For Calculation, Design, and Optimization of Macro - Nano Systems
276 pages
Test of Homogeneity Based On Geometric Mean of Variances
No ratings yet
Test of Homogeneity Based On Geometric Mean of Variances
11 pages
Bda Important Questions
100% (1)
Bda Important Questions
4 pages
Random Variables and Probability Distributions
100% (1)
Random Variables and Probability Distributions
15 pages
Design Optimization of Solid Propellant Rocket Motor Pavel Konečný, Vojtěch Hrubý, Zdeněk Křižan
No ratings yet
Design Optimization of Solid Propellant Rocket Motor Pavel Konečný, Vojtěch Hrubý, Zdeněk Křižan
8 pages
Applied Elasticity - Chapter 1
No ratings yet
Applied Elasticity - Chapter 1
59 pages
Example Think Aloud Script
No ratings yet
Example Think Aloud Script
1 page
Asset Pricing
No ratings yet
Asset Pricing
23 pages
A Brief History of Feedback Control
No ratings yet
A Brief History of Feedback Control
20 pages
Parallelograms
No ratings yet
Parallelograms
4 pages
Basics Basics Basics Basics of of of of Radio Interferometry
No ratings yet
Basics Basics Basics Basics of of of of Radio Interferometry
26 pages
Matrix Multiplication1
No ratings yet
Matrix Multiplication1
10 pages
Advance Excel Notes
No ratings yet
Advance Excel Notes
12 pages
Math Problems
No ratings yet
Math Problems
8 pages
June 2016 Paper
No ratings yet
June 2016 Paper
20 pages
The Materiality of Interaction: Notes on the Materials of Interaction Design
From Everand
The Materiality of Interaction: Notes on the Materials of Interaction Design
Mikael Wiberg
No ratings yet
Going Big: A Scientist's Guide to Large Projects and Collaborations
From Everand
Going Big: A Scientist's Guide to Large Projects and Collaborations
Christopher W. Stubbs
No ratings yet
Changing Connectomes: Evolution, Development, and Dynamics in Network Neuroscience
From Everand
Changing Connectomes: Evolution, Development, and Dynamics in Network Neuroscience
Marcus Kaiser
No ratings yet
Look: How to Pay Attention in a Distracted World
From Everand
Look: How to Pay Attention in a Distracted World
Christian Madsbjerg
No ratings yet
Knowledge for Sale: The Neoliberal Takeover of Higher Education
From Everand
Knowledge for Sale: The Neoliberal Takeover of Higher Education
Lawrence Busch
No ratings yet
Polycentric Water Governance in Spain: Understanding Determinants, Patterns, and Performance of Coordination
From Everand
Polycentric Water Governance in Spain: Understanding Determinants, Patterns, and Performance of Coordination
Nora Schütze
No ratings yet
Supervised Machine Learning for Science: How to stop worrying and love your black box
From Everand
Supervised Machine Learning for Science: How to stop worrying and love your black box
Christoph Molnar
No ratings yet
Modeling and Compensation of Thermally Induced Optical Effects in Highly Loaded Optical Systems
From Everand
Modeling and Compensation of Thermally Induced Optical Effects in Highly Loaded Optical Systems
Alexander Gatej
No ratings yet
The Science of Superhydrophobicity: Enhancing Outdoor Electrical Insulators
From Everand
The Science of Superhydrophobicity: Enhancing Outdoor Electrical Insulators
indirajith K
No ratings yet
Investigating the Scientific Method with Max Axiom, Super Scientist: 4D An Augmented Reading Science Experience
From Everand
Investigating the Scientific Method with Max Axiom, Super Scientist: 4D An Augmented Reading Science Experience
Tod Smith
3/5 (3)
Avoiding and Enforcing Repetitive Structures in Words
From Everand
Avoiding and Enforcing Repetitive Structures in Words
Mike Müller
No ratings yet
Concepts of Probability Theory: Second Revised Edition
From Everand
Concepts of Probability Theory: Second Revised Edition
Paul E. Pfeiffer
3.5/5 (2)
Aspects of Complexity: Managing Projects in a Complex World
From Everand
Aspects of Complexity: Managing Projects in a Complex World
Terry Cooke-Davies
No ratings yet
Science in the Real World: A Simplified Story of How Technology Using Chemistry and Physics is Used in the Real World of Industry
From Everand
Science in the Real World: A Simplified Story of How Technology Using Chemistry and Physics is Used in the Real World of Industry
Alan McGown
4.5/5 (2)
Mechanics of Composite Materials
From Everand
Mechanics of Composite Materials
Richard M. Christensen
3/5 (2)
Computational Modeling for Fluid Flow and Interfacial Transport
From Everand
Computational Modeling for Fluid Flow and Interfacial Transport
Wei Shyy
No ratings yet
ElectroEncephaloGraphics: A Novel Modality For Graphics Research: Dissertation
From Everand
ElectroEncephaloGraphics: A Novel Modality For Graphics Research: Dissertation
Maryam Mustafa
No ratings yet
Communication Nets: Stochastic Message Flow and Delay
From Everand
Communication Nets: Stochastic Message Flow and Delay
Leonard Kleinrock
3/5 (1)
Signals, Systems and Communication
From Everand
Signals, Systems and Communication
B.P. Lathi
No ratings yet
Visuals Matter!
From Everand
Visuals Matter!
Mario Arlt
No ratings yet
Optical Properties of Nanoparticle Systems: Mie and Beyond
From Everand
Optical Properties of Nanoparticle Systems: Mie and Beyond
Michael Quinten
No ratings yet
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)
Issue 8 Printing and Graphics Science Group Newsletter
From Everand
Issue 8 Printing and Graphics Science Group Newsletter
Anna Fricker
No ratings yet
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
From Everand
Connectivity Prediction in Mobile Ad Hoc Networks for Real-Time Control
Sebastian Thelen
5/5 (1)