Deep Convolutional Recurrent Autoencoders For Learning Low-Dimensional Feature Dynamics of Fluid Systems

Download as pdf or txt
Download as pdf or txt
You are on page 1of 28

Received: Added at production Revised: Added at production Accepted: Added at production

DOI: xxx/xxxx

RESEARCH ARTICLE

Deep convolutional recurrent autoencoders for learning


low-dimensional feature dynamics of fluid systems

Francisco J. Gonzalez* | Maciej Balajewicz


arXiv:1808.01346v2 [math.DS] 22 Aug 2018

Department of Aerospace Engineering,


University of Illinois at Urbana-Champaign, Summary
Urbana, Illinois, USA
Model reduction of high-dimensional dynamical systems alleviates computational
Correspondence burdens faced in various tasks from design optimization to model predictive con-
*Francisco J. Gonzalez, Department of
Aerospace Engineering, University of
trol. One popular model reduction approach is based on projecting the governing
Illinois at Urbana-Champaign, Urbana, equations onto a subspace spanned by basis functions obtained from the compres-
Illinois, USA Email: [email protected] sion of a dataset of solution snapshots. However, this method is intrusive since
the projection requires access to the system operators. Further, some systems may
require special treatment of nonlinearities to ensure computational efficiency or addi-
tional modeling to preserve stability. In this work we propose a deep learning-based
strategy for nonlinear model reduction that is inspired by projection-based model
reduction where the idea is to identify some optimal low-dimensional representa-
tion and evolve it in time. Our approach constructs a modular model consisting of a
deep convolutional autoencoder and a modified LSTM network. The deep convolu-
tional autoencoder returns a low-dimensional representation in terms of coordinates
on some expressive nonlinear data-supporting manifold. The dynamics on this man-
ifold are then modeled by the modified LSTM network in a computationally efficient
manner. An offline unsupervised training strategy that exploits the model modular-
ity is also developed. We demonstrate our model on three illustrative examples each
highlighting the model’s performance in prediction tasks for fluid systems with large
parameter-variations and its stability in long-term prediction.

KEYWORDS:
nonlinear model reduction, deep learning, convolutional neural networks, LSTM, dynamical systems

1 INTRODUCTION

Dynamical systems are used to describe the rich and complex evolution of many real-world processes. Modeling the dynamics of
physical, engineering, and biological systems is thus of great importance in their analysis, design, and control. Many fields, such
as the physical sciences, are in the fortunate position of having a first-principles models that describes the evolution of certain
systems with near-perfect accuracy (e.g., the Navier-Stokes equations in fluid mechanics, or Schrödingers equations in quantum
mechanics). Although, in principle it is possible to numerically solve these equations through direct numerical simulations
(DNS), this often yields systems of equations with millions or billions of degrees of freedom. Even with recent advances in
computational power and memory capacity, solving these high-fidelity models (HFMs) is still computationally intractable for
multi-query and time-critical applications such as design optimization, uncertainty quantification, and model predictive control.
2 Gonzalez and Balajewicz

Model reduction aims to alleviate this burden by constructing reduced order models (ROMs) that capture the large-scale system
behavior while retaining physical fidelity.
Some fields, however, such as finance and neuroscience, lack governing laws thereby restricting the applicability of principled
strategies for constructing low-order models. In recent years, the rise in machine learning and big data have driven a shift in
the way complex spatiotemporal systems are modeled 1,2,3,4,5 . The abundance of data have facilitated the construction of so
called data-driven models of systems lacking high-fidelity governing laws. In areas where HFMs do exist, data-driven methods
have become an increasingly popular approach to tackle previously challenging problems wherein solutions are learned from
physical or numerical data 6,7,8 .
In model reduction, machine learning strategies have recently been applied to many remaining challenges, including learning
stabilizing closure terms in unstable POD-Galerkin models 8,9 , and data-driven model identification for truncated generalized
POD coordinates 10,11,12 . A more recent approach involved learning a set of observable functions spanning a Koopman invariant
subspace from which low-order linear dynamics of nonlinear systems are modeled 13 . These approaches constitute just a small
portion of the outstanding challenges in which machine learning can aid in modeling low-dimensional dynamics of complex
systems.
In this work we make progress to this end by proposing a method that uses a completely data-driven approach to identify and
evolve a low-dimensional representation of a spatiotemporal system. In particular, we employ a deep convolutional autoencoder
to learn an optimal low-dimensional representation of the full state of the system in the form of a feature vector, or coordinates
of some low-dimensional nonlinear manifold. The dynamics on this manifold are then learned using a recurrent neural network
trained jointly with the autoencoder in an end-to-end fashion using a set of finite-time trajectories of the system.

1.1 Reduced order and surrogate modeling


Model order reduction is part of a broader family of surrogate modeling strategies that attempt to reduce the computational
burden of solving HFMs by instead solving approximate, low-complexity models. Surrogate models can be broadly classified
into three groups: 1) data-fit models, 2) hierarchical models, and 3) projection-based model reduction 14,15 . Data-fit models use
simulation or experimental data to fit an input-output map as a function of system parameters. Some examples include models
based on Gaussian processes 1,16 , and feed-forward neural networks 8 . Hierarchical or low-fidelity models substitute the HFM
with a lower-fidelity physics-based model that makes simplifying physics assumptions (e.g., ignoring viscous effects in a fluid
flow), use coarser computational grids, or relaxes solver-tolerances.
In contrast to the first two surrogate modeling approaches, projection-based model reduction works by directly exploiting the
low-dimensional behavior inherent in many high-dimensional dynamical systems. These methods approximate the state of the
system by an affine trial subspace, and project the HFM onto a test subspace resulting in a square system of dimension much
smaller than the original high dimension. Over the years, a large variety of empirically-based approaches for generating the trial
and test subspaces have been developed, including proper orthogonal decomposition (POD) 17,18 , Krylov subspace methods 19 ,
and dynamic mode decomposition 20 . Despite the successes of projection-based model reduction, there exist a number of issues
limiting the applicability of these methods.
One issue is that although the projection step effectively constrains the HFM to a lower dimensional subspace, this does not
necessarily provide computational efficiency for general nonlinear models. Systems with generic, nonpolynomial nonlinearities
or time-varying parameters require an additional layer of approximation, or hyper-reduction, to gain a computaional speed up.†
Some approaches for the treatment of nonlinearities include ROMs based on discrete empirical interpolation (DEIM) 21 , or
Gauss-Newton with approximated tensors (GNAT) 22 . Other methods employ patchwork of local state space approximations at
multiple locations including piecewise trajectory linearization (TPWL) 23 and ROMs based on trajectory piecewise quadratic
(TPWQ) approximations 24 . A second well known issue, particularly when dealing with high-Reynolds number fluid flows,
is that of stability. POD-based ROMs are biased towards large energy-producing scales and are not endowed with the small
energy-dissipating scales that maybe dynamically significant 25 . Moreover, projection-based model reduction has the major
disadvantage of being intrusive, requiring access to the system operators during the projection step. Thus, while optimal, say
POD-based, approximations can be made of any dataset, the projection step is still limited to systems with existing governing
laws.

† For linear time-invariant systems, or systems with polynomial nonlinearities all projection coefficient can be precomputed offline.
Gonzalez and Balajewicz 3

1.2 Contributions and outline


In this work we develop a deep learning-based nonlinear model reduction strategy which is completely data-driven. This method
employs a deep convolutional autoencoder to learn an optimal low-dimensional representation of each solution snapshot and later
evolves this representation in time using a type of recurrent neural network (RNN) called a long short-term memory (LSTM)
network. This work has important similarities to previous work using RNNs to evolve reduced order models 10,12 and work that
employs autoencoders for dimensionality reduction 26,27,28,13 .
Although previous work regarding neural-network based reduced order models has shown great promise, a number of signif-
icant issues remain. Notably, while deep fully-connected autoencoders, such as the ones employed in 13,27,28 work well for small
systems with a few thousand degrees of freedom, this approach alone is not scalable as input data increases to DNS-level sizes
(e.g., 106 − 109 degrees of freedom (dof)). For example, an autoencoder just a single layer reducing input data from 106 dof to
100 will require training well over 108 parameters, a feat that quickly becomes computationally intractable as the autoencoder
increases in depth.
To avoid this curse of dimensionality, we instead propose a convolutional recurrent autoencoding model that differs
significantly from existing autoencoder-based model reduction approaches in two main ways:
(i) We propose an autoencoding method that exploits local, location-invariant correlations present in physical data through
the use of convolutional neural networks. That is, rather of applying a fully-connected autoencoder to the high-
dimensional input data we instead apply it to a vectorized feature map produced by a convolutional encoder, and
similarly the reverse is done for reconstruction. The result is the identification of an expressive low-dimensional man-
ifold obtained at a much lower cost while offering specific advantages over both traditional POD-based ROMs and
fully-connected autoencoders.
(ii) We propose a modified LSTM network to model the evolution of low-dimensional data representations on this manifold
that avoids costly state reconstructions at every step. In doing this, we ensure that the evaluation of new steps scales
only with the size of the low-dimensional representation and not with the size of the full dimensional data, which may
be large for some problems.
Taken together this end-to-end approach both identifies an optimal low-dimensional representation of a high-dimensional spa-
tiotemporal dataset and models its dynamics on the underlying data-supporting manifold. Additionally a two-step unsupervised
training strategy is developed that exploits the modularity of the convolution recurrent autoencoder model.
The paper is organized as follows. section 2 formulates the problem of interest and outlines the constraints underwhich
our model is applied. section 3 briefly reviews the core concepts of deep learning used in this work, including recurrent and
convolutional networks. A brief review of projection-based model reduction is algo given in this section. Finally, we review the
important connection between autoencoders and POD. The key contributions of this work are presented in section 4. Namely,
the construction of the convolutional autoencoder for nonlinear dimensionality reduction and the construction of our modified
LSTM network for modeling of feature dynamics. In this section, we also discuss the construction of the training datasets and
develop our training strategy. section 5 demonstrates the use of our method on three illustrative examples. The first example
considers a simple one-dimensional model reduction problem based on the viscous Burgers equation. This serves to highlight the
expressive power of nonlinear autoencoders when compared to POD-based methods. Second, we consider a parametric model
reduction problem based on an incompressible flow inside a periodic domain and evaluate our model’s predictive performance
with large parameter variations. This example has the merit of showcasing the benefits of the location-invariant properties of
the convolutional autoencoder as compared to POD-based models. The last example highlights the stability characteristics of
the convolutional recurrent autoencoder through a model reduction problem based on a chaotic incompressible flow inside a
lid-driven cavity. Finally, section 6 presents a summary and discussion of our work.

2 PROBLEM FORUMALTION

2.1 Nonlinear computational physics problem


Consider a high-dimensional ODE resulting from the semi-discretization of a time-dependent PDE
̇ = 𝐹 (𝐱(𝑡), 𝑡; 𝝁),
𝐱(𝑡)
(1)
𝐱(𝑡0 ) = 𝐱0 (𝝁),
4 Gonzalez and Balajewicz

where 𝑡 ∈ [𝑡0 , 𝑇 ] ⊂ ℝ+ denotes time, 𝐱 ∈ ℝ𝑁 is the spatially discretized state variable where 𝑁 is large, and 𝝁 ∈  ⊆ ℝ𝑑 is
the vector of parameters sampled from the feasible parameter set . Here, 𝐹 ∶ ℝ𝑁 × ℝ+ × ℝ𝑑 → ℝ𝑁 is a nonlinear function
representing the dynamics of the discretized system. Such large nonlinear systems are typical in the computational sciences such
as when numerically solving the Navier-Stokes equations describing a fluid flow. In the parameter-varying case 𝝁 may represent
initial and boundary conditions, material properties, or shape parameters of interest.
Often, in engineering design and analysis the interest is on the evolution of certain outputs
𝐲 = 𝐺(𝐱(𝑡), 𝝁), (2)
𝑝
where 𝐲 ∈ ℝ may represent e.g., lift, drag, or some other performance criteria. In this work, the attention is focused only on
the evolution of the full state 𝐱.

2.2 Completely data-driven model reduction


When the number of degrees of freedom 𝑁 is large, evaluating Equation 1 for a given initial condition and input parameter
𝝁 becomes computationally challenging in two particular applications. The first are time-critical applications, or applications
where a solution needs to be attained within a given threshold of time. Some examples include routine analysis applications
and model predictive control of distributed parameter systems where near-real time solutions are crucial. The second are multi-
query applications, i.e., applications where one needs to sample a large number of parameters from . Examples of multi-query
applications include shape optimization and uncertainty quantification.
To alleviate this computational burden an offline-online strategy is usually employed in which a dataset of solution snapshots
𝑁
 = {𝐱(𝑡𝑖 ; 𝝁𝑖 )}𝑖=1data of Equation 1 is used to construct a surrogate model that is capable of approximating new solutions at
a fraction of the cost. A wide variety of strategies exist for constructing these so-called data-driven models including data-fit
methods which use numerical or experimental data to fit an input-output map, and projection-based reduced order models which
approximately solve Equation 1 in a reduced subspace constructed from numerical or experimental data.
While projection-based reduced order models are physics-based, and thus offer an advantage over data-fit methods when it
comes to physical interpretation, they are often intrusive. That is one requires access to the operators when performing the
projection step. In this work, we will restrict our attention to non-intrusive, purely data-driven reduced order modeling. Thus,
the construction of the surrogate model will require only the dataset  and no information about Equation 1. Indeed there are
many situations, e.g. in neuroscience and finance, in which data are abundant but governing laws are uncertain or do not exist
altogether. For the purposes of this work, we will work under the assumption that we do not have access to Equation 1 from
which the datasets are generated.

2.3 Single vs. multiple parameter-varying trajectories


The construction and availability of solution snapshot datasets is inherently problem dependent. Here, we focus on the two
common cases encountered in model reduction:

(i) The dataset  = {𝐱(𝑡1 ; 𝝁), 𝐱(𝑡2 ; 𝝁), ...} is constructed using snapshots from a single, statistically stationary trajectory
of Equation 1. In this case, 𝝁 is the same for all snapshots. This is relevant to situations in which obtaining snapshot
data is exceedingly expensive such as in large direct numerical simulations and the interest is on obtaining “quick”
approximate solutions.

(ii) The dataset  = {𝑋 𝝁1 , 𝑋 𝝁2 , ...} is constructed using multiple, parameter varying trajectories 𝑋 𝝁𝑖 =
{𝐱(𝑡1 ; 𝝁𝑖 ), 𝐱(𝑡2 ; 𝝁𝑖 ), ...}. This case is relevant to multi-query applications or applications in which the interest is on
capturing the parameter-dependent transient behavior of Equation 1.

In both cases the surrogate model is constructed in a non-intrusive fashion using the same procedure and only the dataset is
changed.
Gonzalez and Balajewicz 5

3 BACKGROUND

In this section, we introduce the basic notions of deep learning and two key architectures used in this work: 1) recurrent neural
networks, and 2) convolutional neural networks. Finally, we finish by summarizing the connections between POD and fully-
connected autoencoders.

3.1 Deep learning


Deep learning has enjoyed great success in recent years in areas from image and speech recognition 29,30,31 to genomics 32,33 . At
the core of deep learning are deep neural networks, whose layered structure allows them to learn at each layer a representation of
the raw input with increasing levels of abstraction 34,35 . With enough layers, deep neural networks can learn intricate structures
in high-dimensional data. For example, given an image as an array of pixel values, the first layer of a deep neural network might
learn to identify edges in various orientations. The second layer then is able to detect particular arrangements of edges, and so
on until a complex hierarchy of features leads to the detection of a face or a road sign. Here, we briefly review some concepts
and common network architectures used in this work.
Neural networks are models of computation loosely inspired by biological neurons. Generally, given a vector of real-valued
inputs 𝐱 ∈ ℝ𝑁 , a single layer artificial neural network is an affine transformation of the input 𝐱 fed through a nonlinear function
𝐲̂ = 𝑓 (𝐖𝐱 + 𝐛), (3)
𝑀×𝑁 𝑀
where 𝐖 ∈ ℝ is the weight matrix, 𝐛 ∈ ℝ is a bias term, and 𝑓 (⋅) is a nonlinear function that acts element-wise on its
inputs.
To create multilayered neural networks, the output 𝐡𝑙 of a layer 𝑙 is fed as the input of the following layer, thus
𝐡𝑙+1 = 𝑓𝑙+1 (𝐖𝑙 𝐡𝑙 + 𝐛𝑙 ),
(4)
= 𝑓𝑙+1 (𝐖𝑙 𝑓𝑙 (...(𝑓1 (𝐖0 𝐱 + 𝐛0 ))... + 𝐛𝑙−1 ) + 𝐛𝑙 ),
where 𝐡1 = 𝑓1 (𝐖0 𝐱 + 𝐛0 ) is the output of the first layer. The vector 𝐡𝑙 is often referred to as the hidden state or feature
vector at the 𝑙-th layer. This process continues for 𝐿 layers, where at the final layer the output of the network is given by
𝐲̂ = 𝑓𝐿 (𝐖𝐿−1 𝐡𝐿−1 + 𝐛𝐿−1 ). In supervised learning, training the network then involves finding the parameters 𝜽 = {𝐖𝑙 , 𝐛𝑙 }𝐿−1
𝑙=0
such that the expected loss between the output 𝐲̂ and the target value 𝐲 is minimized
[ ]
𝜽∗ = arg min 𝔼(𝑥,𝑦)∼𝑑𝑎𝑡𝑎 (𝑓 (𝐱; 𝜽), 𝐲) , (5)
𝜽
where 𝑑𝑎𝑡𝑎 is the data-generating distribution and (𝐲,̂ 𝐲) is some measure of discrepancy between the predicted and target
outputs. What distinguishes machine learning from straight forward optimization is that the model 𝑓 parameterized by 𝜽∗ should
be expected to generalize well for all examples drawn from 𝑑𝑎𝑡𝑎 , even if they were not witnessed during training 34 . Most neural
networks are trained using stochastic gradient descent (SGD), or one of its many variants 36,37 , in which gradients are computed
using the backpropagation procedure 38 .

3.2 Recurrent neural networks


A natural extension of feed-forward networks for sequential data are networks with self-referential, or recurrent connection.
These recurrent neural networks (RNNs) process a sequence of inputs one element at a time, maintaining in the hidden state
an implicit history of previous inputs. Consider a sequence of inputs {𝐱0 , 𝐱1 , ..., 𝐱𝑚 }, with each 𝐱𝑛 ∈ ℝ𝑁 , the 𝑛-th hidden state
𝐡𝑛 ∈ ℝ𝑁ℎ of a simple RNN is evaluated by the following update
𝐡𝑛 = 𝑓 (𝐖𝐡𝑛−1 + 𝐔𝐱𝑛 + 𝐛), (6)
where 𝐖 ∈ ℝ𝑁ℎ ×𝑁ℎ and 𝐔 ∈ ℝ𝑁ℎ ×𝑁 are the hidden and input weight matrices respectively, and 𝐛 ∈ ℝ𝑁ℎ is a bias term. RNNs
are also typically trained using SGD, or some variant, but the gradients are calculated using the backpropagation through time
(BPTT) algorithm 39 . In BPTT, the RNN is first “unrolled" in time, stacking one copy of the RNN per time step. This results in
a weight-tied deep feed forward neural network on which the standard backpropagation algorithm can be employed.
Training RNNs has long been considered to be challenging 40 . The main difficulty is due to the exponential growth or decay
of gradients as they are backpropagated through each time step, so over many time steps they will either vanish or explode.
This is especially problematic when learning sequences with long-term dependencies. The vanishing or exploding gradient
6 Gonzalez and Balajewicz

problem is typically addressed by using gated RNNs, including long short-term memory (LSTM) networks 41 and networks
based on the gated recurrent unit (GRU) 42 . These networks have additional paths through which gradients neither vanish nor
explode, allowing gradients of the loss function to backpropagate across multiple time-steps and thereby making the appropriate
parameter updates. This work will only consider RNNs equipped with LSTM units.

3.3 Convolutional neural networks


The final standard neural network architecture considered in this work are convolutional neural networks. These networks were
first introduced as an alternative to fully connected networks for data structured as multiple arrays (e.g., 1D signals and sequences,
2D images or spectrograms, and 3D video). The two key properties of convolutional neural networks are: 1) local connections,
and 2) shared weights 34,35 . In arrayed data, often local groups of values are highly correlated, assembling into distinct features
that can be easily detected using a local approach. Additionally, weight sharing across the input domain works to detect location-
invariant features.

Stride: 1 Stride: 2 Stride: 3

FIGURE 1 Sliding convolutional filter with varying stride values.

Dilation rate: 1 Dilation rate: 2 Dilation rate: 3

FIGURE 2 Convolutional filters with varying dilation rates.

In convolutional neural networks, layers are organized into feature maps, where each unit in a feature map is connected to a
local domain of the previous layer through a filter bank. Consider a 2D input 𝐗 ∈ ℝ𝑁𝑥 ×𝑁𝑦 , a convolutional layer consists of a
′ ′
set of 𝐹 filters 𝐊𝑓 ∈ ℝ𝑎×𝑏 , 𝑓 = 1, ..., 𝐹 , each of which generates a feature map 𝐘𝑓 ∈ ℝ𝑁𝑥 ×𝑁𝑦 by a 2D discrete convolution
∑ ∑
𝑎−1 𝑏−1
𝐘𝑓𝑖,𝑗 = 𝐊𝑓𝑎−𝑘,𝑏−𝑙 𝐗1+𝑠(𝑖−1)−𝑘,1+𝑠(𝑗−1)−𝑙 , (7)
𝑘=0 𝑙=0
Gonzalez and Balajewicz 7

𝑁 +𝑎−2 𝑁 +𝑏−2
where 𝑁𝑥′ = 1 + 𝑥 𝑠 , 𝑁𝑦′ = 1 + 𝑦 𝑠 , and 𝑠 ≥ 1 is an integer value called the stride. Figure 1 shows the effect of different
stride values of a filter acting on an input feature map. As before, the feature map can be passed through an element-wise
nonlinear function. Typically, the dimension of the feature map is reduced by using a pooling layer, in which a single value is
computed from small 𝑎′ × 𝑏′ patch of the feature map either by taking the maximum value or averaging. A slightly more general
approach is to employ a convolutional layer with a stride of 𝑠 > 1, in which instead of taking the maximum or average value,
some weighted sum of the local patch of the input feature map is learned by adjusting the respective filter 𝐊𝑓 . In addition, dilated
convolutional filters (see Figure 2) are often employed to significantly increase the receptive field without loss of resolution,
effectively capturing larger features in highly dense data 43,44 .

3.4 Projection-based model reduction


In projection-based MOR, the state vector 𝐱 ∈ ℝ𝑁 is approximated by a global affine trial subspace 𝐱0 +  ⊂ ℝ𝑁 of dimension
𝑁ℎ << 𝑁
𝐱 ≈ 𝐱̃ = 𝐱0 + 𝚿𝑁ℎ 𝐡, (8)
where the columns of 𝚿𝑁ℎ ∈ ℝ𝑁×𝑁ℎ contain the basis for subspace , the initial condition is given by 𝐱0 , and 𝐡 ∈ ℝ𝑁ℎ
represents the generalized coordinates in this subspace. Substituting Equation 8 into Equation 1 yields
𝑑𝐡
𝚿𝑁ℎ = 𝐹 (𝐱0 + 𝚿𝑁ℎ 𝐡(𝑡; 𝝁)), (9)
𝑑𝑡
which is an overdetermined system with 𝑁 equations and 𝑁ℎ unknowns. Additional constraints are imposed by enforcing the
orthogonality of the residual of Equation 9 on a test subspace represented by 𝚽 ∈ ℝ𝑁×𝑁ℎ through a Petrov-Galerkin projection
𝚽𝑇 𝑅(𝐱0 + 𝚿𝑁ℎ 𝐡(𝑡; 𝝁)) = 0, (10)
resulting in a square system with 𝑁ℎ equations and 𝑁ℎ unknowns, where 𝑅(⋅) represents the residual of Equation 9. In a
Galerkin projection, 𝚽 = 𝚿𝑁ℎ . An important task is now choosing the subspace  on which to project the governing equation
Equation 1. One popular method is to obtain the basis of  through proper orthogonal decomposition (POD).‡
Beginning with a set of 𝑚 observations {𝐱𝑛 }𝑚𝑛=1 , 𝐱𝑛 ∈ ℝ𝑁 , formed into a data matrix 𝐗 = [𝐱1 , 𝐱2 , ..., 𝐱𝑚 ] ∈ ℝ𝑁×𝑚 , POD
consists of performing the singular value decomposition (SVD) on this data matrix
𝐗 = 𝚿𝚺𝐕𝑇 , (11)
𝑁×𝑟 𝑁×𝑟 𝑇 𝑇 𝑟×𝑟
where 𝚿 ∈ ℝ and 𝐕 ∈ ℝ are orthonormal, i.e., 𝚿 𝚿 = 𝐕 𝐕 = 𝐈𝑟×𝑟 , and 𝚺 ∈ ℝ is a diagonal matrix of whose
entries 𝜎𝑖 ≥ 0, ordered as 𝜎1 ≥ 𝜎2 ≥ ... ≥ 𝜎𝑟 are the singular values. The columns 𝝍 𝑖 of 𝚿 are sometimes called the principal
components, features, or POD modes. These modes have the property that the linear subspace  spanned by 𝚿𝑁ℎ = [𝝍 1 , ..., 𝝍 𝑁ℎ ],
𝑁ℎ < 𝑟, optimally represents the data in the 𝐿2 sense

𝑚
min ‖𝐗 − 𝚿𝑁ℎ 𝚿𝑇𝑁 𝐗‖22 = min ‖𝐱𝑖 − 𝚿𝑁ℎ 𝚿𝑇𝑁 𝐱𝑖 ‖22 . (12)
𝚿𝑁ℎ ℎ 𝚿𝑁ℎ ℎ
𝑖=1

The net result is an optimal low-dimensional representation 𝐡 = 𝚿𝑇𝑁 𝐱 of an input 𝐱, where again 𝐡 can be thought of as the

intrinsic coordinates on the linear subspace .

3.5 Connection between autoencoders and POD


In data-driven sciences, dimensionality reduction attempts to approximately describe high-dimensional data in terms of a low-
dimensional representation. Central to this is the manifold hypothesis, which presumes that real-world high-dimensional data
lies near a low-dimensional manifold  embedded in ℝ𝑁 , where 𝑁 is large 45 . As a result POD has found broad applications
from pre-training machine learning models to dimensionality reduction of physical systems. However it has the major drawback
of constructing only an optimal linear manifold. This is quite significant since data sampled from complex, real-world systems
is more often than not strongly nonlinear. A wide variety of strategies for more accurate modeling of  have been developed

‡ This method is known under different names in various fields: POD, principal component analysis (PCA), KarhunenâĂŞLoève decomposition, empirical orthogonal

functions and many others. In this work we will adopt the name POD.
8 Gonzalez and Balajewicz

over the years, most involving using a patchwork of local subspaces {𝑙 }𝐿𝑙=1 obtained through linearizations or higher-order
approximations of the state-space 23,45,24 .
A nonlinear generalization of POD is the under-complete autoencoder 26,34 . An under-complete autoencoder consists of a
single or multiple-layer encoder network
𝐡 = 𝑓𝐸 (𝐱; 𝜽𝐸 ), (13)
where 𝐱 ∈ ℝ𝑁 is the input state, 𝐡 ∈ ℝ𝑁ℎ is the feature or representation vector, and 𝑁ℎ < 𝑁. A decoder network is then used
to reconstruct 𝐱 by
𝐱̂ = 𝑓𝐷 (𝐡; 𝜽𝐷 ). (14)
Training this autoencoder then consists of finding the parameters that minimize the expected reconstruction error over all training
examples
[ ]
𝜽∗𝐸 , 𝜽∗𝐷 = arg min 𝔼𝑥∼𝑑𝑎𝑡𝑎 (𝐱,
̂ 𝐱) , (15)
𝜽𝐸 ,𝜽𝐷
where (𝐱,̂ 𝐱) is some measure of discrepancy between 𝐱 and its reconstruction 𝐱. ̂ Restricting 𝑁ℎ < 𝑁 serves as a form of
regularization, preventing the autoencoder from learning the identify function. Rather, it captures the salient features of the
data-generating distribution 𝑑𝑎𝑡𝑎 . Under-complete autoenocders are just one of a family of regularized autoencoders which also
include contractive autoencoders, denoising autoencoders, and sparse autoencoders 26,34 .
Remark 1. The choice of 𝑓𝐸 , 𝑓𝐷 , and (𝐱,
̂ 𝐱) largely depends on the application. Indeed, if one chooses a linear encoder and a
linear decoder of the form
𝐡 = 𝐖𝐸 𝐱, (16)
𝐱̂ = 𝐖𝐷 𝐡, (17)
where 𝐖𝐸 ∈ ℝ𝑁ℎ ×𝑁 and 𝐖𝐷 ∈ ℝ𝑁×𝑁ℎ , then with a squared reconstruction error
̂ 𝐱) = ‖𝐱 − 𝐱‖
(𝐱, ̂ 22
(18)
= ‖𝐱 − 𝐖𝐖𝑇 𝐱‖22 ,
the autoencoder will learn the same subspace as the one spanned by the first 𝑁ℎ POD modes if 𝐖 = 𝐖𝐷 = 𝐖𝑇𝐸 . However,
without additional constraints on 𝐖, i.e., 𝐖𝑇 𝐖 = 𝐈𝑁ℎ ×𝑁ℎ , the columns of 𝐖 will not form an orthonormal basis or have any
hierarchical ordering 45,46 .

4 CONVOLUTIONAL RECURRENT AUTOENCODERS FOR MODEL REDUCTION

4.1 Previous work and objectives


In the past few years machine learning has become an increasingly attractive tool in modeling or augmenting low-dimensional
models of complex systems. Broadly, machine learning has been used in three ways in this respect: 1) as input-output maps to
model closure terms in unstable POD-Galerkin models, 2) as a means to model the evolution of the intrinsic coordinates from
an optimal subspace approximation of the state, and more recently 3) as an approach to construct end-to-end models that both
find optimal representations of the system variables and linearly evolve these representations. Our work was motivated and thus
has important similarities to previous work in both the second and third approach.
The main idea behind modeling the evolution of the optimal subspace approximations of the state variable directly addresses
one of the main challenges of projection-based model reduction. Namely, for systems where governing laws do not exist, a
simple yet powerful approach is to model the evolution the intrinsic coordinates, obtained for example through POD, using of
a recurrent neural network
𝐡𝑛+1 = 𝑓𝑅𝑁𝑁 (𝐡𝑛 ), (19)
where the representation vector 𝐡 ∈ ℝ𝑁ℎ , is of much lower dimension than the data from which it is approximated. This
strategy has previously been explored in the context of model reduction where 𝐡 is obtained through POD 12,10 , and in the more
general case where Equation 19 may model the dynamic behavior of complex 47 or chaotic systems 48 . This opens up a family
of strategies for modeling the dynamics of not just systems without HFM, but systems with heterogeneous data sources, and
systems with a priori unknown optimal subspace approximations – a feature which we make use of in this work.
Gonzalez and Balajewicz 9

A more completely data-driven approach, and one that is more closely related to our work, is to both learn a low-dimensional
representation of the state variable and to learn the evolution of this representation. This approach has been explored in 13 in
which an autoencoder is used to learn a low-dimensional representation of the high-dimensional state,
𝐡 = 𝑓𝐸 (𝐱), (20)
𝑁 𝑁ℎ
where 𝐱 ∈ ℝ high-dimensional state of the system, 𝐡 ∈ ℝ , 𝑁ℎ < 𝑁, and a linear recurrent model is used to evolve the
low-dimensional features
𝐡𝑛+1 = 𝐊𝐡𝑛 , (21)
where 𝐊 ∈ ℝ𝑁ℎ ×𝑁ℎ . This approach was first introduced in the context of learning a dictionary of functions used in extended
dynamic mode decomposition to approximate the Koopman operator of a nonlinear system 49 .
The central theme in these approaches and projection-based model reduction in general is the following two-step process:

1. The identification of a low-dimensional manifold  embedded in ℝ𝑁 on which most of the data is supported. This
yields, in some sense, an optimal low-dimensional representations 𝐡 = 𝑓 (𝐱) of the data 𝐱 in terms of intrinsic
coordinates on , and

2. The identification of a dynamic model which efficiently evolves the low-dimensional representation 𝐡 on the manifold
.

In this work, we build on the framework introduced in 10,12,13 for constructing or augmented reduced order models, and extend
it in multiple directions. First, we introduce a deep convolutional autoencoder architecture which provides certain advantages
in identifying low-dimensional representation of the input data. Second, since the dynamics of reduced state vector on  may
not necessarily be linear, we employ a single-layer LSTM network to model the possibly nonlinear evolution of 𝐡 on . Lastly,
we introduce an unsupervised training strategy which trains the convolutional autoencoder while using the current reduced state
vectors to dynamically train the LSTM network.

4.2 Dimensionality reduction via convolutonal autoencoders


Dimensionality reduction through fully-connected autoencoders have long been used in a wide variety of applications. How-
ever, one quickly runs into the curse of dimensionality when considering DNS-level input data which can easily reach 109 dof
as mentioned in the introduction. Directly applying large physical or simulation data to fully-connected autoencoders is not
only computationally prohibitive, but the approach itself ignores the opportunity to exploit the structure of features in high-
dimensional data. That is, since fully-connected autoencoders require that the input data be flattened into an 1D array, the local
spatial relations between values are eliminated and can only be recovered by initially considering dense models. Sparsity can be
achieved either a posteriori by pruning individual connections (setting 𝑤𝑖𝑗 = 0 for some 𝑖, 𝑗) after training, or encouraged dur-
ing training by using 𝐿1 regularization. Here, we seek to exploit local correlations present in many physics-based data through
the use of convolutional neural networks.
In particular, rather than applying a fully-connected autoencoder directly to complex, high-dimensional simulation or exper-
imental data, we apply it to a vectorized feature map of a much lower-dimension obtained from a deep convolutional network
acting directly on the high-dimensional data. Wrapping the fully-connected autoencoder with a convolutional neural network
has two significant advantages:

(i) The local approach of each convolutional layer helps to exploit local correlations in field values. Thus, much the same
way finite-difference stencils can capture local gradients, each filter 𝐊𝑓 in a filter bank computes local low-level features
from a small subset of the input.

(ii) The shared nature of each filter bank both allows to identify similar features throughout the input domain and reduce
the overall number of trainable parameters compared to a fully-connected layer with the same input size.

Consider the following 12-layer convolutional autoencoder model depicted graphically in Figure 3. A 2D arrayed input
𝐗 ∈ ℝ𝑁𝑥 ×𝑁𝑦 , with 𝑁𝑥 = 𝑁𝑦 = 128, is first passed through 4-layer convolutional encoder. Each convolutional encoder layer
uses a filter bank 𝐊𝑓 ∈ ℝ5×5 , with the first layer having a dilation rate of 2 and the number of filters 𝑓 increasing from 4 in
the first layer to 32 in the fourth layer using Equation 7. At the opposite end of the convolutional autoencoder network we use
a 4-layer decoder network consisting of transpose convolutional layers. Often erroneuously referred to as “deconvolutional"
10 Gonzalez and Balajewicz

reshape reshape

Encoder Decoder

FIGURE 3 Network architecture of the convolutional autoencoder. The encoder network consists of a 4-layer convolutional
encoder (blue), a 4-layer fully-connected encoder and decoder (yellow), and a 4-layer transpose convolutional decoder (red).
The low-dimensional representation is depicted in green.

layers, transpose convolutional layers multiply each element of the input with a filter 𝐊𝑓 and sum over the resulting feature
map, effectively swapping the forward and backward passes of a regular convolutional layer. The effect of using transpose
convolutional layers with a stride 𝑠 > 1 is two decode low-dimensional abstract features to a larger dimensional representation.
Table 1 outlines the architecture of the convolutional encoder and decoder subgraphs. In this work will consider the sigmoid
activation function 𝜎(𝑠) = 1∕1 + exp(−𝑠) for each layer of the autoencoder.§

TABLE 1 Convolutional encoder (left) and decoder (right) filter sizes and strides

Layer filter size filters stride Layer filter size filters stride

1 5×5 4 2×2 9 5×5 16 2×2


2 5×5 8 2×2 10 5×5 8 2×2
3 5×5 16 2×2 11 5×5 4 2×2
4 5×5 32 2×2 12 5×5 1 2×2

In between the convolutional encoder and decoder is a regular fully-connected autoencoder consisting of a 2-layer encoder
which takes the vectorized form of the 32 feature maps from the last convolutional encoder layer vec() ∈ ℝ512 , where  =
[𝐘1 , ..., 𝐘32 ] ∈ ℝ4×4×32 , and returns the final low-dimensional representation of the input data
𝐡 = 𝜎(𝐖2𝐸 𝜎(𝐖1𝐸 vec() + 𝐛1𝐸 ) + 𝐛2𝐸 ) ∈ ℝ𝑁ℎ , 𝑁ℎ << (𝑁𝑥 ⋅ 𝑁𝑦 ), (22)
where 𝐖1𝐸 , 𝐛1𝐸
and 𝐖2𝐸 , 𝐛2𝐸
are the parameters of the first and second fully-connected encoder network (the 5 and 6 layers th th

of the whole model). To reconstruct the original input data from the low-dimensional representation, a similar 2-layer fully-
connected decoder parameterized by 𝐖1𝐷 , 𝐛1𝐷 and 𝐖2𝐷 , 𝐛2𝐷 , whose result is reshaped and passed to the transpose convolutional
decoder network.
Hierarchical convolutional feature learning through similar strategies have previously been proposed for visual tracking 51
and scene labeling or semantic segmentation of images 30,52,53 . However, this is the first time, to the authors knowledge, that a
convolutional autoencoders have been applied to model reduction of large numerical data of physical dynamical systems. The
key innovation in using convolutional autoencoders in model reduction is that it allows for nonlinear autoencoders and thus
nonlinear model reduction to be applied to large input data in a way that exploits structures inherent in many physical systems.

§ In recent years the rectified linear units (ReLUs) 34 , given by ReLU(𝑠) = max(0, 𝑠), and its many variants like the ELUs 50 , have been favored over the sigmoid

activation function. However, in this work we have found that ReLUs produce results similar to linear model reduction theory since ReLU(𝑠) are linear for inputs 𝑠 ∈ ℝ+ .
Gonzalez and Balajewicz 11

Remark 2. In this work we restrict our attention to 2D input data of size 𝑁𝑥 × 𝑁𝑦 = 128 × 128 with the first layer convolutional
filter having a dilation rate of 2. In practice, however, an equivalent memory-reducing approach was employed by using an input
data of size 𝑁𝑥 × 𝑁𝑦 = 64 × 64. In addition, the low-dimensional representations considered in this work are of size 𝑁ℎ = 64 or
smaller. To this effect, the hidden state sizes of the middle fully-connected autoencoder were chosen to be 512 and 256 such that
𝐖1𝐸 , (𝐖2𝐷 )𝑇 ∈ ℝ512×256 and 𝐖2𝐸 , (𝐖1𝐷 )𝑇 ∈ ℝ256×𝑁ℎ with the bias terms shaped accordingly. The net result is an autoencoder
with a maximum of 330k parameters with 𝑁ℎ = 64. A similar 12-layer fully-connected autoencoder would require over 22M
parameters.
Remark 3. The size of the low-dimensional representation 𝑁ℎ must be chosen a priori for each model. Currently, no principled
approach exists for the choice of 𝑁ℎ . One possible heuristic for an upper bound is to choose 𝑁ℎ such that
∑𝑁ℎ 2
𝜎
𝑖=1 𝑖
∑𝑚 2 < 𝜅, (23)
𝜎
𝑖=1 𝑖
where 𝜎𝑖 ≥ 0 are the singular values of the data matrix 𝐗 ∈ ℝ𝑁×𝑚 and 𝜅 is usually taken to be 99.9%. This approach is often
employed when selecting the number of POD modes to keep in POD-Galerkin reduced order models 14 , where in the context
of fluid flows this corresponds to choosing enough modes such that 99.9% of the energy content in the flow is preserved.

4.3 Learning feature dynamics


The second component of projection-based model reduction is modeling the evolution of low-dimensional features 𝐡 in a com-
putationally efficient manner. Though identifying linear dynamics of 𝐡 is beneficial from an analysis perspective, here we will
consider the general case of learning arbitrary feature dynamics.
Consider a set of observations {𝐱𝑛 }𝑚𝑛=0 , 𝐱𝑛 ∈ ℝ𝑁 obtained from a HFM or through experimental sampling. Furthermore, for
each observations consider some optimal low-dimensional representation 𝐡𝑛 ∈ ℝ𝑁ℎ , where 𝑁ℎ << 𝑁. This low-dimensional
representation can come from an optimal rank-𝑁ℎ POD representation 𝐡𝑛 = 𝚿𝑇𝑁 𝐱𝑛 , where the columns of 𝚿𝑁ℎ are the first 𝑁ℎ

POD modes, or through a neural network approach such as an autoencoder. We seek to construct a model for the evolution of this
low-dimensional representation in a completely data-driven fashion, i.e., without access or knowledge of the system operators.
This is particularly useful for cases where HFM are uncertain or do not exist altogether.
To model the evolution of 𝐡 we employ a modified version of the long short term memory (LSTM) network. LSTM networks
we first proposed primarily to overcome the vanishing or exploding gradient problem and are equipped with an explicit memory
cell and four gating units which adaptively control the flow of information through the network 41,54 . LSTM networks have
demonstrated impressive results in modeling relationships between sequences such as in machine translation tasks 55,42,56,35 .
More recently, they have been used to predict conditional probability distributions in chaotic dynamical systems 48 and in
modeling the evolution of low-dimensional POD representations 10,12 .
In this work we are interested in evolving feature vectors whose size correspond to the intrinsic dimensionality of a physical
system, which may be small compared to the number of hidden states and layers used in e.g., machine translation. In addition,
for large-scale systems it may be inefficient, if not computationally prohibitive to reconstruct the full high-dimensional state
at every time-step. With these restrictions, we construct a modified single-layer LSTM network to evolve the low-dimensional
representation 𝐡𝑛 without full state reconstruction with the following components:
• Input gate:
𝐢𝑛 = 𝜎(𝐖𝑖 𝐡𝑛−1 + 𝐛𝑖 )

• Forget gate:
𝐟 𝑛 = 𝜎(𝐖𝑓 𝐡𝑛−1 + 𝐛𝑓 )

• Output gate:
𝐨𝑛 = 𝜎(𝐖𝑜 𝐡𝑛−1 + 𝐛𝑜 )

• Cell state:
𝐜𝑛 = 𝐢𝑛 ⊙ 𝐜𝑛−1 + 𝐢𝑛 ⊙ tanh(𝐖𝑐 𝐡𝑛−1 + 𝐛𝑐 )
where all four gates are used to update the feature vector by
𝐡𝑛 = 𝐨𝑛 ⊙ tanh(𝐜𝑛 ). (24)
12 Gonzalez and Balajewicz

Decoder

ℎ& ℎ" % ℎ" ' ... ℎ" #$

Encoder

FIGURE 4 LSTM model that iteratively updates the low-dimensional representation 𝐡.

Here, ⊙ represents the Hadamard product. Intuitively, at each step 𝑛 the input and forget gates choose what information gets
passed and dropped from the cell state 𝐜𝑛 , while the output gate controls the flow of information from the cell state to the feature
vector. It is important to note that the the evolution of 𝐡 does not require information from the full state 𝐱, thereby avoiding a
costly reconstruction at every step.
Initializing with a known low-dimensional representation 𝐡0 one obtains a prediction for the following steps by iteratively
applying
̂𝐡𝑛+1 = 𝑓𝐿𝑆𝑇 𝑀 (̂𝐡𝑛 ) 𝑛 = 1, 2, 3, ... (25)
where ̂𝐡1 = 𝑓𝐿𝑆𝑇 𝑀 (𝐡0 ), and 𝑓𝐿𝑆𝑇 𝑀 (⋅) represents the action of Equation 24 and its subcompoents. A graphical representation
of this model is depicted in Figure 4.

4.4 Unsupervised training strategy


A critical component of this work is the development of an unsupervised training approach that adjusts both the convolutional
autoencoder and recurrent model in a joint fashion. The main obstacle is in preventing either the convolutional autoencoder or
RNN portion of the model from overfitting. Here, we discuss the construction of the training dataset as well as the training and
evaluation algorithms.

4.4.1 Constructing the training dataset


Consider a dataset {𝐱1 , 𝐱2 , ..., 𝐱𝑚 }, where 𝐱 ∈ ℝ𝑁𝑥 ×𝑁𝑦 is a 2D snapshot of some dynamical system (e.g., a velocity field defined
on a 2D grid). To make make this dataset amenable to training it is broken up into a set of 𝑁𝑠 finite-time training sequences
{𝐗1 , ..., 𝐗𝑁𝑠 }, where each training sequence 𝐗𝑖 ∈ ℝ𝑁𝑥 ×𝑁𝑦 ×𝑁𝑡 consists of 𝑁𝑡 snapshots. Parameter-varying datasets naturally
break up in this form where each 𝐗𝑖 may represent a small sequence of snapshots corresponding to a single parameter value 𝝁𝑖 .
A common strategy for improving training is to consider only the fluctuations around the temporal mean
𝐱′𝑛 = 𝐱𝑛 − 𝐱̄ , (26)
1 ∑𝑚
where 𝐱̄ = 𝑚 𝑛=1
𝐱𝑛 is the temporal average over the entire dataset and 𝐱′ are the fluctuations around this mean. In our case, each
layer in the convolutional autoencoder uses the sigmoid activation function which maps each real-valued input to the interval
(0, 1), requiring our dataset be feature-scaled in order to prevent saturation of the activation 34 . Thus, our training dataset consists
of feature scaled snapshots
𝐱′𝑛 − 𝐱𝑚𝑖𝑛

𝐱𝑠′𝑛 = ′ ′
, (27)
𝐱𝑚𝑎𝑥 − 𝐱𝑚𝑖𝑛
where each 𝐱𝑠′𝑛 ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 . With these modifications, the resulting training dataset has the following form
′𝑁
 = {𝐗′1
𝑠
, ..., 𝐗𝑠 𝑠 } ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 ×𝑁𝑡 ×𝑁𝑠 , (28)
′𝑁
where each training sample 𝐗′𝑖𝑠 = [𝐱𝑠,𝑖
′1
, ..., 𝐱𝑠,𝑖 𝑡 ] is a matrix consisting of the feature-scaled fluctuations.
Gonzalez and Balajewicz 13

4.4.2 Offline training and online prediction algorithms


Our approach to train both components of the convolutional recurrent autoencoder model is to split the forward pass into two
stages. In the first stage, the autoencoder takes an 𝑁𝑏 -sized batch of the training data  𝑏 ⊂ , where  𝑏 ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 ×𝑁𝑡 ×𝑁𝑏 ,
and outputs both the current 𝑁𝑏 -sized batch of low-dimensional representations of the training sequence
 𝑏 = {𝐇1 , ..., 𝐇𝑁𝑏 } ∈ ℝ𝑁ℎ ×𝑁𝑡 ×𝑁𝑏 , (29)
𝑁
where 𝐇𝑖 = ∈ ℝ𝑁ℎ ×𝑁𝑡 and a reconstruction ̂ 𝑏 of the original input training batch. In the second stage of the forward
[𝐡1𝑖 , ..., 𝐡𝑖 𝑡 ]
pass, the first feature vector of each sequence is used to initialize and iteratively update Equation 25 to get a reconstruction ̂ 𝑏
of the low-dimensional representations of the training batch Equation 28.
We seek to construct a loss function that equally weights the error in the full-state reconstruction and the evolution of the
low-dimensional representations. In general, we would like to find the model parameters 𝜽 such that for any sequence 𝐗′𝑠 =
′𝑁
[𝐱𝑠′1 , ..., 𝐱𝑠 𝑡 ], 𝐱𝑠′𝑛 ∼ 𝑑𝑎𝑡𝑎 and its corresponding low-dimensional representation 𝐇 = [𝐡1 , ..., 𝐡𝑁𝑡 ], where 𝑑𝑎𝑡𝑎 is the data-
generating distribution, minimizes the following expected error between the model and the data
[ ]
 (𝜽) = 𝔼𝐱′𝑛 ∼ (𝐗̂ ′ , 𝐗′ , 𝐇,
𝑠 𝑑𝑎𝑡𝑎 𝑠
̂ 𝐇)
𝑠
[ 𝑛 2]
𝛼 ∑ ‖𝐱𝑠 − 𝐱̂ 𝑠 ‖𝐹 𝛽 ∑𝑡 ‖𝐡 − ̂𝐡 ‖2
𝑁𝑡 ′𝑛 ′𝑛 2 𝑁 𝑛 (30)
= 𝔼𝐱𝑠′𝑛 ∼𝑑𝑎𝑡𝑎 +
𝑁𝑡 𝑛=1 ‖𝐱𝑠′𝑛 ‖2𝐹 + 𝜖 𝑁𝑡 − 1 𝑛=2 ‖𝐡𝑛 ‖2 + 𝜖
2
where 𝜖 > 0 is a small positive number and 𝛼 = 𝛽 = 0.5. In practice, the expected error is approximated by averaging
(𝐗̂ ′𝑠 , 𝐗′𝑠 , 𝐇,
̂ 𝐇) over all samples in a training batch during each backward pass. Intuitively, at every training step, the autoencoder
performs a regular forward pass while constructing a new batch of low-dimensional representations which are used to train the
RNN. In this work we use the ADAM optimizer 36 , a version of stochastic gradient descent that computes adaptive learning
rates for different parameters using estimates of first and second moments of the gradients. Algorithm 4.1 outlines the offline
training of the convolutional recurrent autoencoder in more detail. This model was built and trained using the open-source deep
learning library TensorFlow 57 .

Algorithm 4.1: Convolutional Recurrent Autoencoder Training Algorithm


Input: Training dataset  ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 ×𝑁𝑡 ×𝑁𝑠 , number of train-steps 𝑁𝑡𝑟𝑎𝑖𝑛 , batch size 𝑁𝑏 .
Result: Trained model parameters 𝜽
1 Randomly initialize 𝜽;
2 for 𝑖 ∈ {1, ..., 𝑁𝑡𝑟𝑎𝑖𝑛 } do
3 Randomly sample batch from training data:  𝑏 ⊂ ;
4 Flatten batch-mode:  𝑏𝐴𝐸 ← flatten( 𝑏 ) s.t.  𝑏𝐴𝐸 ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 ×(𝑁𝑡 ⋅𝑁𝑏 ) ;
5 Encoder forward pass: ̃ 𝑏 ← 𝑓𝑒𝑛𝑐 ( 𝑏𝐴𝐸 ) where ̃ 𝑏 ∈ ℝ𝑁ℎ ×(𝑁𝑡 ⋅𝑁𝑏 ) ;
6 Decoder forward pass: ̂ 𝑏𝐴𝐸 ← 𝑓𝑑𝑒𝑐 (̃ 𝑏 );
7 Reshape low-dimensional features:  𝑏 ∈ ℝ𝑁ℎ ×𝑁𝑡 ×𝑁𝑏 ← reshape(̃ 𝑏 );
8 Initialize RNN subgraph loop: ̂𝐡2𝑖 ← 𝑓𝐿𝑆𝑇 𝑀 (𝐡1𝑖 ) for 𝑖 ∈ {1, ..., 𝑁𝑏 }, 𝐡1𝑖 ⊂  𝑏 ;
9 for 𝑛 ∈ {2, ..., 𝑁𝑡 − 1} do
10 ̂𝐡𝑛+1 ← 𝑓𝐿𝑆𝑇 𝑀 (̂𝐡𝑛 ) for 𝑖 ∈ {1, ..., 𝑁𝑏 }, ̂𝐡𝑛 ⊂ ̂ 𝑏 ;
𝑖 𝑖 𝑖
11 end
12 Using  𝑏 , ̂ 𝑏 ,  𝑏 , and ̂ 𝑏 calculate approximate gradient 𝐠̂ of Equation 30;
13 Update parameters: 𝜽 ← 𝐴𝐷𝐴𝑀(̂𝐠)
14 end

Once the model is trained, online prediction is straightforward. Using the trained parameters 𝜽∗ , and given an initial condition
𝐱0 ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 , a low-dimensional representation of the initial condition 𝐡0 ∈ ℝ𝑁 ℎ
is constructed using the encoder network.
Iterative applications of Equation 25 are then used to evolve this low-dimensional representation for 𝑁𝑡 steps. The modular
construction of the convolutional recurrent autoencoder model allows the user to reconstruct from ̂𝐡𝑛 the full-dimensional state
𝐱̂ 𝑛 at every time step or at any specific instance. The online prediction algorithm is outlined in Algorithm 4.2.
14 Gonzalez and Balajewicz

Algorithm 4.2: Convolutional Recurrent Autoencoder Prediction Algorithm


Input: Initial condition 𝐱0 ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 , number of prediction steps 𝑁𝑡 .
Result: Model prediction 𝐗̂ = [𝐱̂ 1 , ..., 𝐱̂ 𝑁𝑡 ] ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 ×𝑁𝑡
1 Load trained parameters 𝜽∗ ;
2 Encoder forward pass: 𝐡0 ← 𝑓𝑒𝑛𝑐 (𝐱0 );
3 Initialize RNN subgraph loop: ̂𝐡1 ← 𝑓𝐿𝑆𝑇 𝑀 (𝐡0 );
4 for 𝑛 ∈ {1, ..., 𝑁𝑡 − 1} do
5 ̂𝐡𝑛+1 ← 𝑓𝐿𝑆𝑇 𝑀 (̂𝐡𝑛 );
6 end
7 Decoder forward pass: 𝐗̂ ← 𝑓𝑑𝑒𝑐 (𝐇),̂ where 𝐇 ̂ = [̂𝐡1 , ..., ̂𝐡𝑁𝑡 ];

5 NUMERICAL EXPERIMENTS

We apply the methods described in the previous sections on three representative examples to illustrate the effectiveness of deep
autoencoder-based approaches to nonlinear model reduction. The first one considers only a 4-layer fully-connected recurrent
autoencoder model applied to a simple one-dimensional problem based on the viscous Burgers equation. This has the merit of
demonstrating the performance of autoencoders equipped with nonlinear activation functions on tasks where linear methods
tend to struggle. The second example considers a parametric model reduction problem based on two-dimensional fluid flow in
a periodic domain with significant parameter variations. In this case, our convolutional recurrent autoencoder model is tasked
with predicting new solutions given new parameters (i.e., parameters unseen during training). The third example focuses on
long-term prediction of an incompressible flow inside a lid-driven cavity. This case serves to highlight the long-term stability
and overall performance of the convolutional recurrent autoencoder model in contrast to the unstable behavior exhibited by
POD-Galerkin ROMs.

5.1 Viscous Burgers equation


First, we consider the one-dimensional viscous Burgers equation given by
𝜕𝑢 𝜕𝑢 1 𝜕2𝑢
+𝑢 = , (𝑥, 𝑡) ∈ [0, 𝐿] × [0, 𝑇 ],
𝜕𝑡 𝜕𝑥 Re 𝜕𝑥2
( )
2(𝑥 − 𝑥0 )2 (31)
𝑢(𝑥, 0) = 1 + exp − ,
0.12
𝑢(0, 𝑡) = 0,
where 𝐿 = 1.5, 𝑇 = 0.3, 𝑥0 is the initial location of the Gaussian initial condition, and the Reynolds-like number is set to
𝑅𝑒 = 200. This problem is spatially discretized onto a uniform 𝑁𝑥 = 1024 grid using a second order finite difference scheme
with a grid spacing of Δ𝑥 = 𝐿∕𝑁𝑥 . A parameter-varying dataset consisting of 𝑁𝑠 = 128 training samples is created by randomly
sampling 𝑥0 ∈ [0, 𝐿] and solving Equation 31 using a fourth-order Runge-Kutta scheme with Δ𝑡 = 0.5Δ𝑥. After subtracting the
mean and feature scaling each solution snapshot, the training dataset has the form of Equation 28 where each training sample
′𝑁
is a matrix of solution snapshots 𝐗′𝑖𝑠 = [𝐮′1
𝑠,𝑖
, ..., 𝐮𝑠,𝑖 𝑡 ] ∈ ℝ𝑁𝑥 ×𝑁𝑡 and corresponds to a different initial condition 𝑥𝑖0 . In this case,
𝑁𝑡 = 40 is the number of equally spaced snapshots sampled from each trajectory.
This example was crafted to highlight an important performance benefit of using nonlinear fully-connected autoencoder-based
model reduction approaches in contrast to POD-based ROMs. First, we train a 4-layer fully-connected autoencoder (2 encoder
layers and 2 decoder layers) to produce a low-dimensional representation 𝐡𝐴𝐸 ∈ ℝ𝑁ℎ , 𝑁ℎ = 20 with intermediate layer of size
512. The evolution of this representation and an equivalently sized optimal POD representation 𝐡𝑃 𝑂𝐷 = 𝚿𝑇𝑁 𝐮′𝑠 are both modeled

using separate single layer modified LSTM networks trained according to a simplified version of Algorithm 4.1 using a batch
size 𝑁𝑏 = 8. A best-case scenario for any POD-based ROM are snapshot reconstructions satisfying Equation 12, therefore in
lieu of a POD-Galerkin-ROM we will consider only the projected solution snapshots. These proof-of-concept models were each
trained over 𝑁𝑡𝑟𝑎𝑖𝑛 = 100, 000 iterations on a desktop computer in a matter of minutes.
Figure 5 depicts the comparison between exact solution, the best-case optimal POD reconstruction, the POD-LSTM recon-
struction, and finally the shallow recurrent autoencoder reconstruction. As expected, due to the truncation of higher-frequency
Gonzalez and Balajewicz 15

t 0.2

0.1

0.0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
1.0
u(x, t)

0.5

0.0
0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5 0.0 0.5 1.0 1.5
x x x
(a) (b) (c)

FIGURE 5 (a) exact solution, (b) 𝑁ℎ = 20; 𝐿2 -optimal POD reconstruction (solid orange), POD-LSTM reconstruction (dashed
orange), exact solution (light blue), (c) 𝑁ℎ = 20; shallow autoencoder-LSTM reconstruction (dashed orange), exact solution
(light blue).

POD modes the 𝐿2 -optimal POD reconstruction exhibits spurious oscillations. The spurious oscillations, aside from signifying
a poor reconstruction, may lead to stability issues in POD-Galerkin ROMs. This is widely known to be a problem for model
reduction of fluid flows where POD-Galerkin ROMs, while capturing nearly all the energy of the system truncate low-energy
modes which can have a large influence on the dynamics. Additionally, in agreement with similar work in 10,12 , the POD-LSTM
model was able to accurately capture the evolution of the optimal POD representation in a non-intrusive manner.
More importantly, the power of recurrent autoencoder-based approaches for nonlinear model reduction is exhibited in the
reconstruction using the shallow recurrent autoencoder model. The effect of nonlinearities in the fully-connected autoencoder
help to identify a more expressive low-dimensional representation of the full state. Combining this with an LSTM network to
evolve these low-dimensional representations yields an effective nonlinear reduced order modeling approach that outperforms
best-case scenario POD-based ROMs while using the same size models.

5.2 Parameter-varying flow in a periodic box


Next we will consider the problem of a two-dimensional incompressible flow in a square periodic domain prescribed by the
Navier-Stokes equations in vorticity formulation
𝜕𝜔 1 2
+ 𝐮 ⋅ ∇𝜔 = ∇ 𝜔, (32)
𝜕𝑡 Re
defined on the domain (𝑥, 𝑦) = [0, 2𝜋] × [0, 2𝜋] where 𝜔(𝑥, 𝑦, 𝑡) is the vorticity field, 𝐮(𝑥, 𝑦, 𝑡) is the velocity vector field. The
Reynolds number is set to Re = 5 × 103 . For the construction of the dataset, we will consider a family of initial conditions given
by a mixture of 𝑁𝑣 Gaussian vortices
( )

𝑁𝑣
(𝑥 − 𝑥𝑖 )2 + (𝑦 − 𝑦𝑖 )2
𝜔(𝑥, 𝑦, 0) = 𝛿(𝑖) exp − , (33)
𝑖=1
0.1
parameterized by location of the center of each vortex. Each vortex center is sampled randomly from a square subdomain
(𝑥𝑖 , 𝑦𝑖 ) ∈ [𝜋∕2, 3𝜋∕2] × [𝜋∕2, 3𝜋∕2] ∀𝑖 as depicted in Figure 6, and the sign of each vortex is governed by 𝛿(𝑖) ∈ {−1, +1} ∀𝑖.
We consider two cases: (a) 𝑁𝑣 = 2, with each vortex of opposite sign, and (b) 𝑁𝑣 = 3, with one positive vortex and the rest
negative. A representative set of initial conditions for the 𝑁𝑣 = 2 and 𝑁𝑣 = 3 cases can be seen in Figure 7 and Figure 8,
respectively. This example was constructed both to showcase the application to larger scale problems that would otherwise be
too computationally intensive using a fully-connected autoencoder and to highlight the location-invariance capabilities of the
16 Gonzalez and Balajewicz

convolutional autoencoder. The main idea is that similar to detecting an instance of an object anywhere in an an image, the
shared-weight property of each convolutional layer in the autoencoder should be able to capture the large-parameter variations
implicitly defined in the initial condition.

2𝜋
𝜋

-
𝜋
+

0
0 2𝜋
FIGURE 6 Square domain with periodic boundary conditions. The positive and negative vortices of equal strength are randomly
initialized within the grey subdomain.

To create a training dataset, Equation 32 is discretized pseudospectrally using a uniform 1282 grid and integrated in time
using the Crank-Nicholson method to 𝑇 = 250 using a time step of Δ𝑡 = 1 × 10−2 . A parameter-varying dataset is created by
randomly sampling the initial Gaussian center locations from a square subdomain as was previously described. Similar to first
example, after subtracting the temporal mean and feature scaling the resulting dataset has the form
′𝑁
 = {𝐗′1
𝑠
, ..., 𝐗𝑠 𝑠 } ∈ [0, 1]𝑁𝑥 ×𝑁𝑦 ×𝑁𝑡 ×𝑁𝑠 , (34)
′𝑁
where each training sample 𝐗′𝑖𝑠 [𝝎′1
=𝑠,𝑖
, ..., 𝝎𝑠,𝑖 𝑡 ] is a matrix of two-dimensional discretized snapshots corresponding to a different
set of initial conditions. In this case, the dataset consists of a total 𝑁𝑠 = 5120 training samples, each with 𝑁𝑡 = 30 evenly sampled
snapshots. Since we are interested in employing the convolutional recurrent autoencoder, each 𝝎′𝑠,𝑖 is kept as a two-dimensional
array.
Three convolutional recurrent autoencoder models, with feature vector sizes 𝑁ℎ = 8, 16, and 64, were trained using the
dataset Equation 34 with both two and three initial vortices. Each model was trained on an single Nvidia Tesla K20 GPU for
𝑁𝑡𝑟𝑎𝑖𝑛 = 1, 000, 000 iterations. Once trained, the three models were used to predict the evolution of the vorticity field for new
initial conditions. To highlight the benefits of convolutional recurrent autoencoders for location-invariant feature learning, we
compare our prediction with a set of best-case scenario rank-8 POD reconstructions. These rank-8 POD reconstructions use a
dataset containing snapshots from just 128 separate trajectories. In this case, a rank-8 POD reconstruction is not sufficient to
accurately capture the correct solution since the inclusion of randomly varying initial conditions has created a dataset that is no
longer low-rank. This is clearly shown in Figure 9 and Figure 10 for the two and three vortex cases. The need for more and
more POD modes to achieve a good reconstruction underscores a significant disadvantage of POD-based ROMs for systems
with large variations in parameters. The convolutional recurrent autoencoder overcomes these challenges and performs well in
prediction new solutions without the need to resort to larger-rank models. In contrast to POD-Galerkin ROMs, increasing the
number of separate trajectories in a dataset is beneficial to learning the correct dynamic behavior.
Similar to the first numerical example, the predictions are devoid of any spurious oscillations that are commonplace in POD-
based ROMs. Considering a single initial condition Figure 11 shows the performance of each sized model in predicting the
location of each vortex as it evolves up to the training sequence length for the two vortex case. Futher, Figure 12 shows the
Gonzalez and Balajewicz 17

0

y

0

y

0
0 2π 0 2π 0 2π
x x x
FIGURE 7 A set of initial conditions with two randomly located Gaussian vortices of equal and opposite strength.

mean and standard deviation of the scaled squared reconstruction error


‖𝝎′𝑛
𝑠
− 𝝎̂ ′𝑛
𝑠 𝐹
‖2
(35)
‖𝝎′𝑛
𝑠 𝐹
‖2 + 𝜖
at each time step as calculated from 512 new prediction runs using the trained models. In all three cases, the error did not grow
significantly over the length of the training sequence.

5.3 Lid-driven cavity flow


In the final example we consider a two-dimensional incompressible flow inside a square cavity with a lid velocity of 𝐮lid =
(1 − 𝑥2 )2 at a moderate Reynolds number Re = 2.75 × 104 . A graphic of the domain is depicted in Figure 13. At these Reynolds
numbers the lid-driven cavity flow is known to settle into a statistically stationary solution far from the initial condition making
it a well known benchmark for the validation of numerical schemes and reduced order models. In particular, this benchmark
is useful for testing the stability of reduced order models 25 . The characteristic length and velocity scales used in defining the
Reynolds number are the cavity width and the maximum lid velocity.
Consider the two-dimensional Navier-Stokes in streamfunction-vorticity formulation defined on the square domain (𝑥, 𝑦) ∈
[−1, 1] × [−1, 1]
𝜕 2 𝜕Ψ 𝜕 2 𝜕Ψ 𝜕 2
(∇ Ψ) + (∇ Ψ) − (∇ Ψ) = 𝜈∇4 Ψ, (36)
𝜕𝑡 𝜕𝑦 𝜕𝑥 𝜕𝑦 𝜕𝑦
18 Gonzalez and Balajewicz

0

y

0

y

0
0 2π 0 2π 0 2π
x x x
FIGURE 8 A set of initial conditions with three randomly located Gaussian vortices, one positive and two negative all with
equal strength.

where Ψ(𝑥, 𝑦, 𝑡) is the streamfunction, and ∇4 = ∇2 ∇2 is the biharmonic operator. To generate the training dataset, Equation 36
is spatially discretized using a 1282 Chebyshev grid and solved numerically. The Chebyshev coefficients are derived using the
fast Fourier transform (FFT), where the contractive nonlinearities are handled pseudospectrally. The equations are integrated in
time using a semi-implicit, second order Euler scheme. Since the statistically stationary solution is far from the initial condition,
we first initialize the simulation over 7, 500, 000 time steps with time-step size Δ𝑡 = 1 × 10−4 . The following 2, 500, 000 time
steps are then used to create a dataset in the form of Equation 28 with 𝑁𝑠 = 1110 where now each training sample is
′𝑖+(𝑁𝑡 −1)𝑚
𝐗′𝑖𝑠 = [𝚿′𝑖𝑠 , 𝚿𝑠′𝑖+𝑚 , 𝚿′𝑖+2𝑚
𝑠
, ..., 𝚿𝑠 ] ∈ ℝ𝑁𝑥 ×𝑁𝑦 ×𝑁𝑡 , (37)
where each 𝚿′𝑖𝑠 is a discretized two-dimensional snapshot of Equation 36, 𝑁𝑡 = 35, and 𝑚 is taken to be 100. In doing this, we
ensure that the initial training snapshot used to initialize the RNN portion of the model evenly samples the entire trajectory of
Equation 36. The result is the construction a training dataset that gives a good representation of the dynamics for the RNN to
learn. In addition, an interpolation step onto a uniform 1282 is performed to ensure each filter 𝐊𝑓 acts on equally physically-sized
receptive fields.
Three convolutional recurrent autoencoder models were trained using this dataset, again with low-dimensional representations
of sizes 𝑁ℎ = 8, 16, and 64. In this case all three models were trained on a single Nvidia Tesla K20 GPU for 𝑁𝑡𝑟𝑎𝑖𝑛 = 600, 000
iterations. The online performance of the each model was evaluated by initializing each model with a slightly perturbed version
of the first snapshot of the entire dataset and evaluating for 2500 prediction steps, over 70 times the length of each training
sequence. We perform the same with three equivalently sized POD-Galerkin ROMs. ?? depict the final predicted velocity fields
Gonzalez and Balajewicz 19

(a) 2π t=0 t = 40 t = 80 t = 120

0
(b) 2π
y

0
(c) 2π
y

0
0 2π 0 2π 0 2π 0 2π
x x x x

FIGURE 9 Comparison at 𝑡 = 0, 40, 80, 120 of a sample trajectory using two initial vortices: (a) true solution, (b) rank-8 POD
reconstruction using dataset with 128 trajectories, and (c) prediction using a trained convolutional recurrent autoencoder of size
𝑁ℎ = 8.

𝑢(𝑥, 𝑦, 𝑡), 𝑣(𝑥, 𝑦, 𝑡), as well as the predicted vorticity field 𝜔(𝑥, 𝑦, 𝑡) using traditional POD-Galerkin ROMs and our convolutional
recurrent autoencoder model for 𝑁ℎ = 8, 16, and 64. In all three reconstructed fields the poor performance of POD-Galerkin
ROMs can be easily noticed by the spurious oscillations present in the field. This is in contrast to the predictions presented using
our approach, which nearly capture the exact solution even after long-term prediction.
In fact, we only present predictions up until 𝑡 = 60 for the 𝑁ℎ = 8 POD-Galerkin ROM since instabilities cause the solution
to diverge. This can be seen more clearly in Figure 17, which compares the instantaneous turbulent kinetic energy (TKE) of the
flow ( )
1
𝐸(𝑡) = 𝑢(𝑡)′2 + 𝑣(𝑡)′2 𝑑Ω (38)
2∫
Ω
where 𝑢(𝑡)′ and 𝑣(𝑡)′ are the instantaneous velocity fluctuations around the mean and Ω represents the fluid domain. The TKE
can be seen as a measure of the energy content within the flow. For statistically stationary flows, such as the one considered in
this example, the TKE should hover around a mean value. In Figure 17 we see that the POD-Galerkin models fail to capture
the correct TKE, and in the case of 𝑁ℎ = 8 instabilities lead to eventual divergence.
Against this backdrop, we can see that our approach vastly outperforms traditional POD-Galerkin ROMs. All velocity and
vorticity reconstructions are in good agreement with the HFM solution. As the size of the model increases to 𝑁ℎ = 64, we
see that predicted TKE is in good agreement with that of the HFM. It should be noted that the lid-driven cavity flow at these
Reynolds numbers exhibits chaotic motion, thus a best-case scenario would be to capture the right TKE in a statistical sense.
This can be seen further in Figure 18 which compares the power spectral density of each predicted TKE with that of the HFM.
20 Gonzalez and Balajewicz

(a) 2π t=0 t = 40 t = 80 t = 120

0
(b) 2π
y

0
(c) 2π
y

0
0 2π 0 2π 0 2π 0 2π
x x x x

FIGURE 10 Comparison at 𝑡 = 0, 40, 80, 120 of a sample trajectory using three initial vortices: (a) true solution, (b) rank-8
POD reconstruction using dataset with 128 trajectories, and (c) prediction using a trained convolutional recurrent autoencoder
of size 𝑁ℎ = 8.


HFM Nh = 16
Nh = 8 Nh = 64
y

0
0 2π
x

FIGURE 11 Evolution of the vortex centers as given the HFM solution and the predicted solutions using 𝑁ℎ = 8, 16, 64.
Gonzalez and Balajewicz 21

×10−5
kω 0n −ω̂ 0n k22
kω 0n k22 +
0.5

0.0
0 100 200 0 100 200 0 100 200
t t t
(a) (b) (c)

FIGURE 12 Mean and standard deviation of error at every time step for online predictions using (a) 𝑁ℎ = 8, (b) 𝑁ℎ = 16, and
(c) 𝑁ℎ = 64.

𝒖$%&
1

−1
−1 1
FIGURE 13 Lid-driven cavity domain, with lid velocity 𝐮lid = (1 − 𝑥2 )2 .

While each model prediction capatures the general behavior of the HFM, there is some high spatial frequency error evident
throughout the domain in each reconstruction. Interestingly, the stability of the RNN portion of the each model remains unaf-
fected by this high-frequency noise suggesting that it is due only to the transpose convolutional decoder. This is possibly a result
of performing a strided transpose convolution at each layer of the decoder. It is possible and perhaps beneficial to include a final
undilated convolutional layer with a single feature map to filter some of the high-frequency reconstruction noise.

6 CONCLUSIONS

In this work we propose a completely data-driven nonlinear reduced order model based on a convolutional recurrent autoen-
coder architecture for application to parameter-varying systems and systems requiring long-term stability. The construction
of the convolutional recurrent autoencoder consists of two major components each of which performs a key task in projec-
tion based reduced order modeling. First a convolutional autoencoder is designed to identify a low-dimensional representation
of two-dimensional input data in terms of intrinsic coordinates on some low-dimensional manifold embedded in the original,
high-dimensional space. This is done by considering a 4-layer convolutional encoder which computes a hierarchy of local-
ized, location invariant features that are passed to a two-layer fully connected encoder. The result of this is a mapping from
the high-dimensional input space to a low-dimensional data-supporting manifold. An equivalent decoder architecture is consid-
ered for efficiently mapping from the low-dimensional representation to the original space. This can be intuitively understood
22 Gonzalez and Balajewicz

(a) (b) (c) (d)


1
y

−1
(e) (f) (g) (h)
1
y

−1
−1 1 −1 1 −1 1 −1 1
x x x x

FIGURE 14 𝑢(𝑥, 𝑦, 𝑡) contours of the lid-driven cavity flow at 𝑡 = 250 𝑠 using the optimal POD reconstruction: (a) 𝑁ℎ = 8
(note: 𝑡 = 60 shown, right before blowup), (b) 𝑁ℎ = 16, (c) 𝑁ℎ = 64, (d) true solution; and predicted contours using the
convolutional recurrent autoencoder model with hidden state sizes (e) 𝑁ℎ = 8, (f) 𝑁ℎ = 16, (g) 𝑁ℎ = 64, (h) true solution.

as a nonlinear generalization of POD, where the structure of the manifold is more expressive than the linear subspaces learned
by POD-based methods. The second important component of the proposed convolutional recurrent autoencoder is a modified
version of an LSTM network which models the dynamics on the manifold learned by the autoencoder. The LSTM network is
modified to require only information from the low-dimensional representation thereby avoiding costly reconstruction of the full
state at every evolution step.
An offline training and online prediction strategy for the convolutional recurrent autoencoder is proposed in this work. The
training algorithm exploits the modularity of the model by splitting each forward pass into two steps. The first step running a
forward pass on the autoencoder while creating a temporary batch of target low-dimensional representations which are then used
in the second step, which is the forward pass of the modified LSTM network. The backwards pass, or parameter update is then
performed jointly equally weighting autoencoder reconstruction error and the prediction error of the modified LSTM network.
We demonstrated our approach on three illustrative nonlinear model reduction examples. The first emphasizes the expres-
sive power of using fully-connected autoencoders equipped with nonlinear activation functions on performing model reduction
tasks in contrast to POD-based methods. The second highlights the performance of the convolutional recurrent autoencoder,
and in particular its location-invariant properties, in parametric model reduction with initial condition exhibiting large parame-
ter variations. The final example demonstrates the stability of convolutional recurrent autoencoders when performing long-term
predictions of choatic incompressible flows. Collectively, these numerical examples show that our convolutional recurrent
autoencoder model outperforms traditional POD-Galerkin ROMs both in terms of prediction quality, parameter variations,
and stability while also offering other advantages such as location invariant feature learning and non-intrusiveness. In fact,
although in this work we make use of canonical model reduction examples based on computational physics problems, our
approach is completely general and can be applied to arbitrary high-dimensional spatiotemporal data. When compared to exist-
ing autoencoder-based reduced order modeling strategies, our model provides access to larger-sized problems while keeping the
number of trainable parameters low compared to fully-connected autoencoders.

6.1 Future work


This work shows the feasibility of using deep learning-based strategies for performing nonlinear model reduction and more gen-
erally modeling complex dynamical system in a completely data-driven and non-intrusive manner. Although this work presents
Gonzalez and Balajewicz 23

(a) (b) (c) (d)


1
y

−1
(e) (f) (g) (h)
1
y

−1
−1 1 −1 1 −1 1 −1 1
x x x x

FIGURE 15 𝑣(𝑥, 𝑦, 𝑡) contours of the lid-driven cavity flow at 𝑡 = 250 𝑠 using the optimal POD reconstruction: (a) 𝑁ℎ = 8
(note: 𝑡 = 60 shown, right before blowup), (b) 𝑁ℎ = 16, (c) 𝑁ℎ = 64, (d) true solution; and predicted contours using the
convolutional recurrent autoencoder model with hidden state sizes (e) 𝑁ℎ = 8, (f) 𝑁ℎ = 16, (g) 𝑁ℎ = 64, (h) true solution.

promising predictive results for both parameter-varying model reduction problems and problems requiring long-term stability,
these methods remain in their infancy and their full capabilities are yet unknown. There are multiple directions in which this
work can be extended. One such direction is improving the design of the convolutional transpose decoder. As it stands, the main
source of error in our results is high-frequency in nature an appears only during the decoding phase. Considering this, future
decoder designs could include more efficient filtering strategies. Another possible direction is in the dynamic modeling of the
low-dimensional representations. In this work, we considered samples with spatial parameter variations and thus the design of
the LSTM network could remain unchanged. However, there is potential for deep learning-based dynamic modeling approaches
that exploit multi-scale phenomena inherent in many physical systems. Finally, a much more challenging problem is the rec-
onciliation of deep learning-based performance gains with physical intuition. This issue permeates throughout all fields where
deep learning has made an impact: what is it actually doing? Developing our understanding of deep learning-based modeling
strategies can potentially provide us with deeper insights of the dynamics inherent in a physical system.

ACKNOWLEDGMENTS

This material is based upon the work supported by the Air Force Office of Scientific Research under Grant No. FA9550-17-1-
0203. Simulations and model training were also made possible in part by an exploratory award from the Blue Waters sustained-
petascale computing project, which is supported by the National Science Foundation (awards OCI-0725070 and ACI-1238993)
and the state of Illinois. Blue Waters is a joint effort of the University of Illinois at Urbana-Champaign and its National Center
for Supercomputing Applications.

References

1. Raissi M, others . Machine learning of linear differential equations using Gaussian processes. Journal of Computational
Physics 2017; 348: 683–693. doi: 10.1016/j.jcp.2017.07.050
24 Gonzalez and Balajewicz

(a) (b) (c) (d)


1
y

−1
(e) (f) (g) (h)
1
y

−1
−1 1 −1 1 −1 1 −1 1
x x x x

FIGURE 16 Vorticity contours of the lid-driven cavity flow at 𝑡 = 250 𝑠 using the optimal POD reconstruction: (a) 𝑁ℎ = 8
(note: 𝑡 = 60 shown, right before blowup), (b) 𝑁ℎ = 16, (c) 𝑁ℎ = 64, (d) true solution; and predicted contours using the
convolutional recurrent autoencoder model with hidden state sizes (e) 𝑁ℎ = 8, (f) 𝑁ℎ = 16, (g) 𝑁ℎ = 64, (h) true solution.

2. Brunton SL, Brunton BW, Proctor JL, Kaiser E, Kutz JN. Chaos as an intermittently forced linear system. Nature
Communications 2017; 8(1): 19. doi: 10.1038/s41467-017-00030-8

3. Bongard J, Lipson H. Automated reverse engineering of nonlinear dynamical systems. Proceedings of the National Academy
of Sciences 2007; 104(24): 9943–9948. doi: 10.1073/pnas.0609476104

4. Schaeffer H. Learning partial differential equations via data discovery and sparse optimization. Proceedings of the Royal
Society A: Mathematical, Physical and Engineering Science 2017; 473(2197): 20160446. doi: 10.1098/rspa.2016.0446

5. Tran G, Ward R. Exact Recovery of Chaotic Systems from Highly Corrupted Data. Multiscale Modeling & Simulation
2017; 15(3): 1108–1129. doi: 10.1137/16M1086637

6. Raissi M, Perdikaris P, Karniadakis GE. Inferring solutions of differential equations using noisy multi-fidelity data. Journal
of Computational Physics 2016; 335: 736–746. doi: 10.1016/j.jcp.2017.01.060

7. Brunton SL, Proctor JL, Kutz JN. Discovering governing equations from data by sparse identification of nonlinear dynamical
systems. Proceedings of the National Academy of Sciences 2016; 113(15): 3932–3937. doi: 10.1073/pnas.1517384113

8. San O, Maulik R. Neural network closures for nonlinear model order reduction. 2017; 1(405): 1–33.

9. Benosman M, Borggaard J, San O, Kramer B. Learning-based robust stabilization for reduced-order models of 2D and 3D
Boussinesq equations. Applied Mathematical Modelling 2017; 49: 162–181. doi: 10.1016/j.apm.2017.04.032

10. Wang Z, others . Model identification of reduced order fluid dynamics systems using deep learning. International Journal
for Numerical Methods in Fluids 2017(July): 1–14. doi: 10.1002/fld.4416

11. Wang Q, Hesthaven JS, Ray D. Non-intrusive reduced order modeling of unsteady flows using artificial neural networks
with application to a combustion problem. 2018.

12. Kani JN, Elsheikh AH. DR-RNN: A deep residual recurrent neural network for model reduction. arXiv 2017.

13. Otto SE, Rowley CW. Linearly-Recurrent Autoencoder Networks for Learning Dynamics. arXiv 2017: 1–37.
Gonzalez and Balajewicz 25

10−1
Nh = 8
10−2
E(t)

10−3

10−4
10−1
Nh = 16
−2
10
E(t)

10−3

10−4
10−1
Nh = 64
10−2
E(t)

10−3

10−4
0 50 100 150 200 250
t

FIGURE 17 The evolution of the instantaneous turbulent kinetic energy for the lid-driven cavity flow from the DNS (thick grey
lines), standard POD-based Galerkin ROMs (blue dashed lines), and our method (solid black lines).

14. Benner P, Gugercin S, Willcox K. A Survey of Projection-Based Model Reduction Methods for Parametric Dynamical
Systems. SIAM Review 2015; 57(4): 483–531. doi: 10.1137/130932715
15. Carlberg K, Bou-Mosleh C, Farhat C. Efficient non-linear model reduction via a least-squares Petrov-Galerkin projection and
compressive tensor approximations. International Journal for Numerical Methods in Engineering 2011; 86(2): 155–181.
doi: 10.1002/nme.3050
16. Parish EJ, Duraisamy K. A paradigm for data-driven predictive modeling using field inversion and machine learning. Journal
of Computational Physics 2016; 305: 758–774. doi: 10.1016/j.jcp.2015.11.012
17. Lumley JL. Stochastic Tools in Turbulence. Elsevier . 1970.
18. Holmes P, Lumley JL, Berkooz G. Turbulence, Coherent Structures, Dynamical Systems and Symmetry. Cambridge:
Cambridge University Press . 1996
19. Bai Z. Krylov subspace techniques for reduced-order modeling of large-scale dynamical systems. Applied Numerical
Mathematics 2002; 43(1-2): 9–44. doi: 10.1016/S0168-9274(02)00116-2
26 Gonzalez and Balajewicz

−80
Power (dB)
−100
HFM
Nh = 8
−120 Nh = 16
Nh = 64
−140
10−1 100
Normalized frequency (×π rad/sample)

FIGURE 18 PSD of the turbulent kinetic energy of the lid-driven cavity flow.

20. Schmid PJ. Dynamic mode decomposition of numerical and experimental data. Journal of Fluid Mechanics 2010; 656:
5–28. doi: 10.1017/S0022112010001217

21. Chaturantabut S, Sorensen DC. Nonlinear Model Reduction via Discrete Empirical Interpolation. SIAM Journal on
Scientific Computing 2010; 32(5): 2737–2764. doi: 10.1137/090766498

22. Carlberg K, Farhat C, Cortial J, Amsallem D. The GNAT method for nonlinear model reduction: Effective implementation
and application to computational fluid dynamics and turbulent flows. Journal of Computational Physics 2013; 242: 623–647.
doi: 10.1016/j.jcp.2013.02.028

23. Rewieński M, White J. Model order reduction for nonlinear dynamical systems based on trajectory piecewise-linear
approximations. Linear Algebra and its Applications 2006; 415(2-3): 426–454. doi: 10.1016/j.laa.2003.11.034

24. Trehan S, Durlofsky LJ. Trajectory piecewise quadratic reduced-order model for subsurface flow, with application to PDE-
constrained optimization. Journal of Computational Physics 2016; 326: 446–473. doi: 10.1016/j.jcp.2016.08.032

25. Balajewicz MJ, Dowell EH, Noack BR. Low-dimensional modelling of high-Reynolds-number shear flows incor-
porating constraints from the NavierâĂŞStokes equation. Journal of Fluid Mechanics 2013; 729: 285–308. doi:
10.1017/jfm.2013.278

26. Hinton GE. Reducing the Dimensionality of Data with Neural Networks. Science 2006; 313(5786): 504–507. doi:
10.1126/science.1127647

27. Wang Y, Yao H, Zhao S. Auto-encoder based dimensionality reduction. Neurocomputing 2016; 184: 232–242. doi:
10.1016/j.neucom.2015.08.104

28. Hartman D, Mestha LK. A Deep Learning Framework for Model Reduction of Dynamical Systems. In: ; 2017: 1917–1922.

29. Krizhevsky A, Sutskever I, Hinton GE. ImageNet Classification with Deep Convolutional Neural Networks. Advances In
Neural Information Processing Systems 2012: 1–9.

30. Farabet C, Couprie C, Najman L, LeCun Y. Learning Hierarchical Features for Scene Labeling. IEEE Transactions on
Pattern Analysis and Machine Intelligence 2013; 35(8): 1915–1929. doi: 10.1109/TPAMI.2012.231
Gonzalez and Balajewicz 27

31. Hinton G, others . Deep Neural Networks for Acoustic Modeling in Speech Recognition: The Shared Views of Four Research
Groups. IEEE Signal Processing Magazine 2012; 29(6): 82–97. doi: 10.1109/MSP.2012.2205597

32. Xiong HY, others . The human splicing code reveals new insights into the genetic determinants of disease. Science 2015;
347(6218). doi: 10.1126/science.1254806

33. Leung MKK, Xiong HY, Lee LJ, Frey BJ. Deep learning of the tissue-regulated splicing code. Bioinformatics 2014; 30(12):
i121–i129. doi: 10.1093/bioinformatics/btu277

34. Goodfellow I, Bengio Y, Courville A. Deep Learning. MIT Press . 2016.

35. LeCun Y, Bengio Y, Hinton G. Deep learning. Nature 2015; 521(7553): 436–444. doi: 10.1038/nature14539

36. Kingma DP, Ba J. Adam: A Method for Stochastic Optimization. arXiv 2014: 1–15.

37. Zeiler MD. ADADELTA: An Adaptive Learning Rate Method. arXiv 2012.

38. Rumelhart DE, Hinton GE, Williams RJ. Learning representations by back-propagating errors. Nature 1986; 323(6088):
533–536. doi: 10.1038/323533a0

39. Werbos P. Backpropagation through time: what it does and how to do it. Proceedings of the IEEE 1990; 78(10): 1550–1560.
doi: 10.1109/5.58337

40. Bengio Y, Simard P, Frasconi P. Learning long-term dependencies with gradient descent is difficult. IEEE Transactions on
Neural Networks 1994; 5(2): 157–166. doi: 10.1109/72.279181

41. Hochreiter S, Schmidhuber J. Long Short-Term Memory. Neural Computation 1997; 9(8): 1735–1780. doi:
10.1162/neco.1997.9.8.1735

42. Cho K, others . Learning Phrase Representations using RNN EncoderâĂŞDecoder for Statistical Machine Translation. In:
Association for Computational Linguistics; 2014; Stroudsburg, PA, USA: 1724–1734

43. Yu F, Koltun V. Multi-Scale Context Aggregation by Dilated Convolutions. 2015. doi: 10.16373/j.cnki.ahr.150049

44. Li Y, Zhang X, Chen D. CSRNet: Dilated Convolutional Neural Networks for Understanding the Highly Congested Scenes.
2018. doi: 10.1109/CVPR.2018.00120

45. Bengio Y, Courville A, Vincent P. Representation Learning: A Review and New Perspectives. IEEE Transactions on Pattern
Analysis and Machine Intelligence 2013; 35(8): 1798–1828. doi: 10.1109/TPAMI.2013.50

46. Plaut E. From Principal Subspaces to Principal Components with Linear Autoencoders. arXiv 2018: 1–6.

47. Ogunmolu O, Gu X, Jiang S, Gans N. Nonlinear Systems Identification Using Deep Dynamic Neural Networks. 2016.

48. Yeo K. Model-free prediction of noisy chaotic time series by deep learning. arXiv 2017(10): 1–5.

49. Li Q, Dietrich F, Bollt EM, Kevrekidis IG. Extended dynamic mode decomposition with dictionary learning: A data-driven
adaptive spectral decomposition of the koopman operator. Chaos 2017; 27(10). doi: 10.1063/1.4993854

50. Clevert DA, Unterthiner T, Hochreiter S. Fast and Accurate Deep Network Learning by Exponential Linear Units (ELUs).
2015: 1–14. doi: 10.3233/978-1-61499-672-9-1760

51. Ma C, Huang Jb, Yang X, Yang Mh. Hierarchical Convolutional Features for Visual Tracking. In: IEEE; 2015: 3074–3082

52. Noh H, Hong S, Han B. Learning deconvolution network for semantic segmentation. Proceedings of the IEEE International
Conference on Computer Vision 2015; 2015 Inter: 1520–1528. doi: 10.1109/ICCV.2015.178

53. Badrinarayanan V, Kendall A, Cipolla R. SegNet: A Deep Convolutional Encoder-Decoder Architecture for Image
Segmentation. IEEE Transactions on Pattern Analysis and Machine Intelligence 2017; 39(12): 2481–2495. doi:
10.1109/TPAMI.2016.2644615
28 Gonzalez and Balajewicz

54. Lipton ZC, Berkowitz J, Elkan C. A Critical Review of Recurrent Neural Networks for Sequence Learning. Proceedings of
the ACM International Conference on Multimedia - MM ’14 2015: 675–678.

55. Sutskever I, Vinyals O, Le QV. Sequence to Sequence Learning with Neural Networks. arXiv 2014: 1–9.

56. Wu Y, others . Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation.
arXiv 2016: 1–23. doi: abs/1609.08144

57. Abadi M, others . TensorFlow: A System for Large-Scale Machine Learning TensorFlow: A system for large-scale machine
learning. 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’16) 2016: 265–284. doi:
10.1038/nn.3331

You might also like