Linearly Recurrent Autoencoder Networks For Learning Dynamicssms
Linearly Recurrent Autoencoder Networks For Learning Dynamicssms
Abstract. This paper describes a method for learning low-dimensional approximations of nonlinear dynamical
systems, based on neural-network approximations of the underlying Koopman operator. Extended
Dynamic Mode Decomposition (EDMD) provides a useful data-driven approximation of the Koop-
man operator for analyzing dynamical systems. This paper addresses a fundamental problem asso-
ciated with EDMD: a trade-off between representational capacity of the dictionary and over-fitting
due to insufficient data. A new neural network architecture combining an autoencoder with linear
recurrent dynamics in the encoded state is used to learn a low-dimensional and highly informative
Koopman-invariant subspace of observables. A method is also presented for balanced model reduc-
arXiv:1712.01378v2 [math.DS] 15 Jan 2019
tion of over-specified EDMD systems in feature space. Nonlinear reconstruction using partially linear
multi-kernel regression aims to improve reconstruction accuracy from the low-dimensional state when
the data has complex but intrinsically low-dimensional structure. The techniques demonstrate the
ability to identify Koopman eigenfunctions of the unforced Duffing equation, create accurate low-
dimensional models of an unstable cylinder wake flow, and make short-time predictions of the chaotic
Kuramoto-Sivashinsky equation.
Key words. nonlinear systems, high-dimensional systems, reduced-order modeling, neural networks, data-
driven analysis, Koopman operator
1. Introduction. The Koopman operator first introduced in [11] describes how Hilbert
space functions on the state of a dynamical system evolve in time. These functions, referred to
as observables, may correspond to measurements taken during an experiment or the output of
a simulation. This makes the Koopman operator a natural object to consider for data-driven
analysis of dynamical systems. Such an approach is also appealing because the Koopman
operator is linear, though infinite dimensional, enabling the concepts of modal analysis for
linear systems to be extended to dynamics of observables in nonlinear systems. Hence, the
invariant subspaces and eigenfunctions of the Koopman operator are of particular interest and
provide useful features for describing the system if they can be found. For example, level sets
of Koopman eigenfunctions may be used to form partitions of the phase space into ergodic
sets along with periodic and wandering chains of sets [2]. They allow us to parameterize limit
cycles and tori as well as their basins of attraction. The eigenvalues allow us to determine
the stability of these structures and the frequencies of periodic and quasiperiodic attractors
[20]. Furthermore, by projecting the full state as an observable onto the eigenfunctions of
the Koopman operator, it is decomposed into a linear superposition of components called
Koopman modes which each have a fixed frequency and rate of decay. Koopman modes
therefore provide useful coherent structures for studying the system’s evolution and dominant
pattern-forming behaviors. This has made the Koopman operator a particularly useful object
∗
Submitted to the editors January 17, 2019.
Funding: This work was funded by ARO award W911NF-17-0512 and DARPA.
†
Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ ([email protected]).
‡
Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ ([email protected]).
1
2 S. E. OTTO AND C. W. ROWLEY
of study for high-dimensional spatiotemporal systems like unsteady fluid dynamics beginning
with the work of Mezić on spectral properties of dynamical systems [19] then Rowley [27] and
Schmid [28] on the Dynamic Mode Decomposition (DMD) algorithm. Rowley, recognizing
that DMD furnishes an approximation of the Koopman operator and its modes, applied the
technique to data collected by simulating a jet in a crossflow. The Koopman modes identified
salient patterns of spatially coherent structure in the flow which evolved at fixed frequencies.
The Extended Dynamic Mode Decomposition (EDMD) [33] is an algorithm for approx-
imating the Koopman operator on a dictionary of observable functions using data. If a
Koopman-invariant subspace is contained in the span of the observables included in the dic-
tionary, then as long as enough data is used, the representation on this subspace will be exact.
EDMD is a Galerkin method with a particular data-driven inner product as long as enough
data is used. Specifically, this will be true as long as the rank of the data matrix is the same
as the dimension of the subspace spanned by the (nonlinear) observables [25]. However, the
choice of dictionary is ad hoc, and it is often not clear how to choose a dictionary that is
sufficiently rich to span a useful Koopman-invariant subspace. One might then be tempted
to consider a very large dictionary, with enough capacity to represent any complex-valued
function on the state space to within an tolerance. However, such a dictionary has combina-
torial growth with the dimension of the state space and would be enormous for even modestly
high-dimensional problems.
One approach to mitigate the cost of large or even infinite dictionaries is to formulate
EDMD as a kernel method referred to as KDMD [34]. However, we are still essentially
left with the same problem of deciding which kernel function to use. Furthermore, if the
kernel or EDMD feature space is endowed with too much representational capacity (a large
dictionary), the algorithm will over-fit the data (as we shall demonstrate with a toy problem in
Example 2.1). EDMD and KDMD also identify a number of eigenvalues, eigenfunctions, and
modes which grows with the size of the dictionary. If we want to build reduced order models
of the dynamics, a small collection of salient modes or a low-dimensional Koopman invariant
subspace must be identified. It is worth mentioning two related algorithms for identifying low-
rank approximations of the Koopman operator. Optimal Mode Decomposition (OMD) [35]
finds the optimal orthogonal projection subspace of user-specified rank for approximating the
Koopman operator. Sparsity-promoting DMD [10] is a post-processing method which identifies
the optimal amplitudes of Koopman modes for reconstructing the snapshot sequence with an `1
penalty. The sparsity-promoting penalty picks only the most salient Koopman modes to have
nonzero amplitudes. Another related scheme is Sparse Identification of Nonlinear Dynamics
(SINDy) [1] which employs a sparse regression penalty on the number of observables used
to approximate nonlinear evolution equations. By forcing the dictionary to be sparse, the
over-fitting problem is reduced.
In this paper, we present a new technique for learning a very small collection of informative
observable functions spanning a Koopman invariant subspace from data. Two neural networks
in an architecture similar to an under-complete autoencoder [7] represent the collection of ob-
servables together with a nonlinear reconstruction of the full state from these features. A
learned linear transformation evolves the function values in time as in a recurrent neural net-
work, furnishing our approximation of the Koopman operator on the subspace of observables.
This approach differs from recent efforts that use neural networks to learn dictionaries for
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 3
EDMD [37, 13] in that we employ a second neural network to reconstruct the full state. Ours
and concurrent approaches utilizing nonlinear decoder neural networks [30, 16] enable learning
of very small sets of features that carry rich information about the state and evolve linearly in
time. Previous methods for data-driven analysis based on the Koopman operator utilize linear
state reconstruction via the Koopman modes. Therefore they rely on an assumption that the
full state observable is in the learned Koopman invariant subspace. Nonlinear reconstruction
is advantageous since it relaxes this strong assumption, allowing recent techniques to recover
more information about the state from fewer observables. By minimizing the state recon-
struction error over several time steps into the future, our architecture aims to detect highly
observable features even if they have small amplitudes. This is the case in non-normal linear
systems, for instance as arise in many fluid flows (in particular, shear flows [29]), in which
small disturbances can siphon energy from mean flow gradients and excite larger-amplitude
modes. The underlying philosophy of our approach is similar to Visual Interaction Networks
(VINs) [32] that learn physics-based dynamics models for encoded latent variables.
Deep neural networks have gained attention over the last decade due to their ability to
efficiently represent complicated functions learned from data. Since each layer of the network
performs simple operations on the output of the previous layer, a deep network can learn and
represent functions corresponding to high-level or abstract features. For example, your visual
cortex assembles progressively more complex information sequentially from retinal intensity
values to edges, to shapes, all the way up to the facial features that let you recognize your
friend. By contrast, shallow networks — though still universal approximators — require ex-
ponentially more parameters to represent classes of natural functions like polynomials [23, 14]
or the presence of eyes in a photograph. Function approximation using linear combinations
of preselected dictionary elements is somewhat analogous to a shallow neural network where
capacity is built by adding more functions. We therefore expect deep neural networks to
represent certain complex nonlinear observables more efficiently than a large, shallow dictio-
nary. Even with the high representational capacity of our deep neural networks, the proposed
technique is regularized by the small number of observables we learn and is therefore unlikely
to over-fit the data.
We also present a technique for constructing reduced order models in nonlinear feature
space from over-specified KDMD models. Recognizing that the systems identified from data
by EDMD/KDMD can be viewed as state-space systems where the output is a reconstruction
of the full state using Koopman modes, we use Balanced Proper Orthogonal Decomposition
(BPOD) [24] to construct a balanced reduced-order model. The resulting model consists of
only those nonlinear features that are most excited and observable over a finite time horizon.
Nonlinear reconstruction of the full state is introduced in order to account for complicated,
but intrinsically low-dimensional data. In this way, the method is analogous to an autoencoder
where the nonlinear decoder is learned separately from the encoder and dynamics.
Finally, the two techniques we introduce are tested on a range of example problems. We
first investigate the eigenfunctions learned by the autoencoder and the KDMD reduced-order
model by identifying and parameterizing basins of attraction for the unforced Duffing equation.
The prediction accuracy of the models is then tested on a high-dimensional cylinder wake flow
problem. Finally, we see if the methods can be used to construct reduced order models for
the short-time dynamics of the chaotic Kuramoto-Sivashinsky equation. Several avenues for
4 S. E. OTTO AND C. W. ROWLEY
future work and extensions of our proposed methods are discussed in the conclusion.
2. Extended Dynamic Mode Decomposition. Before discussing the new method for ap-
proximating the Koopman operator, it will be beneficial to review the formulation of Extended
Dynamic Mode Decomposition (EDMD) [33] and its kernel variant KDMD [34]. Besides pro-
viding the context for developing the new technique, it will be useful to compare our results
to those obtained using reduced order KDMD models.
2.1. The Koopman operator and its modes. Consider a discrete-time autonomous dy-
namical system on the state space M ⊂ Rn given by the function xt+1 = f (xt ). Let F be a
Hilbert space of complex-valued functions on M. We refer to elements of F as observables.
The Koopman operator acts on an observable ψ ∈ F by composition with the dynamics:
(1) Kψ = ψ ◦ f .
It is easy to see that the Koopman operator is linear; however, the Hilbert space F on which
it acts is often infinite dimensional.1 Since the operator K is linear, it may have eigenvalues
and eigenfunctions. If a given observable lies within the span of these eigenfunctions, then
we can predict the time evolution of the observable’s values, as the state evolves according to
the dynamics. Let g : M → CN0 be a vector-valued observable whose components are in the
span of the Koopman eigenfunctions. The vector-valued coefficients needed to reconstruct g
in a Koopman eigenfunction basis are called the Koopman modes associated with g.
In particular, the dynamics of the original system can be recovered by taking the observ-
able g to be the full-state observable defined by g(x) = x. Assume K has eigenfunctions
{ϕ1 , . . . , ϕK } with corresponding eigenvalues {µ1 , . . . , µK }, and suppose the components of
the vector-valued function g lie within the span of {ϕk }. The Koopman modes ξ k are then
defined by
K
X
(2) x= ξ k ϕk (x),
k=1
The entire orbit of an initial point x0 may thus be determined by evaluating the eigenfunc-
tions at x0 and evolving the coefficients ξ k in time by multiplying by the eigenvalues. The
eigenfunctions ϕk are intrinsic features of the dynamical system which decompose the state
dynamics into a linear superposition of autonomous first-order systems. The Koopman modes
ξ k depend on the coordinates we use to represent the dynamics, and allow us to reconstruct
the dynamics in those coordinates.
1
One must also be careful about the choice of the space F, since ψ ◦ f must also lie in F for any ψ ∈ F. It
is common, especially in the ergodic theory literature, to assume that M is a measure space and f is measure
preserving. In this case, this difficulty goes away: one lets F = L2 (M), and since f is measure preserving, it
follows that K is an isometry.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 5
2.2. Approximating Koopman on an explicit dictionary with EDMD. The aim of EDMD
is to approximate the Koopman operator using data snapshot pairs taken from the system
{(xj , yj )}M
j=1 where yj = f (xj ). For convenience, we organize these data into matrices
(4) X = x1 x2 · · · xM , Y = y1 y2 · · · yM .
Regardless of how the above observables are chosen, the matrix K that minimizes (9) is given
by
M M
1 X 1 X
(10) K = G+ A, G= Ψ(xi )Ψ(xi )∗ , A= Ψ(xi )Ψ(yi )∗ ,
M M
i=1 i=1
6 S. E. OTTO AND C. W. ROWLEY
in feature space i.e., after applying the now only hypothetical feature map Ψ to the snapshots.
We will see that the final results of this approach make reference only to inner products
Ψ(x)∗ Ψ(z) which will be defined using a suitable non-negative definite kernel function k(x, z).
Choice of such a kernel function implicitly defines the corresponding dictionary via Mercer’s
theorem. By employing simply-defined kernel functions, the inner products are evaluated at
a lower computational cost than would be required to evaluate a high or infinite dimensional
feature map and compute inner products in the feature space explicitly.
The total empirical error for EDMD (9) can be written as the Frobenius norm
Let us consider an economy sized SVD ΨX = UΣV∗ , the existence of which is guaranteed by
the finite rank r of our feature data matrix. In (12) we see that any components of the range
R(K) orthogonal to R(ΨX ) are annihilated by Ψ∗X and cannot be inferred from the data.
We therefore restrict the dictionary to those features which can be represented in the range
of the feature space data FD = {ψa = Ψ∗ a : a ∈ R(U)} and represent K = UK̂U∗ for some
matrix K̂ ∈ Cr×r . After some manipulation, it can be shown that minimizing the empirical
error (12) with respect to K̂ is equivalent to minimizing
2
(13) J 0 = V∗ Ψ∗Y − ΣK̂U∗ A .
F
Regardless of how the columns of A are chosen, as long as R(A) = R(U) the minimum norm
solution for the KDMD matrix is
Each component in the above KDMD approximation can be found entirely in terms of in-
ner products in the feature space, enabling the use of a kernel function to implicitly define
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 7
the feature space. The two matrices whose entries are [KXX ]ij = Ψ(xi )∗ Ψ(xj ) = k(xi , xj )
and [KYX ]ij = [Ψ∗Y ΨX ]ij = Ψ(yi )∗ Ψ(xj ) = k(yi , xj ) are computed using the kernel. The
Hermitian eigenvalue decomposition KXX = VΣ2 V∗ provides the matrices V and Σ.
It is worth pointing out that the EDMD and KDMD solutions (10) and (14) can be
regularized by truncating the rank r of the SVD ΨX = UΣV∗ . In EDMD, we recognize
that G = M 1
ΨX Ψ∗X = M 1
UΣ2 U∗ is a Hermitian eigendecomposition. Before finding the
pseudoinverse, the rank is truncated to remove the dyadic components having small singular
values.
2.4. Computing Koopman eigenvalues, eigenfunctions, and modes. Suppose that ϕ =
Ψ∗ w is an eigenvector of the Koopman operator in the span of the dictionary with eigenvalue
µ. Suppose also that w = Uŵ is in the span of the data in feature space. From Kϕ = µϕ
it follows that Ψ∗Y w = µΨ∗X w by substituting all of the snapshot pairs. Left-multiplying
1
by M ΨX and taking the pseudoinverse, we obtain (G+ A)w = µ(G+ G)w = µw where the
second equality holds because w ∈ R(ΨX ). Therefore, w is an eigenvector with eigenvalue
µ of the EDMD matrix (10). In terms of the coefficients ŵ, we have Ψ∗Y Uŵ = µΨ∗X Uŵ,
which upon substituting the definition ΨX = UΣV gives Ψ∗Y ΨX VΣ+ ŵ = µVΣŵ. From
the previous statement we it is evident that Σ+ V∗ Ψ∗Y ΨX VΣ+ ŵ = K̂ŵ = µŵ. Hence, ŵ is
an eigenvector of the KDMD matrix (14) with eigenvalue µ. Unfortunately, the converses of
these statements do not hold. Nonetheless, approximations of Koopman eigenfunctions,
(15) ϕ(x) = Ψ(x)∗ w = Ψ(x)∗ ΨX VΣ+ ŵ,
are formed using the right eigenvectors w and ŵ of K and K̂ respectively. In (15) the inner
products Ψ(x)∗ ΨX can be found by evaluating the kernel function between x and each point
in the training data {xj }M
j=1 yielding a row-vector.
The Koopman modes {ξ k }rk=1 associated with the full state observable reconstruct the
state as a linear combination of Koopman eigenfunctions. They can be found from the pro-
vided training data using a linear regression process. Let us define the matrices
ϕ1 (x1 ) · · · ϕ1 (xM )
ΦX = ... .. .. T T T
(16) Ξ = ξ 1 ξ 2 · · · ξ r , = WR ΨX = ŴR ΣV ,
. .
ϕr (x1 ) · · · ϕr (xM )
containing the Koopman modes and eigenfunction values at the training points. In the above,
WR and ŴR are the matrices whose columns are the right eigenvectors of K and K̂ respec-
tively. Seeking to linearly reconstruct the state from the eigenfunction values at each training
point, the regression problem,
(17) minimize kX − ΞΦX k2F ,
Ξ∈Cn×r
where WL and ŴL are the left eigenvector matrices of K and K̂ respectively. These matrices
must be suitably normalized so that the left and right eigenvectors form bi-orthonormal sets
WL∗ W = I and Ŵ∗ Ŵ = I .
R r L R r
8 S. E. OTTO AND C. W. ROWLEY
2.5. Drawbacks of EDMD. One of the drawbacks of EDMD and KDMD is that the ac-
curacy depends on the chosen dictionary. For high-dimensional data sets, constructing and
evaluating an explicit dictionary becomes prohibitively expensive. Though the kernel method
allows us to use high-dimensional dictionaries implicitly, the choice of kernel function signifi-
cantly impacts the results. In both techniques, higher resolution is achieved directly by adding
more dictionary elements. Therefore, enormous dictionaries are needed in order to represent
complex features. The shallow representation of features in terms of linear combinations of
dictionary elements means that the effective size of the dictionary must be limited by the rank
of the training data in feature space. As one increases the resolution of the dictionary, the
rank r of the feature space data ΨX grows and eventually reaches the number of points M
assuming the points are distinct. The number of data points therefore is an upper bound on
the effective number of features we can retain for EDMD or KDMD. This effective dictionary
selection is implicit when we truncate the SVD of G or ΨX . It is when r = M that we
have retained enough features to memorize the data set up to projection of ΨY onto R(ΨX ).
Consequently, over-fitting becomes problematic as we seek dictionaries with high enough res-
olution to capture complex features. We illustrate this problem with the following simple
example.
Example 2.1. Let us consider the linear dynamical system
1 0
(19) xt+1 = f (xt+1 ) = x
0 0.5 t+1
with x = [x1 , x2 ]T ∈ R2 . We construct this example to reflect the behavior of EDMD with
rich dictionaries containing more elements than snapshots. Suppose that we have only two
snapshot pairs,
1 1 1 1
(20) X= , Y= ,
1 0.5 0.5 0.25
taken by evolving the trajectory two steps from the initial condition x0 = [1, 1]T . Let us
define the following dictionary. Its first two elements are Koopman eigenfunctions whose
values are sufficient to describe the full state. In fact, EDMD recovers the original dynamics
perfectly from the given snapshots when we take only these first two observables. In this
example, we show that by including an extra, unnecessary observable we get a much worse
approximation of the dynamics. A third dictionary element which is not an eigenfunction
is included in order to demonstrate the effects of an overcomplete dictionary. With these
dictionary elements, the data matrices are
x1 1 1 1 1
(21) Ψ(x) = x2 =⇒ ΨX = 1 0.5 , ΨY = 0.5 0.25 .
2
(x1 ) + (x2 ) 2 2 1.25 1.25 1.0625
Applying (10) we compute the EDMD matrix and its eigendecomposition,
µ1 = 1.0413
0.9286 −0.1071 0.7321
(22) K = −0.2143 0.1786 −0.0536
=⇒ µ2 = 0 ,
0.1429 0.2143 0.2857
µ3 = 0.3515
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 9
It is easy to see that none of the eigenfunctions or eigenvalues are correct for the given system
even though the learned matrix satisfies kΨ∗Y − Ψ∗X KkF < 6 ∗ 10−15 with 16 digit precision
computed with standard Matlab tools. This shows that even with a single additional function
in the dictionary, we have severely over-fit the data. This is surprising since our original
dictionary included two eigenfunctions by definition. The nuance comes since EDMD is only
guaranteed to capture eigenfunctions ϕ(x) = wT Ψ(x) where w is in the span of the feature
space data R(ΨX ). In this example, the true eigenfunctions do not satisfy this condition; one
can check that neither w = [1, 0, 0]T nor w = [0, 1, 0]T is in R(ΨX ).
3. Recent approach for dictionary learning. Example 2.1 makes clear the importance of
choosing an appropriate dictionary prior to performing EDMD. In two recent papers [37, 13],
the universal function approximation property of neural networks was used to learn dictio-
naries for approximating the Koopman operator. A fixed number of observables making up
the dictionary are given by a neural network Ψ(x; θ) ∈ Rd parameterized by θ. The linear
operator KT ∈ Rd×d evolving the dictionary function values one time step into the future is
learned simultaneously through minimization of
M
X 2
(24) J(K, θ) = Ψ(yi ; θ) − KT Ψ(xi ; θ) + Ω(K, θ).
i=1
S = D and the only way to increase resolution and feature complexity is to grow the dictionary
— leading to the over-fitting problems illustrated in Example 2.1. By contrast, the dictionary
learning approach enables us to keep the size of the dictionary relatively small while exploring
a much larger space S. In particular, the dictionary size is presumed to be much smaller
than the total number of training data points and probably small enough so that FX = FD .
Otherwise, the number of functions d learned by the network could be reduced so that this
becomes true. The small dictionary size therefore prevents the method from memorizing the
snapshot pairs without learning true invariant subspaces. This is not done at the expense of
resolution since the allowable complexity of functions in S is extremely high.
Deep neural networks are advantageous since they enable highly efficient representations
of certain natural classes of complex features [23, 14]. In particular, deep neural networks
are capable of learning functions whose values are built by applying many simple operations
in succession. It is shown empirically that this is indeed an important and natural class of
functions since deep neural networks have recently enabled near human level performance on
tasks like image and handwritten digit recognition [7]. This proved to be a useful property for
dictionary learning for EDMD since [37, 13] achieve state of the art results on examples includ-
ing the Duffing equation, Kuramoto-Sivashinsky PDE, a system representing the glycolysis
pathway, and power systems.
4. New approach: deep feature learning using the LRAN. By removing the constraint
that the full state observable is in the learned Koopman invariant subspace, one can do even
better. This is especially important for high-dimensional systems where it would be pro-
hibitive to train such a large neural network-based dictionary with limited training data.
Furthermore, it may simply not be the case that the full state observable lies in a finite-
dimensional Koopman invariant subspace. The method described here is capable of learning
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 11
The technique includes a linear time evolution process given by the matrix K(θ K ) pa-
rameterized by θ K . This matrix furnishes our approximation of the Koopman operator on
the learned dictionary. Taking the eigendecomposition K = WR ΛWL ∗ allows us to compute
the Koopman eigenvalues, eigenfunctions, and modes exactly as we would for EDMD using
(15) and (18). By training the operator K simultaneously with the encoder and decoder net-
works, the dictionary of observables learned by the encoder is forced to span a low-dimensional
Koopman invariant subspace which is sufficiently informative to approximately reconstruct
12 S. E. OTTO AND C. W. ROWLEY
the full state. In many real-world applications, the scientist has access to data sets consisting
of several sequential snapshots. The LRAN architecture shown in Figure 2 takes advantage of
longer sequences of snapshots during training. This is especially important when the system
dynamics are highly non-normal. In such systems, low-amplitude features which could oth-
erwise be ignored for reconstruction purposes are highly observable and influence the larger
amplitude dynamics several time-steps into the future. One may be able to achieve reasonable
accuracy on snapshot pairs by neglecting some of these low-energy modes, but accuracy will
suffer as more time steps are predicted. Inclusion of multiple time steps where possible forces
the LRAN to incorporate these dynamically important non-normal features in the dictionary.
As we will discuss later, it is possible to generalize the LRAN architecture to continuous time
systems with irregular sampling intervals and sequence lengths. It is also possible to restrict
the LRAN to the case when only snapshot pairs are available. Here we consider the case when
our data contains equally-spaced snapshot sequences {xt , xt+1 , . . . , xt+T −1 } of length T . The
loss function
TX −1
1 δ t kx̂t+τ − xt+τ k2
(25) J(θ enc , θ dec , θ K ) = E
xt ,...,xt+T −1 ∼Pdata 1 + β N1 (δ) kxt+τ k2 + 1
τ =0
TX −1
δ t−1 kẑt+τ − zt+τ k2
+β + Ω(θ enc , θ dec , θ K )
N2 (δ) kzt+τ k2 + 2
τ =1
is minimized during training, where E denotes the expectation over the data distribution. The
encoding, latent state dynamics, and decoding processes are given by
respectively. The regularization term Ω has been included for generality, though our numerical
experiments show that it was not necessary. Choosing a small dictionary size d provides
sufficient regularization for the network. The loss function (25) consists of a weighted average
of the reconstruction error and the hidden state time evolution error. The parameter β
determines the relative importance of these two terms. The errors themselves are relative
square errors between the predictions and the ground truth summed over time with a decaying
weight 0 < δ ≤ 1. This decaying weight is used to facilitate training by prioritizing short term
predictions. The corresponding normalizing constants,
TX
−1 TX
−1
N1 (δ) = δτ , N2 (δ) = δ τ −1
τ =0 τ =1
ensure that the decay-weighted average is being taken over time. The small constants 1 and 2
are used to avoid division by 0 in the case that the ground truth values vanish. The expectation
value was estimated empirically using minibatches consisting of sequences of length T drawn
randomly from the training data. Stochastic gradient descent with the Adaptive Moment
Estimation (ADAM) method and slowly decaying learning rate was used to simultaneously
optimize the parameters θ enc , θ dec , and θ K in the open-source software package TensorFlow.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 13
4.1. Neural network architecture and initialization. The encoder and decoder consist of
deep neural networks whose schematic is sketched in Figure 3. The figure is only a sketch since
many more hidden layers were actually used in the architectures applied to example problems
in this paper. In order to achieve sufficient depth in the encoder and decoder networks, hidden
layers employed exponential linear units or “elu’s” as the nonlinearity [3]. These units mitigate
the problem of vanishing and exploding gradients in deep networks by employing the identity
function for all non-negative arguments. A shifted exponential function for negative arguments
is smoothly matched to the linear section at the origin, giving the activation function
(
x x≥0
(26) elu(x) = .
exp(x) − 1 x<0
This prevents the units from “dying” as standard rectified linear units or “ReLU’s” do when
the arguments are always negative on the data. Furthermore, “elu’s” have the advantage of
being continuously differentiable. This will be a nice property if we want to approximate a C 1
data manifold whose chart map and its inverse are given by the encoder and decoder. If the
maps are differentiable, then the tangent spaces can be defined as well as push-forward, pull-
back, and connection forms. Hidden layers map the activations x(l) at layer l to activations at
the next layer l + 1 given by a linear transformation and subsequent element-wise application
of the activation function,
h i
(27) x(l+1) = elu W(l) (θ)x(l) + b(l) (θ) , W(l) (θ) ∈ Rnl+1 ×nl , b(l) (θ) ∈ Rnl+1 .
The weight matrices W and vector biases b parameterized by θ are learned by the network
during training. The output layers L for both the encoder and decoder networks utilize linear
transformations without a nonlinear activation function:
(28) x(L) = W(L−1) (θ)x(L−1) + b(L−1) (θ), W(L−1) (θ) ∈ RnL ×nL−1 , b(L−1) (θ) ∈ RnL ,
where L = Lenc or L = Ldec is the last layer of the encoder or decoder with nLenc = d
or nLdec = n respectively. This allows for smooth and consistent treatment of positive and
negative output values without limiting the flow of gradients back through the network.
The weight matrices were initialized using the Xavier initializer in Tensorflow. This
initialization
p distributes the entries in W(l) uniformly over the interval [−α, α] where α =
6/(nl + nl+1 ) in order to keep the scale of gradients approximately the same in all layers.
This initialization method together with the use of exponential linear units allowed deep net-
works to be used for the encoder and decoder. The bias vectors b(l) were initialized to be
σ ω
zero. The transition matrix K was initialized to have diagonal blocks of the form
−ω σ
√
with eigenvalues λ = σ ± ωı equally spaced around the circle of radius r = σ + ω 2 = 0.8.
2
This was done heuristically to give the initial eigenvalues good coverage of the unit disc. One
could also initialize this matrix using a low-rank DMD matrix.
4.2. Simple modifications of LRANs. Several extensions and modifications of the LRAN
architecture are possible. Some simple modifications are discussed here, with several more
14 S. E. OTTO AND C. W. ROWLEY
involved extensions suggested in the conclusion. In the first extension, we observe that it is
easy learn Koopman eigenfunctions associated with known eigenvalues simply by fixing the
appropriate entries in the matrix K. In particular, if we know that our system has Koopman
eigenvalue µ = σ + ωı then we may formulate the state transition matrix
σ ω
02×(d−2)
(29) K(θ) = −ω σ .
0(d−2)×2 K̃(θ)
In the above, the known eigenvalue is fixed and only the entries of K̃ are allowed to be
trained. If more eigenvalues are known, we simply fix the additional entries of K in the same
way. The case where some eigenvalues are known is particularly interesting because in certain
cases, eigenvalues of the linearized system are Koopman eigenvalues whose eigenfunctions have
useful properties. Supposing the autonomous system under consideration has a fixed point
with all eigenvalues µi , i = 1, . . . , n inside the unit circle, the Hartman-Grobman theorem
establishes a topological conjugacy to a linear system with the same eigenvalues in a small
neighborhood U of the fixed point. One easily checks that the coordinate transformations hi :
M∩U → R, i = 1, . . . , n establishing this conjugacy are Koopman eigenfunctions restricted to
the neighborhood. Composing them with the flow map allows us to extend the eigenfunctions
−τ (x)
hi f τ (x) (x) where τ (x) is the
to the entire basin of attraction by defining ϕi (x) = µi
smallest integer τ such that f τ (x) ∈ U. These eigenfunctions extend the topological conjugacy
by parameterizing the basin. Similar results hold for stable limit cycles and tori [20]. This is
nice because we can often find eigenvalues at fixed points explicitly by linearization. Choosing
to fix these eigenvalues in the K matrix forces the LRAN to learn corresponding eigenfunctions
parameterizing the basin of attraction. It is also easy to include a set of observables explicitly
h iT
by appending them to the encoder function Ψ(x; θ enc ) = Ψf ixed (x)T , Ψ̃(x; θ enc )T so that
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 15
only the functions Ψ̃ are learned by the network. This may be useful if we want to accurately
reconstruct some observables Ψf ixed linearly using Koopman modes.
The LRAN architecture and loss function (25) may be further generalized to non-uniform
sampling of continuous-time systems. In this case, we consider T sequential snapshots
{x(t0 ), x(t1 ), . . . , x(tT −1 )} where the times t0 , t1 , . . . , tT −1 are not necessarily evenly spaced.
In the continuous time case, we have a Koopman operator semigroup Kt+s = Kt Ks defined
as Kt ψ(x) = ψ(Φt (x)) and generated by the operator Kψ(x) = ψ̇(x) = (∇x ψ(x)) f (x) where
the dynamics are given by ẋ = f (x) and Φt is the time t flow map. The generator K is clearly
a linear operator which we can approximate on our dictionary of observables with a matrix
K. By integrating, we can approximate elements from the semigroup Kt using the matrices
Kt = exp (Kt) on the dictionary. Finally, in order to formulate the analogous loss function,
we might utilize continuously decaying weights
δt δt
(30) ρ1 (t) = PT −1 , ρ2 (t) = PT −1 ,
k=0 δ tk k=1 δ tk
normalized so that they sum to 1 for the given sampling times. Neural networks can be used
for the encoder and decoder together with the loss function
TX −1
1 kx̂(tk ) − x(tk )k2
(31) J(θ enc , θ dec , θ K ) = E ρ1 (tk )
x(t0 ),...,x(tT −1 )∼Pdata 1 + β
k=0
kx(tk )k2 + 1
TX−1
kẑ(tk ) − z(tk )k2
+β ρ2 (tk ) 2 + Ω(θ enc , θ dec , θ K )
k=1
kz(t k )k + 2
to be minimized during training. In this case, the dynamics evolve the observables linearly in
continuous time, so we let
z(tk ) = Ψ(x(tk ); θ enc ), ẑ(tk ) = exp [K(θ K )(tk − t0 )]T z(t0 ), x̂(tk ) = Ψ̃(ẑ(tk ); θ dec ).
This loss function can be evaluated on the training data and minimized in essentially the
same way as (25). The only difference is that we are discovering a matrix approximation to
the generator of the Koopman semigroup. We will not explore irregularly sampled continuous
time systems further in this paper, leaving it as a subject for future work.
We briefly remark that the general LRAN architecture can be restricted to the case of
snapshot pairs as shown in Figure 4. In this special case, training might be accelerated using
a technique similar to Algorithm 1. During the initial stage of training, it may be beneficial
to periodically re-initialize the K matrix with its EDMD approximation using the current
dictionary functions and a subset of the training data. This might provide a more suitable
initial condition for the matrix as well as accelerate the training process. However, this update
for K is not consistent with all the terms in the loss function J since it does not account for
reconstruction errors. Therefore, the final stages of training must always proceed by gradient
descent on the complete loss function.
Finally, we remark that the LRAN architecture sacrifices linear reconstruction using Koop-
man modes for nonlinear reconstruction using a decoder neural network in order to learn
16 S. E. OTTO AND C. W. ROWLEY
The global coordinates on the manifold are (<(α), =(α2 )). Coordinate projection of the full
state onto the NNM,
<(wR )T
<(α)
(33) = Ψ(x; θ enc ),
=(α) =(wR )T
is accomplished by employing the encoder network and the right eigenvector wR corresponding
to eigenvalue µ. These coordinates are the real and imaginary parts of the associated Koopman
eigenfunction α = ϕ(x). The NNM has angular frequency ∠(ωı)/∆t where ∆t is the sampling
interval between the snapshots in the case of the discrete time LRAN.
We may further generalize the notion of NNMs by considering the Koopman mode expan-
sion of the real-valued observable vector Ψ making up our dictionary. In this particular case,
the associated Koopman modes are the complex conjugate left eigenvectors of K. They allow
exact reconstruction and prediction using the decomposition
r
X r
X
(34) zt = wL,j µtj wR,j
T
z0 = wL,j µtj ϕj (x0 ),
j=1 j=1
assuming a Koopman invariant subspace has been learned that contains the full state observ-
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 17
Therefore, each invariant subspace of K given by its left eigenvectors corresponds to an in-
variant manifold in the n-dimensional phase space. These manifolds have global charts whose
coordinate projections are given by the Koopman eigenfunctions ϕj (x) = wR,j T Ψ(x; θ
enc ).
The dynamics on these manifolds is incredibly simple and entails repeated multiplication of
the coordinates by the eigenvalues. Generalized eigenspaces may also be considered in the
natural way by using the Jordan normal form of K instead of its eigendecomposition in the
above arguments. The only necessary change is in the evolution equations, where instead of
taking powers of Λ = diag {µ1 , . . . , µr }, we take powers of J, the matrix consisting of Jordan
blocks [20]. Future work might use a variational autoencoder (VAE) formulation [7, 18, 17, 12]
where a given distribution is imposed on the latent state in order to facilitate sampling.
5. EDMD-based model reduction as a shallow autoencoder. In this section we examine
how the EDMD method might be used to construct low-dimensional Koopman invariant
subspaces while still allowing for accurate reconstructions and predictions of the full state.
The idea is to find a reduced order model of the large linear system identified by EDMD in the
space of observables. This method is sometimes called overspecification [26] and essentially
determines an encoder function into an appropriate small set of features. From this reduced set
of features, we then employ nonlinear reconstruction of the full state through a learned decoder
function. Introduction of the nonlinear decoder should allow for lower-dimensional models to
be identified which are still able to make accurate predictions. The proposed framework
therefore constructs a kind of autoencoder where encoded features evolve with linear time
invariant dynamics. The encoder functions are found explicitly as linear combinations of
EDMD observables and are therefore analogous to a shallow neural network with a single
hidden layer. The nonlinear decoder function is also found explicitly through a regression
process involving linear combinations of basis functions.
We remark that this approach differs from training an LRAN by minimization of (25)
in two important ways. First, the EDMD-based model reduction and reconstruction pro-
cesses are performed sequentially; thus, the parts are not simultaneously optimized as in the
LRAN. The LRAN is advantageous since we only learn to encode observables which the de-
coder can successfully use for reconstruction. There are no such guarantees here. Second,
the EDMD dictionary remains fixed albeit overspecified whereas the LRAN explicitly learns
an appropriate dictionary. Therefore, the EDMD shallow autoencoder framework will still
suffer from the overfitting problem illustrated in Example 2.1. If the EDMD-identified matrix
K does not correctly represent the dynamics on a Koopman invariant subspace, then any
reduced order models derived from it cannot be expected to be accurately relect the dynamics
either. Nonetheless, in many cases, this method could provide a less computationally expen-
sive alternative to training a LRAN which retains some of the benefits owing to nonlinear
reconstruction.
Dimensionality reduction is achieved by first performing EDMD with a large dictionary,
then projecting the linear dynamics onto a low-dimensional subspace. A naive approach
18 S. E. OTTO AND C. W. ROWLEY
would be to simply project the large feature space system onto the most energetic POD
modes — equivalent to low-rank truncation of the SVD ΨX = UΣV∗ . While effective for
normal systems with a few dominant modes, this approach yields very poor predictions in
non-normal systems since low amplitude modes with large impact on the dynamics would be
excluded from the model. One method which resolves this issue is balanced truncation of
the identified feature space system. Such an idea is suggested in [26] for reducing the system
identified by linear DMD. Drawing from the model reduction procedure for snapshot-based
realizations developed in [15], we will construct a balanced reduced order model for the system
identified using EDMD or KDMD. In the formulation of EDMD that led to the kernel method,
an approximation of the Koopman operator,
was obtained. The approximation allows us to model the dynamics of a vector of observables,
where K̂ is the matrix (14) identified by EDMD or KDMD. The input ut is provided in order
to equate varying initial conditions ΨU (x0 ) with impulse responses of (38). Since the input
is used to set the initial condition, we choose to scale each component by its singular value to
reflect the covariance
∗ 1 ∗ ∗ 1 2 1 ∗ ∗
(39) E [ΨU (x)ΨU (x) ] ≈ U ΨX ΨX U = Σ = E Σuu Σ ,
x∼Pdata M M u∼N (0,Ir ) M
in the observed data. Therefore, initializing the system using impulse responses u0 = ej ,
j = 1, . . . , r from the σ-points of the distribution u ∼ N (0, Ir ) ensures that the correct
empirical covariances are obtained. The output matrix,
(40) C = XVΣ+ ,
is used to linearly reconstruct the full state observable from the complete set of features. It
is found using linear regression similar to the Koopman modes (18). The low-dimensional set
of observables making up the encoder will be found using a balanced reduced order model of
(38).
5.1. Balanced Model Reduction. Balanced truncation [21] is a projection-based model
reduction techniqe that attempts to retain a subspace in which (38) is both maximally ob-
servable and controllable. While these notions generally do not coincide in the original space,
remarkably it is possible to find a left-invertible linear transformation ΨU (x) = Tz under
which these properties are balanced. This so called balancing transformation of the learning
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 19
5.2. Finite-horizon Gramians and Balanced POD. Typically one would find the infinite
horizon Gramians for an overspecified Hurwitz EDMD system (38) by solving the Lyapunov
equations
In the case of neutrally stable or unstable systems, unique positive definite solutions do
not exist and one must use generalized Gramians [38]. When used to form balanced reduced
order models, this will always result in the unstable and neutrally stable modes being chosen
before the stable modes. This could be problematic for our intended framework since EDMD
can identify many spurious and sometimes unstable eigenvalues corresponding to noisy low-
amplitude fluctuations. While these noisy modes remain insignificant over finite times of
interest, they will dominate EDMD-based predictions over long times. Therefore it makes
sense to consider the dominant modes identified by EDMD over a finite time interval of
interest. Using finite horizon Gramians reduces the effect of spurious modes on the reduced
order model, making it more consistent with the data. The time horizon can be chosen to
reflect a desired future prediction time or the number of sequential snapshots in the training
data.
The method of Balanced Proper Orthogonal Decomposition or BPOD [24] allows us to
find balancing and adjoint modes of the finite horizon system. In BPOD, we observe that the
finite-horizon Gramians are empirical covariance matrices formed by evolving the dynamics
from impulsive initial conditions for time T . This gives the specific form for matrices
1
A = C∗ K̂C∗ · · · (K̂)T C∗ , Σ K̂∗ Σ · · · (K̂∗ )T Σ ,
(47) B= √
M
allowing for computation of the balancing and adjoint modes without ever forming the Grami-
ans. This is known as the method of snapshots. Since the output dimension is large, we
consider its projection onto the most energetic modes. These are identified by forming the
economy sized SVD of the impulse responses CB = UOP ΣOP VOP ∗ . Projecting the output
from fewer initial conditions than A. In particular, the initial conditions are the first few
columns of UOP with the largest singular values [24].
Observe that the unit impulses place the initial conditions precisely at the σ-points of
the data-distribution in features space. If this distribution is Gaussian, then the empirical
expectations obtained by evolving the linear system agree with the true expectations taken
over the entire data distribution. Therefore, the finite horizon controllability Gramian cor-
responds to the covariance matrix taken over all time T trajectories starting at initial data
points coming from a Gaussian distribution in feature space. Consequently, controllability in
this case corresponds exactly with feature variance or expected square amplitude over time.
We remark that in the infinite-horizon limit T → ∞, BPOD converges on a transformation
which balances the generalized Gramians introduced in [38]. Application of BPOD to unstable
systems is discussed in [6] which provides justification for the approach.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 21
Another option to avoid spurious modes from corrupting the long-time dynamics is to
consider pre-selection of EDMD modes which are nearly Koopman invariant. The development
of such an accuracy criterion for selecting modes is the subject of a forthcoming paper by H.
Zhang and C. W. Rowley. One may then apply balanced model reduction to the feature space
system containing only the most accurate modes.
(49) x = C1 z + C2 Ψ(z) + e,
is formulated based on [5, 36] to include linear and nonlinear components. In the above, Ψ :
Cd → H is a nonlinear feature map into reproducing kernel Hilbert space H and C1 : Cd → Cn
and C2 : H → Cn are linear operators. These operators are found by solving the l2 regularized
optimization problem,
involving the empirical square error on the training data {(zj , xj )}M
j=1 arranged into columns
of the matrices Z = z1 · · · zM and X = x1 · · · xM . The regularization penalty
is placed only on the coefficients of nonlinear terms to control over-fitting while the linear
term, which we expect to dominate, is not penalized. The operator ΨZ : CM → H forms
linear combinations of the data in feature space v 7→ v1 Ψ(z1 ) + · · · + vM Ψ(zM ). Since Z
and ΨZ are operators with finite ranks r1 and r2 ≤ M , we may consider their economy
sized singular value decompositions: Z = U1 Σ1 V1∗ and ΨZ = U2 Σ2 V2∗ . Observe that it is
impossible to infer any components of R(C∗1 ) orthogonal to R(Z) since they are annihilated
by Z∗ . Therefore, we apply Occam’s razor and assume that C∗1 = U1 Ĉ∗1 for some Ĉ∗1 ∈ Cr1 ×n .
By the same argument, R(C∗2 ) cannot have any components orthogonal to R(ΨZ ) since they
are annihilated by Ψ∗Z and have a positive contribution to the regularization penalty term
Tr(C2 C∗2 ). Hence, we must also have C∗2 = U2 Ĉ∗2 for some Ĉ∗2 ∈ Cr2 ×n . Substituting these
22 S. E. OTTO AND C. W. ROWLEY
relationships into (50) allows it to be formulated as the standard least squares problem
2 2
J = X∗ − V1 Σ1 Ĉ∗1 − V2 Σ2 Ĉ∗2 + γ Ĉ2
F F
(51) ∗ ∗ 2 .
X V1 Σ1 V2 Σ2 Ĉ1
= − √
0r2 ×n 0r2 ×r1 γIr2 Ĉ∗2 F
The block-wise matrix clearly has full column rank r1 + r2 for γ > 0 and the normal equation
for this least squares problem are found by projecting onto its range. The solution,
−1 −1
V1∗ V2
∗
V1∗ X∗
Ĉ1 Σ1 0r1 ×r2 Ir1
(52) = ,
Ĉ∗2 0r2 ×r1 Σ2 V2 V1 Ir2 + γΣ−2
∗
2 V2∗ X∗
corresponds to taking the left pseudoinverse and simplifying the resulting expression. The
matrices V1,2 and Σ1,2 are found by solving Hermitian eigenvalue problems using the (kernel)
matrices of inner products Z∗ Z = V1 Σ21 V1∗ and Ψ∗Z ΨZ = V2 Σ22 V2∗ . At a new point z, the
approximate reconstruction using the partially linear kernel regression model is
(53) x ≈ Ĉ1 Σ−1
1 V ∗ ∗
1 Z z + Ĉ2 Σ−1 ∗ ∗
2 V2 (ΨZ Ψ(z)) .
where the parameters δ = 0.5, β = −1, and α = 1 are chosen. The equation exhibits stable
1 √
equilibria at x = ±1 with eigenvalues λ1,2 = −1 ± 31ı associated with the linearizations
4
at these points. One can show that these (continuous-time) eigenvalues also correspond to
Koopman eigenfunctions whose magnitude and complex argument act like action-angle vari-
ables parameterizing the entire basins. A non-trivial Koopman eigenfunction with eigenvalue
λ0 = 0 takes different constant values in each basin, acting like an indicator function to
distinguish them.
We will see whether the LRAN and the reduced KDMD model can learn these eigenfunc-
tions from data and use them to predict the dynamics of the unforced Duffing equation as
well as to determine which basin of attraction a given point belongs. The training data are
generated by simulating the unforced Duffing equation from uniform random initial conditions
(x(0), ẋ(0)) ∈ [−2, 2] × [−2, 2]. From each trajectory 11 samples are recorded ∆t = 0.25 apart.
The training data for LRAN models consists of M = 104 such trajectories. Since the KDMD
method requires us to evaluate the kernel function between a given example and each training
point, we limit the number of training data points to 103 randomly selected snapshot pairs
from the original set. It is worth mentioning that the LRAN model handles large data sets
more efficiently than KDMD since the significant cost goes into training the model which is
then inexpensive and fast to evaluate on new examples.
Since three of the Koopman eigenvalues are known ahead of time we train an LRAN model
where the transition matrix K is fixed to have discrete time eigenvalues µk = exp(λk ∆t). We
refer to this as the “constrained LRAN” and compare its performance to a “free LRAN”
model where K is learned and a 5th order balanced truncation using KDMD called “KDMD
ROM”. The hyperparameters of each model are reported in Appendix A.
The learned eigenfunctions for each model are plotted in Figures 5 to 7. The corresponding
eigenvalues learned or fixed in the model are also reported. The complex eigenfunctions are
plotted in terms of their magnitude and phase. In each case, the eigenfunction associated with
the continuous-time eigenvalue λ0 closest to zero appears to partition the phase space into
basins of attraction for each fixed point as one would expect. In order to test this hypothesis,
we use the median eigenfunction value for each model as a threshold to classify test data
points between the basins. The eigenfunction learned by the constrained LRAN was used to
correctly classify 0.9274 of the testing data points. The free LRAN eigenfunction and the
KDMD balanced reduced order model eigenfunction correctly classified 0.9488 and 0.9650 of
the testing data respectively. Mtest = 11 ∗ 104 test data points were used to evaluate the
LRAN models, though this number was reduced to a randomly selected Mtest = 1000 to test
the KDMD model due to the exceedingly high computational cost of the kernel evaluations.
The other eigenfunction learned in each case parameterizes the basins of attraction and
therefore is used to account for the dynamics in each basin. Each model appears to have
learned a similar action-angle parameterization regardless of whether the eigenvalues were
specified ahead of time. However, the constrained LRAN shows the best agreement with the
true fixed point locations at x = ±1 where |ϕ1 | → 0. The mean square relative prediction
error was evaluated for each model by making predictions on the testing data set at various
times in the future. The results plotted in Figure 8 show that the free LRAN has by far the
lowest prediction error likely due to the lack of constraints on the functions it could learn. It is
24 S. E. OTTO AND C. W. ROWLEY
surprising however, that nonlinear reconstruction hurt the performance of the KDMD reduced
order model. This illustrates a potential difficulty with this method since the nonlinear part
of the reconstruction is prone to over-fit without sufficient regularization.
Figure 8. Unforced Duffing testing data mean square relative prediction errors for each model plotted
against the prediction time
6.2. Cylinder wake. The next example we consider is the formation of a Kármán vortex
sheet downstream of a cylinder in a fluid flow. This problem was chosen since the data has low
intrinsic dimensionality due to the simple flow structure but is embedded in high-dimensional
snapshots. We are interested in whether the proposed techniques can be used to discover
very low dimensional models that accurately predict the dynamics over many time steps. We
consider the growth of instabilities near an unstable base flow shown in Figure 9a at Reynold
number Re = 60 all the way until a stable limit cycle shown in Figure 9b is reached. The
models will have to learn to make predictions over a range of unsteady flow conditions from
the unstable equilibrium to the stable limit cycle.
(a) (b)
Figure 9. Example cylinder wake flow snapshots at the unstable equilibrium and on the stable limit cycle
The raw data consisted of 2000 simulated snapshots of the velocity field taken at time
intervals 0.2D/U∞ , where D is the cylinder diameter and U∞ is the free-stream velocity.
These data were split into Mtrain = 1000 training, Meval = 500 evaluation, and Mtest = 500
26 S. E. OTTO AND C. W. ROWLEY
testing data points. Odd numbered points ∆t = 0.4D/U∞ apart were used for training.
The remaining 1000 points were divided again into even and odd numbered points 2∆t =
0.8D/U∞ apart for evaluation and testing. This enabled training, evaluation, and testing on
long data sequences while retaining coverage over the complete trajectory. The continuous-
time eigenvalues are found from the discrete-time eigenvalues according to λ = log(µ)/∆t =
log(µ)U∞ /(0.4D).
The raw data was projected onto its 200 most energetic POD modes which captured
essentially all of the energy in order to reduce the cost of storage and training. 400-dimensional
time delay embedded snapshots were formed from the state at time t and t + ∆t. A 5th-
order LRAN model and the 5th-order KDMD reduced order model were trained using the
hyperparameters in Tables 3 and 4. In Figure 10a, many of the discrete-time eigenvalues
given by the over-specified KDMD model have approximately neutral stability with some
being slightly unstable. However, the finite horizon formulation for balanced truncation allows
us to learn the most dynamically salient eigenfunctions over a given length of time, in this
case T = 20 steps or 8.0D/U∞ . We see in Figure 10b that three of the eigenvalues learned by
the two models are in close agreement and all are approximately neutrally stable.
(a) (b)
Figure 10. Discrete-time Koopman eigenvalues approximated by the KDMD ROM and the LRAN
A side-by-side comparison of the Koopman modes gives some insight into the flow struc-
tures whose dynamics the Koopman eigenvalues describe. We notice right away that the
Koopman modes in Figure 12 corresponding to continuous-time eigenvalue λ1 are very similar
for both models and indicate the pattern of vortex shedding downstream. This makes sense
since a single frequency and mode will account for most of the amplitude as the limit cycle
is approached. Evidently both models discover these limiting periodic dynamics. For the
KDMD ROM, λ2 is almost exactly the higher harmonic 2 ∗ λ1 . The corresponding Koopman
mode in Figure 13 also reflects smaller flow structures which oscillate at twice the frequency
of λ1 . Interestingly, the LRAN does not learn the same second eigenvalue as the KDMD
ROM. The LRAN continuous-time eigenvalue λ2 is very close to λ1 which suggest that these
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 27
frequencies might team up to produce the low-frequency λ1 − λ2 . The second LRAN Koop-
man mode in Figure 13 also bears qualitative resemblance to the first Koopman mode in
Figure 12, but with a slightly narrower pattern in the y-direction. The LRAN may be using
the information at these frequencies to capture some of the slower transition process from
the unstable fixed point to the limit cycle. The Koopman modes corresponding to λ0 = 0
are also qualitatively different indicating that the LRAN and KDMD ROM are extracting
different constant features from the data. We must be careful in our interpretation, however,
since the LRAN’s koopman modes are only a least squares approximations to the nonlinear
reconstruction process performed by the decoder.
Figure 11. LRAN and KDMD ROM Koopman modes associated with λ0 ≈ 0
Figure 12. LRAN and KDMD ROM Koopman modes associated with λ1 ≈ 0.002 + 0.845ı
Figure 13. LRAN and KDMD ROM Koopman modes associated with λ2 which differs greatly between the
models
Plotting the model prediction error Figure 14 shows that the linear reconstructions using
both models have comparable performance with errors growing slowly over time. Therefore,
28 S. E. OTTO AND C. W. ROWLEY
the choice of the second Koopman mode does not seem to play a large role in the recon-
struction process. However, when the nonlinear decoder is used to reconstruct the LRAN
predictions, the mean relative error is roughly an order of magnitude smaller than the nonlin-
early reconstructed KDMD ROM over many time steps. The LRAN has evidently learned an
advantageous nonlinear transformation for reconstructing the data using the features evolving
according to λ2 . The second Koopman mode reflects a linear approximation of this nonlinear
transformation in the least squares sense.
Another remark is that nonlinear reconstruction using the KDMD ROM did significantly
improve the accuracy in this example. This indicates that many of the complex variations in
the data are really enslaved to a small number of modes. This makes sense since the dynamics
are periodic on the limit cycle. Finally, it is worth mentioning that the prediction accuracy
was achieved on average over all portions of the trajectory from the unstable equilibrium to
the limit cycle. Both models therefore have demonstrated predictive accuracy and validity
over a wide range of qualitatively different flow conditions. The nonlinearly reconstructed
LRAN achieves a constant low prediction error over the entire time interval used for training
T ∆t = 8.0D/U∞ . The error only begins to grow outside the interval used for training. The
high prediction accuracy could likely be extended by training on longer data sequences.
Figure 14. Cylinder wake testing data mean square relative prediction errors for each model plotted against
the prediction time
Sivashinsky equation,
(56) ut + uxx + uxxxx + uux = 0, x ∈ [0, L],
using a semi-implicit Fourier pseudo-spectral method. The length L = 8π was chosen where
the equation first begins to exhibit chaotic dynamics [9]. 128 Fourier modes were used to
resolve all of the dissipative scales. Each data set: training, evaluation, and test, consisted
of 20 simulations from different initial conditions each with 500 recorded states spaced by
∆t = 1.0. Snapshots consisted of time delay embedded states at t and t + ∆t. The initial
conditions were formed by suppling Gaussian random perturbations to the coefficients on the
3 linearly unstable Fourier modes 0 < 2πk/L < 1 =⇒ k = 1, 2, 3.
An LRAN as well as a KDMD balanced ROM were trained to make predictions over a
time horizon T = 5 steps using only d = 16 dimensional models. Model parameters are given
in Tables 5 and 6. The learned approximate Koopman eigenvalues are plotted in Figure 15.
We notice that there are some slightly unstable eigenvalues, which makes sense since there are
certainly unstable modes including the three linearly unstable Fourier modes. Additionally,
Figure 15b shows that some of the eigenvalues with large magnitude learned by the LRAN
and the KDMD ROM are in near agreement.
(a) (b)
Figure 15. Discrete-time Koopman eigenvalues for the Kuramoto-Sivashinksy equation approximated by
the KDMD ROM and the LRAN
The plot of mean square relative prediction error on the testing data set Figure 16 indi-
cates that our addition of nonlinear reconstruction from the low dimensional KDMD ROM
state does not change the accuracy of the reconstruction. The performance of the KDMD
ROM and the LRAN are comparable with the LRAN showing a modest reduction in error
over all prediction times. It is interesting to note that the LRAN does not produce accurate
reconstructions using the regression-based Koopman modes. In this example, the LRAN’s
nonlinear decoder is essential for the reconstruction process. Evidently, the dictionary func-
tions learned by the encoder require nonlinearity to reconstruct the state. Again, both models
are most accurate over the specified time horizon T = 5 used for training.
30 S. E. OTTO AND C. W. ROWLEY
Plotting some examples in Figure 17 of ground truth and predicted test data sequences
illustrates the behavior of the models. These examples show that both the LRAN and the
KDMD ROM make quantitatively accurate short term predictions. While the predictions after
t ≈ 5 lose their accuracy as one would expect when trying to make linear approximations of
chaotic dynamics, they remain qualitatively plausible. The LRAN model in particular is able
to model and predict grouping and merging events between traveling waves in the solution.
For example in Figure 17a the LRAN successfully predicts the merging of two wave crests
(in red) taking place between t = 2 and t = 5. The LRAN also predicts the meeting of
a peak and trough in Figure 17b at t = 5. These results are encouraging considering the
substantial reduction in dimensionality from a time delay embedded state of dimension 256
to a 16-dimensional encoded state having linear time evolution.
Figure 16. Kuramoto-Sivashinsky testing data mean square relative prediction errors for each model plotted
against the prediction time
(a) (b)
Figure 17. LRAN and KDMD ROM model predictions on Kuramoto-Sivashinsky test data examples
for improving the reconstruction accuracy of the KDMD ROMs from very low-dimensional
spaces. Our examples show that in some cases like the cylinder wake example, it can greatly
improve the accuracy. We think this is because the data is intrinsically low-dimensional, but
curves in such a way as to extend in many dimensions of the embedding space. The limiting
case of the cylinder flow is an extreme example where the data becomes one-dimensional
on the limit cycle. In some other cases, however, nonlinear reconstruction does not help, is
sensitive to parameter choices, or decreases the accuracy due to over-fitting.
Our numerical examples indicate that unfolding the linear recurrence for many steps can
improve the accuracy of LRAN predictions especially within the time horizon used during
training. This is observed in the error versus prediction time plots in our examples: the error
remains low and relatively flat for predictions made inside the training time horizon T . The
error then grows for predictions exceeding this length of time. However, for more complicated
systems like the Kuramoto-Sivashinsky equation, one cannot unfold the network for too many
steps before additional dimensions must be added to retain accuracy of the linear model
approximation over time. These observations are also approximately true of the finite-horizon
BPOD formulation used to create approximate balanced truncations of KDMD models. One
additional consideration in forming balanced reduced-order models from finite-horizon impulse
responses of over-specified systems is the problem of spurious eigenvalues whose associated
modes only become significant for approximations as t → ∞. The use of carefully chosen
finite time horizons allows us to pick features which are the most relevant (observable and
32 S. E. OTTO AND C. W. ROWLEY
ics, it might be possible to employ adversarial training [8] in a similar manner to the adversarial
autoencoder [17]. Training the generative network used for reconstruction against a discrim-
inator network will encourage the generator to produce more plausible details like turbulent
eddies in fluid flows which are not easily distinguished from the real thing.
Appendix A. Hyperparameters used to train models. The same hyperparameters in
Table 1 were used to train the constrained and free LRANs in the unforced Duffing equation
example.
Table 1
Constrained LRAN hyperparameters for unforced Duffing example
Parameter Value(s)
Time-delays embedded in a snapshot 1
Encoder layer widths (left to right) 2, 32, 32, 16, 16, 8, 3
Decoder layer widths (left to right) 3, 8, 16, 16, 32, 32, 2
Snapshot sequence length, T 10
Weight decay rate, δ 0.8
Relative weight on encoded state, β 1.0
Minibatch size 50 examples
Initial learning rate 10−3
Geometric learning rate decay factor 0.01 per 4 ∗ 105 steps
Number of training steps 4 ∗ 105
Table 2 summarizes the hyperparameters used to train the KDMD Reduced Order Model
for the unforced Duffing example.
Table 2
KDMD ROM hyperparameters for unforced Duffing example
Parameter Value(s)
Time-delays embedded in a snapshot 1
EDMD Dictionary kernel function Gaussian RBF, σ = 10.0
KDMD SVD rank, r 27
BPOD time horizon, T 10
BPOD output projection rank 2 (no projection)
Balanced model order, d 3
Nonlinear reconstruction kernel function Gaussian RBF, σ = 10.0
Multi-kernel linear part truncation rank, r1 3
Multi-kernel nonlinear part truncation rank, r2 8
Multi-kernel regularization constant, γ 10−4
The hyperparameters used to train the LRAN model on the cylinder wake data are given
in Table 3.
Table 4 summarizes the hyperparameters used to train the KDMD Reduced Order Model
on the cylinder wake data.
34 S. E. OTTO AND C. W. ROWLEY
Table 3
LRAN hyperparameters for cylinder wake example
Parameter Value(s)
Time-delays embedded in a snapshot 2
Encoder layer widths (left to right) 400, 100, 50, 20, 10, 5
Decoder layer widths (left to right) 5, 10, 20, 50, 100, 400
Snapshot sequence length, T 20
Weight decay rate, δ 0.95
Relative weight on encoded state, β 1.0
Minibatch size 50 examples
Initial learning rate 10−3
Geometric learning rate decay factor 0.01 per 2 ∗ 105 steps
Number of training steps 2 ∗ 105
Table 4
KDMD ROM hyperparameters for cylinder wake example
Parameter Value(s)
Time-delays embedded in a snapshot 2
EDMD Dictionary kernel function Gaussian RBF, σ = 10.0
KDMD SVD rank, r 100
BPOD time horizon, T 20
BPOD output projection rank 100
Balanced model order, d 5
Nonlinear reconstruction kernel function Gaussian RBF, σ = 10.0
Multi-kernel linear part truncation rank, r1 5
Multi-kernel nonlinear part truncation rank, r2 15
Multi-kernel regularization constant, γ 10−8
Table 5 lists the hyperparameters used to train the LRAN model on the Kuramoto-
Sivashinsky equation example.
Table 6 summarizes the hyperparameters used to train the KDMD Reduced Order Model
on the Kuramoto-Sivashinsky equation example data.
Acknowledgments. We would like to thank Scott Dawson for providing us with the
data from his cylinder wake simulations. We would also like to thank William Eggert for his
invaluable help and collaboration on the initial iterations of the LRAN code.
REFERENCES
[1] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse
identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences of
the United States of America, 113 (2016), pp. 3932–7, https://doi.org/10.1073/pnas.1517384113.
[2] M. Budišić, R. Mohr, and I. Mezić, Applied koopmanism a, Chaos: An Interdisciplinary Journal of
Nonlinear Science, 22 (2012), p. 047510.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 35
Table 5
LRAN hyperparameters for Kuramoto-Sivashinsky example
Parameter Value(s)
Time-delays embedded in a snapshot 2
Encoder layer widths (left to right) 256, 32, 32, 16, 16
Decoder layer widths (left to right) 16, 16, 32, 32, 256
Snapshot sequence length, T 5
Weight decay rate, δ 0.9
Relative weight on encoded state, β 1.0
Minibatch size 50 examples
Initial learning rate 10−3
Geometric learning rate decay factor 0.1 per 2 ∗ 105 steps
Number of training steps 4 ∗ 105
Table 6
KDMD ROM hyperparameters for Kuramoto-Sivashinsky example
Parameter Value(s)
Time-delays embedded in a snapshot 2
EDMD Dictionary kernel function Gaussian RBF, σ = 10.0
KDMD SVD rank, r 60
BPOD time horizon, T 5
BPOD output projection rank 60
Balanced model order, d 16
Nonlinear reconstruction kernel function Gaussian RBF, σ = 100.0
Multi-kernel linear part truncation rank, r1 16
Multi-kernel nonlinear part truncation rank, r2 60
Multi-kernel regularization constant, γ 10−7
[3] D. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by expo-
nential linear units (elus), arXiv preprint arXiv:1511.07289, (2015).
[4] G. E. Dullerud and F. Paganini, A Course in Robust Control Theory : a Convex Approach, Springer
New York, 2000.
[5] M. Espinoza, A. K. Suykens, and B. D. Moor, Kernel based partially linear models and nonlinear
identification, IEEE Transactions on Automatic Control, 50 (2005), pp. 1602–6.
[6] T. L. B. Flinois, A. S. Morgans, and P. J. Schmid, Projection-free approximate balanced truncation
of large unstable systems, Physical Review E, 92 (2015), p. 023012.
[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, Generative adversarial nets, in Advances in neural information processing systems,
2014, pp. 2672–2680.
[9] P. Holmes, J. L. Lumley, G. Berkooz, and C. W. Rowley, Turbulence, Coherent Structures, Dy-
namical Systems and Symmetry, Cambridge University Press, 2012.
[10] M. R. Jovanović, P. J. Schmid, and J. W. Nichols, Sparsity-promoting dynamic mode decomposition,
Physics of Fluids, 26 (2014), p. 024103, https://doi.org/10.1063/1.4863670, http://aip.scitation.org/
doi/10.1063/1.4863670.
[11] B. O. Koopman, Hamiltonian systems and transformation in hilbert space, Proceedings of the National
36 S. E. OTTO AND C. W. ROWLEY