0% found this document useful (0 votes)

13 views37 pages

Linearly Recurrent Autoencoder Networks For Learning Dynamicssms

This paper introduces a new method for learning low-dimensional approximations of nonlinear dynamical systems using a neural network architecture that combines an autoencoder with linear recurrent dynamics. It addresses the trade-off between representational capacity and overfitting in Extended Dynamic Mode Decomposition (EDMD) by utilizing a small collection of informative observable functions that evolve over time. The techniques are demonstrated on various dynamical systems, showcasing their ability to accurately model complex dynamics and make predictions.

Uploaded by

ghoo1596

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views37 pages

Linearly Recurrent Autoencoder Networks For Learning Dynamicssms

Uploaded by

ghoo1596

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 37

Linearly-Recurrent Autoencoder Networks for Learning Dynamics∗

Samuel E. Otto† and Clarence W. Rowley‡

Abstract. This paper describes a method for learning low-dimensional approximations of nonlinear dynamical
systems, based on neural-network approximations of the underlying Koopman operator. Extended
Dynamic Mode Decomposition (EDMD) provides a useful data-driven approximation of the Koop-
man operator for analyzing dynamical systems. This paper addresses a fundamental problem asso-
ciated with EDMD: a trade-off between representational capacity of the dictionary and over-fitting
due to insufficient data. A new neural network architecture combining an autoencoder with linear
recurrent dynamics in the encoded state is used to learn a low-dimensional and highly informative
Koopman-invariant subspace of observables. A method is also presented for balanced model reduc-
arXiv:1712.01378v2 [math.DS] 15 Jan 2019

tion of over-specified EDMD systems in feature space. Nonlinear reconstruction using partially linear
multi-kernel regression aims to improve reconstruction accuracy from the low-dimensional state when
the data has complex but intrinsically low-dimensional structure. The techniques demonstrate the
ability to identify Koopman eigenfunctions of the unforced Duffing equation, create accurate low-
dimensional models of an unstable cylinder wake flow, and make short-time predictions of the chaotic
Kuramoto-Sivashinsky equation.

Key words. nonlinear systems, high-dimensional systems, reduced-order modeling, neural networks, data-
driven analysis, Koopman operator

AMS subject classifications. 65P99, 37M10, 37M25, 47-04, 47B33

1. Introduction. The Koopman operator first introduced in [11] describes how Hilbert
space functions on the state of a dynamical system evolve in time. These functions, referred to
as observables, may correspond to measurements taken during an experiment or the output of
a simulation. This makes the Koopman operator a natural object to consider for data-driven
analysis of dynamical systems. Such an approach is also appealing because the Koopman
operator is linear, though infinite dimensional, enabling the concepts of modal analysis for
linear systems to be extended to dynamics of observables in nonlinear systems. Hence, the
invariant subspaces and eigenfunctions of the Koopman operator are of particular interest and
provide useful features for describing the system if they can be found. For example, level sets
of Koopman eigenfunctions may be used to form partitions of the phase space into ergodic
sets along with periodic and wandering chains of sets [2]. They allow us to parameterize limit
cycles and tori as well as their basins of attraction. The eigenvalues allow us to determine
the stability of these structures and the frequencies of periodic and quasiperiodic attractors
[20]. Furthermore, by projecting the full state as an observable onto the eigenfunctions of
the Koopman operator, it is decomposed into a linear superposition of components called
Koopman modes which each have a fixed frequency and rate of decay. Koopman modes
therefore provide useful coherent structures for studying the system’s evolution and dominant
pattern-forming behaviors. This has made the Koopman operator a particularly useful object
∗
Submitted to the editors January 17, 2019.
Funding: This work was funded by ARO award W911NF-17-0512 and DARPA.
†
Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ ([email protected]).
‡
Mechanical and Aerospace Engineering, Princeton University, Princeton, NJ ([email protected]).
1
2 S. E. OTTO AND C. W. ROWLEY

of study for high-dimensional spatiotemporal systems like unsteady fluid dynamics beginning
with the work of Mezić on spectral properties of dynamical systems [19] then Rowley [27] and
Schmid [28] on the Dynamic Mode Decomposition (DMD) algorithm. Rowley, recognizing
that DMD furnishes an approximation of the Koopman operator and its modes, applied the
technique to data collected by simulating a jet in a crossflow. The Koopman modes identified
salient patterns of spatially coherent structure in the flow which evolved at fixed frequencies.
The Extended Dynamic Mode Decomposition (EDMD) [33] is an algorithm for approx-
imating the Koopman operator on a dictionary of observable functions using data. If a
Koopman-invariant subspace is contained in the span of the observables included in the dic-
tionary, then as long as enough data is used, the representation on this subspace will be exact.
EDMD is a Galerkin method with a particular data-driven inner product as long as enough
data is used. Specifically, this will be true as long as the rank of the data matrix is the same
as the dimension of the subspace spanned by the (nonlinear) observables [25]. However, the
choice of dictionary is ad hoc, and it is often not clear how to choose a dictionary that is
sufficiently rich to span a useful Koopman-invariant subspace. One might then be tempted
to consider a very large dictionary, with enough capacity to represent any complex-valued
function on the state space to within an tolerance. However, such a dictionary has combina-
torial growth with the dimension of the state space and would be enormous for even modestly
high-dimensional problems.
One approach to mitigate the cost of large or even infinite dictionaries is to formulate
EDMD as a kernel method referred to as KDMD [34]. However, we are still essentially
left with the same problem of deciding which kernel function to use. Furthermore, if the
kernel or EDMD feature space is endowed with too much representational capacity (a large
dictionary), the algorithm will over-fit the data (as we shall demonstrate with a toy problem in
Example 2.1). EDMD and KDMD also identify a number of eigenvalues, eigenfunctions, and
modes which grows with the size of the dictionary. If we want to build reduced order models
of the dynamics, a small collection of salient modes or a low-dimensional Koopman invariant
subspace must be identified. It is worth mentioning two related algorithms for identifying low-
rank approximations of the Koopman operator. Optimal Mode Decomposition (OMD) [35]
finds the optimal orthogonal projection subspace of user-specified rank for approximating the
Koopman operator. Sparsity-promoting DMD [10] is a post-processing method which identifies
the optimal amplitudes of Koopman modes for reconstructing the snapshot sequence with an `1
penalty. The sparsity-promoting penalty picks only the most salient Koopman modes to have
nonzero amplitudes. Another related scheme is Sparse Identification of Nonlinear Dynamics
(SINDy) [1] which employs a sparse regression penalty on the number of observables used
to approximate nonlinear evolution equations. By forcing the dictionary to be sparse, the
over-fitting problem is reduced.
In this paper, we present a new technique for learning a very small collection of informative
observable functions spanning a Koopman invariant subspace from data. Two neural networks
in an architecture similar to an under-complete autoencoder [7] represent the collection of ob-
servables together with a nonlinear reconstruction of the full state from these features. A
learned linear transformation evolves the function values in time as in a recurrent neural net-
work, furnishing our approximation of the Koopman operator on the subspace of observables.
This approach differs from recent efforts that use neural networks to learn dictionaries for
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 3

EDMD [37, 13] in that we employ a second neural network to reconstruct the full state. Ours
and concurrent approaches utilizing nonlinear decoder neural networks [30, 16] enable learning
of very small sets of features that carry rich information about the state and evolve linearly in
time. Previous methods for data-driven analysis based on the Koopman operator utilize linear
state reconstruction via the Koopman modes. Therefore they rely on an assumption that the
full state observable is in the learned Koopman invariant subspace. Nonlinear reconstruction
is advantageous since it relaxes this strong assumption, allowing recent techniques to recover
more information about the state from fewer observables. By minimizing the state recon-
struction error over several time steps into the future, our architecture aims to detect highly
observable features even if they have small amplitudes. This is the case in non-normal linear
systems, for instance as arise in many fluid flows (in particular, shear flows [29]), in which
small disturbances can siphon energy from mean flow gradients and excite larger-amplitude
modes. The underlying philosophy of our approach is similar to Visual Interaction Networks
(VINs) [32] that learn physics-based dynamics models for encoded latent variables.
Deep neural networks have gained attention over the last decade due to their ability to
efficiently represent complicated functions learned from data. Since each layer of the network
performs simple operations on the output of the previous layer, a deep network can learn and
represent functions corresponding to high-level or abstract features. For example, your visual
cortex assembles progressively more complex information sequentially from retinal intensity
values to edges, to shapes, all the way up to the facial features that let you recognize your
friend. By contrast, shallow networks — though still universal approximators — require ex-
ponentially more parameters to represent classes of natural functions like polynomials [23, 14]
or the presence of eyes in a photograph. Function approximation using linear combinations
of preselected dictionary elements is somewhat analogous to a shallow neural network where
capacity is built by adding more functions. We therefore expect deep neural networks to
represent certain complex nonlinear observables more efficiently than a large, shallow dictio-
nary. Even with the high representational capacity of our deep neural networks, the proposed
technique is regularized by the small number of observables we learn and is therefore unlikely
to over-fit the data.
We also present a technique for constructing reduced order models in nonlinear feature
space from over-specified KDMD models. Recognizing that the systems identified from data
by EDMD/KDMD can be viewed as state-space systems where the output is a reconstruction
of the full state using Koopman modes, we use Balanced Proper Orthogonal Decomposition
(BPOD) [24] to construct a balanced reduced-order model. The resulting model consists of
only those nonlinear features that are most excited and observable over a finite time horizon.
Nonlinear reconstruction of the full state is introduced in order to account for complicated,
but intrinsically low-dimensional data. In this way, the method is analogous to an autoencoder
where the nonlinear decoder is learned separately from the encoder and dynamics.
Finally, the two techniques we introduce are tested on a range of example problems. We
first investigate the eigenfunctions learned by the autoencoder and the KDMD reduced-order
model by identifying and parameterizing basins of attraction for the unforced Duffing equation.
The prediction accuracy of the models is then tested on a high-dimensional cylinder wake flow
problem. Finally, we see if the methods can be used to construct reduced order models for
the short-time dynamics of the chaotic Kuramoto-Sivashinsky equation. Several avenues for
4 S. E. OTTO AND C. W. ROWLEY

future work and extensions of our proposed methods are discussed in the conclusion.
2. Extended Dynamic Mode Decomposition. Before discussing the new method for ap-
proximating the Koopman operator, it will be beneficial to review the formulation of Extended
Dynamic Mode Decomposition (EDMD) [33] and its kernel variant KDMD [34]. Besides pro-
viding the context for developing the new technique, it will be useful to compare our results
to those obtained using reduced order KDMD models.
2.1. The Koopman operator and its modes. Consider a discrete-time autonomous dy-
namical system on the state space M ⊂ Rn given by the function xt+1 = f (xt ). Let F be a
Hilbert space of complex-valued functions on M. We refer to elements of F as observables.
The Koopman operator acts on an observable ψ ∈ F by composition with the dynamics:

(1) Kψ = ψ ◦ f .

It is easy to see that the Koopman operator is linear; however, the Hilbert space F on which
it acts is often infinite dimensional.1 Since the operator K is linear, it may have eigenvalues
and eigenfunctions. If a given observable lies within the span of these eigenfunctions, then
we can predict the time evolution of the observable’s values, as the state evolves according to
the dynamics. Let g : M → CN0 be a vector-valued observable whose components are in the
span of the Koopman eigenfunctions. The vector-valued coefficients needed to reconstruct g
in a Koopman eigenfunction basis are called the Koopman modes associated with g.
In particular, the dynamics of the original system can be recovered by taking the observ-
able g to be the full-state observable defined by g(x) = x. Assume K has eigenfunctions
{ϕ1 , . . . , ϕK } with corresponding eigenvalues {µ1 , . . . , µK }, and suppose the components of
the vector-valued function g lie within the span of {ϕk }. The Koopman modes ξ k are then
defined by
K
X
(2) x= ξ k ϕk (x),
k=1

from which we can recover the evolution of the state, according to

K
X
t
(3) f (x) = ξ k µtk ϕk (x).
k=1

The entire orbit of an initial point x0 may thus be determined by evaluating the eigenfunc-
tions at x0 and evolving the coefficients ξ k in time by multiplying by the eigenvalues. The
eigenfunctions ϕk are intrinsic features of the dynamical system which decompose the state
dynamics into a linear superposition of autonomous first-order systems. The Koopman modes
ξ k depend on the coordinates we use to represent the dynamics, and allow us to reconstruct
the dynamics in those coordinates.
1
One must also be careful about the choice of the space F, since ψ ◦ f must also lie in F for any ψ ∈ F. It
is common, especially in the ergodic theory literature, to assume that M is a measure space and f is measure
preserving. In this case, this difficulty goes away: one lets F = L2 (M), and since f is measure preserving, it
follows that K is an isometry.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 5

2.2. Approximating Koopman on an explicit dictionary with EDMD. The aim of EDMD
is to approximate the Koopman operator using data snapshot pairs taken from the system
{(xj , yj )}M
j=1 where yj = f (xj ). For convenience, we organize these data into matrices

(4) X = x1 x2 · · · xM , Y = y1 y2 · · · yM .

Consider a finite dictionary of observable functions D = {ψi : M → C}N

i=1 that span a sub-
space FD ⊂ F. EDMD approximates the Koopman operator on FD by minimizing an em-
pirical error when the Koopman operator acts on elements ψ ∈ FD . Introducing the feature
map
∗
(5) Ψ(x) = ψ1 (x) ψ2 (x) · · · ψN (x) ,
where (·)∗ is the complex conjugate transpose, we may succinctly express elements in the
dictionary’s span as a linear combination with coefficients a:
(6) ψa = Ψ∗ a.
EDMD represents an approximation of the Koopman operator as a matrix K : CN → CN
that updates the coefficients in the linear combination (6) to approximate the new observable
Kψa in the span of the dictionary. Of course we cannot expect the span of our dictionary to
be an invariant subspace, so the approximation satisfies
(7) Kψa = Ψ∗ Ka + r,
where r ∈ F is a residual that we wish to minimize in some sense, by appropriate choice of
the matrix K. The values of the Koopman-updated observables are known at each of the
data points Kψa (xj ) = ψa ◦ f (xj ) = ψa (yj ), allowing us to define an empirical error of the
approximation in terms of the residuals at these data points. Minimizing this error yields the
EDMD matrix K. The empirical error on a single observable in FD is given by
M
X
J(ψa ) = |ψa (yi ) − Ψ(xi )∗ Ka|2
i=1
(8)
XM
= |(Ψ(yi )∗ − Ψ(xi )∗ K) a|2
i=1
N0
N 0 ≥ N spanning FD is given

and the total empirical error on a set of observables ψaj j=1
,
by
0
N X
M
X
(9) J= |(Ψ(yi )∗ − Ψ(xi )∗ K) aj |2 .
j=1 i=1

Regardless of how the above observables are chosen, the matrix K that minimizes (9) is given
by
M M
1 X 1 X
(10) K = G+ A, G= Ψ(xi )Ψ(xi )∗ , A= Ψ(xi )Ψ(yi )∗ ,
M M
i=1 i=1
6 S. E. OTTO AND C. W. ROWLEY

where (·)+ denotes the Moore-Penrose pseudoinverse of a matrix.

The EDMD solution (10) requires us to evaluate the entries of G and A and compute the
pseudoinverse of G. Both matrices have size N × N where N is the number of observables in
our dictionary. In problems where the state dimension is large, as it is in many fluids datasets
coming from experimental or simulated flow fields, a very large number of observables is needed
to achieve even modest resolution on the phase space. The problem of evaluating and storing
the matrices needed for EDMD becomes intractable as N grows large. However, the rank
of these matrices does not exceed min {M, N }. The kernel DMD method provides a way to
compute an EDMD-like approximation of the Koopman operator using kernel matrices whose
size scales with the number of snapshot pairs M 2 instead of the number of features N 2 . This
makes it advantageous for problems where the state dimension is greater than the number
of snapshots or where very high resolution of the Koopman operator on a large dictionary is
needed.
2.3. Approximating Koopman on an implicit dictionary with KDMD. KDMD can be
derived by considering the data matrices

(11) ΨX = Ψ(x1 ) Ψ(x2 ) · · · Ψ(xM ) , ΨY = Ψ(y1 ) Ψ(y2 ) · · · Ψ(yM ) ,

in feature space i.e., after applying the now only hypothetical feature map Ψ to the snapshots.
We will see that the final results of this approach make reference only to inner products
Ψ(x)∗ Ψ(z) which will be defined using a suitable non-negative definite kernel function k(x, z).
Choice of such a kernel function implicitly defines the corresponding dictionary via Mercer’s
theorem. By employing simply-defined kernel functions, the inner products are evaluated at
a lower computational cost than would be required to evaluate a high or infinite dimensional
feature map and compute inner products in the feature space explicitly.
The total empirical error for EDMD (9) can be written as the Frobenius norm

J = k(Ψ∗Y − Ψ∗X K) Ak2F , where A = a1 a2 · · · aN 0 .

(12)

Let us consider an economy sized SVD ΨX = UΣV∗ , the existence of which is guaranteed by
the finite rank r of our feature data matrix. In (12) we see that any components of the range
R(K) orthogonal to R(ΨX ) are annihilated by Ψ∗X and cannot be inferred from the data.
We therefore restrict the dictionary to those features which can be represented in the range
of the feature space data FD = {ψa = Ψ∗ a : a ∈ R(U)} and represent K = UK̂U∗ for some
matrix K̂ ∈ Cr×r . After some manipulation, it can be shown that minimizing the empirical
error (12) with respect to K̂ is equivalent to minimizing
2
(13) J 0 = V∗ Ψ∗Y − ΣK̂U∗ A .
F

Regardless of how the columns of A are chosen, as long as R(A) = R(U) the minimum norm
solution for the KDMD matrix is

(14) K̂ = Σ+ V∗ Ψ∗Y U = Σ+ V∗ Ψ∗Y ΨX VΣ+ .

Each component in the above KDMD approximation can be found entirely in terms of in-
ner products in the feature space, enabling the use of a kernel function to implicitly define
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 7

the feature space. The two matrices whose entries are [KXX ]ij = Ψ(xi )∗ Ψ(xj ) = k(xi , xj )
and [KYX ]ij = [Ψ∗Y ΨX ]ij = Ψ(yi )∗ Ψ(xj ) = k(yi , xj ) are computed using the kernel. The
Hermitian eigenvalue decomposition KXX = VΣ2 V∗ provides the matrices V and Σ.
It is worth pointing out that the EDMD and KDMD solutions (10) and (14) can be
regularized by truncating the rank r of the SVD ΨX = UΣV∗ . In EDMD, we recognize
that G = M 1
ΨX Ψ∗X = M 1
UΣ2 U∗ is a Hermitian eigendecomposition. Before finding the
pseudoinverse, the rank is truncated to remove the dyadic components having small singular
values.
2.4. Computing Koopman eigenvalues, eigenfunctions, and modes. Suppose that ϕ =
Ψ∗ w is an eigenvector of the Koopman operator in the span of the dictionary with eigenvalue
µ. Suppose also that w = Uŵ is in the span of the data in feature space. From Kϕ = µϕ
it follows that Ψ∗Y w = µΨ∗X w by substituting all of the snapshot pairs. Left-multiplying
1
by M ΨX and taking the pseudoinverse, we obtain (G+ A)w = µ(G+ G)w = µw where the
second equality holds because w ∈ R(ΨX ). Therefore, w is an eigenvector with eigenvalue
µ of the EDMD matrix (10). In terms of the coefficients ŵ, we have Ψ∗Y Uŵ = µΨ∗X Uŵ,
which upon substituting the definition ΨX = UΣV gives Ψ∗Y ΨX VΣ+ ŵ = µVΣŵ. From
the previous statement we it is evident that Σ+ V∗ Ψ∗Y ΨX VΣ+ ŵ = K̂ŵ = µŵ. Hence, ŵ is
an eigenvector of the KDMD matrix (14) with eigenvalue µ. Unfortunately, the converses of
these statements do not hold. Nonetheless, approximations of Koopman eigenfunctions,
(15) ϕ(x) = Ψ(x)∗ w = Ψ(x)∗ ΨX VΣ+ ŵ,
are formed using the right eigenvectors w and ŵ of K and K̂ respectively. In (15) the inner
products Ψ(x)∗ ΨX can be found by evaluating the kernel function between x and each point
in the training data {xj }M
j=1 yielding a row-vector.
The Koopman modes {ξ k }rk=1 associated with the full state observable reconstruct the
state as a linear combination of Koopman eigenfunctions. They can be found from the pro-
vided training data using a linear regression process. Let us define the matrices
 
ϕ1 (x1 ) · · · ϕ1 (xM )
ΦX =  ... .. .. T T T

(16) Ξ = ξ 1 ξ 2 · · · ξ r ,  = WR ΨX = ŴR ΣV ,
 
. .
ϕr (x1 ) · · · ϕr (xM )
containing the Koopman modes and eigenfunction values at the training points. In the above,
WR and ŴR are the matrices whose columns are the right eigenvectors of K and K̂ respec-
tively. Seeking to linearly reconstruct the state from the eigenfunction values at each training
point, the regression problem,
(17) minimize kX − ΞΦX k2F ,
Ξ∈Cn×r

is formulated. The solution to this standard least squares problem is

(18) Ξ = XΨ+ +
X WL = XVΣ ŴL ,

where WL and ŴL are the left eigenvector matrices of K and K̂ respectively. These matrices
must be suitably normalized so that the left and right eigenvectors form bi-orthonormal sets
WL∗ W = I and Ŵ∗ Ŵ = I .
R r L R r
8 S. E. OTTO AND C. W. ROWLEY

2.5. Drawbacks of EDMD. One of the drawbacks of EDMD and KDMD is that the ac-
curacy depends on the chosen dictionary. For high-dimensional data sets, constructing and
evaluating an explicit dictionary becomes prohibitively expensive. Though the kernel method
allows us to use high-dimensional dictionaries implicitly, the choice of kernel function signifi-
cantly impacts the results. In both techniques, higher resolution is achieved directly by adding
more dictionary elements. Therefore, enormous dictionaries are needed in order to represent
complex features. The shallow representation of features in terms of linear combinations of
dictionary elements means that the effective size of the dictionary must be limited by the rank
of the training data in feature space. As one increases the resolution of the dictionary, the
rank r of the feature space data ΨX grows and eventually reaches the number of points M
assuming the points are distinct. The number of data points therefore is an upper bound on
the effective number of features we can retain for EDMD or KDMD. This effective dictionary
selection is implicit when we truncate the SVD of G or ΨX . It is when r = M that we
have retained enough features to memorize the data set up to projection of ΨY onto R(ΨX ).
Consequently, over-fitting becomes problematic as we seek dictionaries with high enough res-
olution to capture complex features. We illustrate this problem with the following simple
example.
Example 2.1. Let us consider the linear dynamical system

1 0
(19) xt+1 = f (xt+1 ) = x
0 0.5 t+1

with x = [x1 , x2 ]T ∈ R2 . We construct this example to reflect the behavior of EDMD with
rich dictionaries containing more elements than snapshots. Suppose that we have only two
snapshot pairs,

1 1 1 1
(20) X= , Y= ,
1 0.5 0.5 0.25

taken by evolving the trajectory two steps from the initial condition x0 = [1, 1]T . Let us
define the following dictionary. Its first two elements are Koopman eigenfunctions whose
values are sufficient to describe the full state. In fact, EDMD recovers the original dynamics
perfectly from the given snapshots when we take only these first two observables. In this
example, we show that by including an extra, unnecessary observable we get a much worse
approximation of the dynamics. A third dictionary element which is not an eigenfunction
is included in order to demonstrate the effects of an overcomplete dictionary. With these
dictionary elements, the data matrices are
     
x1 1 1 1 1
(21) Ψ(x) =  x2  =⇒ ΨX = 1 0.5  , ΨY =  0.5 0.25  .
2
(x1 ) + (x2 ) 2 2 1.25 1.25 1.0625
Applying (10) we compute the EDMD matrix and its eigendecomposition,

 µ1 = 1.0413
 
0.9286 −0.1071 0.7321 
(22) K = −0.2143 0.1786 −0.0536
  =⇒ µ2 = 0 ,
0.1429 0.2143 0.2857

µ3 = 0.3515

LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 9

as well as the eigenfunction approximations,

    
ϕ1 (x) −0.9627 0.2461 −0.1122 x1
ϕ2 (x) = WR T
(23) Ψ(x) =  0.5735 0.4915 0.5722   x2 .
ϕ3 (x) −0.6013 0.5722 0.5577 2
(x1 ) + (x2 )2

It is easy to see that none of the eigenfunctions or eigenvalues are correct for the given system
even though the learned matrix satisfies kΨ∗Y − Ψ∗X KkF < 6 ∗ 10−15 with 16 digit precision
computed with standard Matlab tools. This shows that even with a single additional function
in the dictionary, we have severely over-fit the data. This is surprising since our original
dictionary included two eigenfunctions by definition. The nuance comes since EDMD is only
guaranteed to capture eigenfunctions ϕ(x) = wT Ψ(x) where w is in the span of the feature
space data R(ΨX ). In this example, the true eigenfunctions do not satisfy this condition; one
can check that neither w = [1, 0, 0]T nor w = [0, 1, 0]T is in R(ΨX ).
3. Recent approach for dictionary learning. Example 2.1 makes clear the importance of
choosing an appropriate dictionary prior to performing EDMD. In two recent papers [37, 13],
the universal function approximation property of neural networks was used to learn dictio-
naries for approximating the Koopman operator. A fixed number of observables making up
the dictionary are given by a neural network Ψ(x; θ) ∈ Rd parameterized by θ. The linear
operator KT ∈ Rd×d evolving the dictionary function values one time step into the future is
learned simultaneously through minimization of
M
X 2
(24) J(K, θ) = Ψ(yi ; θ) − KT Ψ(xi ; θ) + Ω(K, θ).
i=1

A schematic of this architecture is depicted in Figure 1 where z = Ψ(x; θ) and z# = Ψ(y; θ)

are the dictionary function values before and after the time increment. The term Ω is used
for regularization and the Tikhonov regularizer Ω(K, θ) = λ kKk2F was used. One notices
that as the problem is formulated, the trivial solution Ψ(x; θ) ≡ 0d and K = 0d×d is a
global minimizer. [13] solves this problem by fixing some of the dictionary elements to not
be trainable. Since the full state observable is to be linearly reconstructed in the span of the
dictionary elements via the Koopman modes, it is natural to fix the first N dictionary elements
to be x while learning the remaining d − N elements through parameterization as a neural
network. The learned dictionary then approximately spans a Koopman invariant subspace
containing the full state observable. Training proceeds by iterating two steps: (1) Fix θ and
optimize K by explicit solution of the least squares problem; Then (2) fix K and optimize θ
by gradient descent. The algorithm implemented in [13] is summarized in Algorithm 1.
When learning an adaptive dictionary of a fixed size using a neural network (or other
function approximation method), let us consider two objects: the dictionary space S =
{ψj (•; θ) : Rn → R : ∀θ ∈ Θ, j = 1, . . . , d} is the set of all functions which can be parame-
terized by the neural network and the dictionary D(θ) = {ψj (•; θ) : Rn → R : j = 1, . . . , d}
is the d elements of S fixed by choosing θ. In EDMD and the dictionary learning ap-
proach just described, the Koopman operator is always approximated on a subspace FX =
{Ψ∗ w : w ∈ R(ΨX )} ⊂ FD = span D. As discussed earlier, the EDMD method always uses
10 S. E. OTTO AND C. W. ROWLEY

Algorithm 1 EDMD with dictionary learning [13]

Initialize K, θ
while J(K, θ) > do
Tikhonov regularized EDMD: K ← (G(θ) + λId )−1 A(θ)
Gradient descent: θ ← θ − δ∇θ J(K, θ)
end while

Figure 1. EDMD with dictionary learning architecture

S = D and the only way to increase resolution and feature complexity is to grow the dictionary
— leading to the over-fitting problems illustrated in Example 2.1. By contrast, the dictionary
learning approach enables us to keep the size of the dictionary relatively small while exploring
a much larger space S. In particular, the dictionary size is presumed to be much smaller
than the total number of training data points and probably small enough so that FX = FD .
Otherwise, the number of functions d learned by the network could be reduced so that this
becomes true. The small dictionary size therefore prevents the method from memorizing the
snapshot pairs without learning true invariant subspaces. This is not done at the expense of
resolution since the allowable complexity of functions in S is extremely high.
Deep neural networks are advantageous since they enable highly efficient representations
of certain natural classes of complex features [23, 14]. In particular, deep neural networks
are capable of learning functions whose values are built by applying many simple operations
in succession. It is shown empirically that this is indeed an important and natural class of
functions since deep neural networks have recently enabled near human level performance on
tasks like image and handwritten digit recognition [7]. This proved to be a useful property for
dictionary learning for EDMD since [37, 13] achieve state of the art results on examples includ-
ing the Duffing equation, Kuramoto-Sivashinsky PDE, a system representing the glycolysis
pathway, and power systems.

4. New approach: deep feature learning using the LRAN. By removing the constraint
that the full state observable is in the learned Koopman invariant subspace, one can do even
better. This is especially important for high-dimensional systems where it would be pro-
hibitive to train such a large neural network-based dictionary with limited training data.
Furthermore, it may simply not be the case that the full state observable lies in a finite-
dimensional Koopman invariant subspace. The method described here is capable of learning
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 11

extremely low-dimensional invariant subspaces limited only by the intrinsic dimensionality of

linearly-evolving patterns in the data. A schematic of our general architecture is presented in
Figure 2. The dictionary function values are given by the output of an encoder neural network
z = Ψ(x; θ enc ) parameterized by θ enc . We avoid the trivial solution by nonlinearly recon-
structing an approximation of the full state using a decoder neural network x̂ = Ψ̃(z; θ dec )
parameterized by θ dec . The decoder network takes the place of Koopman modes for recon-
structing the full state from eigenfunction values. However, if Koopman modes are desired
it is still possible to compute them using two methods. The first is to employ the same re-
gression procedure whose solution is given by (18) to compute the Koopman modes from the
EDMD dictionary provided by the encoder. Reconstruction using the Koopman modes will
certainly achieve lower accuracy than the nonlinear decoder, but may still provide a useful
tool for feature extraction and visualization. The other option is to employ a linear decoder
network Ψ̃(z; θ dec ) = B(θ dec )z where B(θ dec ) ∈ Rn×d is a matrix whose entries are param-
eterized by θ dec . The advantage of using a nonlinear decoder network is that the full state
observable need not be in the span of the learned encoder dictionary functions. A nonlinear
decoder can reconstruct more information about the full state from fewer features provided
by the encoder. This enables the dictionary size d to be extremely small — yet informative
enough to enable nonlinear reconstruction. This is exactly the principle underlying the success
of undercomplete autoencoders for feature extraction, manifold learning, and dimensionality
reduction. Simultaneous training of the encoder and decoder networks extract rich dictionary
elements which the decoder can use for reconstruction.

Figure 2. Linearly-Recurrent Autoencoder Network (LRAN) architecture

The technique includes a linear time evolution process given by the matrix K(θ K ) pa-
rameterized by θ K . This matrix furnishes our approximation of the Koopman operator on
the learned dictionary. Taking the eigendecomposition K = WR ΛWL ∗ allows us to compute

the Koopman eigenvalues, eigenfunctions, and modes exactly as we would for EDMD using
(15) and (18). By training the operator K simultaneously with the encoder and decoder net-
works, the dictionary of observables learned by the encoder is forced to span a low-dimensional
Koopman invariant subspace which is sufficiently informative to approximately reconstruct
12 S. E. OTTO AND C. W. ROWLEY

the full state. In many real-world applications, the scientist has access to data sets consisting
of several sequential snapshots. The LRAN architecture shown in Figure 2 takes advantage of
longer sequences of snapshots during training. This is especially important when the system
dynamics are highly non-normal. In such systems, low-amplitude features which could oth-
erwise be ignored for reconstruction purposes are highly observable and influence the larger
amplitude dynamics several time-steps into the future. One may be able to achieve reasonable
accuracy on snapshot pairs by neglecting some of these low-energy modes, but accuracy will
suffer as more time steps are predicted. Inclusion of multiple time steps where possible forces
the LRAN to incorporate these dynamically important non-normal features in the dictionary.
As we will discuss later, it is possible to generalize the LRAN architecture to continuous time
systems with irregular sampling intervals and sequence lengths. It is also possible to restrict
the LRAN to the case when only snapshot pairs are available. Here we consider the case when
our data contains equally-spaced snapshot sequences {xt , xt+1 , . . . , xt+T −1 } of length T . The
loss function
TX −1
1 δ t kx̂t+τ − xt+τ k2
(25) J(θ enc , θ dec , θ K ) = E
xt ,...,xt+T −1 ∼Pdata 1 + β N1 (δ) kxt+τ k2 + 1
τ =0
TX −1
δ t−1 kẑt+τ − zt+τ k2

+β + Ω(θ enc , θ dec , θ K )
N2 (δ) kzt+τ k2 + 2
τ =1

is minimized during training, where E denotes the expectation over the data distribution. The
encoding, latent state dynamics, and decoding processes are given by

zt+τ = Ψ(xt+τ ; θ enc ), ẑt+τ = [K(θ K )τ ]T zt , x̂t+τ = Ψ̃(ẑt+τ ; θ dec ),

respectively. The regularization term Ω has been included for generality, though our numerical
experiments show that it was not necessary. Choosing a small dictionary size d provides
sufficient regularization for the network. The loss function (25) consists of a weighted average
of the reconstruction error and the hidden state time evolution error. The parameter β
determines the relative importance of these two terms. The errors themselves are relative
square errors between the predictions and the ground truth summed over time with a decaying
weight 0 < δ ≤ 1. This decaying weight is used to facilitate training by prioritizing short term
predictions. The corresponding normalizing constants,
TX
−1 TX
−1
N1 (δ) = δτ , N2 (δ) = δ τ −1
τ =0 τ =1

ensure that the decay-weighted average is being taken over time. The small constants 1 and 2
are used to avoid division by 0 in the case that the ground truth values vanish. The expectation
value was estimated empirically using minibatches consisting of sequences of length T drawn
randomly from the training data. Stochastic gradient descent with the Adaptive Moment
Estimation (ADAM) method and slowly decaying learning rate was used to simultaneously
optimize the parameters θ enc , θ dec , and θ K in the open-source software package TensorFlow.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 13

4.1. Neural network architecture and initialization. The encoder and decoder consist of
deep neural networks whose schematic is sketched in Figure 3. The figure is only a sketch since
many more hidden layers were actually used in the architectures applied to example problems
in this paper. In order to achieve sufficient depth in the encoder and decoder networks, hidden
layers employed exponential linear units or “elu’s” as the nonlinearity [3]. These units mitigate
the problem of vanishing and exploding gradients in deep networks by employing the identity
function for all non-negative arguments. A shifted exponential function for negative arguments
is smoothly matched to the linear section at the origin, giving the activation function
(
x x≥0
(26) elu(x) = .
exp(x) − 1 x<0

This prevents the units from “dying” as standard rectified linear units or “ReLU’s” do when
the arguments are always negative on the data. Furthermore, “elu’s” have the advantage of
being continuously differentiable. This will be a nice property if we want to approximate a C 1
data manifold whose chart map and its inverse are given by the encoder and decoder. If the
maps are differentiable, then the tangent spaces can be defined as well as push-forward, pull-
back, and connection forms. Hidden layers map the activations x(l) at layer l to activations at
the next layer l + 1 given by a linear transformation and subsequent element-wise application
of the activation function,
h i
(27) x(l+1) = elu W(l) (θ)x(l) + b(l) (θ) , W(l) (θ) ∈ Rnl+1 ×nl , b(l) (θ) ∈ Rnl+1 .

The weight matrices W and vector biases b parameterized by θ are learned by the network
during training. The output layers L for both the encoder and decoder networks utilize linear
transformations without a nonlinear activation function:

(28) x(L) = W(L−1) (θ)x(L−1) + b(L−1) (θ), W(L−1) (θ) ∈ RnL ×nL−1 , b(L−1) (θ) ∈ RnL ,

where L = Lenc or L = Ldec is the last layer of the encoder or decoder with nLenc = d
or nLdec = n respectively. This allows for smooth and consistent treatment of positive and
negative output values without limiting the flow of gradients back through the network.
The weight matrices were initialized using the Xavier initializer in Tensorflow. This
initialization
p distributes the entries in W(l) uniformly over the interval [−α, α] where α =
6/(nl + nl+1 ) in order to keep the scale of gradients approximately the same in all layers.
This initialization method together with the use of exponential linear units allowed deep net-
works to be used for the encoder and decoder. The bias vectors b(l) were initialized to be
σ ω
zero. The transition matrix K was initialized to have diagonal blocks of the form
−ω σ
√
with eigenvalues λ = σ ± ωı equally spaced around the circle of radius r = σ + ω 2 = 0.8.
2

This was done heuristically to give the initial eigenvalues good coverage of the unit disc. One
could also initialize this matrix using a low-rank DMD matrix.
4.2. Simple modifications of LRANs. Several extensions and modifications of the LRAN
architecture are possible. Some simple modifications are discussed here, with several more
14 S. E. OTTO AND C. W. ROWLEY

Figure 3. Architecture of the encoder and decoder networks

involved extensions suggested in the conclusion. In the first extension, we observe that it is
easy learn Koopman eigenfunctions associated with known eigenvalues simply by fixing the
appropriate entries in the matrix K. In particular, if we know that our system has Koopman
eigenvalue µ = σ + ωı then we may formulate the state transition matrix

 
σ ω
02×(d−2) 
(29) K(θ) =  −ω σ .
0(d−2)×2 K̃(θ)

In the above, the known eigenvalue is fixed and only the entries of K̃ are allowed to be
trained. If more eigenvalues are known, we simply fix the additional entries of K in the same
way. The case where some eigenvalues are known is particularly interesting because in certain
cases, eigenvalues of the linearized system are Koopman eigenvalues whose eigenfunctions have
useful properties. Supposing the autonomous system under consideration has a fixed point
with all eigenvalues µi , i = 1, . . . , n inside the unit circle, the Hartman-Grobman theorem
establishes a topological conjugacy to a linear system with the same eigenvalues in a small
neighborhood U of the fixed point. One easily checks that the coordinate transformations hi :
M∩U → R, i = 1, . . . , n establishing this conjugacy are Koopman eigenfunctions restricted to
the neighborhood. Composing them with the flow map allows us to extend the eigenfunctions
−τ (x)
hi f τ (x) (x) where τ (x) is the

to the entire basin of attraction by defining ϕi (x) = µi
smallest integer τ such that f τ (x) ∈ U. These eigenfunctions extend the topological conjugacy
by parameterizing the basin. Similar results hold for stable limit cycles and tori [20]. This is
nice because we can often find eigenvalues at fixed points explicitly by linearization. Choosing
to fix these eigenvalues in the K matrix forces the LRAN to learn corresponding eigenfunctions
parameterizing the basin of attraction. It is also easy to include a set of observables explicitly
h iT
by appending them to the encoder function Ψ(x; θ enc ) = Ψf ixed (x)T , Ψ̃(x; θ enc )T so that
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 15

only the functions Ψ̃ are learned by the network. This may be useful if we want to accurately
reconstruct some observables Ψf ixed linearly using Koopman modes.
The LRAN architecture and loss function (25) may be further generalized to non-uniform
sampling of continuous-time systems. In this case, we consider T sequential snapshots
{x(t0 ), x(t1 ), . . . , x(tT −1 )} where the times t0 , t1 , . . . , tT −1 are not necessarily evenly spaced.
In the continuous time case, we have a Koopman operator semigroup Kt+s = Kt Ks defined
as Kt ψ(x) = ψ(Φt (x)) and generated by the operator Kψ(x) = ψ̇(x) = (∇x ψ(x)) f (x) where
the dynamics are given by ẋ = f (x) and Φt is the time t flow map. The generator K is clearly
a linear operator which we can approximate on our dictionary of observables with a matrix
K. By integrating, we can approximate elements from the semigroup Kt using the matrices
Kt = exp (Kt) on the dictionary. Finally, in order to formulate the analogous loss function,
we might utilize continuously decaying weights
δt δt
(30) ρ1 (t) = PT −1 , ρ2 (t) = PT −1 ,
k=0 δ tk k=1 δ tk
normalized so that they sum to 1 for the given sampling times. Neural networks can be used
for the encoder and decoder together with the loss function
TX −1
1 kx̂(tk ) − x(tk )k2
(31) J(θ enc , θ dec , θ K ) = E ρ1 (tk )
x(t0 ),...,x(tT −1 )∼Pdata 1 + β
k=0
kx(tk )k2 + 1
TX−1
kẑ(tk ) − z(tk )k2

+β ρ2 (tk ) 2 + Ω(θ enc , θ dec , θ K )
k=1
kz(t k )k + 2

to be minimized during training. In this case, the dynamics evolve the observables linearly in
continuous time, so we let

z(tk ) = Ψ(x(tk ); θ enc ), ẑ(tk ) = exp [K(θ K )(tk − t0 )]T z(t0 ), x̂(tk ) = Ψ̃(ẑ(tk ); θ dec ).

This loss function can be evaluated on the training data and minimized in essentially the
same way as (25). The only difference is that we are discovering a matrix approximation to
the generator of the Koopman semigroup. We will not explore irregularly sampled continuous
time systems further in this paper, leaving it as a subject for future work.
We briefly remark that the general LRAN architecture can be restricted to the case of
snapshot pairs as shown in Figure 4. In this special case, training might be accelerated using
a technique similar to Algorithm 1. During the initial stage of training, it may be beneficial
to periodically re-initialize the K matrix with its EDMD approximation using the current
dictionary functions and a subset of the training data. This might provide a more suitable
initial condition for the matrix as well as accelerate the training process. However, this update
for K is not consistent with all the terms in the loss function J since it does not account for
reconstruction errors. Therefore, the final stages of training must always proceed by gradient
descent on the complete loss function.
Finally, we remark that the LRAN architecture sacrifices linear reconstruction using Koop-
man modes for nonlinear reconstruction using a decoder neural network in order to learn
16 S. E. OTTO AND C. W. ROWLEY

Figure 4. LRAN architecture restricted to snapshot pairs

ultra-low dimensional Koopman invariant subspaces. Interestingly, this formulation allows

the LRAN to parameterize the Nonlinear Normal Modes (NNMs) frequently encountered in
structural dynamics. These modes are two-dimensional, periodic invariant manifolds contain-
ing a fixed point of a Hamiltonian system lacking internal resonances. Therefore, if µ = ωı and
µ = −ωı are a complex conjugate pair of pure imaginary eigenvalues of K with corresponding
left eigenvectors wL and wL then a NNM is parameterized as follows:

(32) x(α) = Ψ̃(αwL + αwL ; θ dec ) = Ψ̃ (2<(α)<(wL ) − 2=(α)=(wL ); θ dec ) .

The global coordinates on the manifold are (<(α), =(α2 )). Coordinate projection of the full
state onto the NNM,

<(wR )T

<(α)
(33) = Ψ(x; θ enc ),
=(α) =(wR )T

is accomplished by employing the encoder network and the right eigenvector wR corresponding
to eigenvalue µ. These coordinates are the real and imaginary parts of the associated Koopman
eigenfunction α = ϕ(x). The NNM has angular frequency ∠(ωı)/∆t where ∆t is the sampling
interval between the snapshots in the case of the discrete time LRAN.
We may further generalize the notion of NNMs by considering the Koopman mode expan-
sion of the real-valued observable vector Ψ making up our dictionary. In this particular case,
the associated Koopman modes are the complex conjugate left eigenvectors of K. They allow
exact reconstruction and prediction using the decomposition
r
X r
X
(34) zt = wL,j µtj wR,j
T
z0 = wL,j µtj ϕj (x0 ),
j=1 j=1

assuming a Koopman invariant subspace has been learned that contains the full state observ-
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 17

able. Reconstructing and predicting with the decoder instead, we have

 
Xr
(35) xt = Ψ̃  wL,j µtj ϕj (x0 ); θ dec  .
j=1

Therefore, each invariant subspace of K given by its left eigenvectors corresponds to an in-
variant manifold in the n-dimensional phase space. These manifolds have global charts whose
coordinate projections are given by the Koopman eigenfunctions ϕj (x) = wR,j T Ψ(x; θ
enc ).
The dynamics on these manifolds is incredibly simple and entails repeated multiplication of
the coordinates by the eigenvalues. Generalized eigenspaces may also be considered in the
natural way by using the Jordan normal form of K instead of its eigendecomposition in the
above arguments. The only necessary change is in the evolution equations, where instead of
taking powers of Λ = diag {µ1 , . . . , µr }, we take powers of J, the matrix consisting of Jordan
blocks [20]. Future work might use a variational autoencoder (VAE) formulation [7, 18, 17, 12]
where a given distribution is imposed on the latent state in order to facilitate sampling.
5. EDMD-based model reduction as a shallow autoencoder. In this section we examine
how the EDMD method might be used to construct low-dimensional Koopman invariant
subspaces while still allowing for accurate reconstructions and predictions of the full state.
The idea is to find a reduced order model of the large linear system identified by EDMD in the
space of observables. This method is sometimes called overspecification [26] and essentially
determines an encoder function into an appropriate small set of features. From this reduced set
of features, we then employ nonlinear reconstruction of the full state through a learned decoder
function. Introduction of the nonlinear decoder should allow for lower-dimensional models to
be identified which are still able to make accurate predictions. The proposed framework
therefore constructs a kind of autoencoder where encoded features evolve with linear time
invariant dynamics. The encoder functions are found explicitly as linear combinations of
EDMD observables and are therefore analogous to a shallow neural network with a single
hidden layer. The nonlinear decoder function is also found explicitly through a regression
process involving linear combinations of basis functions.
We remark that this approach differs from training an LRAN by minimization of (25)
in two important ways. First, the EDMD-based model reduction and reconstruction pro-
cesses are performed sequentially; thus, the parts are not simultaneously optimized as in the
LRAN. The LRAN is advantageous since we only learn to encode observables which the de-
coder can successfully use for reconstruction. There are no such guarantees here. Second,
the EDMD dictionary remains fixed albeit overspecified whereas the LRAN explicitly learns
an appropriate dictionary. Therefore, the EDMD shallow autoencoder framework will still
suffer from the overfitting problem illustrated in Example 2.1. If the EDMD-identified matrix
K does not correctly represent the dynamics on a Koopman invariant subspace, then any
reduced order models derived from it cannot be expected to be accurately relect the dynamics
either. Nonetheless, in many cases, this method could provide a less computationally expen-
sive alternative to training a LRAN which retains some of the benefits owing to nonlinear
reconstruction.
Dimensionality reduction is achieved by first performing EDMD with a large dictionary,
then projecting the linear dynamics onto a low-dimensional subspace. A naive approach
18 S. E. OTTO AND C. W. ROWLEY

would be to simply project the large feature space system onto the most energetic POD
modes — equivalent to low-rank truncation of the SVD ΨX = UΣV∗ . While effective for
normal systems with a few dominant modes, this approach yields very poor predictions in
non-normal systems since low amplitude modes with large impact on the dynamics would be
excluded from the model. One method which resolves this issue is balanced truncation of
the identified feature space system. Such an idea is suggested in [26] for reducing the system
identified by linear DMD. Drawing from the model reduction procedure for snapshot-based
realizations developed in [15], we will construct a balanced reduced order model for the system
identified using EDMD or KDMD. In the formulation of EDMD that led to the kernel method,
an approximation of the Koopman operator,

(36) KΨ(x)∗ a = Ψ(x)∗ UK̂U∗ a + r(x), ∀a ∈ R(U),

was obtained. The approximation allows us to model the dynamics of a vector of observables,

(37) ΨU (x) = U∗ Ψ(x) = Σ+ V∗ Ψ∗X Ψ(x),

with the linear input-output system

1
ΨU (xt+1 ) = K̂∗ ΨU (xt ) + √ Σut
(38) M ,
xt = CΨU (xt )

where K̂ is the matrix (14) identified by EDMD or KDMD. The input ut is provided in order
to equate varying initial conditions ΨU (x0 ) with impulse responses of (38). Since the input
is used to set the initial condition, we choose to scale each component by its singular value to
reflect the covariance

∗ 1 ∗ ∗ 1 2 1 ∗ ∗
(39) E [ΨU (x)ΨU (x) ] ≈ U ΨX ΨX U = Σ = E Σuu Σ ,
x∼Pdata M M u∼N (0,Ir ) M

in the observed data. Therefore, initializing the system using impulse responses u0 = ej ,
j = 1, . . . , r from the σ-points of the distribution u ∼ N (0, Ir ) ensures that the correct
empirical covariances are obtained. The output matrix,

(40) C = XVΣ+ ,

is used to linearly reconstruct the full state observable from the complete set of features. It
is found using linear regression similar to the Koopman modes (18). The low-dimensional set
of observables making up the encoder will be found using a balanced reduced order model of
(38).
5.1. Balanced Model Reduction. Balanced truncation [21] is a projection-based model
reduction techniqe that attempts to retain a subspace in which (38) is both maximally ob-
servable and controllable. While these notions generally do not coincide in the original space,
remarkably it is possible to find a left-invertible linear transformation ΨU (x) = Tz under
which these properties are balanced. This so called balancing transformation of the learning
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 19

subspace simultaneously diagonalizes the observability and controllability Gramians. There-

fore, the most observable states are also the most controllable and vice versa. The reduced
order model is formed by truncating the least observable/controllable states of the transformed
system. If the discrete time observability Gramian Wo and controllability Gramian Wc are
given by
∞ ∞
X 1 X ∗ t 2
(41) Wo = (K̂)t C∗ C(K̂∗ )t , Wc = (K̂ ) Σ (K̂)t ,
M
t=0 t=0

then the Gramians transform according to

(42) Wo 7→ T∗ Wo T, Wc 7→ T+ + ∗
L Wc (TL )

under the change of coordinates. In the above, S∗ = T+

L is the left pseudoinverse satisfying
∗ ∗
S T = Id where d is the rank of T and TS = PT is a (not necessarily orthogonal) projection
operator onto R(T).
Since the Gramians are Hermitian positive semidefinite, they can be written as Wo =
AA∗ , Wc = BB∗ for some not necessarily unique matrices A, B ∈ Rr×r . Forming an econ-
omy sized singular value decomposition H = A∗ B = UH ΣH VH allows us to construct the
transformations
(43) T = BVH (Σ+
H)
1/2
, S = AUH (Σ+
H)
1/2
.
Using this construction, it is easy to check that the resulting transformation simultaneously
diagonalizes the Gramians:
(44) T∗ Wo T = S∗ Wc S = ΣH .
Entries of the diagonal matrix ΣH are called the Hankel singular values. The columns of
T and S are called the “balancing modes” and “adjoint modes” respectively. The balancing
modes span the subspace where (38) is both observable and controllable while the adjoint
modes furnish the projected coefficients of the state S∗ ΨU (x) = z onto the space where these
properties are balanced. The corresponding Hankel singular values quantify the observabil-
ity/controllability of the states making up z. Therefore, a reduced order model which is
provably close to optimal truncation in the H∞ norm is formed by rank-d truncation of the
SVD, retaining only the first d balancing and adjoint modes Td and Sd [4]. The reduced state
space system modeling the dynamics of (38) is given by
1
zt+1 = S∗d K̂∗ Td zt + √ S∗d Σut
(45) M , where z , S∗d ΨU (x).
xt ≈ CTd zt
Therefore, the reduced dictionary of observables is given by the components of S∗d ΨU : Rn →
Rd and the corresponding EDMD approximation of the Koopman operator on its span is
given by T∗d K̂Sd . This set of features is highly observable in that their dynamics strongly
influences the full state reconstruction over time through C. And highly controllable in that
the features are excited by typical state configurations through Σ. The notion of feature
excitation corresponding to controllability will be made clear in the next section.
20 S. E. OTTO AND C. W. ROWLEY

5.2. Finite-horizon Gramians and Balanced POD. Typically one would find the infinite
horizon Gramians for an overspecified Hurwitz EDMD system (38) by solving the Lyapunov
equations

(46) K̂Wo K̂∗ − Wo + C∗ C = 0, K̂∗ Wc K̂ − Wc + Σ2 = 0.

In the case of neutrally stable or unstable systems, unique positive definite solutions do
not exist and one must use generalized Gramians [38]. When used to form balanced reduced
order models, this will always result in the unstable and neutrally stable modes being chosen
before the stable modes. This could be problematic for our intended framework since EDMD
can identify many spurious and sometimes unstable eigenvalues corresponding to noisy low-
amplitude fluctuations. While these noisy modes remain insignificant over finite times of
interest, they will dominate EDMD-based predictions over long times. Therefore it makes
sense to consider the dominant modes identified by EDMD over a finite time interval of
interest. Using finite horizon Gramians reduces the effect of spurious modes on the reduced
order model, making it more consistent with the data. The time horizon can be chosen to
reflect a desired future prediction time or the number of sequential snapshots in the training
data.
The method of Balanced Proper Orthogonal Decomposition or BPOD [24] allows us to
find balancing and adjoint modes of the finite horizon system. In BPOD, we observe that the
finite-horizon Gramians are empirical covariance matrices formed by evolving the dynamics
from impulsive initial conditions for time T . This gives the specific form for matrices
1
A = C∗ K̂C∗ · · · (K̂)T C∗ , Σ K̂∗ Σ · · · (K̂∗ )T Σ ,

(47) B= √
M
allowing for computation of the balancing and adjoint modes without ever forming the Grami-
ans. This is known as the method of snapshots. Since the output dimension is large, we
consider its projection onto the most energetic modes. These are identified by forming the
economy sized SVD of the impulse responses CB = UOP ΣOP VOP ∗ . Projecting the output

allows us to form the elements of

AOP = C∗ UOP K̂C∗ UOP · · · (K̂)T C∗ UOP ,

(48)

from fewer initial conditions than A. In particular, the initial conditions are the first few
columns of UOP with the largest singular values [24].
Observe that the unit impulses place the initial conditions precisely at the σ-points of
the data-distribution in features space. If this distribution is Gaussian, then the empirical
expectations obtained by evolving the linear system agree with the true expectations taken
over the entire data distribution. Therefore, the finite horizon controllability Gramian cor-
responds to the covariance matrix taken over all time T trajectories starting at initial data
points coming from a Gaussian distribution in feature space. Consequently, controllability in
this case corresponds exactly with feature variance or expected square amplitude over time.
We remark that in the infinite-horizon limit T → ∞, BPOD converges on a transformation
which balances the generalized Gramians introduced in [38]. Application of BPOD to unstable
systems is discussed in [6] which provides justification for the approach.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 21

Another option to avoid spurious modes from corrupting the long-time dynamics is to
consider pre-selection of EDMD modes which are nearly Koopman invariant. The development
of such an accuracy criterion for selecting modes is the subject of a forthcoming paper by H.
Zhang and C. W. Rowley. One may then apply balanced model reduction to the feature space
system containing only the most accurate modes.

5.3. Nonlinear reconstruction. In truncating the system, we determined a small subspace

of observables whose values evolve linearly in time and are both highly observable and con-
trollable. However, most of the less observable low-amplitude modes are removed. The linear
reconstruction only allows us to represent data on a low-dimensional subspace of Rn . While
projection onto this subspace aims to explain most of the data variance and dynamics, it may
be the case that the data lies near a curved manifold not fully contained in the subspace.
The neglected modes contribute to this additional complexity in the shape of the data in Rn .
Nonlinear reconstruction of the full state can help account for the complex shape of the data
and for neglected modes enslaved to the ones retained.
We consider the regression problem involved in reconstructing the full state x from a small
set of EDMD observables z. Because the previously obtained solution to the EDMD balanced
model reduction problem (45) employs linear reconstruction through matrix CTd , we expect
nonlinearities in the reconstruction to be small with most of the variance being accounted for
by linear terms. Therefore, the regression model,

(49) x = C1 z + C2 Ψ(z) + e,

is formulated based on [5, 36] to include linear and nonlinear components. In the above, Ψ :
Cd → H is a nonlinear feature map into reproducing kernel Hilbert space H and C1 : Cd → Cn
and C2 : H → Cn are linear operators. These operators are found by solving the l2 regularized
optimization problem,

(50) minimize J = kX∗ − Z∗ C∗1 − Ψ∗Z C∗2 k2F + γ Tr(C2 C∗2 ), γ ≥ 0,

C1 ,C2

involving the empirical square error on the training data {(zj , xj )}M
j=1 arranged into columns

of the matrices Z = z1 · · · zM and X = x1 · · · xM . The regularization penalty
is placed only on the coefficients of nonlinear terms to control over-fitting while the linear
term, which we expect to dominate, is not penalized. The operator ΨZ : CM → H forms
linear combinations of the data in feature space v 7→ v1 Ψ(z1 ) + · · · + vM Ψ(zM ). Since Z
and ΨZ are operators with finite ranks r1 and r2 ≤ M , we may consider their economy
sized singular value decompositions: Z = U1 Σ1 V1∗ and ΨZ = U2 Σ2 V2∗ . Observe that it is
impossible to infer any components of R(C∗1 ) orthogonal to R(Z) since they are annihilated
by Z∗ . Therefore, we apply Occam’s razor and assume that C∗1 = U1 Ĉ∗1 for some Ĉ∗1 ∈ Cr1 ×n .
By the same argument, R(C∗2 ) cannot have any components orthogonal to R(ΨZ ) since they
are annihilated by Ψ∗Z and have a positive contribution to the regularization penalty term
Tr(C2 C∗2 ). Hence, we must also have C∗2 = U2 Ĉ∗2 for some Ĉ∗2 ∈ Cr2 ×n . Substituting these
22 S. E. OTTO AND C. W. ROWLEY

relationships into (50) allows it to be formulated as the standard least squares problem
2 2
J = X∗ − V1 Σ1 Ĉ∗1 − V2 Σ2 Ĉ∗2 + γ Ĉ2
F F
(51) ∗ ∗ 2 .
X V1 Σ1 V2 Σ2 Ĉ1
= − √
0r2 ×n 0r2 ×r1 γIr2 Ĉ∗2 F

The block-wise matrix clearly has full column rank r1 + r2 for γ > 0 and the normal equation
for this least squares problem are found by projecting onto its range. The solution,
−1 −1
V1∗ V2
∗
V1∗ X∗

Ĉ1 Σ1 0r1 ×r2 Ir1
(52) = ,
Ĉ∗2 0r2 ×r1 Σ2 V2 V1 Ir2 + γΣ−2
∗
2 V2∗ X∗

corresponds to taking the left pseudoinverse and simplifying the resulting expression. The
matrices V1,2 and Σ1,2 are found by solving Hermitian eigenvalue problems using the (kernel)
matrices of inner products Z∗ Z = V1 Σ21 V1∗ and Ψ∗Z ΨZ = V2 Σ22 V2∗ . At a new point z, the
approximate reconstruction using the partially linear kernel regression model is

(53) x ≈ Ĉ1 Σ−1
1 V ∗ ∗
1 Z z + Ĉ2 Σ−1 ∗ ∗
2 V2 (ΨZ Ψ(z)) .

Recall that the kernel matrices are

   
k(z1 , z1 ) · · · k(z1 , zM ) k(z1 , z)
(54) ∗
ΨZ ΨZ =  .. .. .. Ψ∗Z Ψ(z) =  ..
,
   
. . . . 
k(zM , z1 ) · · · k(zM , zM ) k(zM , z)

for a chosen continuous nonnegative definite mercer kernel function k : Cd × Cd → C inducing

the feature map Ψ.
The main drawback associated with the kernel method used for encoding eigenfunctions
or reconstructing the state is the number of kernel evaluations. Even if the dimension d of
the reduced order model is small, the kernel-based inner product of each new example must
still be computed with all of the training data in order to encode it and then again to decode
it. When the training data sets grow large, this leads to a high cost in making predictions on
new data points. An important avenue of future research is to prune the training examples to
only a small number of maximally informative “support vectors” for taking inner products.
Some possible approaches are discussed in [31, 22].
6. Numerical examples.
6.1. Duffing equation. In our first numerical example, we will consider the unforced
Duffing equation in a parameter regime exhibiting two stable spirals. We take this example
directly from [33] where the Koopman eigenfunctions are used to separate and parameterize
the basins of attraction for the fixed points. The unforced Duffing equation is given by

(55) ẍ = −δ ẋ − x(β + αx2 ),

LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 23

where the parameters δ = 0.5, β = −1, and α = 1 are chosen. The equation exhibits stable
1 √
equilibria at x = ±1 with eigenvalues λ1,2 = −1 ± 31ı associated with the linearizations
4
at these points. One can show that these (continuous-time) eigenvalues also correspond to
Koopman eigenfunctions whose magnitude and complex argument act like action-angle vari-
ables parameterizing the entire basins. A non-trivial Koopman eigenfunction with eigenvalue
λ0 = 0 takes different constant values in each basin, acting like an indicator function to
distinguish them.
We will see whether the LRAN and the reduced KDMD model can learn these eigenfunc-
tions from data and use them to predict the dynamics of the unforced Duffing equation as
well as to determine which basin of attraction a given point belongs. The training data are
generated by simulating the unforced Duffing equation from uniform random initial conditions
(x(0), ẋ(0)) ∈ [−2, 2] × [−2, 2]. From each trajectory 11 samples are recorded ∆t = 0.25 apart.
The training data for LRAN models consists of M = 104 such trajectories. Since the KDMD
method requires us to evaluate the kernel function between a given example and each training
point, we limit the number of training data points to 103 randomly selected snapshot pairs
from the original set. It is worth mentioning that the LRAN model handles large data sets
more efficiently than KDMD since the significant cost goes into training the model which is
then inexpensive and fast to evaluate on new examples.
Since three of the Koopman eigenvalues are known ahead of time we train an LRAN model
where the transition matrix K is fixed to have discrete time eigenvalues µk = exp(λk ∆t). We
refer to this as the “constrained LRAN” and compare its performance to a “free LRAN”
model where K is learned and a 5th order balanced truncation using KDMD called “KDMD
ROM”. The hyperparameters of each model are reported in Appendix A.
The learned eigenfunctions for each model are plotted in Figures 5 to 7. The corresponding
eigenvalues learned or fixed in the model are also reported. The complex eigenfunctions are
plotted in terms of their magnitude and phase. In each case, the eigenfunction associated with
the continuous-time eigenvalue λ0 closest to zero appears to partition the phase space into
basins of attraction for each fixed point as one would expect. In order to test this hypothesis,
we use the median eigenfunction value for each model as a threshold to classify test data
points between the basins. The eigenfunction learned by the constrained LRAN was used to
correctly classify 0.9274 of the testing data points. The free LRAN eigenfunction and the
KDMD balanced reduced order model eigenfunction correctly classified 0.9488 and 0.9650 of
the testing data respectively. Mtest = 11 ∗ 104 test data points were used to evaluate the
LRAN models, though this number was reduced to a randomly selected Mtest = 1000 to test
the KDMD model due to the exceedingly high computational cost of the kernel evaluations.
The other eigenfunction learned in each case parameterizes the basins of attraction and
therefore is used to account for the dynamics in each basin. Each model appears to have
learned a similar action-angle parameterization regardless of whether the eigenvalues were
specified ahead of time. However, the constrained LRAN shows the best agreement with the
true fixed point locations at x = ±1 where |ϕ1 | → 0. The mean square relative prediction
error was evaluated for each model by making predictions on the testing data set at various
times in the future. The results plotted in Figure 8 show that the free LRAN has by far the
lowest prediction error likely due to the lack of constraints on the functions it could learn. It is
24 S. E. OTTO AND C. W. ROWLEY

Figure 5. Unforced Duffing eigenfunctions learned using constrained LRAN

Figure 6. Unforced Duffing eigenfunctions learned using free LRAN

Figure 7. Unforced Duffing eigenfunctions found using KDMD balanced ROM

LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 25

surprising however, that nonlinear reconstruction hurt the performance of the KDMD reduced
order model. This illustrates a potential difficulty with this method since the nonlinear part
of the reconstruction is prone to over-fit without sufficient regularization.

Figure 8. Unforced Duffing testing data mean square relative prediction errors for each model plotted
against the prediction time

6.2. Cylinder wake. The next example we consider is the formation of a Kármán vortex
sheet downstream of a cylinder in a fluid flow. This problem was chosen since the data has low
intrinsic dimensionality due to the simple flow structure but is embedded in high-dimensional
snapshots. We are interested in whether the proposed techniques can be used to discover
very low dimensional models that accurately predict the dynamics over many time steps. We
consider the growth of instabilities near an unstable base flow shown in Figure 9a at Reynold
number Re = 60 all the way until a stable limit cycle shown in Figure 9b is reached. The
models will have to learn to make predictions over a range of unsteady flow conditions from
the unstable equilibrium to the stable limit cycle.

(a) (b)

Figure 9. Example cylinder wake flow snapshots at the unstable equilibrium and on the stable limit cycle

The raw data consisted of 2000 simulated snapshots of the velocity field taken at time
intervals 0.2D/U∞ , where D is the cylinder diameter and U∞ is the free-stream velocity.
These data were split into Mtrain = 1000 training, Meval = 500 evaluation, and Mtest = 500
26 S. E. OTTO AND C. W. ROWLEY

testing data points. Odd numbered points ∆t = 0.4D/U∞ apart were used for training.
The remaining 1000 points were divided again into even and odd numbered points 2∆t =
0.8D/U∞ apart for evaluation and testing. This enabled training, evaluation, and testing on
long data sequences while retaining coverage over the complete trajectory. The continuous-
time eigenvalues are found from the discrete-time eigenvalues according to λ = log(µ)/∆t =
log(µ)U∞ /(0.4D).
The raw data was projected onto its 200 most energetic POD modes which captured
essentially all of the energy in order to reduce the cost of storage and training. 400-dimensional
time delay embedded snapshots were formed from the state at time t and t + ∆t. A 5th-
order LRAN model and the 5th-order KDMD reduced order model were trained using the
hyperparameters in Tables 3 and 4. In Figure 10a, many of the discrete-time eigenvalues
given by the over-specified KDMD model have approximately neutral stability with some
being slightly unstable. However, the finite horizon formulation for balanced truncation allows
us to learn the most dynamically salient eigenfunctions over a given length of time, in this
case T = 20 steps or 8.0D/U∞ . We see in Figure 10b that three of the eigenvalues learned by
the two models are in close agreement and all are approximately neutrally stable.

(a) (b)

Figure 10. Discrete-time Koopman eigenvalues approximated by the KDMD ROM and the LRAN

A side-by-side comparison of the Koopman modes gives some insight into the flow struc-
tures whose dynamics the Koopman eigenvalues describe. We notice right away that the
Koopman modes in Figure 12 corresponding to continuous-time eigenvalue λ1 are very similar
for both models and indicate the pattern of vortex shedding downstream. This makes sense
since a single frequency and mode will account for most of the amplitude as the limit cycle
is approached. Evidently both models discover these limiting periodic dynamics. For the
KDMD ROM, λ2 is almost exactly the higher harmonic 2 ∗ λ1 . The corresponding Koopman
mode in Figure 13 also reflects smaller flow structures which oscillate at twice the frequency
of λ1 . Interestingly, the LRAN does not learn the same second eigenvalue as the KDMD
ROM. The LRAN continuous-time eigenvalue λ2 is very close to λ1 which suggest that these
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 27

frequencies might team up to produce the low-frequency λ1 − λ2 . The second LRAN Koop-
man mode in Figure 13 also bears qualitative resemblance to the first Koopman mode in
Figure 12, but with a slightly narrower pattern in the y-direction. The LRAN may be using
the information at these frequencies to capture some of the slower transition process from
the unstable fixed point to the limit cycle. The Koopman modes corresponding to λ0 = 0
are also qualitatively different indicating that the LRAN and KDMD ROM are extracting
different constant features from the data. We must be careful in our interpretation, however,
since the LRAN’s koopman modes are only a least squares approximations to the nonlinear
reconstruction process performed by the decoder.

Figure 11. LRAN and KDMD ROM Koopman modes associated with λ0 ≈ 0

Figure 12. LRAN and KDMD ROM Koopman modes associated with λ1 ≈ 0.002 + 0.845ı

Figure 13. LRAN and KDMD ROM Koopman modes associated with λ2 which differs greatly between the
models

Plotting the model prediction error Figure 14 shows that the linear reconstructions using
both models have comparable performance with errors growing slowly over time. Therefore,
28 S. E. OTTO AND C. W. ROWLEY

the choice of the second Koopman mode does not seem to play a large role in the recon-
struction process. However, when the nonlinear decoder is used to reconstruct the LRAN
predictions, the mean relative error is roughly an order of magnitude smaller than the nonlin-
early reconstructed KDMD ROM over many time steps. The LRAN has evidently learned an
advantageous nonlinear transformation for reconstructing the data using the features evolving
according to λ2 . The second Koopman mode reflects a linear approximation of this nonlinear
transformation in the least squares sense.
Another remark is that nonlinear reconstruction using the KDMD ROM did significantly
improve the accuracy in this example. This indicates that many of the complex variations in
the data are really enslaved to a small number of modes. This makes sense since the dynamics
are periodic on the limit cycle. Finally, it is worth mentioning that the prediction accuracy
was achieved on average over all portions of the trajectory from the unstable equilibrium to
the limit cycle. Both models therefore have demonstrated predictive accuracy and validity
over a wide range of qualitatively different flow conditions. The nonlinearly reconstructed
LRAN achieves a constant low prediction error over the entire time interval used for training
T ∆t = 8.0D/U∞ . The error only begins to grow outside the interval used for training. The
high prediction accuracy could likely be extended by training on longer data sequences.

Figure 14. Cylinder wake testing data mean square relative prediction errors for each model plotted against
the prediction time

6.3. Kuramoto-Sivashinsky equation. We now move on to test our new techniques on

a very challenging problem — the Kuramoto-Sivashinsky equation in a parameter regime
just beyond the onset of chaos. Since any chaotic dynamical system is mixing, it only has
trivial Koopman eigenfunctions on its attractor(s). We therefore cannot expect our model
to accurately reflect the dynamics of the real system. Rather, we aim to make predictions
using very low-dimensional models that are accurate over short times and plausible over longer
times.
The data was generated by performing direct numerical simulations of the Kuramoto-
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 29

Sivashinsky equation,
(56) ut + uxx + uxxxx + uux = 0, x ∈ [0, L],
using a semi-implicit Fourier pseudo-spectral method. The length L = 8π was chosen where
the equation first begins to exhibit chaotic dynamics [9]. 128 Fourier modes were used to
resolve all of the dissipative scales. Each data set: training, evaluation, and test, consisted
of 20 simulations from different initial conditions each with 500 recorded states spaced by
∆t = 1.0. Snapshots consisted of time delay embedded states at t and t + ∆t. The initial
conditions were formed by suppling Gaussian random perturbations to the coefficients on the
3 linearly unstable Fourier modes 0 < 2πk/L < 1 =⇒ k = 1, 2, 3.
An LRAN as well as a KDMD balanced ROM were trained to make predictions over a
time horizon T = 5 steps using only d = 16 dimensional models. Model parameters are given
in Tables 5 and 6. The learned approximate Koopman eigenvalues are plotted in Figure 15.
We notice that there are some slightly unstable eigenvalues, which makes sense since there are
certainly unstable modes including the three linearly unstable Fourier modes. Additionally,
Figure 15b shows that some of the eigenvalues with large magnitude learned by the LRAN
and the KDMD ROM are in near agreement.

(a) (b)

Figure 15. Discrete-time Koopman eigenvalues for the Kuramoto-Sivashinksy equation approximated by
the KDMD ROM and the LRAN

The plot of mean square relative prediction error on the testing data set Figure 16 indi-
cates that our addition of nonlinear reconstruction from the low dimensional KDMD ROM
state does not change the accuracy of the reconstruction. The performance of the KDMD
ROM and the LRAN are comparable with the LRAN showing a modest reduction in error
over all prediction times. It is interesting to note that the LRAN does not produce accurate
reconstructions using the regression-based Koopman modes. In this example, the LRAN’s
nonlinear decoder is essential for the reconstruction process. Evidently, the dictionary func-
tions learned by the encoder require nonlinearity to reconstruct the state. Again, both models
are most accurate over the specified time horizon T = 5 used for training.
30 S. E. OTTO AND C. W. ROWLEY

Plotting some examples in Figure 17 of ground truth and predicted test data sequences
illustrates the behavior of the models. These examples show that both the LRAN and the
KDMD ROM make quantitatively accurate short term predictions. While the predictions after
t ≈ 5 lose their accuracy as one would expect when trying to make linear approximations of
chaotic dynamics, they remain qualitatively plausible. The LRAN model in particular is able
to model and predict grouping and merging events between traveling waves in the solution.
For example in Figure 17a the LRAN successfully predicts the merging of two wave crests
(in red) taking place between t = 2 and t = 5. The LRAN also predicts the meeting of
a peak and trough in Figure 17b at t = 5. These results are encouraging considering the
substantial reduction in dimensionality from a time delay embedded state of dimension 256
to a 16-dimensional encoded state having linear time evolution.

Figure 16. Kuramoto-Sivashinsky testing data mean square relative prediction errors for each model plotted
against the prediction time

7. Conclusions. We have illustrated some fundamental challenges with EDMD, in partic-

ular highlighting the trade-off between rich dictionaries and over-fitting. The use of adaptive,
low-dimensional dictionaries avoids the over-fitting problem while retaining enough capacity
to represent Koopman eigenfunctions of complicated systems. This motivates the use of neural
networks to learn sets of dictionary observables that are well-adapted to the given problems.
By relaxing the constraint that the models must produce linear reconstructions of the state via
the Koopman modes, we introduce a decoder neural network enabling the formation of very
low-order models utilizing richer observables. Finally, by combining the neural network archi-
tecture that is essentially an autoencoder with linear recurrence, the LRAN learns features
that are dynamically important rather than just energetic (i.e., large in norm).
Discovering a small set of dynamically important features or states is also the idea behind
balanced model reduction. This led us to investigate the identification of low-dimensional
models by balanced truncation of over-specified EDMD models, and in particular, KDMD
models. Nonlinear reconstruction using a partially linear multi-kernel method was investigated
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 31

(a) (b)

Figure 17. LRAN and KDMD ROM model predictions on Kuramoto-Sivashinsky test data examples

for improving the reconstruction accuracy of the KDMD ROMs from very low-dimensional
spaces. Our examples show that in some cases like the cylinder wake example, it can greatly
improve the accuracy. We think this is because the data is intrinsically low-dimensional, but
curves in such a way as to extend in many dimensions of the embedding space. The limiting
case of the cylinder flow is an extreme example where the data becomes one-dimensional
on the limit cycle. In some other cases, however, nonlinear reconstruction does not help, is
sensitive to parameter choices, or decreases the accuracy due to over-fitting.
Our numerical examples indicate that unfolding the linear recurrence for many steps can
improve the accuracy of LRAN predictions especially within the time horizon used during
training. This is observed in the error versus prediction time plots in our examples: the error
remains low and relatively flat for predictions made inside the training time horizon T . The
error then grows for predictions exceeding this length of time. However, for more complicated
systems like the Kuramoto-Sivashinsky equation, one cannot unfold the network for too many
steps before additional dimensions must be added to retain accuracy of the linear model
approximation over time. These observations are also approximately true of the finite-horizon
BPOD formulation used to create approximate balanced truncations of KDMD models. One
additional consideration in forming balanced reduced-order models from finite-horizon impulse
responses of over-specified systems is the problem of spurious eigenvalues whose associated
modes only become significant for approximations as t → ∞. The use of carefully chosen
finite time horizons allows us to pick features which are the most relevant (observable and
32 S. E. OTTO AND C. W. ROWLEY

excitable) over any time span of interest.

The main drawback of the balanced reduced-order KDMD models becomes evident when
making predictions on new evaluation and testing examples. While the LRAN has a high
“up-front” cost to train — typically requiring hundreds of thousands of iterations — the
cost of evaluating a new example is almost negligible and so very many predictions can be
made quickly. On the other hand, every new example whose dynamics we want to predict
using the KDMD reduced order model must still have its inner product evaluated with every
training data point. Pruning methods like those developed for multi-output least squares
support vector machines (LS-SVMs) will be needed to reduce the number of kernel evaluations
before KDMD reduced order models can be considered practical for the purpose of predicting
dynamics. The same will need to be done for kernel-based nonlinear reconstruction methods.
We conclude by discussing some exciting directions for future work on LRANs. The story
certainly does not end with using them to learn linear encoded state dynamics and approx-
imations of the Koopman eigenfunctions. The idea of establishing a possibly complicated
transformation into and out of a space where the dynamics are simple is an underlying theme
of this work. Human understanding seems to at least partly reside in finding isomorphism. If
we can establish that a seemingly complicated system is topologically conjugate to a simple
system, it doesn’t really matter what the transformation is as long as we can compute it.
In this vein, the LRAN architecture is perfectly suited to learning the complicated transfor-
mations into and out of a space where the dynamics equations have a simple normal form.
For example, this could be accomplished by learning coefficients on homogeneous polynomials
of varying degree in addition to a matrix for updating the encoded state. One could also
include learned parameter dependencies in order to study bifurcation behavior. Furthermore,
any kind of symmetry could be imposed through either the normal form itself or through the
neural network’s topology. For example, convolutional neural networks can be used in cases
where the system is statistically stationary in a spatial coordinate.
Introduction of control terms to the dynamics of the encoded state is another interesting
direction for inquiry. In some cases it might be possible to introduce a second autoencoder
to perform a state-dependent encoding and decoding of the control inputs at each time step.
Depending on the form that the inputs take in the evolution of the encoded state, it may
be possible to apply a range of techniques from modern state-space control to nonlinear
optimal control or even reinforcement learning-based control with policy and value functions
parameterized by neural networks.
Another natural question is how the LRAN framework can be adapted to handle stochastic
dynamics and uncertainty. Recent work in the development of structured inference networks
for nonlinear stochastic models [12] may offer a promising approach. The low-dimensional
dynamics and reconstruction process could be generalized to nonlinear stochastic processes
for generating full state trajectories. Since the inference problem for such nonlinear systems
is intractable, the encoder becomes a (bi-directional) recurrent neural network for performing
approximate inference on the latent state given our data in analogy with Kalman smoothing.
In this manner, many plausible outputs can be generated to estimate the distribution over
state trajectories in addition to an inference distribution for quantifying uncertainty about
the low-dimensional latent state.
Furthermore, with the above formulation of generative autoencoder networks for dynam-
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 33

ics, it might be possible to employ adversarial training [8] in a similar manner to the adversarial
autoencoder [17]. Training the generative network used for reconstruction against a discrim-
inator network will encourage the generator to produce more plausible details like turbulent
eddies in fluid flows which are not easily distinguished from the real thing.
Appendix A. Hyperparameters used to train models. The same hyperparameters in
Table 1 were used to train the constrained and free LRANs in the unforced Duffing equation
example.
Table 1
Constrained LRAN hyperparameters for unforced Duffing example

Parameter Value(s)
Time-delays embedded in a snapshot 1
Encoder layer widths (left to right) 2, 32, 32, 16, 16, 8, 3
Decoder layer widths (left to right) 3, 8, 16, 16, 32, 32, 2
Snapshot sequence length, T 10
Weight decay rate, δ 0.8
Relative weight on encoded state, β 1.0
Minibatch size 50 examples
Initial learning rate 10−3
Geometric learning rate decay factor 0.01 per 4 ∗ 105 steps
Number of training steps 4 ∗ 105

Table 2 summarizes the hyperparameters used to train the KDMD Reduced Order Model
for the unforced Duffing example.
Table 2
KDMD ROM hyperparameters for unforced Duffing example

Parameter Value(s)
Time-delays embedded in a snapshot 1
EDMD Dictionary kernel function Gaussian RBF, σ = 10.0
KDMD SVD rank, r 27
BPOD time horizon, T 10
BPOD output projection rank 2 (no projection)
Balanced model order, d 3
Nonlinear reconstruction kernel function Gaussian RBF, σ = 10.0
Multi-kernel linear part truncation rank, r1 3
Multi-kernel nonlinear part truncation rank, r2 8
Multi-kernel regularization constant, γ 10−4

The hyperparameters used to train the LRAN model on the cylinder wake data are given
in Table 3.
Table 4 summarizes the hyperparameters used to train the KDMD Reduced Order Model
on the cylinder wake data.
34 S. E. OTTO AND C. W. ROWLEY

Table 3
LRAN hyperparameters for cylinder wake example

Parameter Value(s)
Time-delays embedded in a snapshot 2
Encoder layer widths (left to right) 400, 100, 50, 20, 10, 5
Decoder layer widths (left to right) 5, 10, 20, 50, 100, 400
Snapshot sequence length, T 20
Weight decay rate, δ 0.95
Relative weight on encoded state, β 1.0
Minibatch size 50 examples
Initial learning rate 10−3
Geometric learning rate decay factor 0.01 per 2 ∗ 105 steps
Number of training steps 2 ∗ 105

Table 4
KDMD ROM hyperparameters for cylinder wake example

Parameter Value(s)
Time-delays embedded in a snapshot 2
EDMD Dictionary kernel function Gaussian RBF, σ = 10.0
KDMD SVD rank, r 100
BPOD time horizon, T 20
BPOD output projection rank 100
Balanced model order, d 5
Nonlinear reconstruction kernel function Gaussian RBF, σ = 10.0
Multi-kernel linear part truncation rank, r1 5
Multi-kernel nonlinear part truncation rank, r2 15
Multi-kernel regularization constant, γ 10−8

Table 5 lists the hyperparameters used to train the LRAN model on the Kuramoto-
Sivashinsky equation example.
Table 6 summarizes the hyperparameters used to train the KDMD Reduced Order Model
on the Kuramoto-Sivashinsky equation example data.

Acknowledgments. We would like to thank Scott Dawson for providing us with the
data from his cylinder wake simulations. We would also like to thank William Eggert for his
invaluable help and collaboration on the initial iterations of the LRAN code.

REFERENCES

[1] S. L. Brunton, J. L. Proctor, and J. N. Kutz, Discovering governing equations from data by sparse
identification of nonlinear dynamical systems, Proceedings of the National Academy of Sciences of
the United States of America, 113 (2016), pp. 3932–7, https://doi.org/10.1073/pnas.1517384113.
[2] M. Budišić, R. Mohr, and I. Mezić, Applied koopmanism a, Chaos: An Interdisciplinary Journal of
Nonlinear Science, 22 (2012), p. 047510.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 35

Table 5
LRAN hyperparameters for Kuramoto-Sivashinsky example

Parameter Value(s)
Time-delays embedded in a snapshot 2
Encoder layer widths (left to right) 256, 32, 32, 16, 16
Decoder layer widths (left to right) 16, 16, 32, 32, 256
Snapshot sequence length, T 5
Weight decay rate, δ 0.9
Relative weight on encoded state, β 1.0
Minibatch size 50 examples
Initial learning rate 10−3
Geometric learning rate decay factor 0.1 per 2 ∗ 105 steps
Number of training steps 4 ∗ 105

Table 6
KDMD ROM hyperparameters for Kuramoto-Sivashinsky example

Parameter Value(s)
Time-delays embedded in a snapshot 2
EDMD Dictionary kernel function Gaussian RBF, σ = 10.0
KDMD SVD rank, r 60
BPOD time horizon, T 5
BPOD output projection rank 60
Balanced model order, d 16
Nonlinear reconstruction kernel function Gaussian RBF, σ = 100.0
Multi-kernel linear part truncation rank, r1 16
Multi-kernel nonlinear part truncation rank, r2 60
Multi-kernel regularization constant, γ 10−7

[3] D. Clevert, T. Unterthiner, and S. Hochreiter, Fast and accurate deep network learning by expo-
nential linear units (elus), arXiv preprint arXiv:1511.07289, (2015).
[4] G. E. Dullerud and F. Paganini, A Course in Robust Control Theory : a Convex Approach, Springer
New York, 2000.
[5] M. Espinoza, A. K. Suykens, and B. D. Moor, Kernel based partially linear models and nonlinear
identification, IEEE Transactions on Automatic Control, 50 (2005), pp. 1602–6.
[6] T. L. B. Flinois, A. S. Morgans, and P. J. Schmid, Projection-free approximate balanced truncation
of large unstable systems, Physical Review E, 92 (2015), p. 023012.
[7] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning, MIT Press, 2016.
[8] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville,
and Y. Bengio, Generative adversarial nets, in Advances in neural information processing systems,
2014, pp. 2672–2680.
[9] P. Holmes, J. L. Lumley, G. Berkooz, and C. W. Rowley, Turbulence, Coherent Structures, Dy-
namical Systems and Symmetry, Cambridge University Press, 2012.
[10] M. R. Jovanović, P. J. Schmid, and J. W. Nichols, Sparsity-promoting dynamic mode decomposition,
Physics of Fluids, 26 (2014), p. 024103, https://doi.org/10.1063/1.4863670, http://aip.scitation.org/
doi/10.1063/1.4863670.
[11] B. O. Koopman, Hamiltonian systems and transformation in hilbert space, Proceedings of the National
36 S. E. OTTO AND C. W. ROWLEY

Academy of Sciences, 17 (1931), pp. 315–318.

[12] R. G. Krishnan, U. Shalit, and D. Sontag, Structured inference networks for nonlinear state space
models, in Proc. 31st AAAI Conference on Artificial Intelligence (AAAI-17), 2017.
[13] Q. Li, F. Dietrich, E. M. Bollt, and I. G. Kevrekidis, Extended dynamic mode decomposition with
dictionary learning: a data-driven adaptive spectral decomposition of the Koopman operator, arXiv
preprint arXiv:1707.00225, (2017).
[14] H. W. Lin, M. Tegmark, and D. Rolnick, Why does deep and cheap learning work so well?, arXiv
preprint arXiv:1608.08225, (2016).
[15] D. M. Luchtenburg and C. W. Rowley, Model reduction using snapshot-based realizations, Bulletin
of the American Physical Society, Volume 56, Number 18 (2011), http://meetings.aps.org/Meeting/
DFD11/Session/H19.4.
[16] B. Lusch, J. N. Kutz, and S. L. Brunton, Deep learning for universal linear embeddings of nonlinear
dynamics, arXiv preprint arXiv:1712.09707, (2017).
[17] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey, Adversarial autoencoders, arXiv
preprint arXiv:1511.05644, (2015).
[18] L. Mescheder, S. Nowozin, and A. Geiger, Adversarial variational bayes: Unifying variational au-
toencoders and generative adversarial networks, arXiv preprint arXiv:1701.04722, (2017).
[19] I. Mezić, Spectral properties of dynamical systems, model reduction and decompositions, Nonlinear Dy-
namics, 41 (2005), pp. 309–325, https://doi.org/10.1007/s11071-005-2824-x, http://link.springer.
com/10.1007/s11071-005-2824-x.
[20] I. Mezić, Koopman operator spectrum and data analysis, arXiv preprint arXiv:1702.07597, (2017).
[21] B. Moore, Principal component analysis in linear systems: Controllability, observability, and model
reduction, IEEE Transactions on Automatic Control, 26 (1981), pp. 17–32, https://doi.org/10.1109/
TAC.1981.1102568.
[22] W. Peng and Y. Ai-Jun, Improved pruning algorithm using quadratic renyi entropy for ls-svm modeling,
in 24th Chinese Control and Decision Conference (CCDC), IEEE, 2012, pp. 3471–3474.
[23] D. Rolnick and M. Tegmark, The power of deeper networks for expressing natural functions, arXiv
preprint arXiv:1705.05502, (2017).
[24] C. W. Rowley, Model reduction for fluids, using balanced proper orthogonal decomposition, International
Journal of Bifurcation and Chaos, 15 (2005), pp. 997–1013.
[25] C. W. Rowley, Data-driven methods for identifying nonlinear models of fluid flows, 2017. Presented at
the Kavli Institute for Theoretical Physics, University of California, Santa Barbara.
[26] C. W. Rowley and S. T. Dawson, Model reduction for flow analysis and control, Annual Review
of Fluid Mechanics, 49 (2017), pp. 387–417, https://doi.org/10.1146/annurev-fluid-010816-060042,
http://www.annualreviews.org/doi/10.1146/annurev-fluid-010816-060042.
[27] C. W. Rowley, I. Mezić, S. Bagheri, P. Schlatter, and D. S. Henningson, Spectral analy-
sis of nonlinear flows, Journal of Fluid Mechanics, 641 (2009), p. 115, https://doi.org/10.1017/
S0022112009992059, http://www.journals.cambridge.org/abstract{ }S0022112009992059.
[28] P. J. Schmid, Dynamic mode decomposition of numerical and experimental data, Journal of Fluid Me-
chanics, 656 (2010), pp. 5–28, https://doi.org/10.1017/S0022112010001217, http://www.journals.
cambridge.org/abstract{ }S0022112010001217.
[29] P. J. Schmid and D. S. Henningson, Stability and Transition in Shear Flows, vol. 142 of Applied
Mathematical Sciences, Springer Verlag, 2001.
[30] N. Takeishi, Y. Kawahara, and T. Yairi, Learning koopman invariant subspaces for dynamic mode
decomposition, in Advances in Neural Information Processing Systems, 2017, pp. 1130–1140.
[31] S. Tao, D. Chen, and W. Zhao, Fast pruning for multi-output ls-svm and its application to chemical
pattern classification, Chemometrics and Intelligent Laboratory Systems, 96 (2009), pp. 63–69.
[32] N. Watters, D. Zoran, T. Weber, P. Battaglia, R. Pascanu, and A. Tacchetti, Visual interac-
tion networks: Learning a physics simulator from video, in Advances in Neural Information Processing
Systems, 2017, pp. 4542–4550.
[33] M. O. Williams, I. G. Kevrekidis, and C. W. Rowley, A data-driven approximation of
the Koopman operator: Extending dynamic mode decomposition, Journal of Nonlinear Science,
25 (2015), pp. 1307–1346, https://doi.org/10.1007/s00332-015-9258-5, http://link.springer.com/10.
1007/s00332-015-9258-5.
LINEARLY-RECURRENT AUTOENCODER NETWORKS FOR LEARNING DYNAMICS 37

[34] M. O. Williams, C. W. Rowley, and I. G. Kevrekidis, A kernel-based method for data-driven

Koopman spectral analysis, Journal of Computational Dynamics, 2 (2016), pp. 247–265, https://doi.
org/10.3934/jcd.2015005.
[35] A. Wynn, D. S. Pearson, B. Ganapathisubramani, and P. J. Goulart, Optimal mode decomposition
for unsteady flows, Journal of Fluid Mechanics, 733 (2013), pp. 473–503, https://doi.org/10.1017/jfm.
2013.426, http://www.journals.cambridge.org/abstract{ }S0022112013004266.
[36] Y.-L. Xu and D.-R. Chen, Partially-linear least-squares regularized regression for system identification,
IEEE Transactions on Automatic Control, 54 (2009), pp. 2637–41.
[37] E. Yeung, S. Kundu, and N. Hodas, Learning deep neural network representations for Koopman
operators of nonlinear dynamical systems, arXiv preprint arXiv:1708.06850, (2017).
[38] K. Zhou, G. Salomon, and E. Wu, Balanced realization and model reduction for unstable systems,
International Journal of Robust and Nonlinear Control, 9 (1999), pp. 183–198, https://doi.org/10.
1002/(SICI)1099-1239(199903)9:3h183::AID-RNC399i3.0.CO;2-E.

Roger A. Horn, Charles R. Johnson - Topics in Matrix Analysis-Cambridge University Press (2008)
No ratings yet
Roger A. Horn, Charles R. Johnson - Topics in Matrix Analysis-Cambridge University Press (2008)
615 pages
Dynamic Mode Decomposition: Data-Driven Modeling of Complex Systems
50% (2)
Dynamic Mode Decomposition: Data-Driven Modeling of Complex Systems
241 pages
DMD Slides
No ratings yet
DMD Slides
24 pages
Nathan Kutz - Dynamic Mode Decomposition Data-Driven Modeling of Complex Systems
100% (1)
Nathan Kutz - Dynamic Mode Decomposition Data-Driven Modeling of Complex Systems
255 pages
Seminar MDS
No ratings yet
Seminar MDS
24 pages
Data-Driven Discovery of Koopman Eigenfunctions For Control: Eurika Kaiser, J. Nathan Kutz, and Steven L. Brunton
No ratings yet
Data-Driven Discovery of Koopman Eigenfunctions For Control: Eurika Kaiser, J. Nathan Kutz, and Steven L. Brunton
36 pages
Discovery of Nonlinear Multiscale Systems: Sampling Strategies and Embeddings
No ratings yet
Discovery of Nonlinear Multiscale Systems: Sampling Strategies and Embeddings
21 pages
IgorMezic Plenary LHMNC18 PDF
No ratings yet
IgorMezic Plenary LHMNC18 PDF
43 pages
Data-Driven Discovery of Koopman Eigenfunctions For Control
No ratings yet
Data-Driven Discovery of Koopman Eigenfunctions For Control
10 pages
Physics-Informed Probabilistic Learning of Linear Embeddings of Nonlinear Dynamics With Guaranteed Stability
No ratings yet
Physics-Informed Probabilistic Learning of Linear Embeddings of Nonlinear Dynamics With Guaranteed Stability
30 pages
Predictive Accuracy of Dynamic Mode Decomposition: De-Sc0019130
No ratings yet
Predictive Accuracy of Dynamic Mode Decomposition: De-Sc0019130
24 pages
Spectral Analysis of Nonlinear Ows: Clarencew - Rowley, Shervinbagheri, Philippschlatter Dans - Henningson
No ratings yet
Spectral Analysis of Nonlinear Ows: Clarencew - Rowley, Shervinbagheri, Philippschlatter Dans - Henningson
13 pages
Py Koopman A Python Package For Data Driven Approximation of The Koopman Operator
No ratings yet
Py Koopman A Python Package For Data Driven Approximation of The Koopman Operator
16 pages
DMD Chapter1
No ratings yet
DMD Chapter1
24 pages
DMD PDF
No ratings yet
DMD PDF
19 pages
From Fourier To Koopman Spectral Methods For Long-Term Prediction
No ratings yet
From Fourier To Koopman Spectral Methods For Long-Term Prediction
38 pages
ChenTuRowley2012 VariantsDMDFourierAn
No ratings yet
ChenTuRowley2012 VariantsDMDFourierAn
28 pages
2019 Zhang Siam
No ratings yet
2019 Zhang Siam
24 pages
Korda Mezic MPCKoopman
No ratings yet
Korda Mezic MPCKoopman
12 pages
Rice Et Al. - 2020 - Analyzing Koopman Approaches To Physics-Informed M
No ratings yet
Rice Et Al. - 2020 - Analyzing Koopman Approaches To Physics-Informed M
12 pages
Annurev-Fluid Schmid 2022 Review On DMD
No ratings yet
Annurev-Fluid Schmid 2022 Review On DMD
32 pages
Multiresolution Dynamic Mode Decomposition: J. Nathan Kutz, Xing Fu, Steven L. Brunton
No ratings yet
Multiresolution Dynamic Mode Decomposition: J. Nathan Kutz, Xing Fu, Steven L. Brunton
23 pages
PIKN
No ratings yet
PIKN
27 pages
Model Reduction For Nonlinearizable Dynamics Via D
No ratings yet
Model Reduction For Nonlinearizable Dynamics Via D
22 pages
DMD Algorithms For Experimental Data Processing in Combustion
No ratings yet
DMD Algorithms For Experimental Data Processing in Combustion
10 pages
Model Reduction For Nonlinearizable Dynamics Via Delay-Embedded Spectral Submanifolds
No ratings yet
Model Reduction For Nonlinearizable Dynamics Via Delay-Embedded Spectral Submanifolds
21 pages
Characterizing and Correcting For The Effect of Sensor Noise in The Dynamic Mode Decomposition
No ratings yet
Characterizing and Correcting For The Effect of Sensor Noise in The Dynamic Mode Decomposition
25 pages
Data-Driven Linearization of Dynamical Systems: Featurearticle
No ratings yet
Data-Driven Linearization of Dynamical Systems: Featurearticle
25 pages
Discovering Equations From Data: Melissa R. Mcguirl
No ratings yet
Discovering Equations From Data: Melissa R. Mcguirl
29 pages
A Kernel-Based Method For Data-Driven Koopman Spectral Analysis
No ratings yet
A Kernel-Based Method For Data-Driven Koopman Spectral Analysis
17 pages
Generalizing Koopman Theory To Allow For Inputs and Control: Abstract
No ratings yet
Generalizing Koopman Theory To Allow For Inputs and Control: Abstract
21 pages
14 sayadiFreqDMD
No ratings yet
14 sayadiFreqDMD
12 pages
Mathematics 10 03859
No ratings yet
Mathematics 10 03859
18 pages
Dynamic Mode Decomposition
No ratings yet
Dynamic Mode Decomposition
30 pages
2017 Model-Based Control Using Koopman Operators
No ratings yet
2017 Model-Based Control Using Koopman Operators
8 pages
Dynamic Mode Decomposition For Large and Streaming Datasets: Articles You May Be Interested in
No ratings yet
Dynamic Mode Decomposition For Large and Streaming Datasets: Articles You May Be Interested in
7 pages
Project Brief-1
No ratings yet
Project Brief-1
10 pages
Sparsity-Promoting Dynamic Mode Decomposition: Mihailo R. Jovanovi C
No ratings yet
Sparsity-Promoting Dynamic Mode Decomposition: Mihailo R. Jovanovi C
21 pages
Linear Identification of Nonlinear Systems - A Lifting Technique Based On The Koopman Operator
No ratings yet
Linear Identification of Nonlinear Systems - A Lifting Technique Based On The Koopman Operator
6 pages
Deep Learning Enhanced Dynamic Mode (오토인코더 EDMD)
No ratings yet
Deep Learning Enhanced Dynamic Mode (오토인코더 EDMD)
31 pages
Algorithms 17 00112 v2
No ratings yet
Algorithms 17 00112 v2
11 pages
Koopman Ensembles For Probabilistic Time Series Forecasting
No ratings yet
Koopman Ensembles For Probabilistic Time Series Forecasting
5 pages
2022-Dynamic Mode Decomposition and Its Variants
No ratings yet
2022-Dynamic Mode Decomposition and Its Variants
32 pages
On Dynamic Mode Decomposition-Theory and Applications
No ratings yet
On Dynamic Mode Decomposition-Theory and Applications
31 pages
DMD Solid r3
No ratings yet
DMD Solid r3
24 pages
On Matching, and Even Rectifying, Dynamical Systems Through Koopman Operator Eigenfunctions.
No ratings yet
On Matching, and Even Rectifying, Dynamical Systems Through Koopman Operator Eigenfunctions.
34 pages
Dynamic Mode Decomposition For Financial Trading Strategies
No ratings yet
Dynamic Mode Decomposition For Financial Trading Strategies
18 pages
Soft Sensors For Industrial Processes Using Multi
No ratings yet
Soft Sensors For Industrial Processes Using Multi
24 pages
2203 04728v1DMD
No ratings yet
2203 04728v1DMD
6 pages
Thin-Layer Method: Formulation in The Time Domain
No ratings yet
Thin-Layer Method: Formulation in The Time Domain
15 pages
Compressive Sampling and Dynamic Mode Decomposition
No ratings yet
Compressive Sampling and Dynamic Mode Decomposition
12 pages
5 Hermitian and Skew-Hermitian Matrices: Definitions: A Matrix With Complex Elements Is Said To
No ratings yet
5 Hermitian and Skew-Hermitian Matrices: Definitions: A Matrix With Complex Elements Is Said To
4 pages
传统SINDy
No ratings yet
传统SINDy
26 pages
The Koopman Linearization of
No ratings yet
The Koopman Linearization of
74 pages
Matrices Definition
100% (1)
Matrices Definition
32 pages
BSC Math (MDU, 2018-19) trg4t4trt4t4tt4t4t4t4rt4frf4g4g5g
No ratings yet
BSC Math (MDU, 2018-19) trg4t4trt4t4tt4t4t4t4rt4frf4g4g5g
27 pages
Randomized DMD
No ratings yet
Randomized DMD
25 pages
Total Least SquareDMD
No ratings yet
Total Least SquareDMD
17 pages
Reconstruction of Governing Equations
No ratings yet
Reconstruction of Governing Equations
18 pages
Sciadv 1602614
No ratings yet
Sciadv 1602614
6 pages
DMD Multiverse Colbrook
No ratings yet
DMD Multiverse Colbrook
88 pages
Quantum Computing: Availaible at VTU HUB (Android App)
No ratings yet
Quantum Computing: Availaible at VTU HUB (Android App)
27 pages
Spatio-Temporal Prediction Via Operator-Valued RKHS and Koopman Approximation
No ratings yet
Spatio-Temporal Prediction Via Operator-Valued RKHS and Koopman Approximation
25 pages
Non Hermitian Physics
No ratings yet
Non Hermitian Physics
191 pages
Matrices and Determinants
No ratings yet
Matrices and Determinants
26 pages
Simple Derivation of The Lindblad Equation: PACS Numbers
No ratings yet
Simple Derivation of The Lindblad Equation: PACS Numbers
11 pages
On The Maximum Principle of Ky Fan
No ratings yet
On The Maximum Principle of Ky Fan
8 pages
Engineering Mathematics 2
No ratings yet
Engineering Mathematics 2
390 pages
Linear Alg II Chapter 3
No ratings yet
Linear Alg II Chapter 3
28 pages
A Concise Text On Advanced Linear Algebra by Yisong Yang
67% (3)
A Concise Text On Advanced Linear Algebra by Yisong Yang
333 pages
7 Linear Quadratic Control: 7.1 The Problem
No ratings yet
7 Linear Quadratic Control: 7.1 The Problem
10 pages
2020 Sem-1 H PHYS CC-1 R2kknowledge - Com
No ratings yet
2020 Sem-1 H PHYS CC-1 R2kknowledge - Com
3 pages
Adaptive Detection in Non-Stationary Interference
No ratings yet
Adaptive Detection in Non-Stationary Interference
80 pages
Statistics Syllabus
No ratings yet
Statistics Syllabus
36 pages
Linear Algebra Notes: 1 Column Picture of Matrix Vector Multiplication
No ratings yet
Linear Algebra Notes: 1 Column Picture of Matrix Vector Multiplication
10 pages
Quantum Mechanics 2
No ratings yet
Quantum Mechanics 2
54 pages
Mathematics II
No ratings yet
Mathematics II
2 pages
Linear Algebra Assignment
No ratings yet
Linear Algebra Assignment
27 pages
(Ebook) Quantum Computing: Circuits, Systems, Automation and Applications by Himanshu Thapliyal, Travis Humble ISBN 9783031379659, 3031379659
No ratings yet
(Ebook) Quantum Computing: Circuits, Systems, Automation and Applications by Himanshu Thapliyal, Travis Humble ISBN 9783031379659, 3031379659
81 pages
JAMMatrix Questions
No ratings yet
JAMMatrix Questions
2 pages
Mathematics-I Prof. Abdul Majeed Group-III
No ratings yet
Mathematics-I Prof. Abdul Majeed Group-III
3 pages
Paperia 1 2022
No ratings yet
Paperia 1 2022
7 pages
On A Matrix Decomposition and Its Applications
No ratings yet
On A Matrix Decomposition and Its Applications
9 pages
Rarita-Schwinger Particles in Homogeneous Magnetic Fields, and Inconsistencies of Spin 3/2 Theories
No ratings yet
Rarita-Schwinger Particles in Homogeneous Magnetic Fields, and Inconsistencies of Spin 3/2 Theories
9 pages
IRS-Aided Wireless Relaying - Deployment Strategy and Capacity Scaling
No ratings yet
IRS-Aided Wireless Relaying - Deployment Strategy and Capacity Scaling
5 pages
Assignment 2 MAN 001 2022
No ratings yet
Assignment 2 MAN 001 2022
2 pages
Assignment M1
No ratings yet
Assignment M1
2 pages
Machine Learning - Advanced Concepts
From Everand
Machine Learning - Advanced Concepts
Derrick Mwiti
No ratings yet
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
From Everand
Radial Basis Networks: Fundamentals and Applications for The Activation Functions of Artificial Neural Networks
Fouad Sabry
No ratings yet
Kernel Methods: Fundamentals and Applications
From Everand
Kernel Methods: Fundamentals and Applications
Fouad Sabry
No ratings yet

Linearly Recurrent Autoencoder Networks For Learning Dynamicssms

Uploaded by

Linearly Recurrent Autoencoder Networks For Learning Dynamicssms

Uploaded by

Linearly-Recurrent Autoencoder Networks for Learning Dynamics∗

Samuel E. Otto† and Clarence W. Rowley‡

AMS subject classifications. 65P99, 37M10, 37M25, 47-04, 47B33

from which we can recover the evolution of the state, according to

Consider a finite dictionary of observable functions D = {ψi : M → C}N

where (·)+ denotes the Moore-Penrose pseudoinverse of a matrix.

J = k(Ψ∗Y − Ψ∗X K) Ak2F , where A = a1 a2 · · · aN 0 .

(14) K̂ = Σ+ V∗ Ψ∗Y U = Σ+ V∗ Ψ∗Y ΨX VΣ+ .

is formulated. The solution to this standard least squares problem is

as well as the eigenfunction approximations,

A schematic of this architecture is depicted in Figure 1 where z = Ψ(x; θ) and z# = Ψ(y; θ)

Algorithm 1 EDMD with dictionary learning [13]

Figure 1. EDMD with dictionary learning architecture

extremely low-dimensional invariant subspaces limited only by the intrinsic dimensionality of

Figure 2. Linearly-Recurrent Autoencoder Network (LRAN) architecture

zt+τ = Ψ(xt+τ ; θ enc ), ẑt+τ = [K(θ K )τ ]T zt , x̂t+τ = Ψ̃(ẑt+τ ; θ dec ),

Figure 3. Architecture of the encoder and decoder networks

Figure 4. LRAN architecture restricted to snapshot pairs

ultra-low dimensional Koopman invariant subspaces. Interestingly, this formulation allows

(32) x(α) = Ψ̃(αwL + αwL ; θ dec ) = Ψ̃ (2<(α)<(wL ) − 2=(α)=(wL ); θ dec ) .

able. Reconstructing and predicting with the decoder instead, we have

(36) KΨ(x)∗ a = Ψ(x)∗ UK̂U∗ a + r(x), ∀a ∈ R(U),

(37) ΨU (x) = U∗ Ψ(x) = Σ+ V∗ Ψ∗X Ψ(x),

with the linear input-output system

subspace simultaneously diagonalizes the observability and controllability Gramians. There-

then the Gramians transform according to

under the change of coordinates. In the above, S∗ = T+

(46) K̂Wo K̂∗ − Wo + C∗ C = 0, K̂∗ Wc K̂ − Wc + Σ2 = 0.

allows us to form the elements of

AOP = C∗ UOP K̂C∗ UOP · · · (K̂)T C∗ UOP ,

5.3. Nonlinear reconstruction. In truncating the system, we determined a small subspace

(50) minimize J = kX∗ − Z∗ C∗1 − Ψ∗Z C∗2 k2F + γ Tr(C2 C∗2 ), γ ≥ 0,

Recall that the kernel matrices are

for a chosen continuous nonnegative definite mercer kernel function k : Cd × Cd → C inducing

(55) ẍ = −δ ẋ − x(β + αx2 ),

Figure 5. Unforced Duffing eigenfunctions learned using constrained LRAN

Figure 6. Unforced Duffing eigenfunctions learned using free LRAN

Figure 7. Unforced Duffing eigenfunctions found using KDMD balanced ROM

6.3. Kuramoto-Sivashinsky equation. We now move on to test our new techniques on

7. Conclusions. We have illustrated some fundamental challenges with EDMD, in partic-

excitable) over any time span of interest.

Academy of Sciences, 17 (1931), pp. 315–318.

[34] M. O. Williams, C. W. Rowley, and I. G. Kevrekidis, A kernel-based method for data-driven

You might also like