0% found this document useful (0 votes)

7 views16 pages

Continuous Adjoint Method

The paper introduces a discrete adjoint method for efficiently evaluating derivatives of functions involving discrete sequences, which is crucial for gradient-based techniques in statistics and computer science. It highlights the limitations of traditional sensitivity analysis and presents a more scalable approach that avoids unnecessary computations of intermediate derivatives. The method is particularly beneficial in applications with multiple parameters, significantly improving computational efficiency compared to conventional methods.

Uploaded by

Jack Koo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

7 views16 pages

Continuous Adjoint Method

Uploaded by

Jack Koo

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 16

The Discrete Adjoint Method:

Efficient Derivatives for

Functions of Discrete Sequences
arXiv:2002.00326v1 [stat.CO] 2 Feb 2020

Michael Betancourt, Charles C. Margossian, Vianey Leos-Barajas

Abstract. Gradient-based techniques are becoming increasingly critical

in quantitative fields, notably in statistics and computer science. The
utility of these techniques, however, ultimately depends on how effi-
ciently we can evaluate the derivatives of the complex mathematical
functions that arise in applications. In this paper we introduce a dis-
crete adjoint method that efficiently evaluates derivatives for functions
of discrete sequences.

Many popular mathematical models, such as common hidden Markov models, utilize se-
quences of discrete states implicitly defined through forward difference equations,

un+1 − un = ∆n (un , ψ, n),

to capture the regular evolution of a latent system; here un denotes the nth latent state of
the system and ψ the model parameters. Typically these sequences are incorporated into
larger models through discrete functionals that consume particular sequences and return
scalar values,
NX−1
J (ψ) = jn (un , ψ, n).
n=0
We can quantify the impact of the parameters, ψ, on these functionals by evaluating the
total derivatives, dJ /dψ. The evaluation of these derivatives is complicated by the depen-
dence of the sequences on the parameters enforced by the forward difference equations; the
total derivative of a functional has to take into account both the explicit dependence of
the jn on ψ and also the implicit dependence mediated by the latent states un .
Michael Betancourt is the principal research scientist at Symplectomorphic, LLC. (e-mail:
[email protected]). Charles Margossian is a PhD candidate in the
Department of Statistics, Columbia University. Vianey Leos-Barajas is a postdoctoral
researcher in the Department of Forestry and Environmental Resources and the
Department of Statistics at North Carolina State University.
1
2 BETANCOURT ET AL.

We can always compute each sensitivity, dun /dψ, by propagating derivatives along the
forward difference equations and constructing the corresponding sequence of sensitivities.
This quickly becomes expensive, however, when there are many parameters that each re-
quire their own sensitivities. In order to better scale we need to bypass the superfluous
computation of these intermediate derivatives and only propagate the minimal informa-
tion needed to construct the total derivatives of the desired functionals.
In this paper we introduce a discrete adjoint technique that efficiently computes total
derivatives without explicitly calculating intermediate sensitivities. We begin by reviewing
the powerful continuous adjoint method for ordinary differential equations before deriving a
discrete analog. Finally we demonstrate how the method can be applied to hidden Markov
models.

1. CONTINUOUS ADJOINT SYSTEMS

The continuous analog of discrete sequences are state trajectories, u(t), defined implicitly
through the ordinary differential equations
du
= f (u, ψ, t)
dt
along with the initial conditions

u(t = 0) = υ(ψ).

A functional consumes the state trajectory and returns a single real number through an
integration over time, Z T
J (ψ) = dt j(u, ψ, t).
0
Our goal is then to compute the total derivative of J with respect to the parameter
ψ, taking into account not only the explicit dependence of ψ on j but also the implicit
dependence through the influence of ψ on the evolution of the states u(t). For a thorough
review of the possible strategies see Section 2.6 and 2.7 of Hindmarsh and Serban (2020).
1.1 Adjoint Task Force
An immediate way to compute gradients of functionals like this is to explicitly compute
the state sensitivities
η = du/dψ
THE DISCRETE ADJOINT METHOD 3

by solving the auxiliary ordinary differential equations,

dη d du
=
dt dt dψ

d du
=
dψ dt

d
= f
dψ
†
∂f df du
= + ·
∂ψ du dψ
†
∂f df
= + · η,
∂ψ du

Here a boldfaced fraction is shorthand for the Jacobian matrix

df dfi
= .
du ij duj

Once we’ve solved for the state sensitivities we can construct the total derivative of the
desired functional through the chain rule,
Z T
dJ d
(ψ) = dt j(u, ψ, t)
dψ dψ 0
Z T
d
= dt j(u, ψ, t)
0 dψ
Z T " † #
∂j dj
= dt + ·η .
0 ∂ψ du

This approach becomes burdensome, however, once we consider multiple parameters and
hence multiple total derivatives, each of which requires integrating over its own trajectory
of sensitivities.
Another way to work out the total derivative of the functional is to treat the influence of
the parameter on the state trajectory as constraints (Hannemann-Tamás, Muñoz and Marquardt,
2015),

0 = u(0) − υ(ψ)
du
0= − f (u, ψ, t),
dt
4 BETANCOURT ET AL.

which are explicitly incorporated into the functional with Lagrange multipliers, µ and λ(t),
Z T
J (ψ) = dt j(u, ψ, t)
0
Z T
=0+ dt j(u, ψ, t) + 0
0
Z T
du
= µ† · [u(0) − υ(ψ)] + dt j(u, ψ, t) + λ† (t) · − f (u, ψ, t)
0 dt
≡ L(ψ).

As long as the constraints are satisfied this modified functional will equal our target func-
tional for any values of the Lagrange multipliers.
Under these constraints we can compute the total derivative of the functional by instead
differentiating this modified functional. If we assume that everything is smooth then we
can exchange the order of integration and differentiation to give
dJ dL
=
dψ dψ
Z T
† du dυ dj † d du df
=µ · (0) − + dt + λ (t) · −
dψ dψ 0 dψ dψ dt dψ
Z T " † # " † #
† du ∂υ ∂j ∂j du † d du ∂f ∂f du
=µ · (0) − + dt + · + λ (t) · − − · .
dψ ∂ψ 0 ∂ψ ∂u dψ dt dψ ∂ψ ∂u dψ

Once again a boldfaced fraction is shorthand for a Jacobian matrix. For example,
†
∂j ∂j ∂j
= ,..., .
∂u ∂u1 ∂uN

The benefit of this approach is that we can use the freedom in our Lagrange multipliers
to eliminate the expensive state sensitivities entirely! First we need to integrate the time
derivative of the sensitivities by parts to recover a pure sensitivity,
T T †
d du du du dλ du
Z Z
†
dt λ (t) · = λ† (T ) · (T ) − λ† (0) · (0) − dt · .
0 dt dψ dψ dψ 0 dt dψ

Then we substitute this result into the total derivative and gather all the sensitivity terms
THE DISCRETE ADJOINT METHOD 5

together,

dJ du ∂υ du du
= µ† · (0) − + λ† (T ) · (T ) − λ† (0) · (0)
dψ dψ ∂ψ dψ dψ
Z T † † †
∂j ∂j du dλ du † ∂f † ∂f du
+ dt + · − · − λ (t) · − λ (t) · ·
0 ∂ψ ∂u dψ dt dψ ∂ψ ∂u dψ
†
du ∂υ du
= µ − λ(0) · (0) − µ† · + λ† (T ) · (T )
dψ ∂ψ dψ
∂f † du
Z T Z T
∂j † ∂f ∂j dλ
+ dt − λ (t) · + dt − − λ(t) · ·
0 ∂ψ ∂ψ 0 ∂u dt ∂u dψ

Now we can exploit the freedom in our Lagrange multipliers to remove all vestiges of
the sensitivities. First let’s set µ = λ(0) to remove the initial sensitivities and λ(T ) = 0 to
remove the final sensitivities. We can then remove the integral term that depends on the
intermediate sensitivities if we set
∂j dλ ∂f
− − λ(t) · = 0,
∂u dt ∂u
or
dλ ∂j ∂f
= − λ(t) · .
dt ∂u ∂u
In other words provided that λ(t) satisfies the differential equation

dλ ∂j ∂f
= (u, ψ, t) − λ(t) · (u, ψ, t)
dt ∂u ∂u
with the initial conditions
λ(T ) = 0
then then total derivative of our target functional reduces to
Z T
dJ † ∂υ ∂j ∂f
(ψ) = −λ (0) · + dt (u, ψ, t) − λ† (t) · (u, ψ, t).
dψ ∂ψ 0 ∂ψ ∂ψ

The system of differential equations for λ(t) is known as the adjoint system relative to
the original system of ordinary differential equations. If we first solve for u(t) then we
can solve for the adjoint λ(t) and compute the total derivative dJ /dψ at the same time
without having to compute any explicit sensitivities.
1.2 Computational Scalings
For a single parameter the direct approach is slightly more efficient, requiring two N -
dimensional integrations for the states and their sensitivities compared to the adjoint ap-
proach which requires two N -dimensional integrations, one for the states and one for the
6 BETANCOURT ET AL.

adjoint states, and the extra one-dimensional integration to solve for the total derivative.
The adjoint method, however, quickly becomes more efficient as we consider multiple pa-
rameters because the adjoint states are the same for all parameters.
When we have K parameters the forward sensitivity approach requires an N -dimensional
integration for each sensitivity and the total cost scales as N +N ·K. The adjoint approach,
however, requires only two N -dimensional solves to set up the states and the adjoint states
and then K one-dimensional solves for each gradient component, yielding a total cost
scaling of 2N + K.
Comparing these two scalings we see that the adjoint method is better when
N
< K,
N −1
a condition verified for any N provided that K ≥ 2. In other words the adjoint method
will generally feature the highest performance in any application with at least two param-
eters. As the number of parameters increases the O(N K) scaling of the forward sensitivity
approach grows much faster than the O(N + K) scaling of the adjoint method, and the
performance gap only becomes more substantial.
1.3 An Application to Automatic Differentiation
A particularly useful application of the continuous adjoint method is for the reverse mode
automatic differentiation (Bücker et al., 2006; Griewank and Walther, 2008; Margossian,
2019) of functions incorporating the solutions of ordinary differential equations. In order
to propagate the needed differential information through the composite function we need
to be able to evaluate the Jacobian of the final state with respect to the parameters,
du
(T ),
dψ
contracted against a vector, δ,
du
δ† · (T ),
dψ
where † denotes transposition. This arises, for example, when computing the gradient of a
scalar function, for example a probability density or an objective function, which implicitly
depends on ψ through u.
We can recover the above contraction by defining the integrand

j(u, ψ, t) = δ † · f (u, ψ, t)
THE DISCRETE ADJOINT METHOD 7

and the corresponding functional

Z T
J (ψ) = dt j(u, ψ, t)
0
Z T
†
=δ · dt f (u, ψ, t)
0
Z T
† du
=δ · dt (u, ψ, t)
0 dt
= δ † · (u(T ) − u(0)) .

The total derivative of this functional is given by

dJ du du
(ψ) = δ † · (T ) − (0)
dψ dψ dψ

† du ∂υ
=δ · (T ) −
dψ ∂ψ
which we can then manipulate into the desired contraction
du dJ ∂υ
δ† · (T ) = (ψ) + δ † · .
dψ dψ ∂ψ
We can then use the continuous adjoint method to evaluate the total derivative of the
functional and hence the desired Jacobian-adjoint product,
du ∂υ dJ
δ† · (T ) = δ † · + (ψ)
dψ ∂ψ dψ
Z T
∂υ ∂υ ∂f ∂f
= δ† · − λ† (0) · + dt δ† · (u, ψ, t) − λ† (t) · (u, ψ, t)
∂ψ ∂ψ 0 ∂ψ ∂ψ
† Z T †
∂υ ∂f
= δ − λ(0) · + dt δ − λ(t) · (u, ψ, t).
∂ψ 0 ∂ψ

2. DISCRETE ADJOINT SYSTEMS

By carefully translating the differential operations in the continuous adjoint method to
their discrete counterparts we can derive a corresponding discrete adjoint method.
Recall that in the discrete case our target functional is defined as
N
X −1
J (ψ) = jn (un , ψ, n)
n=0

with the discrete states satisfying the forward difference equation,

un+1 − un = ∆n (un , ψ, n),

8 BETANCOURT ET AL.

along with the initial condition

u0 (ψ) = υ(ψ).
To construct the adjoint system we first introduce the nominal system as explicit con-
straints in a modified functional,
N
X −1
J (ψ) = L(ψ) = µT · [υ − u0 ] + jn + λTn · [un+1 − un − ∆n ] .
n=0

Taking a total derivative then gives

dJ dL
=
dψ dψ
N −1
du0 dυ X djn dun+1 dun d∆n
= µ† · − + + λ†n · − −
dψ dψ n=0
dψ dψ dψ dψ

du0 dυ
= µ† · −
dψ dψ
N −1
" #
∂jn † dun ∂∆n † dun
X ∂jn
† dun+1 dun ∂∆n
+ + · + λn · − − − ·
n=0
∂ψ ∂un dψ dψ dψ ∂ψ ∂un dψ

du0 dυ
= µ† · −
dψ dψ
N −1 N −1 NX −1
∂jn † dun
X ∂jn †
† ∂∆n dun+1 dun ∂∆n dun
X
†
+ − λn · + λn · − + · − λn · ·
∂ψ ∂ψ dψ dψ ∂un dψ ∂un dψ
n=0 n=0 n=0
∂j0 † du0 ∂∆0 † du0

† du0 dυ
= µ · − + · − λ0 · ·
dψ dψ ∂u0 dψ ∂u0 dψ
N −1
X ∂jn N −1 N −1
X ∂jn † dun †
† ∂∆n dun+1 dun ∂∆n dun
X
†
+ − λn · + λn · − + · − λn · · .
∂ψ ∂ψ dψ dψ ∂un dψ ∂un dψ
n=0 n=0 n=1

Now we can apply summation by parts to the forwards difference of sensitivities,

N −1 N −1 †
X dun+1 dun † duN † du0
X dun
λ†n · − = λN −1 · − λ0 · − λn − λn−1 · .
dψ dψ dψ dψ dψ
n=0 n=1
THE DISCRETE ADJOINT METHOD 9

Plugging this result into our functional derivative then gives

∂j0 † du0 ∂∆0 † du0

dJ du0 dυ duN du0
= †
µ · − + · − λ0 · · + λ†N −1 · − λ†0 ·
dψ dψ dψ ∂u0 dψ ∂u0 dψ dψ dψ
N −1
X ∂jn ∂∆n
+ − λ†n ·
∂ψ ∂ψ
n=0
N −1 N −1
∂jn † dun ∂∆n † dun X
†
X dun
+ · − λn · · − λn − λn−1 ·
n=1
∂un dψ ∂un dψ n=1
dψ
†
∂j0 ∂∆0 du0 ∂υ duN
= µ+ − λ0 · − λ0 · − µ† · + λ†N −1 ·
∂υ ∂υ dψ ∂ψ dψ
N −1
X ∂jn ∂∆n
+ − λ†n ·
∂ψ ∂ψ
n=0
N −1
∂∆n † dun

X ∂jn
+ − λn + λn−1 − λn · · .
n=1
∂un ∂un dψ

As in the discrete case we can exploit the freedom in our Lagrange multipliers to remove
all of the sensitivity terms. We first set
∂j0 ∂∆0
µ+ − λ0 · − λ0 = 0,
∂u0 ∂u0
or
∂j0 ∂∆0
µ=− + λ0 · + λ0 ,
∂u0 ∂u0
and then
λN −1 = 0
to remove all the sensitivities outside of the summations. We then eliminate the second
summation by choosing the rest of the λn to satisfy
∂jn ∂∆n
− λn + λn−1 − λn · = 0,
∂un ∂un
or equivalently
∂jn+1 ∂∆n+1
− λn+1 + λn − λn+1 · = 0.
∂un+1 ∂un+1
This defines an adjoint system defined by the backward difference equations
∂jn+1 ∂∆n+1
λn − λn+1 = − + λn+1 ·
∂un+1 ∂un+1
10 BETANCOURT ET AL.

yn−1 yn yn+1

zn−1 zn zn+1

Fig 1. The conditional dependence structure of a hidden Markov model admits efficient marginalization of
the discrete hidden states into state probabilities. Derivatives of the state probabilities with respect to the
model parameters also have to navigate this conditional dependence structure.

along with the terminal condition

λN −1 = 0.
If we solve for the sequence λ(N −1):0 after first forward solving the original sequence u0:N ,
we can compute the total derivative of the functional as
† N −1
dJ ∂j0 ∂∆0 ∂υ X ∂jn ∂∆n
= − λ0 · − λ0 · + − λ†n · .
dψ ∂u0 ∂u0 ∂ψ n=0 ∂ψ ∂ψ

3. APPLICATION TO HIDDEN MARKOV MODELS

The discrete adjoint method is applicable to any discrete sequence defined by forward
difference equations that depend only on the current state. In this section we demonstrate
an application of the method to common hidden Markov models.
An elementary hidden Markov model is a probabilistic model over N observations, yn ,
and N hidden states, zn , satisfying the conditional dependence structure shown in Figure 1.
The joint density π(y1:N , z1:N , ψ) is readily computed, but the derivatives are ill-defined
when the hidden states z are discrete. In order to apply gradient-based methods we first
need to marginalize out the hidden states to define the marginal likelihood π(y1:N , ψ) which
can be differentiated.
Fortunately exact marginalization is tractable due to the conditional dependencies in-
herent to a hidden Markov model. Defining the observational density functions

ωn,i ≡ π(yn | zn = i)

and the transition matrices

Γn,ij ≡ π(zn+1 = i | zn = j)

we can marginalize the hidden states into the forward state probabilities

αn,i ≡ π(y1:N , zn = i).

THE DISCRETE ADJOINT METHOD 11

Because of the defining conditional structure these state probabilities satisfy the recursion
relation
αn+1 (ψ) = ω n+1 (ψ) ◦ (Γn+1 (ψ) · αn (ψ)),
where ◦ denotes the element-wise Hadamard product, along with the initial condition

υ(ψ) = α0 (ψ) = ω 0 (ψ) ◦ ρ(ψ).

Forward solving the recursion relation efficiently computes each of the state probabilities,
the last of which gives the desired marginal likelihood
M
X
π(y1 , . . . , yN , ψ) = αN,m (ψ) = 1† · αN (ψ).
m=1

In order to apply gradient-based learning algorithms to any probabilistic model containing

a hidden Markov model we have to compute not only the marginal likelihood but also its
gradient with respect to any unknown parameters.
There are many ways to derive the gradient for this problem; in this section we will
consider three approaches that tackle the derivation from different directions and different
intuitions but arrive at the same result. These different approaches not only serve as cross
checks for each other but also suggest that their common result is optimal.
In the statistics literature the gradient of the marginal likelihood is often derived as an in-
direct and subtle byproduct of the expectation maximization algorithm (Cappé, Moulines and Rydén,
2005).
We can also obtain a more explicit derivation by unrolling the recursion and applying
the chain rule iteratively. If we let Ωn denote a diagonal matrix of observational densities
at the nth iteration,
Ωn = diag(ω n ),
then the final state probabilities can be written explicitly as
"N #
Y
αN = Ωn (ψ) · Γn+1 (ψ) · Ω0 (ψ) · ρ(ψ)
n=1

with the marginal likelihood taking the form

N
" #
Y
π(y1 , . . . , yN , ψ) = 1† · αN = 1† · Ωn · Γn+1 · Ω0 · ρ.
n=1
12 BETANCOURT ET AL.

Applying the product rule for derivatives then gives

"N −1 # !
d d Y
π(y1 , . . . , yN ) = 1† · Ωn · Γn+1 · Ω0 · ρ
dψ dψ
n=1
  " # " j
N −1 N −1
#
X Y dΩ j+1 dΓ j+2
Y
= 1† ·  Ωi · Γi+1  · · Γj+2 + Ωj+1 · · Ωk · Γk+1 · Ω0 · ρ
dψ dψ
j=0 i=j+2 k=1
"N −1 # " #
Y dΩ0 dρ
+ 1† · Ωn · Γn+1 · · ρ + Ω0 ·
dψ dψ
n=1
  "
N −1 N −1
#
X
†
Y dΩ j+1 dΓ j+2
= 1 · Ωi · Γi+1  · · Γj+2 + Ωj+1 · · αj
dψ dψ
j=0 i=j+2
"N −1 # " #
Y dΩ 0 dρ
+ 1† · Ωn · Γn+1 · · ρ + Ω0 ·
dψ dψ
n=1
 † † "

N −1 N −1
#
X  † Y dΩ j+1 dΓ j+2
=  1 · Ωi · Γi+1   · · Γj+2 + Ωj+1 · · αj

dψ dψ
j=0 i=j+2
"
N −1
#†  † " #
†
Y dΩ 0 dρ
+ 1 · Ωi · Γi+1  · · ρ + Ω0 ·
dψ dψ
i=1

N −1
"" j+2 # #† " #
dΩj+1 dΓj+2
Γ†i+1 · Ω†i
X Y
= ·1 · · Γj+2 + Ωj+1 · · αj
dψ dψ
j=0 i=N −1
"" 1 # #† " #
Y † dΩ 0 dρ
+ Γi+1 · Ω†i · 1 · · ρ + Ω0 ·
dψ dψ
i=N −1
N −1
"" j+2 # #† " #
X Y † dΩj+1 dΓj+2
= Γi+1 · Ωi · 1 · · Γj+2 + Ωj+1 · · αj
dψ dψ
j=0 i=N −1
"" 1 # #† " #
Y † dΩ0 dρ
+ Γi+1 · Ωi · 1 · · ρ + Ω0 ·
dψ dψ
i=N −1
N −1
" #† " #
X dΩj+1 dΓj+2
= β j+1 · · Γj+2 + Ωj+1 · · αj
dψ dψ
j=0
" #† " #
dΩ0 dρ
+ β0 · · ρ + Ω0 · ,
dψ dψ
THE DISCRETE ADJOINT METHOD 13

where we have defined the backwards states

" j+1 #
†
Y
βj = Γi+1 · Ωi · 1
i=N −1

A third, novel approach to deriving the marginal likelihood gradient is to interpret the
recursion as a forward difference equation and apply the discrete adjoint method. Let
un = αn and manipulate the defining recursion relation into a forward difference

∆n = ω n+1 ◦ (Γn+1 · αn ) − αn ,

and take the summand

j = 1† · ∆n
to give the discrete functional
J = 1† · (αN − υ) .
The total derivative of the discrete functional can be used to derive the derivative of the
marginal likelihood,
dJ d † d †
= 1 · αN − 1 · α0
dψ dψ dψ
d dα0
= π(y1 , . . . , yN ) − 1† · ,
dψ dψ
or
d dJ dα0
π(y1 , . . . , yN ) = + 1† · .
dψ dψ dψ
In this case the adjoint system is defined as
∂jn+1 ∂∆n+1
λn − λn+1 = − + λn+1 ·
∂αn+1 ∂αn+1
∂∆n+1 ∂∆n+1
= −1 · + λn+1 ·
∂αn+1 ∂αn+1
∂∆n+1
= (λn+1 − 1) · .
∂αn+1
The partial derivative reduces to
K
!
∂∆n,i ∂ X
= ωn+1,i Γn+1,ik αn,k − αn,i
∂αn,j ∂αn,j
k=1
K
X
= ωn+1,i Γn+1,ik δjk − δij
k=1
= ωn+1,i Γn+1,ij − δij
14 BETANCOURT ET AL.

so that
K K
X ∂∆n,i X
(λn,i − 1) = (λn,i − 1) ωn+1,i Γn+1,ij − (λn,j − 1),
∂αn,j
i=1 i=1
or in matrix notation,
∂∆n
(λn − 1) · = Γ†n+1 · (ω n+1 ◦ (λn − 1)) − λn + 1.
∂αn
The backwards updates then become
∂∆n+1
λn − λn+1 = (λn+1 − 1) ·
∂αn+1
λn − λn+1 = Γ†n+2 · (ω n+2 ◦ (λn+1 − 1)) − λn+1 + 1
λn = Γ†n+2 · (ω n+2 ◦ (λn+1 − 1)) + 1.

If we make the substitution

κ n = 1 − λn
then this further simplifies to

κn = Γ†n+2 · (ω n+2 ◦ κn+1 ) ,

which is just the backward states encountered above with a shifted index,

κn = β n−1 .

For the explicit derivative of the functional we also need

∂jn ∂∆n ∂∆n ∂∆n
− λn · = (1 − λn ) · = κn · ,
∂ψ ∂ψ ∂ψ ∂ψ
where
∂∆n ∂ω n+1 ∂Γn+1
= ◦ (Γn+1 · αn ) + ω n+1 ◦ · αn .
∂ψ ∂ψ ∂ψ
THE DISCRETE ADJOINT METHOD 15

Lastly we work out the boundary term. Recalling υ = ω0 ◦ ρ, the boundary term is
† †
∂j0 ∂∆0 ∂(ω 0 ◦ ρ) ∂∆0 ∂∆0 ∂(ω 0 ◦ ρ)
1+ − λ0 · − λ0 · = 1+1· − λ0 · − λ0 ·
∂α0 ∂α0 ∂ψ ∂α0 ∂α0 ∂ψ
†
∂∆0 ∂(ω 0 ◦ ρ)
= (1 − λ0 ) · + 1 − λ0 ·
∂α0 ∂ψ
" #†
∂(ω 0 ◦ ρ)
= Γ†1 · (ω 1 ◦ (1 − λ0 ) − (1 − λ0 ) + 1 − λ0 ·
∂ψ
" #†
∂(ω 0 ◦ ρ)
= Γ†1 · (ω 1 ◦ (1 − λ0 ) ·
∂ψ
" #†
∂(ω 0 ◦ ρ)
= Γ†1 · (ω 1 ◦ κ0 ) ·
∂ψ
" #†
∂ρ ∂ω 0
= Γ†1 · (ω 1 ◦ κ0 ) · ω 0 ◦ + ◦ρ .
∂ψ ∂ψ

Putting all of this together we can recover the derivative of the marginal likelihood by
computing
d dαN
π(y1 , . . . , yN ) = 1† ·
dψ dψ
† N −1
∂j0 ∂∆0 ∂(ω 0 ◦ ρ) X ∂jn ∂∆n
= 1+ − λ0 · − λ0 · + − λ†n ·
∂α0 ∂α0 ∂ψ ∂ψ ∂ψ
n=0
†
∂ρ ∂ω 0
= Γ†1 · (ω 1 ◦ κ0 ) · ω 0 ◦ + ◦ρ
∂ψ ∂ψ
N −1
X
† ∂ω n+1 ∂Γn+1
+ κn · ◦ Γn+1 · αn + ω n+1 ◦ · αn ,
n=0
∂ψ ∂ψ

equivalent to the result from differentiating the expanded recursion.

One advantage to the discrete adjoint method is that we don’t have to completely expand
the recursion analytically, as done in the above derivation, or computationally, as would
be done in a direct application of automatic differentiation. Instead we can reason about
the derivatives sequentially in the same way that the system is originally defined.
16 BETANCOURT ET AL.

4. CONCLUSION
In analogy to the continuous adjoint methods used with ordinary differential equations,
the discrete adjoint method defines a procedure to efficiently evaluate the derivatives of
functionals over the evolution of discrete sequences. Because this procedure is fully de-
fined by the derivatives of the forward difference equations and the summands defining the
functional, it defines an efficient sequential differentiation algorithm that mirrors the struc-
ture of the original sequence. The beneficial scaling of this procedure makes the resulting
implementations especially useful in practical applications.
We can apply the method to any mathematical model that depends on the parameters
through an (implicit) forward difference equation. Once we have made this equation ex-
plicit the derivation of a differentiation algorithm is completely mechanical, minimizing the
burden of its implementation.

ACKNOWLEDGEMENTS
We thank Bob Carpenter for helpful discussions.

REFERENCES
Bücker, H. M., Corliss, G., Hovland, P., Naumann, U. and Norris, B. (2006). Automatic Differen-
tiation: Applications, Theory, and Implementations. Springer.
Cappé, O., Moulines, E. and Rydén, T. (2005). Inference in hidden Markov models. Springer Series in
Statistics. Springer, New York.
Griewank, A. and Walther, A. (2008). Evaluating derivatives, Second ed. Society for Industrial and
Applied Mathematics (SIAM), Philadelphia, PA.
Hannemann-Tamás, R., Muñoz, D. A. and Marquardt, W. (2015). Adjoint sensitivity analysis for
nonsmooth differential-algebraic equation systems. SIAM J. Sci. Comput. 37 A2380–A2402.
Hindmarsh, A. and Serban, R. (2020). User Documentation for CVODES v5.1.0 Technical Report,
Lawrence Livermore National Laboratory.
Margossian, C. C. (2019). A Review of automatic differentiation and its efficient implementation. Wiley
interdisciplinary reviews: data mining and knowledge discovery 9.

Sensitivity
No ratings yet
Sensitivity
37 pages
Notes - Mike Giles - Aad
No ratings yet
Notes - Mike Giles - Aad
75 pages
Course CFD-DISCRETE-ADJOINT J PETER
No ratings yet
Course CFD-DISCRETE-ADJOINT J PETER
56 pages
Automatic Differentiation Lecture Slides
No ratings yet
Automatic Differentiation Lecture Slides
271 pages
021 - Ceragon - MSE - Presentation v2.1
No ratings yet
021 - Ceragon - MSE - Presentation v2.1
24 pages
Index Reduction For DAE Using Dummy Derivatives
No ratings yet
Index Reduction For DAE Using Dummy Derivatives
17 pages
High-Performance Gradient Evaluation For
No ratings yet
High-Performance Gradient Evaluation For
17 pages
Geophysical J Int - 2006 - Plessix - A Review of The Adjoint State Method For Computing The Gradient of A Functional With
No ratings yet
Geophysical J Int - 2006 - Plessix - A Review of The Adjoint State Method For Computing The Gradient of A Functional With
9 pages
Mit18 S096iap23 Lec4
No ratings yet
Mit18 S096iap23 Lec4
14 pages
Check
No ratings yet
Check
152 pages
Forensic Mental Health Assessment A Casebook All-in-One Download
100% (15)
Forensic Mental Health Assessment A Casebook All-in-One Download
16 pages
Machines 12 00128 v2
No ratings yet
Machines 12 00128 v2
35 pages
Computing Sensitivities Greeks in Computational Finance 1669402405
No ratings yet
Computing Sensitivities Greeks in Computational Finance 1669402405
119 pages
Snesie
No ratings yet
Snesie
70 pages
N.solution of IVPDE
No ratings yet
N.solution of IVPDE
37 pages
Differentiable Programming For Differential Equations A Review
No ratings yet
Differentiable Programming For Differential Equations A Review
72 pages
Adjoint Tutorial
No ratings yet
Adjoint Tutorial
5 pages
Snode PP
No ratings yet
Snode PP
15 pages
1 s2.0 S1007570418302466 Main
No ratings yet
1 s2.0 S1007570418302466 Main
14 pages
Adjoint
No ratings yet
Adjoint
7 pages
Optimization: Calculating Derivatives Panos Patrinos STADIUS, Department of Electrical Engineering, KU Leuven
No ratings yet
Optimization: Calculating Derivatives Panos Patrinos STADIUS, Department of Electrical Engineering, KU Leuven
21 pages
Dyck 1994
No ratings yet
Dyck 1994
4 pages
John Deere Fuel Eng
100% (4)
John Deere Fuel Eng
760 pages
Romer 5e Solutions Manual 06
100% (1)
Romer 5e Solutions Manual 06
26 pages
Option Princing and Calculation by FD 1741545985
No ratings yet
Option Princing and Calculation by FD 1741545985
7 pages
Prefabricated Housing in Japan
No ratings yet
Prefabricated Housing in Japan
25 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
32 pages
Recent Progress in Alkaline Water Electrolysis For Hydrogen Production and Applications.
No ratings yet
Recent Progress in Alkaline Water Electrolysis For Hydrogen Production and Applications.
20 pages
ES3J1 Lecture Lab 4
No ratings yet
ES3J1 Lecture Lab 4
20 pages
Reviewer: Industrial Organizational Psychology
100% (3)
Reviewer: Industrial Organizational Psychology
35 pages
Derivada de Fréchet
No ratings yet
Derivada de Fréchet
18 pages
Adjoint Method Explained: Sarka Tukova
No ratings yet
Adjoint Method Explained: Sarka Tukova
44 pages
Adjoints and Automatic (Algorithmic) Dierentiation in Computational Nance
No ratings yet
Adjoints and Automatic (Algorithmic) Dierentiation in Computational Nance
25 pages
Applied Mathematics and Computation: Alper Korkmaz, Idris Da G
No ratings yet
Applied Mathematics and Computation: Alper Korkmaz, Idris Da G
12 pages
Exam With Solutions PDF
0% (1)
Exam With Solutions PDF
17 pages
1.3 Numerical Simulations
No ratings yet
1.3 Numerical Simulations
6 pages
A Mapping Strategy For The Identification of Structural Systems
No ratings yet
A Mapping Strategy For The Identification of Structural Systems
8 pages
Ad Joint Tech Report
No ratings yet
Ad Joint Tech Report
27 pages
Adjoint in Computational Finance
No ratings yet
Adjoint in Computational Finance
24 pages
A Treatise on the Calculus of Finite Differences
From Everand
A Treatise on the Calculus of Finite Differences
George Boole
4/5 (1)
Introduction To Simulation - Lecture 16: Methods For Computing Periodic Steady-State - Part II
No ratings yet
Introduction To Simulation - Lecture 16: Methods For Computing Periodic Steady-State - Part II
32 pages
Project Secretarial
No ratings yet
Project Secretarial
45 pages
SoftFRAC Matlab Library For Realization
No ratings yet
SoftFRAC Matlab Library For Realization
10 pages
Convolution Quadrature and Discretized Operational Calculus. I
No ratings yet
Convolution Quadrature and Discretized Operational Calculus. I
17 pages
EE553 Lect 4
No ratings yet
EE553 Lect 4
3 pages
Assignment6 Final
No ratings yet
Assignment6 Final
17 pages
Euler Schemes and Half-Space Approximation For The Simulation of Diffusion in A Domain
No ratings yet
Euler Schemes and Half-Space Approximation For The Simulation of Diffusion in A Domain
37 pages
PDE
No ratings yet
PDE
39 pages
(1972) Differential Quadrature A Technique For The Rapid Solution of Nonlinear Partial Differential Equations
100% (1)
(1972) Differential Quadrature A Technique For The Rapid Solution of Nonlinear Partial Differential Equations
13 pages
Estimacion de Parametros
No ratings yet
Estimacion de Parametros
13 pages
PHL BS With Affine Divs
No ratings yet
PHL BS With Affine Divs
8 pages
Gradient Based Optimization
No ratings yet
Gradient Based Optimization
24 pages
Chapter 3
No ratings yet
Chapter 3
25 pages
16.323 Principles of Optimal Control: Mit Opencourseware
No ratings yet
16.323 Principles of Optimal Control: Mit Opencourseware
24 pages
1409 Electricity
No ratings yet
1409 Electricity
139 pages
Perl Bioinf 0411 PDF
No ratings yet
Perl Bioinf 0411 PDF
69 pages
HW1-Solutions - Final 2016
No ratings yet
HW1-Solutions - Final 2016
6 pages
Basic Sens Analysis Review PDF
No ratings yet
Basic Sens Analysis Review PDF
26 pages
Adjoint Sensitivity Analysis For Differential-Algebraic Equations: The Adjoint Dae System and Its Numerical Solution
No ratings yet
Adjoint Sensitivity Analysis For Differential-Algebraic Equations: The Adjoint Dae System and Its Numerical Solution
14 pages
AME 301: Differential Equations, Vibrations and Controls: Notes On Finite-Difference Methods For
No ratings yet
AME 301: Differential Equations, Vibrations and Controls: Notes On Finite-Difference Methods For
19 pages
Partial Differential Equations
No ratings yet
Partial Differential Equations
11 pages
Creating A Data Collection Form With Epicollect5
No ratings yet
Creating A Data Collection Form With Epicollect5
11 pages
Master Thesis Bioinformatics Germany
100% (2)
Master Thesis Bioinformatics Germany
5 pages
Finite Element Method Notes
No ratings yet
Finite Element Method Notes
25 pages
Using Automatic Differentiation For Adjoint CFD Code Development
No ratings yet
Using Automatic Differentiation For Adjoint CFD Code Development
9 pages
Optimization Based On Gradient Descent
No ratings yet
Optimization Based On Gradient Descent
24 pages
Mock IP BBA
No ratings yet
Mock IP BBA
8 pages
Locker System
No ratings yet
Locker System
31 pages
TISE in One Dimension
No ratings yet
TISE in One Dimension
5 pages
Linear Quadratic Control
No ratings yet
Linear Quadratic Control
7 pages
Discrete-Time Evaluation of The Time Response: Appendix
No ratings yet
Discrete-Time Evaluation of The Time Response: Appendix
6 pages
Ethics and Culture: LIS 580: Spring 2006 Instructor-Michael Crandall
No ratings yet
Ethics and Culture: LIS 580: Spring 2006 Instructor-Michael Crandall
26 pages
Adjoint Tutorial PDF
No ratings yet
Adjoint Tutorial PDF
6 pages
Post Office GDS Bharti 2024 For 44228 Posts
No ratings yet
Post Office GDS Bharti 2024 For 44228 Posts
37 pages
Mean Frequency Table
No ratings yet
Mean Frequency Table
13 pages
COMSOL
No ratings yet
COMSOL
20 pages
Notes On Adjoint Methods MIT
No ratings yet
Notes On Adjoint Methods MIT
6 pages
Understanding Vector Calculus: Practical Development and Solved Problems
From Everand
Understanding Vector Calculus: Practical Development and Solved Problems
Jerrold Franklin
No ratings yet
Nisbau Brochure
No ratings yet
Nisbau Brochure
19 pages
Intercropping Cereals and Grain Legumes - A Farmers Perspective
No ratings yet
Intercropping Cereals and Grain Legumes - A Farmers Perspective
2 pages
Channel Partner Training Deck
No ratings yet
Channel Partner Training Deck
5 pages
#2 Lab Report Revised
No ratings yet
#2 Lab Report Revised
2 pages
White Label Atm
No ratings yet
White Label Atm
5 pages
(Ebook PDF) Politics in The Developing World 5th Edition All Chapters Instant Download
100% (4)
(Ebook PDF) Politics in The Developing World 5th Edition All Chapters Instant Download
46 pages
Nghi Son Refinery And Petrochemical Llc: Công Ty Tnhh Lọc Hóa Dầu Nghi Sơn
No ratings yet
Nghi Son Refinery And Petrochemical Llc: Công Ty Tnhh Lọc Hóa Dầu Nghi Sơn
1 page
Invoice FP25-009
No ratings yet
Invoice FP25-009
1 page
Odyssey Battery ODX-AGM31
No ratings yet
Odyssey Battery ODX-AGM31
1 page
CG 5BrochureRev1
No ratings yet
CG 5BrochureRev1
2 pages
A-level Maths Revision: Cheeky Revision Shortcuts
From Everand
A-level Maths Revision: Cheeky Revision Shortcuts
Scool Revision
3.5/5 (8)
Exercises of Multi-Variable Functions
From Everand
Exercises of Multi-Variable Functions
Simone Malacrida
No ratings yet

Continuous Adjoint Method

Uploaded by

Continuous Adjoint Method

Uploaded by

The Discrete Adjoint Method:

Efficient Derivatives for

Michael Betancourt, Charles C. Margossian, Vianey Leos-Barajas

Abstract. Gradient-based techniques are becoming increasingly critical

un+1 − un = ∆n (un , ψ, n),

1. CONTINUOUS ADJOINT SYSTEMS

by solving the auxiliary ordinary differential equations,

Here a boldfaced fraction is shorthand for the Jacobian matrix

and the corresponding functional

The total derivative of this functional is given by

2. DISCRETE ADJOINT SYSTEMS

with the discrete states satisfying the forward difference equation,

un+1 − un = ∆n (un , ψ, n),

along with the initial condition

Taking a total derivative then gives

Now we can apply summation by parts to the forwards difference of sensitivities,

Plugging this result into our functional derivative then gives

∂j0 † du0 ∂∆0 † du0

along with the terminal condition

3. APPLICATION TO HIDDEN MARKOV MODELS

and the transition matrices

αn,i ≡ π(y1:N , zn = i).

υ(ψ) = α0 (ψ) = ω 0 (ψ) ◦ ρ(ψ).

In order to apply gradient-based learning algorithms to any probabilistic model containing

with the marginal likelihood taking the form

Applying the product rule for derivatives then gives

where we have defined the backwards states

and take the summand

If we make the substitution

κn = Γ†n+2 · (ω n+2 ◦ κn+1 ) ,

For the explicit derivative of the functional we also need

equivalent to the result from differentiating the expanded recursion.

You might also like