Mathematical View of Automatic Differentiation
Mathematical View of Automatic Differentiation
321–398
c Cambridge University Press, 2003
DOI: 10.1017/S0962492902000132 Printed in the United Kingdom
A mathematical view of
automatic differentiation
Andreas Griewank
Institute of Scientific Computing,
Department of Mathematics,
Technische Universität Dresden,
01052 Dresden, Germany
E-mail: [email protected]
CONTENTS
1 Introduction 321
2 Evaluation procedures in incremental form 329
3 Basic forward and reverse mode of AD 335
4 Serial and parallel reversal schedules 349
5 Differentiating iterative solvers 365
6 Jacobian matrices and graphs 374
References 393
1. Introduction
Practically all calculus-based numerical methods for nonlinear computations
are based on truncated Taylor expansions of problem-specific functions. Nat-
urally we have to exempt from this blanket assertion those methods that
are targeted for models defined by functions that are very rough, or even
nondeterministic, as is the case for functions that can only be measured ex-
perimentally, or evaluated by Monte Carlo simulations. However, we should
bear in mind that, in many other cases, the roughness and nondetermi-
nacy may only be an artifact of the particular way in which the function is
that, even for the increment size that optimally balances truncation and
rounding error, half of the significant digits are lost. Of course, the optimal
increment typically differs for each variable/function pair and the situation
is still more serious when it comes to approximating higher derivatives. No
such difficulty occurs in AD, as no parameter needs to be selected and there is
no significant loss of accuracy at all. Of course, the application of the chain
rule does entail multiplications and additions of floating point numbers,
which can only be performed with platform-dependent finite precision.
Since AD has really very little in common with divided differences, it is
rather inappropriate to describe it as a ‘halfway house’ between numeric
and symbolic differentiation, as is occasionally done in the literature. What
then is the key feature that distinguishes AD from fully symbolic differen-
tiation, as performed by most computer algebra (CA) systems? A short
answer would be to say that AD applies the chain rule to floating point
numbers rather than algebraic expressions. Indeed, no AD package yields
tidy mathematical formulas for the derivatives of functions, no matter how
algebraically simple they may be. All we get is more code, typically source
or binary. It is usually not nice to look at and, as such, it never provides
any analytical insight into the nature of the function and its derivatives.
Of course, for multi-layered models, such insight by inspection is generally
impossible anyway, as the direct expression of function values in terms of
the independent variables leads to formulas that are already impenetrably
complex. Instead, the aim in AD is the accurate evaluation of derivative
values at a sequence of arguments, with an a priori bounded complexity in
terms of the operation count, memory accesses, and memory size.
are counted, this growth factor is at most 3, and for a typical evaluation
code probably more like 2 on average. That kind of value seems reasonable
to people who skilfully write adjoint code by hand (Zou, Vandenberghe,
Pondeca and Kuo 1997), and it opens up the possibility of turning simulation
codes into optimization codes with a growth of the total runtime by a factor
in the tens rather than the hundreds. Largely because of memory effects,
it is not uncommon that the runtime ratio for a single gradient achieved
by current AD tools is of the order 10 rather than 2. Fortunately, the
memory-induced cost penalty is the same if a handful of gradients, forming
the Jacobian of a vector function, is evaluated simultaneously in what is
called the vector-reverse mode.
In general, it is very important that the user or algorithm designer cor-
rectly identifies which derivative information is actually needed at any par-
ticular point in time, and then gets the AD tool to generate all of it jointly.
In this way optimal use can be made of common subcalculations or memory
accesses. The latter may well determine the wall clock time on a modern
computer. Moreover, the term derivative information needs to be inter-
preted in a wider sense, not just denoting vectors, matrices or even tensors
of partial derivatives, as is usually understood in hand-written formulas or
the input to computer algebra systems. Such rectangular derivative ar-
rays can often be contracted by certain weighting and direction vectors to
yield derivative objects with fewer components. By building this contrac-
tion into the AD process all aspects of the computational costs can usually
be significantly reduced. A typical scenario is the iterative calculation of
approximate Newton steps, where only a sequence of Jacobian–vector, and
possibly vector–Jacobian, products are needed, but never the Jacobian as
such. In fact, as we will discuss in Section 6, it is not really clear what
the ‘Jacobian as such’ really is. Instead we may have to consider various
representations depending on the ultimate purpose of our numerical calcu-
lation. Even if this ultimate purpose is to compute exact Newton steps by
a finite procedure, first calculating all Jacobian entries may not be a good
idea, irrespective of whether it is sparse or not. In some sense Newton steps
themselves become derivative objects. While that may seem a conceptual
stretch, there are some other mathematical objects, such as Lipschitz con-
stants, error estimates, and interval enclosures, which are naturally related
to derivatives and can be evaluated by techniques familiar from AD.
complexity to order d ln(1 + d), a possibility that has not been exploited in
practice. Following a suggestion by Rall, it was shown in Bischof, Corliss and
Griewank (1993), Griewank, Utke and Walther (2000) and Neidinger (200x)
that multivariate Taylor expansions can be computed rather efficiently us-
ing families of univariate polynomials. Similarly, the reverse propagation of
Taylor polynomials does not introduce any really new aspects, and can be
interpreted as performing the usual reverse mode in Taylor series arithmetic
(Christianson 1992, Griewank 2000).
Like the kind of material presented, the style and depth of treatment in
this article is rather heterogeneous. Many observations are just asserted ver-
bally with references to the literature but others are formalized as lemmas
or propositions. Proofs have been omitted, except for a new, shorter demon-
stration that binomial reversal schedules are optimal in Proposition 4.2, the
proof of the folklore result, Proposition 6.3, concerning the generic rank of a
Jacobian, and some simple observations regarding the new concept of Jaco-
bian scarcity introduced in Section 6. For a more detailed exposition of the
basic material the reader should consult the author’s book (Griewank 2000),
and for more recent results and application studies the proceedings volume
edited by Corliss et al. (2001). A rather comprehensive bibliography on AD
is maintained by Corliss (1991).
Program: Graph:
v3 = exp(v1 ) 1 3
v4 = sin(v1 + v2 ) 5
v 5 = v3 ∗ v 4 2 4
v = 0
vi = xi for i = 1...n
vi ∗= σi
for i = n + 1...l
vi += ϕi (ui )
Table 3.1. Tangent procedure derived from original procedure in Table 2.1.
[v, v̇] = 0
[vi , v̇i ] ∗= σi
for i = n + 1...l
[vi , v̇i ] += [ϕi (ui ), ϕ̇i (ui , u̇i )]
We have bracketed [ϕi , ϕ̇i ] side-by-side to indicate that good use can of-
ten be made of common subexpressions in evaluating the function and its
derivative. In any case we can show that
OPS [F (x), Ḟ (x, ẋ)] OPS [ϕi (ui ), ϕ̇i (ui , u̇i )]
≤ max ≤ 3. (3.6)
OPS F (x) 1≤i≤l OPS ϕi (ui )
The upper bound 3 on the relative cost of performing a single tangent
calculation is arrived at by taking the maximum over all elemental functions.
It is actually attained for a single multiplication, which spawns two extra
multiplications by the chain rule
v = ϕ(u, w) ≡ u ∗ w → v̇ = ϕ̇(u, w, u̇, ẇ) = u ∗ ẇ + w ∗ u̇.
Since the data dependence relation ≺ applies to the ϕ̇i in exactly the same
way as to the ϕi , we obtain on a parallel machine
CLOCK [F (x), Ḟ (x, ẋ)] ≤ 3 CLOCK F (x) ,
where CLOCK is the idealized wall clock time introduced in (2.13).
The bound 3 is pessimistic as the cost ratio between ϕ and ϕ̇ is more
advantageous for most other elemental functions. On the other hand actual
runtime ratios between codes representing Table 3.1 and Table 2.1 may well
be worse on account of various effects including loss of vectorization. This
gap between operation count and actual runtime is likely to be even more
marked for the following adjoint calculation.
Proof. The only fact to ascertain is that the argument ui of the ϕ̄i defined
in (3.8) still has the correct value when it is called up in the third line of
Table 3.2. However, this follows from the definition of ui in (2.6) and our
assumption (2.7), so that none of the statements ϕj with j > i, nor of course
the corresponding ϕ̄j , can alter ui before it is used again by ϕ̄i .
So far, we have treated the v̄i merely as auxiliary quantities during re-
verse matrix multiplications. In fact, their final values can be interpreted as
adjoint vectors in the sense that
∂
v̄i ≡ ȳ y. (3.9)
∂vi
Here ȳ is considered constant and the notation ∂v∂ i requires further expla-
nation. Partial differentiation is normally defined with respect to one of
several independent variables whose values uniquely determine the function
being differentiated. However, here y = F (x) is fully determined, via the
evaluation procedure of Table 2.1, as a function of x1 . . . xn , with each vi , for
i > n, occurring merely as an intermediate variable. Its value, vi = vi (x),
v̄ = 0
(0)
v̄5 = ȳ v̄5 = ȳ
(1) (2)
(v̄3 , v̄4 ) += v̄5 ∗ (v4 , v3 ) (v̄3 , v̄4 ) += v̄5 ∗ (v5 , v5 )
(1)
(v̄1 , v̄2 ) += v̄4 ∗ cos(v1 +v2 ) ∗ (1, 1) (v̄1 , v̄2 ) += v̄4 ∗ v4 ∗ (1, 1)
v̄1 += v̄3 ∗ exp(v1 ) v̄1 += v̄3 ∗ v3
(x̄1 , x̄2 ) = (v̄1 , v̄2 ) (x̄1 , x̄2 ) = (v̄1 , v̄2 )
1 3 3̄ 1̄
5 5̄
2 4 4̄ 2̄
Figure 3.1. Graph corresponding to left (· · · ) and right (- - -) part of Table 3.3.
If all elementals are treated according to (i) in Table 3.4, we obtain the
bound
l
MEM F̄ (x, ȳ) ≈ MEM F + mi ∼ OPS F (x) . (3.16)
i=1
Source code
y = F (x) ∈ Rm
d ∂ -re
vers
rwar e
∂ -fo
Object code Object code
y = F (x) Intermediate x̄ = ȳF (x) ∈ Rn
ẏ = F (x) ẋ ∈ Rm values x̄˙ = ȳF (x) ẋ ∈ Rn
Program execution
OPS ∗ = constant
and
d
v(t) = Φ(v(t), x(t)) for 0≤t≤T (3.18)
dt
has the adjoint system
d
v̄(t) = −v̄(t) Φv (v(t), x(t)), (3.19)
dt
x̄(t) = v̄(t) Φx (v(t), x(t)), (3.20)
with the terminal condition
v̄(T ) = ∇ψ(v(T )). (3.21)
Here the trajectory v(t) is a continuous solution path, whose points enter
into the coefficients of the corresponding adjoint differential equation (3.19)
for v̄(t). Rather than storing the whole trajectory, we may instead store
only certain checkpoints and repeat the forward simulation in pieces on the
way back. This technique is the subject of the subsequent Section 4.
and
CLOCK F̄ (x, Ȳ) ≤ (1 + 2 q) CLOCK(F (x)).
Since the trajectory size is independent of the adjoint dimension q, we obtain
from (3.16) the spatial complexity
MEM(F̄ (x, Ȳ)) ∼ q MEM(F (x)) + OPS(F (x)) ∼ OPS(F (x)).
The most important application of the vector mode is probably the effi-
cient evaluation of sparse matrices by matrix compression. Here Ẋ ∈ X p is
chosen as a seed matrix for a given sparsity pattern such that the resulting
compressed Jacobian F (x)Ẋ allows the reconstruction of all nonzero entries
in F (x). This technique apparently originated with the grouping proposal
of Curtis, Powell and Reid (1974), where each row of Ẋ contains exactly one
nonzero element and the p columns of Ẏ are approximated by the divided
differences
1
F x + ε Ẋ ej − F (x) = Ẏej + 0(ε) for j = 1 . . . p.
ε
Here ej denotes the jth Cartesian basis vector in Rp . In AD the matrix
Ẏ is obtained with working accuracy, so that the conditioning of Ẋ is not
quite so critical. The reconstruction of F (x) from F (x)Ẋ relies on cer-
tain submatrices of Ẋ being nonsingular. In the CPR approach they are
permutations of identity matrices and thus ideally conditioned. However,
there is a price to pay, namely the number of columns p must be greater or
equal to the chromatic number of the column-incidence graph introduced by
Coleman and Moré (1984). Any such colouring number is bounded below
by n̂ ≤ n, the maximal number of nonzeros in any row of the Jacobian.
By a degree of freedom argument, we see immediately that F (x) cannot
be reconstructed from F (x)Ẋ if p < n̂, but p = n̂ suffices for almost all
choices of the seed matrix Ẋ. The gap between the chromatic number and
n̂ can be as large as n − n̂, as demonstrated by an example of Hossain and
Steihaug (1998). Whenever the gap is significant we should instead apply
dense seeds Ẋ ∈ X n̂ , which were proposed by Newsam and Ramsdell (1983).
Rather than using the seemingly simple choice of Vandermonde matrices, we
may prefer the much better conditioned Pascal or Lagrange seeds proposed
by Hossain and Steihaug (2002) and Griewank and Verma (2003), respec-
tively. In many applications sparsity can be enhanced by exploiting partial
separability, which sometimes even allows the efficient calculation of dense
gradients using the forward mode.
The compression techniques discussed above require the a priori knowl-
edge of the sparsity pattern, which may be rather complicated and thus
tedious for the user to supply. Then we may prefer to execute the for-
ward mode with p = n and the vi stored and manipulated as dynamically
sparse vectors (Bischof, Carle, Khademi and Mauer 1996). Excluding ex-
act cancellations, we may conclude that the operation count for computing
the whole Jacobians in this sparse vector mode is also bounded above by
(1 + 2 n̂)OPS(F (x)). Unfortunately this bound may not be a very good in-
dication of actual runtimes since the dynamic manipulation of sparse data
structures typically incurs a rather large overhead cost. Alternatively we
may propagate the sparsity pattern of the vi as bit patterns encoded in
n/32 integers (Giering and Kaminski 2001a). In this way the sparsity pat-
tern of F (x) can be computed with about n/32 times the operation count of
F itself and very little overhead. By so-called probing algorithms (Griewank
and Mitev 2002) the cost factor n/32 can often be reduced to O(n̂ log n) for
a seemingly large class of sparse matrices.
Throughout this subsection we have tacitly assumed that the sparsity
pattern of F (x) is static, i.e., does not vary as a function of the evaluation
point x. If it does, we have to either recompute the pattern at each new
argument x or determine certain envelope patterns that are valid, at least
in a certain subregion of the domain. All the techniques we have discussed
here in their application to the Jacobian F (x) can be applied analogously
to its transpose F (x)T using the reverse mode (Coleman and Verma 1996).
For certain matrices such as arrowheads, a combination of both modes is
called for.
for any given (x, ẋ) ∈ X 2 and (ȳ, ȳ) ˙ ∈ (Y ∗ )2 . In Figure 3.2 the vector
∗
ȳ˙ ∈ Y was assumed to vanish, and the abbreviation
F̄˙ (x, ȳ, ẋ, 0) ≡ ȳF (x)ẋ ≡ ∇x ȳF (x)ẋ ∈ X ∗
is certainly stretching conventional matrix notation. To obtain an evaluation
procedure for F̄˙ we simply have to differentiate the combination of Table 2.1
and Table 3.2 once more in the forward mode. Consequently, composing
(3.6) and (3.14) we obtain the bound
tion. One second-order adjoint calculation then yields exactly the informa-
tion needed to perform one inner iteration step within a truncated Newton
method applied to the KKT system
ȳF (x) = 0, c(x) = 0.
Consequently the cost of executing one inner iteration step is roughly ten
times that of evaluating F = (f, c). Note in particular that a procedure for
the evaluation of the whole constraint Jacobian ∇ c need not be developed
either automatically or by hand. Walther (2002) found that iteratively solv-
ing the linearized KKT system using exact second-order adjoints was much
more reliable and also more efficient than the same method based on divided
differences on the gradient of the Lagrangian function.
The brief description of second-order adjoints above begs the question of
their relation to a nested application of the reverse mode. The answer is that
ȳ˙ = ȳF (x)ẋ could also be computed in this way, but that the complication
of such a nested reversal yields no advantage whatever. To see this we
note first that, for a symmetric Jacobian F (x) = F (x)T , we obviously
have ȳF (x) = (F (x) ẋ)T if ẋ = ȳT . Hence the adjoint vector x̄ = ȳF (x)
can be computed by forward differentiation if F (x) is symmetric and thus
actually the Hessian ∇2 ψ(x) of a scalar function ψ(x) with the gradient
∇ψ(x) = F (x). In other words, it never makes sense to apply the reverse
mode to vector functions that are gradients. On the other hand, applying
reverse differentiation to y = F (x) yields
T
F (x)T , ȳ F (x) = ∇ȳ,x ȳ F (x).
This partitioned vector is the gradient of the Lagrangian ȳF (x), so that a
second differentiation may be carried out without loss of generality in the
forward mode, which is exactly what we have done above. The process
¨ = ȳF (x)ẋẋ and so on by the higher forward
can be repeated to yield x̄
differentiation techniques described in Chapter 10 of Griewank (2000).
5 10 15 20 25 30 35 40 45 50 55 60 64 r 1
c0
cycles
c1
c1
5
c2
c1
10 c2
c2
c3
15
steps steps
4
Figure 4.1. Binary serial reversal schedule for l = 16 = 2 and s = 4.
c0 10 20 30 40 50 60
p0 cycles
p0
5
p0 c1 p1
10
15
c1 p1
p1
20 p4
25 c2
p2
c3
30 c4
steps
Figure 4.2. Binary parallel reversal schedule for l = 32 = 25 .
processor pi , which keeps restarting from the same state saved in ci . For
the optimal Fibonacci schedules considered below the task assignment for
the individual processors is considerably more complicated. For the binary
schedule executed in parallel we now have the complexity estimate
CLOCK(F̄ (x, ȳ)) PROCS(F̄ (x, ȳ))
≈ 2 and ≈ log2 l.
CLOCK(F (x)) PROCS(F (x))
Here the original function evaluation may be carried out in parallel so that
PROCS(F (x)) > 1, but then log2 l times as many processors must be available
for the reversal with minimal wall clock time. If not quite as many are
available, we may of course compress the sequential schedule somewhat along
the computing axis without reaching the absolute minimal wall clock time.
Such hybrid schemes will not be elaborated here.
Here REPS(s, l) denotes the minimal repetition count for a reversal of l steps
using just s checkpoints. The three terms on the right represent the effort of,
respectively, advancing ˇl steps, reversing the right subchain [ˇl . . . l] using s−1
checkpoints, and finally reversing the left subchain [0 . . . ˇl], again using all s
checkpoints. To derive a nearly explicit formula for the function REPS(s, l),
we will consider values
lr (s, l) ≤ l for r ≥ 0 < s.
They are defined as the number of steps i = 1 . . . l that are repeated at most r
5 10 15 20 25
c0
steps
Figure 4.3. Reversal with a single checkpoint (s = 1, l = 6).
times, maximized over all reversals on the range [0 . . . l]. By definition, the
lr (s, l) are nondecreasing as functions of r, such that lr+1 (s, l) ≥ lr (s, l).
Moreover, as it turns out, these numbers have maxima
lr (s) ≡ max lr (s, l),
l≥0
(iii) REPS(s, l) does attain its lower bound (4.5), with (4.4) holding as equal-
ity for all 0 ≤ r < rmax , and consequently
−1
rmax
lrmax (s, l) = l − β(s, rmax − 1, ) = l − lr (s, l).
i=0
because all steps in the left half [0 . . . ˇl] get repeated one more time during
the initial advance to ˇl. Now, by taking the maximum over l and ˇl < l, we
obtain the recursive bound
lr (s) ≤ lr−1 (s) + lr (s − 1).
Together with the initial conditions (4.2) and (4.3), we immediately arrive
at the binomial bound
lr (s) ≤ β(s, r) ≡ (s + r)!/(s! r!),
which establishes (i). To prove the remaining assertions let us suppose we
have a given chain length l and some serial reversal schedule using s check-
points that repeats ∆˜lr ≥ 0 steps exactly r times for r = 0, 1 . . . l. Looking
for the most efficient schedule, we then obtain the following constrained
minimization problem:
∞
∞
Min r∆˜lr such that l= ∆˜lr and
r=0 r=0
r
˜lr ≡ ∆˜li ≤ lr (s) for all r ≥ 0. (4.6)
i=0
where the equation follows from standard binomial identities (Knuth 1973).
That the lower bound REPS(s, l) is actually attained is established by the
following construction of suitable schedules by recursion.
When s = 1 we can only use the schedule displayed in Figure 4.3, where
˜
∆lr = 1 for all 0 ≤ r < l so that rmax = l − 1 and REPS = l (l − 1) −
β(2, rmax − 1) = l (l − 1)/2. When l = 1 no step needs to be repeated at all,
so that ∆˜l0 = 1, rmax = 0 and thus REPS = 0 = β(s + 1, rmax − 1). As an
induction hypothesis we suppose that a binomial schedule satisfying (4.4)
and (4.5) as equalities exists for all pairs s̃ ≥ 1 ≤ ˜l that precede (s, l) in the
lexicographic ordering. This induction hypothesis is trivially true for s = 1
and l = 1. There is a unique value rmax such that
β(s, rmax − 2) + β(s − 1, rmax − 1) = β(s, rmax − 1) < l (4.7)
and l ≤ β(s, rmax ) = β(s, rmax − 1) + β(s − 1, rmax ).
0 1 2 3 4 5 6 7 8 9 r
Figure 4.4. Repetition levels for s = 3 and l = 68.
5 10 15 20 25 30 35 40 45 50 55 60 r 1
c0
c1
cycles
c2
c3
5 c1
c2
c3
c2
10 c3
c3
15
steps steps
Figure 4.5. Binomial serial reversal schedule for l = 16 and s = 4.
REPS(s, l) s=2
50
s=3
s=4
s=5
20
1
1 10 l
Figure 4.6. Repetition count for chain length l and checkpoint number s.
checkpoints s we have at our disposal the slower the cost grows. Asymptot-
ically it follows from β(s, rmax ) ≈ l by Stirling’s formula that, for fixed s,
the approximate complexity growth is
s√ s
rmax ∼ l and REPS(s, l) ∼ l1+1/s ,
s
(4.10)
e e
where e is Euler’s number.
A slight complication arises when the total number of steps is not known
a priori , so that a set of optimal checkpoint positions cannot be determined
beforehand. Rather than performing an initial sweep just for the purpose of
determining l, we may construct online methods that release earlier check-
points whenever the simulation continues for longer than expected. Stern-
berg (2002) found that the increase of the repeated step numbers compared
with the (in hindsight) optimal schemes is rarely more than 5%.
A key advantage of the binomial reversal schedule compared with the
simpler binary one is that many steps are repeated exactly rmax (s, l) times.
Since none of them are repeated more often than r = rmax times, we obtain
the complexity bounds
OPS(F̄ (x, ȳ)) MEM(F̄ (x, ȳ))
≤3+r and ≤3+s (4.11)
OPS(F (x)) MEM(F (x))
provided l ≤ β(s, r) ≡ (s + r)!/(s! r!).
40
2113 v−nodes, 545 p−nodes, 1024 triangles
545 v−nodes, 145 p−nodes, 256 triangles
145 v−nodes, 41 p−nodes, 64 triangles
35
30
25
run time ratio
20
15
10
0
0 1 2 3 4 5 6 7 8 9 10
#Checkpoints*100/#Timesteps
0 5 10 15 t
0
r
8
l
0 5 10 15 t
0
8
l
0 5 10 15 t
0
8
l
0 5 10 15 t
0
8
l
0 5 10 15 t
0
8
l
Figure 4.8. Optimal parallel reversal schedule with recursive decomposition
into subschedules and corresponding resource profiles.
the two. At first, one checkpoint keeps the initial state, and one processor
advances along the diagonals until it kicks off the bottom left subschedule,
which itself needs at first two resources, and then for a brief period of the
two cycles 8 and 9 it requires three. In cycle 10 that demand drops again
to two, and one resource can be released to the top right subschedule. As
we can see, the subprofiles fit nicely together and we always obtain a grad-
ual increase toward the vertex cycles l and l + 1, and subsequently a gentle
decline.
Each subschedule in decomposed again into smaller schedules of sizes 3,
2 and ultimately 1. The fact that the schedule sizes (1, 2, 3, 5, and 8) are
Fibonacci numbers is not coincidental. A rigorous derivation would require
too much space, but we give the following argument, whose key observations
were proved by Walther (1999).
• Not only checkpoints but also processors persist, i.e., run without in-
terruption until they reach the step to be reversed immediately (graph-
ically that means that there are no hooks in Figures 4.2 and 4.8).
• The first checkpoint position ˇl partitions the original range [0 . . . l] into
two subranges [0 . . . ˇl] and [ˇl . . . l] such that, if l = l is maximal with
respect to a given number of resources, so are the two parts
ˇl = l−1 and l − ˇl = l−2 < l−1 .
Imagine we have a simulation that runs forward in time with a fair bit
of diffusion or energy dissipation, so that we cannot integrate the model
equations backwards. In order to be able to run the ‘movie’ backwards
nevertheless, we take checkpoints at selected intermediate states and then
repeat partial forward runs using auxiliary processors. The reversal is sup-
posed to start instantaneously and to run at exactly the same speed as the
forward simulation. Moreover, the user can switch directions at any time.
As it turns out, the Fibonacci schedules can be updated adaptively so that
resources, with roughly half of them processors, suffice as long as l does
not exceed l . When that happens, one extra resource must be added so
that += 1.
An example using maximally 5 processors and 3 checkpoints is depicted
in Figure 4.9. First the simulation is advanced from state 0 to state 48 <
55 = l9 . In the 48th transition step 6 checkpoints have been set and 2
processors are running. Then the ‘movie’ is run backwards at exactly the
same speed until state 32 < 34 = l8 , at which point 5 processors are running
and 3 checkpoints are in memory. Then we advance again until state 40,
reverse once more and so on. The key property of the Fibonacci numbers
used in these bidirectional simulation schedules is that, by the removal of a
checkpoint between gaps whose size equals neighbouring Fibonacci numbers,
the size of the new gap equals the next-larger Fibonacci numbers.
10
20
30
40
steps
Figure 4.9. Bidirectional simulation using up to 9 processors and checkpoints.
t=0 v0
(q)
x(t) v(t) B(t) ∆x(t)
(p) (q) (q · p) (p)
(q 2 )
t=T v̄, Q̄
Figure 4.10. Nested reversal for Riccati/Pantoja computation of Newton step.
The horizontal lines represent information flow between the three sweeps
that are represented by slanted lines. The two ellipses across the initial
and intermediate sweep represent checkpoints, and are annotated by the
dominant dimension of the states that need to be saved for a restart on
that sweep. Hence we have thin checkpoints of size q on the initial, forward
sweep, and thick checkpoints of size q 2 on the immediate, reverse sweep.
A simple-minded ‘record all’ approach would thus require memory of order
l q 2 , where l is now the number of discrete time-steps between 0 and T . The
intermediate sweep is likely to be by far the most computationally expensive,
as it involves matrix products and factorizations.
The final sweep again runs forward in time and again has a state dimension
q + p, which will be advanced in spurts as information from the intermediate
sweep becomes available. Christianson (1999) has suggested a two-level
checkpointing scheme, which√reduces the total storage from order q√2 l for a
simple log-all approach to q 2 l. Here checkpoints are placed every l time-
steps along
√ the initial and the intermediate sweep. The subproblems of
length l are then treated with complete logging of all intermediate states.
More sophisticated serial and parallel reversal schedules are currently under
This result was originally established by Gilbert (1992) and later generalized
by Griewank, Bischof, Corliss, Carle and Williamson (1993) to a much more
general class of Newton-like methods. We may abbreviate (5.8) to żi = ż∗ +
Õ(ρi ) for ease of algebraic manipulation, and it is an immediate consequence
for the reduced function f (x) defined in (5.5) that
F (x, zi ) = f (x) + Õ(ρi ) and Ḟ (x, zi , ẋ, żi ) = f (x)ẋ + Õ(ρi ).
As discussed by Ortega and Rheinboldt (1970), R-linear convergence is a
little weaker than Q-linear convergence, in that successive discrepancies
żi −ż∗ need not go down monotonically but must merely decline on average
by the factor ρ. This lack of monotonicity comes about because the leading
term on the right-hand side of (5.6), as defined in (5.7), may also approach
its limit Gx (x, z∗ )ẋ in an irregular fashion. A very similar effect occurs for
Lagrange multipliers in nonlinear programming. Since (5.3) implies by the
triangle inequality
ρ
lim sup zi − z∗ zi − zi−1 ≤ ,
i→∞ 1−ρ
we may use the last step-size zi − zi−1 as an estimate for the remain-
ing discrepancy zi − z∗ . In contrast, this is not possible for R-linearly
converging sequences, so the last step-size żi − żi−1 does not provide a
reliable indication of the remaining discrepancy żi − ż∗ .
In order to gauge whether the current value żi is a reasonable approxi-
mation to ż∗ , we recall that the latter is a solution of the direct sensitivity
equation
[I − Gz (x, z∗ )]ż = Gx (x, z∗ )ẋ, (5.9)
which is a mere rewrite of (5.4). Hence it follows under our assumptions
that, for any candidate pair (z, ż),
z − z∗ + ż − ż∗ = O z − G (x, z) + ż − Ġ (x, z, ẋ, ż) , (5.10)
with Ġ as defined in (5.7). The directional derivative Ġ(x, z, ẋ, ż) can be
computed quite easily in the forward mode of automatic differentiation so
that the residual on the right-hand side of (5.10) is constructively available.
Hence we may execute (5.1) and (5.6) simultaneously, and stop the itera-
tion when not only zi −G(x, zi ) but also żi − Ġ(x, zi , ẋ, żi ) is sufficiently
small. The latter condition requires a modification of the stopping criterion,
but otherwise the directional derivative żi ≈ ż∗ = ż∗ (x, ẋ) and the resulting
ẏi = Ḟ (x, zi , ẋ, żi ) ≈ ẏ∗ can be obtained by black box differentiation of the
original iterative solver.
Sometimes G takes the form
G(x, z) = z − P (x, z) · H(x, z),
with H(x, z) = 0 the state equation to be solved and P (x, z) ≈ Hz−1 (x, z) a
suitable preconditioner. The optimal choice P (x, z) = Hz−1 (x, z) represents
Newton’s method but can rarely be realized exactly on large-scale problems.
Anyway we find
ż − Ġ(x, z, ẋ, ż) = ż − P (x, z) Hx (x, z) ẋ + Hz (x, z) ż − Ṗ (x, z) H(x, z),
where
Ṗ (x, z) = Px (x, z) ẋ + Pz (x, z) ż,
provided P (x, z) is differentiable at all. As H(x, zi ) converges to zero, the
second term Ṗ (x, z) H(x, zi ) could, and probably should, be omitted when
evaluating Ġ(x, zi , ẋ, żi ). This applies in particular when P (x, z) is not even
continuously differentiable, for example due to pivoting in a preconditioner
based on an incomplete LU factorization. Setting Ṗ = 0 does not affect the
R-linear convergence result (5.8), and may reduce the cost of the derivative
iteration. However, it does require separating the preconditioner P from the
residual H, which may not be a simple task if the whole iteration function
G is incorporated in a legacy code. In Figure 5.1 the curves labelled ‘state
equation residual’ and ‘direct derivative residual’ represent zi − G(x, zi )
and żi − Ġ(x, zi , ẋ, żi ), respectively. The other two residuals are explained
in the following subsection.
2
1
0
log10 (residual norm)
−1
−2
−3
−4
−5
−6 state equation residual
direct derivative residual
−7 adjoint derivative residual
second order residual
−8
0 200 400 600 800 1000 1200 1400
Iteration
Figure 5.1. History of residuals on 2D Euler solver, from
Griewank and Faure (2002).
this problem looks somewhat simpler than the original task of solving the
nonlinear state equation z = G(x, z). When G represents a Newton step we
have asymptotically ρ = 0, and the solution of (5.9) can be achieved in a
single step
ż = Ġ(x, z, ẋ, 0) = −P (x, z) Hx (x, z) ẋ
where z ≈ z∗ represents the final iterate of the state vector. Also, the
simplified iteration may be applied since Ṗ is multiplied by H(x, z) ≈ 0.
When G represents an inexact version of Newton’s method based on an
iterative linear equation solver, it seems a natural idea to apply exactly the
same method to the sensitivity equation (5.9). This may be quite economical
because spectral properties and other information, that is known a priori
or gathered during the first phase iteration, may be put to good use once
more. Of course, we must be willing and able to modify the code by hand
unless AD provides a suitable tool (Giering and Kaminski 2000). The ability
to do this would normally presume that a fairly standard iterative solver,
for example of Krylov type, is in use. If nothing is known about the solver
except that it is assumed to represent a contractive fixed point iteration, we
may still apply (5.6) with zi−1 = z fixed so that
żi = Gx (x, z)ẋ + Gz (x, z)żi for i = 1 . . . l. (5.11)
Theoretically, the AD tool could exploit the constancy of z to avoid the
repeated evaluations of certain intermediates. In effect we apply the lin-
earization of the last state space iteration to propagate derivatives forward.
The idea of just exploiting the linearization of the last step has actually
been advocated more frequently for evaluating adjoints of iterative solvers
(Giering and Kaminski 1998, Christianson 1994).
To elaborate on this we first need to derive the adjoint of the fixed point
equation z = G(z, x). It follows from (5.4) by the chain rule that the total
derivative of the reduced response function defined in (5.5) is given by
−1
f (x) = Fx + Fz I − Gz (x, z∗ ) Gx (x, z∗ ).
Applying a weighting functional ȳ to y, we obtain the adjoint vector
x̄∗ = ȳf (x) = ȳ Fx (x, z∗ ) + ḡ∗ Gx (x, z∗ ), (5.12)
where
−1
ḡ∗ ≡ z̄∗ I − Gz (x, z∗ ) with z̄∗ ≡ ȳ Fz (x, z∗ ). (5.13)
While the definition of z̄∗ follows our usual concept of an adjoint vector, the
role of ḡ∗ ∈ Z warrants some further explanation. Suppose we introduce an
additive perturbation g of G so that we have the system
z = G(z, x) + g, y = F (z, x).
Then it follows from the implicit function theorem that ḡ∗ , as given by (5.13),
and
0 = z∗ − G(x, z∗ )) = z − G(x, z) + I − Gz (x, z) (z∗ − z) + O z − z∗ 2 .
By subtracting ḡ times the second equation from the first, we obtain the
estimate
ȳ F (x, z) − ḡ z − G(x, z) − ȳ f (x)
= O ḡ I − Gz (x, z) − z̄ z − z∗ + z − z∗ 2 .
Now, if z = zi and ḡ = ḡi are generated according to (5.1) and (5.15), we
derive from (5.3) and (5.16) that
ȳ F (x, zi ) − ḡi zi − G(x, zi ) = ȳ f (x) + Õ(ρ2i ).
Hence the corrected estimate converges twice as fast as the uncorrected
one. It has been shown by Griewank and Faure (2002) that, for linear
G and F and zero initializations z0 = 0 and ḡ0 = 0, the ith corrected
estimate is exactly equal to the 2 ith uncorrected value ȳ F (x, zi ). The
same effect occurs when the initial z0 is quite good compared to z̄0 , since
the adjoint iteration (5.15) is then almost linear. This effect is displayed
in Figure 5.2, where the normal (uncorrected) value lags behind the double
(adjoint corrected) one by almost exactly the factor 2. Here the weighted
response ȳF was the drag coefficient of the NACA0012 airfoil.
In our setting the discrepancies z − z∗ and ḡ − g∗ come about through it-
erative equation solving. The same duality arguments apply if z∗ and ḡ∗ are
solutions of operator equations that are approximated by solutions z and
ḡ of corresponding discretizations. Under suitable conditions elaborated
normal
0.284 double
Response function
0.283
0.282
0.281
0.28
in Giles and Pierce (2001), Giles and Süli (2002) and Becker and Rannacher
(2001), the adjoint correction technique then doubles the order of conver-
gence with respect to the mesh-width. In both scenarios, solving the adjoint
equation provides accurate sensitivities of the weighted response with re-
spect to solution inaccuracies. For discretized PDEs this information may
then be used to selectively refine the grid where solution inaccuracies have
the largest effect on the weighted response (Becker and Rannacher 2001).
analogous to (5.8). The ḡi allow the computation of the approximate re-
duced gradients
x̄i ≡ ȳ Fx (x, zi ) + ḡi Gx (x, zi ) = x̄∗ + Õ(ρi ). (5.20)
Differentiating (5.18) once more, we obtain the second-order adjoint itera-
tion
ḡ˙ i+1 = ȳḞz (x, zi , ẋ, żi ) + ḡi Ġz (x, zi , ẋ, żi ) + ḡ˙ i Gz (x, zi ), (5.21)
where
Ḟz (x, zi , ẋ, żi ) ≡ Fzx (x, zi )ẋ + Fzz (x, zi )żi
and Ġz is defined analogously. The vector ḡ˙ i ∈ X ∗ obtained from (5.21)
may then be used to calculate
x̄˙ i = ȳḞx (x, zi , ẋ, żi ) + ḡ˙ i Gx (x, zi ) + ḡi Ġx (x, zi , ẋ, żi )
= x̄˙ ∗ + Õ(ρi ) = ȳf (x)ẋ + Õ(ρi ),
where Ḟx and Ġx are also defined by analogy with Ḟz . The vector x̄˙ rep-
resent a first-order approximation to the product of the reduced Hessian
ȳf (x) = ∇2x [ȳf (x)] with the direction ẋ.
While the right-hand side of (5.21) looks rather complicated, it can be
evaluated as a second-order adjoint of the vector function
E(x, z) ≡ [F (x, z), G(x, z)],
for a very significant reduction in operation count, although the search for
an elimination ordering with absolutely minimal operation count is a hard
combinatorial problem.
v̄5 = ȳ
v̄4 = v̄5 ∗ v3
v̄3 = v̄5 ∗ v4
v̄2 = v̄4 ∗ cos(v1 + v2 )
v̄1 = v̄4 ∗ cos(v1 + v2 ) + v̄3 ∗ v3
C6 1 C6 1
1 6 1 6
C3 C
1 C6 5 71
C4 3 C5 4 =⇒
3 4 5 C7 2
C3 2 5 C6
2 7 2 7
C7 2 C7 2
Figure 6.2. Jacobian vertex accumulation on a simple graph.
labels to the edges of the computation graph, as sketched in Figure 6.2 for
a particular case with two independents and two dependents.
Obviously the local derivatives Cij uniquely determine the overall Ja-
cobian F (x), whose calculation from the Cij we will call accumulation.
Formally we may use the following explicit expression for the individual Ja-
cobian blocks, which can be derived from the chain rule by induction on l.
Lemma 6.1. (Bauer’s formula) The derivative of any dependent vari-
able yi = vl−m+i with respect to any independent variable xj = vj is given by
∂yi
= Cı̃̃ , (6.1)
∂xj
P∈[j→ı̂] (̃,ı̃)⊂P
C6 1 C6 1
1 6 1 6
C3 C3
1 C6 5 1 C6 3
C5 3
3 5 3
C7 C7
C 32 5 C 32 3
2 7 2 7
C7 2 C7 2
Figure 6.3. Successive vertex eliminations on problem displayed in Figure 6.2.
are directly connected by an edge, afterwards the vertex j and its edges are
deleted. Again assuming elementary matrix–matrix arithmetic, we have the
total elimination cost
MARK(vj ) ≡ OPS Elim(vj ) = mj mk mi .
k≺j≺i
When all vertices are scalars MARK(vj ) reduces to the Markowitz degree fa-
miliar from sparse Gaussian elimination (Rose and Tarjan 1978). There as
here, the overall objective is to minimize the vertex accumulation cost
VACC F (x) ≡ MARK(vj ),
n<j≤l−m
where the degree MARK(vj ) needs to be computed just before the elimination
of the jth vertex. This accumulation effort is likely to dominate the overall
operation count compared to the effort for evaluating the elemental partials
l
OPS F (x) − VACC F (x) ≡ OPS {Cij }j≺i ≤ q OPS F (x) ,
i=1
where
q ≡ max OPS {Cij }j≺i /OPS (ϕi ) .
i
Typically this maximal ratio q is close to 2 or some other small number for
any given library of elementary functions ϕi . In contrast to sparse Gaussian
elimination, the linearized computational graph stays acyclic throughout,
and the minimal and maximal vertices, whose Markowitz degrees vanish
trivially, are precluded from elimination. They form the vertex set of the
final bipartite graph whose edges represent the nonzero (block) entries of
the accumulated Jacobian, as shown on the right-hand side of Figure 6.2.
For the small example Figure 6.2 with scalar vertices, the elimination or-
der 4 , 5 and 3 discussed above requires 1 + 2 + 4 = 7 multiplications.
In contrast, elimination in the natural order 3 , 4 and 5 , or its exact
opposite 5 , 4 and 3 , requires 2 + 2 + 4 = 8 multiplications. Without too
much difficulty it can be seen that, in general, eliminating all intermediate
vertices in the orders n + 1, n + 2, . . . , (l − 1), l or l, l − 1, . . . , n + 1, n is math-
ematically equivalent to applying the sparse vector forward or the sparse
reverse mode discussed in Section 3.5, respectively. Thus our tiny example
demonstrates that, already, neither of the two standard modes of AD needs
to be optimal in terms of the operation count for Jacobian accumulation.
As in the case of sparse Gaussian elimination (Rose and Tarjan 1978), it
has been shown that minimizing the fill-in during the accumulation of Jaco-
bian is an NP-hard problem and the same is probably true for minimizing
the operation count. In any case no convincing heuristics for selecting one
2 4 2k
u h u h u h
1 3 5
u u u
Figure 6.4. Reid’s example of potential accumulation instability.
−I 0 0 0 0 0 0
0
0
0 0 −I 0 0 0 0
x x −I 0 0 0 0 −I 0 0
x
= = B L−I 0 .
0 R T −I
x x x x −I 0 0
x x x x −I 0 0
0
0
x x x x 0 0 −I
Here δij is the Kronecker delta and Cij = 0 ⇐⇒ j ≺ i =⇒ j < i. The last
relation implies in particular that the matrix L is strictly lower-triangular.
Applying the implicit function theorem to E(x; v) = 0, we obtain the deriva-
tive
F (x) ≡ R + T (I − L)−1 B (6.3)
= R + T [(I − L)−1 B] = R + [T (I − L)−1 ]B.
The two bracketings on the last line represent two particular ways of accu-
mulating the Jacobian F (x). One involves the solution of n linear systems
in the unary lower-triangular matrix (L − I) and the other requires the
solution of m linear systems in its transpose (L − I)T . As observed in Chap-
ter 8 of Griewank (2000), the two alternatives correspond once more to the
forward and reverse mode of AD, and their relative cost need not be deter-
mined by the ratio m/n nor even by the sparsity structure of the Jacobian
F (x) alone. Instead, what matters is the sparsity of the extended Jacobian
E (v, x), which is usually too huge to deal with in many cases. In the scalar
case with R = 0 we can rewrite (6.3) for any two Cartesian basis vectors
ei ∈ Rm and ej ∈ Rn by Cramer’s rule as
I − L Bej eTi
eTi F (x) ej = det . (6.4)
−T I
Thus we see that Bauer’s formula given in Lemma 6.1 indeed represents
the determinant of a sparse matrix. Moreover, we may derive the following
alternative interpretation of Jacobian accumulation procedures.
Program: Graph: 5
v3 = c21 ∗ v1 + c32 ∗ v2 1
v4 = c43 ∗ v3 4 6
v5 = c54 ∗ v4 3
v6 = c64 ∗ v4 7
v7 = c74 ∗ v4
2
v8 = c84 ∗ v4 + c83 ∗ v3 8
4, 5 5, ∞
3, 4 4, 6 6, ∞
−∞, 1 1, 3
4, 7 7, ∞
4, 8 8, ∞
−∞, 2 2, 3
3, 8
where [oj → di ] represents the set of all paths P connecting oj to di , and the
j̃, ĩ range over all vertices belonging to path P in descending order. The
elimination of a face k, j , j, i with k = −∞ and i = +∞ can now proceed
by introducing a new line vertex with the value Cij Cjk , and connecting it
to all predecessors of k, j and all successors of j, i . It is very easy to
see that this modification leaves (6.7) valid and again reduces the total sum
of the length of all maximal paths in the line graph. After the successive
elimination of all interior edges of the line graph whose vertices correspond
initially to edges in the computational graph, we arrive at a tripartite graph.
The latter can be interpreted as the line graph of a bipartite graph, whose
edge values are the vertex labels of the central layer of the tripartite graph.
Naumann (2001) has constructed an example where accumulation by face
elimination reduces the number of multiplications below the minimal count
achievable by edge elimination. It is believed that face elimination on the
scalar line graph is the most general procedure for accumulating Jacobians.
Conjecture 6.2. Given a scalar computational graph, consider a fixed
sequence of multiplications and additions to compute the Jacobian entries
∂yi /∂xj given by (6.1) and (6.7) for arbitrary real values of the Cij for j ≺ i.
Then the number of multiplications required is no smaller than that required
by some face elimination sequence on the computational graph, that is,
ACC F (x) = FAAC F (x) ≤ EACC F (x) ≤ VACC F (x)
where the inequalities hold strictly for certain example problems, and FAAC,
EACC denote the minimal operation count achievable by face and edge elim-
ination, respectively.
Proof. We prove the assertion under the additional assumptions that all
cij = Cij are scalar, and that the computational graph is absorption-free
in that any two vertices j and i are connected by at most one directed
path. The property is inherited by the line graph. Then the accumulation
procedure involves only multiplications, and must generate a sequence of
za1 b1
1 a1 6 1 6
b1 z a 2b 1
b1
a2 T
a x=z za 3
2 4 5 2 z a2
b2 b2 za
1b
a3 2
7 za3 b2
3 3 7
is, for almost all combinations {cij }j≺i , equal to the size r of a maximal
match, that is, the largest number of disjoint paths connecting the roots to
the leaves of the graph.
Proof. Suppose we split all interior vertices into an input port and an
output port connected by an internal edge of capacity 1. Furthermore we
connect all minimal vertices by edges of unit capacity to a common source,
and analogously all maximal ones to a common sink. All other edges are
given infinite capacity. Then the solution of the max-flow (Tarjan 1983)
problem is integral and represents exactly a maximal match as defined above.
The corresponding min-cut solution consists of a set of internal edges of finite
and thus unit capacity, whose removal would split the modified graph into
two halves. The values vj at these split vertices form a vector z of reals that
effect can occur when the computational graph rather than the Jacobian is
sparse in a certain sense.
More specifically, we will call the computational graph and the resulting
Jacobian scarce if the matrices (∂yi /∂xj )j=1...n,i=1...m defined by Bauer’s for-
mula (6.7) for arbitrary values of the elemental partials Cı̃̃ do not range over
all of Rm×n . For the example considered in Section 6.7, we obtain exactly
the set of matrices whose rank equals one or zero, which is a smooth man-
ifold of dimension m + n − 1. Naturally, this manifold need not be regular
and, owing to its homogeneity, there are always bifurcations at the origin.
The difference between n m and the dimension of the manifold may be de-
fined as the degree of scarcity. The dimension of the manifold is bounded
above but generally not equal to the number of arcs in the computational
graph. The discrepancy is exactly two for the example above.
A particularly simple kind of scarcity is sparsity, where the number of
zeros yields the degree of sparsity, provided the other entries are free and
independent of each other. It may then still make sense to accumulate the
Jacobian, that is, represent it as set of partial derivatives. We contend that
this is not true for Jacobians that are scarce but not sparse, such as the
rank-one example.
In the light of Proposition 6.3 and our observation on the rank-one exam-
ple, we might think that the scarcity structure can always be characterized in
terms of a collection of vanishing subdeterminants. This conjecture is refuted
by the following example, which is also representative for a more significant
class of problems for which full accumulation seems a rather bad idea.
Suppose we have a 3 × 3 grid of values vj for j = 1 . . . 9, as displayed in
Figure 6.8, that are subject to two successive transformations. Each time
the new value is a function of the old values at the same place and its
neighbours to the west, south and southwest. At the southern and western
boundary we continue the dependence periodically so that the new value
of v4 , for example, depends on the old values of v4 , v3 , v1 , and v6 . This
dependency is displayed in Figure 6.8. Hence the 9 × 9 Jacobians of each
of the two transitions contain exactly four nonzero elements in each row.
Consequently we have 4 · 9 · 2 = 72 free edge values but 9 · 9 = 81 entries
in the product of the two matrices, which is the Jacobian of the final state
with respect to the initial state.
Since by following the arrows in Figure 6.8 we can get from any grid point
to any other point in two moves, the Jacobian is dense. Therefore the degree
of scarcity defined above is at least 81−72 = 9. However, the scarcity of this
Jacobian cannot be explained in terms of singular submatrices. Any such
square submatrix would be characterized by two sets Ji ⊂ {1, 2, . . . , 9} for
i = 0, 2 with J0 selecting the indices of columns (= independent variables)
and J2 the rows (= dependent variables) of the submatrix. It is then not
too difficult to find an intermediate set J1 such that each element in J1 is
7 8 9
4 5 6
1 2 3
which follow from Tables 3.1 and 3.2, provided all vi are scalar and disjoint.
In either loop there is exactly one multiplication associated with each arc
value cij , so that the number of arcs is an exact measure for the cost of com-
puting Jacobian–vector products ẏ = F (x)ẋ and vector–Jacobian products
x̄ = ȳF (x).
In order to minimize that cost we may perform certain preaccumulations
by eliminating edges or whole vertices. For example, in Figure 6.7 we can
eliminate the central arc either at the front or at the back, and thus reduce
the number of edge values from n+m+1 to n+m without changing the reach
of the Jacobian. This elimination requires here min(n, m) multiplications
and will therefore be worthwhile even if only one Jacobian–vector product is
to be computed subsequently. Further vertex, edge, or even face eliminations
would make the size of the edge set grow again, and cause a loss of scarcity
in the sense discussed above. The only further simplification respecting the
Jacobian structure would be to normalize any one of the edge values to 1
and thus save additional multiplications during subsequent vector product
calculations. Unfortunately, at least this last simplification is clearly not
unique and it would appear that there is in general no unique minimal
representation of a computational graph. As might have been expected,
there is a rather close connection between the avoidance of fill-in and the
maintenance of scarcity.
Proposition 6.4.
(i) If the front- or back-elimination of an edge in a computational graph
does not increase the total number of edges, then the degree of scarcity
remains constant.
(ii) If the elimination of a vertex would lead to a reduction in the total
number of edges, then at least one of its edges can be eliminated via (i)
without loss of scarcity.
cik1 =⇒ c̃ik1
k1 i k1 i
ks
c ik s
cj
c̃ ik s
ks
cj
ks ks
k0 i0 k0 i0
j j
k1 i1 =⇒ k1 i1
k2 i2 k2 i2
Since we may assume without loss of generality that cjk0 = 0, this rela-
tion may be reversed so that the degree of scarcity is indeed maintained as
asserted.
To prove the second assertion, let us assume that the vertex j to be
eliminated has (p+1) predecessors and (s+1) successors, which span together
with j a subgraph of p+s+3 vertices. After the elimination of j they form
a dense bipartite graph with (s + 1)(p + 1) edges. Because of the negative
fill-in, the number of direct arcs between the predecessors and successors of
j is at least
(s + 1)(p + 1) − (s + 1 + p + 1) + 1 = s p.
From this it can easily be shown by contradiction that at least one prede-
cessor of j is connected to all but possibly one of its successors, so that its
link to j can be front-eliminated.
The application of the proposition to the 3 × 3 subgraph depicted in
Figure 6.10 shows that there are 6 + 5 edges in the original graph on the left
which would be reduced to 9 by the elimination of the central node.
However, this operation would destroy the property that the Jacobian
of the vertices i0 and i1 with respect to k0 and k1 is always singular by
Proposition 6.3. Instead we should front-eliminate (k2 , j) and back-eliminate
(j, i2 ), which reduces the number of arcs also to 9, while maintaining the
scarcity.
Obviously Proposition 6.4 may be applied repeatedly until any additional
edge elimination would increase the total edge count and no additional ver-
tex elimination could reduce it further. However, there are examples where
the resulting representation is still not minimal, and other, more global,
scarcity-preserving simplifications may be applied. Also the issue of nor-
malization, that is, the question of which arc values can be set to 1 to save
extra multiplications, appears to be completely open. Hence we cannot yet
set up an optimal Jacobian representation for the seemingly simple, and
certainly practical, task of completing a large number of Jacobian–vector or
vector–Jacobian products with the minimal number of multiplications.
Acknowledgements
The author is greatly indebted to many collaborators and co-authors, as well
as the AD community as a whole. To name some would mean to disregard
others, and quite a few are likely to disagree with my view of the field anyway.
The initial typesetting of the manuscript was done by Sigrid Eckstein, who
took particular care in preparing the many tables and figures.
REFERENCES
A. Aho, J. Hopcroft and J. Ullman (1974), The Design and Analysis of Computer
Algorithms, Addison-Wesley, Reading, MA.
R. Anderssen and P. Bloomfield (1974), ‘Numerical differentiation proceedings for
non-exact data’, Numer. Math. 22, 157–182.
W. Ball (1969), Material and Energy Balance Computations, Wiley, pp. 560–566.
R. Becker and R. Rannacher (2001), An optimal control approach to error control
and mesh adaptation in finite element methods, in Acta Numerica, Vol. 10,
Cambridge University Press, pp. 1–102.
C. Bennett (1973), ‘Logical reversibility of computation’, IBM J. Research Devel-
opment 17, 525–532.
M. Berz, C. Bischof, G. Corliss and A. Griewank, eds (1996), Computational Dif-
ferentiation: Techniques, Applications, and Tools, SIAM, Philadelphia, PA.
C. Bischof, G. Corliss and A. Griewank (1993), ‘Structured second- and higher-
order derivatives through univariate Taylor series’, Optim. Methods Software
2, 211–232.
C. Bischof, A. Carle, P. Khademi and A. Mauer (1996), ‘The ADIFOR 2.0 system
for the automatic differentiation of Fortran 77 programs’, IEEE Comput. Sci.
Engr. 3, 18–32.
T. Braconnier and P. Langlois (2001), From rounding error estimation to automatic
correction with AD, in Corliss et al. (2001), Chapter 42, pp. 333–339.
D. Cacuci, C. Weber, E. Oblow and J. Marable (1980), ‘Sensitivity theory for
general systems of nonlinear equations’, Nucl. Sci. Engr. 88, 88–110.
J.-B. Caillau and J. Noailles (2001), Optimal control sensitivity analysis with AD,
in Corliss et al. (2001), Chapter 11, pp. 105–111.
S. Campbell and R. Hollenbeck (1996), Automatic differentiation and implicit dif-
ferential equations, in Berz et al. (1996), pp. 215–227.
S. Campbell, E. Moore and Y. Zhong (1994), ‘Utilization of automatic differentia-
tion in control algorithms’, IEEE Trans. Automatic Control 39, 1047–1052.
B. Cappelaere, D. Elizondo and C. Faure (2001), Odyssée versus hand differenti-
ation of a terrain modelling application, in Corliss et al. (2001), Chapter 7,
pp. 71–78.
A. Carle and M. Fagan (1996), Improving derivative performance for CFD by using
simplified recurrences, in Berz et al. (1996), pp. 343–351.
B. Christianson (1992), ‘Automatic Hessians by reverse accumulation’, IMA J.
Numer. Anal. 12, 135–150.
B. Christianson (1994), ‘Reverse accumulation and attractive fixed points’, Op-
tim. Methods Software 3, 311–326.
R. Tarjan (1983), ‘Data structures and network algorithms’, CBMS-NSF Reg. Conf.
Ser. Appl. Math. 44, 131.
J. van der Snepscheut (1993), What Computing Is All About, Suppl. 2 of Texts and
Monographs in Computer Science, Springer, Berlin.
D. Venditti and D. Darmofal (2000), ‘Adjoint error estimation and grid adaptation
for functional outputs: application to quasi-one-dimensional flow’, J. Comput.
Phys. 164, 204–227.
Y. Volin and G. Ostrovskii (1985), ‘Automatic computation of derivatives with the
use of the multilevel differentiating technique, I: Algorithmic basis’, Comput.
Math. Appl. 11, 1099–1114.
A. Walther (1999), Program reversal schedules for single- and multi-processor ma-
chines, PhD thesis, Institute of Scientific Computing, Germany.
A. Walther (2002), ‘Adjoint based truncated Newton methods for equality con-
strained optimization’.
X. Zou, F. Vandenberghe, M. Pondeca and Y.-H. Kuo (1997), Introduction to
adjoint techniques and the MM5 adjoint modeling system, NCAR Technical
Note NCAR/TN-435-STR, Mesocale and Microscale Meteorology Division,
National Center for Atmospheric Research, Boulder, Colorado.