0% found this document useful (0 votes)
29 views78 pages

Mathematical View of Automatic Differentiation

The document discusses automatic differentiation and its applications in scientific computing. It examines complexity bounds and mathematical aspects of transforming procedural programs that evaluate functions into programs that also compute derivative values. The document covers evaluation procedures in incremental form, basic forward and reverse modes of automatic differentiation, serial and parallel reversal schedules, and differentiating iterative solvers and Jacobian matrices.

Uploaded by

fikri.hafid347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views78 pages

Mathematical View of Automatic Differentiation

The document discusses automatic differentiation and its applications in scientific computing. It examines complexity bounds and mathematical aspects of transforming procedural programs that evaluate functions into programs that also compute derivative values. The document covers evaluation procedures in incremental form, basic forward and reverse modes of automatic differentiation, serial and parallel reversal schedules, and differentiating iterative solvers and Jacobian matrices.

Uploaded by

fikri.hafid347
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 78

Acta Numerica (2003), pp.

321–398 
c Cambridge University Press, 2003
DOI: 10.1017/S0962492902000132 Printed in the United Kingdom

A mathematical view of
automatic differentiation
Andreas Griewank
Institute of Scientific Computing,
Department of Mathematics,
Technische Universität Dresden,
01052 Dresden, Germany
E-mail: [email protected]

Automatic, or algorithmic, differentiation addresses the need for the accurate


and efficient calculation of derivative values in scientific computing. To this
end procedural programs for the evaluation of problem-specific functions are
transformed into programs that also compute the required derivative values
at the same numerical arguments in floating point arithmetic. Disregarding
many important implementation issues, we examine in this article complexity
bounds and other more mathematical aspects of the program transformation
task sketched above.

CONTENTS
1 Introduction 321
2 Evaluation procedures in incremental form 329
3 Basic forward and reverse mode of AD 335
4 Serial and parallel reversal schedules 349
5 Differentiating iterative solvers 365
6 Jacobian matrices and graphs 374
References 393

1. Introduction
Practically all calculus-based numerical methods for nonlinear computations
are based on truncated Taylor expansions of problem-specific functions. Nat-
urally we have to exempt from this blanket assertion those methods that
are targeted for models defined by functions that are very rough, or even
nondeterministic, as is the case for functions that can only be measured ex-
perimentally, or evaluated by Monte Carlo simulations. However, we should
bear in mind that, in many other cases, the roughness and nondetermi-
nacy may only be an artifact of the particular way in which the function is

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


322 A. Griewank

evaluated. Then a reasonably accurate evaluation of derivatives may well


be possible using variants of the methodology described here. Consequently,
we may view the subject of this article as an effort to extend the applica-
bility of classical numerical methodology, dear to our heart, to the complex,
large-scale models arising today in many fields of science and engineering.
Sometimes, numerical analysts are content to verify the efficacy of their so-
phisticated methods on suites of academic test problems. These may have
a very large number of variables and exhibit other complications, but they
may still be comparatively easy to manipulate, especially as far as the calcu-
lation of derivatives is concerned. Potential users with more complex models
may then give up on calculus-based models and resort to rather crude meth-
ods, possibly after drastically reducing the number of free variables in order
to avoid the prohibitive runtimes resulting from excessive numbers of model
reruns.
Not surprisingly, on the kind of real-life model indicated above, nothing
can be achieved in an entirely automatic fashion. Therefore, the author
much prefers the term ‘algorithmic differentiation’ and will refer to the sub-
ject from here on using the common acronym AD. Beyond this minor la-
belling issue lies the much more serious task of deciding which research and
development activities ought to be described in a fair but focused survey
on AD. To some, AD is an exercise in software development, about which
they want to read (or write) nothing but a crisp online documentation. For
others, AD has become the mainstay of their research activity with plenty of
intriguing questions to resolve. As in all interdisciplinary endeavours, there
are many close connections to other fields, especially numerical linear alge-
bra, computer algebra, and compiler writing. It would be preposterous as
well as imprudent to stake out an exclusive claim for any particular subject
area or theoretical result. There has been a series of workshops and confer-
ences focused on AD and its applications (Griewank and Corliss 1991, Berz,
Bischof, Corliss and Griewank 1996, Corliss, Faure, Griewank, Hascoët and
Naumann 2001), but nobody has seen the need for a dedicated journal. Re-
sults appear mostly in journals on optimization, or more generally, numerical
analysis and scientific computing.

1.1. How numeric and how symbolic is AD?


There is a certain dichotomy in that the final results of AD are numeri-
cal derivative values, which are usually thought to belong to the domain of
nonlinear analysis and optimization. On the other hand, the development
of AD techniques and tools can for the most part stay blissfully oblivious
to issues of floating point arithmetic, much like sparse matrix factoriza-
tion in the positive definite case. Instead, we need to analyse and manip-
ulate discrete objects like computer codes and related graphs, sometimes

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 323

employing tools of combinatorial optimization. There are no division op-


erations at all, and questions of numerical stability or convergence orders
and rates arise mostly for dynamic codes that represent iterative solvers or
adaptive discretizations.
The last assertion flies in the face of the notion that differentiation is an
ill-conditioned process, which is firmly ingrained in the minds of numerical
mathematicians. Whether or not this conviction is appropriate depends on
the way in which the functions to be differentiated are provided by the user.
If we merely have an oracle generating function values with prescribed accu-
racy, derivatives can indeed only be estimated quite inaccurately as divided
differences; possibly averaged over a number of trial evaluations in situa-
tions where errors can be assumed to have a zero mean statistically. If, on
the other hand, the oracle takes the form of a computer code for evaluating
the function, then this code can often be analysed and transformed to yield
an extended code that also evaluates desired derivatives. In both scenarios
we have backward stability à la Wilkinson, in that the floating point values
obtained can be interpreted as the exact derivatives of a ‘slightly’ perturbed
problem with the same structure. The crucial difference is that in the first
scenario there is not much structure to be preserved, as the functions may
be subjected to discontinuous perturbations, generating slope changes of ar-
bitrary size. In the second scenario the function is effectively prescribed as
the composite of elementary functions, whose values and derivative are only
subject to variations on the order of the machine accuracy.
More abstractly, we may simply state that, as a consequence of the chain
rule, the composition operation
F ≡ F2 ◦ F1 for Fi : Ωi−1 → Ωi
is a jointly continuous mapping between Banach spaces of differentiable
functions, i.e., it belongs to
C 1 (Ω0 , Ω1 ) × C 1 (Ω1 , Ω2 ) → C 1 (Ω0 , Ω2 ).
In practical terms this means that we differentiate the composite function
F by appropriately combining the derivatives of the two constituents F1
and F2 , assuming that procedures for their evaluation are already in place.
By doing this recursively, we effectively extend an original procedure for
evaluating function values by themselves into one that also evaluates deriva-
tives. This is essentially an algebraic manipulation, which again reflects the
two-sided nature of AD, partly symbolic and partly numeric.
The term numeric differentiation is widely used to describe the common
practice of approximating derivatives by divided differences (Anderssen and
Bloomfield 1974). This approximation of differential quotients by difference
quotients means geometrically the replacement of tangent slopes by secant
slopes over a certain increment in each independent variable. It is well known

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


324 A. Griewank

that, even for the increment size that optimally balances truncation and
rounding error, half of the significant digits are lost. Of course, the optimal
increment typically differs for each variable/function pair and the situation
is still more serious when it comes to approximating higher derivatives. No
such difficulty occurs in AD, as no parameter needs to be selected and there is
no significant loss of accuracy at all. Of course, the application of the chain
rule does entail multiplications and additions of floating point numbers,
which can only be performed with platform-dependent finite precision.
Since AD has really very little in common with divided differences, it is
rather inappropriate to describe it as a ‘halfway house’ between numeric
and symbolic differentiation, as is occasionally done in the literature. What
then is the key feature that distinguishes AD from fully symbolic differen-
tiation, as performed by most computer algebra (CA) systems? A short
answer would be to say that AD applies the chain rule to floating point
numbers rather than algebraic expressions. Indeed, no AD package yields
tidy mathematical formulas for the derivatives of functions, no matter how
algebraically simple they may be. All we get is more code, typically source
or binary. It is usually not nice to look at and, as such, it never provides
any analytical insight into the nature of the function and its derivatives.
Of course, for multi-layered models, such insight by inspection is generally
impossible anyway, as the direct expression of function values in terms of
the independent variables leads to formulas that are already impenetrably
complex. Instead, the aim in AD is the accurate evaluation of derivative
values at a sequence of arguments, with an a priori bounded complexity in
terms of the operation count, memory accesses, and memory size.

1.2. Various phases and costs of AD


Relative to the complexity of the function itself, the complexity growth is
at worst linear in the number of independent or dependent variables and
quadratic in the degree of the derivatives required. This moderate and
predictable extra cost, combined with the achievement of good derivative
accuracy and the absence of any free method parameter, has led to the
wide-spread use of AD as built-in functionality in AMPL, GAMS, NEOS,
and other modelling or optimization systems. In contrast, the complexity
of symbolic manipulations typically grows exponentially with the depth of
the expression tree or the computational graph that represent a function
evaluation procedure. By making such blanket statements we are in danger
of comparing apples to oranges.
In computing derivatives using either CA or AD we have to distinguish at
least two distinct phases and their respective costs: first, a symbolic phase
in which the given function specification is analysed and a procedure for
evaluating the desired derivatives is prepared in a suitable fashion; second,

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 325

a numeric phase where this evaluation is actually carried out at one or


several numerical arguments. The more effort we invest in the symbolic
phase the more runtime efficiency we can expect in the numeric phase, which
will hopefully be applied sufficiently often to amortize the initial symbolic
investment. For example, extensive symbolic preprocessing of a routine
specifying a certain kind of finite element or the right-hand side of a stiff
ODE is likely to pay off if the same routine and its derivative procedure
are later called in the numeric phase at thousands of grid points or time-
steps, respectively. On the other hand, investing much effort in the symbolic
phase may not pay off, or simply be impossible, if the control flow of the
original routine is very dynamic. For example, one may think of a recursive
quadrature routine that adaptively subdivides the domains depending on
certain error estimates that can vary dramatically from call to call. Then
very little control flow information can be gleaned at compile-time and code
optimization is nearly impossible.
Of course, we can easily conceive of a multi-phase scenario where the
function specification is specialized at various levels and the corresponding
derivative procedures are successively refined accordingly. For example, we
may have an intermediate qualitative phase where, given the specification
of certain parameters like mesh sizes and vector dimensions, the AD tool
determines the sparsity pattern of a Jacobian matrix and correspondingly
allocates suitable data structures in preparation for the subsequent numeric
phase. Naturally, it may also output the sparsity pattern or provide other
qualitative dependence information, such as the maximal rank of the Jaco-
bian or even the algebraic multiplicity of certain eigenvalues.
Traditionally in AD the symbolic effort has been kept to the equivalent
of a few compiler-type passes through the function specification, possibly
including a final compilation by a standard compiler. Hence the symbolic
effort is at most of the order of a single function evaluation, albeit probably
with a rather large constant. However, this characteristic of all current AD
tools may change when more extensive dependence analyses or combinatorial
optimizations of the way in which the chain rule is applied are incorporated.
For the most part these considerable extra efforts in the symbolic phase will
reduce the resulting costs in the numerical phase only by constants but not
by orders of magnitude. Some of them are rather mundane rearrangements
or software modifications with no allure for the mathematician. Of course,
such improvements can make all the difference for the practical feasibility
of certain calculations.
This applies in particular to the calculation of gradients in the reverse
mode of AD. Being a discrete analogue of adjoints in ODEs, this backward
application of the chain rule yields all partials of a scalar function with
respect to an arbitrary number of variables at the cost of a small multiple of
the operation count for evaluating the function itself. If only multiplications

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


326 A. Griewank

are counted, this growth factor is at most 3, and for a typical evaluation
code probably more like 2 on average. That kind of value seems reasonable
to people who skilfully write adjoint code by hand (Zou, Vandenberghe,
Pondeca and Kuo 1997), and it opens up the possibility of turning simulation
codes into optimization codes with a growth of the total runtime by a factor
in the tens rather than the hundreds. Largely because of memory effects,
it is not uncommon that the runtime ratio for a single gradient achieved
by current AD tools is of the order 10 rather than 2. Fortunately, the
memory-induced cost penalty is the same if a handful of gradients, forming
the Jacobian of a vector function, is evaluated simultaneously in what is
called the vector-reverse mode.
In general, it is very important that the user or algorithm designer cor-
rectly identifies which derivative information is actually needed at any par-
ticular point in time, and then gets the AD tool to generate all of it jointly.
In this way optimal use can be made of common subcalculations or memory
accesses. The latter may well determine the wall clock time on a modern
computer. Moreover, the term derivative information needs to be inter-
preted in a wider sense, not just denoting vectors, matrices or even tensors
of partial derivatives, as is usually understood in hand-written formulas or
the input to computer algebra systems. Such rectangular derivative ar-
rays can often be contracted by certain weighting and direction vectors to
yield derivative objects with fewer components. By building this contrac-
tion into the AD process all aspects of the computational costs can usually
be significantly reduced. A typical scenario is the iterative calculation of
approximate Newton steps, where only a sequence of Jacobian–vector, and
possibly vector–Jacobian, products are needed, but never the Jacobian as
such. In fact, as we will discuss in Section 6, it is not really clear what
the ‘Jacobian as such’ really is. Instead we may have to consider various
representations depending on the ultimate purpose of our numerical calcu-
lation. Even if this ultimate purpose is to compute exact Newton steps by
a finite procedure, first calculating all Jacobian entries may not be a good
idea, irrespective of whether it is sparse or not. In some sense Newton steps
themselves become derivative objects. While that may seem a conceptual
stretch, there are some other mathematical objects, such as Lipschitz con-
stants, error estimates, and interval enclosures, which are naturally related
to derivatives and can be evaluated by techniques familiar from AD.

1.3. Some historical remarks


Historically, there has been a particularly close connection between the re-
verse mode of AD and efforts to estimate the effects of rounding errors in
evaluating functions or performing certain numerical algorithms. The ad-
joint quantities generated in the reverse mode for each intermediate variable

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 327

of an evaluation process are simply the sensitivities of the weighted output


with respect to perturbations of this particular variable. Hence we obtain the
first-order Taylor expansion of the output error with respect to all interme-
diate errors, a linearized estimate apparently first published by Linnainmaa
in 1972 (see Linnainmaa (1976)). Even earlier, in 1966, Volin and Ostrovski
(see G. M. Ostrovskii and Borisov (1971)) suggested the reverse mode for
the optimization of certain process models in chemical engineering. Inter-
estingly, the first major publication dedicated to AD, namely the seminal
book by Louis Rall (1983) did not recognize the reverse, or adjoint, mode
as a variant of AD at all, but instead covered exclusively the forward or
direct mode. This straightforward application of the chain rule goes back
much further, at least to the early 1950s, when it was realized that comput-
ers could perform symbolic as well as numerical manipulations. In the past
decade there has been growing interest in combinations of the forward and
reverse mode, a wide range of possibilities, which had in principle already
been recognized by Ostrovski et al. (see Volin and Ostrovskii (1985)).
Based on theoretical work by Cacuci, Weber, Oblow and Marable (1980),
the first general-purpose tool implementing the forward and reverse mode
was developed in the 1980s by Ed Oblow, Brian Worley and Jim Horwedel
at Oak Ridge National Laboratory. Their system GRESS/ADGEN was
successfully applied to many large scientific and industrial codes, especially
from nuclear engineering (Griewank and Corliss 1991). Hence there was an
early proof of concept, which did, however, have next to no impact in the
numerical analysis community, where the notion that derivatives are always
hard if not impossible to come by persisted for a long time, possibly to
this day. Around the time of the first workshop on AD at Breckenridge in
1991, several system projects were started (e.g., ADIFOR, Odyssée, TAMC,
ADOL-C), though regrettably none with sufficient resources for the devel-
opment of a truly professional tool. Currently several projects are under
way, some to cover MATLAB, and others that aim at an even closer inte-
gration into a compilation platform for several source languages (Hague and
Naumann 2001).
As suggested by the title, in this article we will concentrate on the math-
ematical questions of AD and leave the implementation issues largely aside.
Following current practice in numerical linear algebra, we will measure com-
putational complexity by counting fused multiply-adds and denote them
simply by OPS. They can be performed in one or two cycles on modern
super-scalar processors, and are essentially the only extra arithmetic opera-
tions introduced by AD. Furthermore, in Section 4 on reversal schedules, we
will also emphasize the maximal memory requirement and the total number
of memory accesses. They may dominate the runtime even though most
of them occur in a strictly sequential fashion, thus causing only a minimal
number of cache misses.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


328 A. Griewank

1.4. Structure of the article


The paper is organized as follows. In Section 2 we set up a framework
of function evaluation procedures that is invariant with respect to adjoin-
ing, i.e., application of reverse mode AD. To achieve this we consider from
the beginning vector-valued and incremental elemental functions, whereas
in Griewank (2000) and most other presentations of AD only scalar-valued
assignments are discussed. These generalizations make the notation in some
respects a little more complicated, and will therefore be suspended in the
second part of Section 6. That final section discusses the rather intricate
relationship between Jacobian matrices and computational graphs. It is
in part speculative and meant to suggest directions of future research in
AD. Section 3 reviews the basic and well-established techniques of AD, with
methods for the evaluation of sparse Jacobians being merely sketched ver-
bally. For further details on this and other aspects one may consult the
author’s book (Griewank 2000).
The following three sections treat rather different topics and can be read
separately from each other. In many people’s minds the main objection to
the reverse mode in its basic form is its potentially very large demand for
temporary memory. This may either exceed available storage or make the
execution painfully slow owing to extensive data transfers to remote regions
of the memory hierarchy. Therefore, in Section 4 we have elaborated check-
pointing techniques for program reversals on serial and parallel machines in
considerable detail. The emphasis is on the discrete optimization problem
of checkpoint placement rather than their implementation from a computer
science perspective. In Section 5 we will consider adjoints of iterative pro-
cesses for the solution of linear or nonlinear state equations. In that case
loop reversals can be avoided altogether using techniques that were also
considered in Hascoët, Fidanova and Held (2001), Giles and Süli (2002) and
Becker and Rannacher (2001).
Apart from second-order adjoints, which are covered in Sections 3 and 5,
we will not discuss the evaluation of higher derivatives even though there
are some interesting recent applications to DAEs and ODEs (see Pantelides
(1988), Pryce (1998) and Röbenack and Reinschke (2000)). Explicit formu-
las for the higher derivatives of composite functions were published by Faa
di Bruno (1856). Recursive formulas for the forward propagation of Taylor
series with respect to a single variable were published by Moore (1979) and
have been reproduced many times since: see, for instance, Kedem (1980),
Rall (1983), Lohner (1992) and Griewank (2000). The key observation is
that, because all elemental functions of interest, e.g., exp(x), sin(x), etc., are
solutions of linear ODEs, univariate Taylor series can be propagated with a
complexity that grows only quadratically with the degree d of the highest co-
efficient. Using FFT or other fast convolution algorithms we can reduce this

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 329

complexity to order d ln(1 + d), a possibility that has not been exploited in
practice. Following a suggestion by Rall, it was shown in Bischof, Corliss and
Griewank (1993), Griewank, Utke and Walther (2000) and Neidinger (200x)
that multivariate Taylor expansions can be computed rather efficiently us-
ing families of univariate polynomials. Similarly, the reverse propagation of
Taylor polynomials does not introduce any really new aspects, and can be
interpreted as performing the usual reverse mode in Taylor series arithmetic
(Christianson 1992, Griewank 2000).
Like the kind of material presented, the style and depth of treatment in
this article is rather heterogeneous. Many observations are just asserted ver-
bally with references to the literature but others are formalized as lemmas
or propositions. Proofs have been omitted, except for a new, shorter demon-
stration that binomial reversal schedules are optimal in Proposition 4.2, the
proof of the folklore result, Proposition 6.3, concerning the generic rank of a
Jacobian, and some simple observations regarding the new concept of Jaco-
bian scarcity introduced in Section 6. For a more detailed exposition of the
basic material the reader should consult the author’s book (Griewank 2000),
and for more recent results and application studies the proceedings volume
edited by Corliss et al. (2001). A rather comprehensive bibliography on AD
is maintained by Corliss (1991).

2. Evaluation procedures in incremental form


Rather than considering functions ‘merely’ as mathematical mappings we
have to specify them by evaluation procedures and to a large extent identify
the two. These conceptual codes are abstractions of actual computer codes,
as they may be written in Fortran or C and their various extensions. Gen-
erally, the quality and complexity of the derived procedures for evaluating
derivatives will reflect the properties of the underlying function evaluation
procedure quite closely. The following formalism does not make any really
substantial assumptions other than that the procedure represents a finite
algorithm without branching of the control flow. We just have to be a little
careful concerning the handling of intermediate quantities.
The basic assumption of AD is that the function to be differentiated is, at
least conceptually, evaluated by a sequence of elemental statements, such as
vi = ϕi (ui ) or vi += ϕi (ui ) for all i ∈ I. (2.1)
In other words we have the evaluation of an elemental function ϕi followed
by a standard assignment or an additive incrementation. Here the C-style
notation a += b abbreviates a = a+b for any two compatible vectors a and b.
We include incremental assignments and, later, allow overlap between the
left-hand sides vi , because both aspects occur frequently in adjoint programs.
To obtain a unified notation for both kinds of assignment we introduce for

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


330 A. Griewank

each index i ∈ I a flag σi ∈ {0, 1}, and write


vi = σi vi + ϕi (ui ) for all i ∈ I. (2.2)
To make the adjoint procedure even more symmetric we may prefer the
equivalent statement pair
 
vi ∗= σi ; vi += ϕi (ui ) for all i ∈ I, (2.3)
where a ∗= α abbreviates a = α ∗ a for any vector a and scalar α. In this
way we have a fully incremental form, which turns out to be convenient
for adjoining later on. Throughout we will neglect the cost of the additive
and multiplicative incrementations. Normally the ϕi will be taken from a
library of arithmetic operations and intrinsic functions. However, we may
also include basic linear algebra subroutines (BLAS) and other library or
user-defined functions, provided corresponding derivative procedures ϕ̇i and
ϕ̄i , as defined in Section 3, can be supplied.
The index set I is assumed to be partially ordered by an acyclic, nonre-
flexive precedence relation j ≺ i, which indicates that ϕj must be applied
before ϕi . Deviating from the more mathematical AD literature, includ-
ing Griewank (2000), we will identify the indices i ∈ I with the whole
statement (2.2) rather than just the output variable vi . This approach is
more in line with compiler literature (Hascoët 2001). The difference dis-
appears whenever we exclude incremental statements. We may interpret
I as the vertex set of the directed acyclic graph G = (I, E) with edge set
E ≡ {(j, i) ∈ I × I  j ≺ i}. When the vertices of G are annotated with the
elemental functions ϕi , it is often called the computational graph, a concept
apparently due to Kantorovich (1957). As a small example we consider the
scalar function
y = exp(x1 ) ∗ sin(x1 + x2 ). (2.4)
It can be evaluated using 3 intermediates by the program listed in Figure 2.1;
on the right is displayed the corresponding computational graph.

Program: Graph:

v3 = exp(v1 ) 1 3
v4 = sin(v1 + v2 ) 5
v 5 = v3 ∗ v 4 2 4

Figure 2.1. Program and graph associated with (2.4).

Without loss of generality, we may assume that I equals the range of


integers I = [1 . . . l], and consider (2.2) as a loop representing the succession
of statements executed by a computer program for a given set of data.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 331

In particular we consider the control flow to be fixed. The partial ordering


≺ of I allows us to discuss the runtime ratios between the derived and the
original evaluation procedures on parallel as well as serial machines. We
will relax the usual convention that the variables vi are real scalars, and
instead allow them to be elements of an arbitrary real Hilbert space Hi of
dimension mi ≡ dim(Hi ). We will refer to the scalar case whenever mi = 1
for all i = 1 . . . l.

2.1. The state space and its transformations


In real computer programs several variables are often stored in the same
location. Hence we do not require the Hi to be completely separate but to
have a natural embedding into a common state space H, such as
Pi Hi ⊂ H and vi = PiT v for v ∈ H,
where Pi is orthogonal, in that PiT Pi = I is the identity on Hi . Here
and throughout we will identify linear mappings between finite-dimensional
spaces with their matrix description in terms of suitable orthonormal bases.
Correspondingly we will denote the adjoint mapping of P as the transpose
P T . We will choose H minimal such that
H ≡ P1 H1 + P2 H2 + · · · + Pl−1 Hl−1 + Pl Hl .
In the simplest scenario H is the Cartesian product of all Pi Hi , in which case
we have a so-called single assignment procedure. Otherwise the variables vi
are said to overlap, and we have to specify more carefully what (2.2) and
thus (2.3) actually mean. Namely, we assume that there is an underlying
state vector v ∈ H, which is transformed according to
 
Φi (v) ≡ I − (1 − σi ) Pi PiT v + Pi ϕi (Qi v). (2.5)
Here the argument selections Qi denote orthogonal projections from H into
the domains Di of the elemental functions ϕi so that
ui = Qi v ∈ Di = dom(ϕi ) with ni = dim(Di ). (2.6)
Rather than restricting ourselves to Cartesian projections that pick out coor-
dinate components of v, we allow general linear Qi , and thus include the im-
portant concept of group partial separability (Conn, Gould and Toint 1992)
in a natural way. Of course, the ϕi need not be well defined on the whole
Euclidean space Di in practice, but we will assume here that this difficulty
does not occur at arguments of interest. For basic observations regarding
exceptional arguments consider Sections 11.2 and 11.3 in Griewank (2000).
To distinguish them verbally from the underlying elemental functions ϕi
we will refer to the Φi as elemental transitions. Given any initial state u ∈ H
we can now apply the Φi in any specified order and obtain a corresponding

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


332 A. Griewank

final state v ≡ Φ(u) ∈ H where Φ ≡ Φl ◦ Φl−1 ◦ · · · ◦ Φ1 denotes the


composition of the elemental transitions.
For example (2.4) we may set
H = R5 , Hi = R and Pi = ei for i = 1 . . . 5.
Here ei ∈ R5 denotes the ith Cartesian basis vector. Furthermore we select
 
0 0 1 0 0
Q3 = (1 0 0 0 0), Q4 = (1 1 0 0 0), Q5 = ,
0 0 0 1 0
and may thus evaluate (2.4) by the transformations
v(0) = 0,
 
v(1) = I − e1 eT1 v(0) + e1 x1 ,
 
v(2) = I − e2 eT2 v(1) + e2 x2 ,
   
v(3) = I − e3 eT3 v(2) + e3 exp Q3 v(2) ,
   
v(4) = I − e4 eT4 v(3) + e4 sin Q4 v(3) ,
   
v(5) = I − e5 eT5 v(4) + e5 prod Q5 v(4) ,
where I = I5 and prod(a, b) ≡ a ∗ b for (a, b) ∈ R2 .

2.2. No-overwrite conditions


Example (2.4) already shows some properties that we require in general.
Namely, we normally impose the natural condition that the function ϕi may
only be applied when its argument ui has reached its final value. In other
words we must have evaluated all ϕj that precede ϕi so that
Qi Pj = 0 ⇒ j ≺ i. (2.7)
In particular we always have Qi Pi = 0, as ≺ is assumed nonreflexive. In
addition to these write–read dependences between the two statements ϕj
and ϕi , we also have to worry about write–write dependences where several
elemental functions ‘overlap’, in that equivalently
PiT Pj = 0 ⇔ PjT Pi = 0.
Overlapping really only makes sense if all but possibly one of the ϕi are
incremental, i.e., σi = 1, since otherwise values are overwritten before they
are ever used. Moreover, since by their very nature the incremental ϕi with
overlapping Pi do commute, their relative order does not matter, but they
all must succeed a possible, nonincremental one. Formally we impose the
condition
   
PjT Pi = 0 ⇒ j ≺ i and σi = 1 or i ≺ j and σj = 1 . (2.8)

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 333

From now on we will assume that we have an acyclic ordering satisfying


conditions (2.7), (2.8) and furthermore make the monotonicity assumption
j ≺ i ⇒ j < i, where < denotes the usual ordering of integers by size.

2.3. Independent and dependent variables


Even though they are patently obvious in many practical codes, we may
characterize the independent and dependent variables in terms of the prece-
dence relation ≺ as follows. All indices j that are minimal with respect to ≺
must have a trivial argument mapping Qj = 0. Otherwise, by the assumed
minimality of the state space we would have Qj Pk = 0 for some k. All such
minimal ϕj initialize vj to some constant vector xj unless j is incremental,
which we will preclude by assumption. Furthermore, we assume without
loss of generality that the minimal indices are given by j = 1 . . . n and may
thus write
vj = xj for j = 1 . . . n.
We consider
x = (xj )j=1...n ∈ X ≡ P1 H1 × P2 H2 × · · · × Pn−1 Hn−1 × Pn Hn .
as the vector of independent variables. Similarly, we assume that the max-
imal indices i ∈ I with respect to ≺ are given by i = l − m + 1 . . . l. The
values vi with i > l − m do not impact any other elemental function and are
therefore considered as dependent variables
yi = vı̂ with ı̂ ≡ l − m + i for i = 1 . . . m.
We consider
y = (yi )i=1...m ∈ Y ≡ Pl−m+1 Hl−m+1 × · · · × Pl−1 Hl−1 × Pl Hl
as the vector of dependent variables. Combining our assumption on the in-
dependents and dependents in the following additional condition, we obtain
j ≺ i ⇒ i > n and j ≤ l − m. (2.9)
Also, we exclude independent or dependent elementals from being incremen-
tal, so that
i ≤ n or i > l − m ⇒ σi = 0.
Consequently, by (2.8) the Pi Hi for i = 1 . . . n and i = l − m + 1 . . . l must
be mutually orthogonal, so that we have in fact X ⊂ H and Y ⊂ H.

2.4. Four-part form and complexity


To ensure that the result of our evaluation procedure is uniquely defined
even when there are incremental assignments, we will assume that the state

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


334 A. Griewank

Table 2.1. Original function evaluation procedure.

v = 0

vi = xi for i = 1...n

vi ∗= σi
for i = n + 1...l
vi += ϕi (ui )

ym−i = vl−i for i = m − 1...0

vector v is initialized to zero. Hence we obtain the program structure dis-


played in Table 2.1.
With PX and PY the orthogonal projections n H onto itsm
 from subspaces
X and Y , we obtain the mapping from x = xj j=1 to y = yi i=1 as the
composite function
F ≡ PY ◦ Φl ◦ Φl−1 ◦ · · · ◦ Φ2 ◦ Φ1 ◦ PXT , (2.10)
where the transitions Φi are as defined in (2.5). The application of (2.10)
to a particular vector x ∈ X can be viewed as the loop
 
v(0) = PXT x, v(i) = Φi v(i−1) for i = 1 . . . l, y = PY v(l) . (2.11)
We may summarize the mathematical development in this section as fol-
lows.
General Assumption: Elemental Decomposition.
(i) The elemental functions ϕi : Di → Hi = PiT H are d ≥ 1 times contin-
uously differentiable on their open domains Di = Qi H.
(ii) The pairs of linear mappings {Pi , Qi } and the partial ordering ≺ are
consistent in that (2.7), (2.8), and (2.9) hold.
As an immediate consequence we state the following result without proof.
Proposition 2.1. Given the General Assumption, Table 2.1 yields, for
any monotonic total ordering of I, the same unique vector function defined
in (2.10), that is,
X  x → y = F (x) ∈ Y,
which is d times continuously differentiable, by the chain rule.
Only in Sections 3.6 and 5.4 will we need d > 1. The partial ordering in I
is only important for parallel evaluation and reversal without a value stack,
as described in Section 3.2.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 335

As to the computational cost, we will assume that the operation count is


additive in that
l
OPS (F (x)) = OPS (ϕi (ui )), (2.12)
i=1
where we have neglected the cost of performing the incrementations and
projections.
On a parallel machine elemental functions ϕi and ϕj that do not depend
on each others’ results can be executed simultaneously. It is customary to
assume that a read conflict, i.e., Qi QTj = 0, is no inhibition for concurrent
execution, and we will assume here that the same is true for the incremen-
tation of results, i.e., PiT Pj = 0. This is a crucial assumption for showing
that adjoints have the same degree of parallelism as the original evaluation
procedure. Whether or not it is realistic depends on the computing plat-
form and the nature of the elemental functions ϕi . If they are rather chunky,
involving many internal calculations, the delays through incremental write
conflict are indeed likely to be negligible. In any case, it makes no sense to
schedule individual operations separately on a single processor machine, so
that we may consider the ϕi to be more substantial subtasks in a parallel
context. Then we can estimate the wall clock time by the longest, and thus
critical path, i.e.,

CLOCK (F (x)) = max OPS (ϕi ). (2.13)
P⊂G
i∈P
Here P denotes a directed path in G with i ∈ P ranging over its vertices.
Throughout we will use CLOCK to denote the idealized wall clock time on a
parallel machine with an unlimited supply of processors and zero commu-
nication costs. For practical implementations of AD with parallelism, see
Carle and Fagan (1996), Christianson, Dixon and Brown (1997) and Mancini
(2001).

3. Basic forward and reverse mode of AD


In this section we introduce the basic forward and reverse mode of AD and
develop estimates for its computational complexity. They may be geomet-
rically interpreted as the forward propagation of tangents or the backward
propagation of normals or cotangents. Therefore they are often referred to
as the tangent and the cotangent mode, respectively. Rather than propagat-
ing a single tangent or normal, we may also carry forward or back bundles
of them in the so-called vector mode. It amortizes certain overhead costs
and allows the efficient evaluation of sparse Jacobians by matrix compres-
sion. These aspects as well as techniques for propagating bit pattern will
be sketched in Section 3.5. The section concludes with some remarks on
higher-order adjoints and a brief section summary.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


336 A. Griewank

3.1. Forward differentiation


Differentiating (2.10), we obtain by the chain rule the Jacobian
F  (x) ≡ PY Φl Φl−1 · · · Φ2 Φ1 PXT , (3.1)
where
Φi = I − (1 − σi )Pi PiT + Pi ϕi (ui ) Qi . (3.2)
Rather than computing the whole Jacobian explicitly, we usually prefer in
AD to propagate directional derivatives along a smooth curve x(t) ⊂ X with
t ∈ (−ε, ε) for some ε > 0. Then y(t) ≡ F (x(t)) ⊂ Y is also a smooth curve
with the tangent
d d
ẏ(t) = F (x(t)) = F  (x(t)) x(t) ⊂ Y. (3.3)
dt dt
Similarly all states v(i) = v(i) (t) and intermediates vi = vi (t) = PiT v(i) are
differentiable functions of t ∈ (−ε, ε) with the tangent values

d 
v̇i ≡ vi (t) .
dt t=0

Multiplying (3.1) by the input vector ẋ = dx(t)/dt|0 , we obtain the vector


equation
     
ẏ = PY Φl Φl−1 · · · Φ2 Φ1 PXT ẋ · · · .
The bracketing is equivalent to the loop of matrix-vector products
v̇(0) = PXT ẋ, v̇(i) = Φi (v(i−1) )v̇(i−1) for i = 1 . . . l, ẏ = PY v̇(l) . (3.4)
With ui = Qi v(i−1) , u̇i = Qi v̇(i−1) and ϕ̇i (ui , u̇i ) ≡ ϕi (ui )u̇i this may be
rewritten as the so-called tangent procedure listed in Table 2.1, and it can
now be stated formally.
Proposition 3.1. Given the General Assumption, the procedure listed in
Table 3.1 yields the value ẏ = F  (x)ẋ with F , as defined in Proposition 2.1.
Obviously Table 3.1 has the same form as Table 2.1 with H replaced by
H × H, Qi by Qi × Qi , and Pi by Pi × Pi . In other words, all argument and
value spaces have been doubled up by a derivative companion, but the flags
σi , the precedence relation ≺ and thus the structure of the computational
graph remain unchanged.
As we can see, the evaluation of the combined function
[F (x), Ḟ (x, ẋ)] ≡ [F (x), F  (x)ẋ] (3.5)
requires exactly one evaluation of the corresponding elemental combinations
[ϕi (ui ), ϕ̇i (ui , u̇i )] ≡ [ϕi (ui ), ϕi (ui )u̇i ].

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 337

Table 3.1. Tangent procedure derived from original procedure in Table 2.1.

[v, v̇] = 0

[vi , v̇i ] = [xi , ẋi ] for i = 1...n

[vi , v̇i ] ∗= σi
for i = n + 1...l
[vi , v̇i ] += [ϕi (ui ), ϕ̇i (ui , u̇i )]

[ym−i , ẏm−i ] = [vl−i , v̇l−i ] for i = m − 1...0

We have bracketed [ϕi , ϕ̇i ] side-by-side to indicate that good use can of-
ten be made of common subexpressions in evaluating the function and its
derivative. In any case we can show that
   
OPS [F (x), Ḟ (x, ẋ)] OPS [ϕi (ui ), ϕ̇i (ui , u̇i )]
  ≤ max   ≤ 3. (3.6)
OPS F (x) 1≤i≤l OPS ϕi (ui )
The upper bound 3 on the relative cost of performing a single tangent
calculation is arrived at by taking the maximum over all elemental functions.
It is actually attained for a single multiplication, which spawns two extra
multiplications by the chain rule
v = ϕ(u, w) ≡ u ∗ w → v̇ = ϕ̇(u, w, u̇, ẇ) = u ∗ ẇ + w ∗ u̇.
Since the data dependence relation ≺ applies to the ϕ̇i in exactly the same
way as to the ϕi , we obtain on a parallel machine
   
CLOCK [F (x), Ḟ (x, ẋ)] ≤ 3 CLOCK F (x) ,
where CLOCK is the idealized wall clock time introduced in (2.13).
The bound 3 is pessimistic as the cost ratio between ϕ and ϕ̇ is more
advantageous for most other elemental functions. On the other hand actual
runtime ratios between codes representing Table 3.1 and Table 2.1 may well
be worse on account of various effects including loss of vectorization. This
gap between operation count and actual runtime is likely to be even more
marked for the following adjoint calculation.

3.2. Reverse differentiation and adjoint vectors


Rather than propagating tangents forward we may propagate normals back-
ward, that is, instead of computing ẏ = Ḟ (x, ẋ) = F  (x)ẋ we may eval-
uate x̄ = F̄ (x, ȳ) ≡ ȳF  (x). Here the dual vectors ȳ ∈ Y ∗ = Y T and
x̄ ∈ X ∗ = X T are thought of as row vectors. Using the transpose of the

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


338 A. Griewank

Jacobian product representation (3.1) we find that


     
x̄T = PX [Φ1 ]T [Φ2 ]T · · · [Φl−1 ]T [Φl ]T PYT ȳT · · · .
The bracketing is equivalent to the loop of vector-matrix products
v̄(l) = ȳ PY , v̄(i−1) = v̄(i) Φi (v(i−1) ) for i = l . . . 1, x̄ = v̄(0) PXT .
Formally we will write, in agreement with (3.2),
   
v̄(i−1) = Φ̄i v(i−1) , v̄(i) ≡ v̄(i) Φi v(i−1) ,
where, as ever, adjoints are interpreted as row vectors. With v̄i ≡ v̄(i) Pi and
ūi = v̄(i) QTi , we obtain from (3.2), using Qi Pi = 0, the adjoint elemental
function  
ūi += v̄i ϕi (ui ); v̄i ∗= σi . (3.7)
In other words the multiplicative statement vi ∗= σi is self-adjoint in that,
correspondingly, v̄i ∗= σi . The incremental part vi += ϕ(ui ) generates the
equally incremental statement
ūi += ϕ̄(ui , v̄i ) ≡ v̄i ϕi (ui ). (3.8)
Following the forward sweep of Table 2.1, we may then execute the so-called
reverse sweep listed in Table 3.2. Now, using the assumptions on the partial
ordering in I, we obtain the following result.
Proposition 3.2. Given the General Assumption, the procedure listed in
Table 3.1 yields the value x̄ = ȳ F  (x), with F as defined in Proposition 2.1.

Proof. The only fact to ascertain is that the argument ui of the ϕ̄i defined
in (3.8) still has the correct value when it is called up in the third line of
Table 3.2. However, this follows from the definition of ui in (2.6) and our
assumption (2.7), so that none of the statements ϕj with j > i, nor of course
the corresponding ϕ̄j , can alter ui before it is used again by ϕ̄i .
So far, we have treated the v̄i merely as auxiliary quantities during re-
verse matrix multiplications. In fact, their final values can be interpreted as
adjoint vectors in the sense that

v̄i ≡ ȳ y. (3.9)
∂vi
Here ȳ is considered constant and the notation ∂v∂ i requires further expla-
nation. Partial differentiation is normally defined with respect to one of
several independent variables whose values uniquely determine the function
being differentiated. However, here y = F (x) is fully determined, via the
evaluation procedure of Table 2.1, as a function of x1 . . . xn , with each vi , for
i > n, occurring merely as an intermediate variable. Its value, vi = vi (x),

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 339

Table 3.2. Reverse sweep following forward sweep of Table 2.1.

v̄ = 0

v̄l−m+i = ȳi for i = m...1

ūi += ϕ̄i (ui , v̄i )


for i = l...n + 1
v̄i ∗= σi

x̄j = v̄j for j = n...1

being also uniquely determined by x, the intermediate vi is certainly not an-


other independent variable. To define v̄i unambiguously, we should perturb
the assignment (2.2) to
vi = σi vi + ϕi (ui ) + δi for δi ∈ Hi ,
and then set 
v̄i = ∇δi ȳ yδ =0 ∈ Hi∗ = Hi . (3.10)
i

Thus, v̄i quantifies the sensitivity of ȳ y = ȳ F (x) to a small perturbation δi


in the ith assignment. Clearly, δi can be viewed as a variable independent
of x, and we may allow one such perturbation for each vi .
The resulting perturbation δ of ȳ y can then be estimated by the first-
order Taylor expansion
 l
δ≈ v̄i · δi . (3.11)
i=1
When vi is scalar the δi may be interpreted as rounding errors incurred in
evaluating vi = ϕi (ui ). Then we may assume δi = vi ηi with all | ηi | ≤ η, the
relative machine precision. Consequently, we obtain the approximate bound
  
 l  l
 
| δ |   v̄i vi ηi  ≤ η  v̄i vi . (3.12)
i=1 i=1
This relation has been used by many authors for the estimation of round-off
errors (Iri, Tsuchiya and Hoshi 1988). In particular, it led Seppo Linnainmaa
to the first publication of the reverse mode in English (Linnainmaa 1983).
The factor li=1 |v̄i vi | has been extensively used by Stummel (1981) and oth-
ers as a condition estimate for the function evaluation procedure. Braconnier
and Langlois (2001) used the adjoints v̄i to compensate evaluation errors.
In the context of AD we mostly regard adjoints as vectors rather than ad-
joint equations. They represent gradients, or more generally Fréchet deriva-
tives, of the weighted final result ȳ y with respect to all kinds of intermediate

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


340 A. Griewank

vectors that are generated in a well-defined hierarchical fashion. This pro-


cedural view differs strongly from the more global, equation-based concept
prevalent in the literature (see, e.g., Marcuk (1996)).
Viewing all ui as given constants in Table 3.2, this procedure may by
itself be identified with the form of Table 2.1. However, the roles of Pi and
Qi have been interchanged and the precedence relation ≺ must be reversed.
To interpret the combination of Tables 2.1 and 3.2 as a single evaluation
procedure, we introduce an isomorphic copy Ī of I, denoting the bijection
between elements of I and Ī by an overline:
i∈I ⇔ ı̄ ∈ Ī ⇒ ϕı̄ (ui , v̄i ) ≡ ϕ̄i (ui , v̄i ). (3.13)
Identifying the Hilbert space H and its subspaces with their duals, we ob-
tain the state space H × H for the combined procedure. Correspondingly
the argument selections and value projections can be defined such that the
computational graph for F̄ (x, ȳ) = ȳF  (x) has the following ordering.
Lemma 3.3. With ϕı̄ as defined in (3.13), the extension of the precedence
relation ≺ to I ∪ Ī is given by
ı̄ ≺ ̄ ⇔ j ≺ i ⇔ j ≺ ı̄.
Proof. The extension can be written as
 
Qi 0
(Qi , 0) for i ∈ I and Qı̄ ≡ for ı̄ ∈ Ī
0 PiT
and    
Pi 0
for i ∈ I and Pı̄ = for ı̄ ∈ Ī.
0 QTi
Hence we find, for the precedence between adjoint indices,
Qı̄ P̄ = PiT QTj = (Qj Pi )T ⇔ ̄ ≺ ı̄ ⇔ j  i.
For the precedence between direct and adjoint indices, we obtain
Qı̄ Pj = Qi Pj so that j ≺ ı̄ ⇔ j ≺ i.
Hence we see that the graph with the vertex set I ∪ Ī and the extended
precedence relation consists of two halves whose internal dependencies are
mirror images of each other. However, the connections between the two
halves are not symmetric, in that j ≺ ı̄ normally precludes rather than
implies the relation i  ̄, as shown in Figure 3.1.
It is very important to notice that the adjoint operations are executed
in reverse order. As we can see from Tables 2.1 and 3.2, the evaluation
of the combined function [F (x), F̄ (x, ȳ)] requires exactly one evaluation of
the corresponding elemental combinations ϕi (ui ) and ϕ̄i (ui , v̄i ), though this
time the two must be evaluated separately, i.e., at different times. In any

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 341

case, we find that


OPS([F (x), F̄ (x, ȳ)]) OPS(ϕi (ui )) + OPS(ϕ̄i (ui , v̄i ))
≤ max ≤ 3. (3.14)
OPS(F (x)) 1≤i≤l OPS(ϕi (ui ))
The upper bound 2 on the relative cost of performing a single reverse opera-
tion can again be found by taking the maximum over all elemental functions,
which is actually obtained for a single multiplication,
v += ϕ(u, w) ≡ u ∗ w → [ū, w̄] += [v̄ ∗ w, v̄ ∗ u],
only counting multiplications as elsewhere. For the estimate (2.13) of par-
allel execution time, our assumptions now allow us to conclude that
 
CLOCK [F (x), F̄ (x, ȳ)] ≤ 3 CLOCK(F (x)). (3.15)
This estimate holds because every critical path in I ∪ Ī must be the union
of two paths P ⊂ I and P̄ ⊂ Ī, whose complexity is bounded by one and
two times CLOCK(F (x)), respectively.

3.3. Symmetry of Hessian graphs


To make the extended graph fully symmetric we must be able to rewrite
ϕı̄ (ui , v̄i ) as ϕı̄ (vi , v̄i ) so that
 T 
Pi 0
Qı̄ ≡ and hence j ≺ ı̄ ⇔ PiT Pj = 0 ⇔ i ≺ ̄.
0 PiT
In other words, we must pack all information needed for the adjoint op-
eration into the value of the original one. For certain elementals such as
v = ϕ(u) ≡ exp(u) this is very natural, for in that particular case ϕ (u) = v.
For others like ϕ(u, w) = u ∗ w we would have to append the partials
∂ϕ/∂u = w and ∂ϕ/∂w = u to the result, and thus return the vector
value v = (u ∗ w, w, u). This is more or less what is done when a tangent
linear code is first developed and then the partials are used in the cotangent
linear, or adjoint code.
As an example we may again consider the function defined in (2.4). In
Table 3.3 we have listed on the left the combination of the forward and
reverse sweep without the inclusion of partials in the elemental values. On
the right we have listed the extended symmetrized version.
The corresponding computational graphs are depicted in Figure 3.1. With
the curved, dotted arcs included, we have the original version associated
with the left-hand side of Table 3.3. The dotted arcs can be replaced by
the direct, dashed arcs if the computational procedure is modified according
to the right-hand side of Table 3.3. The superscripts denote additional
components of the intermediate values, which contain the partial derivatives
needed on the way back. Apart from its aesthetic appeal, the symmetry
of the Hessian graph promises at least a halving of the operation count

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


342 A. Griewank

Table 3.3. Adjoint procedure = forward sweep + reverse sweep.

Nonsymmetric adjoint Adjoint with symmetric graph

(v1 , v2 ) = (x1 , x2 ) (v1 , v2 ) = (x1 , x2 )


v3 = exp(v1 ) v3 = exp(v1 )
(0) (1)
v4 = sin(v1 +v2 ) (v4 , v4 ) = (sin(v1 +v2 ), cos(v1 +v2 ))
(0) (1) (2) (0) (0)
v5 = v 3 ∗ v4 (v5 , v5 , v5 ) = (v3 ∗ v4 , v4 , v3 )
(0)
y = v5 y = v5

(0)
v̄5 = ȳ v̄5 = ȳ
(1) (2)
(v̄3 , v̄4 ) += v̄5 ∗ (v4 , v3 ) (v̄3 , v̄4 ) += v̄5 ∗ (v5 , v5 )
(1)
(v̄1 , v̄2 ) += v̄4 ∗ cos(v1 +v2 ) ∗ (1, 1) (v̄1 , v̄2 ) += v̄4 ∗ v4 ∗ (1, 1)
v̄1 += v̄3 ∗ exp(v1 ) v̄1 += v̄3 ∗ v3
(x̄1 , x̄2 ) = (v̄1 , v̄2 ) (x̄1 , x̄2 ) = (v̄1 , v̄2 )

1 3 3̄ 1̄

5 5̄

2 4 4̄ 2̄

Figure 3.1. Graph corresponding to left (· · · ) and right (- - -) part of Table 3.3.

and storage requirement for the accumulation task discussed in Section 6.


A different but related symmetrization of Hessian graphs was developed by
Dixon (1991).

3.4. The memory issue


The fact that gradients and other adjoint vectors x̄ = F̄ (x, ȳ) = ȳF  (x)
can, according to (3.14) and (3.15), be computed with essentially the same
operation count as the underlying y = F (x) is certainly impressive, and
possibly still a little bit surprising. However, its practical benefit is some-
times in doubt on account of its potentially large memory requirement. Our
no-overwrite conditions (2.7) and (2.8) mean that, except for incremental as-
signments, each elemental function requires new storage for its value. As we

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 343

Table 3.4. Direct/adjoint statement pairs.

Case On forward sweep On reverse sweep

(i) vi −→ STACK vi ←− STACK


vi = ϕi (ui ) ūi += ϕ̄i (ui , v̄i )

(ii) vi += ϕi (ui ) ūi += ϕ̄i (ui , v̄i )


vi −= ϕi (ui )

(iii) vi = ϕi (ui ) ūi += ϕ̄i (vi , v̄i )


vi −→ STACK vi ←− STACK

noted before, the composition (2.10) always represents a well-defined vector


function y = F (x), no matter how the projections Qi and Pi are defined,
provided the state v is initialized to zero. As a consequence the tangent
procedure listed in Table 3.1 can also be applied without any difficulties,
yielding consistent results.
Things are very different for the reverse mode. Here the ϕ̄i (ui , v̄i ) or
ϕ̄i (vi , v̄i ) require the old values of ui or vi from the time ϕi (ui ) itself was
originally evaluated. One simple way to ensure this is to save the vi = PiT v
on a value stack during the forward sweep, and then to recover them on the
way back just before or just after the form ϕ̄i (ui , v̄i ) or ϕ̄i (vi , v̄i ) is used,
respectively. Then we may modify the statements in the main loops of
Table 2.1 and Table 3.2 by one of the three pairs listed in Table 3.4.
The first case (i) describes a copy-on-write strategy for nonincremental
statements where all pre-values are saved on a stack just before they are
overwritten. On the reverse sweep the value is read back from the stack to
restore the state v(i−1) that was valid just before the original call to ϕi . No
extra storage is necessary in case (ii) of additively incremental statement
as the previous state can be restored by the corresponding decremental
operation. This mechanism allows, for example, the adjoining of an LU
factorization procedure with a stack whose size grows only quadratically
rather than cubically in the matrix dimension. This growth rate can be made
linear if we also choose to invert multiplicatively incremental operations like
v ∗ = u and v /=u, provided we can ensure u = 0, which is of course a
key point of pivoting in matrix factorization. In general, we face a choice
between restoration from a stack and various modes of recomputation. For
a more thorough discussion of these issues see Faure and Naumann (2001)
and Giering and Kaminski (2001b). The main difference between cases (i)
and (iii) is that, in the latter, the new value of vi is needed for the adjoint
ϕ̄(vi , v̄i ) so that restoration of the old value takes place afterwards.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


344 A. Griewank

If all elementals are treated according to (i) in Table 3.4, we obtain the
bound
     l
 
MEM F̄ (x, ȳ) ≈ MEM F + mi ∼ OPS F (x) . (3.16)
i=1

Here the proportionality relation assumes that OPS(ϕi ) ∼ mi ∼ MEM(ϕi ),


which is reasonable under most circumstances. The situation is depicted
schematically in Figure 3.2. The STACK of intermediates vi is sometimes
called a trajectory or execution log of the evaluation procedure.

Source code
y = F (x) ∈ Rm

d ∂ -re
vers
rwar e
∂ -fo
Object code Object code

y = F (x) Intermediate x̄ = ȳF (x) ∈ Rn
ẏ = F  (x) ẋ ∈ Rm values x̄˙ = ȳF  (x) ẋ ∈ Rn

(x, ẋ) ∈ R2n STACK (x, ȳ, ẋ) ∈ Rn+m+n

Program execution
OPS ∗ = constant

Figure 3.2. Practical execution of forward and reverse differentiation.

The situation is completely analogous to backward integration of the


costate in the optimal control of ordinary differential equations (Campbell,
Moore and Zhong 1994). More specifically, the problem
min ψ(v(T )) such that v(0) = v0 ∈ Rn (3.17)
x∈L∞ ([0,T ],Rn )

and
d
v(t) = Φ(v(t), x(t)) for 0≤t≤T (3.18)
dt
has the adjoint system
d
v̄(t) = −v̄(t) Φv (v(t), x(t)), (3.19)
dt
x̄(t) = v̄(t) Φx (v(t), x(t)), (3.20)
with the terminal condition
v̄(T ) = ∇ψ(v(T )). (3.21)

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 345

Here the trajectory v(t) is a continuous solution path, whose points enter
into the coefficients of the corresponding adjoint differential equation (3.19)
for v̄(t). Rather than storing the whole trajectory, we may instead store
only certain checkpoints and repeat the forward simulation in pieces on the
way back. This technique is the subject of the subsequent Section 4.

3.5. Propagating vectors or sparsity patterns


Rather than just evaluating one tangent ẏ = Ḟ (x, ẋ) or one adjoint
x̄ = ȳF  (x), we frequently wish to evaluate several of them, or even the
whole Jacobian F  (x) simultaneously. While this bundling increases the
memory requirement, it means that the general overhead, and especially
the effort for evaluating the ϕi and their partial derivatives, may be bet-
ter amortized. With Ẋ ∈ X p any set of p directions in X and t ∈ Rp
a p-dimensional parameter, we may redefine for each intermediate vi the
derivative vector as

 
v̇i ≡ ∇t vi x + Ẋ t  ∈ Hip .
t=0
Then the tangent procedure Table 3.1 can be applied without any formal
change, though the individual products v̇i = ϕ(ui , u̇i ) = ϕ (ui )u̇i must now
be computed for u̇i ∈ Dip . The overall result of this vector forward mode is
the matrix
Ẏ = Ḟ (x, Ẋ) ≡ F  (x)Ẋ ∈ Y p .
In terms of complexity we can derive the upper bounds
   
OPS Ḟ (x, Ẋ) ≤ (1 + 2 p) OPS(F (x)) ≤ p OPS Ḟ (x, ẋ) , (3.22)
   
MEM Ḟ (x, Ẋ) ≤ (1 + p) MEM F (x) ,
and  
CLOCK Ḟ (x, Ẋ) ≤ (1 + 2 p) CLOCK(F (x)).
With a bit of optimism we could replace the factor 2 in (3.22) by 1, which
would make the forward mode as expensive as one-sided differences. In fact,
depending on the problem and the computing platform, the forward mode
tool ADIFOR 2.0 typically achieves a factor somewhere between 0.5 and 2.
Unfortunately, this cannot yet be said for reverse mode tools.
There we obtain, for given Ȳ ∈ (Y ∗ )q , the adjoint bundle
   q
X̄ = F̄ x, Ȳ ≡ Ȳ F  (x) ∈ X ∗
at the temporal complexity
   
OPS F̄ (x, Ȳ) ≤ (1 + 2 q) OPS(F (x)) ≤ q OPS F̄ (x, ȳ) ,

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


346 A. Griewank

and
 
CLOCK F̄ (x, Ȳ) ≤ (1 + 2 q) CLOCK(F (x)).
Since the trajectory size is independent of the adjoint dimension q, we obtain
from (3.16) the spatial complexity
MEM(F̄ (x, Ȳ)) ∼ q MEM(F (x)) + OPS(F (x)) ∼ OPS(F (x)).
The most important application of the vector mode is probably the effi-
cient evaluation of sparse matrices by matrix compression. Here Ẋ ∈ X p is
chosen as a seed matrix for a given sparsity pattern such that the resulting
compressed Jacobian F  (x)Ẋ allows the reconstruction of all nonzero entries
in F  (x). This technique apparently originated with the grouping proposal
of Curtis, Powell and Reid (1974), where each row of Ẋ contains exactly one
nonzero element and the p columns of Ẏ are approximated by the divided
differences
1   
F x + ε Ẋ ej − F (x) = Ẏej + 0(ε) for j = 1 . . . p.
ε
Here ej denotes the jth Cartesian basis vector in Rp . In AD the matrix
Ẏ is obtained with working accuracy, so that the conditioning of Ẋ is not
quite so critical. The reconstruction of F  (x) from F  (x)Ẋ relies on cer-
tain submatrices of Ẋ being nonsingular. In the CPR approach they are
permutations of identity matrices and thus ideally conditioned. However,
there is a price to pay, namely the number of columns p must be greater or
equal to the chromatic number of the column-incidence graph introduced by
Coleman and Moré (1984). Any such colouring number is bounded below
by n̂ ≤ n, the maximal number of nonzeros in any row of the Jacobian.
By a degree of freedom argument, we see immediately that F  (x) cannot
be reconstructed from F  (x)Ẋ if p < n̂, but p = n̂ suffices for almost all
choices of the seed matrix Ẋ. The gap between the chromatic number and
n̂ can be as large as n − n̂, as demonstrated by an example of Hossain and
Steihaug (1998). Whenever the gap is significant we should instead apply
dense seeds Ẋ ∈ X n̂ , which were proposed by Newsam and Ramsdell (1983).
Rather than using the seemingly simple choice of Vandermonde matrices, we
may prefer the much better conditioned Pascal or Lagrange seeds proposed
by Hossain and Steihaug (2002) and Griewank and Verma (2003), respec-
tively. In many applications sparsity can be enhanced by exploiting partial
separability, which sometimes even allows the efficient calculation of dense
gradients using the forward mode.
The compression techniques discussed above require the a priori knowl-
edge of the sparsity pattern, which may be rather complicated and thus
tedious for the user to supply. Then we may prefer to execute the for-
ward mode with p = n and the vi stored and manipulated as dynamically

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 347

sparse vectors (Bischof, Carle, Khademi and Mauer 1996). Excluding ex-
act cancellations, we may conclude that the operation count for computing
the whole Jacobians in this sparse vector mode is also bounded above by
(1 + 2 n̂)OPS(F (x)). Unfortunately this bound may not be a very good in-
dication of actual runtimes since the dynamic manipulation of sparse data
structures typically incurs a rather large overhead cost. Alternatively we
may propagate the sparsity pattern of the vi as bit patterns encoded in
n/32 integers (Giering and Kaminski 2001a). In this way the sparsity pat-
tern of F  (x) can be computed with about n/32 times the operation count of
F itself and very little overhead. By so-called probing algorithms (Griewank
and Mitev 2002) the cost factor n/32 can often be reduced to O(n̂ log n) for
a seemingly large class of sparse matrices.
Throughout this subsection we have tacitly assumed that the sparsity
pattern of F  (x) is static, i.e., does not vary as a function of the evaluation
point x. If it does, we have to either recompute the pattern at each new
argument x or determine certain envelope patterns that are valid, at least
in a certain subregion of the domain. All the techniques we have discussed
here in their application to the Jacobian F  (x) can be applied analogously
to its transpose F  (x)T using the reverse mode (Coleman and Verma 1996).
For certain matrices such as arrowheads, a combination of both modes is
called for.

3.6. Higher-order adjoints


In Figure 3.2 we have already indicated that a combination of forward and
reverse mode yields so-called second-order adjoints of the form
˙ ≡ ȳF  (x)ẋ + ȳF
x̄˙ ≡ F̄˙ (x, ȳ, ẋ, ȳ) ˙  (x) ∈ X = X ∗

for any given (x, ẋ) ∈ X 2 and (ȳ, ȳ) ˙ ∈ (Y ∗ )2 . In Figure 3.2 the vector

ȳ˙ ∈ Y was assumed to vanish, and the abbreviation
 
F̄˙ (x, ȳ, ẋ, 0) ≡ ȳF  (x)ẋ ≡ ∇x ȳF  (x)ẋ ∈ X ∗
is certainly stretching conventional matrix notation. To obtain an evaluation
procedure for F̄˙ we simply have to differentiate the combination of Table 2.1
and Table 3.2 once more in the forward mode. Consequently, composing
(3.6) and (3.14) we obtain the bound

OPS(F̄˙ (x, ȳ, ẋ, ȳ))


˙ ≤ 3 OPS(F̄ (x, ȳ)) ≤ 9 OPS(F (x)).

Though based on a simple-minded count of multiplicative operations, this


bound is, in our experience, not a bad estimate for actual runtimes.
In the context of optimization we may think of F ≡ (f, c) as the com-
bination of a scalar objective function f : X → R and a vector constraint

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


348 A. Griewank

c : X → Rm−1 . With ȳ a vector of Lagrange multipliers ȳi , the symmet-


ric matrix ȳF  (x) = m
i=1 yi ∇ Fi is the Hessian of the Lagrangian func-
2

tion. One second-order adjoint calculation then yields exactly the informa-
tion needed to perform one inner iteration step within a truncated Newton
method applied to the KKT system
ȳF  (x) = 0, c(x) = 0.
Consequently the cost of executing one inner iteration step is roughly ten
times that of evaluating F = (f, c). Note in particular that a procedure for
the evaluation of the whole constraint Jacobian ∇ c need not be developed
either automatically or by hand. Walther (2002) found that iteratively solv-
ing the linearized KKT system using exact second-order adjoints was much
more reliable and also more efficient than the same method based on divided
differences on the gradient of the Lagrangian function.
The brief description of second-order adjoints above begs the question of
their relation to a nested application of the reverse mode. The answer is that
ȳ˙ = ȳF  (x)ẋ could also be computed in this way, but that the complication
of such a nested reversal yields no advantage whatever. To see this we
note first that, for a symmetric Jacobian F  (x) = F  (x)T , we obviously
have ȳF  (x) = (F  (x) ẋ)T if ẋ = ȳT . Hence the adjoint vector x̄ = ȳF  (x)
can be computed by forward differentiation if F  (x) is symmetric and thus
actually the Hessian ∇2 ψ(x) of a scalar function ψ(x) with the gradient
∇ψ(x) = F (x). In other words, it never makes sense to apply the reverse
mode to vector functions that are gradients. On the other hand, applying
reverse differentiation to y = F (x) yields
 T
F (x)T , ȳ F  (x) = ∇ȳ,x ȳ F (x).
This partitioned vector is the gradient of the Lagrangian ȳF (x), so that a
second differentiation may be carried out without loss of generality in the
forward mode, which is exactly what we have done above. The process
¨ = ȳF  (x)ẋẋ and so on by the higher forward
can be repeated to yield x̄
differentiation techniques described in Chapter 10 of Griewank (2000).

3.7. Section summary


For functions y = F (x) evaluated by procedures of the kind specified in
Section 2, we have derived procedures for computing corresponding tan-
gents F  (x)ẋ and gradients ȳF  (x), whose composition yields second-order
adjoints ȳF  (x)ẋ. All these derivative vectors are obtained with a small
multiple of the operation count for y itself. Derivative matrices can be ob-
tained by bundling these vectors, where sparsity can be exploited through
matrix compression. By combining the well-known derivatives of elemental
functions using the chain rule, the derived procedures can be generated line

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 349

by line and, more generally, subroutine by subroutine. These simple obser-


vations form the core of AD, and everything else is questionable as to its
relevance to AD and scientific computing in general.

4. Serial and parallel reversal schedules


The key difficulty in adjoining large evaluation programs is their reversal
within a reasonable amount of memory. Performing the actual adjoint op-
erations is a simple task compared to (re)producing all intermediate results
in reverse order. The problem of reversing a program execution has re-
ceived some perfunctory attention in the computer science literature (see,
e.g., Bennett (1973) and van der Snepscheut (1993)). The first authors to
consider the problem from an AD point of view were apparently Volin and
Ostrovskii (1985) and later Horwedel (1991). The problem of nested reversal
that occurs in Pantoja’s algorithm (Pantoja 1988) for computing Newton
steps in discrete optimal control was first considered by Christianson (1999).
No optimal reversal schedule has yet been devised for this discrete variant of
the matrix Riccati equation. It might merely be a first example of computa-
tional schemes with an intricate, multi-directional information flow that can
be treated by the checkpointing techniques developed in AD. In this sur-
vey we will examine only the case of discrete evolutions of the form (2.11).
They may be interpreted as a chain of l subroutine calls. More generally
we may consider the reversal of calling trees with arbitrary structure, whose
depth in particular is a critical parameter for memory–runtime trade-offs
(see Section 12.2 in Griewank (2000)).
Throughout this section we suppose that the transition functions Φi change
all or at least most components of the state vector v(i−1) so that l now
represents the number of time-steps rather than elemental functions. Then
the basic form of the reverse mode requires the storage of l times the full state
vector v(i−1) of dimension N = dim(H). Since both l and N may be rather
large, their product may indicate an enormous memory requirement, which
can however be avoided in one of two ways. First, without any conditions
on the transition functions Φi we may apply checkpointing, as discussed in
this section. The simple idea is to store only a selection of the intermediate
state vectors on the first sweep through and then to recalculate the others
using repeated partial forward sweeps. These alternative trade-offs between
spatial and temporal complexities generate rather intricate serial or parallel
reversal schedules, which are discussed in this section. For certain choices
the resulting growth in memory and runtime or processor number is only
logarithmically dependent on l.
Second, the dependence on l can be avoided completely if (2.12) is not a
true iterative loop, but is instead either parallel or represents a contractive
fixed-point iteration. In both cases the individual transitions Φi commute

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


350 A. Griewank

at least asymptotically and, as we will see in Section 5, they can be adjoined


individually without the need for a global execution log.

4.1. Binary schedules for serial reversals


First we consider a simple reversal strategy, which already exhibits certain
key characteristics, though it is never quite optimal. Suppose we have ex-
actly l = 16 = 24 time-steps and enough memory to accommodate s = 4
states on a stack of checkpoints. On the first forward sweep we may then
store the 0th, 8th, 12th, and 14th intermediate states in the checkpoints c0 ,
c1 , c2 , and c3 , respectively. This process is displayed in Figure 4.1, where
the vertical axis represents physical time-steps and the horizontal axis com-
putational time-steps or cycles.
After advancing to state 15 and then reversing the last step, i.e., applying
Φ̄16 , we may then restart the calculation at the last checkpoint c3 containing
the 14th state, and immediately perform Φ̄15 . The execution of these adjoint
steps is represented by the little hooks at the end of each downward slanting
run. They are assumed to take twice as long as the forward steps. Horizontal
lines indicate checkpoint positions and slanted lines to the right represent
forward transition steps, i.e., applications of the Φi . After the successive
reversals Φ̄15 , Φ̄14 , and Φ̄13 , the last two checkpoints can be abandoned
and one of them subsequently overwritten by the 10th state, which can
be reached by two steps from the 8th state, which resides in the second
checkpoint.
Throughout we will let REPS(l) denote the repetition count, i.e., the total
number of repeated forward steps needed by some reversal schedule. Rather
than formalizing the concept of a reversal schedule as a sequence of certain

5 10 15 20 25 30 35 40 45 50 55 60 64 r 1
c0
cycles

c1
c1
5
c2
c1
10 c2
c2
c3
15

steps steps
4
Figure 4.1. Binary serial reversal schedule for l = 16 = 2 and s = 4.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 351

actions, we recommend thinking of them in terms of their graphical repre-


sentation as depicted here in Figures 4.1 and 4.2. For a binary reversal of
an even number of l time-steps we obtain the recurrence
REPS(2 l) = 2 REPS(l) + l with REPS(1) = 0.
Here l represents the cost of advancing to the middle and 2 REPS(l) the cost
of reversing the two halves. Consequently, we derive quite easily that, for
all binary powers l = 2p ,
REPS(l) = l p/2 = l log2 l/2.
For l = 16 we obtain a total of 16 · 4/2 = 32 simple forward steps plus 16
adjoint steps, which adds up to a total runtime of 32 + 2 · 16 = 64 cycles,
as depicted in Figure 4.1. The profile on the right margin indicates how
often each one of the physical steps is repeated, a number r that varies
here between 0 and 4. The total operation count for evaluating the adjoint
F̄ (x, ȳ) is by
 
REPS(l) OPS(Φi ) + l OPS(Φ̄i ) ≤ REPS(l)/l + 3 OPS(F ),
where the last inequality follows from the application of (3.14) to a single
transition Φi . We also need p = log2 l checkpoints, which is the maximal
number of horizontal lines intersecting any imaginary vertical in Figure 4.1.
Hence, both temporal and spatial complexity grow essentially by the same
factor, namely,
OPS(F̄ (x, ȳ)) MEM(F̄ (x, ȳ))
2 ≈ log2 l ≈ .
OPS(F (x)) MEM(F (x))
This common logarithmic factor must be compared with the ratios of 3 and
l obtained for the operation count and memory requirement in basic reverse
mode. Thus we have a drastic reduction in the memory requirement at the
cost of a rather moderate increase in the operation count. On a sizable
problem and a machine with a hierarchical memory the increase may not
necessarily translate into an increased runtime, as data transfers to remote
regions of memory can be avoided. Alternatively we may keep the wall clock
time to its minimal value by performing the repeated forward sweeps using
auxiliary processes on a parallel machine.
For example, if we have log2 l processors as well as checkpoints we may
execute the binary schedule depicted in Figure 4.2 for l = 32 = 25 in 64
cycles. We call such schedules time-minimal because both the initial for-
ward and the subsequent reverse sweep proceed uninterrupted, requiring a
total of 2l computational cycles. Here the maximal number of slanted lines
intersecting an imaginary vertical line gives the number of processors re-
quired. There is a natural association between the checkpoint ci and the

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


352 A. Griewank

c0 10 20 30 40 50 60
p0 cycles
p0
5
p0 c1 p1
10

15
c1 p1
p1
20 p4
25 c2
p2
c3
30 c4
steps
Figure 4.2. Binary parallel reversal schedule for l = 32 = 25 .

processor pi , which keeps restarting from the same state saved in ci . For
the optimal Fibonacci schedules considered below the task assignment for
the individual processors is considerably more complicated. For the binary
schedule executed in parallel we now have the complexity estimate
CLOCK(F̄ (x, ȳ)) PROCS(F̄ (x, ȳ))
≈ 2 and ≈ log2 l.
CLOCK(F (x)) PROCS(F (x))
Here the original function evaluation may be carried out in parallel so that
PROCS(F (x)) > 1, but then log2 l times as many processors must be available
for the reversal with minimal wall clock time. If not quite as many are
available, we may of course compress the sequential schedule somewhat along
the computing axis without reaching the absolute minimal wall clock time.
Such hybrid schemes will not be elaborated here.

4.2. Binomial schedules for serial reversal


While elegant in its simplicity, the binary schedule (like other schemes that
recursively partition the remaining simulation lengths by a fixed propor-
tion) is not optimal in either the serial or parallel context. More specifically,
given memory for log2 l checkpoints, we can construct serial reversal sched-
ules that involve significantly fewer than l log2 l/2 repeated forward steps or
time-minimal parallel reversal schedules that get by with fewer than log2 l
processors. The binary schedules have another major drawback, namely
they cannot be implemented online for cases where the total number of steps
to be taken is not known a priori . This situation may arise, for example,
when a differential equation is integrated with adaptive step-size, even if the
period of integration is fixed beforehand. Optimal online schedules have so
far only been developed for multi-processors.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 353

The construction of serially optimal schedules is possible by variations


of dynamic programming. The key observation for this decomposition into
smaller subschedules is that checkpoints are persistent, in that they exist
until all later steps to the right of them have been reversed. In Figures 4.1
and 4.2 this means graphically that all horizontal lines continue until they
(almost) reach the upward slanting line representing the step reversal.
Lemma 4.1. (Checkpoint persistence) Any reversal schedule can be
modified without a reduction of its length l or an increase in its repetition
count such that, once established at a state j, a checkpoint stays fixed and is
not overwritten until the (j + 1)st step has been reversed, i.e., Φ̄j+1 applied.
Moreover, during the ‘life-span’ of the checkpoint at state j, all actions occur
to the right, i.e., concern only states k ≥ j.
This checkpoint persistence principle is easy to prove (Griewank 2000)
under the crucial assumption that all state vectors v(i) are indeed of the
same dimension and thus require the same amount of storage. While this
assumption could be violated due to the use of adaptive grids, it would be
hard to imagine that the variations in size could be so large that taking
the upper bound of the checkpoint size as a uniform measure would result
in very significant inefficiencies. Another uniformity assumption that we
will use, but which can be quite easily relaxed, is that the computational
effort for all individual transitions Φi is the same. Thus the optimal reversal
schedules depend only on the total number l of steps to be taken.
In Figure 4.1 we see that the setting of the second checkpoint c1 effectively
splits the reversal into two subproblems. After storing the initial state into
c0 and then advancing to state eight, the remaining eight physical steps are
reversed using just the three checkpoints c1 , c2 and c3 . Subsequently the
first eight steps are reversed also using only three checkpoints, namely c0 ,
c1 , c2 , although c3 is again available. In fact, it would be better to set c1
at a state ˇl that solves, for s = 4 and l = 16, the dynamic programming
problem
REPS(s, l) ≡ min ˇl + REPS(s − 1, l − ˇl) + REPS(s, ˇl) . (4.1)
1<ľ<l

Here REPS(s, l) denotes the minimal repetition count for a reversal of l steps
using just s checkpoints. The three terms on the right represent the effort of,
respectively, advancing ˇl steps, reversing the right subchain [ˇl . . . l] using s−1
checkpoints, and finally reversing the left subchain [0 . . . ˇl], again using all s
checkpoints. To derive a nearly explicit formula for the function REPS(s, l),
we will consider values
lr (s, l) ≤ l for r ≥ 0 < s.
They are defined as the number of steps i = 1 . . . l that are repeated at most r

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


354 A. Griewank

5 10 15 20 25
c0

steps
Figure 4.3. Reversal with a single checkpoint (s = 1, l = 6).

times, maximized over all reversals on the range [0 . . . l]. By definition, the
lr (s, l) are nondecreasing as functions of r, such that lr+1 (s, l) ≥ lr (s, l).
Moreover, as it turns out, these numbers have maxima
lr (s) ≡ max lr (s, l),
l≥0

which are attained for all sufficiently large l.


For s = 1 this is easy to see. As there is just one checkpoint it must be set
to the initial state, and the reversal schedule takes the trivial form depicted
in Figure 4.3 for the case l = 6. Hence we see there is exactly one step that
is never repeated, one that is repeated once, twice, and so on, so that
lr (1) = r + 1 = lr (1, l) for all 0 ≤ r < l. (4.2)
On the other hand, in any reversal schedule all but the final step must be
repeated at least once, so that we have
l0 (s) = 1 = l0 (s, l) for all 0 ≤ s < l. (4.3)
The initial values (4.2) and (4.3) for s = 1 or r = 0 facilitate the proof of the
binomial formula below by induction. The other assertions were originally
established by Grimm, Pottier and Rostaing-Schmidt (1996).
Proposition 4.2.
(i) Lemma 4.1 implies that, for any reversal schedule,
lr (s) ≤ β(s, r) ≡ (s + r)!/(s! r!), (4.4)
where l is the chain length and s the number of checkpoints.
(ii) The resulting repetition count is bounded below by
REPS(s, l) ≥ l rmax − β(s + 1, rmax − 1) (4.5)
where rmax ≡ rmax (s, l) is uniquely defined by
β(s, rmax − 1) < l ≤ β(s, rmax ).

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 355

(iii) REPS(s, l) does attain its lower bound (4.5), with (4.4) holding as equal-
ity for all 0 ≤ r < rmax , and consequently
−1
rmax
lrmax (s, l) = l − β(s, rmax − 1, ) = l − lr (s, l).
i=0

Proof. It follows from checkpoint persistence that


lr (s, l) ≤ max lr−1 (s, ˇl) + lr (s − 1, l − ˇl) ,
1<ľ<l

because all steps in the left half [0 . . . ˇl] get repeated one more time during
the initial advance to ˇl. Now, by taking the maximum over l and ˇl < l, we
obtain the recursive bound
lr (s) ≤ lr−1 (s) + lr (s − 1).
Together with the initial conditions (4.2) and (4.3), we immediately arrive
at the binomial bound
lr (s) ≤ β(s, r) ≡ (s + r)!/(s! r!),
which establishes (i). To prove the remaining assertions let us suppose we
have a given chain length l and some serial reversal schedule using s check-
points that repeats ∆˜lr ≥ 0 steps exactly r times for r = 0, 1 . . . l. Looking
for the most efficient schedule, we then obtain the following constrained
minimization problem:

 ∞

Min r∆˜lr such that l= ∆˜lr and
r=0 r=0

r
˜lr ≡ ∆˜li ≤ lr (s) for all r ≥ 0. (4.6)
i=0

By replacing the right sides of the inequality constraints by their upper


bounds β(s, r), we relax the problem so that the minimal value can only go
down. Also relaxing the integrality constraint on the variables ∆˜lr for r ≥ 0
we obtain a standard LP whose minimum is obtained at
∆˜lr = β(s, r) − β(s, r − 1) = β(s − 1, r) for 0 ≤ r < rmax and
∆˜lrmax = l − β(s, rmax − 1) ≥ 0 where β(s, rmax − 1) < l ≤ β(s, rmax ).
This solution is integral and the resulting lower bound on the cost is given by
  −1
rmax
REPS(s, l) ≥ l − β(s, rmax − 1) rmax + j β(s − 1, j)
j=0
= rmax l − β(s + 1, rmax − 1) ≤ rmax l,

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


356 A. Griewank

where the equation follows from standard binomial identities (Knuth 1973).
That the lower bound REPS(s, l) is actually attained is established by the
following construction of suitable schedules by recursion.
When s = 1 we can only use the schedule displayed in Figure 4.3, where
˜
∆lr = 1 for all 0 ≤ r < l so that rmax = l − 1 and REPS = l (l − 1) −
β(2, rmax − 1) = l (l − 1)/2. When l = 1 no step needs to be repeated at all,
so that ∆˜l0 = 1, rmax = 0 and thus REPS = 0 = β(s + 1, rmax − 1). As an
induction hypothesis we suppose that a binomial schedule satisfying (4.4)
and (4.5) as equalities exists for all pairs s̃ ≥ 1 ≤ ˜l that precede (s, l) in the
lexicographic ordering. This induction hypothesis is trivially true for s = 1
and l = 1. There is a unique value rmax such that
β(s, rmax − 2) + β(s − 1, rmax − 1) = β(s, rmax − 1) < l (4.7)
and l ≤ β(s, rmax ) = β(s, rmax − 1) + β(s − 1, rmax ).

Then we can clearly partition l into ˇl and l − ˇl such that


β(s, rmax − 2) < ˇl ≤ β(s, rmax − 1) (4.8)
and β(s − 1, rmax − 1) < l − ˇl ≤ β(s − 1, rmax ), (4.9)
which means that the induction hypothesis is satisfied for the two subranges
[0 . . . ˇl] and [ˇl . . . l]. This concludes the proof by induction.

Figure 4.4 displays the incremental counts


∆lr (s) = lr (s) − lr−1 (s) ≈ β(s − 1, r),
where the approximation holds as equality when lr (s) and lr−1 (s) achieve
their maximal values β(s, r) and β(s − 1, r). The solid line represents an
optimal schedule and the dashed line some other schedule for the given
parameters s = 3 and l = 68. The area under the graphs is the same, as it
must equal the number of steps, 68.
An optimal schedule for l = 16 and s = 4 is displayed in Figure 4.5.
It achieves with rmax = 3 the minimal repetition count REPS(3, 16) = 3 ·
16 − β(5, 2) = 48 − 21 = 27, which yields together with 16 adjoint steps
a total number of 59 cycles. This compares with a total of 65 cycles for
the binary schedule depicted in Figure 4.1, where the repetition count is 33
and the distribution amongst the physical steps less even. Of course, the
discrepancy is larger on bigger problems.
The formula REPS(s, l) = l rmax − β(s + 1, rmax − 1) shows that the num-
ber rmax , which is defined as the maximal number of times that any step
is repeated, gives a very good estimate of the temporal complexity growth
between F̄ (x, ȳ) and F (x), given s checkpoints. The resulting complexity
is piecewise linear, as displayed in Figure 4.6. Not surprisingly, the more

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 357
∆lr
20 optimal schedule
18
16
14 other schedule
12
10
8
6
4
2

0 1 2 3 4 5 6 7 8 9 r
Figure 4.4. Repetition levels for s = 3 and l = 68.

5 10 15 20 25 30 35 40 45 50 55 60 r 1
c0
c1

cycles
c2
c3
5 c1
c2
c3
c2
10 c3
c3

15

steps steps
Figure 4.5. Binomial serial reversal schedule for l = 16 and s = 4.

REPS(s, l) s=2

50
s=3
s=4
s=5

20

1
1 10 l
Figure 4.6. Repetition count for chain length l and checkpoint number s.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


358 A. Griewank

checkpoints s we have at our disposal the slower the cost grows. Asymptot-
ically it follows from β(s, rmax ) ≈ l by Stirling’s formula that, for fixed s,
the approximate complexity growth is
s√ s
rmax ∼ l and REPS(s, l) ∼ l1+1/s ,
s
(4.10)
e e
where e is Euler’s number.
A slight complication arises when the total number of steps is not known
a priori , so that a set of optimal checkpoint positions cannot be determined
beforehand. Rather than performing an initial sweep just for the purpose of
determining l, we may construct online methods that release earlier check-
points whenever the simulation continues for longer than expected. Stern-
berg (2002) found that the increase of the repeated step numbers compared
with the (in hindsight) optimal schemes is rarely more than 5%.
A key advantage of the binomial reversal schedule compared with the
simpler binary one is that many steps are repeated exactly rmax (s, l) times.
Since none of them are repeated more often than r = rmax times, we obtain
the complexity bounds
OPS(F̄ (x, ȳ)) MEM(F̄ (x, ȳ))
≤3+r and ≤3+s (4.11)
OPS(F (x)) MEM(F (x))
provided l ≤ β(s, r) ≡ (s + r)!/(s! r!).

40
2113 v−nodes, 545 p−nodes, 1024 triangles
545 v−nodes, 145 p−nodes, 256 triangles
145 v−nodes, 41 p−nodes, 64 triangles
35

30

25
run time ratio

20

15

10

0
0 1 2 3 4 5 6 7 8 9 10
#Checkpoints*100/#Timesteps

Figure 4.7. Runtime behaviour, 1000 time-steps.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 359

The binomial reversal schedules were applied in an oceanographic study by


Restrepo, Leaf and Griewank (1998), have been implemented in the software
routine revolve (Griewank and Walther 2000), and are used in the Hilbert
class library (Gockenbach, Petro and Symes 1999). Figure 4.7 reports the
experimental runtime ratios on a linearized Navier–Stokes equation for the
driven cavity problem. The problem was discretized using Taylor–Hood
elements on 2113 velocity nodes and 545 pressure nodes. We need 38 kbytes
to store one state: for details see Hinze and Walther (2002). Over 1000
time-steps the memory requirement for the basic approach of storing all
intermediate states would be 38 Mbytes.
While this amount of storage may appear manageable it turns out that
it can be reduced by a factor of about a hundred without any noticeable
increase in runtime despite a growth factor of rmax = 8 ≈ 1e 10(1000)1/10 in
the repetition count. The total operation count ratio is somewhat smaller
because of the constant effort for adjoint steps. By inspection of Figure 4.7
we also note that the runtime ratios are lower for coarser discretizations
and thus smaller state space dimensions. This confirms the notion that the
amount of data transfer from and to various parts of the memory hierarchy
has become critical for the execution times.
The bound (4.11) is valid even when the assumption that OPS(Φi ) is the
same for all i is not true. Only when these individual step costs vary by
orders of magnitude is it worthwhile to minimize the temporal complexity
of F̄ (x, ȳ) more carefully. The task of finding optimal reversal schedules
for sequences of steps with nonuniform step costs is quite similar to the
construction of optimal binary search trees considered by Knuth (1973). As
in that setting, exact solutions can be computed with an effort of order l2
and nearly optimal ones with an effort of order l log2 l.
The binomial reversal schedules can be generalized to the multi-step sit-
uation where
v(i) = Φi (v(i−1) , v(i−2) , . . . , v(i−q) ) for some q > 1. (4.12)
Then checkpoints consist of q consecutive states [i − q, i) that are needed
to advance repeatedly towards i and beyond. Under the critical uniformity
assumption MEM(Φi ) = MEM(Φl ) for i = 1 . . . l, we can show that the check-
point persistence principle still applies, so that dynamic programming is also
applicable. More specifically, according to Theorem 3.1 in Walther (1999),
we have for the minimal repetition count REPS(s, l, q)
q/s
REPS(s, l, q) (s/q)! s
lim 1+q/s
= ≈ . (4.13)
l→∞ l q e q (1+q/s)
Exactly the same asymptotic complexity can be achieved by the one-step
checkpointing schemes discussed above if they are adapted in the following
way. Suppose the l original time-steps are interpreted as l/q mega-time-

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


360 A. Griewank

steps between mega-states comprising q consecutive states [qi − q, qi) for i =


1 . . . l/q. Here we may have to increase l formally to the next integer multiple
of q. While the number of time-steps is thus divided by q, the complexity
of a mega-step is of course q times that of a normal step. Moreover, since
the dimension of the state space has also grown q-fold, we have to assume
that the number of checkpoints s is divisible by q so that we can now keep
up to s/q mega-states in memory. Then, replacing REPS by REPS/q, l by l/q
and s by s/q in (4.10) yields asymptotically the right-hand side of (4.13).
Hence we may conclude that directly exploiting multi-step structure is only
worthwhile when l/q is comparatively small. Otherwise, interpreting multi-
step evolutions as one-step evolutions on mega-states yields nearly optimal
reversal results.

4.3. Fibonacci schedules for parallel reversals


Just as in the serial case the binary schedules discussed in Section 4.1 are
also quite good but not really optimal in the parallel scenario. Moreover,
they cannot be conveniently updated when l changes in the bidirectional
simulation scenario discussed below. First let us briefly consider the scenario
which is depicted for l = 8 in Figure 4.8.
By inspection, the maximal numbers of horizontal and slanted lines in-
tersecting any vertical line are 2 and 3, respectively. This can be executed
using 2 checkpoints and 3 processors. In contrast, the binary schedule for
that problem defined by the lowest quarter of Figure 4.2 requires 3 check-
points and processors each. On larger problems the gap is considerably
bigger, with the numbers √ of checkpoints and processors each being reduced
by the factor 1/ log2 [(1 + 5)/2] ≈ 1.44.
The construction of optimal schedules is again recursive, though no de-
composition into completely separate subtasks is possible. For our example
the optimal position for setting checkpoint c1 is state 5, so that we have
two subschedules of length 5 and 3, represented by the two shaded trian-
gles in the second layer of Figure 4.8 from the top. However, they must be
performed at last in parts simultaneously, with computing resources being
passed from the bottom left triangle to the top right triangle in a rather
intricate manner.
As it turns out, it is best to count both checkpoints and processors to-
gether as resources, whose number is given by . Obviously, a processor
can always transmute into a checkpoint by just standing still. The maximal
length of any chain that can be reverted without any interruption using
resources will be denoted by l . The resources in use during any cycle are
depicted in the histograms at the right-hand side of Figure 4.8. These re-
source profiles are partitioned into the subprofiles corresponding to the two
subschedules, and the shaded area caused by additional activities linking

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 361

0 5 10 15 t
0
r

8
l
0 5 10 15 t
0

8
l
0 5 10 15 t
0

8
l
0 5 10 15 t
0

8
l
0 5 10 15 t
0

8
l
Figure 4.8. Optimal parallel reversal schedule with recursive decomposition
into subschedules and corresponding resource profiles.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


362 A. Griewank

the two. At first, one checkpoint keeps the initial state, and one processor
advances along the diagonals until it kicks off the bottom left subschedule,
which itself needs at first two resources, and then for a brief period of the
two cycles 8 and 9 it requires three. In cycle 10 that demand drops again
to two, and one resource can be released to the top right subschedule. As
we can see, the subprofiles fit nicely together and we always obtain a grad-
ual increase toward the vertex cycles l and l + 1, and subsequently a gentle
decline.
Each subschedule in decomposed again into smaller schedules of sizes 3,
2 and ultimately 1. The fact that the schedule sizes (1, 2, 3, 5, and 8) are
Fibonacci numbers is not coincidental. A rigorous derivation would require
too much space, but we give the following argument, whose key observations
were proved by Walther (1999).

• Not only checkpoints but also processors persist, i.e., run without in-
terruption until they reach the step to be reversed immediately (graph-
ically that means that there are no hooks in Figures 4.2 and 4.8).
• The first checkpoint position ˇl partitions the original range [0 . . . l] into
two subranges [0 . . . ˇl] and [ˇl . . . l] such that, if l = l is maximal with
respect to a given number of resources, so are the two parts
ˇl = l−1 and l − ˇl = l−2 < l−1 .

As a consequence of these observations, we have


l = l−1 + l−2 starting with l = for ≤ 3,
where the initial values l1 and l2 are obtained by the schedules depicted in
the right corner of Figure 4.8.
Hence we obtain the Fibonacci number shifted in index by one so that
according to the usual representation,
 √ ρ √ 
l = round (1 + 5)/2 5 for = 1, 2 . . . .

Conversely, the number of resources needed for given l is approximately


≈ log(1+√5)/2 l ≈ 0.7202 (2 log2 l).

As was shown by Walther (1999), of the resources needed by the Fibonacci


schedule at most one more than half must be processors. Consequently,
compared with the binary schedule the optimal parallel schedule achieves
a reduction in both resources by about 28%. While this gain is not re-
ally enormous, the difference between the binary and Fibonacci reversals
becomes even more pronounced if we consider the scenario where l is not
known a priori .

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 363

Imagine we have a simulation that runs forward in time with a fair bit
of diffusion or energy dissipation, so that we cannot integrate the model
equations backwards. In order to be able to run the ‘movie’ backwards
nevertheless, we take checkpoints at selected intermediate states and then
repeat partial forward runs using auxiliary processors. The reversal is sup-
posed to start instantaneously and to run at exactly the same speed as the
forward simulation. Moreover, the user can switch directions at any time.
As it turns out, the Fibonacci schedules can be updated adaptively so that
resources, with roughly half of them processors, suffice as long as l does
not exceed l . When that happens, one extra resource must be added so
that += 1.
An example using maximally 5 processors and 3 checkpoints is depicted
in Figure 4.9. First the simulation is advanced from state 0 to state 48 <
55 = l9 . In the 48th transition step 6 checkpoints have been set and 2
processors are running. Then the ‘movie’ is run backwards at exactly the
same speed until state 32 < 34 = l8 , at which point 5 processors are running
and 3 checkpoints are in memory. Then we advance again until state 40,
reverse once more and so on. The key property of the Fibonacci numbers
used in these bidirectional simulation schedules is that, by the removal of a
checkpoint between gaps whose size equals neighbouring Fibonacci numbers,
the size of the new gap equals the next-larger Fibonacci numbers.

0 10 20 30 40 50 60 70 80 90 100 110 120 cycles


0

10

20

30

40

steps
Figure 4.9. Bidirectional simulation using up to 9 processors and checkpoints.

4.4. Pantoja’s algorithm and the Riccati equation


To compute a Newton step ∆x(t) of the control x(t) for the optimality
condition x̄(t) = 0 associated with (3.17)–(3.21), we have to solve a linear–
quadratic regulator problem that approximates the given nonlinear control
problem along the current trajectory. Its solution requires the integration
of the matrix Riccati equation (Maurer and Augustin 2001, Caillau and

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


364 A. Griewank

Noailles 2001), backward from a terminal condition involving the objective


Hessian ∇2 ψ. After this intermediate, reverse sweep, a third, forward sweep
yields the control correction ∆x(t). Pantoja’s algorithm (Pantoja 1988)
may be viewed as a discrete variant of these coupled ODEs (Dunn and
Bertsekas 1989). It was originally derived as a clever way of computing
the Newton step on the equality-constrained finite-dimensional problem ob-
tained by discretizing (3.17)–(3.21). Here, we are interested primarily in
the dimensions of the state spaces for the various evaluation sweeps and the
information that flows between them. Assuming that the control dimen-
sion p = dim(x) is much smaller than the original state space dimension
q = dim(v), we arrive at the scenario sketched in Figure 4.10. Here B(t) is
a p × q matrix path that must be communicated from the intermediate to
the third sweep.

t=0 v0

(q)
x(t) v(t) B(t) ∆x(t)
(p) (q) (q · p) (p)
(q 2 )

t=T v̄, Q̄
Figure 4.10. Nested reversal for Riccati/Pantoja computation of Newton step.

The horizontal lines represent information flow between the three sweeps
that are represented by slanted lines. The two ellipses across the initial
and intermediate sweep represent checkpoints, and are annotated by the
dominant dimension of the states that need to be saved for a restart on
that sweep. Hence we have thin checkpoints of size q on the initial, forward
sweep, and thick checkpoints of size q 2 on the immediate, reverse sweep.
A simple-minded ‘record all’ approach would thus require memory of order
l q 2 , where l is now the number of discrete time-steps between 0 and T . The
intermediate sweep is likely to be by far the most computationally expensive,
as it involves matrix products and factorizations.
The final sweep again runs forward in time and again has a state dimension
q + p, which will be advanced in spurts as information from the intermediate
sweep becomes available. Christianson (1999) has suggested a two-level
checkpointing scheme, which√reduces the total storage from order q√2 l for a
simple log-all approach to q 2 l. Here checkpoints are placed every l time-
steps along
√ the initial and the intermediate sweep. The subproblems of
length l are then treated with complete logging of all intermediate states.
More sophisticated serial and parallel reversal schedules are currently under

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 365

development and will certainly achieve a merely logarithmic dependence of


the temporal and spatial complexity on l.

4.5. Section summary


 
The basic reverse mode makes  both the
 operation count OPS F̄ (x, ȳ) and
the memory requirement
  MEM F̄
 (x, ȳ)
 for adjoints proportional to the oper-
ation count OPS F (x) ≥ MEM F (x) . By serial checkpointing on explicitly
time-dependent problems, both ratios
       
OPS F̄ (x, ȳ) OPS F (x) and MEM F̄ (x, ȳ) MEM F (x)
can be made proportional to the
 logarithm
  of the number of time-steps,
which equals roughly OPS F (x) /MEM F (x) . A similar logarithmic number
of checkpoints and processors suffices to maintain the minimal wall clock
ratio
   
CLOCK F (x, ȳ) CLOCK F (x) ∈ [1, 2]
that is obtained theoretically when all intermediate results can be stored. In
practice the massive data movements may well lead to significant slowdown.
Parallel reversal schedules can be implemented online so that, at any point
in time the direction of simulation can be reversed without any delay and
repeatedly.

5. Differentiating iterative solvers


Throughout the previous sections we implicitly made the crucial assumption
that the trajectory traced out by the forward simulation actually matters.
Otherwise the taking of checkpoints, and multiple restarts, would not be
worth the trouble at all.
Many large-scale computations involve pseudo-time-stepping for solving
systems of equations. The resulting solution vector can then be used to
evaluate certain performance measures whose adjoints might be required for
design optimization. In that case, a mechanical application of the adjoining
process developed in the previous sections may still yield reasonably accu-
rate reduced gradients. However, the accuracy achieved is nearly impossible
to control and the cost is unnecessarily high, especially in terms of mem-
ory space. With regard to derivative accuracy, we should first recall that
the derivatives obtained in the forward mode are, up to rounding errors,
identical to the ones obtained in the (mechanical) reverse mode, while the
corresponding computational cost can differ widely, of course. Therefore we
begin this section with an analysis of the application of the forward mode
to a contractive fixed-point iteration.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


366 A. Griewank

5.1. Black box differentiation in forward mode


Let us consider a loop of the form (2.11) with the vector v(i) = (x, zi , y) ∈
X × Z × Y = H partitioned into the vector of independent variables x, the
proper state vector zi and the vector of dependent variables y. While x is
assumed constant throughout the iteration, each vector y is a simple output,
or function, and may therefore not affect any of the other values. Then we
may write
z0 = z0 (x), zi = G(x, zi−1 ) for i = 1 . . . l, y = F (x, zl ) (5.1)
for suitable functions
G : X × Z → Z and F : X × Z → Y.
The statement z0 = z0 (x) is meant to indicate that z0 is suitably initialized
given the value of x. In many applications there might be an explicit depen-
dence of G on the iterations counter i so that, in effect, zi = G(i) (x, zi−1 ),
a situation that was considered in Campbell and Hollenbeck (1996) and
Griewank and Faure (2002). For simplicity we will assume here that the
iteration is stationary in that G(i) = G, an assumption that was also made
in the fundamental papers by Gilbert (1992) and Christianson (1994). To
assure convergence to a fixed point we make the following assumption.
Contractivity Assumption. The equation
z = G(x, z)
has a solution z∗ = z∗ (x), and there are constants δ > 0 and ρ < 1 such
that
z − z∗ (x) ≤ δ ⇒ Gz (x, z) ≤ ρ < 1, (5.2)
where Gz ≡ ∂G/∂z, represents the Jacobian of G with respect to z and
Gz  denotes a matrix norm that must be consistent with a corresponding
vector norm z on Z.
According to Ostrowski’s theorem (see Propositions 10.1.3 and 10.1.4 in
Ortega and Rheinboldt (1970)), it follows immediately that the initial con-
dition z0 − z∗  < δ implies Q-linear convergence in that

Q{zi − z∗ } ≡ lim sup zi − z∗  zi−1 − z∗  ≤ ρ. (5.3)
i→∞

Another consequence of our contractivity assumption is that 1 is not an


eigenvalue of Gz (x, z∗ ), so that, by the implicit function theorem, the locally
unique solution z∗ = z∗ (x) has the derivative

∂z∗ ∂x = [I − Gz (x, z∗ )]−1 Gx (x, z∗ ). (5.4)
For a given tangent ẋ, we obtain the directional derivative ż∗ ≡ [∂z∗ /∂x]ẋ.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 367

We continue to use such restrictions to one-dimensional subspaces of domain


and (dual) range in order to stay within matrix-vector notation as much as
possible. The resulting vector
ẏ∗ = Ḟ (x, z∗ , ẋ, ż∗ ) ≡ Fx (x, z∗ )ẋ + Fz (x, z∗ )ż∗
represents a directional derivative f  (x)ẋ of the reduced function
f (x) ≡ F (x, z∗ (x)). (5.5)
Now the crucial question is whether and how this implicit derivative can
be evaluated at least approximately by AD. The simplest approach is to
differentiate the loop (5.1) in black box fashion.
Provided not only G but also F is differentiable in some neighbourhood of
the fixed point (x, z∗ (x)), we obtain by differentiation in the forward mode
ż0 = 0, żi = Ġ(x, zi−1 , ẋ, żi−1 ) for i = 1 . . . l, ẏ = Ḟ (x, zl , ẋ, żl ), (5.6)
where
Ġ(x, z, ẋ, ż) ≡ Gx (x, z)ẋ + Gz (x, z)ż (5.7)
and Ḟ is defined analogously. Since the derivative recurrence (5.6) is merely
a linearization of the original iteration (5.1), the contractivity is inherited
and we obtain for any initial ż0 the R-linear convergence result

R{żi − ż∗ } ≡ lim sup i żi − ż∗  ≤ ρ. (5.8)
i→∞

This result was originally established by Gilbert (1992) and later generalized
by Griewank, Bischof, Corliss, Carle and Williamson (1993) to a much more
general class of Newton-like methods. We may abbreviate (5.8) to żi = ż∗ +
Õ(ρi ) for ease of algebraic manipulation, and it is an immediate consequence
for the reduced function f (x) defined in (5.5) that
F (x, zi ) = f (x) + Õ(ρi ) and Ḟ (x, zi , ẋ, żi ) = f  (x)ẋ + Õ(ρi ).
As discussed by Ortega and Rheinboldt (1970), R-linear convergence is a
little weaker than Q-linear convergence, in that successive discrepancies
żi −ż∗  need not go down monotonically but must merely decline on average
by the factor ρ. This lack of monotonicity comes about because the leading
term on the right-hand side of (5.6), as defined in (5.7), may also approach
its limit Gx (x, z∗ )ẋ in an irregular fashion. A very similar effect occurs for
Lagrange multipliers in nonlinear programming. Since (5.3) implies by the
triangle inequality
 ρ
lim sup zi − z∗  zi − zi−1  ≤ ,
i→∞ 1−ρ
we may use the last step-size zi − zi−1  as an estimate for the remain-
ing discrepancy zi − z∗ . In contrast, this is not possible for R-linearly

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


368 A. Griewank

converging sequences, so the last step-size żi − żi−1  does not provide a
reliable indication of the remaining discrepancy żi − ż∗ .
In order to gauge whether the current value żi is a reasonable approxi-
mation to ż∗ , we recall that the latter is a solution of the direct sensitivity
equation
[I − Gz (x, z∗ )]ż = Gx (x, z∗ )ẋ, (5.9)
which is a mere rewrite of (5.4). Hence it follows under our assumptions
that, for any candidate pair (z, ż),
   
z − z∗  + ż − ż∗  = O z − G (x, z) + ż − Ġ (x, z, ẋ, ż) , (5.10)
with Ġ as defined in (5.7). The directional derivative Ġ(x, z, ẋ, ż) can be
computed quite easily in the forward mode of automatic differentiation so
that the residual on the right-hand side of (5.10) is constructively available.
Hence we may execute (5.1) and (5.6) simultaneously, and stop the itera-
tion when not only zi −G(x, zi ) but also żi − Ġ(x, zi , ẋ, żi ) is sufficiently
small. The latter condition requires a modification of the stopping criterion,
but otherwise the directional derivative żi ≈ ż∗ = ż∗ (x, ẋ) and the resulting
ẏi = Ḟ (x, zi , ẋ, żi ) ≈ ẏ∗ can be obtained by black box differentiation of the
original iterative solver.
Sometimes G takes the form
G(x, z) = z − P (x, z) · H(x, z),
with H(x, z) = 0 the state equation to be solved and P (x, z) ≈ Hz−1 (x, z) a
suitable preconditioner. The optimal choice P (x, z) = Hz−1 (x, z) represents
Newton’s method but can rarely be realized exactly on large-scale problems.
Anyway we find
 
ż − Ġ(x, z, ẋ, ż) = ż − P (x, z) Hx (x, z) ẋ + Hz (x, z) ż − Ṗ (x, z) H(x, z),
where
Ṗ (x, z) = Px (x, z) ẋ + Pz (x, z) ż,
provided P (x, z) is differentiable at all. As H(x, zi ) converges to zero, the
second term Ṗ (x, z) H(x, zi ) could, and probably should, be omitted when
evaluating Ġ(x, zi , ẋ, żi ). This applies in particular when P (x, z) is not even
continuously differentiable, for example due to pivoting in a preconditioner
based on an incomplete LU factorization. Setting Ṗ = 0 does not affect the
R-linear convergence result (5.8), and may reduce the cost of the derivative
iteration. However, it does require separating the preconditioner P from the
residual H, which may not be a simple task if the whole iteration function
G is incorporated in a legacy code. In Figure 5.1 the curves labelled ‘state
equation residual’ and ‘direct derivative residual’ represent zi − G(x, zi )
and żi − Ġ(x, zi , ẋ, żi ), respectively. The other two residuals are explained
in the following subsection.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 369

2
1

0
log10 (residual norm)

−1
−2

−3
−4
−5
−6 state equation residual
direct derivative residual
−7 adjoint derivative residual
second order residual
−8
0 200 400 600 800 1000 1200 1400
Iteration
Figure 5.1. History of residuals on 2D Euler solver, from
Griewank and Faure (2002).

While the contractivity assumption (5.2) seems quite natural, it is by no


means always satisfied. For example, conjugate gradient and other general
Krylov space methods cannot be written in the form (5.1) with an iteration
function G(x, z) that has a bounded derivative with respect to z. The
problem is that the current residual H(z, x) often occurs as a norm or in
inner products in denominators, so that Gz (z, x) turns out to be unbounded
in the vicinity of a fixed point z∗ = z∗ (x). The same is also true for quasi-
Newton methods based on secant updating, but a rather careful analysis
showed that (5.8) is still true if (5.2) is satisfied initially (Griewank et al.
1993). In practice R-linear convergence of the derivatives has been observed
for other Krylov subspace methods, but we know of no proof that this must
occur under suitable assumptions.

5.2. Two-phase and adjoint differentiation


Many authors (Ball 1969, Cappelaere, Elizondo and Faure 2001) have ad-
vocated the following approach, for which directives are included in TAMC
(Giering and Kaminski 2000) and possibly other AD tools. Rather than
differentiating the fixed point iteration (5.1) itself, we may let it run undif-
ferentiated until the desired solution accuracy has been obtained, and then
solve the sensitivity equation (5.9) in a second phase. Owing to its linearity,

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


370 A. Griewank

this problem looks somewhat simpler than the original task of solving the
nonlinear state equation z = G(x, z). When G represents a Newton step we
have asymptotically ρ = 0, and the solution of (5.9) can be achieved in a
single step
ż = Ġ(x, z, ẋ, 0) = −P (x, z) Hx (x, z) ẋ
where z ≈ z∗ represents the final iterate of the state vector. Also, the
simplified iteration may be applied since Ṗ is multiplied by H(x, z) ≈ 0.
When G represents an inexact version of Newton’s method based on an
iterative linear equation solver, it seems a natural idea to apply exactly the
same method to the sensitivity equation (5.9). This may be quite economical
because spectral properties and other information, that is known a priori
or gathered during the first phase iteration, may be put to good use once
more. Of course, we must be willing and able to modify the code by hand
unless AD provides a suitable tool (Giering and Kaminski 2000). The ability
to do this would normally presume that a fairly standard iterative solver,
for example of Krylov type, is in use. If nothing is known about the solver
except that it is assumed to represent a contractive fixed point iteration, we
may still apply (5.6) with zi−1 = z fixed so that
żi = Gx (x, z)ẋ + Gz (x, z)żi for i = 1 . . . l. (5.11)
Theoretically, the AD tool could exploit the constancy of z to avoid the
repeated evaluations of certain intermediates. In effect we apply the lin-
earization of the last state space iteration to propagate derivatives forward.
The idea of just exploiting the linearization of the last step has actually
been advocated more frequently for evaluating adjoints of iterative solvers
(Giering and Kaminski 1998, Christianson 1994).
To elaborate on this we first need to derive the adjoint of the fixed point
equation z = G(z, x). It follows from (5.4) by the chain rule that the total
derivative of the reduced response function defined in (5.5) is given by
 −1
f  (x) = Fx + Fz I − Gz (x, z∗ ) Gx (x, z∗ ).
Applying a weighting functional ȳ to y, we obtain the adjoint vector
x̄∗ = ȳf  (x) = ȳ Fx (x, z∗ ) + ḡ∗ Gx (x, z∗ ), (5.12)
where
 −1
ḡ∗ ≡ z̄∗ I − Gz (x, z∗ ) with z̄∗ ≡ ȳ Fz (x, z∗ ). (5.13)
While the definition of z̄∗ follows our usual concept of an adjoint vector, the
role of ḡ∗ ∈ Z warrants some further explanation. Suppose we introduce an
additive perturbation g of G so that we have the system
z = G(z, x) + g, y = F (z, x).
Then it follows from the implicit function theorem that ḡ∗ , as given by (5.13),

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 371

is exactly the gradient of ȳ y with respect to g evaluated at g = 0. In other


words ḡ∗ is the vector of Lagrange multipliers of the constraint z = G(x, z)
given the objective function ȳ y. From (5.13) it follows directly that ḡ∗ is
the unique solution of the adjoint sensitivity equation
 
ḡ I − Gz (x, z∗ ) = z̄∗ . (5.14)
Under our contractivity assumption (5.2) on Gz , the corresponding fixed
point iteration
ḡi+1 = z̄∗ + ḡi Gz (x, z∗ ) (5.15)
is also convergent, whence
  
Q ḡi − ḡ∗ = lim sup ḡi − ḡ∗  ḡi−1 − ḡ∗  ≤ ρ. (5.16)
i→∞

In the notation of Section 3, the right-hand side of (5.15) can be evaluated as


ḡi Gz (x, z∗ ) ≡ Ḡ(x, z∗ , ḡi )
by a reverse sweep on the procedure for evaluating G(x, z∗ ). In many appli-
cations, this adjoining of the iteration function G(x, z) is no serious problem,
and requires only a moderate amount of memory space. When G = I − P H
with P = Hz−1 representing Newton’s method, we have Gz (x, z∗ ) = 0, and
the equation (5.15) reduces to g1 = z̄∗ = g∗ . The relation between (5.15)
and mechanical adjoining of original fixed point iteration was analysed care-
fully in Giles (2001) for linear time-variant systems and their discretizations.
Both (5.11) and (5.15) assume that we really have a contractive, single-
step iteration. If in fact zi = G(i) (x, zi ) and each individual G(i) only
contracts the solution error on some subspace, as occurs for example in
multi-grid methods, then the linearization of what happens to be the last
iteration will not provide a convergent solver for the direct or adjoint sensi-
tivity equation.

5.3. Adjoint-based error correction


Whenever we have computed approximate solutions z and z̄ to the state
equation (5.2) and the adjoint sensitivity equation (5.14), we have in analogy
to (5.10) the error bound
   
z − z∗  + z̄ − z̄∗  = O z − G (x, z) + z̄ − Ḡ (x, z, x̄, z̄) . (5.17)
Using the approximate adjoint solution ḡ we may improve the estimate for
the weighted response ȳ F (x, z∗ ) = ȳ f (x) by the correction ḡ(G(z, x) − z).
More specifically, we have for Lipschitz continuously differentiable F and G
the Taylor expansion
 
F (x, z∗ ) = F (x, z) + Fz (x, z)(z∗ − z) + O z − z∗ 2

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


372 A. Griewank

and
   
0 = z∗ − G(x, z∗ )) = z − G(x, z) + I − Gz (x, z) (z∗ − z) + O z − z∗ 2 .
By subtracting ḡ times the second equation from the first, we obtain the
estimate
 
ȳ F (x, z) − ḡ z − G(x, z) − ȳ f (x)
   
= O ḡ I − Gz (x, z) − z̄ z − z∗  + z − z∗ 2 .
Now, if z = zi and ḡ = ḡi are generated according to (5.1) and (5.15), we
derive from (5.3) and (5.16) that
 
ȳ F (x, zi ) − ḡi zi − G(x, zi ) = ȳ f (x) + Õ(ρ2i ).
Hence the corrected estimate converges twice as fast as the uncorrected
one. It has been shown by Griewank and Faure (2002) that, for linear
G and F and zero initializations z0 = 0 and ḡ0 = 0, the ith corrected
estimate is exactly equal to the 2 ith uncorrected value ȳ F (x, zi ). The
same effect occurs when the initial z0 is quite good compared to z̄0 , since
the adjoint iteration (5.15) is then almost linear. This effect is displayed
in Figure 5.2, where the normal (uncorrected) value lags behind the double
(adjoint corrected) one by almost exactly the factor 2. Here the weighted
response ȳF was the drag coefficient of the NACA0012 airfoil.
In our setting the discrepancies z − z∗ and ḡ − g∗ come about through it-
erative equation solving. The same duality arguments apply if z∗ and ḡ∗ are
solutions of operator equations that are approximated by solutions z and
ḡ of corresponding discretizations. Under suitable conditions elaborated

normal
0.284 double
Response function

0.283

0.282

0.281

0.28

1 4 16 64 256 024 4096


Iteration (logarithmically scaled)
Figure 5.2. Normal and corrected values of drag coefficient.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 373

in Giles and Pierce (2001), Giles and Süli (2002) and Becker and Rannacher
(2001), the adjoint correction technique then doubles the order of conver-
gence with respect to the mesh-width. In both scenarios, solving the adjoint
equation provides accurate sensitivities of the weighted response with re-
spect to solution inaccuracies. For discretized PDEs this information may
then be used to selectively refine the grid where solution inaccuracies have
the largest effect on the weighted response (Becker and Rannacher 2001).

5.4. Grey box differentiation and one-shot optimization


The two-phase approach considered above seems inappropriate in all situ-
ations where the design vector x is not constant, but updated repeatedly
to arrive at some desirable value of the response f (x) = F (x, z∗ (x)), for
example a minimum. Then it makes more sense to apply simultaneously
with the state space iteration (5.1) the adjoint iteration
ḡi+1 = z̄i + ḡi Gz (x, zi ) = ȳ Fz (x, zi ) + ḡi Gz (x, zi ). (5.18)
As we observed for the forward derivative recurrence (5.6), the continuing
variation in zi destroys the proof of Q-linear convergence, but we have

R{ḡi − ḡ∗ } ≡ lim sup i ḡi − ḡ∗  ≤ ρ, (5.19)
i→∞

analogous to (5.8). The ḡi allow the computation of the approximate re-
duced gradients
x̄i ≡ ȳ Fx (x, zi ) + ḡi Gx (x, zi ) = x̄∗ + Õ(ρi ). (5.20)
Differentiating (5.18) once more, we obtain the second-order adjoint itera-
tion
ḡ˙ i+1 = ȳḞz (x, zi , ẋ, żi ) + ḡi Ġz (x, zi , ẋ, żi ) + ḡ˙ i Gz (x, zi ), (5.21)
where
Ḟz (x, zi , ẋ, żi ) ≡ Fzx (x, zi )ẋ + Fzz (x, zi )żi
and Ġz is defined analogously. The vector ḡ˙ i ∈ X ∗ obtained from (5.21)
may then be used to calculate
x̄˙ i = ȳḞx (x, zi , ẋ, żi ) + ḡ˙ i Gx (x, zi ) + ḡi Ġx (x, zi , ẋ, żi )
= x̄˙ ∗ + Õ(ρi ) = ȳf  (x)ẋ + Õ(ρi ),
where Ḟx and Ġx are also defined by analogy with Ḟz . The vector x̄˙ rep-
resent a first-order approximation to the product of the reduced Hessian
ȳf  (x) = ∇2x [ȳf (x)] with the direction ẋ.
While the right-hand side of (5.21) looks rather complicated, it can be
evaluated as a second-order adjoint of the vector function
E(x, z) ≡ [F (x, z), G(x, z)],

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


374 A. Griewank

whose first-order adjoint Ē appears on the right-hand side of (5.18).


Overall we have the following derivative enhanced iterations for i = 0, 1 . . . :
(y, z) = E (x, z) (original),
(x̄, ḡ) = Ē (x, z, ȳ, ḡ) (adjoint),
(ẏ, ż) = Ė (x, z, ẋ, ż) (direct),
˙ ḡ)
(x̄, ˙
˙ = Ē (x, z, ȳ, ḡ, ẋ, ż, ḡ)
˙ (second).
Here we have omitted the indices, because the new versions of y, z, x̄, ḡ, ẏ, ż,
x̄˙ and ḡ˙ can immediately overwrite the old ones. Under our contractivity
assumption all converge with the same R-factor ρ. Norms of the discrep-
ancies between the left- and right-hand sides can be used to measure the
progress of the iteration. Their plot in Figure 5.1 confirms the asymptotic
result, but also suggests that higher derivatives lag behind lower derivatives,
as was to be expected.
We have named this section ‘grey box differentiation’ because the genera-
tion of the nonincremental adjoint statement (x̄, ḡ) = Ē(x, z, ȳ, ḡ) from the
original (y, z) = E(x, z) does not follow the usual recipe of adjoint gener-
ation, first and foremost because the outer loop need not be reversed, and
consequently the old versions of the various variables need not be saved on
a stack. Also, the adjoints are not updated incremently by += but properly
assigned new values. There are many possible variations of the derived fixed
point iterations. In ‘one shot’ optimization we try to solve simultaneously
the stationarity condition x̄ = ∇f (x) = 0, possibly by another fixed point
iteration
x = x − Q−1 x̄ with Q ≈ ∇2 f (x).
The reduced Hessian approximation Q might be based on dim(X) second-
˙ with ẋ ranging over a basis in X. Alternatively,
order adjoints of the form x̄,
we may use secant updates or apply Uzawa-like algorithms. Some of these
variants are employed in Newman, Hou, Jones, Taylor and Korivi (1992),
Keyes, Hovland, McInnes and Samyono (2001), Forth and Evans (2001),
Mohammadi and Pironneau (2001), Venditti and Darmofal (2000), Hinze
and Slawig (2002) and Courty, Dervieux, Koobus and Hascoët (2002).

6. Jacobian matrices and graphs


The forward and reverse mode are nearly optimal choices when it comes to
evaluating a single tangent ẏ = F  (x)ẋ or a single gradient x̄ = ȳF  (x),
respectively. In contrast, when we wish to evaluate a Jacobian with m > 1
rows and n > 1 columns, we observe in this section that the forward and
reverse mode are just two extreme options from a wide range of possible
ways to apply the chain rule. This generalization opens up the chance

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 375

for a very significant reduction in operation count, although the search for
an elimination ordering with absolutely minimal operation count is a hard
combinatorial problem.

6.1. Nonincremental structure and Bauer’s formula


In this subsection we will exclude all incremental statements so that there
is a 1–1 correspondence between the ϕi and their results vi . They must
then, by (2.8), all be independent of each other in that PiT Pj = 0 if i = j.
Theoretically any evaluation procedure with incremental statements can be
rewritten by combining all contributions to a certain value in one statement.
For example, the reverse sweep of the adjoint procedure on the right-hand
side of Table 3.3 can be combined to the procedure listed in Figure 6.1.

v̄5 = ȳ
v̄4 = v̄5 ∗ v3
v̄3 = v̄5 ∗ v4
v̄2 = v̄4 ∗ cos(v1 + v2 )
v̄1 = v̄4 ∗ cos(v1 + v2 ) + v̄3 ∗ v3

Figure 6.1. Nonincremental adjoint.

On larger codes this elimination of incremental statements may be quite


laborious and change the structure significantly. Without incremental as-
signments we have the orthogonal decomposition
H = H1 ⊕ H2 ⊕ · · · ⊕ Hl and vi ∈ Hi for i = 1 . . . l.
Consequently we have
 
Qi = Qi Pj PjT = Qi Pj PjT
j≺i j≺i

and may write


 

vi = ϕ i  Qi Pj vj .
j≺i

Thus we obtain the partial derivatives


∂ϕi
Cij = = ϕi (ui ) Qi Pj ∈ Rmi ×mj .
∂vj
The interpretation of the Cij as mi × mj matrices again assumes a fixed
orthonormal basis for all Hi Rmi and thus for their Cartesian product H.
Since Cij is nontrivial exactly when j ≺ i, we may attach these matrices as

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


376 A. Griewank

C6 1 C6 1
1 6 1 6
C3 C
1 C6 5 71
C4 3 C5 4 =⇒
3 4 5 C7 2
C3 2 5 C6
2 7 2 7
C7 2 C7 2
Figure 6.2. Jacobian vertex accumulation on a simple graph.

labels to the edges of the computation graph, as sketched in Figure 6.2 for
a particular case with two independents and two dependents.
Obviously the local derivatives Cij uniquely determine the overall Ja-
cobian F  (x), whose calculation from the Cij we will call accumulation.
Formally we may use the following explicit expression for the individual Ja-
cobian blocks, which can be derived from the chain rule by induction on l.
Lemma 6.1. (Bauer’s formula) The derivative of any dependent vari-
able yi = vl−m+i with respect to any independent variable xj = vj is given by
∂yi  
= Cı̃̃ , (6.1)
∂xj
P∈[j→ı̂] (̃,ı̃)⊂P

where [j → ı̂] denotes the set of all paths connecting j to ı̂ ≡ l − m + i


and the pair (̃, ı̃) ranges over all edges in P in descending order from left
to right.
The order of the factors Cij along any one of the paths matters only if
some of the vertex dimensions mi = dim(vi ) are greater than 1. For the
example depicted in Figure 6.2 we find, according to Lemma 6.1,
∂y1 ∂v6
= ≡ C6 1 = C6 1 + C6 5 C5 4 C4 3 C3 1 ,
∂x1 ∂v1
∂y1 ∂v6
= ≡ C6 2 = C6 5 C5 4 C4 3 C3 2 ,
∂x2 ∂v2
∂y2 ∂v7
= ≡ C7 1 = C7 5 C5 4 C4 3 C3 1 ,
∂x1 ∂v1
∂y2 ∂v7
= ≡ C7 2 = C7 2 + C7 5 C5 4 C4 3 C3 2 .
∂x2 ∂v2
In the scalar case the number of multiplications or additions needed to ac-
cumulate all entries of the Jacobian by Bauer’s formula is given by

n 
m   
SIZE (F (x)) = P ,
j=1 i=1 P∈[j→ı̂]

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 377
 
where P  counts the number of vertices in P. It is not hard to see that this
often enormous number is proportional to the length of the formulas that
express each yi directly in terms of all xj without any named intermediates.
The expression (6.1) has the flavour of a determinant, which is no coin-
cidence since in the scalar case it is in fact a determinant, as we shall see
in Section 6.3. Naturally, an explicit evaluation is usually very wasteful for
there are often common subexpressions, such as C5 4 C4 3 , that should prob-
ably be calculated first. Actually this is not necessarily so, unless we assume
all intermediates vi to be scalars. For instance, if the two independents v1 , v2
and dependents v6 , v7 were scalars but the three intermediates v3 , v4 and v5
were vectors of length d, then the standard matrix product C5 4 C43 of two
d × d matrices would cost d3 multiplications; whereas a total of 8 matrix-
vector multiplications would suffice to compute the four Jacobian entries by
bracketing their expressions from left to right or vice versa.

6.2. Jacobian accumulation by vertex elimination


In graph terminology we can interpret the calculation of the product
C5 3 ≡ C5 4 C4 3 as the elimination of the vertex 4 in Figure 6.2. Then
we have afterwards the simplified graph depicted on the left-hand side of
Figure 6.3.

C6 1 C6 1
1 6 1 6
C3 C3
1 C6 5 1 C6 3
C5 3
3 5 3
C7 C7
C 32 5 C 32 3

2 7 2 7
C7 2 C7 2
Figure 6.3. Successive vertex eliminations on problem displayed in Figure 6.2.

Subsequently we may eliminate vertex 5 by setting C6 3 ≡ C6 5 C5 3 and


C7 3 = C7 5 C5 3 . Finally, eliminating vertex 3 we arrive at the labels C6 1 ,
C6 2 , C7 1 , and C7 2 , which are exactly the (block) elements of the Jacobian
F  (x) represented as a bipartite graph in Figure 6.2.
In general, elimination of an intermediate vertex vj from the computa-
tional graph requires the incrementation
Cik += Cij Cjk ∈ Rmk ×mi
for all predecessor–successor pairs (k, i) with k ≺ j ≺ i. If a direct edge
(i, k) did not exist beforehand it must now be introduced with the initial
value Cik = Cij Cjk . In other words, the precedence relation ≺ and the
graph structure must be updated such that all pairs (k, i) with k ≺ j ≺ i

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


378 A. Griewank

are directly connected by an edge, afterwards the vertex j and its edges are
deleted. Again assuming elementary matrix–matrix arithmetic, we have the
total elimination cost  
  
MARK(vj ) ≡ OPS Elim(vj ) = mj  mk mi .
k≺j≺i

When all vertices are scalars MARK(vj ) reduces to the Markowitz degree fa-
miliar from sparse Gaussian elimination (Rose and Tarjan 1978). There as
here, the overall objective is to minimize the vertex accumulation cost
  
VACC F  (x) ≡ MARK(vj ),
n<j≤l−m

where the degree MARK(vj ) needs to be computed just before the elimination
of the jth vertex. This accumulation effort is likely to dominate the overall
operation count compared to the effort for evaluating the elemental partials
    
l
   
OPS F  (x) − VACC F  (x) ≡ OPS {Cij }j≺i ≤ q OPS F (x) ,
i=1
where
 
q ≡ max OPS {Cij }j≺i /OPS (ϕi ) .
i
Typically this maximal ratio q is close to 2 or some other small number for
any given library of elementary functions ϕi . In contrast to sparse Gaussian
elimination, the linearized computational graph stays acyclic throughout,
and the minimal and maximal vertices, whose Markowitz degrees vanish
trivially, are precluded from elimination. They form the vertex set of the
final bipartite graph whose edges represent the nonzero (block) entries of
the accumulated Jacobian, as shown on the right-hand side of Figure 6.2.
For the small example Figure 6.2 with scalar vertices, the elimination or-
der 4 , 5 and 3 discussed above requires 1 + 2 + 4 = 7 multiplications.
In contrast, elimination in the natural order 3 , 4 and 5 , or its exact
opposite 5 , 4 and 3 , requires 2 + 2 + 4 = 8 multiplications. Without too
much difficulty it can be seen that, in general, eliminating all intermediate
vertices in the orders n + 1, n + 2, . . . , (l − 1), l or l, l − 1, . . . , n + 1, n is math-
ematically equivalent to applying the sparse vector forward or the sparse
reverse mode discussed in Section 3.5, respectively. Thus our tiny example
demonstrates that, already, neither of the two standard modes of AD needs
to be optimal in terms of the operation count for Jacobian accumulation.
As in the case of sparse Gaussian elimination (Rose and Tarjan 1978), it
has been shown that minimizing the fill-in during the accumulation of Jaco-
bian is an NP-hard problem and the same is probably true for minimizing
the operation count. In any case no convincing heuristics for selecting one

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 379

of the (l − m − n)! vertex elimination orderings have yet been developed. A


good reason for this lack of progress may be the observation that, on one
hand, Jacobian accumulation can be performed in more general ways, and
on the other hand, it may not be a good idea in the first place. These two
aspects are discussed in the remainder of this section.
Recently John Reid has observed that, even though Jacobian accumula-
tion involves no divisions, certain vertex elimination orderings may lead to
numerical instabilities. He gave an example similar to the one depicted in
Figure 6.4, where the arcs are annotated by constant partials u and h, whose
values are assumed to be rather close to 2 and −1, respectively.

2 4 2k
u h u h u h

1 3 5
u u u
Figure 6.4. Reid’s example of potential accumulation instability.

When the intermediate vertices 2, 3, 4, . . . , 2k − 1, 2k are eliminated for-


ward or backward, all newly calculated partial derivatives are close to zero,
as should be the final result c2k+1,1 . However, if we first eliminate only the
odd vertices 3, 5, . . . , 2k − 1, the arc between vertex 1 and vertex 2k + 1 tem-
porarily reaches the value uk ≈ 2k . The subsequent elimination of the even
intermediate vertices 2, 4, 6, . . . , 2k theoretically balances out this enormous
value to nearly zero. However, in floating point arithmetic errors are cer-
tain to be amplified enormously. Hence, it is clear that numerical stability
must be a concern in the study of suitable elimination orderings. In this
particular case the numerically unstable, prior elimination of the odd ver-
tices is also the one that makes the least sense in terms of minimizing fill-in
and operation count. More specifically, the number of newly allocated and
computed arcs grows quadratically with the depth k of the computational
graph. Practically any other elimination order will result in a temporal and
spatial complexity that is linear in k.

6.3. Jacobians as Schur complements


Because we have excluded incremental assignments and imposed the sin-
gle assignment condition (2.7), the structure of the evaluation procedure
Table 2.1 is exactly reflected in the nonlinear system
0 = E(x; v) = (ϕi (ui ) − vi )i=1...l ,
where, as before, ϕi (ui ) = ϕi ( ) = xi for i = 1 . . . n. By our assumptions
on the data dependence ≺, the system is triangular and its l × l (block)

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


380 A. Griewank

Jacobian has the structure


 j=1...l
E  (x; v) = Cij − δij I ≡C −I (6.2)
i=1...l

−I 0 0 0 0 0 0
0
0
0 0 −I 0 0 0 0  
x x −I 0 0 0 0 −I 0 0
x  
= = B L−I 0 .
0 R T −I
x x x x −I 0 0
x x x x −I 0 0
0
0
x x x x 0 0 −I

Here δij is the Kronecker delta and Cij = 0 ⇐⇒ j ≺ i =⇒ j < i. The last
relation implies in particular that the matrix L is strictly lower-triangular.
Applying the implicit function theorem to E(x; v) = 0, we obtain the deriva-
tive
F  (x) ≡ R + T (I − L)−1 B (6.3)
= R + T [(I − L)−1 B] = R + [T (I − L)−1 ]B.

The two bracketings on the last line represent two particular ways of accu-
mulating the Jacobian F  (x). One involves the solution of n linear systems
in the unary lower-triangular matrix (L − I) and the other requires the
solution of m linear systems in its transpose (L − I)T . As observed in Chap-
ter 8 of Griewank (2000), the two alternatives correspond once more to the
forward and reverse mode of AD, and their relative cost need not be deter-
mined by the ratio m/n nor even by the sparsity structure of the Jacobian
F  (x) alone. Instead, what matters is the sparsity of the extended Jacobian
E  (v, x), which is usually too huge to deal with in many cases. In the scalar
case with R = 0 we can rewrite (6.3) for any two Cartesian basis vectors
ei ∈ Rm and ej ∈ Rn by Cramer’s rule as

I − L Bej eTi
eTi F  (x) ej = det . (6.4)
−T I

Thus we see that Bauer’s formula given in Lemma 6.1 indeed represents
the determinant of a sparse matrix. Moreover, we may derive the following
alternative interpretation of Jacobian accumulation procedures.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 381

6.4. Accumulation by edge elimination


Eliminating a certain vertex j in the computational graph, as discussed in
the previous section, corresponds to using the −I in the jth diagonal block
of (6.2) to eliminate all other (block) elements in its row or column. This
can be done for all n > j ≤ l−m in arbitrary order until the lower-triangular
part L has been zeroed out completely, and the bottom left block R now
contains the Jacobian F  (x). Rather than eliminating all subdiagonal blocks
in a row or column at once, we may also just zero out one of them, say Cij .
This can be interpreted as back- or front-elimination of the edge (j, i) from
the computational graph, and requires the updates
Cik += Cij Cjk for all k ≺ j (6.5)
or
Chj += Chi Cij for all h  i, (6.6)
respectively. It is easy to check that, after the precedence relation ≺ has
been updated accordingly, Bauer’s formulas is unaffected. Front-eliminating
all edges incoming to a node j or back-eliminating all outgoing edges is
equivalent to the vertex elimination of j. The back-elimination (6.5) or the
front-elimination (6.6) of the edge (j, i) may actually increase the number of
nonzero (block-)elements in (6.2), which equals the number of edges in the
computational graph. However, we can show that the sum over the length
of all directed path in the graph and also the number of nonzeros in the
subdiagonals of (6.2) are monotonically decreasing. Therefore, elimination
of edges in any order must again lead eventually to the same bipartite graph
representing the Jacobian.
Let us briefly consider edge elimination for the ‘lion’ example of Naumann
displayed in Figure 6.5. It can be checked quite easily that eliminating the
two intermediate vertices 1 and 2 in either order requires a total of 12
multiplications. However, if the edge c84 is first back-eliminated, the total
effort is reduced by one multiplication. Hence we see that the generalization
from vertex to edge elimination makes a further reduction in the operation
count possible. However, it is not yet really known how to make good use of
this extra freedom, and there is the danger that a poor choice of the order
in which the edges are eliminated can lead to an exponential effort overall.
This cannot happen in vertex elimination for which VACC(F  (x)) always has
 3
the cubic bound l max1≤i≤l (mi ) .
The same upper bound applies to edge elimination if we ensure that no
edge is eliminated and reintroduced repeatedly through fill-in. To exclude
this possibility, edge (j, i) may only be eliminated when it is final, i.e., no
other path connects the origin j to the destination i. Certain edge elim-
ination orderings correspond to breaking down the product representation
(3.1) further into a matrix chain, where originally each factor corresponds

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


382 A. Griewank

Program: Graph: 5
v3 = c21 ∗ v1 + c32 ∗ v2 1
v4 = c43 ∗ v3 4 6
v5 = c54 ∗ v4 3
v6 = c64 ∗ v4 7
v7 = c74 ∗ v4
2
v8 = c84 ∗ v4 + c83 ∗ v3 8

Figure 6.5. The ‘lion’ example from Naumann (1999) with l = 8.

to a single elemental partial Cij . Many of these extremely sparse matri-


ces commute and can thus be reordered. Moreover, the resulting chain can
be bracketed in many different ways. The bracketing can be optimized by
dynamic programming (Griewank and Naumann 2003b), which is classical
when all matrix factors are dense (Aho, Hopcroft and Ullman 1974). Var-
ious elimination strategies were tested on the stencil generated by the Roe
flux for hyperbolic PDEs (Tadjouddine, Forth and Pryce 2001).

6.5. Face elimination and its optimality


Rather than combining one edge with all its successors or with all its pre-
decessors, we may prefer to combine only two edges at a time. In other
words, we just wish to update Cik += Cij Cjk for a single triple k ≺ j ≺ i,
which is called a ‘face’ in Griewank and Naumann (2003a). Unfortunately,
this kind of modification cannot be represented in a simple way on the level
of the computational graph. Instead we have to perform face eliminations
on the line graph whose vertices j, i correspond initially to the edges (j, i)
of the original graph appended by two extra sets of minimal vertices oj for
j = 1 . . . n and maximal vertices di for i = 1 . . . m. They represent edges
oj = (−∞, j) connecting a common source (−∞) to the independent ver-
tices of the original graphs and edges di = (ı̂, ∞) connecting the dependent
vertices ı̂ = l − m + i for i = 1 . . . m to a common sink (+∞). A line edge
connects two line vertices k, i and j, h exactly when j = i. Now the ver-
tices rather than the edges are labelled by the values Cij . Without a formal
definition we display in Figure 6.6 the line graph corresponding to the ‘lion’
example defined in Figure 6.5. It is easy to check that line graphs of DAGs
(directed acyclic graphs) are also acyclic. They have certain special prop-
erties (Griewank and Naumann 2003a), which may, however, be lost for a
while by face elimination, as discussed below.
Bauer’s formula (6.1) may be rewritten in terms of the line graph as
∂yi  
= Cĩj̃ , (6.7)
∂xj
P∈[oj →di ] j̃, ĩ ⊂P

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 383

4, 5 5, ∞
3, 4 4, 6 6, ∞
−∞, 1 1, 3
4, 7 7, ∞
4, 8 8, ∞
−∞, 2 2, 3
3, 8

Figure 6.6. Line graph for Naumann’s ‘lion’ example.

where [oj → di ] represents the set of all paths P connecting oj to di , and the
j̃, ĩ range over all vertices belonging to path P in descending order. The
elimination of a face k, j , j, i with k = −∞ and i = +∞ can now proceed
by introducing a new line vertex with the value Cij Cjk , and connecting it
to all predecessors of k, j and all successors of j, i . It is very easy to
see that this modification leaves (6.7) valid and again reduces the total sum
of the length of all maximal paths in the line graph. After the successive
elimination of all interior edges of the line graph whose vertices correspond
initially to edges in the computational graph, we arrive at a tripartite graph.
The latter can be interpreted as the line graph of a bipartite graph, whose
edge values are the vertex labels of the central layer of the tripartite graph.
Naumann (2001) has constructed an example where accumulation by face
elimination reduces the number of multiplications below the minimal count
achievable by edge elimination. It is believed that face elimination on the
scalar line graph is the most general procedure for accumulating Jacobians.
Conjecture 6.2. Given a scalar computational graph, consider a fixed
sequence of multiplications and additions to compute the Jacobian entries
∂yi /∂xj given by (6.1) and (6.7) for arbitrary real values of the Cij for j ≺ i.
Then the number of multiplications required is no smaller than that required
by some face elimination sequence on the computational graph, that is,
       
ACC F  (x) = FAAC F  (x) ≤ EACC F  (x) ≤ VACC F  (x)
where the inequalities hold strictly for certain example problems, and FAAC,
EACC denote the minimal operation count achievable by face and edge elim-
ination, respectively.
Proof. We prove the assertion under the additional assumptions that all
cij = Cij are scalar, and that the computational graph is absorption-free
in that any two vertices j and i are connected by at most one directed
path. The property is inherited by the line graph. Then the accumulation
procedure involves only multiplications, and must generate a sequence of

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


384 A. Griewank

partial products of the form



cP ≡ cij , (6.8)
(j,i)⊂P

where P ⊂ P c ⊂ G is a subset of a connected path P c . Multiplying coeffi-


cients cij that are not contained in a common path makes no sense, since the
resulting products cannot occur in any one of the Jacobian entries. Now sup-
pose we modify the original accumulation procedure by keeping all products
in factored form, that is,

c Pk = cij for k = 0, 1, . . . , k̄,
(j,i) ∈Pk
k̄
where P = k=0 Pk is the decomposition of P into its k̄ + 1 maximal con-
nected subcomponents. We show, by induction on the number of elements
in P, that this partitioned representation can be obtained by exactly k̄ fewer
multiplications than cP itself. The assertion is trivially true for all individ-
ual arcs, which form a single element path with k̄ = 0. Now suppose we
multiply two partial products cP and cP  with P ∩ P  = ∅. Then the new
partial path


k̄ 


P ∪P = Pk ∪ Pk 
k=0 k =0
must be partitioned again into maximal connected components. Before these
mergers we have k̄ + k̄  + 2 subpaths, and, by induction hypothesis, k̄ + k̄  + 1
saved multiplications. Whenever two subpaths Pk and Pk  are adjacent we
merge them, expending one multiplication and eliminating one gap. Hence
the number of connected components minus one stays exactly the same as
the number of saved multiplications. In summary, we find that we may
rewrite the accumulation procedure such that all P successively occurring
in (6.8) are, in fact, connected paths. Furthermore, without loss of generality
we may assume that they are ordered according to their length. Now we will
show that each of them can be interpreted as a face elimination on a line
graph, whose initial structure was defined above. The first accumulation step
must be of the form cijk = cij cjk with k ≺ j and j ≺ i. This simplification
may be interpreted as a face elimination on the line graph, which stays
absorption-free. Hence we face the same problem as before, but for a graph
of somewhat reduced total length. Consequently, the assertion follows by
induction on the total length, i.e., the sum of the lengths of all maximal
paths.
According to the conjecture, accumulating Jacobians with minimal multi-
plication count requires the selection of an optimal face elimination sequence
of which there exists a truly enormous number, even for comparatively small

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 385

initial line graphs. Before attacking such a difficult combinatorial problem


we may pause to question whether the accumulation of Jacobians is really
such an inevitable task as it may seem at first. The answer is in general ‘no’.
To demonstrate this we make two rather simple but fundamental observa-
tions. Firstly, Newton steps may sometimes be calculated more cheaply by
a finite algorithm (of infinite precision) that does not involve accumulating
the corresponding square Jacobian. Secondly, the representation of the Ja-
cobian even as a sparse matrix, i.e., a rectangular array of numbers, may be
inappropriate, because it hides structure and wastes storage.

6.6. Jacobians with singular minors


Neither in Section 3 nor in Sections 6.4 and 6.5 have we obtained a definite
answer concerning the cost ratio
   
OPS F (x), F  (x) OPS(F (x)) ∈ 1, 3 min(n̂, m̂) .
Here n̂ ≤ n and m̂ ≤ m denote, as in Section 3.5, the maximal number of
nonzeros per row and column, respectively. A very simple dense example
for which the upper bound is attained up to the constant factor 6 is
1 T 2
F (x) = (a x) b with a ∈ Rn , b ∈ Rm . (6.9)
2
Here we have OPS(F (x)) = n + m, but F  (x) = b(aT x)aT ∈ Rm×n has in
general mn distinct entries so that

OPS(F  (x)) OPS(F (x)) ≥ mn/(m + n) ≥ min(m, n)/2,
for absolutely any method of calculating the Jacobians as a rectangular array
of reals.
The troubling aspect of this example is that we can hardly imagine a
numerical purpose where it would make sense to accumulate the Jacobian –
quite the opposite. If we treat the inner product z = aT x = ni=1 ai xi as a
single elemental operation, the computational graph of (6.9) takes for n = 3
and m = 2 the form sketched in Figure 6.7. Here z is also the derivative
of 12 z 2 .
The elimination of the two intermediate nodes in Figure 6.7 yields an
m × n matrix with mn nonzero elements. Owing to rounding errors this
matrix will typically have full rank, whereas the unaccumulated representa-
tion F  (x) = b(aT x)aT reveals that rank(F  (x)) ≤ 1 for all x ∈ Rn . Thus
we conclude that important structural information can be lost during the
accumulation process. Even if we consider the n + m components of a and
b as free parameters, the set
Reach F  (x) ≡ F  (x) : x, a ∈ Rn , b ∈ Rm ⊂ Rm×n

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


386 A. Griewank

za1 b1
1 a1 6 1 6
b1 z a 2b 1
b1
a2 T
a x=z za 3
2 4 5 2 z a2
b2 b2 za
1b
a3 2

7 za3 b2
3 3 7

Figure 6.7. Vertex accumulation of a Jacobian on a simple graph.

is only an (n + m − 1)-dimensional submanifold of the set of all real m × n


matrices. We can check without too much difficulty that, in this case,
Reach{F  (x)} is characterized uniquely as the set of m×n matrices A = (aij )
for which all 2 × 2 submatrices are singular, that is,
 
ai k ai j
0 = det for all 1 ≤ i < h ≤ m, 1 ≤ k < j ≤ n.
ah k ah j

6.7. Generic rank


In the computational graph displayed in Figure 6.7, all 2 × 2 matrices corre-
spond to a subgraph with the structure obtained overall for n = 2 = m. It
is obvious that such matrices are singular since z = aT x is the single degree
of freedom into which all the independent variables x1 . . . xn are mapped.
More generally, we obtain the following fundamental result.
Proposition 6.3. Let the edge values {cij } of a certain linearized com-
putational graph be considered as free parameters. Then the rank of the
matrix A with elements
 
aij = cı̃̃ , 1 ≤ i ≤ m, 1 ≤ j ≤ n,
P∈[j→ı̂] (̃,ı̃)∈P

is, for almost all combinations {cij }j≺i , equal to the size r of a maximal
match, that is, the largest number of disjoint paths connecting the roots to
the leaves of the graph.
Proof. Suppose we split all interior vertices into an input port and an
output port connected by an internal edge of capacity 1. Furthermore we
connect all minimal vertices by edges of unit capacity to a common source,
and analogously all maximal ones to a common sink. All other edges are
given infinite capacity. Then the solution of the max-flow (Tarjan 1983)
problem is integral and represents exactly a maximal match as defined above.
The corresponding min-cut solution consists of a set of internal edges of finite
and thus unit capacity, whose removal would split the modified graph into
two halves. The values vj at these split vertices form a vector z of reals that

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 387

depends differentiable on x and determines differentiably y. Hence we have


a decomposition
y = F (x) = H(z) with z = G(x),
so that
   
rank F  (x) = rank H  (z) G (x)
   
≤ max rank H  (z) , rank G (x) ≤ r ≡ dim(z).
To attain the upper bound, we simply set cij = 1 for every edge (j, i) in the
maximal match, all others being set to zero. Then the resulting matrix A
is a permutation of a diagonal matrix with the number of nonzero elements
being equal to its rank r. Because the determinant corresponding to the
permuted diagonal has the value 1 for the special choice of the edge values
given above, and it is a polynomial in the cij , it must in fact be nonzero for
almost all combinations of the cij . This completes the proof.
As we have seen in the proof, generic rank can be determined by a max flow
computation for which very efficient algorithms are now available (Tarjan
1983). There are other structural properties which, like rank, can be deduced
from the structure of the computational graph, such as the multiplicity
of the zero eigenvalue (Röbenack and Reinschke 2001). All this kind of
structural information is obliterated when we accumulate the Jacobian into
a rectangular array of numbers. Therefore, in the remainder of this section
we hope to motivate and initiate an investigation into the properties of
Jacobian graphs and their optimal representation.

6.8. Scarce Jacobians and their representation


A simple criterion for the loss of information in accumulating Jacobians
is a count on the number of degrees of freedom. In the graph Figure 6.3
of the rank-one example (6.8) we have n + m + 1 free edge values whose
accumulation into dense m × n matrices cannot possibly reach all of Rm×n
when n > 1 < m. In some sense that is nothing special, since, for sufficiently
smooth F , the set of reachable Jacobians
F  (x) ∈ Rm×n : x ∈ Rn
forms by its very definition an n-dimensional manifold embedded in the m n
dimensional linear space Rm×n . Some of that structure may be directly visi-
ble as sparsity, where certain Jacobian entries are zero or otherwise constant.
Then the set of reachable Jacobians is in fact contained in an affine subspace
of Rm×n , which often has only a dimension of order O(n + m). It is well un-
derstood that such sparsity structure can be exploited for storing, factoring
and otherwise manipulating these matrices economically, especially when
the dimensions m and n are rather large. We contend here that a similar

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


388 A. Griewank

effect can occur when the computational graph rather than the Jacobian is
sparse in a certain sense.
More specifically, we will call the computational graph and the resulting
Jacobian scarce if the matrices (∂yi /∂xj )j=1...n,i=1...m defined by Bauer’s for-
mula (6.7) for arbitrary values of the elemental partials Cı̃̃ do not range over
all of Rm×n . For the example considered in Section 6.7, we obtain exactly
the set of matrices whose rank equals one or zero, which is a smooth man-
ifold of dimension m + n − 1. Naturally, this manifold need not be regular
and, owing to its homogeneity, there are always bifurcations at the origin.
The difference between n m and the dimension of the manifold may be de-
fined as the degree of scarcity. The dimension of the manifold is bounded
above but generally not equal to the number of arcs in the computational
graph. The discrepancy is exactly two for the example above.
A particularly simple kind of scarcity is sparsity, where the number of
zeros yields the degree of sparsity, provided the other entries are free and
independent of each other. It may then still make sense to accumulate the
Jacobian, that is, represent it as set of partial derivatives. We contend that
this is not true for Jacobians that are scarce but not sparse, such as the
rank-one example.
In the light of Proposition 6.3 and our observation on the rank-one exam-
ple, we might think that the scarcity structure can always be characterized in
terms of a collection of vanishing subdeterminants. This conjecture is refuted
by the following example, which is also representative for a more significant
class of problems for which full accumulation seems a rather bad idea.
Suppose we have a 3 × 3 grid of values vj for j = 1 . . . 9, as displayed in
Figure 6.8, that are subject to two successive transformations. Each time
the new value is a function of the old values at the same place and its
neighbours to the west, south and southwest. At the southern and western
boundary we continue the dependence periodically so that the new value
of v4 , for example, depends on the old values of v4 , v3 , v1 , and v6 . This
dependency is displayed in Figure 6.8. Hence the 9 × 9 Jacobians of each
of the two transitions contain exactly four nonzero elements in each row.
Consequently we have 4 · 9 · 2 = 72 free edge values but 9 · 9 = 81 entries
in the product of the two matrices, which is the Jacobian of the final state
with respect to the initial state.
Since by following the arrows in Figure 6.8 we can get from any grid point
to any other point in two moves, the Jacobian is dense. Therefore the degree
of scarcity defined above is at least 81−72 = 9. However, the scarcity of this
Jacobian cannot be explained in terms of singular submatrices. Any such
square submatrix would be characterized by two sets Ji ⊂ {1, 2, . . . , 9} for
i = 0, 2 with J0 selecting the indices of columns (= independent variables)
and J2 the rows (= dependent variables) of the submatrix. It is then not
too difficult to find an intermediate set J1 such that each element in J1 is

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 389

7 8 9

4 5 6

1 2 3

Figure 6.8. Update dependences on the upwinding example.

the middle of a directed path from an element in J0 to an element in J2 with


all these paths being disjoint. Then it follows from Proposition 6.3 that the
submatrix has generically full rank, and hence its determinant cannot vanish.
The example sketched in Figure 6.8 may be viewed as prototypical of up-
wind discretizations of a time-dependent PDE on a two-dimensional domain.
There the Jacobian of the final state with respect to the initial state is the
product of transformation matrices that are block-tridiagonal for reasonably
regular discretizations. If there are just enough time-steps such that infor-
mation can traverse the whole domain, the Jacobian will be dense but still
scarce, as shown in Section 8.3 of Griewank (2000). It was also observed
in that book that the greedy Markowitz heuristic has significantly worse
performance than forward and reverse. We may view this as yet another
indication that accumulating Jacobians may not be such a good idea in the
first place.
That leaves the question of determining what is the best representation
of scarce Jacobians whose full accumulation is redundant by definition of
scarcity. Suppose we have evaluated all local partials cij and are told that a
rather large number of Jacobian–vector or vector–Jacobian products must
be computed subsequently. This requires execution of the forward loop

v̇i = cij ∗ v̇j for i = n + 1 . . . l (6.10)
j≺i

or the (nonincremental) reverse loop



v̄j = v̄i ∗ cij for j = l − m . . . 1, (6.11)
i j

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


390 A. Griewank

which follow from Tables 3.1 and 3.2, provided all vi are scalar and disjoint.
In either loop there is exactly one multiplication associated with each arc
value cij , so that the number of arcs is an exact measure for the cost of com-
puting Jacobian–vector products ẏ = F  (x)ẋ and vector–Jacobian products
x̄ = ȳF  (x).
In order to minimize that cost we may perform certain preaccumulations
by eliminating edges or whole vertices. For example, in Figure 6.7 we can
eliminate the central arc either at the front or at the back, and thus reduce
the number of edge values from n+m+1 to n+m without changing the reach
of the Jacobian. This elimination requires here min(n, m) multiplications
and will therefore be worthwhile even if only one Jacobian–vector product is
to be computed subsequently. Further vertex, edge, or even face eliminations
would make the size of the edge set grow again, and cause a loss of scarcity
in the sense discussed above. The only further simplification respecting the
Jacobian structure would be to normalize any one of the edge values to 1
and thus save additional multiplications during subsequent vector product
calculations. Unfortunately, at least this last simplification is clearly not
unique and it would appear that there is in general no unique minimal
representation of a computational graph. As might have been expected,
there is a rather close connection between the avoidance of fill-in and the
maintenance of scarcity.
Proposition 6.4.
(i) If the front- or back-elimination of an edge in a computational graph
does not increase the total number of edges, then the degree of scarcity
remains constant.
(ii) If the elimination of a vertex would lead to a reduction in the total
number of edges, then at least one of its edges can be eliminated via (i)
without loss of scarcity.

Proof. Without loss of generality we may consider back-elimination of an


edge (j, i), as depicted in Figure 6.9. The no fill-in condition requires that all
predecessors of j but possibly one, say k0 , are already directly connected
to i . Now we have to show that the values of the new arcs after the
elimination can be chosen such that they reproduce any possible sensitivities
in the subgraph displayed in Figure 6.9. Since the arc from j to h and
possibly other successors of j remain unchanged, we must also keep all arcs
cjkt for t = 0 . . . s unaffected. If also k0 ≺ i there is no difficulty at all, as we
may scale cij to one without loss of generality so that the new arc values are
given by c̃kt = cikt + cjkt , which is obviously reversible. Otherwise we have
c̃ik0 = cij ∗ cjk0 and c̃ikt = cikt + cij ∗ cjkt for k = 1 . . . s.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 391

cjk0 chi cjk0 chi


k0 j h k0 j h
cik ci
j c̃ik
c jk 0 0 c jk 0 0

cik1 =⇒ c̃ik1
k1 i k1 i
ks

c ik s
cj

c̃ ik s

ks
cj
ks ks

Figure 6.9. Scarcity-preserving back-elimination of (j, i).

k0 i0 k0 i0

j j

k1 i1 =⇒ k1 i1

k2 i2 k2 i2

Figure 6.10. Scarcity-preserving elimination of (j, i2 ) and (k2 , j).

Since we may assume without loss of generality that cjk0 = 0, this rela-
tion may be reversed so that the degree of scarcity is indeed maintained as
asserted.
To prove the second assertion, let us assume that the vertex j to be
eliminated has (p+1) predecessors and (s+1) successors, which span together
with j a subgraph of p+s+3 vertices. After the elimination of j they form
a dense bipartite graph with (s + 1)(p + 1) edges. Because of the negative
fill-in, the number of direct arcs between the predecessors and successors of
j is at least

(s + 1)(p + 1) − (s + 1 + p + 1) + 1 = s p.
From this it can easily be shown by contradiction that at least one prede-
cessor of j is connected to all but possibly one of its successors, so that its
link to j can be front-eliminated.
The application of the proposition to the 3 × 3 subgraph depicted in
Figure 6.10 shows that there are 6 + 5 edges in the original graph on the left
which would be reduced to 9 by the elimination of the central node.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


392 A. Griewank

However, this operation would destroy the property that the Jacobian
of the vertices i0 and i1 with respect to k0 and k1 is always singular by
Proposition 6.3. Instead we should front-eliminate (k2 , j) and back-eliminate
(j, i2 ), which reduces the number of arcs also to 9, while maintaining the
scarcity.
Obviously Proposition 6.4 may be applied repeatedly until any additional
edge elimination would increase the total edge count and no additional ver-
tex elimination could reduce it further. However, there are examples where
the resulting representation is still not minimal, and other, more global,
scarcity-preserving simplifications may be applied. Also the issue of nor-
malization, that is, the question of which arc values can be set to 1 to save
extra multiplications, appears to be completely open. Hence we cannot yet
set up an optimal Jacobian representation for the seemingly simple, and
certainly practical, task of completing a large number of Jacobian–vector or
vector–Jacobian products with the minimal number of multiplications.

6.9. Section summary


For vector functions defined by an evaluation procedure, each partial deriva-
tive ∂yi /∂xj of a dependent variable yi , with respect to an independent vari-
able xj , can be written down explicitly as a polynomial of elemental partial
derivatives cı̃̃ . However, naively applying this formula of Bauer is grossly
inefficient, even for obtaining one ∂yi /∂xj , let alone for a simultaneous eval-
uation of all of them. The identification of common subexpressions can
be interpreted as vertex and edge elimination on the computational graph
or face elimination on the associated line graph. The latter approach is be-
lieved to be the most general, and thus potentially most efficient, procedure.
In contrast to vertex and edge elimination, face elimination can apparently
not be interpreted in terms of linear algebra operations on the extended
Jacobian (6.2). In any case, heuristics to determine good, if suboptimal,
elimination sequences remain to be developed.
The accumulation of Jacobians as rectangular arrays of numbers appears
to be a bad idea when they are scarce without being sparse, as defined in
Section 6.8. Then there is a certain intrinsic relation between the matrix
entries which is lost by accumulation. Instead we should strive to simplify
the computational graph by suitable eliminations until a minimal represen-
tation is reached. This would, for example, be useful for the subsequent
evaluation of Jacobian–vector or vector–Jacobian products with minimal
costs. More importantly, structural properties like the generic rank should
be respected. Thus we face the challenge of developing a theory of canonical
representations of linear computational graphs, including in particular their
composition.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 393

Acknowledgements
The author is greatly indebted to many collaborators and co-authors, as well
as the AD community as a whole. To name some would mean to disregard
others, and quite a few are likely to disagree with my view of the field anyway.
The initial typesetting of the manuscript was done by Sigrid Eckstein, who
took particular care in preparing the many tables and figures.

REFERENCES
A. Aho, J. Hopcroft and J. Ullman (1974), The Design and Analysis of Computer
Algorithms, Addison-Wesley, Reading, MA.
R. Anderssen and P. Bloomfield (1974), ‘Numerical differentiation proceedings for
non-exact data’, Numer. Math. 22, 157–182.
W. Ball (1969), Material and Energy Balance Computations, Wiley, pp. 560–566.
R. Becker and R. Rannacher (2001), An optimal control approach to error control
and mesh adaptation in finite element methods, in Acta Numerica, Vol. 10,
Cambridge University Press, pp. 1–102.
C. Bennett (1973), ‘Logical reversibility of computation’, IBM J. Research Devel-
opment 17, 525–532.
M. Berz, C. Bischof, G. Corliss and A. Griewank, eds (1996), Computational Dif-
ferentiation: Techniques, Applications, and Tools, SIAM, Philadelphia, PA.
C. Bischof, G. Corliss and A. Griewank (1993), ‘Structured second- and higher-
order derivatives through univariate Taylor series’, Optim. Methods Software
2, 211–232.
C. Bischof, A. Carle, P. Khademi and A. Mauer (1996), ‘The ADIFOR 2.0 system
for the automatic differentiation of Fortran 77 programs’, IEEE Comput. Sci.
Engr. 3, 18–32.
T. Braconnier and P. Langlois (2001), From rounding error estimation to automatic
correction with AD, in Corliss et al. (2001), Chapter 42, pp. 333–339.
D. Cacuci, C. Weber, E. Oblow and J. Marable (1980), ‘Sensitivity theory for
general systems of nonlinear equations’, Nucl. Sci. Engr. 88, 88–110.
J.-B. Caillau and J. Noailles (2001), Optimal control sensitivity analysis with AD,
in Corliss et al. (2001), Chapter 11, pp. 105–111.
S. Campbell and R. Hollenbeck (1996), Automatic differentiation and implicit dif-
ferential equations, in Berz et al. (1996), pp. 215–227.
S. Campbell, E. Moore and Y. Zhong (1994), ‘Utilization of automatic differentia-
tion in control algorithms’, IEEE Trans. Automatic Control 39, 1047–1052.
B. Cappelaere, D. Elizondo and C. Faure (2001), Odyssée versus hand differenti-
ation of a terrain modelling application, in Corliss et al. (2001), Chapter 7,
pp. 71–78.
A. Carle and M. Fagan (1996), Improving derivative performance for CFD by using
simplified recurrences, in Berz et al. (1996), pp. 343–351.
B. Christianson (1992), ‘Automatic Hessians by reverse accumulation’, IMA J.
Numer. Anal. 12, 135–150.
B. Christianson (1994), ‘Reverse accumulation and attractive fixed points’, Op-
tim. Methods Software 3, 311–326.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


394 A. Griewank

B. Christianson (1998), ‘Reverse accumulation and implicit functions’, Optim.


Methods Software 9, 307–322.
B. Christianson (1999), ‘Cheap Newton steps for optimal control problems: Auto-
matic differentiation and Pantoja’s algorithm’, Optim. Methods Software 10,
729–743.
B. Christianson, L. Dixon and S. Brown (1997), Automatic differentiation of com-
puter programs in a parallel computing environment, in Applications of High
Performance Computing in Engineering V (H. Power and J. C. Long, eds),
Computational Mechanics Publications, Southampton, pp. 169–178.
T. Coleman and J. Moré (1984), ‘Estimation of sparse Jacobian matrices and graph
coloring problems’, SIAM J. Numer. Anal. 20, 187–209.
T. Coleman and A. Verma (1996), Structure and efficient Jacobian calculation, in
Berz et al. (1996), pp. 149–159.
A. Conn, N. Gould and P. Toint (1992), LANCELOT, a Fortran Package for Large-
Scale Nonlinear Optimization (Release A), Vol. 17 of Computational Mathe-
matics, Springer, Berlin.
G. Corliss (1991), Automatic differentiation bibliography, in Griewank and Corliss
(1991), pp. 331–353.
G. Corliss, C. Faure, A. Griewank, L. Hascoët and U. Naumann, eds (2001), Auto-
matic Differentiation: From Simulation to Optimization, series in Computer
and Information Science, Springer, New York.
F. Courty, A. Dervieux, B. Koobus and L. Hascoët (2002), Reverse automatic
differentiation for optimum design: From adjoint state assembly to gradient
computation, Technical report, INRIA Sophia-Antipolis.
A. Curtis, M. Powell and J. Reid (1974), ‘On the estimation of sparse Jacobian
matrices’, J. Inst. Math. Appl. 13, 117–119.
L. C. W. Dixon (1991), Use of automatic differentiation for calculating Hessians
and Newton steps, in Griewank and Corliss (1991), pp. 114–125.
J. Dunn and D. Bertsekas (1989), ‘Efficient dynamic programming implementations
of Newton’s method for unconstrained optimal control problems’, J. Optim.
Theory Appl. 63, 23–38.
Faa di Bruno (1856), ‘Note sur une nouvelle formule de calcule differentiel’, Quar.
J. Math. 1, 359–360.
C. Faure and U. Naumann (2001), Minimizing the tape size, in Corliss et al. (2001),
Chapter 34, pp. 279–284.
S. A. Forth and T. Evans (2001), Aerofoil optimisation via AD of a multigrid cell-
vertex Euler flow solver, in Corliss et al. (2001), Chapter 17, pp. 149–156.
Y. M. V. G. M. Ostrovskii and W. W. Borisov (1971), ‘Über die Berechnung von
Ableitungen’, Wissenschaftliche Zeitschrift der Technischen Hochschule für
Chemie 13, 382–384.
R. Giering and T. Kaminski (1998), ‘Recipes for adjoint code construction’, ACM
Trans. Math. Software 24, 437–474.
R. Giering and T. Kaminski (2000), On the performance of derivative code gen-
erated by TAMC. Manuscript, FastOpt, Hamburg, Germany. Submitted to
Optim. Methods Software See www.FastOpt.de/papers/perftamc.ps.gz.
R. Giering and T. Kaminski (2001a), ‘Automatic sparsity detection’. Draft.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 395

R. Giering and T. Kaminski (2001b), Generating recomputations in reverse mode


AD, in Corliss et al. (2001), Chapter 33, pp. 271–278.
J. C. Gilbert (1992), ‘Automatic differentiation and iterative processes’, Optim.
Methods Software 1, 13–21.
M. Giles (2001), On the iterative solution of adjoint equations, in Corliss et al.
(2001), pp. 145–151.
M. B. Giles and N. A. Pierce (2001), ‘An introduction to the adjoint approach to
design’, Flow, Turbulence and Combustion 65, 393–415.
M. B Giles and E. Süli (2002), Adjoint methods for PDEs: a posteriori error
analysis and postprocessing by duality, in Acta Numerica, Vol. 11, Cambridge
University Press, pp. 145–236.
M. Gockenbach, M. J. Petro and W. Symes (1999), ‘C++ classes for linking op-
timization with complex simulations’, ACM Trans. Math. Software 25, 191–
212.
A. Griewank (2000), Evaluating Derivatives, Principles and Techniques of Algo-
rithmic Differentiation, Vol. 19 of Frontiers in Applied Mathematics, SIAM,
Philadelphia.
A. Griewank and G. Corliss, eds (1991), Automatic Differentiation of Algorithms:
Theory, Implementation, and Application, SIAM, Philadelphia, PA.
A. Griewank and C. Faure (2002), ‘Reduced functions, gradients and Hessians from
fixed point iteration for state equations’, Numer. Alg. 30, 113–139.
A. Griewank and C. Mitev (2002), ‘Detecting Jacobian sparsity patterns by
Bayesian probing’, Math. Prog. 93, 1–25.
A. Griewank and U. Naumann (2003a), Accumulating Jacobians by vertex, edge
or face elimination, in Proceedings of the 6th African Conference on Research
in Computer Science (L. Andriamampianina, B. Philippe, E. Kamgnia and
M. Tchuente, eds), Imprimerie Saint Paul, Camerun and INRIA, France,
pp. 375–383.
A. Griewank and U. Naumann (2003b), ‘Accumulating Jacobians as chained sparse
matrix products’, Math. Prog. 95, 555–571.
A. Griewank and A. Verma (2003), ‘On Newsam–Ramsdell-seeds for calculating
sparse Jacobians’. In preparation.
A. Griewank and A. Walther (2000), ‘Revolve: An implementation of checkpointing
for the reverse or adjoint mode of computational differentiation’, Trans. Math.
Software 26, 19–45.
A. Griewank, C. Bischof, G. Corliss, A. Carle and K. Williamson (1993), ‘Derivative
convergence for iterative equation solvers’, Optim. Methods Software 2, 321–
355.
A. Griewank, J. Utke and A. Walther (2000), ‘Evaluating higher derivative tensors
by forward propagation of univariate Taylor series’, Math. Comp. 69, 1117–
1130.
J. Grimm, L. Pottier and N. Rostaing-Schmidt (1996), Optimal time and minimum
space-time product for reversing a certain class of programs, in Berz et al.
(1996), pp. 95–106.
S. Hague and U. Naumann (2001), Present and future scientific computation envi-
ronments, in Corliss et al. (2001), Chapter 5, pp. 55–62.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


396 A. Griewank

L. Hascoët (2001), The data-dependence graph of adjoint programs, Research


Report 4167, INRIA, Sophia-Antipolis, France. See www.inria.fr/rrrt/
rr-4167.html.
L. Hascoët, S. Fidanova and C. Held (2001), Adjoining independent computations,
in Corliss et al. (2001), Chapter 35, pp. 285–290.
M. Hinze and T. Slawig (2002), Adjoint gradients compared to gradients from al-
gorithmic differentiation in instantaneous control of the Navier–Stokes equa-
tions, Preprint, TU Dresden.
M. Hinze and A. Walther (2002), An optimal memory-reduced procedure for calcu-
lating adjoints of the instationary Navier–Stokes equations, Preprint MATH-
NM-06-2002, TU Dresden.
J. Horwedel (1991), GRESS: A preprocessor for sensitivity studies on Fortran pro-
grams, in Griewank and Corliss (1991), pp. 243–250.
A. S. Hossain and T. Steihaug (1998), ‘Computing a sparse Jacobian matrix by
rows and columns’, Optim. Methods Software 10, 33–48.
S. Hossain and T. Steihaug (2002), Sparsity issues in the computation of Jacobian
matrices, Technical Report 223, Department of Computer Science, University
of Bergen, Norway.
M. Iri, T. Tsuchiya and M. Hoshi (1988), ‘Automatic computation of partial deriva-
tives and rounding error estimates with applications to large-scale systems of
nonlinear equations’, Comput. Appl. Math. 24, 365–392. Original Japanese
version appeared in J. Information Processing 26 (1985), 1411–1420.
L. Kantorovich (1957), ‘Ob odnoj matematicheskoj simvolike, udobnoj pri prove-
denii vychislenij na mashinakh’, Doklady Akademii Nauk. SSSR 113, 738–741.
G. Kedem (1980), ‘Automatic differentiation of computer programs’, ACM Trans.
Math. Software 6, 150–165.
D. Keyes, P. Hovland, L. McInnes and W. Samyono (2001), Using automatic differ-
entiation for second-order matrix-free methods in PDE-constrained optimiza-
tion, in Corliss et al. (2001), Chapter 3, pp. 33–48.
D. Knuth (1973), The Art of Computer Programming, Vol. 1 of Computer Science
and Information Processing, Addison-Wesley, MI.
S. Linnainmaa (1976), ‘Taylor expansion of the accumulated rounding error’, BIT
(Nordisk Tidskrift for Informationsbehandling) 16, 146–160.
S. Linnainmaa (1983), ‘Error linearization as an effective tool for experimental
analysis of the numerical stability of algorithms’, BIT 23, 346–359.
R. Lohner (1992), Verified computing and programs in PASCAL-XSC, with ex-
amples, Habilitationsschrift, Institute for Applied Mathematics, University of
Karlsruhe.
M. Mancini (2001), A parallel hierarchical approach for automatic differentiation,
in Corliss et al. (2001), Chapter 27, pp. 225–230.
G. Marcuk (1996), Adjoint Equations and Perturbation Algorithms in Nonlinear
Problems, CRC Press, Boca Raton, FL.
H. Maurer and D. Augustin (2001), Sensitivity analysis and real-time control of
parametric optimal control problems using boundary value methods, in Online
Optimization of Large Scale Systems (M. Grötschel, S. Krumke and J. Ram-
bau, eds), Springer, Berlin/Heidelberg/New York, Chapter I, pp. 17–55.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


A mathematical view of automatic differentiation 397

B. Mohammadi and O. Pironneau (2001), Applied Shape Optimization for Flu-


ids, series in Numerical Mathematics and Scientific Computation, Clarendon
Press, Oxford.
R. Moore (1979), Methods and Applications of Interval Analysis, SIAM, Philadel-
phia, PA.
U. Naumann (1999), Efficient calculation of Jacobian matrices by optimized ap-
plication of the chain rule to computational graphs, PhD thesis, Institute of
Scientific Computing, Germany.
U. Naumann (2001), Elimination techniques for cheap Jacobians, in Corliss et al.
(2001), Chapter 29, pp. 241–246.
R. Neidinger (200x), Directions for computing multivariate Taylor series. Technical
report, Davidson College, Davidson. To appear in Math. Comp.
P. Newman, G.-W. Hou, H. Jones, A. Taylor and V. Korivi (1992), Observations
on computational methodologies for use in large-scale, gradient-based, multi-
disciplinary design incorporating advanced CFD codes, Technical Memoran-
dum 104206, NASA Langley Research Center. AVSCOM Technical Report
92-B-007.
G. Newsam and J. Ramsdell (1983), ‘Estimation of sparse Jacobian matrices’, SIAM
J. Algebraic Discrete Methods 4, 404–417.
J. Ortega and W. Rheinboldt (1970), Iterative Solution of Nonlinear Equations in
Several Variables, Academic Press, New York.
C. Pantelides (1988), ‘The consistent initialization of differential-algebraic systems’,
SIAM J. Sci. Statist. Comput. 9, 213–231.
J. D. O. Pantoja (1988), ‘Differential dynamic programming and Newton’s method’,
Internat. J. Control 47, 1539–1553.
J. Pryce (1998), ‘Solving high-index DAEs by Taylor series’, Numer. Alg. 19, 195–
211.
L. Rall (1983), Differentiation and generation of Taylor coefficients in Pascal-SC, in
A New Approach to Scientific Computation (U. Kulisch and W. L. Miranker,
eds), Academic Press, New York, pp. 291–309.
J. Restrepo, G. Leaf and A. Griewank (1998), ‘Circumventing storage limitations in
variational data assimilation studies’, SIAM J. Sci. Comput. 19, 1586–1605.
K. Röbenack and K. Reinschke (2000), ‘Graph-theoretic characterization of struc-
tural controllability for singular DAE’, Z. Angew. Math. Mech. 8 Suppl. 1,
849–850.
K. Röbenack and K. Reinschke (2001), Nonlinear observer design using automatic
differentiation, in Corliss et al. (2001), Chapter 15, pp. 133–138.
D. Rose and R. Tarjan (1978), ‘Algorithmic aspects of vertex elimination on di-
rected graphs’, SIAM J. Appl. Math. 34, 177–197.
J. Sternberg (2002), Adaptive Umkehrschemata für Schrittfolgen mit nicht-
uniformen Kosten, Diploma Thesis, Institute of Scientific Computing, Ger-
many.
F. Stummel (1981), ‘Optimal error estimates for Gaussian elimination in floating-
point arithmetic’, GAMM, Numerical Analysis 62, 355–357.
M. Tadjouddine, S. A. Forth and J. Pryce (2001), AD tools and prospects for opti-
mal AD in CFD flux Jacobian calculations, in Corliss et al. (2001), Chapter 30,
pp. 247–252.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press


398 A. Griewank

R. Tarjan (1983), ‘Data structures and network algorithms’, CBMS-NSF Reg. Conf.
Ser. Appl. Math. 44, 131.
J. van der Snepscheut (1993), What Computing Is All About, Suppl. 2 of Texts and
Monographs in Computer Science, Springer, Berlin.
D. Venditti and D. Darmofal (2000), ‘Adjoint error estimation and grid adaptation
for functional outputs: application to quasi-one-dimensional flow’, J. Comput.
Phys. 164, 204–227.
Y. Volin and G. Ostrovskii (1985), ‘Automatic computation of derivatives with the
use of the multilevel differentiating technique, I: Algorithmic basis’, Comput.
Math. Appl. 11, 1099–1114.
A. Walther (1999), Program reversal schedules for single- and multi-processor ma-
chines, PhD thesis, Institute of Scientific Computing, Germany.
A. Walther (2002), ‘Adjoint based truncated Newton methods for equality con-
strained optimization’.
X. Zou, F. Vandenberghe, M. Pondeca and Y.-H. Kuo (1997), Introduction to
adjoint techniques and the MM5 adjoint modeling system, NCAR Technical
Note NCAR/TN-435-STR, Mesocale and Microscale Meteorology Division,
National Center for Atmospheric Research, Boulder, Colorado.

https://doi.org/10.1017/S0962492902000132 Published online by Cambridge University Press

You might also like