The Art of Differentiating
The Art of Differentiating
The Art of Differentiating
Computer Programs
An Introduction to Algorithmic Differentiation
SE24_Naumann_FM-08-10-11.indd 1
10/28/2011 11:49:08 AM
The SIAM series on Software, Environments, and Tools focuses on the practical implementation of
computational methods and the high performance aspects of scientific computation by emphasizing
in-demand software, computing environments, and tools for computing. Software technology development
issues such as current status, applications and algorithms, mathematical software, software tools, languages
and compilers, computing environments, and visualization are presented.
Editor-in-Chief
Jack J. Dongarra
University of Tennessee and Oak Ridge National Laboratory
Editorial Board
James W. Demmel, University of California, Berkeley
Dennis Gannon, Indiana University
Eric Grosse, AT&T Bell Laboratories
Jorge J. Mor, Argonne National Laboratory
SE24_Naumann_FM-08-10-11.indd 2
10/28/2011 11:49:08 AM
Uwe Naumann
RWTH Aachen University
Aachen, Germany
SE24_Naumann_FM-08-10-11.indd 3
10/28/2011 11:49:09 AM
Partial royalties from the sale of this book are placed in a fund to help
students attend SIAM meetings and other SIAM-related activities. This fund
is administered by SIAM, and qualified individuals are encouraged to write
directly to SIAM for guidelines.
is a registered trademark.
SE24_Naumann_FM-08-10-11.indd 4
10/28/2011 11:49:09 AM
SE24_Naumann_FM-08-10-11.indd 5
10/28/2011 11:49:09 AM
Contents
Preface
xi
Acknowledgments
xv
Optimality
1
xvii
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
1
1
2
9
22
23
27
34
34
34
35
35
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
37
39
40
48
51
53
56
71
76
77
80
81
86
86
87
88
vii
viii
Contents
2.4.4
2.4.5
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
91
91
98
100
102
104
110
121
129
131
134
136
142
144
144
144
144
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
147
148
153
156
157
157
158
160
162
166
168
169
174
175
180
185
185
188
194
197
204
205
206
208
208
208
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
Contents
ix
4.7.3
4.7.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
209
209
210
211
213
215
221
221
221
222
222
224
226
227
229
230
231
231
232
235
235
236
236
237
237
239
239
241
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
243
243
245
249
251
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
261
261
261
262
266
269
Contents
C.2
C.3
C.4
Chapter 2 .
C.2.1
C.2.2
C.2.3
C.2.4
C.2.5
Chapter 3 .
C.3.1
C.3.2
C.3.3
Chapter 4 .
C.4.1
C.4.2
C.4.3
C.4.4
. . . . . . . . .
Exercise 2.4.1
Exercise 2.4.2
Exercise 2.4.3
Exercise 2.4.4
Exercise 2.4.5
. . . . . . . . .
Exercise 3.5.1
Exercise 3.5.2
Exercise 3.5.3
. . . . . . . . .
Exercise 4.7.1
Exercise 4.7.2
Exercise 4.7.3
Exercise 4.7.4
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
.
269
269
276
281
283
286
295
295
296
298
308
308
309
312
322
Bibliography
333
Index
339
Preface
How sensitive are the values of the outputs of my computer program with respect
to changes in the values of the inputs? How sensitive are these first-order sensitivities
with respect to changes in the values of the inputs? How sensitive are the second-order
sensitivities with respect to changes in the values of the inputs? . . .
Computational scientists, engineers, and economists as well as quantitative analysts
in computational finance tend to ask these questions on a regular basis. They write computer
programs in order to simulate diverse real-world phenomena. The underlying mathematical models often depend on a possibly large number of (typically unknown or uncertain)
parameters. Values for the corresponding inputs of the numerical simulation programs can,
for example, be the result of (typically error-prone) observations and measurements. If
very small perturbations in these uncertain values yield large changes in the values of the
outputs, then the feasibility of the entire simulation becomes questionable. Nobody should
make decisions based on such highly uncertain data.
Quantitative information about the extent of this uncertainty is crucial. First- and
higher-order sensitivities of outputs of numerical simulation programs with respect to their
inputs (also first and higher derivatives) form the basis for various approximations of uncertainty. They are also crucial ingredients of a large number of numerical algorithms ranging
from the solution of (systems of) nonlinear equations to optimization under constraints
given as (systems of) partial differential equations. This book describes a set of techniques
for modifying the semantics of numerical simulation programs such that the desired first
and higher derivatives can be computed accurately and efficiently. Computer programs implement algorithms. Consequently, the subject is known as Algorithmic (also Automatic)
Differentiation (AD).
AD provides two fundamental modes. In forward mode, a tangent-linear version of
the original program is built. The sensitivities of all outputs of the program with respect
to its inputs can be computed at a computational cost that is proportional to the number of
inputs. The computational complexity is similar to that of finite difference approximation.
At the same time, the desired derivatives are computed with machine accuracy. Truncation
is avoided.
Reverse mode yields an adjoint program that can be used to perform the same task at a
computational cost that is proportional to the number of outputs. For example, in large-scale
nonlinear optimization a scalar objective that is returned by the given computer program
can depend on a very large number of input parameters. The adjoint program allows for
the computation of the gradient (the first-order sensitivities of the objective with respect to
all parameters) at a small constant multiple R (typically between 3 and 30) of the cost of
running the original program. It outperforms gradient accumulation routines that are based
xi
xii
Preface
on finite differences or on tangent-linear code as soon as the size of the gradient exceeds R.
The ratio R plays a very prominent role in the evaluation of the quality of derivative code.
It will reappear several times in this book.
The generation of tangent-linear and adjoint code is the main topic of this introduction to The Art of Differentiating Computer Programs by AD. Repeated applications of
forward and reverse modes yield higher-order tangent-linear and adjoint code. Two ways
of implementing AD are presented. Derivative code compilers take a source transformation approach in order to realize the semantic modification. Alternatively, run time support
libraries can be developed that use operator and function overloading based on a redefined
floating-point data type to propagate tangent-linear as well as adjoint sensitivities. Note that
AD differentiates what you implement!1
Many successful applications of AD are described in the proceedings of, five international conferences [10, 11, 13, 18, 19]. The standard book on the subject by Griewank and
Walther [36] covers a wide range of basic, as well as advanced, topics in AD. Our focus is
different. We aim to present a textbook style introduction to AD for undergraduate and graduate students as well as for practitioners in computational science, engineering, economics,
and finance. The material was developed to support courses on Computational Differentiation and Derivatives Code Compilers for students of Computational Engineering
Science, Mathematics, and Computer Science at RWTH Aachen University. Project-style
exercises come with detailed hints on possible solutions. All software is provided as open
source. In particular, we present a fully functional derivative code compiler (dcc) for a
(very) limited subset of C/C++. It can be used to generate tangent-linear and adjoint code of
arbitrary order by reapplication to its own output. Our run time support library dco provides
a better language coverage at the expense of less efficient derivative code. It uses operator
and function overloading in C++. Both tools form the basis for the ongoing development
of production versions that are actively used in a number of collaborative projects among
scientists and engineers from various application areas.
Except for relatively simple cases, the differentiation of computer programs is not
automatic despite the existence of many reasonably mature AD software packages.2 To
reveal their full power, AD solutions need to be integrated into existing numerical simulation
software. Targeted application of AD tools and intervention by educated users is crucial.
We expect AD to be become truly automatic at some time in the (distant) future. In
particular, the automatic generation of optimal (in terms of robustness and efficiency) adjoint
versions of large-scale simulation code is one of the great open challenges in the field of
High-Performance Scientific Computing. With this book, we hope to contribute to a better
understanding of AD by a wider range of potential users of this technology. Combine it
with the book of Griewank and Walther [36] for a comprehensive introduction to the state
of the art in the field.
There are several reasonable paths through this book that depend on your specific
interests. Chapter 1 motivates the use of differentiated computer programs in the context of
methods for the solution of systems of nonlinear equations and for nonlinear programming.
The drawbacks of closed-form symbolic differentiation and finite difference approximations are discussed, and the superiority of adjoint over tangent-linear code is shown if the
1 Which occasionally differs from what you think you implement!
2 See www.autodiff.org.
Preface
xiii
number of inputs exceeds the number of outputs significantly. The generation of tangentlinear and adjoint code by forward and reverse mode AD is the subject of Chapter 2. If
you are a potential user of first-order AD exclusively, then you may proceed immediately
to the relevant sections of Chapter 5, covering the use of dcc for the generation of first
derivative code. Otherwise, read Chapter 3 to find out more about the generation of secondor higher-order tangent-linear and adjoint code. The remaining sections in Chapter 5 illustrate the use of dcc for the partial automation of the corresponding source transformation.
Prospective developers of derivative code compilers should not skip Chapter 4. There, we
relate well-known material from compiler construction to the task of differentiating computer programs. The scanner and parser generators flex and bison are used to build a
compiler front-end that is suitable for both single- and multipass compilation of derivative
code. Further relevant material, including hints on the solutions for all exercises, is collected
in the Appendix.
The supplementary website for this book, http://www.siam.org/se22, contains sources
of all software discussed in the book, further exercises and comments on their solutions
(growing over the coming years), links to further sites on AD, and errata.
In practice, the programming language that is used for the implementation of the original program accounts for many of the problems to be addressed by users of AD technology.
Each language deserves to be covered by a separate book. The given computing infrastructure (hardware, native compilers, concurrency/parallelism, external libraries, handling data,
i/o, etc.) and software policies (level of robustness and safety, version management) may
complicate things even further. Nevertheless, AD is actively used in many large projects,
each of them posing specific challenges. The collection of these issues and their structured presentation in the form of a book can probably only be achieved by a group of AD
practitioners and is clearly beyond the scope of this introduction.
Let us conclude these opening remarks with comments on the books title, which
might sound vaguely familiar. While its scope is obviously much narrower than that of the
classic by Knuth [45], the application of AD to computer programs still deserves to be called
an art. Educated users are crucial prerequisites for robust and efficient AD solutions in the
context of large-scale numerical simulation programs. In AD details really do matter.3
With this book, we hope to set the stage for many more artists to enter this exciting field.
Uwe Naumann
July 2011
Acknowledgments
I could probably fill a few pages acknowledging the exceptional role of my family in my
(professional) life including the evolution of this book. You know what I am talking about.
My wife Ines, who is by far the better artist, has helped me with the cover art. My own
attempts never resulted in a drawing that adequately reflects the intrinsic joy of differentiating
computer programs. I did contribute the code fragment, though.
I am grateful to Markus Beckers, Michael Frster, Boris Gendler, Johannes Lotz,
Andrew Lyons, Viktor Mosenkis, Jan Riehme, Niloofar Safiran, Michel Schanen, Ebadollah
Varnik, and Claudia Yastremiz for (repeatedly) proofreading the manuscript. Any remaining
shortcomings should be blamed on them.
Three anonymous referees provided valuable feedback on various versions of the
manuscript.
Last, but not least, I would like thank my former Ph.D. supervisor Andreas Griewank
for seeding my interest in AD. As a pioneer in this field, he has always been keen on
promoting AD technology by using various techniques. One of them is music.
xv
Optimality
4 Originally presented in Nice, France on April 8, 2010 at the Algorithmic Differentiation, Optimization, and
Beyond meeting in honor of Andreas Griewanks 60th birthday.
xvii
xviii
Optimality
A few hours later my talks getting rude.
The sole thing descending seems to be my mood.
How can guessing the Hessian only take this much time?
N squared function runs appear to be the crime.
The facts support this thesis, and I wonder . . .
Isolation due to KKT.
Isolationwhy not simply drop feasibility?
The guy next doors been sayin again and again:
An adjoint Lagrangian might relieve my pain.
Though I dont quite believe him, I surrender.
I wonder how but I still give it a try.
Gradients and Hessians in the blink of an eye.
Still all Id like to see is simply optimality.
Epsilon itself has finally disappeared.
Reverse mode AD works, no matter how weird,
and Im about to see local optimality.
Yes, I wonder, I wonder . . .
I wonder how but I still give it a try.
Gradient and Hessians in the blink of an eye.
Still all Id like to see . . .
I really need to see . . .
now I can finally see my cherished optimality :-)
Chapter 1
The computation of derivatives plays a central role in many numerical algorithms. Firstand higher-order sensitivities of selected outputs of numerical simulation programs with
respect to certain inputs as well as projections of the corresponding derivative tensors may
be required. Often the computational effort of these algorithms is dominated by the run time
and the memory requirement of the computation of derivatives. Their accuracy may have a
dramatic effect on both convergence behavior and run time of the underlying iterative numerical schemes. We illustrate these claims with simple case studies in Section 1.1, namely
the solution of systems of nonlinear equations using the Newton method in Section 1.1.1
and basic first- and second-order nonlinear programming algorithms in Section 1.1.2. The
use of derivatives with numerical libraries is demonstrated in the context of the NAG Numerical Library as a prominent representative for a number of similar commercial and
noncommercial numerical software tools in Section 1.1.3.
This first chapter aims to set the stage for the following discussion of Algorithmic Differentiation (AD) for the accurate and efficient computation of first and higher derivatives.
Traditionally, numerical differentiation has been performed manually, possibly supported
by symbolic differentiation capabilities of modern computer algebra systems, or derivatives
have been approximated by finite difference quotients. Neither approach turns out to be a
serious competitor for AD. Manual differentiation is tedious and error-prone, while finite
differences are often highly inefficient and potentially inaccurate. These two techniques are
discussed briefly in Section 1.2 and Section 1.3, respectively.
1.1
Numerical simulation enables computational scientists and engineers to study the behavior
of various kinds of real-world systems in ways that are impossible (or at least extremely
difficult) in reality. The quality of the results depends largely on the quality of the underlying
mathematical model F : Rn Rm . Computer programs are developed to simulate the
functional dependence of one or more objectives y Rm on a potentially very large number
of parameters x Rn . For a given set of input parameters, the corresponding values of the
objectives can be obtained by a single run of the simulation program as y = F (x). This
1
simulation of the studied real-world system can be extremely useful. However, it leaves
various questions unanswered.
One of the simpler open questions is about sensitivities of the objective with respect
to the input parameters with the goal of quantifying the change in the objective values for
slight (infinitesimal) changes in the parameters. Suppose that the values of the parameters
are defined through measurements within the simulated system (for example, an ocean or
the atmosphere). The accuracy of such measurements is less important for small sensitivities
as inaccuracies will not translate into significant variations of the objectives. Large sensitivities, however, indicate critical parameters whose inaccurate measurement may yield
dramatically different results. More accurate measuring devices are likely to be more costly.
Even worse, an adequate measuring strategy may be infeasible due to excessive run time or
other hard constraints. Mathematical remodeling may turn out to be the only solution.
Sensitivity analysis is one of many areas requiring the Jacobian matrix of F ,
yj j =0,...,m1
F = F (x)
,
xi i=0,...,n1
whose rows contain the sensitivities of the outputs yj , j = 0, . . . , m 1, of the numerical simulation y = F (x) with respect to the input parameters xi , i = 0, . . . , n 1. Higher
derivative tensors including the Hessian of F ,
j =0,...,m1
2 yj
2
2
F = F (x)
,
xi xk
i,k=0,...,n1
are used in corresponding higher-order methods. This book is based on C/C++ as the underlying programming language; hence, vectors are indexed starting from zero instead of one.
In the following, highlighted terminology is used without definition. Formal explanations
are given in the subsequent chapters.
Algorithm 1.1 Newton algorithm for solving the nonlinear system F (x) = 0.
In:
implementation of the residual y at the current point x Rn :
F : Rn Rn , y = F (x)
implementation of the Jacobian A F (x) of the residual at the current point x:
F : Rn Rnn , A = F (x)
solver for computing the Newton step dx Rn as the solution of the linear Newton
system A dx = y :
s : Rn Rnn Rn , dx = s(y, A)
starting point: x Rn
upper bound on the norm of the residual F (x) at the approximate solution: R
Out:
approximate solution of the nonlinear system F (x) = 0: x Rn
Algorithm:
1:
2:
3:
4:
5:
6:
7:
y = F (x)
while y > do
A = F (x)
dx = s(y, A)
x x + dx
y = F (x)
end while
given bound. Refer to [44] for further details on the Newton algorithm. A basic version
without error handling is shown in Algorithm 1.1. Convergence is assumed.
The computational cost of Algorithm 1.1 is dominated by the accumulation of the
Jacobian A F in line 3 and by the solution of the Newton system in line 4. The quality
of the Newton step dx depends on the accuracy of the Jacobian F . Traditionally, F is
approximated using finite difference quotients as shown in Algorithm 1.2, where ei denotes
the ith Cartesian basis vector in Rn , that is,
1 i = j,
i
i
e = ej
j =0,...,n1
0 otherwise.
The value of the residual is computed at no extra cost when using forward or backward
finite differences, to be discussed in further detail in Section 1.3.
In Algorithm 1.2, a single evaluation of the residual at the current point x in line 1 is
succeeded by n evaluations at perturbed points in line 4. The components of x are perturbed
individually in line 3. The columns of the Jacobian are approximated separately in lines 57,
yielding a computational cost of O(n) Cost(F ), where Cost(F ) denotes the cost of a single
Algorithm 1.2 Jacobian accumulation by (forward) finite differences in the context of the
Newton algorithm for solving the nonlinear system F (x) = 0.
In:
implementation of the residual y = (yk )k=0,...,n1 at the current point x Rn :
F : Rn Rn , y = F (x)
current point: x Rn
perturbation: R
Out:
residual at the current point: y = F (x) Rn
approximate Jacobian of the residual at the current point:
A = (ak,i )k,i=0,...,n1 F (x) Rnn
Algorithm:
1:
2:
3:
4:
5:
6:
7:
8:
y = F (x)
for i = 0 to n 1 do
x x + ei
y = F (x)
for k = 0 to n 1 do
ak,i (yk yk )/
end for
end for
evaluation of the residual. Refer to Section 1.3 for further details on finite differences as
well as on alternative approaches to their implementation. Potential sparsity of the Jacobian
should be exploited to reduce the computational effort as discussed in Section 2.1.3.
The inherent inaccuracy of the approximation of the Jacobian by finite differences may
have a negative impact on the convergence of the Newton algorithm. Exact (up to machine
accuracy) Jacobians can be computed by the tangent-linear mode of AD as described in
Section 2.1. The corresponding part of the Newton algorithm is replaced by Algorithm 1.3.
Columns of the Jacobian are computed in line 3 after setting x(1) equal to the corresponding
Cartesian basis vector in line 2. The value of the residual y is computed in line 3 by the
given implementation of the tangent-linear residual F (1) at almost no extra cost. Details on
the construction of F (1) are discussed in Chapter 2. The superscript (1) is used to denote
first-order tangent-linear versions of functions and variables. This notation will be found
advantageous for generalization in the context of higher derivatives in Chapter 3.
Example 1.1 For y = F (x) defined as
y0 = 4 x0 (x02 + x12 ),
y1 = 4 x1 (x02 + x12 )
for i = 0 to n 1 do
x(1) ei
(y, y(1) ) = F (1) (x, x(1) )
for k = 0 to n 1 do
(1)
ak,i yk
end for
end for
and starting from xT = (1, 1), a total of 21 Newton iterations are performed by the code in
Section C.1.2 to drive the norm of the residual y below 109 :
1
2
3
...
20
21
x0 = 0.666667
x0 = 0.444444
x0 = 0.296296
x1 = 0.666667
x1 = 0.444444
x1 = 0.296296
y = 11.3137
y = 3.35221
y = 0.993247
x0 = 0.000300729
x0 = 0.000200486
x1 = 0.000300729
x1 = 0.000200486
y = 1.03849e 09
y = 3.07701e 10
The solution of the linear Newton system by a direct method (such as Gaussian LU
factorization or Cholesky LLT factorization
if the Jacobian is symmetric positive definite
as in the previous example) is an O n3 algorithm. Hence, the overall cost of the Newton
method is dominated by the solution of the linear system in addition to the accumulation of
the Jacobian. The computational complexity of the direct linear solver can be decreased by
exploiting possible sparsity of the Jacobian [24].
Alternatively, iterative solvers can be used to approximate the Newton step. Matrixfree implementations of Krylov subspace methods avoid the accumulation of the full Jacobian. Consider, for example, the Conjugate Gradient (CG) algorithm [39] in Algorithm 1.4
Algorithm 1.4 Matrix-free CG algorithm for computing the Newton step in the context of
the Newton algorithm for solving the nonlinear system F (x) = 0.
In:
implementation of the tangent-linear residual F (1) for computing the residual y
F (x) and its directional derivative y(1) F (x) x(1) in the tangent-linear direction
x(1) Rn at the current point x Rn :
F (1) : Rn Rn Rn Rn , (y, y(1) ) = F (1) (x, x(1) )
starting point for the Newton step: dx x(1) Rn
upper bound on the norm of the residual y F (x) dx at the approximate
solution for the Newton step: R
Out:
approximate solution for the Newton step: dx Rn
Algorithm:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
x(1) dx
(y, y(1) ) F (1) (x, x(1) )
p y y(1)
rp
while r do
x(1) p
(y, y(1) ) F (1) (x, x(1) )
rT r / (pT y(1) )
dx dx + p
rprev r
r r y(1)
T r
rT r / (rprev
prev )
p r+ p
end while
for symmetric positive definite systems. It aims to drive during each Newton iteration the
norm of the residual y F dx toward zero. Note that only the function value (line 3)
and projections of the Jacobian (Jacobian-vector products in lines 3, 8, and 11) are required.
These directional derivatives are delivered efficiently by a single run of the implementation of the tangent-linear residual F (1) in lines 2 and 7, respectively. The exact solution is
obtained in infinite precision arithmetic after a total of n steps. Approximations of the solution in floating-point arithmetic can often be obtained much sooner provided that a suitable
preconditioner is available [58]. For notational simplicity, Algorithm 1.4 assumes that no
preconditioner is required.
Example 1.2 Solutions of systems of nonlinear equations play an important role in various
fields of Computational Science and Engineering. They result, for example, from the
(1.1)
2y 2y
+
x02 x12
with a set of algebraic equations, thus transforming (1.1) into a system of nonlinear equations
that can be solved by Algorithm 1.1.
Let be discretized using central finite differences with step size h = 1/s in both the
j
x0 and x1 directions. The second derivative with respect to x0 at some point (x0i , x1 ) (for
example, (x02 , x12 ) in Figure 1.1 where s = 4) is approximated based on the finite difference
j
approximation of the first derivative with respect to x0 at points a = (x0i h/2, x1 ) and
j
b = (x0i + h/2, x1 ). Similarly, the second derivative with respect to x1 at the same point
x1
1
(1, 4)
(0, 3)
(2, 4)
(3, 4)
(1, 3)
(4, 3)
d
(0, 2)
(1, 2)
(4, 2)
(4, 1)
c
(0, 1)
(1, 1)
(2, 1)
(3, 1)
(1, 0)
(2, 0)
(3, 0)
x0
is approximated based on the finite difference approximation of the first derivative with
j
j
respect to x1 at points c = (x0i , x1 h/2) and d = (x0i , x1 + h/2). As the result of
yi,j yi1,j
y(x0 , x1 )
(a)
,
x0
h
and
2 y(x0 , x1 ) i j
(x0 , x1 )
x02
we have
yi+1,j yi,j
y(x0 , x1 )
(b)
,
x0
h
y(x0 ,x1 )
y(x0 ,x1 )
x0 (b)
x0 (a)
Similarly,
yi,j +1 2 yi,j + yi,j 1
2 y(x0 , x1 ) i j
(x0 , x1 )
2
h2
x1
follows from
2 y(x0 , x1 ) i j
(x0 , x1 )
x12
with
yi,j yi,j 1
y(x0 , x1 )
(c)
,
x1
h
y(x0 ,x1 )
y(x0 ,x1 )
x1 (d)
x1 (c)
h
yi,j +1 yi,j
y(x0 , x1 )
(d)
.
x1
h
Both y and r cover the entire discretized unit square, that is, y R(s+1)(s+1) as well as
r R(s+1)(s+1) . Both derivatives of boundary values as well as derivatives with respect to
boundary values turn out to be equal to zero. For s = 3, we get
4 y1,1 + y2,1 + y0,1 + y1,2 + y1,0 = h2 ey1,1
4 y1,2 + y2,2 + y0,2 + y1,3 + y1,1 = h2 ey1,2
4 y2,1 + y3,1 + y1,1 + y2,2 + y2,0 = h2 ey2,1
4 y2,2 + y3,2 + y1,2 + y2,3 + y2,1 = h2 ey2,2
Table 1.1. Run time statistics for the SFI problem. The solution is computed by
the standard Newton algorithm and by a matrix-free implementation of the Newton-CG
algorithm. For different resolutions of the mesh (defined by the step size s) and for varying
levels of accuracies of the Newton iteration (defined by ), we list the run time of the Newton
algorithm t, the number of Newton iterations i, and the number of Jacobian-vector products
computed by calling the tangent-linear routine t1_f.
Newton
i
t1_f
Newton-CG
i
t1_f
25
30
35
105
105
105
6,0
24,6
71,4
8
8
8
4.608
6.720
9.248
0,0
0,0
0,0
8
8
8
281
333
384
25
30
35
1010
1010
1010
6,8
27,1
79,6
9
9
9
5.184
7.569
10.404
0,0
0,0
0,1
9
9
9
528
629
724
300
300
300
105
108
1010
31,0
44,4
53,3
8
9
9
2.963
4.905
5.842
1.1.2
. . . Nonlinear Programming
The following example has been designed to compare the performance of finite difference
approximation of first and second derivatives with that of derivative code that computes
10
exact values in the context of basic unconstrained optimization algorithms. Adjoint code
exceeds the efficiency of finite differences by a factor at the order of n. In many cases, this
factor makes the difference between derivative-based methods being applicable to largescale optimization problems or not.
Consider the nonlinear programming problem
min f (x),
xRn
n1
2
xi2
(1.2)
i=0
is implemented as follows:
v o i d f ( i n t n , d o u b l e x , d o u b l e &y ) {
y =0;
f o r ( i n t i = 0 ; i <n ; i ++) y=y+x [ i ] x [ i ] ;
y=y y ;
}
For educational reasons, we intentionally avoid the use of the more compact C++ notation
y+=x[i]x[ i ] . The clean separation of left- and right-hand sides of assignments will prove
advantageous in Chapters 2 and 3.
The function in (1.2) has a global minimum at x = 0. We use this problem to illustrate
issues that arise in derivative-based optimization methods. Our objective is not to cover the
state of the art in nonlinear optimization; refer to [50] for a survey of such techniques.
We apply basic line search methods to the given implementation of the objective.
Such methods compute iterates
xk+1 = xk k Bk1 f (xk )
(1.3)
for some suitable starting value x0 = (xi0 )i=0,...,n1 and with a step length k > 0, where
f (xk ) denotes the gradient (the transposed single-row Jacobian) of f at the current iterate.
Simple first- (Bk is equal to the identity In in Rn ) and second-order (Bk is equal to the Hessian
2 f (xk ) of f at point xk ) methods are discussed below. We aim to find a local minimizer
by starting at xi0 = 1 for i = 0, . . . , n 1.
Steepest Descent Algorithm
In the simplest case, (1.3) becomes
xk+1 = xk k f (xk ).
The step length k > 0 is, for example, chosen by recursive bisection on k starting from
k = 1 (0.5, 0.25, . . .) and such that a decrease in the objective value is ensured. This simple
method is known as the steepest descent algorithm. It is stated formally in Algorithm 1.5,
where it is assumed that a suitable can always be found. Refer to [50] for details on
exceptions.
11
Algorithm 1.5 Steepest descent algorithm for solving the unconstrained nonlinear programming problem minxRn f (x).
In:
implementation of the objective y R at the current point x Rn :
f : Rn R, y = f (x)
implementation of f for computing the objective y f (x) and its gradient g f (x)
at the current point x:
f : Rn R Rn , (y, g) = f (x)
starting point: x Rn
upper bound on gradient norm g at the approximate minimal point: R
Out:
approximate minimal value of the objective: y R
approximate minimal point: x Rn
Algorithm:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
repeat
(y, g) = f (x)
if g > then
1
y y
while y y do
x x g
y = f (x)
/2
end while
x x
end if
until g
n1
xj2
f (x) = 4 xi
j =0
i=0,...,n1
is certainly an option. This situation will change for more complex objectives implemented
as computer programs with many thousand lines of source code. Typically, the efficiency
of handwritten derivative code is regarded as close to optimal. While this is a reasonable
assumption in many cases, it still depends very much on the author of the derivative code.
12
Algorithm 1.6 Gradient approximation by (forward) finite differences in the context of the
unconstrained nonlinear programming problem minxRn f (x).
In:
implementation of the objective function: f : Rn R, y = f (x)
current point: x Rn
perturbation: R
Out:
objective at the current point: y = f (x) R
approximate gradient of the objective at the current point:
g (gi )i=0,...,n1 f (x) Rn
Algorithm:
1:
2:
3:
4:
5:
6:
y = f (x)
for i = 0 to n 1 do
x x + ei
y = f (x)
gi (y y)/
end for
Moreover, hand-coding may be infeasible within the time frame allocated to the project.
Debugging is likely to occupy the bigger part of the development time. Hence, we aim to
build up a set of rules that will allow us to automate the generation of derivative code to
the greatest possible extent. All of these rules can be applied manually. Our ultimate goal,
however, is the development of corresponding software tools in order to make this process
less tedious and less error-prone.
As in Section 1.1.1, the gradient can be approximated using finite difference quotients
(see Algorithm 1.6), provided that the computational complexity of O(n) Cost(f ) remains
feasible. Refer to Section 1.3 for further details on finite differences. This approach has two
major disadvantages. First, the approximation may be poor. Second, a minimum of n + 1
function evaluations are required. If, for example, a single function evaluation takes one
minute on the given computer architecture, and if n = 106 (corresponding, for example, to a
temperature distribution in a very coarse-grain discretization of a global three-dimensional
atmospheric model), then a single evaluation of the gradient would take almost two years.
Serious climate simulation would not be possible.
The computational complexity is not decreased when using tangent-linear AD as
outlined in Algorithm 1.7. Nevertheless, the improved accuracy of the computed gradient
may lead to faster convergence of the steepest descent algorithm.
Large-scale and long-term climate simulations are performed by many researchers
worldwide. A single function evaluation is likely to run for much longer than one minute,
13
for i = 0 to n 1 do
x(1) ei
(y, y (1) ) = f (1) (x, x(1) )
gi y (1)
end for
even on the latest high-performance computer architectures. It may take hours or days to
perform climate simulations at physically meaningful spatial discretization levels and over
relevant time intervals. Typically only a few runs are feasible. The solution to this problem
comes in the form of various flavors of so-called adjoint methods. In particular adjoint AD
allows us to generate for a given implementation of f an adjoint program that computes the
gradient f at a computational cost of O(1) Cost(f ). As opposed to finite differences
and tangent-linear AD, adjoint AD thus makes the computational cost independent of n. It
enables large-scale sensitivity analysis as well as high-dimensional nonlinear optimization
and uncertainty quantification for practically relevant problems in science and engineering.
This observation is worth highlighting even at this early stage, and it serves as motivation
for the better part of the remaining chapters in this book.
Algorithm 1.8 illustrates the use of an adjoint code for f . The adjoint objective is called
only once (in line 2). We use the subscript (1) to denote adjoint functions and variables. The
advantages of this notation will become obvious in Chapter 3, in the context of higher-order
adjoints.
Table 1.2 summarizes the impact of the various differentiation methods on the run
time of the steepest descent algorithm when applied to our example problem in (1.2). These
results were obtained on a standard Linux PC running the GNU C++ compiler with optimization level 3, which will henceforth be referred to as the reference platform. The numbers illustrate the superiority of adjoint over both tangent-linear code and finite difference
14
Algorithm 1.8 Gradient accumulation by adjoint mode AD in the context of the unconstrained nonlinear programming problem minxRn f (x).
In:
implementation of the adjoint objective function f(1) for computing the objective
y f (x) and the product x(1) y(1) f (x) of its gradient at the current point x Rn
with a factor y(1) R:
f(1) : Rn R R Rn , (y, x(1) ) = f(1) (x, y(1) )
current point: x Rn
Out:
objective at the current point: y = f (x) R
gradient of the objective at the current point: g = f (x) Rn
Algorithm:
1:
2:
3:
y(1) 1
(y, x(1) ) = f(1) (x, y(1) )
g x(1)
Table 1.2. Run time of the steepest descent algorithm (in seconds and starting from x = 1). The gradient (of size n) is approximated by central finite differences
(FD; see Section 1.3) or computed with machine accuracy by a tangent-linear code (TLC;
see Section 2.1) or an adjoint code (ADC; see Section 2.2). The tangent-linear and
adjoint codes are generated automatically by the derivative code compiler (dcc) (see
Chapter 5).
n
100
200
300
400
500
1000
FD
13
47
104
184
284
1129
TLC
8
28
63
113
173
689
ADC
<1
1
2
2.5
3
6
approximations in terms of the overall computational effort that is dominated by the cost
of the gradient evaluation. Convergence of the steepest descent algorithm is defined as the
L2 -norm of the gradient falling below 109 . The steepest descent algorithm is expected to
perform a large number of iterations (with potentially very small step sizes) to reach this
high level of accuracy. Similar numbers of iterations (over 3 105 ) are performed independently of the method used for the evaluation of the gradient. As expected, the step size k
is reduced to values below 104 to reach convergence while ensuring strict descent in the
objective function value.
15
Newton Algorithm
Second-order methods based on the Newton algorithm promise faster convergence in the
neighborhood of the minimum by taking into account second derivative information. We
consider the Newton algorithm discussed in Section 1.1.1 extended by a local line search
to determine k for k = 0, 1, . . . in (1.3) as
1
xk+1 = xk k 2 f (xk )
f (xk ).
As in Section 1.1.1, the Newton method is applied to find a stationary point of f by solving
the nonlinear system f = 0. The Newton step
1
f (xk )
dxk 2 f (xk )
is obtained as the solution of the linear Newton system
2 f (xk ) dxk = f (xk )
at each iteration. If xk is far from a solution, then sufficient descent in the residual can be
obtained using a local line search to determine k such that the L2 -norm of the residual at
the next iterate is minimized, that is, the scalar nonlinear optimization problem
min ||f (xk k dxk ))||2
k
needs to be solved. The first and, potentially also required, second derivatives of the
objective with respect to k can be computed efficiently using the methods discussed in
this book. Alternatively, a simple recursive bisection algorithm similar to that used in the
steepest descent method can help to improve the robustness of the Newton method.
A formal description of the Newton algorithm for unconstrained nonlinear optimization is given in Algorithm 1.9. Again, convergence is assumed. The computational cost
is dominated by the accumulation in lines 1, 3, and 14 of gradient and Hessian and by the
solution of the linear Newton system in line 4. Both the gradient and the Hessian should
be accurate in order to ensure the expected convergence behavior. Approximation by finite
differences may not be good enough due to an inaccurate Hessian in particular.
An algorithmic view of the (second-order) finite difference method for approximating the Hessian is given in Algorithm 1.10. Refer to Section 1.3 for details on first- and
second-order finite differences as well as for a description of alternative approaches to their
implementation. The shortcomings of finite difference approximation become even more
apparent in the second-order case. The inaccuracy is likely to become more significant due
to the limitations of floating-point arithmetic. Moreover, O(n2 ) function evaluations are
required for the approximation of the Hessian.
The first problem can be overcome by applying tangent-linear AD to Algorithm 1.7,
yielding Algorithm 1.11. Each of the n calls of the second-order tangent-linear function
G(1) , where G f is defined as in Algorithm 1.7, involves n calls of the tangent-linear
code. The overall computational complexity of the Hessian accumulation adds up to O(n2 )
Cost(f ). Both the gradient and the Hessian are obtained with machine accuracy. Symmetry
of the Hessian is exploited in neither Algorithm 1.10 nor Algorithm 1.11.
16
Algorithm 1.9 Newton algorithm for solving the unconstrained nonlinear programming
problem minxRn f (x).
In:
implementation of the objective y R at the current point x Rn :
f : Rn R, y = f (x)
implementation of the differentiated objective function f for computing the objective y f (x) and its gradient g f (x) at the current point x:
f : Rn R Rn , (y, g) = f (x)
implementation of the differentiated objective function f for computing the objective y, its gradient g, and its Hessian H 2 f (x) at the current point x:
f : Rn R Rn Rnn , (y, g, H ) = f (x)
solver to determine the Newton step dx Rn as the solution of linear Newton system
H dx = g :
s : Rn Rnn Rn , dx = s(g, H )
starting point: x Rn
upper bound on the gradient norm g at the approximate solution: R
Out:
approximate minimal value: y R
approximate minimal point: x Rn
Algorithm:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
(y, g) = f (x)
while g > do
(y, g, H ) = f (x)
dx = s(g, H )
1
y y
x x
while y y do
x x dx
y = f (x)
/2
end while
x x
(y, g) = f (x)
end while
17
Algorithm 1.10 Hessian approximation by (forward) finite differences in the context of the
unconstrained nonlinear programming problem minxRn f (x).
In:
implementation of f for computing the objective y f (x) and its approximate gradient g = (gj )j =0,...,n1 f (x) at the current point x Rn as defined in Algorithm 1.6:
f : Rn R Rn , (y, g) = f (x)
current point: x Rn
perturbation: R
Out:
objective function value at the current point: y = f (x)
approximate gradient of the objective at the current point: g f (x) Rn
approximate Hessian of the objective at the current point:
H = (hj ,i )j ,i=0,...,n1 2 f (x) Rnn
Algorithm:
1:
2:
3:
4:
5:
6:
7:
8:
(y, g) = f (x)
for i = 0 to n 1 do
x x + ei
(y,
g ) = f (x)
for j = 0 to n 1 do
hj ,i (g j gj )/
end for
end for
Substantial savings in the computational cost result from performing the gradient
computation in adjoint mode. The savings are due to each of the n calls of the secondorder adjoint function G(1) , where G f is defined as in Algorithm 1.8, now involving
merely a single call of the adjoint code. The overall computational complexity becomes
O(n) Cost(f ) instead of O(n2 ) Cost(f ).
The tangent-linear version of an adjoint code is referred to as second-order adjoint
code. It can be used to compute both the gradient as w f (x) as well as the projections of
the Hessian in direction v Rn as w 2 f (x) v by setting w = 1. A second-order tangentlinear code (tangent-linear version of the tangent-linear code) can be used to compute
single entries of the Hessian as uT 2 f (x) w by letting u and w range independently
over the Cartesian basis vectors in Rn . Consequently, the computational complexity of
Hessian approximation using finite differences or the second-order tangent-linear model is
O(n2 ) Cost(f ). Second-order adjoint code delivers the Hessian at a computational cost of
O(n) Cost(f ). Savings at the order of n are likely to make the difference between secondorder methods being applicable or not. Refer to Table 1.3 for numerical results that support
these findings. Note that further combinations of tangent-linear and adjoint AD are possible
when computing second derivatives. Refer to Chapter 3 for details.
18
Algorithm 1.11 Hessian accumulation by second-order AD in the context of the unconstrained nonlinear programming problem minxRn f (x).
In:
implementation of the tangent-linear version G(1) of the differentiated objective function G f defined in Algorithm 1.7 (yielding second-order tangent-linear mode
AD) or Algorithm 1.8 (yielding second-order adjoint mode AD) for computing the
objective y f (x), its directional derivative y (1) f (x) x(1) in the tangent-linear
direction x(1) Rn , its gradient g f (x), and its second directional derivative
(1)
g(1) = (gj )j =0,...,n1 2 f (x) x(1) in direction x(1) at the current point x Rn :
G(1) : Rn Rn R R Rn Rn , (y, y (1) , g, g(1) ) = G(1) (x, x(1) )
current point: x Rn
Out:
objective function value at the current point: y = f (x)
gradient at the current point: g = f (x) Rn
Hessian at the current point: H = (hj ,i )j ,i=0,...,n1 = 2 f (x) Rnn
Algorithm:
for i = 0 to n 1 do
x(1) ei
(y, y (1) , g, g(1) ) = G(1) (x, x(1) )
for j = 0 to n 1 do
(1)
5:
hj ,i gj
6:
end for
7: end for
1:
2:
3:
4:
As already observed in Section 1.1.1, the solution of the linear Newton system by a
direct method yields a computational complexity at the order of O(n3 ). Hence, the overall
cost of the Newton method applied to the given implementation of (1.2) is dominated by
the solution of the linear system in columns SOFD, SOTLM, and SOADM of Table 1.3.
Exploitation of possible sparsity of the Hessian can reduce the computational complexity
as discussed in Chapter 3. The run time of the Hessian accumulation is higher when using
the second-order tangent-linear model (column SOTLM) or, similarly, a second-order finite
difference approximation (column SOFD). Use of the second-order adjoint model reduces
the computational cost (column SOADM). The Hessian may become indefinite very quickly
when using finite difference approximation.
Matrix-free implementations of the Conjugate Gradients solver avoid the accumulation of the full Hessian. Note that in Algorithm 1.12 only the gradient g (line 3) and
projections of the Hessian g(1) (lines 3, 8, and 11) are required. Both are delivered efficiently by a single run of G(1) (in line 2 and 7, respectively). Again, preconditioning has
been omitted for the sake of notational simplicity. If a second-order adjoint code is used
19
Table 1.3. Run time of the Newton algorithm (in seconds and starting from x = 1).
The gradient and Hessian of the given implementation of (1.2) are approximated by secondorder central finite differences (SOFD; see Section 1.3) or computed with machine accuracy
by a second-order tangent-linear (SOTLC; see Section 3.2) or adjoint (SOADC; see Section 3.3) code. The Newton system is solved using a Cholesky factorization that dominates
both the run time and the memory requirement for increasing n due to the relatively low
cost of the function evaluation itself. The last column shows the run times for a matrix-free
implementation of a NewtonKrylov algorithm that uses the CG algorithm to approximate
the Newton step based on the second-order adjoint model. As expected, the algorithm scales
well beyond the problem sizes that could be handled by the other three approaches. A run
time of more than 1 second is observed only for n 105 .
n
100
200
300
400
500
1000
..
.
105
SOFD
<1
2
7
17
36
365
..
.
> 104
SOTLC
<1
1
3
9
21
231
..
.
> 104
SOADC
<1
<1
1
4
10
138
..
.
> 104
SOADC (CG)
<1
<1
<1
<1
<1
<1
..
.
1
(see Algorithm 1.11), then the computational complexity of evaluating the gradient and
a Hessian-vector product is O(1) Cost(f ). We take this result as further motivation for
an in-depth look into the generation of first- and higher-order adjoint code in the following
chapters.
Nonlinear Programming with Constraints
Practically relevant optimization problems are most likely subject to constraints, which are
often nonlinear. For example, the solution may be required to satisfy a set of nonlinear
PDEs as in many data assimilation problems in the atmospheric sciences. Discretization of
the PDEs yields a system of nonlinear algebraic equations to be solved by the solution of
the optimization problem.
The core of many algorithms for constrained optimization is the solution of the
equality-constrained problem
min o(x)
subject to c(x) = 0,
=0
c(x)
of n+m nonlinear equations in n+m unknowns x Rn and
Rm . The Newton algorithm
can be used to solve the KKT system subject to the following conditions: The Jacobian of
20
Algorithm 1.12 CG algorithm for computing the Newton step in the context of the unconstrained nonlinear programming problem minxRn f (x).
In:
implementation of the tangent-linear version G(1) of the differentiated objective function G f defined in Algorithm 1.7 (yielding second-order tangent-linear mode
AD) or Algorithm 1.8 (yielding a potentially matrix-free implementation based on
second-order adjoint mode AD) for computing the objective y f (x), its directional
derivative y (1) f (x) x(1) in the tangent-linear direction x(1) Rn , its gradient
(1)
g f (x), and its second directional derivative g(1) = (gj )j =0,...,n1 2 f (x)x(1)
in direction x(1) at the current point x Rn :
G(1) : Rn Rn R R Rn Rn , (y, y (1) , g, g(1) ) = G(1) (x, x(1) )
starting point: dx Rn
upper bound on the norm of the residual g H dx, where H 2 f (x), at the
approximate solution for the Newton step: R
Out:
approximate solution for the Newton step: dx Rn
Algorithm:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
x(1) dx
(y, y (1) , g, g(1) ) = G(1) (x, x(1) )
p g g(1)
rp
while r do
x(1) p
(y, y (1) , g, g(1) ) = G(1) (x, x(1) )
rT r / (pT g(1) )
dx dx + p
rprev r
r r g(1)
T r
rT r / (rprev
prev )
p r+ p
end while
the constraints c(x) needs to have full row rank; the Hessian
2 L(x,
)
2L
(x,
)
x2
21
xk+1
xk
dxk
=
+
,
k+1
k
d
k
where the Newton step is computed as the solution of the linear system
2
dxk
(c(xk ))T
k o(xk )
L(xk ,
k ) (c(xk ))T
.
=
d
k
c(xk )
c(xk )
0
Note that
2 L(x,
) = 2 o(x)
, 2 c(x)
,
where the notation for the projection
, 2 c(x)
of the 3-tensor 2 c(x) Rmnn in
direction
Rm is formally introduced in Chapter 3.
Many modern algorithms for constrained nonlinear optimization are based on the
solution of the KKT system. See, for example, [50] for a comprehensive survey. Our focus
is on the efficient provision of the required derivatives.
If a direct linear solver is used, then the following derivatives need to be computed:
o(x) and 2 o(x) at O(n) Cost(o) using a second-order adjoint model of o;
c(x) and
, 2 c(x)
at O(n) Cost(c) using a second-order adjoint model of c;
, c
at O(1) Cost(c) using the adjoint model of c.
A matrix-free implementation of a NewtonKrylov algorithm requires the following
derivatives:
o(x) and 2 o(x), v
, where v Rn . Both can be computed at the cost of O(1)
Cost(o) using a second-order adjoint model of o;
w, c(x)
, c(x), v
, and
, 2 c(x), v
, where v Rn and w Rm . The first- and
second-order adjoint projections can be computed at the cost of O(1) Cost(c) using
a second-order adjoint model of c. A tangent-linear model of c permits the evaluation
of Jacobian-vector products at the same relative computational cost.
Refer to Section 3.1 for formal definitions of the projection operator ., .
.
A detailed discussion of constrained nonlinear optimization is beyond the scope of
this book. Various software packages have been developed to solve this type of problem.
Some packages use AD techniques or can be coupled with code generated by AD. Examples
include AMPL [26], IPOPT [59], and KNITRO [14]. Both IPOPT and KNITRO can be
accessed via the Network-Enabled Optimization Server (NEOS5 ) maintained by Argonne
National Laboratorys Mathematics and Computer Science Division. NEOS uses a variety
of AD tools. Refer to the NEOS website for further information.
A case study for the use of AD in the context of constrained nonlinear programming
is presented in [41]. Moreover, we discuss various combinatorial issues related to AD and
to the use of sparse direct linear solvers. Our derivative code compiler dcc is combined
with IPOPT [59] and PARDISO [55] to solve an inverse medium problem.
5 neos.mcs.anl.gov
22
...) ;
The library provides a custom integer data type Integer. Algorithm 1.2 orpreferably
Algorithm 1.3 can be used to approximate the Jacobian or to accumulate its entries with
machine accuracy at a computational cost of O(n) Cost(F ), respectively.
A similar approach can be taken for the integration of stiff systems of ordinary differential equations
x
= F (t, x)
t
using various NAG C Library routines. The API is similar to the above. The accumulation
of the Jacobian of F (t, x) with respect to x is analogous.
Unconstrained Nonlinear Optimization
The e04dgc section of the library deals with the minimization of an unconstrained nonlinear
function f : Rn R, y = f (x), where n is assumed to be very large. It uses a preconditioned, limited memory quasi-Newton CG method and is based upon algorithm PLMA
as described in [33]. The user must provide code for the accumulation of the gradient
g f (x) as part of a function
void g_f ( I n t e g e r n , const double x [ ] ,
double f , double g [ ] ,
...) ;
Both Algorithm 1.2 and Algorithm 1.3 can be used to approximate the gradient or to accumulate its entries with machine accuracy at a computational cost of O(n) Cost(f ),
respectively. A better choice is Algorithm 1.8, which delivers the gradient with machine
accuracy at a computational cost of O(1) Cost(f ).
Bound-Constrained Nonlinear Optimization
The e04lbc section of the library provides a modified Newton algorithm for finding unconstrained or bound-constrained minima of twice continuously differentiable nonlinear
functions f : Rn R, y = f (x) [32]. The user needs to provide code for the accumulation
23
of the gradient g f (x) and for the computation of the function value y as part of a
function
void g_f ( I n t e g e r n , const double x [ ] ,
double y , double g [ ] ,
...) ;
... );
The gradient should be computed with machine accuracy by Algorithm 1.8 at a computational cost of O(1) Cost(f ). For the Hessian, we can choose second-order finite differences, the second-order tangent-linear model, or the second-order adjoint model. While the
first two run at a computational cost of O(n2 ) Cost(f ), the second-order adjoint code
delivers the Hessian with machine accuracy at a computational cost of only O(n) Cost(f ).
1.2
Manual Differentiation
Closed-form symbolic as well as algorithmic differentiation are based on two key ingredients: First, expressions for partial derivatives of the various arithmetic operations and
intrinsic functions provided by programming languages are well known. Second, the chain
rule of differential calculus holds.
Theorem 1.3 (Chain Rule of Differential Calculus). Let
y = F (x) = G1 (G0 (x))
such that G0 : Rn Rk , z = G0 (x) is differentiable at x and G1 ; Rk Rm , y = G1 (z)
is differentiable at z. Then F is differentiable at x and
G1 G0
y z
F
=
=
.
x
z
x
z x
Proof. See, for example, [4].
Example 1.4 Let y = F (x) = G1 (G0 (x)) such that z = G0 (x0 , x1 ) = x0 x1 and
sin(z)
.
G1 (z) =
cos(z)
Then,
F
cos(z)
cos(z) x0
cos(z) x1
=
x1 x0 =
sin(z)
sin(z) x1 sin(z) x0
x
cos(x0 x1 ) x1
cos(x0 x1 ) x0
=
.
sin(x0 x1 ) x1 sin(x0 x1 ) x0
24
The fundamental assumption we make is that at run time a computer program can
be regarded as a sequence of assignments with arithmetic operations or intrinsic functions
on their right-hand sides. The flow of control does not represent a serious problem as it is
resolved uniquely for any given set of inputs.
Definition 1.5. The given implementation of F as a (numerical) program is assumed to
decompose into a single assignment code (SAC) at every point of interest as follows:
For j = n, . . . , n + p + m 1,
vj = j (vi )ij ,
(1.4)
v4
sin(v2 )
cos(v2 )
v2
v1
v0
v0
v1
25
Let A = (ai,j ) F (x).As an immediate consequence of the chain rule, the individual
entries of the Jacobian can be computed as
cl,k
(1.5)
ai,j =
[in+p+j ] (k,l)
where
cl,k
l
(vq )ql
vk
and [i n + p + j ] denotes the set of all paths that connect the independent vertex i with
the dependent vertex n + p + j [6].
Example 1.8 From Figure 1.2 we get immediately
cos(x0 x1 ) x1
cos(v2 ) v0
cos(v2 ) v1
=
F =
sin(v2 ) v1 sin(v2 ) v0
sin(x0 x1 ) x1
cos(x0 x1 ) x0
.
sin(x0 x1 ) x0
Linearized DAGs and the chain rule (for example, formulated as in (1.5)) can be useful tools
for the manual differentiation of numerical simulation programs. Nonetheless, this process
can be extremely tedious and highly error-prone.
When differentiating computer programs, one aims for a derivative code that covers
the entire domain of the original code. Correct derivatives should be computed for any set
of inputs for which the underlying function is defined. Manual differentiation is reasonably
straightforward for straight-line code, that is, for sequences of assignments that are not
interrupted by control flow statements or subprogram calls. In this case, the DAG is static,
meaning that its structure remains unchanged for varying values of the inputs. Manual
differentiation of computer programs becomes much more challenging under the presence
of control flow.
Consider the given implementation of (1.2):
v o i d f ( i n t n , d o u b l e x , d o u b l e &y ) {
y =0;
f o r ( i n t i = 0 ; i <n ; i ++) y=y+x [ i ] x [ i ] ;
y=y y ;
}
The structure of the DAG varies with the value of n indicated by the variable index i in
Figure 1.3. A handwritten gradient code might look as follows:
1 v o i d g _ f ( i n t n , d o u b l e x , d o u b l e &y , d o u b l e g ) {
2
y =0;
3
f o r ( i n t i = 0 ; i <n ; i ++) {
4
g [ i ]=2 x [ i ] ;
5
y=y+x [ i ] x [ i ] ;
6
}
7
f o r ( i n t i = 0 ; i <n ; i ++) g [ i ] = g [ i ] 2 y ;
8
y=y y ;
9 }
Local gradients of the sums in line 5 are built in line 4. Each of them needs to be multiplied
in line 7 with the local partial derivative of the square operation in line 8 to obtain the
26
y
1
y
1
x 02
2x 0
x0
1
1
x i2
x 12
2x i
2x 1
x1
...
xi
n1 2
i=0 xi
2
final gradient. While this simple example is certainly manageable, it still demonstrates how
painful this procedure is likely to become for complex code involving nontrivial control
flow and many subprograms.
Repeating the above process for second derivatives computed by the differentiated
gradient code yields the following handwritten Hessian code:
1 v o i d h _ g _ f ( i n t n , d o u b l e x , d o u b l e &y ,
2
y =0;
3
f o r ( i n t i = 0 ; i <n ; i ++) {
4
g [ i ]=2 x [ i ] ;
5
y=y+x [ i ] x [ i ] ;
6
}
7
f o r ( i n t i = 0 ; i <n ; i ++)
8
f o r ( i n t j = 0 ; j <= i ; j ++) {
9
H[ i ] [ j ] = 4 x [ j ] g [ i ] ;
10
i f ( i == j )
11
H[ i ] [ j ] =H[ i ] [ j ] + 4 y ;
12
else
13
H[ j ] [ i ] =H[ i ] [ j ] ;
14
}
15
f o r ( i n t i = 0 ; i <n ; i ++) g [ i ] = g [ i ] 2 y ;
16
y=y y ;
17 }
d o u b l e g , d o u b l e H) {
This code is based on the linearized DAG of the gradient code shown in Figure 1.4. Again,
the structure of this DAG depends on the value of n. The entries of the Hessian are assembled
in lines 9, 11, and 13.
g1
2g0
2gi
y
1
x 02
g0
2
2x 0
x0
y=
gi
2g1
2y
27
2y
2y
x 12
g1
2x 1
2x 2
...
x1
x i2
gi
2
xi
n1 2
i=0 xi
F0
F = ... ,
Fm1
28
F0
xi (x)
F
..
(x)
.
xi
F
1 F (x + e ) F (x) ,
m1
(x)
xi
(1.6)
f (x ) = f (x) +
(1.8)
For x = x + we get
f (x + ) = f (x) +
1 2f
1 3f
f
(x) + 2 (x) 2 + 3 (x) 3 + ,
x
2! x
3! x
(1.9)
(1.11)
29
f
2 3f
(x) + 3 (x) 3 + .
x
3! x
Truncation after the first derivative term yields the scalar univariate version of (1.11). For
small values of , the truncation error is dominated by the value of the 3 term, which implies
that only accuracy up to the order of 2 (second-order accuracy) can be expected.
Example 1.11 The gradient of the given implementation of (1.2) is accumulated by forward
finite differences with perturbation h = 109 as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
v o i d g _ f _ f f d ( i n t n , d o u b l e x , d o u b l e& y , d o u b l e g ) {
c o n s t d o u b l e h=1e 9;
d o u b l e y_ph ;
d o u b l e x_ph=new d o u b l e [ n ] ; ;
f o r ( i n t i = 0 ; i <n ; i ++) x_ph [ i ] = x [ i ] ;
f (n ,x , y) ;
f o r ( i n t i = 0 ; i <n ; i ++) {
x_ph [ i ]+= h ;
f ( n , x_ph , y_ph ) ;
g [ i ] = ( y_phy ) / h ;
x_ph [ i ] = x [ i ] ;
}
d e l e t e [ ] x_ph ;
}
The driver routine for backward finite differences is obtained by subtracting (instead of
adding) h in line 8 and by switching the operands in the subtraction in line 10.
Extension of the driver to central finite differences is straightforward:
1
2
3
4
5
6
7
8
9
10
11
12
13
v o i d g _ f _ c f d ( i n t n , d o u b l e x , d o u b l e& y , d o u b l e g ) {
c o n s t d o u b l e h=5e 10;
d o u b l e y_mh , y_ph ;
d o u b l e x_mh=new d o u b l e [ n ] ; ;
d o u b l e x_ph=new d o u b l e [ n ] ; ;
f o r ( i n t i = 0 ; i <n ; i ++) x_ph [ i ] = x_mh [ i ] = x [ i ] ;
f o r ( i n t i = 0 ; i <n ; i ++) {
x_mh [ i ]=h ;
f ( n , x_mh , y_mh ) ;
x_ph [ i ]+= h ;
f ( n , x_ph , y_ph ) ;
g [ i ] = ( y_phy_mh ) / ( 2 h ) ;
x_ph [ i ] = x_mh [ i ] = x [ i ] ;
30
14
15
16
17
18
The call of f at the original point x in line 15 ensures the return of the correct function
value y.
Directional derivatives can be approximated by forward finite differences as
y(1) F (x) x(1) 1
F (x + x(1) ) F (x)
.
The correctness of this statement follows immediately from the forward finite difference
approximation of the Jacobian of F (x + s x(1) ) at point s = 0 as
y
(1)
F (x) x
(1)
F (x + (s + ) x(1) ) F (x + s x(1) )
1
F (x + x(1) ) F (x)
s=0
F (x) F (x x(1) )
F (x + x(1) ) F (x x(1) )
.
2
The classical definition of derivatives as limits of finite difference quotients suggests that
the quality of an approximation is improved by making the perturbation smaller. Unfortunately, this assumption is not valid in finite precision arithmetic such as implemented by
todays computers.
The way in which numbers are represented on a computer is defined by the IEEE 754
standard [1]. Real numbers x R are represented with base , precision t, and exponent
range [L, U ] as
d1 d2
dt1
x = d0 + + 2 + + t1 e ,
31
Example 1.12 Let = 2, t = 3, and [L, U ] = [1, 1]. The corresponding normalized
floating-point number system contains the following 25 elements:
0
1.002 21 = 0.510 ,
1.102 21 = 0.7510 ,
1.002 20 = 110 ,
1.102 20 = 1.510 ,
1.012 21 = 0.62510
1.112 21 = 0.87510
1.012 20 = 1.2510
1.112 20 = 1.7510
1.002 21 = 210 ,
1.012 21 = 2.510
1.102 21 = 310 ,
1.112 21 = 3.510 .
The IEEE single precision floating-point number data type float uses 32 bits: 23
bits for its mantissa, 8 bits for the exponent, and one sign bit. In decimal representation,
we get 6 significant digits with minimal and maximal absolute values of 1.17549e-38 and
3.40282e+38, respectively. The stored exponent is biased by adding 27 1 = 127 to its
actual signed value.
The double precision floating-point number data type double uses 64 bits: 52 bits
for its mantissa, 11 bits for the exponent, and one sign bit. In decimal representation we
get 15 significant digits with minimal and maximal absolute values of 2.22507e-308 and
1.79769e+308, respectively. The stored biased exponent is obtained by adding 210 1 =
1023 to its actual signed value. Higher precision floating-point types are defined accordingly.
If x R is not exactly representable in the given floating-point number system, then
it must be approximated by a nearby floating-point number. This process is known as
rounding. The default algorithm for rounding in binary floating-point arithmetic is rounding
to nearest, where x is represented by the nearest floating-point number. Ties are resolved
by choosing the floating-point number whose last stored digit is even, i.e., equal to zero in
binary floating-point arithmetic.
Example 1.13 In the previously discussed ( = 2, t = 3, [L, U ] = [1, 1]) floating-point
number system, the decimal value 1.126 is represented as 1.25 when rounding to nearest.
Tie breaking toward the trailing zero in the mantissa yields 1 for 1.125.
Let x and y be two floating-point numbers that agree in all but the last few digits. If
we compute z = x y, then z may only have a few digits of accuracy due to cancellation.
Subsequent use of z in a calculation may impact the accuracy of the result negatively. Finite
difference approximation of derivatives is a prime example for potentially catastrophic loss
in accuracy due to cancellation and rounding.
Example 1.14 Consider the approximation of the first derivative of y = f (x) = x in single
precision IEEE floating-point arithmetic at x = 106 by the forward finite difference quotient
f (x)
f (x + ) f (x)
32
(1.12)
33
Fk
j
xi (x + e
j
k
) F
xi (x e )
F (x + ej + ei ) F (x + ej ei )
=
2
j
i
F (x e + e ) F (x ej ei )
/(2 )
2
For f : R R, we get the well-known formula
f (x + ) 2 f (x) + f (x )
2f
x 2
2
Example 1.16 The Hessian of the given implementation f of Equation (1.2) can be accumulated by second-order central finite differences with h = 106 as follows:
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
v o i d h _ c f d ( i n t n , d o u b l e x , d o u b l e H) {
c o n s t d o u b l e h=1e 6;
d o u b l e yp1 , yp2 ;
d o u b l e xp=new d o u b l e [ n ] ;
f o r ( i n t i = 0 ; i <n ; i ++) xp [ i ] = x [ i ] ;
f o r ( i n t i = 0 ; i <n ; i ++) {
f o r ( i n t j = 0 ; j <= i ; j ++) {
xp [ i ] = x [ i ] ; xp [ j ] = x [ j ] ;
xp [ i ]+=+ h ; xp [ j ]+= h ;
f ( n , xp , yp2 ) ;
xp [ i ] = x [ i ] ; xp [ j ] = x [ j ] ;
xp [ i ]=h ; xp [ j ]+= h ;
f ( n , xp , yp1 ) ; yp2=yp1 ;
xp [ i ] = x [ i ] ; xp [ j ] = x [ j ] ;
xp [ i ]+= h ; xp [ j ]=h ;
f ( n , xp , yp1 ) ; yp2=yp1 ;
xp [ i ] = x [ i ] ; xp [ j ] = x [ j ] ;
xp [ i ]=h ; xp [ j ]=h ;
f ( n , xp , yp1 ) ; yp2+=yp1 ;
H[ i ] [ j ] =H[ j ] [ i ] = yp2 / ( 4 h h ) ;
}
}
d e l e t e [ ] xp ;
}
34
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
1.4
1.4.1
Exercises
Finite Differences and Floating-Point Arithmetic
Write a C++ program that converts single precision floating-point variables into their bit
representation (see Section 1.3). Investigate the effects of cancellation and rounding on the
finite difference approximation of first and second derivatives of a set of functions of your
choice.
1.4.2
Apply Algorithm 1.1 to approximate a solution y = y(x0 , x1 ) of the discrete SFI problem
introduced in Example 1.2.
1. Approximate the Jacobian of the residual r = F (y) by finite differences. Write exact
derivative code based on (1.5) for comparison.
2. Use finite differences to approximate the product of the Jacobian with a vector within
a matrix-free implementation of the Newton algorithm based on Algorithm 1.4.
3. Repeat the above for further problems from the MINPACK-2 test problem collection
[5], for example, for the Flow in a Channel and Elastic Plastic Torsion problems.
1.4. Exercises
1.4.3
35
Apply the steepest descent and Newton algorithms to an extended version of the Rosenbrock
function [54], which is defined as
n2
y = f (x)
(1 xi )2 + 10 (xi+1 xi2 )2
i=0
for n = 10, 100, 1000 and for varying starting values of your choice. The function has a
global minimum at xi = 1, i = 0, . . . , n 1, where f (x) = 0. Approximate the required
derivatives by finite differences. Observe the behavior (development of function values and
L2 -norm of gradient; run time) of the algorithms for varying values of the perturbation size.
Use (1.5) to derive (handwritten) exact derivatives for comparison.
Repeat the above for
y = f (x)
n4
i=0
n2
2
(xi2 )xi+1 +1 + (xi+1
)xi +1
2
i=0
1.4.4
Use manual differentiation and finite differences with your favorite solver for
1. Systems of nonlinear equations to find a numerical solution of the SFI problem introduced in Section 1.4.2; repeat for further MINPACK-2 test problems.
2. Nonlinear programming to minimize the Rosenbrock function; repeat for the other
two test problems from Section 1.4.3.
Chapter 2
Chapter 2 aims to equip the reader with the fundamentals of first derivative code generated
by AD. Two methods for implementing tangent-linear and adjoint models are considered:
source transformation and operator overloading. The former is introduced as a technique for
rewriting numerical simulation programs manually. With current AD technology still being
far from plug-and-play, we feel that users of AD tools should theoretically be capable to
perform the underlying semantic transformations by hand. Otherwise it is unlikely that they
will unleash the full power of these tools.
The automated generation of first derivative code is based on the knowledge about
partial derivatives of the intrinsic functions and arithmetic operators offered by programming
languages and on the chain rule of differential calculusits associativity in particular. We
focus on aspects of AD with immediate relevance to the derivative code compiler dcc that
is presented in Chapter 5. For a comprehensive discussion of advanced issues in AD we
refer the reader to [36]. A combination of both texts will be a very good starting point for
anyone interested in AD and derivative code compiler technology.
Some notation is needed for the upcoming material.
Definition 2.1. Let D Rn be an open domain and let f : D R be continuously differentiable on D. The partial derivative of y = f (x), x = (xi )i=0,...,n1 , at point x0 with respect
to xj is denoted as
f
fxj (x0 )
(x0 ).
xj
The vector
fx0 (x0 )
..
n
f (x0 )
R
.
fxn1 (x0 )
38
..
mn
F (x0 )
R
.
(Fm1 (x0 ))T
containing the gradients of each of the m components Fi of F as rows is called the Jacobian
of F at point x0 .
Example 2.4 Consider the Jacobian r = r(y, ) R44 of the discrete residual of the
SFI problem introduced in Example 1.2. For s = 3 and h = 1/s the Jacobian of
r0 = 4 y1,1 + y2,1 + y1,2 h2 ey1,1
r1 = 4 y1,2 + y2,2 + y1,1 h2 ey1,2
r2 = 4 y2,1 + y1,1 + y2,2 h2 ey2,1 1
r3 = 4 y2,2 + y1,2 + y2,1 h2 ey2,2 1
with respect to y becomes
2
h ey1,1 4
1
r
1
0
yielding
1
h2 ey1,2 4
0
1
4.15
1
r =
1
0
1
4.15
0
1
1
0
h2 ey2,1 4
1
1
0
4.15
1
0
1
1
4.15
1
2
y
2,2
h e 4
39
exists at all of these points, and it contains the numerical values of the corresponding partial
derivatives of the components of y with respect to the components of x :
yj j =0,...,m1
F (x) =
.
xi i=0,...,n1
This chapter forms the basis for the remaining material discussed in Chapters 3, 4,
and 5. Tangent-linear and adjoint models of numerical simulation programs and their generation using forward and reverse mode AD are discussed in Sections 2.1 and 2.2, respectively.
We focus on the (manual) implementation of tangent-linear and adjoint code and its semiautomatic generation by means of overloading of arithmetic operators and intrinsic functions.
Compiler-based source transformation techniques are considered in Chapter 4. The bottleneck of adjoint code is the often excessive memory requirement which is proportional to
the number of statements executed by the original code. Its reduction through trade-offs
between storage and recomputation in the context of checkpointing schemes is the subject
of Section 2.3.
(2.1)
x
.
s
y y x
=
= F (x) x(1) .
s
x s
Derivative code compilers such as dcc (see Chapter 5) transform a given implementation
v o i d f ( i n t n , i n t m, d o u b l e x , d o u b l e y )
40
y[F]
F(x)
y[F]
F(x)
x
(a)
x(1)
s
(b)
Figure 2.1. High-level linearized DAG of y = F (x) (a) and its tangent-linear
extension (b).
of the function y = F (x) into tangent-linear code (y(1) , y) = F (1) (x, x(1) ) for computing both
the function value and the directional derivative:
y(1) = F (x) x(1) ,
y = F (x).
The signature of the resulting tangent-linear subroutine is the following:
v o i d t 1 _ f ( i n t n , i n t m, d o u b l e x , d o u b l e t 1 _ x ,
double y , double t1_y ) .
Superscripts of tangent-linear subroutine names and tangent-linear variable names are replaced with the prefix t1_ , for example, v(1) t1_v.
The entire Jacobian can be accumulated by letting x(1) range over the Cartesian basis
vectors in Rn .
Linearized DAGs can be derived for multivariate vector functions y = F (x) at various
levels of granularity depending on which parts of the computation are considered to be
elemental. The most high-level view is shown in Figure 2.1 (a). Its tangent-linear extension
represents the augmentation with the auxiliary variable s and the corresponding local partial
derivative x(1) . It is shown in Figure 2.1 (b). The tangent-linear model follows immediately
from the application of the chain rule on linearized DAGs (see (1.5)) to Figure 2.1 (b).
41
Theorem 2.6. The tangent-linear model y(1) = F (1) (x, x(1) ) of a program implementing
y = F (x), F : Rn Rm , as in Definition 1.5, is evaluated for given inputs (v0 , . . . , vn1 ) = x
by the following recurrence:
For j = n, . . . , n + p + m 1,
j (1)
(1)
v ,
vj =
vi i
(2.2)
ij
vj = j (vi )ij .
All SAC statements are preceded by local tangent-linear models as defined in (2.1). The
directional derivative of y with respect to x is returned as
(1)
(1)
for j = 1, . . . , p + m,
j 1
k (vi
j 1
vk
)ik
if k n = j ,
otherwise
42
where
cj i
k
j i = 1
and
if k n = j and i j ,
if k n = j and j = i,
otherwise
(1)
vp+m = F (v0 ) v0
(1)
=
p+m (vp+m1 )
p+m1 (vp+m2 )
1 (v0 ) v0 .
The forward mode of AD is obtained immediately from the last equation by considering
(1)
single elements of the vj for j = 0, . . . , p + m.
The choice of x uniquely determines the flow of control in the given implementation
of F as well as in its tangent-linear version. Single assignment code can be generated for
each assignment separately, making (2.2) applicable to arbitrary (intra- and interprocedural)
flow of control. The correctness of this approach follows immediately from the chain rule.
We formulate these observations as a set of rules, each of which is illustrated by an example.
Thus we aim to provide a cook book that helps to produce tangent-linear code for arbitrary
numerical programs. Advanced features provided by modern programming languages will
require careful adaptation of these rules. This process is not expected to pose any conceptual
difficulties.
Frequently, special care must be taken when implementing even seemingly simple
mathematical ideas such as forward mode AD on a computer. Difficulties may arise
from the conceptual difference between mathematical variables (e.g., the SAC variables
in Theorem 2.6) and declared program variables that represent memory locations in the
implementation. As previously mentioned, a single program variable may represent several mathematical variables by storing their values in the same memory location. Values
of mathematical variables get lost due to overwriting. Program variables become invalid
when leaving their scope. The associated memory can be reassigned by the operating system, thus overwriting its contents or making it inaccessible (depending on the programming
language in use). While the implications turn out to be rather straightforward for forward
mode AD, they will cause substantial trouble when implementing the only slightly more
mathematically complicated reverse mode.
Tangent-Linear Code Generation Rule 1: Duplicate Active Data Segment
Tangent-linear code augments the original computation with the evaluation of directional
derivatives. Hence any stored active value in the original program must be matched by a
memory location storing the corresponding derivative. This statement applies to global as
well as local and temporary SAC variables.
Definition 2.7. We refer to a computed value as active if it depends on the value of at least
one independent variable. Additionally, it must have an impact on the value of at least one
43
becomes
double g , t1_g ;
v o i d t 1 _ f ( d o u b l e& x , d o u b l e& t 1 _ x , d o u b l e& y , d o u b l e& t 1 _ y ) {
d o u b l e v0 , t 1 _ v 0 ;
d o u b l e v1 , t 1 _ v 1 ;
d o u b l e v2 , t 1 _ v 2 ;
d o u b l e v3 , t 1 _ v 3 ;
t 1 _ v 0 = t 1 _ x ; v0=x ;
t 1 _ v 1 = t 1 _ g ; v1=g ;
t 1 _ v 2 =2 t 1 _ v 1 ; v2 =2 v1 ;
t 1 _ v 3 = t 1 _ v 0 + t 1 _ v 2 ; v3=v0+v2 ;
t 1 _ y = t 1 _ v 3 ; y=v3 ;
}
Throughout this book we assume that subroutine arguments are passed by reference; this is
indicated by the & character in C++. Arrays are assumed to be passed as pointers to their
respective first entry. Issues arising from the fact that parameters are passed by value (e.g.,
in C/C++) or are marked as input-only or output-only (e.g., in Fortran) are beyond the scope
of this book. Otherwise, special treatment becomes necessary in the context of adjoint code.
Copy propagation [2] simplifies the tangent-linear code to
v o i d t 1 _ f ( d o u b l e& x , d o u b l e& t 1 _ x , d o u b l e& y , d o u b l e& t 1 _ y ) {
d o u b l e v2 , t 1 _ v 2 ;
d o u b l e v3 , t 1 _ v 3 ;
t 1 _ v 2 =2 t 1 _ g ; v2 =2 g ;
t 1 _ v 3 =x+ t 1 _ v 2 ; v3=x+v2 ;
t 1 _ y = t 1 _ v 3 ; y=v3 ;
}
or even further to
v o i d t 1 _ f ( d o u b l e& x , d o u b l e& t 1 _ x , d o u b l e& y , d o u b l e& t 1 _ y ) {
t 1 _ y = t 1 _ x +2 t 1 _ g ;
y=x +2 g ;
}
44
4: x=v2=sin(v1)
3: v1=xx
2: x=v2=sin(v1)
1: v1=xx
0: x
is impossible for an unknown n. The chain rule allows us to restrict the building of SAC to
static code fragments such as individual assignments or sequences thereof, which are also
referred to as basic blocks. The same program variable may represent multiple global SAC
variables. For example, in the tangent-linear code
v o i d t 1 _ f ( i n t n , d o u b l e& x , d o u b l e& t 1 _ x ) {
d o u b l e v1 , t 1 _ v 3 ;
d o u b l e v2 , t 1 _ v 2 ;
45
f o r ( i n t i = 0 ; i <n ; i ++) {
t 1 _ v 1 =2 x t 1 _ x ; v1=x x ;
t 1 _ v 2 = c o s ( v1 ) t 1 _ v 1 ; v2= s i n ( v1 ) ;
t 1 _ x = t 1 _ v 2 ; x=v2 ;
}
}
the memory location accessed through x=v2 may hold v0 , v2 , v4 , and so forth. Similarly, v1 ,
v3 , v5 , . . . are stored in v1. Refer to Figure 2.2 for graphical illustration.
Tangent-Linear Code Generation Rule 3: Interprocedural Tangent-Linear Code
Subroutine calls are simply replaced by calls of their tangent-linear versions. The correctness
of this approach follows immediately from inlining the respective tangent-linear subroutine
calls. Consider, for example, the following interprocedural version of the code used to
illustrate Rule 2:
v o i d g ( d o u b l e& x ) {
x=x x ;
}
v o i d f ( i n t n , d o u b l e& x ) {
f o r ( i n t i = 0 ; i <n ; i ++) {
g(x) ;
x= s i n ( x ) ;
}
}
The square operation has been extracted into the subroutine g. In the tangent-linear code,
the call to g is simply replaced by a call to its tangent-linear version t1_g:
v o i d t 1 _ g ( d o u b l e& x , d o u b l e& t 1 _ x ) {
t 1 _ x =2 x t 1 _ x ;
x=x x ;
}
v o i d t 1 _ f ( i n t n , d o u b l e& x , d o u b l e& t 1 _ x ) {
f o r ( i n t i = 0 ; i <n ; i ++) {
t1_g ( x , t1_x ) ;
t1_x=cos ( x ) t1_x ;
x= s i n ( x ) ;
}
}
Example 2.8 Consider the implementation of (1.2) given in Section 1.1.2. Figure 2.3 shows
the corresponding linearized DAG; Figure 2.4 shows its tangent-linear extension for n = 3.
A tangent-linear version of the code is the following:
void t 1 _ f ( i n t n , double x , double t1_x ,
d o u b l e& y , d o u b l e& t 1 _ y ) {
t1_y =0; y =0;
f o r ( i n t i = 0 ; i <n ; i ++) {
46
8: y[]
2 v7
7: v7 [+]
1
5: v5 [+]
1
1
3: v3 []
4: v4 []
2 x0
0: x 0
6: v6 []
2 x1
1: x 1
2 x2
2: x 2
2 2
Figure 2.3. Linearized DAG of y = ( n1
i=0 xi ) for n = 3; a single argument v of
a multiplication denotes the square operation v v.
t 1 _ y = t 1 _ y +2 x [ i ] t 1 _ x [ i ] ;
y=y+x [ i ] x [ i ] ;
}
t 1 _ y =2 y t 1 _ y ;
y=y y ;
}
The following driver accumulates the gradient entry-by-entry using the tangent-linear function t1_f .
void d r i v e r ( i n t n , double x , double g ) {
double y ;
d o u b l e t 1 _ x =new d o u b l e [ n ] ;
f o r ( i n t i = 0 ; i <n ; i ++) t 1 _ x [ i ] = 0 ;
f o r ( i n t i = 0 ; i <n ; i ++) {
t1_x [ i ]=1;
t 1 _ f ( n , x , t1_x , y , g [ i ] ) ;
t1_x [ i ]=0;
}
delete [ ] t1_x ;
}
A total of n calls to the tangent-linear routine is required to compute the full gradient.
47
y[]
2 v7
v7 [+]
1
v5 [+]
1
v3 []
v4 []
2 x0
v6 []
2 x1
x0
2 x2
x1
(1)
x2
(1)
x0
x1
(1)
x2
y=(
Table 2.1. Run times for tangent-linear code (in seconds). n function evaluations
are compared with n evaluations of the tangent-linear code required for a full gradient
accumulation. The compiler optimizations are either switched off (g++ -O0) or the full set
of optimizations is enabled (g++ -O3); refer to the g++ manual pages for documentation
on the different optimization levels. We observe a difference of a factor R of less than 2
when comparing the run time of a single run of the tangent-linear code with that of an
original function evaluation. Full compiler optimization reduces R to about 1.2 as shown
in the rightmost column.
n
f
t1_f
104
0.9
1.4
g++ -O0
2 104 4 104
3.6
13.7
5.6
22.2
104
0.2
0.2
g++ -O3
2 104 4 104
0.8
3.1
0.9
3.7
R
1
1.2
48
Implementations of all relevant arithmetic operators and intrinsic functions are required, for
example,
d c o _ t 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ t 1 s _ t y p e& x1 ,
c o n s t d c o _ t 1 s _ t y p e& x2 ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v=x1 . v x2 . v ;
tmp . t =x1 . t x2 . v+x1 . v x2 . t ;
6 More substantial modifications may become necessary in languages that do not have full support for objectoriented programming.
49
r e t u r n tmp ;
}
and
d c o _ t 1 s _ t y p e s i n ( c o n s t d c o _ t 1 s _ t y p e& x ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v= s i n ( x . v ) ;
tmp . t = c o s ( x . v ) x . t ;
r e t u r n tmp ;
}
Refer to Section A.1 for a more complete version of the source code. The driver program in
Listing 2.1 uses the implementation of class dco_t1s_type to compute the gradient of (1.2)
for n = 4 at the point xi = 1 for i = 0, . . . , 3. Four evaluations of the tangent-linear routine
v o i d f ( d c o _ t 1 s _ t y p e x , d c o _ t 1 s _ t y p e &y )
are performed with the derivative components of x initialized to the Cartesian basis vectors
in R4 .
Listing 2.1. Driver for tangent-linear code by overloading.
# include <iostream >
u s i n g namespace s t d ;
# i n c l u d e " d c o _ t 1 s _ t y p e . hpp "
const i n t n =4;
v o i d f ( d c o _ t 1 s _ t y p e x , d c o _ t 1 s _ t y p e &y ) {
y =0;
f o r ( i n t i = 0 ; i <n ; i ++) y=y+x [ i ] x [ i ] ;
y=y y ;
}
i n t main ( ) {
dco_t1s_type x[n ] , y ;
f o r ( i n t i = 0 ; i <n ; i ++) x [ i ] = 1
f o r ( i n t i = 0 ; i <n ; i ++) {
x [ i ] . t =1;
f (x , y) ;
x [ i ] . t =0;
c o u t << y . t << e n d l ;
}
return 0 ;
}
Let class dco_t1s_type be defined in the C++ source files dco_t1s_type.hpp and
dco_t1s_type.cpp, and let the driver program be stored as main.cpp. An executable
is built by calling
$(CPPC) -c dco_t1s_type.cpp
$(CPPC) -c main.cpp
$(CPPL) -o main dco_t1s_type.o main.o
50
Table 2.2. Run times for tangent-linear code by overloading (in seconds). n
function evaluations are compared with n evaluations of the tangent-linear code required
for a full gradient accumulation. With the full set of compiler optimizations enabled, we
observe a factor R of less than 10 when comparing the run time of a single run of the
tangent-linear code with that of an original function evaluation in the right-most column.
The overloading solution turns out to be more than 5 times slower than the hand-written
tangent-linear code due to less effective compiler optimization of the overloaded code.
n
f
t1_f
104
0.9
7.7
g++ -O0
2 104 4 104
3.6
13.7
29.5
116.5
104
0.2
1.9
g++ -O3
2 104 4 104
0.8
3.1
7.1
28.3
R
1
9.1
where $(CPPC) and $(CPPL) should be replaced by a C++ compiler and a corresponding
linker, respectively (for example, g++). Run time measurements are reported in Table 2.2.
When computing several directional derivatives at the same time, it is favorable to
evaluate the function and its local partial derivatives only once, followed by products of the
latter with vectors of directional derivatives in correspondence with the chain rule. This
vector forward mode of AD can be implemented by overloading all intrinsic functions and
arithmetic operators for the user-defined data type
class dco_t1v_type {
public :
double v ;
double t ;
...
};
The t component becomes a vector whose size is set at run time and stored in the static variable dco_t1v_type :: t_length . All overloaded functions and arithmetic operators are modified
accordingly, for example,
d c o _ t 1 v _ t y p e o p e r a t o r ( c o n s t d c o _ t 1 v _ t y p e& x1 , c o n s t
d c o _ t 1 v _ t y p e& x2 ) {
d c o _ t 1 v _ t y p e tmp ;
tmp . v=x1 . v x2 . v ;
f o r ( i n t i = 0 ; i < d c o _ t 1 v _ t y p e : : t _ l e n g t h ; i ++)
tmp . t [ i ] = x1 . t [ i ] x2 . v+x1 . v x2 . t [ i ] ;
r e t u r n tmp ;
}
All constructors need to allocate the vector t ; the destructor deallocates it. This dynamic
allocation and deallocation is performed as part of each arithmetic operation or intrinsic
function call, which results in a significant and potentially infeasible run-time overhead.
Alternatively, one may chose to allocate t statically. A recompilation of dco_t1v_type may be
necessary whenever the required size of t changes. As usual in AD, the implementation of an
efficient library (here for vector forward mode) needs to take into account the characteristics
of the given computing platform including hardware (CPU, memory hierarchy, i/o system)
and system software (operating system, memory management, native compiler, run-time
support libraries).
2.1.3
51
.
(2.3)
xj
h
In order to minimize the number of function evaluations required, we intend to avoid the
computation of statically known entries. Without loss of generality, such entries are assumed
to be zeros. Hence, knowledge of the sparsity pattern of the Jacobian is the key to all results
in this section.
Potential savings in the run time of Jacobian accumulation result from the observation
that the columns of the Jacobian can be partitioned into mutually disjoint subsets I1 , . . . , Il
of the column index set {0, . . . , n 1}. Any two columns u F,i and v F,j that
belong to the same subset Ih , h {1, . . . , l}, are assumed to be structurally orthogonal; that
is, k : uk = 0 vk = 0. Hence the desired entries of all columns in a given set Ij can be
computed simultaneously as
F x + h iIj ei F (x)
.
F,Ij
h
As a relevant example, the authors of [21] discuss band matrices of band width w. Obviously,
any two columns F,i and F,j with |j i| > w are structurally orthogonal. Moreover, a
sequential greedy approach is proposed as a heuristic for determining a feasible (structurally
orthogonal but not necessarily optimal in terms of a minimal value for the number of index
sets l) partitioning of the columns. When considering the ith column, the remaining columns
F,j , i < j < n, are tested for structural orthogonality with Fi . If the test is successful,
then j becomes part of the same partition. This procedure is run iteratively for increasing
i = 0, 1, . . . until no more columns remain unassigned.
An obvious lower bound for l is the maximum number of nonzero elements in any
single row in the Jacobian. Sequential partitioning reaches this lower bound for band
matrices. However, as shown in the following example, it does not produce an optimal
partitioning in all cases.
Example 2.9 Sequential partitioning applied to
a0,0
0
0
0
a1,2
F = 0
0
a2,1 a2,2
a0,3
a1,3
0
(2.4)
results in I1 = {0, 1}, I2 = {2}, and I3 = {3}. A better solution is to partition as I1 = {0, 2}
and I2 = {1, 3}.
The column partitioning problem applies both to finite difference approximation of the
Jacobian and to Jacobian accumulation in tangent-linear mode AD. With the latter, we aim
52
to compute
Bt = A St ,
(2.5)
{0, 1}nlt
to as the seed matrix and the compressed Jacobian, respectively. The number of columns
in St is denoted by lt (la will be used for adjoint seeding). The term harvesting refers to
the retrieval of the uncompressed Jacobian matrix. Harvesting is performed by solving the
system of simultaneous linear equations in (2.5). For direct methods, the solution is obtained
by a simple substitution procedure. The combinatorial problem is to find a minimal lt ; the
resulting coloring problems on various graph representations of the Jacobian are discussed
in detail in [30]. The coloring problem is known to be NP-complete [28], which makes
heuristics the preferred approach to the determination of a feasible, and hopefully close to
optimal, partitioning of the columns of the Jacobian.
Example 2.10 Suppose that (2.4) results from a function F : R4 R3 implemented as
v o i d f ( i n t n , d o u b l e x , i n t m, d o u b l e y )
generated by forward mode AD. The following example driver for computing the compressed Jacobian Bt uses the column partitioning I1 = {0, 2} and I2 = {1, 3}.
i n t main ( ) {
double x [ 4 ] = . . . ;
double y [ 3 ] , t1_y [ 3 ] ;
{
double t1_x [ 4 ] = { 1 , 0 , 1 , 0 } ;
t 1 _ f ( 4 , x , t 1 _ x , 3 , y , t 1 _ y ) ; / / c o l u m n s 0 and 2
}
...
{
double t1_x [ 4 ] = { 0 , 1 , 0 , 1 } ;
t 1 _ f ( 4 , x , t 1 _ x , 3 , y , t 1 _ y ) ; / / c o l u m n s 1 and 3
}
...
}
For the known sparsity pattern of F and the resulting seed matrix St , all unknown nonzero
entries xj ,i are obtained by simple substitution from
x0,0
0
0
0
0
x2,1
0
x1,2
x2,2
1
x0,3
0
x1,3
1
0
0
0
a
1 0,0
= a1,2
0
a2,2
1
a0,3
a1,3 .
a2,1
53
n1
(1)
xi
T
ai,j
y(1)j
j =0
i=0
m1
n1 m1
[D]
(1)
T
xi ai,j
y(1)j
i=0 j =0
m1
n1
[K+]
(1)
T
xi ai,j
y(1)j
j =0 i=0
m1
n1
[K]
(1)
T
y(1)j ai,j
xi
j =0 i=0
m1
[D]
j =0
y(1)j
n1
i=0
(1)
T
ai,j
xi
54
m1
[(AT )T ]
y(1)j
n1
j =0
(1)
aj ,i xi
i=0
= y(1) , F (x) x
Rm .
(1)
We have used distributivity [D], commutativity of addition [K+] and multiplication [K],
and the fact that (AT )T = A. With (2.1) and (2.7) it follows that x(1) , x(1)
Rn =
y(1) , y(1)
Rm .
An immediate consequence of Definition 2.11 is the following: If the adjoint of the
output y(1) is chosen orthogonal to the directional derivative y(1) = F (x) x(1) , then the
adjoint of the input x(1) = F (x)T y(1) is orthogonal to x(1) .
Definition 2.13. The Jacobian F = F (x) induces a linear mapping Rm Rn defined by
y(1) F T y(1) .
The function F(1) : Rn+m Rn defined as
x(1) = F(1) (x, y(1) ) F (x)T y(1)
(2.7)
t
.
x
t
x
T
=
y
x
T T
t
= F (x)T y(1) .
y
A graphical illustration in the form of the adjoint extension of the linearized DAG for
y = F (x) is shown in Figure 2.5. The adjoint extension of the linearized DAG of (1.2) is
displayed in Figure 2.6.
The derivative code compiler dcc transforms the given implementation
v o i d f ( i n t n , i n t m, d o u b l e x , d o u b l e y )
of the function y = F (x) into adjoint code (y, x(1) , y(1) ) = F(1) (x, x(1) , y(1) ), which computes
y = F (x),
x(1) = x(1) + (F (x))T y(1) ,
y(1) = 0.
55
t
y(1)
y[F]
F(x)
x
Figure 2.5. Adjoint extension of the linearized DAG for y = F (x).
t
y(1)
y[]
2 v7
v7 [+]
1
v5 [+]
1
v3 []
2 x0
x0
1
v4 []
2 x1
x1
v6 []
2 x2
x2
n1
i=0
xi2 )2
56
Subscripts of adjoint subroutine names and adjoint variable names are replaced with the
prefix a1_, such as v(1) a1_v. The entire Jacobian is accumulated by letting y(1) range over
the Cartesian basis vectors in Rm . There is no approximate model for adjoints as there is for
directional derivatives in the form of finite differences.
n1
xi
i=0
to illustrate the ability of the reverse mode to compute gradients cheaply (with a computational cost that exceeds by a small constant factor that of the pure function evaluation). This function is known as Speelpennings example and is used extensively to illustrate the power of reverse mode AD. Equation (1.2) exhibits similar properties while
being better suited for the discussion of unconstrained nonlinear optimization methods in
Chapter 1.
Let us take a closer look at the structure of the adjoint code that is generated by reverse
mode AD.
Theorem 2.14. For given adjoints of the dependent variables, (nonincremental) reverse
mode AD propagates adjoints backwards through the SAC as follows:
For j =n, . . . , n + p + m 1
vj = j (vi )ij
for i =n + p 1, . . . , 0
j
v(1)j .
v(1)i =
vi
(2.8)
j :ij
57
becomes
d o u b l e g , a1_g ;
v o i d a 1 _ f ( d o u b l e& x , d o u b l e& a1_x , d o u b l e& y , d o u b l e& a1_y ) {
d o u b l e v2 , a1_v2 ;
d o u b l e v3 , a1_v3 ;
/ / forward s e c t i o n
v2 =2 g ; v3=x+v2 ; y=v3 ;
58
In the forward section of the adjoint code, the SAC computes all intermediate values that
enter the computation of the local partial derivatives in (2.8). For the linear function given
above the values of the SAC variables (v2, v3) are not required for the evaluation of the
constant local partial derivatives. Hence, the construction of the SAC in the forward section
is actually obsolete in this particular case.
Adjoints of all SAC and program variables (a1_v3, a1_v2, a1_x, a1_g) are computed
as a function of the adjoint output a1_y in the reverse section of the adjoint code. Copy
propagation combined with the observation from the previous paragraph yields an optimized
version of this code as follows:
d o u b l e g , a1_g ;
v o i d a 1 _ f ( d o u b l e& x , d o u b l e& a1_x , d o u b l e& y , d o u b l e& a1_y ) {
/ / forward s e c t i o n
y=x +2 g ;
/ / reverse section
a1_x = a1_y ; a1_g =2 a1_y ;
}
Adjoint Code Generation Rule 2: Increment and Reset Adjoint Program Variables
Consider the following implementation of a function f : R2 R2 :
void f ( double x , double y ) {
y [0]= sin ( x [ 0 ] ) ; y [1]= x [0] x [ 1 ] ;
}
Its DAG is shown in Figure 2.7. The program variable x[0] appears on the right-hand side
of the two assignments. In both cases, it represents the same mathematical variable (node
v0 : x[0]
v3 : y[1]=x[0] x[1]
v1 : x[1]
Figure 2.7. Adjoint Code Generation Generation Rule 2: Increment adjoint program variables.
59
in the DAG). Hence, the adjoint versions of both assignments contribute to the adjoint
a1_x[0]. An implementation of the current definition of reverse mode AD may require
access to information about two or more assignments when generating code for computing
the adjoint of some program variable. In the current example, to generate
a1_x [ 0 ] = c o s ( x [ 0 ] ) a1_y [ 0 ] + x [ 1 ] a1_y [ 1 ] ;
we need to access the adjoints of both left-hand sides (a1_y [0], a1_y[1]) in addition to the
arguments (x [0], x[1] ) of the corresponding partial derivatives. Note that, in general, these
assignments may not lie in close proximity in the code. Ideally, we would like to find a
method that allows us to process the original assignments in strictly reverse order. That is,
each adjoint assignment should spread its contributions to the adjoints of its right-hand side
arguments instead of adjoint program variables having to collect them. The result is the
following incremental adjoint code:
v o i d a 1 _ f ( d o u b l e x , d o u b l e a1_x , d o u b l e y , d o u b l e a1_y ) {
d o u b l e v2 , a1_v2 ;
d o u b l e v3 , a1_v3 ;
/ / forward s e c t i o n
v2= s i n ( x [ 0 ] ) ;
y [ 0 ] = v2 ;
v3=x [ 0 ] x [ 1 ] ;
y [ 1 ] = v3 ;
/ / reverse section
a1_v3 = a1_y [ 1 ] ; a1_y [ 1 ] = 0 ;
a1_x [ 0 ] + = x [ 1 ] a1_v3 ;
a1_x [ 1 ] + = x [ 0 ] a1_v3 ;
a1_v2 = a1_y [ 0 ] ; a1_y [ 0 ] = 0 ;
a1_x [ 0 ] + = c o s ( x [ 0 ] ) a1_v2 ;
}
The auxiliary variables (v2, v3) are each used exactly once. Consequently, their adjoints
(a1_v2, a1_v3) are defined exactly once and hence do not need to be incremented. To
avoid incrementation of invalid values, adjoints of program variables on left-hand sides of
assignments need to be reset to zero after the corresponding adjoint assignments. Refer to
the example used to explain Adjoint Code Generation Rule 4 for an illustration. Adjoint
inputs (a1_x) are expected to be initialized by the calling routine. Further optimization of
the adjoint code yields
v o i d a 1 _ f ( d o u b l e x , d o u b l e a1_x , d o u b l e y , d o u b l e a1_y ) {
/ / forward s e c t i o n
y [0]= sin ( x [ 0 ] ) ;
y [1]= x [0] x [ 1 ] ;
/ / reverse section
a1_x [ 0 ] + = x [ 1 ] a1_y [ 1 ] ;
a1_x [ 1 ] + = x [ 0 ] a1_y [ 1 ] ; a1_y [ 1 ] = 0 ;
a1_x [ 0 ] + = c o s ( x [ 0 ] ) a1_y [ 0 ] ; a1_y [ 1 ] = 0 ;
}
60
(2.9)
While the local variable z is in fact obsolete in this simple example, in a more complex
situation it may well be used by subsequent computations. We keep it simple for the sake of
clarity. Mechanical application of incremental reverse mode yields the following incorrect
adjoint code:
v o i d a 1 _ f ( d o u b l e& x , d o u b l e a1_x ) {
double z , a1_z =0;
d o u b l e v1 , a1_v1 ;
d o u b l e v2 , a1_v2 ;
/ / forward s e c t i o n
v1 =2 x ;
z=v1 ;
v2= c o s ( z ) ;
x=v2 ;
/ / reverse section
a1_v2 = a1_x ;
a 1 _ z+= s i n ( z ) a1_v2 ;
a1_v1 = a 1 _ z ;
a1_x +=2 a1_v1 ;
}
61
v2 : x=cos(z)
v1 : z=2x
v0 : x
Figure 2.8. Adjoint Code Generation Rule 2: Reset adjoint program variables.
Adjoint program variables (a1_x, a1_z) are incremented. Adjoint local variables are initialized to zero (a1_z=0). Nevertheless, this code returns the wrong adjoint. The problem is
the incrementation of a1_x by the last assignment. Refer to Figure 2.8 for illustration. In
its current form, the code preserves the value of a1_x v(1)2 . However, according to (2.9)
the last assignment is assumed to increment a1_x v(1)0 = 0, which fails because v(1)0 and
v(1)2 share the same memory location. A feasible solution sets a1_x=0 immediately after the
adjoint of x=v2. The correct adjoint code becomes
v o i d a 1 _ f ( d o u b l e& x , d o u b l e a1_x ) {
...
/ / reverse section
a1_v2 = a1_x ; a1_x = 0 ;
...
}
Application of Adjoint Code Generation Rules 1 and 2 yields the following incorrect adjoint
code:
1 v o i d a 1 _ f ( d o u b l e& x , d o u b l e& a1_x ) {
2
/ / SAC v a r i a b l e s
3
d o u b l e v =0 , a1_v = 0 ;
4
5
/ / forward s e c t i o n
6
v=x x ;
7
x= s i n ( v ) ;
8
9
/ / reverse section
62
10
a1_v = c o s ( v ) a1_x ; a1_x = 0 ;
11
a1_x +=2 x a1_v ;
12 }
The problem lies in line 11, where the partial derivative of xx is evaluated incorrectly. The
reason is not the expression itself; 2x is certainly correct. However, the value of x at this
point is not what it should be due to overwriting in line 7. When evaluating the local partial
derivative of the right-hand side of the assignment in line 6, we need the value of x before
it is overwritten. Our preferred solution is to augment the forward section with statements
that push any required (by the reverse section) value onto a stack before it is overwritten
by the following assignment. We use stacks from the C++ standard library [42]. A single
stack entry is required for our simple example, which yields
1 v o i d a 1 _ f ( d o u b l e& x , d o u b l e& a1_x ) {
2
/ / SAC v a r i a b l e s
3
d o u b l e v =0 , a1_v = 0 ;
4
5
/ / augmented forward s e c t i o n
6
v=x x ;
7
r e q u i r e d _ d o u b l e . p u s h ( x ) ; x= s i n ( v ) ;
8
9
/ / reverse section
10
x= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
11
a1_v = c o s ( v ) a1_x ; a1_x = 0 ;
12
a1_x +=2 x a1_v ;
13 }
Values of different data types may need to be stored within the augmented forward section.
Floating-point values as well as integers and values of other data types may be required for
a correct evaluation of the reverse section of the adjoint code. Consequently, several typed
stacks may have to be provided. They are referred to as required [double, integer, . . .] data
stacks. As an immediate consequence of Theorem 2.14, data access is guaranteed to be
LIFO (Last In First Out), making stacks the preferred data structure.
While these changes ensure that the adjoint is computed correctly, there is still one
more problem to solve. As a result of restoring the input value of x in line 10, an incorrect
function value is returned. If the latter is not used by the enclosing computation, then no
further action is required. Otherwise, the function value(s) should be stored at the end of the
augmented forward section, and subsequently should be restored at the end of the adjoint
code. To this end, a result checkpoint is written. A correct adjoint code that fully satisfies
all requirements is the following:
v o i d a 1 _ f ( d o u b l e& x , d o u b l e& a1_x ) {
/ / SAC v a r i a b l e s
d o u b l e v =0 , a1_v = 0 ;
/ / augmented forward s e c t i o n
v=x x ;
r e q u i r e d _ d o u b l e . p u s h ( x ) ; x= s i n ( v ) ;
// store result
r e s u l t _ d o u b l e . push ( x ) ;
63
/ / reverse section
x= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
a1_v = c o s ( v ) a1_x ; a1_x = 0 ;
a1_x +=2 x a1_v ;
// restore result
x= r e s u l t _ d o u b l e . t o p ( ) ; r e s u l t _ d o u b l e . pop ( ) ;
}
The value of v that is overwritten in line 8 is required by the adjoint of the assignment in
line 7. Hence this value needs to be stored in addition to the instances of x in lines 7 and 9.
64
Note that without resetting a1_x to zero in lines 16 and 20 the wrong base values would be
incremented in lines 18 and 21.
The amount of memory occupied by the required double data stack can be reduced by
moving the construction of the assignment-level SACs to the reverse section of the adjoint
code as follows:
1 v o i d a 1 _ f ( d o u b l e& x , d o u b l e& a1_x ) {
2
/ / SAC v a r i a b l e s
3
d o u b l e v =0 , a1_v = 0 ;
4
5
/ / augmented forward s e c t i o n
6
r e q u i r e d _ d o u b l e . p u s h ( x ) ; x= s i n ( x x ) ;
7
r e q u i r e d _ d o u b l e . p u s h ( x ) ; x= s i n ( x x ) ;
8
9
// store result
10
r e s u l t _ d o u b l e . push ( x ) ;
11
12
/ / reverse section
13
x= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
14
v=x x ; / / i n c o m p l e t e SAC
15
a1_v = c o s ( v ) a1_x ; a1_x = 0 ;
16
a1_x +=2 x a1_v ;
17
x= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
18
v=x x ; / / i n c o m p l e t e SAC
19
a1_v = c o s ( v ) a1_x ; a1_x = 0 ;
20
a1_x +=2 x a1_v ;
21
22
/ / restore result checkpoint
23
x= r e s u l t _ d o u b l e . t o p ( ) ; r e s u l t _ d o u b l e . pop ( ) ;
24 }
The forward sweep consists of the original statements augmented with code for storing all
required data (lines 6 and 7). Adjoint versions are generated for all original assignments in
reverse order; the corresponding stored data is recovered (lines 13 and 17), followed by the
execution of the incomplete SACs (lines 14 and 18). The final assignment to the original
right-hand side is omitted, as it would undo the previous recovery of the required values for
x and thus lead to potentially incorrect adjoints. The adjoint statements (lines 1516 and
1920) remain unchanged.
For this simple example, while the savings in memory occupied by the required data
stack is not impressive, they are likely to be significant for larger code. Further replication
of the assignment x=sin(xx) allows us to save a factor of 2, asymptotically. This number
grows with growing right-hand sides of assignments.
Adjoint Code Generation Rule 5: Intraprocedural Control Flow Reversal
Consider the following implementation of (1.2):
1 void f ( i n t n ,
2
i n t i =0;
3
y =0;
d o u b l e x , d o u b l e &y ) {
65
4
w h i l e ( i <n ) {
5
y=y+x [ i ] x [ i ] ;
6
i = i +1;
7
}
8
y=y y ;
9 }
For any given value of n, the loop in line 4 can be unrolled, and the adjoint of the resulting
straight-line code can be built according to Adjoint Code Generation Rules 14. A generalpurpose adjoint code needs to be valid for arbitrary n. The order in which the assignments
are executed in the original program for any set of inputs (n and x in this case) needs to be
reversed. While this information is rather easily extracted from the given example code,
the solution to this problem may be less straightforward for larger programs. A generally
valid algorithmic approach is required.
The simplest way to reverse the order of all executed assignments is to enumerate
them in the augmented forward section followed by pushing their respective indices onto a
control flow stack, for example, stack int
control. All indices are retrieved in LIFO order
in the reverse section and the corresponding adjoint statements are executed. For example,
the assignments in lines 3, 5, 6, and 8 receive indices, 0, 1, 2, and 3, respectively, yielding
the following adjoint code:
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e& y , d o u b l e a1_y ) {
i n t i =0;
/ / a u g m e n t e d f o r w a r d sweep
c o n t r o l . push ( 0 ) ; y =0;
w h i l e ( i <n ) {
c o n t r o l . p u s h ( 1 ) ; y=y+x [ i ] x [ i ] ;
c o n t r o l . push ( 2 ) ; r e q u i r e d _ i n t e g e r . push ( i ) ; i = i +1;
}
c o n t r o l . p u s h ( 3 ) ; r e q u i r e d _ d o u b l e . p u s h ( y ) ; y=y y ;
// store result
r e s u l t _ d o u b l e . push ( y ) ;
/ / r e v e r s e sweep
w h i l e ( ! c o n t r o l . empty ( ) ) {
i f ( c o n t r o l . t o p ( ) ==0)
a1_y = 0 ;
e l s e i f ( c o n t r o l . t o p ( ) ==1)
a1_x [ i ]+=2 x [ i ] a1_y ;
e l s e i f ( c o n t r o l . t o p ( ) ==2) {
i = r e q u i r e d _ i n t e g e r . top ( ) ;
r e q u i r e d _ i n t e g e r . pop ( ) ;
}
e l s e i f ( c o n t r o l . t o p ( ) ==3) {
y= r e q u i r e d _ d o u b l e . t o p ( ) ;
r e q u i r e d _ d o u b l e . pop ( ) ;
a1_y =2 y a1_y ;
}
66
The individual adjoint statements are constructed according to Adjoint Code Generation
Rules 14 followed by some obvious code optimizations. For example, the adjoint of
y=y+x[i]x[ i ] is constructed from
1
2
3
4
5
6
7
/ / i n c o m p l e t e SAC
v1=x [ i ] x [ i ]
v2=y+v1 ;
/ / adjoint statements
a1_v2 = a1_y ; a1_y = 0 ;
a1_v1 = a1_v2 ; a1_y += a1_v2 ;
a1_x [ i ]+=2 x [ i ] a1_v1 ;
Neither v1 nor v2 is used by the adjoint statements, making lines 2 and 3 obsolete. Copy
propagation in lines 57 yields a1_x[i]+=2x[i]a1_y. The driver calls a1_f once to compute
the entire gradient.
void d r i v e r ( i n t n , double x , double g ) {
double y ;
f o r ( i n t i = 0 ; i <n ; i ++) g [ i ] = 0 ;
a1_f ( n , x , g , y , 1 ) ;
}
The size of the control stack can be significantly reduced by enumerating basic blocks
instead of individual assignments. Reversing the order of the assignments within a basic
block is trivial. Consequently, the augmented forward section of our example code becomes
c o n t r o l . push ( 0 ) ; y =0;
w h i l e ( i <n ) {
c o n t r o l . p u s h ( 1 ) ; y=y+x [ i ] x [ i ] ;
r e q u i r e d _ i n t e g e r . push ( i ) ; i = i +1;
}
c o n t r o l . p u s h ( 2 ) ; r e q u i r e d _ d o u b l e . p u s h ( y ) ; y=y y ;
67
a1_y =2 y a1_y ;
}
c o n t r o l . pop ( ) ;
}
The major advantages of the basic block enumeration method are its relative simplicity and
its applicability to arbitrary flow of control. Older legacy code in particular sometimes
makes excessive use of goto statements, thus making it hard or even impossible to identify
loops in the flow of control at compile time. Further improvements are possible for reducible
flow of control [2], which allows for all loops to be detected at compile time, potentially
followed by a syntactic modification of the source code to make such loops explicit. In this
case, the number of iterations can be counted for each loop within the augmented forward
section. The adjoint loop is constructed to perform the same number of executions of the
adjoint loop body within the reverse section as illustrated below.
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e& y , d o u b l e a1_y ) {
i n t i =0;
/ / a u g m e n t e d f o r w a r d sweep
y =0;
i n t loop_counter =0;
w h i l e ( i <n ) {
y=y+x [ i ] x [ i ] ;
r e q u i r e d _ i n t e g e r . push ( i ) ; i = i +1;
l o o p _ c o u n t e r ++;
}
c o n t r o l . push ( l o o p _ c o u n t e r ) ;
r e q u i r e d _ d o u b l e . p u s h ( y ) ; y=y y ;
// store result
r e s u l t _ d o u b l e . push ( y ) ;
/ / r e v e r s e sweep
y= r e q u i r e d _ d o u b l e . t o p ( ) ;
r e q u i r e d _ d o u b l e . pop ( ) ;
a1_y =2 y a1_y ;
loop_counter =0;
while ( loop_counter < c o n t r o l . top ( ) ) {
i = r e q u i r e d _ i n t e g e r . top ( ) ;
r e q u i r e d _ i n t e g e r . pop ( ) ;
a1_x [ i ]+=2 x [ i ] a1_y ;
l o o p _ c o u n t e r ++;
}
c o n t r o l . pop ( ) ;
a1_y = 0 ;
// restore result
y= r e s u l t _ d o u b l e . t o p ( ) ;
r e s u l t _ d o u b l e . pop ( ) ;
}
68
The memory savings are dramatic if, as in this case, the loop body is a basic block. Instead
of storing n copies of the basic block index (1 in our case) we need only store a single integer
that represents the number of iterations actually performed. If the loop body is not a basic
block, then the savings due to counting loop iterations are mostly insignificant. Moreover,
special care must be taken when considering nontrivial control flow constructs involving
nested loops and branches. Refer to the AD tool TAPENADE [52] for an implementation
of control flow reversal by counting loops and enumerating branches.
Sometimes the semantics of certain syntactic structures can be exploited for the generation of optimized adjoint code. For example, the reversal of simple for-loops in C/C++
such as
f o r ( i n t i = 0 ; i <n ; i ++) . . .
can be implemented by an inverse for-loop starting with the target index value n1 and
decrementing the counter i down to the start value 0:
f o r ( i n t i =n 1; i >=0; i ) . . . .
This technique is illustrated in the following implementation of an adjoint for (1.2). Note
that even though the loop index is required by the adjoint code of the loop body, instead of
saving it after each loop iteration, its required values are generated explicitly by the reversed
loop.
1 v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
2
d o u b l e& y , d o u b l e a1_y ) {
3
/ / augmented forward s e c t i o n
4
y =0;
5
f o r ( i n t i = 0 ; i <n ; i ++) y=y+x [ i ] x [ i ] ;
6
r e q u i r e d _ d o u b l e . p u s h ( y ) ; y=y y ;
7
8
// store result
9
r e s u l t _ d o u b l e . push ( y ) ;
10
11
/ / reverse section
12
y= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
13
a1_y =2 y a1_y ;
14
f o r ( i n t i =n 1; i >=0; i )
15
a1_x [ i ] = a1_x [ i ] + 2 x [ i ] a1_y ;
16
17
// restore result
18
y= r e s u l t _ d o u b l e . t o p ( ) ; r e s u l t _ d o u b l e . pop ( ) ;
19 }
The right-hand side value of y in line 6 is required in line 13 for the evaluation of the partial
derivative of yy; it is stored on the required double data stack required_double . If a1_f is
not required to return the correct function value y computed in line 6, then lines 69 in the
augmented forward section and lines 12 and 1718 of the reverse section can be removed
from the adjoint code. Thus, for this simple example the additional memory requirement
of the adjoint code can be reduced to zero when compared with the tangent-linear code.
Run time results for the different approaches to intraprocedural control flow reversal
are reported in Table 2.3.
69
Table 2.3. Run times for adjoint code (in seconds). n function evaluations are
compared with n evaluations of the adjoint code. The convenience of dynamic memory
management provided by the C++ standard library is paid for with a considerable increase
of this factor, both in versions v.1 (enumerate basic blocks) and v.2. (count loop iterations).
Version v.3 (explicit for-loop reversal) avoids stack accesses for the most part and takes less
than R = 3 times the run time of a function evaluation (see rightmost column), which is
close to optimal. Missing compiler optimization yields an (often dramatic) increase in the
observed run-time ratio.
n
f
a1_f (v.1)
a1_f (v.2)
a1_f (v.3)
104
0.9
46.5
15.1
2.1
g++ -O0
2 104 4 104
3.6
13.7
186.2
746.0
65.3
243.1
8.3
33.3
104
0.2
2.4
1.0
0.4
g++ -O3
2 104 4 104
0.8
3.1
9.5
39.0
3.9
15.1
1.8
7.0
R
1
12.6
4.9
2.3
In general, the optimization of adjoint code combines elements from both classical
compiler theory and numerical analysis. Determining whether some overwritten value is
required or not may not be entirely straightforward for complex programs. Conservatively,
one may decide to store all overwritten values on the corresponding stacks; the resulting
memory requirement is likely to become infeasible. Various program analysis techniques
have been developed to identify a minimal set of required values; see, for example, [37,
38]. Refer to [31] for an alternative approach to adjoint code generation based on the
recomputation of required values from selected checkpoints. Similar ideas will be exploited
in Section 2.3 for the optimization of interprocedural adjoint code.
Adjoint Code Generation Rule 6: Interprocedural Adjoint Code
Particular attention must be paid to the scope of variables.When a variable v leaves its scope,
the corresponding memory can be reassigned by the compiler to other variables. The value
of v, which may be required, is lost.
Conceptually, the generation of interprocedural adjoint code does not pose any
further difficulties. For illustration, we split the computation of x = sinn (x) sin(sin(. . .
(sin(x)) . . .)) implemented as
v o i d f ( i n t n , d o u b l e& x ) {
i n t i =0;
w h i l e ( i <n ) {
x= s i n ( x ) ;
i = i +1;
}
}
70
v o i d g ( i n t n , d o u b l e& x ) {
double l ;
f o r ( i n t i = 0 ; i <n ; i ++) {
l =x ;
x= s i n ( l ) ;
}
}
v o i d f ( i n t n , d o u b l e& x ) {
i n t n1 , n2 , n3 ;
n1=n / 3 ; n2=n / 3 ; n3=nn1n2 ;
f o r ( i n t i = 0 ; i <n1 ; i ++) x= s i n ( x ) ;
g ( n2 , x ) ;
f o r ( i n t i = 0 ; i <n3 ; i ++) x= s i n ( x ) ;
}
The local variable l , which is actually obsolete, has been added to g for illustration of the
impact of variable scopes on the adjoint code.
The augmented forward section of f records the overwritten required values of x.
Basic block enumeration is not necessary, as the intraprocedural flow of control is described
entirely by the two simple for-loops. The augmented forward section of g needs to be
executed as part of the augmented forward section of f . Therefore, the augmented forward
and reverse sections are separated in g and can be called individually by setting the first
integer argument of a1_g equal to 1 (augmented forward section) or 2 (reverse section); see
line 9 in the following code listing:
1 v o i d a 1 _ f ( i n t n , d o u b l e& x , d o u b l e& a1_x ) {
2
i n t n1 , n2 , n3 ;
3
/ / augmented forward s e c t i o n
4
n1=n / 3 ; n2=n / 3 ; n3=nn1n2 ;
5
f o r ( i n t i = 0 ; i <n1 ; i ++) {
6
r e q u i r e d _ d o u b l e . push ( x ) ;
7
x= s i n ( x ) ;
8
}
9
a1_g ( 1 , n2 , x , a1_x ) ;
10
f o r ( i n t i = 0 ; i <n3 ; i ++) {
11
r e q u i r e d _ d o u b l e . push ( x ) ;
12
x= s i n ( x ) ;
13
}
The adjoint of the last loop is executed first within the reverse section (lines 25 of the
following code listing). It is followed in line 6 by a call of the reverse section of a1_g.
Finally, in lines 710, the adjoint of the originally first loop is executed.
1
2
3
4
5
6
7
/ / reverse section
f o r ( i n t i =n3 1; i >=0; i ) {
x= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
a1_x = c o s ( x ) a1_x ;
}
a1_g ( 2 , n2 , x , a1_x ) ;
f o r ( i n t i =n1 1; i >=0; i ) {
71
r e q u i r e d _ d o u b l e . pop ( ) ;
The following adjoint version of g separates the augmented forward and reverse sections
(lines 511 and lines 1319, respectively). An integer parameter mode is used to choose
between them. The value of l overwritten in line 8 is required in line 16 by the partial
derivative of sin ( l ) ; l is stored in line 7 and restored in line 17.
1 v o i d a1_g ( i n t mode , i n t n , d o u b l e& x , d o u b l e& a1_x ) {
2
d o u b l e l =0 , a 1 _ l = 0 ;
3
i n t i =0;
4
i f ( mode ==1) {
5
/ / augmented forward s e c t i o n
6
f o r ( i n t i = 0 ; i <n ; i ++) {
7
r e q u i r e d _ d o u b l e . push ( l ) ;
8
l =x ;
9
x= s i n ( l ) ;
10
}
11
r e q u i r e d _ d o u b l e . push ( l ) ;
12
} e l s e i f ( mode ==2) {
13
/ / reverse section
14
l = r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
15
f o r ( i n t i =n 1; i >=0; i ) {
16
a 1 _ l = a 1 _ l + c o s ( l ) a1_x ; a1_x = 0 ;
17
l = r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
18
a1_x = a1_x + a 1 _ l ; a 1 _ l = 0 ;
19
}
20
}
21 }
The storage of the value of l in line 11 and the subsequent recovery in line 14 are necessary
because l leaves its scope after the execution of the augmented forward section. The value
of l required to compute the correct local partial derivative of the last execution of x=sin( l )
in line 9 would otherwise be lost.
72
As in forward mode, an augmented data type is defined to replace the type of every active
floating-point variable. The corresponding class dco_a1s_type (dcos adjoint 1st-order
scalar type) contains the virtual address va (position in tape) of the current variable in
addition to its value v.
class dco_a1s_type {
public :
i n t va ;
double v ;
d c o _ a 1 s _ t y p e ( ) : va ( DCO_A1S_UNDEF ) , v ( 0 ) { } ;
d c o _ a 1 s _ t y p e ( c o n s t d o u b l e &) ;
d c o _ a 1 s _ t y p e& o p e r a t o r = ( c o n s t d c o _ a 1 s _ t y p e &) ;
};
Special constructors and a custom assignment operator are required. The latter either handles
a self-assignment or generates a new tape entry with corresponding operation code and with
copies of the right-hand sides value and virtual address. A global virtual address counter
dco_a1s_vac is used to populate the tape.
d c o _ a 1 s _ t y p e& d c o _ a 1 s _ t y p e : : o p e r a t o r = ( c o n s t d c o _ a 1 s _ t y p e& x ) {
i f ( t h i s ==&x ) r e t u r n t h i s ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . oc =DCO_A1S_ASG ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . v=v=x . v ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 1 =x . va ;
va = d c o _ a 1 s _ v a c ++;
return t h i s ;
}
All arithmetic operators and intrinsic functions make similar recordings on the tape, for
example,
d c o _ a 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ a 1 s _ t y p e& x1 ,
c o n s t d c o _ a 1 s _ t y p e& x2 ) {
d c o _ a 1 s _ t y p e tmp ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . oc =DCO_A1S_MUL;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 1 =x1 . va ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 2 =x2 . va ;
73
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . v=tmp . v=x1 . v x2 . v ;
tmp . va = d c o _ a 1 s _ v a c ++;
r e t u r n tmp ;
}
and
d c o _ a 1 s _ t y p e s i n ( c o n s t d c o _ a 1 s _ t y p e& x ) {
d c o _ a 1 s _ t y p e tmp ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . oc =DCO_A1S_SIN ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 1 =x . va ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . v=tmp . v= s i n ( x . v ) ;
tmp . va = d c o _ a 1 s _ v a c ++;
r e t u r n tmp ;
}
.
The driver program in Listing 2.2 uses the implementation of class dco_a1s_type in connection with a tape of size DCO_A1S_TAPE_SIZE (to be replaced with an integer value by
the C preprocessor). The tape is allocated statically in dco_a1s_type.cpp and is later
linked to the object code of the driver program. The latter computes the gradient of the
74
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
n1
2
xi2
i=0
followed by the tape interpretation yields the two tapes in Figure 2.9. Arguments are
referenced by their virtual address within the tape. For example, tape entry 11 represents the
sum (oc=2) of the two arguments represented by tape entries 9 and 10. The tape is structurally
equivalent to the DAG. The propagation of adjoints is preceded by the initialization of the
adjoint of the tape entry that corresponds to the dependent variable y (tape entry 23). The
desired gradient is accumulated in the adjoint components of the four tape entries 1, 3,
5, and 7.
Tape entries 07 correspond to the initialization of the x[ j ] in line 19 of Listing 2.2.
The initialization of y inside of f (line 10) yields tape entries 8 and 9. The loop in line 11
produces the following twelve (four triplets) entries 1021. Squaring y in line 12 adds the
last two tape entries 22 and 23.
75
Tape:
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
Interpreted Tape:
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
0,
1,
0,
1,
0,
1,
0,
1,
0,
1,
4,
2,
1,
4,
2,
1,
4,
2,
1,
4,
2,
1,
4,
1,
-1,
0,
-1,
2,
-1,
4,
-1,
6,
-1,
8,
1,
9,
11,
3,
12,
14,
5,
15,
17,
7,
18,
20,
21,
22,
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
][
0,
1,
0,
1,
0,
1,
0,
1,
0,
1,
4,
2,
1,
4,
2,
1,
4,
2,
1,
4,
2,
1,
4,
1,
-1,
0,
-1,
2,
-1,
4,
-1,
6,
-1,
8,
1,
9,
11,
3,
12,
14,
5,
15,
17,
7,
18,
20,
21,
22,
-1, 1.0,
-1, 1.0,
-1, 1.0,
-1, 1.0,
-1, 1.0,
-1, 1.0,
-1, 1.0,
-1, 1.0,
-1, 0.0,
-1, 0.0,
1, 1.0,
10, 1.0,
-1, 1.0,
3, 1.0,
13, 2.0,
-1, 2.0,
5, 1.0,
16, 3.0,
-1, 3.0,
7, 1.0,
19, 4.0,
-1, 4.0,
21, 16.0,
-1, 16.0,
16.0
16.0
16.0
16.0
16.0
16.0
16.0
16.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
8.0
1.0
1.0
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
(b)
Figure 2.9. dco_a1s_tape for the computation of the gradient of (1.2) for n = 4
at the point xi = 1 for i = 0, . . . , 3. The five columns show for each tape entry with virtual
addresses from 0 to 23, the operation code, the virtual addresses of the (up to two) arguments, the function value, and the adjoint value, where -1 DCO_A1S_UNDEF in the third
and fourth columns and with operation codes 0 DCO_A1S_CONST, 1 DCO_A1S_ASG,
2 DCO_A1S_ADD, and 4 DCO_A1S_MUL.
The tape interpreter implements (2.9) without modification. Starting from tape entry
23, the adjoint value 1 of the dependent variable y is propagated to the single argument of the
underlying assignment. The adjoint of tape entry 22 is set to 1 as the local partial derivative
of an assignment is equal to 1. Tape entry 22 represents the multiplication y=yy in line 12
of Listing 2.2, where the value of y on the right-hand side of the assignment is represented
by tape entry 21. The value of the local partial derivative (2y=2*4=8) is multiplied with
the adjoint of tape entry 22, followed by incrementing the adjoint of tape entry 21, whose
initial value is equal to 0. This process continues until all tape entries have been visited.
The gradient can be retrieved from tape entries 1, 3, 5, and 7. If none of the independent
variables is overwritten, then their va components contain the correct virtual addresses after
calling the overloaded version of f . This is the case in the given example. Hence, lines
2427 deliver the correct gradient in Listing 2.2. Otherwise, the virtual addresses of the
independent variables need to be stored in order to ensure a correct retrieval of the gradient.
76
Table 2.4. Run times for adjoint code by overloading (in seconds). n function
evaluations are compared with n evaluations of the adjoint code including the generation
and interpretation of the tape. We observe a difference of a factor of at least 16 when
comparing the run time of the adjoint code with that of an original function evaluation in
the rightmost column. This factor increases with growing values of n. Compiler optimization
has almost no effect on the quality of the adjoint code. With increasing tape size, the run
time is dominated by the memory accesses. The observed factor rises quickly to 100 and
more. Version 1.0 of dco keeps the factor below 20 by exploiting advanced techniques
whose discussion is beyond the scope of this introduction.
g++ -O0
n
f
a1_f
104
0.9
23.8
2 104
3.6
110.3
g++ -O3
4 104
13.7
551.9
104
0.2
12.7
2 104
0.8
73.5
4 104
3.1
478.2
/T(f)
1
> 16
Listings of the full source code that implements adjoint mode AD by overloading
can be found in Section A.2. If both class dco_a1s_type and the tape are implemented in
the files dco_a1s_type.hpp and dco_a1s_type.cpp, and if the driver program is
stored as main.cpp, then the build process is similar to that in Section 2.1.2. Run time
measurements are reported in Table 2.4.
Tape-based reverse mode AD can be implemented in vector mode by redefining tape
entries as follows:
class dco_a1v_tape_entry {
public :
i n t oc , a r g 1 , a r g 2 ;
double v , a ;
...
};
The overloaded operators and functions remain unchanged. The tape interpreter needs to
be altered to enable the propagation of vectors of adjoints. Remarks similar to those made
in Section 2.1.2 apply.
Several implementations of reverse mode AD by overloading have been proposed
over the past decades. Popular representatives for C++ include ADOL-C [34], cppAD [7],
and FADBAD [9]. While the fundamental concepts are similar to what we have described
here, the actual implementations vary in terms of the functionality and efficiency of the
resulting code. Version 0.9 of dco is not meant to compete with the established tools. Later
versions of dco provide a wider range of functionalities (checkpointing, parallelism, hybrid
tangent-linear and adjoint modes) while yielding more robust and efficient derivative code.
2.2.3
(2.10)
a B :a
a
where A F (x) and Sa {0, 1}mla such that ai,j = 0 bl,i
a
i,j = bl,i . Each
nonzero element ai,j in the Jacobian must be present in Ba . Similar to the tangent-linear
77
case, the matrix Sa is referred to as the seed matrix and Ba as the compressed transposed
Jacobian. The number of columns in Sa is denoted by la . Harvesting solves (2.10) by
substitution. Refer to [30] for details on the combinatorial problem that is to minimize la
by graph coloring algorithms.
Example 2.15 An adjoint version of the implementation in Example 2.10,
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
i n t m, d o u b l e y , d o u b l e a1_y ) ,
is generated by reverse mode AD. A driver for computing the compressed transposed Jacobian Ba uses the row-partition I1 = {0, 2} and I2 = {1} as follows:
i n t main ( ) {
d o u b l e x [ 4 ] = . . . , a1_x [ 4 ] ;
double y [ 3 ] ;
{
d o u b l e a1_y [ 3 ] = { 1 , 0 , 1 } ;
a 1 _ f ( 4 , x , a1_x , 3 , y , a1_y ) ; / / rows 0 and 2
}
...
{
d o u b l e a1_y [ 3 ] = { 0 , 1 , 0 } ;
a 1 _ f ( 4 , x , a1_x , 3 , y , a1_y ) ; / / row 1
}
...
}
The unknown nonzero entries xi,j of the transposed Jacobian are obtained by substitution
from
a0,0
0
0
0
x0,0
1 0
0 x2,1
0
a
0
.
0 1 = 2,1
0
x1,2 x2,2
a2,2 a1,2
1 0
x0,3 x1,3
0
a0,3 a1,3
Combinations of tangent-linear and adjoint compression may give better compression rates
as described, for example, in [36]. Arrow-shaped matrices are prime examples for this type
of bidirectional seeding and harvesting.
2.3
Consider the interprocedural adjoint code used to illustrate Adjoint Code Generation Rule 6
in Section 2.2.1. For n = 10, the size of the required double data stack hits its maximum
of 11 at the end of the augmented forward section. Suppose that the available memory
allows the storage of only 9 double precision floating-point values.7 It would follow that
this adjoint code cannot be run on the given computer.
Adjoint Code Generation Rule 7: Subroutine Argument Checkpointing
Buying more memory may be an option. However, we prefer an algorithmic solution that
will allow us to generate suitable adjoint code for arbitrary available hardware. We focus
7 We leave it to the reader to multiply this number by 10k in order to get to a more realistic number.
78
There is no need to store the outputs at the end of the augmented forward section, as their
values are dead within the reverse section of the calling routine a1_f. Liveness analysis [2]
eliminates the corresponding statements from the adjoint code for g.
The adjoint code for f calls g and its adjoint a1_g as follows:
1 v o i d a 1 _ f ( i n t n , d o u b l e& x , d o u b l e& a1_x ) {
2
i n t n1 , n2 , n3 ;
3
/ / augmented forward s e c t i o n
4
n1=n / 3 ; n2=n / 3 ; n3=nn1n2 ;
5
f o r ( i n t i = 0 ; i <n1 ; i ++) {
6
r e q u i r e d _ d o u b l e . push ( x ) ;
7
x= s i n ( x ) ;
79
}
/ / s t o r e argument c h e c k p o i n t
a1_g ( 3 , n2 , x , a1_x ) ;
g ( n2 , x ) ;
f o r ( i n t i = 0 ; i <n3 ; i ++) {
r e q u i r e d _ d o u b l e . push ( x ) ;
x= s i n ( x ) ;
}
// store results
r e s u l t s _ d o u b l e . push ( x ) ;
/ / reverse section
f o r ( i n t i =n3 1; i >=0; i ) {
x= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
a1_x = c o s ( x ) a1_x ;
}
/ / r e s t o r e argument c h e c k p o i n t
a1_g ( 4 , n2 , x , a1_x ) ;
a1_g ( 1 , n2 , x , a1_x ) ;
f o r ( i n t i =n1 1; i >=0; i ) {
x= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
a1_x = c o s ( x ) a1_x ;
}
// restore results
x= r e s u l t s _ d o u b l e . t o p ( ) ; r e s u l t s _ d o u b l e . pop ( ) ;
The original version of g is called in line 11 as part of the augmented forward section
of f . No required data is recorded. No additional memory is required. Instead, the value
of x is stored as an argument checkpoint in line 10. Note that x is the only input of g
whose value is overwritten by the subsequent statements in g or f . The value of the second
input n2 remains unchanged throughout the entire program and thus does not need to be
checkpointed. Once the propagation of adjoints through f reaches the point where adjoints
need to be propagated through g, the argument checkpoint is restored (line 26). Subsequent
recording of all required data within the augmented forward section of g is followed by the
propagation of the adjoints through its reverse section. The results enter the remainder of
the reverse section of f (lines 2831). Result checkpointing is taken care of in lines 18 and
34 if required.
Note that the size of the required double data stack never exceeds 7. The additional
memory requirement of the adjoint code is increased to 8 by the argument checkpoint
of g. The first part of the reverse section of f (lines 2124) decreases the stack size to 3.
Subsequent execution of the augmented forward section of g lets it grow up to 6 again before
all of the remaining entries are recovered. The reduced memory requirement comes at the
expense of a single evaluation of g. Thus checkpointing enables the computation of adjoints
within the given memory constraints at an additional computational cost.
80
2.3.1
The general Data Flow Reversal (also DAG Reversal) problem concerns the selection
of appropriate intermediate values as checkpoints for a given upper bound on the available
additional memory, that is, the memory available on top of the duplicated data segment of
the original program. This problem is NP-complete [49]. It is therefore unlikely that an
efficient algorithm (with run time polynomial in the size of the DAG) for its deterministic
solution can be formulated.
Our approach to the generation of interprocedural adjoint code suggests a focus on
subroutine arguments as potential checkpoints. The associated Call Tree Reversal (CTR)
problem is a special case of DAG Reversal and is also NP-complete [48]. (Approximate)
solutions for given instances of CTR turn out to be easier to integrate into adjoint versions
of the corresponding code.
For a call of a subroutine g inside of another subroutine f represented by the call tree
|_ f
|_ g
Subroutine calls are denoted by | _. The augmented forward section (RECORD) generates
a recording of all data that is required by the reverse section and is potentially lost due to
overwriting / deallocation. Adjoints are propagated by the reverse section (ADJOIN). The
order of execution in such reversal trees is top-down.
Joint Call Reversal: The joint reversal of the call of g inside of f is defined as
| _ a 1 _ f (RECORD)
|
| _ a1_g ( STORE_INPUTS )
|
|_ g
| _ a 1 _ f (ADJOIN )
| _ a1_g ( RESTORE_INPUTS )
| _ a1_g (RECORD)
| _ a1_g (ADJOIN )
Rather than recording the data that is required by the reverse section of a1_g, an argument
checkpoint is stored (STORE_INPUT) and the original subroutine g is executed. The checkpoint is restored within the reverse section of a1_f, followed by runs of the augmented
forward (RECORD) and reverse (ADJOIN) sections of a1_g.
Refer to [36] as the original source of the terms split and joint reversal modes. Split
mode refers to the augmented forward and reverse sections being separated during the
execution of the entire adjoint code. In joint mode, the reverse section follows the forward
augmented section immediately.
81
(a)
(b)
Figure 2.10. Call reversal in spilt (a) and joint (b) modes; squares represent
(sections of) subroutines. Execution of a subroutine is denoted by an overset arrow pointing
to the right. A downward arrow indicates the storage of an argument checkpoint; its recovery
is denoted by an upward arrow. Two rightward pointing arrows represent the augmented
forward section. A reverse section is denoted by two leftward pointing arrows. The order
of execution is depth-first and from left to right.
Split and joint call reversal exhibit different memory requirements. If the size
MEM(xg ) of an argument checkpoint for g is considerably smaller than the amount of
data to be recorded for the reversal of its data flow, then joint reversal yields a decreased
memory requirement at the expense of an additional function evaluation.
For illustration, let f 0 and f 1 denote the two parts of f preceding and succeeding the
call of g, respectively, as in the example discussed in the context of Adjoint Code Generation
Rules 6 and 7. While the maximal memory requirement of split reversal is
MEM( f 0 ) + MEM(g) + MEM( f 1 ),
that of joint reversal amounts to
MEM( f 1 ) + max(MEM(xg ) + MEM( f 1 ), MEM(g)).
For example, if MEM( f 0 ) = MEM( f 1 ) = MEM(g) = 10 (memory units) and MEM(xg ) = 1,
then the memory requirement of joint reversal (21) undercuts that of split reversal (30) by
nearly a third. Graphical representations of split and joint call reversals are shown in
Figure 2.10.
2.3.2
The computational cost of a reversal scheme R = R(T ) for a call tree T = (N , A) with nodes
N and arcs A is defined by
1. the maximum amount of memory consumed in addition to the memory requirement
of the original program, which is denoted by MEM(R);
2. the number of arithmetic operations performed in addition to those required for recording, denoted by OPS(R);
The choice between split and joint reversal is made independently for each arc in the
call tree. Consequently, the call tree T = (N , A) given as
82
(a)
(b)
Figure 2.11. Call tree reversal in global split (a) and global joint (b) modes.
|_ f
|_ g
|_ h
yields a total of four possible data flow reversal schemes Rj A {0, 1}, j = 1, . . . , 4.
The reversal of a call of g inside of f in split ( joint) mode is denoted as ( f , g, 0) [( f , g, 1)].
A subroutine f is separated into f 0 , . . . , f k if it performs k subroutine calls. MEM( f i ) denotes
the memory required to record f i for i = 0, . . . , k. The computational cost of running f i is
denoted by OPS( f i ). We set MEM( f ) = ki=0 MEM( f i ) and OPS( f ) = ki=0 OPS( f i ). The
memory occupied by an input checkpoint of f is denoted by MEM(x f ). Consequently, we
have the choice between the following four CTR schemes:
R1 = {( f , g, 0), (g, h, 0)} (global split)
| _ a 1 _ f (RECORD)
|
| _ a1_g (RECORD)
|
| _ a1_h (RECORD)
| _ a 1 _ f (ADJOIN )
| _ a1_g (ADJOIN )
| _ a1_h (ADJOIN )
A graphical representation is shown in Figure 2.11 (a). Additional memory requirement and operations count are given by
MEM(R1 ) = MEM( f ) + MEM(g) + MEM(h),
OPS(R1 ) = OPS( f ) + OPS(g) + OPS(h).
R2 = {( f , g, 1), (g, h, 0)} ( joint over split mode)
| _ a 1 _ f (RECORD)
|
| _ a1_g ( STORE_INPUTS )
|
|_ g
|
|_ h
| _ a 1 _ f (ADJOIN )
83
(a)
(b)
Figure 2.12. CTR in joint over split (a) and split over joint (b) modes.
| _ a1_g ( RESTORE_INPUTS )
| _ a1_g (RECORD)
|
| _ a1_h (RECORD)
| _ a1_g (ADJOIN )
| _ a1_h (ADJOIN )
A graphical representation is shown in Figure 2.12 (a). Additional memory requirement and operations count are given by
MEM( f ) + MEM(xg )
MEM(R2 ) = max
MEM( f 0 ) + MEM(g) + MEM(h)
,
A graphical representation is shown in Figure 2.12 (b). Additional memory requirement and operations count are given by
,
84
A graphical representation is shown in Figure 2.11 (b). Additional memory requirement and operations count are given by
MEM( f ) + MEM(x )
0
0
OPS(R4 ) = OPS( f ) + 2 OPS(g) + 3 OPS(h).
Formally, the CTR problem aims to determine for a given call tree T = (N , A) and an integer
K > 0 a reversal scheme R A {0, 1} such that OPS(R) min subject to MEM(R) K.
Ongoing research investigates heuristics for determining a near-optimal reversal scheme in
(preferably) linear time. A simple greedy smallest-recording-first heuristic starts with a
global joint reversal and switches edge labels from 1 to 0 in increasing order of the callees
recording size. Ties are broken according to some enumeration of the nodes in T . The
constraints of the CTR problem are guaranteed to be satisfied under the assumption that
MEM(x f ) MEM( f ) for all f N . In this case, global joint reversal yields the minimal
memory requirement. Effective use of the larger available memory may allow for certain
calls to be reversed in split rather than joint mode, as illustrated by the following example.
Example 2.16 Consider the call tree T = (N , A) in Figure 2.13 (a). Nodes are annotated
with the sizes of the respective input checkpoints (left) and the sizes of the recordings
(below the nodes). For example, MEM(xg ) = 5 and MEM(g0 ) = MEM(g1 ) = MEM(g2 ) =
10, and hence MEM(g) = 30. We assume MEM(p) = OPS(p) for any program code
fragment p and for some R. This assumption turns out to be reasonable in most practical
situations. For simplicity, we set = 1 in this example.
There are 2|A| = 8 distinct reversal schemes, each with a potentially different computational cost. Global joint reversal Rj = (( f , g, 1), (g, s, 1), (g, h, 1)) yields MEM(Rj ) = 225
and OPS(Rj ) = 830 which minimizes the overall memory requirement. Global split mode
Rs = (( f , g, 0), (g, s, 0), (g, h, 0)) minimizes the operation count (OPS(Rs ) = 300) at the expense of a maximum memory requirement of MEM(Rs ) = 300. A graphical illustration
of global split and global joint CTR modes is given in Figure 2.13 (b) and Figure 2.14,
respectively.
5
15
10
85
10
10
10
h
s
200
50
(a)
(b)
Figure 2.13. Annotated call tree (a) and its global split reversal (b).
Figure 2.14. Global joint reversal of the call tree in Figure 2.13 (a).
Let the available memory be of size 250. Global split reversal becomes infeasible.
Global joint reversal is an option, but can we do better? The given small example allows us
to perform an exhaustive search for a solution of the CTR problem yielding the following
six reversal schemes in addition to the global split and joint reversals discussed above:
R1 = (( f , g, 1), (g, s, 1), (g, h, 0)) with MEM(R1 ) = 225 and OPS(R1 ) = 780;
R2 = (( f , g, 1), (g, s, 0), (g, h, 1)) with MEM(R2 ) = 285 and OPS(R2 ) = 630;
R3 = (( f , g, 1), (g, s, 0), (g, h, 0)) with MEM(R3 ) = 295 and OPS(R3 ) = 580;
R4 = (( f , g, 0), (g, s, 1), (g, h, 1)) with MEM(R4 ) = 225 and OPS(R4 ) = 550;
R5 = (( f , g, 0), (g, s, 1), (g, h, 0)) with MEM(R5 ) = 225 and OPS(R5 ) = 500;
R6 = (( f , g, 0), (g, s, 0), (g, h, 1)) with MEM(R6 ) = 285 and OPS(R6 ) = 350;
R1 , R4 , and R5 turn out to be feasible. R5 yields the lowest operation count and represents
the unique solution for the given instance of the CTR problem.
The greedy smallest-recording-first heuristic starts with the global joint reversal
scheme and switches the reversal mode to split for the call of the subroutine with the smallest recording size; that is, ( f , g, 1) ( f , g, 0). The operation count is decreased significantly
86
Figure 2.15. Optimal reversal scheme for the call tree in Figure 2.13 (a) for an
available memory of size 250.
to 550, whereas the memory requirement remains unchanged. Performing the next switch
(g, h, 1) (g, h, 0) reduces the operation count even further to 500 while preserving the
memory requirement of 225. A last potential split (g, s, 1) (g, s, 0) fails due to violation of
the memory bound since 300 > 250. The greedy smallest-recording-first heuristic succeeds
in finding the optimal CTR scheme R5 (shown in Figure 2.15) for the given instance of the
CTR problem.
The greedy largest-recording-first heuristic starts with the global joint reversal scheme
and attempts to switch the reversal mode to split for the call of the subroutine with the
largest recording size, that is, (g, s, 1) (g, s, 0), which yields the infeasible CTR scheme
R3 . Rejection of the first switch is followed by (g, h, 1) (g, h, 0), resulting in the feasible
reversal scheme R1 . Finally, switching ( f , g, 1) ( f , g, 0) gives the optimal result R5 (shown
in Figure 2.15).
Both greedy heuristics happen to lead to the same solution for the given example.
Refer to the exercise in Section 2.4.5 for a call tree instance in which the two heuristics
yield different results.
A call tree is an image of the calling structure of a given program at run time.
A (near-)optimal reversal scheme for a given call tree is of limited use if the calling structure
of the program changes dynamically as a function of the inputs. In such cases, we need
conservative solutions that guarantee feasible and reasonably efficient run time characteristics for all possible call trees on average. The automation of the (near-)optimal placement
of checkpoints is the subject of ongoing research. AD will never become truly automatic
unless robust software solutions for the DAG Reversal problem are developed.
2.4
Exercises
2.4.1
2.4. Exercises
87
Listing 2.3. Disputable implementation of a function.
v o i d g ( i n t n , d o u b l e x , d o u b l e& y ) {
y =1.0;
f o r ( i n t i = 0 ; i <n ; i ++)
y =x [ i ] x [ i ] ;
}
v o i d f ( i n t n , d o u b l e x , d o u b l e &y ) {
f o r ( i n t i = 0 ; i <n ; i ++) x [ i ] = s q r t ( x [ i ] / x [ ( i + 1 )%n ] ) ;
g(n ,x , y) ;
y= c o s ( y ) ;
}
and use it for the computation of the gradient of the dependent output y with respect
to the independent input x. Apply backward finite differences for verification.
3. Write adjoint code (split mode) for the example code in Listing 2.3. Use the adjoint
code to accumulate the gradient of the dependent output y with respect to the independent input x. Ensure that the correct function values are returned in addition to
the gradient.
4. Write adjoint code ( joint mode) for the example code in Listing 2.3. Use it to accumulate the gradient of the dependent output y with respect to the independent input x.
Correct function values need not be returned.
5. Use the adjoint code developed in Exercises 3 and 4 to compute the gradient of the
dependent output x[0] with respect to the independent input x. Optimize the adjoint
code by eliminating obsolete (dead) statements.
2.4.2
Consider an implementation of the discrete residual r = F (y) for the SFI problem introduced
in Example 1.2.
1. Implement the tangent-linear model r(1) = F (y) y(1) by writing a tangent-linear
code by hand, and use it to accumulate F (y) with machine accuracy. Compare
the numerical results with those obtained by the finite difference approximation in
Section 1.4.2.
88
2.4.3
2.4.4
1. Use the tangent-linear model with your favorite solver for systems of nonlinear equations to find a numerical solution of the SFI problem; repeat for further MINPACK-2
test problems.
2. Use the adjoint model with your favorite solver for nonlinear programming to minimize the extended Rosenbrock function; repeat for the other two test problems from
Section 1.4.3.
2.4.5
1. Consider the following modification of the example code from Section 2.4.1:
v o i d h ( d o u b l e& x ) {
x =x ;
}
v o i d g ( i n t n , d o u b l e x , d o u b l e& y ) {
y =0;
f o r ( i n t i = 0 ; i <n ; i ++) {
2.4. Exercises
89
f
10
10
50
5
50
s
20
h
50
Chapter 3
3.1
92
(x )
Fn1
F0 (x0 )
0
Fn (x0 )
F
(x
2n1 0 )
2 F (x0 )
Rmnn
..
..
.
F(m1)n
(x0 ) Fmn1
(x0 )
is called the Hessian of F at point x0 .
Example 3.2 The Hessian 2 r(y, ) R444 of the residual of the SFI problem from
Example 1.2 becomes very sparse with
2 ri,j ,k
2
h ey1,1
2
y
h e 1,2
= h2 ey2,1
h2 ey2,2
if i = j = k = 0,
if i = j = k = 1,
if i = j = k = 2,
if i = j = k = 3,
otherwise.
2
y
h e 1,2 if i = j = k = l = 1,
3
ri,j ,k,l = h2 ey2,1 if i = j = k = l = 2,
h2 ey2,2 if i = j = k = l = 3,
0
otherwise.
Conceptually, the computation of higher derivatives does not pose any exceptional
difficulties. The notation is complicated by the need to work with higher-order tensors.
Tensor notation is not necessarily required for first derivatives. Nevertheless we use it as an
intuitive entry point into the following formalism.
mn be a 2-tensor (a matrix). A first-order
Definition 3.4. Let A (ak,j )k=0,...,m1
j =0,...,n1 R
tangent-linear projection of A in direction v Rn is defined as the usual matrix vector
product A v. Alternatively, we use the inner product notation
b A, v Rm ,
93
n1
ak,l vl
l=0
for k = 0, . . . , m 1. The kth row of A is denoted by ak, . The expression ak, , v
denotes
the usual scalar product of two vectors in Rn .
A first-order adjoint projection
c w, A
Rn
of A in direction w Rm , where c = (cj )j =0,...,n1 , is defined as
cj = w, a,j
m1
wl al,j
l=0
94
1
3
3
3
3
3
< A, v >
Figure 3.1. First-order tangent-linear (equivalently, adjoint) projection of a symmetric 3-tensor A R466 in direction v R6 : A, v
= v, A
R46 . The line of
symmetry in A is shown as well as the direction of the projection and its result.
of A in direction v Rn with B = (bk,j )k=0,...,m1
j =0,...,n1 is defined as
bk,j = ak,j , , v
n1
ak,j ,l vl
l=0
for k = 0, . . . , m 1 and l = 0, . . . , n 1.
A first-order adjoint projection
C w, A
Rnn
j =0,...,n1
m1
wl al,j ,i
l=0
for i, j = 0, . . . , n 1.
For technical reasons, a first-order adjoint projection of A in direction v Rn is
defined to be equivalent to the corresponding tangent-linear projection, that is,
v, A
A, v
Rmn .
Figures 3.1 and 3.2 provide a graphical illustration. Refer to the exercises in Section 3.5.3
and to their solutions in Section C.3.3 for applications of first-order adjoint projections of
second derivative tensors in directions in Rn in the context of third- and higher-order adjoint
models.
Lemma 3.6. Let A Rmnn be a symmetric 3-tensor as defined in Definition 3.5. Then,
C = w, A
Rnn is symmetric for all w Rm .
95
3
1
3
3
1
1
< w, A >
Proof. This result follows immediately from Definition 3.5, as C = (cj ,i )i=0,...,n1 , where
cj ,i = w, a,j ,i
m1
wl al,j ,i .
l=0
96
5
3
1
3
1
< A, v, u >
5
3
5
1
< w, A, v >
ck =
n1
bk,l ul
l=0
n1
n1
l=0 p=0
ak,l,p vp ul
(substitution)
n1
n1
97
ak,l,p vp ul
(switch loops)
ak,l,p ul vp
(commutativity)
p=0 l=0
n1
n1
p=0 l=0
n1
dk,p vp
p=0
n1
bj ,l vl
l=0
n1 m1
ap,j ,l wp vl
(substitution)
ap,j ,l wp vl
(switch loops)
ap,j ,l vl wp
(commutativity)
l=0 p=0
m1
n1
p=0 l=0
m1
n1
p=0 l=0
m1
dp,j wp
p=0
98
n1
bj ,l vl
l=0
n1 m1
wp ap,j ,l vl
(substitution)
wp ap,j ,l vl
(switch loops)
l=0 p=0
m1
n1
p=0 l=0
m1
wp
p=0
m1
n1
ap,j ,l vl
(distributivity)
l=0
wp dp,j
p=0
3.2
(3.1)
99
x(1)
x(2)
x(1,2)
s
x
s
and
x(1,2)
x(1)
,
s
(3.2)
100
v o i d t 1 _ f ( i n t n , i n t m, d o u b l e x , d o u b l e t 1 _ x ,
double y , double t1_y )
that implements
y
= F (1) (x, x(1) )
y(1)
yields an implementation of
F (1,2) : Rn Rn Rn Rn Rm Rm Rm Rm :
y
y(2)
(1) = F (1,2) (x, x(2) , x(1) , x(1,2) ),
y
y(1,2)
where
y = F (x)
y
(2)
= F (x), x(2)
Superscripts of second-order tangent-linear subroutine and variable names are replaced with
the prefixes t2_ and t1_ , that is, v(2) t2_v and v(1,2) t2_t1_v . The Hessian at point x can
be accumulated at the computational cost of O(n2 ) Cost(F ), where Cost(F ) denotes the
computational cost of evaluating F , by setting x(1,2) = 0 initially and by letting x(2) and x(1)
range independently over the Cartesian basis vectors in Rn . The computational complexity
is the same as that of second-order finite differences. Some run-time savings result from
the possible exploitation of symmetry as in second-order finite differences.
3.2.1
Source Transformation
101
(1)
(2)
(2)
(1,2)
According to Tangent-Linear Code Generation Rule 1, all double parameters of the tangentlinear subroutine (x, t1_x, y,and t1_y) are duplicated. The new variables are augmented with
the t2_ prefix. Inputs are augmented with inputs and outputs with outputs. Tangent-linear
102
versions of all assignments are inserted into the tangent-linear code in lines 5, 7, 10, 12,
15, and 17. The flow of control remains unchanged. First-order tangent-linear projections of the gradient in directions t1_x and t2_x are returned in t1_y and t2_y, respectively. The function value is returned in y. If only second derivatives are required, then
dead code elimination results in further optimization of the second-order tangent-linear
code. For example, lines 8, 13, and 1618 become obsolete in this case. Knowing that
all entries of t2_t1_x are equal to zero, the assignment in line 10 can be simplified to
t2_t1_y=t2_t1_y+2t2_x[i] t1_x[ i ] .
The following driver computes all entries of the lower triangular part of the Hessian.
1 i n t main ( ) {
2
const i n t n =4; i n t i , j ;
3
double x [ n ] , t1_x [ n ] , t2_x [ n ] , t 2 _ t 1 _ x [ n ] , y , t1_y , t2_y , t 2 _ t 1 _ y ;
4
5
f o r ( j = 0 ; j <n ; j ++) { x [ j ] = 1 ; t 2 _ t 1 _ x [ j ] = t 2 _ x [ j ] = t 1 _ x [ j ] = 0 ; }
6
f o r ( j = 0 ; j <n ; j ++) {
7
t1_x [ j ]=1;
8
f o r ( i = 0 ; i <= j ; i ++) {
9
t2_x [ i ]=1;
10
t 2 _ t 1 _ f ( n , x , t2_x , t1_x , t2_t1_x , y , t2_y , t1_y , t 2 _ t 1 _ y ) ;
11
c o u t << "H[ " << j << " ] [ " << i << " ] = " << t 2 _ t 1 _ y << e n d l ;
12
t2_x [ i ]=0;
13
}
14
t1_x [ j ]=0;
15
}
16
return 0 ;
17 }
It exploits the fact that x is not overwritten inside of f . Consequently, neither t1_x nor t2_x
is overwritten in t2_t1_f , and their entries can be set and reset individually in lines 7, 9, 12,
and 14. O(n2 ) evaluations of the second-order tangent-linear code are performed in line 10.
The Hessian entries are returned individually in t2_t1_y , and they are printed to the standard
output in line 11.
3.2.2
Overloading
The computation of second derivatives is supported by dco through the provision of the
second-order scalar tangent-linear data type dco_t2s_t1s_type whose value (v) and derivative
( t ) components are tangent-linear scalars of type dco_t1s_type .
class dco_t2s_t1s_type {
public :
dco_t1s_type v , t ;
...
};
The definition of the arithmetic operators and intrinsic functions does not yield any surprises;
for example,
d c o _ t 2 s _ t 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x1 ,
c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x2 ) {
d c o _ t 2 s _ t 1 s _ t y p e tmp ;
103
tmp . v=x1 . v x2 . v ;
tmp . t =x1 . t x2 . v+x1 . v x2 . t ;
r e t u r n tmp ;
}.
The driver program in Listing 3.1 uses the implementation of the second-order tangentlinear model by overloading to compute the Hessian 2 F of (1.2) for n = 4 at the point
xi = 1 for i = 0, . . . , 3. All data members of variables of type dco_t2s_t1s_type are initialized
(1,2)
to zero at the time of construction. Hence, xi
x[ i ]. t . t does not need to be initialized
(1)
(2)
explicitly. Both xi x[ i ]. t .v and xi x[ i ]. v. t range in lines 17 and 19 independently
over the Cartesian basis vectors in Rn . Resetting can be restricted in lines 24 and 21 to the
individual components as x is not overwritten in f . The desired Hessian entries are retrieved
from y. t . t y (1,2) , and they are printed in line 22 to the standard output.
Listing 3.1. Driver for second-order tangent-linear code by overloading.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
104
Table 3.1. Run times for second-order tangent-linear code (in seconds). In order
to determine the relative computational complexity R of the derivative code, n function evaluations are compared with n evaluations of the second-order tangent-linear code (t2_t1_f)
and with the same number of evaluations of an implementation of the second-order tangentlinear model by overloading (dco_t2s_t1s_f). As in Chapter 2, the compiler optimizations
are either switched off (g++ -O0) or the full set of optimizations is enabled (g++ -O3).
We observe a factor of approximately 2.3 when comparing the run time of a single run of
the second-order tangent-linear code with that of an original function evaluation in the
rightmost column. Implementation by overloading adds a factor of almost 20 due to less
effective compiler optimization.
n
f
t2_t1_f
dco_t2s_t1s_f
104
0.9
3.0
35.1
g++ -O0
2 104 4 104
3.6
13.7
12.0
47.3
134.6
562.3
104
0.2
0.5
8.6
g++ -O3
2 104 4 104
0.8
3.1
1.9
7.2
32.9
128.4
R
1
2.3
41.5
is built by calling
$(CPPC)
$(CPPC)
$(CPPC)
$(CPPL)
-c
-c
-c
-o
dco_t1s_type.cpp
dco_t2s_t1s_type.cpp
main.cpp
main dco_t1s_type.o dco_t2s_t1s_type.o main.o
where $(CPPC) and $(CPPL) denote the native C++ compiler and linker, respectively.
Run time results are reported in Table 3.1.
3.3
The remaining three approaches to the generation of second derivative code involve at
least one application of reverse mode AD. According to Lemmas 3.9 and 3.10, all three
alternatives implement the second-order adjoint model that is defined next.
Definition 3.14. The Hessian 2 F = 2 F (x) Rmnn of a multivariate vector function
y = F (x), F : Rn Rm , induces a bilinear mapping Rn Rm Rn defined by
(v, w) w, 2 F , v
.
The function F : R2n+m Rn that is defined as
F (x, v, w) w, 2 F (x), v
(3.4)
105
x(1) [<, >]
y(1)
F
F(x)
2 F(x)
y(1)
x
y(2)
(1)
x(2)
Figure 3.6. Tangent-linear extension of the linearized DAG of the adjoint model
x(1) = y(1) , F (x)
of y = F (x).
Forward-over-Reverse Mode
Theorem 3.15. The application of forward mode AD to the adjoint model yields an implementation of the second-order adjoint model.
Proof. The application of forward mode AD as defined in Section 2.1.1 to the adjoint model
x(1) = y(1) , F (x)
gives
(2)
(2)
x(1)
,
s
(2)
y(1)
y(1)
,
s
and where x(1) is computed as the partial derivative of x(1) with respect to s according
to (1.5).
106
F(1) : Rn Rn Rn Rn Rm Rm Rm Rm Rn Rn Rm Rm :
(2)
(2)
(2)
(2)
(2)
(y, y(2) , x(1) , x(1) , y(1) , y(1) ) = F(1) (x, x(2) , x(1) , x(1) , y(1) , y(1) ),
such that
y = F (x)
y(2) = F (x), x(2)
x(1) = x(1) + y(1) , F (x)
(2)
(2)
(3.5)
(2)
y(1) = 0.
For m = 1, we get y(1) , 2 F (x), x(2)
= y(1) 2 F (x) x(2) . The corresponding second-order
adjoint subroutine has the following signature:
v o i d t 2 _ a 1 _ f ( i n t n , i n t m,
d o u b l e x , d o u b l e t 2 _ x , d o u b l e a1_x , d o u b l e t 2 _ a 1 _ x ,
d o u b l e y , d o u b l e t 2 _ y , d o u b l e a1_y , d o u b l e t 2 _ a 1 _ y ) ;
Subscripts (superscripts) of second-order adjoint subroutine and variable names are re(2)
placed with the prefixes a1_ and t2_ ; for example, v(2) t2_v and v(1) t2_a1_v. The
(2)
computation of projections of the Hessian in directions x(2) and y(1) requires y(1) = 0 initially. The entire Hessian can be accumulated at a computational cost of O(n m) Cost(F )
by letting x(2) and y(1) range over the Cartesian basis vectors in Rn and Rm , respectively.
For m = 1, a single Hessian-vector product can be computed at the computational cost of
O(1) Cost(F ), that is, at a constant multiple of the cost of evaluating F . The magnitude of
this constant factor depends on details of the implementation as illustrated in Sections 3.3.1
and 3.3.2.
Reverse-over-Forward Mode
Theorem 3.16. The application of reverse mode AD to the tangent-linear model yields an
implementation of the second-order adjoint model.
Proof. The application of reverse mode AD as defined in Section 2.2.1 to the tangent-linear
model y(1) = F (x), x(1)
gives
y(1) = F (x), x(1)
(1)
107
(1)
y(2)
x(1)
Figure 3.7. Adjoint extension of the linearized DAG of the tangent-linear model
y(1) = F (x), x(1)
of y = F (x).
(1)
(1)
(1)
y(2) = 0.
With x(2) = 0 initially, the second line yields (3.4).
A graphical illustration in the form of the adjoint extension of the linearized DAG of
the tangent-linear model can be found in Figure 3.7, where
(1)
y(2)
t
,
y(1)
(1)
and where x(2) and x(2) are computed according to (1.5) as partial derivatives of t with
respect to x and x(1) .
The application of reverse mode AD to a tangent-linear code that implements
(y, y(1) ) = F (1) (x, x(1) )
results in
(1)
F(2) : Rn Rn Rn Rn Rm Rm Rm Rm Rn Rn Rm Rm :
(1)
(1)
(1)
(1)
(1)
(y, y(1) , x(2) , x(2) , y(2) , y(2) ) = F(2) (x, x(2) , x(1) , x(2) , y(2) , y(2) ),
108
where
y = F (x)
y
(1)
= F (x), x(1)
(1)
(1)
(1)
(3.6)
y(2) = 0
(1)
y(2) = 0.
The corresponding second-order adjoint subroutine has the following signature:
v o i d a 2 _ t 1 _ f ( i n t n , i n t m,
d o u b l e x , d o u b l e a2_x , d o u b l e t 1 _ x , d o u b l e a 2 _ t 1 _ x ,
d o u b l e y , d o u b l e a2_y , d o u b l e t 1 _ y , d o u b l e a 2 _ t 1 _ y ) ;
Subscripts (superscripts) of second-order adjoint subroutine and variable names are replaced
(1)
with the prefixes t1_ and a2_; for example, v(2) a2_v and v(2) a2_t1_v. The entire Hessian
(1)
can be accumulated at a computational cost of O(n m) Cost(F ) by letting x(1) and y(2)
range over the Cartesian basis vectors in Rn and Rm , respectively. Hence, for m = 1, the
computational cost of products of the Hessian with a vector x(1) Rn is O(1) Cost(F ).
Reverse-over-Reverse Mode
Theorem 3.17. The application of reverse mode AD to the adjoint model yields an implementation of the second-order adjoint model.
Proof. The application of reverse mode AD as defined in Section 2.2.1 to the adjoint model
x(1) = y(1) , F (x)
gives
x(1) = y(1) , F (x)
x(2) = x(2) + x(1,2) , y(1) , 2 F (x)
y(1,2) = y(1,2) + x(1,2) , F (x)
x(1,2) = 0.
If x(2) is equal to zero initially, then the second line yields (3.4) as, according to Lemma 3.9,
x(1,2) , y(1) , 2 F (x)
= y(1) , 2 F (x), x(1,2)
.
A graphical illustration in the form of the adjoint extension of the linearized DAG of
the adjoint model can be found in Figure 3.8, where
x(1,2)
t
,
x(1)
and where x(2) and y(1,2) are computed according to (1.5) as the partial derivatives of t with
respect to x and y(1) .
109
x(1,2)
y(1)
Figure 3.8. Adjoint extension of the linearized DAG of the adjoint model
x(1) = y(1) , F (x)
of y = F (x).
The application of reverse mode AD with required data stack s and result checkpoint
r to an adjoint code that implements (y, x(1) , y(1) ) = F(1) (x, x(1) , y(1) ) yields
F(1,2) : Rn Rn Rm Rn Rn Rm Rm Rn Rn Rm Rm :
(y, x(1) , x(2) , y(1,2) , y(2) ) = F(1,2) (x, x(1) , y(1) , x(2) , x(1,2) , y(2) ),
where
y = F (x)
x(1) = x(1) + y(1) , F (x)
s[0] = y(1)
y(1) = 0
r[0] = y; r[1] = x(1) ; r[2] = y(1)
y(1) = s[0]
y(1,2) = 0
x(2) = x(2) + x(1,2) , y(1) , 2 F (x)
y(1,2) = y(1,2) + x(1,2) , F (x)
x(2) = x(2) + y(2) , F
; y(2) = 0
y = r[0]; x(1) = r[1]; y(1) = r[2].
110
(3.7)
Subscripts of second-order adjoint subroutine and variable names are replaced with the
prefixes a1_ and a2_; for example, v(2) a2_v and v(1,2) a2_a1_v. The entire Hessian
can be accumulated at a computational cost of O(n m) Cost(F ) by letting x(1,2) and y(1)
range over the Cartesian basis vectors in Rn and Rm , respectively. For m = 1, a single
Hessian-vector product can be computed at a computational cost of O(1) Cost(F ).
3.3.1
Source Transformation
Second derivative code is most likely to be used in the context of numerical algorithms
that require second as well as first derivatives. The Hessian or projections thereof may
be needed in addition to the gradient at the current point. If we assume that an adjoint
code exists in order to compute the gradient efficiently, then a second-order adjoint code
generated in forward-over-reverse mode is the most likely choice despite the fact that the
computational complexity of computing a projected Hessian is the same in reverse-overforward mode. In practice, the repeated reversal of the data flow in reverse-over-reverse
mode turns out to result in a considerable computational overhead.
Forward-over-Reverse Mode
The application of forward mode AD to an adjoint SAC generated in incremental reverse
mode as defined in (2.9) yields
for j = n, . . . , n + p + m 1
!
j (vi )ij
(2)
(2)
, vk
vj =
kj
vk
kj
vj = j (vi )ij
111
for j = n + p+m 1, . . . , n
!
2 j (vi )ij
(2)
(2)
(2)
= v(1)i
+ v(1)j ,
, vk
v(1)i
ij
ij
kj
vk vl
{k,l}j
!
j (vi )ij
(2)
+ v(1)j ,
vk
kj
!
j (vi )ij
.
v(1)i ij = v(1)i ij + v(1)j ,
vk
kj
(3.8)
As in (2.9), the v(1)n+p+j are assumed to be initialized to y(1)j for j = 0, . . . , m1. Moreover,
(2)
(2)
(2)
(2)
the caller is expected to set vj = xj and v(1)n+p+j = v(1)i = 0 for j = 0, . . . , m 1 and
i = 0, . . . , n 1 if projections of the Hessian in directions y(1) (v(1)n+p+j )j =0,...,m1 and
(2)
(2)
(2)
x(2) (vj )j =0,...,n1 shall be returned in x(1) (v(1)j )j =0,...,n1 . Adjoints of intermediate
variables are initialized to zero by default, which is exploited in the following example.
Example 3.18 For illustration, we consider the scalar function y = f (x) = sin(x0 x1 ). In
forward-over-reverse mode, the SAC and its adjoint are differentiated in forward mode
yielding
[tangent-linear SAC]
(2)
(2)
(2)
v2 = v0 v1 + v0 v1
v2 = v0 v1
(2)
(2)
v3 = cos(v2 ) v2
v3 = sin(v2 )
[tangent-linear adjoint SAC]
(2)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
112
where
cos(x0 x1 ) x0 x1 sin(x0 x1 )
x12 sin(x0 x1 )
f (x) =
x02 sin(x0 x1 )
cos(x0 x1 ) x0 x1 sin(x0 x1 )
(2)
(2)
(2)
(2)
(2)
if both x(1) = (v(1)0 , v(1)1 )T and y(1) = v(1)3 are initialized to zero.
Example 3.19 The application of forward mode AD to the adjoint version
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e& y , d o u b l e a1_y ) {
y =0;
f o r ( i n t i = 0 ; i <n ; i ++) y=y+x [ i ] x [ i ] ;
r e q u i r e d _ d o u b l e . push ( y ) ;
y=y y ;
y= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
a1_y =2 y a1_y ;
f o r ( i n t i =n 1; i >=0; i )
a1_x [ i ] = a1_x [ i ] + 2 x [ i ] a1_y ;
}
113
The store/restore mechanism for required data prevents y and t2_y from holding the correct
function and first derivative values on output. See lines 9, 14, and 15. If the correct value
of y is preserved by writing a result checkpoint in the adjoint code, then the tangent-linear
version of the store/restore statements for y also recovers the correct directional derivative
t2_y. See lines 12, 24, and 25.
The following driver computes the Hessian of (1.2) at point xi = 1, i = 0, . . . , 3, set
in line 4.
1 i n t main ( ) {
2
const i n t n =4;
3
d o u b l e x [ n ] , y , t 2 _ x [ n ] , t 2 _ y , a1_x [ n ] , a1_y , t 2 _ a 1 _ x [ n ] , t 2 _ a 1 _ y ;
4
f o r ( i n t i = 0 ; i <n ; i ++) { x [ i ] = 1 ; t 2 _ x [ i ] = 0 ; }
5
t2_a1_y =0;
6
f o r ( i n t i = 0 ; i <n ; i ++) {
7
f o r ( i n t j = 0 ; j <n ; j ++) a1_x [ j ] = t 2 _ a 1 _ x [ j ] = 0 ;
8
a1_y = 1 ;
9
t2_x [ i ]=1;
10
t 2 _ a 1 _ f ( n , x , t 2 _ x , a1_x , t 2 _ a 1 _ x , y , t 2 _ y , a1_y , t 2 _ a 1 _ y ) ;
11
f o r ( i n t j = 0 ; j <= i ; j ++)
12
c o u t << "H[ " << i << " ] [ " << j << " ] = "
13
<< t 2 _ a 1 _ x [ j ] << e n d l ;
14
t2_x [ i ]=0;
15
}
16
return 0 ;
17 }
It contains in line 6 a loop over the Cartesian basis vectors in R4 that are assigned in line 9 to
the initially zero vector t2_x (see line 6) for the fixed first-order adjoint a1_y=1 of the original
output, set in line 8. The corresponding t2_x entries can be reset to zero individually in line
14 as x is not overwritten in a1_f; hence, it is not modified by t2_a1_f either. Initialization
of t2_a1_y in line 5 is crucial for avoiding the addition of first derivative information to
t2_a1_x. According to (3.5), t2_a1_y is kept equal to zero by the repeated calls of t2_a1_f .
The columns of the Hessian are returned in t2_a1_x and they are printed to the standard
output in lines 1113. Both a1_x and t2_a1_x need to be reset to zero prior to each iteration
(see line 7) because of the incremental nature of the adjoint code.
A total of n = 4 evaluations of the second-order adjoint code are required to compute
all entries of the Hessian. A single Hessian-vector product is obtained at a constant factor
of the cost of evaluating the original code in Section 1.1.2. Refer to Table 3.2 for run-time
measurements.
Reverse-over-Forward Mode
To obtain an implementation of the second-order adjoint model in reverse-over-forward
mode, reverse mode AD is applied to (2.2) yielding
for j = n, . . . , n + p + m 1
!
j (vi )ij
(1)
(1)
, vk
vj =
kj
vk
kj
vj = j (vi )ij
114
v(2)i
v(2)i
(1)
v(2)k
ij
ij
kj
(1)
!
j (vi )ij
= v(2)i ij + v(2)j ,
vk
kj
!
2
j (vi )ij
(1)
(1)
= v(2)i ij + v(2)j ,
, vk
kj
vk vl
{k,l}j
!
j (vi )ij
(1)
(1)
.
= v(2)k
+ v(2)j ,
kj
vk
kj
(3.9)
(1)
a projection of the Hessian in directions y(2) and x(1) if v(2)n+p+i = y(2)i and v(2)j = x(2)j
are initialized to zero for i = 0, . . . , m 1 and i = 0, . . . , n 1, respectively. Adjoints of
intermediate variables are initialized to zero by default, which is exploited in the following
example.
Example 3.20 Again, we consider y = f (x) = sin(x0 x1 ). In reverse-over-forward mode
the original tangent-linear SAC is succeeded by its adjoint yielding
[tangent-linear SAC]
(1)
(1)
(1)
v2 = v0 v1 + v0 v1
v2 = v0 v1
(1)
(1)
v3 = cos(v2 ) v2
v3 = sin(v2 )
adjoint [tangent-linear SAC]:
(1)
(1)
(1)
(1)
(1)
(1)
v(2)1 = v(2)2 v0
v(2)1 = v(2)1 + v(2)2 v0
(1)
(1)
v(2)0 = v(2)2 v1 .
(1)
(1)
(1)
(1)
115
(1)
x
x(2)0
(1)
2
x(2)
= y(2) f (x) 0(1)
x(2)1
x
)T
Values of y and t1_y that are used in lines 13, 15, and 16 of the reverse section are overwritten
by the assignments in lines 9 and 10. Consequently, their values are stored on the required
data stack in lines 9 and 10, and they are restored in lines 12 and 14.
The following driver computes the Hessian of (1.2) at point xi = 1, i = 0, . . . , 3, set
in line 4.
1 i n t main ( ) {
2
const i n t n =4;
3
d o u b l e x [ n ] , y , a2_x [ n ] , a2_y , t 1 _ x [ n ] , t 1 _ y , a 2 _ t 1 _ x [ n ] , a 2 _ t 1 _ y ;
4
f o r ( i n t i = 0 ; i <n ; i ++) { x [ i ] = 1 ; t 1 _ x [ i ] = 0 ; }
116
5
6
7
8
9
10
11
12
13
14
15
16 }
It contains in line 6 a loop over the Cartesian basis vectors in R4 that are assigned to the initially zero (see line 4) vector t1_x in line 9. The second-order adjoint a2_t1_y of the original
output is set to one in line 8. The corresponding t1_x entries can be reset to zero individually
in line 13 as t1_x is not overwritten in t1_f and hence is not modified by a2_t1_f . Initialization of a2_y=0 in line 5 is crucial for avoiding the addition of first derivative information
to a2_x. According to (3.6), a2_y is kept equal to zero by the repeated calls of a2_t1_f . The
columns of the Hessian are returned in a2_x, and they are printed to the standard output in
lines 11 and 12. Both a2_x and a2_t1_x need to be reset to zero in line 7 prior to each call
of a2_t1_f as the adjoint code is generated in incremental reverse mode.
A total of n = 4 evaluations of the second-order adjoint code are required to compute
all entries of the Hessian. A single Hessian-vector product is obtained at a constant factor
of the cost of evaluating the original code in Section 1.1.2. Typically, second-order adjoint
code generated in reverse-over-forward mode is slightly less efficient than its competitor
that is generated in forward-over-reverse mode. The latter can be optimized more effectively by the native C++ compiler. The impact of this effect is almost negligible for our
simple example as illustrated by the run-time measurements in Table 3.2. It turns out to be
more significant for larger simulations.
Table 3.2. Run times for second-order adjoint code (in seconds). In order to determine the relative computational complexity R of the derivative code, n function evaluations
are compared with a full Hessian accumulation. We observe a factor of approximately 3.3
when comparing the time taken by a single run of the second-order adjoint code that was
generated in forward-over-reverse mode with that of an original function evaluation in the
right-most column. Reverse-over-forward mode performs slightly worse if compiler optimization is switched off (g++ -O0). Reverse-over-reverse mode turns out to be infeasible
(runs out of memory) for n = 4 104 . Its run time significantly exceeds that of the other two
modes for n = 104 and n = 2 104 .
n
f
t2_a1_f
a2_t1_f
a2_a1_f
104
0.9
4.0
4.6
12.3
g++ -O0
2 104 4 104
3.6
13.7
15.9
62.2
18.1
69.6
47.4
fail
104
0.2
0.7
0.7
2.2
g++ -O3
2 104 4 104
0.8
3.1
2.6
10.2
2.6
10.4
8.8
fail
R
1
3.3
3.3
11
117
Reverse-over-Reverse Mode
The implementation of reverse-over-reverse mode becomes very tedious, even for simple
cases. Its performance falls below that of forward-over-reverse and reverse-over-forward
modes because of the repeated data flow reversal. While reverse-over-reverse mode is likely
not to be used in practice, its investigation contributes to a better understanding of first- and
higher-order adjoint code, which is why we decided to consider it here. In order to obtain
an implementation of the second-order adjoint model in reverse-over-reverse mode, reverse
mode AD is applied to (2.9) yielding
for j = n, . . . , n + p + m 1
vj = j (vi )ij
for j = n + p + m 1, . . . , n
v(1)k
kj
= v(1)k
j (vi )ij
+ v(1)j ,
kj
vk
for j = n, . . . , n + p + m 1
kj
!
j (vi )ij
v(1,2)j = v(1,2)j + v(1,2)k kj ,
vk
kj
2
j (vi )ij
v(2)i ij = v(2)i ij + v(1,2)k kj , v(1)j ,
vk vl
(3.10)
{k,l}j
for j = n + p + m 1, . . . , n
!
j (vi )ij
v(2)i ij = v(2)i ij + v(2)j ,
.
vk
kj
118
of the implementation of (1.2) given in Section 1.1.2. The required data stack is renamed in
order to distinguish it from the stack that is generated by the second application of reverse
mode. The second-order adjoint code becomes
119
/ / 16
/ / 15
/ / 14
/ / 13
//
//
//
//
//
12
11
9
10
8
// 7
// 6
It consists of the usual four parts, namely, the augmented forward section (the first-order
adjoint code augmented with the storage of required overwritten values; lines 616), the
storage of the results (of the first-order adjoint code; lines 1920), the reverse section
(adjoint versions of all statements in the first-order adjoint code augmented with the recovery
of required values that are stored in the augmented forward section; lines 2340), and
the recovery of the results (lines 4345). Comments link adjoint statements with their
counterparts in the augmented forward section; for example, line 36 holds the adjoint version
of the assignment in line 10.
120
The entire data segment of the first-order adjoint code is duplicated according to
Adjoint Code Generation Rule 1 including the required data and result checkpoint stacks,
yielding a2_required_double_1 and a2_result_double_1 . The treatment of stack accesses exploits the fact that all stack values are both written and read exactly once. Hence, the adjoint
version of required_double_1 . push(y) in line 8 yields reading a2_y from a2_required_double_1
in line 37 followed by removing in line 38 the top of the stack. No further augmentation is
necessary as no required value is overwritten. Lines 11 and 34 form an analogous pair. All
remaining statements are the result of the straight application of the Adjoint Code Generation
Rules to the first-order adjoint code. For example, a required value of a1_y is overwritten
in line 13 and hence is stored on the required_double_2 stack. The corresponding adjoint
assignments in lines 31 and 32 are preceded by the recovery of the required value and its
removal in line 30 from required_double_2 .
The following driver computes the Hessian of (1.2) at point xi = 1, i = 0, . . . , 3, set
in line 4.
1 i n t main ( ) {
2
const i n t n =4;
3
d o u b l e x [ n ] , y , a2_x [ n ] , a2_y , a1_x [ n ] , a1_y , a 2 _ a 1 _ x [ n ] , a 2 _ a 1 _ y ;
4
f o r ( i n t i = 0 ; i <n ; i ++) { x [ i ] = 1 ; a 2 _ a 1 _ x [ i ] = 0 ; }
5
a2_y = 0 ;
6
f o r ( i n t i = 0 ; i <n ; i ++) {
7
f o r ( i n t j = 0 ; j <n ; j ++) a2_x [ j ] = 0 ;
8
a1_y = 1 ;
9
a2_a1_x [ i ] = 1 ;
10
a 2 _ a 1 _ f ( n , x , a2_x , a1_x , a2_a1_x , y , a2_y , a1_y , a 2 _ a 1 _ y ) ;
11
f o r ( i n t j = 0 ; j <= i ; j ++)
12
c o u t << "H[ " << i << " ] [ " << j << " ] = " << a2_x [ j ] << e n d l ;
13
a2_a1_x [ i ] = 0 ;
14
}
15
return 0 ;
16 }
It contains in line 6 a loop over the Cartesian basis vectors in R4 that are assigned to the
initially zero (see line 4) vector a2_a1_x. The first-order adjoint a1_y of the original output
is set to one in line 8. The corresponding a2_a1_x entries can be reset to zero individually
in line 13 as a2_a1_x is, according to (3.7), left unchanged by a2_a1_f. Initialization of
a2_y=0 in line 5 is crucial for avoiding the addition of first derivative information to a2_x.
According to (3.7), a2_y is kept equal to zero by the repeated calls of a2_a1_f. The columns
of the Hessian are returned in a2_x, and they are sent to the standard output in line 12. Only
a2_x needs to be reset to zero in line 7 prior to each call of a2_a1_f as the adjoint code is
generated in incremental reverse mode.
Again, n = 4 evaluations of the second-order adjoint code are required to compute
all entries of the Hessian. A single Hessian-vector product is obtained at a constant factor
of the cost of evaluating the original code in Section 1.1.2. Typically, second-order adjoint
code generated in reverse-over-reverse mode is significantly less efficient than the other
two variants of implementing the second-order adjoint model, as illustrated by the run-time
measurements in Table 3.2.
The derivative code compiler dcc supports the generation of second-order adjoint
code in all three modes. Thus, it may contribute to a better understanding of the principles
that AD is based on. Refer to Chapter 5 for further details.
3.3.2
121
Overloading
dco supports both forward-over-reverse and reverse-over-forward modes. Reverse-overreverse mode has been omitted due to its obvious drawbacks as a result of the repeated data
flow reversal. The given tangent-linear or adjoint code is treated analogous to any other
target code. Active floating-point variables are redeclared as dco_t1s_type in forward mode
or as dco_a1s_type in reverse mode. A tape is generated and interpreted in reverse mode.
The use of the corresponding second derivative code is very similar to what was discussed in
the previous section. Qualitatively, the run time behavior matches that of second derivative
code generated by source transformation.
Forward-over-Reverse Mode
The second-order adjoint model can be implemented by changing the types of all floatingpoint members in class dco_a1s_tape_entry and class dco_a1s_type from Section 2.2.2 to
dco_t1s_type as defined in Section 2.1.2 yielding the data type class dco_t2s_a1s_tape_entry
and class dco_t2s_a1s_type shown in the following code listing. See lines 4 and 11 for the
respective type changes.
1
2
3
4
5
6
7
8
9
10
11
12
13
class dco_t2s_a1s_tape_entry {
public :
i n t oc , a r g 1 , a r g 2 ;
dco_t1s_type v , a ;
...
};
class dco_t2s_a1s_type {
public :
i n t va ;
dco_t1s_type v ;
...
};
This approach yields an implementation of the second-order adjoint model in forwardover-reverse mode. The driver program in Listing 3.2 uses this implementation of the
second-order adjoint model to compute the Hessian of (1.2) for n = 4, set in line 5, at the
point xi = 1 for i = 0, . . . , 3.
The second-order adjoint data type dco_t2s_a1s_type is declared in the header
file dco_t2s_a1s_type.hpp. Its declaration is included in the driver program in
line 3, and it is used to activate the target code (the function f in lines 711) by changing the type of all floating-point variables from double to dco_t2s_a1s_type . A tape of size
DCO_T2S_A1S_TAPE_SIZE (to be replaced with an integer value by the C preprocessor) is
allocated statically in dco_t2s_a1s_type.cpp and is later linked to the object code of
the driver program.
Both the taping and the interpretation of the tape are performed in tangent-linear mode.
(1)
(1)
Hence, x[ i ]. v. t xi and dco_t2s_a1s_tape [x[ i ]. va ]. v. t xi need to be initialized
simultaneously as shown in line 21. A new tape is generated for each column of the Hessian.
Therefore, the virtual address counter dco_t2s_a1s_vac as well as all adjoint tape entries are
reset to zero prior to each iteration of the loop in line 18 by calling in line 19 the function
122
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
123
Tape:
0:
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
...
21:
22:
23:
Interpreted Tape:
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
0,-1,-1,
1, 0,-1,
0,-1,-1,
1, 2,-1,
0,-1,-1,
1, 4,-1,
0,-1,-1,
1, 6,-1,
0,-1,-1,
1, 8,-1,
4, 1, 1,
2, 9,10,
1,11,-1,
4, 3, 3,
2,12,13,
1,14,-1,
(1.,0.),
(1.,0.),
(1.,0.),
(1.,0.),
(1.,0.),
(1.,1.),
(1.,0.),
(1.,0.),
(0.,0.),
(0.,0.),
(1.,0.),
(1.,0.),
(1.,0.),
(1.,0.),
(2.,0.),
(2.,0.),
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
(0.,0.)
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
]
[ 1,20,-1,
(4.,2.), (0.,0.) ]
[ 4,21,21, (16.,16.), (0.,0.) ]
[ 1,22,-1, (16.,16.), (0.,0.) ]
(a)
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
[
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
...
(1.,0.), (16.,8.) ]
(1.,0.), (16.,8.) ]
(1.,0.), (16.,8.) ]
(1.,0.), (16.,8.) ]
(1.,0.), (16.,24.) ]
(1.,1.), (16.,24.) ]
(1.,0.), (16.,8.) ]
(1.,0.), (16.,8.) ]
(0.,0.),
(8.,4.) ]
(0.,0.),
(8.,4.) ]
(1.,0.),
(8.,4.) ]
(1.,0.),
(8.,4.) ]
(1.,0.),
(8.,4.) ]
(1.,0.),
(8.,4.) ]
(2.,0.),
(8.,4.) ]
(2.,0.),
(8.,4.) ]
[ ...
(4.,2.),
[ ... (16.,16.),
[ ... (16.,16.),
(8.,4.) ]
(1.,0.) ]
(1.,0.) ]
(b)
Figure 3.9. dco_t2s_a1s_tape for the computation of third column of the Hessian.
The five columns show for each tape entry the operation code, the virtual addresses of the (up
to two) arguments, the tangent-linear function value, and the tangent-linear adjoint value.
Tangent-linear quantities are pairs that consist of the original value and the corresponding
directional derivative.
The tape entries 0, 2, 4, and 6 represent the constants on the right-hand side of the
assignment in line 20 of Listing 3.2 that are converted into variables of type dco_t2s_a1s_type
by the constructor
1 d c o _ t 2 s _ a 1 s _ t y p e : : d c o _ t 2 s _ a 1 s _ t y p e ( c o n s t d o u b l e& x ) : v ( x ) {
2
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . oc =DCO_T2S_A1S_CONST ;
3
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . v=x ;
4
va = d c o _ t 2 s _ a 1 s _ v a c ++;
5 };
Their values are assigned to x[ j ] , yielding tape entries 1, 3, 5, and 7 as a result of calling
the overloaded assignment operator
1 d c o _ t 2 s _ a 1 s _ t y p e&
2 d c o _ t 2 s _ a 1 s _ t y p e : : operator =( const
3
i f ( t h i s ==&x ) r e t u r n t h i s ;
4
dco_t2s_a1s_tape [ dco_t2s_a1s_vac
5
dco_t2s_a1s_tape [ dco_t2s_a1s_vac
6
dco_t2s_a1s_tape [ dco_t2s_a1s_vac
7
va = d c o _ t 2 s _ a 1 s _ v a c ++;
8
return t h i s ;
9 }
d c o _ t 2 s _ a 1 s _ t y p e& x ) {
] . oc =DCO_T2S_A1S_ASG ;
] . v=v=x . v ;
] . a r g 1 =x . va ;
124
All tangent-linear components v (2) are initialized to zero by the assignments in lines 3 and 5
of the constructor and the assignment operator, respectively. Note that these assignments are
overloaded for variables of type dco_t1s_type . To compute the third column of the Hessian,
(2)
x2 x [2]. v. t dco_t2s_a1s_tape [x [2]. va ]. v. t is set to one in line 21 of Listing 3.2. The
(2)
fifth components of all tape entries are initialized to (v(1) , v(1) ) = (0, 0) by the function
dco_t2s_a1s_reset_tape that is called in line 19.
The tape entries 8 and 9 correspond to line 8 in Listing 3.2. Four evaluations of
the assignment in line 9 yield the twelve tape entries 1021. For example, tape entry 10
stands for the square operation applied to x[0] (tape entry 1) followed by tape entry 11 that
represents the addition of the result to y (tape entry 9). Anew live instance of y is generated by
the subsequent assignment (tape entry 12). This new instance of y is incremented during the
next loop iteration (see tape entry 14) and so forth. When processing line 10 of Listing 3.2,
the square of the live instance of y at the end of the loop (tape entry 21) is squared (tape entry
22) and the result is assigned to the output y of the subroutine f (tape entry 23). According
to (3.5), we obtain
(2)
The interpretation of the tape is preceded by the initialization of the adjoint output
y(1) dco_t2s_a1s_tape [y.va ]. a . v yielding a modification of the first-order adjoint component of tape entry 23 in Figure 3.9 (b). Initialization with one results in a first-order adjoint
accumulation of the gradient. Overloading of the interpreter in tangent-linear mode adds the
(2)
propagation of tangent-linear projections of the Hessian in direction xi dco_t2s_a1s_type
[x[ i ]. va ]. v. t for i = 0, . . . , 3. The first- and second-order adjoints are copied into tape entry
22 that represents the result of the product yy in line 10 of Listing 3.2 without modification.
According to (3.5), the interpretation of tape entry 22 yields
v(1)21 = v(1)21 + v(1)22 2 v21 = 0 + 1 2 4 = 8,
(2)
(2)
(2)
class dco_a2s_t1s_type {
public :
125
dco_a1s_type v , t ;
...
3
4
5
};
The driver program performs the same task as that in Listing 3.2.
A shortened version of the tape that is generated for the computation of the second
column of the Hessian is shown in Figure 3.10 (a) (tape after recording) and Figure 3.10 (b)
(tape after interpretation). The increase in the length of the tape by a factor of approximately
four is due to the entire tangent-linear code being recorded. Initialization of the independent
variables in line 20 of Listing 3.3 yields four tape entries, respectively. For example, tape
entries 4 and 5 represent a call of the constructor
d c o _ a 2 s _ t 1 s _ t y p e : : d c o _ a 2 s _ t 1 s _ t y p e ( c o n s t d o u b l e& x ) : t ( 0 ) v ( x ) {}
that converts the constants on the right-hand side of the assignment x[ i]=1 to variables of
type dco_a2s_t1s_type. One tape entry is generated for the value (v; tape entry 5) and for the
Listing 3.3. Driver for reverse-over-forward mode by overloading.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
126
Interpreted Tape:
0, -1, -1,
0, -1, -1,
1, 4, -1,
1, 5, -1,
0.,
1.,
1.,
1.,
0.
0.
0.
0.
]
]
]
]
4, 7, 6,
4, 6, 7,
2, 39, 38,
1, 40, -1,
4, 7, 7,
1, 42, -1,
1.,
1.,
2.,
2.,
1.,
1.,
0.
0.
0.
0.
0.
0.
]
]
]
]
]
]
2,
1,
2,
1,
34,
46,
35,
48,
41,
-1,
43,
-1,
2.,
2.,
2.,
2.,
0.
0.
0.
0.
]
]
]
]
4,
4,
2,
1,
4,
1,
1,
1,
83,
82,
87,
88,
83,
90,
89,
91,
82,
83,
86,
-1,
83,
-1,
-1,
-1,
8.,
8.,
16.,
16.,
16.,
16.,
16.,
16.,
0.
0.
0.
0.
0.
0.
0.
0.
]
]
]
]
]
]
]
]
...
4:
5:
6:
7:
...
38:
39:
40:
41:
42:
43:
...
82:
83:
...
86:
87:
88:
89:
90:
91:
92:
93:
(a)
[
[
[
[
0, -1, -1,
0, -1, -1,
1, 4, -1,
1, 5, -1,
0.,
1.,
1.,
1.,
16.
24.
16.
24.
]
]
]
]
[
[
[
[
[
[
4, 7, 6,
4, 6, 7,
2, 39, 38,
1, 40, -1,
4, 7, 7,
1, 42, -1,
1.,
1.,
2.,
2.,
1.,
1.,
8.
8.
8.
8.
4.
4.
]
]
]
]
]
]
[ 1, 79, -1,
[ 1, 81, -1,
2.,
4.,
8. ]
4. ]
[
[
[
[
[
[
[
[
4,
4,
2,
1,
4,
1,
1,
1,
83,
82,
87,
88,
83,
90,
89,
91,
82,
83,
86,
-1,
83,
-1,
-1,
-1,
8.,
8.,
16.,
16.,
16.,
16.,
16.,
16.,
1.
1.
1.
1.
0.
0.
1.
0.
]
]
]
]
]
]
]
]
(b)
Figure 3.10. dco_a2s_t1s_tape for the computation of the second column of the
Hessian; the five columns show for each tape entry the operation code, the virtual addresses
of the (up to two) arguments, the function value, and the adjoint value. First- and secondorder adjoints are propagated during the interpretation of the tape of the underlying firstorder tangent-linear code.
directional derivative component ( t ; tape entry 4), respectively. The overloaded assignment
operator
1 d c o _ a 2 s _ t 1 s _ t y p e&
2 d c o _ a 2 s _ t 1 s _ t y p e : : operator =( const
3
i f ( t h i s ==&x ) r e t u r n t h i s ;
4
t =x . t ; v=x . v ;
5
return t h i s ;
6
}
d c o _ a 2 s _ t 1 s _ t y p e& x ) {
adds tape entries 6 and 7 that represent the two assignments in line 4. According to (3.6),
(1)
(1)
we obtain v6 v7 = 1 v5 v4 = 0, which does not match the value in Figure 3.10 (a).
127
(2)
(1)
The displayed value is due to line 21 in Listing 3.3, where x1 v7 v6 is set equal to
one explicitly.
A total of six tape entries are generated for products as the original multiplication
is augmented with the product rule in the tangent-linear code. For example, the product
performed in line 9 of Listing 3.3 during the second iteration of the enclosing loop yields
tape entries 38, 39, 40, and 42. The respective assignments to temporary variables that are
performed by the overloaded multiplication operator
1 d c o _ a 2 s _ t 1 s _ t y p e operator ( const
2
const
3
d c o _ a 2 s _ t 1 s _ t y p e tmp ;
4
tmp . t =x1 . t x2 . v+x1 . v x2 . t ;
5
tmp . v=x1 . v x2 . v ;
6
r e t u r n tmp ;
7 }
d c o _ a 2 s _ t 1 s _ t y p e& x1 ,
d c o _ a 2 s _ t 1 s _ t y p e& x2 ) {
are represented by tape entries 41 and 43. According to (3.6), the two assignments in lines
4 and 5 of the code for operator result in the following computation:
(1)
v38 = v7 v7 = v7 v6 = 1 1 = 1
(1)
v39 = v7 v7 = v6 v7 = 1 1 = 1
v40 = v39 + v38 = 1 + 1 = 2
(1)
Additions yield four tape entries; for example, 4649, where v47 v49 . The product
in line 10 of Listing 3.3 followed by the assignment to the output y of the subroutine f
(1)
(1)
is represented by the last eight tape entries 8693, where v89 v91 and v92 v93 . With
(1)
v83 = 4 and v82 v83 = 2, the following steps are performed according to (3.6):
(1)
128
Table 3.3. Run times for second-order adjoint code by overloading (in seconds).
In order to determine the relative computational complexity R of the derivative code, n
function evaluations are compared with n evaluations of the second-order adjoint code
required for a full Hessian accumulation. We observe a constantly growing factor of at
least 39 when comparing the run time of a single run of the second-order adjoint that was
code generated in forward-over-reverse mode with that of an original function evaluation
in the right-most column. This factor is approximately double the factor that was observed
for first-order adjoint code generated by overloading. The impact of compiler optimization
is even less significant for second-order adjoint code generated in reverse-over-forward
mode.
g++ -O0
n
f
t2_a1_f
a2_t1_f
104
0.9
87.8
147.6
2 104
3.6
360.6
562.6
g++ -O3
4 104
13.7
1435.0
2297.9
2 104
0.8
164.1
474.0
104
0.2
31.4
101.8
4 104
3.1
772.6
1886.3
1
> 39
> 127
interpretation of the tape, all first-order adjoint components of tape entries that represent
variables in the original code contain second-order adjoint projections of the local Hessians;
for example, tape entries 82 and 83 represent the right-hand side instance of y in line 10
of Listing 3.3. The local Hessian is equal to the constant scalar 2. According to (3.6),
we get
(1)
(1)
(1)
(1)
Similarly,
(1)
(1)
= v(2)42 2 v7 + v(2)40 2 v6 = 4 2 1 + 8 2 1 = 24
(1)
v(2)6 v(2)7
(1)
(1)
= v(2)7 + v(2)42 2 v7
= v(2)6 + v(2)40 2 v7 = 0 + 8 2 1 = 16.
The second diagonal entry of the Hessian (24) is accumulated in the adjoint component of
tape entry 7. Tape entry 6 contains the second gradient entry (16) in its adjoint component.
According to Theorem 3.16, the same value is contained in y (1) v92 . Refer to Table 3.3
for run time measurements.
Reverse-over-Reverse Mode
The discussion of an implementation of second-order adjoint code in reverse-over-reverse
mode by overloading is omitted; this approach is irrelevant in practice. The repeated reversal
129
of the data flow yields an excessive memory requirement due to recursive taping as well as
an increased computational cost caused by the complexity of the interpretation procedure.
3.3.3
We recall the basics of compression techniques for second derivative tensors based on
second-order adjoint projections as in [17]. Scalar functions F : Rn Rm , where m =
1 are of particular interest in the context of nonlinear programming. A corresponding
second-order adjoint code computes products of the Hessian with a vector as y(1) 2 F (x)
x(2) , where y(1) = 1 and x(2) Rn . Tangent-linear compression techniques as described in
Section 2.1.3 can be applied. Moreover, symmetry should be exploited potentially yielding
better compression rates. For vector functions (m > 1), the sparsity of the Hessian is
closely related to that of the Jacobian. Hence, a combination of tangent-linear and adjoint
compression is likely to give the best compression rate as described in [36].
Example 3.24 Let
h0,0
2F = 0
h0,2
0
h1,1
h1,2
h0,2
h1,2 .
h2,2
The dense third row appears to make unidirectional compression as in Section 2.1.3 inapplicable. However, symmetry of the Hessian implies that only one instance of h0,2 and
h1,2 needs to be recovered, respectively. Consequently, the following compression can be
applied:
h0,0
1 0
h0,0
0
h0,2
h0,2
h1,1 h1,2 1 0 = h1,1
h1,2 .
2 F St = 0
h0,2 h1,2 h2,2
h0,2 + h1,2 h2,2
0 1
All five distinct nonzero entries of the Hessian can be recovered by direct substitution.
Definition 3.25. Let A Rmnn be a symmetric 3-tensor, Sa = (sja,i ) Rmla , and
St = (sjt ,i ) Rnlt . Then, the compressed Hessian B (bk,j ,i ) = Sa , A, St
Rla lt lt
is defined as
a
t
bk,j , = sk,
, A
, s,j
for k = 1, . . . , la and j = 1, . . . , lt .
When applying compression techniques in second-order adjoint mode, we aim to find
seed matrices Sa and St with minimal numbers of columns la and lt such that
B = Sa , A, St
Rla lt ,
(3.11)
where A 2 F (x), Sa {0, 1}mla , St {0, 1}nlt , and ai,j = 0, i j bl,i B : ai,j = bl,i .
All nonzero entries of the lower (resp., upper) triangular submatrix of the Hessian need to be
present in the compressed Hessian B. Harvesting solves the system in (3.11) by substitution
if direct methods are applied. Again, indirect methods may result in a better compression
rate. Refer to [30] for details on the combinatorial problem of minimizing lt and la by graph
coloring algorithms.
130
Example 3.26 Let the second-order adjoint model of the implementation of the SFI problem
(see Example 1.2) be implemented as a subroutine
v o i d t 2 _ a 1 _ f ( i n t s , d o u b l e y , d o u b l e t 2 _ y ,
d o u b l e a1_y , d o u b l e t 2 _ a 1 _ y ,
d o u b l e& l , d o u b l e& t 2 _ l ,
d o u b l e& a 1 _ l , d o u b l e& t 2 _ a 1 _ l ,
d o u b l e r , d o u b l e t 2 _ r ,
d o u b l e a 1 _ r , d o u b l e t 2 _ a 1 _ r ) ;
The Hessian tensor of the residual is very sparse. Its computation is complicated by the fact
that we are actually dealing with a 6-tensor instead of a 3-tensor, because both y and r are
implemented as matrices. Thus, the Hessian for s = 3 determined in Example 3.2 becomes
2
h ey1,1
h2 ey1,2
= h2 ey2,1
h2 ey2,2
if i1 = j1 = k1 = 1 and i2 = j2 = k2 = 1
if i1 = j1 = k1 = 1 and i2 = j2 = k2 = 2
if i1 = j1 = k1 = 2 and i2 = j2 = k2 = 1
if i1 = j1 = k1 = 2 and i2 = j2 = k2 = 2
otherwise.
Knowledge about this sparsity pattern can be exploited by seeding and harvesting the secondorder adjoint routine as shown in the following driver fragment:
...
f o r ( i n t i = 1 ; i < s ; i ++)
f o r ( i n t j = 1 ; j < s ; j ++)
a1_r [ i ] [ j ]= t2_y [ i ] [ j ] = 1 ;
t 2 _ a 1 _ f ( 1 , s , y , t 2 _ y , a1_y , t 2 _ a 1 _ y ,
lambda , t 2 _ l a m b d a , a1_lambda , t 2 _ a 1 _ l a m b d a ,
r , t2_r , a1_r , t 2 _ a 1 _ r ) ;
...
The nonzero entries of the Hessian are returned in t2_a1_y whose entries are assumed to be
initialized to zero prior to the single run of t2_a1_f .
In general, the adjoint seed matrix used for the first-order adjoint can also be applied
to compress the Hessian tensor. Linearities in the underlying function yield constants in
the Jacobian and further zero entries in the Hessian as in the given example. While this
way of exploiting sparsity appears to yield optimal computational complexity, there is still
room for improvement. The preferred approach to the computation of the Hessian of the
SFI problem uses an implementation of the second-order tangent-linear model
v o i d t 2 _ t 1 _ f ( i n t s , d o u b l e y , d o u b l e t 2 _ y ,
d o u b l e t 1 _ y , d o u b l e t 2 _ t 1 _ y ,
d o u b l e& l , d o u b l e& t 2 _ l ,
d o u b l e& t 1 _ l , d o u b l e& t 2 _ t 1 _ l ,
d o u b l e r , d o u b l e t 2 _ r ,
d o u b l e t 1 _ r , d o u b l e t 2 _ t 1 _ r )
131
as follows:
...
f o r ( i n t i = 1 ; i < s ; i ++)
f o r ( i n t j = 1 ; j < s ; j ++)
t1_y [ i ] [ j ]= t2_y [ i ] [ j ] = 1 ;
t 2 _ t 1 _ f ( s , y , t2_y , t1_y , t2_t1_y ,
lambda , t 2 _ l a m b d a , t 1 _ l a m b d a , t 2 _ t 1 _ l a m b d a ,
r , t2_r , t1_r , t2_t1_r ) ;
...
All nonzero entries of the Hessian are returned in t2_t1_r . A single call of t2_t1_f is
performed. The generation and evaluation of the computationally more challenging adjoint
code can be avoided due to the strict symmetry of the Hessian tensor of the SFI problem
under arbitrary projections.
3.4
The application of forward or reverse mode AD to any of the second derivative models yields
third derivative information and so forth. In order to formalize this repeated reapplication
of AD, we need to generalize the tensor notation introduced in Section 3.1.
p
j =0,...,m1
with T = (tj ,i1 ,...,ip1 )ik =0,...,n1 for k=1,...,p1 and
n1
tj ,i1 ,...,ip1 ,l vl
l=0
for ik = 0, . . . , n 1 (k = 1, . . . , p 1) and j = 0, . . . , m 1.
Higher-order tangent-linear projections are defined recursively as
p2
T , v1 , v2 T , v1 , v2 Rmn
p3
T , v1 , v2 , v3
T , v1
, v2
, v3
Rmn
..
.
T , v1 , v2 , . . . , vp
T , v1
, v2
, . . .
, vp
Rm .
First- and higher-order adjoint projections in directions in Rn are defined as the corresponding tangent-linear projections. Such projections appear in the context of higher-order
adjoint code similar to the case considered in Section C.3.3 in the appendix.
132
m1
ul tl,i1 ,...,ip
l=0
for ik = 0, . . . , n 1 (k = 1, . . . , p).
A second-order adjoint projection of T in directions u Rm and v Rn is defined as
a first-order (tangent-linear or adjoint) projection in direction v of the first-order adjoint
projection in direction u.
Higher-order adjoint projections are defined recursively as
v1 , u, T
v1 , u, T
Rn
p1
p2
v2 , v1 , u, T
v2 , v1 , u, T
Rn
..
.
vp , . . . , v1 , u, T
vp , v1 , u, T
, . . .
R.
p
Lemma 3.31. Let T Rmn be defined as in Definition 3.27, and let k p. Then,
u, T , v1 , v2 , . . . , vk
= u, T , (v1 , v2 , . . . , vk )
for any permutation of the vi Rn for i = 1, . . . , k, and where u Rm .
Proof. Again, this result follows immediately from the symmetry of T .
133
v(j11 ,...,jfb )
denotes the dth derivative of v, where d = f + b. The current value of the dth derivative
of v is computed by a derivative code that resulted from the ik th differentiation performed
in forward mode for k = 1, . . . , f and where the jl th differentiation is performed in reverse
(2,7)
mode for l = 1, . . . , b. For example, v(1,6) represents a fourth derivative of v in a kth-order
adjoint code (k 7) that is obtained by a sequence of applications of forward and reverse
mode AD, where the first and sixth applications are performed in reverse mode, and the
second and seventh applications are performed in forward mode.
Definition 3.32. The third derivative tensor 3 F = 3 F (x) Rmnnn of a multivariate
vector function F : Rn Rm , where y = F (x), induces a trilinear mapping Rn Rn Rn
Rm defined by
(u, v, w) 3 F , u, v, w
.
The function F (1,2,3) : R4n Rm that is defined as
y(1,2,3) = F (1,2,3) (x, u, v, w) 3 F (x), u, v, w
(3.12)
(2,3)
(3.13)
(1,3)
(1,2)
(3)
(2)
(1)
134
x(1)
= x(1,3) ,
s
and
x(2)
= x(2,3)
s
135
x(1)
x(3)
x(1,3)
x(2)
x(2,3)
23
24
25
26
27
28
29
30
31
32
33
34
35
36 }
t 2 _ y = t 2 _ y +2 x [ i ] t 2 _ x [ i ] ;
t 3 _ y = t 3 _ y +2 x [ i ] t 3 _ x [ i ] ;
y=y+x [ i ] x [ i ] ;
}
t 3 _ t 2 _ t 1 _ y =2( t 3 _ t 2 _ y t1_y + t2_y t 3 _ t 1 _ y + t3_y t 2 _ t 1 _ y
+y t 3 _ t 2 _ t 1 _ y ) ;
t 2 _ t 1 _ y =2 t 2 _ y t 1 _ y +2 y t 2 _ t 1 _ y ;
t 3 _ t 1 _ y =2 t 3 _ y t 1 _ y +2 y t 3 _ t 1 _ y ;
t 1 _ y =2 y t 1 _ y ;
t 3 _ t 2 _ y =2 t 3 _ y t 2 _ y +2 y t 3 _ t 2 _ y ;
t 2 _ y =2 y t 2 _ y ;
t 3 _ y =2 y t 3 _ y ;
y=y y ;
136
Each assignment is preceded by its tangent-linear version; for example, the tangent-linear
version of the assignment in line 33 is inserted in line 32. First-order projections of 3 F (x)
in directions t1_x, t2_x , and t3_x are returned in t1_y, t2_y, and t3_y, respectively. Corresponding second-order projections are returned in t2_t1_y , t3_t2_y , and t3_t1_y .
The following driver program computes the entire third derivative tensor:
i n t main ( ) {
const i n t n =4;
double x [ n ] , t1_x [ n ] , t2_x [ n ] , t 2 _ t 1 _ x [ n ] ;
double t3_x [ n ] , t 3 _ t 1 _ x [ n ] , t 3 _ t 2 _ x [ n ] , t 3 _ t 2 _ t 1 _ x [ n ] ;
double y , t1_y , t2_y , t 2 _ t 1 _ y ;
double t3_y , t3_t1_y , t3_t2_y , t 3 _ t 2 _ t 1 _ y ;
f o r ( i n t j = 0 ; j <n ; j ++) {
x[ j ]=1;
t 2 _ t 1 _ x [ j ]= t2_x [ j ]= t1_x [ j ]= t3_x [ j ]
= t 3 _ t 2 _ t 1 _ x [ j ]= t 3 _ t 2 _ x [ j ]= t 3 _ t 1 _ x [ j ] = 0 ;
}
f o r ( i n t k = 0 ; k<n ; k ++) {
t1_x [ k ]=1;
f o r ( i n t j = 0 ; j <=k ; j ++) {
t2_x [ j ]=1;
f o r ( i n t i = 0 ; i <= j ; i ++) {
t3_x [ i ]=1;
t 3 _ t 2 _ t 1 _ f ( n , x , t3_x , t2_x , t3_t2_x , t1_x ,
t3_t1_x , t2_t1_x , t3_t2_t1_x ,
y , t3_y , t2_y , t3_t2_y , t1_y ,
t3_t1_y , t2_t1_y , t3_t2_t1_y ) ;
c o u t << "H[ " << k << " ] [ " << j << " ] [ " << i << " ] = "
<< t 3 _ t 2 _ t 1 _ y << e n d l ;
t3_x [ i ]=0;
}
t2_x [ j ]=0;
}
t1_x [ k ]=0;
}
return 0 ;
}
137
(3)
y(1)
(3)
= y(1) ,
s
and
x(2)
= x(2,3)
s
138
(2)
2 F(x)
2 F
3 F(x)
y(1)
(3)
y(1)
x(3)
x(2)
x(2,3)
y=y y ;
t3_t2_y= t3_t2_required_double . top ( ) ;
t 3 _ t 2 _ r e q u i r e d _ d o u b l e . pop ( ) ;
t 2 _ y = t 2 _ r e q u i r e d _ d o u b l e . t o p ( ) ; t 2 _ r e q u i r e d _ d o u b l e . pop ( ) ;
t 3 _ y = t 3 _ r e q u i r e d _ d o u b l e . t o p ( ) ; t 3 _ r e q u i r e d _ d o u b l e . pop ( ) ;
y= r e q u i r e d _ d o u b l e . t o p ( ) ; r e q u i r e d _ d o u b l e . pop ( ) ;
t 3 _ t 2 _ a 1 _ y = 2 ( t 3 _ t 2 _ y a1_y + t 2 _ y t 3 _ a 1 _ y
+ t 3 _ y t 2 _ a 1 _ y +y t 3 _ t 2 _ a 1 _ y ) ;
t 2 _ a 1 _ y = 2 ( t 2 _ y a1_y +y t 2 _ a 1 _ y ) ;
t 3 _ a 1 _ y = 2 ( t 3 _ y a1_y +y t 3 _ a 1 _ y ) ;
a1_y =2 y a1_y ;
f o r ( i n t i =n 1; i >=0; i ) {
t 3 _ t 2 _ a 1 _ x [ i ] = t 3 _ t 2 _ a 1 _ x [ i ] + 2 ( t 3 _ t 2 _ x [ i ] a1_y
+ t 2 _ x [ i ] t 3 _ a 1 _ y + t 3 _ x [ i ] t 2 _ a 1 _ y +x [ i ] t 3 _ t 2 _ a 1 _ y ) ;
t 2 _ a 1 _ x [ i ] = t 2 _ a 1 _ x [ i ] + 2 ( t 2 _ x [ i ] a1_y +x [ i ] t 2 _ a 1 _ y ) ;
139
42
t 3 _ a 1 _ x [ i ] = t 3 _ a 1 _ x [ i ] + 2 ( t 3 _ x [ i ] a1_y +x [ i ] t 3 _ a 1 _ y ) ;
43
a1_x [ i ] = a1_x [ i ] + 2 x [ i ] a1_y ;
44
}
45 }
Each assignment is preceded by its tangent-linear version; for example, the tangent-linear
version of the assignment in line 41 is inserted in lines 3940. All stacks are duplicated.
The respective accesses are augmented with corresponding accesses of the tangent-linear
stacks.
The following driver program computes the whole third derivative tensor:
i n t main ( ) {
const i n t n =4;
double x [ n ] , y ;
double t3_x [ n ] , t3_y ;
double t2_x [ n ] , t2_y ;
double t 3 _ t 2 _ x [ n ] , t 3 _ t 2 _ y ;
d o u b l e a1_x [ n ] , a1_y ;
double t3_a1_x [ n ] , t3_a1_y ;
double t2_a1_x [ n ] , t2_a1_y ;
double t 3 _ t 2 _ a 1 _ x [ n ] , t 3 _ t 2 _ a 1 _ y ;
f o r ( i n t i = 0 ; i <n ; i ++) {
x [ i ] = 1 ; t3_x [ i ]= t2_x [ i ]= t 3 _ t 2 _ x [ i ] = 0 ;
}
f o r ( i n t k = 0 ; k<n ; k ++) {
t3_x [ k ]=1;
f o r ( i n t i = 0 ; i <n ; i ++) {
f o r ( i n t j = 0 ; j <n ; j ++)
a1_x [ j ] = t 3 _ a 1 _ x [ j ] = t 2 _ a 1 _ x [ j ] = t 3 _ t 2 _ a 1 _ x [ j ] = 0 ;
y= t 3 _ y = t 2 _ y = t 3 _ t 2 _ y = t 3 _ a 1 _ y = t 2 _ a 1 _ y = t 3 _ t 2 _ a 1 _ y = 0 ;
a1_y = 1 ;
t2_x [ i ]=1;
t 3 _ t 2 _ a 1 _ f ( n , x , t 3 _ x , t 2 _ x , t 3 _ t 2 _ x , a1_x ,
t3_a1_x , t2_a1_x , t3_t2_a1_x ,
y , t 3 _ y , t 2 _ y , t 3 _ t 2 _ y , a1_y ,
t3_a1_y , t2_a1_y , t3_t2_a1_y ) ;
f o r ( i n t j = 0 ; j <n ; j ++)
c o u t << "H[ " << k << " ] [ " << i << " ] [ " << j << " ] = "
<< t 3 _ t 2 _ a 1 _ x [ j ] << e n d l ;
t2_x [ i ]=0;
}
t3_x [ k ]=0;
}
return 0 ;
}
140
Table 3.4. Run times for third-order tangent-linear and adjoint code (in seconds).
In order to determine the relative computational complexity R of the derivative code, n
function evaluations are compared with n evaluations of the third-order tangent-linear
code (t3_t2_t1_f) and with the same number of evaluations of the third-order adjoint code
that was generated in forward-over-forward-over-reverse mode (t3_t2_a1_f). We observe
a factor less than 8 when comparing the run time of a single run of the third-order tangentlinear code with that of an original function evaluation. O(n3 ) runs of the third-order
tangent-linear code are required for the evaluation of the whole third derivative tensor
or of projections thereof. Even less time is taken by a single execution of the third-order
adjoint code due to more effective compiler optimization. Moreover, only O(n2 ) runs are
required for the evaluation of the whole third derivative tensor. Second- and third-order
projections thereof can even be computed with a computational complexity of O(n) and
O(1), respectively.
g++ -O0
2 104 4 104
3.6
13.7
28.1
113.6
31.2
129.3
104
0.9
7.4
8.2
n
f
t3_t2_t1_f
t3_t2_a1_f
104
0.2
1.4
1.0
g++ -O3
2 104 4 104
0.8
3.1
5.5
23.8
4.0
16.9
R
1
7.7
5.5
checkpoint r to the second-order adjoint model in (3.6) yields reverse-over-reverse-overforward mode. The augmented forward section becomes
y = F (x)
y(1) = F (x), x(1)
(1)
(1)
(1)
s[1] = y(2)
(1)
y(2) = 0.
It is succeeded by the following reverse section.
(1)
y(2) = s[1]
(1)
y(2,3) = 0
y(2) = s[0]
y(2,3) = 0
(1)
(1)
(1)
(1)
141
(1)
(1)
(1)
(1)
(1)
y(3) = 0
x(3) = x(3) + y(3) , F (x)
y(3) = 0.
Constant-folding, copy propagation, and substitution yield
y = F (x)
y(1) = F (x), x(1)
(1)
(1)
(1)
(1)
(1)
+ y(3) , F (x)
(1)
= x(2,3) , F (x)
+ x(2,3) , 2 F (x), x(1)
(1)
(1)
(1)
y(3) = 0
y(3) = 0.
The entire third derivative tensor can be accumulated at the computational cost of O(m
(1)
(1)
n2 ) Cost(F ) by setting x(3) = y(2) = x(2,3) = y(3) = y(3) = 0 initially and by letting x(2,3) ,
(1)
y(2) , and x(1) range independently over the Cartesian basis vectors in Rn , Rm , and Rn ,
respectively. Projections of 3 F (x) can be obtained at a lower computational cost; for
example,
(1)
x(2,3) , 3 F (x), x(1) Rm at the cost of O(m) Cost(F ) (y(2) ranges over the Cartesian basis vectors in Rm );
142
x(2,3) , 3 F (x)
Rmn at the cost of O(m n) Cost(F ) (y(2) and x(1) range independently over the Cartesian basis vectors in Rm and Rn , respectively);
(1)
y(2) , 3 F (x)
Rnn at the cost of O(n2 ) Cost(F ) (x(2,3) and x(1) range independently over the Cartesian basis vectors in Rn );
Moreover, the third-order adjoint code returns arbitrary projections of the second and first
derivative tensors in addition to the original function value. Potential sparsity should be
exploited whenever applicable.
3.4.3
Projections of fourth and potentially higher derivative tensors may be required if a numerical
second-order algorithm is applied to a simulation code that already contains second or higher
derivatives of some underlying function. The use of AD remains straightforward.
Definition 3.38. The fourth derivative tensor 4 F = 4 F (x) Rmnnnn of a multivariate vector function F : Rn Rm , where y = F (x), induces a quadrilinear mapping
Rn Rn Rn Rn Rm defined by
(t, u, v, w) 4 F , t, u, v, w
.
The function F (1,2,3,4) : R5n Rm that is defined as
y(1,2,3,4) = F (1,2,3,4) (x, t, u, v, w) 4 F (x), t, u, v, w
(3.14)
x(1)
(2,3,4)
= F(1)
(x, u, t, v, w) u, 4 F (x), t, v, w
(3.15)
143
x(1)
(4)
(4)
144
3.5
3.5.1
Exercises
Second Derivative Code
3.5.2
Consider the given implementation of the extended Rosenbrock function f from Section 1.4.3.
1. Write a second-order tangent-linear code and use it to accumulate 2 f with machine
accuracy. Compare the numerical results with those obtained by finite difference
approximation.
2. Write a second-order adjoint code in forward-over-reverse mode and use it to accumulate 2 f with machine accuracy. Compare the numerical results with those obtained
with the second-order tangent-linear approach.
3. Use dco to accumulate 2 f in second-order tangent-linear and adjoint modes with
machine accuracy. Compare the numerical results with those obtained from the handwritten derivative code.
4. Use the Newton algorithm and a corresponding matrix-free implementation based
on the CG method for the solution of the Newton system to minimize the extended
Rosenbrock function for different start values of your own choice.
5. Compare the run times for the various approaches to computing the required derivatives as well as the run times of the optimization algorithms for increasing values
of n.
3.5. Exercises
145
Chapter 4
The following chapter serves as the basis for a one-semester lab on derivative code compilers.
Through inclusion of this material, we hope to give the interested reader some insight
into the automatization of derivative code generation as introduced in Chapters 2 and 3.
Readers without interest in technical issues related to derivative code compilers may skip
this chapter and proceed to Chapter 5, where the result in the form of a fully functional
prototype derivative code compiler for a small subset of C++ is presented.
Typically, the purpose of a compiler front-end is twofold: The given source code is
verified syntactically, that is, the front-end checks whether the input is a valid sequence of
words from the given programming language. Moreover, the source is transformed from
a pure sequence of characters into a structured representation that captures the syntactic
properties of the program. A basic internal representation consists of an abstract syntax (or
parse) tree and a symbol table. Various extensions and modifications are used in practice.
Semantic tests are needed to verify the correctness of the given source code completely. In
this book we assume that any input program is correct, both syntactically and semantically.
In any case, a native compiler for the source language is required in order to be able to process
the generated derivative code. The user of our basic derivative code compiler is expected
to have validated the inputs syntax and semantics with the help of the native compiler.
Hence, we use the compiler front-end simply as a transformation engine delivering an
abstract intermediate representation of the input that is then used for semantic modification.
No semantic analysis is performed.
We start this chapter with a brief overview of the basic structure of a derivative code
compiler in Section 4.1. Fundamental terminology required for lexical and syntax analysis is
introduced in Section 4.2. Lexical analysis and its implementation by using the scanner generator flex is covered in Section 4.3 followed by syntax analysis and its implementation
using the parser generator bison in Section 4.4. The logic behind parse tree construction
algorithms is exploited for the single-pass syntax-directed compilation of derivative code
in Section 4.5. The advantage of multipass compilation algorithms is the ability to annotate
the intermediate representation with information obtained by static program analysis. Thus,
more complex language constructs can be analyzed semantically, and potentially more efficient derivative code can be generated. The foundations of multipass source transformation
in the form of an abstract intermediate representation are laid in Section 4.6.
147
148
Sequence of Tokens
Unparser
f2.c
Figure 4.1. Derivative code compiler.
4.1
Overview
The main stages of a simple derivative code compiler are shown in Figure 4.1. The characters
in a given input file (e.g. f.c) are grouped by the scanner into so-called tokens that are
defined by a regular grammar that guides the lexical analysis as described in Section 4.3.
For example, the simple product reduction
void f ( i n t n , double x , double & y ) {
i n t i =0;
w h i l e ( i <n )
i f ( i ==0)
y=x [ 0 ] ;
else
y=y x [ i ] ;
i = i +1;
}
becomes
VOID N ( INT N , DBL N , DBL & N ) {
INT N = C ;
WHILE ( N LT N )
I F ( N EQ C )
4.1. Overview
149
sub
args
decls
while
==
stats
<
stats
if
stats
stats
x[0]
x[i]
N = N [ C ] ;
ELSE
N = N N [ N ] ;
N = N + C ;
}
150
2), types (double 1 or integer 2), and shapes (scalar 1 or vector 2) of symbols.8
For example,
Name
f
n
x
y
i
Kind
1
2
2
2
2
Type
0
2
1
1
2
Shape
0
1
2
1
1
Undefined properties are marked with 0. Access to symbol information through the parse
tree can be implemented in the leaf nodes as pointers to the corresponding symbol table
entries. Syntax analysis techniques are discussed in Section 4.4.
Syntax analysis is driven by a grammar. Instead of building a parse tree, a single-pass
derivative code compiler can write derivative code as it parses the input. This syntax-directed
approach to the semantic transformation of computer programs is discussed in Section 4.5
and is not generally applicable. It can be used to generate derivative code for the subset
of C++ considered in this book. For example, parsing of the arguments (construction of
the parse tree rooted by the args node in Figure 4.2) identifies both x and y as variables of
type double. When processing the assignment y=yx[i] (construction of the corresponding
subtree in Figure 4.2), the parser can immediately generate the tangent-linear assignment
t1_y=t1_yx[i]+yt1_x[ i ] by applying the well-known differentiation rules (here the product
rule) built on predefined prefixes (for example, t1_ ) to variable names for the directional
derivatives. This single-pass approach is illustrated in Figure 4.3 showing the relevant part
of the parse tree. We assume that the latter is built from the bottom up (following the enumeration of the vertices from 1 to 5) based, for example, on the following simplified set of
syntax (also production) rules:
(P 1)
(P 2)
(P 3)
(P 4)
(P 5)
a : v = e ;
e : e e
| v
v : N
| N [ N ]
These rules describe assignments that consist of a variable v on the left-hand side and
an expression e on the right-hand side. Expressions are defined recursively as products
of expressions or as single variables. Variables can be scalars (symbols) or elements of
vectors, where the index is assumed to be represented by a symbol. Although this grammar
is ambiguous (see Section 4.2), it is well-suited for the illustration of the syntax-directed
generation of tangent-linear code for the assignment y=yx[i] .
The identification of y and x[ i ] as variables gives the compiler access to the corresponding variable names through pointers into the symbol table. Matching variables to
hold the directional derivatives are generated by adding the predefined prefix t1_ to the
original variable name. The left-hand side of the assignment is represented by vertex 1.
Both y and x[ i ] are recognized as expressions according to production rule P 3, yielding
vertices 2 and 3. Reduction of the product of both expressions using rule P 2 introduces
8 Our example is based on the syntax and semantics accepted by version 0.9 of dcc. In particular, all subprograms are expected to have return type void. Pointer arguments are always interpreted as arrays.
4.1. Overview
151
(5) a : v = e ;
d1_y=d1_y*x[i]+y*d1_x[i];
y=y*x[i];
(1) v
d1_y
y
(4) e : e * e
d1_y*x[i]+y*d1_x[i]
y*x[i]
(2) e : v
d1_y
y
(3) e : v
d1_x[i]
x[i]
152
y=x[0];
i=i+1;
<-
y=y*x[i];
abstract interpretation [20] might give a more precise representation of the flow of control.
For example, the fact that y=yx[i] can never be the first assignment could be revealed. The
corresponding edge that emanates from the entry node could be removed.
Suppose that a tangent-linear code is to be constructed that computes directional
derivatives of the scalar output y with respect to the input vector x. Activity analysis will
mark both assignments to y as active as both left-hand sides depend on some component
of the independent input vector x. Moreover, they have an impact on y as the dependent
output of the subroutine f . The incrementation of i is found to be passive due to its missing
dependence on x. Consequently, the tangent-linear unparser modifies the signature of f
and it inserts tangent-linear assignments prior to the two assignments to y leading to the
following output:
void t 1 _ f ( i n t n , double x , double t1_x ,
double & y , double & t1_y ) {
i n t i =0;
w h i l e ( i <n )
i f ( i ==0) {
t1_y=t1_x [ 0 ] ;
y=x [ 0 ] ;
}
else {
t1_y = t1_y x [ i ]+ y t1_x [ i ] ;
y=y x [ i ] ;
}
i = i +1;
}
More substantial improvements of the generated derivative code can be expected in practice.
A detailed discussion of static program analysis methods specific to the domain of derivative
code generation is beyond the scope of this introductory text. Refer to [37, 38] for further
information on this topic. The tangent-linear and adjoint unparsers discussed in this book
operate directly on the intermediate representation. Further static program analysis is not
required.
4.2
153
Similar to human languages, the syntax of programming languages is defined through grammars over alphabets forming words, sometimes also referred to as sentences. An alphabet
is a finite, nonempty set of symbols. Examples are the binary alphabet = {0, 1}, the
alphabet containing all uppercase letters = {A, B, . . .}, or the set of all ASCII characters.
Words (strings) are finite sequences of symbols from an alphabet . The empty string has
zero occurrences of symbols from . For example, 010001111 is a binary word. Languages
"
i
are all L , where 0 , 1 = , 2 = , and so forth, and =
i=0 .
The set of all C++ programs forms a language; so does the set of all SL programs (see
Section 4.3) as a subset of all C++ programs.
Definition 4.1. A grammar G is a quadruple G = (Vt , Vn , s, P ) where
Vt is a finite set of terminal symbols (also: terminals);
Vn is a finite set of nonterminal symbols (also: nonterminals) such that Vt Vn = ;
s Vn is the start symbol;
P is a nonempty finite set of production rules (also: productions) of the form u v,
where u Vn and v (Vt Vn ) .
Definition 4.2. Consider a grammar G = (Vt , Vn , s, P ) and let V = Vt Vn . A word 2 = xvz
over V with x, z and v V can be derived from a word 1 = xuz over V with u Vn
if (u v) P . Derivation is denoted by 1 2 . The relation denotes the reflexive
and transitive closure of .
Any sequence of derivations can be represented by a tree, referred to as the abstract
syntax tree (AST) or parse tree. The root stands for the start symbol. The children of a
node in the tree correspond to the symbols on the right-hand side of the production used to
perform the respective derivation.
Definition 4.3. The language L(G) = { : s } that is generated by a grammar
G = (Vt , Vn , s, P ) contains all words that can be derived from the start symbol.
Example 4.4 Let G = (Vt , Vn , s, P ) with terminal symbols Vt = {W , O}, nonterminals
Vn = {a, b, c, d}, start symbol s = a, and production rules a W b, b Oc, b Ob,
c W d, d . A possible derivation of W OOOW L(G) is the following: a
W Ob W OOOW d W OOOW as a W b W Ob and W Ob W OOb
W OOOc W OOOW d.
Grammars are classified according to the Chomsky Hierarchy [15]. Four types of
grammars are distinguished; neither Type 0 (unrestricted) nor type 1 (context sensitive)
grammars play a significant role in classical compiler technology. Instead, we take a closer
look at type 2 and type 3 grammars. In type 2 or context-free grammars, all productions
have the form a v where a Vn and v (Vn Vt ) . Context-free grammars form the
basis for the parsing algorithms in Section 4.3. Lexical analysis is based on type 3 or regular
grammars. In a (right-linear) regular grammar, all productions have the form a T b or
a T or a , where a, b Vn , T Vt . The grammar in Example 4.4 is regular.
154
155
O
156
W
a
4.3
Lexical Analysis
Lexical analysis is performed by so-called scanners. It aims to group the sequence of input
characters into substrings that belong to logical groups or tokens. The patterns to be matched
by the scanner are described by regular grammars. The preferred way to specify regular
grammars is via regular expressions (REs).
Definition 4.9. REs are defined recursively as follows:
is a RE.
is a RE.
is a RE.
If a and b are REs, then a|b is the RE that denotes the union L(a) L(b) {w|w
L(a) w L(b)}, where L(a) denotes the language defined by the RE a.
If a and b are REs, then ab is the RE that denotes the concatenation L(a)L(b)
{vw|v L(a) w L(b)}.
"
i
If a is a RE, then a is the RE that denotes the Kleene closure L
i=0 L .
If a is a RE, then (a) is a RE that denotes the same language.
Examples for REs are 0 |1 or (01) |(10) . The scanner generator flex (see Section 4.3.4)
uses an extended set of regular expressions. For example, a + aa , a? |a, a{n}
(n times)
(n times)
a a, and a{n, m} a a | | a
flex manual9 for further information.
9 http://flex.sourceforge.net/manual/
m times
157
v
w
v
v
v|w
v
v
4.3.1
From RE to NFA
An NFA can be constructed for any set of REs. Figure 4.8 shows some of the basic building
blocks of the recursive construction algorithm that is due to Thompson [57]. Automata
that recognize the grammar symbols in a given regular expression consist of two nodes
(local start and final states) connected by an arc that is labeled with the respective symbol.
Concatenation vw is represented by connecting the local final state of the automaton that
recognizes v with the local start state of the automaton for w via an -arc. Union and
Kleene closure are constructed as shown in Figure 4.8. Unlabeled arcs are -arcs. See also
Example 4.10.
4.3.2
Nondeterministic automata are often not the best choice for language recognition as they
may require backtracking in order to try all possible sequences of transitions. A deterministic
approach promises to be superior in most cases. Hence, we are looking for a method that
allows us to transform a given NFA into a DFA. The corresponding subset construction
algorithm [2] may result in a DFA whose size is exponential in that of the original NFA. In
this case a backtracking algorithm based on the latter may be the preferred method. Most
of the time, subset construction yields useful results.
We consider the construction of a DFA Ad = (Qd , , d , q0d , F d ) from a NFA An =
(Qn , , n , q0n , F n ) by the subset construction algorithm. Let -closure(q) denote all states
reachable from q by in An . Let move (q, ) be the set of all states reachable from q by
in An . The algorithm proceeds as follows:
q0d := -closure(q0n )
Qd := q0d
qid Qd :
qjd := -closure(move(qid , ))
Qd = Qd qjd
d = d ((qid , ) qjd )
The subset construction algorithm is a fixed-point computation. It terminates as soon as both
Qd and d have reached their maximum sizes. Termination follows immediately from the
fact that the number of distinct subsets of the finite set Qn is finite. Moreover, the number
158
1
9
8
5
12
11
10
NFA
3
{12, 4, 5}
{10}
{11, 10, 8, 6, 7}
{9, 10, 8, 6, 7}
v
{12, 4, 5}
{10}
{11, 10, 8, 6, 7}
{9, 10, 8, 6, 7}
{9, 10, 8, 6, 7}
{9, 10, 8, 6, 7}
{9, 10, 8, 6, 7}
of edges carrying distinct labels that are drawn from the finite set must be finite for any
source state in Qd .
Example 4.10 The NFA obtained by Thompsons construction applied to the regular expression v(0|1(0|1)*) is shown in Figure 4.9. The numbering of the states corresponds
to that used by the scanner generator flex; see Section 4.3.4. Section 4.5.3 shows only
the relevant part of the NFA that is generated by flex. Subset construction yields the
transitions in Table 4.1. The corresponding DFA is shown in Figure 4.10. Its unique start
state is the -closure of the start state of the NFA (3). Three final (also accepting) states (7,
8, and 9) are induced by the final state of the NFA (10).
In some cases, NFAs can be more expressive than their deterministic counterparts.
Refer to [40] for further details.
4.3.3
Minimization of DFA
There is always a unique minimum state DFA for any regular language. It is constructed
recursively by verifying if states are distinguishable by transitions on certain input symbols.
In a first step, all accepting states are separated from the non-accepting states and the error
state yielding three initial groups. A state is distinguishable from another state if on input
of some symbol the transitions lead into different groups. Otherwise, the two states are
159
0
1
1
0|1
9
0|1
v
{6}
{7}
{8, 9}
{8, 9}
{8, 9}
The start state of the minimal DFA is the group that contains the original start state, that is,
{1}. The accepting states are those groups that contain at least one accepting state from the
original DFA, that is {7} and {8, 9}. A graphical representation of the minimal DFA is shown
in Figure 4.11.
160
0
{1}
{6}
{7}
1
{8, 9}
0|1
4.3.4 flex
We start this section with a quote from the flex manual:10 flex is a tool for generating
scanners. A scanner is a program which recognizes lexical patterns in text. The flex
program reads the given input files, or its standard input if no file names are given, for
a description of a scanner to generate. The description is in the form of pairs of regular
expressions and C code, called rules. flex generates as output a C source file, lex.yy.c
by default, which defines a routine yylex(). This file can be compiled and linked with
the flex run time library to produce an executable. When the executable is run, it analyzes
its input for occurrences of the regular expressions. Whenever it finds one, it executes the
corresponding C code.
Listing 4.1 shows a flex input file for the RE v(0|1(0|1)*). The file consists
of three sections separated by %%. REs are used to define tokens in the first section. The
second section contains rules for actions to be taken when matching tokens. In our simple
example, we read strings (their respective ends marked by \ n) from standard input for
as long as they match the definition of variables given by the regular expression in the first
section of the flex input file. We stop the scanner as soon as a string that is not a variable
is typed in. The third section contains user-defined routinesjust the main function in this
example.
Listing 4.1. flex input file.
variable
v ( 0 | 1 ( 0 | 1 ) )
%%
{ variable }
.
{ }
{ return 0 ; }
%%
i n t main ( )
{
yylex ( ) ;
return 0 ;
}
10 http://flex.sourceforge.net/manual/
161
flex generates an NFA that contains the NFA in Figure 4.9. The corresponding DFA
contains the DFA in Figure 4.10. Running flex with the -T option generates diagnostic
output containing information on both automata. As an example, we consider the diagnostic
output generated for Listing 4.1. It starts with an enumeration of the REs that describe the
tokens to be recognized by the scanner (here only rule 1) in addition to the remaining
single-character tokens (rule 2) and the special end (of string) marker \ n (rule 3).
1
2
3
( v ( 0 | 1 ( 0 | 1 ) ) )
.
End Marker
Note that the scanner generated by flex recognizes arbitrary input. In the worst case, single
characters are matched individually by using rule 2. Potentially desired error handling needs
to be implemented by the user in the form of appropriate actions.
The relevant part of the NFA that corresponds to the three rules is the following:
...
state
state
state
state
state
state
state
state
state
state
state
state
state
state
state
state
state
b e g i n n i n g dump o f n f a w i t h s t a r t
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
#
end
3
118:
4
48:
5
49:
6
48:
7
49:
8
257:
9
257:
10
257:
11
257:
12
257:
13
257:
14
1:
15
257:
16
257:
17
2:
18
257:
19
257:
o f dump
12 ,
10 ,
11 ,
9,
9,
6,
8,
0,
8,
4,
1,
15 ,
0,
13 ,
18 ,
0,
16 ,
0
0
0
0
0
7
10
0
10
5
3
0
0
14
0
0
17
s t a t e 19
[1]
[2]
[3]
Each state is followed by an integer that encodes an input symbol. The next two columns
contain the corresponding successor states. For example, in state 3, the input of v (ASCII
code 118) leads to state 12. Zeros denote undefined states. From the start state 19, two
-transitions (encoded as 257) lead into states 16 and 17, respectively. Reading the end
marker (encoded as -2) in state 17 yields a transition into the accepting state 18. Acceptance
is indicated by the number of the matched REs given in square brackets. All single character
tokens except for v (this set is represented by -1) are accepted by state 15 that is reached
from state 16 via state 14. Further -transitions take the NFA from state 16 via state 13 to
state 3. Input of v (ASCII 118) leads to state 12 and hence, to states 4 and 5 via respective
-transitions. Acceptance of v0 is represented by the transition from state 4 to state 10 on
input of 0 (ASCII 48). Closer inspection of states 511 and of the corresponding transitions
identifies the spanned sub-NFAas one that accepts strings described by the regular expression
v(0|1(0|1)*).
162
Subset construction yields the DFAthat is listed in the diagnostic output (edited slightly
for brevity) as follows:
s t a t e # 1:
1
4
2
5
3
4
4
4
5
6
s t a t e # 4:
s t a t e # 5:
s t a t e # 6:
3
7
4
8
s t a t e # 7:
s t a t e # 8:
3
9
4
9
s t a t e # 9:
3
9
4
9
state # 4 accepts :
state # 5 accepts :
state # 6 accepts :
state # 7 accepts :
state # 8 accepts :
state # 9 accepts :
[2]
[3]
[2]
[1]
[1]
[1]
All 256 of the 8-bit characters are grouped into equivalence classes used to specify the
transitions in the DFA. For example, \ n=2, 0 =3, 1 =4, v=5, and the group of
the remaining 252 characters is encoded as 1. Reading v in start state 1 yields a transition
to state 6. From there, 0 and 1 take the DFA to states 7 and 8, respectively. State 7
accepts v0 as a special case of a token described by the first RE v (0|1(0|1) ). State 8
accepts v1. Further instances of 0 and 1 lead from state 8 to state 9 followed
by leaving the DFA in state 9. The latter accepts all strings that consist of at least three
characters and that can be derived from the RE v (0|1(0|1) ). This part of the DFA is shown
in Figure 4.10. States that accept \n (state 5) and all remaining single character tokens
(states 4 and 6) are not shown in Figure 4.10. For example, if the characters read in state
1 are either 0 , 1 , or any of the characters in equivalence class 1, then acceptance is
due to the second rule in state 4. Moreover, if v is read and is not followed by 0 or
1 , then it is accepted as a single character token in state 6.
4.4
Syntax Analysis
163
2. An intermediate representation (IR) of the program is built (see Section 4.6). The IR is
used for static program analysis as well as for semantic transformation and unparsing;
that is, the generation of the desired output.
Numerous semantic tests are performed by standard compilers including, for example,
type checking. In this book, we assume that all input programs are guaranteed to be both
syntactically and semantically correct. Users of derivative code compilers must have access
to native compilers. They can be used to verify semantic soundness of a syntactically
correct input program. The derivative code compiler front-end can thus be kept simpler.
The emphasis is put on the domain-specific issues instead of well-known and widely studied
standard compiler problems. The development of a robust compiler for a modern programming language such as C or C++ is a highly complex and challenging project. It is clearly
beyond the scope of this book. Our intention is not to present an in-depth discussion of
syntax analysis techniques, thus, merely repeating material that has been available in the
literature for many years. See, for example, [2]. Instead, we focus on an intuitive description of the fundamental concepts based on examples. Our goal is to provide the essential
amount of understanding that enables the reader to follow the ideas behind the generation
of derivative code by semantic source code transformation.
We focus on a very small subset of C++ that is still rich enough to capture the fundamental issues in derivative code compiler development and use. For example, version 0.9 of
dcc (see Chapter 5) accepts input consisting of several subroutines containing branches and
loops and with arithmetic performed on scalars as well as on multi-dimensional arrays. For
the sake of brevity and conciseness of the following discussions of various parsing algorithms, it turns out to be advantageous to impose further syntactic restrictions. We consider
variants of a Simple Language (SL). SL programs consist of (possibly nested) sequences
of assignments, loops, and branches. All variables are assumed to be floating-point scalars.
Arithmetic is performed using linear (e.g. +), nonlinear (e.g. ), and relational (e.g. <)
operators and intrinsic functions (e.g. sin). An example of an SL program is the following:
i f ( x<y ) {
x=x y ;
w h i l e ( y<x ) {
x= s i n ( x+y 3 ) ;
}
}
164
s (sequence of statements)
a (assignment)
Vn =
,
e (expression)
terminal symbols
Vt =
V
C
F
L
N
)(=
(program variables)
(constants)
(unary intrinsic)
,
(P 2) s : as
(P 1) s : a
(P 4) e : eLe (P 5) e : eN e
P=
(P 7) e : V
(P 8) e : C
(P 3) a : V = e;
(P 6) e : F (e)
.
Note that G has been made ambiguous for the purpose of illustrating certain fundamental aspects of syntax analysis. For example, the word V = V N V LV ; has two feasible
right-most derivations
s a V = e; V = eLe; V = eLV ;
V = eN eLV ; V = eN V LV ; V = V N V LV ;
and
s a V = e; V = eN e; V = eN eLe;
V = eN eLV ; V = eN V LV ; V = V N V LV ;
as previously discussed. Moreover, the missing handling of operator precedence may result
in numerically incorrect code. These issues will be dealt with. Let us first recall some
classical work in the field of syntax analysis, which every developer of (derivative code)
compilers should be familiar with.
All context-free grammars can be converted into Chomsky normal form [15], where
all productions have one of the following formats:
a : bc
a:A
s:
165
d
n
c e
c e
l
e
166
the length of the input word) computational complexity is paid for with a quadratic memory
requirement. The word is verified as an element of the language generated by the grammar
G. It is successfully derived from the start symbol e. Ambiguity of G results in two distinct
derivations marked with overset, resp., underset, bars in Figure 4.12. The corresponding
alternatives are
e ce cV elV eLV deLV
dV LV enV LV eN V LV V N V LV
and
e de dce dcV delV deLV
dV LV enV LV eN V LV V N V LV .
167
168
represent valid derivations. The former results from the greedy approach that is used, for
example, in ANTLR to resolve nondeterminism. A lookahead of length one identifies L as
the next token to be read. The algorithm picks the corresponding production rule t : Let. In
general, a lookahead of length k may be required to make this choice unique.
4.4.2
Bottom-Up Parsing
Conceptually, the term bottom-up parsing refers to the construction of parse trees for a
sequence of tokens starting from the leafs and working up toward the root. We consider
shift-reduce parsers as a general approach to bottom-up parsing. The key decisions are
about when to reduce and what production to apply. A reduction is defined as the reverse
of a step in the derivation. A handle is a substring that matches the right-hand side of some
production.
A basic shift-reduce parser uses a stack to hold grammar symbols, and it reads from
an input buffer (left to right) holding the string to be parsed. The stack is empty initially.
Symbols are read and pushed onto the stack (shift S) until the top-most symbols on the
stack form a handle. The handle is replaced by the left-hand side of the corresponding
production (reduce R). This process is repeated iteratively until the string has been parsed
successfully (ACCEPT) or until an error has occurred (ERROR). A successful parsing
procedure is characterized by an empty input buffer and a stack that contains only the start
symbol. It can be shown that for any right-most derivation handles always appear on top of
the stack, never inside. Shift-reduce parsing may lead to conflicts where the next action is
not determined uniquely. We distinguish between shift-reduce and reduce-reduce conflicts.
Example 4.14 Consider the same context-free grammar as in Example 4.13. A shift-reduce
parser processes the string F (F (V )) by shifting tokens from left to right onto a stack with
reductions performed for handles occurring on top of the stack:
STACK
1
2
3
4
5
6
7
8
9
10
11
F
F(
F (F
F (F (
F (F (V
F (F (e
F (F (e)
F (e
F (e)
e
INPUT
F (F (V ))
(F (V ))
F (V ))
(V ))
V ))
))
))
)
)
ACTION
S
S
S
S
S
R (e : V )
S
R (e : F (e))
S
R (e : F (e))
ACCEPT
169
STACK
1
2
3
4
5
6
7
V
V =
V =V
V =e
V = e;
a
INPUT
V =V;
=V;
V;
;
;
ACTION
S
S (or R (e : V )?)
S
R (e : V ) (or S?)
S
R (a : V = e;)
ACCEPT
The lookahead token = is feasible in line 2 as it follows V in the first production rule.
Hence, a shift is performed instead of the potential reduction. Formal approaches to making
this decision are discussed in the remainder of this section. Similarly, the lookahead token;
is not feasible in line 4 yielding the reduction based on e : V .
Reduce-reduce conflicts can occur in languages such as Fortran, where a( i ) can denote
both an array access and a univariate function call. Additional information on the kind of
tokens is required to resolve these conflicts.
Vt =
V
C
F
L
N
)(=
(program variables)
(constants)
(unary intrinsic)
,
170
(P 2) s : as
(P 1) s : a
(P 5) e : t
(P 6) t : tNf
P=
(P 9) f : V (P 10) f : C
(P 3) a : V = e; (P 4) e : eLt
(P 7) t : f
(P 8) f : F (e) .
f : F (.e)
e : .eLt | .t
Closure(I ) =
.
t : .tNf | .f
f : .F (e) | .V | .C
As a first step the initial configurations of both production rules for e are added to f : F (.e).
We use the more compact notation e : .eLt | .t instead of {e : .eLt, e : .t}. The nonterminal
symbol t can be obtained by reduction using the two production rules t : tNf and t : f . Hence,
both initial configurations are added to Closure(I ). Similarly, the initial configurations of the
three production rules for f are added, which completes the closure operation. Termination
follows immediately from the finite number of configurations over the finite set of production
171
rules. The closure operation yields all feasible paths to the current state of the parser. For
example, if the current state is defined by the (closure of) the configuration f : F (.e), then
anything but a succeeding reduction to e results in a syntax error. This reduction can be
obtained by reducing eLt or t to e. Recursively, this reduction must be preceded by reductions
to e or t, and so forth. This closure operation captures all possible terminal symbols that
are allowed to be read in the current state of the parser, for example, F , V , or C.
The configurating sets (together with a dedicated error state) define the vertices of the
characteristic automaton (also: LR(0) automaton)
ALR(0) = (VLR(0) , ELR(0) ).
The labeled edges (also: transitions) are defined as
ELR(0) = {((i, j ), v) : [a : .v] i and [a : v.] j },
where a Vn , , (Vn Vt ) , and v Vn Vt . A stack is used to store the history of
transitions of the characteristic automaton. To illustrate this procedure, let the characteristic
automaton be in a state j defined by the corresponding configurating set. Reading a new
terminal symbol B from the input results in the forward transition to state k defined as
the closure of a configuration [a : B.] k obtained from [a : .B] j . The index j of
this shift state is pushed onto the stack. If [a : .B] j , then the transition leads into the
dedicated error state and a syntax error is reported.
A reduce state j contains a final configuration b : . where the dot appears at the end
of the right-hand side of the production rule. When the parser reaches state j , a reduction of
to b is performed unless j contains another configuration a : .A and the next terminal
symbol to be read from the input (the lookahead) is A. SLR parsers perform the shift
operation in this case. Reduction yields a backward transition to state k, where k is the | | s
element on the stack and where | | denotes the length of the string, that is, the number of
symbols in . The part of the parsing history that yielded the handle becomes obsolete.
Hence, the top | | elements can be removed from the stack. A forward transition from k to
k follows if k contains a configuration [a : .b] k; a syntax error is reported otherwise.
Conflicts that cannot be resolved using this technique identify SLR parsers as infeasible for
the recognition of the language L(G) that is generated by the given grammar G.
The characteristic automaton of SL2 is shown in Figure 4.13. An auxiliary production
rule
(P 0) $accept : s$end
is introduced to mark the end of the input to be parsed with $end. The characteristic automaton becomes ALR(0) = (VLR(0) , ELR(0) ) with states
VLR(0) ={
0:
$accept : .s$end
1 : [ a : V . = e; ] ,
+
2:
$accept : s.$end
s : .a | .as
,
a : .V = e;
3 : [ s : a. | a.s s : .a | .as a : .V = e; ] ,
4 : [ a : V = .e; e : .eLt | .t t : .tNf | .f f : .F (e) | .V | .C
+
,
5:
$accept : s$end.
,
],
172
0
$accept : . s $end
s : . a | . a s
a : . V = e ;
s
a
V
3
2
s : a . | a . s
$accept : s . $end
s : . a | . a s
a : . V = e ;
V
$end
5
$accept : s $end .
1
a : V . = e ;
6
s : a s .
=
4
a : V = . e ;
e : . e L t | . t
t : . t N f | . f
f:.F(e)|.V|.C
V
t
e
10
a : V = e . ;
e : e . L t
9
f:F.(e)
(
F
13
f:F(.e)
e : . e L t | . t
t : . t N f | . f
f:.F(e)|.V|.C
t
11
e : t .
t:t.Nf
15
a : V = e ; .
C
f
17
f:F(e.)
e : e . L t
)
20
f:F(e).
18
e : e L t .
t : t . N f
14
e : e L . t
t : . t N f | . f
f:.F(e)|.V|.C
t
V
f
C
12
t : f .
F
16
t:tN.f
f:.F(e)|.V|.C
f
19
t:tNf.
C
7
f:V.
8
f:C.
s : as. ] ,
f : V . ],
f : C. ] ,
f : F .(e) ] ,
a : V = e.; e : e.Lt ] ,
e : t. t : t.Nf ] ,
t : f . ],
f : F (.e) e : .eLt | .t t : .tNf | .f f : .F (e) | .V | .C
e : eL.t t : .tNf | .f f : .F (e) | .V | .C ] ,
a : V = e; . ] ,
t : tN.f f : .F (e) | .V | .C ] ,
f : F (e.) e : e.Lt ] ,
e : eLt. t : t.Nf ] ,
t : tNf . ] ,
f : F (e). ]
173
],
and transitions
ELR(0) ={
[0, 1, V ], [0, 2, s], [0, 3, a], [1, 4, =], [2, 5, $end], [3, 1, V ], [3, 3, a], [3, 6, s],
[4, 7, V ], [4, 8, C], [4, 9, F ], [4, 10, e], [4, 11, t], [4, 12, f ], [9, 13, (], [10, 14, L],
[10, 15, ; ], [11, 16, N ], [13, 7, V ], [13, 8, C], [13, 9, F ], [13, 11, t], [13, 12, f ],
[13, 17, e], [14, 7, V ], [14, 8, C], [14, 9, F ], [14, 12, f ], [14, 18, t], [16, 7, V ],
[16, 8, C], [16, 9, F ], [16, 19, f ], [17, 14, L], [17, 20, )], [18, 16, N ]
}
The presence of shift-reduce conflicts indicates that SL2 cannot be parsed without taking
lookahead into account. For example, when reaching state 18, it is unclear whether to reduce
using production rule P4 or whether to shift. Our SLR parser uses the Follow sets of the
nonterminal symbols (the set of terminal symbols that can follow the nonterminal symbol
in some derivation) to make the decision about the next action deterministic. Shift-reduce
conflicts are resolved by shifting whenever there is an outgoing edge labeled with the next
input symbol and whose target is not the error state. A shift is performed in state 18 if the
lookahead is N . Otherwise, the handle eLt is reduced to e using production rule P4.
SLR parsing of the input string V = F (V N C); is illustrated in Table 4.2. Initially,
the stack is empty in state 0 and the first token is read (shifted). Acceptance of the given
string is obtained after a total of 32 steps. For example, line 5 in Table 4.2 shows the parser
in state 13 after reading the first four tokens V = F (. State 13 is a shift state. The next
token (V ) is read from the input yielding the transition into state 7 while 13 is pushed onto
the stack. State 7 is a reduce state. Production rule P 9 is used to replace V by f followed
by a backward transition into state 13. With the length of V being equal to one, only the
top element 13 is popped from the stack. The following transition on f is from state 13 to
state 12. Similar arguments yield the remaining entries in Table 4.2. The shift-reduce conflict
174
0
0,1
0,1,4
0,1,4,9
0,1,4,9,13
0,1,4,9
0,1,4,9,13
0,1,4,9
0,1,4,9,13
0,1,4,9,13,11
0,1,4,9,13,11,16
0,1,4,9,13,11
0,1,4,9,13,11,16
0,1,4,9
0,1,4,9,13
0,1,4,9
0,1,4,9,13
0,1,4,9,13,17
0,1
0,1,4
0,1
0,1,4
0,1
0,1,4
0,1,4,10
0
0
0,2
STATE
0
1
4
9
13
7
13
12
13
11
16
8
16
19
13
11
13
17
20
4
12
4
11
4
10
15
0
3
0
2
5
0
PARSED
V
V =
V =F
V = F(
V = F (V
V = F (f
V = F (t
V = F (tN
V = F (tN C
V = F (tNf
V = F (t
V = F (e
V = F (e)
V =f
V =t
V =e
V = e;
INPUT
= F (V N C);
F (V N C);
(V N C);
V N C);
N C);
N C);
N C);
N C);
N C);
C);
);
);
);
);
);
);
);
;
;
;
;
;
;
;
a
s
s$end
$accept
ACTION
S
S
S
S
S
R(P 9)
S
R(P 7)
S
S
S
R(P 10)
S
R(P 6)
S
R(P 5)
S
S
R(P 8)
S
R(P 7)
S
R(P 5)
S
S
R(P 3)
S
R(P 1)
S
S
R(P 0)
ACCEPT
in state 11 is resolved in the favor of shifting whenever the next token is N . Consequently, a
shift is performed in line 10 of Table 4.2, whereas reductions take place in lines 16 and 23.
175
b (branch statement)
Vn := Vn l (loop statement)
,
r (result of relational operator)
terminal symbols
IF
W H I LE
Vt := Vt
}{
(branch keyword)
(loop keyword)
,
(P 1a)
(P 2a)
P := P
(P 9)
(P 11)
s:b
s : bs
b : I F (r){s}
r : V RV
(P 1b) s : l
(P 2b) s : ls
.
(P 10) l : W H I LE(r){s}
4.4.5
We use flex and bison to implement a parser for SL2 programs. For the sake of brevity,
variables and constants are restricted to lowercase letters and single digits, respectively.
The flex source file is shown in Listing 4.2. Whitespaces are ignored (line 5). Intrinsic
functions (line 12), linear operators (line 13), and nonlinear operators (line 14) are represented by a single instance each. The named tokens (F, L, N, V, and C) to be returned to
the parser are encoded as integers in the file parser . tab . h. Its inclusion into lex . yy.c prior
to any other automatically generated code is triggered by the corresponding preprocessor
directive at the beginning of the first section of the flex input file (lines 13). The remaining unnamed single character tokens are simply forwarded to the parser (line 17). A special
lexinit routine is provided to set the pointer yyin to the given source file (line 21).
176
0
$accept: . s $end
s: . a | . a s
a: . V = e ;
a
3
s: a . | a . s
s: . a | . a s
a: . V = e ;
V
2
$accept: s . $end
$end
1
a: V . = e ;
6
s: a s .
5
$accept: s $end .
=
4
a: V = . e ;
e: . e L e | . e N e | . F ( e ) | . V | . C
9
e: F . ( e )
(
F
F
11
e: F ( . e )
e: . e L e | . e N e | . F ( e ) | . V | . C
V
e
15
e: e . L e | e . N e | F ( e . )
)
18
e: F ( e ) .
L
L
10
a: V = e . ;
e: e . L e | e . N e
N
12
e: e L . e
e: . e L e | . e N e | . F ( e ) | . V | . C
;
14
a: V = e ; .
e
16
e: e . L e | e L e . | e . N e
N
13
e: e N . e
e: . e L e | . e N e | . F ( e ) | . V | . C
C
V
7
e: V .
C
8
e: C .
17
e: e . L e | e . N e | e N e .
177
0
0,1
0,1,4
0,1,4,9
0,1,4,9,11
0,1,4,9
0,1,4,9,11
0,1,4,9,11,15
0,1,4,9,11,15,13
0,1,4,9,11,15
0,1,4,9,11,15,13
0,1,4,9
0,1,4,9,11
0,1,4,9,11,15
0,1
0,1,4
0,1,4,10
0
0
0,2
STATE
0
1
4
9
11
7
11
15
13
8
13
17
11
15
18
4
10
14
0
3
0
2
5
0
PARSED
V
V =
V =F
V = F(
V = F (V
V = F (e
V = F (eN
V = F (eNC
V = F (eNe
V = F (e
V = F (e)
V =e
V = e;
INPUT
= F (V N C);
F (V N C);
(V N C);
V N C);
N C);
N C);
N C);
C);
);
);
);
);
);
;
;
;
a
s
$end
$accept
%{
# include " parser . tab . h"
%}
whitespace
variable
constant
[ \ t \ n ]+
[ az ]
[0 9]
%%
{ whitespace }
" sin "
"+"
""
{ variable }
{ constant }
.
{
{
{
{
{
{
{
}
return
return
return
return
return
return
F; }
L; }
N; }
V; }
C; }
yytext [0]; }
ACTION
S
S
S
S
S
R(P 6)
S
S
S
R(P 7)
S
R(P 4)
S
S
R(P 5)
S
S
R(P 3)
S
R(P 1)
S
S
R(P 0)
ACCEPT
178
18
19 %%
20
21 v o i d
l e x i n i t ( FILE s o u r c e ) { y y i n = s o u r c e ; }
Similar to the flex input file, the bison input file consists of three sections separated by %% that contain definitions (e.g., of tokens and rules for resolving associativity and
operator precedence; see Listing 4.3), production rules (Listing 4.4), and user-defined routines (Listing 4.5), respectively. The five named tokens are defined in line 1 of Listing 4.3.
Lines 3 and 4 set nonlinear operators to precede linear ones. Associativity is resolved by
generating locally left-most parse trees.
Listing 4.3. First section of the bison input file.
1
2
3
4
5
6
7
%t o k e n V C F L N
%l e f t L
%l e f t N
%%
...
Unnamed single character tokens are enclosed within single quotes inside the production
rules.
Listing 4.4. Second section of the bison input file.
...
s :
|
;
a :
e :
|
|
|
|
;
a
a s
V
e
e
F
V
C
= e ; ;
L e
N e
( e )
%%
...
Two user-defined routines are provided. The basic error handler in line 5 of Listing 4.5
simply prints the error message that is generated by the parser to standard output. Inside the
main routine, the source file is opened for read-only access (line 9) and the corresponding
FILE pointer is passed on to the scanner (line 10). The parser itself is started by calling
yyparse () in line 11. It calls the scanner routine yylex () to get the next token as required.
Finally, the source file is closed (line 12).
179
...
# i n c l u d e < s t d i o . h>
i n t y y e r r o r ( char msg ) { p r i n t f ( "%s \ n " , msg ) ; r e t u r n 1; }
i n t main ( i n t a r g c , char a r g v )
{
FILE s o u r c e _ f i l e = f o p e n ( a r g v [ 1 ] , " r " ) ;
lexinit ( source_file ) ;
yyparse ( ) ;
fclose ( source_file ) ;
return 0 ;
}
The dependencies within the build process are best illustrated with the following makefile
[46]:
p a r s e : l e x . yy . c p a r s e r . t a b . c
g c c p a r s e r . t a b . c l e x . yy . c l f l o p a r s e
parser . tab . c : parser . y
b i s o n d p a r s e r . y
l e x . yy . c : s c a n n e r . l
flex scanner . l
The executable parse is built from the two C-files with default names lex . yy.c and parser .
tab . c that are generated by flex and bison, respectively. Running bison with the d
option yields the generation of parser . tab . h that contains all token definitions to be included
into lex . yy.c as described. The object code is linked with the flex run time support library
(lfl ).
Similar to flex, the parser generator bison can generate diagnostic information.
When run as bison -v parser.y, the diagnostic output is written into a file named
parser.output. Its contents starts with a summary of the underlying augmented grammar followed by information on terminal and nonterminal symbols and rules where they
appear. Production rules are enumerated as follows:
0 $ a c c e p t : s $end
1 s: a
2 | a s
3 a : V = e ;
4 e: e L e
5 | e N e
6 | F ( e )
7 | V
8 | C
180
Most importantly, bison reports on the characteristic finite automaton that the generated
parser is based on. The following output is generated for the specifications in Listing 4.4.
state 0
0 $ a c c e p t : . s $end
V
s h i f t , and go t o s t a t e 1
s
a
go t o s t a t e 2
go t o s t a t e 3
...
state 3
1 s: a .
2 | a . s
V
s h i f t , and go t o s t a t e 1
$default
s
a
reduce using r u l e 1 ( s )
go t o s t a t e 6
go t o s t a t e 3
...
s t a t e 18
6 e : F ( e ) .
$default
reduce using r u l e 6 ( e )
The output has been edited for brevity as indicated by the three consecutive dots. All states
list the kernels of their respective configurating sets while omitting the remaining production
rules of their closures. Transitions that correspond to shift operations are listed as well as
potential reductions and their effects. For example, in state 3 shifting requires the next
token to be read to be V. The characteristic automaton moves into state 1 in this case. If the
preceding reduction is to s or a, then the characteristic automaton moves into states 6 or it
remains in state 3, respectively. Otherwise it attempts to reduce to s using production rule 1.
A syntax error is reported if none of the previously mentioned situations occurs. bison can
also generate a graphical representation of the characteristic automaton. Refer to bisons
online documentation for further up-to-date information on its diagnostic capabilities.
4.4.6
The following two case studies are used to discuss the interaction between flex and bison
in more detail.
181
14 [s:as]
7 [a:V=e;]
13 [s:a]
6 [e:eLe]
12 [a:V=e;]
5 [e:F(e)]
11 [e:F(e)]
4 [e:eNe]
10 [e:eLe]
1 x=x+ s i n ( y 3 ) ;
2 y= s i n ( x+y ) ;
1 [e:V]
2 [e:V]
3 [e:C]
8 [e:V]
9 [e:V]
%{
int i ; / / vertex counter
# i n c l u d e < s t d i o . h>
%}
%t o k e n V C F L N
%l e f t L
%l e f t N
The flex input file is similar to Listing 4.2. We consider the three sections of the
bison input file separately. Its first part contains a new section in addition to the definition
of tokens (line 6 in Listing 4.6) and of rules for resolving ambiguity / shift-reduce conflicts
due to operator precedence / associativity of the binary arithmetic operators (lines 8 and
9). Code enclosed within %{ and %} is copied by bison into the generated C-file without
182
s : a
{
$$=++ i ;
p r i n t f ( "%d [ l a b e l =\"% d [ s : a ] \ " ] \ n " , $$ , $$ ) ;
p r i n t f ( "%d>%d \ n " , $1 , $$ ) ;
}
...
a : V = e ;
{
$$=++ i ;
p r i n t f ( "%d [ l a b e l =\"% d [ a : V=e ; ] \ " ] \ n " , $$ , $$ ) ;
p r i n t f ( "%d>%d \ n " , $3 , $$ ) ;
}
;
e : e L e
{
$$=++ i ;
p r i n t f ( "%d [ l a b e l =\"% d [ e : eLe ] \ " ] \ n " , $$ , $$ ) ;
p r i n t f ( "%d>%d \ n " , $1 , $$ ) ;
p r i n t f ( "%d>%d \ n " , $3 , $$ ) ;
}
...
| V
{
$$=++ i ;
p r i n t f ( "%d [ l a b e l =\"% d [ e : V ] \ " ] \ n " , $$ , $$ ) ;
}
...
Each reduction causes the incrementation of the global vertex counter (lines 3, 10, 17, and
25). The new vertex that represents the nonterminal symbol on the left-hand side of the
respective production rule is labeled with the corresponding index (lines 4, 11, 18, and 26).
Indices of predecessors are accessed through their position in the right-hand side. Edges
are added to the dot output accordingly (lines 5, 12, 19, and 20).
The third part of the bison input file is shown in Listing 4.8. It contains a basic
version of the error handling routine yyerror and the main routine. The latter opens the file to
183
be parsed, provides this information to the scanner, initializes the parse tree vertex counter,
and calls the parsing routine. Suitable dot output of the parse tree as a directed graph
drawn from Bottom (leafs) to Top toward the root (sink) s is generated. The orientation is
set via the rankdir attribute. Corresponding wrapper code that is written in lines 8 and 10
encloses the code generated in Listing 4.7.
Listing 4.8. bison input file Part 3.
1
2
3
4
5
6
7
8
9
10
11
12
13
Syntax-Directed Unparser We consider the single-pass generation of a verified syntactically equivalent copy of the input code as an important next step toward single-pass
derivative code generation. Relevant modifications to the flex input file are documented
by Listing 4.9 and Listing 4.10.
Listing 4.9. First section of the flex input file.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
%{
# d e f i n e YYSTYPE char / / n e e d s t o be d e f i n e d p r i o r t o i n c l u s i o n
/ / of parser . tab . h
# include " parser . tab . h"
# d e f i n e BUFFER_SIZE 3
# i n c l u d e < s t d l i b . h> / / m a l l o c
# i n c l u d e < s t r i n g . h> / / s t r c p y
void t o _ p a r s e r ( ) {
y y l v a l = ( char ) m a l l o c ( BUFFER_SIZE s i z e o f ( char ) ) ;
strcpy ( yylval , yytext ) ;
}
%}
whitespace
variable
constant
[ \ t \ n ]+
[ az ]
[0 9]
Specific names of all tokens need to be passed from the scanner to the parser in order to
be copied correctly to the output. Therefore, the default type of the information that is
associated with all parse tree nodes needs to be changed to the C-string type char. The
preprocessor macro YYSTYPE is redefined accordingly in line 2 prior to the inclusion of
parser . tab . h that contains references to YYSTYPE. A buffer of characters of sufficient
184
size BUFFER_SIZE and with built-in name yylval is allocated in line 11, and it is used by
the function to_parser to pass the string associated with the current token to the parser.
Appropriate declarations from the C standard library need to be included (lines 8 and 9).
Simplified lexical definitions of whitespaces, variables, and constants follow in lines 1618.
The various tokens are handled in the second part of the flex input file. For simplicity,
whitespaces are ignored in line 1 of Listing 4.10. While passing whitespaces on to the parser
results in an exact copy of the input code the size of the buffer (here, set equal to 3 as none
of the tokens is represented by a string of length greater than 3) becomes unpredictable.
Formatting of the output is taken care of by the parser.
Listing 4.10. Second section of the flex input file.
1
2
3
4
5
6
7
8
9
10
{ whitespace }
" sin "
" cos "
""
"/"
"+"
""
{ variable }
{ constant }
.
{
{
{
{
{
{
{
{
{
{
}
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
return y y t e x t [ 0 ] ; }
F;
F;
N;
N;
L;
L;
V;
C;
}
}
}
}
}
}
}
}
Several instances of the same token type are distinguished through their actual names. For
example, both sin and cos are tokens of type F. Single character tokens that are not explicitly
listed are simply passed on to the parser in line 10. The third section of the flex input file
is not listed as it contains nothing but the standard void lexinit (FILE) routine.
Section one of the bison input file is similar to Listing 4.6 except for the missing
declaration of the parser tree vertex counter that is not required by the syntax-directed
unparser. Its listing is omitted. The second section of the bison input file augments the
production rules with appropriate actions for printing a syntactically equivalent copy of the
input code.
Listing 4.11. Second section of the bison input file.
1 s : a
2
| a s
3
;
4 a : V = { p r i n t f ( "%s = " , $1 ) ; } e
5
;
6 e : e L { p r i n t f ( "%s " , $2 ) ; } e
7
| e N { p r i n t f ( "%s " , $2 ) ; } e
8
| F ( { p r i n t f ( "%s ( " , $1 ) ; } e
9
| V { p r i n t f ( "%s " , $1 ) ; }
10
| C { p r i n t f ( "%s " , $1 ) ; }
11
;
; { p r i n t f ( " ; \ n" ) ; }
In line 4 of Listing 4.11, reading the left-hand side of an assignment and the assignment
operator is succeeded by the output of the string that is associated with the token V followed
by =. Bottom-up parsing of the expression e on the right-hand side yields corresponding
output due to the actions associated with the production rules in lines 610. All assignments
185
are finished with a semicolon. Output of the following newline character ensures that each
assignment is printed onto a new line.
The third part of the bison input file is similar to that in Listing 4.4.
4.5
This section deals with the simplest possible method for generating derivative code. SL has
been designed to facilitate this approach. It shows nicely the link between differentiation
and compiler construction. Many statements and algorithms in this section are mostly
conceptual. State-of-the-art derivative code compilers implement variants thereof in order
to achieve the wanted efficiency. Our aim is to communicate fundamental concepts as
the basis for a better understanding of advanced concepts and source code transformation
algorithms that are described in the literature.
186
Example 4.20 illustrates the syntax-directed enumeration of subexpressions on the righthand side of assignments. It is based on Example 4.19 which shows how to augment the
SL2 grammar with rules for counting subexpressions in parse trees of right-hand sides of
assignments.
Example 4.19 (S-attributed Counting of Subexpressions) Without loss of generality, we
consider all SL2 programs that consist of a single assignment only. Production rules P3-P8
in Definition 4.16 (see also Definition 4.12) are augmented with actions on the synthesized
attribute s that holds the number of subexpressions in the parse tree rooted by the current
vertex. Any reduction to e (rules P4-P8) adds a new subexpression. Unary intrinsics
increment the number of subexpressions in their single argument. Binary operators add one
to the sum of the numbers of subexpressions in the two operands. We use superscripts to
distinguish between instances of the same nonterminal symbol within the same production
rule. The left-hand side of a production is potentially augmented with superscript l. Counters
r1 and r2 denote the first and second occurrences of a symbol on the right-hand side of the
production, respectively.
a : V = e;
el : F (er )
{el .s := er .s + 1 }
: er1 Ler2
: er1 N er2
:V
{ el .s := 1 }
:C
{ el .s := 1 }
The implementation with flex and bison is straightforward. The scanner is the same
as in Section 4.4.5. Modifications are restricted to the first and second parts of the bison
input file. The latter becomes
a : V
e : e
| e
| F
| V
| C
;
= e ; { p r i n t f ( "%d \ n " , $3 ) ; } ;
L e { $$=$1+$3 + 1 ; }
N e { $$=$1+$3 + 1 ; }
( e ) { $$=$3 + 1 ; }
{ $$ = 1 ; }
{ $$ = 1 ; }
to the first part in order to gain access to the definition of printf . Application to y=sin(x
2); yields the output 4 which corresponds to the four subexpressions x, 2, x2, and
sin (x2).
Example 4.20 (L-attributed Enumeration of Subexpressions) We consider the same
grammar as in the previous example. An inherited attribute i that represents the unique index of each subexpression is added to the S-attributed grammar developed in Example 4.19.
187
This index will be used to generate single assignment code in Section 4.5.2. The bottom-up
evaluation of the synthesized attribute values is followed by a top-down sweep on the parse
tree to propagate i.
a : V = e;
l
e : F (e )
: er1 Ler2
{ e.i := 0 }
{ el .s := er .s + 1; er .i := el .i + 1 }
{ el .s := er1 .s + er2 .s + 1
er1 .i := el .i + 1; er2 .i := er1 .i + er1 .s }
: er1 N er2
{ el .s := er1 .s + er2 .s + 1
er1 .i := el .i + 1; er2 .i := er1 .i + er1 .s }
:V
{ el .s := 1 }
:C
{ el .s := 1 }
%{ u n s i g n e d i n t s a c v c ; %}
%t o k e n V C F L N
%l e f t L
%l e f t N
%%
a : V
e : e
| e
| F
| V
| C
;
%%
...
= { s a c v c = 0 ; } e ;
L e { $$= s a c v c ++; }
N e { $$= s a c v c ++; }
( e ) { $$= s a c v c ++; }
{ $$= s a c v c ++; }
{ $$= s a c v c ++; }
188
Reduction of a new subexpression in lines 1115 yields the incrementation of the initially
zero (see line 10) global counter, thus ensuring uniqueness of the assigned index. The
order of enumeration is bottom-up instead of top-down. It is irrelevant in the context
of assignment-level SAC whose syntax-directed generation is discussed in the following
section.
4.5.2
a : V = e;
(P 4) el : er1 Ler2
e.i = 0
a.c = e.c
+ V .c + =v0;
el .s = er1 .s + er2 .s + 1
er1 .i = el .i + 1
er2 .i = er1 .i + er1 .s
el .c = er1 .c + er2 .c
+ v + el .i + =v + er1 .i + L.c + v + er2 .i + ;
(P 5) el : er1 N er2
el .s = er1 .s + er2 .s + 1
er1 .i = el .i + 1
er2 .i = er1 .i + er1 .s
el .c = er1 .c + er2 .c
+ v + el .i + =v + er1 .i + N .c + v + er2 .i + ;
11 The semantics of the operator + is modified according to the principles of operator overloading in, for
example, C++.
189
Shift-reduce conflicts are resolved by specifying the order of evaluation for associativity
and operator precedence as discussed in Section 4.4.4.
(P 6) el : F (er )
el .s = er .s + 1
er .i = el .i + 1
el .c = er .c
(P 7)
e:V
(P 8)
e:C
+ v + el .i + = + F .c + (v + er .i + ) ;
e.s = 1
e.c = v + e.i + = + V .c + ;
e.s = 1
e.c = v + e.i + = + C.c + ;
Table 4.4 illustrates the attribute grammars use in a syntax-directed assignment-level SAC
generator for SL2 programs. The sequence of tokens in the assignment y=sin(x2); is
parsed, and its SAC is synthesized in the c attributes of the nonterminals. Subexpressions
are enumerated top-down on the parse tree as shown in Example 4.20. The SAC of subtrees
with roots that represent nonterminals on the right-hand side of the production are followed
by the contribution of the production itself. An annotated representation of the parse tree is
shown in Figure 4.16. The local contributions to the value of the c attribute can be unparsed
immediately. An explicit construction of the parse tree is not necessary as shown in the
following proof-of-concept implementation.
Implementation
We use flex and bison to build assignment-level SACs for SL2 programs. The corresponding flex and bison input files are shown in Listings 4.12 and 4.13, respectively.
An extension to SL programs is a straightforward exercise.
The single-pass SAC generator is based on Examples 4.19 and 4.20. It uses the
structured type ptNodeType that is defined in a file ast.h to associate the two attributes i
and c with the nodes in the parse tree.
# d e f i n e BUFFER_SIZE 100000
typedef struct {
int i ;
char c ;
} ptNodeType ;
# d e f i n e YYSTYPE ptNodeType
The bison preprocessor macro YYSTYPE is set to ptNodeType for this purpose. A sufficiently large buffer of characters is required to store the SAC. For simplicity, we define its
size statically in ast.h.
The flex input file includes ast.h prior to parser.tab.h. Otherwise, it is
similar to Listings 4.9 and 4.10.
190
PARSED
V
V = F (V
V=F(eNC
V=F(eNe
11
15
18
V=F(e
V=F(e)
4
10
14
V=e
V=e;
$accept
ACTION
S
S
R(P 7)
S
R(P 8)
S
R(P 5)
$$.i
v2 = x;
v3 = 2;
Comment
v2 = x;
v3 = 2;
v1 = v2 v3 ;
< . . . = er1 .c
< . . . = er2 .c
N .c = " "
er1 .i = 2, er2 .i = 3
v2 = x;
v3 = 2;
v1 = v2 v3 ;
v0 = sin(v1 );
<
<
< . . . = er .c
F .c = " sin ", er .i = 1
v2 = x;
v3 = 2;
v1 = v2 v3 ;
v0 = sin(v1 );
y = v0 ;
<
<
<
< . . . = e.c
V .c = "y", e.i = 0
S
S
R(P 6)
S
S
R(P 3)
$$.c
ACCEPT
[ \ t \ n ]+
191
a
v2 = x;
v3 = 2;
v 1 = v 2 v3 ;
v0 = sin(v1 );
y = v0 ;
sin
=
v2 = x;
v3 = 2;
v 1 = v 2 v3 ;
v0 = sin(v1 );
v2 = x;
v3 = 2;
v 1 = v 2 v3 ;
v2 = x;
v3 = 2;
[ az ]
[0 9]
%%
{ whitespace }
" sin "
"+"
""
{ variable }
{ constant }
.
{
{
{
{
{
{
{
}
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
t o _ p a r s e r ( ) ; return
return y y t e x t [ 0 ] ; }
F;
L;
N;
V;
C;
}
}
}
}
}
%%
v o i d l e x i n i t ( FILE s o u r c e ) { y y i n = s o u r c e ; }
The bison input file does not yield many surprises. It uses routines get_buffer
and free_buffer to manage the memory that is required during the synthesis of the SAC.
Individual SAC variables are enumerated as in Example 4.20.
192
%{
# i n c l u d e < s t d i o . h>
# i n c l u d e < s t d l i b . h>
# include " ast . h"
unsigned i n t sacvc ;
v o i d g e t _ b u f f e r (YYSTYPE v ) {
v>c= m a l l o c ( BUFFER_SIZE s i z e o f ( char ) ) ;
}
v o i d f r e e _ b u f f e r (YYSTYPE v ) {
i f ( v>c ) f r e e ( v>c ) ;
}
%}
%t o k e n V C F L N
%l e f t L
%l e f t N
%%
sl2_program : s {
p r i n t f ( "%s " , $1 . c ) ;
f r e e _ b u f f e r (&$1 ) ;
};
s : a
| a s {
g e t _ b u f f e r (&$$ ) ;
s p r i n t f ( $$ . c , "%s%s " , $1 . c , $2 . c ) ;
f r e e _ b u f f e r (&$1 ) ; f r e e _ b u f f e r (&$2 ) ;
};
a : V = { s a c v c = 0 ; } e ; {
g e t _ b u f f e r (&$$ ) ;
s p r i n t f ( $$ . c , "%s%s =v%d ; \ n " , $4 . c , $1 . c , $4 . j ) ;
f r e e _ b u f f e r (&$1 ) ; f r e e _ b u f f e r (&$4 ) ;
};
e : e L e {
$$ . j = s a c v c ++;
g e t _ b u f f e r (&$$ ) ;
s p r i n t f ( $$ . c , "%s%s v%d=v%d%s v%d ; \ n " ,
$1 . c , $3 . c , $$ . j , $1 . j , $2 . c , $3 . j ) ;
f r e e _ b u f f e r (&$1 ) ;
}
| e N e {
/ / same a s a b o v e
}
193
| F ( e ) {
$$ . j = s a c v c ++;
g e t _ b u f f e r (&$$ ) ;
s p r i n t f ( $$ . c , "%s v%d= s i n ( v%d ) ; \ n " ,
$3 . c , $$ . j , $3 . j ) ;
f r e e _ b u f f e r (&$3 ) ;
}
| V {
$$ . j = s a c v c ++;
g e t _ b u f f e r (&$$ ) ;
s p r i n t f ( $$ . c , " v%d=%s ; \ n " , $$ . j , $1 . c ) ;
f r e e _ b u f f e r (&$1 ) ;
}
| C {
/ / same a s a b o v e
};
%%
i n t y y e r r o r ( char msg ) { p r i n t f ( "ERROR : %s \ n " , msg ) ; r e t u r n 1;}
i n t main ( i n t a r g c , char a r g v )
{
FILE s o u r c e _ f i l e = f o p e n ( a r g v [ 1 ] , " r " ) ;
l e x i n i t ( s o u r c e _ f i l e ) ; yyparse ( ) ; fclose ( s o u r c e _ f i l e ) ;
return 0 ;
}
Actions in the bison input file can be associated with reduce as well as with shift operations.
For example, the rule
a : V "=" { sacvc =0; } e " ; "
causes the variable sacvc to be initialized immediately after reading = from the input. The
C library function sprintf is used to implement the overloaded + operator in the associated
attribute grammar. The SAC statements are printed into the corresponding buffers. Local
buffers are freed as soon as they are no longer required.
Example 4.21 Application of the syntax-directed assignment-level SAC compiler to the
SL2 program
x=x y ;
x= s i n ( x y + 3 ) ;
yields
v0=x ; v1=y ; v2=v0 v1 ; x=v2 ;
v0=x ; v1=y ; v2=v0 v1 ; v3 = 3 ; v4=v2+v3 ; v5= s i n ( v4 ) ; x=v5 ;
All variables are assumed to be scalar floating-point variables. Again, the numbering of
the subexpressions on the right-hand side of assignments does not match the values of the
corresponding inherited attribute in the associated attribute grammar. This difference is
194
irrelevant in the present context. All that is needed is uniqueness, which is guaranteed by
the global counter mechanism.
4.5.3
Tangent-linear code is generated conceptually by attaching (directional) derivative components to each floating-point variable followed by differentiating all SAC assignments.
According to (2.1), each SAC assignment vj = j (vi )ij is preceded by code for computing the inner product of the partial derivative of vj with respect to all SAC variables vi ,
(1)
i j , on the right-hand side with the vector (vi )ij of directional derivatives of these
SAC variables. We use the underscore character to denote the directional derivative of a
variable v, that is v_ v (1) .
The attributes are the same as in Section 4.5.2. The synthesized attribute c now
contains the sequence of assignment-level SAC statements, each of them augmented with the
corresponding elementary tangent-linear assignments. The resulting L-attributed grammar
for tangent-linear versions of single assignments is the following:
(P 3)
a : V = e;
e.i = 0
a.c = e.c
+ V .c + _=v0_;
+ V .c + =v0;
y
v0
(1)
v0 = v0 .
Linear (P 4) and nonlinear (P 5) operators can be described by a single rule P 4/5. The
differences are restricted to the expressions for the local partial derivatives.
(P 4/5) el : er1 Oer2
el .s = er1 .s + er2 .s + 1
er1 .i = el .i + 1
er2 .i = er1 .i + er1 .s
el .c = er1 .c + er2 .c
+ v + el .i + _= + er1 .i O + v + er2 .i + _+
+ er2 .i O + v + er1 .i + _;
+ v + el .i + =v + er1 .i + O.c + v + er2 .i + ;
1
1
if L.c = +,
if L.c = ,
er2 .i N :=
v + er2 .i
1/v + er2 .i
195
if N .c = ,
if N .c = / ,
v + er1 .i
+ er1 .i N + + er1 .i N + v + er1 .i
if N .c = ,
if N .c = / .
As before, shift-reduce conflicts are resolved by specifying the order of evaluation for
associativity and operator precedence.
(P 6)
el : F (er )
el .s = er .s + 1
er .i = el .i + 1
el .c = er .c
+ v + el .i + _= + er .i F + v + er .i + _;
+ v + el .i + = + F .c + (v + er .i + ) ;
where
cos(v + er .i + ) if F .c = sin
sin(v + er .i + ) if F .c = cos
er .i F := exp(v + er .i + ) if F .c = exp
..
. etc.
(P 7)
e:V
(P 8)
e:C
e.s = 1
e.c = v + e.i + _= + V .c + _;
+ v + e.i + = + V .c + ;
e.s = 1
e.c = v + e.i + _=0;
+ v + e.i + = + C.c + ;
Certain production rules of the SL grammar are omitted as the flow of control in the tangentlinear code is the same as in the original code. The actions associated with rules P 9P 11
are simple unparsing steps. Rules P 1P 2 yield a simple concatenation of tangent-linear
code of sequences of statements.
The use of the attribute grammar in a syntax-directed tangent-linear code compiler is
illustrated in Table 4.5 for the assignment y = sin(x 2);. The derivation of the corresponding annotated parse tree is straightforward. A proof-of-concept implementation based
on Listings 4.12 and 4.13 is left as an exercise.
Example 4.22 Application of the syntax-directed tangent-linear code compiler to the SL
program
196
Table 4.5. Syntax-directed tangent-linear code for y = sin(x 2); set vi vi_ to
establish the link with the code that is generated by the syntax-directed tangent-linear code
compiler.
i
0
11
PARSED
V
V = F (V
7
13
V=F(eNC
V=F(e
V=F(e)
v2 = x (1) ; v2 = x;
v3 = 0; v3 = 2;
S
R(P 5)
V=e
V=e;
(1)
(1)
v2 = x (1) ; v2 = x;
(1)
v3 = 0; v3 = 2;
S
S
R(P 6)
(1)
(1)
(1)
v1 = v2 v3 + v2 v3 ; v1 = v2 v3 ;
(1)
v2 = x (1) ; v2 = x;
(1)
v3 = 0; v3 = 2;
(1)
(1)
(1)
v1 = v2 v3 + v2 v3 ; v1 = v2 v3 ;
4
4
10
14
(1)
3
11
15
18
$$.c
S
R(P 8)
V=F(eNe
$$.i
S
R(P 7)
8
13
14
ACTION
S
S
S
R(P 3)
(1)
(1)
v0 = cos(v1 ) v1 ; v0 = sin(v1 );
(1)
v2 = x (1) ; v2 = x;
(1)
v3 = 0; v3 = 2;
(1)
(1)
(1)
v1 = v2 v3 + v2 v3 ; v1 = v2 v3 ;
(1)
(1)
v0 = cos(v1 ) v1 ; v0 = sin(v1 );
(1)
y (1) = v0 ; y = v0 ;
0
$accept
ACCEPT
i f ( x<y ) {
x= s i n ( x ) ;
w h i l e ( y<x ) { x= s i n ( x 3 ) ; }
y =4 x+y ;
}
yields
i f ( x<y ) {
v0_=x_ ; v0=x ;
v1_= c o s ( v0 ) v0_ ; v1= s i n ( v0 ) ;
x_=v1_ ; x=v1 ;
w h i l e ( y<x ) {
197
v0_=x_ ; v0=x ;
v1_ = 0 ; v1 = 3 ;
v2_=v0_ v1+v0 v1_ ; v2=v0 v1 ;
v3_= c o s ( v2 ) v2_ ; v3= s i n ( v2 ) ;
x_=v3_ ; x=v3 ;
}
v0_ = 0 ; v0 = 4 ;
v1_=x_ ; v1=x ;
v2_=v0_ v1+v0 v1_ ; v2=v0 v1 ;
v3_=y_ ; v3=y ;
v4_=v2_+v3_ ; v4=v2+v3 ;
y_=v4_ ; y=v4 ;
}
All variables are assumed to be scalar floating-point variables. The flow of control remains
unchanged. Each assignment is simply augmented with its tangent-linear code. As in
Example 4.21, the numbering of the subexpressions is bottom-up instead of top-down due
to the replacement of the inherited attribute i by a global counter.
In order to run this code, it must be wrapped into a function
v o i d f _ ( d o u b l e& x , d o u b l e& x_ , d o u b l e& y , d o u b l e& y_ ) {
d o u b l e v0 , v1 , v2 , v3 , v4 ;
d o u b l e v0_ , v1_ , v2_ , v3_ , v4_ ;
/ / g e n e r a t e d code goes here
}
that includes appropriate declarations of the SAC variables and of their tangent-linear
versions.
4.5.4
198
$accept :
s.k = 0
s$end
$accept.cf = s.cf
+ int i;
+ while(pop _c(i)){
+ if (i == 1){
+
+
s.c1b
}else if (i == 2){
s.c2b
..
.
(P 1)
}else if (i == + s.k + ){
+
+
b
s.cs.k
}
s:
a.k = s.k + 1
a
s.k = a.k;
s.cf = a.cf ;
s.cb = a.cb
The vector assignment s.c = a.c is defined as s.ci = a.ci for i = 1, . . . , and {f , b}. We
present the production rules together with their associated actions similar to a corresponding
implementation in bison. For example, the attribute k of a is set prior to parsing the
assignment itself. The attribute k as well as the forward and backward code of s are
synthesized at the time of the reduction to s. The value of the inherited assignment counter
k is propagated top-down through sequences of statements that are described by rules P 1,
P 1a, . . . , P 2b. It is incremented whenever a new assignment is parsed; see rules P 1 and P 2.
(P 1a) s :
b.k = s.k
b
s.k = b.k;
(P 1b)
s.cf = b.cf ;
s.cb = b.cb
s:
l.k = s.k
l
s.k = l.k;
s.cf = l.cf ;
s.cb = l.cb
In production rules P 2, P 2a, and P 2b that describe sequences of statements with more
than one element, the value of the assignment counter is passed from left to right prior to
199
processing the respective nonterminal symbol. Its value is returned to the left-hand side of
the production rule at the time of reduction. Moreover, forward and backward sections of
adjoint sequences of statements are built as concatenations of the respective code fragments
that are associated with their children. The order is reversed in the synthesis of the reverse
section.
(P 2)
sl :
a.k = s l .k + 1
a
s r .k = a.k
sr
s l .k = s r .k
s l .cf = a.cf + s r .cf ;
Again, the vector sum s .c = s .c + a.c is elemental, that is, s .ci = s .ci + a.ci for
i = 1, . . . , , {l, r}, and {f , b}.
(P 2a)
sl :
b.k = s l .k
b
s r .k = b.k
s
s l .k = s r .k
s l .cf = b.cf + s r .cf ;
(P 2b)
sl :
l.k = s l .k
l
s r .k = l.k
s
s l .k = s r .k
s l .cf = l.cf + s r .cf ;
Assignment-level SAC is built as in Section 4.5.3. The root of the syntax tree of the
expression of the right-hand side has fixed SAC variable index 0. Variable names are stored
in V .cf by the scanner. The assignment of v0 to the variable on the left-hand side is
preceded by push statements for storing the unique number a.k of the assignment and
the current value of its left-hand side V .cf on appropriately typed stacks. While, due to
missing static data-flow analysis, it cannot be decided if this value is needed by the reverse
section, this conservative approach ensures correctness of the adjoint code. The storage of
the unique identifier of the assignment is necessary for the correct reversal of the flow of
200
control. Alternative approaches to control-flow reversal are discussed in the literature; see
Chapter 2.
(P 3)
a:
e.k = a.k;
e.i = 0
V = e;
a.cf = e.cf + "push_c(" + a.k + " ) ; "
+ "push_d(" + V .cf + " ) ; "
+ V .cf + "=v0;"
The adjoint assignment is built according to Adjoint Code Generation Rule 2; see Section 2.2.1. Incrementation of adjoint SAC variables, such as v0, is not necessary as values
of SAC variables are read exactly once. The adjoint of the program variable on the left-hand
side of the assignment is set equal to zero, followed by the execution of the adjoint code
that corresponds to the SAC of the right-hand side of the assignment.
b
= "pop_d(" + V .cf + " ) ; "
a.ca.k
Again, linear and nonlinear operations are treated similarly with differences restricted to
the local partial derivatives. Both the attributes for enumeration of subexpressions (i) and
for identifying assignments uniquely (k) are propagated top-down. The left-hand sides of
all SAC statements are stored on the data stack prior to being overwritten. Their values
are recovered before the execution of the corresponding adjoint assignments in the reverse
section. Note that this approach yields a larger memory requirement than the code that results
from Adjoint Code Generation Rule 4. There, the storage of overwritten values is restricted
to program variables, and assignment-level incomplete SAC is built within the reverse
section to ensure access to arguments of the local partial derivatives. The corresponding
modification of the attribute grammar is straightforward and hence left as an exercise.
(P 4/5) el :
er1 .i = el .i + 1;
e
eri .k = el .k for i = 1, 2
r1
201
b
el .ce.k
= pop_d(v + el .i + ) ;
+ v + er2 .i + _= + Oer2 .i + v + el .i + _;
+ v + er1 .i + _= + Oer1 .i + v + el .i + _;
b
b
+ er2 .ce.k
+ er1 .ce.k
where O {L, N }. As in Section 4.5.3 Oer1 .i denotes the partial derivative of operation O
with respect to the SAC variable that holds the value of the expression er1 (similarly er2 ).
Shift-reduce conflicts are resolved by specifying the order of evaluation for associativity
and operator precedence.
A feasible treatment of unary intrinsics follows immediately from the previous discussion.
(P 6)
el :
er .i = el .i + 1;
er .k = el .k
F (er )
el .s = er .s + 1
el .cf = er .cf
+ push_d(v + el .i + ) ;
+ v + el .i + = + F .cf + (v + er .i + ) ;
b
el .ce.k
= pop_d(v + el .i + ) ;
+ v + er .i + _= + Fer .i + v + el .i + _;
b
+ er .ce.k
F is an arbitrary unary function, such as sin or exp . Fer .i denotes the partial derivative of
F with respect to the SAC variable that holds the value of the expression er .
Assignments of values of program variables to SAC variables do not yield any surprises in
the forward section.
(P 7) e : V
e.s = 1
a.cf = push_d(v + e.i + ) ;
+ v + e.i + = + V .cf + ;
In the reverse section, the adjoint program variable needs to be incremented according to
Adjoint Code Generation Rule 2.
b
= pop_d(v + e.i + ) ;
a.ce.k
202
e.s = 1
a.cf = push_d(v + e.i + ) ;
+ v + e.i + = + C.cf + ;
b
a.ce.k
= pop_d(v + e.i + ) ;
Control-flow statements such as branches and loops as well as the associated conditions are
simply unparsed in the forward section. They have no impact on the reverse section due to
the chosen conservative control-flow reversal method.
(P 9)
b : I F (r)
s.k = b.k
{s}
b.k = s.k
b.cf = if + ( + r.cf + ) + {s.cf + }
b.cb = s.cb
(P 10)
l : W H I LE(r)
s.k = l.k
{s}
l.k = s.k
l.cf = while + ( + r.cf + ) + {s.cf + }
l.cb = s.cb
(P 11)
r : V r1 RV r2
r.cf = V r1 .cf + R.cf + V r2 .cf
Table 4.6 illustrates the syntax-directed synthesis of the forward and reverse sections of the
adjoint code. A proof-of-concept implementation based on Listings 4.12 and 4.13 is left as
an exercise.
Example 4.23 Application of the syntax-directed adjoint code compiler to
t =0;
w h i l e ( x< t ) {
i f ( x<y ) {
x=y + 1 ;
}
x= s i n ( x y ) ;
}
yields
push_c ( 0 ) ;
p u s h _ d ( v1 ) ; v1 = 0 ;
p u s h _ d ( t ) ; t =v1 ;
203
Table 4.6. Syntax-Directed Adjoint SAC for y = sin(x 2); set vi(1) vi_ to
establish the link with the code that is generated by the syntax-directed adjoint code compiler.
i
0
11
7
13
8
13
14
$$.cf
push(v2 ); v2 = x;
push(v3 ); v3 = 2;
pop(v3 );
push(v2 ); v2 = x;
push(v3 ); v3 = 2;
push(v1 ); v1 = v2 v3 ;
11
15
18
push(v2 ); v2 = x;
push(v3 ); v3 = 2;
push(v1 ); v1 = v2 v3 ;
push(v0 ); v0 = sin(v1 );
4
10
14
$$.cb
push(v2 ); v2 = x;
push(v3 ); v3 = 2;
push(v1 ); v1 = v2 v3 ;
push(v0 ); v0 = sin(v1 );
push(y); y = v0 ;
w h i l e ( x< t ) {
i f ( x<y ) {
push_c ( 1 ) ;
p u s h _ d ( v1 ) ; v1=y ;
p u s h _ d ( v2 ) ; v2 = 1 ;
p u s h _ d ( v3 ) ; v3=v1+v2 ;
p u s h _ d ( x ) ; x=v3 ;
}
push_c ( 2 ) ;
p u s h _ d ( v1 ) ; v1=x ;
p u s h _ d ( v2 ) ; v2=y ;
p u s h _ d ( v3 ) ; v3=v1 v2 ;
p u s h _ d ( v4 ) ; v4= s i n ( v3 ) ;
p u s h _ d ( x ) ; x=v4 ;
}
204
int i_ ;
w h i l e ( pop_c ( i _ ) ) {
i f ( i _ ==0) {
pop_d ( t ) ; v1_= t _ ; t _ = 0 ;
pop_d ( v1 ) ;
}
e l s e i f ( i _ ==1) {
pop_d ( x ) ; v3_=x_ ; x_ = 0 ;
pop_d ( v3 ) ; v1_=v3_ ; v2_=v3_ ;
pop_d ( v2 ) ;
pop_d ( v1 ) ; y_+=v1_ ;
}
e l s e i f ( i _ ==2) {
pop_d ( x ) ; v4_=x_ ; x_ = 0 ;
pop_d ( v4 ) ; v3_= c o s ( v3 ) v4_ ;
pop_d ( v3 ) ; v1_=v3_ v2 ; v2_=v3_ v1 ;
pop_d ( v2 ) ; y_+=v2_ ;
pop_d ( v1 ) ; x_+=v1_ ;
}
}
Appropriate implementations of the stack access routines must be supplied. Again, the
generated code must be wrapped into an appropriate function in order to run it. This wrapper
must declare all program and SAC variables as well as their respective adjoints.
205
There are four (double precision) floating-point variables x,y,a , and p and two integer
variables n and i . The value of i is assumed to be equal to zero at the beginning of the SL
code fragment.
4.6.1
Symbol Table
Symbols are described by regular expressions that are recognized by the scanner. Associated
parse tree leaf nodes are generated that contain a reference (pointer) to the corresponding
symbol table entries. The entire procedure is illustrated by the following fragments from
the flex input file.
1 %{
2 # i n c l u d e " p a r s e _ t r e e . hpp "
3 # include " parser . tab . h"
4 ...
5 %}
6
7 ...
8 symbol
[ az ]
9 ...
10
11 %%
12 . . .
13
14 { symbol } {
15
y y l v a l =new p a r s e _ t r e e _ v e r t e x _ s y m b o l ( SYMBOL_PTV , y y t e x t ) ;
16
return V;
17 }
18 . . .
19
20 %%
21 . . .
A new leaf node is added to the parse tree in line 15. Parse tree vertices that are referenced
through yylval are declared as of type parse_tree_vertex in the file parse_tree.hpp
that is included in line 2.
class parse_tree_vertex {
unsigned short type ;
l i s t < p a r s e _ t r e e _ v e r t e x > s u c c ;
...
};
class parse_tree_vertex_named : public parse_tree_vertex {
s t r i n g name ;
...
};
class parse_tree_vertex_symbol : public parse_tree_vertex {
symbol sym ;
206
...
};
# d e f i n e YYSTYPE p a r s e _ t r e e _ v e r t e x
When calling the constructor of parse_tree_vertex_symbol in line 15 with the string yytext
as its second argument, the name of the new symbol is set to yytext while its type is left
undefined. Types of variables can only be determined after parsing the associated declaration
(see Section 4.6.2). If a symbol with the same name as in yytext already exists, then the
address of the existing entry is returned. The type of the newly generated parse tree vertex
is set to SYMBOL_PTV. Token identifiers to be returned to the parser (for example, V
returned in line 16) are defined in the file parser.tab.h included in line 3. After lexical
analysis the parse tree consists of a number of leaf nodes that represent symbols referenced
via pointers into the symbol table.
4.6.2
Parse Tree
The parser that is generated by bison based on the code fragments listed below performs
two main tasks. It sets the types of all variables while parsing the respective declarations
(see line 13) and it inserts new parse tree vertices when reducing a handle to the left-hand
side of the associated production (lines 2026).
1
2
3
4
5
6
7
8
9
10
11
12
13
%{
...
# i n c l u d e " p a r s e _ t r e e . hpp "
...
extern p a r s e _ t r e e _ v e r t e x p t _ r o o t ;
%}
..
%%
s l : d s { p t _ r o o t =$2 ; } ;
d :
...
| FLOAT V " ; " d { $2>s y m b o l _ t y p e ( ) =FLOAT_ST ; }
207
32 (s)
5 (=)
31 (l)
1 (p)
4 (+)
8 (<)
2 (y)
3 (1)
6 (i)
9 (i)
30 (s)
7 (n)
19 (b)
24 (=)
11 (==)
18 (s)
20 (y)
10 (0)
17 (=)
12 (a)
29 (=)
23 (*)
21 (a)
25 (i)
22 (p)
28 (+)
26 (i)
27 (1)
16 (sin)
15 (*)
13 (x)
14 (y)
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
;
...
s : a
...
a : V "=" e " ; "
...
e : e N e
{
i f ( $2>g e t _ n a m e ( ) == " " )
$$=new p a r s e _ t r e e _ v e r t e x ( MULTIPLICATION_PTV ) ;
$$>s u c c . p u s h _ b a c k ( $1 ) ; $$>s u c c . p u s h _ b a c k ( $3 ) ;
d e l e t e $2 ;
}
...
%%
...
The parse tree is synthesized bottom-up by inserting new typed vertices including references
to their children. A global pointer to the unique root of the parse tree is stored once the whole
input program is parsed (line 10). An example parse tree is shown in Figure 4.17 for the
SL code in Listing 4.14. The implementation of tangent-linear and adjoint code unparsers
is reasonably straightforward. Precise descriptions are given by the attribute grammars in
Section 4.5.3 and Section 4.5.4.
208
This chapter can only be a first step toward a comprehensive discussion of issues in derivative code compiler construction. As mentioned previously, a large number of technical
challenges are caused by the various advanced syntactical and semantic concepts of modern
programming languages. It is up to the users of these languages to decide which features are
absolutely necessary in the context of numerical simulation software development. Existing semantic source transformation tools for numerical code rarely support entire language
standards. Failure to apply these tools to a given code is often due to rather basic incompatibilities that could be avoided if code and tool development took place in parallel.
Communication among both sides is crucial.
4.7
4.7.1
Exercises
Lexical Analysis
Derive DFAs for recognizing the languages that are defined by the following regular expressions
1. 0|1+(0|1)*
2. 0+|1(0|1)+
Implement scanners for these languages with flex and gcc. Compare the NFAs and DFAs
derived by yourself with the ones that are generated by flex.
4.7.2
Syntax Analysis
1. Use the parser for SL2 to parse the assignment y = sin(x) + x 2;" as shown in
Table 4.3. Draw the parse tree.
2. Extend SL2 and its parser to include the ternary fused-multiply-add operation, defined
as y = fma(a, b, c) a b + c. Derive the characteristic automaton.
3. Use flex and bison to implement a parser for SL programs that prints a syntactically equivalent copy of the input code.
4.7.3
1. Use flex and bison to implement a single-pass tangent-linear code compiler for
SL2 programs. Extend it to SL.
2. Use flex and bison to implement a single-pass adjoint code compiler for SL2
programs. Extend it to SL.
Chapter 5
This last chapter combines the material presented in the previous chapters to form the
prototype derivative code compiler dcc. Version 0.9 of dcc can be used to verify the given
examples as well as to run more complex experiments. It serves as an introductory case
study for more mature derivative code compilers.
5.1
Functionality
y = F (x),
as subroutines
v o i d f ( i n t n , i n t m, d o u b l e x , d o u b l e y ) .
Its results vary depending on whether certain inputs and outputs are aliased (represented by
the same program variable) or not. Hence, the two cases
y = F (x)
and
y
= F (x, z)
z
(x and y unaliased) are considered separately. For y = F (x) and x and y not aliased the
generated derivative code behaves similar to what has been presented Chapters 2 and 3.
209
210
F
(1)
: R R R R :
n
All superscripts of the tangent-linear subroutine and variable names are replaced with the
prefix t1_ , that is, v(1) t1_v.
For
y
n
p
m
p
= F (x, z)
F : R R R R :
z
we obtain
F (1) : Rn Rn Rp Rp Rm Rm Rp Rp :
y
y(1)
= F (1) (x, x(1) , z, z(1) ),
z
z(1)
where
(1) .
x
y(1)
=
F
(x,
z),
z(1)
z(1)
y
= F (x, z).
z
5.1. Functionality
211
x
Rn+p
z
t1_x
t1_z
range over the Cartesian basis vectors in Rn+p . Potential sparsity of the Jacobian should be
exploited. Details of the generated tangent-linear code will be discussed in Section 5.3.
212
All subscripts of the adjoint subroutine and variable names are replaced with the prefix a1_,
that is, v(1) a1_v. The integer parameter a1_mode selects between various modes required
in the context of interprocedural adjoint code. Details will be discussed in Section 5.3.
The adjoint of F (x, z) becomes
F(1) : Rn Rn Rp Rp Rm Rn Rp Rp Rm Rm :
x(1)
z
z(1) = F(1) (x, x(1) , z, z(1) , y(1) ),
y
y(1)
where
-
.
x(1)
y(1)
x(1)
=
, F (x, z)
+
z(1)
0
z(1)
y
= F (x, z)
z
(5.1)
y(1) = 0.
The input value of z(1) is overwritten instead of incremented because z is both an input and
an output of F (x, z). Correctness of (5.1) follows immediately from the decomposition
y
v
= F (x, z)
vz
y
y
v
= z
v
z
using an auxiliary variable v Rm+p . Decomposition ensures that local inputs and outputs
are mutually unaliased. Application of incremental reverse mode with required data stack
s and result checkpoint r to the decomposed function yields
5.1. Functionality
213
z
v(1) = 0; v(1)
=0
y y
v
v(1)
y
y(1)
= (1)
+ (1) ;
=0
z = s[0];
z
z
z(1)
z(1)
v(1)
v(1)
y
!
y
v(1)
v(1)
x(1)
x(1)
=
+
, F (x, z) ;
=0
z
z
z(1)
z(1)
v(1)
v(1)
y = r[0]; z = r[1],
which is easily simplified to (5.1).
For the given implementation of F as
v o i d f ( i n t n , i n t p , i n t m, d o u b l e x , d o u b l e z , d o u b l e y ) ,
int
x ,
z ,
y ,
p , int
double
double
double
m,
a1_x ,
a1_z ,
a1_y ) ;
The Jacobian is computed by setting a1_mode = 1 and a1_x = 0 followed by letting the input
vector
a1_y
a1_z
range over the Cartesian basis vectors in Rm+p . Potential sparsity of the Jacobian should
be exploited.
5.1.3
dcc behaves exactly as described in Section 3.2 when applied in forward mode to the
tangent-linear code F (1) (x, x(1) ). Application of dcc in forward mode to a tangent-linear
code
(1) (1) .
y
x
=
F
(x,
z),
z(1)
z(1)
y
= F (x, z)
z
214
y
y(2)
(1)
y
(1,2)
y
(1,2)
z(1)
z(1,2)
where
(1) (2) .
(1,2) . x
y(1,2)
x
x
2
= F (x, z), (1,2) + F (x, z), (1) , (2)
z(1,2)
z
z
z
(2) .
(2) x
y
= F (x, z), (2)
(2)
z
z
(1) .
(1) x
y
= F (x, z), (1)
z(1)
z
y
= F (x, z).
z
i n t m,
, double t1_x , double t2_t1_x ,
, double t1_z , double t 2 _ t 1 _ z ,
, double t1_y , double t 2 _ t 1 _ y ) .
Superscripts of the second-order tangent-linear subroutine and variable names are replaced
with the prefixes t2_ and t1_ , that is, v(1,2) t2_t1_v . The Hessian at point
x
Rn+p
z
t2_z
5.1. Functionality
5.1.4
215
dcc supports all three modes for generating second-order adjoint code. Its output is such
that an arbitrary number of reapplications are possible after some minor preprocessing.
Forward-over-Reverse Mode
dcc behaves exactly as described in Section 3.3 when applied in forward mode to the adjoint
code F(1) (x, x(1) , y(1) ). Application of dcc in forward mode to an adjoint code
.
-
x(1)
x(1)
y(1)
=
, F (x, z)
+
z(1)
0
z(1)
y
= F (x, z)
z
y(1) = 0
that implements F(1) (x, x(1) , z, z(1) , y(1) ) yields
(2)
F(1) : Rn Rn Rn Rn Rp Rp Rp Rp Rm Rm
Rn Rn Rp Rp Rp Rp Rm Rm Rm Rm :
x(1)
x(2)
(1)
z
z(2)
z
(1)
(2)
(2)
(2)
(2)
(2) = F(1) (x, x(2) , x(1) , x(1) , z, z(2) , z(1) , z(1) , y(1) , y(1) ),
z(1)
y
(2)
y
y(1)
(2)
y(1)
where
(2)
! -
(2) .
(2)
y(1)
x
y(1)
x(1)
2
,
+
F
(x,
z),
=
,
F
(x,
z)
+
(2)
(2)
z(1)
z(2)
0
z(1)
z(1)
(2) (2) .
y
x
=
F
(x),
z(2)
z(2)
(2)
x(1)
(2)
y(1) = 0
-
.
x(1)
x
y(1)
= (1) +
, F (x, z)
z(1)
0
z(1)
y
= F (x, z)
z
y(1) = 0.
216
int
x ,
z ,
y ,
p , int
double
double
double
m,
a1_x ,
a1_z ,
a1_y ) ,
int n , int p ,
, d o u b l e a1_x
, d o u b l e a1_z
, d o u b l e a1_y
i n t m,
, double t2_a1_x ,
, double t2_a1_z ,
, double t2_a1_y ) .
Super- and subscripts of the second-order adjoint subroutine and variable names are replaced
(2)
with the prefixes t2_ and a1_, respectively; that is, v(1) t2_a1_v. The Hessian at point
x
Rn+p
z
a1_y
a1_z
and
t2_x
t2_z
range independently over the Cartesian basis vectors in Rm+p and Rn+p , respectively.
Reverse-over-Forward Mode
dcc behaves exactly as described in Section 3.3 when applied in reverse mode to the
tangent-linear code F (1) (x, x(1) ). Application of dcc in reverse mode with required data
stack s and result checkpoint r to a tangent-linear code
(1) .
y(1)
x
=
F
(x,
z),
z(1)
z(1)
y
= F (x, z)
z
F(2) : Rn Rn Rn Rn Rp Rp Rp Rp Rm Rm
R n Rn Rp Rp Rp Rp Rm Rm Rm Rm :
5.1. Functionality
217
x(2)
x(1)
(2)
z
z(2)
z(1)
(1)
(1)
(1)
(1)
(1) = F(2) (x, x(2) , x(1) , x(2) , z, z(2) , z(1) , z(2) , y(2) , y(2) ),
z(2)
y
y(2)
(1)
y
(1)
y(2)
where
[augmented forward section]
(1) .
y x
v
= F (x, z), (1)
vz
z
(1) y
v
y
= z
s[0] = z(1) ;
(1)
v
z
y
v
= F (x, z)
vz
y
y
v
s[1] = z;
= z
z
v
r[0] = y; r[1] = z; r[2] = y(1) ; r[3] = z(1)
[reverse section]
y
z
v(2) = 0; v(2)
=0
y y
v(2)
v
y
y(2)
z = s[1];
= (2)
+ (2) ;
=0
z
z
z(2)
z(2)
v(2)
v(2)
! y
y
v(2)
v(2)
x(2)
x(2)
, F (x, z) ;
=0
=
+
z
z
z(2)
z(2)
v(2)
v(2)
y y (1) (1)
y(2)
y(2)
v
v(2)
= (2)
+ (1)
;
z(1) = s[0];
z
z
(1) = 0
v(2)
v(2)
z(2)
z(2)
y
(1) !
v(2)
x(2)
x(2)
x
2
, F (x, z), (1)
=
+
z
z(2)
z(2)
v(2)
z
(1) (1)
y
! y
x(2)
x(2)
v(2)
v(2)
, F (x, z) ;
=0
z
z
(1) =
(1) +
v
v
z
z
(2)
(2)
(2)
(2)
218
y(2) = 0
(1)
y(2) = 0.
For the given tangent-linear subroutine
v o i d t 1 _ f ( i n t n , i n t p , i n t m, d o u b l e x , d o u b l e t 1 _ x ,
double z , double t1_z ,
double y , double t 1 _ y ) ,
int n , int p ,
, double t1_x
, double t1_z
, double t1_y
i n t m,
, double a2_t1_x ,
, double a2_t1_z ,
, double a2_t1_y ) .
Super- and subscripts of second-order adjoint subroutine and variable names are replaced
(1)
with the prefixes t1_ and a2_, respectively; that is, v(2) a2_t1_v. The Hessian at point
x
Rn+p
z
t1_z
range independently over the Cartesian basis vectors in Rm+p and Rn+p , respectively.
Reverse-over-Reverse Mode
While reverse-over-reverse mode has no relevance for practical applications its discussion
is useful as it provides deeper insight into adjoint code in general. dcc behaves exactly as
described in Section 3.3 when applied in reverse mode to the adjoint code F(1) (x, x(1) , y(1) ).
5.1. Functionality
219
Application of dcc in reverse mode with required data stack s and result checkpoint r to
the adjoint code
-
.
y(1)
x(1)
x
= (1) +
, F (x)
z(1)
0
z(1)
y
= F (x, z)
z
y(1) = 0
yields
F(1,2) : Rn Rn Rn Rn Rp Rp Rp Rp Rm Rm
R n Rn Rp Rp Rp Rp Rm Rm Rm Rm :
x(1)
x(2)
z(1)
z(2)
= F(1,2) (x, x(2) , x(1) , x(1,2) , z, z(2) , z(1) , z(1,2) , y(1) , y(2) ),
z(1,2)
y
y
(1)
y
(2)
y(1,2)
where
[augmented forward section]
x -
.
v
x(1)
y(1)
=
,
F
(x.z)
+
vz
0
z(1)
x
v
x(1)
s[0] = z(1) ;
= z
z(1)
v
y
v
= F (x, z)
vz
y
y
v
s[1] = z;
= z
z
v
s[2] = y(1) ; y(1) = 0
r[0] = x(1) ; r[1] = z(1) ; r[2] = y; r[3] = z; r[4] = y(1)
[reverse section]
y
x
z
= 0; v(2) = 0; v(2)
=0
v(2)
y(1) = s[2];
y(1,2) = 0
220
x
x(1,2) = x(1,2) + v(2)
- x
.
v(2)
y(1,2)
y
= (1,2) +
,
F
(x,
z)
z
z(1,2)
z(1,2)
v(2)
- x
.
v(2)
x(2)
x(2)
y(1)
2
=
+
,
, F (x, z) ;
z
z(2)
z(2)
v(2)
z(1)
x
v(2)
=0
z
v(2)
-
. -
.
x(2)
x(2)
x(1,2)
y(1)
y(2)
2
=
, F (x) +
,
, F (x, z)
+
z(2)
0
z(2)
z(1,2)
z(1)
-
.
y(1,2)
x(1,2)
=
, F (x)
z(1,2)
z(1,2)
y(2) = 0
-
.
y(1)
x(1)
x(1)
+
=
, F (x)
z(1)
0
z(1)
y
= F (x, z)
z
y(1) = 0.
For the given adjoint subroutine
v o i d a 1 _ f ( i n t a1_mode , i n t n ,
double
double
double
int
x ,
z ,
y ,
p , int
double
double
double
m,
a1_x ,
a1_z ,
a1_y ) ,
i n t a1_mode , i n t n , i n t p , i n t m,
, d o u b l e a1_x , d o u b l e a2_a1_x ,
, d o u b l e a1_z , d o u b l e a2_a1_z ,
, d o u b l e a1_y , d o u b l e a 2 _ a 1 _ y ) .
Subscripts of second-order adjoint subroutine and variable names are replaced with the
prefixes a2_ and a1_, respectively; that is, v(1,2) a2_a1_v. The Hessian at point
x
Rn+p
z
221
a2_a1_z
range independently over the Cartesian basis vectors in Rm+p and Rn+p , respectively.
5.1.5
5.2
Installation of dcc
The compiler has been tested on various Linux platforms. Its installation files come as
a compressed tar archive file dcc-0.9.tar.gz. It is unpacked into a subdirectory
./dcc-0.9, e.g., by running
tar -xzvf dcc-0.9.tar.gz.
To build the compiler enter the subdirectory ./dcc-0.9 and type
./configure --prefix=$(INSTALL_DIR)
make
make check
make install
The executable dcc can be found in $(INSTALL_DIR)/bin.
make check runs the compiler in both supported modes (tangent-linear and adjoint)
on test input code stored in subdirectories of ./dcc-0.9/src/tests. The generated
output is verified against a reference. An error message is generated for anything but
identical matches.
5.3
Use of dcc
Let the original source code reside in a file named f.c in subdirectory $(SRC_DIR) and
let the top-level directory of the dcc installation be $(DCC_DIR).
A first-order tangent-linear code is built in $(SRC_DIR) by typing
$(DCC_DIR)/dcc f.c 1 1.
The name of the source file f . c is followed by two command-line parameters for setting
tangent-linear mode (1) and the order of the derivative (1). The generated code is stored in
a file named t1_f.c.
A first-order adjoint code is built in $(SRC_DIR) by typing
$(DCC_DIR)/dcc f.c 2 1.
222
The first-order (third command-line parameter set to 1) adjoint (second command-line parameter set to 2) version of the code in f . c is stored in a file named a1_f.c.
Higher derivative code can be obtained by reapplying dcc to a previously generated
derivative code in either tangent-linear or adjoint mode. Reapplication of dcc to a previously generated adjoint code a1_f.c requires running the C preprocessor on a1_f.c first
as described in Section 5.4.4. For example, the second-order adjoint code t2_a1_f.c
results from running
$(DCC_DIR)/dcc a1_f.c 1 2
on the preprocessed version of a1_f.c. A third derivative code can be generated, for
example, by running
$(DCC_DIR)/dcc t2_a1_f.c 2 3.
The result is stored in a3_t2_a1_f.c. While reapplication of dcc in adjoint mode to
a previously generated first- or higher-order adjoint model is feasible, this feature is less
likely to be used in practice for reasons outlined in Chapter 3. A third-order adjoint model
is best generated by running
$(DCC_DIR)/dcc t2_a1_f.c 1 3.
Nevertheless, repeated code transformations in adjoint mode have been found to be enlightening ingredients of our lecture / tutorial on Computational Differentiation.
5.4
dcc expects all double parameters to be passed by reference. Call by value is supported for
integer parameters only. Single-line comments are not preserved in the output code. We
use this trivial input code to take a closer look at the result of the semantic transformations
performed by dcc. Larger inputs result in tangent-linear and adjoint code whose listing
becomes unreasonable due to excessive length.
223
The original assignment is decomposed into the SAC (see Section 2.1.1) listed in lines 8,
10, and 12. Two auxiliary SAC variables v1_0 and v1_1 are declared in lines 3 and 5. dcc
expects a separate declaration for each variable as well as its initialization with some constant
(e.g. 0). Tangent-linear versions of both auxiliary variables are declared and initialized in
lines 4 and 6. All three SAC statements are augmented with local tangent-linear models
(lines 7, 9, and 11).
Auxiliary variable names are built from the base string v by appending the order of
differentiation (1) and a unique counter (0, 1, . . .) separated by an underscore. Potential
name clashes with variables present in the original source code could be avoided by a more
sophisticated naming strategy. Version 0.9 of dcc does not support such a mechanism.
Its source code would need to be edited in order to replace the base string v with some
alternative. The native C++ compiler can be expected to eliminate most auxiliary variables
as the result of copy propagation.
A driver program/function must be supplied by the user, for example,
1
2
3
4
5
6
7
8
9
10
11
12
13
14
It computes the partial derivative of the output y with respect to the input x at point x=1.
Relevant parts of the C++ standard library are used for file i/o ( fstream ) and to provide an
implementation for the intrinsic sine function (cmath). Global use of the std namespace
is crucial as dcc does neither accept nor generate namespace prefixes such as std :: . The
file t1_f . c is included into the driver in line 5 in order to make these preprocessor settings
applicable to the tangent-linear output of dcc. Both the values of x and of its directional
derivative t1_x are set to one at the time of their declaration in line 9, followed by declarations
of the outputs y and t1_y and the call of the tangent-linear function t1_f in lines 10 and 11,
224
respectively. The results are written into the file t1 . out for later validation. Optimistically,
zero is returned to indicate an error-free execution of the driver program.
int cs [ 1 0 ] ;
i n t csc =0;
double f d s [ 1 0 ] ;
i n t fdsc =0;
int ids [10];
i n t i d s c =0;
# include " declare_checkpoints . inc "
v o i d a 1 _ f ( i n t a1_mode , d o u b l e& x , d o u b l e& a1_x ,
d o u b l e& y , d o u b l e& a1_y )
{
d o u b l e v1_0 = 0 ;
d o u b l e a1_v1_0 = 0 ;
d o u b l e v1_1 = 0 ;
d o u b l e a1_v1_1 = 0 ;
i f ( a1_mode ==1) {
cs [ csc ]=0; csc=csc +1;
f d s [ f d s c ] = y ; f d s c = f d s c + 1 ; y= s i n ( x ) ;
# include " f _ s t o r e _ r e s u l t s . inc "
while ( csc >0) {
c s c = c s c 1;
i f ( cs [ csc ]==0) {
f d s c = f d s c 1; y= f d s [ f d s c ] ;
v1_0=x ;
v1_1= s i n ( v1_0 ) ;
a1_v1_1 = a1_y ; a1_y = 0 ;
a1_v1_0 = c o s ( v1_0 ) a1_v1_1 ;
a1_x = a1_x + a1_v1_0 ;
}
}
# include " f _ s t o r e _ r e s u l t s . inc "
}
}
The adjoint function needs to be called in first-order adjoint calling mode a1_mode=1 to
invoke the propagation of adjoints from the adjoint output a1_y to the adjoint input a1_x.
Further calling modes will be added when considering call tree reversal in the interprocedural
case in Section 5.6.
An augmented version of the original code enumerates basic blocks in the order
of their execution (line 17; see Adjoint Code Generation Rule 5) and it saves left-hand
sides of assignments before they get overwritten (line 18; see Adjoint Code Generation
225
Rule 3). Three global stacks are declared for this purpose with default sizes set to 10
to be adapted by the user. The sizes of both the control flow stack ( cs ) and the required
floating-point data stack ( fds ) can be reduced to 1 in the given example. Counter variables
csc and fdsc are declared as references to the tops of the respective stacks. Missing integer
assignments make the required integer data stack ( ids ) in line 5, as well as its counter
variable idsc in line 6, obsolete. Code for allocating memory required for the potential
storage of argument and/or result checkpoints needs to be provided by the user in a file
named declare_checkpoints.inc. In version 0.9 of dcc, all memory required for
the data-flow reversal is allocated globally. Related issues such as thread safety of the
generated adjoint code are the subject of ongoing research and development.
The reverse section of the adjoint code (lines 20 to 30) runs the adjoint basic blocks
in reverse order driven by their indices retrieved one by one from the top of the control
stack (lines 20 to 22). Processing of the original assignments within a basic block in reverse
order starts with the recovery of the original value of the variable on the left-hand side of
the assignment (line 23). An incomplete version of the assignments SAC (without storage
of the value of the right-hand side expression in the variable on the left-hand side of the
original assignment; lines 24 and 25; see Adjoint Code Generation Rule 4) is built to ensure
availability of all arguments of local partial derivatives potentially needed by the adjoint
SAC (lines 26 to 28). The corresponding auxiliary SAC variables and their adjoints are
declared in lines 12 to 15 (see Adjoint Code Generation Rule 1). dcc expects all local
variables to be initialized, e.g., to zero. Adjoints of variables declared in the original code
are incremented (line 28) while adjoints of (single-use) auxiliary variables are overwritten
(lines 26 and 27). Adjoints of left-hand sides of assignments are set to zero after their use
by the corresponding adjoint SAC statement (line 26; see Adjoint Code Generation Rule 2).
The user is given the opportunity to ensure the return of the correct original function
value through provision of three appropriate files to be included into the adjoint code. By
default, the data flow reversal mechanism restores the input values of all parameters. For
example, one could store y ( rescp=y;) in f_store_results.inc and recover it (y=
rescp; ) in f_restore_results.inc in addition to the declaration and initialization of
the checkpoint (double rescp=0;) in declare_checkpoints.inc. Automation of this
kind of checkpointing is impossible if arrays are passed as pointer parameters due to missing
size information in C/C++.
The determination of sufficiently large stack sizes may turn out to be not entirely
trivial. For given values of the inputs, one could check the maxima of the stack counters
csc , fdsc , and idsc by insertion of
c o u t << c s c << " " << f d s c << " " << i d s << e n d l ;
in between the augmented forward and reverse sections of the adjoint code (right before or
after line 19).
Again, a driver program/function needs to be supplied by the user. For our simple
example, it looks very similar to the tangent-linear driver discussed in Section 5.4.1.
1
2
3
4
5
6
226
7 i n t main ( ) {
8
o f s t r e a m a 1 _ o u t ( " a1 . o u t " ) ;
9
d o u b l e x =1 , a1_x = 0 ;
10
d o u b l e y , a1_y = 1 ;
11
a 1 _ f ( 1 , x , a1_x , y , a1_y ) ;
12
a 1 _ o u t << y << " " << a1_x << e n d l ;
13
return 0 ;
14 }
To compute the partial derivative of y with respect to x at point x = 1, the value a1_y of the
adjoint of the output is set to one while the adjoint a1_x of the input needs to be initialized
to zero. The correct calling mode (1) is passed to the adjoint function a1_f in line 11. In line
12, the result a1_x is written to a file for later validation. Compilation of this driver followed
by linking with the C++ standard library yields a program whose execution generates the
same output as the tangent-linear driver in Section 5.4.1. A typical correctness check comes
in the form of a comparison of the results obtained from the tangent-linear and adjoint code,
for example running
diff t1.out a1.out.
5.4.3
The result t2_t1_y is written to the file t2t1.out for later validation, for example, by
comparison with the second derivative generated by a second-order adjoint code to be
discussed in the next section.
13 Listings of second and higher derivative codes are omitted due to their considerable lengths. The reader is
encouraged to generate them with dcc.
5.4.4
227
We consider all three combinations of tangent-linear and adjoint modes to obtain secondorder adjoint code with dcc.
Forward-over-Reverse Mode
A second-order adjoint code is obtained by application of dcc to a preprocessed version of
a1_f.c in tangent-linear mode as
$(DCC_DIR)/dcc a1_f.c 1 2.
The C preprocessor needs to be called with the -P (inhibit generation of line markers) option
to resolve all #include statements. Its output corresponds to the syntax accepted by dcc.
As a result, the code associated with checkpointing (declarations, read and write accesses)
is inlined. No argument checkpointing code is required for this simple example.
To compute the second partial derivative of the output y with respect to the input x at
point x = 1, the values a1_y and t2_x are set to one while the second derivatives t2_a1_x and
t2_a1_y need to be set to zero.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
Both y and t2_y are pure outputs and thus do not need to be initialized. The result t2_a1_x is
written to the file t2a1.out for comparison with the previously generated t2t1.out.
In addition to the second-order adjoint projection t2_a1_x of the Hessian, the secondorder adjoint code also computes tangent-linear and adjoint projections of the Jacobian.
For the given scalar case we obtain the value of the first derivative of y with respect to x
both in t2_y (tangent-linear projection) and a1_x (adjoint projection). Hence, changing both
lines 12 in the previously listed drivers for the second-order tangent-linear and second-order
adjoint code to
t 2 t 1 _ o u t << y << t 2 _ y << t 1 _ y << t 2 _ t 1 _ y << e n d l ;
and
t 2 a 1 _ o u t << y << a1_x << t 2 _ y << t 2 _ a 1 _ x << e n d l ;
228
Reverse-over-Forward Mode
A second-order adjoint code is obtained by application of dcc to t1_f.c in adjoint mode as
$(DCC_DIR)/dcc t1_f.c 2 2.
To compute the second partial derivative of the output y with respect to the input x at point
x=1, the values t1_x and a2_t1_y are set to one while the first-order adjoints a2_x and a2_y
need to be set to zero.
# include <fstream >
# i n c l u d e <cmath >
u s i n g namespace s t d ;
# include " a2_t1_f . c"
i n t main ( ) {
ofstream a2t1_out ( " a2t1 . out " ) ;
d o u b l e x =1 , t 1 _ x =1 , a2_x =0 , a 2 _ t 1 _ x = 0 ;
d o u b l e y , t 1 _ y , a2_y =0 , a 2 _ t 1 _ y = 1 ;
a 2 _ t 1 _ f ( 1 , x , a2_x , t 1 _ x , a 2 _ t 1 _ x , y , a2_y , t 1 _ y , a 2 _ t 1 _ y ) ;
a 2 t 1 _ o u t << a2_x << e n d l ;
return 0 ;
}
Both y and t1_y are pure outputs and thus do not need to be initialized. The result a2_x is
written to the file a2t1.out for comparison with the previously generated t2t1.out.
In addition to the second-order adjoint projection a2_x of the Hessian the second-order
adjoint code also computes tangent-linear and adjoint projections of the Jacobian. For the
given scalar case we obtain the value of the first derivative of y with respect to x both in
t1_y (tangent-linear projection in direction t1_x) and a2_t1_x (adjoint projection in direction
a2_t1_y). Proper initialization of a2_t1_x to zero is crucial in this case.
Reverse-over-Reverse Mode
As a third alternative, second-order adjoint code is obtained by application of dcc to a
preprocessed version of a1_f.c in adjoint mode as
$(DCC_DIR)/dcc a1_f.c 2 2.
The names of all global variables need to be modified in a1_f.c to avoid name clashes
with the global variables generated by the second application of adjoint mode. This step is
not automatic when working with version 0.9 of dcc. The user needs to change the source
code in a1_f.c manually. This restriction is mostly irrelevant as second-order adjoint
code is unlikely to be generated in reverse-over-reverse mode in practice anyway.
To compute the second partial derivative of y with respect to x at point x=1, both a1_y
and a2_a1_x are set to one while a2_x and a2_y need to be initialized to zero.
# include <fstream >
# i n c l u d e <cmath >
u s i n g namespace s t d ;
229
Both y and a2_a1_y are pure outputs and can thus be initialized arbitrarily. The result a2_x
is written to the file a2a1.out for comparison with the previously generated t2t1.out.
In addition to the second-order adjoint projection a2_x of the Hessian, the second-order
adjoint code also computes adjoint projections of the Jacobian. For the given scalar case, we
obtain the value of the first derivative of y with respect to x both in a1_x (adjoint projection
in direction a1_y) and a2_a1_y (adjoint projection in direction a2_a1_x). Initialization of
a1_x to zero is crucial in this case.
5.4.5
Higher derivative code is generated by repeated application of dcc to its own (preprocessed)
output. Listings become rather lengthy even for the simplest code.
The third-order tangent-linear subroutine has the following signature:
t 3 _ t 2 _ t 1 _ f ( x , t3_x , t2_x , t3_t2_x , t1_x , t3_t1_x , t2_t1_x ,
t 3 _ t 2 _ t 1 _ x , y , t3_y , t2_y , t3_t2_y , t1_y , t3_t1_y , t2_t1_y ,
t3_t2_t1_y ) ;
Initialization of t1_x, t2_x, t3_x to one while all second- and third-order directional derivatives are set equal to zero yields the first (partial) derivative of y with respect to x in t1_y, t2_y,
and t3_y, respectively, the second derivative in t2_t1_y , t3_t1_y , and t3_t2_y , respectively,
and the third derivative in t3_t2_t1_y .
To obtain the same derivative information, the third-order adjoint routine obtained by
running dcc in forward over forward over reverse mode is called as follows:
t 3 _ t 2 _ a 1 _ f ( 1 , x , t 3 _ x , t 2 _ x , t 3 _ t 2 _ x , a1_x , t 3 _ a 1 _ x , t 2 _ a 1 _ x ,
t 3 _ t 2 _ a 1 _ x , y , t 3 _ y , t 2 _ y , t 3 _ t 2 _ y , a1_y , t 3 _ a 1 _ y , t 2 _ a 1 _ y ,
t3_t2_a1_y ) ;
a1_y, t2_x, and t3_x need to be initialized to one while the remaining second- and third-order
directional derivatives and adjoints are set equal to zero. The first derivative is returned in
a1_x, t2_y, and t3_y, respectively, the second derivative in t2_a1_x, t3_a1_x, and t3_t2_y ,
respectively, and the third derivative in t3_t2_a1_x.
We leave the generation and use of fourth and higher derivative code to the reader.
230
5.5
Table 5.1 quantifies the performance of the corresponding first and second derivative code
generated by dcc on our reference platform. We compare the run times of n executions
of the respective derivative code for n = 2 104 . Thus, we are able to quantify easily the
computational complexity R of the derivative code relative to the cost of an original function
evaluation. For example, the ratio between the run time of a single evaluation of the firstorder adjoint code and the run time of a single function evaluation is 4.4/0.8 = 5.5. We
show the numbers of lines of code (loc) in the second column. Optimization of the native
C++ compiler is switched off (-O0) or full optimization is applied (-O3).
The results document the superiority of the hand-written derivative code discussed in
Chapters 2 and 3. A single execution of the adjoint code generated by dcc takes about five
times the time of an original function evaluation. Much better performance can be observed
for the tangent-linear code. However, only a single execution of the adjoint code is required
to compute the gradient entries as opposed to n executions of the tangent-linear code.
The second-order tangent-linear code can be optimized very effectively by the native
C++ compiler performing copy propagation and elimination of common subexpressions.
The optimized second-order adjoint code generated in forward-over-reverse mode is only
about 50 percent more expensive than the first-order adjoint code. Compiler optimization
turns out to be less effective if reverse-over-forward mode is used. This lack is mostly
due to all auxiliary variables getting pushed onto the global required data stack within the
augmented forward section.
Missing native compiler optimizations decrease the performance of the generated
code significantly. A second-order adjoint code generated in forward-over-reverse mode
out-performs the one generated in reverse-over-forward mode. A second-order adjoint code
generated in reverse-over-reverse mode turns out to be not competitive.
Table 5.1. Run time of first and second derivative code generated by dcc (in seconds).
f
t1_f
a1_f
t2_t1_f
t2_a1_f
a2_t1_f
a2_a1_f
loc
10
41
80
177
320
236
453
-O0
3.6
11.1
23.9
37.7
71.4
80.8
181.9
-O3
0.8
0.9
4.4
2.1
6.0
15.3
73.0
231
We encourage the reader to run similar tests on their favorite computer architectures.
Experience shows that the actual run time of (derivative) code depends significantly on the
given platform consisting of the hardware, the optimizing native C++ compiler, and the implementation of the C++ standard library and other libraries used. Typically, there is plenty
of room for improving automatically generated derivative code either by postprocessing or
by adaptation of the source code transformation algorithms to the given platform. Pragmatically, the extent to which such optimizations pay off depends on the context. Derivative
code compilers can be tuned for given applications depending on their relevance. Automatically generated derivative code can be tuned (semi-)manually for speed and memory
requirement if the resulting code is used extensively over a long period of time.
5.6
For the generation of interprocedural derivative code, dcc expects all subroutines to be
provided in a single file; for example,
v o i d g ( d o u b l e& x ) {
x= s i n ( x ) ;
}
v o i d f ( d o u b l e& x , d o u b l e& y ) {
g(x) ;
y= s q r t ( x ) ;
}.
v o i d t 1 _ g ( d o u b l e& x , d o u b l e& t 1 _ x )
{
d o u b l e v1_0 = 0 ;
double t1_v1_0 =0;
d o u b l e v1_1 = 0 ;
double t1_v1_1 =0;
t1_v1_0=t1_x ;
v1_0=x ;
t 1 _ v 1 _ 1 = c o s ( v1_0 ) t 1 _ v 1 _ 0 ;
v1_1= s i n ( v1_0 ) ;
t1_x=t1_v1_1 ;
x=v1_1 ;
}
v o i d t 1 _ f ( d o u b l e& x , d o u b l e& t 1 _ x ,
d o u b l e& y , d o u b l e& t 1 _ y )
{
d o u b l e v1_0 = 0 ;
double t1_v1_0 =0;
232
19
20
21
22
23
24
25
26
27
28 }
The original call of g is replaced by its tangent-linear version t1_g in line 21. Copy propagation (elimination of auxiliary variables) and the elimination of common subexpressions
(for example, sqrt (v1_0) in lines 24 and 25) is again left to the native C++ compiler.
int cs [ 1 0 ] ;
i n t csc =0;
double f d s [ 1 0 ] ;
i n t fdsc =0;
int ids [10];
i n t i d s c =0;
# include " declare_checkpoints . inc "
# include " f . c"
It is the users responsibility to declare sufficiently large stacks. Moreover, name clashes
with variables declared in the original program must be avoided. The preset sizes (here
10) need to be adapted accordingly. The file declare_checkpoints.inc is extended
with variable declarations required for the implementation of the subroutine argument checkpointing scheme in joint call tree reversal mode. For example,
double r e s c p =0;
double argcp =0;
allocates memory for storing the input value x of g that is needed for running g out of
context in joint call tree reversal mode. As in Section 5.4.2 these declarations need to be
supplied by the user since the problem of generating correct checkpoints for C++-code is
statically undecidable. Sizes of vector arguments passed as pointers are generally unknown
due to missing array descriptors. While the scalar case could be treated automatically it
is probably not worth the effort since in numerical simulation code handled by dcc most
subroutine arguments are arrays. The inclusion of the original code in line 8 is necessary as
g is called within the augmented forward section of the adjoint version a1_f of f .
Adjoint subroutines can be called in three modes selected by setting the integer parameter a1_mode. The prefix a1 indicates the order of differentiation. For example, if a
233
The three modes are represented by three if statements in lines 8, 27, and 31. If the adjoint
subroutine is called with a1_mode set equal to one, then an adjoint code that is similar to
the one discussed in Section 5.4.2 is executed. Note that the control flow reversal uses an
additional auxiliary variable save_csc to store the state of the control stack counter ( csc ) in
line 7, followed by stepping through the local csc save_csc adjoint basic blocks in lines
1424. Code for storage and recovery of gs results needs to be supplied by the user.
The two remaining adjoint calling modes invoke user-supplied code for storage
(a1_mode==2) and recovery (a1_mode==3) of the subroutines inputs. Two dummy assignments are generated in lines 29 and 33 to ensure correct syntax of the adjoint code even if no
argument checkpointing code is provided, that is, if both files g_store_inputs.inc
and g_restore_inputs.inc are left empty. These dummy assignments are eliminated
234
by the optimizing native C++ compiler. In the current example the input value x of g is
saved by
a r g c p =x
and restored by
x= a r g c p .
Not saving x results in incorrect adjoints as its input value is overwritten by the call of g
in line 13 of the following listing of a1_f. The adjoint a1_g would hence be called with the
wrong input value for x in line 27.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
235
Apart from the treatment of the subroutine call in lines 12, 13, 26, and 27 the adjoint
version of f is structurally similar to a1_g. Subroutine calls are preceded by the storage of
their argument checkpoints within the augmented forward section (lines 12 and 13). In the
reverse section, the correct arguments are restored (line 26) before the adjoint subroutine
is executed (line 27). The correct result y of f is preserved by the user-provided code for
storing
r e s c p =y
5.6.3
The second-order tangent-linear code contains the two second-order tangent-linear subroutines t2_t1_g and t2_t1_f adding up to a total of 123 lines of code. The call of g inside of f
(of t1_g inside of t1_f ) is replaced by dcc by a call to t2_t1_g inside of t2_t1_f :
v o i d t 2 _ t 1 _ g ( d o u b l e& x , d o u b l e& t 2 _ x ,
d o u b l e& t 1 _ x , d o u b l e& t 2 _ t 1 _ x ) {
...
}
v o i d t 2 _ t 1 _ f ( d o u b l e& x , d o u b l e& t 2 _ x ,
d o u b l e& t 1 _ x , d o u b l e& t 2 _ t 1 _ x ,
d o u b l e& y , d o u b l e& t 2 _ y ,
d o u b l e& t 1 _ y , d o u b l e& t 2 _ t 1 _ y ) {
d o u b l e v1_0 = 0 ;
double t2_v1_0 =0;
...
t 2 _ t 1 _ g ( x , t2_x , t1_x , t 2 _ t 1 _ x ) ;
...
}
5.6.4
is given by
| _ a 1 _ f (RECORD)
|
| _ a1_g ( STORE_INPUTS )
|
|_ g
236
| _ a 1 _ f (ADJOIN )
| _ a1_g ( RESTORE_INPUTS )
| _ a1_g (RECORD)
| _ a1_g (ADJOIN ) .
The inputs to g are stored within the augmented forward section of f and the original g is
executed. Once the propagation of adjoints within the reverse section of f reaches the point
where the adjoint a1_g of g must be evaluated, the original inputs to g are restored and the
augmented forward section of g is executed, followed by its reverse section. The adjoint
results of a1_g are passed into the remainder of the reverse section of f .
The call tree of a second-order adjoint code constructed in forward-over-reverse mode
becomes
| _ t 2 _ a 1 _ f (RECORD)
|
| _ t 2 _ a 1 _ g ( STORE_INPUTS )
|
| _ t2_g
| _ t 2 _ a 1 _ f (ADJOIN )
| _ t 2 _ a 1 _ g ( RESTORE_INPUTS )
| _ t 2 _ a 1 _ g (RECORD)
| _ t 2 _ a 1 _ g (ADJOIN ) .
If a1_f returns the correct result y of f in addition to the first-order adjoint a1_x, then t2_a1_f
returns both values as well as the first-order directional derivative t2_y and the second-order
adjoint t2_a1_x.
5.6.5
237
While after calling f the output y still contains the desired function value, side effects have
been added to illustrate certain complications that arise frequently in real world situations.
In fact, f (together with g) is an implementation of the multivariate vector function F :
Rn Rn+1 defined as
x0
2
2
x0 + x1
..
y = F (x) = .
n1 2
i=0 xi 2
n1 2
i=0 xi
Only the first component of the input vector x remains unchanged.
The input vector x needs to be reinitialized correctly prior to each call of t1_f as it is
overwritten inside of t1_f . We call a subroutine init for this purpose in line 7. Similarly,
the vector of directional derivatives of x as an output of f needs to be reset to the correct
Cartesian basis vector in line 8 before getting passed as an input to the next call of t1_f in
line 9. All elements of t1_x are reinitialized to zero by the subroutine zero except for the ith
entry that is set to one. Thus, a single gradient entry is computed during each loop iteration.
All entries are collected in the vector g in line 10 and the whole gradient g is printed to the
file t1.out by calling the function print.
238
5
6
7
8
9
10 }
For the adjoint a1_x of x to contain the gradient after running a1_f, it must be initialized to
zero. This initialization is performed explicitly in line 6. An incorrectly initialized vector
a1_x makes the incremental adjoint code increment the wrong values. The adjoint a1_y of
the scalar output y of f is set to one followed by running a1_f. Again, the gradient is written
to a file for subsequent verification.
Further user intervention is required to make this adjoint computation of the gradient
a success. Correct argument checkpoints need to be defined and implemented manually.
Joint call tree reversal results in a1_g being called out of context within the reverse section
of a1_f. A checkpoint is needed to ensure correct input values for x. For example, the user
may declare
double argcp_g [ 1 0 ] ;
int argcp_g_c ;
in g_store_inputs.inc and
argcp_g_c =0;
w h i l e ( a r g c p _ g _ c <n ) {
x [ argcp_g_c ]= argcp_g [ argcp_g_c ] ;
argcp_g_c=argcp_g_c +1;
}
Execution of the adjoint version of g at the end of the reverse section of a1_f is preceded by
the recovery of the correct input values:
...
a1_g ( 3 , n , x , a1_x , y , a1_y ) ;
a1_g ( 1 , n , x , a1_x , y , a1_y ) ;
... .
The remaining .inc files may remain empty unless the user wants the correct result to be
returned in y. Corresponding result checkpoints need to be declared and implemented in
this case. See also Section 5.6.2.
5.7.3
239
The main conceptual difference between the second-order tangent-linear driver routine in
Section 5.4.3 and the following is the need for O(n2 ) runs of the second-order tangentlinear routine when computing the whole Hessian. Overwriting of x in g makes the proper
reinitialization of all first- and second-order directional derivatives of x, as well as of x itself
(line 9 of the listing below), prior to each call of the second-order tangent-linear routine
t2_t1_f crucial.
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
Symmetry of the Hessian is exploited by computing only its upper (or lower) triangular
submatrix. Again, the first-order directional derivatives of the input vector x are set to range
independently over the Cartesian basis vector in Rn in line 10. The Hessian itself is written
to a file for later comparison with the result obtained from a second-order adjoint code.
5.7.4
While O(n2 ) runs of the second-order tangent-linear model are required to accumulate all
entries of the Hessian, the same task can be performed by the second-order adjoint model
at a computational cost that depends only linearly on n.
1 # include " t2_a1_f . c"
2 const i n t n =10;
3
4 i n t main ( ) {
5
d o u b l e x [ n ] , y , H[ n ] [ n ] ;
6
init (x) ;
7
d o u b l e t 2 _ x [ n ] , a1_x [ n ] , t 2 _ a 1 _ x [ n ] , t 2 _ y , a1_y , t 2 _ a 1 _ y ;
8
f o r ( i n t j = 0 ; j <n ; j ++) {
9
t 2 _ a 1 _ y = 0 ; z e r o ( t 2 _ a 1 _ x ) ; z e r o ( a1_x ) ;
10
a1_y = 1 ;
11
zero ( t2_x ) ; t2_x [ j ]=1;
12
t 2 _ a 1 _ f ( 1 , n , x , t 2 _ x , a1_x , t 2 _ a 1 _ x , y , t 2 _ y , a1_y , t 2 _ a 1 _ y ) ;
240
13
f o r ( i n t i = 0 ; i <n ; i ++) H[ j ] [ i ] = t 2 _ a 1 _ x [ i ] ;
14
}
15
p r i n t (H, " t 2 a 1 . o u t " ) ;
16
return 0 ;
17 }
According to Theorem 3.15, the first directional derivative of x as an input to the adjoint
routine a1_f needs to range over the Cartesian basis vectors in Rn (line 11) to obtain the
Hessian column by column. The first-order adjoint a1_y of the original output y is set
to one for this purpose in line 10. The initialization of x in line 6 outside of the loop is
feasible despite the fact that x is overwritten inside of t2_a1_f.c. The data flow reversal
mechanism of dcc ensures that the values of variables are equal to their input values at the
end of the reverse section of the adjoint code.
Hessian-vector products 2 F v can be computed by a single run of the second-order
adjoint code as shown in the following listing:
1
...
2
t 2 _ a 1 _ y = 0 ; z e r o ( t 2 _ a 1 _ x ) ; z e r o ( a1_x ) ;
3
i n i t ( t2_x ) ;
4
a1_y = 1 ;
5
t 2 _ a 1 _ f ( 1 , n , x , t 2 _ x , a1_x , t 2 _ a 1 _ x , y , t 2 _ y , a1_y , t 2 _ a 1 _ y ) ;
6
...
7 }
The values of v are simply passed into t2_a1_f through t2_x by calling the subroutine init
in line 3 of the above listing. The Hessian-vector product is returned through t2_a1_x.
To summarize, there are three issues to be taken care of by the user of dcc.
1. The input code needs to satisfy the syntactical and semantic constraints imposed by
dccs front-end. See Chapter B for details. In particular, all subprograms must be
provided in a single file.
2. The sizes of the stacks generated for the data and control flow reversal need to be
adapted to the memory requirement of the adjoint code. Clashes between names
generated by dcc (for required data and control stacks and for the associated counter
variables) and names of variables present in the input program must be avoided.
3. Checkpoints need to be stored and restored correctly in interprocedural adjoint code.
If a reapplication of dcc to the adjoint code is planned, then the checkpointing code
also needs to satisfy the syntactic and semantic constraints imposed by dccs frontend.
The second point requires knowledge of upper bounds for the number of executions of basic
blocks and for the number of overwrites performed on integer and floating-point variables,
respectively. A possible solution is the inspection of the respective stack counters during
a profiling run of the adjoint code. Setting the corresponding parameters exactly to these
values allows us to evaluate correct adjoints for the same inputs that were used for the
profiling runs. Different inputs may lead to different memory requirements in the presence
of nontrivial flow of control. Failure to allocate sufficient memory may result in incorrect
adjoints. In reality the original source code may have to be restructured to yield a feasible
memory requirement in fully joint call tree reversal mode.
5.8. Projects
5.8
241
Projects
The range of potential exercises that involve dcc is very large. Many of the exercises
in the previous chapters can (in fact, should) be solved with the help of dcc. Readers
are encouraged to use the compiler with their favorite solvers, for example, for systems of
nonlinear equations or for nonlinear optimization. Many small to medium size problems can
be implemented in the subset of C++ that is accepted by dcc. Combinations of overloading
and source transformation tools for AD are typically employed to handle more complicated
simulation code.
Ongoing developments by various groups aim to provide derivative code compilers
that cover an extended set of C/C++ language features. Complete language coverage appears to be unlikely for the foreseeable future. Refer to the AD communitys web portal
www.autodiff.org for up-to-date information on available AD tools.
Appendix A
Derivative Code by
Overloading
We present parts of the dco source code implementing the scalar tangent-linear (A.1),
adjoint (A.2), second-order tangent-linear (A.3), and second-order adjoint (A.4) modes of
AD. Listings are restricted to a few selected arithmetic operators and intrinsic functions. Extensions are reasonably straightforward. Refer to Sections 2.1.2 (for tangent-linear mode),
2.2.2 (for adjoint mode), 3.2.2 (for second-order tangent-linear mode), and 3.3.2 (for secondorder adjoint mode) for explanation of the code.
243
244
# i n c l u d e <cmath >
u s i n g namespace s t d ;
# i n c l u d e " d c o _ t 1 s _ t y p e . hpp "
d c o _ t 1 s _ t y p e : : d c o _ t 1 s _ t y p e ( c o n s t d o u b l e& x ) : v ( x ) , t ( 0 ) { } ;
dco_t1s_type : : dco_t1s_type () : v (0) , t (0) { };
d c o _ t 1 s _ t y p e& d c o _ t 1 s _ t y p e : : o p e r a t o r = ( c o n s t d c o _ t 1 s _ t y p e& x ) {
i f ( t h i s ==&x ) r e t u r n t h i s ;
v=x . v ; t =x . t ;
return t h i s ;
}
d c o _ t 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ t 1 s _ t y p e& x1 , c o n s t
d c o _ t 1 s _ t y p e& x2 ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v=x1 . v x2 . v ;
tmp . t =x1 . t x2 . v+x1 . v x2 . t ;
r e t u r n tmp ;
}
d c o _ t 1 s _ t y p e o p e r a t o r + ( c o n s t d c o _ t 1 s _ t y p e& x1 , c o n s t
d c o _ t 1 s _ t y p e& x2 ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v=x1 . v+x2 . v ;
tmp . t =x1 . t +x2 . t ;
r e t u r n tmp ;
}
d c o _ t 1 s _ t y p e o p e ra t o r ( c o n s t d c o _ t 1 s _ t y p e& x1 , c o n s t
d c o _ t 1 s _ t y p e& x2 ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v=x1 . vx2 . v ;
tmp . t =x1 . t x2 . t ;
r e t u r n tmp ;
}
d c o _ t 1 s _ t y p e s i n ( c o n s t d c o _ t 1 s _ t y p e& x ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v= s i n ( x . v ) ;
tmp . t = c o s ( x . v ) x . t ;
r e t u r n tmp ;
}
d c o _ t 1 s _ t y p e c o s ( c o n s t d c o _ t 1 s _ t y p e& x ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v= c o s ( x . v ) ;
tmp . t = s i n ( x . v ) x . t ;
245
r e t u r n tmp ;
}
d c o _ t 1 s _ t y p e exp ( c o n s t d c o _ t 1 s _ t y p e& x ) {
d c o _ t 1 s _ t y p e tmp ;
tmp . v= exp ( x . v ) ;
tmp . t =tmp . v x . t ;
r e t u r n tmp ;
}
DCO_A1S_UNDEF 1
DCO_A1S_CONST 0
DCO_A1S_ASG 1
DCO_A1S_ADD 2
DCO_A1S_SUB 3
DCO_A1S_MUL 4
DCO_A1S_SIN 5
DCO_A1S_COS 6
DCO_A1S_EXP 7
class dco_a1s_tape_entry {
public :
i n t oc ;
int arg1 ;
int arg2 ;
double v ;
double a ;
d c o _ a 1 s _ t a p e _ e n t r y ( ) : oc ( DCO_A1S_UNDEF ) , a r g 1 (
DCO_A1S_UNDEF ) , a r g 2 ( DCO_A1S_UNDEF ) , v ( 0 ) , a ( 0 ) { } ;
};
class dco_a1s_type {
public :
i n t va ;
double v ;
d c o _ a 1 s _ t y p e ( ) : va ( DCO_A1S_UNDEF ) , v ( 0 ) { } ;
d c o _ a 1 s _ t y p e ( c o n s t d o u b l e &) ;
d c o _ a 1 s _ t y p e& o p e r a t o r = ( c o n s t d c o _ a 1 s _ t y p e &) ;
};
d c o _ a 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ a 1 s _ t y p e &, c o n s t d c o _ a 1 s _ t y p e
&) ;
246
d c o _ a 1 s _ t y p e o p e r a t o r + ( c o n s t d c o _ a 1 s _ t y p e &, c o n s t d c o _ a 1 s _ t y p e
&) ;
d c o _ a 1 s _ t y p e o p e ra t o r ( c o n s t d c o _ a 1 s _ t y p e &, c o n s t d c o _ a 1 s _ t y p e
&) ;
d c o _ a 1 s _ t y p e s i n ( c o n s t d c o _ a 1 s _ t y p e &) ;
d c o _ a 1 s _ t y p e exp ( c o n s t d c o _ a 1 s _ t y p e &) ;
void d c o _ a 1 s _ p r i n t _ t a p e ( ) ;
void d c o _ a 1 s _ i n t e r p r e t _ t a p e ( ) ;
void d c o _ a 1 s _ r e s e t _ t a p e ( ) ;
# endif
247
] . oc =DCO_A1S_ADD ;
] . a r g 1 =x1 . va ;
] . a r g 2 =x2 . va ;
] . v=tmp . v=x1 . v+x2 . v ;
}
d c o _ a 1 s _ t y p e o p e ra t o r ( c o n s t d c o _ a 1 s _ t y p e& x1 , c o n s t d c o _ a 1 s _ t y p e
& x2 ) {
d c o _ a 1 s _ t y p e tmp ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . oc =DCO_A1S_SUB ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 1 =x1 . va ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 2 =x2 . va ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . v=tmp . v=x1 . vx2 . v ;
tmp . va = d c o _ a 1 s _ v a c ++;
r e t u r n tmp ;
}
d c o _ a 1 s _ t y p e s i n ( c o n s t d c o _ a 1 s _ t y p e& x ) {
d c o _ a 1 s _ t y p e tmp ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . oc =DCO_A1S_SIN ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 1 =x . va ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . v=tmp . v= s i n ( x . v ) ;
tmp . va = d c o _ a 1 s _ v a c ++;
r e t u r n tmp ;
}
d c o _ a 1 s _ t y p e c o s ( c o n s t d c o _ a 1 s _ t y p e& x ) {
d c o _ a 1 s _ t y p e tmp ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . oc =DCO_A1S_COS ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 1 =x . va ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . v=tmp . v= c o s ( x . v ) ;
tmp . va = d c o _ a 1 s _ v a c ++;
r e t u r n tmp ;
}
d c o _ a 1 s _ t y p e exp ( c o n s t d c o _ a 1 s _ t y p e& x ) {
d c o _ a 1 s _ t y p e tmp ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . oc =DCO_A1S_EXP ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . a r g 1 =x . va ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ v a c ] . v=tmp . v= exp ( x . v ) ;
tmp . va = d c o _ a 1 s _ v a c ++;
r e t u r n tmp ;
}
void d c o _ a 1 s _ p r i n t _ t a p e ( ) {
c o u t << " t a p e : " << e n d l ;
f o r ( i n t i = 0 ; i < d c o _ a 1 s _ v a c ; i ++)
248
}
void d c o _ a 1 s _ r e s e t _ t a p e ( ) {
f o r ( i n t i = 0 ; i < d c o _ a 1 s _ v a c ; i ++)
dco_a1s_tape [ i ] . a =0;
dco_a1s_vac =0;
}
void d c o _ a 1 s _ i n t e r p r e t _ t a p e ( ) {
f o r ( i n t i = d c o _ a 1 s _ v a c ; i >=0; i ) {
s w i t c h ( d c o _ a 1 s _ t a p e [ i ] . oc ) {
c a s e DCO_A1S_ASG : {
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 1 ] . a += d c o _ a 1 s _ t a p e [ i ] . a ;
break ;
}
c a s e DCO_A1S_ADD : {
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 1 ] . a += d c o _ a 1 s _ t a p e [ i ] . a ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 2 ] . a += d c o _ a 1 s _ t a p e [ i ] . a ;
break ;
}
c a s e DCO_A1S_SUB : {
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 1 ] . a += d c o _ a 1 s _ t a p e [ i ] . a ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 2 ] . a=d c o _ a 1 s _ t a p e [ i ] . a ;
break ;
}
c a s e DCO_A1S_MUL : {
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 1 ] . a += d c o _ a 1 s _ t a p e [
d c o _ a 1 s _ t a p e [ i ] . a r g 2 ] . v d c o _ a 1 s _ t a p e [ i ] . a ;
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 2 ] . a += d c o _ a 1 s _ t a p e [
d c o _ a 1 s _ t a p e [ i ] . arg1 ] . v d c o _ a 1 s _ t a p e [ i ] . a ;
break ;
}
c a s e DCO_A1S_SIN : {
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 1 ] . a += c o s ( d c o _ a 1 s _ t a p e [
dco_a1s_tape [ i ] . arg1 ] . v ) dco_a1s_tape [ i ] . a ;
break ;
}
c a s e DCO_A1S_COS : {
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 1 ] . a= s i n ( d c o _ a 1 s _ t a p e [
dco_a1s_tape [ i ] . arg1 ] . v ) dco_a1s_tape [ i ] . a ;
break ;
}
c a s e DCO_A1S_EXP : {
d c o _ a 1 s _ t a p e [ d c o _ a 1 s _ t a p e [ i ] . a r g 1 ] . a += d c o _ a 1 s _ t a p e [ i ] . v
dco_a1s_tape [ i ] . a ;
249
break ;
}
}
}
}
A.3
# i f n d e f DCO_T2S_T1S_INCLUDED_
# d e f i n e DCO_T2S_T1S_INCLUDED_
# i n c l u d e " d c o _ t 1 s _ t y p e . hpp "
class dco_t2s_t1s_type {
public :
dco_t1s_type v ;
dco_t1s_type t ;
d c o _ t 2 s _ t 1 s _ t y p e ( c o n s t d o u b l e &) ;
dco_t2s_t1s_type () ;
d c o _ t 2 s _ t 1 s _ t y p e& o p e r a t o r = ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e &) ;
};
d c o _ t 2 s _ t 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e &, c o n s t
d c o _ t 2 s _ t 1 s _ t y p e &) ;
d c o _ t 2 s _ t 1 s _ t y p e o p e r a t o r + ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e &, c o n s t
d c o _ t 2 s _ t 1 s _ t y p e &) ;
d c o _ t 2 s _ t 1 s _ t y p e o p e ra t o r ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e &, c o n s t
d c o _ t 2 s _ t 1 s _ t y p e &) ;
d c o _ t 2 s _ t 1 s _ t y p e s i n ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e &) ;
d c o _ t 2 s _ t 1 s _ t y p e c o s ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e &) ;
d c o _ t 2 s _ t 1 s _ t y p e exp ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e &) ;
# endif
250
d c o _ t 2 s _ t 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x1 , c o n s t
d c o _ t 2 s _ t 1 s _ t y p e& x2 ) {
d c o _ t 2 s _ t 1 s _ t y p e tmp ;
tmp . v=x1 . v x2 . v ;
tmp . t =x1 . t x2 . v+x1 . v x2 . t ;
r e t u r n tmp ;
}
d c o _ t 2 s _ t 1 s _ t y p e o p e r a t o r + ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x1 , c o n s t
d c o _ t 2 s _ t 1 s _ t y p e& x2 ) {
d c o _ t 2 s _ t 1 s _ t y p e tmp ;
tmp . v=x1 . v+x2 . v ;
tmp . t =x1 . t +x2 . t ;
r e t u r n tmp ;
}
d c o _ t 2 s _ t 1 s _ t y p e o p e ra t o r ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x1 , c o n s t
d c o _ t 2 s _ t 1 s _ t y p e& x2 ) {
d c o _ t 2 s _ t 1 s _ t y p e tmp ;
tmp . v=x1 . vx2 . v ;
tmp . t =x1 . t x2 . t ;
r e t u r n tmp ;
}
d c o _ t 2 s _ t 1 s _ t y p e s i n ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x ) {
d c o _ t 2 s _ t 1 s _ t y p e tmp ;
tmp . v= s i n ( x . v ) ;
tmp . t = c o s ( x . v ) x . t ;
r e t u r n tmp ;
}
d c o _ t 2 s _ t 1 s _ t y p e c o s ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x ) {
d c o _ t 2 s _ t 1 s _ t y p e tmp ;
tmp . v= c o s ( x . v ) ;
tmp . t =0 s i n ( x . v ) x . t ;
r e t u r n tmp ;
}
d c o _ t 2 s _ t 1 s _ t y p e exp ( c o n s t d c o _ t 2 s _ t 1 s _ t y p e& x ) {
d c o _ t 2 s _ t 1 s _ t y p e tmp ;
tmp . v= exp ( x . v ) ;
tmp . t =tmp . v x . t ;
r e t u r n tmp ;
}
A.4
251
# i f n d e f DCO_T2S_A1S_INCLUDED_
# d e f i n e DCO_T2S_A1S_INCLUDED_
# i n c l u d e " d c o _ t 1 s _ t y p e . hpp "
# d e f i n e DCO_T2S_A1S_TAPE_SIZE 1000000
# define
# define
# define
# define
# define
# define
# define
# define
# define
DCO_T2S_A1S_UNDEF 1
DCO_T2S_A1S_CONST 0
DCO_T2S_A1S_ASG 1
DCO_T2S_A1S_ADD 2
DCO_T2S_A1S_SUB 3
DCO_T2S_A1S_MUL 4
DCO_T2S_A1S_SIN 5
DCO_T2S_A1S_COS 6
DCO_T2S_A1S_EXP 7
class dco_t2s_a1s_tape_entry {
public :
i n t oc ;
int arg1 ;
int arg2 ;
dco_t1s_type v ;
dco_t1s_type a ;
d c o _ t 2 s _ a 1 s _ t a p e _ e n t r y ( ) : oc ( 0 ) , a r g 1 ( DCO_T2S_A1S_UNDEF ) ,
a r g 2 ( DCO_T2S_A1S_UNDEF ) , v ( 0 ) , a ( 0 ) { } ;
};
class dco_t2s_a1s_type {
public :
i n t va ;
dco_t1s_type v ;
d c o _ t 2 s _ a 1 s _ t y p e ( ) : va ( DCO_T2S_A1S_UNDEF ) , v ( 0 ) { } ;
d c o _ t 2 s _ a 1 s _ t y p e ( c o n s t d o u b l e &) ;
d c o _ t 2 s _ a 1 s _ t y p e& o p e r a t o r = ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e &) ;
};
d c o _ t 2 s _ a 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e &, c o n s t
d c o _ t 2 s _ a 1 s _ t y p e &) ;
d c o _ t 2 s _ a 1 s _ t y p e o p e r a t o r + ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e &, c o n s t
d c o _ t 2 s _ a 1 s _ t y p e &) ;
d c o _ t 2 s _ a 1 s _ t y p e o p e ra t o r ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e &, c o n s t
d c o _ t 2 s _ a 1 s _ t y p e &) ;
d c o _ t 2 s _ a 1 s _ t y p e s i n ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e &) ;
d c o _ t 2 s _ a 1 s _ t y p e exp ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e &) ;
void d c o _ t 2 s _ a 1 s _ p r i n t _ t a p e ( ) ;
252
void d c o _ t 2 s _ a 1 s _ i n t e r p r e t _ t a p e ( ) ;
void d c o _ t 2 s _ a 1 s _ r e s e t _ t a p e ( ) ;
# endif
: : operator =( const
] . oc =DCO_T2S_A1S_ASG ;
] . v=v=x . v ;
] . a r g 1 =x . va ;
d c o _ t 2 s _ a 1 s _ t y p e o p e r a t o r ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e& x1 , c o n s t
d c o _ t 2 s _ a 1 s _ t y p e& x2 ) {
d c o _ t 2 s _ a 1 s _ t y p e tmp ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . oc =DCO_T2S_A1S_MUL ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . a r g 1 =x1 . va ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . a r g 2 =x2 . va ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . v=tmp . v=x1 . v x2 . v ;
tmp . va = d c o _ t 2 s _ a 1 s _ v a c ++;
r e t u r n tmp ;
}
d c o _ t 2 s _ a 1 s _ t y p e o p e r a t o r + ( c o n s t d c o _ t 2 s _ a 1 s _ t y p e& x1 , c o n s t
d c o _ t 2 s _ a 1 s _ t y p e& x2 ) {
d c o _ t 2 s _ a 1 s _ t y p e tmp ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . oc =DCO_T2S_A1S_ADD ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . a r g 1 =x1 . va ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . a r g 2 =x2 . va ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ v a c ] . v=tmp . v=x1 . v+x2 . v ;
tmp . va = d c o _ t 2 s _ a 1 s _ v a c ++;
253
254
}
void d c o _ t 2 s _ a 1 s _ r e s e t _ t a p e ( ) {
f o r ( i n t i = 0 ; i < d c o _ t 2 s _ a 1 s _ v a c ; i ++)
dco_t2s_a1s_tape [ i ] . a =0;
dco_t2s_a1s_vac =0;
}
void d c o _ t 2 s _ a 1 s _ i n t e r p r e t _ t a p e ( ) {
f o r ( i n t i = d c o _ t 2 s _ a 1 s _ v a c ; i >=0; i ) {
s w i t c h ( d c o _ t 2 s _ a 1 s _ t a p e [ i ] . oc ) {
c a s e DCO_T2S_A1S_ASG : {
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a=
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a+
dco_t2s_a1s_tape [ i ] . a ;
break ;
}
c a s e DCO_T2S_A1S_ADD : {
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a=
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a+
dco_t2s_a1s_tape [ i ] . a ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 2 ] . a=
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 2 ] . a+
dco_t2s_a1s_tape [ i ] . a ;
break ;
}
c a s e DCO_T2S_A1S_SUB : {
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a=
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a+
dco_t2s_a1s_tape [ i ] . a ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 2 ] . a=
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 2 ] . a
dco_t2s_a1s_tape [ i ] . a ;
break ;
}
c a s e DCO_T2S_A1S_MUL : {
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a=
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 1 ] . a+
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . arg2 ] . v
dco_t2s_a1s_tape [ i ] . a ;
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 2 ] . a=
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . a r g 2 ] . a+
d c o _ t 2 s _ a 1 s _ t a p e [ d c o _ t 2 s _ a 1 s _ t a p e [ i ] . arg1 ] . v
dco_t2s_a1s_tape [ i ] . a ;
break ;
}
255
Appendix B
This appendix contains a summary of the syntax accepted by version 0.9 of dcc. The
same information can be obtained by running flex and bison in their diagnostic modes
on the respective input files scanner.l and parser.y. Refer to Chapter 4 for further
explanation.
B.1
bison Grammar
code : s e q u e n c e _ o f _ g l o b a l _ d e c l a r a t i o n s s e q u e n c e _ o f _ s u b r o u t i n e s
sequence_of_subroutines : subroutine
| sequence_of_subroutines subroutine
s u b r o u t i n e : VOID SYMBOL ( l i s t _ o f _ a r g u m e n t s ) {
s e q u e n c e _ o f _ l o c a l _ d e c l a r a t i o n s s e q u e n c e _ o f _ s t a t e m e n t s }
l i s t _ o f _ a r g u m e n t s : argument
| l i s t _ o f _ a r g u m e n t s , argument
s e q u e n c e _ o f _ a s t e r i x e s :
| s e q u e n c e _ o f _ a s t e r i x e s
argument :
|
|
|
|
s e q u e n c e _ o f _ g l o b a l _ d e c l a r a t i o n s : / empty /
| sequence_of_global_declarations
global_declaration
s e q u e n c e _ o f _ l o c a l _ d e c l a r a t i o n s : / empty /
257
258
;
;
] ;
sequence_of_statements : statement
| sequence_of_statements statement
statement :
|
|
|
assignment
if_statement
while_statement
subroutine_call_statement
i f _ s t a t e m e n t : I F ( c o n d i t i o n ) { s e q u e n c e _ o f _ s t a t e m e n t s }
else_branch
e l s e _ b r a n c h : / empty /
| ELSE { s e q u e n c e _ o f _ s t a t e m e n t s }
w h i l e _ s t a t e m e n t : WHILE ( c o n d i t i o n ) {
s e q u e n c e _ o f _ s t a t e m e n t s }
condition :
|
|
|
|
|
memref_or_constant
memref_or_constant
memref_or_constant
memref_or_constant
memref_or_constant
memref_or_constant
<
>
=
!
>
<
memref_or_constant
memref_or_constant
= m e m r e f _ o r _ c o n s t a n t
= m e m r e f _ o r _ c o n s t a n t
= m e m r e f _ o r _ c o n s t a n t
= m e m r e f _ o r _ c o n s t a n t
s u b r o u t i n e _ c a l l _ s t a t e m e n t : SYMBOL ( l i s t _ o f _ a r g s
a s s i g n m e n t : memref = e x p r e s s i o n ;
expression :
|
|
|
|
|
|
|
|
|
|
( expression )
e x p r e s s i o n e x p r e s s i o n
expression / expression
e x p r e s s i o n + e x p r e s s i o n
e x p r e s s i o n e x p r e s s i o n
SIN ( e x p r e s s i o n )
COS ( e x p r e s s i o n )
EXP ( e x p r e s s i o n )
SQRT ( e x p r e s s i o n )
TAN ( e x p r e s s i o n )
ATAN ( e x p r e s s i o n )
259
LOG ( e x p r e s s i o n )
POW ( e x p r e s s i o n , SYMBOL )
memref
CONSTANT
l i s t _ o f _ a r g s : memref_or_constant
| memref_or_constant , l i s t _ o f _ a r g s
m e m r e f _ o r _ c o n s t a n t : memref
| CONSTANT
a r r a y _ i n d e x : SYMBOL
| CONSTANT
memref : SYMBOL
| array_reference
a r r a y _ r e f e r e n c e : SYMBOL a r r a y _ a c c e s s
array_access : [ array_index ]
array_access : [ array_index ] array_access
B.2
flex Grammar
Whitespaces are ignored. Single-line comments are allowed starting with // . Some integer
and floating-point constants are supported. Variable names start with lowercase and/or
uppercase letters followed by further letters, underscores, or digits.
int
float
const
symbol
0|[1 9][0 9]
{ i n t } " . " [0 9]
{ int }|{ float }
( [ AZ ] | [ az ] ) ( ( [ AZ ] | [ az ] ) | _ | { i n t } )
Supported key words are the following: double, int , void, if , else , while, sin , cos, exp, sqrt ,
atan , tan , pow, and log .
Appendix C
C.1
C.1.1
Chapter 1
Exercise 1.4.1
Write a C++ program that converts single precision floating-point variables into their bit
representation (see Section 1.3). Investigate the effects of cancellation and rounding on the
finite difference approximation of first and second derivatives of a set of functions of your
choice.
The following function prints the binary representation of single as well as double and
higher precision floating-point variables on little endian architectures:
t e m p l a t e < c l a s s T>
void t o _ b i n (T v ) {
union {
T value ;
u n s i g n e d char b y t e s [ s i z e o f ( T ) ] ;
};
memset(& b y t e s , 0 , s i z e o f ( T ) ) ;
value = v ;
/ / assumes l i t t l e endian a r c h i t e c t u r e
f o r ( s i z e _ t i = s i z e o f ( T ) ; i > 0 ; i ) {
u n s i g n e d char p o t = 1 2 8 ;
f o r ( i n t j = 7 ; j >=0; j , p o t / = 2 )
i f ( b y t e s [ i 1]& p o t )
c o u t << " 1 " ;
else
c o u t << " 0 " ;
c o u t << " " ;
}
c o u t << e n d l ;
}
261
262
It can be used to investigate numerical effects due to rounding and cancellation in finite
difference approximations of first and higher derivatives as in Section 1.3.
C.1.2
Exercise 1.4.2
Apply Algorithm 1.1 to approximate a solution y = y(x0 , x1 ) of the discrete SFI problem
introduced in Example 1.2.
1. Approximate the Jacobian of the residual r = F (y) by finite differences. Write exact
derivative code based on (1.5) for comparison.
2. Use finite differences to approximate the product of the Jacobian with a vector within
a matrix-free implementation of the Newton algorithm based on Algorithm 1.4.
3. Repeat the above for further problems from the MINPACK-2 test problem collection [5], for example, for the Flow in a Channel and Elastic Plastic Torsion problems.
Listing C.1 shows an implementation of Newtons algorithm for solving the system of
nonlinear equations that implements the SFI problem over the unit square for equidistant
finite difference discretization with step size s1 . Refer to Section 1.1.1 for details. The
discrete two-dimensional domain is flattened by storing the rows of the matrix Y = (yi,j )
of the nonlinear system
4 yi,j + yi+1,j + yi1,j + yi,j +1 + yi,j 1 = h2 eyi,j
for i, j = 1, . . . , s 1 consecutively in a vector y R(s1)(s1) as shown in the following
code listing.
void f ( i n t s , double y , double r ) {
double h =0;
d o u b l e l e f t = 0 ; d o u b l e r i g h t = 0 ; d o u b l e up = 0 ; d o u b l e down = 0 . ;
d o u b l e v a l u e _ i j = 0 ; d o u b l e dyy = 0 ; d o u b l e d r r = 0 ;
i n t i = 0 ; i n t j = 0 ; i n t smo = 0 ; i n t i d x _ i j = 0 ; i n t k = 0 ;
i =1; j =1;
h =1.0/ s ;
while ( i <s ) {
j =1;
while ( j <s ) {
i d x _ i j = ( i 1) ( s 1)+ j 1;
v a l u e _ i j =y [ i d x _ i j ] ;
up = 0 ; down = 0 ; l e f t = 0 ; r i g h t = 0 ;
smo=s 1;
i f ( i ! = 1 ) { k= i d x _ i j (s 1) ; down=y [ k ] ; }
i f ( i ! = smo ) { k= i d x _ i j +s 1; up=y [ k ] ; }
i f ( j ! = 1 ) { k= i d x _ i j 1; l e f t =y [ k ] ; }
i f ( j ! = smo ) { k= i d x _ i j + 1 ; r i g h t =y [ k ] ; }
C.1. Chapter 1
263
dyy = ( r i g h t 2 v a l u e _ i j + l e f t ) ;
d r r = ( up 2 v a l u e _ i j +down ) ;
r [ i d x _ i j ] = dyy+ d r r +h h exp ( v a l u e _ i j ) ;
j = j +1;
}
i = i +1;
}
}
Both Listing C.1 and Listing C.2 assume that f returns the negative residual that is required
by Newtons algorithm for a constant parameter = 1. The given implementation is accepted
as input by dcc. An implementation of Newtons algorithm is shown in Listing C.1.
Listing C.1. Newtons algorithm.
1 i n t main ( ) {
2
const i n t s =50;
3
c o n s t i n t n = ( s 1) ( s 1) ;
4
double y [ n ] , d e l t a _ y [ n ] , r [ n ] , J [ n ] ;
5
f o r ( i n t i = 0 ; i <n ; i ++) J [ i ] = new d o u b l e [ n ] ;
6
f o r ( i n t i = 0 ; i <n ; i ++) y [ i ] = 0 ;
7
do {
8
df ( s , y , r , J ) ;
9
Factorize (n , J ) ;
10
double z [ n ] ;
11
FSubstitute (n , J , r , z) ;
12
BSubstitute (n , J , z , delta_y ) ;
13
f o r ( i n t i = 0 ; i <n ; i ++) y [ i ] = y [ i ] + d e l t a _ y [ i ] ;
14
f (s ,y, r ) ;
15
} w h i l e ( norm ( n , r ) >1e 9) ;
16
plot_solution (s ,y) ;
17
f o r ( i n t i = 0 ; i <n ; i ++) d e l e t e [ ] J [ i ] ;
18
return 0 ;
19 }
At the beginning of each Newton iteration the Jacobian J of the residual r at the current
point y is approximated using finite differences by the function df as shown in Listing C.2.
A direct linear solver (for example, Cholesky) is used to decompose J into a lower and an
upper triangular factor. Both factors overwrite the memory allocated for J. The Newton step
delta_y is computed by forward and backward substitution in lines 11 and 12, and it is used
to update the current point y. A reevaluation of the residual r at the new point is performed
in line 14 followed by checking the convergence criterion that is defined as the Euclidean
norm of r reaching 109 . A converged solution is written by the function plot_solution
into a file whose format is suitable for visualization with gnuplot. Figure C.1 shows a
corresponding plot.
Listing C.2. Jacobian by forward finite differences.
v o i d d f ( i n t s , d o u b l e y , d o u b l e r , d o u b l e J ) {
c o n s t d o u b l e h=1e 8;
i n t n = ( s 1) ( s 1) ;
d o u b l e y_ph=new d o u b l e [ n ] ;
264
0.08
0.07
0.06
0.05
0.04
0.03
0.02
0.01
0
10
15
20
25
30
35
40
45
50 0
10
15
20
25
30
35
40
45
50
For the given problem specification, finite differences turn out to be rather robust with respect
to changes in the size of the perturbation. The fact that f returns the negative residual is
taken into account by computing the negative finite difference quotient.
The manual derivation of derivative code for the SFI problem is reasonably straightforward. In particular, there is a good check for correctness in the form of the finite difference
code. This part of the exercise is meant to illustrate the disadvantages of hand-written
derivative code in terms of development and debugging effort.
A matrix-free implementation of the Conjugate Gradient algorithm (Algorithm 1.4)
for solving the Newton system is shown in Listing C.3. It returns the residual r and an
approximation of the Newton step delta_y with accuracy eps at the current point y for given
s and n.
Listing C.3. Matrix-free CG solver for Newton system.
i n t s , i n t n , double y , double r , double
delta_y ) {
d o u b l e t =new d o u b l e [ n ] ;
d o u b l e p=new d o u b l e [ n ] ;
1 v o i d cg ( d o u b l e eps ,
2
3
C.1. Chapter 1
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21 }
265
d o u b l e J v =new d o u b l e [ n ] ;
df ( s , y , r , d e l t a _ y , Jv ) ;
f o r ( i n t i = 0 ; i <n ; i ++) p [ i ] = t [ i ] = r [ i ] J v [ i ] ;
d o u b l e n o r m t =norm ( n , t ) ;
w h i l e ( normt > e p s ) {
df ( s , y , r , p , Jv ) ;
d o u b l e t T t =xTy ( n , t , t ) ;
d o u b l e a l p h a = t T t / xTy ( n , p , J v ) ;
axpy ( n , a l p h a , p , d e l t a _ y , d e l t a _ y ) ;
axpy ( n , a l p h a , Jv , t , t ) ;
d o u b l e b e t a =xTy ( n , t , t ) / t T t ;
axpy ( n , b e t a , p , t , p ) ;
n o r m t =norm ( n , t ) ;
}
d e l e t e [ ] Jv ;
delete [] p ;
delete [] t ;
Standard inner product and axpy operations are implemented by the functions xTy and axpy,
respectively. The Euclidean norm of a vector is returned by the function norm. Ideally,
optimized native implementations of the Basic Linear Algebra Subprograms (BLAS) [12]
should be used.
Required projections of the Jacobian in directions delta_y and p are approximated by
the function df called in lines 5 and 9. A corresponding implementation of df is shown in
Listing C.4. There, the current point y is perturbed in direction v in line 7. Refer to [44] for
a discussion of scaling that may be required for numerically less stable problems. Again, f
is assumed to return the negative residual yielding the finite difference quotient in line 9.
Listing C.4. Projection of Jacobian by forward finite differences.
1 v o i d d f ( i n t s , double y , double r , double v , double Jv ) {
2
c o n s t d o u b l e h=1e 8;
3
i n t n = ( s 1) ( s 1) ;
4
d o u b l e y_ph=new d o u b l e [ n ] ;
5
d o u b l e r _ p h =new d o u b l e [ n ] ;
6
f (s ,y, r ) ;
7
f o r ( i n t i = 0 ; i <n ; i ++) y_ph [ i ] = y [ i ] + v [ i ] h ;
8
f ( s , y_ph , r _ p h ) ;
9
f o r ( i n t i = 0 ; i <n ; i ++) J v [ i ] = ( r [ i ] r _ p h [ i ] ) / h ;
10
d e l e t e [ ] r_ph ;
11
d e l e t e [ ] y_ph ;
12 }
The Newton algorithm initializes each call of cg with the Newton step computed in the
previous iteration. Initially, delta_y and y are set to zero in line 6 of Listing C.5. The
main loop approximates the Newton step in line 8 followed by updating the current point
in line 9. Convergence is again defined as the norm of the residual reaching some upper
bound. Successful termination can be observed for both the accuracies of the CG solution and of the solution of the Newton iteration itself set to 106 . Interested readers are
266
encouraged to investigate the behavior of the inexact Newton algorithm for increased target
accuracies.
Listing C.5. Matrix-free Newton-CG algorithm.
1 i n t main ( ) {
2
const i n t s =50;
3
c o n s t i n t n = ( s 1) ( s 1) ;
4
double y [ n ] , d e l t a _ y [ n ] , r [ n ] ;
5
f o r ( i n t i = 0 ; i <n ; i ++) J [ i ] = new d o u b l e [ n ] ;
6
f o r ( i n t i = 0 ; i <n ; i ++) y [ i ] = d e l t a _ y [ i ] = 0 ;
7
do {
8
cg ( 1 e 6 , s , n , y , r , d e l t a _ y ) ;
9
f o r ( i n t i = 0 ; i <n ; i ++) y [ i ] = y [ i ] + d e l t a _ y [ i ] ;
10
f (s ,y, r ) ;
11
} w h i l e ( norm ( n , r ) >1e 6) ;
12
plot_solution (s ,y) ;
13
return 0 ;
14 }
For s = 50 the solution of the SFI problem takes more than 11 seconds on our reference
architecture when using the standard Newton algorithm with a direct linear solver. This run
time grows rapidly, reaching more than 12 minutes for s = 100. Convergence is defined as
the norm of the residual reaching 106 . The code is compiled at the highest optimization
level. Less than one second is required by the matrix-free Newton-CG algorithm for s = 100.
Its run time increases to 2.3 seconds for s = 200 and to 8.5 seconds for s = 300. The standard
Newton algorithm fails to allocate enough memory for s 200. Exploitation of sparsity is
crucial in this case. A compressed Jacobian can be computed at significantly lower cost as
described in Section 2.1.3. Moreover, sparse direct linear solvers are likely to decrease the
memory requirement substantially.
The software infrastructure developed in this section should be applied to further
MINPACK-2 test problems. Some functions may require additional parameters that can
either be fixed (as done in the SFI problem for parameter l ) or passed through slightly
modified interfaces for f , df , and cg. (Matrix-free) Preconditioning is likely to become an
issue when considering less well-conditioned problems.
C.1.3
Exercise 1.4.3
Apply the steepest descent and Newton algorithms to an extended version of the Rosenbrock
function [54], which is defined as
y = f (x)
n2
(1 xi )2 + 10 (xi+1 xi2 )2
(C.1)
i=0
for n = 10, 100, 1000 and for varying starting values of your choice. The function has a
global minimum at xi = 1 for i = 0, . . . , n 1, where f (x) = 0. Approximate the required
derivatives by finite differences. Observe the behavior (development of function values and
L2 -norm of gradient; run time) of the algorithms for varying values of the perturbation size.
Use (1.5) to derive (hand-written) exact derivatives for comparison.
C.1. Chapter 1
267
The implementation of the Steepest Descent algorithm shown in Listing C.6 uses iterative
bisection as a local line search technique to determine the step size such that strict decrease
in the objective function value is obtained.
Listing C.6. Steepest descent algorithm.
1 i n t main ( ) {
2
const i n t n =10;
3
double x [ n ] , y , g [ n ] , a l p h a ;
4
c o n s t d o u b l e e p s =1e 5;
5
f o r ( i n t i = 0 ; i <n ; i ++) x [ i ] = 0 ;
6
df ( n , x , y , g ) ;
7
do {
8
alpha =1;
9
d o u b l e x _ t [ n ] , y _ t =y ;
10
w h i l e ( y _ t >=y ) {
11
f o r ( i n t i = 0 ; i <n ; i ++) x _ t [ i ] = x [ i ] a l p h a g [ i ] ;
12
f ( n , x_t , y_t ) ;
13
alpha /=2;
14
}
15
f o r ( i n t i = 0 ; i <n ; i ++) x [ i ] = x _ t [ i ] ;
16
df ( n , x , y , g ) ;
17
} w h i l e ( norm ( n , g ) > e p s ) ;
18
c o u t << " y= " << y << e n d l ;
19
f o r ( i n t i = 0 ; i <n ; i ++)
20
c o u t << " x [ " << i << " ] = " << x [ i ] << e n d l ;
21
return 0 ;
22 }
The gradient is provided by calls to df in lines 6 and 16 followed by the local line search in
lines 714 and, in case of termination of the line search, the update of the current point in
line 15. Convergence is defined as the Euclidean norm of the gradient reaching or falling
below 107 .
Starting from the origin for n = 10 a total of 1580 major and 11584 minor (local
line search) iterations are performed to approximate the minimum f (x) = 0 at xi = 1 for
i = 0, . . . , n 1. The gradient is approximated by finite differences as shown in Listing C.7.
Listing C.7. Gradient by forward finite differences.
v o i d d f ( i n t n , d o u b l e x , d o u b l e &y , d o u b l e g ) {
c o n s t d o u b l e h=1e 8;
d o u b l e x_ph=new d o u b l e [ n ] , y_ph ;
f (n ,x , y) ;
268
The function df also returns the current objective function value y required by the local line
search.
Steepest Descent typically requires a large number of iterations in order to reach a
satisfactory level of accuracy. The corresponding run time may become infeasible very
quickly for large values of n. For n = 100 the 1609 major and 11877 minor iterations still
take less than one second on our reference architecture. While even less major (1604) and
minor (11837) iterations are required for n = 1000, the run time increases to nearly twelve
seconds due to the higher cost of the gradient approximation.
An implementation of Newtons algorithm for the minimization of the extended
Rosenbrock function is shown in Listing C.8.
Listing C.8. Newtons algorithm.
i n t main ( ) {
const i n t n =10;
double x [ n ] , d e l t a _ x [ n ] , g [ n ] ;
d o u b l e H=new d o u b l e [ n ] ;
f o r ( i n t i = 0 ; i <n ; i ++) H[ i ] = new d o u b l e [ n ] ;
double y ;
c o n s t d o u b l e e p s =1e 9;
f o r ( i n t i = 0 ; i <n ; i ++) x [ i ] = 0 ;
i n t n i t e r s =0;
h _ g _ f _ c f d ( n , x , y , g , H) ;
i f ( norm ( n , g ) > e p s )
do {
F a c t o r i z e ( n , H) ;
f o r ( i n t i = 0 ; i <n ; i ++) g [ i ]=g [ i ] ;
double z [ n ] ;
F S u b s t i t u t e ( n , H, g , z ) ;
B S u b s t i t u t e ( n , H, z , d e l t a _ x ) ;
f o r ( i n t i = 0 ; i <n ; i ++) x [ i ] = x [ i ] + d e l t a _ x [ i ] ;
h _ g _ f _ c f d ( n , x , y , g , H) ;
} w h i l e ( norm ( n , g ) > e p s ) ;
c o u t << " y= " << y << e n d l ;
f o r ( i n t i = 0 ; i <n ; i ++)
c o u t << " x [ " << i << " ] = " << x [ i ] << e n d l ;
f o r ( i n t i = 0 ; i <n ; i ++) d e l e t e [ ] H[ i ] ;
d e l e t e [ ] H;
return 0 ;
}
C.2. Chapter 2
269
At the beginning of each Newton iteration the gradient g and the Hessian H of the objective
y at the current point x are approximated using finite differences by the function h_g_f_cfd
as shown in Section 1.3. Cholesky decomposition is used to factorize H into a lower and an
upper triangular factor. Both factors overwrite the memory allocated for H. The Newton
step delta_x is computed by forward and backward substitution, and it is used to update
the current point x. Convergence is defined as the Euclidean norm of g reaching 109 .
A converged solution is written to the screen.
While the given configuration converges, the Hessian approximation becomes indefinite very quickly resulting in a failure to compute the Cholesky factorization. Try n=11 to
observe this behavior.
C.1.4
Exercise 1.4.4
Use manual differentiation and finite differences with your favorite solver for
1. systems of nonlinear equations to find a numerical solution of the SFI problem introduced in Section 1.4.2; repeat for further MINPACK-2 test problems;
2. nonlinear programming to minimize the Rosenbrock function; repeat for the other
two test problems from Section 1.4.3.
Refer to Section C.2.4 for case studies that illustrate the use of derivative code with the
NAG Library. Adaptation to finite difference or manually derived code is straightforward.
C.2
Chapter 2
C.2.1
Exercise 2.4.1
Use the tangent-linear code to compute the Jacobian of the dependent outputs x and y
with respect to the independent input x. Use central finite differences for verification.
270
C.2. Chapter 2
271
F (x) =
.
0
0.24927
0.399798
0.100604
0.11881
0.381113
The first row contains the partial derivatives of y with respect to x. Closer inspection
of the computation reveals that its third entry should vanish identically in infinite
precision arithmetic. We leave the bit-level exploration of the numerical effects
caused by the data-flow dependence of y on x[2] to the reader; see also Section C.1.1.
Qualitatively, the results obtained from the tangent-linear code can be verified by the
following central finite difference approximation:
i n t main ( ) {
const i n t n =3;
c o n s t d o u b l e h=1e 8;
d o u b l e xph [ n ] , yph ;
d o u b l e xmh [ n ] , ymh ;
f o r ( i n t i = 0 ; i <n ; i ++) {
f o r ( i n t j = 0 ; j <n ; j ++) xmh [ j ] = xph [ j ]=2+ c o s ( j ) ;
xph [ i ]+= h ;
f ( n , xph , yph ) ;
xmh [ i ]=h ;
f ( n , xmh , ymh ) ;
c o u t << ( yphymh ) / ( 2 h ) << e n d l ;
f o r ( i n t j = 0 ; j <n ; j ++)
c o u t << ( xph [ j ]xmh [ j ] ) / ( 2 h ) << e n d l ;
c o u t << e n d l ;
}
return 0 ;
}
The third entry in the first row of the Jacobian matrix is approximated as 2.77556e08.
Truncation amplifies the previously observed numerical effects even further.
2. Write adjoint code for
v o i d g ( i n t n , d o u b l e x , d o u b l e& y ) {
double l ;
i n t i =0;
y =0;
w h i l e ( i <n ) {
l =x [ i ] ;
y+=x [ i ] l ;
i = i +1;
}
}
and use it for the computation of the gradient of the dependent output y with respect
to the independent input x. Apply backward finite differences for verification.
272
C.2. Chapter 2
273
i n t main ( ) {
const i n t n =3;
c o n s t d o u b l e h=1e 8;
d o u b l e x [ n ] , y , xmh [ n ] , ymh ;
f o r ( i n t i = 0 ; i <n ; i ++) xmh [ i ] = x [ i ] = 1 . / ( 1 . + i ) ;
g(n ,x , y) ;
f o r ( i n t i = 0 ; i <n ; i ++) {
xmh [ i ]=h ;
g ( n , xmh , ymh ) ;
xmh [ i ]+= h ;
c o u t << ( yymh ) / h << e n d l ;
}
return 0 ;
}
3. Write adjoint code (split mode) for the example code in Listing C.9. Use the adjoint
code to accumulate the gradient of the dependent output y with respect to the independent input x. Ensure that the correct function values are returned in addition to
the gradient.
The given code is regarded as a function f : Rn R, y = f (x). An implementation
of the corresponding adjoint model in split call tree reversal mode is the following:
s t a c k < double > f d s ;
v o i d a1_g ( i n t a1_mode , i n t n ,
d o u b l e x , d o u b l e a1_x , d o u b l e& y , d o u b l e& a1_y )
{
i f ( a1_mode ==1) { / / a u g m e n t e d f o r w a r d s e c t i o n
y =1.0;
f o r ( i n t i = 0 ; i <n ; i ++) {
f d s . push ( y ) ;
y =x [ i ] x [ i ] ;
}
}
else { / / reverse section
f o r ( i n t i =n 1; i >=0; i ) {
y= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_x [ i ]+=2 x [ i ] y a1_y ;
a1_y =x [ i ] x [ i ] a1_y ;
}
a1_y = 0 ;
}
}
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e &y , d o u b l e& a1_y ) {
/ / augmented forward s e c t i o n
f o r ( i n t i = 0 ; i <n ; i ++) {
f d s . push ( x [ i ] ) ;
274
C.2. Chapter 2
275
f o r ( i n t i =n 1; i >=0; i ) {
y= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_x [ i ]+=2 x [ i ] y a1_y ;
a1_y =x [ i ] x [ i ] a1_y ;
}
a1_y = 0 ;
}
e l s e i f ( a1_mode ==3) { / / s t o r e i n p u t s
f o r ( i n t i = 0 ; i <n ; i ++) a r g _ c p _ g . p u s h ( x [ i ] ) ;
}
e l s e i f ( a1_mode ==4) { / / r e s t o r e i n p u t s
f o r ( i n t i =n 1; i >=0; i ) {
x [ i ] = a r g _ c p _ g . t o p ( ) ; a r g _ c p _ g . pop ( ) ;
}
}
}
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e &y , d o u b l e& a1_y ) {
/ / augmented forward s e c t i o n
f o r ( i n t i = 0 ; i <n ; i ++) {
f d s . push ( x [ i ] ) ;
x [ i ] = s q r t ( x [ i ] / x [ ( i + 1 )%n ] ) ;
}
a1_g ( 3 , n , x , a1_x , y , a1_y ) ;
g(n ,x , y) ;
f d s . push ( y ) ;
y= c o s ( y ) ;
/ / reverse section
y= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_y= s i n ( y ) a1_y ;
a1_g ( 4 , n , x , a1_x , y , a1_y ) ;
a1_g ( 1 , n , x , a1_x , y , a1_y ) ;
f o r ( i n t i =n 1; i >=0; i ) {
x [ i ] = f d s . t o p ( ) ; f d s . pop ( ) ;
d o u b l e v=x [ i ] / x [ ( i + 1 )%n ] ;
d o u b l e a1_v = a1_x [ i ] / ( 2 s q r t ( v ) ) ; a1_x [ i ] = 0 ;
a1_x [ i ]+= a1_v / x [ ( i + 1 )%n ] ;
a1_x [ ( i + 1 )%n]= a1_v x [ i ] / ( x [ ( i + 1 )%n ] x [ ( i + 1 )%n ] ) ;
}
}
276
5. Use the adjoint code developed under 3 and 4 to compute the gradient of the dependent output x[0] with respect to the independent input x. Optimize the adjoint code
by eliminating obsolete (dead) statements.
If the given code is regarded as a function f : Rn R, x0 = f (x), then many statements in the original adjoint code become obsolete. The only active assignment is
x [ 0 ] = s q r t ( x [ 0 ] / x[1%n ] ) ;
yielding the gradient f (x) = (0.18112, 0.213896, 0)T at x = (3, 2.5403, 1.58385)T .
It is equal to the second row of the Jacobian computed under 1.
C.2.2
Exercise 2.4.2
Consider an implementation of the discrete residual r = F (y) for the SFI problem introduced
in Example 1.2.
1. Implement the tangent-linear model r(1) = F (y) y(1) by writing a tangent-linear
code by hand and use it to accumulate F (y) with machine accuracy. Verify the
numerical results with finite differences.
We consider a flattened version of the residual of the SFI problem similar to the
solution of Exercise 1.4.2.
void f ( i n t s , double y , double l , double r ) {
d o u b l e l e f t , r i g h t , up , down , dyy , d r r ;
i n t i d x _ i j =0;
f o r ( i n t i = 1 ; i < s ; i ++)
f o r ( i n t j = 1 ; j < s ; j ++) {
i d x _ i j = ( i 1) ( s 1)+ j 1;
up = 0 ; down = 0 ; l e f t = 0 ; r i g h t = 0 ;
i f ( i ! = 1 ) down=y [ i d x _ i j (s 1) ] ;
i f ( i ! = s 1) up=y [ i d x _ i j +s 1 ] ;
i f ( j ! = 1 ) l e f t =y [ i d x _ i j 1 ] ;
i f ( j ! = s 1) r i g h t =y [ i d x _ i j + 1 ] ;
dyy= r i g h t 2 y [ i d x _ i j ] + l e f t ;
d r r =up 2 y [ i d x _ i j ] + down ;
r [ i d x _ i j ]=dyyd r r l exp ( y [ i d x _ i j ] ) / ( s s ) ;
}
}
C.2. Chapter 2
277
f o r ( i n t i = 1 ; i < s ; i ++)
f o r ( i n t j = 1 ; j < s ; j ++) {
i d x _ i j = ( i 1) ( s 1)+ j 1;
t 1 _ u p = 0 ; t1_down = 0 ; t 1 _ l e f t = 0 ; t 1 _ r i g h t = 0 ;
up = 0 ; down = 0 ; l e f t = 0 ; r i g h t = 0 ;
i f ( i !=1) {
t1_down = t 1 _ y [ i d x _ i j (s 1) ] ;
down=y [ i d x _ i j (s 1) ] ;
}
i f ( i ! = s 1) {
t 1 _ u p = t 1 _ y [ i d x _ i j +s 1 ] ;
up=y [ i d x _ i j +s 1 ] ;
}
i f ( j !=1) {
t 1 _ l e f t = t1_y [ i d x _ i j 1];
l e f t =y [ i d x _ i j 1 ] ;
}
i f ( j ! = s 1) {
t 1 _ r i g h t =t1_y [ i d x _ i j +1];
r i g h t =y [ i d x _ i j + 1 ] ;
}
t 1 _ d y y = t 1 _ r i g h t 2 t 1 _ y [ i d x _ i j ] + t 1 _ l e f t ;
dyy= r i g h t 2 y [ i d x _ i j ] + l e f t ;
t 1 _ d r r = t 1 _ u p 2 t 1 _ y [ i d x _ i j ] + t1_down ;
d r r =up 2 y [ i d x _ i j ] + down ;
t 1 _ r [ i d x _ i j ]= t 1 _ d y y t 1 _ d r r
t 1 _ y [ i d x _ i j ] l exp ( y [ i d x _ i j ] ) / ( s s ) ;
r [ i d x _ i j ]=dyyd r r l exp ( y [ i d x _ i j ] ) / ( s s ) ;
}
}
Sparsity is not taken into account; it should be exploited. Correctness of the tangentlinear code is easily verified by finite differences.
278
2. Implement the adjoint model y(1) = y(1) + F (y)T r(1) by writing an adjoint code by
hand and use it to accumulate F (y) with machine accuracy. Compare the numerical
results with those obtained by the tangent-linear approach.
An adjoint version of the same implementation as considered under 1 is the following:
v o i d a 1 _ f ( i n t s , d o u b l e y , d o u b l e a1_y , d o u b l e l ,
double r , double a1_r ) {
d o u b l e l e f t , r i g h t , up , down , dyy , d r r ;
d o u b l e a 1 _ l e f t =0 , a 1 _ r i g h t =0 , a1_up = 0 ;
d o u b l e a1_down =0 , a1_dyy =0 , a 1 _ d r r = 0 ;
i n t i d x _ i j =0;
/ / augmented forward s e c t i o n
f o r ( i n t i = 1 ; i < s ; i ++)
f o r ( i n t j = 1 ; j < s ; j ++) {
i d x _ i j = ( i 1) ( s 1)+ j 1;
up = 0 ; down = 0 ; l e f t = 0 ; r i g h t = 0 ;
i f ( i ! = 1 ) down=y [ i d x _ i j (s 1) ] ;
i f ( i ! = s 1) up=y [ i d x _ i j +s 1 ] ;
i f ( j ! = 1 ) l e f t =y [ i d x _ i j 1 ] ;
i f ( j ! = s 1) r i g h t =y [ i d x _ i j + 1 ] ;
dyy= r i g h t 2 y [ i d x _ i j ] + l e f t ;
d r r =up 2 y [ i d x _ i j ] + down ;
r [ i d x _ i j ]=dyyd r r l exp ( y [ i d x _ i j ] ) / ( s s ) ;
}
/ / reverse section
f o r ( i n t i =s 1; i > 0 ; i )
f o r ( i n t j =s 1; j > 0 ; j ) {
i d x _ i j = ( i 1) ( s 1)+ j 1;
a1_dyy =a 1 _ r [ i d x _ i j ] ;
a 1 _ d r r =a 1 _ r [ i d x _ i j ] ;
a1_y [ i d x _ i j ]= l exp ( y [ i d x _ i j ] ) / ( s s ) a 1 _ r [ i d x _ i j ] ;
a1_r [ i d x _ i j ]=0;
a1_up += a 1 _ d r r ;
a1_y [ i d x _ i j ]=2 a 1 _ d r r ;
a1_down+= a 1 _ d r r ;
a1_drr =0;
a 1 _ r i g h t += a1_dyy ;
a1_y [ i d x _ i j ]=2 a1_dyy ;
a 1 _ l e f t += a1_dyy ;
a1_dyy = 0 ;
if
if
if
if
( j ! = s 1)
( j !=1) {
( i ! = s 1)
( i !=1) {
{ a1_y [ i d x _ i j +1]+= a 1 _ r i g h t ; a 1 _ r i g h t = 0 ; }
a1_y [ i d x _ i j 1]+= a 1 _ l e f t ; a 1 _ l e f t = 0 ; }
{ a1_y [ i d x _ i j +s 1]+= a1_up ; a1_up = 0 ; }
a1_y [ i d x _ i j (s 1) ]+= a1_down ; a1_down = 0 ; }
C.2. Chapter 2
279
a1_up = 0 ; a1_down = 0 ; a 1 _ l e f t = 0 ; a 1 _ r i g h t = 0 ;
}
}
Again, sparsity is not taken into account. The numerical results are equal to those
computed by the tangent-linear code.
3. Use dco to implement the tangent-linear and adjoint models.
A tangent-linear version of the given implementation of the SFI problem that uses
dco is the following:
# i n c l u d e " d c o _ t 1 s _ t y p e . hpp "
void f ( i n t s , d c o _ t 1 s _ t y p e y , double l , d c o _ t 1 s _ t y p e r ) {
d c o _ t 1 s _ t y p e l e f t , r i g h t , up , down , dyy , d r r ;
...
}
i n t main ( ) {
const i n t s =5;
c o n s t i n t n = ( s 1) ( s 1) ;
const double l =1;
dco_t1s_type y[n ] , r [n ] ;
f o r ( i n t i = 0 ; i <n ; i ++) {
f o r ( i n t j = 0 ; j <n ; j ++) y [ j ] = r [ j ] = 0 ;
y [ i ] . t =1;
f (s ,y, l , r ) ;
f o r ( i n t j = 0 ; j <n ; j ++) c o u t << r [ j ] . t << " " ;
c o u t << e n d l ;
}
return 0 ;
}
280
C.2. Chapter 2
C.2.3
281
Exercise 2.4.3
282
The returned gradient is equal to the gradient computed by seeding the tangent-linear
code with the identity in Rn .
3. Use dco to implement the tangent-linear and adjoint models.
Again, the types of all active floating-point variables in f are switched to dco_t1s_type
to obtain a tangent-linear code based on dco.
# i n c l u d e " d c o _ t 1 s _ t y p e . hpp "
v o i d f ( i n t n , d c o _ t 1 s _ t y p e x , d c o _ t 1 s _ t y p e &y ) {
dco_t1s_type t ;
...
}
i n t main ( ) {
const i n t n =3;
dco_t1s_type x[n ] , y ;
f o r ( i n t i = 0 ; i <n ; i ++) {
f o r ( i n t j = 0 ; j <n ; j ++) x [ j ]=2+ c o s ( j ) ;
x [ i ] . t =1;
f (n ,x , y) ;
c o u t << y . t << e n d l ;
}
return 0 ;
}
The main routine seeds the tangent-linear components x[ i ]. t of all inputs with the
Cartesian basis vectors in Rn . The entries of the gradient are extracted from the
tangent-linear component of the output y. t .
Similarly, an adjoint version that use dco is the following:
# i n c l u d e " d c o _ a 1 s _ t y p e . hpp "
e x t e r n d c o _ a 1 s _ t a p e _ e n t r y d c o _ a 1 s _ t a p e [ DCO_A1S_TAPE_SIZE ] ;
v o i d f ( i n t n , d c o _ a 1 s _ t y p e x , d c o _ a 1 s _ t y p e &y ) {
dco_a1s_type t ;
...
}
C.2. Chapter 2
283
i n t main ( ) {
const i n t n =3;
dco_a1s_type x [ n ] , y ;
f o r ( i n t j = 0 ; j <n ; j ++) x [ j ]=2+ c o s ( j ) ;
f (n ,x , y) ;
d c o _ a 1 s _ t a p e [ y . va ] . a = 1 ;
dco_a1s_interpret_tape () ;
f o r ( i n t j = 0 ; j <n ; j ++)
c o u t << d c o _ a 1 s _ t a p e [ x [ j ] . va ] . a << e n d l ;
return 0 ;
}
C.2.4
Exercise 2.4.4
We use the NAG Library as a case study for a wide range of numerical libraries. Examples
from the online documentation of the library are used for easier cross-reference. Tangentlinear and adjoint code is generated by dcc.
1. Use the tangent-linear model with your favorite solver for systems of nonlinear equations to find a numerical solution of the SFI problem; repeat for further MINPACK-2
test problems.
To demonstrate the use of the nonlinear equations solver from the NAG Library
(c05ubc), an example is used that computes the values x0 , . . . , x8 of the tridiagonal
system F (x) = 0, where the residual F : R9 R9 , y = F (x), is implemented as
y0 = (3 2 x0 ) x0 2 x1 + 1,
yi = xi1 + (3 2 xi ) xi 2 xi+1 + 1,
y8 = x7 + (3 2 x8 ) x8 + 1
for i = 1, . . . , 7. A corresponding C++ implementation that is accepted by dcc is the
following:
v o i d f ( i n t n , double x , double y ) {
i n t k =0;
i n t nm1 = 0 ;
i n t km1 = 0 ;
284
that can be used to accumulate the Jacobian matrix F (x) alongside with the value of
the residual as part of the following driver that needs to be passed to the NAG library
routine nag_zero_nonlin_eqns_deriv_1.
s t a t i c v o i d NAG_CALL f ( I n t e g e r n , d o u b l e x [ ] , d o u b l e f v e c [ ] ,
double f j a c [ ] , I n t e g e r t d f j a c ,
I n t e g e r u s e r f l a g , Nag_User comm ) {
# d e f i n e FJAC ( I , J ) f j a c [ ( ( I ) ) t d f j a c + ( J ) ]
Integer j , k;
d o u b l e t 1 _ x =new d o u b l e [ n ] ;
d o u b l e t 1 _ f v e c =new d o u b l e [ n ] ;
i f ( u s e r f l a g !=2)
f (n , x , fvec ) ;
else {
memset ( t 1 _ x , 0 , n s i z e o f ( d o u b l e ) ) ;
f o r ( i n t i = 0 ; i <n ; i ++) {
t1_x [ i ]=1;
t 1 _ f ( n , x , t1_x , fvec , t 1 _ f v e c ) ;
t1_x [ i ]=0;
f o r ( i n t j = 0 ; j <n ; j ++) FJAC ( j , i ) = t 1 _ f v e c [ j ] ;
}
}
delete [ ] t1_fvec , t1_x ;
}
The parameter userflag is used to select between pure function evaluation and the
computation of the Jacobian. Simply replacing the hand-written version of f that
is originally provided by NAG followed by building the example as outlined in the
documentation yields the desired output.
n a g _ z e r o _ n o n l i n _ e q n s _ d e r i v _ 1 ( c05ubc )
Example Program R e s u l t s
Final approximate s o l u t i o n
C.2. Chapter 2
285
0.5707
0.7042
0.6658
0.6816
0.7014
0.5960
0.7017
0.6919
0.4164
The application of the same steps to the SFI problem as well as to other MINPACK-2
test problems is straightforward.
2. Use the adjoint model with your favorite solver for nonlinear programming to minimize the extended Rosenbrock function; repeat for the other two test problems from
Section 1.4.3.
The function F : R2 R
y = F (x) = ex0 (4 x02 + 2 x12 + 4 x0 x1 + 2 x1 + 1)
implemented as
v o i d f ( i n t n , d o u b l e x , d o u b l e& y ) {
y= exp ( x [ 0 ] ) ( 4 x [ 0 ] x [ 0 ] + 2 x [ 1 ] x [ 1 ]
+4 x [ 0 ] x [ 1 ] + 2 x [ 1 ] + 1 ) ;
}
to be passed to the NAG library routine expects a1_f to return the correct function
value in addition to the gradient at point x. The former needs to be stored as a result
checkpoint after the augmented forward section in order to be restored after the reverse
section of the adjoint code. Hence, the files declare_checkpoints.inc, f_store_results.inc,
and f_restore_results.inc need to contain the corresponding declarations, store and
restore code, respectively. For example,
declare_checkpoints . inc : double rescp=0;
f_store_results.inc: rescp=y;
f_restore_results.inc: y=rescp;
286
Table C.1. Run time of minimization of the extended Rosenbrock function using
e04dgc of the NAG C Library.
n
100
500
1000
TLM
0.2
18
220
ADM
0.01
0.1
0.5
C.2.5
Exercise 2.4.5
1. Consider the following modification of the example code from Section 2.4.1:
v o i d h ( d o u b l e& x ) {
x =x ;
}
v o i d g ( i n t n , d o u b l e x , d o u b l e& y ) {
y =0;
f o r ( i n t i = 0 ; i <n ; i ++) {
h ( x [ i ] ) ; y =x [ i ] ;
}
}
v o i d f ( i n t n , d o u b l e x , d o u b l e &y ) {
f o r ( i n t i = 0 ; i <n ; i ++) x [ i ] = s q r t ( x [ i ] / x [ ( i + 1 )%n ] ) ;
g(n ,x , y) ;
y= c o s ( y ) ;
}
Write adjoint code that correspond to the four call tree reversal schemes
R1 = {( f , g, 0), (g, h, 0)}
R2 = {( f , g, 1), (g, h, 0)}
R3 = {( f , g, 0), (g, h, 1)}
R4 = {( f , g, 1), (g, h, 1)} ,
respectively. Apply the reversal mode of (g, h) to all n calls of h inside of g.
Globally split call tree reversal (R1 = {( f , g, 0), (g, h, 0)}) yields the following adjoint
code:
C.2. Chapter 2
s t a c k < double > f d s ;
v o i d a1_h ( i n t a1_mode , d o u b l e& x , d o u b l e& a1_x ) {
i f ( a1_mode ==1) { / / s p l i t a u g m e n t e d f o r w a r d . . .
f d s . push ( x ) ;
x =x ;
}
e l s e { / / . . . and r e v e r s e s e c t i o n s
x= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_x =2 x ;
}
}
v o i d a1_g ( i n t a1_mode , i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e& y , d o u b l e& a1_y ) {
i f ( a1_mode ==1) { / / s p l i t a u g m e n t e d f o r w a r d
y =1.0;
f o r ( i n t i = 0 ; i <n ; i ++) {
a1_h ( 1 , x [ i ] , a1_x [ i ] ) ;
f d s . push ( y ) ;
y =x [ i ] ;
}
}
e l s e { / / . . . and r e v e r s e s e c t i o n s
f o r ( i n t i =n 1; i >=0; i ) {
y= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_x [ i ]+= y a1_y ;
a1_y =x [ i ] a1_y ;
a1_h ( 2 , x [ i ] , a1_x [ i ] ) ;
}
a1_y = 0 ;
}
}
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e &y , d o u b l e& a1_y ) {
/ / j o i n t augmented forward . . .
f o r ( i n t i = 0 ; i <n ; i ++) {
f d s . push ( x [ i ] ) ;
x [ i ] = s q r t ( x [ i ] / x [ ( i + 1 )%n ] ) ;
}
a1_g ( 1 , n , x , a1_x , y , a1_y ) ;
f d s . push ( y ) ;
y= c o s ( y ) ;
d o u b l e r e s _ c p =y ; / / s t o r e r e s u l t
/ / . . . and r e v e r s e s e c t i o n s
y= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_y= s i n ( y ) a1_y ;
a1_g ( 2 , n , x , a1_x , y , a1_y ) ;
287
288
C.2. Chapter 2
289
f o r ( i n t i =n 1; i >=0; i ) {
x [ i ] = a r g _ c p _ g . t o p ( ) ; a r g _ c p _ g . pop ( ) ;
}
}
}
v o i d a 1 _ f ( i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e &y , d o u b l e& a1_y ) {
/ / j o i n t augmented forward . . .
f o r ( i n t i = 0 ; i <n ; i ++) {
f d s . push ( x [ i ] ) ;
x [ i ] = s q r t ( x [ i ] / x [ ( i + 1 )%n ] ) ;
}
a1_g ( 3 , n , x , a1_x , y , a1_y ) ;
g(n ,x , y) ;
f d s . push ( y ) ;
y= c o s ( y ) ;
d o u b l e r e s _ c p =y ; / / s t o r e r e s u l t
/ / . . . and r e v e r s e s e c t i o n s
y= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_y= s i n ( y ) a1_y ;
a1_g ( 4 , n , x , a1_x , y , a1_y ) ;
a1_g ( 1 , n , x , a1_x , y , a1_y ) ;
f o r ( i n t i =n 1; i >=0; i ) {
x [ i ] = f d s . t o p ( ) ; f d s . pop ( ) ;
d o u b l e v=x [ i ] / x [ ( i + 1 )%n ] ;
d o u b l e a1_v = a1_x [ i ] / ( 2 s q r t ( v ) ) ; a1_x [ i ] = 0 ;
a1_x [ i ]+= a1_v / x [ ( i + 1 )%n ] ;
a1_x [ ( i + 1 )%n]= a1_v x [ i ] / ( x [ ( i + 1 )%n ] x [ ( i + 1 )%n ] ) ;
}
y= r e s _ c p ; / / r e s t o r e r e s u l t
}
3. Split-over-joint call tree reversal (R3 = {( f , g, 0), (g, h, 1)}) yields the following adjoint
code.
s t a c k < double > f d s ;
s t a c k < double > a r g _ c p _ h ;
v o i d a1_h ( i n t a1_mode , d o u b l e& x , d o u b l e& a1_x ) {
i f ( a1_mode ==1) { / / j o i n t a u g m e n t e d f o r w a r d . . .
f d s . push ( x ) ;
x =x ;
/ / . . . and r e v e r s e s e c t i o n s
x= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_x =2 x ;
}
e l s e i f ( a1_mode ==3) { / / s t o r e i n p u t s
arg_cp_h . push ( x ) ;
}
290
C.2. Chapter 2
291
e l s e i f ( a1_mode ==4) { / / r e s t o r e i n p u t s
x= a r g _ c p _ h . t o p ( ) ; a r g _ c p _ h . pop ( ) ;
}
}
v o i d a1_g ( i n t a1_mode , i n t n , d o u b l e x , d o u b l e a1_x ,
d o u b l e& y , d o u b l e& a1_y ) {
i f ( a1_mode ==1) { / / j o i n t a u g m e n t e d f o r w a r d . . .
y =1.0;
f o r ( i n t i = 0 ; i <n ; i ++) {
a1_h ( 3 , x [ i ] , a1_x [ i ] ) ;
h(x[ i ]) ;
f d s . push ( y ) ;
y =x [ i ] ;
}
/ / . . . and r e v e r s e s e c t i o n s
f o r ( i n t i =n 1; i >=0; i ) {
y= f d s . t o p ( ) ; f d s . pop ( ) ;
a1_x [ i ]+= y a1_y ;
a1_y =x [ i ] a1_y ;
a1_h ( 4 , x [ i ] , a1_x [ i ] ) ;
a1_h ( 1 , x [ i ] , a1_x [ i ] ) ;
}
a1_y = 0 ;
}
e l s e i f ( a1_mode ==3) { / / s t o r e i n p u t s
f o r ( i n t i = 0 ; i <n ; i ++) a r g _ c p _ g . p u s h ( x [ i ] ) ;
}
e l s e i f ( a1_mode ==4) { / / r e s t o r e i n p u t s
f o r ( i n t i =n 1; i >=0; i ) {
x [ i ] = a r g _ c p _ g . t o p ( ) ; a r g _ c p _ g . pop ( ) ;
}
}
}
292
10
10
50
5
50
s
20
h
50
C.2. Chapter 2
R2 = (( f , g, 1), ( f , s, 0), (g, h, 1)) : MEM(R2 ) = 115, OPS(R2 ) = 390;
| _ a 1 _ f (RECORD)
|
| _ a1_g ( STORE_INPUTS )
|
|_ g
|
|
|_ h
|
| _ a 1 _ s (RECORD)
| _ a 1 _ f (ADJOIN )
| _ a 1 _ s (ADJOIN )
| _ a1_g ( RESTORE_INPUTS )
| _ a1_g (RECORD)
|
| _ a1_h ( STORE_INPUTS )
|
|_ h
| _ a1_g (ADJOIN )
| _ a1_h ( RESTORE_INPUTS )
| _ a1_h (RECORD)
| _ a1_h (ADJOIN )
293
294
For an available memory of size 140, both the greedy Smallest- and Largest-RecordingFirst heuristics yield the optimal reversal scheme R2 performing OPS(R2 ) = 390
operations. If a memory of size 150 is at our disposal, then the Largest-RecordingFirst heuristic selects the optimal reversal scheme R6 (OPS(R6 ) = 240), whereas the
Smallest-Recording-First heuristic fails to improve R2 . The Largest-Recording-First
heuristic also outperforms its competitor for an available memory of size 160 by
selecting reversal scheme R6 as opposed to R3 (OPS(R3 ) = 340).
C.3. Chapter 3
C.3
295
Chapter 3
C.3.1
Exercise 3.5.1
296
2. Write second-order adjoint code based on the adjoint code that was developed in
Section 2.4.1 (forward-over-reverse mode in both split and joint modes); use it to
accumulate the same Hessian as in 1.
Implementations of the second-order adjoint model can be obtained by applying the
Tangent-Linear Code Generation Rules from Section 2.1.1 to the adjoint code developed in Section C.2.1. Alternatively, or in order to verify the solutions, the adjoint
code can be reimplemented in a syntax that is accepted by dcc. Application of dcc
in tangent-linear mode yields the desired second-order adjoint code.
3. Write second-order adjoint code based on the tangent-linear code that was developed
in Section 2.4.1 (reverse-over-forward mode in both split and joint modes); use it to
accumulate the same Hessian as in 1.
The Adjoint Code Generation Rules from Section 2.2.1 need to be applied to the
code developed in Section C.2.1. Alternatively, the tangent-linear code can be reimplemented in a syntax that is accepted by dcc. Application of dcc in adjoint mode
yields the desired second-order adjoint code in joint call tree reversal mode. Special
care must be taken when defining the argument checkpoint of g. It should contain
both x and its tangent-linear counterpart t1_x. Split call tree reversal can be derived by
simple local modifications. Again, listings are omitted due to their excessive length.
The correctness of a given solution can always be verified for a given input by comparing the numerical results with those obtained by second derivative code that was
generated by dcc.
C.3.2
Exercise 3.5.2
Consider the given implementation of the extended Rosenbrock function f from Section 1.4.3.
1. Write a second-order tangent-linear code and use it to accumulate 2 f with machine
accuracy. Compare the numerical results with those obtained by finite difference approximation.
Application of the Tangent-Linear Code Generation Rules from Section 2.1.1 to an
implementation of the tangent-linear extended Rosenbrock function (see t1_f in Section C.2.3) yields the second-order tangent-linear code. Alternatively, dcc can be
applied to the following variant of an implementation of the extended Rosenbrock
function:
v o i d f ( i n t n , d o u b l e x , d o u b l e &y ) {
i n t i =0;
i n t nm1 = 0 ;
i n t ip1 =0;
double t 1 =0;
double t 2 =0;
y =0;
nm1=n 1;
w h i l e ( i <nm1 ) {
t 1 =1x [ i ] ;
C.3. Chapter 3
297
ip1= i +1;
t 2 = ( x [ i p 1 ]x [ i ] x [ i ] ) ;
y=y+ t 1 t 1 +10 t 2 t 2 ;
i = i +1;
}
}
in second-order adjoint modes. The driver routines, as well as the build process, is
the same as in the examples in Sections 3.2.2 and 3.3.2.
4. Use the Newton algorithm and a corresponding matrix-free implementation based on
Conjugate Gradients for the solution of the Newton system to minimize the extended
Rosenbrock function for different start values of your own choice. Compare the run
times for the various approaches to computing the required derivatives as well as the
run times of the optimization algorithms for increasing values of n.
The various implementations of the second-order adjoint model replace the respective
code for the evaluation / approximation of the gradient and the Hessian in the solution
of Exercise 1.4.3 discussed in Section C.1.3. A matrix-free implementation of the
298
It uses the second-order adjoint version t2_a1_f of the extended Rosenbrock function
to compute the required objective values, gradients, and Hessian vector products.
Qualitatively, the observed run times are similar to Table 1.3.
C.3.3
Exercise 3.5.3
1. Write third-order tangent-linear and adjoint versions for the code in Section 3.5.1.
Run numerical tests to verify correctness.
C.3. Chapter 3
299
The application of the Tangent-Linear Code Generation Rules to the previously developed second-order tangent-linear and adjoint code is straightforward. Listings
become rather lengthy and are hence omitted. Finite differences can be applied to
the second derivative code to qualitatively verify the numerical correctness of third
derivative code at selected points.
2. Given y = F (x), derive the following higher derivative code and provide drivers for
its use in the accumulation of the corresponding derivative tensors:
(a) third-order adjoint code in reverse-over-reverse-over-reverse mode;
(b) fourth-order adjoint code in forward-over-forward-over-forward-over-reverse
mode;
(c) fourth-order adjoint code in reverse-over-forward-over-reverse-over-forward
mode.
Discuss the complexity of computing various projections of the third and fourth derivative tensors.
Third-order adjoint code in reverse-over-reverse-over-reverse mode:
Application of reverse mode AD to an implementation of y = F (x) yields the
first-order adjoint code
y = F (x)
x(1) = x(1) + y(1) , F (x)
y(1) = 0.
Application of reverse mode AD with required floating-point data stack s to the
first-order adjoint code yields the second-order adjoint code with augmented
forward section
y = F (x)
x(1) = x(1) + y(1) , F (x)
s[0] = y(1)
y(1) = 0
and reverse section
y(1) = s[0]
y(1,2) = 0
y(1,2) = y(1,2) + x(1,2) , F (x)
x(2) = x(2) + x(1,2) , y(1) , 2 F (x)
x(2) = x(2) + y(2) , F (x)
y(2) = 0.
This second-order adjoint code computes
y = F (x)
x(1) = x(1) + y(1) , F (x)
300
(C.2)
(C.3)
(C.4)
(C.5)
(C.6)
(C.7)
(C.8)
(C.9)
(C.10)
(C.11)
(C.12)
(C.13)
(C.14)
(C.15)
(C.16)
(C.17)
(C.18)
(C.19)
(C.20)
(C.21)
(C.22)
(C.23)
(C.24)
(C.25)
(C.26)
(C.27)
(C.28)
C.3. Chapter 3
301
s[0](3) = 0
y(1,3) = y(1,3) + x(1,3) , F (x)
x(3) = x(3) + x(1,3) , y(1) , 2 F (x)
x(3) = x(3) + y(3) , F (x)
y(3) = 0.
(C.29)
(C.30)
(C.31)
(C.32)
(C.33)
302
(2)
(2)
y(1) = 0; y(1) = 0.
Application of forward mode AD to the second-order adjoint code yields the
third-order adjoint code
y(2,3) = 2 F (x), x(2) , x(3)
+ F (x), x(2,3)
y(2) = F (x), x(2)
y(3) = F (x), x(3)
y = F (x)
(2,3)
(2,3)
(2)
(2,3)
C.3. Chapter 3
303
(2)
(2)
(2)
(3)
(3)
(3)
(3)
Application of forward modeAD to the third-order adjoint code yields the fourthorder adjoint code
y(2,3,4) = 3 F (x), x(2) , x(3) , x(4)
+ 2 F (x), x(2,3) , x(3)
+ 2 F (x), x(2) , x(3,4)
+ 2 F (x), x(2,3) , x(4)
+ F (x), x(2,3,4)
y(2,3) = 2 F (x), x(2) , x(3)
+ F (x), x(2,3)
y(2,4) = 2 F (x), x(2) , x(4)
+ F (x), x(2,4)
y(2) = F (x), x(2)
y(3,4) = 2 F (x), x(3) , x(4)
+ F (x), x(3,4)
y(3) = F (x), x(3)
y(4) = F (x), x(4)
y = F (x)
(2,3,4)
x(1)
(2,3,4)
= x(1)
(2,4)
(2)
(2)
(2,3,4)
(2,3)
(2,4)
, F (x)
(2)
(2,3)
(2)
(2,3)
(2,4)
(2,4)
(2)
304
(2)
(2)
(3,4)
(3,4)
(3)
(3)
(3)
(4)
(4)
(4)
(2,4)
(2)
(4)
x(1)
(2,3,4)
= x(1)
(2,4)
(2)
(2)
(2,3,4)
(2,3)
(2,4)
, F (x)
(2)
C.3. Chapter 3
305
(2)
(2)
y(1) = 0
y(1) = 0.
Application of reverse mode AD with required floating-point data stack s to the
second-order adjoint code yields the third-order adjoint code with augmented
forward section
y(2) = F (x), x(2)
y = F (x)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
306
(2)
(2)
(2)
(2)
y(3) = 0.
The resulting third-order adjoint code computes
y(2) = F (x), x(2)
y = F (x)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
(2)
y(3) = 0; y(3) = 0.
Application of forward mode AD to this third-order adjoint code yields the
fourth-order adjoint code
y(2,4) = 2 F (x), x(2) , x(4)
+ F (x), x(2,4)
y(2) = F (x), x(2)
y(4) = F (x), x(4)
y = F (x)
(2,4)
(2,4)
(2,4)
(2)
(2)
(2)
(4)
(4)
(4)
C.3. Chapter 3
307
x(1)
(4)
y(1,3)
(2)
(4)
(4)
(4)
(2)
(2,4)
(2)
(2)
(2)
(4)
(2)
(2)
(2)
(2,4)
(2)
(2)
(2)
(2,4)
(2,4)
(2)
(4)
(2,4)
(2)
(2)
(2)
(2,4)
(2)
(4)
(4)
(4)
(2)
(2,4)
(2)
(2)
(2)
(4)
308
(2)
(2)
by setting x, x(1,3) , y(1) , x(2) , and x(4) appropriately while ensuring that the other
terms vanish identically as the result of initializing the remaining inputs to zero.
(2)
The whole fourth derivative tensor can be accumulated by letting x(1,3) , y(1) ,
x(2) , and x(4) range independently over the Cartesian basis vectors in Rn , Rm ,
Rn , and Rn , respectively. Projections of 4 F (x) can be obtained at a lower
computational cost, for example,
(2)
x(1,3) , 4 F (x), x(2) , x(4)
Rm at the cost of O(m) Cost(F ) (y(1) ranges
over the Cartesian basis vectors in Rm );
(2)
y(1) , 4 F (x), x(2) , x(4)
Rn at the cost of O(n) Cost(F ) (x(1,3) ranges
over the Cartesian basis vectors in Rn );
(2)
x(1,3) , 4 F (x), x(2)
Rmn at the cost of O(m n) Cost(F ) (y(1) and
x(4) range independently over the Cartesian basis vectors in Rm and Rn ,
respectively).
Moreover, the fourth-order adjoint code returns arbitrary projections of the third,
second, and first derivative tensors in addition to the original function value.
Potential sparsity should be exploited to reduce the cost of computing certain
required projections.
C.4
C.4.1
Chapter 4
Exercise 4.7.1
Derive DFAs for recognizing the languages that are defined by the following regular expressions:
1. 0|1+(0|1)*.
2. 0+|1(0|1)+.
Implement scanners for these languages with flex and gcc. Compare the NFAs and DFAs
derived by yourself with the ones that are generated by flex.
Refer to Figures C.3, C.4, and C.5 for the NFAs and DFAs. Transitions into the dedicated
error states are omitted.
The corresponding flex input files are analogous to the one discussed in Section 4.3.4. Running flex with the -T option produces diagnostic output that contains
the automata shown in Figures C.3, C.4, and C.5.
C.4. Chapter 4
309
6
0
7
5
1
2
1
10
11
0
1
0|1
1
0
11
1
1
8
1
6
1
0|1
4
0|1
C.4.2
Exercise 4.7.2
1. Use the parser for SL2 to parse the assignment y = sin(x) + x 2; as shown in
Table 4.3. Draw the parse tree.
Refer to Table C.2 for illustration. The parse tree is derived by applying the reductions in the ACTION column in reverse order.
2. Extend SL2 and its parser to include the ternary fused-multiply-add operation, defined as y = fma(a, b, c) a b + c. Derive the characteristic automaton.
Both the flex and the bison input files are listed below.
310
0
0,1
0,1,4
0,1,4,9
0,1,4,9,11
0,1,4,9
0,1,4,9,11
0,1,4,9,11,15
0,1
0,1,4
0,1,4,10
0,1,4,10,12
0,1,4,10
0,1,4,10,12
0,1,4,10,12,16
0,1,4,10,12,16,13
0,1,4,10,12,16
0,1,4,10,12,16,13
0,1,4,10
0,1,4,10,12
0,1
0,1,4
0,1,4,10
0
0
0,2
STATE
0
1
4
9
11
7
11
15
18
4
10
12
7
12
16
13
8
13
17
12
16
4
10
14
0
3
0
2
5
0
PARSED
V
V =
V =F
V = F(
V = F (V
V = F (e
V = F (e)
V =e
V = eL
V = eLV
V = eLe
V = eLeN
V = eLeNC
V = eLeNe
V = eLe
V =e
V = e;
INPUT
= F (V )LV NC;
F (V )LV NC;
(V )LV NC;
V )LV NC;
)LV NC;
)LV NC;
)LV NC;
LV NC;
LV NC;
LV NC;
V N C;
N C;
N C;
N C;
C;
;
;
;
;
;
;
;
a
s
s$end
$accept
[ \ t \ n ]+
[ az ]
[0 9]
%%
{ whitespace }
" sin "
{
" fma "
{
"+"
{
{
""
{ }
return
return
return
return
F;
T;
L;
N;
}
}
}
}
ACTION
S
S
S
S
S
R(P 7)
S
S
R(P 6)
S
S
S
R(P 7)
S
S
S
R(P 8)
S
R(P 5)
S
R(P 4)
S
S
R(P 3)
S
R(P 1)
S
S
R(P 0)
ACCEPT
C.4. Chapter 4
311
{ v a r i a b l e } { return V; }
{ c o n s t a n t } { return C; }
.
{ return y y t e x t [ 0 ] ; }
%%
v o i d l e x i n i t ( FILE s o u r c e ) { y y i n = s o u r c e ; }
a
a s
V
e
e
F
T
V
C
= e ; ;
L e
N e
( e )
( e , e , e )
%%
# i n c l u d e < s t d i o . h>
i n t y y e r r o r ( char msg ) {
p r i n t f ( "ERROR : %s \ n " , msg ) ;
r e t u r n 1;
}
i n t main ( i n t a r g c , char a r g v )
{
FILE s o u r c e _ f i l e = f o p e n ( a r g v [ 1 ] , " r " ) ;
lexinit ( source_file ) ;
yyparse ( ) ;
fclose ( source_file ) ;
return 0 ;
}
312
{ return IF ; }
{ r e t u r n WHILE ; }
Corresponding new tokens are defined in the bison input file in addition to proper
actions associated with the loop and branch statements.
...
%t o k e n V C F I F WHILE O R
...
%%
s : a | a s | b | b s | l | l s ;
b : IF (
{ printf (" if (") ; }
c )
{ { p r i n t f ( " ) { \ n " ) ; }
s }
{ p r i n t f ( " }\ n" ) ; } ;
l : WHILE (
{ p r i n t f ( " while ( " ) ; }
c ) {
{ p r i n t f ( " ) { \ n" ) ; }
s }
{ p r i n t f ( " }\ n" ) ; } ;
c : V R V { p r i n t f ( "%s%s%s " , $1 , $2 , $3 ) ; } ;
...
%%
...
C.4.3
Exercise 4.7.3
1. Use flex and bison to implement a single-pass tangent-linear code compiler for
SL2 programs. Extend it to SL.
The corresponding flex and bison input files for SL2 are listed below. The extension to SL is straightforward as control-flow statements are simply unparsed. Refer
to Section 4.5.3 for conceptual details on this syntax-directed tangent-linear code
compiler.
C.4. Chapter 4
313
Listing C.14. Definition of parse tree node.
# d e f i n e BUFFER_SIZE 100000
typedef struct {
int j ;
char c ;
} a s t N o d eTy p e ;
# d e f i n e YYSTYPE a s t N o d eTy p e
[ \ t \ n ]+
[ az ]
[0 9]
%%
{ whitespace }
{ }
" sin "
{ t o _ p a r s e r ( ) ; return F ; }
"+"
{ t o _ p a r s e r ( ) ; return L ;
{ t o _ p a r s e r ( ) ; return N;
""
{ variable }
{ t o _ p a r s e r ( ) ; return V;
{ constant }
{ t o _ p a r s e r ( ) ; return C;
.
{ return y y t e x t [ 0 ] ; }
}
}
}
}
%%
v o i d l e x i n i t ( FILE s o u r c e ) { y y i n = s o u r c e ; }
314
C.4. Chapter 4
315
$$ . j = s a c v c ++;
get_memory (&$$ ) ;
s p r i n t f ( $$ . c , "%s%s v%d_=v%d_%s v%d_ ; v%d=v%d%s v%d ; \ n " ,
$1 . c , $3 . c , $$ . j , $1 . j , $2 . c , $3 . j ,
$$ . j , $1 . j , $2 . c , $3 . j ) ;
f r e e _ m e m o r y (&$1 ) ;
}
|e N e
{
i f ( ! s t r c m p ( $2 . c , " " ) ) {
$$ . j = s a c v c ++;
get_memory (&$$ ) ;
s p r i n t f ( $$ . c ,
"%s%s v%d_=v%d_ v%d+v%d v%d_ ; v%d=v%d%s v%d ; \ n " ,
$1 . c , $3 . c , $$ . j , $1 . j , $3 . j , $1 . j , $3 . j ,
$$ . j , $1 . j , $2 . c , $3 . j ) ;
f r e e _ m e m o r y (&$1 ) ;
}
}
| F ( e )
{
i f ( ! s t r c m p ( $2 . c , " s i n " ) ) {
$$ . j = s a c v c ++;
get_memory (&$$ ) ;
s p r i n t f ( $$ . c , "%s v%d_= c o s ( v%d ) v%d_ ; v%d= s i n ( v%d ) ; \ n " ,
$3 . c , $$ . j , $3 . j , $3 . j , $$ . j , $3 . j ) ;
f r e e _ m e m o r y (&$3 ) ;
}
}
| V
{
$$ . j = s a c v c ++;
get_memory (&$$ ) ;
s p r i n t f ( $$ . c , " v%d_=%s _ ; v%d=%s ; \ n " , $$ . j , $1 . c , $$ . j , $1 . c ) ;
f r e e _ m e m o r y (&$1 ) ;
}
| C
{
$$ . j = s a c v c ++;
get_memory (&$$ ) ;
s p r i n t f ( $$ . c , " v%d_ = 0 ; v%d=%s ; \ n " , $$ . j , $$ . j , $1 . c ) ;
f r e e _ m e m o r y (&$1 ) ;
}
;
%%
i n t y y e r r o r ( char msg ) {
p r i n t f ( "ERROR : %s \ n " , msg ) ;
r e t u r n 1;
316
2. Use flex and bison to implement a single-pass adjoint code compiler for SL2
programs. Extend it to SL.
The corresponding flex and bison input files for SL are listed below. A solution of
SL2 is implied. Refer to Section 4.5.4 for conceptually details of the syntax-directed
adjoint code compiler.
Listing C.17. Definition of parse tree node.
# d e f i n e maxBB 100
typedef struct {
int j ;
char a f ;
char a r [ maxBB ] ;
} a s t N o d eTy p e ;
# d e f i n e YYSTYPE a s t N o d eTy p e
[ \ t \ n ]+
[ az ]
[0 9]
%%
{ whitespace } { }
" i f " { return IF ; }
" w h i l e " { r e t u r n WHILE ; }
" sin " {
y y l v a l . a f = ( char ) m a l l o c ( ( s t r l e n ( y y t e x t ) + 1 ) s i z e o f ( char ) ) ;
s t r c p y ( y y l v a l . af , y y t e x t ) ;
int i ;
C.4. Chapter 4
317
f o r ( i = 0 ; i <maxBB ; i ++) y y l v a l . a r [ i ] = 0 ;
y y l v a l . j =0;
return F ;
}
"<" {
y y l v a l . a f = ( char ) m a l l o c ( ( s t r l e n ( y y t e x t ) + 1 ) s i z e o f
s t r c p y ( y y l v a l . af , y y t e x t ) ;
int i ;
f o r ( i = 0 ; i <maxBB ; i ++) y y l v a l . a r [ i ] = 0 ;
y y l v a l . j =0;
return R;
}
"+" {
y y l v a l . a f = ( char ) m a l l o c ( ( s t r l e n ( y y t e x t ) + 1 ) s i z e o f
s t r c p y ( y y l v a l . af , y y t e x t ) ;
int i ;
f o r ( i = 0 ; i <maxBB ; i ++) y y l v a l . a r [ i ] = 0 ;
y y l v a l . j =0;
return L ;
}
"" {
y y l v a l . a f = ( char ) m a l l o c ( ( s t r l e n ( y y t e x t ) + 1 ) s i z e o f
s t r c p y ( y y l v a l . af , y y t e x t ) ;
int i ;
f o r ( i = 0 ; i <maxBB ; i ++) y y l v a l . a r [ i ] = 0 ;
y y l v a l . j =0;
return N;
}
{ symbol } {
y y l v a l . a f = ( char ) m a l l o c ( ( s t r l e n ( y y t e x t ) + 1 ) s i z e o f
s t r c p y ( y y l v a l . af , y y t e x t ) ;
int i ;
f o r ( i = 0 ; i <maxBB ; i ++) y y l v a l . a r [ i ] = 0 ;
y y l v a l . j =0;
return V;
}
{ const } {
y y l v a l . a f = ( char ) m a l l o c ( ( s t r l e n ( y y t e x t ) + 1 ) s i z e o f
s t r c p y ( y y l v a l . af , y y t e x t ) ;
int i ;
f o r ( i = 0 ; i <maxBB ; i ++) y y l v a l . a r [ i ] = 0 ;
y y l v a l . j =0;
return C;
}
. { return y y t e x t [ 0 ] ; }
%%
v o i d l e x i n i t ( FILE s o u r c e ) { y y i n = s o u r c e ; }
( char ) ) ;
( char ) ) ;
( char ) ) ;
( char ) ) ;
( char ) ) ;
318
C.4. Chapter 4
$$ . a f = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a f , "%s%s " , $1 . a f , $2 . a f ) ;
f r e e ( $2 . a f ) ; f r e e ( $1 . a f ) ;
int i ;
f o r ( i = 0 ; i <=idxBB ; i ++) {
i f ( $2 . a r [ i ]&&$1 . a r [ i ] ) {
$$ . a r [ i ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a r [ i ] , "%s%s " , $2 . a r [ i ] , $1 . a r [ i ] ) ;
f r e e ( $2 . a r [ i ] ) ; f r e e ( $1 . a r [ i ] ) ;
}
e l s e i f ( $2 . a r [ i ] ) {
$$ . a r [ i ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a r [ i ] , "%s " , $2 . a r [ i ] ) ;
f r e e ( $2 . a r [ i ] ) ;
}
e l s e i f ( $1 . a r [ i ] ) {
$$ . a r [ i ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a r [ i ] , "%s " , $1 . a r [ i ] ) ;
f r e e ( $1 . a r [ i ] ) ;
}
}
}
;
s :
|
|
;
b :
{
a { $$=$1 ; }
b { $$=$1 ; }
l { $$=$1 ; }
I F ( c ) {
newBB = 1 ;
}
s s }
{
$$ . a f = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a f , " i f (% s ) { \ n%s } \ n " , $3 . a f , $7 . a f ) ;
f r e e ( $3 . a f ) ; f r e e ( $7 . a f ) ;
int i ;
f o r ( i = 0 ; i <=idxBB ; i ++) {
i f ( $7 . a r [ i ] ) {
$$ . a r [ i ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a r [ i ] , "%s " , $7 . a r [ i ] ) ;
f r e e ( $7 . a r [ i ] ) ;
}
}
newBB = 1 ;
}
;
l : WHILE ( c ) {
{
319
320
C.4. Chapter 4
321
e : e N e
{
$$ . j =c ++; i f ( c >cmax ) cmax=c ;
$$ . a f = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a f , "%s%s p u s h _ d ( v%d ) ; v%d=v%d v%d ; \ n " ,
$1 . a f , $3 . a f , $$ . j , $$ . j , $1 . j , $3 . j ) ;
f r e e ( $1 . a f ) ; f r e e ( $3 . a f ) ;
$$ . a r [ idxBB ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a r [ idxBB ] ,
" pop_d ( v%d ) ; v%d_=v%d_ v%d ; v%d_=v%d_ v%d ; \ n%s%s " ,
$$ . j , $1 . j , $$ . j , $3 . j , $3 . j , $$ . j , $1 . j ,
$3 . a r [ idxBB ] , $1 . a r [ idxBB ] ) ;
f r e e ( $1 . a r [ idxBB ] ) ; f r e e ( $3 . a r [ idxBB ] ) ;
}
| e L e
{
$$ . j =c ++; i f ( c >cmax ) cmax=c ;
$$ . a f = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a f , "%s%s p u s h _ d ( v%d ) ; v%d=v%d+v%d ; \ n " ,
$1 . a f , $3 . a f , $$ . j , $$ . j , $1 . j , $3 . j ) ;
f r e e ( $1 . a f ) ; f r e e ( $3 . a f ) ;
$$ . a r [ idxBB ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a r [ idxBB ] ,
" pop_d ( v%d ) ; v%d_=v%d_ ; v%d_=v%d_ ; \ n%s%s " ,
$$ . j , $1 . j , $$ . j , $3 . j , $$ . j ,
$3 . a r [ idxBB ] , $1 . a r [ idxBB ] ) ;
f r e e ( $1 . a r [ idxBB ] ) ; f r e e ( $3 . a r [ idxBB ] ) ;
}
| F ( e )
{
$$ . j =c ++; i f ( c >cmax ) cmax=c ;
$$ . a f = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a f , "%s p u s h _ d ( v%d ) ; v%d=%s ( v%d ) ; \ n " ,
$3 . a f , $$ . j , $$ . j , $1 . a f , $3 . j ) ;
f r e e ( $3 . a f ) ;
$$ . a r [ idxBB ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
i f ( ! s t r c m p ( $1 . a f , " s i n " ) )
s p r i n t f ( $$ . a r [ idxBB ] ,
" pop_d ( v%d ) ; v%d_= c o s ( v%d ) v%d_ ; \ n%s " ,
$$ . j , $3 . j , $3 . j , $$ . j , $3 . a r [ idxBB ] ) ;
f r e e ( $3 . a r [ idxBB ] ) ;
}
| V
{
$$ . j =c ++; i f ( c >cmax ) cmax=c ;
$$ . a f = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
s p r i n t f ( $$ . a f , " p u s h _ d ( v%d ) ; v%d=%s ; \ n " , $$ . j , $$ . j , $1 . a f ) ;
$$ . a r [ idxBB ] = ( char ) m a l l o c ( b s s i z e o f ( char ) ) ;
322
C.4.4
Exercise 4.7.4
Use flex and bison to implement a compiler that generates an intermediate representation for explicitly typed SL programs in the form of a parse tree and a symbol table.
Implement an unparser.
A fully functional solution is listed below. Refer to Section 4.6 for details.
Listing C.20. parse_tree.hpp.
# i f n d e f PARSE_TREE_INC
# d e f i n e PARSE_TREE_INC
# include < string >
# include < l i s t >
u s i n g namespace s t d ;
# i n c l u d e " s y m b o l _ t a b l e . hpp "
C.4. Chapter 4
const
const
const
const
const
const
const
const
const
const
const
const
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
unsigned
323
short
short
short
short
short
short
short
short
short
short
short
short
UNDEFINED_PTV= 0 ;
SEQUENCE_OF_STATEMENTS_PTV= 1 ;
LOOP_PTV= 2 ;
BRANCH_PTV= 3 ;
ASSIGNMENT_PTV= 4 ;
INTRINSIC_CALL_PTV = 5 ;
ADDITION_PTV = 6 ;
MULTIPLICATION_PTV = 7 ;
SYMBOL_PTV= 8 ;
CONSTANT_PTV= 9 ;
LT_CONDITION_PTV = 1 0 ;
PARENTHESES_PTV= 1 1 ;
class parse_tree_vertex {
public :
unsigned short type ;
l i s t < p a r s e _ t r e e _ v e r t e x > s u c c ;
p a r s e _ t r e e _ v e r t e x ( unsigned short ) ;
virtual parse_tree_vertex () ;
v i r t u a l c o n s t s t r i n g& g e t _ n a m e ( ) c o n s t ;
v i r t u a l i n t& s y m b o l _ t y p e ( ) ;
v i r t u a l void unparse ( ) const ;
};
class parse_tree_vertex_named : public parse_tree_vertex {
public :
s t r i n g name ;
p a r s e _ t r e e _ v e r t e x _ n a m e d ( u n s i g n e d s h o rt , s t r i n g ) ;
parse_tree_vertex_named () ;
c o n s t s t r i n g& g e t _ n a m e ( ) c o n s t ;
void unparse ( ) const ;
};
class parse_tree_vertex_symbol : public parse_tree_vertex {
public :
symbol sym ;
p a r s e _ t r e e _ v e r t e x _ s y m b o l ( u n s i g n e d s h o rt , s t r i n g ) ;
parse_tree_vertex_symbol () ;
void unparse ( ) const ;
i n t& s y m b o l _ t y p e ( ) ;
};
# d e f i n e YYSTYPE p a r s e _ t r e e _ v e r t e x
# endif
324
# i n c l u d e < a s s e r t . h>
# include <iostream >
u s i n g namespace s t d ;
# i n c l u d e " p a r s e _ t r e e . hpp "
extern symbol_table s t a b ;
p a r s e _ t r e e _ v e r t e x : : p a r s e _ t r e e _ v e r t e x ( unsigned short t )
: t y p e ( t ) {}
parse_tree_vertex : : parse_tree_vertex () {
l i s t < p a r s e _ t r e e _ v e r t e x >:: i t e r a t o r i ;
f o r ( i = s u c c . b e g i n ( ) ; i ! = s u c c . end ( ) ; i ++) {
d e l e t e ( i ) ;
}
}
c o n s t s t r i n g& p a r s e _ t r e e _ v e r t e x : : g e t _ n a m e ( ) c o n s t {
assert ( false ) ;
}
i n t& p a r s e _ t r e e _ v e r t e x : : s y m b o l _ t y p e ( ) {
assert ( false ) ;
}
void p a r s e _ t r e e _ v e r t e x : : unparse ( ) const {
switch ( type ) {
c a s e SEQUENCE_OF_STATEMENTS_PTV : {
l i s t < p a r s e _ t r e e _ v e r t e x >:: c o n s t _ i t e r a t o r i ;
f o r ( i = s u c c . b e g i n ( ) ; i ! = s u c c . end ( ) ; i ++)
( i )> u n p a r s e ( ) ;
break ;
}
c a s e LOOP_PTV : {
l i s t < p a r s e _ t r e e _ v e r t e x >:: c o n s t _ i t e r a t o r i ;
c o u t << " w h i l e ( " ;
( ( s u c c . b e g i n ( ) ) )> u n p a r s e ( ) ;
c o u t << " ) { " << e n d l ;
( ( + + ( s u c c . b e g i n ( ) ) ) )> u n p a r s e ( ) ;
c o u t << " } " << e n d l ;
break ;
}
c a s e BRANCH_PTV : {
l i s t < p a r s e _ t r e e _ v e r t e x >:: c o n s t _ i t e r a t o r i ;
c o u t << " i f ( " ;
( ( s u c c . b e g i n ( ) ) )> u n p a r s e ( ) ;
c o u t << " ) { " << e n d l ;
( ( + + ( s u c c . b e g i n ( ) ) ) )> u n p a r s e ( ) ;
c o u t << " } " << e n d l ;
C.4. Chapter 4
break ;
}
c a s e ADDITION_PTV : {
l i s t < p a r s e _ t r e e _ v e r t e x >:: c o n s t _ i t e r a t o r i =succ . begin ( ) ;
( i ++)> u n p a r s e ( ) ;
c o u t << " + " ;
( i )> u n p a r s e ( ) ;
break ;
}
c a s e MULTIPLICATION_PTV : {
( ( s u c c . b e g i n ( ) ) )> u n p a r s e ( ) ;
c o u t << " " ;
( ( + + ( s u c c . b e g i n ( ) ) ) )> u n p a r s e ( ) ;
break ;
}
c a s e LT_CONDITION_PTV : {
( ( s u c c . b e g i n ( ) ) )> u n p a r s e ( ) ;
c o u t << " < " ;
( ( + + ( s u c c . b e g i n ( ) ) ) )> u n p a r s e ( ) ;
break ;
}
c a s e PARENTHESES_PTV : {
c o u t << " ( " ;
( ( s u c c . b e g i n ( ) ) )> u n p a r s e ( ) ;
c o u t << " ) " ;
break ;
}
}
}
parse_tree_vertex_named : : parse_tree_vertex_named
( unsigned short t , s t r i n g n ) :
p a r s e _ t r e e _ v e r t e x ( t ) , name ( n ) { }
p a r s e _ t r e e _ v e r t e x _ n a m e d : : p a r s e _ t r e e _ v e r t e x _ n a m e d ( ) {}
c o n s t s t r i n g& p a r s e _ t r e e _ v e r t e x _ n a m e d : : g e t _ n a m e ( ) c o n s t {
r e t u r n name ;
}
void parse_tree_vertex_named : : unparse ( ) const {
switch ( type ) {
c a s e CONSTANT_PTV : {
c o u t << name ;
break ;
}
c a s e INTRINSIC_CALL_PTV : {
c o u t << name << " ( " ;
( ( s u c c . b e g i n ( ) ) )> u n p a r s e ( ) ;
c o u t << " ) " ;
325
326
}
}
parse_tree_vertex_symbol : : parse_tree_vertex_symbol ( i
unsigned short t , s t r i n g n ) : p a r s e _ t r e e _ v e r t e x ( t ) {
sym= s t a b . i n s e r t ( n ) ;
}
p a r s e _ t r e e _ v e r t e x _ s y m b o l : : p a r s e _ t r e e _ v e r t e x _ s y m b o l ( ) {}
i n t& p a r s e _ t r e e _ v e r t e x _ s y m b o l : : s y m b o l _ t y p e ( ) {
r e t u r n sym>t y p e ;
}
void p a r s e _ t r e e _ v e r t e x _ s y m b o l : : unparse ( ) const {
switch ( type ) {
c a s e ASSIGNMENT_PTV : {
c o u t << sym>name << " = " ;
( ( s u c c . b e g i n ( ) ) )> u n p a r s e ( ) ;
c o u t << " ; " << e n d l ;
break ;
}
c a s e SYMBOL_PTV : {
c o u t << sym>name ;
break ;
}
}
}
C.4. Chapter 4
327
s t r i n g name ;
int type ;
symbol ( ) ;
};
/
symbol t a b l e
/
class symbol_table {
public :
/
symbol t a b l e i s s t o r e d as s i m p l e l i s t o f symbols ;
/
l i s t < symbol > t a b ;
symbol_table () ;
symbol_table () ;
/
i n s e r t a s t r i n g i n t o t h e symbol t a b l e ; checks f o r d u p l i c a t i o n .
/
symbol i n s e r t ( s t r i n g ) ;
void unparse ( ) const ;
};
# endif
328
int line_counter ;
int yylex ( ) ;
i n t y y e r r o r ( c o n s t char ) ;
v o i d l e x i n i t ( FILE ) ;
extern p a r s e _ t r e e _ v e r t e x p t _ r o o t ;
%}
%t o k e n INT FLOAT I F WHILE F L N R V C
%l e f t L
%l e f t N
%%
sl : d s
{
p t _ r o o t =$2 ;
}
;
d :
| INT V ; d
{
$2>s y m b o l _ t y p e ( ) =INTEGER_ST ;
}
| FLOAT V ; d
C.4. Chapter 4
{
$2>s y m b o l _ t y p e ( ) =FLOAT_ST ;
}
;
s : a
{
}
| a s
{
$$=new p a r s e _ t r e e _ v e r t e x ( SEQUENCE_OF_STATEMENTS_PTV) ;
$$>s u c c . p u s h _ b a c k ( $1 ) ;
$$>s u c c . p u s h _ b a c k ( $2 ) ;
}
| b
{
}
| b s
{
$$=new p a r s e _ t r e e _ v e r t e x ( SEQUENCE_OF_STATEMENTS_PTV) ;
$$>s u c c . p u s h _ b a c k ( $1 ) ;
$$>s u c c . p u s h _ b a c k ( $2 ) ;
}
| l
{
}
| l s
{
$$=new p a r s e _ t r e e _ v e r t e x ( SEQUENCE_OF_STATEMENTS_PTV) ;
$$>s u c c . p u s h _ b a c k ( $1 ) ;
$$>s u c c . p u s h _ b a c k ( $2 ) ;
}
;
b : I F ( c ) { s }
{
$$=new p a r s e _ t r e e _ v e r t e x ( BRANCH_PTV) ;
$$>s u c c . p u s h _ b a c k ( $3 ) ;
$$>s u c c . p u s h _ b a c k ( $6 ) ;
}
;
l : WHILE ( c ) { s }
{
$$=new p a r s e _ t r e e _ v e r t e x ( LOOP_PTV ) ;
$$>s u c c . p u s h _ b a c k ( $3 ) ;
$$>s u c c . p u s h _ b a c k ( $6 ) ;
}
;
c : V R V
{
i f ( $2>g e t _ n a m e ( ) == " < " )
$$=new p a r s e _ t r e e _ v e r t e x ( LT_CONDITION_PTV ) ;
329
330
}
a : V = e ;
{
$$=$1 ; $$>t y p e =ASSIGNMENT_PTV ;
$$>s u c c . p u s h _ b a c k ( $3 ) ;
}
;
e : e N e
{
i f ( $2>g e t _ n a m e ( ) == " " )
$$=new p a r s e _ t r e e _ v e r t e x ( MULTIPLICATION_PTV ) ;
$$>s u c c . p u s h _ b a c k ( $1 ) ;
$$>s u c c . p u s h _ b a c k ( $3 ) ;
d e l e t e $2 ;
}
| e L e
{
i f ( $2>g e t _ n a m e ( ) == " + " )
$$=new p a r s e _ t r e e _ v e r t e x (ADDITION_PTV ) ;
$$>s u c c . p u s h _ b a c k ( $1 ) ;
$$>s u c c . p u s h _ b a c k ( $3 ) ;
d e l e t e $2 ;
}
| F ( e )
{
$$=$1 ;
$$>t y p e =INTRINSIC_CALL_PTV ;
$$>s u c c . p u s h _ b a c k ( $3 ) ;
}
| ( e )
{
$$=new p a r s e _ t r e e _ v e r t e x ( PARENTHESES_PTV ) ;
$$>s u c c . p u s h _ b a c k ( $2 ) ;
}
| V
{
$$=$1 ;
}
| C
{
$$=$1 ;
}
;
%%
i n t y y e r r o r ( c o n s t char msg ) {
C.4. Chapter 4
331
[ \ t ]+
\n
[0 9]
[ az ]
%%
{ whitespace }
{ }
{ l i n e f e e d } { l i n e _ c o u n t e r ++; }
" int "
{ r e t u r n INT ; }
" float "
{ r e t u r n FLOAT ; }
" if "
{ return IF ; }
" while "
{ r e t u r n WHILE ; }
" sin "
{
y y l v a l =new p a r s e _ t r e e _ v e r t e x _ n a m e d ( UNDEFINED_PTV , y y t e x t ) ;
return F ;
}
"+"
{
y y l v a l =new p a r s e _ t r e e _ v e r t e x _ n a m e d ( UNDEFINED_PTV , y y t e x t ) ;
return L ;
}
{
""
y y l v a l =new p a r s e _ t r e e _ v e r t e x _ n a m e d ( UNDEFINED_PTV , y y t e x t ) ;
return N;
}
"<"
{
y y l v a l =new p a r s e _ t r e e _ v e r t e x _ n a m e d ( UNDEFINED_PTV , y y t e x t ) ;
return R;
}
{ symbol }
{
y y l v a l =new p a r s e _ t r e e _ v e r t e x _ s y m b o l ( SYMBOL_PTV , y y t e x t ) ;
return V;
}
332
{ constant } {
y y l v a l =new p a r s e _ t r e e _ v e r t e x _ n a m e d ( CONSTANT_PTV , y y t e x t ) ;
return C;
}
.
{ return y y t e x t [ 0 ] ; }
%%
v o i d l e x i n i t ( FILE s o u r c e ) { y y i n = s o u r c e ; }
< s t d i o . h>
<cstdlib >
<iostream >
" p a r s e _ t r e e . hpp "
" s y m b o l _ t a b l e . hpp "
u s i n g namespace s t d ;
e x t e r n v o i d l e x i n i t ( FILE ) ;
extern void yyparse ( ) ;
parse_tree_vertex pt_root ;
symbol_table stab ;
i n t main ( i n t a r g c , char a r g v [ ] ) {
/ / open s o u r c e f i l e
FILE s o u r c e _ f i l e = f o p e n ( a r g v [ 1 ] , " r " ) ;
/ / parse
lexinit ( source_file ) ;
yyparse ( ) ;
/ / close source f i l e
fclose ( source_file ) ;
/ / unparse
c o u t << " i n t main ( ) { " << e n d l ;
stab . unparse ( ) ;
p t _ r o o t > u n p a r s e ( ) ;
c o u t << " r e t u r n 0 ; " << e n d l << " } " << e n d l ;
return 0 ;
}
Bibliography
[1] 7541985 IEEE standard for binary floating-point arithmetic. SIGPLAN Notices, 22:
925, 1987.
[2] A. Aho, M. Lam, R. Sethi, and J. Ullman. Compilers. Principles, Techniques, and
Tools (Second Edition). Addison-Wesley, Reading, MA, 2007.
[3] P. Amestoy, I. Duff, and J.-Y. LExcellent. Multifrontal parallel distributed symmetric
and unsymmetric solvers. Comput. Methods in Appl. Mech. Eng., 184:501520, 2000.
[4] H. Anton. Calculus, 6th edition. Wiley, 1999.
[5] B. Averik, R. Carter, and J. Mor. The MINPACK-2 test problem collection (preliminary version). Technical Report 150, Mathematical and Computer Science Division,
Argonne National Laboratory, Argonne, IL, 1991.
[6] F. L. Bauer. Computational graphs and rounding error. SIAM J. Numer. Anal., 11:
8796, 1974.
[7] B. Bell and J. Burke. Algorithmic differentiation of implicit functions and optimal
values. In C. Bischof, M. Bcker, P. Hovland, U. Naumann, and J. Utke, editors.
Advances in Automatic Differentiation, Lecture Notes in Comput. Sci. Engrg. 64,
Springer, Berlin, 2008, pages 6777, 2008.
[8] R. Bellmann. Dynamic Programming. Princeton University Press, Princeton, NJ, 1957.
[9] C. Bendtsen and O. Stauning. FADBAD, A Flexible C++ Package for Automatic Differentiation. Technical Report IMMREP199617, Department of Mathematical Modelling, Technical University of Denmark, Lyngby, Denmark, 1996.
[10] M. Berz, C. Bischof, G. Corliss, and A. Griewank, editors. Computational Differentiation: Techniques, Applications, and Tools, Proc. Appl. Math. Ser. 89, SIAM,
Philadelphia, 1996.
[11] C. Bischof, M. Bcker, P. Hovland, U. Naumann, and J. Utke, editors. Advances in
Automatic Differentiation, Lecture Notes in Comput. Sci. Engrg. 64, Springer, Berlin,
2008.
[12] L. S. Blackford, J. Demmel, J. Dongarra, I. Duff, S. Hammarling, G. Henry, M. Heroux,
L. Kaufman, A. Lumsdaine, A. Petitet, R. Pozo, K. Remington, and R. C. Whaley. An
333
334
Bibliography
updated set of Basic Linear Algebra Subprograms (BLAS). ACM Trans. Math. Softw.,
28:135151, 2002.
Bibliography
335
[28] M. Garey and D. Johnson. Computers and Intractability. Mathematical Sciences. Freeman, New York, 1979.
[29] F. Garvan. The MAPLE Book. Chapman and Hall/CRC, Boca Raton, FL, 2001.
[30] A. H. Gebremedhin, F. Manne, and A. Pothen. What color is your Jacobian? Graph
coloring for computing derivatives. SIAM Rev., 47:629705, 2005.
[31] R. Giering and T. Kaminski. Recipes for adjoint code construction. ACM Trans. Math.
Soft., 24:437474, 1998.
[32] P. Gill and W. Murray. Newton-type methods for unconstrained and linearly constrained optimization. Math. Prog., 7:311350, 1974.
[33] P. Gill and W. Murray. Conjugate-gradient methods for large-scale nonlinear optimization. Technical Report SOL 79-15, Department of Operations Research, Stanford
University, Palo Alto, CA, 1979.
[34] A. Griewank, D. Juedes, and J. Utke. Algorithm 755: ADOL-C: A package for the
automatic differentiation of algorithms written in C/C++. ACM Trans. Math. Softw.,
22:131167, 1996.
[35] A. Griewank, J. Utke, and A. Walther. Evaluating higher derivative tensors by forward
propagation of univariate Taylor series. Math. Comp., 69:11171130, 2000.
[36] A. Griewank and A. Walther. Evaluating Derivatives: Principles and Techniques of
Algorithmic Differentiation (Second Edition). SIAM, Philadelphia, 2008.
[37] L. Hascot and M. Araya-Polo. The adjoint data-flow analyses: Formalization, properties, and applications. In M. Bcker, G. Corliss, P. Hovland, U. Naumann, and B. Norris,
editors. Automatic Differentiation: Applications, Theory, and Tools, Lecture Notes in
Comput. Sci. Engrg. 50, Springer, Berlin, 2005, pages 135146. Springer, 2005.
[38] L. Hascot, U. Naumann, and V. Pascual. To-be-recorded analysis in reverse mode
automatic differentiation. Future Generation Comput. Syst., 21:14011417, 2005.
[39] M. Hestens and E. Stiefel. Methods of conjugate gradients for solving linear systems.
J. Res. Nat. Bur. Standards, 49:409436, 1952.
[40] J. Hopcroft, R. Motwani, and J. Ullman. Introduction to Automata Theory, Languages,
and Computation (International Edition). Pearson Education International, Upper Saddle River, NJ, 2003.
[41] J. Huber, U. Naumann, O. Schenk, E. Varnik, and A. Wchter. Algorithmic differentiation and nonlinear optimization for an inverse medium problem. In U. Naumann
and O. Schenk, editors, Combinatorial Scientific Computing, Computational Science
Series. Chapman & Hall / CRC Press, Taylor and Francis Group, 2011. To appear.
[42] N. M. Josuttis. The C++ Standard Library A Tutorial and Reference. AddisonWesley, Reading, MA, 1999.
336
Bibliography
Bibliography
337
Index
third-order, 134
derivative tensor, 91
projection
first-order, 93, 132
second-order, 95, 132
direct linear solver, 5
Cholesky factorization, 5
Gaussian factorization, 5
directed acyclic graph, 24
linearized, 24
adjoint extension, 54
tangent-linear extension, 40
reversal problem, 80
active variable, 42
algorithmic differentiation, xi
adjoint model, 54
fourth-order, 142
second-order, 104
third-order, 133
forward mode, 40
forward-over-forward mode, 99
forward-over-reverse mode, 105
reverse mode
incremental, 60
nonincremental, 56
reverse-over-forward mode, 106
reverse-over-reverse mode, 108
tangent-linear model, 39
fourth-order, 142
second-order, 98
third-order, 133
aliasing, 44
elemental function, 24
finite differences, 3
backward, 3, 28
central, 28
forward, 3, 28
second-order, 15
basic block, 44
call tree, 80
reversal problem, 81
Cartesian basis vector, 3
CG (conjugate gradient), 5
Chomsky normal form, 164
control flow stack, 65
CYK-algorithm, 164
gradient, 10, 37
Hessian, 2, 92
compression, 129
IEEE 754 standard, 30
cancellation, 31
floating-point number
double precision, 31
single precision, 31
rounding, 31
independent input, 24
iterative linear solver, 6
conjugate gradient (CG), 6, 18
Krylov subspace method, 6
preconditioner, 6
dependent output, 24
derivative code
adjoint, 57
fourth-order, 143
second-order, 110
third-order, 137
tangent-linear, 42
fourth-order, 142
second-order, 100
339
340
Jacobian, 2, 38
compression, 52
symmetric positive definite, 9
KarushKuhnTucker (KKT) system, 19
Lagrangian, 20
local partial derivative, 24
Newton algorithm, 2, 15
Newton step, 2, 15
Newton system, 2, 15
overloading
forward mode, 48
forward-over-forward mode, 102
forward-over-reverse mode, 121
reverse mode, 71
reverse-over-forward mode, 124
reverse-over-reverse mode, 128
vector forward mode, 50
vector reverse mode, 76
passive variable, 43
required data stack, 62
result checkpoint, 62
seed matrix, 52
single assignment code (SAC), 24
assignment-level, 64
incomplete, 64
source transformation
forward mode, 42
reverse mode, 57
sparsity, 51
steepest descent algorithm, 10
subroutine argument checkpoint, 78
tape, 71, 121, 125
Index