0% found this document useful (0 votes)
232 views

Automatic Differentiation Lecture Slides

The document describes a course on computational differentiation (AD) taught by Uwe Naumann at RWTH Aachen University. The course covers techniques for automatically differentiating computer programs to compute derivatives efficiently via adjoint methods. Topics include computing first and second derivatives of multivariate vector and scalar functions. Real-world applications are discussed where AD can compute derivatives much faster than finite differencing, such as parameter estimation, PDE solvers, and option pricing.

Uploaded by

Nalayak_Karta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
232 views

Automatic Differentiation Lecture Slides

The document describes a course on computational differentiation (AD) taught by Uwe Naumann at RWTH Aachen University. The course covers techniques for automatically differentiating computer programs to compute derivatives efficiently via adjoint methods. Topics include computing first and second derivatives of multivariate vector and scalar functions. Real-world applications are discussed where AD can compute derivatives much faster than finite differencing, such as parameter estimation, PDE solvers, and option pricing.

Uploaded by

Nalayak_Karta
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 271

STCE

Computational Differentiation
WS 16/17

Uwe Naumann
LuFG Informatik 12: Software and Tools for
Computational Engineering, RWTH Aachen

Who knows how to differentiate ...

static expression y = sin(x)

static expression X123 = z74

static program

STCE

void my sin(const double& x, double& y) {


y=sin(x);
}
I

static program
void my pow(double z 7, double& X123) {
X123=z 7z 7;
X123=X123X123;
}

???

,
Computational Differentiation, WS 16/17

I do ...

STCE

,
Computational Differentiation, WS 16/17

STCE

Motivation
Who knows how to differentiate ...
I

dynamic expression

n1
X

y=

2
x2i

i=0
I

dynamic program
void f(const int n,
const double const x, double& y) {
y=0;
for (int i=0;i<n;i++)
y+=x[i]x[i];
y=y;
}

???
... you will!

,
Computational Differentiation, WS 16/17

STCE

Aim

We consider implementations of multivariate vector functions


F : DF IRn IF IRm : y = F (x)
as computer programs.
Assuming differentiability of F and of its implementation1 we aim to study
methods for transforming the given computer program into one for computing
first or higher derivatives efficiently.

1 counter

example: if(x) y=sin(x); else y=0;

,
Computational Differentiation, WS 16/17

Outline I

STCE

Motivation, Terminology, Finite Differences

First Derivatives of Multivariate Vector Functions

Second Derivatives of Multivariate Scalar Functions

,
Computational Differentiation, WS 16/17

Formalities

STCE

Lecture on Monday

Tutorial on Wednesday (J. H


user)

90min written exams on 13.2.17 and 3.4.16 (see Campus)

SiSc Lab (17.10-16.11.): First- and Second-Order AD with dco/c++


I
I

Integration of AD into project


45min written assignment

Registration for lecture and exams separately

l2p access through registration for lecture; all material there

,
Computational Differentiation, WS 16/17

Course on Algorithmic Differentiation (AD)


based on ...

STCE

U.N.: The Art of Differentiating Computer Programs. An Introduction to


Algorithmic Differentiation. Number 24 in Software , Environments, and Tools,
SIAM, 2012.
https://www.stce.rwth-aachen.de/research/the-art

,
Computational Differentiation, WS 16/17

Outline

STCE

Motivation, Terminology, Finite Differences

First Derivatives of Multivariate Vector Functions

Second Derivatives of Multivariate Scalar Functions

,
Computational Differentiation, WS 16/17

Motivation: Cheap Gradients


Diffusion (See Tutorial)

STCE

,
Computational Differentiation, WS 16/17

10

Motivation: Cheap Gradients


Diffusion

STCE

T = T (t, x, c(x)) : IR IR IR IR is given as the solution of the 1D diffusion


equation
dT
d2 T
= c(x)
dt
dx2
over the domain = [0, 1] and with initial condition T (0, x) = i(x) for x
and Dirichlet boundary T (t, 0) = b0 (t) and T (t, 1) = b1 (t) for t [0, 1].
The numerical solution is based on a central finite difference / implicit
(backward) Euler integration scheme that exploits linearity of the residual r in
dr
and factorization).
T (single evaluation of constant Jacobian dT

,
Computational Differentiation, WS 16/17

11

Motivation: Cheap Gradients

STCE

Diffusion

We aim to analyze sensitivities of the predicted T (x, c(x)) with respect to c(x)
or even calibrate the uncertain c(x) to given observations O(x) for T (x, c(x))
at time t = 1 by solving the (possibly constrained) least squares problem
Z
2
min f (c(x), x) where f
T (1, x, c(x)) O(x) dx.
c(x)

Assuming a spatial discretization of with n grid points we require


df
df
dT (1)
=

IRn
dc
dT (1)
dc

( scalar adjoint integration at O(1))

and possibly
d2 f
IRnn
dc2

( 2nd-order vector adjoint integrations at O(n))

,
Computational Differentiation, WS 16/17

12

STCE

...
For differentiation, is there anything else?
Perturbing the inputs cant imagine this fails.
I pick a small Epsilon, and I wonder ...
...
from: Optimality (Lyrics: Naumann; Music: Think of Fools Gardens Lemon
Tree) in Naumann: The Art of Differentiating Computer Programs. An Introduction
to Algorithmic Differentiation. Number 24 in Software , Environments, and Tools,
SIAM, 2012. Page xvii

,
Computational Differentiation, WS 16/17

13

Motivation: Cheap Gradients

STCE

Diffusion

Naive application of Algorithmic Differentiation (AD) tool dco/c++ yields:


df
dc (n=300, m=50)
run time (s)

y=f(x)

central FD
15.2

adjoint AD
0.7

d2 f
dc2 (n=100, m=50)
run time (s)

???

central FD
63.6

adjoint AD
3.9

x+h

... while ignoring accuracy for the time being ...

,
Computational Differentiation, WS 16/17

14

STCE

Nice to have?

MITgcm, (EAPS, MIT)


in collaboration with ANL, MIT,
Rice, UColorado
J. Utke, U.N. et al: OpenAD/F: A
modular, open-source tool for automatic
differentiation of Fortran codes . ACM
TOMS 34(4), 2008.

Plot: A tangent computation / finite difference approximation for 64,800 grid


points at 1 min each would keep us waiting for a month and a half ... :-((( We
can do it in less than 10 minutes thanks to adjoints computed by a
differentiated version of the MITgcm :-)

,
Computational Differentiation, WS 16/17

15

Adjoints Are Everywhere ...

STCE

Topology Optimization with Adjoint OpenFOAM [AboutFlow]


design space

optimization

result

primal (penalized Navier-Stokes [Othmer, 2008])


(v ) v = 2 v p v, > 0
objective (total pressure loss between in- and outlet)
Z
1
J=
p + vn2 d
2

dJ
solver (gradient descent) n+1 = n d
n
dJ n
requires dn computed by dco/c++ / AMPI

aiming for unsteady,


coupled
(e.g,
fluid/heat,
fluid/structure)

[6]: MPI-Parallel Discrete Adjoint OpenFOAM, ICCS Conf., 2015.

,
Computational Differentiation, WS 16/17

16

Adjoints Are Everywhere ...


DAEs with Optimality Conditions (DAEO) [FZJ-IBG, AVT]

STCE

The J
ulich-Aachen Dynamic Optimization Environment (JADE) targets DAEO
y d (t) = f (yd (t), ya (t), t, p), for given yd (0)
ya (t) = argminya IRna h(yd (t), ya , t, p)
s.t.

0 = gi (yd (t), ya , t, p), i = 1, . . . , ne


0 gj (yd (t), ya , t, p), j = ne + 1, . . . , ng .

Embedding KKT condition


"
#
ya h Tya g
=0
g
for Langrangian L = h < , g > yields need
for ya h, 2ya h u, ya g v, Tya g w, and <
w, 2ya g, v > generated by dcc / dco/c++ .
[2]: First- and Second-Order Parameter Sensitivities of a Metabolically and Isotopically Non-Stationary
Biochemical Network Model, Modelica Conf., 2012.

,
Computational Differentiation, WS 16/17

17

Adjoints Are Everywhere ...


Parameter Estimation with JURASSIC2 [FZJ-IEK]

STCE

recovery of atmospheric
state in upper troposphere
/ lower stratosphere for
given radiance measurements along line-of-sight
primal model yi = g(x, i ) with simulated radiances y IRm , atmospheric
parameters x IRn , and line-of-sight elevations IRm
residual IRm 3 F = (oi g(x, i ))i=1...m for measured radiances o IRm
objective G (x, ) = F T S1 F + (x xa )T Sa1 (x xa ) with measurement error
correlation matrix S IRmm , typical atmospheric state values from historical data
xa IRn , and (Tikhonov) regularization matrix Sa IRnn
Gauss-Newton solver requires x F computed by dco/c++
[7]: A 3-D tomographic retrieval approach with advection compensation for the air-borne limb-imager
GLORIA, Atmos. Meas. Tech., 2011.
[3]: A Case Study in Adjoint Sensitivity Analysis of Parameter Calibration, Submitted, 2016.

,
Computational Differentiation, WS 16/17

18

Adjoints Are Everywhere ...


Waterways Engineering [BAW, EDF, HRW]

STCE

Sensitivity of local shear stress with respect to geometry (top) and roughness of
sediment (bottom) generated by dco/fortran.
[4]: Reverse engineering of initial and boundary conditions with Telemac and algorithmic differentiation,
Wasserwirtschaft, 2013

,
Computational Differentiation, WS 16/17

19

Adjoints Are Everywhere ...


Physical Oceanography [MPI-M]

STCE

Adjoint version of ICON (Icosahedral Non-hydrostatic General Circulation Model


generated by dco/fortran for use in 4DVAR data assimilation and error estimation.

[9]: Estimation of Data Assimilation Error: A Shallow-Water Model Study, Monthly Weather Review, 2014

,
Computational Differentiation, WS 16/17

20

Adjoints Are Everywhere ...

STCE

Option Pricing / Risk Management [NAG]


Lets price a simple European Call option written on
an underlying S = S(t) : IR+ IR+ , described by
the SDE
dS(t) = S(t) r dt + S(t) (t, S(t)) dW (t)
with time t 0, maturity T > 0, strike K > 0,
constant interest rate r > 0, volatility = (t, S) :
IR+ IR+ IR+ , and Brownian motion W = W (t) :
IR+ IR.
The value of a European call option driven by S is
given by the expectation
V = EerT (S(T ) K)+

Scenario
I 104 paths
I 360 Euler steps
I 62 uncertain parameters
I pricer takes 1s

Greeks
I first order (dco/c++)
I
I

I second order (dco/c++)


I

for given interest r, strike K, and maturity T . It is


usually evaluated using Monte Carlo simulation.

finite diffs: 81s


adjoint: 5s

finite diffs: 480s


adjoint: 53s

[?]: Adjoint Algorithmic Differentiation Tool Support for Typical Numerical Patterns in Computational
Finance, NAG, 2014.

,
Computational Differentiation, WS 16/17

21

Adjoints Are Everywhere ...

STCE

Robust Non-Convex Optimization [AVT, Imperial College]


Branch and (Improved) Bounds through Subgradients
of McCormick Relaxations
5

Original function
Nat. interval ext. underest.
Convex underestimator
Affine underestimator
Upper bound

4
3

Robust Objective through


Variance Penalty

Original function
Nat. interval ext. underest.
Convex underestimator
Affine underestimator
Global upper bound
Local upper bound

4
3

500

400

2
2
F(x)

1
1

300

0
0

200

-1
-1

-2

100

-2

-3
-4

-3
-6

-4

-2

4
5

Original function
Nat. interval ext. underest.
Convex underestimator
Affine underestimator
Global upper bound

-6

-2

Original function
Nat. interval ext. underest.
Convex underestimator
Affine underestimator
Global upper bound

-4

800

E[F(x)]+1/2*Var[F(x)]

1
1

600

400

0
0

200

-1
-1
-2

-6

-4

-2

-6

-4

-2

5
x

[1]: Adjoint Mode Computations of Subgradients for McCormick Relaxations, AD Conf., LNCSE, 2012.
M. Beckers: Toward Global Robust Optimization, Ph.D. Thesis, RWTH Aachen, 2014.

Computational Differentiation, WS 16/17

22

Adjoints Are Everywhere ...


Adjoint Numerical Methods / Libraries [NAG]

STCE

For example, nag zero cont func brent locates a simple zero x of a continuous function f
in a given interval [a,b] using Brents method. The adjoint version enables
computation of sensitivities of the solution x wrt. all relevant input parameters
(potentially passed through comm).
void
nag zero cont func brent dco a1s (
dco::a1s::type a, // in: lower bound
dco::a1s::type b, // in: upper bound
dco::a1s::type eps, // in: termination tolerance on x
dco::a1s::type eta, // in: acceptance tolerance for vanishing f(x)
dco::a1s::type (f)(dco::a1s::type x, Nag Comm dco a1s comm), // in: f
dco::a1s::type x, // out: x
Nag Comm dco a1s comm, // inout: parameters
NagError fail // out: error code
);

We develop first- and higher order tangent and adjoint versions of a growing number
of numerical methods using dco/c++, dco/fortran, and hand-coding.

,
Computational Differentiation, WS 16/17

23

Adjoints Are Everywhere ...

STCE

Significance-Based Approximate Computing [SCoRPiO]


For illustration consider

v0
1: 6.89497e+13
[1, 3.6e+04]
[-9.58e+08, 9.58e+08]

y=log(exp(sin(x1))(x1x2))/10+x2/100;

Interval evaluation for x1, x2 [1, 36000] yields

[-1, 1]

v6 =sin(v0)
7: 1.91524e+09
[-1, 1]
[1.04e-11, 9.58e+08]

[1, 3.6e+04]

[0.368, 2.72]

v5=log(exp(sin(x1))(x1x2)) > [1,22]

Significance criterion, e.g,

v1
2: 9.57622e+08
[1, 3.6e+04]
[0.01, 2.66e+04]

[1, 3.6e+04]

v7 =exp(v6)
8: 9.57622e+08
[0.368, 2.72]
[2.84e-11, 3.52e+08]

v5 =v0*v1
6: 9.57622e+08
[1, 1.3e+09]
[1.04e-11, 0.739]

[1, 1.3e+09]

v2
3: 359.99
[100, 100]
[-3.6, -0.0001]

[0.01, 0.01]

[-3.6, -0.0001]

v3 =v1/v2
4: 359.99
[0.01, 360]
[1, 1]

[0.368, 2.72]

v8 =v7*v5
9: 9.57622e+08
[0.368, 3.52e+09]
[2.84e-11, 0.272]

w([v5] [v5] [y]) 2.5


selects v5=10.5; and hence y=1.05+x2/100;
Significance analysis requires gradient of objective y with
respect to all intermediates computed by dco/scorpio
( scenarios through interval splitting and exploratory
spawning).
[8]: Towards Automatic Significance Analysis for Approximate Computing, IEEE/ACM CGO, 2016.

[2.84e-10, 2.72]

v9 =log(v8)
10: 2.29825
[-1, 22]
[0.1, 0.1]

v4
5: 2.29825
[10, 10]
[-0.22, 0.01]

[0.1, 0.1]

[1, 1]

[-0.22, 0.01]

v10 =v9/v4
11: 2.29825
[-0.1, 2.2]
[1, 1]

[1, 1]

v11 =v10+v3
12: 362.288
[-0.09, 362]
[1, 1]

[1, 1]

v11 out
13: 362.288
[-0.09, 362]
[1, 1]

,
Computational Differentiation, WS 16/17

24

Recall ...
Basic Mathematical Terminology

continuity

differentiability

gradient, Jacobian, Hessian, higher derivative tensors

Taylor expansion
chain rule

STCE

,
Computational Differentiation, WS 16/17

25

Continuity

STCE

Univariate Scalar Functions

Let D IR be the open domain of the univariate scalar function f : D IR.


f (x) is right-continuous at x0 D if
lim

h0,h>0

f (x0 + h) = f (x0 ) .

f is left-continuous at x0 if
lim

h0,h>0

f (x0 h) = f (x0 ) .

f is continuous at x0 if it is both left- and right-continuous at x0 .


Continuity is a necessary condition for differentiability.

,
Computational Differentiation, WS 16/17

26

Continuity
Univariate Scalar Functions: Alternative Formulation

STCE

Let D IR be the open domain of the univariate scalar function f : D IR.


The function f is continuous at a point x0 D if
lim f (x) = f (x0 ) .

xx0

The above implies that for all series (xi )


i=1 with limi xi = x0 and xi 6= x0
the series (f (xi ))
converges
to
f
(x
).
0
i=1
Continuity in B D requires continuity at all x0 B.

,
Computational Differentiation, WS 16/17

27

Continuity
Univariate Scalar Functions: Example

STCE

We investigate the continuity of f (x) = |x| at x = 0:


The left limit
lim

f (0 h) = f (0) = 0

lim

f (0 + h) = f (0) = 0

h0,h>0

and the right limit


h0,h>0

are idenitical proving that |x| is continuous at the origin.


In fact, |x| is continuous throughout its domain IR.

,
Computational Differentiation, WS 16/17

28

Continuity

STCE

Multivariate Vector Functions

Let D IRn be the open domain of the multivariate scalar function


f : D IR. The function f is continuous at a point x0 D if
lim f (x) = f (x0 ) .

xx0

Continuity in B D requires continuity at all x0 B.


A multivariate vector function

f1
.
m

F =
.. : D IR
fm

is continuous if and only if all its component functions fi , i = 1, . . . , m are


continuous.

,
Computational Differentiation, WS 16/17

29

Differentiability

STCE

Univariate Scalar Function

Let D IR be the open domain of the univariate scalar function f : D IR.


f (x) is right-differentiable at x0 D if the limit
+ = lim

h0

f (x0 + h) f (x0 )
h

exists (is finite). f is left-differentiable at x0 if


= lim

h0

f (x0 ) f (x0 h)
h

exists (is finite). f is differentiable at x0 if it is both left- and


right-differentiable and
df
+ =
(x0 ) .
dx

,
Computational Differentiation, WS 16/17

30

Differentiability
Univariate Scalar Function: Example

STCE

We investigate the differentiability of f (x) = |x| at x = 0: The left limit is


derived from the backward difference
lim

h0,h>0

f (0) f (0 h)
0h
= lim
= 1
h0,h>0
h
h

while a forward difference is used to get the right limit


lim

h0,h>0

f (0 + h) f (0)
h
= lim
=1 .
h0,h>0 h
h

The limits are distinct proving that |x| is not differentiable at the origin.
However, |x| is differentiable anywhere else in its domain IR.

,
Computational Differentiation, WS 16/17

31

Continuity and Differentiability


abs

STCE

,
Computational Differentiation, WS 16/17

32

Differentiability
Univariate Scalar Function: Alternative Formulation

STCE

Let D IR be the open domain of the univariate scalar function f : D IR.


The function f is differentiable at point x0 D if there is a scalar a IR such
that
f (x) = f (x0 ) + a (x x0 ) + r
with asymptotically vanishing remainder r = r(x) IR, that is,
lim

xx0

r(x)
=0 .
|x x0 |

Differentiability in B D requires differentiability at all x0 B.


The function a = f 0 (x0 ) =

df
dx (x0 )

: IR IR is called the [first] derivative of f.

If f 0 is continuous [at some point, in some subdomain], then f is called


continuously differentiable [at this point, in this subdomain].

,
Computational Differentiation, WS 16/17

33

Differentiability

STCE

Multivariate Scalar Function

Let D IRn be the open domain of the multivariate scalar function


f : D IR. The function f is differentiable at point x0 D if there is a vector
a IRn such that
f (x) = f (x0 ) + a (x x0 ) + r
with asymptotically vanishing remainder r = r(x) IR, that is,
lim

xx0

r(x)
=0 .
kx x0 k

Differentiability in B D requires differentiability at all x0 B.


a = f (x0 ) =

df
dx (x0 )

: IRn IRn is called the gradient of f.

If f is continuous [at some point, in some subdomain], then f is called


continuously differentiable [at this point, in this subdomain].

,
Computational Differentiation, WS 16/17

34

STCE

Gradient and Partial Derivatives

Let D IRn be the open domain of a continuously differentiable multivariate


scalar function f : D IR.
The [partial] derivative of y = f (x), x = (xi )i=0,...,n1 , at point x0 with
respect to xj is denoted as
fxj (x0 )
The vector

df
(x0 ) .
dxj

fx0 (x0 )

..
IRn
f (x0 )
.

fxn1 (x0 )

is called the gradient of f at point x0 .


We write f to denote the vector of partial derivatives of f with respect to all
its arguments.

,
Computational Differentiation, WS 16/17

35

Gradients

STCE

Example

The gradient of the Rosenbrock function


y = f (x) = (1 x0 )2 + 100 (x1 x20 )2
is a vector f IR2 defined as


400 x30 + 2 x0 400 x1 x0 2
f = f (x)
200 x1 200x20

It vanishes identically at x = (1, 1) due to a local extremum.

,
Computational Differentiation, WS 16/17

36

Gradients
Example

STCE

,
Computational Differentiation, WS 16/17

37

Differentiability

STCE

Multivariate Vector Functions

Let D IRn be the open domain of the multivariate vector function


F : D IRm . The function F is differentiable at point x0 D if there is a
matrix A IRmn such that
F (x) = F (x0 ) + A (x x0 ) + r
with asymptotically vanishing remainder r = r(x) IRm , that is,
lim

xx0

kr(x)k
=0 .
kx x0 k

Differentiability in B D requires differentiability at all x0 B.


A = F (x0 ) =

dF
dx (x0 )

: IRn IRmn is called the Jacobian of F.

If F is continuous [at some point, in some subdomain], then F is called


continuously differentiable [at this point, in this subdomain].

,
Computational Differentiation, WS 16/17

38

STCE

Jacobians

Let D IRn be an open domain let and


F (Fi )i=0,...,m1 : D IRm
be continuously differentiable on D. The matrix

(F0 (x0 ))T

..
IRmn
F (x0 )
.

(Fm1 (x0 ))T

containing the gradients of each of the m components Fi of F as rows is called


the Jacobian of F at point x0 .

,
Computational Differentiation, WS 16/17

39

STCE

Assumption: Differentiability

In Algorithmic Differentiation we assume differentiability of the target functions


as well as their implementations as algorithms / computer programs. However,
there are methods for handling non-differentiability and even (mild)
discontinuities.
y

y=F(x)

x
eps

eps

,
Computational Differentiation, WS 16/17

40

Hessians and Higher Derivative Tensors

STCE

Let D IRn be an open domain and let F : D IRm be twice continuously


differentiable on D. Let F 0 F denote the Jacobian of F. The 3-tensor

F00 (x0 )
Fn0 (x0 )
..
.

...
...

F (x )

...

0
F(m1)n
(x0 ) . . .
2

0
Fn1
(x0 )
0
F2n1
(x0 )

IRmnn
..

0
0
Fmn1 (x )

is called the Hessian of F at point x0 .


k-th derivative tensors are defined recursively as Jacobians of (k 1)-th
derivatives.

,
Computational Differentiation, WS 16/17

41

Hessians
Example

STCE

,
Computational Differentiation, WS 16/17

42

STCE

Taylor Expansion

Every sufficiently often differentiable function f : IR (a, b) IR can be


represented for x, x + h (a, b) and h IR, h > 0, as a Taylor Expansion
f (x + h) = f (x) + hf 0 (x) + O(h2 )
= f (x) +

h 0
h2
h3
f (x) + f 00 (x) + f 000 (x) + . . .
1!
2!
3!

It follows
f (x h) = f (x) hf 0 (x) + O(h2 )
= f (x)
I

h2
h3
h 0
f (x) + f 00 (x) f 000 (x) + . . .
1!
2!
3!

O(h2 ) absolute error decreases as h2 , (h3 . . .) (approximation of 2. (3.


. . .) order of f (x + h)). For h < 1 we get O(hk+1 ) < O(hk ), k 1
(better approximation of f (x h)).
e.g, f (x) = x + 1 f (h) = f (0) + h f 0 (0) = 1 + h.

,
Computational Differentiation, WS 16/17

43

Taylor Expansion
Illustration

int main() {
double x=1;
for (double h=1e1;h>=1e4;h=h/10)
cout << h << \t
<< abs(sin(x+h)(sin(x)+hcos(x)))
<< endl;
cout << endl;
for (double h=1e1;h>=1e4;h=h/10)
cout << h << \t
<< abs(sin(x+h)
(sin(x)+hcos(x)hh/2sin(x)))
<< endl;
cout << endl;
}

STCE

0.1 0.00429385533327507
0.01 4.21632485627078e05
0.001 4.20825507812877e07
0.0001 4.2074449518501e09
0.1 8.65004092356039e05
0.01 8.96993222595979e08
0.001 9.00153702661569e11
0.0001 9.0023962676794e14

Difference between exact value sin(x + h) and Taylor approximation of 2. resp.


3. order decreases as h2 resp. h3 .

,
Computational Differentiation, WS 16/17

44

Chain Rule (of Differential Calculus)


Univariate Scalar Functions

STCE

Let y = f (x) : Df IR If IR be defined over Df and let


y = f (x) = g(h(x)) = g(v)

be such that both g and h are continuously differentiable over their respective
domains Dg = Ih and Dh = Df . Then f is continuously differentiable over Df
and
df
dg
dg dh
(x ) =
(v ) =
(v )
(x )
dx
dx
dv
dx
for all x Df and v = h(x ).

,
Computational Differentiation, WS 16/17

45

Chain Rule
Multivariate Vector Functions (Standard Formulation)

STCE

Let y = F (x) : DF IRn IF IRm be defined over DF and let


y = F (x) = G(H(x)) = G(z)

be such that both G and H are continuously differentiable over their respective
domains DG = IH and DH = DF . Then F is continuously differentiable over
DF and
dF
dG dH
(x ) =
(z )
(x )
dx
dz
dx
for all x DF and z = H(x ).
Proof follows immediately from product of the two Jacobians.

,
Computational Differentiation, WS 16/17

46

Chain Rule
Multivariate Vector Functions (Generalization)

STCE

Let y = F (x) : DF IRn IF IRm be defined over DF and let


y = F (x) = G(H(x), x) = G(z, x)
be such that both G : DG IRp IRn IG IRm and
H : DH IRn IF IRp are continuously differentiable over their respective
domains DG = IH DF and DH DF . Then F is continuously differentiable
over DF and
dF
dG
dG dH
G
(x ) =
(z , x ) =
(z , x )
(x ) +
(z , x )
dx
dx
dz
dx
x
for all x DF and z = H(x ).
Notation:

G
x

incomplete derivative;

dG
dx

[complete] derivative

,
Computational Differentiation, WS 16/17

47

Chain Rule
Multivariate Vector Functions (Generalization): Proof

STCE

Let u, v, w IRn+p+m such that u = (x 0 0)T , v = H(u)


= (x H(x) 0)T ,
T

and w = G(v) = (x H(x) G(H(x), x)) .


By the standard chain rule

In
0
dH

dG
dF

Ik
=

= 0
du
dv du
G
dG
x

dz

0
In

0 dH
dx
0
0


In
0 0

dH
0 0 =
dx
G
dG
0 0
x + dz

dH
dx

0
0
0

0
0

Hence,
dF
dy dw du
dF
G dG dH
=

= Qm
PnT =
+

dx
dw du dx
du
x
dz dx
where Pn = (In 0) IRn(n+p+m) , and Qm = (0 Im ) IRm(n+p+m) with Ik
denoting the identity matrix in IRkk , k {n, m}, and using appropriate
numbers of zero padding columns. 

,
Computational Differentiation, WS 16/17

48

Chain Rule
Complete vs. Incomplete Derivatives

STCE

Let y = F (x) : DF IRn IF IRm be defined over DF and let


y = F (x) = G(H1 (x), . . . , Hk (x))

be such that G and all Hi , i = 1, . . . , k are continuously differentiable over


their respective domains.
Then F is continuously differentiable over DF and its [complete] derivative
k
(Jacobian) dF
dx is uniquely defined. Moreover, there are 2 incomplete
derivatives F
x due to considering subsets of the Hi as constant (or passive).
For notational simplicity we avoid precise annotation of incomplete derivatives.
The incomplete derivatives referred to will be well defined by the context.

,
Computational Differentiation, WS 16/17

49

Chain Rule
Complete vs. Incomplete Derivatives: Example

STCE

Let
y = f (x) = g(h1 (x), h2 (x)) = sin(x) cos(x).
By the chain rule the complete [derivative] becomes
df
= cos(x)2 sin(x)2 .
dx
The following 22 = 4 incomplete derivatives exist:
o
f n
0, cos(x)2 , sin(x)2 , cos(x)2 sin(x)2 ,
x
where, admittedly, the vanishing derivative due to assuming no dependence on
x could be considered as obsolete.

,
Computational Differentiation, WS 16/17

50

Chain Rule
Graphical Illustration

STCE

,
Computational Differentiation, WS 16/17

51

Notation

STCE

Partial and Total Derivatives

For F : IRn IRm : y = F (x) we use the following equivalent notations for
total derivatives
dF (x)
F (x)
dx
d2 F (x)
F 00 (x)
2 F (x)
dx2
F 000 (x) . . .
F 0 (x)

and partial derivatives


Fxi (x)

dF (x)
xi F (x)
dxi

d2 F (x)
Fxi ,xj (x)
dxi dxj

d2 F (x)

dx2i

i=j

!
xi ,xj F (x)

Fxi ,xj ,xk (x) . . .

,
Computational Differentiation, WS 16/17

52

Derivatives in Numerical Methods


Newtons Method for Systems of Nonlinear Equations

STCE

,
Computational Differentiation, WS 16/17

53

Derivatives in Numerical Methods


Newtons Method for Systems of Nonlinear Equations I

STCE

Wanted: Solution to system of n nonlinear equations F (x) = 0


In:
implementation of the residual y at the current point x IRn :
F : IRn IRn , y = F (x)
implementation of the Jacobian A F (x) of the residual at the current
point x:
F 0 : IRn IRnn , A = F 0 (x)
solver for computing the Newton step dx IRn as the solution of the
linear Newton system A dx = y :
s : IRn IRnn IRn , dx = s(y, A)
starting point: x IRn
upper bound on the norm of the residual kF (x)k at the approximate
solution:  IR
Out:
approximate solution of the nonlinear system F (x) = 0: x IRn

,
Computational Differentiation, WS 16/17

54

Derivatives in Numerical Methods


Newtons Method for Systems of Nonlinear Equations II

STCE

Algorithm:
1: y = F (x)
2: while kyk >  do
3:
A = F 0 (x)
4:
dx = s(y, A)
5:
x x + dx
6:
y = F (x)
7: end while

,
Computational Differentiation, WS 16/17

55

Derivatives in Numerical Methods


Newtons Method for Systems of Nonlinear Equations

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

56

Derivatives in Numerical Methods


Gradient Descent for Nonlinear Optimization

STCE

,
Computational Differentiation, WS 16/17

57

Derivatives in Numerical Methods


Gradient Descent for Nonlinear Optimization I

STCE

Wanted: Solution to argminxIRn f (x)


In:
implementation of the objective y IR at the current point x IRn :
f : IRn IR, y = f (x)
implementation of f 0 for computing the objective y f (x) and its
gradient g f (x) at the current point x:
f 0 : IRn IR IRn , (y, g) = f 0 (x)
starting point: x IRn
upper bound on gradient norm kgk at the approximate minimal point:
 IR
Out:
approximate minimal value of the objective: y IR
approximate minimal point: x IRn

,
Computational Differentiation, WS 16/17

58

Derivatives in Numerical Methods


Gradient Descent for Nonlinear Optimization II

STCE

Algorithm:
1: repeat
2:
(y, g) = f 0 (x)
3:
if kgk >  then
4:
1
5:
y y
6:
while y y do
xg
7:
x
8:
y = f (
x)
9:
/2
10:
end while

11:
xx
12:
end if
13: until kgk 

,
Computational Differentiation, WS 16/17

59

Derivatives in Numerical Methods


Gradient Descent for Nonlinear Optimization

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

60

Derivatives in Numerical Methods

STCE

Gradient Descent for Nonlinear Optimization Race

Consider

argminxIRn

n1
X

2
x2i

i=0

n
100
200
300
400
500
1000

f (x + ei h)
13
47
104
184
284
1129

f (1)
8
28
63
113
173
689

f(1)
<1
1
2
2.5
3
6

,
Computational Differentiation, WS 16/17

61

Derivatives in Numerical Methods


Newtons Method for Nonlinear Optimization

STCE

,
Computational Differentiation, WS 16/17

62

Derivatives in Numerical Methods


Newtons Method for Nonlinear Optimization I

STCE

Wanted: Solution to argminxIRn f (x)


In:
implementation of the objective y IR at the current point x IRn :
f : IRn IR, y = f (x)
implementation of the differentiated objective function f 0 for computing
the objective y f (x) and its gradient g f (x) at the current point x:
f 0 : IRn IR IRn , (y, g) = f 0 (x)
implementation of the differentiated objective function f 00 for computing
the objective y, its gradient g, and its Hessian H 2 f (x) at the current
point x:
f 00 : IRn IR IRn IRnn , (y, g, H) = f 00 (x)
solver to determine the Newton step dx IRn as the solution of linear
Newton system H dx = g :
s : IRn IRnn IRn , dx = s(g, H)
starting point: x IRn

,
Computational Differentiation, WS 16/17

63

Derivatives in Numerical Methods


Newtons Method for Nonlinear Optimization II

STCE

upper bound on the gradient norm kgk at the approximate solution:  IR


Out:
approximate minimal value: y IR
approximate minimal point: x IRn
Algorithm:
1: (y, g) = f 0 (x)
2: while kgk >  do
3:
(y, g, H) = f 00 (x)
4:
dx = s(g, H)
5:
1
6:
y y
x
7:
x
8:
while y y do
x
dx
9:
x
10:
y = f (
x)
11:
/2

,
Computational Differentiation, WS 16/17

64

Derivatives in Numerical Methods


Newtons Method for Nonlinear Optimization III
12:
13:
14:
15:

STCE

end while

xx
(y, g) = f 0 (x)
end while

,
Computational Differentiation, WS 16/17

65

Derivatives in Numerical Methods


Newtons Method for Nonlinear Optimization

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

66

Derivatives in Numerical Methods


Newton-CG Method for Nonlinear Optimization I

STCE

Wanted: Newton step by Conjugate Gradient (CG) method

In:
implementation of the tangent-linear residual F (1) for computing the
residual y F (x) and its directional derivative y(1) F (x) x(1) in the
tangent-linear direction x(1) IRn at the current point x IRn :
F (1) : IRn IRn IRn IRn , (y, y(1) ) = F (1) (x, x(1) )
starting point for the Newton step: dx x(1) IRn
upper bound on the norm of the residual k y F (x) dxk at the
approximate solution for the Newton step:  IR
Out:
approximate solution for the Newton step: dx IRn

,
Computational Differentiation, WS 16/17

67

Derivatives in Numerical Methods


Newton-CG Method for Nonlinear Optimization II

STCE

Algorithm:
1: x(1) dx
2: (y, y(1) ) F (1) (x, x(1) )
3: p y y(1)
4: r p
5: while r  do
6:
x(1) p
7:
(y, y(1) ) F (1) (x, x(1) )
8:
rT r / (pT y(1) )
9:
dx dx + p
10:
rprev r
11:
r r y(1)
12:
rT r / (rTprev rprev )
13:
pr+p
14: end while

,
Computational Differentiation, WS 16/17

68

Derivatives in Numerical Methods


Newton-CG Method for Nonlinear Optimization

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

69

Derivatives in Numerical Methods

STCE

Newtons Method for Nonlinear Optimization: Race

Consider

argminxIRn

n1
X

2
x2i

i=0

(2)

(2)

n
100
200
300
400
500
1000
..
.

f (x + ei h)
<1
2
7
17
36
365
..
.

f (1,2)
<1
1
3
9
21
231
..
.

f(1)
<1
<1
1
4
10
138
..
.

f(1) v
<1
<1
<1
<1
<1
<1
..
.

105

> 104

> 104

> 104

matrix-free Newton-CG method not dominated by cost of linear solver

,
Computational Differentiation, WS 16/17

70

Optimality
Poem / Song ... I

STCE

Im sittin in front of the computer screen.


Newtons second iteration is what Ive just seen.
Its not quite the progress that I would expect
from a code such as mine no doubt it must be perfect!
Just the facts are not supportive, and I wonder ...
My linear solver is state-of-the-art.
It does not get better wherever I start.
For differentiation is there anything else?
Perturbing the inputs cant imagine this fails.
I pick a small Epsilon, and I wonder ...
I wonder how, but I still give it a try.
The next change in step size is bound to fly.
cause all Id like to see is simply optimality.
Epsilon, in fact, appears to be rather small.
A factor of ten should improve it all.

,
Computational Differentiation, WS 16/17

71

Optimality
Poem / Song ... II

STCE

cause all Id like to see is nearly optimality.


A DAD ADADA DAD ADADA DADAD.
A few hours later my talks getting rude.
The sole thing descending seems to be my mood.
How can guessing the Hessian only take this much time?
N squared function runs appear to be the crime.
The facts support this thesis, and I wonder ...
Isolation due to KKT
Isolation why not simply drop feasibility?
The guy next doors been sayin again and again:
An adjoint Lagrangian might relieve my pain.
Though I dont quite believe him, I surrender.
I wonder how but I still give it a try:

,
Computational Differentiation, WS 16/17

72

Optimality
Poem / Song ... III

STCE

Gradients and Hessians in the blink of an eye.


Still all Id like to see is simply optimality.
Epsilon itself has finally disappeared.
Reverse mode AD works, no matter how weird,
and Im about to see local optimality.
Yes, I wonder, I wonder ...
I wonder how but I still give it a try:
Gradient and Hessians in the blink of an eye.
Still all Id like to see ...
I really need to see ...
now I can finally see my cherished optimality :-)

,
Computational Differentiation, WS 16/17

73

Numerical Approximation of First Derivatives


Finite Difference Quotients

STCE

We consider implementations of multivariate vector functions


F : DF IRn IF IRm : y = F (x)
as computer programs. F assumed to be (k times) continuously differentiable
over their entire (open) domain2 implying the existence of the Jacobian
(Hessian etc.)
dy
dF
=
IRmn
F (x)
dx
dx
the individual columns of which can be approximated at all points x DF by
(forward, backward, central) finite difference quotients as follows:
F (x ) 1




2
2 to

F (x + h ei ) F (x )
h

n1


1

i=0

F (x + h ei ) F (x h ei )
2h

F (x ) F (x h ei )
h

n1
i=0

n1
i=0

be relaxed later

,
Computational Differentiation, WS 16/17

74

Numerical Approximation of First Derivatives


Finite Difference Quotients

STCE

,
Computational Differentiation, WS 16/17

75

Accuracy of Forward/Backward Finite Differences

STCE

W.l.o.g., let m = 1. For x = x0 + h we get


1 d2 f
df 0
1 d3 f
(x ) h + 2 (x0 ) h2 + 3 (x0 ) h3 + . . .
dx
2! dx
3! dx
and similarly for x = x0 h
f (x0 + h) = f (x0 ) +

df 0
1 d2 f
1 d3 f
(x ) h + 2 (x0 ) h2 3 (x0 ) h3 + . . . .
dx
2! dx
3! dx
Truncation after the respective first derivative terms yields scalar univariate
versions of forward and backward finite difference quotients, respectively,e.g,
from
df
f (x0 + h) = f (x0 ) + h (x0 ) + O(h2 ) .
dx
For 0 < h << 1 the truncation error is dominated by the value of the h2 term
which implies that only accuracy up to the order of h (= h1 and hence
first-order accuracy) can be expected, e.g,
f (x0 h) = f (x0 )

df 0
f (x0 + h) f (x0 ) + O(h2 )
f (x0 + h) f (x0 )
(x ) =
=
+ O(h)
dx
h
h

,
Computational Differentiation, WS 16/17

76

Accuracy of Central Finite Differences

STCE

Second-order accuracy (2 ) follows immediately from the previous Taylor


expansions. Their subtraction yields
f (x0 + h) f (x0 h) =
df 0
1 d2 f
1 d3 f
(x ) h + 2 (x0 ) h2 + 3 (x0 ) h3 + . . .
dx
2! dx
3! dx
2
1
d
f
1 d3 f
df
(x0 ) h + 2 (x0 ) h2 3 (x0 ) h3 + . . .)
(f (x0 )
dx
2! dx
3! dx
df 0
2 d3 f 0
3
(x ) h + 3 (x ) h + . . . .
=2
dx
3! dx

f (x0 ) +

Truncation after the first derivative term yields the scalar univariate version of
the central finite difference quotient. For small values of h the truncation error
is dominated by the value of the h3 term which implies that only accuracy up
to the order of h2 (second-order accuracy) can be expected, i.e,
f (x0 + h) f (x0 h) + O(h3 )
f (x0 + h) f (x0 h)
df 0
(x ) ==
=
+O(h2 ).
dx
2h
2h

,
Computational Differentiation, WS 16/17

77

Accuracy of Finite Differences

STCE

Case Study (sin.cpp)


1 double f(double x) { return sin(x); }
2
3 int main() {
4
double x=1;
5
for (double h=1e1;h>=1e15;h=h/10)
6
cout << h << \t << (f(x+h)f(x))/h << \t
7
<< (f(x+h/2)f(xh/2))/h << \t << cos(x) << endl;
8 }
h
0.1
0.01
0.001
0.0001
1e-05
1e-06
1e-07
1e-08
1e-09
1e-10
1e-11
1e-12
1e-13
1e-14
1e-15

FFD
0.497363752535389
0.536085981011869
0.539881480360327
0.540260231418621
0.540298098505865
0.54030188512133
0.540302264040449
0.540302302898254
0.540302358409406
0.540302247387103
0.540301137164079
0.540345546085064
0.539568389967826
0.544009282066327
0.555111512312578

CFD
0.540077208046432
0.540300054611342
0.540302283355554
0.540302305643836
0.540302305873652
0.540302305895857
0.540302306228924
0.540302291796024
0.540302358409406
0.540303357610128
0.540301137164079
0.540345546085064
0.539568389967826
0.544009282066327
0.555111512312578

EXACT
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814
0.54030230586814

,
Computational Differentiation, WS 16/17

78

Real Numbers
Floating-Point Format

STCE

Real numbers x IR are represented internally as floating-point numbers with


base , accuracy t und exponent range [L, U ] as follows:


d1
d2
dt1
x = d0 +
+ 2 + . . . + t1 e

where 0 di 1 for i = 0, . . . , t 1 and L e U. The base- sequence


of digits m = d0 d1 . . . dt1 is called mantissa and e is called exponent.
A floating-point system is called normalized if d0 6= 0 for x 6= 0, i.e, 1 m < .

,
Computational Differentiation, WS 16/17

79

Floating-Point Numbers

STCE

Example

Let = 2, t = 3 and [L, U ] = [1, 1]. The corresponding normalized


floating-point system contains the following 25 Elements:
0
1.002 21 = 0.510 ,
1.102 21 = 0.7510 ,
1.002 20 = 110 ,
1.102 20 = 1.510 ,

1.012 21 = 0.62510
1.112 21 = 0.87510

1.012 20 = 1.2510
1.112 20 = 1.7510

1.002 21 = 210 ,

1.012 21 = 2.510

1.102 21 = 310 ,

1.112 21 = 3.510

,
Computational Differentiation, WS 16/17

80

Floating-Point Numbers

STCE

Example: = 2, t = 3, [L, U ] = [1, 1]

1
"fps_ex.gnuplot"

0.5

-0.5

-1
-4

-3

-2

-1

,
Computational Differentiation, WS 16/17

81

Floating-Point Number Types


Single and Double Precision

STCE

float uses 32 Bits:


I

23 Bits for the mantissa


8 Bits for the exponent

1 Bit for the sign

and hence provides 6 significant digits in decimal notation with absolute


minimum 1.17549e-38 and maximum 3.40282e+38.
double uses 64 Bits:
I

52 Bits for the mantissa

11 Bits for the exponent

1 Bit for the sign

and hence provides 15 significant digits in decimal notation with absolute


minimum 2.22507e-308 and maximum 1.79769e+308.

,
Computational Differentiation, WS 16/17

82

Impact of Perturbation
Case Study (h.cpp)

STCE

The above happens to be reasonably representative for the general case. Hence,
perturbation of half of the mantissa appears to be a good rule of thumb.
#include <iostream>
#include <cmath>
#include <cfloat>
using namespace std;
int main() {
cout.precision(15);
double x1=1, x2=111111111111111;
cout << x1+sqrt(DBL EPSILON) << endl;
cout << x2+sqrt(DBL EPSILON) << endl;
cout << x1+abs(x1)sqrt(DBL EPSILON) << endl;
cout << x2+abs(x2)sqrt(DBL EPSILON) << endl;
}

Output:
1.00000001490116
111111111111111
1.00000001490116
111111112766796

,
Computational Differentiation, WS 16/17

83

Numerical Approximation of First Derivatives


Example

STCE

For example, let

n1
X

y=

2
x2i

i=0

be implemented in C++ as
template<class T>
void f(const vector<T>& x, T &y) {
y=0;
for (size t i=0; i<x.size(); i++) y=y+x[i]x[i];
y=yy;
}

,
Computational Differentiation, WS 16/17

84

Numerical Approximation of First Derivatives


Example

STCE

We are looking for a routine fg(...) returning for a given vector x of length n the
value y of f and its gradient g.
int main(int argc, char argv[]) {
assert(argc==2); cout.precision(15);
size t n=atoi(argv[1]);
vector<double> x(n,0), g(n,0); double y=0;
for (size t i=0;i<n;i++) x[i]=cos(static cast<double>(i));
fg(x,y,g);
cout << y << endl;
for (size t i=0;i<n;i++) cout << g[i] << endl;
return 0;
}

,
Computational Differentiation, WS 16/17

85

Numerical Approximation of First Derivatives


Example

STCE

Pn1
Live for y = ( i=0 x2i )2 :
I

implementation of fg for given


template<typename T>
void fg(const vector<T>& x, T &y, vector<T>& g) {
...
}

build and run for increasing values of n

,
Computational Differentiation, WS 16/17

86

Numerical Approximation of First Derivatives


Example (fd.cpp)

STCE

A central finite difference approximation of the gradient can be computed


alongside with the function value as follows:
template<typename T>
void fg(const vector<T>& x, T &y, vector<T>& g) {
size t n=x.size();
for (int i=0;i<n;i++) {
vector<T> x pp(x), x mp(x); T y pp=0, y mp=0;
double p=(x mp[i]==0) ? sqrt(DBL EPSILON)
: sqrt(DBL EPSILON)abs(x mp[i]);
x mp[i]=p; f(x mp,y mp);
x pp[i]+=p; f(x pp,y pp);
g[i]=(y ppy mp)/(2p);
}
f(x,y);
}

... takes more than 1 minute for n = 105 to produce the gradient with
second-order accuracy.

,
Computational Differentiation, WS 16/17

87

Motivation
I can do better ... (ga1s.cpp)

STCE

The adjoint
#include dco.hpp
using namespace dco;
template<typename T>
void fg(const vector<T> &xv, T &yv, vector<T> &g) {
typedef ga1s<T> DCO M;
typedef typename DCO M::type DCO T;
typedef typename DCO M::tape t DCO TAPE T;
...
}
int main(int argc, char argv[]) {
...
fg(x,y,g);
...
}

... takes less than 1 second for n = 105 to produce the gradient with machine
accuracy.

,
Computational Differentiation, WS 16/17

88

Outline

STCE

Motivation, Terminology, Finite Differences

First Derivatives of Multivariate Vector Functions

Second Derivatives of Multivariate Scalar Functions

,
Computational Differentiation, WS 16/17

89

First Derivative Models


Outline

Linearized Single Assignment Code (lSAC)

Linearized Directed Acyclic Graph (lDAG)

Chain Rule on lDAG


First-Order Tangent Model

I
I

STCE

Edge Elimnination on Tangent-Augmented lDAG


Implementation with dco/c++

First-Order Adjoint Model


I
I

Vertex Elimnination on Adjoint-Augmented lDAG


Implementation with dco/c++

,
Computational Differentiation, WS 16/17

90

Chain Rule

STCE

Recall ...

Let y = F (x) : DF IRn IF IRm be defined over DF and let


y = F (x) = G(H(x), x) = G(z, x)

be such that both G and H are continuously differentiable over their respective
domains DG = IH DF and DH DF . Then F is continuously differentiable
over DF and
dF
dG
dG dH
G
(x ) =
(z , x ) =
(z , x )
(x ) +
(z , x )
dx
dx
dz
dx
x
for all x DF and z = H(x ).
Notation:

G
x

partial derivative;

dG
dx

total derivative

,
Computational Differentiation, WS 16/17

91

Chain Rule
Recall ...

STCE

,
Computational Differentiation, WS 16/17

92

Algorithmic Differentiation (AD)

STCE

Properties of Feasible Target Functions I

1. The given implementation of F : IRn IRm : y = F (x), can be


decomposed into a single assignment code (SAC)
i = 0, . . . , n 1

vi = i (xi ) = xi
vj = j (vk )kj

j = n, . . . , n + q 1

yk = n+q+k (vn+p+k ) = vn+p+k k = 0, . . . , m 1


where q = p + m and k j denotes a direct dependence of vj on vk as an
argument of j .
2. All elemental functions j possess continuous partial derivatives
dj,i

dj
(vk )kj
dvi

with respect to their arguments (vk )kj at all points of interest.

,
Computational Differentiation, WS 16/17

93

Algorithmic Differentiation (AD)


Properties of Feasible Target Functions II

STCE

3. A linearized SAC (lSAC) is obtained by augmenting the elemental


assignments with computations of the local partial derivatives dj,i .
4. The SAC induces a directed acyclic graph (DAG) G = G(F ) = (V, E)
with integer vertices V = {0, . . . , n + q} and edges
V V E = {(i, j) : i j}.
5. The set of vertices representing the n inputs is denoted as X V. The m
outputs are collected in Y V. All remaining intermediate vertices belong
to Z ( V.
6. A linearized DAG (lDAG) is obtained by attaching the dj,i to the
corresponding edges (i, j) in the DAG.

,
Computational Differentiation, WS 16/17

94

SAC

STCE

,
Computational Differentiation, WS 16/17

95

Chain Rule on Linearized DAG


Example

STCE

,
Computational Differentiation, WS 16/17

96

STCE

AD Graphically

3: y[G]

3: y[G]
dG
dz

SAC:

z := H(x)
y := G(z, x)

DAG:

2: z[H]

lDAG:

2: z[H]

G
x

dH
dx

1: x

F (x)

dy
=
dx

1: x

pathlDAG

(i,j)path

dj,i

,
Computational Differentiation, WS 16/17

97

Chain Rule on lDAG


Proof (Idea)

STCE

,
Computational Differentiation, WS 16/17

98

Chain Rule on lDAG

STCE

Proof (Formally) I

Baseline is the SAC of F : IRn IRm : y = F (x),


i = 0, . . . , n 1

vi = i (xi ) = xi
vj = j (vk )kj

j = n, . . . , n + q 1

yk = n+q+k (vn+p+k ) = vn+p+k k = 0, . . . , m 1


with elemental partial derivatives
dj,i

dj
(vk )kj
dvi

as introduced above.
Consider
F : IRn+q IRn+q
defined as
v q = F (v 0 ) = q (q1 (. . . (1 (v 0 )) . . .)),

,
Computational Differentiation, WS 16/17

99

Chain Rule on lDAG


Proof (Formally) II

STCE

where v j = (vkj )n+q1


= j (v j1 ) for j = 1, . . . , q and
k=0

j1

kn+1<j
vk
vkj = k (vij1 )ik k n + 1 = j , k = 0, . . . , n + q 1.

0
kn+1>j
v 0 = (x0 . . . xn1 0 . . . 0)T v q = (x0 . . . xn1 vn . . . vn+p1 y0 . . . ym1 )T .
By the chain rule
dy
dy dv q dv 0
=

dx
dv q dv 0 dx
dy
dq
d1 dv 0
=

...

dv q dv q1
dv 0 dx

,
Computational Differentiation, WS 16/17

100

Chain Rule on lDAG

STCE

Proof (Formally) III

yielding an expression of the Jacobian as a chained (sparse) matrix product. It


remains to be shown that this chain corresponds exactly to
dy
=
dx

(xy)G(F )

(i,j)(xy)

dj,i ,

where x X and y Y.

Based on the obvious correctness for the product of two matrices we assume
correctness for chains of length k. Proof by induction requires us to show
correctness for chains of length k + 1.
Let B denote the result of evaluting the chain of length k. W.l.o.g,3 consider
C = A B. To obtain cj,i the inner product of aj, and b,i needs to be
computed.
P For corresponding pairs of nonzero entries aj, and b,i we get
cj,i = aj, b,i . With
b,i =

(i)

(,)(i)

d,

,
Computational Differentiation, WS 16/17

101

Chain Rule on lDAG

STCE

Proof (Formally) IV

we get
aj, b,i =

aj,

(i)

d,

(,)(i)

and hence
cj,i =

X X
(i)

aj,

Y
(,)(i)

d, =

(ij)

(,)(ij)

d, ,

where ranges over the index set induced by corresponding nonzero pairs aj,
and b,i .

3B

A similar

,
Computational Differentiation, WS 16/17

102

AD builds on Chain Rule


Example

STCE

Let y = F (x) be defined over DF = IR \ {0} as


y=

sin(x)
.
x

Both G / and H sin are continuously differentiable


over their respective domains DG = IR (IR \ {0}) and
DH = IR. Hence F is continuously differentiable over its
domain DF and

3: y[G]
1
x

2: v[H]

sin(x)
x2

cos(x)

dF
dG dH G
cos(x) sin(x)
(x ) =
(v , x )
(x )+
(v , x ) =
2
dx
dv
dx
x
x
x

1: x

for all x DF and v = sin(x ).

,
Computational Differentiation, WS 16/17

103

Chain Rule on lDAG

STCE

... the Essence of AD


Consider

y = F (x) =

y0
y1


=


 

F0 (x0 , x1 )
G0 (x0 , H(x0 , x1 ))
=
F1 (x0 , x1 )
G1 (H(x0 , x1 ))

3: y0 [G0 ]

4: y1 [G1 ]
dG0
dz

G0
x0

2: z[H]

dH
dx0

0: x0

dG1
dz

dH
dx1

1: x1

,
Computational Differentiation, WS 16/17

104

Chain Rule on lDAG

STCE

... Edge Back-Elimination (to be used for tangents)

T
T
G0
dF0
dG0 dH
+ dz dx0

dx0
dF0
= x0

dF0

dx
dG0 dH
dx1
dz dx1

3: y0 [G0 ]

4: y1 [G1 ]
dG0
dz

G0
x0

0: x0

4: y1 [G1 ]

dG1
dz

2: z[H]

dH
dx0

3: y0 [F0 ]

dG1
dz

dF0
dx0

dH
dx1

1: x1

dF0
dx1

dH
dx0

0: x0

2: z[H]
dH
dx1

1: x1

,
Computational Differentiation, WS 16/17

105

Chain Rule on lDAG

STCE

... Vertex Elimination (to be used for adjoints)

dF0
dF
dx0
= dF
1
dx
dx0

3: y0 [G0 ]


dG0 dH
G0
dF0
x0 + dz dx0
dx1

dF1 =
dG1 dH
dx1
dz dx0
4: y1 [G1 ]

dG0
dz

G0
x0

0: x0

3: y0 [F0 ]

dH
dx1

dH
dx1
4: y1 [F1 ]

dG1
dz

2: z[H]

dH
dx0

dG0
dz
dG1
dz

dF0
dx0

dF0
dx1

dF1
dx0

dF1
dx1

dH
dx1

1: x1

0: x0

1: x1

,
Computational Differentiation, WS 16/17

106

Chain Rule on lDAG


Combinatorics

STCE

U. N.: Elimination Techniques for cheap Jacobians. Automatic


Differentiation of Algorithms, Springer, 2002.

U. N.: Optimal accumulation of Jacobian matrices by elimination methods


on the dual computational graph. Math. Prog. 99(3):399421, Springer,
2004.

U. N.: Optimal Jacobian accumulation is NP-complete. Math. Prog.


112(2):427441, Springer, 2008.

See lecture/tutorial Combinatorial Problems in Scientific Computing, each


SS (not SS17 due to sabbatical)

,
Computational Differentiation, WS 16/17

107

First-Order Tangent Model

STCE

Mathematicians View

A first-order tangent model F (1) : IRn IRn IRm IRm ,


!
y
= F (1) (x, x(1) ),
y(1)
defines a directional derivative alongside with the function value:
y = F (x)
y

(1)

= F (x) x(1)

... definition of the whole Jacobian column-wise by input directions x(1) IRn equal to the
Cartesian basis vectors in IRn .

,
Computational Differentiation, WS 16/17

108

First-Order Tangent Code


Notation

STCE

In

dF
x(1)
dx
the superscript on x denotes first directional differentiation of F performed in
tangent mode in direction x(1) IRn .
Subscripts will be used later to denote adjoints.
Larger values for superscripts will become relevant in the context of higher
derivatives.

,
Computational Differentiation, WS 16/17

109

First-Order Tangent Code


Computer Scientists View

STCE

A first-order tangent code F (1) : IRn IRn IRn IRm IRm IRm

z
(1)
z


z
, z, z(1) , z
),
:= F (1) (x, x(1) , x

y
(1)
y

computes a Jacobian vector product alongside with the function value:



z

, z, z
)
IRm IRm
3
y := F (x, x

y
!
!
(1)
(1)
z
x
:= F (x, x
, z, z
)
IRm 3
y(1)
z(1)

,
Computational Differentiation, WS 16/17

110

First-Order Tangent Code


Computer Scientists View

STCE

Variables for which derivatives are computed are referred to as active; x and z
are active inputs; z and y are active outputs.
Variables which depend on active inputs are referred to as varied.

Variables for which no derivatives are computed are referred to as passive; x


are passive inputs; z
and y
are passive outputs.
and z
Variables on which active outputs depend are referred to as useful.
Active variables are both varied and useful.
The whole (dense) Jacobian can be harvested column-wise from the active
output directions (z(1) , y(1) )T IRm by seeding active input directions
(x(1) , z(1) )T IRn with the Cartesian basis vectors in IRn .

,
Computational Differentiation, WS 16/17

111

First-Order Tangent Code


Example: Lighthouse

STCE

In his 2008 seminal book Andreas Griewank introduces a Lighthouse example to


explain the basic concepts behind AD.
It yields a multivariate vector function F : IR4 IR2 defined as
tan( t)
tan( t)
tan( t)
y2 =
tan( t)

y1 =

,
Computational Differentiation, WS 16/17

112

First-Order Tangent Code


Example: Lighthouse

STCE

,
Computational Differentiation, WS 16/17

113

First-Order Tangent Code

STCE

Example: Lighthouse (Implementation)

Let the Lighthouse function be evaluated by a computer program


 

:= F (, , t, )
y
as
h1 := tan( t); h2 :=

h1
; y := h2 ; := h2
h1

The tangent program evaluates

(1)
y (1)

(1)
(1)
dF

:=
(1)
d(, , t, ) t
(1)

in addition to the function value.

,
Computational Differentiation, WS 16/17

114

First-Order Tangent Code


Computer Scientists View (Simplified)

STCE

A first-order tangent code F (1) : IRn IRn IRm IRm ,


!
y
:= F (1) (x, x(1) ),
y(1)
computes a Jacobian vector product alongside with the function value:
y := F (x)
y

(1)

:= F (x) x(1)

,
Computational Differentiation, WS 16/17

115

First-Order Tangent Model


Jacobian-Vector Product

STCE

,
Computational Differentiation, WS 16/17

116

First-Order Tangent Model

STCE

Alternative Interpretation via Chain Rule on lDAG

Define

dv
ds
for v {x, y} and some auxiliary s IR assuming that F (x(s)) is continuously
differentiable over its domain.
v (1)

y[F ]

By the chain rule

dy
dy dx
=

= F (x) x(1)
ds
dx ds

and hence
y(1) = F (x) x(1) .

x(1)
s

,
Computational Differentiation, WS 16/17

117

First-Order Tangent Model

STCE

Graphically

tangent-augmented lDAG
2: y[F ]

tangent DAG
5: y(1) [<, >]

3: y[F ]

F
1: x
x(1)
0: s

4
4: F

2
1: x

2: x(1)

Vertices sorted (indexed) topologically wrt. data dependence.

Edges sorted topologically wrt. index of target with index of source as


tie-breaker.
Note inner product notation < F (x), x(1) > F (x) x(1) .

,
Computational Differentiation, WS 16/17

118

First-Order Tangent Code


Implementation as Edge Back Elimination on lDAG

STCE

Let y = F (x) = G(H(x), x) with F : IRn IRm , H : IRn IRk , z = H(x),


and G : IRn+k IRm continuously differentiable over their respective domains.
By the chain rule

y(1) =


dG dH
G

+
x(1)
dz dx
x
dG (1) G (1)
=
z +
x
dz
x
dG (1)
=
z + y(1)
dz

dF
dF
=
x(1) =
ds
dx

= y(1)


(1, 2) :

(1, 3) :

(2, 3) :


dH
x(1)
dx

G (1)
x
x

dG (1)
z
dz

Graphically, this sequence of operations can be represented by a sequence of


edge back eliminations on the corresponding tangent-augmented lDAG as
follows:

,
Computational Differentiation, WS 16/17

119

First-Order Tangent Code

STCE

Implementation as Edge Back Elimination on lDAG


3: y[G]

3: y[G]

dG
dz

2: z[H]

dG
dz
G
x

2: z[H]

3: y[G]
dG
dz

G
x

2: z[H]

dH
dx

y(1)
1: x

z(1)

x(1)
0: s

3: y[G]

1: x

y(1)

z(1)

x(1)
0: s

0: s

0: s

An edge is back eliminated by multiplying its label with the label(s) of the
incoming edge(s) of its source followed by its removal. If the source has no
further emanating edges, then it is also removed.
First-order tangent code back eliminates all back eliminatable edges in
topological order (no storage of lDAG).

,
Computational Differentiation, WS 16/17

120

First-Order Tangents by Overloading

STCE

Tangent SAC

For i = 0, . . . , n 1
vi
(1)
vi
For i = n, . . . , q 1
!
vi
:=
(1)
vi

!
:=

xi
(1)
xi

!
(seed)

i (vk )ki
P

di (vk )ki
dvj

vn+p+i
(1)
vn+p+i

ji

(1)

vj

(propagate)

For i = 0, . . . , m 1
yi
(1)
yi

:=

!
(harvest)

,
Computational Differentiation, WS 16/17

121

First-Order Tangents by Overloading


Forward Edge Back-Elimination on Tangent lDAG
5: y0 ()

6: y1 (c)
5 [x0 ]

6 [c]
4: /

We consider
  

y0
x0 sin(x0 x1 )/x1
=
y1
sin(x0 x1 )/x1 c
implemented as

3 [1/x1 ]

4 [v4 ]

STCE

3: sin
7 [v4 /x1 ]

t := sin(x0 x1 )/x1
y0 := x0 t; y1 := t c

2 [cos(v2 )]

yielding SAC
2:
0 [x1 ]

1 [x0 ]

0: x0

1: x1
(1)

(1)

[x0 ]

[x1 ]

Tangent lDAG

v2
v3
v4
y0

:= x0 x1
:= sin(v2 )
:= v3 /x1
:= x0 v4 ; y1 := v4 c

for some passive value c, i.e, no derivatives of or with respect to required; x, y,


and t are active.

,
Computational Differentiation, WS 16/17

122

First-Order Tangents by Overloading

STCE

Seed

5: y0 ()

6: y1 (c)
[x0 ]

[c]
4: /

x0 :=?
x1 :=?

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

(1)
x0 :=?
(1)
x1 :=?

[x0 ]

0: x0

1: x1
(1)

(1)

[x0 ]

[x1 ]

,
Computational Differentiation, WS 16/17

123

First-Order Tangents by Overloading


Propagate (Local Directional Derivatives)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

v2 := x0 x1
(1)
(1)
(1)
v2 := x1 x0 + x0 x1

2:
[x1 ]

[x0 ]

0: x0

1: x1
(1)

(1)

[x0 ]

[x1 ]

,
Computational Differentiation, WS 16/17

124

First-Order Tangents by Overloading


Propagate

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

v2 := x0 x1
(1)
(1)
(1)
v2 := x1 x0 + x0 x1
v3 := sin(v2 )
(1)
(1)
v3 := cos(v2 ) v2

2:

(1)

0: x0

1: x1

[v2 ]
(1)

(1)

[x0 ]

[x1 ]

,
Computational Differentiation, WS 16/17

125

First-Order Tangents by Overloading


Propagate

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]

(1)

[v3 ]
0: x0

v2 := x0 x1
(1)
(1)
(1)
v2 := x1 x0 + x0 x1
v3 := sin(v2 )
(1)
(1)
v3 := cos(v2 ) v2
v4 := v3 /x1
(1)
(1)
(1)
v4 := (v3 v4 x1 )/x1

1: x1
(1)

(1)

[x0 ]

[x1 ]

,
Computational Differentiation, WS 16/17

126

First-Order Tangents by Overloading


Propagate

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[v4 ]

(1)

[v4 ]

v2 := x0 x1
(1)
(1)
(1)
v2 := x1 x0 + x0 x1
v3 := sin(v2 )
(1)
(1)
v3 := cos(v2 ) v2
v4 := v3 /x1
(1)
(1)
(1)
v4 := (v3 v4 x1 )/x1
y0 := x0 v4
(1)
(1)
(1)
y0 := v4 x0 + x0 v4

0: x0
(1)

[x0 ]

,
Computational Differentiation, WS 16/17

127

First-Order Tangents by Overloading


Propagate

5: y0 ()

STCE

6: y1 (c)
[c]
4: /

(1)

[y0 ]
(1)

[v4 ]

v2 := x0 x1
(1)
(1)
(1)
v2 := x1 x0 + x0 x1
v3 := sin(v2 )
(1)
(1)
v3 := cos(v2 ) v2
v4 := v3 /x1
(1)
(1)
(1)
v4 := (v3 v4 x1 )/x1
y0 := x0 v4
(1)
(1)
(1)
y0 := v4 x0 + x0 v4
y1 := v4 c
(1)
(1)
y1 := c v4

,
Computational Differentiation, WS 16/17

128

First-Order Tangents by Overloading


Harvest

5: y0 ()

STCE

6: y1 (c)

(1)

(1)

[y0 ]

[y1 ]

v2 := x0 x1
(1)
(1)
(1)
v2 := x1 x0 + x0 x1
v3 := sin(v2 )
(1)
(1)
v3 := cos(v2 ) v2
v4 := v3 /x1
(1)
(1)
(1)
v4 := (v3 v4 x1 )/x1
y0 := x0 v4
(1)
(1)
(1)
y0 := v4 x0 + x0 v4
y1 := v4 c
(1)
(1)
y1 := c v4

,
Computational Differentiation, WS 16/17

129

First-Order Tangents by Overloading

STCE

Accumulation of Jacobian (Driver)

For i = 0, . . . , n 1


y
[F ],i

:= F (1) (x, ei ),

where ei denotes the i-th Cartesian basis vector in IRn .

,
Computational Differentiation, WS 16/17

130

First-Order Tangent Code


Example

STCE

,
Computational Differentiation, WS 16/17

131

First-Order Tangent Code


Example

STCE

,
Computational Differentiation, WS 16/17

132

First-Order Tangent Code


Example

STCE

,
Computational Differentiation, WS 16/17

133

First-Order Tangents by Overloading


dco/c++ Version 0.9 (see text book)

STCE

For scalar tangent mode AD, a class dco t1s type (tangent 1st-order scalar type)
is defined with double precision members v (value) and t (tangent).
// tangent 1storder scalar derivative type
class dco t1s type {
public :
double v; // value
double t; // tangent
dco t1s type(const double&);
dco t1s type();
dco t1s type& operator=(const dco t1s type&);
};

,
Computational Differentiation, WS 16/17

134

First-Order Tangents by Overloading


Assignment

STCE

dco t1s type&


dco t1s type::operator=(const dco t1s type& x) {
if (this==&x) return this; // selfassignment
v=x.v; t=x.t; // assigns both primal and tangent
return this;
}

,
Computational Differentiation, WS 16/17

135

First-Order Tangents by Overloading


Arithmetic Operators

STCE

dco t1s type operator(const dco t1s type& x1,


const dco t1s type& x2) {
dco t1s type tmp;
tmp.v=x1.vx2.v; // primal
tmp.t=x1.tx2.v+x1.vx2.t; // tangent
return tmp;
}

dco t1s type sin(const dco t1s type& x) {


dco t1s type tmp;
tmp.v=sin(x.v); // primal
tmp.t=cos(x.v)x.t; // tangent
return tmp;
}

,
Computational Differentiation, WS 16/17

136

First-Order Tangents by Overloading


Driver

STCE

...
#include dco t1s type.hpp // tangent type definition
const int n=4;
void f(dco t1s type x, dco t1s type &y) { ... }
int main() {
dco t1s type x[n], y;
for (int i=0;i<n;i++) x[i]=1;
for (int i=0;i<n;i++) {
x[i].t=1; // seed
f(x,y);
x[i].t=0; // reset for next Cartesian basis direction
cout << y.t << endl; // harvest
}
return 0;
}

,
Computational Differentiation, WS 16/17

137

First-Order Tangent Code with dco/c++


Hands On (gt1s.cpp)

STCE

Pn1
Live for y = ( i=0 x2i )2 :
I

implementation of driver fg for given instantiation of


template<class T>
void f(const vector<T>& x, T &y) {
...
}

instantiated with T=DCO M::type for DCO M=dco::gt1s<double>


I

build and run for increasing values of n

compare with finite differences

see case_studies/race/gt1s

,
Computational Differentiation, WS 16/17

138

First-Order Tangent Code with dco/c++


Driver

STCE

User Guide: y(1) := F (x) x(1)


template<typename T>
void fg(const vector<T>& xv, T &yv, vector<T> &g) {
// tangent 1storder scalar dco type
typedef typename gt1s<T>::type DCO T;
size t n=xv.size();
DCO T y=0;
for (size t i=0;i<n;i++) {
vector<DCO T> x(n,0);
for (size t j=0;j<n;j++) x[j]=xv[j];
derivative(x[i])=1; // seed directions
f(x,y); // overloaded primal
g[i]=derivative(y); // harvest directional derivatives
}
yv=value(y); // extract function value
}

,
Computational Differentiation, WS 16/17

139

First-Order Vector Tangent Code

STCE

Computer Scientists View (Simplified)

A first-order vector tangent code F (1) : IRn IRnl IRm IRml ,


!
y
:= F (1) (x, X (1) ),
Y (1)
computes a Jacobian matrix product alongside with the function value:
y := F (x)
Y

(1)

:= F (x) X (1)
... harvesting of the whole Jacobian by seeding input directions
X (1) [i] IRn , i = 0, . . . , n 1,
with the Cartesian basis vectors
in IRn . Note concurrency!

,
Computational Differentiation, WS 16/17

140

First-Order (Vector) Tangent Code


Example: Lighthouse

STCE

Generic Primal F : IR4 IR2

template<typename T>
void f(const vector<T>& x, vector<T>& y) {
T v = tan(x[2] x[3]);
T w = x[1] v;
y[0] = x[0] v / w;
y[1] = y[0] x[1];
}

,
Computational Differentiation, WS 16/17

141

First-Order (Vector) Tangent Code

STCE

Example: Lighthouse I

The Jacobian of F at point x = (1, 1, 1, 1)T is equal to




2.79402 5.01252 11.025 11.025
F (x) =
2.79402 7.80654 11.025 11.025
Consequently, the sum of its columns is computed as the first directional
derivative of F with respect to x in direction x(1) = (1, 1, 1, 1), that is,
y(1) := F (x) x(1)

=

2.79402 5.01252
2.79402 7.80654

11.025
11.025


1


11.025
1
14.2435

=
.
11.025 1
11.4495
1


,
Computational Differentiation, WS 16/17

142

Example: Lighthouse
First-Order Scalar Tangent Code (lighthouse/gt1s.cpp)

STCE

void driver(
const vector<double>& xv, const vector<double>& xt,
vector<double>& yv, vector<double>& yt
){
typedef gt1s<double>::type DCO T;
const int n=xv.size(), m=yv.size();
vector<DCO T> x(n), y(m);
for (int i=0;i<n;i++) { value(x[i])=xv[i]; derivative(x[i])=xt[i]; } // seed
f(x,y); // overloaded primal
for (int i=0;i<m;i++) { yv[i]=value(y[i]); yt[i]=derivative(y[i]); } // harvest
}

,
Computational Differentiation, WS 16/17

143

Example: Lighthouse
First-Order Vector Tangent Code (lighthouse/gt1v.cpp)

STCE

const int n=4;


void driver(
const vector<double>& x, const vector<vector<double> >& xt1,
vector<double>& y, vector<vector<double> >& yt1
){
// generic tangent 1storder vector dco type
typedef gt1v<double,n>::type DCO T;
int m=y.size(); vector<DCO T> t1v x(n), t1v y(m);
for (int i=0;i<n;i++) {
value(t1v x[i])=x[i];
// vector tangents
for (int j=0;j<n;j++) derivative(t1v x[i])[j]=xt1[i][j];
}
f(t1v x,t1v y);
for (int i=0;i<m;i++) {
y[i]=value(t1v y[i]);
for (int j=0;j<n;j++) yt1[i][j]=derivative(t1v y[i])[j];
}
}

... requires header-only version of dco/c++.

,
Computational Differentiation, WS 16/17

144

First-Order Tangent Code


Integration into Newtons Algorithm for NLS

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

145

First-Order Tangent Code


Integration into Steepest Descent Algorithm for NLP

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

146

STCE

First-Order Adjoint Model


The Jacobian is a linear operator F : IRn IRm .
Its adjoint is defined as (F ) : IRm IRn where
< (F ) y(1) , x(1) >IRn =< y(1) , F x(1) >IRm

and where < ., . >IRn and < ., . >IRm denote appropriate scalar products in IRn
and IRm , respectively.

Theorem
(F ) = (F )T .
< (F )T y(1) , x(1) >IRn =< y(1) , F x(1) >IRm
[=:x(1) ]

[=:y(1) ]

Note invariant at each point in the program execution validation of


derivatives.

,
Computational Differentiation, WS 16/17

147

First-Order Adjoint Model

STCE

,
Computational Differentiation, WS 16/17

148

Validation of Algorithmic Derivatives

STCE

,
Computational Differentiation, WS 16/17

149

First-Order Adjoint Model

STCE

Mathematicians View

A first-order adjoint model F(1) : IRn IRm IRm IRn ,




y
= F(1) (x, y(1) ),
x(1)
defines an adjoint directional derivative alongside with the function value:
y = F (x)
x(1) = F (x)T y(1)

... definition of the whole Jacobian row-wise through input directions y(1) IRm equal to the
Cartesian basis vectors in IRm .

,
Computational Differentiation, WS 16/17

150

First-Order Adjoint Code

STCE

Notation

In


dF
dx

T
y(1)

the subscript on y denotes the first directional differentiation of F performed in


adjoint mode in direction y(1) IRm .
Enumeration of derivatives and distinction of super- and subscripts will become
relevant in the discussion of higher derivatives computed by combinations of
tangent and adjoint modes.

,
Computational Differentiation, WS 16/17

151

First-Order Adjoint Code

STCE

Computer Scientists View

F(1) : IRn IRnx IRm IRn IRm IRn IRmy IRm


,

x(1)

z(1)

y(1)

T

:= F(1) (x, x(1) , x


, z, z(1) , z
, y(1) ),

computes a shifted transposed Jacobian vector product alongside with the function
value:

z

z
m
m
, z, z
)
IR IR 3 := F (x, x
y

y
!
!
!
x(1)
x(1)
z(1)
T
:=
, z, z
)
+ F (x, x
z(1)
0
y(1)
y(1) := 0

,
Computational Differentiation, WS 16/17

152

First-Order Adjoint Code


Computer Scientists View

STCE

The whole (dense) Jacobian can be harvested from the active input adjoints


x(1)
IRm
z(1)
row-wise by seeding active output adjoints


z(1)
IRm
y(1)
with the Cartesian basis vectors in IRm and for x(1) := 0 on input.

,
Computational Differentiation, WS 16/17

153

First-Order Adjoint Code


Example: Lighthouse (Implementation)

STCE

Let the Lighthouse function be evaluated by a computer program


 

:= F (, , t, )
y
as
h1 := tan( t); h2 :=

h1
; y := h2 ; := h2
h1

The adjoint program evaluates

(1)
(1)
T 


(1)

dF
(1)

:= (1) +

t(1)
t(1)
y(1)
d(, , t, )
(1)
0
y(1) := 0
in addition to the function value; see later for details.

,
Computational Differentiation, WS 16/17

154

First-Order Adjoint Model


Computer Scientists View (Simplified)

STCE

A first-order adjoint code F(1) : IRn IRm IRm IRn ,




y
:= F(1) (x, y(1) ),
x(1)
computes a shifted transposed Jacobian vector product alongside with the
function value:
y := F (x)
x(1) := F (x)T y(1)
... harvesting of the whole Jacobian row-wise by seeding input directions
y(1) IRm with the Cartesian basis vectors in IRm and for x(1) = 0 on input.

,
Computational Differentiation, WS 16/17

155

First-Order Adjoint Model


Vector-Jacobian Product

STCE

,
Computational Differentiation, WS 16/17

156

First-Order Adjoint Model

STCE

Alternative Interpretation via Chain Rule on lDAG

Define

dt T
dv
for v {x, y} and some auxiliary t IR assuming that t(F (x)) is continuously
differentiable over its domain.
v (1)

By the chain rule


dt dF
dt
T
=

= y(1)
F (x)
dx
dy dx

T
y(1)

y[F ]

and hence
x(1)

dt T

= F (x)T y(1) .
dx

dF
dx

,
Computational Differentiation, WS 16/17

157

First-Order Adjoint Code

STCE

Graphically

adjoint-augmented lDAG
3: t

adjoint DAG
5: x(1) [<, >]

3: y[F ]

y(1)

2: y[F ]

dF
dx

1: x

4: F

1: x

2: y(1)

Note inner product notation < y(1) , F (x) > F (x)T y(1) .
Computational Differentiation, WS 16/17

158

First-Order Adjoint Code

STCE

Implementation as Vertex Elimination on lDAG

Let
y = F (x) = G(H(x), x)
with F : IRn IRm , H : IRn IRk , z = H(x), and G : IRn+k IRm
continuously differentiable over their respective domains. By the chain rule

x(1)

dt T
dF T

=
y(1) =
dx
dx

dH T dG T
G T

+
dx
dz
x

dH T
z(1) + G x(1)
=
dx
= x(1)

!
y(1)
"

dG T
3:
y(1)
d(z, x)
"
#
dH T
2:
z(1)
dx

Graphically, this sequence of operations can be represented by a sequence of


vertex eliminations on the corresponding adjoint-augmented lDAG as follows:

,
Computational Differentiation, WS 16/17

159

First-Order Adjoint Model

STCE

Chain Rule and Vertex Elimination


4: t

4: t

T
y(1)

3: y[G]

zT(1)
G xT(1)

dG
dz

2: z[H]

G
x

dH
dx

xT(1)

2: z[H]
dH
dx

1: x

1: x

A vertex is eliminated by multiplying the labels of its incoming edges with the
label(s) of the edge(s) emanating from its target (resulting in new edges or
incrementation of existing edge labels) followed by its removal.
First-order adjoint code eliminates all eliminatable vertices in reverse
topological order (reverses primal data flow).

,
Computational Differentiation, WS 16/17

160

First-Order Adjoint Code


Example

STCE

,
Computational Differentiation, WS 16/17

161

First-Order Adjoint Code


Example

STCE

,
Computational Differentiation, WS 16/17

162

STCE

First-Order Adjoints by Overloading I


1. Record (Tape / lDAG)
For i = 0, . . . , n 1
vi := xi
record i V (vi(1) := xi(1) )
For i = n, . . . , q 1
vi := i (vk )ki
record i V (vi(1) := 0)
For j i : record (i, j) E (dj,i :=

di (vk )ki
)
dvj

For i = 0, . . . , m 1
yi := vn+p+i

,
Computational Differentiation, WS 16/17

163

First-Order Adjoints by Overloading II

STCE

2. Interpret (Tape / lDAG)


For i = 0, . . . , m 1
vn+p+i(1) := yi(1)
For i = q 1, . . . , n
(j, i) E : vj(1) := vj(1) + vi(1) di,j
For i = 0, . . . , n 1
xi(1) := vi(1)

,
Computational Differentiation, WS 16/17

164

First-Order Adjoints by Overloading


Reverse Vertex Elimination on Adjoint lDAG (Tape)
t
y0(1)

y1(1)

[x0 ]

[c]

5: y0 ()

6: y1 (c)

t := sin(x0 x1 )/x1
y0 := x0 t
y1 := t c

[1/x1 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

We consider
  

y0
x0 sin(x0 x1 )/x1
=
y1
sin(x0 x1 )/x1 c
implemented as

4: /

[v4 ]

STCE

[x0 ]

0: x0

1: x1

Adjoint lDAG

yielding SAC
v2
v3
v4
y0
y1

:= x0 x1
:= sin(v2 )
:= v3 /x1
:= x0 v4
:= v4 c

for some passive value c.

,
Computational Differentiation, WS 16/17

165

First-Order Adjoints by Overloading


Register (Independent Inputs with Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

x0 :=?
x1 :=?

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

166

First-Order Adjoints by Overloading


Record (Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

v2 := x0 x1

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

167

First-Order Adjoints by Overloading


Record (Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

v2 := x0 x1
v3 := sin(v2 )

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

168

First-Order Adjoints by Overloading


Record (Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1

[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

169

First-Order Adjoints by Overloading


Record (Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

v2
v3
v4
y0

:= x0 x1
:= sin(v2 )
:= v3 /x1
:= x0 v4

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

170

First-Order Adjoints by Overloading


Record (Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

v2
v3
v4
y0
y1

:= x0 x1
:= sin(v2 )
:= v3 /x1
:= x0 v4
:= v4 c

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

171

First-Order Adjoints by Overloading


Seed

t
[y0(1) ]

[y1(1) ]

[x0 ]

[c]

5: y0 ()

6: y1 (c)

4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
y0 := x0 v4
y1 := v4 c
y0(1) :=?
y1(1) :=?
x0(1) :=?
x1(1) :=?
v2(1) := 0
v3(1) := 0
v4(1) := 0

1: x1

,
Computational Differentiation, WS 16/17

172

First-Order Adjoints by Overloading

STCE

Interpret (Tape)

t
[y0(1) ]

[y1(1) ]

[x0 ]

[c]

5: y0 ()

6: y1 (c)

4: /

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)

[1/x1 ]

[v4 ]

3: sin

Note C++ Syntax:


[v4 /x1 ]

[cos(v2 )]

v4(1) + = c y1(1)
2:
[x1 ]

[x0 ]

0: x0

1: x1

v4(1) := v4(1) + c y1(1) .

,
Computational Differentiation, WS 16/17

173

First-Order Adjoints by Overloading


Interpret (Tape)

STCE

t
[y0(1) ]
[y1 v4(1) ]

5: y0 ()
[x0 ]

4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]
[cos(v2 )]

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

174

First-Order Adjoints by Overloading


Interpret (Tape)

STCE

[v4(1) ]

4: /

[1/x1 ]
[y0 x0(1) ]
3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)
u := 1/x1
v3(1) + = u v4(1)
x1(1) = v4 u v4(1)

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

175

First-Order Adjoints by Overloading


Interpret (Tape)

[v3(1) ]

[y0 x0(1) ]

[v4 x1(1) ]
3: sin

[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)
u := 1/x1
v3(1) + = u v4(1)
x1(1) = v4 u v4(1)
v2(1) + = cos(x2 ) v3(1)

1: x1

,
Computational Differentiation, WS 16/17

176

First-Order Adjoints by Overloading


Interpret (Tape)

[v2(1) ]
[y0 x0(1) ]

[v4 x1(1) ]

2:
[x1 ]

[x0 ]

0: x0

1: x1

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)
u := 1/x1
v3(1) + = u v4(1)
x1(1) = v4 u v4(1)
v2(1) + = cos(x2 ) v3(1)
x0(1) + = x1 v2(1)
x1(1) + = x0 v2(1)

,
Computational Differentiation, WS 16/17

177

First-Order Adjoints by Overloading


Harvest

[x0(1) ]

[x1(1) ]

0: x0

1: x1

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)
u := 1/x1
v3(1) + = u v4(1)
x1(1) = v4 u v4(1)
v2(1) + = cos(x2 ) v3(1)
x0(1) + = x1 v2(1)
x1(1) + = x0 v2(1)

,
Computational Differentiation, WS 16/17

178

First-Order Adjoints by Overloading


dco/c++ Version 0.9 (see text book)

STCE

The favored approach to a run-time version of the adjoint model is to build a


tape (an augmented representation of the DAG) by overloading, followed by an
interpretative reverse propagation of adjoints through the tape. In the simplest
case the tape is a statically allocated array of tape entries addressed by their
position in the array.
class dco a1s tape entry {
public:
int oc; // opcode
int arg1, arg2; // tape addresses of arguments
double v,a; // value and adjoint
dco a1s tape entry() :
oc(DCO A1S UNDEF), arg1(DCO A1S UNDEF),
arg2(DCO A1S UNDEF), v(0), a(0)
{}; // constructor
};

,
Computational Differentiation, WS 16/17

179

First-Order Adjoints by Overloading


Active Data Type

STCE

As in forward mode, an augmented data type is defined to replace the type of


every active floating-point variable. The corresponding class dco a1s type (dcos
adjoint 1st-order scalar type) contains the virtual address va (position in tape)
of the current variable in addition to its value v.
// adjoint 1storder scalar derivative type
class dco a1s type {
public:
int va; // (virtual) tape address
double v; // value
dco a1s type() : va(DCO A1S UNDEF), v(0) {};
dco a1s type(const double&);
dco a1s type& operator=(const dco a1s type&);
};

,
Computational Differentiation, WS 16/17

180

First-Order Adjoints by Overloading


Assignment

STCE

dco a1s type&


dco a1s type::operator=(const dco a1s type& x) {
if (this==&x) return this; // selfassignment
dco a1s tape[dco a1s vac].oc=DCO A1S ASG; // opcode
// value both in tape and active variable
dco a1s tape[dco a1s vac].v=v=x.v;
// single argument is tape entry of righthand side
dco a1s tape[dco a1s vac].arg1=x.va;
// tape grows sequentially
va=dco a1s vac++;
return this;
}

,
Computational Differentiation, WS 16/17

181

First-Order Adjoints by Overloading


Multiplication

STCE

dco a1s type


operator(const dco a1s type& x1,
const dco a1s type& x2) {
dco a1s type tmp;
dco a1s tape[dco a1s vac].oc=DCO A1S MUL; // opcode
// tape entries of two arguments
dco a1s tape[dco a1s vac].arg1=x1.va;
dco a1s tape[dco a1s vac].arg2=x2.va;
// value both in tape and active variable
dco a1s tape[dco a1s vac].v=tmp.v=x1.vx2.v;
// tape grows sequentially
tmp.va=dco a1s vac++;
return tmp;
}

,
Computational Differentiation, WS 16/17

182

First-Order Adjoints by Overloading


Sine

STCE

dco a1s type sin(const dco a1s type& x) {


dco a1s type tmp;
dco a1s tape[dco a1s vac].oc=DCO A1S SIN; // opcode
// tape entrie of single argument
dco a1s tape[dco a1s vac].arg1=x.va;
// value both in tape and active variable
dco a1s tape[dco a1s vac].v=tmp.v=sin(x.v);
// tape grows sequentially
tmp.va=dco a1s vac++;
return tmp;
}.

,
Computational Differentiation, WS 16/17

183

First-Order Adjoints by Overloading


Tape Interpreter

STCE

void dco a1s interpret tape () {


// tape interpretation sequentially backwards
for (int i=dco a1s vac;i>=0;i) {
switch (dco a1s tape[i].oc) { // distinguish opcodes
case DCO A1S ASG : {
// incremental adjoints
dco a1s tape[dco a1s tape[i].arg1].a+=
dco a1s tape[i].a; break;
}
case DCO A1S SIN : {
dco a1s tape[dco a1s tape[i].arg1].a+=
cos(dco a1s tape[dco a1s tape[i].arg1].v)
dco a1s tape[i].a; break;
}
...
}
}
}

,
Computational Differentiation, WS 16/17

184

First-Order Adjoints by Overloading


Driver

STCE

#include <iostream>
#include dco a1s type.hpp // adjoint type definition
using namespace std;
const int n=4;
extern dco a1s tape entry dco a1s tape[DCO A1S TAPE SIZE]; // tape
// overloaded primal
void f(dco a1s type x, dco a1s type &y) {
y=0;
for (int i=0;i<n;i++) y=y+x[i]x[i];
y=yy;
}
...

,
Computational Differentiation, WS 16/17

185

First-Order Adjoints by Overloading


Driver (cont.)

STCE

int main() {
dco a1s type x[n], y;
for (int j=0;j<n;j++) x[j]=1
f(x,y); // overloaded primal builds tape
dco a1s tape[y.va].a=1; // seed
dco a1s interpret tape(); // tape interpreter
cout << i << \t
<< dco a1s tape[x[i].va].a << endl; // harvest
dco a1s reset tape(); // here obsolete ...
return 0;
}

,
Computational Differentiation, WS 16/17

186

First-Order Adjoints by Overloading


Types of Tapes

value tape
I
I

records partial derivatives, dependences


no reevaluation

gradient tape
I
I

records values, opcodes, dependences


reevaluation by interpretation

partial tape
I

STCE

records gradients of assignments, dependences


no reevaluation

mixtures ...

,
Computational Differentiation, WS 16/17

187

First-Order Adjoints by Overloading

STCE

Accumulation of Jacobian (Driver)

For i = 0, . . . , n 1
xi(1) := 0
For i = 0, . . . , m 1


y
[F ]i,


:= F(1) (x, x(1) , ei ),

where ei denotes the i-th Cartesian basis vector in IRm .

,
Computational Differentiation, WS 16/17

188

First-Order Adjoint Code with dco/c++


Hands On

STCE

Pn1
Live for y = ( i=0 x2i )2 :
I

implementation of driver fg for given instantiation of


template<class T>
void f(const vector<T>& x, T &y) {
...
}

instantiated with T=DCO M::type for DCO M=dco::ga1s<double>


I

build and run for increasing values of n

compare with finite differences and tangent AD

,
Computational Differentiation, WS 16/17

189

First-Order Adjoint Code with dco/c++


Driver (ga1s.cpp)

STCE

User Guide: x(1) := F (x)T y(1)


template<typename T>
void fg(const vector<T> &xv, T &yv, vector<T> &g) {
typedef ga1s<T> DCO M; // dco mode
typedef typename DCO M::type DCO T; // dco type
typedef typename DCO M::tape t DCO TAPE T; /dco tape type
size t n=xv.size();
vector<DCO T> x(n); DCO T y;
DCO M::global tape=DCO TAPE T::create(); // tape creation
for (size t i=0;i<n;i++) { // independent tape entries
x[i]=xv[i]; DCO M::global tape>register variable(x[i]);
}
f(x,y); // overloaded primal
DCO M::global tape>register output variable(y); // dependent tape entry
yv=value(y); derivative(y)=1; // seed
DCO M::global tape>interpret adjoint(); // tape interpretation
for (size t i=0;i<n;i++) { g[i]=derivative(x[i]); } // harvest
DCO TAPE T::remove(DCO M::global tape); // release tape
}

,
Computational Differentiation, WS 16/17

190

First-Order Vector Adjoint Code

STCE

Computer Scientists View (Simplified)

A first-order vector adjoint code F(1) : IRn IRml IRm IRnl ,




y
:= F(1) (x, Y(1) ),
X(1)
computes a transposed Jacobian matrix product alongside with the function
value:
y := F (x)
X(1) := F (x)T Y(1)

... harvesting of the whole Jacobian by seeding input directions


Y(1) [i] IRm , i = 0, . . . , m 1,
with the Cartesian basis vectors
in IRm . Note concurrency!

,
Computational Differentiation, WS 16/17

191

First-Order (Vector) Adjoint Code


Example: Lighthouse

STCE

Generic Primal F : IR4 IR2

template<typename T>
void f(const vector<T>& x, vector<T>& y) {
T v = tan(x[2] x[3]);
T w = x[1] v;
y[0] = x[0] v / w;
y[1] = y[0] x[1];
}

,
Computational Differentiation, WS 16/17

192

First-Order (Vector) Adjoint Code


Example: Lighthouse I

STCE

The Jacobian of F at point x = (1, 1, 1, 1)T is equal to




2.79402 5.01252 11.025 11.025
F (x) =
.
2.79402 7.80654 11.025 11.025
Consequently, the shifted sum of its rows is computed as the first-order adjoint
of F with respect to x in direction y(1) = (1, 1)T added to x(1) = (1, 1)T , that
is,
x(1) := x(1) + (F )T y(1)

1
2.79402 2.79402
4.58804
 
1 5.01252 7.80654 1
11.8191

=
=
1 + 11.025

23.0501 .
11.025
1
1
11.025
11.025
23.0501

,
Computational Differentiation, WS 16/17

193

Example: Lighthouse
First-Order Scalar Adjoint Code (lighthouse/ga1s.cpp) I

STCE

void driver(
const vector<double>& xv, vector<double>& xa,
vector<double>& yv, vector<double>& ya
){
// generic adjoint 1storder scalar dco mode
typedef ga1s<double> DCO M;
typedef DCO M::type DCO T; // dco type
typedef DCO M::tape t DCO TAPE T; // dco tape type
DCO M::global tape=DCO TAPE T::create(); // tape creation
int n=xv.size(), m=yv.size();
vector<DCO T> x(n), y(m);
for (int i=0;i<n;i++) { // independent tape entries
x[i]=xv[i];
DCO M::global tape>register variable(x[i]);
}
f(x,y); // overloaded primal
for (int i=0;i<m;i++) {
DCO M::global tape>register output variable(y[i]); // dependent tape entries
yv[i]=value(y[i]); derivative(y[i])=ya[i]; // seed
}

,
Computational Differentiation, WS 16/17

194

Example: Lighthouse
First-Order Scalar Adjoint Code (lighthouse/ga1s.cpp) II

STCE

for (int i=0;i<n;i++) derivative(x[i])=xa[i]; // seed


DCO M::global tape>interpret adjoint(); // tape interpretation
for (int i=0;i<n;i++) xa[i]=derivative(x[i]); // harvest
for (int i=0;i<m;i++) ya[i]=derivative(y[i]); // harvest
DCO TAPE T::remove(DCO M::global tape); // release tape
}

,
Computational Differentiation, WS 16/17

195

Example: Lighthouse
First-Order Vector Adjoint Code (lighthouse/ga1v.cpp) I

STCE

const int m=2;


void driver(
const vector<double>& xv, vector<vector<double> >& xa,
vector<double>& yv, vector<vector<double> >& ya
){
// generic adjoint 1storder vector dco mode
typedef ga1v<double,m> DCO M;
typedef DCO M::type DCO T;
typedef DCO M::tape t DCO TAPE T;
DCO M::global tape=DCO TAPE T::create();
int n=xv.size();
vector<DCO T> x(n), y(m);
for (int i=0;i<n;i++) {
x[i]=xv[i];
DCO M::global tape>register variable(x[i]);
}
f(x,y);
for (int i=0;i<m;i++) {
DCO M::global tape>register output variable(y[i]);

,
Computational Differentiation, WS 16/17

196

Example: Lighthouse
First-Order Vector Adjoint Code (lighthouse/ga1v.cpp) II

STCE

yv[i]=value(y[i]);
for (int j=0;j<m;j++) derivative(y[i])[j] = ya[i][j]; // vector adjoints
}
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) derivative(x[i])[j] = xa[i][j];
}
DCO M::global tape>interpret adjoint();
for (int i=0;i<n;i++) {
for (int j=0;j<m;j++) xa[i][j]=derivative(x[i])[j];
}
for (int i=0;i<m;i++) {
for (int j=0;j<m;j++) ya[i][j]=derivative(y[i])[j];
}
DCO TAPE T::remove(DCO M::global tape);
}

... requires header-only version of dco/c++.

,
Computational Differentiation, WS 16/17

197

Optimized First Derivative Code


Statement-Level Preaccumulation of Gradients

STCE

,
Computational Differentiation, WS 16/17

198

First-Order Adjoints with dco/c++


Register (Independent Inputs with Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

x0 :=?
x1 :=?

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

199

First-Order Adjoints with dco/c++


Record (Local Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

v2 := x0 x1

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

200

First-Order Adjoints with dco/c++


Record (Local Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

v2 := x0 x1
v3 := sin(v2 )

3: sin
[v4 /x1 ]
[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

201

First-Order Adjoints with dco/c++


Record (Local Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[1/x1 ]

[v4 ]

3: sin
[v4 /x1 ]

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1

[cos(v2 )]

2:
[x1 ]

[x0 ]

0: x0

1: x1

,
Computational Differentiation, WS 16/17

202

First-Order Adjoints with dco/c++


Preaccumulate (Local Gradient)

5: y0 ()

6: y1 (c)
[x0 ]

[c]
4: /

[v4 ]
[v4 x0(1) ]

[v4 x1(1) ]

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
...
local gradient code exposed to
compiler (optimization)

0: x0

1: x1

,
Computational Differentiation, WS 16/17

203

First-Order Adjoints with dco/c++


Record (Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[v4 ]
[v4 x0(1) ]

[v4 x1(1) ]

0: x0

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
y0 := x0 v4

1: x1

,
Computational Differentiation, WS 16/17

204

First-Order Adjoints with dco/c++


Record (Tape)

5: y0 ()

STCE

6: y1 (c)
[x0 ]

[c]
4: /

[v4 ]
[v4 x0(1) ]

[v4 x1(1) ]

0: x0

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
y0 := x0 v4
y1 := v4 c

1: x1

,
Computational Differentiation, WS 16/17

205

First-Order Adjoints with dco/c++


Seed

t
[y0(1) ]

[y1(1) ]

[x0 ]

[c]

5: y0 ()

6: y1 (c)

4: /

[v4 ]
[v4 x0(1) ]

[v4 x1(1) ]

0: x0

1: x1

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
y0 := x0 v4
y1 := v4 c
y0(1) :=?
y1(1) :=?
x0(1) :=?
x1(1) :=?
v2(1) := 0
v3(1) := 0
v4(1) := 0

,
Computational Differentiation, WS 16/17

206

First-Order Adjoints with dco/c++


Interpret (Tape)

STCE

t
[y0(1) ]

[y1(1) ]

[x0 ]

[c]

5: y0 ()

6: y1 (c)

4: /

[v4 ]
[v4 x0(1) ]

[v4 x1(1) ]

0: x0

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)

1: x1

,
Computational Differentiation, WS 16/17

207

First-Order Adjoints with dco/c++


Interpret (Tape)

t
[y0(1) ]
[y1 v4(1) ]

5: y0 ()
[x0 ]

4: /

[v4 ]
[v4 x0(1) ]

[v4 x1(1) ]

0: x0

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)

1: x1

,
Computational Differentiation, WS 16/17

208

First-Order Adjoints with dco/c++


Interpret (Tape)

[v4(1) ]

4: /

[y0 x0(1) ]

[v4 x0(1) ]

[v4 x1(1) ]

0: x0

1: x1

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)
x1(1) + = v4(1) v4 x1(1)
x0(1) + = v4(1) v4 x0(1)

,
Computational Differentiation, WS 16/17

209

First-Order Adjoints with dco/c++


Harvest

[x0(1) ]

[x1(1) ]

0: x0

1: x1

STCE

v2 := x0 x1
v3 := sin(v2 )
v4 := v3 /x1
v4 v3(1) := 1/x1 1
v4 v2(1) := cos(v2 ) v4 v3(1)
v4 x1(1) := v4 v2(1) x1
v4 x0(1) := v4 /x1 + v4 v2(1) x0
y0 := x0 v4
y1 := v4 c
v4(1) + = c y1(1)
v4(1) + = x0 y0(1)
x0(1) + = v4 y0(1)
x1(1) + = v4(1) v4 x1(1)
x0(1) + = v4(1) v4 x0(1)

,
Computational Differentiation, WS 16/17

210

First-Order Adjoint Code


Integration into Steepest Descent Algorithm for NLP

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

211

Outline

STCE

Motivation, Terminology, Finite Differences

First Derivatives of Multivariate Vector Functions

Second Derivatives of Multivariate Scalar Functions

,
Computational Differentiation, WS 16/17

212

Second (and Higher) Derivative Models


Outline

Second Derivatives of Multivariate Scalar Functions


I
I
I
I
I

STCE

Numerical Approximation of Hessians


Second-Order Tangent-over-Tangent (ToT) Model
Second-Order Tangent-over-Adjoint (ToA) Model
Second-Order Adjoint-over-Tangent (AoT) Model
Second-Order Adjoint-over-Adjoint (AoA) Model

Second and Higher Derivatives of Multivariate Vector Functions

,
Computational Differentiation, WS 16/17

213

Second Derivative Models

STCE

Multivariate Scalar Functions

Initially we consider multivariate scalar functions


y = F (x) : DF IRn IF IR in order to simplify the notation.
We assume F to be twice continuously differentiable over its domain DF
implying the existence of the Hessian
2 F (x)

d2 F
(x).
dx2

For multivariate vector functions the Hessian is a three-tensor complicating the


notation slightly due to the need for tensor arithmetic; see later.

,
Computational Differentiation, WS 16/17

214

Numerical Approximation of Second Derivatives

STCE

A second-order central finite difference quotient


h
d2 f
(x0 ) f (x0 + (ej + ei ) h) f (x0 + (ej ei ) h)
dxi dxj
i
f (x0 + (ei ej ) h) + f (x0 (ej + ei ) h) /(4 h2 )

(1)

yields an approximation of the second directional derivative


T

y (1,2) = x(1) 2 f (x) x(2)

(w.l.o.g. m = 1)

as
d2 f
(x0 )
dxi dxj

df
0
dxi (x

+ ej h)

df
0
dxi (x

ej h)

2h
"

f (x + ej h + ei h) f (x0 + ej h ei h)
2h
#
f (x0 ej h + ei h) f (x0 ej h ei h)
/(2 h).

2h

,
Computational Differentiation, WS 16/17

215

Numerical Approximation of Second Derivatives


Example

STCE

For the given implementation of

n1
X

y=

2
x2i

i=0

we are looking for a routine fgh(...) returning for a given vector x of length n the
value y of f, its gradient g, and Hessian h.
int main(int argc, char argv[]) {
assert(argc==2); cout.precision(15);
size t n=atoi(argv[1]);
vector<double> x(n,0), g(n,0), h(n(n+1)/2,0); double y=0;
for (size t i=0;i<n;i++) x[i]=cos(static cast<double>(i));
fgh(x,y,g,h);
cout << y << endl;
for (size t i=0;i<n;i++) cout << g[i] << endl;
int ii=0;
for (int i=0;i<n;i++)
for (int j=0;j<=i;j++,ii++)
cout << h[ii] << endl;
return 0;
}
Computational Differentiation, WS 16/17

,
216

Numerical Approximation of Second Derivatives


Example (fd fd.cpp)

STCE

template<typename T>
void fgh(const vector<T>& x, T &y, vector<T>& g, vector<T>& h) {
size t n=x.size();
int ii=0;
for (int i=0;i<n;i++) {
vector<T> x pp(x), x mp(x), g pp(n,0), g mp(n,0);
double p=(x mp[i]==0) ? sqrt(sqrt(DBL EPSILON))
: sqrt(sqrt(DBL EPSILON))abs(x mp[i]);
x mp[i]=p; fg(x mp,y,g mp);
x pp[i]+=p; fg(x pp,y,g pp);
for (int j=0;j<=i;j++,ii++)
h[ii]=(g pp[j]g mp[j])/(2p);
}
fg(x,y,g);
}

... takes about 5.5s for n = 103 to approximate the Hessian.

,
Computational Differentiation, WS 16/17

217

Motivation
I can do better ... (t2s a1s.cpp)

STCE

The second-order adjoint


#include dco.hpp
using namespace dco;
template<typename T>
void fgh(const vector<T>& xv,
T& yv, vector<T>& g, vector<vector<T> >& h) {
typedef ga1s<typename gt1s<T>::type> DCO M;
typedef typename DCO M::type DCO T;
typedef typename DCO M::tape t DCO TAPE T;
typedef typename DCO TAPE T::position t DCO TAPE POSITION T;
...
}
int main(int argc, char argv[]) {
...
fgh(x,y,g,h);
...
}

... takes about 0.5s for n = 103 to produce the Hessian with machine accuracy.

,
Computational Differentiation, WS 16/17

218

Second-Order Tangent-over-Tangent Model


Computer Scientists View (Simplified)

STCE

A second derivative code F (1,2) : IRn IRn IRn IRn IR IR IR IR,


generated in Tangent-over-Tangent (ToT) mode computes

y
(2)
y
(1) = F (1,2) (x, x(2) , x(1) , x(1,2) ),
y
y (1,2)
as follows:

(2)

(1) :=
y

T
y (1,2)
x(1)

F (x)

F (x) x(2)

.
(1)
F (x) x

2 F (x) x(2) + F (x) x(1,2)

,
Computational Differentiation, WS 16/17

219

Second-Order ToT Model

STCE

Notation

Second directional differentiation of


!
y
=
y (1)

F (x)
dF (x)
(1)
dx x

in tangent mode ...


. . . yields for

dy
dx(1)

. . . yields for

dy
dx

. . . yields for

dy (1)
dx

in direction x(2) a vanishing contribution to y (1)(2) y (1,2) ;

. . . yields for

dy (1)
dx(1)

in direction x(1,2) the result y (1,2) .

in direction x(1)(2) x(1,2) a vanishing contribution to y (2) ;

in direction x(2) the result y (2) ;

,
Computational Differentiation, WS 16/17

220

Second-Order ToT Model

STCE

Derivation

Directional differentiation in tangent mode of the first-order tangent model


!
!
F (x)
y
= dF (x) (1)
y (1)
dx x
in direction (x(2) x(1,2) )T yields
!

y
! d (1)
!

y
y (2)
x(2)

(1,2)
(1)
y (1,2)
x

d(x x )


T
T dF (x) T
d2 F (x)
d2 F (x)
; dx2
= dx2
dx

y (1) =x(1)

{
dy

dy
(2)
+ (1) x(1,2)
dx x

dx (1)
dy (1)
dy
(2)
(1,2)
+ dx
(1) x
dx x

T
x(1)

=0

}|

dF (x)
(2)
dx x
2
d F (x)
(2)
+ dFdx(x)
dx2 x

x(1,2)

,
Computational Differentiation, WS 16/17

221

Second-Order ToT Model


Essential Activity

STCE

Aiming for second-order derivatives with respect to x we may consider y as not


useful and x(1) as not varied; i.e, both y and x(1) are passive with respect to
the application of tangent mode AD to the first-order tangent code.
Consequently,
y (1,2)

T d2 F (x)
dy (1) (2)
x = x(1)
x(2)
dx
dx2

Graphical illustration is provided by a corresponding reduction of the


tangent-augmented tangent lDAG; see below.

,
Computational Differentiation, WS 16/17

222

Second-Order ToT Model

STCE

Accumulation of Hessian

x(1) 2 F (x) x(2)


... accumulation of the whole Hessian element-wise by seeding input directions
x(1) IRn x(2) IRn independently with the Cartesian basis vectors in IRn for
x(1,2) = 0; harvesting from y (1,2) .

,
Computational Differentiation, WS 16/17

223

Second-Order Tangent Model

STCE

Alternative Interpretation via Chain Rule on lDAG

Define
v (2)

dv
ds

for v {x, x(1) , y, y (1) } and some auxiliary s IR assuming that


F (1) (x(s), x(1) (s)) is continuously differentiable over its domain.
By the chain rule
IR

d(x(1) F (x)T )
d(F (x) x(1) )
=
ds
ds
=

dx(1)
ds

T
T

dF (x)
ds
dF (x)T dx

dx
ds

F (x)T + x(1)
T

= x(1,2) F (x)T + x(1)

= F (x) x(1,2) + x(1) 2 F (x) x(2)


... best illustrated graphically ...

,
Computational Differentiation, WS 16/17

224

Second-Order ToT Model


Graphically on Tangent-Augmented Tangent lDAG

STCE

dy
ds
dF (x)
=
x(2)
dx
dy (1)

ds
T
dF (x)
d2 F (x)
=
x(2)
x(1,2) + x(1)
dx
dx2

y (2)
5: y (1) []

3: y[F ]
x(1)

y (1,2)

4: F

2 F

Comments:
2: x(1)

1: x
x(2)

second-order term requires proof


(1)
since actually dy
IR1(1n) ,
dF
(1n)n
dF
IR
, and dx
IRn .
dx
ds

x, x(2) y (2)

x, x(1) , x(2) , x(1,2) y (1,2)

x(1) varied? y useful?

x(1,2)

0: s

,
Computational Differentiation, WS 16/17

225

Second-Order ToT Code with dco/c++

STCE

Pn1
Live for y = ( i=0 x2i )2 :
I

implementation of driver fgh for given instantiation of


template<class T>
void f(const vector<T>& x, T &y) {
...
}

instantiated with T=DCO M::type for


DCO M=dco::gt1s<dco::gt1s<double>::type>
I

build and run for increasing values of n

compare with finite differences

,
Computational Differentiation, WS 16/17

226

Second-Order ToT Code with dco/c++


gt2s gt1s.cpp

STCE

User Guide: y (1,2) := x(1) 2 F (x) x(2) + F (x) x(1,2)


template<typename T>
void fgh(const vector<T>& xv,
T& yv, vector<T>& g, vector<vector<T> >& h) {
// generic tangent over tangent scalar dco mode
typedef gt1s<gt1s<double>::type>::type DCO T;
size t n=xv.size();
DCO T y=0;
for (size t i=0;i<n;i++) {
for (size t j=0;j<=i;j++) {
vector<DCO T> x(n,0);
for (size t k=0;k<n;k++) x[k]=xv[k]; // zero derivatives
derivative(value(x[i]))=1; // seed
value(derivative(x[j]))=1; // seed
f(x,y); // overloaded primal
h[i][j]=h[j][i]=derivative(derivative(y)); // harvest
}
g[i]=derivative(value(y)); // harvest
}
yv=passive value(y); // primal value
}

,
Computational Differentiation, WS 16/17

227

Second-Order ToT Code with dco/c++


Data Access

STCE

,
Computational Differentiation, WS 16/17

228

Second-Order ToT Code with dco/c++


Example

STCE

,
Computational Differentiation, WS 16/17

229

Approximate Second-Order ToT Code


Finite Differences over Tangents (fd gt1s.cpp)

STCE

... by application of finite differences to tangent driver ...


template<typename T>
void fgh(const vector<T>& x, T &y, vector<T>& g, vector<T>& h) {
size t n=x.size();
int ii=0;
for (int i=0;i<n;i++) {
vector<T> x pp(x), x mp(x), g pp(n,0), g mp(n,0);
double p=(x mp[i]==0) ? sqrt(DBL EPSILON)
: sqrt(DBL EPSILON)abs(x mp[i]);
x mp[i]=p; fg(x mp,y,g mp);
x pp[i]+=p; fg(x pp,y,g pp);
for (int j=0;j<=i;j++,ii++)
h[ii]=(g pp[j]g mp[j])/(2p);
}
fg(x,y,g);
}

,
Computational Differentiation, WS 16/17

230

Second-Order ToT Code


Integration into Newtons Algorithm for NLP

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

231

Second-Order Tangent-over-Adjoint Model


Computer Scientists View (Simplified)

STCE

A second derivative code


(2)
F(1) : IRn IRn IR IR IR IR IRn IRn ,

generated in Tangent-over-Adjoint (ToA) mode computes

y
y (2)

(2)
(2)
x = F(1) (x, x(2) , y(1) , y(1) ),
(1)

(2)
x(1)
as follows:

F (x)
y (2)

F (x) x(2)

x :=

T
F (x) y(1)
(1)

(2)
(2)
x(1)
y(1) 2 F (x) x(2) + F (x)T y(1)
y

,
Computational Differentiation, WS 16/17

232

Second-Order ToA Model

STCE

Notation

Second directional differentiation of




y
=
x(1)

F (x)
dF (x) T
dx

y(1)

in tangent mode ...


(2)

. . . yields for

dy
dy(1)

. . . yields for

dy
dx

. . . yields for

dx(1)
dx

in direction x(2) a vanishing contribution to x(1) ;

. . . yields for

dx(1)
dy(1)

in direction y(1) the result x(1) .

in direction y(1) a vanishing contribution to y (2) ;

in direction x(2) the result y (2) ;


(2)

(2)

(2)

,
Computational Differentiation, WS 16/17

233

Second-Order ToA Model

STCE

Derivation

Directional differentiation in tangent mode of the first-order adjoint model


!


F (x)
y
= dF (x) T
x(1)
y(1)
dx
(2)

in direction (x(2) y(1) )T yields

{
!

dy
dy (2)
x(2)
(2)

y
= dx x +
(2)
dy(1) (1)
y(1)
dx(1)

dx(1)
(2)
(2)

x
+

y
(1)
dx
dy(1)



T
2
2
T
dF (x)
d F (x)
d F (x)
dF (x)
(2)
x(1) =y(1) dx
; dx2
= dx2

x
dx

=
T
(2)
d2 F (x)
y(1) dx2 x(2) + dFdx(x) y(1)



! d y
y (2)
x(1)

(2)
x(1)
d(x y(1) )

=0

}|

,
Computational Differentiation, WS 16/17

234

Second-Order ToA Model


Essential Activity

STCE

Aiming for second-order derivatives with respect to x we may consider y as not


useful and y(1) as not varied; i.e, both y and y(1) are passive with respect to
the application of tangent mode AD to the first-order adjoint code.
Consequently,
(2)

x(1)

dx(1) (2)
d2 F (x) (2)
x = y(1)
x
dx
dx2

Graphical illustration is provided by a corresponding reduction of the


tangent-augmented adjoint lDAG; see below.

,
Computational Differentiation, WS 16/17

235

Second-Order ToA Model

STCE

Accumulation of Hessian

y(1) 2 F (x) x(2)


... accumulation of the whole Hessian column-wise by seeding input directions
(2)
x(2) IRn with the Cartesian basis vectors in IRn for y(1) = 1 and y(1) = 0;
(2)

harvesting from x(1) .

,
Computational Differentiation, WS 16/17

236

Second-Order ToA Model


Alternative Interpretation via Chain Rule on lDAG

STCE

Define

dv
ds
for v {x, x(1) , y, y(1) } and assuming that F(1) (x(s), x(1) (s), y(1) (s)) is
continuously differentiable over its domain.
v (2)

By the chain rule


d(F (x)T y(1) )
dy(1)
dF (x)T
=
y(1) + F (x)T
ds
ds
ds
T
dF (x)
dx
(2)
=

y(1) + F (x)T y(1)


dx
ds
(2)
= y(1) 2 F (x) x(2) + F (x)T y(1)
... best illustrated graphically ...

,
Computational Differentiation, WS 16/17

237

Second-Order ToA Model

STCE

Graphically on Tangent-Augmented Adjoint lDAG

6: x(1) []

4: y[F ]

T
y(1)

(2)

x(1)
5: F T

dy
ds
dF (x)
=
x(2)
dx
dx(1)

ds
(2)
= y(1) 2 F (x) x(2) + F (x)T y(1)

y (2)

F T

Comments:

2 F

second-order term requires proof


dx(1)
since actually dF
IRn(1n) ,
dF
IR(1n)n , and dx
IRn .
dx
ds

x, x(2) y (2)

x, y(1) , x(2) , y(1) x(1)

y(1) varied? y useful?

1: y(1)

0: x
(2)

x(2) y(1)
s

(2)

(2)

,
Computational Differentiation, WS 16/17

238

Second-Order ToA Code with dco/c++

STCE

Pn1
Live for y = ( i=0 x2i )2 :
I

implementation of driver fgh for given


template<class T>
void f(const vector<T>& x, T &y) {
...
}

instantiated with T=DCO M::type for


DCO M=dco::ga1s<dco::gt1s<double>::type>
I

build and run for increasing values of n

compare with second-order tangent AD and with second-order finite


differences

,
Computational Differentiation, WS 16/17

239

Second-Order ToA Code with dco/c++


gt2s ga1s.cpp I

STCE

(2)
(2)
User Guide: x(1) := y(1) 2 F (x) x(2) + F (x)T y(1)

template<typename T>
void fgh(const vector<T>& xv,
T& yv, vector<T>& g, vector<vector<T> >& h) {
// generic tangent over adjoint scalar dco mode
typedef ga1s<typename gt1s<T>::type> DCO M;
typedef typename DCO M::type DCO T;
typedef typename DCO M::tape t DCO TAPE T;
size t n=xv.size();
DCO M::global tape=DCO TAPE T::create();
for (size t i=0;i<n;i++) {
vector<DCO T> x(n,0); DCO T y=0;
for (size t j=0;j<n;j++) {
x[j]=xv[j];
DCO M::global tape>register variable(x[j]);
}
derivative(value(x[i]))=1; // seed tangent
f(x,y); // overloaded primal
DCO M::global tape>register output variable(y);
yv=passive value(y); // harvest tangent

,
Computational Differentiation, WS 16/17

240

Second-Order ToA Code with dco/c++


gt2s ga1s.cpp II

STCE

g[i]=derivative(value(y));
value(derivative(y))=1; // seed adjoint
DCO M::global tape>interpret adjoint();
for (size t j=0;j<n;j++) h[i][j]=derivative(derivative(x[j]));
// harvest adjoint
DCO M::global tape>reset(); // reset tape to start position
}
DCO TAPE T::remove(DCO M::global tape);
}

,
Computational Differentiation, WS 16/17

241

Second-Order ToA Code with dco/c++


Data Access

STCE

,
Computational Differentiation, WS 16/17

242

Second-Order ToA Code with dco/c++


Example

STCE

,
Computational Differentiation, WS 16/17

243

Second-Order ToA Code


Integration into Newtons Algorithm for NLP

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

244

Second-Order Adjoint-over-Tangent Model

STCE

Computer Scientists View (Simplified)

A second derivative code


(1)
F(2) : IRn IRn IR IR IR IR IRn IRn ,

generated in Adjoint-over-Tangent (AoT) mode computes

y
y (1)

(1)
(1)
x = F(2) (x, x(1) , y(2) , y(2) ),
(2)
(1)
x(2)
as follows:

F (x)

(1)

F (x) x(1)

:=

(1)
F (x)T y(2)
x(2)

(1)

(1)
2
(1)
T
x(2)
y(2) F (x) x + F (x) y(2)

,
Computational Differentiation, WS 16/17

245

Second-Order AoT Model


Derivation I

STCE

Directional differentiation in adjoint mode of the first-order tangent model


!
!
F (x)
y
= dF (x) (1)
y (1)
dx x
(1)

in direction (y(2) y(2) )T yields

x(2)
(1)
x(2)

!T

y (1)

d(x x(1) )

y(2)
(1)
y(2)

dy T
dx

dy (1)
dx

y(2) +

T
dy
=
(1) y(2) +
dx
|
{z
}

(1)

dy
dx(1)

(1)
y(2)

(1)
y(2)

=0

dy (1)
dx

T d2 F (x)
d2 F (x)
; dx2
dx2

=x(1)

d2 F (x)
dx2

dF (x) T
dx

2
(1)
F (x)
y(2) + y(2) d dx
2
(1)
dF (x) T
y(2)
dx

(1)

,
Computational Differentiation, WS 16/17

246

Second-Order AoT Model

STCE

Essential Activity

Aiming for second-order derivatives with respect to x we may consider y as not


useful and x(1) as not varied; i.e, both y and x(1) are passive with respect to
the application of adjoint mode AD to the first-order tangent code.
Consequently,
x(2)

dy (1)
dx

T
(1)

(1)

y(2) = y(2)

d2 F (x) (1)
x
dx2

Graphical illustration is provided by a corresponding reduction of the


adjoint-augmented tangent lDAG; see below.

,
Computational Differentiation, WS 16/17

247

Second-Order AoT Model

STCE

Graphically on Adjoint-Augmented Tangent lDAG

(1)

dt T
(1)
= F (x))T y(2)
dx(1)
dt T

dx

x(2)

x(2)

(1) T

T
y(2)

y(2)

(1)

4: y (1) []

1: y[F ]

= F (x)T y(2) + y(2) 2 F (x) x(1)


Comments:
I second-order term requires proof

(1) T

3: F

since actually
F

2 F
0: x

(1)

dy
dF
dF
dx

dt
dy (1)

IR,

IR1(1n) ,

and
IR(1n)n .

I x, x(1) , y(2) , y (1) x(2)


(2)

2: x(1)

I x, y (1) x(1)
(2)
(2)
I x(1) varied? y useful?

,
Computational Differentiation, WS 16/17

248

Second-Order AoT Model

STCE

Accumulation of Hessian (Complexity)

(2)

y(1) 2 F (x) x(1)


... harvesting of the whole Hessian column-wise by seeding input directions
(2)
x(1) IRn with the Cartesian basis vectors in IRn for y(1) = 1, x(2) = 0, and
(1)

y(2) = 0; harvesting from x(2) .

,
Computational Differentiation, WS 16/17

249

Second-Order AoT Code with dco/c++

STCE

Pn1
Live for y = ( i=0 x2i )2 :
I

implementation of driver fgh for given


template<class T>
void f(const vector<T>& x, T &y) {
...
}

instantiated with T=DCO M::type for


DCO M=dco::gt1s<dco::ga1s<double>::type>
I

build and run for increasing values of n

compare with second-order ToT and ToA codes and with second-order
finite differences

,
Computational Differentiation, WS 16/17

250

Second-Order AoT Code with dco/c++


ga2s gt1s.cpp I

STCE

(2)
User Guide: x(2) := F (x)T y(2) + y(1) 2 F (x) x(1)

void fgh(const vector<double>& xv,


double& yv, vector<double>& g, vector<vector<double> >& H) {
typedef ga1s<double> DCO BASE MODE;
typedef DCO BASE MODE::type DCO BASE TYPE;
// generic adjoint over tangent scalar dco mode
typedef gt1s<DCO BASE TYPE> DCO MODE;
typedef DCO MODE::type DCO TYPE;
typedef DCO BASE MODE::tape t DCO TAPE TYPE;
DCO BASE MODE::global tape=DCO TAPE TYPE::create();
size t n=xv.size();
vector<DCO TYPE> x(n),x indep(n);
DCO TYPE y;
for (size t i=0;i<n;i++) {
for (size t j=0;j<n;j++) {
x[j]=xv[j];
// independents are both components of wrapping tangent type
DCO BASE MODE::global tape>register variable(value(x[j]));
DCO BASE MODE::global tape>register variable(derivative(x[j]));
x indep[j]=x[j]; // save tape position of independents

,
Computational Differentiation, WS 16/17

251

Second-Order AoT Code with dco/c++


ga2s gt1s.cpp II

STCE

}
value(derivative(x[i]))=1; // seed
f(x,y); // overloaded primal generates tape
DCO BASE MODE::global tape>register output variable(value(y));
DCO BASE MODE::global tape>register output variable(derivative(y));
derivative(derivative(y))=1; // seed
DCO BASE MODE::global tape>interpret adjoint();
for (size t j=0;j<n;j++)
H[j][i] = derivative(value(x indep[j])); // harvest
DCO BASE MODE::global tape>reset(); // reset tape to start position
g[i]=value(derivative(y)); // harvest
}
yv=passive value(y);
DCO TAPE TYPE::remove(DCO BASE MODE::global tape);
}

,
Computational Differentiation, WS 16/17

252

Second-Order AoT Code with dco/c++


Data Access

STCE

,
Computational Differentiation, WS 16/17

253

Second-Order AoT Code


Integration into Newtons Algorithm for NLP

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

254

Second-Order Adjoint-over-Adjoint Model


Computer Scientists View (Simplified)

STCE

A second derivative code


F(1,2) : IRn IRn IR IR IR IRn IRn IR,
generated in Adjoint-over-Adjoint (AoA) mode computes

y
x(1)
(1,2)

, y(1) , y(1,2) ),
x(2) = F(1,2) (x, x
y(1,2)
as follows:

F (x)
y

x(1)
F (x)T y(1)

2
T
x(2) :=
y(1) F (x) x(1,2) + F (x) y(2)
y(1,2)
F (x) x(1,2)

,
Computational Differentiation, WS 16/17

255

Second-Order ToA Model

STCE

Notation

Second directional differentiation of




y
=
x(1)

F (x)
dF (x) T
dx

y(1)

in adjoint mode ...


. . . yields for

dx(1)
dx

. . . yields for

dy
dx

. . . yields for

dy
dy(1)

in direction y(2) a vanishing contribution to y(1,2) ;

. . . yields for

dx(1)
dy(1)

in direction x(1,2) the result y(1,2) .

in direction x(1,2) a vanishing contribution to x(2) ;

in direction y(2) to result x(2) ;

,
Computational Differentiation, WS 16/17

256

Second-Order AoA Model

STCE

Derivation

Directional differentiation in adjoint mode of the first-order adjoint model


!


F (x)
y
= dF (x) T
x(1)
y(1)
dx

in direction (y(2) X(1,2) )T yields

T
dx(1) T
dy T
y

y
+

x
(1,2)
dx

 d

 dx T (2)
dy

x(1)
x(2)
y(2)
dx(1) T

=
y(2) + dy(1) x(1,2)

y(1,2)
x(1,2)
d(x y(1) )
dy(1)

|
{z
}


=0

[x(1) =y(1) F (x)T ]

dF (x) T
dx

y(2) + y(1)
dF (x)
dx

d2 F (x)
dx2

x(1,2)

x(1,2)

,
Computational Differentiation, WS 16/17

257

Second-Order AoA Model


Essential Activity

STCE

Aiming for second-order derivatives with respect to x we may consider y as not


useful and y(1) as not varied; i.e, both y and y(1) are passive with respect to
the application of adjoint mode AD to the first-order adjoint code.
Consequently,
x(1)

dx(1)
d2 F (x)
x(1,2) = y(1)
x(1,2)
dx
dx2

Graphical illustration is provided by a corresponding reduction of the


adjoint-augmented adjoint lDAG; see below.

,
Computational Differentiation, WS 16/17

258

Second-Order AoA Model

STCE

Graphically on Adjoint-Augmented Adjoint lDAG

dt T
dx
= F (x)T y(2) + y(1) 2 F (x) x(1,2)

x(2)

t
T
y(2)

xT(1,2)

y(1,2)

4: x(1) []

1: y[F ]

dt T
dy(1)

= F (x)T x(1,2)

T
y(1)

Comments:
F

3: F

second-order term requires proof since


actually dxdt IR1n ,
(1)

dx(1)
dF

2 F

0: x

2: y(1)

IRn(1n) , and

dF
dx

x, y(1) , y(2) , x(1,2) x(2)

x, x(1,2) y(1,2)

y(1) varied? y useful?

IR(1n)n .

,
Computational Differentiation, WS 16/17

259

Second-Order AoA Model

STCE

Accumulation of Hessian (Complexity)

y(1) 2 F (x) x(1,2)


... harvesting of the whole Hessian row-wise by seeding input directions
x(1,2) IRn with the Cartesian basis vectors in IRn for y(1) = 1, x(2) = 0, and
y(2) = 0; harvesting from x(2) = 0.

,
Computational Differentiation, WS 16/17

260

Second-Order AoA Code with dco/c++

STCE

Pn1
Live for y = ( i=0 x2i )2 :
I

implementation of driver fgh for given


template<class T>
void f(const vector<T>& x, T &y) {
...
}

instantiated with T=DCO M::type for


DCO M=dco::ga1s<dco::ga1s<double>::type>
I

build and run for increasing values of n

compare with second-order ToT, ToA and AoT codes and with
second-order finite differences

,
Computational Differentiation, WS 16/17

261

Second-Order AoA Code with dco/c++


ga2s ga1s.cpp I

STCE

User Guide: x(2) := F (x)T y(2) + y(1) 2 F (x) x(1,2)


void fgh(const vector<double>& xv,
double& yv, vector<double>& g, vector<vector<double> >& H) {
typedef ga1s<double> DCO BASE MODE;
typedef DCO BASE MODE::type DCO BASE TYPE;
typedef DCO BASE MODE::tape t DCO BASE TAPE TYPE;
// generic adjoint over adjoint dco mode
typedef ga1s<DCO BASE TYPE> DCO MODE;
typedef DCO MODE::type DCO TYPE;
typedef DCO MODE::tape t DCO TAPE TYPE;
size t n=xv.size();
vector<DCO TYPE> x(n),x in(n);
DCO TYPE y;
// local taping and interpretation is taped and interpreted ...
DCO BASE MODE::global tape=DCO BASE TAPE TYPE::create();
DCO MODE::global tape=DCO TAPE TYPE::create();
for (size t j=0;j<n;j++) {
x[j]=xv[j];
DCO BASE MODE::global tape>register variable(value(x[j]));
DCO MODE::global tape>register variable(x[j]);

,
Computational Differentiation, WS 16/17

262

Second-Order AoA Code with dco/c++


ga2s ga1s.cpp II

STCE

x in[j]=x[j];
}
f(x,y);
derivative(y)=1.0;
DCO BASE MODE::global tape>register variable(derivative(y));
DCO MODE::global tape>interpret adjoint();
for (size t j=0;j<n;j++)
g[j]=value(derivative(x in[j]));
// repeated interpretation of same tape
for (size t i=0;i<n;i++) {
derivative(derivative(x in[i]))=1;
DCO BASE MODE::global tape>interpret adjoint();
for (size t j=0;j<n;j++)
H[i][j]=derivative(value(x in[j]));
// zero adjoints prior to reinterpretation
DCO BASE MODE::global tape>zero adjoints();
}
yv=passive value(y);
DCO BASE TAPE TYPE::remove(DCO BASE MODE::global tape);
DCO TAPE TYPE::remove(DCO MODE::global tape);
}

,
Computational Differentiation, WS 16/17

263

Second-Order AoA Code with dco/c++


ga2s ga1s.cpp III

STCE

,
Computational Differentiation, WS 16/17

264

Second-Order AoA Code with dco/c++


Data Access

STCE

,
Computational Differentiation, WS 16/17

265

Approximate Second-Order ToA Code


Finite Differences over Adjoint

STCE

template<typename T>
void fgh(const vector<T>& x, T &y, vector<T>& g, vector<T>& h) {
size t n=x.size();
int ii=0;
for (int i=0;i<n;i++) {
vector<T> x pp(x), x mp(x), g pp(n,0), g mp(n,0);
double p=(x mp[i]==0) ? sqrt(DBL EPSILON)
: sqrt(DBL EPSILON)abs(x mp[i]);
x mp[i]=p; fg(x mp,y,g mp);
x pp[i]+=p; fg(x pp,y,g pp);
for (int j=0;j<=i;j++,ii++)
h[ii]=(g pp[j]g mp[j])/(2p);
}
fg(x,y,g);
}

Alternatively: Adjoint over Finite Differences

,
Computational Differentiation, WS 16/17

266

Second-Order AoA Code


Integration into Newtons Algorithm for NLP

STCE

See tutorial for implementation.

,
Computational Differentiation, WS 16/17

267

References I

STCE

M. Beckers, V. Mosenkis, and U. Naumann.


Adjoint mode computations of subgradients for McCormick relaxations.
In Recent Advances in Algorithmic Differentiation, number 87 in Lecture
Notes in Computational Science and Engineering (LNCSE), pages
103113. Springer, 2012.
Type: Conference Paper.
R. Hannemann-Tamas, J. Tillack, M. Schmitz, M. Forster, J. Wyes,
K. N
oh, E. von Lieres, U. Naumann, W. Wiechert, and W. Marquardt.
First- and second-order parameter sensitivities of a metabolically and
isotopically non-stationary biochemical network model.
In Electronic Proceedings of the 9th International Modelica Conference,
Munich, Sep 3-5, 2012. Modelica Association, 2012.
Type: Conference Paper.

,
Computational Differentiation, WS 16/17

268

References II

STCE

J. Lotz, M. Schwalbach, and U. Naumann.


A case study in adjoint sensitiviy analysis of parameter calibration.
In Procedia Computer Science, editor, International Conference on
Computational Science (ICCS 2016), 2016.
Type: Conference Paper.
U. Merkel, J. Riehme, and U. Naumann.
Reverse engineering of initial and boundary conditions with Telemac and
algorithmic differentiation.
WASSERWIRTSCHAFT, 103(12):2227, 2013.
Type: Journal Paper.
M. Towara, M. Schanen, and U. Naumann.
Mpi-parallel discrete adjoint openfoam.
Procedia Computer Science, 51:1928, 2015.
Type: Conference Paper.

,
Computational Differentiation, WS 16/17

269

References III

STCE

J. Ungermann, J. Blank, J. Lotz, K. Leppkes, Lars Hoffmann,


T. Guggenmoser, M. Kaufmann, P. Preusse, U. Naumann, and M. Riese.
A 3-d tomographic retrieval approach with advection compensation for the
air-borne limb-imager GLORIA.
Atmospheric Measurement Techniques, 4(11):25092529, 2011.
Type: Journal Paper.
V. Vassiliadis, J. Riehme, J. Deussen, K. Parasyris, C. Antonopoulos,
N. Bellas, S. Lalisa, and U. Naumann.
Towards automatic significance analysis for approximate computing.
In International Symposium on Code Generation and Optimization, pages
182193. IEEE/ACM, 2016.
Type: Conference Paper.

,
Computational Differentiation, WS 16/17

270

References IV

STCE

A. Vlasenko, P. Korn, J. Riehme, and U. Naumann.


Estimation of data assimilation error: A shallow-water model study.
Monthly Weather Review, 142:25022520, 2014.
Type: Journal Paper.

,
Computational Differentiation, WS 16/17

271

You might also like