Autodiff
Autodiff
Lecture Notes
William W. Cohen
October 17, 2016
1
A Wengert list defines a function f (x1 , . . . , xk ) of some inputs x1 , . . . , xk .
Syntactically, it is just a list of assignment statements, where the right-hand-
side (RHS) of each statement is very simple: a call to a function g (where
g is one of a set of primitive functions G = {g1 , . . . , g` } supported by the
sublanguage), where the arguments to g either inputs x1 , . . . , xk , or the left-
hand-side (LHS) of a previous assignment. (The functions in G will be called
operators here.) The output of the function is the LHS of the last item in
the list. For example, a Wengert list for
f (x1 , x2 ) ≡ (2x1 + x2 )2 (1)
with the functions G = {add, multiply, square} might be
z1 = add(x1 , x1 )
z2 = add(z1 , x2 )
f = square(z2 )
A Wengert list for
f (x) ≡ x3
might be
z1 = multiply(x, x)
f = multiply(z1 , x)
The set of functions G defines the sublanguage. It’s convenient if they
have a small fixed number of arguments, and are differentiable with respect
to all their arguments. But there’s no reason that they have to be scalar
functions! For instance, if G contains the appropriate matrix and vector
operations, the loss for logistic regression (for example x with label y and
weight matrix W ) could be written as
f (x, y, W ) ≡ crossEntropy(softmax(x · W ), y) + frobeniusNorm(W )
and be compiled to the Wengert list
z1 = dot(x, W )
z2 = softmax(z1 )
z3 = crossEntropy(z2 , y)
z4 = frobeniusNorm(W )
f = add(z3 , z4 )
2
This implements k-class logistic regression if x is a d-dimensional row vector,
dot is matrix product, W is a d × k weight matrix, and
ea1 ead
softmax(ha1 , . . . , ad i) ≡ h Pd a
, . . . , Pd a
i
i=1 ei i=1 ei
d
X
crossEntropy(ha1 , . . . , ad i, hb1 , . . . , bd i) ≡ − ai log bi
i=1
v
u
uXd X
k
frobeniusNorm(A) ≡ a2i,j
u
t
i=1 j=1
(z, g, (y1 , . . . , yk ))
where z is a string that names the LHS variable, g is a string that names the
operator, and the yi ’s are strings that name the arguments to g. So the list
for the function of Equation 1 would be encoded in python as
[ ("z1", "add", ("x1","x1")),
("z2", "add", ("z1","x2")),
("f", "square", ("z2")) ]
We also store functions for each operator in a Python dictionary G:
G = { "add" : lambda a,b: a+b,
"square": lambda a:a*a }
Before we evaluate the function, we will store the parameters in a dictionary
val: e.g., to evalute f and x1 = 3, x2 = 7 we will initialize val to
val = { "x1" : 3, "x2" : 7 }
The pseudo-code to evaluate f is:
3
def eval(f )
initialize val to the inputs at which f should be evaluated
for (z, g, (y1 , . . . , yk )) in the list:
op = G[g]
val[z] = op(val[y1 ],...,val[yk ])
return the last entry stored in val.
4
def backprop(f ,val)
initialize delta: delta[f ] = 1
for (z,g,(y1 , . . . , yk )) in the list, in reverse order:
for i = 1, . . . , k:
opi = DG[g][i]
if delta[yi ] is not defined set delta[yi ] = 0
delta[yi ] = delta[yi ] + delta[z] * opi (val[y1 ],...,val[yk ])
1.4 Examples
Let’s look at Equation 1. In freshman calculus you’d just probably just do
this:
Here we’ll use the Wengert list, in reverse, and the chain rule. This list is
z1 = add(x1 , x1 )
z2 = add(z1 , x2 )
f = square(z2 )
df
Table 1 contains a detailed derivation of dx 1
, where in each step we either
plug in a definition of a variable in the list, or use the derivative of one of
the operators (square or add). Table 2 contains an analogous derivation of
df
dx1
. Notice that these derivations are nearly identical. In fact, they are very
analogous to the computations carried out by the backprop algorithm: can
you see how?
Finally Table 3 shows a slightly less detailed derivation of the second
sample function, f = x3 . It is instructive to step through the backprop
algorithm for these functions as well: for example the list z1 = x · x; f = z1 · x
leads to the delta updates.
5
Derivation Step Reason
df dz 2 dz2
dx1
= dz22 · dx1
f = z22
df dz2 d(a2 )
dx1
= 2z2 · dx1 da
= 2a
df d(z1 +x2 )
dx1
= 2z2 · dx1
z2 = z1 + x2
df dz1 dx2 d(a+b) d(a+b)
dx1
= 2z2 · 1 · dx1
+1· dx1 da
= db
=1
df d(x1 +x1 ) dx2
dx1
= 2z2 · 1 · dx1
+1· dx1
z1 = x1 + x1
df dx1 dx1 dx2 d(a+b) d(a+b)
dx1
= 2z2 · 1 · 1 · dx1
+1· dx1
+1· dx1 da
= db
=1
df da da
dx1
= 2z2 · (1 · (1 · 1 + 1 · 1) + 1 · 0) da
= 1 and db
= 0 for inputs a, b
df
dx1
= 2z2 · 2 = 8x1 + 4x2 simplify
df
Table 1: A detailed derivation of dx1
for f = z22 ; z2 = z1 + x2 ; z1 = x1 + x1
6
Derivation Step Reason
df dz 2 dz2
dx2
= dz22 · dx2
f = z22
df dz2 d(a2 )
dx2
= 2z2 · dx2 da
= 2a
df d(z1 +x2 )
dx2
= 2z2 · dx2
z2 = z1 + x2
df dz1 dx2 d(a+b) d(a+b)
dx2
= 2z2 · 1 · dx2
+1· dx2 da
= db
=1
df d(x1 +x1 ) dx2
dx2
= 2z2 · 1 · dx2
+1· dx2
z1 = x1 + x1
df dx2 dx1 dx2 d(a+b) d(a+b)
dx2
= 2z2 · 1 · 1 · dx2
+1· dx2
+1· dx2 da
= db
=1
df da da
dx1
= 2z2 · (1 · (1 · 1 + 1 · 0) + 1 · 1) da
= 1 and db
= 0 for inputs a, b
df
dx1
= 2z2 · 2 = 4x1 + 2x2 simplify
df
Table 2: A detailed derivation of dx2
for f = z22 ; z2 = z1 + x2 ; z1 = x1 + x1
df
dx1
= (x + x) · x + x2 = 3x2 z1 = x · x and simplify
df
Table 3: A derivation of dx
for f = z1 x; z1 = xx
7
delta[f ] =1
delta[z1 ] + = delta[f ] · x = x arg 1 of f =mul(...)
delta[x] + = delta[f ] · z1 = x2 arg 2 of f =mul(...)
delta[x] + = delta[z1 ] · x = x2 arg 1 of z1 =mul(...)
delta[x] + = delta[z1 ] · x = x2 arg 1 of z1 =mul(...)
1.5 Discussion
What’s going on here? Let’s simplify for a minute and assume that the list
is of the form
z1 = f1 (z0 )
z2 = f2 (z1 )
...
zm = fm (zm−1 )
8
is hard to read.) Notice that there are m2 of these hi,j functions—quite a
few!—but for machine learing applications we won’t care about most of them:
typically we just care about the partial derivative of the cost function (the
final variable in the list) with respect to the parameters, so we only need hm,i
for certain i’s.
Let’s look at evaluating hm,0 as some point z0 = a (say z0 = 53.4). Again
to simplify, define
a1 = f1 (a)
a2 = f2 (f1 (a))
...
am = fm (fm−1 (fm−2 (. . . f1 (a) . . .)))
When we write
dzm dzm dzm−1
=
dz0 dzm−1 dz0
We mean that: for all a
0
hm,0 (a) = fm (am ) · hm−1,1 (a)
When we execute the backprop code above, this is what we do: in particular
we group the operations as
0 0 0
hm,0 (a) = (fm (am ) · fm−1 (am−1 ) · fm−2 (am−2 ) · · · f20 (a1 )) · f10 (a)
9
1.6 Constructing Wengert lists
Wengert lists are useful but tedious to program in. Usually they are con-
structed using some sort programming language extension. You will be pro-
vided a package, xman.py, to construct Wengert lists from Python expres-
sions: xman is short for “expression manager”. Here is an example of using
xman:
In the definition of Triangle, the variables h, w, and area are called registers.
Note that after creating an instance of a subclass of xman.XMan, you need to
call setup(), which returns the newly-created instance. After the setup you
can call the operationSequence method to construct a Wengert list, which
will be encoded in Python as
Internally this works as follows. There are two types of Python objects,
called Registers and Operations. A Register corresponds to a variable, and
an Operation corresponds to a function call.
The base XManFunctions class defines a method input() which creates a
register object that is marked as an “input”, meaning that it has no definition.
(A similar method param() creates a register object that is marked as a
“parameter”, which like an input has no definition.) It also defines a few
functions that correspond to operators, like mul and add. These return a
10
class XManFunctions(object):
@staticmethod
def input(default=None):
return Register(role=’input’,default=default)
...
@staticmethod
def mul(a,b):
return XManFunctions.registerDefinedByOperator(’mul’,a,b)
...
@staticmethod
def registerDefinedByOperator(fun,*args):
reg = Register(role=’operationOutput’)
op = Operation(fun,*args)
reg.definedAs = op
op.outputReg = reg
return reg
Figure 1: The Python data structures created by the Triangle class. Blue
objects are Registers, and green ones are Operations. Arrows are pointers.
11
register object that is cross-linked to an Operation object, as illustrated in
Figure 1.6. (The Register class also uses Python’s operator overloading to
so that syntax like h*w is expanded to XManFunctions.mul(h,w).)
To produce the Wengert list, the setup() command uses Python in-
trospection methods to add names to each register, based on the Python
variable that points to it, and generates new variable names for any reach-
able registers that cannot be named with Python variables. Finally, the
operationSequence does a pre-order traversal of the data structure to cre-
ate a Wengert list.
12