0% found this document useful (0 votes)
29 views12 pages

Autodiff

Uploaded by

Abhishek Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
29 views12 pages

Autodiff

Uploaded by

Abhishek Shukla
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 12

Automatic Reverse-Mode Differentiation:

Lecture Notes
William W. Cohen
October 17, 2016

1 Background: Automatic Differentiation


1.1 Why automatic differentiation is important
Most neural network packages (e.g., Torch or Theano) don’t require a user to
actually derive the gradient updates for a neural model. Instead they allow
the user to define a model in a “little language”, which supports common
neural-network operations like matrix multiplication, “soft max”, etc, and
automatically derive the gradients. Typically, the user will define a loss
function L in this sublanguage: the loss function takes as inputs the training
data, current parameter values, and hyperparameters, and outputs a scalar
value. From the definition of L, the system will compute the partial derivative
of the loss function with respect to every parameter, using these gradients,
it’s straightforward bto implement gradient-based learning methods.
Going from the definition of L to its partial derivatives is called automatic
differentiation. In this assignment you will start with a simple automatic
differentiation system written in Python, and use it to implement a neural-
network package.

1.2 Wengert lists


Automatic differentiation proceeds in two stages. First, function definitions
f (x1 , . . . , xk ) in the sublanguage are converted to a format called a Wengert
list. The second stage is to evaluate the function and its gradients using the
Wengert list.

1
A Wengert list defines a function f (x1 , . . . , xk ) of some inputs x1 , . . . , xk .
Syntactically, it is just a list of assignment statements, where the right-hand-
side (RHS) of each statement is very simple: a call to a function g (where
g is one of a set of primitive functions G = {g1 , . . . , g` } supported by the
sublanguage), where the arguments to g either inputs x1 , . . . , xk , or the left-
hand-side (LHS) of a previous assignment. (The functions in G will be called
operators here.) The output of the function is the LHS of the last item in
the list. For example, a Wengert list for
f (x1 , x2 ) ≡ (2x1 + x2 )2 (1)
with the functions G = {add, multiply, square} might be
z1 = add(x1 , x1 )
z2 = add(z1 , x2 )
f = square(z2 )
A Wengert list for
f (x) ≡ x3
might be
z1 = multiply(x, x)
f = multiply(z1 , x)
The set of functions G defines the sublanguage. It’s convenient if they
have a small fixed number of arguments, and are differentiable with respect
to all their arguments. But there’s no reason that they have to be scalar
functions! For instance, if G contains the appropriate matrix and vector
operations, the loss for logistic regression (for example x with label y and
weight matrix W ) could be written as
f (x, y, W ) ≡ crossEntropy(softmax(x · W ), y) + frobeniusNorm(W )
and be compiled to the Wengert list
z1 = dot(x, W )
z2 = softmax(z1 )
z3 = crossEntropy(z2 , y)
z4 = frobeniusNorm(W )
f = add(z3 , z4 )

2
This implements k-class logistic regression if x is a d-dimensional row vector,
dot is matrix product, W is a d × k weight matrix, and
ea1 ead
softmax(ha1 , . . . , ad i) ≡ h Pd a
, . . . , Pd a
i
i=1 ei i=1 ei
d
X
crossEntropy(ha1 , . . . , ad i, hb1 , . . . , bd i) ≡ − ai log bi
i=1
v
u
uXd X
k
frobeniusNorm(A) ≡ a2i,j
u
t
i=1 j=1

1.3 Backpropagation through Wengert list


We’ll first discuss how to use a Wengert list, and then below, discuss how to
construct one.
Given a Wengert list for f , it’s obvious how to evaluate f : just step
through the assignment statements in order, computing each value as needed.
To be perfectly clear about this, the procedure is as follows. We will encode
each assignment in Python as a nested tuple

(z, g, (y1 , . . . , yk ))

where z is a string that names the LHS variable, g is a string that names the
operator, and the yi ’s are strings that name the arguments to g. So the list
for the function of Equation 1 would be encoded in python as
[ ("z1", "add", ("x1","x1")),
("z2", "add", ("z1","x2")),
("f", "square", ("z2")) ]
We also store functions for each operator in a Python dictionary G:
G = { "add" : lambda a,b: a+b,
"square": lambda a:a*a }
Before we evaluate the function, we will store the parameters in a dictionary
val: e.g., to evalute f and x1 = 3, x2 = 7 we will initialize val to
val = { "x1" : 3, "x2" : 7 }
The pseudo-code to evaluate f is:

3
def eval(f )
initialize val to the inputs at which f should be evaluated
for (z, g, (y1 , . . . , yk )) in the list:
op = G[g]
val[z] = op(val[y1 ],...,val[yk ])
return the last entry stored in val.

Some Python hints: (1) to convert (y1 , . . . , yk ) to (val[y1 ],...,val[yk ])


you might use Python’s map function. (2) If args is a length-2 Python list
and g is a function that takes two arguments (like G[‘‘add’’] above) then
g(*args) will call g with the elements of that list as the two arguments to
g.
To differentiate, we will use a generalization of backpropagation (back-
prop). We’ll assume that eval has already been run and val has been popu-
lated, and we will compute, in reverse order, a value delta(zi ) for each vari-
able zi that appears in the list. We initialize delta by setting delta(f ) = 1,
(where f is the string that names the function output).
Informally you can think of delta(zi ) as the “sensitivity” of f to the
variable zi , , at the point we’re evaluating f (i.e., the point a that corresponds
to the initial dictionary entries we stored in val). Here zi can be intermediate
variable or an input. If it’s an input x that we’re treating as a parameter,
delta is the gradient of the cost function, evaluated at a: i.e., delta(x) =
df
dx
(a).
To compute these sensitivities we need to “backpropagate” through the
list: when we encounter the assignment (z, g, (y1 , . . . , yk )) we will use delta(z)
and the derivatives of g with respect to its inputs to compute the sensitivities
of the y’s. We will store derivatives for each operator in a Python dictionary
DG.
Note that if g has k inputs, then we need k partial derivatives, one for
each input, so the entries in DG are lists of functions. For the functions used
in this example, we’ll need these entries in DG.
DG = { "add" : [ (lambda a,b: 1), (lambda a,b: 1) ],
"square": [ lambda a:2*a ] }
d
To figure out these functions we used some high school calculus rules: dx (x +
d d 2
y) = 1, dy (x + y) = 1 and dx (x ) = 2x.
Finally, the pseudo-code to compute the deltas is below. Note that we
don’t just store values in delta: we accumulate them additively.

4
def backprop(f ,val)
initialize delta: delta[f ] = 1
for (z,g,(y1 , . . . , yk )) in the list, in reverse order:
for i = 1, . . . , k:
opi = DG[g][i]
if delta[yi ] is not defined set delta[yi ] = 0
delta[yi ] = delta[yi ] + delta[z] * opi (val[y1 ],...,val[yk ])

1.4 Examples
Let’s look at Equation 1. In freshman calculus you’d just probably just do
this:

f (x1 , x2 ) = (2x1 + x2 )2 = 4x21 + 4x1 x2 + x22


df
= 8x1 + 4x2
dx1
df
= 4x1 + 2x2
dx2

Here we’ll use the Wengert list, in reverse, and the chain rule. This list is

z1 = add(x1 , x1 )
z2 = add(z1 , x2 )
f = square(z2 )
df
Table 1 contains a detailed derivation of dx 1
, where in each step we either
plug in a definition of a variable in the list, or use the derivative of one of
the operators (square or add). Table 2 contains an analogous derivation of
df
dx1
. Notice that these derivations are nearly identical. In fact, they are very
analogous to the computations carried out by the backprop algorithm: can
you see how?
Finally Table 3 shows a slightly less detailed derivation of the second
sample function, f = x3 . It is instructive to step through the backprop
algorithm for these functions as well: for example the list z1 = x · x; f = z1 · x
leads to the delta updates.

5
Derivation Step Reason
df dz 2 dz2
dx1
= dz22 · dx1
f = z22

df dz2 d(a2 )
dx1
= 2z2 · dx1 da
= 2a

df d(z1 +x2 )
dx1
= 2z2 · dx1
z2 = z1 + x2
 
df dz1 dx2 d(a+b) d(a+b)
dx1
= 2z2 · 1 · dx1
+1· dx1 da
= db
=1
 
df d(x1 +x1 ) dx2
dx1
= 2z2 · 1 · dx1
+1· dx1
z1 = x1 + x1
   
df dx1 dx1 dx2 d(a+b) d(a+b)
dx1
= 2z2 · 1 · 1 · dx1
+1· dx1
+1· dx1 da
= db
=1

df da da
dx1
= 2z2 · (1 · (1 · 1 + 1 · 1) + 1 · 0) da
= 1 and db
= 0 for inputs a, b

df
dx1
= 2z2 · 2 = 8x1 + 4x2 simplify
df
Table 1: A detailed derivation of dx1
for f = z22 ; z2 = z1 + x2 ; z1 = x1 + x1

6
Derivation Step Reason
df dz 2 dz2
dx2
= dz22 · dx2
f = z22

df dz2 d(a2 )
dx2
= 2z2 · dx2 da
= 2a

df d(z1 +x2 )
dx2
= 2z2 · dx2
z2 = z1 + x2
 
df dz1 dx2 d(a+b) d(a+b)
dx2
= 2z2 · 1 · dx2
+1· dx2 da
= db
=1
 
df d(x1 +x1 ) dx2
dx2
= 2z2 · 1 · dx2
+1· dx2
z1 = x1 + x1
   
df dx2 dx1 dx2 d(a+b) d(a+b)
dx2
= 2z2 · 1 · 1 · dx2
+1· dx2
+1· dx2 da
= db
=1

df da da
dx1
= 2z2 · (1 · (1 · 1 + 1 · 0) + 1 · 1) da
= 1 and db
= 0 for inputs a, b

df
dx1
= 2z2 · 2 = 4x1 + 2x2 simplify
df
Table 2: A detailed derivation of dx2
for f = z22 ; z2 = z1 + x2 ; z1 = x1 + x1

Derivation Step Reason


df d(ab)
dx
= dz
dx
1
x + z1 dx
dx
f = z1 x and dx
= da
dx
b db
+ a dx
 
df dx d(ab)
dx1
= dx
x + x dx
dx
· x + z1 dx
dx
z1 = x · x and dx
= da
dx
b db
+ a dx

df
dx1
= (x + x) · x + x2 = 3x2 z1 = x · x and simplify
df
Table 3: A derivation of dx
for f = z1 x; z1 = xx

7
delta[f ] =1
delta[z1 ] + = delta[f ] · x = x arg 1 of f =mul(...)
delta[x] + = delta[f ] · z1 = x2 arg 2 of f =mul(...)
delta[x] + = delta[z1 ] · x = x2 arg 1 of z1 =mul(...)
delta[x] + = delta[z1 ] · x = x2 arg 1 of z1 =mul(...)

1.5 Discussion
What’s going on here? Let’s simplify for a minute and assume that the list
is of the form

z1 = f1 (z0 )
z2 = f2 (z1 )
...
zm = fm (zm−1 )

so f = zm = fm (fm−1 (. . . f1 (z0 ) . . .)). We’ll assume we can compute the fi


functions and their derivations fi0 . We know that one way to find dz m
dz0
would
be to repeatedly use the chain rule:
dzm dzm dzm−1
=
dz0 dzm−1 dz0
dzm dzm−1 dzm−2
=
dzm−1 dzm−2 dz0
...
dzm dzm−1 dz1
= ...
dzm−1 dzm−2 dz0
Let’s take some time to unpack what this means. When we do derivations
by hand, we are working symbolically: we are constructing a symbolic rep-
resentation of the derivative function. This is an interesting problem—it’s
called symbolic differentiation—but it’s not the same task as automatic dif-
ferentiation. In automatic differentiation, we want instead an algorithm for
evaluating the derivative function.
dzi
To simplify notation, let hi,j be the function dzj
. (I’m doing this so that I
dzi dzi
can use hi,j (a) to denote the result of evaluating the function dz j
at a: dzj
(a)

8
is hard to read.) Notice that there are m2 of these hi,j functions—quite a
few!—but for machine learing applications we won’t care about most of them:
typically we just care about the partial derivative of the cost function (the
final variable in the list) with respect to the parameters, so we only need hm,i
for certain i’s.
Let’s look at evaluating hm,0 as some point z0 = a (say z0 = 53.4). Again
to simplify, define

a1 = f1 (a)
a2 = f2 (f1 (a))
...
am = fm (fm−1 (fm−2 (. . . f1 (a) . . .)))

When we write
dzm dzm dzm−1
=
dz0 dzm−1 dz0
We mean that: for all a
0
hm,0 (a) = fm (am ) · hm−1,1 (a)

That’s a useful step because we have assumed we have available a routine


0
to evaluate fm (am ): in the code this would be the function DG[fm ][1].
Continuing, when we write
dzm dzm dzm−1 dz1
= ...
dz0 dzm−1 dzm−2 dz0
it means that
0 0
hm,0 (a) = fm (am ) · fm−1 (am−1 ) · · · f20 (a1 ) · f10 (a)

When we execute the backprop code above, this is what we do: in particular
we group the operations as
   
0 0 0
hm,0 (a) = (fm (am ) · fm−1 (am−1 ) · fm−2 (am−2 ) · · · f20 (a1 )) · f10 (a)

and the delta’s are partial products: specifically


0
delta[zi ] = fm (am ) · · · fi0 (ai )

9
1.6 Constructing Wengert lists
Wengert lists are useful but tedious to program in. Usually they are con-
structed using some sort programming language extension. You will be pro-
vided a package, xman.py, to construct Wengert lists from Python expres-
sions: xman is short for “expression manager”. Here is an example of using
xman:

from xman import *


...
class f(XManFunctions):
@staticmethod
def half(a):
...
class Triangle(XMan):
h = f.input()
w = f.input()
area = f.half(h*w)
...
xm = Triangle().setup()
print xm.operationSequence(xm.area)

In the definition of Triangle, the variables h, w, and area are called registers.
Note that after creating an instance of a subclass of xman.XMan, you need to
call setup(), which returns the newly-created instance. After the setup you
can call the operationSequence method to construct a Wengert list, which
will be encoded in Python as

[(’z1’, ’mul’, [’h’, ’w’]),


(’area’, ’half’, [’z1’])]

Internally this works as follows. There are two types of Python objects,
called Registers and Operations. A Register corresponds to a variable, and
an Operation corresponds to a function call.
The base XManFunctions class defines a method input() which creates a
register object that is marked as an “input”, meaning that it has no definition.
(A similar method param() creates a register object that is marked as a
“parameter”, which like an input has no definition.) It also defines a few
functions that correspond to operators, like mul and add. These return a

10
class XManFunctions(object):
@staticmethod
def input(default=None):
return Register(role=’input’,default=default)
...
@staticmethod
def mul(a,b):
return XManFunctions.registerDefinedByOperator(’mul’,a,b)
...
@staticmethod
def registerDefinedByOperator(fun,*args):
reg = Register(role=’operationOutput’)
op = Operation(fun,*args)
reg.definedAs = op
op.outputReg = reg
return reg

Figure 1: The Python data structures created by the Triangle class. Blue
objects are Registers, and green ones are Operations. Arrows are pointers.

11
register object that is cross-linked to an Operation object, as illustrated in
Figure 1.6. (The Register class also uses Python’s operator overloading to
so that syntax like h*w is expanded to XManFunctions.mul(h,w).)
To produce the Wengert list, the setup() command uses Python in-
trospection methods to add names to each register, based on the Python
variable that points to it, and generates new variable names for any reach-
able registers that cannot be named with Python variables. Finally, the
operationSequence does a pre-order traversal of the data structure to cre-
ate a Wengert list.

12

You might also like