A Step-By-step Introduction To The Implementation of Automatic Differentiation
A Step-By-step Introduction To The Implementation of Automatic Differentiation
Automatic Differentiation
Yu-Hsueh Fang∗1 , He-Zhe Lin∗1 , Jie-Jyun Liu1 , and Chih-Jen Lin1,2
1
National Taiwan University
{d11725001, r11922027, d11922012}@ntu.edu.tw
[email protected]
2
Mohamed bin Zayed University of Artificial Intelligence
[email protected]
Abstract
Automatic differentiation is a key component in deep learning. This topic is well studied
and excellent surveys such as Baydin et al. (2018) have been available to clearly describe the
basic concepts. Further, sophisticated implementations of automatic differentiation are now
an important part of popular deep learning frameworks. However, it is difficult, if not impos-
sible, to directly teach students the implementation of existing systems due to the complexity.
On the other hand, if the teaching stops at the basic concept, students fail to sense the re-
alization of an implementation. For example, we often mention the computational graph in
teaching automatic differentiation, but students wonder how to implement and use it. In this
document, we partially fill the gap by giving a step by step introduction of implementating
a simple automatic differentiation system. We streamline the mathematical concepts and the
implementation. Further, we give the motivation behind each implementation detail, so the
whole setting becomes very natural.
1 Introduction
In modern machine learning, derivatives are the cornerstone of numerous applications and stud-
ies. The calculation often relies on automatic differentiation, a classic method for efficiently and
accurately calculating derivatives of numeric functions. For example, deep learning cannot suc-
ceed without automatic differentiation. Therefore, teaching students how automatic differentiation
works is highly essential.
Automatic differentiation is a well-developed area with rich literature. Excellent surveys in-
cluding Chinchalkar (1994), Bartholomew-Biggs et al. (2000), Baydin et al. (2018) and Margossian
*
These authors contributed equally to this work
1
(2019) review the algorithms for automatic differentiation and its wide applications. In particular,
Baydin et al. (2018) is a comprehensive work focusing the automatic differentiation in machine
learning. Therefore, there is no lack of materials introducing the concept of automatic differentia-
tion.
On the other hand, as deep learning systems now solve large-scale problems, it is inevitable
that the implementation of automatic differentiation becomes highly sophisticated. For example, in
popular deep learning systems such as PyTorch (Paszke et al., 2017) and Tensorflow (Abadi et al.,
2016), at least thousands of lines of code are needed. Because of this, many places of teaching
automatic differentiation for deep learning stops at the basic concepts. Then students fail to sense
the realization of an implementation. For example, we often mention the computational graph
in teaching automatic differentiation, but students wonder how to implement and use it. In this
document, we aim to partially fill the gap by giving a tutorial on the basic implementation.
In recent years, many works1,2,3,4,5 have attempted to discuss the basic implementation of au-
tomatic differentiation. However, they still leave room for improvement. For example, some are
not self-contained – they quickly talk about implementations without connecting to basic concepts.
Ours, which is very suitable for the beginners, has the following features:
• We streamline the mathematical concepts and the implementation. Further, we give the
motivation behind each implementation detail, so the whole setting becomes very natural.
• We use the example from Baydin et al. (2018) for the consistency with past works. Ours is
thus an extension of Baydin et al. (2018) into the implementation details.
• We build a complete tutorial including this document, slides and the source code at
https://www.csie.ntu.edu.tw/~cjlin/papers/autodiff/.
2 Automatic Differentiation
There are two major modes of automatic differentiation. In this section, we introduce the basic
concepts of both modes. Most materials in this section are from Baydin et al. (2018). We consider
the same example function
2
2.1 Forward Mode
First, we discuss the forward mode. Before calculating the derivative, let us check how to calculate
the function value. Assume that we want to calculate the function value at (x1 , x2 ) = (2, 5). Then,
in the following table, we have a forward procedure.
x1 =2
x2 =5
v1 = log x1 = log 2
v2 = x1 × x2 =2×5
v3 = sin x2 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 − v3 = 10.693 + 0.959
y = v5 = 11.652
We use variables vi to record the intermediate outcomes. First, we know log function is applied to
x1 . Therefore, we have log(x1 ) as a new variable called v1 . Similarly, there is a variable v2 , which
is x1 × x2 . Each vi is related to a simple operation. The initial value of this member is zero when
a node is created. In the end, our function value at (2, 5) is y = v5 . As shown in the table, the
function evaluation is decomposed into a sequence of simple operations. We have a corresponding
computational graph as follows:
x1 v1 v4
v2 v5 y = f (x1 , x2 )
x2 v3
Because calculating both v1 and v2 needs x1 , x1 has two links to them in the graph. The following
graph shows all the intermediate results in the computation.
3
x1 v1 v4
= ln x1 = v1 + v2
=2
log 2 0.693 + 10
v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3
2×5 10.693 − (−0959)
x2 v3
= sin x2
=5
sin 5
The computational graph tells us the dependencies of variables. Thus, from the inputs x1 and
x2 we can go through all nodes for getting the function value y = v5 in the end.
Now, we have learned about the function evaluation. But remember, we would like to calculate
the derivative. Assume that we target at the partial derivative ∂y/∂x1 . Here, we denote
∂v
v̇ =
∂x1
as the derivative of the variable v with respect to x1 . The idea is that by using the chain rule, we
can obtain the following forward derivative calculation to eventually get v̇5 = ∂f /∂x1 .
The table starts from ẋ1 and ẋ2 , which are ∂x1 /∂x1 = 1 and ∂x2 /∂x1 = 0. Based on ẋ1 and ẋ2 , we
can calculate other values. For example, let us check the partial derivative ∂v1 /∂x1 . From
v1 = log x1 ,
4
2.2 Reverse Mode
Next, we discuss the reverse mode. We denote
∂y
v̄ =
∂v
as the derivative of the function y with respect to the variable v. Note that earlier, in the forward
mode, we considered
∂v
v̇ =,
∂x1
so the focus is on the derivatives of all variables with respect to one input variable. In contrast, the
reverse mode focuses on v̄ = ∂y/∂v for all v, the partial derivatives of one output with respect to
all variables. Therefore, for our example, we can use v̄i ’s and x̄i ’s to get both ∂y/∂x1 and ∂y/∂x2
at once. Now we illustrate the calculation of
∂y
.
∂x2
By checking the variable x2 in the computational graph, we see that variable x2 affects y by
affecting v2 and v3 .
x1 v1 v4
=2 = log x1 = v1 + v2
v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3
x2 v3
=5 = sin x2
This dependency, together with the fact that x1 is fixed, means if we would like to calculate ∂y/∂x2 ,
then it is equal to calculate
∂y ∂y ∂v2 ∂y ∂v3
= + . (1)
∂x2 ∂v2 ∂x2 ∂v3 ∂x2
We can rewrite Equation 1 as follows with our notataion.
∂v2 ∂v3
x̄2 = v̄2 + v̄3 . (2)
∂x2 ∂x2
5
If v̄2 and v̄3 are available beforehand, all we need is to calculate ∂v2 /∂x2 and ∂v3 /∂x2 . From the
operation between x2 and v3 , we know that ∂v3 /∂x2 = cos(x2 ). Similarly, we have ∂v2 /∂x2 = x1 .
Then, the evaluation of x̄2 is done in two steps:
∂v3
x̄2 ← v̄3
∂x2
∂v2
x̄2 ← x̄2 + v̄2
.
∂x2
These steps are part of the sequence of a reverse traversal, shown in the following table.
x̄1 = 5.5
x̄2 = 1.716
∂v1
x̄1 = x̄1 + v̄1 ∂x 1
= x̄1 + v̄1 /x1 = 5.5
∂v2
x̄2 = x̄2 + v̄2 ∂x2 = x̄2 + v̄2 × x1 = 1.716
∂v2
x̄1 = v̄2 ∂x 1
= v̄2 × x2 =5
∂v3
x̄2 = v̄3 ∂x 2
= v̄3 × cos x2 = −0.284
v̄2 = v̄4 ∂v
∂v2
4
= v̄4 × 1 =1
v̄1 = v̄4 ∂v
∂v1
4
= v̄4 × 1 =1
v̄3 = v̄5 ∂v
∂v3
5
= v̄5 × (−1) = −1
∂v5
v̄4 = v̄5 ∂v 4
= v̄5 × 1 =1
v̄5 = ȳ =1
To get the desired x̄1 and x̄2 (i.e., ∂y/∂x1 and ∂y/∂x2 ), we begin with
∂y ∂y
v̄5 = = = 1.
∂v5 ∂y
From the computational graph, we then get v̄4 and v̄3 . Because v4 affects y only through v5 , we
have
∂y ∂y ∂v5 ∂v5
v̄4 = = = v̄5 = v̄5 × 1
∂v4 ∂v5 ∂v4 ∂v4
The above equation is based on that we already know ∂y/∂v5 = v̄5 . Also, the operation from v4 to
v5 is an addition, so ∂v5 /∂v4 is a constant 1. By such a sequence, in the end, we obtain
∂y ∂y
= x̄1 and = x̄2
∂x1 ∂x2
at the same time.
y = f (x) = f (x1 , x2 , . . . , xn ).
6
For any given x, we show the computation of
∂y
∂x1
as an example.
with
g(h1 , h2 ) = h1 − h2 ,
h1 (x1 , x2 ) = log x1 + x1 x2 ,
h2 (x1 , x2 ) = sin(x2 ).
7
x1 =2
x2 =5
v1 = log x1 = log 2
v2 = x1 × x2 =2×5
v3 = sin x2 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 − v3 = 10.693 + 0.959
y = v5 = 11.652
Also, we have a computational graph to generate the computing order
x1 v1 v4
= log x1 = v1 + v2
=2
log 2 0.693 + 10
v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3
2×5 10.693 − (−0959)
x2 v3
= sin x2
=5
sin 5
v1 = log x1 ,
v2 = x1 × x2 ,
v3 = sin x2 ,
v4 = v1 × v2 ,
v5 = v4 − v3 .
The expression in each node is from an operation to expressions in other nodes. Therefore, it is
natural to construct an edge
u → v,
if the expression of a node v is based on the expression of another node u. We say node u is a parent
node (of v) and node v is a child node (of u). To do the forward calculation, at node v we should
8
store v’s parents. Additionally, we need to record the operator applied on the node’s parents and
the resulting value. For example, the construction of the node
v2 = x1 × x2 ,
requires to store v2 ’s parent nodes {x1 , x2 }, the corresponding operator “×” and the resulting value.
Up to now, we can implement each node as a class Node with the following members.
member data type example for Node v2
numerical value float 10
parent nodes List[Node] [x1 , x2 ]
child nodes List[Node] [v4 ]
operator string "mul" (for ×)
At this moment, it is unclear why we should store child nodes in our Node class. Later we will
explain why such information is needed. Once the Node class is ready, starting from initial nodes
(which represent xi ’s), we use nested function calls to build the whole graph. In our case, the graph
for y = f (x1 , x2 ) can be constructed via
let us see this process step by step and check what each function must do. First, our starting point
is the root nodes created by the Node class constructor.
x1
x1
x2
x2
These root Nodes have empty members “parent nodes,” “child nodes,” and “operator” with only
“numerical value” respectively set to x1 and x2 . Then, we apply our implemented log(node) to
the node x1.
x1 log
x1 v1 = log x1
The implementation of our log function should create a Node instance to store log(x1 ). Therefore,
what we have is a wrapping function that does more than the log operation; see details in Section
3.3. The created node is the v1 node in our computational graph. Next, we discuss details of the
node creation. From the current log function and the input node x1 , we know contents of the
following members:
9
• parent nodes: [x1 ]
• operator: "log"
• numerical value: log 2
However, we have no information about children of this node. The reason is obvious because we
have not had a graph including its child nodes yet. Instead, we leave this member “child nodes”
empty and let child nodes to write back the information. By this idea, our log function should add
v1 to the “child nodes” of x1 . See more details later in Section 3.3.
We move on to apply mul(node1, node2) on nodes x1 and x2.
x1
x1
×
v2 = x1 × x2
x2
x2
Similarly, the mul function generates a Node instance. However, different from log(x1), the node
created here stores two parents (instead of one). Then we apply the function call add(log(x1),
mul(x1, x2)).
x1 log +
x1 v1 = log x1 v4 = log x1 + x1 × x2
×
v2 = x1 × x2
x2
x2
x2 sin
x2 v3 = sin x2
10
Last, applying sub(node1,node2) to the output nodes of add(log(x1), mul(x1, x2)) and sin(x1)
leads to
x1 log + v4 = log x1 + x1 × x2
x1 v1 = log x1
× −
v 2 = x1 × x2 v5 = log x1 + x1 × x2 − sin x2
x2 sin
x2 v3 = sin x2
We can conclude that each function generates exactly one Node instance; however, the generated
nodes differ in the operator, the number of parents, etc.
11
4 Topological Order and Partial Derivatives
Once the computational graph is built, we want to use the information in the graph to compute
∂y ∂v5
= .
∂x1 ∂x1
VR = {v ∈ V | v is reachable from x1 }
and
ER = {(u, v) ∈ E | u ∈ VR , v ∈ VR }.
Then,
GR ≡ ⟨VR , ER ⟩
is a subgraph of G. For our example, GR is the following subgraph with
VR = {x1 , v1 , v2 , v4 , v5 }
and
ER = {(x1 , v1 ), (x2 , v2 ), (v1 , v4 ), (v2 , v4 ), (v4 , v5 )}.
12
x1 v1 v4
v2 v5
We aim to find a “suitable” ordering of VR satisfying that each node u ∈ VR comes before all of its
child nodes in the ordering. By doing so, u̇ can be used in the derivative calculation of its child
nodes; see (6). For our example, a “suitable” ordering can be
x1 , v1 , v2 , v4 , v5 .
x1 → v1 → v4 → v5 , (8)
[v5 , v4 , v1 , v2 , x1 ].
6
We do not get into details, but a proof can be found in Kleinberg and Tardos (2005).
13
Then, by reversing the list, a node always comes before its children. One may wonder whether we
can add a node to the list before adding its child nodes. This way, we have a simpler implementation
without needing to reverse the list in the end. However, this setting may fail to generate a topological
ordering. We obtain the following list for our example:
[x1 , v1 , v4 , v5 , v2 ].
A violation occurs because v2 does not appear before its child v4 . The key reason is that in a DFS
path, a node may point to another node that was added earlier through a different path. Then this
node becomes after one of its children. For our example,
x1 → v2 → v4 → v5
is a path processed after the path in (8). Thus, v2 is added after v4 and a violation occurs. Reversing
the list can resolve the problem. To this end, in the function add children, we must append the
input node in the end.
In automatic differentiation, methods based on the topological ordering are called tape-based
methods. They are used in some real-world implementations such as Tensorflow. The ordering is
regarded as a tape. We read the nodes one by one from the beginning of the sequence (tape) to
calculate the derivative value.
Based on the obtained ordering, subsequently let us see how to compute each v̇.
v5 (v4 , v3 ) = v4 − v3 ,
we know
∂v5 ∂v5
= 1 and = −1.
∂v4 ∂v3
14
A general form of our calculation is
X ∂v
v̇ = u̇. (9)
u∈v’s parents
∂u
∂u
The second term, u̇ = , comes from past calculation due to the topological ordering. We can
∂x1
calculate the first term because u is one of v’s parent(s) and we know the operation at v. For
example, we have v4 = v1 × v2 , so
∂v4 ∂v4
= v2 and = v1 .
∂v1 ∂v2
These values can be immediately computed and stored when we construct the computational graph.
Therefore, we add one member “gradient w.r.t. parents” to our Node class. In addition, we need
a member “partial derivative” to store the accumulated sum in the calculation of (9). In the end,
this member stores the v̇i value. Details of the derivative evaluation are in Listing 4. The complete
list of members of our node class is in the following table.
member data type example for Node v2
numerical value float 10
parent nodes List[Node] [x1 , x2 ]
child nodes List[Node] [v4 ]
operator string "mul"
gradient w.r.t parents List[float] [5, 2]
partial derivative float 5
We update the mul function accordingly.
Listing 3: The wrapping function “mul”. The change from Listing 1 is in red color.
def mul(node1, node2):
,
, value = node1.value * node2.value
, parent_nodes = [node1, node2]
, newNode = Node(value, parent_nodes, "mul")
, newNode.grad_wrt_parents = [node2.value,node1.value]
, node1.child_nodes.append(newNode)
, node2.child_nodes.append(newNode)
, return newNode
As shown above, we must compute
∂ newNode
∂ parentNode
for each parent node in constructing a new child node. Here are some examples other than the mul
function:
15
so the red line is replaced by
Now, we know how to get each term in (9), i.e., the chain rule for calculating v̇. Therefore, if we
follow the topological ordering, all v̇i (i.e., partial derivatives with respect to x1 ) can be calculated.
An implementation to compute the partial derivatives is as follows. Here we store the resulting
value in the member partial derivative of each node.
Listing 4: Evaluating derivatives
def forward(rootNode):
rootNode.partial_derivative = 1
ordering = topological_order(rootNode)
for node in ordering[1:]:
partial_derivative = 0
for i in range(len(node.parent_nodes)):
dnode_dparent = node.grad_wrt_parents[i]
dparent_droot = node.parent_nodes[i].partial_derivative
partial_derivative += dnode_dparent * dparent_droot
node.partial_derivative = partial_derivative
4.3 Summary
The procedure for forward mode includes three steps:
3. Compute the partial derivative with respect to x1 along the topological order
We discuss not only how to run each step but also what information we should store. This is a
minimal implementation to demonstrate the forward mode automatic differentiation.
Acknowledgements
This work was supported by National Science and Technology Council of Taiwan grant 110-2221-
E-002-115-MY3.
16
References
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,
M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker,
V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale
machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI), pages 265–283, 2016.
J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc.,
2005. ISBN 0321295358.
17