0% found this document useful (0 votes)
52 views17 pages

A Step-By-step Introduction To The Implementation of Automatic Differentiation

This document provides a step-by-step introduction to implementing automatic differentiation, a crucial technique in deep learning for calculating derivatives. It addresses the gap in teaching by simplifying mathematical concepts and implementation details, making it accessible for beginners. The tutorial includes a complete guide with slides and source code to aid understanding and application.

Uploaded by

abc1255856
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
52 views17 pages

A Step-By-step Introduction To The Implementation of Automatic Differentiation

This document provides a step-by-step introduction to implementing automatic differentiation, a crucial technique in deep learning for calculating derivatives. It addresses the gap in teaching by simplifying mathematical concepts and implementation details, making it accessible for beginners. The tutorial includes a complete guide with slides and source code to aid understanding and application.

Uploaded by

abc1255856
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

A Step-by-step Introduction to the Implementation of

Automatic Differentiation
Yu-Hsueh Fang∗1 , He-Zhe Lin∗1 , Jie-Jyun Liu1 , and Chih-Jen Lin1,2
1
National Taiwan University
{d11725001, r11922027, d11922012}@ntu.edu.tw
[email protected]
2
Mohamed bin Zayed University of Artificial Intelligence
[email protected]

December 15, 2024

Abstract
Automatic differentiation is a key component in deep learning. This topic is well studied
and excellent surveys such as Baydin et al. (2018) have been available to clearly describe the
basic concepts. Further, sophisticated implementations of automatic differentiation are now
an important part of popular deep learning frameworks. However, it is difficult, if not impos-
sible, to directly teach students the implementation of existing systems due to the complexity.
On the other hand, if the teaching stops at the basic concept, students fail to sense the re-
alization of an implementation. For example, we often mention the computational graph in
teaching automatic differentiation, but students wonder how to implement and use it. In this
document, we partially fill the gap by giving a step by step introduction of implementating
a simple automatic differentiation system. We streamline the mathematical concepts and the
implementation. Further, we give the motivation behind each implementation detail, so the
whole setting becomes very natural.

1 Introduction
In modern machine learning, derivatives are the cornerstone of numerous applications and stud-
ies. The calculation often relies on automatic differentiation, a classic method for efficiently and
accurately calculating derivatives of numeric functions. For example, deep learning cannot suc-
ceed without automatic differentiation. Therefore, teaching students how automatic differentiation
works is highly essential.
Automatic differentiation is a well-developed area with rich literature. Excellent surveys in-
cluding Chinchalkar (1994), Bartholomew-Biggs et al. (2000), Baydin et al. (2018) and Margossian
*
These authors contributed equally to this work

1
(2019) review the algorithms for automatic differentiation and its wide applications. In particular,
Baydin et al. (2018) is a comprehensive work focusing the automatic differentiation in machine
learning. Therefore, there is no lack of materials introducing the concept of automatic differentia-
tion.
On the other hand, as deep learning systems now solve large-scale problems, it is inevitable
that the implementation of automatic differentiation becomes highly sophisticated. For example, in
popular deep learning systems such as PyTorch (Paszke et al., 2017) and Tensorflow (Abadi et al.,
2016), at least thousands of lines of code are needed. Because of this, many places of teaching
automatic differentiation for deep learning stops at the basic concepts. Then students fail to sense
the realization of an implementation. For example, we often mention the computational graph
in teaching automatic differentiation, but students wonder how to implement and use it. In this
document, we aim to partially fill the gap by giving a tutorial on the basic implementation.
In recent years, many works1,2,3,4,5 have attempted to discuss the basic implementation of au-
tomatic differentiation. However, they still leave room for improvement. For example, some are
not self-contained – they quickly talk about implementations without connecting to basic concepts.
Ours, which is very suitable for the beginners, has the following features:

• We streamline the mathematical concepts and the implementation. Further, we give the
motivation behind each implementation detail, so the whole setting becomes very natural.

• We use the example from Baydin et al. (2018) for the consistency with past works. Ours is
thus an extension of Baydin et al. (2018) into the implementation details.

• We build a complete tutorial including this document, slides and the source code at
https://www.csie.ntu.edu.tw/~cjlin/papers/autodiff/.

2 Automatic Differentiation
There are two major modes of automatic differentiation. In this section, we introduce the basic
concepts of both modes. Most materials in this section are from Baydin et al. (2018). We consider
the same example function

y = f (x1 , x2 ) = log x1 + x1 x2 − sin x2 .


1
https://towardsdatascience.com/build-your-own-automatic-differentiation-program-6ecd585eec2a
2
https://sidsite.com/posts/autodiff
3
https://mdrk.io/introduction-to-automatic-differentiation-part2/
4
https://github.com/dlsyscourse/lecture5/blob/main/5_automatic_differentiation_
implementation.ipynb
5
https://github.com/karpathy/micrograd

2
2.1 Forward Mode
First, we discuss the forward mode. Before calculating the derivative, let us check how to calculate
the function value. Assume that we want to calculate the function value at (x1 , x2 ) = (2, 5). Then,
in the following table, we have a forward procedure.
x1 =2
x2 =5
v1 = log x1 = log 2
v2 = x1 × x2 =2×5
v3 = sin x2 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 − v3 = 10.693 + 0.959
y = v5 = 11.652
We use variables vi to record the intermediate outcomes. First, we know log function is applied to
x1 . Therefore, we have log(x1 ) as a new variable called v1 . Similarly, there is a variable v2 , which
is x1 × x2 . Each vi is related to a simple operation. The initial value of this member is zero when
a node is created. In the end, our function value at (2, 5) is y = v5 . As shown in the table, the
function evaluation is decomposed into a sequence of simple operations. We have a corresponding
computational graph as follows:

x1 v1 v4

v2 v5 y = f (x1 , x2 )

x2 v3

Because calculating both v1 and v2 needs x1 , x1 has two links to them in the graph. The following
graph shows all the intermediate results in the computation.

3
x1 v1 v4
= ln x1 = v1 + v2
=2
log 2 0.693 + 10

v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3
2×5 10.693 − (−0959)

x2 v3
= sin x2
=5
sin 5

The computational graph tells us the dependencies of variables. Thus, from the inputs x1 and
x2 we can go through all nodes for getting the function value y = v5 in the end.
Now, we have learned about the function evaluation. But remember, we would like to calculate
the derivative. Assume that we target at the partial derivative ∂y/∂x1 . Here, we denote

∂v
v̇ =
∂x1
as the derivative of the variable v with respect to x1 . The idea is that by using the chain rule, we
can obtain the following forward derivative calculation to eventually get v̇5 = ∂f /∂x1 .

ẋ1 = ∂x1 /∂x1 =1


ẋ2 = ∂x2 /∂x1 =0
v̇1 = ẋ1 /x1 = 1/2
v̇2 = ẋ1 × x2 + ẋ2 × x1 =1×5+0×2
v̇3 = ẋ2 × cos x2 = 0 × cos 5
v̇4 = v̇1 + v̇2 = 0.5 + 5
v̇5 = v̇4 − v̇3 = 5.5 − 0
ẏ = v̇5 = 5.5

The table starts from ẋ1 and ẋ2 , which are ∂x1 /∂x1 = 1 and ∂x2 /∂x1 = 0. Based on ẋ1 and ẋ2 , we
can calculate other values. For example, let us check the partial derivative ∂v1 /∂x1 . From

v1 = log x1 ,

by the chain rule,


∂v1 ∂x1 1 ∂x1 ẋ1
×
v̇1 = = × = .
∂x1 ∂x1 x1 ∂x1 x1
Therefore, we need ẋ1 and x1 for calculating v̇1 on the left-hand side. We already have the value of
ẋ1 from the previous step (∂x1 /∂x1 = 1). Also, the function evaluation gives the value of x1 . Then,
we can calculate ẋ1 /x1 = 1/2. Clearly, the chain rule plays an important role here. The calculation
of other v̇i is similar.

4
2.2 Reverse Mode
Next, we discuss the reverse mode. We denote
∂y
v̄ =
∂v
as the derivative of the function y with respect to the variable v. Note that earlier, in the forward
mode, we considered
∂v
v̇ =,
∂x1
so the focus is on the derivatives of all variables with respect to one input variable. In contrast, the
reverse mode focuses on v̄ = ∂y/∂v for all v, the partial derivatives of one output with respect to
all variables. Therefore, for our example, we can use v̄i ’s and x̄i ’s to get both ∂y/∂x1 and ∂y/∂x2
at once. Now we illustrate the calculation of
∂y
.
∂x2
By checking the variable x2 in the computational graph, we see that variable x2 affects y by
affecting v2 and v3 .

forward calculation of function value


reverse calculation of derivative value

x1 v1 v4
=2 = log x1 = v1 + v2

v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3

x2 v3
=5 = sin x2

This dependency, together with the fact that x1 is fixed, means if we would like to calculate ∂y/∂x2 ,
then it is equal to calculate
∂y ∂y ∂v2 ∂y ∂v3
= + . (1)
∂x2 ∂v2 ∂x2 ∂v3 ∂x2
We can rewrite Equation 1 as follows with our notataion.
∂v2 ∂v3
x̄2 = v̄2 + v̄3 . (2)
∂x2 ∂x2

5
If v̄2 and v̄3 are available beforehand, all we need is to calculate ∂v2 /∂x2 and ∂v3 /∂x2 . From the
operation between x2 and v3 , we know that ∂v3 /∂x2 = cos(x2 ). Similarly, we have ∂v2 /∂x2 = x1 .
Then, the evaluation of x̄2 is done in two steps:
∂v3
x̄2 ← v̄3
∂x2
∂v2
x̄2 ← x̄2 + v̄2
.
∂x2
These steps are part of the sequence of a reverse traversal, shown in the following table.
x̄1 = 5.5
x̄2 = 1.716
∂v1
x̄1 = x̄1 + v̄1 ∂x 1
= x̄1 + v̄1 /x1 = 5.5
∂v2
x̄2 = x̄2 + v̄2 ∂x2 = x̄2 + v̄2 × x1 = 1.716
∂v2
x̄1 = v̄2 ∂x 1
= v̄2 × x2 =5
∂v3
x̄2 = v̄3 ∂x 2
= v̄3 × cos x2 = −0.284
v̄2 = v̄4 ∂v
∂v2
4
= v̄4 × 1 =1
v̄1 = v̄4 ∂v
∂v1
4
= v̄4 × 1 =1
v̄3 = v̄5 ∂v
∂v3
5
= v̄5 × (−1) = −1
∂v5
v̄4 = v̄5 ∂v 4
= v̄5 × 1 =1
v̄5 = ȳ =1
To get the desired x̄1 and x̄2 (i.e., ∂y/∂x1 and ∂y/∂x2 ), we begin with
∂y ∂y
v̄5 = = = 1.
∂v5 ∂y
From the computational graph, we then get v̄4 and v̄3 . Because v4 affects y only through v5 , we
have
∂y ∂y ∂v5 ∂v5
v̄4 = = = v̄5 = v̄5 × 1
∂v4 ∂v5 ∂v4 ∂v4
The above equation is based on that we already know ∂y/∂v5 = v̄5 . Also, the operation from v4 to
v5 is an addition, so ∂v5 /∂v4 is a constant 1. By such a sequence, in the end, we obtain
∂y ∂y
= x̄1 and = x̄2
∂x1 ∂x2
at the same time.

3 Implementation of Function Evaluation and the Compu-


tational Graph
With the basic concepts ready in Section 2, we move to the implementation of the automatic
differentiation. For simplicity, we consider the forward mode. The reverse mode can be designed in
a similar way. Consider a function f : Rn → R with

y = f (x) = f (x1 , x2 , . . . , xn ).

6
For any given x, we show the computation of
∂y
∂x1
as an example.

3.1 The Need to Calculate Function Values


We are calculating the derivative, so at the first glance, function values are not needed. However,
we show that function evaluation is necessary due to the function structure and the use of the chain
rule. To explain this, we begin with knowing that the function of a neural network is usually a
nested composite function
f (x) = hk (hk−1 (. . . h1 (x)))
due to the layered structure. For easy discussion, let us assume that f (x) is the following general
composite function
f (x) = g(h1 (x), h2 (x), . . . , hk (x)).
We can see that the example function considered earlier

f (x1 , x2 ) = log x1 + x1 x2 − sin x2 (3)

can be written in the following composite function

g(h1 (x1 , x2 ), h2 (x1 , x2 ))

with

g(h1 , h2 ) = h1 − h2 ,
h1 (x1 , x2 ) = log x1 + x1 x2 ,
h2 (x1 , x2 ) = sin(x2 ).

To calculate the derivative at x = x0 using the chain rule, we have


k  
∂f X ∂g ∂hi
= × ,
∂x1 x=x0 i=1
∂h i h=h(x0 ) ∂x 1 x=x0

where the notation


∂g
∂hi h=h(x0 )
 T
means the derivative of g with respect to hi evaluated at h(x0 ) = h1 (x0 ) · · · hk (x0 ) . Clearly,
we must calculate the inner function values h1 (x0 ), . . . , hk (x0 ) first. The process of computing all
hi (x0 ) is part of (or almost the same as) the process of computing f (x0 ). This explanation tells
why for calculating the partial derivatives, we need the function value first.
Next, we discuss the implementation of getting the function value. For the function (3), recall
we have a table recording the order to get f (x1 , x2 ):

7
x1 =2
x2 =5
v1 = log x1 = log 2
v2 = x1 × x2 =2×5
v3 = sin x2 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 − v3 = 10.693 + 0.959
y = v5 = 11.652
Also, we have a computational graph to generate the computing order

x1 v1 v4
= log x1 = v1 + v2
=2
log 2 0.693 + 10

v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3
2×5 10.693 − (−0959)

x2 v3
= sin x2
=5
sin 5

Therefore, we must check how to build the graph.

3.2 Creating the Computational Graph


A graph consists of nodes and edges. We must discuss what a node/edge is and how to store
information. From the graph shown above, we see that each node represents an intermediate
expression:

v1 = log x1 ,
v2 = x1 × x2 ,
v3 = sin x2 ,
v4 = v1 × v2 ,
v5 = v4 − v3 .

The expression in each node is from an operation to expressions in other nodes. Therefore, it is
natural to construct an edge
u → v,

if the expression of a node v is based on the expression of another node u. We say node u is a parent
node (of v) and node v is a child node (of u). To do the forward calculation, at node v we should

8
store v’s parents. Additionally, we need to record the operator applied on the node’s parents and
the resulting value. For example, the construction of the node

v2 = x1 × x2 ,

requires to store v2 ’s parent nodes {x1 , x2 }, the corresponding operator “×” and the resulting value.
Up to now, we can implement each node as a class Node with the following members.
member data type example for Node v2
numerical value float 10
parent nodes List[Node] [x1 , x2 ]
child nodes List[Node] [v4 ]
operator string "mul" (for ×)
At this moment, it is unclear why we should store child nodes in our Node class. Later we will
explain why such information is needed. Once the Node class is ready, starting from initial nodes
(which represent xi ’s), we use nested function calls to build the whole graph. In our case, the graph
for y = f (x1 , x2 ) can be constructed via

y = sub(add(log(x1), mul(x1, x2)),sin(x2)).

let us see this process step by step and check what each function must do. First, our starting point
is the root nodes created by the Node class constructor.

x1
x1

x2
x2

These root Nodes have empty members “parent nodes,” “child nodes,” and “operator” with only
“numerical value” respectively set to x1 and x2 . Then, we apply our implemented log(node) to
the node x1.

x1 log
x1 v1 = log x1

The implementation of our log function should create a Node instance to store log(x1 ). Therefore,
what we have is a wrapping function that does more than the log operation; see details in Section
3.3. The created node is the v1 node in our computational graph. Next, we discuss details of the
node creation. From the current log function and the input node x1 , we know contents of the
following members:

9
• parent nodes: [x1 ]
• operator: "log"
• numerical value: log 2

However, we have no information about children of this node. The reason is obvious because we
have not had a graph including its child nodes yet. Instead, we leave this member “child nodes”
empty and let child nodes to write back the information. By this idea, our log function should add
v1 to the “child nodes” of x1 . See more details later in Section 3.3.
We move on to apply mul(node1, node2) on nodes x1 and x2.

x1
x1

×
v2 = x1 × x2

x2
x2

Similarly, the mul function generates a Node instance. However, different from log(x1), the node
created here stores two parents (instead of one). Then we apply the function call add(log(x1),
mul(x1, x2)).

x1 log +
x1 v1 = log x1 v4 = log x1 + x1 × x2

×
v2 = x1 × x2

x2
x2

Next, we apply sin(node) to x2.

x2 sin
x2 v3 = sin x2

10
Last, applying sub(node1,node2) to the output nodes of add(log(x1), mul(x1, x2)) and sin(x1)
leads to

x1 log + v4 = log x1 + x1 × x2

x1 v1 = log x1

× −
v 2 = x1 × x2 v5 = log x1 + x1 × x2 − sin x2

x2 sin
x2 v3 = sin x2

We can conclude that each function generates exactly one Node instance; however, the generated
nodes differ in the operator, the number of parents, etc.

3.3 Wrapping Functions


We mentioned that a function like “mul” does more than calculating the product of two numbers.
Here we show more details. These customized functions “add”, “mul” and “log” in the previous
pages are wrapping functions, which “wrap” numerical operations with additional codes. An impor-
tant task is to maintain the relation between the constructed node and its parents/children. This
way, the information of graph can be preserved.
For example, we may implement the following “mul” function.
Listing 1: The wrapping function “mul”
def mul(node1, node2):
value = node1.value * node2.value
parent_nodes = [node1, node2]
newNode = Node(value, parent_nodes, "mul")
node1.child_nodes.append(newNode)
node2.child_nodes.append(newNode)
return newNode
In this code, we add the created node to the “child nodes” lists of the two input nodes: node1 and
node2. As we mentioned eariler, when node1 and node2 were created, their lists of child nodes
were unknown and left empty. Thus, in creating each node, we append the node to the list of its
parent(s).
The output of the function should be the created node. This setting enables the nested function
call. That is, calling
y = sub(add(log(x1), mul(x1, x2)),sin(x2))
finishes the function evaluation. At the same time, we build the computational graph.

11
4 Topological Order and Partial Derivatives
Once the computational graph is built, we want to use the information in the graph to compute
∂y ∂v5
= .
∂x1 ∂x1

4.1 Finding the Topological Order


Recall that ∂v/∂x1 is denoted by v̇. From the chain rule,
∂v5 ∂v5
v̇5 = v̇4 + v̇3 . (4)
∂v4 ∂v3
We are able to calculate
∂v5 ∂v5
and , (5)
∂v4 ∂v3
because the node v5 stores the needed information related to its parents v4 and v3 . We defer the
details on calculating (5), so the focus is now on calculating v̇4 and v̇3 . For v̇4 , we further have
∂v4 ∂v4
v̇4 = v̇1 + v̇2 , (6)
∂v1 ∂v2
which, by the same reason, indicates the need of v̇1 and v̇2 . On the other hand, we have v̇3 = 0
since v3 (i.e., sin(x2 )) is not a function of x1 . The discussion on calculating v̇4 and v̇3 leads us to
find that
v is not reachable from x1 in the graph ⇒ v̇ = 0. (7)
We say a node v is reachable from a node u if there exists a path from u to v in the graph. From (7),
now we only care about nodes reachable from x1 . Further, we must properly order nodes reachable
from x1 so that, for example, in (4.2), v̇4 and v̇3 are ready before calculating v̇5 . Similarly, v̇1 and
v̇2 should be available when calculating v̇4 .
To consider nodes reachable from x1 , from the whole computational graph G = ⟨V, E⟩, where
V and E are respectively sets of nodes and edges, we define

VR = {v ∈ V | v is reachable from x1 }

and
ER = {(u, v) ∈ E | u ∈ VR , v ∈ VR }.
Then,
GR ≡ ⟨VR , ER ⟩
is a subgraph of G. For our example, GR is the following subgraph with

VR = {x1 , v1 , v2 , v4 , v5 }

and
ER = {(x1 , v1 ), (x2 , v2 ), (v1 , v4 ), (v2 , v4 ), (v4 , v5 )}.

12
x1 v1 v4

v2 v5

We aim to find a “suitable” ordering of VR satisfying that each node u ∈ VR comes before all of its
child nodes in the ordering. By doing so, u̇ can be used in the derivative calculation of its child
nodes; see (6). For our example, a “suitable” ordering can be

x1 , v1 , v2 , v4 , v5 .

In graph theory, such an ordering is called a topological ordering of GR . Since GR is a directed


acyclic graph (DAG), a topological ordering must exist.6 We may use depth first search (DFS) to
traverse GR to find the topological ordering. For the implementation, earlier we included a member
“child nodes” in the Node class, but did not explain why. The reason is that to traverse GR from
x1 , we must access children of each node.
Based on the above idea, we can have the following code to find a topological ordering.
Listing 2: Using depth first search to find a topological ordering
def topological_order(rootNode):
def add_children(node):
if node not in visited:
visited.add(node)
for child in node.child_nodes:
add_children(child)
ordering.append(node)
ordering, visited = [], set()
add_children(rootNode)
return list(reversed(ordering))
The function add children implements the depth-first-search of a DAG. From the input node, it
sequentially calls itself by using each child as the input. This way explores all nodes reachable from
the input node. After that, we append the input node to the end of the output list. Also, we must
maintain a set of visited nodes to ensure that each node is included in the ordering exactly once.
For our example, the input in calling the above function is x1 , which is also the root node of GR .
The left-most path of the depth-first search is

x1 → v1 → v4 → v5 , (8)

so v5 is added first. In the end, we get the following list

[v5 , v4 , v1 , v2 , x1 ].
6
We do not get into details, but a proof can be found in Kleinberg and Tardos (2005).

13
Then, by reversing the list, a node always comes before its children. One may wonder whether we
can add a node to the list before adding its child nodes. This way, we have a simpler implementation
without needing to reverse the list in the end. However, this setting may fail to generate a topological
ordering. We obtain the following list for our example:

[x1 , v1 , v4 , v5 , v2 ].

A violation occurs because v2 does not appear before its child v4 . The key reason is that in a DFS
path, a node may point to another node that was added earlier through a different path. Then this
node becomes after one of its children. For our example,

x1 → v2 → v4 → v5

is a path processed after the path in (8). Thus, v2 is added after v4 and a violation occurs. Reversing
the list can resolve the problem. To this end, in the function add children, we must append the
input node in the end.
In automatic differentiation, methods based on the topological ordering are called tape-based
methods. They are used in some real-world implementations such as Tensorflow. The ordering is
regarded as a tape. We read the nodes one by one from the beginning of the sequence (tape) to
calculate the derivative value.
Based on the obtained ordering, subsequently let us see how to compute each v̇.

4.2 Computing the Partial Derivative


Earlier, by the chain rule, we have
∂v5 ∂v5
v̇5 = v̇4 + v̇3 .
∂v4 ∂v3
In Section 4.1, we mentioned that
v̇4 and v̇3

should be ready before calculating v̇5 . For


∂v5 ∂v5
and ,
∂v4 ∂v3
we are able to calculate and store them when v5 is created. The reason is that from

v5 (v4 , v3 ) = v4 − v3 ,

we know
∂v5 ∂v5
= 1 and = −1.
∂v4 ∂v3

14
A general form of our calculation is
X ∂v
v̇ = u̇. (9)
u∈v’s parents
∂u

∂u
The second term, u̇ = , comes from past calculation due to the topological ordering. We can
∂x1
calculate the first term because u is one of v’s parent(s) and we know the operation at v. For
example, we have v4 = v1 × v2 , so
∂v4 ∂v4
= v2 and = v1 .
∂v1 ∂v2
These values can be immediately computed and stored when we construct the computational graph.
Therefore, we add one member “gradient w.r.t. parents” to our Node class. In addition, we need
a member “partial derivative” to store the accumulated sum in the calculation of (9). In the end,
this member stores the v̇i value. Details of the derivative evaluation are in Listing 4. The complete
list of members of our node class is in the following table.
member data type example for Node v2
numerical value float 10
parent nodes List[Node] [x1 , x2 ]
child nodes List[Node] [v4 ]
operator string "mul"
gradient w.r.t parents List[float] [5, 2]
partial derivative float 5
We update the mul function accordingly.
Listing 3: The wrapping function “mul”. The change from Listing 1 is in red color.
def mul(node1, node2):
,
, value = node1.value * node2.value
, parent_nodes = [node1, node2]
, newNode = Node(value, parent_nodes, "mul")
, newNode.grad_wrt_parents = [node2.value,node1.value]
, node1.child_nodes.append(newNode)
, node2.child_nodes.append(newNode)
, return newNode
As shown above, we must compute
∂ newNode
∂ parentNode
for each parent node in constructing a new child node. Here are some examples other than the mul
function:

• For add(node1, node2), we have


∂ newNode ∂ newNode
= = 1,
∂ Node1 ∂ Node2

15
so the red line is replaced by

newNode.grad wrt parents = [1., 1.].

• For log(node), we have


∂ newNode 1
= ,
∂ Node Node.value
so the red line becomes

newNode.grad wrt parents = [1/node.value].

Now, we know how to get each term in (9), i.e., the chain rule for calculating v̇. Therefore, if we
follow the topological ordering, all v̇i (i.e., partial derivatives with respect to x1 ) can be calculated.
An implementation to compute the partial derivatives is as follows. Here we store the resulting
value in the member partial derivative of each node.
Listing 4: Evaluating derivatives
def forward(rootNode):
rootNode.partial_derivative = 1
ordering = topological_order(rootNode)
for node in ordering[1:]:
partial_derivative = 0
for i in range(len(node.parent_nodes)):
dnode_dparent = node.grad_wrt_parents[i]
dparent_droot = node.parent_nodes[i].partial_derivative
partial_derivative += dnode_dparent * dparent_droot
node.partial_derivative = partial_derivative

4.3 Summary
The procedure for forward mode includes three steps:

1. Create the computational graph

2. Find a topological order of the graph associated with x1

3. Compute the partial derivative with respect to x1 along the topological order

We discuss not only how to run each step but also what information we should store. This is a
minimal implementation to demonstrate the forward mode automatic differentiation.

Acknowledgements
This work was supported by National Science and Technology Council of Taiwan grant 110-2221-
E-002-115-MY3.

16
References
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,
M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker,
V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale
machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI), pages 265–283, 2016.

M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon. Automatic differentiation of


algorithms. Journal of Computational and Applied Mathematics, 124(1-2):171–190, 2000.

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in


machine learning: a survey. Journal of Machine Learning Research, 18(153):1–43, 2018.

S. Chinchalkar. The application of automatic differentiation to problems in engineering analysis.


Computer Methods in Applied Mechanics and Engineering, 118(1-2):197–207, 1994.

J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc.,
2005. ISBN 0321295358.

C. C. Margossian. A review of automatic differentiation and its efficient implementation. Wiley


interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1305, 2019.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,


and A. Lerer. Automatic differentiation in Pytorch. 2017.

17

You might also like