0% found this document useful (0 votes)

52 views17 pages

A Step-By-step Introduction To The Implementation of Automatic Differentiation

This document provides a step-by-step introduction to implementing automatic differentiation, a crucial technique in deep learning for calculating derivatives. It addresses the gap in teaching by simplifying mathematical concepts and implementation details, making it accessible for beginners. The tutorial includes a complete guide with slides and source code to aid understanding and application.

Uploaded by

abc1255856

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

52 views17 pages

A Step-By-step Introduction To The Implementation of Automatic Differentiation

Uploaded by

abc1255856

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 17

A Step-by-step Introduction to the Implementation of

Automatic Differentiation
Yu-Hsueh Fang∗1 , He-Zhe Lin∗1 , Jie-Jyun Liu1 , and Chih-Jen Lin1,2
1
National Taiwan University
{d11725001, r11922027, d11922012}@ntu.edu.tw
[email protected]
2
Mohamed bin Zayed University of Artificial Intelligence
[email protected]

December 15, 2024

Abstract
Automatic differentiation is a key component in deep learning. This topic is well studied
and excellent surveys such as Baydin et al. (2018) have been available to clearly describe the
basic concepts. Further, sophisticated implementations of automatic differentiation are now
an important part of popular deep learning frameworks. However, it is difficult, if not impos-
sible, to directly teach students the implementation of existing systems due to the complexity.
On the other hand, if the teaching stops at the basic concept, students fail to sense the re-
alization of an implementation. For example, we often mention the computational graph in
teaching automatic differentiation, but students wonder how to implement and use it. In this
document, we partially fill the gap by giving a step by step introduction of implementating
a simple automatic differentiation system. We streamline the mathematical concepts and the
implementation. Further, we give the motivation behind each implementation detail, so the
whole setting becomes very natural.

1 Introduction
In modern machine learning, derivatives are the cornerstone of numerous applications and stud-
ies. The calculation often relies on automatic differentiation, a classic method for efficiently and
accurately calculating derivatives of numeric functions. For example, deep learning cannot suc-
ceed without automatic differentiation. Therefore, teaching students how automatic differentiation
works is highly essential.
Automatic differentiation is a well-developed area with rich literature. Excellent surveys in-
cluding Chinchalkar (1994), Bartholomew-Biggs et al. (2000), Baydin et al. (2018) and Margossian
*
These authors contributed equally to this work

1
(2019) review the algorithms for automatic differentiation and its wide applications. In particular,
Baydin et al. (2018) is a comprehensive work focusing the automatic differentiation in machine
learning. Therefore, there is no lack of materials introducing the concept of automatic differentia-
tion.
On the other hand, as deep learning systems now solve large-scale problems, it is inevitable
that the implementation of automatic differentiation becomes highly sophisticated. For example, in
popular deep learning systems such as PyTorch (Paszke et al., 2017) and Tensorflow (Abadi et al.,
2016), at least thousands of lines of code are needed. Because of this, many places of teaching
automatic differentiation for deep learning stops at the basic concepts. Then students fail to sense
the realization of an implementation. For example, we often mention the computational graph
in teaching automatic differentiation, but students wonder how to implement and use it. In this
document, we aim to partially fill the gap by giving a tutorial on the basic implementation.
In recent years, many works1,2,3,4,5 have attempted to discuss the basic implementation of au-
tomatic differentiation. However, they still leave room for improvement. For example, some are
not self-contained – they quickly talk about implementations without connecting to basic concepts.
Ours, which is very suitable for the beginners, has the following features:

• We streamline the mathematical concepts and the implementation. Further, we give the
motivation behind each implementation detail, so the whole setting becomes very natural.

• We use the example from Baydin et al. (2018) for the consistency with past works. Ours is
thus an extension of Baydin et al. (2018) into the implementation details.

• We build a complete tutorial including this document, slides and the source code at
https://www.csie.ntu.edu.tw/~cjlin/papers/autodiff/.

2 Automatic Differentiation
There are two major modes of automatic differentiation. In this section, we introduce the basic
concepts of both modes. Most materials in this section are from Baydin et al. (2018). We consider
the same example function

y = f (x1 , x2 ) = log x1 + x1 x2 − sin x2 .

1
https://towardsdatascience.com/build-your-own-automatic-differentiation-program-6ecd585eec2a
2
https://sidsite.com/posts/autodiff
3
https://mdrk.io/introduction-to-automatic-differentiation-part2/
4
https://github.com/dlsyscourse/lecture5/blob/main/5_automatic_differentiation_
implementation.ipynb
5
https://github.com/karpathy/micrograd

2
2.1 Forward Mode
First, we discuss the forward mode. Before calculating the derivative, let us check how to calculate
the function value. Assume that we want to calculate the function value at (x1 , x2 ) = (2, 5). Then,
in the following table, we have a forward procedure.
x1 =2
x2 =5
v1 = log x1 = log 2
v2 = x1 × x2 =2×5
v3 = sin x2 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 − v3 = 10.693 + 0.959
y = v5 = 11.652
We use variables vi to record the intermediate outcomes. First, we know log function is applied to
x1 . Therefore, we have log(x1 ) as a new variable called v1 . Similarly, there is a variable v2 , which
is x1 × x2 . Each vi is related to a simple operation. The initial value of this member is zero when
a node is created. In the end, our function value at (2, 5) is y = v5 . As shown in the table, the
function evaluation is decomposed into a sequence of simple operations. We have a corresponding
computational graph as follows:

x1 v1 v4

v2 v5 y = f (x1 , x2 )

x2 v3

Because calculating both v1 and v2 needs x1 , x1 has two links to them in the graph. The following
graph shows all the intermediate results in the computation.

3
x1 v1 v4
= ln x1 = v1 + v2
=2
log 2 0.693 + 10

v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3
2×5 10.693 − (−0959)

x2 v3
= sin x2
=5
sin 5

The computational graph tells us the dependencies of variables. Thus, from the inputs x1 and
x2 we can go through all nodes for getting the function value y = v5 in the end.
Now, we have learned about the function evaluation. But remember, we would like to calculate
the derivative. Assume that we target at the partial derivative ∂y/∂x1 . Here, we denote

∂v
v̇ =
∂x1
as the derivative of the variable v with respect to x1 . The idea is that by using the chain rule, we
can obtain the following forward derivative calculation to eventually get v̇5 = ∂f /∂x1 .

ẋ1 = ∂x1 /∂x1 =1

ẋ2 = ∂x2 /∂x1 =0
v̇1 = ẋ1 /x1 = 1/2
v̇2 = ẋ1 × x2 + ẋ2 × x1 =1×5+0×2
v̇3 = ẋ2 × cos x2 = 0 × cos 5
v̇4 = v̇1 + v̇2 = 0.5 + 5
v̇5 = v̇4 − v̇3 = 5.5 − 0
ẏ = v̇5 = 5.5

The table starts from ẋ1 and ẋ2 , which are ∂x1 /∂x1 = 1 and ∂x2 /∂x1 = 0. Based on ẋ1 and ẋ2 , we
can calculate other values. For example, let us check the partial derivative ∂v1 /∂x1 . From

v1 = log x1 ,

by the chain rule,

∂v1 ∂x1 1 ∂x1 ẋ1
×
v̇1 = = × = .
∂x1 ∂x1 x1 ∂x1 x1
Therefore, we need ẋ1 and x1 for calculating v̇1 on the left-hand side. We already have the value of
ẋ1 from the previous step (∂x1 /∂x1 = 1). Also, the function evaluation gives the value of x1 . Then,
we can calculate ẋ1 /x1 = 1/2. Clearly, the chain rule plays an important role here. The calculation
of other v̇i is similar.

4
2.2 Reverse Mode
Next, we discuss the reverse mode. We denote
∂y
v̄ =
∂v
as the derivative of the function y with respect to the variable v. Note that earlier, in the forward
mode, we considered
∂v
v̇ =,
∂x1
so the focus is on the derivatives of all variables with respect to one input variable. In contrast, the
reverse mode focuses on v̄ = ∂y/∂v for all v, the partial derivatives of one output with respect to
all variables. Therefore, for our example, we can use v̄i ’s and x̄i ’s to get both ∂y/∂x1 and ∂y/∂x2
at once. Now we illustrate the calculation of
∂y
.
∂x2
By checking the variable x2 in the computational graph, we see that variable x2 affects y by
affecting v2 and v3 .

forward calculation of function value

reverse calculation of derivative value

x1 v1 v4
=2 = log x1 = v1 + v2

v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3

x2 v3
=5 = sin x2

This dependency, together with the fact that x1 is fixed, means if we would like to calculate ∂y/∂x2 ,
then it is equal to calculate
∂y ∂y ∂v2 ∂y ∂v3
= + . (1)
∂x2 ∂v2 ∂x2 ∂v3 ∂x2
We can rewrite Equation 1 as follows with our notataion.
∂v2 ∂v3
x̄2 = v̄2 + v̄3 . (2)
∂x2 ∂x2

5
If v̄2 and v̄3 are available beforehand, all we need is to calculate ∂v2 /∂x2 and ∂v3 /∂x2 . From the
operation between x2 and v3 , we know that ∂v3 /∂x2 = cos(x2 ). Similarly, we have ∂v2 /∂x2 = x1 .
Then, the evaluation of x̄2 is done in two steps:
∂v3
x̄2 ← v̄3
∂x2
∂v2
x̄2 ← x̄2 + v̄2
.
∂x2
These steps are part of the sequence of a reverse traversal, shown in the following table.
x̄1 = 5.5
x̄2 = 1.716
∂v1
x̄1 = x̄1 + v̄1 ∂x 1
= x̄1 + v̄1 /x1 = 5.5
∂v2
x̄2 = x̄2 + v̄2 ∂x2 = x̄2 + v̄2 × x1 = 1.716
∂v2
x̄1 = v̄2 ∂x 1
= v̄2 × x2 =5
∂v3
x̄2 = v̄3 ∂x 2
= v̄3 × cos x2 = −0.284
v̄2 = v̄4 ∂v
∂v2
4
= v̄4 × 1 =1
v̄1 = v̄4 ∂v
∂v1
4
= v̄4 × 1 =1
v̄3 = v̄5 ∂v
∂v3
5
= v̄5 × (−1) = −1
∂v5
v̄4 = v̄5 ∂v 4
= v̄5 × 1 =1
v̄5 = ȳ =1
To get the desired x̄1 and x̄2 (i.e., ∂y/∂x1 and ∂y/∂x2 ), we begin with
∂y ∂y
v̄5 = = = 1.
∂v5 ∂y
From the computational graph, we then get v̄4 and v̄3 . Because v4 affects y only through v5 , we
have
∂y ∂y ∂v5 ∂v5
v̄4 = = = v̄5 = v̄5 × 1
∂v4 ∂v5 ∂v4 ∂v4
The above equation is based on that we already know ∂y/∂v5 = v̄5 . Also, the operation from v4 to
v5 is an addition, so ∂v5 /∂v4 is a constant 1. By such a sequence, in the end, we obtain
∂y ∂y
= x̄1 and = x̄2
∂x1 ∂x2
at the same time.

3 Implementation of Function Evaluation and the Compu-

tational Graph
With the basic concepts ready in Section 2, we move to the implementation of the automatic
differentiation. For simplicity, we consider the forward mode. The reverse mode can be designed in
a similar way. Consider a function f : Rn → R with

y = f (x) = f (x1 , x2 , . . . , xn ).

6
For any given x, we show the computation of
∂y
∂x1
as an example.

3.1 The Need to Calculate Function Values

We are calculating the derivative, so at the first glance, function values are not needed. However,
we show that function evaluation is necessary due to the function structure and the use of the chain
rule. To explain this, we begin with knowing that the function of a neural network is usually a
nested composite function
f (x) = hk (hk−1 (. . . h1 (x)))
due to the layered structure. For easy discussion, let us assume that f (x) is the following general
composite function
f (x) = g(h1 (x), h2 (x), . . . , hk (x)).
We can see that the example function considered earlier

f (x1 , x2 ) = log x1 + x1 x2 − sin x2 (3)

can be written in the following composite function

g(h1 (x1 , x2 ), h2 (x1 , x2 ))

with

g(h1 , h2 ) = h1 − h2 ,
h1 (x1 , x2 ) = log x1 + x1 x2 ,
h2 (x1 , x2 ) = sin(x2 ).

To calculate the derivative at x = x0 using the chain rule, we have

k
∂f X ∂g ∂hi
= × ,
∂x1 x=x0 i=1
∂h i h=h(x0 ) ∂x 1 x=x0

where the notation

∂g
∂hi h=h(x0 )
T
means the derivative of g with respect to hi evaluated at h(x0 ) = h1 (x0 ) · · · hk (x0 ) . Clearly,
we must calculate the inner function values h1 (x0 ), . . . , hk (x0 ) first. The process of computing all
hi (x0 ) is part of (or almost the same as) the process of computing f (x0 ). This explanation tells
why for calculating the partial derivatives, we need the function value first.
Next, we discuss the implementation of getting the function value. For the function (3), recall
we have a table recording the order to get f (x1 , x2 ):

7
x1 =2
x2 =5
v1 = log x1 = log 2
v2 = x1 × x2 =2×5
v3 = sin x2 = sin 5
v4 = v1 + v2 = 0.693 + 10
v5 = v4 − v3 = 10.693 + 0.959
y = v5 = 11.652
Also, we have a computational graph to generate the computing order

x1 v1 v4
= log x1 = v1 + v2
=2
log 2 0.693 + 10

v2 v5 f (x1 , x2 )
= x1 × x2 = v4 − v3
2×5 10.693 − (−0959)

x2 v3
= sin x2
=5
sin 5

Therefore, we must check how to build the graph.

3.2 Creating the Computational Graph

A graph consists of nodes and edges. We must discuss what a node/edge is and how to store
information. From the graph shown above, we see that each node represents an intermediate
expression:

v1 = log x1 ,
v2 = x1 × x2 ,
v3 = sin x2 ,
v4 = v1 × v2 ,
v5 = v4 − v3 .

The expression in each node is from an operation to expressions in other nodes. Therefore, it is
natural to construct an edge
u → v,

if the expression of a node v is based on the expression of another node u. We say node u is a parent
node (of v) and node v is a child node (of u). To do the forward calculation, at node v we should

8
store v’s parents. Additionally, we need to record the operator applied on the node’s parents and
the resulting value. For example, the construction of the node

v2 = x1 × x2 ,

requires to store v2 ’s parent nodes {x1 , x2 }, the corresponding operator “×” and the resulting value.
Up to now, we can implement each node as a class Node with the following members.
member data type example for Node v2
numerical value float 10
parent nodes List[Node] [x1 , x2 ]
child nodes List[Node] [v4 ]
operator string "mul" (for ×)
At this moment, it is unclear why we should store child nodes in our Node class. Later we will
explain why such information is needed. Once the Node class is ready, starting from initial nodes
(which represent xi ’s), we use nested function calls to build the whole graph. In our case, the graph
for y = f (x1 , x2 ) can be constructed via

y = sub(add(log(x1), mul(x1, x2)),sin(x2)).

let us see this process step by step and check what each function must do. First, our starting point
is the root nodes created by the Node class constructor.

x1
x1

x2
x2

These root Nodes have empty members “parent nodes,” “child nodes,” and “operator” with only
“numerical value” respectively set to x1 and x2 . Then, we apply our implemented log(node) to
the node x1.

x1 log
x1 v1 = log x1

The implementation of our log function should create a Node instance to store log(x1 ). Therefore,
what we have is a wrapping function that does more than the log operation; see details in Section
3.3. The created node is the v1 node in our computational graph. Next, we discuss details of the
node creation. From the current log function and the input node x1 , we know contents of the
following members:

9
• parent nodes: [x1 ]
• operator: "log"
• numerical value: log 2

However, we have no information about children of this node. The reason is obvious because we
have not had a graph including its child nodes yet. Instead, we leave this member “child nodes”
empty and let child nodes to write back the information. By this idea, our log function should add
v1 to the “child nodes” of x1 . See more details later in Section 3.3.
We move on to apply mul(node1, node2) on nodes x1 and x2.

x1
x1

×
v2 = x1 × x2

x2
x2

Similarly, the mul function generates a Node instance. However, different from log(x1), the node
created here stores two parents (instead of one). Then we apply the function call add(log(x1),
mul(x1, x2)).

x1 log +
x1 v1 = log x1 v4 = log x1 + x1 × x2

×
v2 = x1 × x2

x2
x2

Next, we apply sin(node) to x2.

x2 sin
x2 v3 = sin x2

10
Last, applying sub(node1,node2) to the output nodes of add(log(x1), mul(x1, x2)) and sin(x1)
leads to

x1 log + v4 = log x1 + x1 × x2

x1 v1 = log x1

× −
v 2 = x1 × x2 v5 = log x1 + x1 × x2 − sin x2

x2 sin
x2 v3 = sin x2

We can conclude that each function generates exactly one Node instance; however, the generated
nodes differ in the operator, the number of parents, etc.

3.3 Wrapping Functions

We mentioned that a function like “mul” does more than calculating the product of two numbers.
Here we show more details. These customized functions “add”, “mul” and “log” in the previous
pages are wrapping functions, which “wrap” numerical operations with additional codes. An impor-
tant task is to maintain the relation between the constructed node and its parents/children. This
way, the information of graph can be preserved.
For example, we may implement the following “mul” function.
Listing 1: The wrapping function “mul”
def mul(node1, node2):
value = node1.value * node2.value
parent_nodes = [node1, node2]
newNode = Node(value, parent_nodes, "mul")
node1.child_nodes.append(newNode)
node2.child_nodes.append(newNode)
return newNode
In this code, we add the created node to the “child nodes” lists of the two input nodes: node1 and
node2. As we mentioned eariler, when node1 and node2 were created, their lists of child nodes
were unknown and left empty. Thus, in creating each node, we append the node to the list of its
parent(s).
The output of the function should be the created node. This setting enables the nested function
call. That is, calling
y = sub(add(log(x1), mul(x1, x2)),sin(x2))
finishes the function evaluation. At the same time, we build the computational graph.

11
4 Topological Order and Partial Derivatives
Once the computational graph is built, we want to use the information in the graph to compute
∂y ∂v5
= .
∂x1 ∂x1

4.1 Finding the Topological Order

Recall that ∂v/∂x1 is denoted by v̇. From the chain rule,
∂v5 ∂v5
v̇5 = v̇4 + v̇3 . (4)
∂v4 ∂v3
We are able to calculate
∂v5 ∂v5
and , (5)
∂v4 ∂v3
because the node v5 stores the needed information related to its parents v4 and v3 . We defer the
details on calculating (5), so the focus is now on calculating v̇4 and v̇3 . For v̇4 , we further have
∂v4 ∂v4
v̇4 = v̇1 + v̇2 , (6)
∂v1 ∂v2
which, by the same reason, indicates the need of v̇1 and v̇2 . On the other hand, we have v̇3 = 0
since v3 (i.e., sin(x2 )) is not a function of x1 . The discussion on calculating v̇4 and v̇3 leads us to
find that
v is not reachable from x1 in the graph ⇒ v̇ = 0. (7)
We say a node v is reachable from a node u if there exists a path from u to v in the graph. From (7),
now we only care about nodes reachable from x1 . Further, we must properly order nodes reachable
from x1 so that, for example, in (4.2), v̇4 and v̇3 are ready before calculating v̇5 . Similarly, v̇1 and
v̇2 should be available when calculating v̇4 .
To consider nodes reachable from x1 , from the whole computational graph G = ⟨V, E⟩, where
V and E are respectively sets of nodes and edges, we define

VR = {v ∈ V | v is reachable from x1 }

and
ER = {(u, v) ∈ E | u ∈ VR , v ∈ VR }.
Then,
GR ≡ ⟨VR , ER ⟩
is a subgraph of G. For our example, GR is the following subgraph with

VR = {x1 , v1 , v2 , v4 , v5 }

and
ER = {(x1 , v1 ), (x2 , v2 ), (v1 , v4 ), (v2 , v4 ), (v4 , v5 )}.

12
x1 v1 v4

v2 v5

We aim to find a “suitable” ordering of VR satisfying that each node u ∈ VR comes before all of its
child nodes in the ordering. By doing so, u̇ can be used in the derivative calculation of its child
nodes; see (6). For our example, a “suitable” ordering can be

x1 , v1 , v2 , v4 , v5 .

In graph theory, such an ordering is called a topological ordering of GR . Since GR is a directed

acyclic graph (DAG), a topological ordering must exist.6 We may use depth first search (DFS) to
traverse GR to find the topological ordering. For the implementation, earlier we included a member
“child nodes” in the Node class, but did not explain why. The reason is that to traverse GR from
x1 , we must access children of each node.
Based on the above idea, we can have the following code to find a topological ordering.
Listing 2: Using depth first search to find a topological ordering
def topological_order(rootNode):
def add_children(node):
if node not in visited:
visited.add(node)
for child in node.child_nodes:
add_children(child)
ordering.append(node)
ordering, visited = [], set()
add_children(rootNode)
return list(reversed(ordering))
The function add children implements the depth-first-search of a DAG. From the input node, it
sequentially calls itself by using each child as the input. This way explores all nodes reachable from
the input node. After that, we append the input node to the end of the output list. Also, we must
maintain a set of visited nodes to ensure that each node is included in the ordering exactly once.
For our example, the input in calling the above function is x1 , which is also the root node of GR .
The left-most path of the depth-first search is

x1 → v1 → v4 → v5 , (8)

so v5 is added first. In the end, we get the following list

[v5 , v4 , v1 , v2 , x1 ].
6
We do not get into details, but a proof can be found in Kleinberg and Tardos (2005).

13
Then, by reversing the list, a node always comes before its children. One may wonder whether we
can add a node to the list before adding its child nodes. This way, we have a simpler implementation
without needing to reverse the list in the end. However, this setting may fail to generate a topological
ordering. We obtain the following list for our example:

[x1 , v1 , v4 , v5 , v2 ].

A violation occurs because v2 does not appear before its child v4 . The key reason is that in a DFS
path, a node may point to another node that was added earlier through a different path. Then this
node becomes after one of its children. For our example,

x1 → v2 → v4 → v5

is a path processed after the path in (8). Thus, v2 is added after v4 and a violation occurs. Reversing
the list can resolve the problem. To this end, in the function add children, we must append the
input node in the end.
In automatic differentiation, methods based on the topological ordering are called tape-based
methods. They are used in some real-world implementations such as Tensorflow. The ordering is
regarded as a tape. We read the nodes one by one from the beginning of the sequence (tape) to
calculate the derivative value.
Based on the obtained ordering, subsequently let us see how to compute each v̇.

4.2 Computing the Partial Derivative

Earlier, by the chain rule, we have
∂v5 ∂v5
v̇5 = v̇4 + v̇3 .
∂v4 ∂v3
In Section 4.1, we mentioned that
v̇4 and v̇3

should be ready before calculating v̇5 . For

∂v5 ∂v5
and ,
∂v4 ∂v3
we are able to calculate and store them when v5 is created. The reason is that from

v5 (v4 , v3 ) = v4 − v3 ,

we know
∂v5 ∂v5
= 1 and = −1.
∂v4 ∂v3

14
A general form of our calculation is
X ∂v
v̇ = u̇. (9)
u∈v’s parents
∂u

∂u
The second term, u̇ = , comes from past calculation due to the topological ordering. We can
∂x1
calculate the first term because u is one of v’s parent(s) and we know the operation at v. For
example, we have v4 = v1 × v2 , so
∂v4 ∂v4
= v2 and = v1 .
∂v1 ∂v2
These values can be immediately computed and stored when we construct the computational graph.
Therefore, we add one member “gradient w.r.t. parents” to our Node class. In addition, we need
a member “partial derivative” to store the accumulated sum in the calculation of (9). In the end,
this member stores the v̇i value. Details of the derivative evaluation are in Listing 4. The complete
list of members of our node class is in the following table.
member data type example for Node v2
numerical value float 10
parent nodes List[Node] [x1 , x2 ]
child nodes List[Node] [v4 ]
operator string "mul"
gradient w.r.t parents List[float] [5, 2]
partial derivative float 5
We update the mul function accordingly.
Listing 3: The wrapping function “mul”. The change from Listing 1 is in red color.
def mul(node1, node2):
,
, value = node1.value * node2.value
, parent_nodes = [node1, node2]
, newNode = Node(value, parent_nodes, "mul")
, newNode.grad_wrt_parents = [node2.value,node1.value]
, node1.child_nodes.append(newNode)
, node2.child_nodes.append(newNode)
, return newNode
As shown above, we must compute
∂ newNode
∂ parentNode
for each parent node in constructing a new child node. Here are some examples other than the mul
function:

• For add(node1, node2), we have

∂ newNode ∂ newNode
= = 1,
∂ Node1 ∂ Node2

15
so the red line is replaced by

newNode.grad wrt parents = [1., 1.].

• For log(node), we have

∂ newNode 1
= ,
∂ Node Node.value
so the red line becomes

newNode.grad wrt parents = [1/node.value].

Now, we know how to get each term in (9), i.e., the chain rule for calculating v̇. Therefore, if we
follow the topological ordering, all v̇i (i.e., partial derivatives with respect to x1 ) can be calculated.
An implementation to compute the partial derivatives is as follows. Here we store the resulting
value in the member partial derivative of each node.
Listing 4: Evaluating derivatives
def forward(rootNode):
rootNode.partial_derivative = 1
ordering = topological_order(rootNode)
for node in ordering[1:]:
partial_derivative = 0
for i in range(len(node.parent_nodes)):
dnode_dparent = node.grad_wrt_parents[i]
dparent_droot = node.parent_nodes[i].partial_derivative
partial_derivative += dnode_dparent * dparent_droot
node.partial_derivative = partial_derivative

4.3 Summary
The procedure for forward mode includes three steps:

1. Create the computational graph

2. Find a topological order of the graph associated with x1

3. Compute the partial derivative with respect to x1 along the topological order

We discuss not only how to run each step but also what information we should store. This is a
minimal implementation to demonstrate the forward mode automatic differentiation.

Acknowledgements
This work was supported by National Science and Technology Council of Taiwan grant 110-2221-
E-002-115-MY3.

16
References
M. Abadi, P. Barham, J. Chen, Z. Chen, A. Davis, J. Dean, M. Devin, S. Ghemawat, G. Irving,
M. Isard, M. Kudlur, J. Levenberg, R. Monga, S. Moore, D. G. Murray, B. Steiner, P. Tucker,
V. Vasudevan, P. Warden, M. Wicke, Y. Yu, and X. Zheng. Tensorflow: A system for large-scale
machine learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design
and Implementation (OSDI), pages 265–283, 2016.

M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon. Automatic differentiation of

algorithms. Journal of Computational and Applied Mathematics, 124(1-2):171–190, 2000.

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in

machine learning: a survey. Journal of Machine Learning Research, 18(153):1–43, 2018.

S. Chinchalkar. The application of automatic differentiation to problems in engineering analysis.

Computer Methods in Applied Mechanics and Engineering, 118(1-2):197–207, 1994.

J. Kleinberg and E. Tardos. Algorithm Design. Addison-Wesley Longman Publishing Co., Inc.,
2005. ISBN 0321295358.

C. C. Margossian. A review of automatic differentiation and its efficient implementation. Wiley

interdisciplinary reviews: data mining and knowledge discovery, 9(4):e1305, 2019.

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,

and A. Lerer. Automatic differentiation in Pytorch. 2017.

Guitar World - Holiday 2015 PDF
100% (9)
Guitar World - Holiday 2015 PDF
148 pages
Chennai, Bangalore and Hyderabad
50% (2)
Chennai, Bangalore and Hyderabad
52 pages
Demystifying Deep Learning
No ratings yet
Demystifying Deep Learning
68 pages
Automatic Differentiation
No ratings yet
Automatic Differentiation
39 pages
Journal of Computational and Applied Mathematics: H.Z. Hassan, A.A. Mohamad, G.E. Atteia
No ratings yet
Journal of Computational and Applied Mathematics: H.Z. Hassan, A.A. Mohamad, G.E. Atteia
10 pages
Apple Marketing Challenge
100% (1)
Apple Marketing Challenge
5 pages
1 Areas of Polygonal Regions
No ratings yet
1 Areas of Polygonal Regions
160 pages
Gec 220 Partial Derivation
100% (1)
Gec 220 Partial Derivation
24 pages
G7 Q1 W2 Activity Sheet
100% (1)
G7 Q1 W2 Activity Sheet
3 pages
S&ML Unit 3 - Q & A
No ratings yet
S&ML Unit 3 - Q & A
14 pages
00 Manufacturing Process III Text Book Kestoor Praveen
No ratings yet
00 Manufacturing Process III Text Book Kestoor Praveen
66 pages
DERIVATIVE
No ratings yet
DERIVATIVE
8 pages
Chenhall REVIEW JUNAL VIKA
100% (1)
Chenhall REVIEW JUNAL VIKA
7 pages
Brochure Dietetics With Nutrition
100% (1)
Brochure Dietetics With Nutrition
12 pages
Machine Learning and Pattern Recognition Week 8 - Backprop
No ratings yet
Machine Learning and Pattern Recognition Week 8 - Backprop
8 pages
Differentiation
No ratings yet
Differentiation
59 pages
Differentiation Rules (Differential Calculus)
No ratings yet
Differentiation Rules (Differential Calculus)
3 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Mit18 S096iap23 Lec01
No ratings yet
Mit18 S096iap23 Lec01
6 pages
Lecture NM 1 Numerical Differentiation Integration
No ratings yet
Lecture NM 1 Numerical Differentiation Integration
57 pages
Lecture12 Diff
No ratings yet
Lecture12 Diff
31 pages
Autograd Handouts
No ratings yet
Autograd Handouts
14 pages
Hjorth-Jensen Notes2008 03
No ratings yet
Hjorth-Jensen Notes2008 03
18 pages
Neural Networks With Cheap Differential Operators
No ratings yet
Neural Networks With Cheap Differential Operators
11 pages
Tut 01
No ratings yet
Tut 01
39 pages
Calc
No ratings yet
Calc
6 pages
Multi-Variate Calculus Applications
No ratings yet
Multi-Variate Calculus Applications
10 pages
Physics Sample
No ratings yet
Physics Sample
19 pages
Lec06 Derivatives
No ratings yet
Lec06 Derivatives
22 pages
Autodiff
No ratings yet
Autodiff
15 pages
Mit18 S096iap23 Lec1
No ratings yet
Mit18 S096iap23 Lec1
16 pages
Automatic Differentiation of Algorithms For Machine Learning
No ratings yet
Automatic Differentiation of Algorithms For Machine Learning
7 pages
Derivadas y Sus Aplicaciones.
No ratings yet
Derivadas y Sus Aplicaciones.
6 pages
Content Beyond Syllabus Unit-2
No ratings yet
Content Beyond Syllabus Unit-2
4 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
Opt Sem3
No ratings yet
Opt Sem3
50 pages
Engineering Mathematics V 1B - Differentiation 2020
No ratings yet
Engineering Mathematics V 1B - Differentiation 2020
80 pages
Stochastic Automatic Differentiation Automatic Differentiation For Monte-Carlo Simulations
No ratings yet
Stochastic Automatic Differentiation Automatic Differentiation For Monte-Carlo Simulations
45 pages
Lecture 24 Diffrentiation
No ratings yet
Lecture 24 Diffrentiation
64 pages
A Primer On Calculus
No ratings yet
A Primer On Calculus
17 pages
Topic 1.3: Differentiation:, F (X) ), Say N (X, F (X) ) - The MN Line Act As The Tangent Line To The
No ratings yet
Topic 1.3: Differentiation:, F (X) ), Say N (X, F (X) ) - The MN Line Act As The Tangent Line To The
15 pages
CS115 Intro To Optimization
No ratings yet
CS115 Intro To Optimization
60 pages
Automatic Differentiation and Neural Networks
No ratings yet
Automatic Differentiation and Neural Networks
13 pages
Chap 3 Slides
No ratings yet
Chap 3 Slides
95 pages
SIT194 - Derivatives (Lecture Notes)
No ratings yet
SIT194 - Derivatives (Lecture Notes)
40 pages
Numerical Methods Kirkegaard
No ratings yet
Numerical Methods Kirkegaard
122 pages
Derivatives PDF
No ratings yet
Derivatives PDF
6 pages
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
No ratings yet
Automatic Differentiation (1) : Slides Prepared By: Atılım Güneş Baydin Gunes@robots - Ox.ac - Uk
114 pages
Differentiable Programming and Design Optimization
No ratings yet
Differentiable Programming and Design Optimization
72 pages
PX - 120 - 01 - e Manual Casio Privia Px120
No ratings yet
PX - 120 - 01 - e Manual Casio Privia Px120
38 pages
Unit 1
No ratings yet
Unit 1
30 pages
Calculus Review Presentation
No ratings yet
Calculus Review Presentation
42 pages
CS231n Convolutional Neural Networks For Visual Recognition
No ratings yet
CS231n Convolutional Neural Networks For Visual Recognition
9 pages
Ad Refer
No ratings yet
Ad Refer
53 pages
Autodiff
No ratings yet
Autodiff
12 pages
1502 05767v2 PDF
No ratings yet
1502 05767v2 PDF
29 pages
Differentiation Tutorial
No ratings yet
Differentiation Tutorial
16 pages
Automatic Differentiation: Hamid Reza Ghaffari, Jonathan Li, Yang Li, Zhenghua Nie
No ratings yet
Automatic Differentiation: Hamid Reza Ghaffari, Jonathan Li, Yang Li, Zhenghua Nie
103 pages
Efficient Derivative Codes Through Automatic Differentiation and Interface Contraction: Application in Biostatistics
No ratings yet
Efficient Derivative Codes Through Automatic Differentiation and Interface Contraction: Application in Biostatistics
13 pages
Differentiation (Derivatives)
No ratings yet
Differentiation (Derivatives)
42 pages
Sales and Marketing: Process Map - Winter 08
100% (1)
Sales and Marketing: Process Map - Winter 08
7 pages
Phys 1110 A Tutorial Notes
No ratings yet
Phys 1110 A Tutorial Notes
48 pages
Automatic Differentiation, C++ Templates AndPhotogrammetry
No ratings yet
Automatic Differentiation, C++ Templates AndPhotogrammetry
14 pages
Appendix D Calculus
No ratings yet
Appendix D Calculus
31 pages
Civil & Environmental Engineering 253
No ratings yet
Civil & Environmental Engineering 253
74 pages
AD Review Paper
No ratings yet
AD Review Paper
32 pages
Teaching Script
No ratings yet
Teaching Script
8 pages
Complete Derivatives Notes
No ratings yet
Complete Derivatives Notes
3 pages
7-S Framework of McKinsey
No ratings yet
7-S Framework of McKinsey
13 pages
Chap 17 Reading Worksheet
No ratings yet
Chap 17 Reading Worksheet
5 pages
Arun &associates
No ratings yet
Arun &associates
12 pages
AI-Based Adaptive Traffic Signal Control For Congestion Mitigation
No ratings yet
AI-Based Adaptive Traffic Signal Control For Congestion Mitigation
7 pages
BIM Modeler Designer Portfolio 1744863806
No ratings yet
BIM Modeler Designer Portfolio 1744863806
42 pages
Script For Curator in The Museum
No ratings yet
Script For Curator in The Museum
2 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
2 pages
Djing
No ratings yet
Djing
8 pages
Seedfolks Reflective Journal Entry
No ratings yet
Seedfolks Reflective Journal Entry
1 page
Ashley Dohr Cover Letter
No ratings yet
Ashley Dohr Cover Letter
2 pages
English 8 2nd Quarter Exam
No ratings yet
English 8 2nd Quarter Exam
2 pages
Self Crosslinking Acrylic Primer - HALOX 570 2
No ratings yet
Self Crosslinking Acrylic Primer - HALOX 570 2
1 page
EPI 200 SIT Skin Irritation MK 24 007 0023
No ratings yet
EPI 200 SIT Skin Irritation MK 24 007 0023
35 pages
Conduent Applicant Adaaa Referral Form PDF
No ratings yet
Conduent Applicant Adaaa Referral Form PDF
1 page
Ad 0264558
No ratings yet
Ad 0264558
221 pages
The Consumer Decision Journey
No ratings yet
The Consumer Decision Journey
8 pages
Tourism Product Portfolio Narrative
No ratings yet
Tourism Product Portfolio Narrative
2 pages
Sec 4 Bio Mar HW - Coordination & Response
No ratings yet
Sec 4 Bio Mar HW - Coordination & Response
6 pages
O'Connor e Kirtley (2018) - The Integrated Motivational-Volitional Modeol of Suicidal Behaviour PDF
No ratings yet
O'Connor e Kirtley (2018) - The Integrated Motivational-Volitional Modeol of Suicidal Behaviour PDF
10 pages
AVC (Average Variable Cost) ATC (Average Total Cost) MC (Marginal Cost)
No ratings yet
AVC (Average Variable Cost) ATC (Average Total Cost) MC (Marginal Cost)
2 pages
ESP32-C3 & ESP32-S3 BQB Certification
No ratings yet
ESP32-C3 & ESP32-S3 BQB Certification
1 page
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet

A Step-By-step Introduction To The Implementation of Automatic Differentiation

Uploaded by

A Step-By-step Introduction To The Implementation of Automatic Differentiation

Uploaded by

A Step-by-step Introduction to the Implementation of

December 15, 2024

y = f (x1 , x2 ) = log x1 + x1 x2 − sin x2 .

ẋ1 = ∂x1 /∂x1 =1

by the chain rule,

forward calculation of function value

3 Implementation of Function Evaluation and the Compu-

3.1 The Need to Calculate Function Values

f (x1 , x2 ) = log x1 + x1 x2 − sin x2 (3)

can be written in the following composite function

g(h1 (x1 , x2 ), h2 (x1 , x2 ))

To calculate the derivative at x = x0 using the chain rule, we have

where the notation

Therefore, we must check how to build the graph.

3.2 Creating the Computational Graph

y = sub(add(log(x1), mul(x1, x2)),sin(x2)).

Next, we apply sin(node) to x2.

3.3 Wrapping Functions

4.1 Finding the Topological Order

In graph theory, such an ordering is called a topological ordering of GR . Since GR is a directed

so v5 is added first. In the end, we get the following list

4.2 Computing the Partial Derivative

should be ready before calculating v̇5 . For

• For add(node1, node2), we have

newNode.grad wrt parents = [1., 1.].

• For log(node), we have

newNode.grad wrt parents = [1/node.value].

1. Create the computational graph

2. Find a topological order of the graph associated with x1

M. Bartholomew-Biggs, S. Brown, B. Christianson, and L. Dixon. Automatic differentiation of

A. G. Baydin, B. A. Pearlmutter, A. A. Radul, and J. M. Siskind. Automatic differentiation in

S. Chinchalkar. The application of automatic differentiation to problems in engineering analysis.

C. C. Margossian. A review of automatic differentiation and its efficient implementation. Wiley

A. Paszke, S. Gross, S. Chintala, G. Chanan, E. Yang, Z. DeVito, Z. Lin, A. Desmaison, L. Antiga,

You might also like