0% found this document useful (0 votes)
38 views14 pages

Derivative Networks Reducedversion - 2022

This document discusses neural networks and their derivatives. It defines neural networks using graphs and describes commonly used network functions like summation, product and activation functions. It also discusses linear neural networks and their dual representations.

Uploaded by

a.frank.lice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
38 views14 pages

Derivative Networks Reducedversion - 2022

This document discusses neural networks and their derivatives. It defines neural networks using graphs and describes commonly used network functions like summation, product and activation functions. It also discusses linear neural networks and their dual representations.

Uploaded by

a.frank.lice
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

Neural Networks and Their Derivatives

Yongyi Mao, University of Ottawa


September 14, 2023

This note was first written in June 2016, when I was entertaining myself with some clean algebraic views
of gradient, Hessian, and the back propagation algorithm for neural networks. The original note also fiddled
with some connections to certain topics in graphical model literature.
As I will be teaching a course on deep learning in fall 2018, I think this note may be useful for some
students. Thus I removed some parts of the note and only kept the parts relevant to gradient and back
propagation.
This note isn’t meant to be a publication. Even though it may contain certain perspectives that might not
have appeared in the literature (or maybe they have, but in some disguise), there is not any “new result” to
claim. As such, this note is merely a service of mine, offered to my students and to whoever interested.
But the note isn’t written in a sufficiently friendly manner. – I do not have time to revise the original
writing or make the expositions more rigorous and pedagogical. Neither do I bother to give references as
one usually does in a research paper – not claiming any credit, I won’t infringe any. But indeed there are
a number of places in the note that I do not find satisfactory. I maintain a list of items on them in the
appendix. Hopefully one day on a sunny patio of Tim Hortons, I will come back to the list and revise this
note.
— August 22, 2018

I have not done it.


— September 14, 2023

1 Neural Networks
1.1 Definitions and graphical notations
We will represent a neural network as a directed acyclic graph (DAG), where the set of all edges will be
denoted by E and the set of vertices will be denoted by V. Each edge e ∈ E is associated with a variable
xe taking values in an arbitrary vector space Ue over the set R of real numbers, and the set of all variables
{xe : e ∈ E} will be denoted by X . Each vertex v ∈ V is associated with a function fv and the set of all
functions {fv : v ∈ V} will be denoted by F. Since there always exists a topological ordering of the vertices
in any DAG, we will regard V as {1, 2, . . . , |V|}, where if there a directed path from vertex u to vertex v,
then u < v; and in this case, we say that vertex u (or function fu ) is topologically earlier than vertex v (or
function fv , resp) .
The set E of edges may be partitioned into three sets, namely, input edges Ein , output edges Eout and
internal edges Einternal , and their associated variable sets are denoted respectively as Xin , Xout and Xinternal .
In addition, the set of incoming edges to vertex v will be denoted by IN(v), and the set of outgoing edges
from vertex v will be denoted by OUT(v). The corresponding input variable set and output variable set
of function fv will be denoted by XIN(v) and XOUT(v) respectively. The neural network G := (V, E, X , F)
then represents a function F with input variables Xin and output variables Xout (namely, in the form of
Xout = F (Xin )) via a series of function compositions in the following order:

1
x1 x3
f1 f3
x8
x4
x10
x7 f5
x5 x9
x2 x6
f2 f4

Figure 1: An example of a neural network.

XOUT(1) = f1 (XIN(1) )
XOUT(2) = f2 (XIN(2) )
...
XOUT(|V|) = f|V| (XIN(|V|) )

Example 1 Figure 1 gives an example of a toy neural network.

We now describe some elementary functions which appear particularly useful in neural networks.
• (equality function) I= : U → U K which maps any u ∈ U to (u, u, . . . , u) ∈ U K where U is a Euclidean
space over R and K is a positive integer. That is, an equality function takes one input value and
outputs K copies of it.
PK
• (summation function) I+ : U K → U which maps any (u1 , u2 , . . . , uK ) ∈ U K to k=1 uk . That is, a
summation function takes K input variable from the same space and outputs their sum.
• (product function) I (·; w) : U → V for some Euclidean spaces U and V of arbitrary dimensions, where
w is a linear map from U to V mapping u ∈ U to w u ∈ V . For example, the product operation
can be the following operations.
– vector-vector inner product
– matrix-vector product
– scalar-vector product
– scalar-scalar product
We note that for an arbitrary linear map A from an Euclidean space U to an Euclidean space V , its
adjoint map A∗ is the unique linear map from V to U satisfying

hA(u), vi = hu, A∗ (v)i

for every u ∈ U and v ∈ V . For example:

– if w is matrix, x is a vector, and I (x) = wx (matrix-vector product), then

I ∗ (y) = wT y

where wT is the transpose of w.


– if w is a vector having the same dimension as x, and I (x) = hw, xi, then

I ∗ (y) = yw

2
– if w is a vector, x is a scalar and I (x) = xw, then

I ∗ (y) = hw, yi

– if w is a scalar, x is a vector, and I (x) = wx, then

I ∗ (y) = wy

Finally we note that a product function is obviously a one-input one-output function. Later, we will
consider treating the parameter w as a variable of the function, which will make the function bi-variate
(see later).

• (stacking function) STACK : RK1 × RK2 × . . . × RKm → RK1 +K2 +...+Km which maps m vectors
 T
u1 , u2 , . . . , um to vector uT1 , uT2 , . . . , uTm , namely, the output vector obtained by stacking all input
vectors. It is worth noting that the stacking function can be expressed as composition of m product
functions and an addition function.
 T
• (splitting function) SPLIT : RK1 +K2 +...+Km → RK1 ×RK2 ×. . .×RKm which maps a vector uT1 , uT2 , . . . , uTm
to m vectors u1 , u2 , . . . , um , namely, the function splits its input vector to m vectors. It is worth not-
ing that the spliting function can be expressed as composition of an equality function with m product
functions.

• (activation function) g : R → R, which can be any arbitrary function. Although in practice, such
functions are non-linear non-decreasing functions, we do not make such restriction here. An activation
function is a one-input one-output function. We will also extend the definition of activation functions,
allowing them to take a vector input. In this case the function acts on the input vector component-wise
and outputs a vector of the same length.

The graphical notations for these functions are given in Figure 2.


Using these notations, the structure of standard neuron in a typical neural network is shown in Figure
1
3.
We wish to note that equality maps are particularly useful when a variable is shared between more than
one functions.

Example 2 In Figure 4, each input variable is shared by two neurons, and this is done by introducing an
equality function to each input variable, which creates two copies, each for a neuron.

1.2 Linear network and its dual


A neural network is said to be linear if its global function F is a linear map from the input to the output.
That is, for a linear network, the global function can always be written as

STACK(Xout ) = WF STACK(Xin )

for some matrix WF .


From here on, we will drop the STACK notation for simplicity.
1 With this example, we would also like to remark on a simple graphical technique, the “open the box” / “close the box”

technique. Close the box: group, or “box”, a set of a vertices and treat all functions in the group as a single function, and
the input edges and the output edges of the box then become the input variables and output variables of the new function.
Open the box: take a vertex and express it as a series of function compositions, and replace the vertex by the graph representing
these function compositions. It is not yet clear whether such technique may lead to great advantage in anaylsis/computation. It is
worth noting that such a technique on normal factor graphs has led to the fundamental discovery of holographic transformations
(Al-Bashabsheh and Mao 2011). It is of great interest to investigate if something similar to holographic transformation can be
developed in this context.

3
Equality function

Summation function

Product function

Stacking function

Splitting function

Activation function

Figure 2: Graphical notations for neural networks

Most neural network models are not linear. However as we will develop in this note, it is an important
object for gradient-based optimization.
It is easy to see that a linear network can always be expressed only in terms of equality functions,
summation functions and product functions, noting that splitting functions and stacking functions can be
expressed always as such functions.

4
Figure 3: A typical neuron in standard neural networks (where each variable is a scalar).

Figure 4: An example of a neural network with one hidden layer.

On a linear network G, we can define a dualization procedure, which is essentially changing every function
node to its adjoint map (noting that all elementary functions can also be expressed using matrices, therefore
their adjoint maps are all well defined.) More precisely, the recipe for the dualization is the following.
1. reverting the direction of every edge.
2. changing every product function I to its adjoint I ∗ .
3. changing every summation function to equality function, and every equality function to summation
function
4. changing every stacking function to splitting function, and every splitting function to stacking function.
The resulting network, denoted by G T , is then called the dual network of the original linear network.
Lemma 1 If linear network G represents global function F , then its dual G T represents the adjoint F ∗ of
F . That is, when function F is written as
Xout = WF Xin ,

for some matrix WF , then function F can be written as
Xin = WFT Xout .
Proof: First note
• Equality function and summation function are adjoint of each other.
• STACK and SPLIT are adjoint of each other.
The lemma follows from recursive application of the following result:2 for any two linear maps A : U → V
and B : V → W , (B ◦ A)∗ = A∗ ◦ B ∗ . 
2 This is because for any u ∈ U and w ∈ W ,
h(B ◦ A)(u), wi = hB(A(u)), wi
= hA(u), B ∗ (w)i
= hu, A∗ (B ∗ (w))i
= hu, (A∗ ◦ B ∗ )(w)i

5
We will also write F ∗ as F T .

Theorem 1 Suppose that F is the function represented by a linear network G with a single output edge and
that the output variable xout of F takes scalar value. That is,
X
F (Xin ) = ae xe ,
e∈Ein

where for each input edge e ∈ Ein , ae is a real number. Then

ae = FeT (1).

namely, ae is the e-component of the output of the dual network G T when it takes value 1 as its input.

This theorem3 plays an central role in connecting two algorithms that we will present for computing the
gradient for a neural network.

1.3 Parameters as Variables


A main purpose of this note is to study the gradient of the function encoded by a neural network, with
respect to its parameter. In this case, we will consider the parameters of the involved functions as variables.
In particular, a linear function I (x; w) may be understood as a two-variable function, one variable being
x and the other being w. In this view, the graphical notation of I is shown in Figure 5.

Figure 5: Notation of product function when parameter w is treated as a variable.

In a neural network, parameters are often shared across functions. This is particularly the case in deep
networks. When parameters are considered as variables, it is important to use the equality function to create
the copies of shared parameters. This is shown in the next example.

Example 3 Figure 6 (a) is a toy CNN with one convolution layer. Assume that each input xi is a scalar,
then the w is a length-3 vector, shared in two product functions. If we are going to consider w as input
variable to the network, variable w needs to be replicated using an equality function, which sends the two
copies each to one of the product functions. Note that in the figure, we express parameter w as a single
vector-valued variable. This is only for saving some graph complexity; as a consequence, we need to add the
stacking functions to stack some copies of the inputs into a vector. It is possible to treat w as three separate
variables and in that case the stacking functions won’t be necessary.

From the rest of this note, parameters are always treated as variables.

2 Derivative Networks
Each function fv in the network can be regarded as a set {fv,e : e ∈ OUT(v)} of functions. This is made
clear in the following example.
3 There should be a more general form of the theorem for multiple-output functions. But I have not thought it over.

6
x1 z1 w

x2

x3

x4 z2 w

(a)

x1 z1

x2

x3

x4 z2

(b)

Figure 6: A toy network with one convolution layer. (a) Parameter w not as a variable. (b) Parameter w as
a variable.

Example 4 Suppose that vertex v has two input edges {a, b} and three output edges {1, 2, 3}. That is, the
function fv is in the form of of (x1 , x2 , x3 ) = fv (xa , xb ). Then the function f can be expressed as three
functions below.

x1 = fv,1 (xa , xb )
x2 = fv,2 (xa , xb )
x3 = fv,3 (xa , xb )

For any function fv , we define its Jacobian function Jv as the function with input variables set ∆IN(v)

7
and output variables ∆OUT(v) such that for each e ∈ OUT(v)
X ∂fv,e
∆e = Jv,e (∆IN(v) ) = (XIN(v) ) ∆e0
∂xe0
e0 ∈IN(v)

Note that on each edge e, ∆e takes values from the same space as xe , and the notion of product is the one
consistent with the definition of the paired partial derivative. It is worth noting that the Jacobian function
Jv depends on the input configuration xIN(v) , however, for now, we will suppress this dependency in our
notation for Jv .

Example 5 For the function fv in Example 4, suppose that each variable takes scalar value, then at any
input configuration (xa , xb ), the Jacobian function of fv can be expressed as the following three functions:

∂fv,1 ∂fv,1
∆1 = Jv,1 (∆a , ∆b ) = (xa , xb )∆a + (xa , xb )∆b
∂xa ∂xb
∂fv,2 ∂fv,2
∆2 = Jv,2 (∆a , ∆b ) = (xa , xb )∆a + (xa , xb )∆b
∂xa ∂xb
∂fv,3 ∂fv,3
∆3 = Jv,3 (∆a , ∆b ) = (xa , xb )∆a + (xa , xb )∆b
∂xa ∂xb
Grouping the three functions as a single multi-output function, the Jacobian function Jv is essentially the
function that maps (∆a , ∆b ) to (∆1 , ∆2 , ∆3 ) via the above three functions Jv,1 , Jv,2 , and Jv,3 .

It is perhaps necessary to spell out the meaning of the Jacobian fucntion Jv,c . Essentially, the Jacobian
function answers the following question: For a given input configuration, if each input variable xe0 is increase
by a tiny ∆e0 , by how much each output xe of the function fv will be increased due to its gradient at
the that input configuration? The Jacobian function expresses the answer to this question, ∆e , as a linear
combination of the increments ∆e0 ’s on the inputs. In other words, using the Jacobian function, the value
of function fv at configuration (xa + ∆a , xb + ∆b ) is approximated as

fv (xa + ∆a , xb + ∆b ) ≈ fv (xa , xb ) + Jv (∆a , ∆b ).

That is, the Jacobian function Jv is the first-order term of the Taylor expansion of function fv .
It is also worth noting that the notion of Jacobian fucntion is mathematically equivalent to Jacobian
matrix for a multi-input multi-output function. To see this, consider the above example and let J˜v denote
the Jacobian matrix of function fv . Then it is easy to see that
 
∆1  
 ∆2  = J˜v ∆a
∆b
∆3

However, we point out here that the notion of Jacobian function is slightly more flexible than Jacobian
matrix. This is because when the variables of fv take values from spaces of different dimensions, the Jacobian
function will not correspond to a matrix representation.
Graphically, for any given function fv , it is possible to draw a neural network representing its Jacobian
function Jv . An example is shown in Figure 7. Converting a function in a network to its Jacobian function
will be called Jacobian conversion. With respect to Jacobian conversion, the following lemma is elementary.

Lemma 2 The Jacobian conversion of some functions are given in Figure 8.

Regarding the lemma, we make the following remarks.

• The equality function, summation function, stacking function and splitting functions are all invariant
under Jacobian conversion.

8
∂fv,1
∂xa (xa , xb )

∆a ∂fv,2 ∆1
∂xa (xa , xb )

∂fv,3
∂xa (xa , xb )

∆2

∂fv,1
∂xb (xa , xb )

∆b ∆3
∂fv,2
∂xb (xa , xb )

∂fv,3
∂xb (xa , xb )

Figure 7: The Jacobian function of function fv in Example 4.

• The product function is also invariant under Jacobian conversion if the function parameter is not
treated as an input variable for the function. However, if the parameter is treated as a variable, then
the Jacobian function changes its form to
J (∆x, ∆w) = w ∆x + ∆w x,
where the notion of product takes its definition in the original function. In general, before determining
the Jacobian function of a function f , we must be clear about what are considered as the function’s
variables.
• Under Jacobian conversion, an activation function is converted to its derivative, which can be viewed
as another activation function. Note that this result holds irrespective of the dimension of the input
variable.
On a neural network G = (V, E, X , F) we may convert every function vertex fv locally to its Jacobian
function Jv . The resulting network will be called the derivative network of G and we will denote it by
∂G/∂Xin , or ∂G for simplicity.
Theorem 2 If neural network G = (V, E, X , F) represents function F , then its derivative network ∂G rep-
resents the Jacobian function of F .
The following lemma follows from the definition of Jacobian conversion.
Lemma 3 Every derivative network is linear.
Suppose that a network that represents a function F with input variables Xin and output variables Xout .
Let J denote the Jacobian function of F . It is also possible to define the notion of “partial Jacobian”,
namely, the Jacobian function with respect to a subset of the input variables in Xin . More precisely, let
input edges be partitioned into two sets A and set A0 . Then the partial Jacobian function of F with respect
to variables XA is simply the Jacobian function J evaluated at ∆i = 0 for each i ∈ A0 . That is, denoting the
partial Jacobian function as J A .
J A (∆A ) = J(∆A , 0A0 ).
We won’t be careful in distinguishing a Jacobian function J and a partial Jacobian function J A , and will
speak of J A also as Jacobian function. The superscript A serves to indicate with respect to which variables
the Jacobian function is meant to be.

9
original function Jacobian function

w w

∆w
w x

x w
∆x
g g 0 (x)

Figure 8: The Jacobian conversion of special functions

3 Gradient Computation
A standard training method for neural networks is stochastic gradient descent, which always involves com-
puting the gradient of a function represented by a neural network. Now we discuss the computation of
gradient.
To begin, we need to note that the neural network we consider is not one that models the relationship
between the observed input (say feature) and the predicted output (say class label), but one representing
the (optimization) objective function for such a model. More precisely, consider the network in Figure 6 (b)
as the neural network model with observed input variables x1 , x2 , x3 , x4 , and output variable y, serving as
the prediction of an observed targeted output t. We then create another function E with input variables4 y
4 Another way to formulate the network is to regard the network always models some energy function, every observed variable

is an input to this function. In this formulation, there will be no need to distinguish t with the x-variables. This formulation
generalizes to settings of unsupervised learning.

10
and t. Although this needs not to be the case, a typical choice of such a function E is
1
E(y, t) := (y − t)2 (1)
2
The function E, also known as the energy function, is then the objective function for optimization.
We may build this function also into the neural network as shown in Figure 9 (a), and such a network,
representing an optimization objective function, is what we now consider. More specifically, we wish to
minimize the function with respect to its parameters. With any gradient based optimization method, it is
critical to compute the gradient of the function (with respect to the parameters), which we now discuss. In
particular, let P denote the set of input edges that correspond to parameters, and our goal is to compute
∂E
the ∂X P
at an arbitrary input configuration Xin .

3.1 Chain Rules


We begin with introducing some simple graph-theoretic notions, where the graph we consider is the neural
network representing the energy function.
• A path from an edge a to edge b is a sequence of edges (e0 , e1 , e2 , ..., em ) where e0 = a, em = b, and
for every i = 0, 1, . . . , m − 1, start(ei ) = end(ei+1 ). If there exists a path from edge a to edge b, then
a is said to be an ancestor of b, and b is said to be a descendant of a. Similarly, one can define the
notion of a path from a vertex to an edge, or from an edge to a vertex. Similarly notions of ancestor
and descendant may also be defined.
• An arbitrary set A of edges are said to be non-comparable if for any a, b ∈ A, neither is a a descent of
b, nor is b a descendant of a.
• Let A and B be two sets of edges where A ∩ B = ∅ and for any a ∈ A, there exists at least one b ∈ B
such that there is a path from a to b. In this case, we say that A is an ancestor set of B.
• For any two sets A and B of edges where A and B are both non-comparable, A ∩ B = ∅, and A is an
ancestor set of B, let P(A : B) denote the set of all edges that are contained in at least one path from
an edge a ∈ A to an edge b ∈ B. We then denote by G[A : B] the subgraph of the original graph G that
contains all edges in P(A : B) and all edges that are an ancestor of at least one edge in P(A : B) \ A.
We will denote the set of all input edges to P(A : B) by Ã, then obviously A ⊆ Ã. If the original
network represents global function F , then we denote the function represented by network G[A : B] by
F [A : B]. Clearly F [A : B] has input variables XÃ and output variables XB . We will be interested in
A
the Jacobian function of F [A : B] with respect to variables XA , which will be denoted by JB .
A
The reason we introduce sub-network G[A : B], the function F [A : B], and the Jacobian function JB is
to allow taking the derivative of an arbitrarily selected set XB of variables with respect any of its ancestor
variable set XA . The subgraph G[A : B] essentially contains precisely all needed function vertices to express
this derivative.
Theorem 3 Suppose that A and B are two non-comparable sets of edges in a network G, where A is an
ancestor set of B. Then
A ∂F [A : B]
JB (∆A) = ∆A .
∂XB
This is the key theorem in this note, which relates the derivatives with Jacobian functions. We actually
have seen an example of this theorem in the discussion following Example 5
Immediately following this theorem, another important theorem can be proved.
Theorem 4 (Chain Rule) Let A, B, S be three sets of non-comparable edges, where A is an ancestor set
of B and S blocks all paths from A to B. Then
A S
JSA (∆A ) .

JB (∆A ) = JB (2)

11
Note that this theorem is equivalent to the chain rule of derivatives:

∂F [A : B] ∂F [S : B] ∂F [A : S]
= . (3)
∂XA ∂XS ∂XA

Essentially, Equation (2) states a chain rule for the composition of Jacobian functions, which relates to
the chain rule of derivatives precisely via Theorem 3

3.2 Algorithms
We now present two algorithms for computing ∂F/∂XP , each corresponding to one of the chain rules above.
We first introduce some key steps, and then discuss how they may be organized to form the two algorithms.

Forward evaluation: On any network, forward evaluation refers the computation from input to output on
the network. Each function vertex only computes its outputs when its input values have all been computed.
The computed value for each variable is stored on the corresponding edge.

Constructing derivative network: Construct the derivative network ∂G/∂XP that represents the Jaco-
bian function J P of objective function F with respect to XP . For example, if Figure 9(a) is network G that
represents F , then Figure 9(b) (solid edges only) is the derivative network. Note that the parameter of each
product function in ∂G/∂XP depends on the input value to the corresponding function vertex in the original
network G, which is to be calculated during the forward evaluation on G.

T
Constructing the dual derivative network: Construct the dual (∂G/∂XP ) of network ∂G/∂XP , e.g.
Figure 9(c) (solid edges only) is the dual of Figure 9(b). Again, the parameters of the dual network depends
on computation in the forward evaluation on G.

Now we describe two algorithms.

Forward Token Propagation Algorithm

1. Forward evaluation on G using the given input X in .


2. Determine the parameters for ∂G/∂XP
3. Set the input to each variable e to be a token (indeterminate) ∆e . Apply forward evaluation on
∂G/∂XP . The output is a linear combination of all input tokens.

4. Extract the coefficient for each token ∆e , which is equal to ∂F/∂xe evaluated at X in .

Note that steps 1, 2 and 3 can be performed in parallel. That is, as we perform forward evaluation on
G, we can at the same time compute the parameter for each function in ∂G/∂XP and pass the tokens along.
So in fact, only one forward round is needed.

Backward Gradient Propagation Algorithm (i.e. “back propagation” algorithm)

1. Forward evaluation on G using the given input X in .


T
2. Determine the parameters for (∂G/∂XP )
T T
3. Set (the only) input variable in (∂G/∂XP ) to 1. Then apply forward computation on (∂G/∂XP ) .
4. The output at each e is ∂F/∂xe evaluated at the input configuration X in .

12
Theorem 5 The two algorithms both give correct results.

We now briefly explain why the two algorithms work. First the Forward Token Propagation works, since
the forward evaluation on ∂G/∂XP leads to a linear combination of the input tokens ∆e ’s. The coeffcient of
each token ∆e is in fact the ∂F/∂xe , by the definition of J P . Second, by Theorem 3, the Jacobian function
J P may also be expressed as
∂F
J P (∆P ) = ∆P .
∂XP
T
The dual derivative network (∂G/∂XP ) then represents function
 T
∂F
(J P )T (∆out ) = ∆out .
∂XP

By Theorem 1,
(J P )T (1) = ∂F/∂xe .
for each input variable xe in the original network G. Thus Backward Gradient Propagation gives correct
result.
The propagation in the two algorithms are respectively implementation of the two chain rules. The
Forward Token Propagation implements the chain rule in (2) on the derivative network, and the Backward
Gradient Propagation implements the chain rule in (3) on the dual derivative network.

4 Conclusion
This note presents a graphical perspective to understand the derivative/gradient of functions represented
by neural networks. This perspective allows more intuitive and easier assess to analyzing/computing the
gradient of neural networks, which can serve as a simple explanation of the back-propagation algorithm.

A Revision Notes
The note has been written without sufficient account for some details. Below is a list of issues which I may
consider in future revisit of this document.

1. The notion of variable should be replace by the set of values a variable takes.
2. Instead of defining function as something like f : U M → U K , define a vertex with K input edges and M
output edges first, then associate each edge with a domain/space, and then define the function. That
is, do not define the function then express it using a graph vertex, but define a graph vertex and then
define the function. This is because f : U M → U K could be a function having one input over U M
rather than a function having M inputs.
3. The definition of Jacobian function might need to be reformulated. In particular, I should define partial
Jacobian function first, then the complete Jacobian as a special case. Also, should I define Jacobian
function for a network or a function? It seems make it a definition of a network is the cleanest. That
is, the Jacobian function is defined via the network, the choice of input variable set A and the choice
of output variable set B (which may or may not be coincide with the the input and output edges of
the network.) The downside of this formulation is that it gets into graph theory too early.
A
4. Perhaps replace the notion JB to J A→B to be more intuitive.

13
x1 z1
g

x2

g t
w u
E
y
x3

x4 z2
g

(a)

∆x1 ∆z1 w
g0
z1

∆x2 ∆t

g0 t−y
∆w ∆u
∆y y − t
∆x3

z2
∆x4 ∆z2
w g0

(b)

w
∆x1 ∆z1 g0
z1

∆t
∆x2
g0 t−y

∆w ∆y y − t ∆u

∆x3

z2

∆x4 ∆z2 w g0

(c)

Figure 9: A network representing optimization objective function (a), its derivative network (b), and the
dual derivative network (c).

14

You might also like