0% found this document useful (0 votes)
110 views23 pages

NN PDF

This document provides an overview of topics to be covered in Lecture 13 on feedforward neural networks. It includes: 1. A brief history of neural networks from the perceptron to today's deep learning approaches. 2. How the XOR problem, which perplexed early neural networks, can be solved using a neural network with a hidden layer. 3. An introduction to the mathematics behind multilayer neural networks, including the backpropagation algorithm.

Uploaded by

RaviKumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
110 views23 pages

NN PDF

This document provides an overview of topics to be covered in Lecture 13 on feedforward neural networks. It includes: 1. A brief history of neural networks from the perceptron to today's deep learning approaches. 2. How the XOR problem, which perplexed early neural networks, can be solved using a neural network with a hidden layer. 3. An introduction to the mathematics behind multilayer neural networks, including the backpropagation algorithm.

Uploaded by

RaviKumar
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 23

Lectures on Machine Learning (Fall 2017)

Hyeong In Choi Seoul National University

Lecture 13: Feedforward Neural Networks


(Draft: version 0.9.1)

Topics to be covered:

• Brief history of neural network

• XOR problem and neural network with a hidden layer

• Universal approximation theorem

• Multilayer neural network


- Formalism
- Errors
- Backpropagation algorithm
- Backpropagation in matrix form

• Mathematical supplement: automatic differentiation

1 Brief history of neural networks


Neural network has a very interesting history. Its popularity comes in
ebb and flow like a drama. Neural network first got started when F. Rosen-
blatt proposed the so-called Perceptron [4, 5]. Architecturally it is just like
a neural network with no hidden layer, although its learning rule uses the

1
so-called perceptron learning algorithm, which is different from the present-
day backpropagation algorithm. It nonetheless is the first one that can learn
from data without explicit programming by human experts. This feature
created a lot of excitement, and with the perceptron convergence theorem,
perceptron seem to be able to meet many interesting challenges. However,
mathematically what Perceptron does is finding linear classification bound-
ary.
This fact drew attention from Minsky and Papert, who wrote a book
criticizing that Perceptron cannot even distinguish between the very simple
XOR configurations of four data points on a plane [3]. This book almost
instantly killed the then-flourishing many perceptron research projects, and
this marked the closing of the first wave of neural network.
Since then, MIT (and others) led the artificial intelligence research along
the path of logical analysis and rule-based programming. They contrived
many interesting methods and approaches to embody human intelligence
through the rule-like programming paradigm. Although this approach had
many successes in various areas where logical thinking is the key, it failed
pretty badly in handling the perception problem, which is what Perceptron
was intended to solve. This lack of common sense became the glaring defi-
ciency of this type of artificial intelligence methods, and it is the reason why
logic-based rule-like artificial intelligence gradually withered away.
The second wave of neural network came when the PDP group led by
Rumelhart, McClelland and others discovered the magic of adding a hidden
layer to the neural network [6]. In particular, they showed that the XOR
problem, which had vexed Perceptron so much, can be easily overcome. (See
Section 2.2 below.) This and many other interesting applications rekindled
interest on neural networks. Many other models of neural networks were
proposed and many interesting problems solved. Indeed it seemed that the
secret of human intelligence was about to be revealed. However, it soon
dawned on people that complex neural networks with many hidden layers are
very hard to train. They frequently overfit and, as a consequence, a small
change in the dataset leads to drastic changes in the neural network itself;
thereby earning the reputation of being “hard-to-train”. So around the mid-
1990s and on, the PDP-style neural network projects were mostly abandoned,
and people turned to other machine learning methods like Support Vector
Machine and kernel methods; ensemble methods like boosting and random
forests; and so on.
Such was the state of affairs until suddenly G. Hinton and his colleagues

2
came bursting onto the scene with fresh new ideas of pre-training neural
networks in the mid-2000s. (See for example [2] and our subsequent lectures.)
With the drastic increase in computational capability with the use of GPU
machines, people began to create very complex neural networks with tens of
hidden layers and train them reasonably well, as long as there are enough
data to feed to the network.
Currently we are witnessing the third wave of the neural network under
the banner of deep learning coupled with big data. Its success is impres-
sive. Overnight, it has almost completely replaced voice recognition tech-
nology that has a long history of its own; vision research is now mostly
based on deep learning; and with the use deep learning, complicated natural
language problems like machine translation are at a different level of compe-
tency. Many commercial applications developed with deep learning are now
routinely in use.
However, deep learning is not a panacea. Although it is capable of solving
many, if not all, complex perception problems, it is not as adept at reasoning
tasks. Also, its black-box nature makes it harder for humans to accept the
outcome if it conflicts with one’s expert knowledge or judgment. Another
aspect of deep learning is that it requires a huge amount of data, which
contrasts with the way humans learn. All these objections and questions will
be presented as some of the central themes in the next phase of artificial
intelligence research, and it is interesting to see how one can mesh deep
learning with this new coming trend.

2 Neural network with hidden layer


2.1 Neural network formalism of logistic regression
In Lecture 3, we showed that logistic regression can be recast in neural
network formalism. It goes as follows: Let x = [x1 , · · · , xd ]T be an input
(vector). Define
zk = wk · x + bk
Xd
= wkj xj + bk (1)
j=1

and let
z = [z1 , · · · , zK ]T

3
as a K dimensional column vector. Then in matrix notation we have

z = W x + b.

The matrix W and the vector b are to be estimated with data.

Figure 1: Neural network view of logistic regression

The output is the softmax function:


!
ez1 e zK
softmax(z1 , · · · , zK ) = PK , · · · , PK ,
zj zj
j=1 e j=1 e

and its k-th element of the output is denoted by


e zk
softmaxk (z1 , · · · , zK ) = PK ,
j=1 ezj

In vector notation, we write

softmax(z) = softmax(z1 , · · · , zK ).

This softmax defines the probability P (y = k | x) as the output h = [h1 , · · · , hK ]T ,


which is the conditional probability given by:
e zk
hk = hk (x) = P (Y = k | X = x) = softmaxk (z1 , · · · , zK ) = PK . (2)
zj
j=1 e

4
Its neural network formalism is as in Figure 1. The circles in the left
box represent the input variables x1 , · · · , xd and the ones in the right the
response variables h1 , · · · , hK . The internal state of the k-th output neuron
is the variable zk whose value is as described in (1). In neural network
parlance, zk is gotten by summing over all j the multiples of wkj and xj , and
then adding the value bk , called the bias. Once the values of z1 , · · · , zK are
given, the output h1 , · · · , hK can be found by applying the softmax function
(2).

2.2 XOR problem and neural network with hidden


layer
Let D = {(0, 0), (0, 1), (1, 0), (1, 1)} be a dataset consisting of 4 points in
2
R . They belong to one of the two classes as shown in Figure 2. Note that
there is no line that separates these two classes. Let us see how this problem

Figure 2: XOR Problem

can be solved if we add a hidden layer. Treating x1 and x2 as Boolean


variables, we can write

XOR(x1 , x2 ) = x1 x¯2 + x¯1 x2 .

Thus if we give label x1 x¯2 to (x1 , x2 ) as in Figure 3, the line x1 − x2 − 1/2 = 0


becomes the separating line. Note that if a >> 0, σ(at) becomes very close
to the step function as Figure 4 shows. Now let z1 = a(x1 − x2 − 1/2) and
let h1 = σ(z1 ). Then the value of h1 becomes (very close to) 1 at the bottom
right side of the line x1 − x2 − 1/2 = 0, and 0 at the top left side of line
x1 − x2 − 1/2 = 0. This situation is depicted in Figure 3. In neural network
notation, this variable relation is drawn as in Figure 5.

5
Figure 3: Value of h1 and the separating line

Figure 4: For large a, sigmoid becomes very close to step function

Figure 5: Neural network for h1

Figure 6: Value of h2 and the separating line

6
Similarly, for the label x¯1 x2 , the Boolean value of x¯1 x2 and the separating
line −x1 + x2 − 1/2 = 0 is shown in Figure 6. Now let z2 = a(−x1 + x2 − 1/2)
and let h2 = σ(z2 ). Then by the same argument, the value of h2 becomes
(very close to) 0 at the bottom right side of the line −x1 + x2 − 1/2 = 0, and
1 at the top left side of line −x1 + x2 − 1/2 = 0. This situation is depicted
in Figure 6. In neural network notation, this variable relation is drawn as
in Figure 7. To combine h1 and h2 , let h3 = XOR(x1 , x2 ) = x1 x¯2 + x¯1 x2 .

Figure 7: Neural network for h2

The values of all the variables are shown in Figure 8. Note that the range

Figure 8: Value of variables

of values (h1 , h2 ) can take becomes restricted, i.e. they are (0, 0), (0, 1), and
(1, 0). However (1, 1) does not now show up. This situation is depicted in
Figure 9 and thus h1 + h2 − 1/2 is a separating line for these data. Define
1
z3 = b(h1 + h2 − ), for large b >> 0, and h3 = σ(z3 ). And the value of h3
2
on the plane is also shown in Figure 9. If we combine everything we have
done in this subsection, it can be written in the neural network as in Figure
10. Clearly this neural network with one hidden layer separates the original
XOR data.

7
Figure 9: Value of h3 and the separating line

Figure 10: Final neural network

2.3 Universal approximation theorem


Let us rewrite part of the above neural network construction as in Figure
11. If we take the OR Boolean operation of Boolean variables h1 and h2 , we

Figure 11: Neural network for h1 and h2

have the table in Figure 12. Since the label of the OR operation is 1 except
at (0, 0), where the label is 1, it is easy to find a separating line for this
OR data. Similarly, we can construct another neural network as in Figure

8
Figure 12: Boolean OR of h1 and h2

Figure 13: Neural network for h3 and h4

13. Its OR value on the plane is depicted in Figure 14. If we combine all

Figure 14: OR value of h3 and h4

four variables h1 , h2 , h3 and h4 by OR operation, its values are shown in the


table in Figure 15. Note that all points (h1 , h2 , h3 , h4 ) in R4 have the label
1, except (0, 0, 0, 0) which has label 0. It is very easy to separate this with a
hyperplane in R4 so that the neural network in Figure 16 separates all points
in the table in Figure 15 with the h5 label. Its neural network is depiected
in Figure 16. The value of h5 as a function of x1 and x2 is shown in Figure

9
Figure 15: Combined OR value table of h1 , h2 , h3 and h4

Figure 16: Neural network for OR of h1 , h2 , h3 and h4

17.

Figure 17: Value of h5

Continuing this way, one can construct any approximate bump function
as an output of a neural network with one hidden layer. Furthermore com-
bining these bump functions, one can approximate any continuous function.
Namely, a neural network with one hidden layer can do any task, at least
in principle. This heuristic argument can be made rigorous using a Stone-
Weienstrass theorem-type argument to get the following theorem.

10
Theorem 1. (Cybenko-Hornik-Funabashi Theorem) Let K = [0, 1]d be a d-
dimensional hypercube in Rd . Then the sum of the form

X d
X
f (x) = ci sigmoid(bi + wij xj )
i j=1

can approximate any continuous function on K to any degree of accuracy.

Although this theorem is stated for the hypercube, it actually is valid on


any compact set K.

3 Multilayer neural network


Logistic regression is an example of the simplest neural network con-
sisting only of the input and output layers. However, the solution to the
XOR problem with a hidden layer presented in the previous section shows
the advantage of having a hidden layer. In general, a neural network with
many hidden layers is called a multilayer neural network or multilayer
perceptron.

3.1 Formalism of multilayer neural network


In Figure 18 is depicted the layer structure of a multilayer neural network.
(Connection links are not shown.) The left-most group of neurons, called
Layer 0, is the input layer. The `-th layer to the right of the input layer is
called Layer `, and the right-most layer, Layer L, is called the output layer.
(So there are L − 1 hidden layers here.) The boxes enclosing the neurons
in each layer there only for the convenience of communication. They are
sometimes drawn and sometimes not.
Figure 19 shows Layers ` − 1 and ` and the connection links between
them. Moreover, every neuron in Layer ` − 1 is connected to every neuron in
Layer `, but there is no connection between neurons in the same layer. The
pre-activation value of (input to) neuron i in Layer ` is zi` , which is gotten
as a weighted sum of the previous layer’s outputs by
X
` `−1
zi` = wij hj + b`i , (3)
j

11
Figure 18: MLP Layers

for ` = 1, · · · , L. In vector-matrix notation, we have


z ` = W ` h`−1 + b` . (4)
(Here, the superscript denotes the layer number.)

Figure 19: Connection weights

The activation value of (output of) neuron i in Layer ` is h`i . The pre-
activation and activation values are related by the activation function ϕ(t)

12
so that
h`i = ϕ` (zi` ), (5)
for ` = 1, · · · , L − 1. Some of the most popular activation functions are
1
σ(t) = , sigmoid function
1 + e−t
ReLU(t) = max(0, t), rectified linear unit
et − e−t
tanh(t) = t , hyperbolic tangent function.
e + e−t
For ` = 0, h0j is set to be the input xj , and we do not use zj0 . For the output
layer, i.e. ` = L, the form of output hLi changes depending on the problem.
For regression, the output is as usual

hLi = ϕL (ziL ).

However, for classification, it is the softmax function of the pre-activation


values ziL . Thus
L
ezk
hLk = softmaxk (z1 , · · · , zK ) = PK L .
j=1 e zj

In this case, the output has the usual probabilistic interpretation

hLk = hLk (x) = P (Y = k | X = x).

3.2 Errors
The error of classification problems is essentially the same as that of
logistic regression. As before, the formalism goes as follows. For a generic
input-output pair (x, y), where y ∈ {1, · · · , K}, use the one-hot encoding to
define
yk = I(y = k).
So y can be identified with (y1 , · · · , yK ) in which yk ∈ {0, 1} and y1 + · · · +
yK = 1. Then the cross entropy error is
X X
E=− yk log hLk (x) = − yk log P (Y = k | X = x).
k k

13
Let D = {(x(i) , y (i) )}N (i) (i)
i=1 be a given dataset. For each data point (x , y ),
its cross entropy error is
X (i) X (i)
E=− yk log hLk (x(i) ) = − yk log P (Y = k | X = x(i) ),
k k

(i)
where yk = I(y (i) = k). For a mini-batch {(x(i) , y (i) )}B
i=1 , its cross entropy
error is
B X
X B X
X
(i) (i)
E=− yk log hLk (x(i) ) =− yk log P (Y = k | X = x(i) ),
i=1 k i=1 k

`
The training of a neural network is to find good parameters wij and b`i to min-
imize the error, which is the topic we will be covering in the next subsection
and subsequent lecture on Training Deeping Neural Network.
For the regression problem, one usually uses the L2 error so that
B
X
E=− |y (i) − hL (x(i) )|2 ,
i=1

for the mini-batch {(x(i) , y (i) )}B


i=1 , for instance. However,
P other forms of error
can also be used. For instance, the L1 error E = − B i=1 |y (i)
− hL (x(i) )| can
be used, and as we have seen in Lecture 12 on Gradient Boost, errors like
Huber error can also be used.
Neural networks are prone to overfitting. So to alleviate the problem, one
usually resorts to regularization, which is to add the regularizing term Ω(θ)
to the error terms above. (Here, θ is a generic notation P for all P parameters like
` ` 2 ` 2
wijPand bi .) P
Typically, Ω(θ) is of the form λ|θ| =Pλ( |wij | + |b`i |2 ),P
µ|θ| =
µ( |wij |+ |bi |), or λ|θ| +µ|θ| = λ( |wij | + |bi | )+µ( |wij |+ |b`i |).
` ` 2 ` 2 ` 2 `
P P
However, many other forms can be used.
For classification, the typical error term with regularizer becomes
B X
X B X
X
(i) (i)
E=− yk log hLk (x(i) )+Ω(θ) = − yk log P (Y = k | X = x(i) )+Ω(θ),
i=1 k i=1 k

and for regression, it is of the form


B
X B X
X d
(i)
E=− |y (i) − hL (x(i) )|2 + Ω(θ) = − |yj − hLj (x(i) )|2 + Ω(θ).
i=1 i=1 j=1

14
3.3 Backpropagation algorithm
Training of a neural network, in a nutshell, is a gradient descent algorithm
which is basically of the form:

` ∂E
∆wij = −λ `
∂wij
∂E
∆b`i = −λ ` ,
∂bi
where
` ` `
∆wij = wij (new) − wij (old)
∆b`i = b`i (new) − b`i (old).

In fact, deep learning training methods are not this simplistic, but all of
them are variations of one form or another of this basic idea of gradient
descent. For more details, the reader is referred to the forthcoming lecture
on ”Training Deep Neural Network.”
∂E ∂E
So to do the training it is necessary to compute `
and ` efficiently.
∂wij ∂bi
∂E
At the output layer, Layer L, it is easy to determine , as we know
∂hLi
the expression of the error E. By the chain rule, we can compute

∂E X ∂hL ∂E
i
= . (6)
∂zjL i
∂z L L
j ∂hi

The variable dependencies are depicted in Figure 20. From this we can easily

Figure 20: Variable dependency

write by the chain rule that for ` = 1, · · · , L − 1,

∂E dh`i ∂E
= ,
∂zi` dzi` ∂h`i

15
where  `
dh`i  hi (1 − h`i ) if ϕ` is σ
= I(zi` ≥ 0) if ϕ` is ReLU
dzi`
sech2 zi` if ϕ` is tanh z

Applying the chain rule again, we get


∂E X ∂z ` ∂E
i
`−1
= .
∂hj i
∂hj ∂zi`
`−1

Using (3), we have


∂zi`
`−1
= ωij`
∂hj
∂zi`
= h`−1
j .
∂ωij`
Therefore, we get
∂E X ∂E
`−1
= ωij` ` , (7)
∂hj i
∂zi
and
∂E ∂zi` ∂E ∂E
`
= ` `
= h`−1
j . (8)
∂ωij ∂ωij ∂zi ∂zi`
∂zi`
Note that (7) says that the error is the weighted sum, with weights ωij` ,
∂h`−1
j
∂E
of errors , which are in the next layer. This means the error propagates
∂zi`
backward from layer ` to layer ` − 1, hence the name backpropagation.
Let us now derive similar formulas with respect to bias, i.e. b`i . Note from
(3),
∂zi`
= 1.
∂b`i
Thus we have
∂E ∂zi` ∂E ∂E
`
= ` `
= ` (9)
∂bi ∂bi ∂zi ∂zi
The whole thing is summed up as the following summary.

Summary

16
• By (3) and (5), the data flows forward, i.e. from Layer ` − 1 to Layer
`, hence the name feedforward network

• By (6) and (7), the error derivative, not the actual error, can be com-
puted backward, i.e.
∂E by(6) ∂E by(7) ∂E by(6) ∂E
←− ←− ` ←− ←,
∂zi`−1 ∂h`−1
i ∂zi ∂h`i

hence the name backpropagation.

• Equations (8) and (9) are the basic equations to be used for the gradient
descent algorithm for learning.

3.4 Backpropagation in matrix form


Recall that the variable dependency in Figure 20. We want to write the
!
∂E
chain rule in matrix form. Using the vector notation, we denote by
∂z L
∂E
the column vector whose j-th entry is L . Similarly let
∂zj
!
∂h`
∂z `

∂h`i
be the Jacobian matrix whose (i, j)-th entry is . Then in matrix form (6)
∂zj`
can be written as ! ! !
∂E ∂hL ∂E
= .
∂z L ∂z L ∂hL
Note that the chain rule
∂h`i X ∂z ` ∂h`
k i
=
∂h`−1
j k
∂h `−1
j ∂z `
k

17
can be written in matrix form as
! ! !
∂h` ∂z ` ∂h`
=
∂h`−1 ∂h`−1 ∂z `

Thus the vector used in the backpropagation rule is nothing but the matrix
products of the following form:
! ! ! ! !
∂E ∂z `+1 ∂h`+1 ∂hL ∂E
= ··· , (10)
∂h` ∂h` ∂z `+1 ∂z L ∂hL

and ! ! ! ! ! !
∂E ∂h` ∂z `+1 ∂h`+1 ∂hL ∂E
= ··· . (11)
∂z ` ∂z ` ∂h` ∂z `+1 ∂z L ∂hL

4 Mathematical supplement: automatic dif-


ferentiation
Backpropagation algorithm depends heavily on the derivative calcula-
tions. Mathematically, calculating derivatives is no big deal. However, if a
function is not given explicitly – for example, if it is defined in terms of a
certain procedure – calculating derivatives may become no small matter.
There are three way to calculate the derivatives. The first way is to use
symbolic differentiation. This method requires the symbolic mathematics
engine and it is not applicable if the function is not given explicitly. The
second is to use numerical method like finite difference. However, numerical
differentiation often entails numerical error and it is rather difficult to control
in general. The third way, which by now has become the default in machine
learning, is to use the so-called automatic differentiation or algorithmic
differentiation.
It is a numerical way of calculating the exact value of derivatives without
any aid of the symbolic mathematics engine. It relies on the dual number
invented by Clifford in 1873. Dual number is a pair of real numbers < x, x0 >
which is usually written as x + x0 . The arithmetic operation of dual numbers

18
are as follows:

λ(a + a0 ) = λa + λa0 
(a + a0 ) + (b + b0 ) = (a + b) + (a0 + b0 )
(a + a0 ) − (b + b0 ) = (a − b) + (a0 − b0 )
(a + a0 )(b + b0 ) = ab + (a0 b + ab0 )
1 1 a0
= − 2
a + a0  a a
Note that if we put a = b = 0 and a0 = b0 = 1 in the fourth line above, we
get
2 = 0,
from which it is easy to prove by induction that

(x + x0 )n = xn + nxn−1 x0 .

In fact, for any function f (x), f (x + x0 ) is defined as

f (x + x0 ) = f (x) + f 0 (x)x0 ,

which is basically the first order Taylor series expansion. This mysterious
 can be envisioned as being sort of like the smallest floating point number
representable in a computer so that 2 = 0. This is also reminiscent of the
complex number i satisfying i2 = −1.
Let us see how this dual number can be used to compute partial deriva-
∂f
tives. Suppose f (x, y) = xy 2 + y + 5. Since (x, y) = 2xy + 1, we can
∂y
∂f
directly compute f (2, 3) = 26 and (2, 3) = 13.
∂y
To see how automatic differentiation can be used to do the same, we look
at

f (2, 3 + ) = 2(3 + )2 + (3 + ) + 5


= 2(9 + 6) + (3 + ) + 5
= 26 + 13,
∂f
from which f (2, 3) = 26 and (2, 3) = 13 can be read off. This procedure
∂y
is systematized by constructing the computational graph which is a parsing

19
Figure 21: AutoDiff in Forward Mode

tree for the computation of f (x, y). An example of the computational graph
∂f
for this computation is given in Figure 21. In here, to compute (2, 3), one
∂y
feeds the values x = 2 and y = 3+ at the bottom variables. One then follows
the arrows of the computational graph and performs appropriate operations
whose results are recorded on the corresponding links. From the value at
∂f
the top node, one can read off f (2, 3) = 26 and (2, 3) = 13. Note that to
∂y
∂f
compute (2, 3), we must redo the similar computation f (2 + , 3) on the
∂x
same computational graph all over again by feeding x = 2+ and y = 3. This
may be quite burdensome if there are millions of variables to deal with as is
the case in deep learning. This way of directly computing partial derivatives
is called the forward mode automatic differentiation.
An alternative method called the reverse mode automatic differen-
tiation is devised to alleviate this problem. It is basically a repeated appli-
cation of chain rule. It goes as follows. First, use the forward computational
graph to compute values of each node in the forward pass, i.e. from the
input values at the bottom nodes through all nodes upward. The number

20
inside each node in Figure 22 is the value of that node. The symbol ni at
the left of each node is the node number, which one can think of as a vari-
able representing the value of the corresponding node. The idea of reverse
mode automatic differentiation is to trace the variable dependencies along the
paths of the computational graph and apply the appropriate chain rules. Note
∂n1
f = n1 and n1 = n2 + n3 . Thus = 1 and this number is recorded on the
∂n2
link from n1 to n2 . The number on the link from n1 to n3 is gotten similarly.
∂n2
Now n2 = n4 n5 , from which we get = n4 = 9. Therefore we have
∂n5
∂n1 ∂n1 ∂n2
= = 9,
∂n5 ∂n2 ∂n5
which is recorded on the link from n2 to n5 . Note that this is the chain rule
(along the path from n1 tp n5 .). The reason why no other variables enter
into this chain rule calculation can be easily seen: namely, a change in n5
only affects n2 , which again affects n1 , and no other variables are affected.
One can read off this dependency from the graph structure, i.e. there is only
∂n2
one path from n1 to n5 . Similarly, = n5 = 2. Thus
∂n4
∂n1 ∂n1 ∂n2
= = 2,
∂n4 ∂n2 ∂n4
which is recorded on the link from n2 to n4 . Now n4 = n26 . Thus
∂n4
= 2n6 = 6. (12)
∂n6
Thus
∂n1 ∂n1 ∂n4
= = 12,
∂n6 ∂n4 ∂n6
which is recorded on the link from n4 to n6 . Note that this does not yet give
∂f
the value of (2, 3), although in terms of values of variables, f = n1 and
∂y
y = n6 . The reason is that there is another path from n1 to n6 via n3 , and
this calculation only reflects the dependency of n1 on n6 through the path
via n4 . This number 12 recorded on the link from n4 to n6 represents only
the part of this dependency. Another path of dependency of n1 on n6 is via
the path through n3 . Computing similarly, we can get the number 1 on the

21
link from n3 to n6 . Since n1 depends on n6 through these two paths, we have
to add these two numbers to get
∂f
(2, 3) = 12 + 1 = 13.
∂y

Figure 22: AutoDiff in Reverse Mode

This reverse mode automatic differentiation is heavily dependent on deriva-


tive calculation at each node. This individual differentiation is where the
forward mode automatic differentiation enters. For instance (12) is gotten
∂n4
by computing (3 + )2 = 9 + 6 and noting that this gives = 6.
∂n6
The reverse mode automatic differentiation is a way of organizing deriva-
tive calculations to get all partial derivatives at once with one single com-
putational graph, which is a big advantage in the case like deep learning
where there are millions of input variables while there are only a few output
variables. In contrast, one has to repeat the similar computations millions
of times if one relies entirely on the forward mode automatic differentiation.
This is the reason why most deep learning software implements reverse mode
automatic differentiation.

22
References
[1] Community Portal for Automatic Differentiation
http://www.autodiff.org/

[2] Goodfellow, I., Bengio, Y., Courville, A., Deep Learning, MIT Press
(2016)

[3] Minsky M. L. and Papert S. A., Perceptrons: An Introduction to Com-


putational Geometry, MIT Press (1969)

[4] Rosenblatt, F., A Probabilistic Model for Information Storage and Or-
ganization in the Brain, Cornell Aeronautical Laboratory, Psychological
Review, v65, No. 6, 386408 (1958)

[5] Rosenblatt, F., Principles of Neurodynamics: Perceptrons and the The-


ory of Brain Mechanisms, Spartan Books (1962)

[6] Rumelhart, D., McClelland, J., PDP Research Group, Parallel Dis-
tributed Processing, vol 1 & 2, Bradford Book (1987)

23

You might also like