0% found this document useful (0 votes)

110 views23 pages

NN PDF

This document provides an overview of topics to be covered in Lecture 13 on feedforward neural networks. It includes: 1. A brief history of neural networks from the perceptron to today's deep learning approaches. 2. How the XOR problem, which perplexed early neural networks, can be solved using a neural network with a hidden layer. 3. An introduction to the mathematics behind multilayer neural networks, including the backpropagation algorithm.

Uploaded by

RaviKumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

110 views23 pages

NN PDF

Uploaded by

RaviKumar

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 23

Lectures on Machine Learning (Fall 2017)

Hyeong In Choi Seoul National University

Lecture 13: Feedforward Neural Networks

(Draft: version 0.9.1)

Topics to be covered:

• Brief history of neural network

• XOR problem and neural network with a hidden layer

• Universal approximation theorem

• Multilayer neural network

- Formalism
- Errors
- Backpropagation algorithm
- Backpropagation in matrix form

• Mathematical supplement: automatic differentiation

1 Brief history of neural networks

Neural network has a very interesting history. Its popularity comes in
ebb and flow like a drama. Neural network first got started when F. Rosen-
blatt proposed the so-called Perceptron [4, 5]. Architecturally it is just like
a neural network with no hidden layer, although its learning rule uses the

1
so-called perceptron learning algorithm, which is different from the present-
day backpropagation algorithm. It nonetheless is the first one that can learn
from data without explicit programming by human experts. This feature
created a lot of excitement, and with the perceptron convergence theorem,
perceptron seem to be able to meet many interesting challenges. However,
mathematically what Perceptron does is finding linear classification bound-
ary.
This fact drew attention from Minsky and Papert, who wrote a book
criticizing that Perceptron cannot even distinguish between the very simple
XOR configurations of four data points on a plane [3]. This book almost
instantly killed the then-flourishing many perceptron research projects, and
this marked the closing of the first wave of neural network.
Since then, MIT (and others) led the artificial intelligence research along
the path of logical analysis and rule-based programming. They contrived
many interesting methods and approaches to embody human intelligence
through the rule-like programming paradigm. Although this approach had
many successes in various areas where logical thinking is the key, it failed
pretty badly in handling the perception problem, which is what Perceptron
was intended to solve. This lack of common sense became the glaring defi-
ciency of this type of artificial intelligence methods, and it is the reason why
logic-based rule-like artificial intelligence gradually withered away.
The second wave of neural network came when the PDP group led by
Rumelhart, McClelland and others discovered the magic of adding a hidden
layer to the neural network [6]. In particular, they showed that the XOR
problem, which had vexed Perceptron so much, can be easily overcome. (See
Section 2.2 below.) This and many other interesting applications rekindled
interest on neural networks. Many other models of neural networks were
proposed and many interesting problems solved. Indeed it seemed that the
secret of human intelligence was about to be revealed. However, it soon
dawned on people that complex neural networks with many hidden layers are
very hard to train. They frequently overfit and, as a consequence, a small
change in the dataset leads to drastic changes in the neural network itself;
thereby earning the reputation of being “hard-to-train”. So around the mid-
1990s and on, the PDP-style neural network projects were mostly abandoned,
and people turned to other machine learning methods like Support Vector
Machine and kernel methods; ensemble methods like boosting and random
forests; and so on.
Such was the state of affairs until suddenly G. Hinton and his colleagues

2
came bursting onto the scene with fresh new ideas of pre-training neural
networks in the mid-2000s. (See for example [2] and our subsequent lectures.)
With the drastic increase in computational capability with the use of GPU
machines, people began to create very complex neural networks with tens of
hidden layers and train them reasonably well, as long as there are enough
data to feed to the network.
Currently we are witnessing the third wave of the neural network under
the banner of deep learning coupled with big data. Its success is impres-
sive. Overnight, it has almost completely replaced voice recognition tech-
nology that has a long history of its own; vision research is now mostly
based on deep learning; and with the use deep learning, complicated natural
language problems like machine translation are at a different level of compe-
tency. Many commercial applications developed with deep learning are now
routinely in use.
However, deep learning is not a panacea. Although it is capable of solving
many, if not all, complex perception problems, it is not as adept at reasoning
tasks. Also, its black-box nature makes it harder for humans to accept the
outcome if it conflicts with one’s expert knowledge or judgment. Another
aspect of deep learning is that it requires a huge amount of data, which
contrasts with the way humans learn. All these objections and questions will
be presented as some of the central themes in the next phase of artificial
intelligence research, and it is interesting to see how one can mesh deep
learning with this new coming trend.

2 Neural network with hidden layer

2.1 Neural network formalism of logistic regression
In Lecture 3, we showed that logistic regression can be recast in neural
network formalism. It goes as follows: Let x = [x1 , · · · , xd ]T be an input
(vector). Define
zk = wk · x + bk
Xd
= wkj xj + bk (1)
j=1

and let
z = [z1 , · · · , zK ]T

3
as a K dimensional column vector. Then in matrix notation we have

z = W x + b.

The matrix W and the vector b are to be estimated with data.

Figure 1: Neural network view of logistic regression

The output is the softmax function:

!
ez1 e zK
softmax(z1 , · · · , zK ) = PK , · · · , PK ,
zj zj
j=1 e j=1 e

and its k-th element of the output is denoted by

e zk
softmaxk (z1 , · · · , zK ) = PK ,
j=1 ezj

In vector notation, we write

softmax(z) = softmax(z1 , · · · , zK ).

This softmax defines the probability P (y = k | x) as the output h = [h1 , · · · , hK ]T ,

which is the conditional probability given by:
e zk
hk = hk (x) = P (Y = k | X = x) = softmaxk (z1 , · · · , zK ) = PK . (2)
zj
j=1 e

4
Its neural network formalism is as in Figure 1. The circles in the left
box represent the input variables x1 , · · · , xd and the ones in the right the
response variables h1 , · · · , hK . The internal state of the k-th output neuron
is the variable zk whose value is as described in (1). In neural network
parlance, zk is gotten by summing over all j the multiples of wkj and xj , and
then adding the value bk , called the bias. Once the values of z1 , · · · , zK are
given, the output h1 , · · · , hK can be found by applying the softmax function
(2).

2.2 XOR problem and neural network with hidden

layer
Let D = {(0, 0), (0, 1), (1, 0), (1, 1)} be a dataset consisting of 4 points in
2
R . They belong to one of the two classes as shown in Figure 2. Note that
there is no line that separates these two classes. Let us see how this problem

Figure 2: XOR Problem

can be solved if we add a hidden layer. Treating x1 and x2 as Boolean

variables, we can write

XOR(x1 , x2 ) = x1 x¯2 + x¯1 x2 .

Thus if we give label x1 x¯2 to (x1 , x2 ) as in Figure 3, the line x1 − x2 − 1/2 = 0

becomes the separating line. Note that if a >> 0, σ(at) becomes very close
to the step function as Figure 4 shows. Now let z1 = a(x1 − x2 − 1/2) and
let h1 = σ(z1 ). Then the value of h1 becomes (very close to) 1 at the bottom
right side of the line x1 − x2 − 1/2 = 0, and 0 at the top left side of line
x1 − x2 − 1/2 = 0. This situation is depicted in Figure 3. In neural network
notation, this variable relation is drawn as in Figure 5.

5
Figure 3: Value of h1 and the separating line

Figure 4: For large a, sigmoid becomes very close to step function

Figure 5: Neural network for h1

Figure 6: Value of h2 and the separating line

6
Similarly, for the label x¯1 x2 , the Boolean value of x¯1 x2 and the separating
line −x1 + x2 − 1/2 = 0 is shown in Figure 6. Now let z2 = a(−x1 + x2 − 1/2)
and let h2 = σ(z2 ). Then by the same argument, the value of h2 becomes
(very close to) 0 at the bottom right side of the line −x1 + x2 − 1/2 = 0, and
1 at the top left side of line −x1 + x2 − 1/2 = 0. This situation is depicted
in Figure 6. In neural network notation, this variable relation is drawn as
in Figure 7. To combine h1 and h2 , let h3 = XOR(x1 , x2 ) = x1 x¯2 + x¯1 x2 .

Figure 7: Neural network for h2

The values of all the variables are shown in Figure 8. Note that the range

Figure 8: Value of variables

of values (h1 , h2 ) can take becomes restricted, i.e. they are (0, 0), (0, 1), and
(1, 0). However (1, 1) does not now show up. This situation is depicted in
Figure 9 and thus h1 + h2 − 1/2 is a separating line for these data. Define
1
z3 = b(h1 + h2 − ), for large b >> 0, and h3 = σ(z3 ). And the value of h3
2
on the plane is also shown in Figure 9. If we combine everything we have
done in this subsection, it can be written in the neural network as in Figure
10. Clearly this neural network with one hidden layer separates the original
XOR data.

7
Figure 9: Value of h3 and the separating line

Figure 10: Final neural network

2.3 Universal approximation theorem

Let us rewrite part of the above neural network construction as in Figure
11. If we take the OR Boolean operation of Boolean variables h1 and h2 , we

Figure 11: Neural network for h1 and h2

have the table in Figure 12. Since the label of the OR operation is 1 except
at (0, 0), where the label is 1, it is easy to find a separating line for this
OR data. Similarly, we can construct another neural network as in Figure

8
Figure 12: Boolean OR of h1 and h2

Figure 13: Neural network for h3 and h4

13. Its OR value on the plane is depicted in Figure 14. If we combine all

Figure 14: OR value of h3 and h4

four variables h1 , h2 , h3 and h4 by OR operation, its values are shown in the

table in Figure 15. Note that all points (h1 , h2 , h3 , h4 ) in R4 have the label
1, except (0, 0, 0, 0) which has label 0. It is very easy to separate this with a
hyperplane in R4 so that the neural network in Figure 16 separates all points
in the table in Figure 15 with the h5 label. Its neural network is depiected
in Figure 16. The value of h5 as a function of x1 and x2 is shown in Figure

9
Figure 15: Combined OR value table of h1 , h2 , h3 and h4

Figure 16: Neural network for OR of h1 , h2 , h3 and h4

17.

Figure 17: Value of h5

Continuing this way, one can construct any approximate bump function
as an output of a neural network with one hidden layer. Furthermore com-
bining these bump functions, one can approximate any continuous function.
Namely, a neural network with one hidden layer can do any task, at least
in principle. This heuristic argument can be made rigorous using a Stone-
Weienstrass theorem-type argument to get the following theorem.

10
Theorem 1. (Cybenko-Hornik-Funabashi Theorem) Let K = [0, 1]d be a d-
dimensional hypercube in Rd . Then the sum of the form

X d
X
f (x) = ci sigmoid(bi + wij xj )
i j=1

can approximate any continuous function on K to any degree of accuracy.

Although this theorem is stated for the hypercube, it actually is valid on

any compact set K.

3 Multilayer neural network

Logistic regression is an example of the simplest neural network con-
sisting only of the input and output layers. However, the solution to the
XOR problem with a hidden layer presented in the previous section shows
the advantage of having a hidden layer. In general, a neural network with
many hidden layers is called a multilayer neural network or multilayer
perceptron.

3.1 Formalism of multilayer neural network

In Figure 18 is depicted the layer structure of a multilayer neural network.
(Connection links are not shown.) The left-most group of neurons, called
Layer 0, is the input layer. The `-th layer to the right of the input layer is
called Layer `, and the right-most layer, Layer L, is called the output layer.
(So there are L − 1 hidden layers here.) The boxes enclosing the neurons
in each layer there only for the convenience of communication. They are
sometimes drawn and sometimes not.
Figure 19 shows Layers ` − 1 and ` and the connection links between
them. Moreover, every neuron in Layer ` − 1 is connected to every neuron in
Layer `, but there is no connection between neurons in the same layer. The
pre-activation value of (input to) neuron i in Layer ` is zi` , which is gotten
as a weighted sum of the previous layer’s outputs by
X
` `−1
zi` = wij hj + b`i , (3)
j

11
Figure 18: MLP Layers

for ` = 1, · · · , L. In vector-matrix notation, we have

z ` = W ` h`−1 + b` . (4)
(Here, the superscript denotes the layer number.)

Figure 19: Connection weights

The activation value of (output of) neuron i in Layer ` is h`i . The pre-
activation and activation values are related by the activation function ϕ(t)

12
so that
h`i = ϕ` (zi` ), (5)
for ` = 1, · · · , L − 1. Some of the most popular activation functions are
1
σ(t) = , sigmoid function
1 + e−t
ReLU(t) = max(0, t), rectified linear unit
et − e−t
tanh(t) = t , hyperbolic tangent function.
e + e−t
For ` = 0, h0j is set to be the input xj , and we do not use zj0 . For the output
layer, i.e. ` = L, the form of output hLi changes depending on the problem.
For regression, the output is as usual

hLi = ϕL (ziL ).

However, for classification, it is the softmax function of the pre-activation

values ziL . Thus
L
ezk
hLk = softmaxk (z1 , · · · , zK ) = PK L .
j=1 e zj

In this case, the output has the usual probabilistic interpretation

hLk = hLk (x) = P (Y = k | X = x).

3.2 Errors
The error of classification problems is essentially the same as that of
logistic regression. As before, the formalism goes as follows. For a generic
input-output pair (x, y), where y ∈ {1, · · · , K}, use the one-hot encoding to
define
yk = I(y = k).
So y can be identified with (y1 , · · · , yK ) in which yk ∈ {0, 1} and y1 + · · · +
yK = 1. Then the cross entropy error is
X X
E=− yk log hLk (x) = − yk log P (Y = k | X = x).
k k

13
Let D = {(x(i) , y (i) )}N (i) (i)
i=1 be a given dataset. For each data point (x , y ),
its cross entropy error is
X (i) X (i)
E=− yk log hLk (x(i) ) = − yk log P (Y = k | X = x(i) ),
k k

(i)
where yk = I(y (i) = k). For a mini-batch {(x(i) , y (i) )}B
i=1 , its cross entropy
error is
B X
X B X
X
(i) (i)
E=− yk log hLk (x(i) ) =− yk log P (Y = k | X = x(i) ),
i=1 k i=1 k

`
The training of a neural network is to find good parameters wij and b`i to min-
imize the error, which is the topic we will be covering in the next subsection
and subsequent lecture on Training Deeping Neural Network.
For the regression problem, one usually uses the L2 error so that
B
X
E=− |y (i) − hL (x(i) )|2 ,
i=1

for the mini-batch {(x(i) , y (i) )}B

i=1 , for instance. However,
P other forms of error
can also be used. For instance, the L1 error E = − B i=1 |y (i)
− hL (x(i) )| can
be used, and as we have seen in Lecture 12 on Gradient Boost, errors like
Huber error can also be used.
Neural networks are prone to overfitting. So to alleviate the problem, one
usually resorts to regularization, which is to add the regularizing term Ω(θ)
to the error terms above. (Here, θ is a generic notation P for all P parameters like
` ` 2 ` 2
wijPand bi .) P
Typically, Ω(θ) is of the form λ|θ| =Pλ( |wij | + |b`i |2 ),P
µ|θ| =
µ( |wij |+ |bi |), or λ|θ| +µ|θ| = λ( |wij | + |bi | )+µ( |wij |+ |b`i |).
` ` 2 ` 2 ` 2 `
P P
However, many other forms can be used.
For classification, the typical error term with regularizer becomes
B X
X B X
X
(i) (i)
E=− yk log hLk (x(i) )+Ω(θ) = − yk log P (Y = k | X = x(i) )+Ω(θ),
i=1 k i=1 k

and for regression, it is of the form

B
X B X
X d
(i)
E=− |y (i) − hL (x(i) )|2 + Ω(θ) = − |yj − hLj (x(i) )|2 + Ω(θ).
i=1 i=1 j=1

14
3.3 Backpropagation algorithm
Training of a neural network, in a nutshell, is a gradient descent algorithm
which is basically of the form:

` ∂E
∆wij = −λ `
∂wij
∂E
∆bì = −λ ` ,
∂bi
where
` ` `
∆wij = wij (new) − wij (old)
∆bì = bì (new) − bì (old).

In fact, deep learning training methods are not this simplistic, but all of
them are variations of one form or another of this basic idea of gradient
descent. For more details, the reader is referred to the forthcoming lecture
on ”Training Deep Neural Network.”
∂E ∂E
So to do the training it is necessary to compute `
and ` efficiently.
∂wij ∂bi
∂E
At the output layer, Layer L, it is easy to determine , as we know
∂hLi
the expression of the error E. By the chain rule, we can compute

∂E X ∂hL ∂E
i
= . (6)
∂zjL i
∂z L L
j ∂hi

The variable dependencies are depicted in Figure 20. From this we can easily

Figure 20: Variable dependency

write by the chain rule that for ` = 1, · · · , L − 1,

∂E dh`i ∂E
= ,
∂zi` dzi` ∂h`i

15
where  `
dh`i  hi (1 − h`i ) if ϕ` is σ
= I(zi` ≥ 0) if ϕ` is ReLU
dzi`
sech2 zi` if ϕ` is tanh z


Applying the chain rule again, we get

∂E X ∂z ` ∂E
i
`−1
= .
∂hj i
∂hj ∂zi`
`−1

Using (3), we have

∂zi`
`−1
= ωij`
∂hj
∂zi`
= h`−1
j .
∂ωij`
Therefore, we get
∂E X ∂E
`−1
= ωij` ` , (7)
∂hj i
∂zi
and
∂E ∂zi` ∂E ∂E
`
= ` `
= h`−1
j . (8)
∂ωij ∂ωij ∂zi ∂zi`
∂zi`
Note that (7) says that the error is the weighted sum, with weights ωij` ,
∂h`−1
j
∂E
of errors , which are in the next layer. This means the error propagates
∂zi`
backward from layer ` to layer ` − 1, hence the name backpropagation.
Let us now derive similar formulas with respect to bias, i.e. b`i . Note from
(3),
∂zi`
= 1.
∂b`i
Thus we have
∂E ∂zi` ∂E ∂E
`
= ` `
= ` (9)
∂bi ∂bi ∂zi ∂zi
The whole thing is summed up as the following summary.

Summary

16
• By (3) and (5), the data flows forward, i.e. from Layer ` − 1 to Layer
`, hence the name feedforward network

• By (6) and (7), the error derivative, not the actual error, can be com-
puted backward, i.e.
∂E by(6) ∂E by(7) ∂E by(6) ∂E
←− ←− ` ←− ←,
∂zi`−1 ∂h`−1
i ∂zi ∂h`i

hence the name backpropagation.

• Equations (8) and (9) are the basic equations to be used for the gradient
descent algorithm for learning.

3.4 Backpropagation in matrix form

Recall that the variable dependency in Figure 20. We want to write the
!
∂E
chain rule in matrix form. Using the vector notation, we denote by
∂z L
∂E
the column vector whose j-th entry is L . Similarly let
∂zj
!
∂h`
∂z `

∂h`i
be the Jacobian matrix whose (i, j)-th entry is . Then in matrix form (6)
∂zj`
can be written as ! ! !
∂E ∂hL ∂E
= .
∂z L ∂z L ∂hL
Note that the chain rule
∂h`i X ∂z ` ∂h`
k i
=
∂h`−1
j k
∂h `−1
j ∂z `
k

17
can be written in matrix form as
! ! !
∂h` ∂z ` ∂h`
=
∂h`−1 ∂h`−1 ∂z `

Thus the vector used in the backpropagation rule is nothing but the matrix
products of the following form:
! ! ! ! !
∂E ∂z `+1 ∂h`+1 ∂hL ∂E
= ··· , (10)
∂h` ∂h` ∂z `+1 ∂z L ∂hL

and ! ! ! ! ! !
∂E ∂h` ∂z `+1 ∂h`+1 ∂hL ∂E
= ··· . (11)
∂z ` ∂z ` ∂h` ∂z `+1 ∂z L ∂hL

4 Mathematical supplement: automatic dif-

ferentiation
Backpropagation algorithm depends heavily on the derivative calcula-
tions. Mathematically, calculating derivatives is no big deal. However, if a
function is not given explicitly – for example, if it is defined in terms of a
certain procedure – calculating derivatives may become no small matter.
There are three way to calculate the derivatives. The first way is to use
symbolic differentiation. This method requires the symbolic mathematics
engine and it is not applicable if the function is not given explicitly. The
second is to use numerical method like finite difference. However, numerical
differentiation often entails numerical error and it is rather difficult to control
in general. The third way, which by now has become the default in machine
learning, is to use the so-called automatic differentiation or algorithmic
differentiation.
It is a numerical way of calculating the exact value of derivatives without
any aid of the symbolic mathematics engine. It relies on the dual number
invented by Clifford in 1873. Dual number is a pair of real numbers < x, x0 >
which is usually written as x + x0 . The arithmetic operation of dual numbers

18
are as follows:

λ(a + a0 ) = λa + λa0
(a + a0 ) + (b + b0 ) = (a + b) + (a0 + b0 )
(a + a0 ) − (b + b0 ) = (a − b) + (a0 − b0 )
(a + a0 )(b + b0 ) = ab + (a0 b + ab0 )
1 1 a0
= − 2
a + a0 a a
Note that if we put a = b = 0 and a0 = b0 = 1 in the fourth line above, we
get
2 = 0,
from which it is easy to prove by induction that

(x + x0 )n = xn + nxn−1 x0 .

In fact, for any function f (x), f (x + x0 ) is defined as

f (x + x0 ) = f (x) + f 0 (x)x0 ,

which is basically the first order Taylor series expansion. This mysterious
can be envisioned as being sort of like the smallest floating point number
representable in a computer so that 2 = 0. This is also reminiscent of the
complex number i satisfying i2 = −1.
Let us see how this dual number can be used to compute partial deriva-
∂f
tives. Suppose f (x, y) = xy 2 + y + 5. Since (x, y) = 2xy + 1, we can
∂y
∂f
directly compute f (2, 3) = 26 and (2, 3) = 13.
∂y
To see how automatic differentiation can be used to do the same, we look
at

f (2, 3 + ) = 2(3 + )2 + (3 + ) + 5

= 2(9 + 6) + (3 + ) + 5
= 26 + 13,
∂f
from which f (2, 3) = 26 and (2, 3) = 13 can be read off. This procedure
∂y
is systematized by constructing the computational graph which is a parsing

19
Figure 21: AutoDiff in Forward Mode

tree for the computation of f (x, y). An example of the computational graph
∂f
for this computation is given in Figure 21. In here, to compute (2, 3), one
∂y
feeds the values x = 2 and y = 3+ at the bottom variables. One then follows
the arrows of the computational graph and performs appropriate operations
whose results are recorded on the corresponding links. From the value at
∂f
the top node, one can read off f (2, 3) = 26 and (2, 3) = 13. Note that to
∂y
∂f
compute (2, 3), we must redo the similar computation f (2 + , 3) on the
∂x
same computational graph all over again by feeding x = 2+ and y = 3. This
may be quite burdensome if there are millions of variables to deal with as is
the case in deep learning. This way of directly computing partial derivatives
is called the forward mode automatic differentiation.
An alternative method called the reverse mode automatic differen-
tiation is devised to alleviate this problem. It is basically a repeated appli-
cation of chain rule. It goes as follows. First, use the forward computational
graph to compute values of each node in the forward pass, i.e. from the
input values at the bottom nodes through all nodes upward. The number

20
inside each node in Figure 22 is the value of that node. The symbol ni at
the left of each node is the node number, which one can think of as a vari-
able representing the value of the corresponding node. The idea of reverse
mode automatic differentiation is to trace the variable dependencies along the
paths of the computational graph and apply the appropriate chain rules. Note
∂n1
f = n1 and n1 = n2 + n3 . Thus = 1 and this number is recorded on the
∂n2
link from n1 to n2 . The number on the link from n1 to n3 is gotten similarly.
∂n2
Now n2 = n4 n5 , from which we get = n4 = 9. Therefore we have
∂n5
∂n1 ∂n1 ∂n2
= = 9,
∂n5 ∂n2 ∂n5
which is recorded on the link from n2 to n5 . Note that this is the chain rule
(along the path from n1 tp n5 .). The reason why no other variables enter
into this chain rule calculation can be easily seen: namely, a change in n5
only affects n2 , which again affects n1 , and no other variables are affected.
One can read off this dependency from the graph structure, i.e. there is only
∂n2
one path from n1 to n5 . Similarly, = n5 = 2. Thus
∂n4
∂n1 ∂n1 ∂n2
= = 2,
∂n4 ∂n2 ∂n4
which is recorded on the link from n2 to n4 . Now n4 = n26 . Thus
∂n4
= 2n6 = 6. (12)
∂n6
Thus
∂n1 ∂n1 ∂n4
= = 12,
∂n6 ∂n4 ∂n6
which is recorded on the link from n4 to n6 . Note that this does not yet give
∂f
the value of (2, 3), although in terms of values of variables, f = n1 and
∂y
y = n6 . The reason is that there is another path from n1 to n6 via n3 , and
this calculation only reflects the dependency of n1 on n6 through the path
via n4 . This number 12 recorded on the link from n4 to n6 represents only
the part of this dependency. Another path of dependency of n1 on n6 is via
the path through n3 . Computing similarly, we can get the number 1 on the

21
link from n3 to n6 . Since n1 depends on n6 through these two paths, we have
to add these two numbers to get
∂f
(2, 3) = 12 + 1 = 13.
∂y

Figure 22: AutoDiff in Reverse Mode

This reverse mode automatic differentiation is heavily dependent on deriva-

tive calculation at each node. This individual differentiation is where the
forward mode automatic differentiation enters. For instance (12) is gotten
∂n4
by computing (3 + )2 = 9 + 6 and noting that this gives = 6.
∂n6
The reverse mode automatic differentiation is a way of organizing deriva-
tive calculations to get all partial derivatives at once with one single com-
putational graph, which is a big advantage in the case like deep learning
where there are millions of input variables while there are only a few output
variables. In contrast, one has to repeat the similar computations millions
of times if one relies entirely on the forward mode automatic differentiation.
This is the reason why most deep learning software implements reverse mode
automatic differentiation.

22
References
[1] Community Portal for Automatic Differentiation
http://www.autodiff.org/

[2] Goodfellow, I., Bengio, Y., Courville, A., Deep Learning, MIT Press
(2016)

[3] Minsky M. L. and Papert S. A., Perceptrons: An Introduction to Com-

putational Geometry, MIT Press (1969)

[4] Rosenblatt, F., A Probabilistic Model for Information Storage and Or-
ganization in the Brain, Cornell Aeronautical Laboratory, Psychological
Review, v65, No. 6, 386408 (1958)

[5] Rosenblatt, F., Principles of Neurodynamics: Perceptrons and the The-

ory of Brain Mechanisms, Spartan Books (1962)

[6] Rumelhart, D., McClelland, J., PDP Research Group, Parallel Dis-
tributed Processing, vol 1 & 2, Bradford Book (1987)

How To Apply Initial Stress Using INISTATE
No ratings yet
How To Apply Initial Stress Using INISTATE
4 pages
4 Neural Network
No ratings yet
4 Neural Network
74 pages
Neural Networks EN
No ratings yet
Neural Networks EN
16 pages
From Perceptron To Deep Neural Nets - Becoming Human - Artificial Intelligence Magazine
No ratings yet
From Perceptron To Deep Neural Nets - Becoming Human - Artificial Intelligence Magazine
36 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
82 pages
Week-12 - Introduction To ML-NN-CNN
No ratings yet
Week-12 - Introduction To ML-NN-CNN
45 pages
Deep Learning
No ratings yet
Deep Learning
13 pages
What Are Neural Nets
No ratings yet
What Are Neural Nets
4 pages
2024 MTH058 Lecture02 Backpropagation
No ratings yet
2024 MTH058 Lecture02 Backpropagation
62 pages
2023 Lecture11 NeuralNetworks
No ratings yet
2023 Lecture11 NeuralNetworks
48 pages
Neural Networks
No ratings yet
Neural Networks
37 pages
Neural Network
No ratings yet
Neural Network
82 pages
Notes ML 02 Slides RNN ANN
No ratings yet
Notes ML 02 Slides RNN ANN
105 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
75 pages
Unit 4-Health Care and Deep Learninh
No ratings yet
Unit 4-Health Care and Deep Learninh
87 pages
Mid Summary
No ratings yet
Mid Summary
13 pages
XOR Problem Demonstration Using MATLAB
0% (1)
XOR Problem Demonstration Using MATLAB
19 pages
Unit-5: Introduction To Deep Learning: Artificial Neural Networks
No ratings yet
Unit-5: Introduction To Deep Learning: Artificial Neural Networks
14 pages
11-Nonlinear Models (Neural Networks)
No ratings yet
11-Nonlinear Models (Neural Networks)
6 pages
Unit V
No ratings yet
Unit V
42 pages
Machine Learning Unit 5 Notes
No ratings yet
Machine Learning Unit 5 Notes
19 pages
Data Mining: Practical Machine Learning Tools and Techniques
No ratings yet
Data Mining: Practical Machine Learning Tools and Techniques
123 pages
The Deep Neural Network-A Review
No ratings yet
The Deep Neural Network-A Review
5 pages
Supervised Learning Neural Networks
No ratings yet
Supervised Learning Neural Networks
34 pages
Chapter 5 Final
No ratings yet
Chapter 5 Final
80 pages
Unit 3
No ratings yet
Unit 3
16 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
100 pages
2021 Lecture11 NeuralNetworks
No ratings yet
2021 Lecture11 NeuralNetworks
48 pages
Bim309 Ai Week13
No ratings yet
Bim309 Ai Week13
53 pages
Lecture 2
No ratings yet
Lecture 2
52 pages
Learning
No ratings yet
Learning
48 pages
Module 1
No ratings yet
Module 1
23 pages
CV 3
No ratings yet
CV 3
159 pages
Session 2 ANN 2024
No ratings yet
Session 2 ANN 2024
29 pages
Deep Learning
No ratings yet
Deep Learning
19 pages
Neural Networks
No ratings yet
Neural Networks
54 pages
Module 2
No ratings yet
Module 2
44 pages
Lecture Slides-Week13,14
No ratings yet
Lecture Slides-Week13,14
62 pages
Ai Unit 4 Part 2
No ratings yet
Ai Unit 4 Part 2
45 pages
Convolutional Neural Networks
No ratings yet
Convolutional Neural Networks
21 pages
Week 03-04 - Deep Feedforward Networks - Intro
No ratings yet
Week 03-04 - Deep Feedforward Networks - Intro
141 pages
AN2DL 02 2324 Perceptron 2 FeedForward
No ratings yet
AN2DL 02 2324 Perceptron 2 FeedForward
55 pages
DL 02 Deep Forward Networks
No ratings yet
DL 02 Deep Forward Networks
47 pages
Unit 4 Hca
No ratings yet
Unit 4 Hca
57 pages
Technical Seminar Index
No ratings yet
Technical Seminar Index
4 pages
Unit 2.1
No ratings yet
Unit 2.1
37 pages
ELM Tutorial
No ratings yet
ELM Tutorial
177 pages
2EL1730 ML Lecture07 Neural Networks
No ratings yet
2EL1730 ML Lecture07 Neural Networks
65 pages
BDA Unit 2
No ratings yet
BDA Unit 2
48 pages
02 ML Fundatmentals 2
No ratings yet
02 ML Fundatmentals 2
81 pages
AI Lab 1
No ratings yet
AI Lab 1
11 pages
Types of Machine Learning: Supervised Learning: The Computer Is Presented With Example Inputs and Their
No ratings yet
Types of Machine Learning: Supervised Learning: The Computer Is Presented With Example Inputs and Their
50 pages
DL Mod 1 Final
No ratings yet
DL Mod 1 Final
4 pages
Chapter 7
No ratings yet
Chapter 7
31 pages
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
No ratings yet
Lecture 5 - CS50's Introduction To Artificial Intelligence With Python
16 pages
ML Unit-5
No ratings yet
ML Unit-5
19 pages
02A DL2023 NN Basics
No ratings yet
02A DL2023 NN Basics
52 pages
Dave Reed: Connectionist Approach To AI
No ratings yet
Dave Reed: Connectionist Approach To AI
26 pages
Lesson 7.0 Supervised Learning With Neural Networks
No ratings yet
Lesson 7.0 Supervised Learning With Neural Networks
22 pages
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
A Short Course in Discrete Mathematics
From Everand
A Short Course in Discrete Mathematics
Edward A. Bender
3/5 (1)
Deep Learning Lab Course 2017 (Deep Learning Practical)
No ratings yet
Deep Learning Lab Course 2017 (Deep Learning Practical)
49 pages
Aqar 2018 PDF
No ratings yet
Aqar 2018 PDF
29 pages
Algorithms and FLowcharts 6-2-21
No ratings yet
Algorithms and FLowcharts 6-2-21
1 page
Assignement 2 Question Paper Format
No ratings yet
Assignement 2 Question Paper Format
1 page
WWW - Manaresults.Co - In: (Common To CSE, IT)
No ratings yet
WWW - Manaresults.Co - In: (Common To CSE, IT)
2 pages
Is Multiple Inheritance Supported in PHP? Explain.: (Common To CSE, IT)
No ratings yet
Is Multiple Inheritance Supported in PHP? Explain.: (Common To CSE, IT)
2 pages
Shri Vishnu Engineering College For Women: Assignment Particulars
No ratings yet
Shri Vishnu Engineering College For Women: Assignment Particulars
1 page
Hyd2bvrm Bus PDF
No ratings yet
Hyd2bvrm Bus PDF
1 page
Bus Ticket-Hyderabad To Bhimavaram On 05/02/2019 and Rate Your Experience
No ratings yet
Bus Ticket-Hyderabad To Bhimavaram On 05/02/2019 and Rate Your Experience
1 page
Your Wednesday Evening Trip With Uber: Thanks For Choosing Uber, Ravi August 15, 2018 - Ubergo
No ratings yet
Your Wednesday Evening Trip With Uber: Thanks For Choosing Uber, Ravi August 15, 2018 - Ubergo
4 pages
Bvrm2hyd PDF
No ratings yet
Bvrm2hyd PDF
2 pages
FWD: Booking Confirmation On IRCTC, Train: 17480, 26-Sep-2018, 3A, BVRT - PSA
No ratings yet
FWD: Booking Confirmation On IRCTC, Train: 17480, 26-Sep-2018, 3A, BVRT - PSA
3 pages
Irctcs E-Ticketing Service Electronic Reservation Slip (Personal User)
No ratings yet
Irctcs E-Ticketing Service Electronic Reservation Slip (Personal User)
1 page
Bza2beng 2
No ratings yet
Bza2beng 2
2 pages
Irctcs E-Ticketing Service Electronic Reservation Slip (Personal User)
No ratings yet
Irctcs E-Ticketing Service Electronic Reservation Slip (Personal User)
1 page
Hyderabad To Bhimavaram: Sri Krishna Travels Service # SKT-101AB
No ratings yet
Hyderabad To Bhimavaram: Sri Krishna Travels Service # SKT-101AB
2 pages
Unit I Object Oriented Thinking: Need For Oop Paradigm, A Way of Viewing World - Agents, Responsibility
No ratings yet
Unit I Object Oriented Thinking: Need For Oop Paradigm, A Way of Viewing World - Agents, Responsibility
5 pages
Introduction To PHP: Iii I.T
No ratings yet
Introduction To PHP: Iii I.T
25 pages
Skip List: ADS Skip List Iii I.T - I Sem
No ratings yet
Skip List: ADS Skip List Iii I.T - I Sem
6 pages
Free Siemens NX (Unigraphics) Tutorial - Surface Modeling
73% (11)
Free Siemens NX (Unigraphics) Tutorial - Surface Modeling
53 pages
MTM Jan Feb 2017 Web
No ratings yet
MTM Jan Feb 2017 Web
72 pages
Assessment 2 Task 2-3-43g1gbwp
100% (1)
Assessment 2 Task 2-3-43g1gbwp
9 pages
Arshad Siddiqui: Highlights
No ratings yet
Arshad Siddiqui: Highlights
3 pages
ASD Applicaton Format1
No ratings yet
ASD Applicaton Format1
3 pages
Ubc 2020 November Tembrevilla Gerald
No ratings yet
Ubc 2020 November Tembrevilla Gerald
253 pages
Block Diagram: X541UV Repair Guide
No ratings yet
Block Diagram: X541UV Repair Guide
7 pages
Utilization Capability
No ratings yet
Utilization Capability
96 pages
BMC Resmart Gii Y30t Bipap Humidifier
No ratings yet
BMC Resmart Gii Y30t Bipap Humidifier
4 pages
Logical Structuring Deloitte S Case Competition Training
100% (1)
Logical Structuring Deloitte S Case Competition Training
66 pages
NI Serial Hardware Specifications PDF
No ratings yet
NI Serial Hardware Specifications PDF
62 pages
KehuaFrance 3kW
No ratings yet
KehuaFrance 3kW
2 pages
Tdp-704 Variable Volume and Temperature Systems
No ratings yet
Tdp-704 Variable Volume and Temperature Systems
64 pages
PM2-Project Charter
No ratings yet
PM2-Project Charter
23 pages
Unit IV
No ratings yet
Unit IV
32 pages
Mariem Abidi Rapport PFE 2020 Final
No ratings yet
Mariem Abidi Rapport PFE 2020 Final
101 pages
DE Ch21
No ratings yet
DE Ch21
20 pages
Audio Technica ATH-M20x
No ratings yet
Audio Technica ATH-M20x
1 page
Algeria (DZA) : Administrative Boundary Common Operational Database (COD-AB)
No ratings yet
Algeria (DZA) : Administrative Boundary Common Operational Database (COD-AB)
3 pages
Time Estimation Gizmo ExploreLearning
No ratings yet
Time Estimation Gizmo ExploreLearning
1 page
Pragya Sachdeva Resume
No ratings yet
Pragya Sachdeva Resume
1 page
Lista de Accesorios Nueva
No ratings yet
Lista de Accesorios Nueva
11 pages
2HRMS
No ratings yet
2HRMS
4 pages
STR-W6753: Universal-Input/58 W Off-Line Quasi-Resonant Flyback Switching Regulator
No ratings yet
STR-W6753: Universal-Input/58 W Off-Line Quasi-Resonant Flyback Switching Regulator
8 pages
API - Pipeline Fact Sheet - RV8
No ratings yet
API - Pipeline Fact Sheet - RV8
1 page
Api 5C
No ratings yet
Api 5C
12 pages
Single Sideband Modulation
No ratings yet
Single Sideband Modulation
25 pages
Learning Journal - Photo Class
No ratings yet
Learning Journal - Photo Class
112 pages
SRTPV Solar Application Form
No ratings yet
SRTPV Solar Application Form
30 pages