NN PDF
NN PDF
Topics to be covered:
1
so-called perceptron learning algorithm, which is different from the present-
day backpropagation algorithm. It nonetheless is the first one that can learn
from data without explicit programming by human experts. This feature
created a lot of excitement, and with the perceptron convergence theorem,
perceptron seem to be able to meet many interesting challenges. However,
mathematically what Perceptron does is finding linear classification bound-
ary.
This fact drew attention from Minsky and Papert, who wrote a book
criticizing that Perceptron cannot even distinguish between the very simple
XOR configurations of four data points on a plane [3]. This book almost
instantly killed the then-flourishing many perceptron research projects, and
this marked the closing of the first wave of neural network.
Since then, MIT (and others) led the artificial intelligence research along
the path of logical analysis and rule-based programming. They contrived
many interesting methods and approaches to embody human intelligence
through the rule-like programming paradigm. Although this approach had
many successes in various areas where logical thinking is the key, it failed
pretty badly in handling the perception problem, which is what Perceptron
was intended to solve. This lack of common sense became the glaring defi-
ciency of this type of artificial intelligence methods, and it is the reason why
logic-based rule-like artificial intelligence gradually withered away.
The second wave of neural network came when the PDP group led by
Rumelhart, McClelland and others discovered the magic of adding a hidden
layer to the neural network [6]. In particular, they showed that the XOR
problem, which had vexed Perceptron so much, can be easily overcome. (See
Section 2.2 below.) This and many other interesting applications rekindled
interest on neural networks. Many other models of neural networks were
proposed and many interesting problems solved. Indeed it seemed that the
secret of human intelligence was about to be revealed. However, it soon
dawned on people that complex neural networks with many hidden layers are
very hard to train. They frequently overfit and, as a consequence, a small
change in the dataset leads to drastic changes in the neural network itself;
thereby earning the reputation of being “hard-to-train”. So around the mid-
1990s and on, the PDP-style neural network projects were mostly abandoned,
and people turned to other machine learning methods like Support Vector
Machine and kernel methods; ensemble methods like boosting and random
forests; and so on.
Such was the state of affairs until suddenly G. Hinton and his colleagues
2
came bursting onto the scene with fresh new ideas of pre-training neural
networks in the mid-2000s. (See for example [2] and our subsequent lectures.)
With the drastic increase in computational capability with the use of GPU
machines, people began to create very complex neural networks with tens of
hidden layers and train them reasonably well, as long as there are enough
data to feed to the network.
Currently we are witnessing the third wave of the neural network under
the banner of deep learning coupled with big data. Its success is impres-
sive. Overnight, it has almost completely replaced voice recognition tech-
nology that has a long history of its own; vision research is now mostly
based on deep learning; and with the use deep learning, complicated natural
language problems like machine translation are at a different level of compe-
tency. Many commercial applications developed with deep learning are now
routinely in use.
However, deep learning is not a panacea. Although it is capable of solving
many, if not all, complex perception problems, it is not as adept at reasoning
tasks. Also, its black-box nature makes it harder for humans to accept the
outcome if it conflicts with one’s expert knowledge or judgment. Another
aspect of deep learning is that it requires a huge amount of data, which
contrasts with the way humans learn. All these objections and questions will
be presented as some of the central themes in the next phase of artificial
intelligence research, and it is interesting to see how one can mesh deep
learning with this new coming trend.
and let
z = [z1 , · · · , zK ]T
3
as a K dimensional column vector. Then in matrix notation we have
z = W x + b.
softmax(z) = softmax(z1 , · · · , zK ).
4
Its neural network formalism is as in Figure 1. The circles in the left
box represent the input variables x1 , · · · , xd and the ones in the right the
response variables h1 , · · · , hK . The internal state of the k-th output neuron
is the variable zk whose value is as described in (1). In neural network
parlance, zk is gotten by summing over all j the multiples of wkj and xj , and
then adding the value bk , called the bias. Once the values of z1 , · · · , zK are
given, the output h1 , · · · , hK can be found by applying the softmax function
(2).
5
Figure 3: Value of h1 and the separating line
6
Similarly, for the label x¯1 x2 , the Boolean value of x¯1 x2 and the separating
line −x1 + x2 − 1/2 = 0 is shown in Figure 6. Now let z2 = a(−x1 + x2 − 1/2)
and let h2 = σ(z2 ). Then by the same argument, the value of h2 becomes
(very close to) 0 at the bottom right side of the line −x1 + x2 − 1/2 = 0, and
1 at the top left side of line −x1 + x2 − 1/2 = 0. This situation is depicted
in Figure 6. In neural network notation, this variable relation is drawn as
in Figure 7. To combine h1 and h2 , let h3 = XOR(x1 , x2 ) = x1 x¯2 + x¯1 x2 .
The values of all the variables are shown in Figure 8. Note that the range
of values (h1 , h2 ) can take becomes restricted, i.e. they are (0, 0), (0, 1), and
(1, 0). However (1, 1) does not now show up. This situation is depicted in
Figure 9 and thus h1 + h2 − 1/2 is a separating line for these data. Define
1
z3 = b(h1 + h2 − ), for large b >> 0, and h3 = σ(z3 ). And the value of h3
2
on the plane is also shown in Figure 9. If we combine everything we have
done in this subsection, it can be written in the neural network as in Figure
10. Clearly this neural network with one hidden layer separates the original
XOR data.
7
Figure 9: Value of h3 and the separating line
have the table in Figure 12. Since the label of the OR operation is 1 except
at (0, 0), where the label is 1, it is easy to find a separating line for this
OR data. Similarly, we can construct another neural network as in Figure
8
Figure 12: Boolean OR of h1 and h2
13. Its OR value on the plane is depicted in Figure 14. If we combine all
9
Figure 15: Combined OR value table of h1 , h2 , h3 and h4
17.
Continuing this way, one can construct any approximate bump function
as an output of a neural network with one hidden layer. Furthermore com-
bining these bump functions, one can approximate any continuous function.
Namely, a neural network with one hidden layer can do any task, at least
in principle. This heuristic argument can be made rigorous using a Stone-
Weienstrass theorem-type argument to get the following theorem.
10
Theorem 1. (Cybenko-Hornik-Funabashi Theorem) Let K = [0, 1]d be a d-
dimensional hypercube in Rd . Then the sum of the form
X d
X
f (x) = ci sigmoid(bi + wij xj )
i j=1
11
Figure 18: MLP Layers
The activation value of (output of) neuron i in Layer ` is h`i . The pre-
activation and activation values are related by the activation function ϕ(t)
12
so that
h`i = ϕ` (zi` ), (5)
for ` = 1, · · · , L − 1. Some of the most popular activation functions are
1
σ(t) = , sigmoid function
1 + e−t
ReLU(t) = max(0, t), rectified linear unit
et − e−t
tanh(t) = t , hyperbolic tangent function.
e + e−t
For ` = 0, h0j is set to be the input xj , and we do not use zj0 . For the output
layer, i.e. ` = L, the form of output hLi changes depending on the problem.
For regression, the output is as usual
hLi = ϕL (ziL ).
3.2 Errors
The error of classification problems is essentially the same as that of
logistic regression. As before, the formalism goes as follows. For a generic
input-output pair (x, y), where y ∈ {1, · · · , K}, use the one-hot encoding to
define
yk = I(y = k).
So y can be identified with (y1 , · · · , yK ) in which yk ∈ {0, 1} and y1 + · · · +
yK = 1. Then the cross entropy error is
X X
E=− yk log hLk (x) = − yk log P (Y = k | X = x).
k k
13
Let D = {(x(i) , y (i) )}N (i) (i)
i=1 be a given dataset. For each data point (x , y ),
its cross entropy error is
X (i) X (i)
E=− yk log hLk (x(i) ) = − yk log P (Y = k | X = x(i) ),
k k
(i)
where yk = I(y (i) = k). For a mini-batch {(x(i) , y (i) )}B
i=1 , its cross entropy
error is
B X
X B X
X
(i) (i)
E=− yk log hLk (x(i) ) =− yk log P (Y = k | X = x(i) ),
i=1 k i=1 k
`
The training of a neural network is to find good parameters wij and b`i to min-
imize the error, which is the topic we will be covering in the next subsection
and subsequent lecture on Training Deeping Neural Network.
For the regression problem, one usually uses the L2 error so that
B
X
E=− |y (i) − hL (x(i) )|2 ,
i=1
14
3.3 Backpropagation algorithm
Training of a neural network, in a nutshell, is a gradient descent algorithm
which is basically of the form:
` ∂E
∆wij = −λ `
∂wij
∂E
∆b`i = −λ ` ,
∂bi
where
` ` `
∆wij = wij (new) − wij (old)
∆b`i = b`i (new) − b`i (old).
In fact, deep learning training methods are not this simplistic, but all of
them are variations of one form or another of this basic idea of gradient
descent. For more details, the reader is referred to the forthcoming lecture
on ”Training Deep Neural Network.”
∂E ∂E
So to do the training it is necessary to compute `
and ` efficiently.
∂wij ∂bi
∂E
At the output layer, Layer L, it is easy to determine , as we know
∂hLi
the expression of the error E. By the chain rule, we can compute
∂E X ∂hL ∂E
i
= . (6)
∂zjL i
∂z L L
j ∂hi
The variable dependencies are depicted in Figure 20. From this we can easily
∂E dh`i ∂E
= ,
∂zi` dzi` ∂h`i
15
where `
dh`i hi (1 − h`i ) if ϕ` is σ
= I(zi` ≥ 0) if ϕ` is ReLU
dzi`
sech2 zi` if ϕ` is tanh z
Summary
16
• By (3) and (5), the data flows forward, i.e. from Layer ` − 1 to Layer
`, hence the name feedforward network
• By (6) and (7), the error derivative, not the actual error, can be com-
puted backward, i.e.
∂E by(6) ∂E by(7) ∂E by(6) ∂E
←− ←− ` ←− ←,
∂zi`−1 ∂h`−1
i ∂zi ∂h`i
• Equations (8) and (9) are the basic equations to be used for the gradient
descent algorithm for learning.
∂h`i
be the Jacobian matrix whose (i, j)-th entry is . Then in matrix form (6)
∂zj`
can be written as ! ! !
∂E ∂hL ∂E
= .
∂z L ∂z L ∂hL
Note that the chain rule
∂h`i X ∂z ` ∂h`
k i
=
∂h`−1
j k
∂h `−1
j ∂z `
k
17
can be written in matrix form as
! ! !
∂h` ∂z ` ∂h`
=
∂h`−1 ∂h`−1 ∂z `
Thus the vector used in the backpropagation rule is nothing but the matrix
products of the following form:
! ! ! ! !
∂E ∂z `+1 ∂h`+1 ∂hL ∂E
= ··· , (10)
∂h` ∂h` ∂z `+1 ∂z L ∂hL
and ! ! ! ! ! !
∂E ∂h` ∂z `+1 ∂h`+1 ∂hL ∂E
= ··· . (11)
∂z ` ∂z ` ∂h` ∂z `+1 ∂z L ∂hL
18
are as follows:
λ(a + a0 ) = λa + λa0
(a + a0 ) + (b + b0 ) = (a + b) + (a0 + b0 )
(a + a0 ) − (b + b0 ) = (a − b) + (a0 − b0 )
(a + a0 )(b + b0 ) = ab + (a0 b + ab0 )
1 1 a0
= − 2
a + a0 a a
Note that if we put a = b = 0 and a0 = b0 = 1 in the fourth line above, we
get
2 = 0,
from which it is easy to prove by induction that
(x + x0 )n = xn + nxn−1 x0 .
f (x + x0 ) = f (x) + f 0 (x)x0 ,
which is basically the first order Taylor series expansion. This mysterious
can be envisioned as being sort of like the smallest floating point number
representable in a computer so that 2 = 0. This is also reminiscent of the
complex number i satisfying i2 = −1.
Let us see how this dual number can be used to compute partial deriva-
∂f
tives. Suppose f (x, y) = xy 2 + y + 5. Since (x, y) = 2xy + 1, we can
∂y
∂f
directly compute f (2, 3) = 26 and (2, 3) = 13.
∂y
To see how automatic differentiation can be used to do the same, we look
at
19
Figure 21: AutoDiff in Forward Mode
tree for the computation of f (x, y). An example of the computational graph
∂f
for this computation is given in Figure 21. In here, to compute (2, 3), one
∂y
feeds the values x = 2 and y = 3+ at the bottom variables. One then follows
the arrows of the computational graph and performs appropriate operations
whose results are recorded on the corresponding links. From the value at
∂f
the top node, one can read off f (2, 3) = 26 and (2, 3) = 13. Note that to
∂y
∂f
compute (2, 3), we must redo the similar computation f (2 + , 3) on the
∂x
same computational graph all over again by feeding x = 2+ and y = 3. This
may be quite burdensome if there are millions of variables to deal with as is
the case in deep learning. This way of directly computing partial derivatives
is called the forward mode automatic differentiation.
An alternative method called the reverse mode automatic differen-
tiation is devised to alleviate this problem. It is basically a repeated appli-
cation of chain rule. It goes as follows. First, use the forward computational
graph to compute values of each node in the forward pass, i.e. from the
input values at the bottom nodes through all nodes upward. The number
20
inside each node in Figure 22 is the value of that node. The symbol ni at
the left of each node is the node number, which one can think of as a vari-
able representing the value of the corresponding node. The idea of reverse
mode automatic differentiation is to trace the variable dependencies along the
paths of the computational graph and apply the appropriate chain rules. Note
∂n1
f = n1 and n1 = n2 + n3 . Thus = 1 and this number is recorded on the
∂n2
link from n1 to n2 . The number on the link from n1 to n3 is gotten similarly.
∂n2
Now n2 = n4 n5 , from which we get = n4 = 9. Therefore we have
∂n5
∂n1 ∂n1 ∂n2
= = 9,
∂n5 ∂n2 ∂n5
which is recorded on the link from n2 to n5 . Note that this is the chain rule
(along the path from n1 tp n5 .). The reason why no other variables enter
into this chain rule calculation can be easily seen: namely, a change in n5
only affects n2 , which again affects n1 , and no other variables are affected.
One can read off this dependency from the graph structure, i.e. there is only
∂n2
one path from n1 to n5 . Similarly, = n5 = 2. Thus
∂n4
∂n1 ∂n1 ∂n2
= = 2,
∂n4 ∂n2 ∂n4
which is recorded on the link from n2 to n4 . Now n4 = n26 . Thus
∂n4
= 2n6 = 6. (12)
∂n6
Thus
∂n1 ∂n1 ∂n4
= = 12,
∂n6 ∂n4 ∂n6
which is recorded on the link from n4 to n6 . Note that this does not yet give
∂f
the value of (2, 3), although in terms of values of variables, f = n1 and
∂y
y = n6 . The reason is that there is another path from n1 to n6 via n3 , and
this calculation only reflects the dependency of n1 on n6 through the path
via n4 . This number 12 recorded on the link from n4 to n6 represents only
the part of this dependency. Another path of dependency of n1 on n6 is via
the path through n3 . Computing similarly, we can get the number 1 on the
21
link from n3 to n6 . Since n1 depends on n6 through these two paths, we have
to add these two numbers to get
∂f
(2, 3) = 12 + 1 = 13.
∂y
22
References
[1] Community Portal for Automatic Differentiation
http://www.autodiff.org/
[2] Goodfellow, I., Bengio, Y., Courville, A., Deep Learning, MIT Press
(2016)
[4] Rosenblatt, F., A Probabilistic Model for Information Storage and Or-
ganization in the Brain, Cornell Aeronautical Laboratory, Psychological
Review, v65, No. 6, 386408 (1958)
[6] Rumelhart, D., McClelland, J., PDP Research Group, Parallel Dis-
tributed Processing, vol 1 & 2, Bradford Book (1987)
23