0% found this document useful (0 votes)
83 views13 pages

Automatic Differentiation and Neural Networks

This document provides an introduction to automatic differentiation and neural networks. It discusses how automatic differentiation can be used to efficiently compute the gradient of neural networks during backpropagation. Specifically, it explains that automatic differentiation formalizes representing functions as expression graphs and then applying the chain rule in reverse to compute derivatives. This allows derivatives to be computed with the same time complexity as the original forward function evaluation. The document uses an example mathematical function to illustrate how automatic differentiation works by defining intermediate variables and then working backwards to compute derivatives.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
83 views13 pages

Automatic Differentiation and Neural Networks

This document provides an introduction to automatic differentiation and neural networks. It discusses how automatic differentiation can be used to efficiently compute the gradient of neural networks during backpropagation. Specifically, it explains that automatic differentiation formalizes representing functions as expression graphs and then applying the chain rule in reverse to compute derivatives. This allows derivatives to be computed with the same time complexity as the original forward function evaluation. The document uses an example mathematical function to illustrate how automatic differentiation works by defining intermediate variables and then working backwards to compute derivatives.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Statistical Machine Learning Notes 9

Automatic Differentiation and Neural Networks


Instructor: Justin Domke

Contents
1 Introduction 1

2 Automatic Differentiation 2

3 Multi-Layer Perceptrons 5

4 MNIST 7

5 Backpropagation 9

6 Discussion 13

1 Introduction
The name “neural network” is sometimes used to refer to many things (e.g. Hopfield networks,
self-organizing maps). In these notes, we are only interested in the most common type of
neural network, the multi-layer perceptron.
A basic problem in machine learning is function approximation. We have some inputs x̂ and
some outputs ŷ, and we want to fit some function f such that it predicts ŷ well. (For some
definition of “well”.) The basic idea of neural networks is very simple. We will make up
a big nonlinear function f(x, θ), parameterized by some vector θ. Next, we pick some loss
function, measuring how well f predicts y. Finally, we use some local optimization algorithm
to minimize the empirical risk

X 
L f(x̂; θ), ŷ . (1.1)
{x̂,ŷ}

1
Automatic Differentiation and Neural Networks 2

We can also, of course, add a regularization penalty to θ.


As we will see below, multi-layer perceptrons use a quite powerful class of functions f, and
so could approximate many mappings between x and y. The price we pay for this is that
the empirical risk is almost always non-convex. Thus, local minima are a fact of life with
neural networks. How we should react to this fact of this is an issue of debate.
Understanding the particular class of functions f used in neural networks is not too hard.
The main technical problem is this: how do we compute the derivative of the loss function
for some particular input x̂? What is

d 
L f(x̂; θ) ?

You may wonder what the problem is. We have a function... We want its derivatives... Isn’t
the answer just a matter of calculus? Unfortunately, basic calculus will not get the job done.
(At least, not quite.) The problem is that f is very large, so much so that we won’t generally
write it down in any closed-form. Even if we did, we would find that just applying standard
calculus rules would cause it to “explode”. That is, the “closed-form” for the derivatives would
be gigantic, compared to the (already huge) form of f. The practical meaning of this is that,
with out being careful, it would be much more computationally expensive to compute the
gradient than to compute f.
Luckily, if we are a little bit clever, we can compute the gradient in the same time complexity
as computing f. The method for doing this is called the “backpropagation” algorithm. It
turns out, however, that backpropagation is really a special case of of a technique from
numerical analysis known as automatic differentiation. It may be somewhat easier to
understand the basic idea of backprop by seeing the more general algorithm first. For that
reason, we will first talk about autodiff.

2 Automatic Differentiation
Let’s start with an example. Consider the scalar function

f = exp exp(x) + exp(x)2 + sin exp(x) + exp(x)2


 
(2.1)

It isn’t too hard to explicitly write down an expression for the derivative of this.

df
= exp(exp(x) + exp(x)2 )(exp(x) + 2 exp(x)2 ) + cos(exp(x) + exp(x)2 )(exp(x) + 2 exp(x)2 )
dx
(2.2)
Automatic Differentiation and Neural Networks 3

Another way to attack this would be to just define some intermediate variables. Say

a = exp(x)
b = a2
c = a+b
d = exp(c)
e = sin(c)
f = d + e. (2.3)

It is convenient to draw a little graph picturing the relationship between all the variables.

Then, we can mechanically write down the derivatives of the individual terms. Given all
these, we can work backwards to compute the derivative of f with respect to each variable.
This is just an application of the chain rule. We have the derivatives with respect to d and
e above. Then, we can do

df df
= 1 = 1
dd dd
df df
= 1 = 1
de de
df df dd df de df df df
= + = exp(c) + cos(c)
dc dd dc de dc dc dd de
df df dc df df
= =
db dc db db dc
df df dc df db df df df
= + = + 2a
da dc da db da da dc db
df df da df df
= = exp(x). (2.4)
dx da dx dx da
In this way, we can work backwards from the end of the graph, computing the derivative of
each variable, making the use of the derivatives of the children of that variable.
Automatic Differentiation and Neural Networks 4

Something important happened here. Notice that when we differentiated the original ex-
pression in Eq. 2.1, we obtained a significantly larger expression in Eq. 2.2. However, when
we represent the function as a sequence of basic operations, as in Eq. 2.3, then we recover
a very similar sequence of basic operations for computing the derivatives, as in Eq. 2.4.
Automatic differentiation is essentially just a formalization of what we did above. We can
represent many functions f : ℜn → ℜ as expression graphs.

Forward Propagation

For i = n + 1, n + 2 ... N:

xi ← gi (xPa(i) )

Here, we consider x1 , ..., xn to be the input, xn+1 , ..., xN −1 to be the intermediate values, and
xN to be the final function value. The functions gi are the elementary functions evaluated
on the “parents” Pa(i) of variable i.
Now, given a function represented in this way, we can just apply the chain rule step-by-step
to compute derivatives. By definition, f = xN , and so

df
= 1.
dxN
Meanwhile, for other values of xi , we have

df X df dxj
=
dxi dxj dxi
j:i∈Pa(j)
X df dgj
= .
dxj dxi
j:i∈Pa(j)

Thus, we can compute the derivatives by the following algorithm.

Back Propagation

(Do forward propagation)


df
←1
dxN
For i = N − 1, N − 2, ... 1:
df X df dgj
← .
dxi dxj dxi
j:i∈Pa(j)
Automatic Differentiation and Neural Networks 5

The wonderful thing about this is that it works for any differentiable function that can be
phrased as an expression graph. One can really think of this as differentiating programs,
rather than “functions”. The back propagation step always has the same complexity as the
original forward propagation step.
There are limitations, of course. Notably, if your program isn’t differentiable, this can’t work.
There is also difficulty in handling conditionals like if statements.
Notice also that our entire discussion above was for computing derivatives of functions from
some vector of inputs to a single output. That is, for differentiating functions f : ℜn → ℜ.
This is actually a special case of automatic differentiation, known as "reverse-mode". There
exists another (simpler) "forward-mode" of automatic differentiation that can efficiently
compute the derivatives of a function f : ℜ → ℜm of a scalar input to a vector of outputs.
There are also more general algorithms for computing the derivatives of functions from vector
inputs to vector outputs f : ℜn → ℜm . However, these algorithms are in general slower than
just computing f , unlike reverse mode (forward mode) with a single output (input). In any
case, the functions we are interested in in machine learning are from a vector of parameters
to a scalar loss function value, so reverse-mode is all we really need.

3 Multi-Layer Perceptrons
A multi-layer perceptron is is just a big network of units connected together by simple
functions.

We will denote the values computed in the network by vi . We put the inputs x1 , ..., xn into
the network by setting v1 = x1 , v2 = x2 , ..., vn = xn . Then, in a neural network, a new unit
computes its value by (for i > n)
Automatic Differentiation and Neural Networks 6


vi = σi wi · vPa(i) ,

where Pa(i) are the “parents” of node i. For example, if Pa(i) = (1, 3, 7), then

vPa(i) = (v1 , v3 , v7 ).

Here, σi is a possibly nonlinear function. The most common functions are either σ(a) = a
(the identity function), or a “sigmoid” function like

1
σ(a) = or σ(a) = tanh(a).
1 + exp(−a)

0.8

0.6
σ(a)

0.4

0.2

0
−5 0 5
a

(Note: there are several “sigmoid” functions in the literature that have a shape that generally
looks like the above. In general, the exact choice of the function does not seem to be too
important.) So, given the weights wi for each variable, computing the network values are
quite trivial.

Forward Propagation (Compute f(x) and L(f(x), y))

Input x.
Set v1...,n ← x
For i = n + 1, n + 2, ..., N:
vi ← σi (wi · vPa(i) )
Set f ← (vN −M +1 , vN −M +2 ... vN )
Set L ← L(f, y).
Automatic Differentiation and Neural Networks 7

We consider the last M values of v to be the output. So, for example, if M = 1, f (x) = vN .
If M = 3, f(x) = (vN −2 , vN −1 , v).
To do learning with neural networks, all that we need to do is fit the weights wi to minimize
the empirical risk. The technically difficult part of this is calculating {dL/dwi }. If we have
those values, then we could apply any standard smooth unconstrained optimization method,
such as BFGS, gradient descent, or stochastic gradient descent. Before worrying about how
to calculate {dL/dwi} do that, let’s consider an example.

4 MNIST
We return to the MNIST database of handwritten digits, consisting of 6,000 samples of each
of the digits 1, 2, ..., 9, 0. Each sample is a 28×28 grayscale image.
Here, we make use of the multiclass logistic loss.

X
L(f, ŷ) = log exp fy − fŷ
y

This loss can be understood intuitively as follows. given a vector f, “log-sum-exp” can be
understood as a “soft max”. Very roughly speaking,1

X
log exp fy ≈≈ max fy .
y
y

Thus, the logistic loss can be thought of as the difference between the maximum value fy
and the value fŷ . So, if fŷ is much bigger than the other values, we have near zero loss. If
there is some fy , y 6= ŷ much bigger than all the others, then we suffer approximately the
loss fy − fŷ .
Here, a network was used with 72 “hidden” units fully connected to all inputs, with a tanh
sigmoid function. The learned weights are shown below.
1
The two ’≈’ symbols here are because we are being extra approximate.
Automatic Differentiation and Neural Networks 8

These 72 hidden units are then fully connected to 10 output units. These used an identiy
function. The learned weights are shown below.
1 2 3 4 5

6 7 8 9 0

We could also think about this in a vector form. If the input x is our 784 dimensional vector,
we attain the hidden representation by doing

h = tanh(Mx),

where M is a 72×784 matrix. Then, we produce the output by

f = W h,

where W is a 10×72 matrix. Thus, we are essentially computing a linear classifier on the
basis expansion tanh(Mx). The difference is this: we fit the basis expansion, as well as the
linear classifier.
We can show the results of running the neural network on a variety of images. The features
show the value computed by each of the hidden nodes, while the scores show the value
Automatic Differentiation and Neural Networks 9

computed by all the output nodes. The scores are organized as above, from 1 to 0.

query features (h) scores (f)

5 Backpropagation
Fundamentally, backpropagation is just a special case of reverse-mode autodiff, applied to a
neural network. We will derive the method here from first principles, but keep in mind that
autodiff can generalize to essentially arbitrary expression graphs.
Automatic Differentiation and Neural Networks 10

We can compute the derivatives of the loss with respect to f directly. Thus, we have

dL
, i ≥ N − M + 1.
dvi

Now, the good-old calculus 101 chain rule tells us that

dL X dL dvj
= .
dvi dvj dvi
j:i∈Pa(j)

Now, recall that we compute the value of vi by

vi = σi (wi · vPa(i) ).

From this, we can calculate that

dvi
= σi′ (wi · vPa(i) )wiq , Pa(i)q = j. (5.1)
dvj

This is a little hard to understand in the abstract, but can be made clear by an example.
Suppose P a(i) = (2, 7, 9). Then

vi = σi (wi · v(2,7,9) ).

It is not hard to see that we have the three derivatives

dvi
= σi′ (wi · v(2,7,9) )wi1
dv2
dvi
= σi′ (wi · v(2,7,9) )wi2
dv7
dvi
= σi′ (wi · v(2,7,9) )wi3 .
dv9

Now, defining the notation vi′ = σi′ (wi · vPa(i) )wiq , we can rewrite Eq. 5.1 as

dvi
= vi′ wiq , Pa(i)q = j. (5.2)
dvj

Yet another way to write this, in vector notation, is


Automatic Differentiation and Neural Networks 11

dvi
= vi′ wi .
dvPa(i)

Meanwhile, we can also easily derive that

dL X dL dvi
=
dwi i
dvi dwi
X dL
= σi′ (w · vPa(i) )vPa(i)
i
dvi
X dL
= vi′ vPa(i) .
i
dvi

Putting all the pieces together, we can derive the full Backpropagation algorithm.
Automatic Differentiation and Neural Networks 12

Backpropagation (Compute f(x), L(f(x), y), and dL/dwi)

Input x.
Set v1...,n ← x
For i = n + 1, n + 2, ..., N:
vi ← σi (wi · vPa(i) )
vi′ ← σi′ (wi · vPa(i) )

Set f ← (vN −M +1 , vN −M +2 ... vN )

Set L ← L(f, y)
dL
Compute
df
dL
Initialize ←0
dvi
For m = 1, 2, ..., M:
dL dL

dvN −M +m dfm
For i = N, N − 1, ..., n + 1:

dL dL ′
← v vPa(i)
dwi dvi i
dL dL dL ′
← + v wi
dvPa(i) dvPa(i) dvi i

Note that if we didn’t have Backpropagation, we could still calculate the derivatives dL/dwin
with reasonable accuracy by finite differences2 . This would require running the Forward
propagation algorithm roughly P times if we have a total of P parameters. It would also be
somewhat more numerically unstable.
2
How would we do this? First, we would compute the loss function L0 for the weights {wi }. Next, we
would create weights {wi′ } where wjn
′ ′
= wjn for all j, n except for win which we set to be equal to win + ǫ
for some small constant ǫ. Then we compute the loss L1 on the weights {wi }. Finally, we can approximate
dL 1
≈ L1 − L0 ).
dwin ǫ
Automatic Differentiation and Neural Networks 13

6 Discussion
Neural networks are relatively fast, as classifiers go, as long as there aren’t too many hidden
units. They also seem to work reasonably well for many problems, and so seem to be the
method of choice for a certain range of speed and accuracy requirements.
Perhaps the single biggest drawback of neural networks is the fact that their optimization
is usually non-convex. Local minima are a fact of life. In practice, neural networks seem
to usually find a reasonable solution when the number of layers is not too large, but find
poor solutions when using more than, say, 2 hidden layers. (Convolutional neural networks,
neural networks in which many weights are constrained to be the same in order to enforce
translational invariance, are claimed to be somewhat immune to these problems.)
It is possible to find better minima by, e.g., searching from many initial solutions. If we were
somehow able to identify the global optima, however, the results would not necessarily be
better! Finding the global solution, effectively means searching over a large set of candidate
functions. This decreases the bias of the method, but also increases the variance. This may
or may not be beneficial.

You might also like