0% found this document useful (0 votes)
5 views5 pages

Backprop Unit 2

The document discusses multi-layer perceptrons (MLPs) and the back-propagation algorithm used for training them, covering key terminology and training issues such as non-convexity and vanishing gradients. It details the forward and backward passes of the algorithm, including how to compute activations and derivatives recursively. Additionally, it addresses the time complexity of the algorithm, asserting that the time to compute both the loss function and its gradient is within a constant factor of the time required to compute the loss function alone.

Uploaded by

rafeedahjannath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
5 views5 pages

Backprop Unit 2

The document discusses multi-layer perceptrons (MLPs) and the back-propagation algorithm used for training them, covering key terminology and training issues such as non-convexity and vanishing gradients. It details the forward and backward passes of the algorithm, including how to compute activations and derivatives recursively. Additionally, it addresses the time complexity of the algorithm, asserting that the time to compute both the loss function and its gradient is within a constant factor of the time required to compute the loss function alone.

Uploaded by

rafeedahjannath
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 5

CSE 446: Machine Learning Lecture

Multi-layer Perceptrons & the Back-propagation Algorithm

Instructor: Sham Kakade

1 Terminology
• non-linear decision boundaries and the XOR function
• multi-layer neural networks & multi-layer perceptrons
• # of layers (definitions sometimes not consistent)
• input layer is x. output layer is y. hidden layers.
• activation function or transfer function or link function.
• forward propagation
• back propagation

Issues related to training are:

• non-convexity
• initialization
• weight symmetries and “symmetry breaking”
• saddle points & local optima & global optima
• vanishing gradients

2 Backprop (for MLPs)

2.1 MLPs
(l)
We can specify an L-hidden layer network as follows: given outputs {zj } from layer l − 1, the input activations are:
 (l−1) 
dX
(l) (l) (l−1)  (l)
aj =  wji zi + wj0
i=1

(l)
where wj0 is a “bias” term. For ease of exposition, we drop the bias term and proceed by assuming that:
(l−1)
dX
(l) (l) (l−1)
aj = wji zi .
i=1

1
The output activation of each node is:
(l) (l)
zj = h(aj )
Remark: The terminology of the “activation” is not necessarily used consistently. Sometimes the activation is used
to refer to only the input a’s , as defined above, or only the input z’s, as defined above . Sometimes the literature uses
the terminology input activation and output activation. We will use the latter terminology as it less confusing. Or we
just use the terminology “inputs” and “outputs” of nodes.
The target function/output, after we go through L-hidden layers, is then:
(L)
d
(L+1) (L)
X
(L+1)
yb(x) = a = wi zi ,
i=1

where saying the output is the activation at level L + 1. If we have more than one output than,
(L)
d
(L+1) (L+1) (L)
X
ybj (x) = aj = wji zi ,
i=1

for j ∈ {1, . . . K}, if we had K outputs (e.g. if we had K classes). It is straightforward to generalize this to force
yb(x) to be bounded between 0 and 1 (using a sigmoid transfer function) or having multiple outputs. Let us also use
the convention that:
(0)
zi = x[i]
The parameters of the model are all the weights

w(L+1) , w(L) , . . . w(1) .

2.2 The Back-Propagation Algorithm for MLPs

The back-propagation algorithm was introduced in [1].


In general, the loss function in an L-hidden layer network is:
1 X
L(w(1) , w(2) , . . . , w(L+1) ) = `(y, yb(x))
N n

and for the special case of the square loss we have:


1 X 1 1 X
`(yn , yb(xn )) = (yn − yb(xn ))2
N n 2N n

Again, we seek to compute:


∇`(yn , yb(xn ))
where the gradient is with respect to all the parameters.

The Forward Pass

Starting with the input x, go forward (from the input to the output layer), compute and store in memory the variables
a(1) , z (1) , a(2) , z (2) , . . . a(L) , z (L) , a(L+1)

2
The Backward Pass

Note `(y, yb) depends on all the parameters and we will not write out this functional dependency (e.g. yb depends on x
and all the weights).
We will compute the derivates by recursion. It useful to do recursion by computing the derivates with respect to the
input activations and proceeding “backwards” (from the output layer to the input layer). Define:
(l) ∂`(y, yb)
δj := (l)
∂aj
(l)
First, let us see that if we had all the δj ’s then we are able to obtain the derivatives with respect to all of our parameters:
(l)
∂`(y, yb) ∂`(y, yb) ∂aj (l) (l−1)
(l)
= (l) (l)
= δj zi
∂wji ∂aj ∂wji
(l)
∂aj (l−1)
where we have used the chain rule and that (l) = zi . To see the latter claim is true, note
∂wji

(l−1)
dX
(l) (l)
aj = wjc zc(l−1)
c=1
(l)
∂∂aj (l−1)
(the sum is over all nodes c in layer l − 1). This expression implies (l) = zi
∂wji

Now let us understand how to start the recursion, i.e. to compute the δ’s for the output layer, if there is only one node
and yb = a(L+1) ,
∂`(y, yb)
δ (L+1) = = −(y − a(L+1) ) = −(y − yb)
∂a(L+1)
(so we don’t need a subscript of j since there is only one node). Hence, for the output layer,
∂`(y, yb) (L) (L)
(L+1)
= δ (L+1) zj = −(y − a(L+1) )zj
∂wj

where we have used our expression for δ (L+1) .


(l) (l+1)
Thus, we know how to start our recursion. Now let us proceed recursively, computing δj using δj . Observe that
all the functional dependencies on the activations at layer l goes through the activations at l + 1. This implies, using
the chain rule,
(l) ∂`(y, yb)
δj = (l)
∂aj
(l+1)
dX (l+1)
∂`(y, yb) ∂ak
= (l+1) (l)
k=1 ∂ak ∂aj
(l+1)
dX (l+1)
(l+1) ∂ak
= δk (l)
k=1 ∂aj
(l+1)
∂ak
To complete the recursion we need to evaluate (l) . By definition,
∂aj

(l) (l)
d d
(l+1) (l+1) (l+1)
X X
ak = wkc zc(l) = wkc h(a(l)
c )
c=1 c=1

3
This implies:
(l+1)
∂ak (l+1) 0 (l)
(l)
= wkj h (aj ) ,
∂aj
and, by substitution,
(l+1)
dX
(l) 0 (l) (l+1) (l+1)
δj =h (aj ) wkj δk
k=1

which completes our recursion.

2.3 The Algorithm

We are now ready to state the algorithm.


The Forward Pass:

1. Starting with the input x, go forward (from the input to the output layer), compute and store in memory the
variables a(1) , z (1) , a(2) , z (2) , . . . a(L) , z (L) , a(L+1)

The Backward Pass:

1. Initialize as follows:
δ (L+1) = −(y − yb) = −(y − a(L+1) )
and compute the derivatives at the output layer:
∂`(y, yb) (L)
(L+1)
= −(y − yb)zj
∂wj

2. Recursively, compute
(l+1)
dX
(l) 0 (l) (l+1) (l+1)
δj =h (aj ) wkj δk
k=1

and also compute our derivatives at layer l:


∂`(y, yb) (l) (l−1)
(l)
= δj zi
∂wji

3 Time Complexity

Assume that addition, subtraction, multiplication, and division take “one unit” of time (and let us ignore numerical
precision issues).
Let us say the time complexity to compute `(y, yb(x)) is based on the naive algorithm (which explicitly just computes
all the sums in the forward pass).
Theorem 3.1. Suppose that we can compute the derivative h0 (·) in an amount of time that is within a constant factor
of the time it takes to compute h(·) itself (and suppose that constant is 5). Using the backprop algorithm, the (total)
time to compute both `(y, yb(x)) and ∇`(y, yb(x)) — note that ∇`(y, yb(x)) contains # parameter partial derivatives —
is within a factor of 5 of computing just the scalar value `(y, yb(x)).

4
Proof. The number of transfer function evaluations in the forward pass is just the number of nodes. In the backward
pass, the number of times we need to evaluate the derivative of the transfer function is also just the number of nodes.
In the forward pass, note that the total computation time is a constant factor (actually 2) times the number weights.
To see this, note that to compute any activation, it costs us (one addition and one multiplication) associated with each
weight (and that the every weight is only involved in the computation of just one activation). Similarly, by examining
the recursion in the backward pass, we see that the total compute time to get all the δ’s is a constant factor times the
number of weights (again there is only one multiplication and one addition associated with each weight). Also, the
compute time for obtaining the partial derivatives from the δ’s is equal to the number of weights.

Remark: That we defined the time complexity to be under the “naive” algorithm is irrelevant. If we had a faster
algorithm to compute `(y, yb(x)) (say through a faster matrix multiplication algorithm or possibly other tricks), then
the above theorem would still hold. This is the remarkable Baur-Strassen theorem [2] (also independently stated
by [3]).

References
[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In
David E. Rumelhart, James L. McClelland, and PDP Research Group, editors, Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, Vol. 1, pages 318–362. MIT Press, Cambridge, MA, USA, 1986.
[2] Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer Science, 22:317–
330, 1983.
[3] Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Dif-
ferentiation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2008.

You might also like