0% found this document useful (0 votes)

5 views5 pages

Backprop Unit 2

The document discusses multi-layer perceptrons (MLPs) and the back-propagation algorithm used for training them, covering key terminology and training issues such as non-convexity and vanishing gradients. It details the forward and backward passes of the algorithm, including how to compute activations and derivatives recursively. Additionally, it addresses the time complexity of the algorithm, asserting that the time to compute both the loss function and its gradient is within a constant factor of the time required to compute the loss function alone.

Uploaded by

rafeedahjannath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

5 views5 pages

Backprop Unit 2

Uploaded by

rafeedahjannath

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 5

CSE 446: Machine Learning Lecture

Multi-layer Perceptrons & the Back-propagation Algorithm

Instructor: Sham Kakade

1 Terminology
• non-linear decision boundaries and the XOR function
• multi-layer neural networks & multi-layer perceptrons
• # of layers (definitions sometimes not consistent)
• input layer is x. output layer is y. hidden layers.
• activation function or transfer function or link function.
• forward propagation
• back propagation

Issues related to training are:

• non-convexity
• initialization
• weight symmetries and “symmetry breaking”
• saddle points & local optima & global optima
• vanishing gradients

2 Backprop (for MLPs)

2.1 MLPs
(l)
We can specify an L-hidden layer network as follows: given outputs {zj } from layer l − 1, the input activations are:
 (l−1) 
dX
(l) (l) (l−1)  (l)
aj =  wji zi + wj0
i=1

(l)
where wj0 is a “bias” term. For ease of exposition, we drop the bias term and proceed by assuming that:
(l−1)
dX
(l) (l) (l−1)
aj = wji zi .
i=1

1
The output activation of each node is:
(l) (l)
zj = h(aj )
Remark: The terminology of the “activation” is not necessarily used consistently. Sometimes the activation is used
to refer to only the input a’s , as defined above, or only the input z’s, as defined above . Sometimes the literature uses
the terminology input activation and output activation. We will use the latter terminology as it less confusing. Or we
just use the terminology “inputs” and “outputs” of nodes.
The target function/output, after we go through L-hidden layers, is then:
(L)
d
(L+1) (L)
X
(L+1)
yb(x) = a = wi zi ,
i=1

where saying the output is the activation at level L + 1. If we have more than one output than,
(L)
d
(L+1) (L+1) (L)
X
ybj (x) = aj = wji zi ,
i=1

for j ∈ {1, . . . K}, if we had K outputs (e.g. if we had K classes). It is straightforward to generalize this to force
yb(x) to be bounded between 0 and 1 (using a sigmoid transfer function) or having multiple outputs. Let us also use
the convention that:
(0)
zi = x[i]
The parameters of the model are all the weights

w(L+1) , w(L) , . . . w(1) .

2.2 The Back-Propagation Algorithm for MLPs

The back-propagation algorithm was introduced in [1].

In general, the loss function in an L-hidden layer network is:
1 X
L(w(1) , w(2) , . . . , w(L+1) ) = `(y, yb(x))
N n

and for the special case of the square loss we have:

1 X 1 1 X
`(yn , yb(xn )) = (yn − yb(xn ))2
N n 2N n

Again, we seek to compute:

∇`(yn , yb(xn ))
where the gradient is with respect to all the parameters.

The Forward Pass

Starting with the input x, go forward (from the input to the output layer), compute and store in memory the variables
a(1) , z (1) , a(2) , z (2) , . . . a(L) , z (L) , a(L+1)

2
The Backward Pass

Note `(y, yb) depends on all the parameters and we will not write out this functional dependency (e.g. yb depends on x
and all the weights).
We will compute the derivates by recursion. It useful to do recursion by computing the derivates with respect to the
input activations and proceeding “backwards” (from the output layer to the input layer). Define:
(l) ∂`(y, yb)
δj := (l)
∂aj
(l)
First, let us see that if we had all the δj ’s then we are able to obtain the derivatives with respect to all of our parameters:
(l)
∂`(y, yb) ∂`(y, yb) ∂aj (l) (l−1)
(l)
= (l) (l)
= δj zi
∂wji ∂aj ∂wji
(l)
∂aj (l−1)
where we have used the chain rule and that (l) = zi . To see the latter claim is true, note
∂wji

(l−1)
dX
(l) (l)
aj = wjc zc(l−1)
c=1
(l)
∂∂aj (l−1)
(the sum is over all nodes c in layer l − 1). This expression implies (l) = zi
∂wji

Now let us understand how to start the recursion, i.e. to compute the δ’s for the output layer, if there is only one node
and yb = a(L+1) ,
∂`(y, yb)
δ (L+1) = = −(y − a(L+1) ) = −(y − yb)
∂a(L+1)
(so we don’t need a subscript of j since there is only one node). Hence, for the output layer,
∂`(y, yb) (L) (L)
(L+1)
= δ (L+1) zj = −(y − a(L+1) )zj
∂wj

where we have used our expression for δ (L+1) .

(l) (l+1)
Thus, we know how to start our recursion. Now let us proceed recursively, computing δj using δj . Observe that
all the functional dependencies on the activations at layer l goes through the activations at l + 1. This implies, using
the chain rule,
(l) ∂`(y, yb)
δj = (l)
∂aj
(l+1)
dX (l+1)
∂`(y, yb) ∂ak
= (l+1) (l)
k=1 ∂ak ∂aj
(l+1)
dX (l+1)
(l+1) ∂ak
= δk (l)
k=1 ∂aj
(l+1)
∂ak
To complete the recursion we need to evaluate (l) . By definition,
∂aj

(l) (l)
d d
(l+1) (l+1) (l+1)
X X
ak = wkc zc(l) = wkc h(a(l)
c )
c=1 c=1

3
This implies:
(l+1)
∂ak (l+1) 0 (l)
(l)
= wkj h (aj ) ,
∂aj
and, by substitution,
(l+1)
dX
(l) 0 (l) (l+1) (l+1)
δj =h (aj ) wkj δk
k=1

which completes our recursion.

2.3 The Algorithm

We are now ready to state the algorithm.

The Forward Pass:

1. Starting with the input x, go forward (from the input to the output layer), compute and store in memory the
variables a(1) , z (1) , a(2) , z (2) , . . . a(L) , z (L) , a(L+1)

The Backward Pass:

1. Initialize as follows:
δ (L+1) = −(y − yb) = −(y − a(L+1) )
and compute the derivatives at the output layer:
∂`(y, yb) (L)
(L+1)
= −(y − yb)zj
∂wj

2. Recursively, compute
(l+1)
dX
(l) 0 (l) (l+1) (l+1)
δj =h (aj ) wkj δk
k=1

and also compute our derivatives at layer l:

∂`(y, yb) (l) (l−1)
(l)
= δj zi
∂wji

3 Time Complexity

Assume that addition, subtraction, multiplication, and division take “one unit” of time (and let us ignore numerical
precision issues).
Let us say the time complexity to compute `(y, yb(x)) is based on the naive algorithm (which explicitly just computes
all the sums in the forward pass).
Theorem 3.1. Suppose that we can compute the derivative h0 (·) in an amount of time that is within a constant factor
of the time it takes to compute h(·) itself (and suppose that constant is 5). Using the backprop algorithm, the (total)
time to compute both `(y, yb(x)) and ∇`(y, yb(x)) — note that ∇`(y, yb(x)) contains # parameter partial derivatives —
is within a factor of 5 of computing just the scalar value `(y, yb(x)).

4
Proof. The number of transfer function evaluations in the forward pass is just the number of nodes. In the backward
pass, the number of times we need to evaluate the derivative of the transfer function is also just the number of nodes.
In the forward pass, note that the total computation time is a constant factor (actually 2) times the number weights.
To see this, note that to compute any activation, it costs us (one addition and one multiplication) associated with each
weight (and that the every weight is only involved in the computation of just one activation). Similarly, by examining
the recursion in the backward pass, we see that the total compute time to get all the δ’s is a constant factor times the
number of weights (again there is only one multiplication and one addition associated with each weight). Also, the
compute time for obtaining the partial derivatives from the δ’s is equal to the number of weights.

Remark: That we defined the time complexity to be under the “naive” algorithm is irrelevant. If we had a faster
algorithm to compute `(y, yb(x)) (say through a faster matrix multiplication algorithm or possibly other tricks), then
the above theorem would still hold. This is the remarkable Baur-Strassen theorem [2] (also independently stated
by [3]).

References
[1] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal representations by error propagation. In
David E. Rumelhart, James L. McClelland, and PDP Research Group, editors, Parallel Distributed Processing:
Explorations in the Microstructure of Cognition, Vol. 1, pages 318–362. MIT Press, Cambridge, MA, USA, 1986.
[2] Walter Baur and Volker Strassen. The complexity of partial derivatives. Theoretical Computer Science, 22:317–
330, 1983.
[3] Andreas Griewank and Andrea Walther. Evaluating Derivatives: Principles and Techniques of Algorithmic Dif-
ferentiation. Society for Industrial and Applied Mathematics, Philadelphia, PA, USA, second edition, 2008.

18+430 List of CH.: Design of Pier P1
No ratings yet
18+430 List of CH.: Design of Pier P1
54 pages
Stock Market Prediction Using MLP and Random Forest
No ratings yet
Stock Market Prediction Using MLP and Random Forest
18 pages
SUNLU T3 Manul
No ratings yet
SUNLU T3 Manul
23 pages
Machine Design-Ii: Gears
100% (1)
Machine Design-Ii: Gears
50 pages
Seminar Report On Machine Learing
33% (3)
Seminar Report On Machine Learing
30 pages
Making Revolver - Blender
No ratings yet
Making Revolver - Blender
24 pages
Network How To
100% (1)
Network How To
139 pages
Nnew - DC Lab Manual
No ratings yet
Nnew - DC Lab Manual
106 pages
Clutches Technical Data
No ratings yet
Clutches Technical Data
7 pages
Self-Learning Home Task (SLHT)
No ratings yet
Self-Learning Home Task (SLHT)
6 pages
Stata 1
No ratings yet
Stata 1
45 pages
QED User Manual
No ratings yet
QED User Manual
57 pages
Matlab Programmming Previous Papers
No ratings yet
Matlab Programmming Previous Papers
4 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
IC693ALG223
No ratings yet
IC693ALG223
17 pages
Biostatistics Classes PDF
No ratings yet
Biostatistics Classes PDF
156 pages
Large
No ratings yet
Large
15 pages
Selective Determination of Fe (III) in Fe (II) Samples by UV-spectrophotometry With The Aid of Quercetin and Morin
No ratings yet
Selective Determination of Fe (III) in Fe (II) Samples by UV-spectrophotometry With The Aid of Quercetin and Morin
8 pages
Gigamon Gigavue VM Virtual Machine 4022
No ratings yet
Gigamon Gigavue VM Virtual Machine 4022
7 pages
Recirculation Pump Sizing
No ratings yet
Recirculation Pump Sizing
3 pages
1 s2.0 S0020768323001865 Main
No ratings yet
1 s2.0 S0020768323001865 Main
16 pages
Lesson 1 Measures of Position
No ratings yet
Lesson 1 Measures of Position
23 pages
Ann2018 L6
No ratings yet
Ann2018 L6
18 pages
Machine Learning Based Integrated Scheduling and Rescheduling For Elective and Emergency Patients in The Operating Theatre
No ratings yet
Machine Learning Based Integrated Scheduling and Rescheduling For Elective and Emergency Patients in The Operating Theatre
24 pages
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
No ratings yet
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
39 pages
Lab 3
No ratings yet
Lab 3
4 pages
Lecture Slides 4 - Backpropagation - 2021
No ratings yet
Lecture Slides 4 - Backpropagation - 2021
19 pages
Week 1 Solutions
No ratings yet
Week 1 Solutions
8 pages
MLP Lecture 4
No ratings yet
MLP Lecture 4
35 pages
Tut 01
No ratings yet
Tut 01
39 pages
Learning in Multi-Layer Perceptrons - Back-Propagation: Neural Computation: Lecture 7
No ratings yet
Learning in Multi-Layer Perceptrons - Back-Propagation: Neural Computation: Lecture 7
20 pages
Learning Rules For Multilayer Feedforward Neural Networks
No ratings yet
Learning Rules For Multilayer Feedforward Neural Networks
19 pages
Git Cheat Sheet
No ratings yet
Git Cheat Sheet
9 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
14 pages
Week5-LectureNotes
No ratings yet
Week5-LectureNotes
15 pages
Lecture 2, Part 2: Backpropagation: Roger Grosse
No ratings yet
Lecture 2, Part 2: Backpropagation: Roger Grosse
9 pages
Backward Forward Propogation
No ratings yet
Backward Forward Propogation
19 pages
Chap5 3-BackProp
No ratings yet
Chap5 3-BackProp
41 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
47 pages
AML 04 Backpropagation
100% (1)
AML 04 Backpropagation
26 pages
ML Session 15 Backpropagation
No ratings yet
ML Session 15 Backpropagation
30 pages
Back-Propagation Algorithm
No ratings yet
Back-Propagation Algorithm
26 pages
Understanding Backpropagation Algorithm - Towards Data Science
No ratings yet
Understanding Backpropagation Algorithm - Towards Data Science
11 pages
Chapter 10: Artificial Neural Networks
No ratings yet
Chapter 10: Artificial Neural Networks
17 pages
Neural Networks - Learning
No ratings yet
Neural Networks - Learning
26 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Keyboard Shortcuts Linux
No ratings yet
Keyboard Shortcuts Linux
1 page
4 Multilayer Perceptrons and Radial Basis Functions
No ratings yet
4 Multilayer Perceptrons and Radial Basis Functions
6 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
No ratings yet
Learning in A Feed Forward Multiple Layer ANN - Backpropagation
18 pages
Exp 3
No ratings yet
Exp 3
9 pages
Back Propagation
100% (1)
Back Propagation
27 pages
Artificial Neural Networks - Lect - 3
No ratings yet
Artificial Neural Networks - Lect - 3
16 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Lect 15 MLP Introduction Backprop
No ratings yet
Lect 15 MLP Introduction Backprop
24 pages
Advance Java Notes
No ratings yet
Advance Java Notes
138 pages
Consumer Theory
No ratings yet
Consumer Theory
17 pages
Notes Chapter8
No ratings yet
Notes Chapter8
4 pages
07autodiff Nnets
No ratings yet
07autodiff Nnets
12 pages
Lecture 17-Backpropagation
No ratings yet
Lecture 17-Backpropagation
28 pages
Ann PPT
No ratings yet
Ann PPT
48 pages
Neural Networks
No ratings yet
Neural Networks
52 pages
Annette Paper
No ratings yet
Annette Paper
7 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Multilayer Perceptrons and Backpropagation Learning: 1 Some History
No ratings yet
Multilayer Perceptrons and Backpropagation Learning: 1 Some History
6 pages
Derivations For Back Propagation of Multilayer Neural Network
No ratings yet
Derivations For Back Propagation of Multilayer Neural Network
14 pages
Unit 1 Exam Qs and MS
No ratings yet
Unit 1 Exam Qs and MS
17 pages
Backpropagation Algorithm
No ratings yet
Backpropagation Algorithm
12 pages
Apptitude + HR Qa
No ratings yet
Apptitude + HR Qa
252 pages
Slides 11
No ratings yet
Slides 11
48 pages
Wa0006.
No ratings yet
Wa0006.
70 pages
FFNN, GD, Backpropagation
No ratings yet
FFNN, GD, Backpropagation
18 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Back Propagation
No ratings yet
Back Propagation
5 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Back Propagation
No ratings yet
Back Propagation
29 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Back Propogation
No ratings yet
Back Propogation
43 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
ANN Notes Updated
0% (1)
ANN Notes Updated
46 pages
ML Unit 5
No ratings yet
ML Unit 5
34 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Module 2
No ratings yet
Module 2
14 pages
CBSE Class12 PYQs Electric Charges and Fields-1
No ratings yet
CBSE Class12 PYQs Electric Charges and Fields-1
2 pages
Working of Multi-Layer Perceptron
No ratings yet
Working of Multi-Layer Perceptron
16 pages
14 Backprop
No ratings yet
14 Backprop
34 pages
Feedforward in Neural Networks
No ratings yet
Feedforward in Neural Networks
14 pages

Backprop Unit 2

Uploaded by

Backprop Unit 2

Uploaded by

CSE 446: Machine Learning Lecture

Multi-layer Perceptrons & the Back-propagation Algorithm

Instructor: Sham Kakade

Issues related to training are:

2 Backprop (for MLPs)

w(L+1) , w(L) , . . . w(1) .

2.2 The Back-Propagation Algorithm for MLPs

The back-propagation algorithm was introduced in [1].

and for the special case of the square loss we have:

Again, we seek to compute:

The Forward Pass

where we have used our expression for δ (L+1) .

which completes our recursion.

2.3 The Algorithm

We are now ready to state the algorithm.

The Backward Pass:

and also compute our derivatives at layer l:

You might also like