0% found this document useful (0 votes)

47 views26 pages

Back-Propagation Algorithm

The document describes the back-propagation algorithm for training multi-layer neural networks. It explains that back-propagation uses gradient descent to minimize the error of the network outputs compared to the target outputs on the training data. It does this by propagating error signals backward from the output to the inner layers to update the weights. Specifically, it calculates the deltas for each layer based on the previous layer and uses these to update the weights through gradient descent. This allows multi-layer networks to learn complex patterns unlike single layer perceptrons.

Uploaded by

hackye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

47 views26 pages

Back-Propagation Algorithm

Uploaded by

hackye

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 26

Back-Propagation Algorithm

Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples

Inner-product
r r r r net =< w, x >=|| w ||" || x ||"cos(# )
n

net = # w i " x i
!

i=1

A measure of the projection of one vector onto another

Activation function
n

o = f (net) = f (# w i " x i )
i=1

$ 1 if x " 0 f (x) := sgn(x) = % &#1 if x < 0

!
!

$1 if f (x) := " (x) = % &0 if

x #0 x <0

& 1 if x # 0.5 ( f (x) := " (x) = ' x if 0.5 > x > 0.5 ( 0 if x $ %0.5 )

sigmoid function
!

f (x) := " (x) =

1 1+ e(#ax )

Gradient Descent

To understand, consider simpler linear unit, where

o = # wi " x i
i= 0

Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}

(t for target)

Error for different hypothesis, for w0 and w1 (dim 2)

We want to move the weight vector in the direction that decrease E

wi=wi+wi

w=w+w

Differentiating E

Update rule for gradient decent

"w i = # % (t d & od )x id
d $D

!
5

Stochastic Approximation to gradient descent

"w i = #(t $ o)x i

The gradient decent training rule updates summing over all the training examples D Stochastic gradient approximates gradient decent by updating weights incrementally Calculate error for each example Known as delta-rule or LMS (last mean-square) weight update

Adaline rule, used for adaptive lters Widroff and Hoff (1960)

XOR problem and Perceptron

By Minsky and Papert in mid 1960

Multi-layer Networks

The limitations of simple perceptron do not apply to feed-forward networks with intermediate or hidden nonlinear units A network with just one hidden unit can represent any Boolean function The great power of multi-layer networks was realized long ago

But it was only in the eighties it was shown how to make them learn

Multiple layers of cascade linear units still produce only linear functions We search for networks capable of representing nonlinear functions

Units should use nonlinear activation functions Examples of nonlinear activation functions

XOR-example

Back-propagation is a learning algorithm for multi-layer neural networks It was invented independently several times

Bryson an Ho [1969] Werbos [1974] Parker [1985] Rumelhart et al. [1986]

Parallel Distributed Processing - Vol. 1

Foundations David E. Rumelhart, James L. McC lelland and the PDP Research Group What makes people smarter than computers? These volumes by a pioneering neurocomputing.....

Back-propagation

The algorithm gives a prescription for changing the weights wij in any feedforward network to learn a training set of input output pairs {xd,td} We consider a simple two-layer network

xk x1 x2 x3 x4 x5

Given the pattern xd the hidden unit j receives a net input

5 d net = " w jk x k d j k=1

and produces the output

d V = f (net ) = f (" w jk x k ) d j d j k=1

Output unit i thus receives

3 d i d j 3 5 d net = "W ijV = " (W ij # f (" w jk x k )) j=1 j=1 k=1

And produce the nal output

3 3 d j 5 d i

d o = f (net ) = f ("W ijV ) = f (" (W ij # f (" w jk x k ))) d i j=1 j=1 k=1

!
12

Out usual error function

For l outputs and m input output pairs {xd,td}

r 1 m l d E[ w ] = " " (t i # oid ) 2 2 d =1 i=1

In our example E becomes

r 1 m 2 d E[ w ] = " " (t i # oid ) 2 2 d =1 i=1

3 5 r 1 m 2 d d E[ w ] = " " (t i # f ("W ij $ f (" w jk x k ))) 2 2 d =1 i=1 j k=1

!
!

E[w] is differentiable given f is differentiable Gradient descent can be applied

For hidden-to-output connections the gradient descent rule gives:

"W ij = #$
m %E = #$& (t id # oid ) f ' (net id ) ' (#V jd ) %W ij d =1 m

"W ij = $& (t id # oid ) f ' (net id ) ' V jd

d =1

"id = f ' (net id )(t id # oid )

$W ij = %&"id V jd
d =1

For the input-to hidden connection wjk we must differentiate with respect to the wjk

Using the chain rule we obtain

d m %E %E %V j "w jk = #$ = #$' d & %w jk %V j %w jk d =1

!
14

d "w jk = #$ $ (t id % oid ) f ' (net id )W ij f ' (net d ) & x k j d =1 i=1

"id = f ' (net id )(t id # oid )

!
m 2 d "w jk = #& &$id %W ij f ' (net d ) % x k j d =1 i=1

!
!
!

" = f (net )#W ij"id

d j ' d j i=1

m d "w jk = #%$ d & x k j d =1

"W ij = #%$id V jd
d =1

m d "w jk = #%$ d & x k j

d =1

we have same form with a different denition of

In general, with an arbitrary number of layers, the back-propagation update rule has always the form m "w ij = #%$output & Vinput
d =1

Where output and input refers to the connection concerned V stands for the appropriate input (hidden unit or ! real input, xd ) depends on the layer concerned

d By the equation " = f (net )#W ij"i d j ' d j i=1

allows us to determine for a given hidden unit Vj in terms of the s of the unit oi ! The coefcient are usual forward, but the errors are propagated backward

back-propagation

We have to use a nonlinear differentiable activation function

Examples:
1 1+ e(#$ % x )

f (x) = " (x) =

f ' (x) = " ' (x) = # $ " (x) $ (1% " (x))

!
!

f (x) = tanh(" # x) f ' (x) = " # (1$ f (x) 2 )

!
17

Consider a network with M layers m=1,2,..,M Vmi from the output of the ith unit of the mth layer V0i is a synonym for xi of the ith input Subscript m layers ms layers, not patterns Wmij mean connection from Vjm-1 to Vim

Stochastic Back-Propagation Algorithm (mostly used)

1. 2.

Initialize the weights to small random values Choose a pattern xdk and apply is to the input layer V0k= xdk for all k Propagate the signal through the network
m Vim = f (net im ) = f (" w ij V jm#1 ) j

5. !

Compute the deltas for the output layer "iM = f ' (net iM )(t id # ViM ) Compute the deltas for the preceding layer for m=M,M-1,..2
"im#1 = f ' (net im#1 )$ w m" m ji j

! 6.
7. !

Update all connections m new old "w ij = #$imV jm%1 w ij = w ij + "w ij Goto 2 and repeat for the next pattern

More on Back-Propagation
Gradient descent over entire network weight vector Easily generalized to arbitrary directed graphs Will nd a local, not necessarily global error minimum

In practice, often works well (can run multiple times)

Gradient descent can be very slow if is to small, and can oscillate widely if is to large Often include weight momentum
"w pq (t + 1) = #$

%E + & ' "w pq (t) %w pq

Momentum parameter is chosen between 0 and 1, 0.9 is a good value

Minimizes error over training examples

Will it generalize well

Training can take thousands of iterations, it is slow! Using network after training is very fast

Convergence of Backpropagation

Gradient descent to some local minimum

Perhaps not global minimum... Add momentum Stochastic gradient descent Train multiple nets with different initial weights

Nature of convergence

Initialize weights near zero Therefore, initial networks near-linear Increasingly non-linear functions possible as training progresses

Expressive Capabilities of ANNs

Boolean functions:

Every boolean function can be represented by network with single hidden layer but might require exponential (in number of inputs) hidden units Every bounded continuous function can be approximated with arbitrarily small error, by network with one hidden layer [Cybenko 1989; Hornik et al. 1989] Any function can be approximated to arbitrary accuracy by a network with two hidden layers [Cybenko 1988].

Continuous functions:

NETtalk Sejnowski et al 1987

Prediction

Perceptron Gradient Descent Multi-layerd neural network Back-Propagation More on Back-Propagation Examples

RBF Networks, Support Vector Machines

ANN Notes Updated
0% (1)
ANN Notes Updated
46 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
14 Backprop
No ratings yet
14 Backprop
34 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
Back Propagation Technique
No ratings yet
Back Propagation Technique
24 pages
Module 3 - Modified
No ratings yet
Module 3 - Modified
106 pages
Lecture 10
No ratings yet
Lecture 10
155 pages
ML Unit - 2
No ratings yet
ML Unit - 2
70 pages
Neural Networks & Deep Learning 2025
No ratings yet
Neural Networks & Deep Learning 2025
73 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
75 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Back Propagation
100% (1)
Back Propagation
27 pages
2025 Lecture07 P2 MLP
No ratings yet
2025 Lecture07 P2 MLP
56 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
SJNanda - Neural Network
No ratings yet
SJNanda - Neural Network
43 pages
Multi Layer Perceptron 1
No ratings yet
Multi Layer Perceptron 1
54 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
43 pages
Artificial Neural Networks: HCMC University of Technology Sep. 2008
No ratings yet
Artificial Neural Networks: HCMC University of Technology Sep. 2008
71 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
855597620
No ratings yet
855597620
44 pages
Multi Layer Perceptron Haykin
No ratings yet
Multi Layer Perceptron Haykin
50 pages
Module 3 Final
No ratings yet
Module 3 Final
88 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
PNAL6 MLPTraining
No ratings yet
PNAL6 MLPTraining
40 pages
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
No ratings yet
EELU ANN ITF309 Lecture 08 Spring 2023-2024-Sensitivity-Back-Propagation
39 pages
SJNanda Neural Network
No ratings yet
SJNanda Neural Network
47 pages
UNIT 3 - Backpropagation Algorithm
No ratings yet
UNIT 3 - Backpropagation Algorithm
38 pages
Lecture 17-Backpropagation
No ratings yet
Lecture 17-Backpropagation
28 pages
38 Backpropagation
No ratings yet
38 Backpropagation
19 pages
Session XX - Neural Network
No ratings yet
Session XX - Neural Network
43 pages
Soft Computing Unit 2
No ratings yet
Soft Computing Unit 2
22 pages
MLP
No ratings yet
MLP
19 pages
Back-Propagation Algorithm
No ratings yet
Back-Propagation Algorithm
51 pages
Module 3.docxaiml
No ratings yet
Module 3.docxaiml
20 pages
Neural Network
No ratings yet
Neural Network
44 pages
MLP Lecture 4
No ratings yet
MLP Lecture 4
35 pages
Artificial Neural Networks: Dan Simon Cleveland State University
No ratings yet
Artificial Neural Networks: Dan Simon Cleveland State University
44 pages
Ann2018 L6
No ratings yet
Ann2018 L6
18 pages
Multilayer Perceptrons Neural Networks
No ratings yet
Multilayer Perceptrons Neural Networks
19 pages
Eio Supplementary
No ratings yet
Eio Supplementary
6 pages
l7 - Learning in Multi-Layer Perceptrons, Back-Propagation
No ratings yet
l7 - Learning in Multi-Layer Perceptrons, Back-Propagation
16 pages
Back Propagation
No ratings yet
Back Propagation
20 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
19 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
71 pages
Learning Rules For Multilayer Feedforward Neural Networks
No ratings yet
Learning Rules For Multilayer Feedforward Neural Networks
19 pages
Ann4-3s.pdf 7oct PDF
No ratings yet
Ann4-3s.pdf 7oct PDF
21 pages
Unit II Supervised II
No ratings yet
Unit II Supervised II
16 pages
Machine Learning: Algorithms and Applications: (Continued)
No ratings yet
Machine Learning: Algorithms and Applications: (Continued)
17 pages
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
14 pages
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
No ratings yet
Machine Learning: Lecture 4: Artificial Neural Networks (Based On Chapter 4 of Mitchell T.., Machine Learning, 1997)
14 pages
Neural Networks Backpropagation Algorithm: COMP4302/COMP5322, Lecture 4, 5
No ratings yet
Neural Networks Backpropagation Algorithm: COMP4302/COMP5322, Lecture 4, 5
11 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
14 pages
Neural Networks: Derivation: 1 Model
No ratings yet
Neural Networks: Derivation: 1 Model
9 pages
Multilayer Perceptrons and Backpropagation Learning: 1 Some History
No ratings yet
Multilayer Perceptrons and Backpropagation Learning: 1 Some History
6 pages
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
From Everand
Learn Python through Nursery Rhymes and Fairy Tales: Classic Stories Translated into Python Programs (Coding for Kids and Beginners)
Shari Eskenas
5/5 (1)
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
From Everand
Line Drawing Algorithm: Mastering Techniques for Precision Image Rendering
Fouad Sabry
No ratings yet

Back-Propagation Algorithm

Uploaded by

Back-Propagation Algorithm

Uploaded by

Back-Propagation Algorithm

A measure of the projection of one vector onto another

$ 1 if x " 0 f (x) := sgn(x) = % &#1 if x < 0

$1 if f (x) := " (x) = % &0 if

f (x) := " (x) =

To understand, consider simpler linear unit, where

Let's learn wi that minimize the squared error, D={(x1,t1),(x2,t2), . .,(xd,td),..,(xm,tm)}

Error for different hypothesis, for w0 and w1 (dim 2)

We want to move the weight vector in the direction that decrease E

Update rule for gradient decent

Stochastic Approximation to gradient descent

"w i = #(t $ o)x i

XOR problem and Perceptron

By Minsky and Papert in mid 1960

Bryson an Ho [1969] Werbos [1974] Parker [1985] Rumelhart et al. [1986]

Parallel Distributed Processing - Vol. 1

Given the pattern xd the hidden unit j receives a net input

and produces the output

d V = f (net ) = f (" w jk x k ) d j d j k=1

Output unit i thus receives

And produce the nal output

d o = f (net ) = f ("W ijV ) = f (" (W ij # f (" w jk x k ))) d i j=1 j=1 k=1

Out usual error function

For l outputs and m input output pairs {xd,td}

r 1 m l d E[ w ] = " " (t i # oid ) 2 2 d =1 i=1

In our example E becomes

r 1 m 2 d E[ w ] = " " (t i # oid ) 2 2 d =1 i=1

E[w] is differentiable given f is differentiable Gradient descent can be applied

For hidden-to-output connections the gradient descent rule gives:

"W ij = $& (t id # oid ) f ' (net id ) ' V jd

"id = f ' (net id )(t id # oid )

Using the chain rule we obtain

d m %E %E %V j "w jk = #$ = #$' d & %w jk %V j %w jk d =1

d "w jk = #$ $ (t id % oid ) f ' (net id )W ij f ' (net d ) & x k j d =1 i=1

"id = f ' (net id )(t id # oid )

" = f (net )#W ij"id

m d "w jk = #%$ d & x k j d =1

m d "w jk = #%$ d & x k j

we have same form with a different denition of

d By the equation " = f (net )#W ij"i d j ' d j i=1

We have to use a nonlinear differentiable activation function

f (x) = " (x) =

f (x) = tanh(" # x) f ' (x) = " # (1$ f (x) 2 )

Stochastic Back-Propagation Algorithm (mostly used)

In practice, often works well (can run multiple times)

%E + & ' "w pq (t) %w pq

Momentum parameter is chosen between 0 and 1, 0.9 is a good value

Minimizes error over training examples

Will it generalize well

Gradient descent to some local minimum

Expressive Capabilities of ANNs

NETtalk Sejnowski et al 1987

RBF Networks, Support Vector Machines

You might also like