0% found this document useful (0 votes)

84 views34 pages

Lec 15 MLP Cont

The document describes the backpropagation algorithm for training a multi-layer perceptron neural network. It involves: 1. Forward propagation to calculate activations and outputs for each layer. 2. Calculating error terms for the output layer using the difference between actual and target outputs. 3. Backpropagating the errors to calculate error terms for hidden layers. 4. Updating weights proportionally to the error terms and activations using gradient descent. This process repeats for each training example until weights converge. The derivative of the activation function, typically sigmoid, determines how much weights are adjusted at each layer based on the error signals. Backpropagation allows MLPs to be trained on labeled datasets for classification and regression

Uploaded by

sadaqatsaghri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

84 views34 pages

Lec 15 MLP Cont

Uploaded by

sadaqatsaghri

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PPT, PDF, TXT or read online on Scribd

You are on page 1/ 34

Multi-Layer Perceptron (MLP)

Back Propagation Algorithm (sequential)

1. Apply an input vector and calculate all activations, a and u

2. Evaluate k for all output units via:
 i (t )  ( d i (t )  yi (t )) g ' ( ai (t ))
(Note similarity to perceptron learning algorithm)
3. Backpropagate ks to get error terms  for hidden layers using:

 i (t )  g ' (ui (t ))  k (t ) wki

4. Evaluate changes using:

vij (t  1)  vij (t )   i (t ) x j (t )
wij (t  1)  wij (t )   i (t ) z j (t )
Once weight changes are computed for all units, weights are updated
at the same time (bias included as weights here). An example:

x1 v11= -1
w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2
w22= 1
v10= 1
v20= 1

Use identity activation function (ie g(a) = a)

All biases set to 1. Will not draw them for clarity.
Learning rate  = 0.1

x1= 0 v11= -1
w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2= 1 v22= 1 y2
w22= 1

Have input [0 1] with target [1 0].

Forward pass. Calculate 1st layer activations:

v11= -1 u1 = 1
x1 w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2
w22= 1
u2 = 2

u1 = -1x0 + 0x1 +1 = 1
u2 = 0x0 + 1x1 +1 = 2
Calculate first layer outputs by passing activations thru activation
functions

z1 = 1
x1 v11= -1
w11= 1 y1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2
w22= 1
z2 = 2

z1 = g(u1) = 1
z2 = g(u2) = 2
Calculate 2nd layer outputs (weighted sum thru activation functions):

x1 v11= -1
w11= 1 y1= 2
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 y2= 2
w22= 1

y1 = a1 = 1x1 + 0x2 +1 = 2
y2 = a2 = -1x1 + 1x2 +1 = 2
Backward pass:
wij (t  1)  wij (t )   i (t ) z j (t )
  ( d i (t )  yi (t )) g ' ( ai (t )) z j (t )

x1 v11= -1
w11= 1 1= -1
v21= 0 w21= -1
v12= 0
w12= 0
x2 v22= 1 2= -2
w22= 1

Target =[1, 0] so d1 = 1 and d2 = 0

So:
1 = (d1 - y1 )= 1 – 2 = -1
2 = (d2 - y2 )= 0 – 2 = -2
Calculate weight changes for 1st layer (cf perceptron learning):

v11= -1 z1 = 1
x1 w11= 1 1 z1 =-1
v21= 0 w21= -1 1 z2 =-2
v12= 0
w12= 0
x2 v22= 1 2 z1 =-2
w22= 1 2 z2 =-4
z2 = 2

wij (t  1)  wij (t )   i (t ) z j (t )
Weight changes will be:

wij (t  1)  wij (t )   i (t ) z j (t )

x1 v11= -1
w11= 0.9
v21= 0 w21= -1.2
v12= 0
w12= -0.2
x2 v22= 1
w22= 0.6
But first must calculate ’s:

 i (t )  g ' (ui (t ))  k (t ) wki

v11= -1
x1 1 w11= -1 1= -1
v21= 0
2 w21= 2
v12= 0 1 w12= 0
x2 v22= 1 2= -2
2 w22= -2
’s propagate back:

 i (t )  g ' (ui (t ))  k (t ) wki

v11= -1 1= 1
x1 1= -1
v21= 0
v12= 0
x2 v22= 1 2= -2

2 = -2

1 = - 1 + 2 = 1
2 = 0 – 2 = -2
And are multiplied by inputs:

vij (t  1)  vij (t )   i (t ) x j (t )

x1= 0 v11= -1  1 x1 = 0
1= -1
v21= 0  1 x2 = 1
v12= 0
2 x1 = 0
x2= 1 v22= 1 2= -2

2 x2 = -2
Finally change weights:

vij (t  1)  vij (t )   i (t ) x j (t )

x1= 0 v11= -1
w11= 0.9
v21= 0 w21= -1.2
v12= 0.1
w12= -0.2
x2 = 1 v22= 0.8
w22= 0.6

Note that the weights multiplied by the zero input are

unchanged as they do not contribute to the error
We have also changed biases (not shown)
Now go forward again (would normally use a new input vector):

vij (t  1)  vij (t )   i (t ) x j (t )

v11= -1 z1 = 1.2
x1= 0 w11= 0.9
v21= 0 w21= -1.2
v12= 0.1
w12= -0.2
x2 = 1 v22= 0.8
w22= 0.6
z2 = 1.6
Now go forward again (would normally use a new input vector):

vij (t  1)  vij (t )   i (t ) x j (t )

x1= 0 v11= -1 y1 = 1.66

w11= 0.9
v21= 0 w21= -1.2
v12= 0.1
w12= -0.2
x2 = 1 v22= 0.8
w22= 0.6
y2 = 0.32

Outputs now closer to target value [1, 0]

Activation Functions
How does the activation function affect the changes?
 i (t )  ( d i (t )  yi (t )) g ' ( ai (t ))
 i (t )  g ' (ui (t ))  k (t ) wki
k

dg ( a )
Where: g ' ( ai (t )) 
da

- we need to compute the derivative of activation function g

- to find derivative the activation function must be smooth
(differentiable)
Sigmoidal (logistic) function-common in MLP
1 1
g ( ai (t ))  
1  exp( k ai (t )) 1  e  k ai ( t )

where k is a positive
constant. The sigmoidal
function gives a value in
range of 0 to 1.
Alternatively can use
tanh(ka) which is same
shape but in range –1 to 1.

Input-output function of a
neuron (rate coding
assumption)
Note: when net = 0, f = 0.5
Derivative of sigmoidal function is
k exp( k ai (t ))
g ' (ai (t ))   k g (ai (t ))[1  g (ai (t ))]
[1  k exp(k ai (t ))]2

since : yi (t )  g (ai (t )) we have : g ' (ai (t ))  k yi (t )(1  yi (t ))

Derivative of sigmoidal function has max at a = 0., is symmetric about

this point falling to zero as sigmoid approaches extreme values
Since degree of weight change is proportional to derivative of
activation function,
 i (t )  ( d i (t )  yi (t )) g ' ( ai (t ))
 i (t )  g ' (ui (t ))  k (t ) wki
k

weight changes will be greatest when units

receives mid-range functional signal and 0 (or very small)
extremes. This means that by saturating a neuron (making the
activation large) the weight can be forced to be static. Can be a
very useful property
Summary of (sequential) BP learning algorithm

Set learning rate


Set initial weight values (incl. biases): w, v

Loop until stopping criteria satisfied:

present input pattern to input units
compute functional signal for hidden units
compute functional signal for output units

present Target response to output units

computer error signal for output units
compute error signal for hidden units
update all weights at same time
increment n to n+1 and select next input and target
end loop
Network training:

Training set shown repeatedly until stopping criteria are met

Each full presentation of all patterns = ‘epoch’
Usual to randomize order of training patterns presented for each
epoch in order to avoid correlation between consecutive training
pairs being learnt (order effects)

Two types of network training:

• Sequential mode (on-line, stochastic, or per-pattern)

Weights updated after each pattern is presented

• Batch mode (off-line or per -epoch). Calculate the

derivatives/wieght changes for each pattern in the training set.
Calculate total change by summing imdividual changes
Advantages and disadvantages of different modes

Sequential mode
• Less storage for each weighted connection
• Random order of presentation and updating per pattern means
search of weight space is stochastic--reducing risk of local
minima
• Able to take advantage of any redundancy in training set (i.e..
same pattern occurs more than once in training set, esp. for large
difficult training sets)
• Simpler to implement

Batch mode:
• Faster learning than sequential mode
• Easier from theoretical viewpoint
• Easier to parallelise
Dynamics of BP learning

Aim is to minimise an error function over all training

patterns by adapting weights in MLP
Recall, mean squared error is typically used

p
1
E(t)=

2 k 1
( d k (t )  Ok (t )) 2

idea is to reduce E

in single layer network with linear activation functions, the

error function is simple, described by a smooth parabolic surface
with a single minimum
But MLP with nonlinear activation functions have complex error
surfaces (e.g. plateaus, long valleys etc. ) with no single minimum

valleys
Selecting initial weight values

• Choice of initial weight values is important as this decides starting

position in weight space. That is, how far away from global minimum

• Aim is to select weight values which produce midrange function

signals

• Select weight values randomly form uniform probability distribution

• Normalise weight values so number of weighted connections per unit

produces midrange function signal
Regularization – a way of reducing variance (taking less
notice of data)

Smooth mappings (or others such as correlations) obtained by

introducing penalty term into standard error function

E(F)=Es(F)+ ER(F)

where  is regularization coefficient

penalty term: require that the solution should be smooth,

etc. Eg

ER ( F )    y d x 2
without regularization

with regularization
Momentum

Method of reducing problems of instability while increasing the rate

of convergence

Adding term to weight update equation term effectively

exponentially holds weight history of previous weights changed

Modified weight update equation is

wij (n  1)  wij (n )   j (n )yi (n )

  [wij (n )  wij (n  1)]
 is momentum constant and controls how much notice is taken of
recent history

Effect of momentum term

• If weight changes tend to have same sign

momentum terms increases and gradient decrease
speed up convergence on shallow gradient
• If weight changes tend have opposing signs
momentum term decreases and gradient descent slows to
reduce oscillations (stablizes)
• Can help escape being trapped in local minima
Stopping criteria
Can assess train performance using

p M
E    [ d j ( n)  y j ( n)]i2
i 1 j 1

where p=number of training patterns,

M=number of output units
Could stop training when rate of change of E is small, suggesting
convergence

However, aim is for new patterns to be

classified correctly
error Training error
Generalisation
error
Training time
Typically, though error on training set will decrease as training
continues generalisation error (error on unseen data) hitts a
minimum then increases (cf model complexity etc)
Therefore want more complex stopping criterion
Cross-validation
Method for evaluating generalisation performance of networks
in order to determine which is best using of available data

Hold-out method
Simplest method when data is not scare

Divide available data into sets

• Training data set
-used to obtain weight and bias values during network training
• Validation data
-used to periodically test ability of network to generalize
-> suggest ‘best’ network based on smallest error
• Test data set
Evaluation of generalisation error ie network performance
Early stopping of learning to minimize the training error and
validation error
Universal Function Approximation

How good is an MLP? How general is an MLP?

Universal Approximation Theorem

For any given constant and continuous function h (x1,...,xm),

there exists a three layer MLP with the property that

| h (x1,...,xm) - H(x1,...,xm) |< 

where H ( x1 , ... , xm )=  k
a
i=1 i f (  m
j=1 wijxj + bi )

ANN-Implemetation of Back-Prop
No ratings yet
ANN-Implemetation of Back-Prop
89 pages
Back Propagation ALGORITHM
No ratings yet
Back Propagation ALGORITHM
11 pages
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
No ratings yet
Shortcomings in Single Layer Neural Networks: Most Real World Problems Are Not
43 pages
ANN MODULE 1 Part2
No ratings yet
ANN MODULE 1 Part2
58 pages
36-Multi-Layer Perceptron and Its Properties-30-10-2024
No ratings yet
36-Multi-Layer Perceptron and Its Properties-30-10-2024
39 pages
Neural Network Intro Lecture 4
No ratings yet
Neural Network Intro Lecture 4
46 pages
FALLSEM2023-24 CSE4020 ELA VL2023240104096 2023-09-07 Reference-Material-I
No ratings yet
FALLSEM2023-24 CSE4020 ELA VL2023240104096 2023-09-07 Reference-Material-I
7 pages
Wa0006.
No ratings yet
Wa0006.
70 pages
cst414 - Deep Learning
No ratings yet
cst414 - Deep Learning
34 pages
Backpropogation Learning
No ratings yet
Backpropogation Learning
9 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Back Propagation
No ratings yet
Back Propagation
9 pages
Lect 15 MLP Introduction Backprop
No ratings yet
Lect 15 MLP Introduction Backprop
24 pages
38 Backpropagation
No ratings yet
38 Backpropagation
19 pages
Artificial Neural Networks
No ratings yet
Artificial Neural Networks
26 pages
Lect 15 MLP Introduction Backprop
No ratings yet
Lect 15 MLP Introduction Backprop
24 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
14 pages
Neural
No ratings yet
Neural
53 pages
Back Propagation Learning Algorithm
No ratings yet
Back Propagation Learning Algorithm
15 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
(IJCST-V6I4P17) :P T V Lakshmi
No ratings yet
(IJCST-V6I4P17) :P T V Lakshmi
4 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
Neural Network
100% (1)
Neural Network
54 pages
Exp 4
No ratings yet
Exp 4
9 pages
L04 Slides - mlp1
No ratings yet
L04 Slides - mlp1
22 pages
Foundations of Machine Learning: Module 6: Neural Network
No ratings yet
Foundations of Machine Learning: Module 6: Neural Network
68 pages
Unit - II ML
No ratings yet
Unit - II ML
9 pages
Lecture 13.3 Classification ANN
No ratings yet
Lecture 13.3 Classification ANN
64 pages
Ann MJJ-1
No ratings yet
Ann MJJ-1
64 pages
Soft Computing
No ratings yet
Soft Computing
92 pages
Neural Network - Optimization DRAFT 3.11
No ratings yet
Neural Network - Optimization DRAFT 3.11
66 pages
Multi Layer Perceptron
No ratings yet
Multi Layer Perceptron
62 pages
Classification BP Regression KNN Other Classifiers - Final
No ratings yet
Classification BP Regression KNN Other Classifiers - Final
116 pages
Final PPT DataMining
No ratings yet
Final PPT DataMining
64 pages
Exp 3
No ratings yet
Exp 3
9 pages
CC511 Week 5 - 6 - NN - BP
No ratings yet
CC511 Week 5 - 6 - NN - BP
62 pages
ML Unit-2
100% (1)
ML Unit-2
28 pages
Module 5 Lecture 2
No ratings yet
Module 5 Lecture 2
45 pages
Artificial Neural Network
No ratings yet
Artificial Neural Network
35 pages
Machine Learning With Artificial Neural Networks
No ratings yet
Machine Learning With Artificial Neural Networks
44 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
Module 3 Final
No ratings yet
Module 3 Final
88 pages
Introduction To Neural Network
No ratings yet
Introduction To Neural Network
20 pages
Activation Function To Back Pro
No ratings yet
Activation Function To Back Pro
22 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
Soft Computing Unit 2
No ratings yet
Soft Computing Unit 2
22 pages
AI Lec24-25
No ratings yet
AI Lec24-25
63 pages
7 Ann Multilayer Perceptron Full
No ratings yet
7 Ann Multilayer Perceptron Full
69 pages
John Bullinaria's Step by Step Guide To Implement Neuronal Network in C
No ratings yet
John Bullinaria's Step by Step Guide To Implement Neuronal Network in C
6 pages
4 Multilayer Perceptrons and Radial Basis Functions
No ratings yet
4 Multilayer Perceptrons and Radial Basis Functions
6 pages
Slide 2
No ratings yet
Slide 2
35 pages
DeepLearning Practice Question Answers
No ratings yet
DeepLearning Practice Question Answers
43 pages
L05 Slides - mlp2
No ratings yet
L05 Slides - mlp2
21 pages
Week 2
No ratings yet
Week 2
17 pages
Neural Networks: Single Neurons (Continued) : G. Extension of The Delta Rule: Smooth F (Z)
No ratings yet
Neural Networks: Single Neurons (Continued) : G. Extension of The Delta Rule: Smooth F (Z)
5 pages
EE769 7 Introduction To Neural Networks
No ratings yet
EE769 7 Introduction To Neural Networks
52 pages
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
From Everand
Mathematics 1St First Order Linear Differential Equations 2Nd Second Order Linear Differential Equations Laplace Fourier Bessel Mathematics
Andrew Igla
No ratings yet
Generalized Fermat Equation
From Everand
Generalized Fermat Equation
Ran Van Vo
No ratings yet
Aiml Assignment 2
No ratings yet
Aiml Assignment 2
2 pages
Deep Learning Computer Vision NLP
No ratings yet
Deep Learning Computer Vision NLP
140 pages
Learning Multiple Layers of Representation: Geoffrey E. Hinton
No ratings yet
Learning Multiple Layers of Representation: Geoffrey E. Hinton
7 pages
Unit V
No ratings yet
Unit V
26 pages
Backpropagation in Neural Network - GeeksforGeeks
No ratings yet
Backpropagation in Neural Network - GeeksforGeeks
10 pages
Comparison of Statistical and Machine Learning Methods For Daily SKU Demand Forecasting
No ratings yet
Comparison of Statistical and Machine Learning Methods For Daily SKU Demand Forecasting
25 pages
Ijigsp V7 N6 6 PDF
No ratings yet
Ijigsp V7 N6 6 PDF
7 pages
Advanced Fault Detection Classification and Localization in Transmission Lines A Comparative Study of ANFIS Neural Networks and Hybrid Methods
No ratings yet
Advanced Fault Detection Classification and Localization in Transmission Lines A Comparative Study of ANFIS Neural Networks and Hybrid Methods
18 pages
Sahoo2009 PDF
No ratings yet
Sahoo2009 PDF
18 pages
AI-driven Remote Sensing Enhances Mediterranean Se
No ratings yet
AI-driven Remote Sensing Enhances Mediterranean Se
19 pages
Text Similarity Using Siamese Networks and Transformers
No ratings yet
Text Similarity Using Siamese Networks and Transformers
10 pages
Lake Et Al 2017 BBS
No ratings yet
Lake Et Al 2017 BBS
72 pages
MLP Vs RBF Doctoral Thesis
No ratings yet
MLP Vs RBF Doctoral Thesis
32 pages
Predictive Models For Vertical Total Electron Content in Ionosphere
No ratings yet
Predictive Models For Vertical Total Electron Content in Ionosphere
10 pages
Modelling of Soil Shear Strength Using Neural Network Approach
No ratings yet
Modelling of Soil Shear Strength Using Neural Network Approach
21 pages
PHD IT Syllabus 01
No ratings yet
PHD IT Syllabus 01
27 pages
Package Neuralnet': R Topics Documented
No ratings yet
Package Neuralnet': R Topics Documented
13 pages
Unit - I Artificial Neural Networks
No ratings yet
Unit - I Artificial Neural Networks
23 pages
Q Bank2
No ratings yet
Q Bank2
4 pages
IS Revision Questions
No ratings yet
IS Revision Questions
9 pages
IJAER2 VVVV
No ratings yet
IJAER2 VVVV
5 pages
Tut 01
No ratings yet
Tut 01
39 pages
Basic Models of Artificial Neural Network
No ratings yet
Basic Models of Artificial Neural Network
4 pages
The Graph Neural Network Model
No ratings yet
The Graph Neural Network Model
20 pages
Power Prediction Analysis Using Artificial Neural Network in MS Excel
No ratings yet
Power Prediction Analysis Using Artificial Neural Network in MS Excel
5 pages
Advances in Speech Recognition
No ratings yet
Advances in Speech Recognition
174 pages
Prediction of Unemployment Rates With Time Series
No ratings yet
Prediction of Unemployment Rates With Time Series
34 pages
28 Desalination OptimalCOD 2011
No ratings yet
28 Desalination OptimalCOD 2011
14 pages
Neural Networks Application in Pavement
No ratings yet
Neural Networks Application in Pavement
20 pages
Stock Price Prediction Using Machine Learning
100% (1)
Stock Price Prediction Using Machine Learning
15 pages

Lec 15 MLP Cont

Uploaded by

Lec 15 MLP Cont

Uploaded by

Multi-Layer Perceptron (MLP)

Back Propagation Algorithm (sequential)

1. Apply an input vector and calculate all activations, a and u

 i (t )  g ' (ui (t ))  k (t ) wki

4. Evaluate changes using:

Use identity activation function (ie g(a) = a)

Have input [0 1] with target [1 0].

Target =[1, 0] so d1 = 1 and d2 = 0

 i (t )  g ' (ui (t ))  k (t ) wki

 i (t )  g ' (ui (t ))  k (t ) wki

Note that the weights multiplied by the zero input are

x1= 0 v11= -1 y1 = 1.66

Outputs now closer to target value [1, 0]

- we need to compute the derivative of activation function g

since : yi (t )  g (ai (t )) we have : g ' (ai (t ))  k yi (t )(1  yi (t ))

Derivative of sigmoidal function has max at a = 0., is symmetric about

weight changes will be greatest when units

Set learning rate

Loop until stopping criteria satisfied:

present Target response to output units

Training set shown repeatedly until stopping criteria are met

Two types of network training:

• Sequential mode (on-line, stochastic, or per-pattern)

• Batch mode (off-line or per -epoch). Calculate the

Aim is to minimise an error function over all training

in single layer network with linear activation functions, the

• Choice of initial weight values is important as this decides starting

• Aim is to select weight values which produce midrange function

• Select weight values randomly form uniform probability distribution

• Normalise weight values so number of weighted connections per unit

Smooth mappings (or others such as correlations) obtained by

where  is regularization coefficient

penalty term: require that the solution should be smooth,

Method of reducing problems of instability while increasing the rate

Adding term to weight update equation term effectively

Modified weight update equation is

wij (n  1)  wij (n )   j (n )yi (n )

Effect of momentum term

• If weight changes tend to have same sign

where p=number of training patterns,

However, aim is for new patterns to be

Divide available data into sets

How good is an MLP? How general is an MLP?

Universal Approximation Theorem

For any given constant and continuous function h (x1,...,xm),

| h (x1,...,xm) - H(x1,...,xm) |< 

You might also like