0% found this document useful (0 votes)
27 views

Lect 4

The document describes gradient descent learning and backpropagation algorithms for training neural networks. It discusses how gradient descent aims to find the minimum error by computing derivatives of the error function with respect to weights. Batch and incremental training modes are described. Backpropagation is introduced as a method for calculating error gradients for multi-layer networks using generalized delta rule. Sigmoid activation functions and their use in backpropagation are also mentioned.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
27 views

Lect 4

The document describes gradient descent learning and backpropagation algorithms for training neural networks. It discusses how gradient descent aims to find the minimum error by computing derivatives of the error function with respect to weights. Batch and incremental training modes are described. Backpropagation is introduced as a method for calculating error gradients for multi-layer networks using generalized delta rule. Sigmoid activation functions and their use in backpropagation are also mentioned.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 54

University Of Khartoum

Department Of Electronics & Electrical


Engineering
Software & Control Engineering

EEE52511: NEURAL NETWORKS


& FUZZY SYSTEMS
By: Dr. Hiba Hassan Sayed
Lecture 4
30/1/2023 U of K: Dr. Hiba Hassan 2

GRADIENT DESCENT LEARNING


30/1/2023 U of K: Dr. Hiba Hassan 3

Gradient Descent Learning in NN

• The gradient is the rate of change of f(x) at a particular value of x.


• Hence, it is the partial derivative of f(x) with respect to x.
• That led to Gradient Descent Learning, its aim is to find the
minimum error by computing the derivative of the error function
with respect to the weight. Sometimes it is called Gradient
Descent Minimization.
30/1/2023 U of K: Dr. Hiba Hassan 4

Finding the minimum of a function: gradient descent


30/1/2023 U of K: Dr. Hiba Hassan 5

Cont.
30/1/2023 U of K: Dr. Hiba Hassan 6

Cont.
• For a target (t) & an actual output (o), the error is given by the
following mean square error cost function,

• Where D is the set of training examples.


• There are 2 types of gradient descent based cost function, they are
stated next;
30/1/2023 U of K: Dr. Hiba Hassan 7

Where,
30/1/2023 U of K: Dr. Hiba Hassan 8
30/1/2023 U of K: Dr. Hiba Hassan 9

Batch Training
• Batch Training: In batch mode the weights and biases of the
network are updated only after the entire training set has been
applied to the network. The gradients calculated at each training
example are added together to determine the change in the weights
and biases.
• Batch Gradient Descent: In the batch steepest descent training
function the weights and biases are updated in the direction of the
negative gradient of the performance function.
30/1/2023 U of K: Dr. Hiba Hassan 10

Batch Gradient Descent with Momentum

• This algorithm often provides faster convergence.


• Momentum allows a network to respond not only to the local
gradient, but also to recent trends in the error surface.
• Acting like a low-pass filter, momentum allows the network to ignore
small features in the error surface.
• Without momentum a network may get stuck in a shallow local
minimum, such as shown in the next figure.
30/1/2023 U of K: Dr. Hiba Hassan 11

Cont.
Local and global minima Effect of adding Momentum
30/1/2023 U of K: Dr. Hiba Hassan 12

Incremental Mode Gradient Descent


• When we use the gradient with respect to one training example at
a time, the gradient descent becomes the Hoff’s delta rule, which
is given by,

Wi   (t  o) xi
• Also called the Least Mean Square, LMS, method.
30/1/2023 U of K: Dr. Hiba Hassan 13

LMS Learning Rule


Mean Square Error:
• Like the perceptron learning rule, the least mean square (LMS) - the delta
rule - algorithm is an example of supervised training, in which the learning
rule is provided with a set of examples of desired network behavior:
p1 , t1 , p2 , t 2  , ... , pQ , t Q 
• We want to minimize the average of the sum of the squared errors
between target & actual network output:
Q Q
1 1
mse   e(k )   (t (k ) - a(k ))
2 2

Q k 1 Q k 1
30/1/2023 U of K: Dr. Hiba Hassan 14

LMS Algorithm/ Widrow-Hoff rule


• The LMS algorithm was presented by Widrow and Hoff, hence, it is
called Widrow-Hoff learning algorithm.
• As seen before, it is based on an approximate steepest descent
procedure.
• Widrow and Hoff decided that they could estimate the mean
square error by using the squared error at each iteration.
30/1/2023 U of K: Dr. Hiba Hassan 15

Comparing Perceptron & Delta Rules


• Perceptron rule
• Thresholded output.
• Converges after a finite number of iterations to a hypothesis that perfectly
classifies the training data, provided the training examples are linearly
separable.
• Linearly separable data.
• Delta rule
• Unthresholded output.
• Converges toward the error minimum, possibly requiring unbounded time,
but converges regardless of whether the training data are linearly
separable or not.
• Linearly non-separable data.
30/1/2023 U of K: Dr. Hiba Hassan 16

Adaptive Linear Neuron Network Architecture (ADALINE)

• The ADALINE network is a single layer neural network with multiple


nodes, where each node accepts multiple inputs to generate one output.
• ADALINE networks are similar to the perceptron, but their transfer
function is linear rather than hard-limiting. This allows their outputs to
take on any value, whereas the perceptron output is limited to either 0 or
1.
• Both the ADALINE and the perceptron can only solve linearly separable
problems.
30/1/2023 U of K: Dr. Hiba Hassan 17

Cont.
• An adaptive linear system responds to changes in its environment
as it is operating.
• These networks are often used in error cancellation, signal
processing, and control systems. For example, they are used by
many long distance phone lines for echo cancellation.
• The pioneering work in this field was done by Widrow and Hoff,
who gave the name ADALINE to adaptive linear elements.
30/1/2023 U of K: Dr. Hiba Hassan 18

The ADALINE Neural Network


30/1/2023 U of K: Dr. Hiba Hassan 19

Cont.
• Multiple layer ADALINE is called MADALINE.
• The Widrow-Hoff rule can only train single-layer linear networks.
• This is not much of a disadvantage; single-layer linear networks are
just as capable as multilayer linear networks.
• For every multilayer linear network, there is an equivalent single-
layer linear network.
30/1/2023 U of K: Dr. Hiba Hassan 20

BACKPROPAGATION ALGORITHM
30/1/2023 U of K: Dr. Hiba Hassan 21

BackPropagation Algorithm
• The backpropagation algorithm was made popular by Rumelhart,
Hinton and Williams in 1986 "Learning Internal Representations by
Error Propagation". Rumelhart, David E.; McClelland, James
L. (eds.). Parallel Distributed Processing : Explorations in the
Microstructure of Cognition. Vol. 1 : Foundations. Cambridge: MIT
Press. ISBN 0-262-18120-7.]
• The researchers used semi-linear neurons with differentiable activation
functions in the hidden neurons (logistic activation functions or
sigmoids).
30/1/2023 U of K: Dr. Hiba Hassan 22

Cont.
• The error between the target and actual output is calculated at
every iteration and is back propagated through the layers of the
ANN to adapt the weights.
• The weights are adapted such that the error is minimized.
• Once the error has reached a justified minimum value, the training
is stopped.
• Among the first applications of the BP algorithm is speech
synthesis called NETalk developed by Terence Sejnowski
[Sejnowski & Rosenberg, 1987 “Parallel Networks that Learn to
Pronounce English Text”, Complex Systems 1, 145-168]
30/1/2023 U of K: Dr. Hiba Hassan 23

Cont.
• The configuration for training a neural network using the BP
algorithm is shown in the figure below.
30/1/2023 U of K: Dr. Hiba Hassan 24

The Generalized Delta Rule (G.D.R.)


• In BP algorithm, like in other learning algorithms, the goal is to find
the next value of the adaptation weights (Δw) which is also known
as the G.D.R.
• Consider the following ANN model:
30/1/2023 U of K: Dr. Hiba Hassan 25

Cont.
• We need to obtain the following algorithm to adapt the weights
between the output (k) and hidden (j) layers:

• Where the weights are adapted as follows:

• And t is the iteration number and is the error signal between the
output and hidden layers & is given by:
30/1/2023 U of K: Dr. Hiba Hassan 26

Cont.
• Adaptation between input (i) and hidden (j) layers :

• The new weight is thus:

• and the error signal through layer j is:

• Where,

• And,
30/1/2023 U of K: Dr. Hiba Hassan 27

Backpropagation Algorithm
• The following ANN model is used to derive the backpropagation
algorithm:
30/1/2023 U of K: Dr. Hiba Hassan 28

BP (cont.)
• The backpropagation has two steps,
• Forward propagation, and
• Backward propagation.
• Our ANN model has the following assumptions:
• A two-layer multilayer NN model, i.e. with 1 set of hidden neurons.
• Neurons in layer i are fully connected to layer j and neurons in
layer j are fully connected to layer k.
• Input layer neurons have linear activation functions and hidden
and output layer neurons have logistic activation functions
(sigmoids).
30/1/2023 U of K: Dr. Hiba Hassan 29

Note: Sigmoid Function


• Sigmoids have a variable c that controls their firing angle.
30/1/2023 U of K: Dr. Hiba Hassan 30

Cont.
• When c is large, the sigmoid becomes like a threshold function and
when is c is small, the sigmoid becomes more like a straight line
(linear).
• When c is large learning is much faster but a lot of information is
lost, however when c is small, learning is very slow but information
is retained.
• Since this function is differentiable, it enables the B.P. algorithm to
adapt the lower layers of weights in a multilayer neural network.
30/1/2023 U of K: Dr. Hiba Hassan 31

Cont.
• The firing angle used here is c=1.
• Bias weights are used with bias signals of 1 for hidden (j) and output
layer (k) neurons.
• In many ANN models, bias weights (θ) with bias signals of 1 are used to
speed up the convergence process.
• The learning parameter is given by the symbol η and is usually fixed a
value between 0 and 1, however, in many applications nowadays an
adaptive η is used.
• Usually η is set large in the initial stage of learning and reduced to a
small value at the final stage of learning.
• A momentum term α is also used in the G.D.R. to avoid local minima.
30/1/2023 U of K: Dr. Hiba Hassan 32

Steps of BP Algorithm
• Step 1: Obtain a set of training patterns.
• Step 2: Set up neural network model: No. of Input neurons, Hidden
neurons, and Output Neurons.
• Step 3: Set learning rate η and momentum rate α
• Step 4: Initialize all connection Wji , Wkj and bias weights θj θk to
random values.
• Step 5: Set minimum error, Emin
• Step 6: Start training by applying input patterns one at a time and
propagate through the layers then calculate total error.
30/1/2023 U of K: Dr. Hiba Hassan 33

Cont.
• Step 7: Backpropagate error through output and hidden layer and
adapt weights.
• Step 8: Backpropagate error through hidden and input layer and
adapt weights.
• Step 9: Check if Error < Emin
• If not repeat Steps 6-9. If yes stop training.
30/1/2023 U of K: Dr. Hiba Hassan 34

Solving an XOR Problem


• In this example we use the BP algorithm to solve a 2-bit XOR problem.
• The training patterns of this ANN is the XOR example as given in the next
table.
• For simplicity, the ANN model has only 4 neurons (2 inputs, 1 hidden and
1 output) and has no bias weights.
• The input neurons have linear functions and the hidden and output
neurons have sigmoid functions.
• The weights are initialized randomly.
• We train the ANN by providing the patterns #1 to #4 through an iteration
process until the error is minimized.
30/1/2023 U of K: Dr. Hiba Hassan 35

Cont.
• The training patterns of this ANN is the XOR example as given in
the following table:
30/1/2023 U of K: Dr. Hiba Hassan 36

Cont.
• The ANN model and its initial weights,

• Training begins when the pattern#1 and its target are provided to the
ANN.
• 1st pattern: 0, 0 target : 0
30/1/2023 U of K: Dr. Hiba Hassan 37
30/1/2023 U of K: Dr. Hiba Hassan 38

Compute the error by comparing this value to the target,


30/1/2023 U of K: Dr. Hiba Hassan 39

Cont.
• This error is now backpropagated through the layers following the
error signal equations given as follows:
• Between output (k) and hidden (j) layer

• Thus
• Between hidden (j) and input (i) layer :

• = -0.0035
30/1/2023 U of K: Dr. Hiba Hassan 40

Cont.
• Now we have calculated the error signal between layers (k) and (j)

• If we had chosen the learning rate and momentum term as follows :


• η = 0.1 and α= 0.9
• and the previous change in weight is 0 and Ojo= 0.5
• Then,

= -0.0064
30/1/2023 U of K: Dr. Hiba Hassan 41

Cont.
• This is the increment of the weight after the first iteration for the
weight between layers k and j.
• Now this change in weight is added to the actual weight as follows

• and thus the weight between layers k and j has been adapted.
30/1/2023 U of K: Dr. Hiba Hassan 42

Cont.
• Similarly for the weights between layers j and i, the adaptation follows

• Now this change in weight is added to the actual weight as follows:

• and this is the adapted weight between layers j and i after pattern#1 is
seen by the ANN in the first iteration.
• The whole calculation is then repeated for the next pattern (pattern#2 =
[0, 1]) with tk=1.
• After all the 4 patterns have been completed the whole process is
repeated for pattern#1 again.
30/1/2023 U of K: Dr. Hiba Hassan 43

UNSUPERVISED LEARNING
30/1/2023 U of K: Dr. Hiba Hassan 44

Unsupervised Learning
• Unsupervised learning is the process of finding structure, patterns
or correlation in the given data.
• Many times this type of learning depends on associative learning
procedures.
• We focus on two main approaches:
• Unsupervised Hebbian learning
• Principal component analysis
• Unsupervised competitive learning
• Clustering
30/1/2023 U of K: Dr. Hiba Hassan 45

Types of Analysis used in


Unsupervised Learning
• Correlational analysis
• Identifying the correlations among features.
• Accomplished via Hebbian learning
• Cluster analysis
• Identifying the relational structure of the data.
• Accomplished via competitive learning.

• Cluster analysis is a form of categorization, whereas


Correlational analysis is a form of simplification.
30/1/2023 U of K: Dr. Hiba Hassan 46

Hebbian Learning
• An association principle was proposed by Hebb in 1949 in the
context of biological neurons.
• Hebb’s principle
When a neuron repeatedly excites another neuron, then the
threshold of the latter neuron is decreased, or the synaptic
weight between the neurons is increased, in effect increasing
the likelihood of the second neuron to be excited by the first.
30/1/2023 U of K: Dr. Hiba Hassan 47

Hebbian Learning as Correlation Learning


• Hebbian learning is an associative learning, it associates things that
occur together.
• Thus Hebbian learning can be thought of as learning the auto-
correlation of the input space.
• Example: a child recognizes a banana by its shape & wants to eat it.
Then, he smells it and after a couple of exposures to that experiment
starts, drooling! Once he smells it without even seeing it.
• Conclusion: the child has associated the smell with the banana &
produced a response (hunger effect) even without seeing its shape.
30/1/2023 U of K: Dr. Hiba Hassan 48

Cont.
• Brilliant idea by Hebb(1949):cells that fire together, wire
together

Banana-smell Hungry Neuron


Neuron
30/1/2023 U of K: Dr. Hiba Hassan 49

Hebbian Learning Neural Network

Output Signals
Input Signals

i j
30/1/2023 U of K: Dr. Hiba Hassan 50

Banana Associator Example


30/1/2023 U of K: Dr. Hiba Hassan 51

Example (cont.)
• The inputs are defined as follows:

• If we want the network to associate the response to the shape of


the banana & not its smell, w0 is assigned a value greater than –b,
while w is assigned a value less than –b.
• Hence we choose; w0 = 1& w = 0.
• The output of the network reduces to;
a = hardlim(p0 - 0.5)
30/1/2023 U of K: Dr. Hiba Hassan 52

Hebbian Learning
• Hebbian learning rule Δwji = ηyjxi
• Consider the update of a single weight w,
w(n + 1) = w(n) + ηy(n)x(n)
• For a linear activation function
w(n + 1) = w(n)[1 + ηx2(n)]
• Weights increase without bounds. If initial weight is negative, then
it will increase in the negative range. If it is positive, then it will
increase in the positive range.
• Hebbian learning is naturally unstable.
30/1/2023 U of K: Dr. Hiba Hassan 53

Oja’s Learning Rule


• To solve the problem of the simple Hebbian rule that causes the
weights to increase (or decrease) without bounds,
• The weights need to be normalized to one as follows,
wji(n + 1) = [wji(n) + ηyj(n)xi(n)] / √Σi[wji(n) + ηyj(n)xi(n)]2
• This equation effectively imposes a constraint on the weights.
• Oja approximated the normalization (for small η) as:
30/1/2023 U of K: Dr. Hiba Hassan 54

Oja’s Rule (continued)


wji(n + 1) = wji(n) + ηyj(n)[xi(n) – yj(n)wji(n)]

• This rule is also known as the generalized Hebbian rule.


• The 2nd term is called a weight decay term or a ‘forgetting term’.

You might also like