Eem520l3 2023
Eem520l3 2023
Multi layer perceptron(MLP) Contains Hidden Layer and diffrent activation function)
Nonlinear classification and approximation (differentiable function Function)
Activation function for MLP : : sigmoid, or hyperbolic tangents are commonly used
Sign activation its non-differentiability prevents its use for creating the loss function
at training time. In MLP implementations the hard-limiter function is usually
replaced by a smooth nonlinear activation function.(Error based Learning)
feed-forward network
MLP with two hidden layers: Feedforward neural networks are the
models used most often for solving nonlinear classification and
regression tasks by learning from data.
Activation function selection for MLP
In recent years,
however, a number
of piecewise linear
activation functions
have become more
popular:
The ReLU and hard tanh activation functions have substantially replaced the sigmoid and soft
tanh activation functions in modern neural networks
is an activation function that maps any number to zero if it is negative, and otherwise maps it to
itself. The ReLU function has been found to be very good for networks with many layers because it
can prevent vanishing gradients when training deep networks.
The tanh function is preferable to the sigmoid when the outputs of the computations are
desired to be both positive and negative. The sigmoid and the tanh functions have been the
historical tools of choice for incorporating nonlinearity in the neural network
Learning in MLP is similar to learning algorithm of Adaline (Error base learning: Error function
minimization Learning process is an optimization approach). The optimization algorithm is called
“gradient descent“, where “gradient” refers to the calculation of an error gradient or slope of
error . “Descent” refers to the moving down along that slope towards some minimum level of
error. The algorithm is iterative. Weight update procedure is based on backpropagation update
algorithm (General approach).
What is Gradient Descent?. Gradient descent is an optimization algorithm often used for finding
the weights or coefficients of machine learning algorithms, such as artificial neural networks and
logistic regression. Optimization is a big part of machine learning.
Consequently, assuming a training set composed of p
samples, the measurement of the global performance of the
backpropagation algorithm can be calculated through the
“mean squared error” defined by
Gradient descent (GD) is an iterative first-order optimisation algorithm, used to find a local
minimum/maximum of a given function. The term "gradient" is typically used for functions
with several inputs and a single output (a scalar field). Yes, you can say a line has a gradient (its
slope), but using "gradient" for single-variable functions is unnecessarily confusing. Keep it
simple. A gradient measures how much the output of a function changes if you change the
inputs a little bit." — Lex Fridman (MIT):Gradient descent: Uses the negative of the gradient of
the function at the current point. the gradient points in the direction of greatest increase The
gradient is represented by the symbol ∇ (nabla).
Error based learning is an optimization problem Gradient Vector (For multi variable function)
Numerical methods for nonlinear optimization problems are needed because the
analytical methods for solving some of the problems are too cumbersome to use.
Step 1. Estimate a reasonable starting design x(0). Set the iteration counter k = 0.
Step 2. Compute a search direction d(k) in the design space. This calculation
generally requires a cost function value and its gradient for unconstrained problems
and, in addition, constraint functions and their gradients for constrained problems.
Step 3. Check for convergence of the algorithm. If it has converged, stop; otherwise,
continue.
Step 4. Calculate a positive step size ak in the direction d(k).
Step 5. Update the design as follows, set k = k + 1 and go to Step 2:
Gradient-based methods
compute both a direction pk and
a step length k at each iteration k
Error based learning is an optimization problem :Gradient Vector Learning of ANN
can be viewed as a nonlinear optimization problem for finding a
set of network parameters (weights) that minimize the cost function (Error) for given
examples. Error function contains activation function.
Weight update
What Is an Epoch?
The number of epochs is a hyperparameter that defines the number times that the learning
algorithm will work through the entire training dataset.
If the gradient is positive, then we decrease the weights; and conversely, if the gradient is
negative, then we increase them.
Mini-Batch Gradient Descent. 1 <
Batch Size < Size of Training Set.
Subset of Off line
E(k) : on line (training is made pattern q : Total numbers of the output layer
by pattern: Stochastic/online). neurons
İndex i represents the i-th neuron of the
output layer
n is the number of training patterns
Back-propagation neural network (learning algorithm for MLP)
Learning in a multilayer network proceeds the same way as for a perceptron. A training set
of input patterns is presented to the network. The network computes its output pattern, and if
there is an error − or in other words a difference between actual and desired output patterns
− the weights are adjusted to reduce this error. The backpropagation algorithm for training
multilayer neural networks is a generalization of the LMS training procedure (Adaline), In a
back-propagation neural network, the learning algorithm has two phases. First, a training
input pattern is presented to the network input layer. The network propagates the input
pattern from layer to layer until the output pattern is generated by the output layer. If this
pattern is different from the desired output, an error is calculated and then propagated
backwards through the network from the output layer to the input layer. The weights are
modified as the error is propagated. Back-propagation is an automatic differentiation
algorithm for calculating gradients for the weights in a neural network graph structure.
Stochastic gradient descent and the back-propagation of error algorithms together are used
to train neural network models.
I n p u t s ig n a l s
1
x1
1 y1
1
2
x2 2 y2
2
i w ij j w jk
xi k yk
m
n l yl
xn
In p u t H id d e n O u tp u t
la y e r la y e r la y e r
E r r o r s i g n a ls
Features of Backpropagation:
it is the gradient descent method as used in the case of simple perceptron network with
the differentiable unit.
it is different from other networks in respect to the process by which the weights are
calculated during the learning period of the network.
training is done in the three stages :
the feed-forward of input training pattern
the calculation and backpropagation of the error
updation of the weight
Backpropagation, short for “backward propagation of errors
The negative sign is assigned to a gradient to force it to move downhill along the error
surface in the weight space.
Back-Propagation Algorithm
Back-propagation, also called “backpropagation,” or simply “backprop,” is an algorithm
for calculating the gradient of a loss function with respect to variables of a model
The name “Backpropagation” literally comes from “propagating the errors back to the
network”. By propagating the errors backwards through the network, the partial
derivative of the gradient of the last layer (closest layer to the output layer) is used to
calculate the gradient of the second to the last layer. The propagation of errors through
the layers and the utilization of the partial derivative of the gradient from a previous
layer in the current layer occurs until the first layer i.e. layer closest to the input layer.
Learning Algorithm: Backpropagation x1
1
I n p u t s ig n a ls
m
n l yl
xn
In p u t H id d e n O u tp u t
la y e r la y e r la y e r
E r r o r s ig n a ls
m Ref. Book
yk ( p ) = sigmoid x jk ( p ) ⋅ w jk ( p ) − θ k A Guide to Intelligent Systems
j =1
δ k ( p ) = yk ( p ) ⋅ [1 − yk ( p )]⋅ ek ( p ) f ( x) = S ( x) =
1
1+ e−x
where ek ( p ) = y d , k ( p ) − y k ( p )
Calculate the weight corrections:
Update the weights at the output neurons: S ' ( x) = S ( x)(1 − S ( x))
Δw jk ( p ) = α ⋅ y j ( p ) ⋅ δ k ( p ) w jk ( p + 1) = w jk ( p ) + Δw jk ( p )
(b) Calculate the error gradient for the neurons in the hidden layer:
l
δ j ( p ) = y j ( p ) ⋅ [1 − y j ( p )] ⋅ δ k ( p ) w jk ( p )
k =1
Calculate the weight corrections: Δwij ( p ) = α ⋅ xi ( p ) ⋅ δ j ( p )
Update the weights at
the hidden neurons: wij ( p + 1) = wij ( p ) + Δwij ( p )
Ref. Book A Guide to Intelligent Systems
Step 4: Iteration: Increase iteration p by one, go back to Step 2 and repeat the process until the
selected error criterion is satisfied.
Example application for XOR function: A the three-layer back-propagation network is
considered for logical operation Exclusive-OR. Recall that a single-layer perceptron could
not do this operation. Now we will apply the three-layer net.
−1
θ3
w 13 −1 XOR function
x1 1 3 w 35
w 23 θ5
5 y5
14
x2 2 4 w 45
w 24
Input θ4 Output
layer layer
−1
Hidden layerThe initial weights and threshold levels are set randomly as follows:
Ref. Book A Guide to Intelligent Sys. w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = −1.2, w45 = 1.1,
θ3 = 0.8, θ4 = −0.1 and θ5 = 0.3.
−1
We consider a training set where inputs x1 and θ3
−1
x2 are equal to 1 and desired output yd,5 is 0. x1 1
w 13
3 w 35 θ5
w 23
The actual outputs of neurons 3 and 4 in the 5 y5
hidden layer are calculated as 14
w 45
x2 2 4
w 24
Input θ4 Output
[ ]
layer layer
[
y 5 = sigmoid ( y 3 w35 + y 4 w 45 − θ 5 ) = 1 / 1 + e − ( − 0 .5250 ⋅1 .2 + 0 .8808 ⋅1 .1−1⋅0 .3 ) ] = 0 .5097
Thus, the following error is obtained: e = y
d ,5 − y5 = 0 − 0.5097 = −0.5097
The next step is weight training. we propagate the error, e, from the output layer backward to
the input layer. First, we calculate the error gradient for neuron 5 in the output layer:
θ3
w 13 −1
x1 1 3 w 35
w 23 θ5
The training process is
5 y5
repeated until the sum of
14
w 45 squared errors is less than
x2 2 4
w 24 0.001
Input θ4 Output
layer layer
−1 Ref. Book A Guide to Intelligent Sys.
Hidd l
10
-1 mean squared error between two
successive epochs is sufficiently
small, that is: where e is the
10 -2
precision required for the
convergence process
-3
10
10 -4
0 50 100
Epoch
150 200 Final results of three-layer
network learning
I n p u ts D e s ir e d A c tu a l E rror Sum of
o u tp u t o u tp u t squa re d
x1 x2 yd y5 e e rrors
1 1 0 0 .0 1 5 5 − 0 .0 1 5 5 0 .0 0 1 0
0 1 1 0 .9 8 4 9 0 .0 1 5 1
1 0 1 0 .9 8 4 9 0 .0 1 5 1
0 0 0 0 .0 1 7 5 − 0 .0 1 7 5
Several variations of the backpropagation method have been proposed in order to enhance
the efficiency of its convergence. Among these variations, one can find the method that uses
the momentum parameter, resilient-propagation, and Levenberg-Marquardt methods.
For small values of the learning parameter η, this leads most often to a very slow convergence
rate of the algorithm. Larger learning parameters have been known to lead to unwanted
oscillations in the weight space and may even cause divergence of the algorithm. To avoid
these issues, researchers have devised a modified weight updating algorithm in which the
change of the weight of the upcoming iteration (at time t + 1) is made dependent on the
weight change of the current iteration (at time t).
Momentum
Momentum term when the current solution (reflected by its weight matrices) is far from the
final solution (minimum point of the error function), the variation in the opposite direction of
the gradient of the squared error function between two successive iterations will be significant.
This implies that the difference between the error matrices of these two iterations will be
relevant and, in this case, it is possible to perform a bigger incremental step for weights in the
direction of the minimum of the error function. The momentum term is in charge of this task,
since it is responsible for measuring this variation. However, when the current solution is very
near to the final solution, the variations on the weight matrices will be small, since the variation
of the mean squared error between two successive iterations will be minor and, consequently,
the contribution of the momentum term for the convergence process will be slight. From this
moment on, all adjustments on the weight matrices are conducted (usually) only by the learning
term.
10 0
Sum-Squared Error
-1
10
-2
10
10
-4 Learning with momentum
0 50 100 150 200
Epoch
−1
θ3
w 13 −1
x1 1 3 w 35
w 23 θ5
5 y5
14
x2 2 4 w 45
w 24
Input θ4 Output
layer layer
−1
Hidden layer
Learning with adaptive learning rate (XOR Learning with momentum and
problem) adaptive learning rate (XOR
Training for 103 Epochs
problem)
2
10 Training for 85 Epochs
Sum-Squared Error
10
1 102
Sum-Squared Error
0 101
10
-1 100
10
-2
10-1
10
10-2
-3
10
10-3
-4
10 10-4
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80
Epoch Epoch
1 2.5
0.8 2
Learning Rate
Learning Rate
1.5
0.6
1
0.4
0.5
0.2
0
0 0 10 20 30 40 50 60 70 80 90
0 20 40 60 80 100 120 Epoch
Epoch
Ref. Book
A Guide to Intelligent Sys.
Learning with adaptive learning rate
To accelerate the convergence and yet avoid the danger of instability, we can apply
two heuristics:
Heuristic 1
If the change of the sum of squared errors has the same algebraic sign for several
consequent epochs, then the learning rate parameter, α, should be increased.
Heuristic 2
If the algebraic sign of the change of the sum of squared errors alternates for
several consequent epochs, then the learning rate parameter, α, should be
decreased.
Adapting the learning rate requires some changes in the back-propagation
algorithm.
If the sum of squared errors at the current epoch exceeds the previous value by
more than a predefined ratio (typically 1.04), the learning rate parameter is
decreased (typically by multiplying by 0.7) and new weights and thresholds are
calculated.
If the error is less than the previous one, the learning rate is increased (typically by
multiplying by 1.05).
MLP for pattern classification: A MLP with one hidden layer can map any
Pattern classification problem whose elements are within a convex region,
The classification of the samples in Fig. 5.17a would require two neurons in the hidden layer of
the MLP, whereas the classification of the samples in Fig. 5.17b, c would require six and four
neurons on the hidden layers of the MLP networks From a geometric point of view, a region is
considered convex if, and only if, all the points of any line segment defined between any pair of
points from the domain are inside this region. Figure 5.18a presents an illustration of a convex
region while Fig. 5.18b shows a non-convex region. Thus, considering that MLP networks with a
single hidden layer can classify patterns placed within a convex region, it is possible to deduce
that MLP networks with two hidden layers can classify patterns that are within any geometric
region (Lui 1990; Lippmann 1987), including non-convex regions such that showed in Fig.
Pattern classification :Also, in the case of pattern classification problems with more than two
classes, there is the need for inserting more neurons on the output layer of the network,
because an MLP with a single neuron in its output layer can distinguish just two classes. As an
example, an MLP composed of two neurons in its output layer could represent, at most, four
classes (Fig.). A network with three neurons in the output layer could classify a total of eight
classes. Generalizing this concept, an MLP with m neurons in its output layer would be able to
classify, theoretically, up to 2m classes.
Alternatively, one of the most used methods
of codification is the “one of c-class”, which
consists in associating the output of each
neuron directly to the class.
Number of neurons in hidden layer affects the accuracy of recognition and speed of learning:
Look at: digital _recognition.m and bit_map.m files Ref. Book A Guide to Intelligent Sys.
digit2 = [0 1 1 1 0
10001
00001 s1=12; Number of neurons in the hidden layer
00001 s2=10; Number of neurons in the output layer
00010
00100
01000
10000
1 1 1 1 1 ]; net = newff(minmax(p),[s1 s2],{'logsig' 'purelin'},'traingdx');
0 0 0 0 1 0 1 0 0 0
0 1 1 0 1 1 0 1 1 1
1 0 0 0 1 1 0 1 1 1
Target (output)
0 0 0 0 1 1 0 1 1 1
0 0 0 0 1 1 0 0 0 1 t=
0 0 0 1 0 1 0 1 0 1
0 0 0 1 0 1 0 1 0 1
0 1 1 0 1 1 0 1 1 1
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 1 0 1 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0
1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 Target vector 10x10
1 0 0 0 0 1 1 1 1 0
1 1 0 0 0 0 1 0 0 0
1 0 0 1 0 0 0 0 0 0
1 0 0 0 0
……………………………….
0 0 0 0 0
Book: Artificial Intelligence A
……………………… Guide to Intelligent Systems
45-element Input matris 45x10
Overfitting problem: A state wich an ANN has memoriesd all the training examples, but cannot
generalise. Overfitting may occur if the numbers of hidden layer neurons is too big. To prevent
overfittingit is better to choose the smallest number of hidden neurons. In this example 2,5,10
and 20 neuron is selected Results showe: there is not significant differences between the
networks with 10 and 20 neuron
0.8
0.4
TF2},BTF,BLF,PF);
Command for creating 0.2
0
the network 0 0.5 1 1.5 2
Input x
2.5 3 3.5 4
Example : Application of MLP for
function approximation: effect of hidden layer
3 And 6 neurons in hidden layer
2
2 Desired output
Desired output Network output
1.8 Network output
1.5
1.6
1.4
1
1.2
1
0.5
0.8
0.6
0
0.4
0.2 -0.5
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0 0.5 1 1.5 2 2.5 3 3.5 4
Effect of hidden nodes on function approximation (Example 5.2: Effect of Hidden layer)
illustrate the effects of the number of hidden neurons on the approximation capabilities of the
MLP, we use here the simple function f (x) given by: f (x) = x sin(x)
Matlab:New
%Example5;
x = 0:.05:2; y=humps(x);
P=x; T=y;
%Plot the data
plot(P,T,'x')
grid; xlabel('time (s)'); ylabel('output'); title('humps function')
% DESIGN THE NETWORK