0% found this document useful (0 votes)
59 views25 pages

Eem520l3 2023

The document discusses neural networks and multilayer perceptrons (MLPs). It covers topics like: - MLPs contain hidden layers and use activation functions like sigmoid, tanh. - Error-based learning and gradient descent are used to minimize error during training. - The gradient vector and Hessian matrix are important for optimization. - Numerical optimization methods are iterative processes for finding minimum error.

Uploaded by

Berkan Tezcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
59 views25 pages

Eem520l3 2023

The document discusses neural networks and multilayer perceptrons (MLPs). It covers topics like: - MLPs contain hidden layers and use activation functions like sigmoid, tanh. - Error-based learning and gradient descent are used to minimize error during training. - The gradient vector and Hessian matrix are important for optimization. - Numerical optimization methods are iterative processes for finding minimum error.

Uploaded by

Berkan Tezcan
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

EEM520/620 (Lecture3): Neural networks:

Multilayer perceptron (MLP) ,NonLinear classification and function Approximation ,,


Gradient Descent optimization Rule, Error minimization an optimization process
Combining output neurons by a linear function (effect of Hidden layer and activation function)

Multi layer perceptron(MLP) Contains Hidden Layer and diffrent activation function)
Nonlinear classification and approximation (differentiable function Function)
Activation function for MLP : : sigmoid, or hyperbolic tangents are commonly used
Sign activation its non-differentiability prevents its use for creating the loss function
at training time. In MLP implementations the hard-limiter function is usually
replaced by a smooth nonlinear activation function.(Error based Learning)

feed-forward network

MLP with two hidden layers: Feedforward neural networks are the
models used most often for solving nonlinear classification and
regression tasks by learning from data.
Activation function selection for MLP

In recent years,
however, a number
of piecewise linear
activation functions
have become more
popular:

The ReLU and hard tanh activation functions have substantially replaced the sigmoid and soft
tanh activation functions in modern neural networks

is an activation function that maps any number to zero if it is negative, and otherwise maps it to
itself. The ReLU function has been found to be very good for networks with many layers because it
can prevent vanishing gradients when training deep networks.

The tanh function is preferable to the sigmoid when the outputs of the computations are
desired to be both positive and negative. The sigmoid and the tanh functions have been the
historical tools of choice for incorporating nonlinearity in the neural network

Learning in MLP is similar to learning algorithm of Adaline (Error base learning: Error function
minimization Learning process is an optimization approach). The optimization algorithm is called
“gradient descent“, where “gradient” refers to the calculation of an error gradient or slope of
error . “Descent” refers to the moving down along that slope towards some minimum level of
error. The algorithm is iterative. Weight update procedure is based on backpropagation update
algorithm (General approach).
What is Gradient Descent?. Gradient descent is an optimization algorithm often used for finding
the weights or coefficients of machine learning algorithms, such as artificial neural networks and
logistic regression. Optimization is a big part of machine learning.
Consequently, assuming a training set composed of p
samples, the measurement of the global performance of the
backpropagation algorithm can be calculated through the
“mean squared error” defined by
Gradient descent (GD) is an iterative first-order optimisation algorithm, used to find a local
minimum/maximum of a given function. The term "gradient" is typically used for functions
with several inputs and a single output (a scalar field). Yes, you can say a line has a gradient (its
slope), but using "gradient" for single-variable functions is unnecessarily confusing. Keep it
simple. A gradient measures how much the output of a function changes if you change the
inputs a little bit." — Lex Fridman (MIT):Gradient descent: Uses the negative of the gradient of
the function at the current point. the gradient points in the direction of greatest increase The
gradient is represented by the symbol ∇ (nabla).

GD algorithm does not work for all functions. There


are two specific requirements. A function has to be:
Differentiable and Convex
GD iteratively calculates the next point using gradient
at the current position, scales it (by a learning rate) and
subtracts obtained value from the current position
(makes a step).

Error based learning is an optimization problem Gradient Vector (For multi variable function)

The gradient vector has several


properties The most important
property is that the gradient at
a point x points in the direction
of maximum increase in the
cost function
Geometrically, the gradient vector is
normal to the tangent plane at the point
x* as shown in Fig for a function of three
variables. Also, it points in the direction
of maximum increase in the function.
They will be used in developing
optimality conditions and numerical
methods for optimum design.
The second order derivatives are called the Hessian matrix.It is a symmetric matrix.
EXAMPLE :Calculation of Gradient Vector: Calculate the gradient vector for the function at the
point x* = (1.8, 1.6).

Example 1 — a quadratic function

For this function, by taking a learning rate of 0.1 and


starting point at x=9 we can easily calculate each step by
hand. Let’s do it for the first 3 steps:

Gradient vector for the function


f(x) at the point (1.8, 1.6).

Hessian Matrix. (The Hessian is an n xn matrix): Used for second diff.


Differentiating the gradient vector once again, we obtain a matrix of second partial
derivatives for the function f (x) called the Hessian matrix, or simply the Hessian.
Therefore, the Hessian is always a symmetric matrix. (Used by Levenburg-Marquardt
optimization algorithm)
EXAMPLE: Evaluation of Gradient and
Hessian of a Function: For the following
function, calculate the gradient vector and
the Hessian matrix at the point (1, 2):

Substituting the point x1 = 1, x2 = 2, the


gradient vector is given as: c = (7, 27).
Numerical optimization.. Related to wegitht udate
A General Algorithm: Many numerical solution methods are described by the
following iterative prescription:
Vector form:
Component form:

In these equations, the superscript k represents the iteration number, subscript i


denotes the design variable number, x(0) is a starting point, and Δx(k) is a change in
the current point.
The iterative formula is applicable to constrained as well as unconstrained problems.
For unconstrained problems, calculations for Δx(k) depend on the cost function and its
derivatives at the current design point. For constrained problems, the constraints must
also be considered while computing the change in design Δ x(k).
Error based learnig or weights update is an optimization problem

Numerical methods for nonlinear optimization problems are needed because the
analytical methods for solving some of the problems are too cumbersome to use.

The iterative process is summarized as a general algorithm that is applicable to


both constrained and unconstrained problems:

Step 1. Estimate a reasonable starting design x(0). Set the iteration counter k = 0.
Step 2. Compute a search direction d(k) in the design space. This calculation
generally requires a cost function value and its gradient for unconstrained problems
and, in addition, constraint functions and their gradients for constrained problems.
Step 3. Check for convergence of the algorithm. If it has converged, stop; otherwise,
continue.
Step 4. Calculate a positive step size ak in the direction d(k).
Step 5. Update the design as follows, set k = k + 1 and go to Step 2:

Descent Direction and Descent Step


Ther are some methods for calculating the step size ak and the search direction d(k)
Algorithms related to calculating ak is called Line search method

Gradient-based methods
compute both a direction pk and
a step length k at each iteration k
Error based learning is an optimization problem :Gradient Vector Learning of ANN
can be viewed as a nonlinear optimization problem for finding a
set of network parameters (weights) that minimize the cost function (Error) for given
examples. Error function contains activation function.

O: Desired oput, d… The output of neuron


includes Activation function), (O-d)…
difference… Error
Chain rule
Note

Mean square error is a quadratic function


Gradient
of Error

Weight update

Types of Gradient Descent


BATCH GRADIENT DESCENT: it’s called a training epoch. calculates the error for each
example within the training dataset (OffLine)
STOCHASTIC GRADIENT DESCENT (OLINE): it updates the parameters for each training
example one by one
MINI-BATCH GRADIENT DESCENT
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of
SGD and batch gradient descent. It simply splits the training dataset into small batches and
performs an update for each of those batches.
What Is a Batch?
The batch size is a hyperparameter that defines the number of samples to work through
before updating the internal model parameters.
A training dataset can be divided into one or more batches.

Batch Gradient Descent. Batch Size = Size of Training Set


Stochastic Gradient Descent. Batch Size = 1
Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

What Is an Epoch?
The number of epochs is a hyperparameter that defines the number times that the learning
algorithm will work through the entire training dataset.

If the gradient is positive, then we decrease the weights; and conversely, if the gradient is
negative, then we increase them.
Mini-Batch Gradient Descent. 1 <
Batch Size < Size of Training Set.
Subset of Off line

Ec : off line (all the training patterns are


presented to the system at once)

E(k) : on line (training is made pattern q : Total numbers of the output layer
by pattern: Stochastic/online). neurons
İndex i represents the i-th neuron of the
output layer
n is the number of training patterns
Back-propagation neural network (learning algorithm for MLP)
 Learning in a multilayer network proceeds the same way as for a perceptron. A training set
of input patterns is presented to the network. The network computes its output pattern, and if
there is an error − or in other words a difference between actual and desired output patterns
− the weights are adjusted to reduce this error. The backpropagation algorithm for training
multilayer neural networks is a generalization of the LMS training procedure (Adaline), In a
back-propagation neural network, the learning algorithm has two phases. First, a training
input pattern is presented to the network input layer. The network propagates the input
pattern from layer to layer until the output pattern is generated by the output layer. If this
pattern is different from the desired output, an error is calculated and then propagated
backwards through the network from the output layer to the input layer. The weights are
modified as the error is propagated. Back-propagation is an automatic differentiation
algorithm for calculating gradients for the weights in a neural network graph structure.
Stochastic gradient descent and the back-propagation of error algorithms together are used
to train neural network models.
I n p u t s ig n a l s
1
x1
1 y1
1
2
x2 2 y2
2

i w ij j w jk
xi k yk

m
n l yl
xn
In p u t H id d e n O u tp u t
la y e r la y e r la y e r

E r r o r s i g n a ls

Features of Backpropagation:
it is the gradient descent method as used in the case of simple perceptron network with
the differentiable unit.
it is different from other networks in respect to the process by which the weights are
calculated during the learning period of the network.
training is done in the three stages :
the feed-forward of input training pattern
the calculation and backpropagation of the error
updation of the weight
Backpropagation, short for “backward propagation of errors

The negative sign is assigned to a gradient to force it to move downhill along the error
surface in the weight space.
Back-Propagation Algorithm
Back-propagation, also called “backpropagation,” or simply “backprop,” is an algorithm
for calculating the gradient of a loss function with respect to variables of a model

The name “Backpropagation” literally comes from “propagating the errors back to the
network”. By propagating the errors backwards through the network, the partial
derivative of the gradient of the last layer (closest layer to the output layer) is used to
calculate the gradient of the second to the last layer. The propagation of errors through
the layers and the utilization of the partial derivative of the gradient from a previous
layer in the current layer occurs until the first layer i.e. layer closest to the input layer.
Learning Algorithm: Backpropagation x1
1
I n p u t s ig n a ls

( Calculation in first layer): Three-layer


1 y1
1
2
x2 2 y2
2

back-propagation neural network xi


i w ij j w jk
k yk

m
n l yl
xn
In p u t H id d e n O u tp u t
la y e r la y e r la y e r

E r r o r s ig n a ls

Backpropagation: Error propagation


The back-propagation training algorithm
Backpropagation = Chain Rule + Dynamic Programming
Step 1: Initialisation

Activate the back-propagation neural network by applying inputs


x1(p), x2(p),…, xn(p) and desired outputs yd,1(p), yd,2(p),…, yd,n(p).

(a) Calculate the actual outputs of the neurons in the hidden


layer:
 n 
y j ( p ) = sigmoid   x i ( p ) ⋅ wij ( p ) − θ j 
 i =1 
where n is the number of inputs of neuron j in the hidden
layer, and sigmoid is the sigmoid activation function.
1
Step 2: Activation (continued) f ( x) = S ( x) =
(b) Calculate the actual outputs
1+ e−x

m  Ref. Book
yk ( p ) = sigmoid   x jk ( p ) ⋅ w jk ( p ) − θ k  A Guide to Intelligent Systems
 j =1 

Step 3: Weight training


Update the weights in the back-propagation network propagating backward
the errors associated with output neurons.
(a) Calculate the error gradient for the neurons in the output layer:

δ k ( p ) = yk ( p ) ⋅ [1 − yk ( p )]⋅ ek ( p ) f ( x) = S ( x) =
1
1+ e−x
where ek ( p ) = y d , k ( p ) − y k ( p )
Calculate the weight corrections:
Update the weights at the output neurons: S ' ( x) = S ( x)(1 − S ( x))
Δw jk ( p ) = α ⋅ y j ( p ) ⋅ δ k ( p ) w jk ( p + 1) = w jk ( p ) + Δw jk ( p )
(b) Calculate the error gradient for the neurons in the hidden layer:
l
δ j ( p ) = y j ( p ) ⋅ [1 − y j ( p )] ⋅  δ k ( p ) w jk ( p )
k =1
Calculate the weight corrections: Δwij ( p ) = α ⋅ xi ( p ) ⋅ δ j ( p )
Update the weights at
the hidden neurons: wij ( p + 1) = wij ( p ) + Δwij ( p )
Ref. Book A Guide to Intelligent Systems
Step 4: Iteration: Increase iteration p by one, go back to Step 2 and repeat the process until the
selected error criterion is satisfied.
Example application for XOR function: A the three-layer back-propagation network is
considered for logical operation Exclusive-OR. Recall that a single-layer perceptron could
not do this operation. Now we will apply the three-layer net.
−1

θ3
w 13 −1 XOR function
x1 1 3 w 35
w 23 θ5

5 y5
14
x2 2 4 w 45
w 24
Input θ4 Output
layer layer
−1
Hidden layerThe initial weights and threshold levels are set randomly as follows:

Ref. Book A Guide to Intelligent Sys. w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = −1.2, w45 = 1.1,
θ3 = 0.8, θ4 = −0.1 and θ5 = 0.3.

−1
We consider a training set where inputs x1 and θ3
−1
x2 are equal to 1 and desired output yd,5 is 0. x1 1
w 13
3 w 35 θ5
w 23
The actual outputs of neurons 3 and 4 in the 5 y5
hidden layer are calculated as 14
w 45
x2 2 4
w 24
Input θ4 Output

[ ]
layer layer

y3 = sigmoid ( x1w13 + x2 w23 − θ3 ) = 1 / 1 + e − (1⋅0.5 +1⋅0.4 −1⋅0.8) = 0.5250


−1
Hidden layer

y4 = sigmoid ( x1w14 + x2 w24 − θ ) = 1 / [1 + e


4
− (1⋅0.9 +1⋅1.0 +1⋅0.1)
] = 0.8808
 Now the actual output of neuron 5 in the output layer is determined as:

[
y 5 = sigmoid ( y 3 w35 + y 4 w 45 − θ 5 ) = 1 / 1 + e − ( − 0 .5250 ⋅1 .2 + 0 .8808 ⋅1 .1−1⋅0 .3 ) ] = 0 .5097
Thus, the following error is obtained: e = y
d ,5 − y5 = 0 − 0.5097 = −0.5097

 The next step is weight training. we propagate the error, e, from the output layer backward to
the input layer. First, we calculate the error gradient for neuron 5 in the output layer:

δ 5 = y5 (1 − y5 ) e = 0.5097 ⋅ (1 − 0.5097) ⋅ (−0.5097) = −0.1274


 Then we determine the weight corrections assuming that the learning rate
parameter, α, is equal to 0.1:
Δw35 = α ⋅ y3 ⋅ δ 5 = 0.1⋅ 0.5250 ⋅ (−0.1274) = −0.0067
Ref. Book: A Guide to
Intelligent Sys. Δw45 = α ⋅ y4 ⋅ δ 5 = 0.1⋅ 0.8808 ⋅ (−0.1274) = −0.0112
Δθ5 = α ⋅ (−1) ⋅ δ 5 = 0.1⋅ (−1) ⋅ (−0.1274) = −0.0127
 Next we calculate the error gradients for neurons 3 and 4 in the hidden layer:

δ 3 = y3 (1 − y3 ) ⋅ δ 5 ⋅ w35 = 0.5250 ⋅ (1 − 0.5250) ⋅ ( − 0.1274) ⋅ ( − 1.2) = 0.0381


δ 4 = y4 (1 − y4 ) ⋅ δ 5 ⋅ w45 = 0.8808 ⋅ (1 − 0.8808) ⋅ ( − 0.127 4) ⋅ 1.1 = −0.0147
 We then determine the weight corrections:

Δ w13 = α ⋅ x1 ⋅ δ 3 = 0 . 1 ⋅ 1 ⋅ 0 . 0381 = 0 . 0038


Δ w 23 = α ⋅ x 2 ⋅ δ 3 = 0 .1 ⋅ 1 ⋅ 0 .0381 = 0 .0038
Δ θ 3 = α ⋅ ( − 1) ⋅ δ 3 = 0 .1 ⋅ ( − 1) ⋅ 0 .0381 = − 0 .0038
Δ w14 = α ⋅ x1 ⋅ δ 4 = 0 . 1 ⋅ 1 ⋅ ( − 0 .0147 ) = − 0 .0015
Δ w 24 = α ⋅ x 2 ⋅ δ 4 = 0 .1 ⋅ 1 ⋅ ( − 0 .0147 ) = − 0 .0015
Δ θ 4 = α ⋅ ( − 1) ⋅ δ 4 = 0 . 1 ⋅ ( − 1) ⋅ ( − 0 . 0147 ) = 0 . 0015
−1

θ3
w 13 −1
x1 1 3 w 35
w 23 θ5
The training process is
5 y5
repeated until the sum of
14
w 45 squared errors is less than
x2 2 4
w 24 0.001
Input θ4 Output
layer layer
−1 Ref. Book A Guide to Intelligent Sys.
Hidd l

Learning curve for operation Exclusive-OR Ref. Book


10 1
Sum -Squared Network Error for 224 Epochs A Guide to Intelligent Sys.
(Matlab File)
The stop criteria of the process is
10 0
defined by the mean squared error
The algorithm converges when the
Sum-Squared Error

10
-1 mean squared error between two
successive epochs is sufficiently
small, that is: where e is the
10 -2
precision required for the
convergence process
-3
10

10 -4
0 50 100
Epoch
150 200 Final results of three-layer
network learning

I n p u ts D e s ir e d A c tu a l E rror Sum of
o u tp u t o u tp u t squa re d
x1 x2 yd y5 e e rrors
1 1 0 0 .0 1 5 5 − 0 .0 1 5 5 0 .0 0 1 0
0 1 1 0 .9 8 4 9 0 .0 1 5 1
1 0 1 0 .9 8 4 9 0 .0 1 5 1
0 0 0 0 .0 1 7 5 − 0 .0 1 7 5
Several variations of the backpropagation method have been proposed in order to enhance
the efficiency of its convergence. Among these variations, one can find the method that uses
the momentum parameter, resilient-propagation, and Levenberg-Marquardt methods.
For small values of the learning parameter η, this leads most often to a very slow convergence
rate of the algorithm. Larger learning parameters have been known to lead to unwanted
oscillations in the weight space and may even cause divergence of the algorithm. To avoid
these issues, researchers have devised a modified weight updating algorithm in which the
change of the weight of the upcoming iteration (at time t + 1) is made dependent on the
weight change of the current iteration (at time t).

Momentum

Momentum term when the current solution (reflected by its weight matrices) is far from the
final solution (minimum point of the error function), the variation in the opposite direction of
the gradient of the squared error function between two successive iterations will be significant.
This implies that the difference between the error matrices of these two iterations will be
relevant and, in this case, it is possible to perform a bigger incremental step for weights in the
direction of the minimum of the error function. The momentum term is in charge of this task,
since it is responsible for measuring this variation. However, when the current solution is very
near to the final solution, the variations on the weight matrices will be small, since the variation
of the mean squared error between two successive iterations will be minor and, consequently,
the contribution of the momentum term for the convergence process will be slight. From this
moment on, all adjustments on the weight matrices are conducted (usually) only by the learning
term.

(α is momentum rate and its value is within the range of 0 and 1.


Momentum:
However, when the current solution is very near to the final solution, the variations on
the weight matrices will be small, since the variation of the mean squared error
between two successive iterations will be minor and, consequently, the contribution of
the momentum term for the convergence process will be slight. From this moment on,
all adjustments on the weight matrices are conducted (usually) only by the learning
term.
Generalized
Delta Rule
where γ is a positive number (0 ≤ γ < 1) called the momentum constant. Typically, the
momentum constant is set to 0.95.
The momentum term uses the previous weight (and gradient) information to adjust the
motion of the current weight. momentum term accelerates the descent in steady
downhill directions and has a stabilizing effect in directions that oscillate in time
Momentum - Momentum is a value that is used to help push a network out of a local minimum

Adaptive learning rate: (Change steps)


Adaptive learning rates allow the training algorithm to
monitor the performance of the model and automatically
adjust the learning rate for the best performance.
The most basic model of this decreases the learning rate
once the performance of the model reaches a plateau. The
model does so by decreasing the learning rate by a factor
of two or an order of magnitude. However, the learning
rate can be increased again if the performance doesn’t
improve.
Learning with momentum for operation Exclusive-OR (Learning rate constant)
Reduction in epoches
Sum -Squared Network Error for 224 Epochs
1
10

10 0
Sum-Squared Error

-1
10

-2
10

Learning without momentum


-3
10

10
-4 Learning with momentum
0 50 100 150 200
Epoch

−1

θ3
w 13 −1
x1 1 3 w 35
w 23 θ5

5 y5
14
x2 2 4 w 45
w 24
Input θ4 Output
layer layer
−1
Hidden layer

Artificial Intelligence A Guide to Intelligent Systems (Matlab file)

Learning with adaptive learning rate (XOR Learning with momentum and
problem) adaptive learning rate (XOR
Training for 103 Epochs
problem)
2
10 Training for 85 Epochs
Sum-Squared Error

10
1 102
Sum-Squared Error

0 101
10
-1 100
10
-2
10-1
10
10-2
-3
10
10-3
-4
10 10-4
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80
Epoch Epoch

1 2.5

0.8 2
Learning Rate
Learning Rate

1.5
0.6
1
0.4
0.5
0.2
0
0 0 10 20 30 40 50 60 70 80 90
0 20 40 60 80 100 120 Epoch
Epoch

The Levenberg-Marquardt algorithm is a second-order gradient method, based on the least


squares method for nonlinear models, which can be incorporated into the backpropagation
algorithm as to enhance the efficiency of the training process.

Ref. Book
A Guide to Intelligent Sys.
Learning with adaptive learning rate

To accelerate the convergence and yet avoid the danger of instability, we can apply
two heuristics:
Heuristic 1
If the change of the sum of squared errors has the same algebraic sign for several
consequent epochs, then the learning rate parameter, α, should be increased.
Heuristic 2
If the algebraic sign of the change of the sum of squared errors alternates for
several consequent epochs, then the learning rate parameter, α, should be
decreased.
 Adapting the learning rate requires some changes in the back-propagation
algorithm.
 If the sum of squared errors at the current epoch exceeds the previous value by
more than a predefined ratio (typically 1.04), the learning rate parameter is
decreased (typically by multiplying by 0.7) and new weights and thresholds are
calculated.
 If the error is less than the previous one, the learning rate is increased (typically by
multiplying by 1.05).

Artificial Intelligence A Guide to Intelligent Systems

MLP for pattern classification: A MLP with one hidden layer can map any
Pattern classification problem whose elements are within a convex region,
The classification of the samples in Fig. 5.17a would require two neurons in the hidden layer of
the MLP, whereas the classification of the samples in Fig. 5.17b, c would require six and four
neurons on the hidden layers of the MLP networks From a geometric point of view, a region is
considered convex if, and only if, all the points of any line segment defined between any pair of
points from the domain are inside this region. Figure 5.18a presents an illustration of a convex
region while Fig. 5.18b shows a non-convex region. Thus, considering that MLP networks with a
single hidden layer can classify patterns placed within a convex region, it is possible to deduce
that MLP networks with two hidden layers can classify patterns that are within any geometric
region (Lui 1990; Lippmann 1987), including non-convex regions such that showed in Fig.

2 neuron in 1 6 neuron in 1 4neuron in 1 Book: Artificial Neural


hidden layer hidden layer hidden layer Networks A Practical
Course
pattern classification

Fig. 5.19 Decision boundaries of a


problem within a non-convex region
Fig. 5.18 (a) (b) Illustration of a convex region and
a non-convex region
For such case, the configuration of the MLP network Fig. 5.20
showed in Fig. 5.20 represents a topology that can
implement the pattern classification for problemFig. 5.19.
In this case, it is possible to consider, for instance, that
neurons A, B and C of the first hidden layer are
responsible for delimiting the left convex region (triangle),
while the neurons D, E and F are related to the right
convex region (triangle). Neuron G of the second middle neuron Y of is responsible for
layer is responsible for combining the outputs of neurons performing the Boolean operation of
A, B, and C in order to represent the group that belongs to disjunction (OR gate), because if one
the left convex region so that in this condition its output of the outputs produced by neurons
would be equal to 1. In a similar manner, G or H is equal to 1, its final response
y should also be equal to 1.

Pattern classification :Also, in the case of pattern classification problems with more than two
classes, there is the need for inserting more neurons on the output layer of the network,
because an MLP with a single neuron in its output layer can distinguish just two classes. As an
example, an MLP composed of two neurons in its output layer could represent, at most, four
classes (Fig.). A network with three neurons in the output layer could classify a total of eight
classes. Generalizing this concept, an MLP with m neurons in its output layer would be able to
classify, theoretically, up to 2m classes.
Alternatively, one of the most used methods
of codification is the “one of c-class”, which
consists in associating the output of each
neuron directly to the class.

Book: Artificial Neural Networks A


Practical Course
Example: Typical Applications of MLP ,Pattern Classification, Character recognition
recognition of digits from 0 to 9. In this The number of neuronsin the input
application, each digit is represented by a 5 9 layer is the numbers of pixels in bit
bit map, as shown in Figure map. The bit map in our examole
momentum constant is set to 0.95. consistes of 45 pixel, and we need
Neurons in the hidden and output layers use a 45 input neurons. The output layer
sigmoid activation function. has 10 neurons (on efor each digitto
be recognised): where a better
resolution is required, at least 16x16

Number of neurons in hidden layer affects the accuracy of recognition and speed of learning:
Look at: digital _recognition.m and bit_map.m files Ref. Book A Guide to Intelligent Sys.

Application of Feed-forward networks - character recognition:


digit1 = [0 0 1 0 0
01100
10100
00100
00100
00100
00100
00100
0 0 1 0 0 ];

digit2 = [0 1 1 1 0
10001
00001 s1=12; Number of neurons in the hidden layer
00001 s2=10; Number of neurons in the output layer
00010
00100
01000
10000
1 1 1 1 1 ]; net = newff(minmax(p),[s1 s2],{'logsig' 'purelin'},'traingdx');

Book: Artificial Intelligence A Guide to Intelligent Systems


p=

0 0 0 0 1 0 1 0 0 0
0 1 1 0 1 1 0 1 1 1
1 0 0 0 1 1 0 1 1 1
Target (output)
0 0 0 0 1 1 0 1 1 1
0 0 0 0 1 1 0 0 0 1 t=
0 0 0 1 0 1 0 1 0 1
0 0 0 1 0 1 0 1 0 1
0 1 1 0 1 1 0 1 1 1
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 1 0 1 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0
1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 Target vector 10x10
1 0 0 0 0 1 1 1 1 0
1 1 0 0 0 0 1 0 0 0
1 0 0 1 0 0 0 0 0 0
1 0 0 0 0
……………………………….
0 0 0 0 0
Book: Artificial Intelligence A
……………………… Guide to Intelligent Systems
45-element Input matris 45x10

Digit 1 2 3 4 5 6 7 8 9 Desired pat. Input/Output data


1 00100 01100 10100 0100 00100 00100 00100 00100 00100 1000000000
2 01110 10001 00001 00001 00010 00100 01000 10000 11111 0100000000
3 …. …….. ……. ……. ……. …… ….. ….. ….. …….
4 …. …….. ……. ……. ……. …… ….. ….. ….. …….
5 …. …….. ……. ……. ……. …… ….. ….. ….. …….
6 …. …….. ……. ……. ……. …… ….. ….. ….. …….
7 …. …….. ……. ……. ……. …… ….. ….. ….. …….
8 …. …….. ……. ……. ……. …… ….. ….. ….. …….
9 …. …….. ……. ……. ……. …… ….. ….. ….. …….
0 0000000001

‘noise’ – the distortion of the Improve the


input patterns. This distortion performance:
can be created, for instance,
by adding some small random Trainning with noisy
values chosen from a normal examples
distribution
Learning curve of digit recogniiton 3 layer NN (Effect of neurons in hidden layer)
Performance evalueation of digit recognition, Performance study is measured by SSE
we will examine the system’s performance with 2, 5, 10 and 20 hidden neurons and compare
results.

Overfitting problem: A state wich an ANN has memoriesd all the training examples, but cannot
generalise. Overfitting may occur if the numbers of hidden layer neurons is too big. To prevent
overfittingit is better to choose the smallest number of hidden neurons. In this example 2,5,10
and 20 neuron is selected Results showe: there is not significant differences between the
networks with 10 and 20 neuron

Ref. Book A Guide to Intelligent Sys.

Stopping criterions in learning


• Sensible stopping criterions:
– total mean squared error change:
– Back-prop is considered to have converged when the absolute rate of change in the
average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).
– generalization based criterion:
– After each epoch the NN is tested for generalization using a different set of examples
(validation set). If the generalization performance is adequate then stop.

• Network Topology : The number of layers and of neurons depend on the


specific task. In practice this issue is solved by trial and error. Usually determined
by experimentation
• Two types of adaptive algorithms can be used: (Neurons in Hidden Layer)
– start from a large network and successively remove some neurons and links
until network performance degrades.
– begin with a small network and introduce new neurons until performance is
satisfactory.
– Too many – network will memorise training set and will not generalise well;
Too few – risk that network may not be able to learn the pattern in the
training set. It is often necessary to train the network with different learning
rates to find the optimum value for the problem under investigation
Other optimization algorithms
The example below, shows that a
combination of four
hidden neurons can produce a
“bump” at the output space of a
two-layered MLP
g A large number of these
“bumps” can approximate any
surface arbitrarily well
[Duda, Hart and Stork, 2001]

Other Learning Methods (MLP):


• Other methods that can be used to find weights of a MLP include:
– Conjugate gradient method
– Levenburg-Marquardt method (The most used)
– Quasi-Newton method
– Least square Kalman filtering method
Levenberg-Marquardt Method :The Levenberg-Marquardt algorithm is a second-order gradient
method, based on the least squares method for nonlinear models, which can be incorporated
into the backpropagation algorithm as to enhance the efficiency of the training process.
the Levenberg-Marquardt Uses the Hessian matrix (matrix of second-order derivatives)

Linear regression or Curve fitting: minimum square error:


Minimum mean square error. In statistics and signal
Mean squared error
processing, a minimum mean square error (MMSE)
estimator is an estimation method which minimizes the
mean square error (MSE), which is a common measure of
estimator quality, of the fitted values of a dependent
variable.
MSE is used to check how close estimates or forecasts are
to actual values. Lower the MSE, the closer is forecast to
actual. The Mean Squared Error (MSE) is perhaps the
simplest and most common loss function,

LMS algorithm environment adjusts the weights


and biases of the linear network to minimize
this Mean Square Error (MSE function)
The MSE is a measure of the quality of an
estimator or predictor
MSE is almost always strictly positive

A linear model for learning rparameters of


Y= ax+b
Universal curve fitting (function approximation): : Similarly to pattern classification problems,
although an MLP with a single hidden layer is sufficient for mapping any nonlinear continuous
function: logistic activation function

Book: Artificial Neural Networks A


Practical Course Illustration of the superposition of logistic activation
functions for a curve fitting

Example : Application of MLP for function approximation (cont.)

Output node (Linear)


x = 0:0.01:4; Data generation
y = (sin(2*pi*x)+1).*exp(-x.^2); x y

Matlab command : Create a 2-layer network


purelin
PR = [min(x) max(x)] Range of inputs
logsig
S1 = 6;
No. of nodes in Layers 1 and 2
S2 = 1;
TF1 = 'logsig';
Activation functions of Layers 1 and 2
TF2 = 'purelin';
BTF = 'trainlm'; Training function
1.8

BLF = 'learngd'; Learning function 1.6 Function to be


PF = 'mse'; Cost function
1.4
approximated
1.2
Output y

0.8

net = newff(PR,[S1 S2],{TF1 0.6

0.4
TF2},BTF,BLF,PF);
Command for creating 0.2

0
the network 0 0.5 1 1.5 2
Input x
2.5 3 3.5 4
Example : Application of MLP for
function approximation: effect of hidden layer
3 And 6 neurons in hidden layer

2
2 Desired output
Desired output Network output
1.8 Network output
1.5
1.6

1.4
1
1.2

1
0.5
0.8

0.6
0
0.4

0.2 -0.5
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0 0.5 1 1.5 2 2.5 3 3.5 4

Weighted summation of all outputs from the


No. of hidden nodes first layer nodes yields function approximation.
is too small !
6 hidden neuron
3 hidden neuron…. Compare the perfommance

Effect of hidden nodes on function approximation (Example 5.2: Effect of Hidden layer)

illustrate the effects of the number of hidden neurons on the approximation capabilities of the
MLP, we use here the simple function f (x) given by: f (x) = x sin(x)

Given a sufficiently large number of


hidden neurons, a two-layer MLP can
approximate any continuous function
arbitrarily well

Book:Soft Computing and Intelligent Systems Design memorizing


Example 5.3 : Effect of training patterns (Samples) on function approximation
The same function was approximated with a network with a fixed number of nodes (taken
as five here), but with a variable number of training patterns.

A higher number of training


patterns provided better
results as shown in Figure (f).

Matlab:New

Book:Soft Computing and Intelligent Systems Design

%Example5;
x = 0:.05:2; y=humps(x);
P=x; T=y;
%Plot the data
plot(P,T,'x')
grid; xlabel('time (s)'); ylabel('output'); title('humps function')
% DESIGN THE NETWORK

%Define network. First try a simple one TrainLM: Levenburg-


net=newff([0 20], [20,1], {'tansig','purelin'},'trainlm'); Marquardt method
%Define parameters Optimization
net.trainParam.show = 50;
net.trainParam.lr = 0.05;
net.trainParam.epochs = 300;
net.trainParam.goal = 1e-4;
%Train network
net1 = train(net, P, T);
% Simulate result
a= sim(net1,P);
%Plot result and compare
plot(P,a,P,a-T)
xlabel('time in secs');ylabel('Network output and error');
title('First order bessel function'); grid
%First try a simple one – feedforward (multilayer perceptron) network
net=newff([0 2], [5,1], {'tansig','purelin'},'traingdm')%traingd
% Here newff defines feedforward network architecture.
% The first argument [0 2] defines the range of the input and initializes the
network parameters.
% The second argument the structure of the network. There are two layers.
% 5 is the number of the nodes in the first hidden layer,
% 1 is the number of nodes in the output layer,
% Next the activation functions in the layers are defined.
% In the first hidden layer there are 5 tansig functions.
% In the output layer there is 1 linear function.
% ‘learngd’ defines the basic learning scheme – gradient method
% Define learning parameters
net.trainParam.show = 50; % The result is shown at every 50 th iteration (epoch)
net.trainParam.lr = 0.05; % Learning rate used in some gradient schemes
net.trainParam.epochs =1000; % Max number of iterations
net.trainParam.goal = 1e-3; % Error tolerance; stopping criterion
%Train network
net1 = train(net, P, T); % Iterates gradient type of loop
%test;
a= sim(net1,P);
% Plot result and compare
plot(P,a-T, P,T); grid;
xlabel('Time (s)'); ylabel('Output of network and error'); title('Humps function');

You might also like