0% found this document useful (0 votes)

59 views25 pages

Eem520l3 2023

The document discusses neural networks and multilayer perceptrons (MLPs). It covers topics like: - MLPs contain hidden layers and use activation functions like sigmoid, tanh. - Error-based learning and gradient descent are used to minimize error during training. - The gradient vector and Hessian matrix are important for optimization. - Numerical optimization methods are iterative processes for finding minimum error.

Uploaded by

Berkan Tezcan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

59 views25 pages

Eem520l3 2023

Uploaded by

Berkan Tezcan

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 25

EEM520/620 (Lecture3): Neural networks:

Multilayer perceptron (MLP) ,NonLinear classification and function Approximation ,,

Gradient Descent optimization Rule, Error minimization an optimization process
Combining output neurons by a linear function (effect of Hidden layer and activation function)

Multi layer perceptron(MLP) Contains Hidden Layer and diffrent activation function)
Nonlinear classification and approximation (differentiable function Function)
Activation function for MLP : : sigmoid, or hyperbolic tangents are commonly used
Sign activation its non-differentiability prevents its use for creating the loss function
at training time. In MLP implementations the hard-limiter function is usually
replaced by a smooth nonlinear activation function.(Error based Learning)

feed-forward network

MLP with two hidden layers: Feedforward neural networks are the
models used most often for solving nonlinear classification and
regression tasks by learning from data.
Activation function selection for MLP

In recent years,
however, a number
of piecewise linear
activation functions
have become more
popular:

The ReLU and hard tanh activation functions have substantially replaced the sigmoid and soft
tanh activation functions in modern neural networks

is an activation function that maps any number to zero if it is negative, and otherwise maps it to
itself. The ReLU function has been found to be very good for networks with many layers because it
can prevent vanishing gradients when training deep networks.

The tanh function is preferable to the sigmoid when the outputs of the computations are
desired to be both positive and negative. The sigmoid and the tanh functions have been the
historical tools of choice for incorporating nonlinearity in the neural network

Learning in MLP is similar to learning algorithm of Adaline (Error base learning: Error function
minimization Learning process is an optimization approach). The optimization algorithm is called
“gradient descent“, where “gradient” refers to the calculation of an error gradient or slope of
error . “Descent” refers to the moving down along that slope towards some minimum level of
error. The algorithm is iterative. Weight update procedure is based on backpropagation update
algorithm (General approach).
What is Gradient Descent?. Gradient descent is an optimization algorithm often used for finding
the weights or coefficients of machine learning algorithms, such as artificial neural networks and
logistic regression. Optimization is a big part of machine learning.
Consequently, assuming a training set composed of p
samples, the measurement of the global performance of the
backpropagation algorithm can be calculated through the
“mean squared error” defined by
Gradient descent (GD) is an iterative first-order optimisation algorithm, used to find a local
minimum/maximum of a given function. The term "gradient" is typically used for functions
with several inputs and a single output (a scalar field). Yes, you can say a line has a gradient (its
slope), but using "gradient" for single-variable functions is unnecessarily confusing. Keep it
simple. A gradient measures how much the output of a function changes if you change the
inputs a little bit." — Lex Fridman (MIT):Gradient descent: Uses the negative of the gradient of
the function at the current point. the gradient points in the direction of greatest increase The
gradient is represented by the symbol ∇ (nabla).

GD algorithm does not work for all functions. There

are two specific requirements. A function has to be:
Differentiable and Convex
GD iteratively calculates the next point using gradient
at the current position, scales it (by a learning rate) and
subtracts obtained value from the current position
(makes a step).

Error based learning is an optimization problem Gradient Vector (For multi variable function)

The gradient vector has several

properties The most important
property is that the gradient at
a point x points in the direction
of maximum increase in the
cost function
Geometrically, the gradient vector is
normal to the tangent plane at the point
x* as shown in Fig for a function of three
variables. Also, it points in the direction
of maximum increase in the function.
They will be used in developing
optimality conditions and numerical
methods for optimum design.
The second order derivatives are called the Hessian matrix.It is a symmetric matrix.
EXAMPLE :Calculation of Gradient Vector: Calculate the gradient vector for the function at the
point x* = (1.8, 1.6).

Example 1 — a quadratic function

For this function, by taking a learning rate of 0.1 and

starting point at x=9 we can easily calculate each step by
hand. Let’s do it for the first 3 steps:

Gradient vector for the function

f(x) at the point (1.8, 1.6).

Hessian Matrix. (The Hessian is an n xn matrix): Used for second diff.

Differentiating the gradient vector once again, we obtain a matrix of second partial
derivatives for the function f (x) called the Hessian matrix, or simply the Hessian.
Therefore, the Hessian is always a symmetric matrix. (Used by Levenburg-Marquardt
optimization algorithm)
EXAMPLE: Evaluation of Gradient and
Hessian of a Function: For the following
function, calculate the gradient vector and
the Hessian matrix at the point (1, 2):

Substituting the point x1 = 1, x2 = 2, the

gradient vector is given as: c = (7, 27).
Numerical optimization.. Related to wegitht udate
A General Algorithm: Many numerical solution methods are described by the
following iterative prescription:
Vector form:
Component form:

In these equations, the superscript k represents the iteration number, subscript i

denotes the design variable number, x(0) is a starting point, and Δx(k) is a change in
the current point.
The iterative formula is applicable to constrained as well as unconstrained problems.
For unconstrained problems, calculations for Δx(k) depend on the cost function and its
derivatives at the current design point. For constrained problems, the constraints must
also be considered while computing the change in design Δ x(k).
Error based learnig or weights update is an optimization problem

Numerical methods for nonlinear optimization problems are needed because the
analytical methods for solving some of the problems are too cumbersome to use.

The iterative process is summarized as a general algorithm that is applicable to

both constrained and unconstrained problems:

Step 1. Estimate a reasonable starting design x(0). Set the iteration counter k = 0.
Step 2. Compute a search direction d(k) in the design space. This calculation
generally requires a cost function value and its gradient for unconstrained problems
and, in addition, constraint functions and their gradients for constrained problems.
Step 3. Check for convergence of the algorithm. If it has converged, stop; otherwise,
continue.
Step 4. Calculate a positive step size ak in the direction d(k).
Step 5. Update the design as follows, set k = k + 1 and go to Step 2:

Descent Direction and Descent Step

Ther are some methods for calculating the step size ak and the search direction d(k)
Algorithms related to calculating ak is called Line search method

Gradient-based methods
compute both a direction pk and
a step length k at each iteration k
Error based learning is an optimization problem :Gradient Vector Learning of ANN
can be viewed as a nonlinear optimization problem for finding a
set of network parameters (weights) that minimize the cost function (Error) for given
examples. Error function contains activation function.

O: Desired oput, d… The output of neuron

includes Activation function), (O-d)…
difference… Error
Chain rule
Note

Mean square error is a quadratic function

Gradient
of Error

Weight update

Types of Gradient Descent

BATCH GRADIENT DESCENT: it’s called a training epoch. calculates the error for each
example within the training dataset (OffLine)
STOCHASTIC GRADIENT DESCENT (OLINE): it updates the parameters for each training
example one by one
MINI-BATCH GRADIENT DESCENT
Mini-batch gradient descent is the go-to method since it’s a combination of the concepts of
SGD and batch gradient descent. It simply splits the training dataset into small batches and
performs an update for each of those batches.
What Is a Batch?
The batch size is a hyperparameter that defines the number of samples to work through
before updating the internal model parameters.
A training dataset can be divided into one or more batches.

Batch Gradient Descent. Batch Size = Size of Training Set

Stochastic Gradient Descent. Batch Size = 1
Mini-Batch Gradient Descent. 1 < Batch Size < Size of Training Set

What Is an Epoch?
The number of epochs is a hyperparameter that defines the number times that the learning
algorithm will work through the entire training dataset.

If the gradient is positive, then we decrease the weights; and conversely, if the gradient is
negative, then we increase them.
Mini-Batch Gradient Descent. 1 <
Batch Size < Size of Training Set.
Subset of Off line

Ec : off line (all the training patterns are

presented to the system at once)

E(k) : on line (training is made pattern q : Total numbers of the output layer
by pattern: Stochastic/online). neurons
İndex i represents the i-th neuron of the
output layer
n is the number of training patterns
Back-propagation neural network (learning algorithm for MLP)
 Learning in a multilayer network proceeds the same way as for a perceptron. A training set
of input patterns is presented to the network. The network computes its output pattern, and if
there is an error − or in other words a difference between actual and desired output patterns
− the weights are adjusted to reduce this error. The backpropagation algorithm for training
multilayer neural networks is a generalization of the LMS training procedure (Adaline), In a
back-propagation neural network, the learning algorithm has two phases. First, a training
input pattern is presented to the network input layer. The network propagates the input
pattern from layer to layer until the output pattern is generated by the output layer. If this
pattern is different from the desired output, an error is calculated and then propagated
backwards through the network from the output layer to the input layer. The weights are
modified as the error is propagated. Back-propagation is an automatic differentiation
algorithm for calculating gradients for the weights in a neural network graph structure.
Stochastic gradient descent and the back-propagation of error algorithms together are used
to train neural network models.
I n p u t s ig n a l s
1
x1
1 y1
1
2
x2 2 y2
2

i w ij j w jk
xi k yk

m
n l yl
xn
In p u t H id d e n O u tp u t
la y e r la y e r la y e r

E r r o r s i g n a ls

Features of Backpropagation:
it is the gradient descent method as used in the case of simple perceptron network with
the differentiable unit.
it is different from other networks in respect to the process by which the weights are
calculated during the learning period of the network.
training is done in the three stages :
the feed-forward of input training pattern
the calculation and backpropagation of the error
updation of the weight
Backpropagation, short for “backward propagation of errors

The negative sign is assigned to a gradient to force it to move downhill along the error
surface in the weight space.
Back-Propagation Algorithm
Back-propagation, also called “backpropagation,” or simply “backprop,” is an algorithm
for calculating the gradient of a loss function with respect to variables of a model

The name “Backpropagation” literally comes from “propagating the errors back to the
network”. By propagating the errors backwards through the network, the partial
derivative of the gradient of the last layer (closest layer to the output layer) is used to
calculate the gradient of the second to the last layer. The propagation of errors through
the layers and the utilization of the partial derivative of the gradient from a previous
layer in the current layer occurs until the first layer i.e. layer closest to the input layer.
Learning Algorithm: Backpropagation x1
1
I n p u t s ig n a ls

( Calculation in first layer): Three-layer

1 y1
1
2
x2 2 y2
2

back-propagation neural network xi

i w ij j w jk
k yk

m
n l yl
xn
In p u t H id d e n O u tp u t
la y e r la y e r la y e r

E r r o r s ig n a ls

Backpropagation: Error propagation

The back-propagation training algorithm
Backpropagation = Chain Rule + Dynamic Programming
Step 1: Initialisation

Activate the back-propagation neural network by applying inputs

x1(p), x2(p),…, xn(p) and desired outputs yd,1(p), yd,2(p),…, yd,n(p).

(a) Calculate the actual outputs of the neurons in the hidden

layer:
 n 
y j ( p ) = sigmoid   x i ( p ) ⋅ wij ( p ) − θ j 
 i =1 
where n is the number of inputs of neuron j in the hidden
layer, and sigmoid is the sigmoid activation function.
1
Step 2: Activation (continued) f ( x) = S ( x) =
(b) Calculate the actual outputs
1+ e−x

m  Ref. Book
yk ( p ) = sigmoid   x jk ( p ) ⋅ w jk ( p ) − θ k  A Guide to Intelligent Systems
 j =1 

Step 3: Weight training

Update the weights in the back-propagation network propagating backward
the errors associated with output neurons.
(a) Calculate the error gradient for the neurons in the output layer:

δ k ( p ) = yk ( p ) ⋅ [1 − yk ( p )]⋅ ek ( p ) f ( x) = S ( x) =
1
1+ e−x
where ek ( p ) = y d , k ( p ) − y k ( p )
Calculate the weight corrections:
Update the weights at the output neurons: S ' ( x) = S ( x)(1 − S ( x))
Δw jk ( p ) = α ⋅ y j ( p ) ⋅ δ k ( p ) w jk ( p + 1) = w jk ( p ) + Δw jk ( p )
(b) Calculate the error gradient for the neurons in the hidden layer:
l
δ j ( p ) = y j ( p ) ⋅ [1 − y j ( p )] ⋅  δ k ( p ) w jk ( p )
k =1
Calculate the weight corrections: Δwij ( p ) = α ⋅ xi ( p ) ⋅ δ j ( p )
Update the weights at
the hidden neurons: wij ( p + 1) = wij ( p ) + Δwij ( p )
Ref. Book A Guide to Intelligent Systems
Step 4: Iteration: Increase iteration p by one, go back to Step 2 and repeat the process until the
selected error criterion is satisfied.
Example application for XOR function: A the three-layer back-propagation network is
considered for logical operation Exclusive-OR. Recall that a single-layer perceptron could
not do this operation. Now we will apply the three-layer net.
−1

θ3
w 13 −1 XOR function
x1 1 3 w 35
w 23 θ5

5 y5
14
x2 2 4 w 45
w 24
Input θ4 Output
layer layer
−1
Hidden layerThe initial weights and threshold levels are set randomly as follows:


Ref. Book A Guide to Intelligent Sys. w13 = 0.5, w14 = 0.9, w23 = 0.4, w24 = 1.0, w35 = −1.2, w45 = 1.1,
θ3 = 0.8, θ4 = −0.1 and θ5 = 0.3.

−1
We consider a training set where inputs x1 and θ3
−1
x2 are equal to 1 and desired output yd,5 is 0. x1 1
w 13
3 w 35 θ5
w 23
The actual outputs of neurons 3 and 4 in the 5 y5
hidden layer are calculated as 14
w 45
x2 2 4
w 24
Input θ4 Output

[ ]
layer layer

y3 = sigmoid ( x1w13 + x2 w23 − θ3 ) = 1 / 1 + e − (1⋅0.5 +1⋅0.4 −1⋅0.8) = 0.5250

−1
Hidden layer

y4 = sigmoid ( x1w14 + x2 w24 − θ ) = 1 / [1 + e

4
− (1⋅0.9 +1⋅1.0 +1⋅0.1)
] = 0.8808
 Now the actual output of neuron 5 in the output layer is determined as:

[
y 5 = sigmoid ( y 3 w35 + y 4 w 45 − θ 5 ) = 1 / 1 + e − ( − 0 .5250 ⋅1 .2 + 0 .8808 ⋅1 .1−1⋅0 .3 ) ] = 0 .5097
Thus, the following error is obtained: e = y
d ,5 − y5 = 0 − 0.5097 = −0.5097


 The next step is weight training. we propagate the error, e, from the output layer backward to
the input layer. First, we calculate the error gradient for neuron 5 in the output layer:

δ 5 = y5 (1 − y5 ) e = 0.5097 ⋅ (1 − 0.5097) ⋅ (−0.5097) = −0.1274

 Then we determine the weight corrections assuming that the learning rate
parameter, α, is equal to 0.1:
Δw35 = α ⋅ y3 ⋅ δ 5 = 0.1⋅ 0.5250 ⋅ (−0.1274) = −0.0067
Ref. Book: A Guide to
Intelligent Sys. Δw45 = α ⋅ y4 ⋅ δ 5 = 0.1⋅ 0.8808 ⋅ (−0.1274) = −0.0112
Δθ5 = α ⋅ (−1) ⋅ δ 5 = 0.1⋅ (−1) ⋅ (−0.1274) = −0.0127
 Next we calculate the error gradients for neurons 3 and 4 in the hidden layer:

δ 3 = y3 (1 − y3 ) ⋅ δ 5 ⋅ w35 = 0.5250 ⋅ (1 − 0.5250) ⋅ ( − 0.1274) ⋅ ( − 1.2) = 0.0381

δ 4 = y4 (1 − y4 ) ⋅ δ 5 ⋅ w45 = 0.8808 ⋅ (1 − 0.8808) ⋅ ( − 0.127 4) ⋅ 1.1 = −0.0147
 We then determine the weight corrections:

Δ w13 = α ⋅ x1 ⋅ δ 3 = 0 . 1 ⋅ 1 ⋅ 0 . 0381 = 0 . 0038

Δ w 23 = α ⋅ x 2 ⋅ δ 3 = 0 .1 ⋅ 1 ⋅ 0 .0381 = 0 .0038
Δ θ 3 = α ⋅ ( − 1) ⋅ δ 3 = 0 .1 ⋅ ( − 1) ⋅ 0 .0381 = − 0 .0038
Δ w14 = α ⋅ x1 ⋅ δ 4 = 0 . 1 ⋅ 1 ⋅ ( − 0 .0147 ) = − 0 .0015
Δ w 24 = α ⋅ x 2 ⋅ δ 4 = 0 .1 ⋅ 1 ⋅ ( − 0 .0147 ) = − 0 .0015
Δ θ 4 = α ⋅ ( − 1) ⋅ δ 4 = 0 . 1 ⋅ ( − 1) ⋅ ( − 0 . 0147 ) = 0 . 0015
−1

θ3
w 13 −1
x1 1 3 w 35
w 23 θ5
The training process is
5 y5
repeated until the sum of
14
w 45 squared errors is less than
x2 2 4
w 24 0.001
Input θ4 Output
layer layer
−1 Ref. Book A Guide to Intelligent Sys.
Hidd l

Learning curve for operation Exclusive-OR Ref. Book

10 1
Sum -Squared Network Error for 224 Epochs A Guide to Intelligent Sys.
(Matlab File)
The stop criteria of the process is
10 0
defined by the mean squared error
The algorithm converges when the
Sum-Squared Error

10
-1 mean squared error between two
successive epochs is sufficiently
small, that is: where e is the
10 -2
precision required for the
convergence process
-3
10

10 -4
0 50 100
Epoch
150 200 Final results of three-layer
network learning

I n p u ts D e s ir e d A c tu a l E rror Sum of
o u tp u t o u tp u t squa re d
x1 x2 yd y5 e e rrors
1 1 0 0 .0 1 5 5 − 0 .0 1 5 5 0 .0 0 1 0
0 1 1 0 .9 8 4 9 0 .0 1 5 1
1 0 1 0 .9 8 4 9 0 .0 1 5 1
0 0 0 0 .0 1 7 5 − 0 .0 1 7 5
Several variations of the backpropagation method have been proposed in order to enhance
the efficiency of its convergence. Among these variations, one can find the method that uses
the momentum parameter, resilient-propagation, and Levenberg-Marquardt methods.
For small values of the learning parameter η, this leads most often to a very slow convergence
rate of the algorithm. Larger learning parameters have been known to lead to unwanted
oscillations in the weight space and may even cause divergence of the algorithm. To avoid
these issues, researchers have devised a modified weight updating algorithm in which the
change of the weight of the upcoming iteration (at time t + 1) is made dependent on the
weight change of the current iteration (at time t).

Momentum

Momentum term when the current solution (reflected by its weight matrices) is far from the
final solution (minimum point of the error function), the variation in the opposite direction of
the gradient of the squared error function between two successive iterations will be significant.
This implies that the difference between the error matrices of these two iterations will be
relevant and, in this case, it is possible to perform a bigger incremental step for weights in the
direction of the minimum of the error function. The momentum term is in charge of this task,
since it is responsible for measuring this variation. However, when the current solution is very
near to the final solution, the variations on the weight matrices will be small, since the variation
of the mean squared error between two successive iterations will be minor and, consequently,
the contribution of the momentum term for the convergence process will be slight. From this
moment on, all adjustments on the weight matrices are conducted (usually) only by the learning
term.

(α is momentum rate and its value is within the range of 0 and 1.

Momentum:
However, when the current solution is very near to the final solution, the variations on
the weight matrices will be small, since the variation of the mean squared error
between two successive iterations will be minor and, consequently, the contribution of
the momentum term for the convergence process will be slight. From this moment on,
all adjustments on the weight matrices are conducted (usually) only by the learning
term.
Generalized
Delta Rule
where γ is a positive number (0 ≤ γ < 1) called the momentum constant. Typically, the
momentum constant is set to 0.95.
The momentum term uses the previous weight (and gradient) information to adjust the
motion of the current weight. momentum term accelerates the descent in steady
downhill directions and has a stabilizing effect in directions that oscillate in time
Momentum - Momentum is a value that is used to help push a network out of a local minimum

Adaptive learning rate: (Change steps)

Adaptive learning rates allow the training algorithm to
monitor the performance of the model and automatically
adjust the learning rate for the best performance.
The most basic model of this decreases the learning rate
once the performance of the model reaches a plateau. The
model does so by decreasing the learning rate by a factor
of two or an order of magnitude. However, the learning
rate can be increased again if the performance doesn’t
improve.
Learning with momentum for operation Exclusive-OR (Learning rate constant)
Reduction in epoches
Sum -Squared Network Error for 224 Epochs
1
10

10 0
Sum-Squared Error

-1
10

-2
10

Learning without momentum

-3
10

10
-4 Learning with momentum
0 50 100 150 200
Epoch

−1

θ3
w 13 −1
x1 1 3 w 35
w 23 θ5

5 y5
14
x2 2 4 w 45
w 24
Input θ4 Output
layer layer
−1
Hidden layer

Artificial Intelligence A Guide to Intelligent Systems (Matlab file)

Learning with adaptive learning rate (XOR Learning with momentum and
problem) adaptive learning rate (XOR
Training for 103 Epochs
problem)
2
10 Training for 85 Epochs
Sum-Squared Error

10
1 102
Sum-Squared Error

0 101
10
-1 100
10
-2
10-1
10
10-2
-3
10
10-3
-4
10 10-4
0 10 20 30 40 50 60 70 80 90 100 0 10 20 30 40 50 60 70 80
Epoch Epoch

1 2.5

0.8 2
Learning Rate
Learning Rate

1.5
0.6
1
0.4
0.5
0.2
0
0 0 10 20 30 40 50 60 70 80 90
0 20 40 60 80 100 120 Epoch
Epoch

The Levenberg-Marquardt algorithm is a second-order gradient method, based on the least

squares method for nonlinear models, which can be incorporated into the backpropagation
algorithm as to enhance the efficiency of the training process.

Ref. Book
A Guide to Intelligent Sys.
Learning with adaptive learning rate

To accelerate the convergence and yet avoid the danger of instability, we can apply
two heuristics:
Heuristic 1
If the change of the sum of squared errors has the same algebraic sign for several
consequent epochs, then the learning rate parameter, α, should be increased.
Heuristic 2
If the algebraic sign of the change of the sum of squared errors alternates for
several consequent epochs, then the learning rate parameter, α, should be
decreased.
 Adapting the learning rate requires some changes in the back-propagation
algorithm.
 If the sum of squared errors at the current epoch exceeds the previous value by
more than a predefined ratio (typically 1.04), the learning rate parameter is
decreased (typically by multiplying by 0.7) and new weights and thresholds are
calculated.
 If the error is less than the previous one, the learning rate is increased (typically by
multiplying by 1.05).

Artificial Intelligence A Guide to Intelligent Systems

MLP for pattern classification: A MLP with one hidden layer can map any
Pattern classification problem whose elements are within a convex region,
The classification of the samples in Fig. 5.17a would require two neurons in the hidden layer of
the MLP, whereas the classification of the samples in Fig. 5.17b, c would require six and four
neurons on the hidden layers of the MLP networks From a geometric point of view, a region is
considered convex if, and only if, all the points of any line segment defined between any pair of
points from the domain are inside this region. Figure 5.18a presents an illustration of a convex
region while Fig. 5.18b shows a non-convex region. Thus, considering that MLP networks with a
single hidden layer can classify patterns placed within a convex region, it is possible to deduce
that MLP networks with two hidden layers can classify patterns that are within any geometric
region (Lui 1990; Lippmann 1987), including non-convex regions such that showed in Fig.

2 neuron in 1 6 neuron in 1 4neuron in 1 Book: Artificial Neural

hidden layer hidden layer hidden layer Networks A Practical
Course
pattern classification

Fig. 5.19 Decision boundaries of a

problem within a non-convex region
Fig. 5.18 (a) (b) Illustration of a convex region and
a non-convex region
For such case, the configuration of the MLP network Fig. 5.20
showed in Fig. 5.20 represents a topology that can
implement the pattern classification for problemFig. 5.19.
In this case, it is possible to consider, for instance, that
neurons A, B and C of the first hidden layer are
responsible for delimiting the left convex region (triangle),
while the neurons D, E and F are related to the right
convex region (triangle). Neuron G of the second middle neuron Y of is responsible for
layer is responsible for combining the outputs of neurons performing the Boolean operation of
A, B, and C in order to represent the group that belongs to disjunction (OR gate), because if one
the left convex region so that in this condition its output of the outputs produced by neurons
would be equal to 1. In a similar manner, G or H is equal to 1, its final response
y should also be equal to 1.

Pattern classification :Also, in the case of pattern classification problems with more than two
classes, there is the need for inserting more neurons on the output layer of the network,
because an MLP with a single neuron in its output layer can distinguish just two classes. As an
example, an MLP composed of two neurons in its output layer could represent, at most, four
classes (Fig.). A network with three neurons in the output layer could classify a total of eight
classes. Generalizing this concept, an MLP with m neurons in its output layer would be able to
classify, theoretically, up to 2m classes.
Alternatively, one of the most used methods
of codification is the “one of c-class”, which
consists in associating the output of each
neuron directly to the class.

Book: Artificial Neural Networks A

Practical Course
Example: Typical Applications of MLP ,Pattern Classification, Character recognition
recognition of digits from 0 to 9. In this The number of neuronsin the input
application, each digit is represented by a 5 9 layer is the numbers of pixels in bit
bit map, as shown in Figure map. The bit map in our examole
momentum constant is set to 0.95. consistes of 45 pixel, and we need
Neurons in the hidden and output layers use a 45 input neurons. The output layer
sigmoid activation function. has 10 neurons (on efor each digitto
be recognised): where a better
resolution is required, at least 16x16

Number of neurons in hidden layer affects the accuracy of recognition and speed of learning:
Look at: digital _recognition.m and bit_map.m files Ref. Book A Guide to Intelligent Sys.

Application of Feed-forward networks - character recognition:

digit1 = [0 0 1 0 0
01100
10100
00100
00100
00100
00100
00100
0 0 1 0 0 ];

digit2 = [0 1 1 1 0
10001
00001 s1=12; Number of neurons in the hidden layer
00001 s2=10; Number of neurons in the output layer
00010
00100
01000
10000
1 1 1 1 1 ]; net = newff(minmax(p),[s1 s2],{'logsig' 'purelin'},'traingdx');

Book: Artificial Intelligence A Guide to Intelligent Systems

0 0 0 0 1 0 1 0 0 0
0 1 1 0 1 1 0 1 1 1
1 0 0 0 1 1 0 1 1 1
Target (output)
0 0 0 0 1 1 0 1 1 1
0 0 0 0 1 1 0 0 0 1 t=
0 0 0 1 0 1 0 1 0 1
0 0 0 1 0 1 0 1 0 1
0 1 1 0 1 1 0 1 1 1
1 0 0 0 0 0 0 0 0 0
0 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0
0 1 1 0 1 1 1 1 1 1 0 0 1 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0
0 0 0 1 1 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0
0 0 0 1 0 1 0 1 1 0 0 0 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 1 0 1 0 0 1 0 0 0
0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 1 0 0
0 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 1 0
1 1 1 0 1 1 1 1 1 1 0 0 0 0 0 0 0 0 0 1
1 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0
1 0 0 0 1 0 0 0 0 0 Target vector 10x10
1 0 0 0 0 1 1 1 1 0
1 1 0 0 0 0 1 0 0 0
1 0 0 1 0 0 0 0 0 0
1 0 0 0 0
……………………………….
0 0 0 0 0
Book: Artificial Intelligence A
……………………… Guide to Intelligent Systems
45-element Input matris 45x10

Digit 1 2 3 4 5 6 7 8 9 Desired pat. Input/Output data

1 00100 01100 10100 0100 00100 00100 00100 00100 00100 1000000000
2 01110 10001 00001 00001 00010 00100 01000 10000 11111 0100000000
3 …. …….. ……. ……. ……. …… ….. ….. ….. …….
4 …. …….. ……. ……. ……. …… ….. ….. ….. …….
5 …. …….. ……. ……. ……. …… ….. ….. ….. …….
6 …. …….. ……. ……. ……. …… ….. ….. ….. …….
7 …. …….. ……. ……. ……. …… ….. ….. ….. …….
8 …. …….. ……. ……. ……. …… ….. ….. ….. …….
9 …. …….. ……. ……. ……. …… ….. ….. ….. …….
0 0000000001

‘noise’ – the distortion of the Improve the

input patterns. This distortion performance:
can be created, for instance,
by adding some small random Trainning with noisy
values chosen from a normal examples
distribution
Learning curve of digit recogniiton 3 layer NN (Effect of neurons in hidden layer)
Performance evalueation of digit recognition, Performance study is measured by SSE
we will examine the system’s performance with 2, 5, 10 and 20 hidden neurons and compare
results.

Overfitting problem: A state wich an ANN has memoriesd all the training examples, but cannot
generalise. Overfitting may occur if the numbers of hidden layer neurons is too big. To prevent
overfittingit is better to choose the smallest number of hidden neurons. In this example 2,5,10
and 20 neuron is selected Results showe: there is not significant differences between the
networks with 10 and 20 neuron

Ref. Book A Guide to Intelligent Sys.

Stopping criterions in learning

• Sensible stopping criterions:
– total mean squared error change:
– Back-prop is considered to have converged when the absolute rate of change in the
average squared error per epoch is sufficiently small (in the range [0.01, 0.1]).
– generalization based criterion:
– After each epoch the NN is tested for generalization using a different set of examples
(validation set). If the generalization performance is adequate then stop.

• Network Topology : The number of layers and of neurons depend on the

specific task. In practice this issue is solved by trial and error. Usually determined
by experimentation
• Two types of adaptive algorithms can be used: (Neurons in Hidden Layer)
– start from a large network and successively remove some neurons and links
until network performance degrades.
– begin with a small network and introduce new neurons until performance is
satisfactory.
– Too many – network will memorise training set and will not generalise well;
Too few – risk that network may not be able to learn the pattern in the
training set. It is often necessary to train the network with different learning
rates to find the optimum value for the problem under investigation
Other optimization algorithms
The example below, shows that a
combination of four
hidden neurons can produce a
“bump” at the output space of a
two-layered MLP
g A large number of these
“bumps” can approximate any
surface arbitrarily well
[Duda, Hart and Stork, 2001]

Other Learning Methods (MLP):

• Other methods that can be used to find weights of a MLP include:
– Conjugate gradient method
– Levenburg-Marquardt method (The most used)
– Quasi-Newton method
– Least square Kalman filtering method
Levenberg-Marquardt Method :The Levenberg-Marquardt algorithm is a second-order gradient
method, based on the least squares method for nonlinear models, which can be incorporated
into the backpropagation algorithm as to enhance the efficiency of the training process.
the Levenberg-Marquardt Uses the Hessian matrix (matrix of second-order derivatives)

Linear regression or Curve fitting: minimum square error:

Minimum mean square error. In statistics and signal
Mean squared error
processing, a minimum mean square error (MMSE)
estimator is an estimation method which minimizes the
mean square error (MSE), which is a common measure of
estimator quality, of the fitted values of a dependent
variable.
MSE is used to check how close estimates or forecasts are
to actual values. Lower the MSE, the closer is forecast to
actual. The Mean Squared Error (MSE) is perhaps the
simplest and most common loss function,

LMS algorithm environment adjusts the weights

and biases of the linear network to minimize
this Mean Square Error (MSE function)
The MSE is a measure of the quality of an
estimator or predictor
MSE is almost always strictly positive

A linear model for learning rparameters of

Y= ax+b
Universal curve fitting (function approximation): : Similarly to pattern classification problems,
although an MLP with a single hidden layer is sufficient for mapping any nonlinear continuous
function: logistic activation function

Book: Artificial Neural Networks A

Practical Course Illustration of the superposition of logistic activation
functions for a curve fitting

Example : Application of MLP for function approximation (cont.)

Output node (Linear)

x = 0:0.01:4; Data generation
y = (sin(2*pi*x)+1).*exp(-x.^2); x y

BLF = 'learngd'; Learning function 1.6 Function to be

PF = 'mse'; Cost function
1.4
approximated
1.2
Output y

0.8

net = newff(PR,[S1 S2],{TF1 0.6

0.4
TF2},BTF,BLF,PF);
Command for creating 0.2

0
the network 0 0.5 1 1.5 2
Input x
2.5 3 3.5 4
Example : Application of MLP for
function approximation: effect of hidden layer
3 And 6 neurons in hidden layer

2
2 Desired output
Desired output Network output
1.8 Network output
1.5
1.6

1.4
1
1.2

1
0.5
0.8

0.6
0
0.4

0.2 -0.5
0 0.5 1 1.5 2 2.5 3 3.5 4
0
0 0.5 1 1.5 2 2.5 3 3.5 4

Weighted summation of all outputs from the

No. of hidden nodes first layer nodes yields function approximation.
is too small !
6 hidden neuron
3 hidden neuron…. Compare the perfommance

Effect of hidden nodes on function approximation (Example 5.2: Effect of Hidden layer)

illustrate the effects of the number of hidden neurons on the approximation capabilities of the
MLP, we use here the simple function f (x) given by: f (x) = x sin(x)

Given a sufficiently large number of

hidden neurons, a two-layer MLP can
approximate any continuous function
arbitrarily well

Book:Soft Computing and Intelligent Systems Design memorizing

Example 5.3 : Effect of training patterns (Samples) on function approximation
The same function was approximated with a network with a fixed number of nodes (taken
as five here), but with a variable number of training patterns.

A higher number of training

patterns provided better
results as shown in Figure (f).

Matlab:New

Book:Soft Computing and Intelligent Systems Design

%Example5;
x = 0:.05:2; y=humps(x);
P=x; T=y;
%Plot the data
plot(P,T,'x')
grid; xlabel('time (s)'); ylabel('output'); title('humps function')
% DESIGN THE NETWORK

%Define network. First try a simple one TrainLM: Levenburg-

net=newff([0 20], [20,1], {'tansig','purelin'},'trainlm'); Marquardt method
%Define parameters Optimization
net.trainParam.show = 50;
net.trainParam.lr = 0.05;
net.trainParam.epochs = 300;
net.trainParam.goal = 1e-4;
%Train network
net1 = train(net, P, T);
% Simulate result
a= sim(net1,P);
%Plot result and compare
plot(P,a,P,a-T)
xlabel('time in secs');ylabel('Network output and error');
title('First order bessel function'); grid
%First try a simple one – feedforward (multilayer perceptron) network
net=newff([0 2], [5,1], {'tansig','purelin'},'traingdm')%traingd
% Here newff defines feedforward network architecture.
% The first argument [0 2] defines the range of the input and initializes the
network parameters.
% The second argument the structure of the network. There are two layers.
% 5 is the number of the nodes in the first hidden layer,
% 1 is the number of nodes in the output layer,
% Next the activation functions in the layers are defined.
% In the first hidden layer there are 5 tansig functions.
% In the output layer there is 1 linear function.
% ‘learngd’ defines the basic learning scheme – gradient method
% Define learning parameters
net.trainParam.show = 50; % The result is shown at every 50 th iteration (epoch)
net.trainParam.lr = 0.05; % Learning rate used in some gradient schemes
net.trainParam.epochs =1000; % Max number of iterations
net.trainParam.goal = 1e-3; % Error tolerance; stopping criterion
%Train network
net1 = train(net, P, T); % Iterates gradient type of loop
%test;
a= sim(net1,P);
% Plot result and compare
plot(P,a-T, P,T); grid;
xlabel('Time (s)'); ylabel('Output of network and error'); title('Humps function');

The Econometric Analysis of Transition Data - Tony Lancaster
No ratings yet
The Econometric Analysis of Transition Data - Tony Lancaster
374 pages
Chapter 9 Newton's Method
No ratings yet
Chapter 9 Newton's Method
27 pages
Unconstrained Numerical Optimization An Introduction For Econometricians
100% (1)
Unconstrained Numerical Optimization An Introduction For Econometricians
32 pages
JEHLE RENY Solutions To Selected Exercises
100% (8)
JEHLE RENY Solutions To Selected Exercises
38 pages
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
100% (1)
Probability With Applications in Engineering, Science, and Technology, 2nd (Instructor's Solution Manual) - Matthew A. Carlton
400 pages
Optimisation and Decision
100% (1)
Optimisation and Decision
17 pages
1 - Course Slides - Data Science and ML Fundamentals
No ratings yet
1 - Course Slides - Data Science and ML Fundamentals
92 pages
Kelley - Iterative Methods For Optimization-SIAM (1999) PDF
No ratings yet
Kelley - Iterative Methods For Optimization-SIAM (1999) PDF
187 pages
Statistical Inference For Engineers and Data Scientists Solutions Manual
No ratings yet
Statistical Inference For Engineers and Data Scientists Solutions Manual
12 pages
Main
No ratings yet
Main
342 pages
DLbook
No ratings yet
DLbook
165 pages
Linear Systems and Optimal Control Condensed Notes: J. A. Mcmahan JR
No ratings yet
Linear Systems and Optimal Control Condensed Notes: J. A. Mcmahan JR
22 pages
Matlab For Microeconometrics: Numerical Optimization: Nick Kuminoff Virginia Tech: Fall 2008
No ratings yet
Matlab For Microeconometrics: Numerical Optimization: Nick Kuminoff Virginia Tech: Fall 2008
16 pages
CHAPTER 7 - Optimal Dispatch of Generation 110511
No ratings yet
CHAPTER 7 - Optimal Dispatch of Generation 110511
72 pages
The Rank-Deficient Least Squares Problem: With Column Pivoting
No ratings yet
The Rank-Deficient Least Squares Problem: With Column Pivoting
6 pages
ECE/CS 559 - Neural Networks Lecture Notes #6: Learning: Erdem Koyuncu
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #6: Learning: Erdem Koyuncu
13 pages
Application of Theorem of Minimum Potential Energy To A Complex Structure Part II: Three-Dimensional Analysis
No ratings yet
Application of Theorem of Minimum Potential Energy To A Complex Structure Part II: Three-Dimensional Analysis
29 pages
Fmincon Code
No ratings yet
Fmincon Code
17 pages
Fall 2014, University of Houston Instructor: Dr. K. B. Nakshatrala
No ratings yet
Fall 2014, University of Houston Instructor: Dr. K. B. Nakshatrala
3 pages
Fminunc
No ratings yet
Fminunc
9 pages
Lecture CB304 Midsem Part2
No ratings yet
Lecture CB304 Midsem Part2
32 pages
Me 433 - State Space Control: 1. Optimization Without Constraints
No ratings yet
Me 433 - State Space Control: 1. Optimization Without Constraints
8 pages
Arvind Jute Mill Product Mix Problem - Neelesh Kamath ePGP04C-060
No ratings yet
Arvind Jute Mill Product Mix Problem - Neelesh Kamath ePGP04C-060
27 pages
EE5239 JKJKJK
No ratings yet
EE5239 JKJKJK
6 pages
Lecture 13: Simple Linear Regression in Matrix Format
No ratings yet
Lecture 13: Simple Linear Regression in Matrix Format
9 pages
Isacker, Mei
No ratings yet
Isacker, Mei
20 pages
Lecture 04 - Conjugate Gradient Methods
No ratings yet
Lecture 04 - Conjugate Gradient Methods
9 pages
Least Square Vs Gradient Descent
100% (1)
Least Square Vs Gradient Descent
52 pages
Sigmoid Functions and Explanations
100% (2)
Sigmoid Functions and Explanations
52 pages
ECON 500 - Problem Set 0
No ratings yet
ECON 500 - Problem Set 0
3 pages
(MADHU MANGAL PAUL) Numerical Analysis For Scienti
100% (1)
(MADHU MANGAL PAUL) Numerical Analysis For Scienti
666 pages
S S Sastry PDF
100% (1)
S S Sastry PDF
11 pages
Sheet-5-Optimility Conditions Solution
No ratings yet
Sheet-5-Optimility Conditions Solution
16 pages
Numerical Methods - B. Ram
No ratings yet
Numerical Methods - B. Ram
236 pages
9.-Time-Series Prediction of Wind Speed Using Machine Learning Algorithms 2018
No ratings yet
9.-Time-Series Prediction of Wind Speed Using Machine Learning Algorithms 2018
17 pages
Levenberg-Marquardt Backpropagation - MATLAB Trainlm
No ratings yet
Levenberg-Marquardt Backpropagation - MATLAB Trainlm
2 pages
HW01 Sol - Math Recap
No ratings yet
HW01 Sol - Math Recap
13 pages
Levenberg Examples
100% (1)
Levenberg Examples
2 pages
IACT 422 - 03 - Term Project - SUPPLY CHAIN SIMULATION FOR 4th PARTY LOGISTICS
100% (1)
IACT 422 - 03 - Term Project - SUPPLY CHAIN SIMULATION FOR 4th PARTY LOGISTICS
37 pages
Error Analysis Numerical Methods PDF
100% (1)
Error Analysis Numerical Methods PDF
2 pages
Modeling With Penalized Splines
No ratings yet
Modeling With Penalized Splines
50 pages
Sample 7394
100% (1)
Sample 7394
11 pages
Matlab Matlab Toolbox Deep Learning Toolbox Neural Network Toolbox Libraries Functions How To Use
No ratings yet
Matlab Matlab Toolbox Deep Learning Toolbox Neural Network Toolbox Libraries Functions How To Use
5 pages
Introduction To Ordinary Differential Equations With: Mathematica®
No ratings yet
Introduction To Ordinary Differential Equations With: Mathematica®
9 pages
10-701 Midterm Exam Solutions, Spring 2007
No ratings yet
10-701 Midterm Exam Solutions, Spring 2007
20 pages
Analytics On Spreadsheets: Business Analytics: Methods, Models, and Decisions, 1
No ratings yet
Analytics On Spreadsheets: Business Analytics: Methods, Models, and Decisions, 1
39 pages
Pranab K Sen - Julio M Singer - Large Sample Methods in Statistics (1994) - An Introduction With Applications (2017, CRC Press) - Libgen - Li
No ratings yet
Pranab K Sen - Julio M Singer - Large Sample Methods in Statistics (1994) - An Introduction With Applications (2017, CRC Press) - Libgen - Li
395 pages
Ashenafi Agizaw
No ratings yet
Ashenafi Agizaw
103 pages
Partial Differential Equations: Analytical and Numerical Methods
No ratings yet
Partial Differential Equations: Analytical and Numerical Methods
136 pages
Plane Wave Propagation and Reflection: David R. Jackson
No ratings yet
Plane Wave Propagation and Reflection: David R. Jackson
11 pages
Lec 21 Marquardt Method
100% (1)
Lec 21 Marquardt Method
29 pages
Binomial Distribution
No ratings yet
Binomial Distribution
16 pages
Gamma Extended Frechet Distribution
No ratings yet
Gamma Extended Frechet Distribution
23 pages
Neural Networks Backtracking
No ratings yet
Neural Networks Backtracking
14 pages
ANN Notes
No ratings yet
ANN Notes
54 pages
BackPropogationCrossEntNotes PDF
No ratings yet
BackPropogationCrossEntNotes PDF
4 pages
PDF Evolutionary Optimization Algorithms Full Online: Book Details
No ratings yet
PDF Evolutionary Optimization Algorithms Full Online: Book Details
1 page
Time Series Forecasting ANN
No ratings yet
Time Series Forecasting ANN
8 pages
Optim
No ratings yet
Optim
70 pages
Center Manifold Reduction
100% (2)
Center Manifold Reduction
8 pages
Optionic
No ratings yet
Optionic
27 pages
Newton Gauss Method
No ratings yet
Newton Gauss Method
37 pages
Vector and Matrix Norm
No ratings yet
Vector and Matrix Norm
17 pages
Classical Optimization Technique
No ratings yet
Classical Optimization Technique
19 pages
Radial Basis Function
No ratings yet
Radial Basis Function
35 pages
Numerical Solutions of Stiff Initial Value Problems Using Modified Extended Backward Differentiation Formula
No ratings yet
Numerical Solutions of Stiff Initial Value Problems Using Modified Extended Backward Differentiation Formula
4 pages
1104.4025 (Methods in Ma Thematic A For Solving Ordinary Differential Equations)
No ratings yet
1104.4025 (Methods in Ma Thematic A For Solving Ordinary Differential Equations)
13 pages
Econ 3051-Lecture Slide - Five
No ratings yet
Econ 3051-Lecture Slide - Five
17 pages
VOLTERRA INTEGRAL EQUATIONS .Ru
No ratings yet
VOLTERRA INTEGRAL EQUATIONS .Ru
15 pages
Statistics 580 Nonlinear Least Squares: I I I I I I I 2 N I I 2
No ratings yet
Statistics 580 Nonlinear Least Squares: I I I I I I I 2 N I I 2
14 pages
Applications of Numerical Methods Matlab
No ratings yet
Applications of Numerical Methods Matlab
15 pages
ML Notes
No ratings yet
ML Notes
14 pages
OptimisationII Notes
100% (1)
OptimisationII Notes
94 pages
03 23ECE216 PartialDerivatives
No ratings yet
03 23ECE216 PartialDerivatives
47 pages
Mathematics Essentials For Convex Optimization
No ratings yet
Mathematics Essentials For Convex Optimization
300 pages
Complete Matrix Differential Calculus With Applications in Statistics and Econometrics 3rd Edition Jan R. Magnus PDF For All Chapters
100% (1)
Complete Matrix Differential Calculus With Applications in Statistics and Econometrics 3rd Edition Jan R. Magnus PDF For All Chapters
55 pages
Cognitive Psychology - Module 1
No ratings yet
Cognitive Psychology - Module 1
72 pages
Computational Tools and Software MATLAB Python
No ratings yet
Computational Tools and Software MATLAB Python
5 pages
Statistical Learning Theory
No ratings yet
Statistical Learning Theory
4 pages
C. B. Gupta - Optimization Techniques in Operation Research-I.K. International (2020)
No ratings yet
C. B. Gupta - Optimization Techniques in Operation Research-I.K. International (2020)
381 pages
2D Heat Equation Iteration Method
No ratings yet
2D Heat Equation Iteration Method
4 pages
Gradient Descent
No ratings yet
Gradient Descent
15 pages
Abbas 2021
No ratings yet
Abbas 2021
7 pages
Multi Percept Ron
No ratings yet
Multi Percept Ron
14 pages
Math Finals Cheat Sheet
No ratings yet
Math Finals Cheat Sheet
2 pages
Introduction to Numerical Analysis
From Everand
Introduction to Numerical Analysis
Simone Malacrida
No ratings yet
Graphs and Tables of the Mathieu Functions and Their First Derivatives
From Everand
Graphs and Tables of the Mathieu Functions and Their First Derivatives
James C. Wiltse
No ratings yet
Textbook of Engineering Chemistry
From Everand
Textbook of Engineering Chemistry
C. Parameswara Murthy
No ratings yet
Numerical Methods for Two-Point Boundary-Value Problems
From Everand
Numerical Methods for Two-Point Boundary-Value Problems
Herbert B. Keller
No ratings yet
Lectures on the Coupling Method
From Everand
Lectures on the Coupling Method
Torgny Lindvall
No ratings yet

Eem520l3 2023

Uploaded by

Eem520l3 2023

Uploaded by

EEM520/620 (Lecture3): Neural networks:

Multilayer perceptron (MLP) ,NonLinear classification and function Approximation ,,

GD algorithm does not work for all functions. There

The gradient vector has several

Example 1 — a quadratic function

For this function, by taking a learning rate of 0.1 and

Gradient vector for the function

Hessian Matrix. (The Hessian is an n xn matrix): Used for second diff.

Substituting the point x1 = 1, x2 = 2, the

In these equations, the superscript k represents the iteration number, subscript i

The iterative process is summarized as a general algorithm that is applicable to

Descent Direction and Descent Step

O: Desired oput, d… The output of neuron

Mean square error is a quadratic function

Types of Gradient Descent

Batch Gradient Descent. Batch Size = Size of Training Set

Ec : off line (all the training patterns are

( Calculation in first layer): Three-layer

back-propagation neural network xi

Backpropagation: Error propagation

Activate the back-propagation neural network by applying inputs

(a) Calculate the actual outputs of the neurons in the hidden

Step 3: Weight training

y3 = sigmoid ( x1w13 + x2 w23 − θ3 ) = 1 / 1 + e − (1⋅0.5 +1⋅0.4 −1⋅0.8) = 0.5250

y4 = sigmoid ( x1w14 + x2 w24 − θ ) = 1 / [1 + e

δ 5 = y5 (1 − y5 ) e = 0.5097 ⋅ (1 − 0.5097) ⋅ (−0.5097) = −0.1274

δ 3 = y3 (1 − y3 ) ⋅ δ 5 ⋅ w35 = 0.5250 ⋅ (1 − 0.5250) ⋅ ( − 0.1274) ⋅ ( − 1.2) = 0.0381

Δ w13 = α ⋅ x1 ⋅ δ 3 = 0 . 1 ⋅ 1 ⋅ 0 . 0381 = 0 . 0038

Learning curve for operation Exclusive-OR Ref. Book

(α is momentum rate and its value is within the range of 0 and 1.

Adaptive learning rate: (Change steps)

Learning without momentum

Artificial Intelligence A Guide to Intelligent Systems (Matlab file)

The Levenberg-Marquardt algorithm is a second-order gradient method, based on the least

Artificial Intelligence A Guide to Intelligent Systems

2 neuron in 1 6 neuron in 1 4neuron in 1 Book: Artificial Neural

Fig. 5.19 Decision boundaries of a

Book: Artificial Neural Networks A

Application of Feed-forward networks - character recognition:

Book: Artificial Intelligence A Guide to Intelligent Systems

Digit 1 2 3 4 5 6 7 8 9 Desired pat. Input/Output data

‘noise’ – the distortion of the Improve the

Ref. Book A Guide to Intelligent Sys.

Stopping criterions in learning

• Network Topology : The number of layers and of neurons depend on the

Other Learning Methods (MLP):

Linear regression or Curve fitting: minimum square error:

LMS algorithm environment adjusts the weights

A linear model for learning rparameters of

Book: Artificial Neural Networks A

Example : Application of MLP for function approximation (cont.)

Output node (Linear)

Matlab command : Create a 2-layer network

BLF = 'learngd'; Learning function 1.6 Function to be

net = newff(PR,[S1 S2],{TF1 0.6

Weighted summation of all outputs from the

Given a sufficiently large number of

Book:Soft Computing and Intelligent Systems Design memorizing

A higher number of training

Book:Soft Computing and Intelligent Systems Design

%Define network. First try a simple one TrainLM: Levenburg-

You might also like