0% found this document useful (0 votes)
30 views

3 2MLP Extra Notes

The document discusses the multilayer perceptron model for neural networks. [1] It describes how a multilayer perceptron can solve problems that a single perceptron cannot by having an input layer, hidden layer, and output layer. [2] The multilayer perceptron uses a sigmoid activation function instead of a step function to allow learning through backpropagation. [3] The backpropagation algorithm calculates error terms to update weights between layers and reduce the overall error of the network output.

Uploaded by

youssef hossam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
30 views

3 2MLP Extra Notes

The document discusses the multilayer perceptron model for neural networks. [1] It describes how a multilayer perceptron can solve problems that a single perceptron cannot by having an input layer, hidden layer, and output layer. [2] The multilayer perceptron uses a sigmoid activation function instead of a step function to allow learning through backpropagation. [3] The backpropagation algorithm calculates error terms to update weights between layers and reduce the overall error of the network output.

Uploaded by

youssef hossam
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Arab Academy For Science & Technology and

Maritime Transport – Cairo Branch


College of Engineering & Technology
Electronics & Communication Engineering Department

Course: Neural Networks Course code:CC524


Lecturer:Dr.Waleed Fakhr Lecture Notes 3

The Multilayer Perceptron


Altering The Perceptron Model
The Problem

How are we to overcome the problem of being unable to solve


linearly inseparable problems with our perceptron? An initial approach
would be to use more than one perceptron each set up to identify small,
linearly separable sections of the inputs, then combining their outputs into
another perceptron, which would produce a final indication of the class to
which the input belongs.This approach to the XOR problem is shown in
Figure 3.1

Figure 3.1 Combining perceptrons can solve the


XOR problem: perceptron 1 detects when the
pattern corresponding to (0,1) is present, and the
other detects when (1,0) is there. Combined,
these two facts allow perceptron 3 to classify the
input correctly. They have to be set up correctly
in the first place, however; they cannot learn to
produce this classification.

1
This seems fine on first examination, but a moment’s thought will
show that this arrangement of perceptrons in layers will be unable to learn.
Each neuron in the structure still takes the weighted sum of its inputs,
thresholds it, and outputs either a one or a zero. For the perceptrons in the
first layer, the inputs come from the actual inputs to the network, while the
perceptrons in the second layer take as their inputs the outputs from the first
layer. This means that the perceptrons in the second layer do not know
which of the real inputs were on or not; they are only aware of input from
the first layer. Since learning corresponds to strengthening the connections
between active inputs and active units (refer to section 3.3), it is impossible
to strengthen the correct parts of the network, since the actual inputs are
effectively masked off from the output units by the intermediate layer. The
two-state neuron, being “on” or “off”, gives us no indication of the scale by
which we need to adjust the weights, and so we cannot make a reasonable
adjustment.

The Solution

The way around the difficulty imposed by using the step function as
the thresholding is to adjust it slightly, and use a slightly different non-
linearity. If we smooth it out, so that it more or less turns on or off, as
before, but has a sloping region in the middle that will give us some
information on the inputs, we will be able to determine when we need to
strengthen or weaken the relevant weights. This means that the network
will be able to learn, as required. A couple of possibilities for the new
thresholding function are shown in Figure 3.2

2
Name Input/Output Icon MATLAB
Relation Function
a=0 n<0
Hard Limit a=1 n≥0 Hardlim

a=-1 n<0
Symmetrical a=+1 n≥0 Hardlims
Hard Limit

Linear a=n Purelin

Saturating a=0 n<0


Linear a=n 0≤n≤1 Satlin
a=1 n>1
Symmetric a=-1 n<-1
Saturating a=n -1≤n≤1 Satlins
Linear a=1 n>1

Log- 1 Logsig
a=
Sigmoid 1+ e−n
n −n
Hyperbolic
a = en − e−n
Tangent e +e Tansig
Sigmoid
Positive a=-1 n<0
Linear a=n n≥0 Poslin

a=1 neurons with max n


Competitive a=0 all other neurons Compet

3
Figure (3.2):The most used activation functions: (a) hard-limited threshold; (b) linear
threshold: if the input is above a certain threshold, the output becomes
saturated (to a value of 1); there are different variants of this function
depending on the range of the neuronal output values shown in (b-1) and
(b-2); (c) sigmoid function: logistic function (c-1); bipolar logistic function
(c-2); (c-3) gaussian (bell shape) function.

In both cases, the value of the output will be practically one if the
weighted sum exceeds the threshold by a lot, and conversely, it will be
practically zero if the weighted sum is much less than the threshold value.
However, in the case when the threshold and the weighted sun are almost
the same, the output from the neuron will have a value somewhere between
the two extremes. This means that the output from the neuron is able to be
related to its inputs in a more useful and informative way.

4
THE NEW MODEL

The adapted perceptron units are arranged in layers, and so the


new model is naturally enough termed the multilayer perceptron.
Our new model has three layers; an input layer, an output layer,
and a layer in between, not connected directly to the input or the
output, and so called the hidden layer. Each unit in the hidden layer
and the output layer is like a perceptron unit, except that the
thresholding function is the sigmoid function B and not the step
function as before. The units in the input layer serve to distribute
the values they receive to the next layer, and so do not perform a
weighted sum or threshold.

Figure 3.3: The multilayer perceptron “Our new model”

5
The NEW LEARNING RULE
The learning rule for multilayer perceptrons is called the
“generalized delta rule”, or the “backpropagation rule”.
The operation of the network is similar to that of the single-
layer perceptron, in that we show the net a pattern and calculate its
response. Comparison with the desired response enables the
weights to be altered so that the network can produce a more
accurate output next time. The learning rule provides the method
for adjusting the weights in the network When we show the
untrained network an input pattern, it will produce any random
output. We need to define an error function that represents the
difference between the network’s current output and the correct
output that we want it to produce. Because we need to know the
“correct” pattern, this type of learning is known as “supervised
learning” In order to learn successfully we want to make the output
of the net approach the desired output, that is, we want to
continually reduce the value of this error function. This is achieved
by adjusting the weights on the links between the units and the
generalized delta rule does this by calculating the value of the error
function for that particular input, and then back-propagating (hence
the name!) the error from one layer to the previous one Each unit in
the net has its weights adjusted so that it reduces of the value of the
error function; for units actually on the output, their output and the
desired output is known, so adjusting the weights is relatively
simple, but for units in the middle layer , the adjustment is not so
obvious.

6
™ Weights from input nodes to hidden nodes are called input-layer
weights.

™ Weights from hidden nodes to output nodes are called output-layer


weights.

™ At each pattern “p”, The total squared error :

2
E P = ∑ (T j − O j ) 2
1 ────►(1)
2 j =1
“Only for output nodes since they have targets.”

7
™ The total linear sum at each unit “j” is:

net = W .O ────►(2)
j ij i
For example, net3 = (W13.X1 + W23.X2)
net7 = (W37.O3 + W47.O4 +W57.O5+…..)

™ The output from each unit “j” is the non-linear function


(e.g.,Sigmoid) acting o the weighted sum, thus,

Oj = fj(netj) ────►(3)
i.e.
O4 = f4(net4)

™The Gradient-descent (LMS) algorithm:-


The update equation for any weight:-

∂E P
Wij ( new) = Wij ( old ) −η × ────►(4)
∂Wij
For any weight in the network.

⎛ ∂E P ⎞
™ Thus, the issue is to get ⎜ ⎟ for
⎜ ∂W ⎟ each weight in the
⎝ ij ⎠
network.

8
™ We can write that:-
⎛ ∂E P ∂E P ∂net j ⎞
⎜ = • ⎟ ────►(5)
⎜ ∂W ∂net ∂W ⎟
⎝ ij j ij ⎠

Chain Rule :-

but , net = W .O (from 2)


j ij i
∂net j
Thus, = Oi ────► (6)
∂Wij

∂E P ∂E P
Thus, = Oi • ────► (5′)
∂Wij ∂net j
i.e., the update error term for any weight depends on th input value going to
that weight (Oi).

9
™ Now we look at 2nd term of eq. (5′)
∂E P
Let’s call :- δj =− ────►(7)
∂net j

Wij ( new) = Wij ( old ) + η × δ j × Oi ────►(8)

™ We now need to know what is δ j for each of the units.

∆W46 = η × δ 6 × O4
{
O/P

∆W14 = η × δ 4 × O1
{
hidden

10
™ The δ j for O/P nodes (concerning O/P-layer weights) is
simpler than δ j for hidden nodes(concerning input-layer
weights);

∂E ∂E ∂O j
δj =− =− • “Chain Rule” →(9)
∂net j ∂O j ∂net j

Since, O j = f j (net j )

∂O j
∴ = f j′(net j ) ────►(10)
∂net j
Thus,

⎛ ∂E ⎞
δ j = − f j′(net j ) × ⎜⎜ ⎟


⎝123
O j ⎠
→ (9′)

The problem is
to get this one

11
∂E
™ :-
∂O j
1) If Oj is an O/P unit:-

Thus,
E = 12 ∑ (T j − O j ) 2
j

∂E
= −(T j − O j ) ────► (11)
∂O j
Thus,
δ j = f j′(net j ) • (T j − O j ) ────►(12)
For O/P units, and hence, for O/P-layer weights,

∆Wij = η × f j′(net j ) × (T j − O j ) × Oi

(∆W46 = η × f ′(net6 ) × (T6 − O6 ) × O4 )

12
2)For hidden-Layer weights (input-layer
weights):-

∂E
δ j = − f j′(net j ) ×
∂O j
Consider W14 ,Oj = O4

The O4 reaches all the outputs,through (net6 ,net7).

∂E ∂E ∂net 6 ∂E ∂net 7
∴ = × + ×
∂O4 ∂net 6 ∂O4 ∂net 7 ∂O4

13
Generally , If Oj is connected to (k) output nodes, then:

∂E ∂E ⎛ ∂net k ⎞
=∑ × ⎜⎜ ⎟
∂O j k ∂net k ⎝ ∂O j ⎟⎠
1424
3
simplify →W jk

Thus,
∂E ⎛ ∂E ⎞
= ∑ W jk × ⎜⎜ ⎟⎟ ────►(13)
Oj k
⎝1∂net k ⎠
424 3
O / P − Layer
But form (9),

∂E ∂E
= −δ j or = −δ k
∂net j ∂net k

∂E
∴ = −∑ W jk × δ k ────►(14)
∂O j k

From (12),

δ k = f k′(net k ) • (Tk − Ok ) ────►(15)

14
Back in (9′) ,
δ j = − f j′(net j ) × −∑ W jk f k′(net k ) × (Tk − Ok )
{

The δ for weight Wij in the (i/p) layer, Thus,

∆Wij = η × ⎡ f j′(net j )× ∑W jk (netk ) × (Tk − Ok )⎤ × Oi


⎢⎣ ⎥⎦
….eq(16)

Ex:-

∆W14 = η × [ f 4′(net 4 ) × (W46 ⋅ f 6′(net 6 ) ⋅ (T6 − O6 )


+ (W47 ⋅ f 7′(net 7 ) ⋅ (T7 − O7 ))] × X 1

15
Multilayer Perceptron Learning Algorithm:-
1)Initialise weights and thresholds:
Set all weights and thresholds to small random values.

2)Present input and desired output


Present input Xp = xo,x1,x2,..,xn-1 and target output Tp = to,t1,..,tm-1
where n is the number of input nodes and m is the number of
output nodes. Set Wo to be -θ, the bias, and x0 to be always 1.
For pattern association, Xp and Tp represent the patterns to be
associated. For classification, Tp is set to zero except for one
element set to 1 that corresponds to the class that Xp is in.

3)Calculate actual output


Each layer calculates:
n −1
y pj = ⎡∑ wi xi ⎤ = O j
⎢⎣ i = 0 ⎥⎦
and passes that as input to the next layer. The final layer outputs
values Opj.

4) Adapt weights
Start from the output layer, and work backwards.
Wij (t + 1) = Wij (t ) + ηδ pj O pi
Wij(t) represents the weights from node I to node j at time t, η is a gain
term, and δ pj is an error term for pattern p on node j.
For output units:
δ pj = α ⋅ O pj (1 − O pj )(t pj − O pj )
δ pj = α ⋅ O pj (1 − O pj )∑ δ pkW jk
For hidden units:

k
Where the sum is over the k nodes in the layer above node j.

16
THE XOR PROBLEM REVISITED:
In the previous chapter, we saw how the single-layer perceptron was
unable to solve the exclusive-or problem.

The first test of the multilayer perceptron is to see if we can produce a


network that can solve this problem; the two-layer net shown in figure 4.4
is able to produce the correct output. It has a three-layer structure, with two
input units (as we might expect since there are two variables in the
problem), one unit in the hidden layer, and one output unit. The connection
weights are shown on the links) and the threshold of each unit is shown
inside the unit. As far as the output unit is concerned, the hidden unit is no
different from either of the input units, and simply provides another input.

Input Hidden Unit Output


00 0 0
01 0 1
10 0 1
11 1 0

17
Considering the hidden unit, we can see that it is detecting when both the
inputs are on, since, this is the only condition under which it turns on. Since
each of the input units detect when their inputs are on, the output unit is fed
with three items of information: if the left input is on, if the right input is
on, and if both the left and the right inputs are on. Since the output unit
treats the hidden unit as another input unit, the apparent input patterns it
receives are now dissimilar enough for the classification to be learnt.

The generalized delta rule provides a method for teaching multilayer


perceptron networks, producing the necessary internal representations on
the hidden nodes. It is unlikely that the weights produced by a taught
network would be as simple as those shown above, but the same principles
hold. Figure 4.5 shows another solution to the XOR problem.

Multilayer perceptrons can appear in all shapes and sizes, with the same
learning rule for them all. This means that it is possible to produce different
network topologies to solve the same problem. One of the more interesting
cases is when there is no direct connection from the input to the output.
This and the corresponding XOR solution are shown in figure 4.6.

18
The network shown in figure 4.7 will correctly respond to the input
patterns 00 and 10, but fails to produce the correct output for the patterns 01
or 11. The right hand input turns on both hidden units. These produce a net
input of 0.8 at the output unit, exactly the same as the threshold value.
Since the thresholding function is the sigmoid, this gives an output value of
0.5. This situation is stable and does not alter with further training. This
local minimum occurs infrequently about 1% of the time in the XOR
problem.
Another minor problem can occur in training networks with the
generalized. delta rule. Since the weight changes are proportional to the
weights themselves, if the system starts off with equal weights then non-
equal weights can never be developed, and. so the net cannot settle into the
non-symmetric solution that may be required.

VISUALISING NETWORK BEHAVIOUR:

As we have seen, the network computes an error or energy function,


2
1 ⎛ ⎞
E p = ∑ ⎜ t pj − O pj ⎟ which represents the amount by which the output
2 ⎝ ⎠
of the net differs from the required output. Large differences correspond to
large energies, whilst small differences correspond to small energies. Since
the output of the net is related to the weights between the units and the
input applied, the energy is therefore a function of the weights and inputs to
the network. We can draw a graph of the energy function showing how
varying the weights affects the energy, for a fixed input pattern.

Figure

This energy Surface is a rippling landscape of hills and valleys, wells and
mountains, with points of minimum energy corresponding to the. wells and
maximum energy found on the peaks. The generalized delta rule aims to
minimize the error function E by adjusting the weights in the network so
that they correspond to those at which the energy surface is lowest. It does
this by a method known as gradient descent, where the energy function is
calculated, and changes are made in the steepest downward direction. This
is guaranteed to find a solution in cases where the energy landscape is

19
simple. Each possible solution is represented as a hollow, or a basin, in the
landscape These basins of attraction, as they are known, represent the
solutions to the values of the weights that produce the correct output from s
a given input.

To try to clarify the situation a multilayer network receives a number of


inputs. These are distributed by a layer of input nodes that do not perform
any summation or thresholding these input nodes have only one input each,
so it is clear which they are, and obviously pointless for them to sum their
only input. These inputs are then passed along the first layer of adaptive
weights to a layer of perceptron-like units, which do sum and threshold
their inputs. This layer is able to produce classification lines in pattern
space. The output from this layer is then passed to another layer of
perceptron-like units via adaptable weights, and it is the output of this layer
that forms convex hulls in pattern space. A further layer of perceptron-like
units is reached by another set of adaptive weights, and the output of this
layer is able to define any arbitrary shape in pattern space. Counting the
number of active weight layers, or the number of active perceptron layers,
this is a three-layer network. If the inactive set of input units is included, it
can be called a four-layer network. The general trend is to use the former,
since it is more descriptive. This is summarized in figure

20
We can consider classifying patterns in another way. Any given input
pattern must belong to one of the classes that we are considering, and so
there is a mapping from the input to the required class. This mapping can be
viewed as a function that transforms the input pattern into the correct output
class, and we can consider that a network has learnt to-perform correctly, if
it can carry out this mapping. In fact, any function, no matter how complex,
can be represented by a multilayer perceptron of no more than three layers,
the inputs are fed through an input layer, a middle hidden layer, and an
output layer.

Multilayer Perceptrons as Classifiers:


We have already considered how the multilayer perceptron copes with the
complicated, linearly inseparable XOR Problem, now we consider the more
general case. The single-layer perceptron is limited to calculating a single
plane of separation between classes, which is why it fails on problems
such, as the XOR which are more complicated. We discussed earlier how a
two-layer device could; in principle, solve the XOR problem.

21
More than two units can be used in the first layer, which produces
pattern space partitioning that is a combination of more than 2 lines. All
regions produced in this way are known as convex regions or convex hulls.
A convex hull is a region in which any point can be connected to any other
by a straight line that does not cross the boundary of the region.
However, if we add another layer of perceptrons, the units in this layer
will receive as inputs, not lines, but convex hulls. The combinations of
these convex regions may intersect, overlap, or be separate from each other,
producing arbitrary shapes.

LEARNING DIFFICULTIES :

The XOR problem demonstrates some of the difficulties associated with


learning in multilayer perceptrons. Occasionally the network settles into a
stable solution that does not provide the correct output. In these cases, the
energy function is in a local minimum.
There are alternative approaches to minimizing these occurrences, which
are outlined below.

™ Lowering the gain term (η ) “and adapting η :


As the gain is decreased, the network weights settle into a minimum
energy configuration without overshooting the stable position, as the
gradient descent take smaller downhill steps.
However, the reduction in the gain term will mean that the network will
take longer to converge.

™ Addition of internal nodes:


Local minima can be considered to occur when two or more disjoint
classes are categorized as the same. This amounts to a poor internal
representation within the hidden units, and so adding more units to this
layer wall allow a better recording of the input; and lessen the occurrence of
these minima. “But may cause Over-training”.

22
™ Momentum term
The weight changes can be given some “momentum” by introducing an
extra term into the weight adaptation equation that will produce a large
change in the weight if the changes are currently large, and will decrease as
the changes become less.
This momentum can be written as follows:
W ji (t + 1) = W ji (t ) + η .δ pj Oi + k (W ji (t ) − W ji (t − 1))
Where k is the momentum factor, 0<k<1

™ Addition of noise
If random noise is added, this perturbs the gradient. descent algorithm
from the line of steepest descent, and often this noise is enough to knock
the system out of a local minimum. This approach has the advantage that it
takes very little extra computation time, and, so is not noticeably slower
than the direct gradient descent algorithm.

™ Generalization: Generalization is concerned with how well the


network performs on the problem w respect to both seen and unseen data. It
is usually tested on new data outside the training set. Generalization is
dependent on the network architecture and size, the learning algorithm, the
complexity of the underlying problem, and the quality and quantity of the
training data. Research has been conducted to answer such questions as:

¾ How many framing instances are required for good generalization?


¾ What size network gives the best generalization?
¾ What kind of architecture is best for modeling the underlying
problem?
¾ What learning algorithm can achieve the best generalization?

It is often difficult to answer any of these questions without fixing some


factors. But an important goal is to develop general learning algorithm
which improves generalization in most or all circumstances. This is
addressed in the next section.
For simplicity, we focus our discussion on learning logic functions of
“d” binary variables. In this problem, there are 2d different patterns in the

23
2d
domain and there are 2 possible functions. The network for identifying
the true function should have “d” input units and one output unit. In
general, the larger the network, the larger the set of functions it can form
and the more likely the true function is in this set. We can think of the use
of training data as a means of rejecting incorrect functions. Hopefully, at
the end of training, there is only one function left which is the true function
sought. If this is not the case, we wish the function learned to be the
function that best approximates the true function. In this sense, the more
training instances, the more likely the network can identify the true
function. However, this is true only when the network is large enough (but
not too large) to implement the true function.
Hush and Horne(1993) describe the approaches to study generalization
and the methods to improve generalization. In one approach,
the average-generalization of the network is defined to be the average the
generalization values of all functions that are consistent with the set of
training instances. The generalization value of a function is the fraction of
the domain (of 2d instances).

Parsimony principle:-For n networks which are all capable of


solving a problem, the smallest one gives the best generalization.

™ Learning Speed: Some techniques have been used to accelerate


Convergence of a gradient descent technique like back-propagation.
Newton’s method uses the information of the second-order derivatives.
Quasi-Newton “trainlm” methods approximate second-order information
with first-order information. Conjugate gradient methods compute a linear
combination of the current gradient vector and the previous search direction
(momentum) for the current search direction. A detailed discussion on this
issue is made by Shanno (1990).
The network training time can also be improved by some tricks, such as
joggling the weights or using slightly noisy data (Caudill 1991; Stubbs
1990). It has been found that if the activation level is restricted to the range
from -1/2 to 1/2, convergence time may be reduced by half compared with
the 0 to1 range (Storrietta and Huberman 1987). This improvement
proceeds from the fact that a weight coming from a neuron of zero

24
activation will not be modified. This idea is implemented by changing the
input range to -1/2 to 1/2 and using the following activation function:

⎛ 1 ⎞
− 12 ⋅ ⎜ ⎟
⎝1 + e− a ⎠
Where a is the argument of the function. An alternative function is the
hyperbolic tangent function (tanh), which lies in the range from -1 to 1
“tansig”.

™ Stopping Criteria: The process of adjusting the weights based on the


gradients is repeated until a minimum is reached. In practice, one has to
choose a stopping condition. There are several stopping criteria that can
be considered:
¾ Based on the error to be minimized: In pattern recognition, one
might consider stopping the procedure once the training data are
correctly classified. When this is not the case, a fixed threshold is
used o that the procedure is stopped if the error is below it.
However, this criterion does not guarantee generalization to new
data.
¾ Based on the gradient: The algorithm is terminated when the
gradient is sufficiently small. Note that the gradient will be zero at
a minimum by definition.
¾ Base cross-validation: This criterion can be used to monitor
generalization performance during learning and to terminate the
algorithm when there is no more improvement.

The first two criteria are sensitive to the choice of parameters and may
lead to poor results if the parameters are improperly chosen. The cross-
validation criterion docs not have this drawback. It can avoid over fitting
the data and can actually improve the generalization performance of the
network. However, cross-validation is much more computationally
intensive and often demands more data.

25
™ Network Size: The backpropagation network can approximate an
arbitrary mapping only when the network is sufficiently large. In general, it
is not known what size of the network will produce the best result for a
given problem. If the network is too small, it cannot learn to fit the training
data well On the other hand, if the network is too big the learning problem
becomes under constrained and the network is able to learn many solutions
that are consistent with the training data but most of them are likely to be
poor approximations of the actual model.

ƒ The network should have such a size as to capture the structure


of the data and eventually to model the underlying problem. With
some specific knowledge about the problem structure, one can
sometimes form a good estimate of the proper network size.
ƒ With little prior knowledge, the size of the network must be
determined by trial and error. One approach is to grow the network
starting with smallest-possible network until the performance
begins to level off or decline. The network size is increased by
adding more nodes in fixed layers or by adding new 1ayer. The
latter approach is illustrated by cascade correlation learning .
ƒ Alternatively, we can proceed the other way around. That is,
starting with a large network we apply a pruning technique to
remove those weight and nodes which have little relevance to the
solution. In this approach, one needs to know how to select a
“large” network to begin with.
ƒ In most cases, a two-layer network (with a single hidden layer)
Numerous bounds on the number of hidden nodes in two-layer
network have been derived. However, most of them assume that
the activation function is a hard-limiting function. Formal analysis
will be more difficult in the case of a sigmoid function.

26
Complexity of Learning: It has been shown that the problem of
finding a set of weights for a fixed-size network which performs the desired
mapping exactly for given training data is NP-complete1. That is, we cannot
find optimal weights in polynomial time. So, for a very large problem, it is
unlikely that we can determine the optimal solution in a reasonable amount
of time.
Learning algorithms like backpropagation are gradient descent
techniques which seek-only a local minimum. These algorithms usually do
not take exponential time to run. Empirically, the learning time on a serial
machine for backpropagation is about O(N3) where N is the number of
weights in the network (Hinton 1989). The slowness of finding a local
minimum is perhaps due to the characteristics of the error surface being
searched. In the case of single-layer perceptron, the error surface is a
quadratic bowl with a single minimum.

1
NP-complete problems refer to class of problems which lacks a polynomial-time solution .the time taken by the best
method we know for solving such problems grows exponentially with problem size. It has been proven that if a
polynomial-time solution is found for one of the NP-complete problems, the solution can be generalized to the rest of the
problem in this category. The question is, however, whether such faster methods exist.

27

You might also like