0% found this document useful (0 votes)

30 views

3 2MLP Extra Notes

The document discusses the multilayer perceptron model for neural networks. [1] It describes how a multilayer perceptron can solve problems that a single perceptron cannot by having an input layer, hidden layer, and output layer. [2] The multilayer perceptron uses a sigmoid activation function instead of a step function to allow learning through backpropagation. [3] The backpropagation algorithm calculates error terms to update weights between layers and reduce the overall error of the network output.

Uploaded by

youssef hossam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views

3 2MLP Extra Notes

Uploaded by

youssef hossam

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 27

Arab Academy For Science & Technology and

Maritime Transport – Cairo Branch

College of Engineering & Technology
Electronics & Communication Engineering Department

Course: Neural Networks Course code:CC524

Lecturer:Dr.Waleed Fakhr Lecture Notes 3

The Multilayer Perceptron

Altering The Perceptron Model
The Problem

How are we to overcome the problem of being unable to solve

linearly inseparable problems with our perceptron? An initial approach
would be to use more than one perceptron each set up to identify small,
linearly separable sections of the inputs, then combining their outputs into
another perceptron, which would produce a final indication of the class to
which the input belongs.This approach to the XOR problem is shown in
Figure 3.1

Figure 3.1 Combining perceptrons can solve the

XOR problem: perceptron 1 detects when the
pattern corresponding to (0,1) is present, and the
other detects when (1,0) is there. Combined,
these two facts allow perceptron 3 to classify the
input correctly. They have to be set up correctly
in the first place, however; they cannot learn to
produce this classification.

1
This seems fine on first examination, but a moment’s thought will
show that this arrangement of perceptrons in layers will be unable to learn.
Each neuron in the structure still takes the weighted sum of its inputs,
thresholds it, and outputs either a one or a zero. For the perceptrons in the
first layer, the inputs come from the actual inputs to the network, while the
perceptrons in the second layer take as their inputs the outputs from the first
layer. This means that the perceptrons in the second layer do not know
which of the real inputs were on or not; they are only aware of input from
the first layer. Since learning corresponds to strengthening the connections
between active inputs and active units (refer to section 3.3), it is impossible
to strengthen the correct parts of the network, since the actual inputs are
effectively masked off from the output units by the intermediate layer. The
two-state neuron, being “on” or “off”, gives us no indication of the scale by
which we need to adjust the weights, and so we cannot make a reasonable
adjustment.

The Solution

The way around the difficulty imposed by using the step function as
the thresholding is to adjust it slightly, and use a slightly different non-
linearity. If we smooth it out, so that it more or less turns on or off, as
before, but has a sloping region in the middle that will give us some
information on the inputs, we will be able to determine when we need to
strengthen or weaken the relevant weights. This means that the network
will be able to learn, as required. A couple of possibilities for the new
thresholding function are shown in Figure 3.2

2
Name Input/Output Icon MATLAB
Relation Function
a=0 n<0
Hard Limit a=1 n≥0 Hardlim

a=-1 n<0
Symmetrical a=+1 n≥0 Hardlims
Hard Limit

Linear a=n Purelin

Saturating a=0 n<0

Linear a=n 0≤n≤1 Satlin
a=1 n>1
Symmetric a=-1 n<-1
Saturating a=n -1≤n≤1 Satlins
Linear a=1 n>1

Log- 1 Logsig
a=
Sigmoid 1+ e−n
n −n
Hyperbolic
a = en − e−n
Tangent e +e Tansig
Sigmoid
Positive a=-1 n<0
Linear a=n n≥0 Poslin

a=1 neurons with max n

Competitive a=0 all other neurons Compet

3
Figure (3.2):The most used activation functions: (a) hard-limited threshold; (b) linear
threshold: if the input is above a certain threshold, the output becomes
saturated (to a value of 1); there are different variants of this function
depending on the range of the neuronal output values shown in (b-1) and
(b-2); (c) sigmoid function: logistic function (c-1); bipolar logistic function
(c-2); (c-3) gaussian (bell shape) function.

In both cases, the value of the output will be practically one if the
weighted sum exceeds the threshold by a lot, and conversely, it will be
practically zero if the weighted sum is much less than the threshold value.
However, in the case when the threshold and the weighted sun are almost
the same, the output from the neuron will have a value somewhere between
the two extremes. This means that the output from the neuron is able to be
related to its inputs in a more useful and informative way.

4
THE NEW MODEL

The adapted perceptron units are arranged in layers, and so the

new model is naturally enough termed the multilayer perceptron.
Our new model has three layers; an input layer, an output layer,
and a layer in between, not connected directly to the input or the
output, and so called the hidden layer. Each unit in the hidden layer
and the output layer is like a perceptron unit, except that the
thresholding function is the sigmoid function B and not the step
function as before. The units in the input layer serve to distribute
the values they receive to the next layer, and so do not perform a
weighted sum or threshold.

Figure 3.3: The multilayer perceptron “Our new model”

5
The NEW LEARNING RULE
The learning rule for multilayer perceptrons is called the
“generalized delta rule”, or the “backpropagation rule”.
The operation of the network is similar to that of the single-
layer perceptron, in that we show the net a pattern and calculate its
response. Comparison with the desired response enables the
weights to be altered so that the network can produce a more
accurate output next time. The learning rule provides the method
for adjusting the weights in the network When we show the
untrained network an input pattern, it will produce any random
output. We need to define an error function that represents the
difference between the network’s current output and the correct
output that we want it to produce. Because we need to know the
“correct” pattern, this type of learning is known as “supervised
learning” In order to learn successfully we want to make the output
of the net approach the desired output, that is, we want to
continually reduce the value of this error function. This is achieved
by adjusting the weights on the links between the units and the
generalized delta rule does this by calculating the value of the error
function for that particular input, and then back-propagating (hence
the name!) the error from one layer to the previous one Each unit in
the net has its weights adjusted so that it reduces of the value of the
error function; for units actually on the output, their output and the
desired output is known, so adjusting the weights is relatively
simple, but for units in the middle layer , the adjustment is not so
obvious.

6
Weights from input nodes to hidden nodes are called input-layer
weights.

Weights from hidden nodes to output nodes are called output-layer

weights.

At each pattern “p”, The total squared error :

2
E P = ∑ (T j − O j ) 2
1 ────►(1)
2 j =1
“Only for output nodes since they have targets.”

7
The total linear sum at each unit “j” is:

net = W .O ────►(2)
j ij i
For example, net3 = (W13.X1 + W23.X2)
net7 = (W37.O3 + W47.O4 +W57.O5+…..)

The output from each unit “j” is the non-linear function

(e.g.,Sigmoid) acting o the weighted sum, thus,

Oj = fj(netj) ────►(3)
i.e.
O4 = f4(net4)

The Gradient-descent (LMS) algorithm:-

The update equation for any weight:-

∂E P
Wij ( new) = Wij ( old ) −η × ────►(4)
∂Wij
For any weight in the network.

⎛ ∂E P ⎞
Thus, the issue is to get ⎜ ⎟ for
⎜ ∂W ⎟ each weight in the
⎝ ij ⎠
network.

8
We can write that:-
⎛ ∂E P ∂E P ∂net j ⎞
⎜ = • ⎟ ────►(5)
⎜ ∂W ∂net ∂W ⎟
⎝ ij j ij ⎠

Chain Rule :-

but , net = W .O (from 2)

j ij i
∂net j
Thus, = Oi ────► (6)
∂Wij

∂E P ∂E P
Thus, = Oi • ────► (5′)
∂Wij ∂net j
i.e., the update error term for any weight depends on th input value going to
that weight (Oi).

9
Now we look at 2nd term of eq. (5′)
∂E P
Let’s call :- δj =− ────►(7)
∂net j

Wij ( new) = Wij ( old ) + η × δ j × Oi ────►(8)

We now need to know what is δ j for each of the units.

∆W46 = η × δ 6 × O4
{
O/P

∆W14 = η × δ 4 × O1
{
hidden

10
The δ j for O/P nodes (concerning O/P-layer weights) is
simpler than δ j for hidden nodes(concerning input-layer
weights);

∂E ∂E ∂O j
δj =− =− • “Chain Rule” →(9)
∂net j ∂O j ∂net j

Since, O j = f j (net j )

∂O j
∴ = f j′(net j ) ────►(10)
∂net j
Thus,

⎛ ∂E ⎞
δ j = − f j′(net j ) × ⎜⎜ ⎟
⎟
∂
⎝123
O j ⎠
→ (9′)

The problem is
to get this one

11
∂E
:-
∂O j
1) If Oj is an O/P unit:-

Thus,
E = 12 ∑ (T j − O j ) 2
j

∂E
= −(T j − O j ) ────► (11)
∂O j
Thus,
δ j = f j′(net j ) • (T j − O j ) ────►(12)
For O/P units, and hence, for O/P-layer weights,

∆Wij = η × f j′(net j ) × (T j − O j ) × Oi

(∆W46 = η × f ′(net6 ) × (T6 − O6 ) × O4 )

12
2)For hidden-Layer weights (input-layer
weights):-

∂E
δ j = − f j′(net j ) ×
∂O j
Consider W14 ,Oj = O4

The O4 reaches all the outputs,through (net6 ,net7).

∂E ∂E ∂net 6 ∂E ∂net 7
∴ = × + ×
∂O4 ∂net 6 ∂O4 ∂net 7 ∂O4

13
Generally , If Oj is connected to (k) output nodes, then:

∂E ∂E ⎛ ∂net k ⎞
=∑ × ⎜⎜ ⎟
∂O j k ∂net k ⎝ ∂O j ⎟⎠
1424
3
simplify →W jk

Thus,
∂E ⎛ ∂E ⎞
= ∑ W jk × ⎜⎜ ⎟⎟ ────►(13)
Oj k
⎝1∂net k ⎠
424 3
O / P − Layer
But form (9),

∂E ∂E
= −δ j or = −δ k
∂net j ∂net k

∂E
∴ = −∑ W jk × δ k ────►(14)
∂O j k

From (12),

δ k = f k′(net k ) • (Tk − Ok ) ────►(15)

14
Back in (9′) ,
δ j = − f j′(net j ) × −∑ W jk f k′(net k ) × (Tk − Ok )
{

The δ for weight Wij in the (i/p) layer, Thus,

∆Wij = η × ⎡ f j′(net j )× ∑W jk (netk ) × (Tk − Ok )⎤ × Oi

⎢⎣ ⎥⎦
….eq(16)

Ex:-

∆W14 = η × [ f 4′(net 4 ) × (W46 ⋅ f 6′(net 6 ) ⋅ (T6 − O6 )

+ (W47 ⋅ f 7′(net 7 ) ⋅ (T7 − O7 ))] × X 1

15
Multilayer Perceptron Learning Algorithm:-
1)Initialise weights and thresholds:
Set all weights and thresholds to small random values.

2)Present input and desired output

Present input Xp = xo,x1,x2,..,xn-1 and target output Tp = to,t1,..,tm-1
where n is the number of input nodes and m is the number of
output nodes. Set Wo to be -θ, the bias, and x0 to be always 1.
For pattern association, Xp and Tp represent the patterns to be
associated. For classification, Tp is set to zero except for one
element set to 1 that corresponds to the class that Xp is in.

3)Calculate actual output

Each layer calculates:
n −1
y pj = ⎡∑ wi xi ⎤ = O j
⎢⎣ i = 0 ⎥⎦
and passes that as input to the next layer. The final layer outputs
values Opj.

4) Adapt weights
Start from the output layer, and work backwards.
Wij (t + 1) = Wij (t ) + ηδ pj O pi
Wij(t) represents the weights from node I to node j at time t, η is a gain
term, and δ pj is an error term for pattern p on node j.
For output units:
δ pj = α ⋅ O pj (1 − O pj )(t pj − O pj )
δ pj = α ⋅ O pj (1 − O pj )∑ δ pkW jk
For hidden units:

k
Where the sum is over the k nodes in the layer above node j.

16
THE XOR PROBLEM REVISITED:
In the previous chapter, we saw how the single-layer perceptron was
unable to solve the exclusive-or problem.

The first test of the multilayer perceptron is to see if we can produce a

network that can solve this problem; the two-layer net shown in figure 4.4
is able to produce the correct output. It has a three-layer structure, with two
input units (as we might expect since there are two variables in the
problem), one unit in the hidden layer, and one output unit. The connection
weights are shown on the links) and the threshold of each unit is shown
inside the unit. As far as the output unit is concerned, the hidden unit is no
different from either of the input units, and simply provides another input.

Input Hidden Unit Output

00 0 0
01 0 1
10 0 1
11 1 0

17
Considering the hidden unit, we can see that it is detecting when both the
inputs are on, since, this is the only condition under which it turns on. Since
each of the input units detect when their inputs are on, the output unit is fed
with three items of information: if the left input is on, if the right input is
on, and if both the left and the right inputs are on. Since the output unit
treats the hidden unit as another input unit, the apparent input patterns it
receives are now dissimilar enough for the classification to be learnt.

The generalized delta rule provides a method for teaching multilayer

perceptron networks, producing the necessary internal representations on
the hidden nodes. It is unlikely that the weights produced by a taught
network would be as simple as those shown above, but the same principles
hold. Figure 4.5 shows another solution to the XOR problem.

Multilayer perceptrons can appear in all shapes and sizes, with the same
learning rule for them all. This means that it is possible to produce different
network topologies to solve the same problem. One of the more interesting
cases is when there is no direct connection from the input to the output.
This and the corresponding XOR solution are shown in figure 4.6.

18
The network shown in figure 4.7 will correctly respond to the input
patterns 00 and 10, but fails to produce the correct output for the patterns 01
or 11. The right hand input turns on both hidden units. These produce a net
input of 0.8 at the output unit, exactly the same as the threshold value.
Since the thresholding function is the sigmoid, this gives an output value of
0.5. This situation is stable and does not alter with further training. This
local minimum occurs infrequently about 1% of the time in the XOR
problem.
Another minor problem can occur in training networks with the
generalized. delta rule. Since the weight changes are proportional to the
weights themselves, if the system starts off with equal weights then non-
equal weights can never be developed, and. so the net cannot settle into the
non-symmetric solution that may be required.

VISUALISING NETWORK BEHAVIOUR:

As we have seen, the network computes an error or energy function,

2
1 ⎛ ⎞
E p = ∑ ⎜ t pj − O pj ⎟ which represents the amount by which the output
2 ⎝ ⎠
of the net differs from the required output. Large differences correspond to
large energies, whilst small differences correspond to small energies. Since
the output of the net is related to the weights between the units and the
input applied, the energy is therefore a function of the weights and inputs to
the network. We can draw a graph of the energy function showing how
varying the weights affects the energy, for a fixed input pattern.

Figure

This energy Surface is a rippling landscape of hills and valleys, wells and
mountains, with points of minimum energy corresponding to the. wells and
maximum energy found on the peaks. The generalized delta rule aims to
minimize the error function E by adjusting the weights in the network so
that they correspond to those at which the energy surface is lowest. It does
this by a method known as gradient descent, where the energy function is
calculated, and changes are made in the steepest downward direction. This
is guaranteed to find a solution in cases where the energy landscape is

19
simple. Each possible solution is represented as a hollow, or a basin, in the
landscape These basins of attraction, as they are known, represent the
solutions to the values of the weights that produce the correct output from s
a given input.

To try to clarify the situation a multilayer network receives a number of

inputs. These are distributed by a layer of input nodes that do not perform
any summation or thresholding these input nodes have only one input each,
so it is clear which they are, and obviously pointless for them to sum their
only input. These inputs are then passed along the first layer of adaptive
weights to a layer of perceptron-like units, which do sum and threshold
their inputs. This layer is able to produce classification lines in pattern
space. The output from this layer is then passed to another layer of
perceptron-like units via adaptable weights, and it is the output of this layer
that forms convex hulls in pattern space. A further layer of perceptron-like
units is reached by another set of adaptive weights, and the output of this
layer is able to define any arbitrary shape in pattern space. Counting the
number of active weight layers, or the number of active perceptron layers,
this is a three-layer network. If the inactive set of input units is included, it
can be called a four-layer network. The general trend is to use the former,
since it is more descriptive. This is summarized in figure

20
We can consider classifying patterns in another way. Any given input
pattern must belong to one of the classes that we are considering, and so
there is a mapping from the input to the required class. This mapping can be
viewed as a function that transforms the input pattern into the correct output
class, and we can consider that a network has learnt to-perform correctly, if
it can carry out this mapping. In fact, any function, no matter how complex,
can be represented by a multilayer perceptron of no more than three layers,
the inputs are fed through an input layer, a middle hidden layer, and an
output layer.

Multilayer Perceptrons as Classifiers:

We have already considered how the multilayer perceptron copes with the
complicated, linearly inseparable XOR Problem, now we consider the more
general case. The single-layer perceptron is limited to calculating a single
plane of separation between classes, which is why it fails on problems
such, as the XOR which are more complicated. We discussed earlier how a
two-layer device could; in principle, solve the XOR problem.

21
More than two units can be used in the first layer, which produces
pattern space partitioning that is a combination of more than 2 lines. All
regions produced in this way are known as convex regions or convex hulls.
A convex hull is a region in which any point can be connected to any other
by a straight line that does not cross the boundary of the region.
However, if we add another layer of perceptrons, the units in this layer
will receive as inputs, not lines, but convex hulls. The combinations of
these convex regions may intersect, overlap, or be separate from each other,
producing arbitrary shapes.

LEARNING DIFFICULTIES :

The XOR problem demonstrates some of the difficulties associated with

learning in multilayer perceptrons. Occasionally the network settles into a
stable solution that does not provide the correct output. In these cases, the
energy function is in a local minimum.
There are alternative approaches to minimizing these occurrences, which
are outlined below.

Lowering the gain term (η ) “and adapting η :

As the gain is decreased, the network weights settle into a minimum
energy configuration without overshooting the stable position, as the
gradient descent take smaller downhill steps.
However, the reduction in the gain term will mean that the network will
take longer to converge.

Addition of internal nodes:

Local minima can be considered to occur when two or more disjoint
classes are categorized as the same. This amounts to a poor internal
representation within the hidden units, and so adding more units to this
layer wall allow a better recording of the input; and lessen the occurrence of
these minima. “But may cause Over-training”.

22
Momentum term
The weight changes can be given some “momentum” by introducing an
extra term into the weight adaptation equation that will produce a large
change in the weight if the changes are currently large, and will decrease as
the changes become less.
This momentum can be written as follows:
W ji (t + 1) = W ji (t ) + η .δ pj Oi + k (W ji (t ) − W ji (t − 1))
Where k is the momentum factor, 0<k<1

Addition of noise
If random noise is added, this perturbs the gradient. descent algorithm
from the line of steepest descent, and often this noise is enough to knock
the system out of a local minimum. This approach has the advantage that it
takes very little extra computation time, and, so is not noticeably slower
than the direct gradient descent algorithm.

Generalization: Generalization is concerned with how well the

network performs on the problem w respect to both seen and unseen data. It
is usually tested on new data outside the training set. Generalization is
dependent on the network architecture and size, the learning algorithm, the
complexity of the underlying problem, and the quality and quantity of the
training data. Research has been conducted to answer such questions as:

¾ How many framing instances are required for good generalization?

¾ What size network gives the best generalization?
¾ What kind of architecture is best for modeling the underlying
problem?
¾ What learning algorithm can achieve the best generalization?

It is often difficult to answer any of these questions without fixing some

factors. But an important goal is to develop general learning algorithm
which improves generalization in most or all circumstances. This is
addressed in the next section.
For simplicity, we focus our discussion on learning logic functions of
“d” binary variables. In this problem, there are 2d different patterns in the

23
2d
domain and there are 2 possible functions. The network for identifying
the true function should have “d” input units and one output unit. In
general, the larger the network, the larger the set of functions it can form
and the more likely the true function is in this set. We can think of the use
of training data as a means of rejecting incorrect functions. Hopefully, at
the end of training, there is only one function left which is the true function
sought. If this is not the case, we wish the function learned to be the
function that best approximates the true function. In this sense, the more
training instances, the more likely the network can identify the true
function. However, this is true only when the network is large enough (but
not too large) to implement the true function.
Hush and Horne(1993) describe the approaches to study generalization
and the methods to improve generalization. In one approach,
the average-generalization of the network is defined to be the average the
generalization values of all functions that are consistent with the set of
training instances. The generalization value of a function is the fraction of
the domain (of 2d instances).

Parsimony principle:-For n networks which are all capable of

solving a problem, the smallest one gives the best generalization.

Learning Speed: Some techniques have been used to accelerate

Convergence of a gradient descent technique like back-propagation.
Newton’s method uses the information of the second-order derivatives.
Quasi-Newton “trainlm” methods approximate second-order information
with first-order information. Conjugate gradient methods compute a linear
combination of the current gradient vector and the previous search direction
(momentum) for the current search direction. A detailed discussion on this
issue is made by Shanno (1990).
The network training time can also be improved by some tricks, such as
joggling the weights or using slightly noisy data (Caudill 1991; Stubbs
1990). It has been found that if the activation level is restricted to the range
from -1/2 to 1/2, convergence time may be reduced by half compared with
the 0 to1 range (Storrietta and Huberman 1987). This improvement
proceeds from the fact that a weight coming from a neuron of zero

24
activation will not be modified. This idea is implemented by changing the
input range to -1/2 to 1/2 and using the following activation function:

⎛ 1 ⎞
− 12 ⋅ ⎜ ⎟
⎝1 + e− a ⎠
Where a is the argument of the function. An alternative function is the
hyperbolic tangent function (tanh), which lies in the range from -1 to 1
“tansig”.

Stopping Criteria: The process of adjusting the weights based on the

gradients is repeated until a minimum is reached. In practice, one has to
choose a stopping condition. There are several stopping criteria that can
be considered:
¾ Based on the error to be minimized: In pattern recognition, one
might consider stopping the procedure once the training data are
correctly classified. When this is not the case, a fixed threshold is
used o that the procedure is stopped if the error is below it.
However, this criterion does not guarantee generalization to new
data.
¾ Based on the gradient: The algorithm is terminated when the
gradient is sufficiently small. Note that the gradient will be zero at
a minimum by definition.
¾ Base cross-validation: This criterion can be used to monitor
generalization performance during learning and to terminate the
algorithm when there is no more improvement.

The first two criteria are sensitive to the choice of parameters and may
lead to poor results if the parameters are improperly chosen. The cross-
validation criterion docs not have this drawback. It can avoid over fitting
the data and can actually improve the generalization performance of the
network. However, cross-validation is much more computationally
intensive and often demands more data.

25
Network Size: The backpropagation network can approximate an
arbitrary mapping only when the network is sufficiently large. In general, it
is not known what size of the network will produce the best result for a
given problem. If the network is too small, it cannot learn to fit the training
data well On the other hand, if the network is too big the learning problem
becomes under constrained and the network is able to learn many solutions
that are consistent with the training data but most of them are likely to be
poor approximations of the actual model.

The network should have such a size as to capture the structure

of the data and eventually to model the underlying problem. With
some specific knowledge about the problem structure, one can
sometimes form a good estimate of the proper network size.
With little prior knowledge, the size of the network must be
determined by trial and error. One approach is to grow the network
starting with smallest-possible network until the performance
begins to level off or decline. The network size is increased by
adding more nodes in fixed layers or by adding new 1ayer. The
latter approach is illustrated by cascade correlation learning .
Alternatively, we can proceed the other way around. That is,
starting with a large network we apply a pruning technique to
remove those weight and nodes which have little relevance to the
solution. In this approach, one needs to know how to select a
“large” network to begin with.
In most cases, a two-layer network (with a single hidden layer)
Numerous bounds on the number of hidden nodes in two-layer
network have been derived. However, most of them assume that
the activation function is a hard-limiting function. Formal analysis
will be more difficult in the case of a sigmoid function.

26
Complexity of Learning: It has been shown that the problem of
finding a set of weights for a fixed-size network which performs the desired
mapping exactly for given training data is NP-complete1. That is, we cannot
find optimal weights in polynomial time. So, for a very large problem, it is
unlikely that we can determine the optimal solution in a reasonable amount
of time.
Learning algorithms like backpropagation are gradient descent
techniques which seek-only a local minimum. These algorithms usually do
not take exponential time to run. Empirically, the learning time on a serial
machine for backpropagation is about O(N3) where N is the number of
weights in the network (Hinton 1989). The slowness of finding a local
minimum is perhaps due to the characteristics of the error surface being
searched. In the case of single-layer perceptron, the error surface is a
quadratic bowl with a single minimum.

1
NP-complete problems refer to class of problems which lacks a polynomial-time solution .the time taken by the best
method we know for solving such problems grows exponentially with problem size. It has been proven that if a
polynomial-time solution is found for one of the NP-complete problems, the solution can be generalized to the rest of the
problem in this category. The question is, however, whether such faster methods exist.

The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
From Everand
The Subtle Art of Not Giving a F*ck: A Counterintuitive Approach to Living a Good Life
Mark Manson
4/5 (6411)
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (640)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1173)
Never Split the Difference: Negotiating As If Your Life Depended On It
From Everand
Never Split the Difference: Negotiating As If Your Life Depended On It
Chris Voss
4.5/5 (990)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1851)
Grit: The Power of Passion and Perseverance
From Everand
Grit: The Power of Passion and Perseverance
Angela Duckworth
4/5 (650)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4101)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (887)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (627)
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
From Everand
The Hard Thing About Hard Things: Building a Business When There Are No Easy Answers
Ben Horowitz
4.5/5 (361)
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
From Everand
Hidden Figures: The American Dream and the Untold Story of the Black Women Mathematicians Who Helped Win the Space Race
Margot Lee Shetterly
4/5 (1015)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1138)
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
From Everand
Elon Musk: Tesla, SpaceX, and the Quest for a Fantastic Future
Ashlee Vance
4.5/5 (581)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (297)
A Man Called Ove: A Novel
From Everand
A Man Called Ove: A Novel
Fredrik Backman
4.5/5 (5143)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
The Little Book of Hygge: Danish Secrets to Happy Living
From Everand
The Little Book of Hygge: Danish Secrets to Happy Living
Meik Wiking
3.5/5 (460)
Brooklyn: A Novel
From Everand
Brooklyn: A Novel
Colm Toibin
3.5/5 (2126)
The Art of Racing in the Rain: A Novel
From Everand
The Art of Racing in the Rain: A Novel
Garth Stein
4/5 (4355)
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
From Everand
Devil in the Grove: Thurgood Marshall, the Groveland Boys, and the Dawn of a New America
Gilbert King
4.5/5 (278)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
Yes Please
From Everand
Yes Please
Amy Poehler
4/5 (2001)
Bad Feminist: Essays
From Everand
Bad Feminist: Essays
Roxane Gay
4/5 (1087)
The Woman in Cabin 10
From Everand
The Woman in Cabin 10
Ruth Ware
3.5/5 (2786)
A Tree Grows in Brooklyn
From Everand
A Tree Grows in Brooklyn
Betty Smith
4.5/5 (2032)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2876)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
From Everand
The Sympathizer: A Novel (Pulitzer Prize for Fiction)
Viet Thanh Nguyen
4.5/5 (141)
Wolf Hall: A Novel
From Everand
Wolf Hall: A Novel
Hilary Mantel
4/5 (4087)
Basic Switch and End Device Configuration
No ratings yet
Basic Switch and End Device Configuration
41 pages
On Fire: The (Burning) Case for a Green New Deal
From Everand
On Fire: The (Burning) Case for a Green New Deal
Naomi Klein
4/5 (78)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (835)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (918)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (814)
Problema B.5.1 Ogata
67% (3)
Problema B.5.1 Ogata
2 pages
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 14 EE419 Lec14 Jury Stability
No ratings yet
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 14 EE419 Lec14 Jury Stability
39 pages
Lecture 4 - 5 - RC - Phase Shift Oscillator Twin T
No ratings yet
Lecture 4 - 5 - RC - Phase Shift Oscillator Twin T
24 pages
Command Guide
No ratings yet
Command Guide
22 pages
Lecture 8 - Introduction To Sensors
No ratings yet
Lecture 8 - Introduction To Sensors
42 pages
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 0 4 EE419 Lec6,7,8 State Space Representation
No ratings yet
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 0 4 EE419 Lec6,7,8 State Space Representation
61 pages
351 - 27435 - EE419 - 2020 - 1 - 2 - 2 - 1 - 0 0 EE419 Lec 1,2,3 Root Locus
No ratings yet
351 - 27435 - EE419 - 2020 - 1 - 2 - 2 - 1 - 0 0 EE419 Lec 1,2,3 Root Locus
74 pages
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 0 2 EE419 Lec2,3 Bode Plot
No ratings yet
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 0 2 EE419 Lec2,3 Bode Plot
42 pages
Final Neural June 2020
No ratings yet
Final Neural June 2020
2 pages
EE419 Lecture 01
No ratings yet
EE419 Lecture 01
17 pages
MLP Momentum+relu+cross Entropy Example
No ratings yet
MLP Momentum+relu+cross Entropy Example
6 pages
Final Neural 2018 May
No ratings yet
Final Neural 2018 May
2 pages
AAST-CC312-Fall 21-Lec 11
No ratings yet
AAST-CC312-Fall 21-Lec 11
17 pages
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 0 8 EE419 Lec12-16 Z-Domain
No ratings yet
351 - 27435 - EE419 - 2020 - 1 - 2 - 1 - 0 8 EE419 Lec12-16 Z-Domain
18 pages
EE419 Lecture 08 Digital Control
No ratings yet
EE419 Lecture 08 Digital Control
24 pages
Transfer Learning Brief
No ratings yet
Transfer Learning Brief
7 pages
EE419 Lecture 02
No ratings yet
EE419 Lecture 02
12 pages
EE419 Lecture 03
No ratings yet
EE419 Lecture 03
14 pages
EE419 Lecture 04
No ratings yet
EE419 Lecture 04
10 pages
AAST-CC312-Fall 21-Lec 08
No ratings yet
AAST-CC312-Fall 21-Lec 08
17 pages
AAST-CC312-Fall 21-Lec 06
No ratings yet
AAST-CC312-Fall 21-Lec 06
16 pages
EE419 Lecture 07
No ratings yet
EE419 Lecture 07
26 pages
AAST-CC312-Fall 21-Lec 10
No ratings yet
AAST-CC312-Fall 21-Lec 10
21 pages
NE 364 Engineering Economy: Change by A Constant Amount
No ratings yet
NE 364 Engineering Economy: Change by A Constant Amount
10 pages
AAST-CC312-Fall 21 - Lec 01
No ratings yet
AAST-CC312-Fall 21 - Lec 01
28 pages
AAST-CC312-Fall 21 - Lec 02
No ratings yet
AAST-CC312-Fall 21 - Lec 02
26 pages
AAST-CC312-Fall 21 - Lec 03
No ratings yet
AAST-CC312-Fall 21 - Lec 03
15 pages
NE 364 Engineering Economy: Annual Compounding
No ratings yet
NE 364 Engineering Economy: Annual Compounding
12 pages
Lecture 8
No ratings yet
Lecture 8
12 pages
NE 364 Engineering Economy: Introduction To The Basic Cost Concepts and Economic Environment
No ratings yet
NE 364 Engineering Economy: Introduction To The Basic Cost Concepts and Economic Environment
11 pages
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The Constant Gardener: A Novel
From Everand
The Constant Gardener: A Novel
John le Carré
4/5 (278)
A Piloted Adaptive Notch Filter: Yong Ching Lim, Fellow, IEEE, Yue Xian Zou, Member, IEEE, and N. Zheng
No ratings yet
A Piloted Adaptive Notch Filter: Yong Ching Lim, Fellow, IEEE, Yue Xian Zou, Member, IEEE, and N. Zheng
14 pages
Particle Mechanics
No ratings yet
Particle Mechanics
3 pages
Introduction To Segment Trees
No ratings yet
Introduction To Segment Trees
17 pages
A Benchmark of Machine Learning Approaches For Credit Score Prediction
No ratings yet
A Benchmark of Machine Learning Approaches For Credit Score Prediction
8 pages
Paper 2 ADT
No ratings yet
Paper 2 ADT
36 pages
Operation Research-D.M.Marathe
No ratings yet
Operation Research-D.M.Marathe
7 pages
Quantum Computing: Lecture Notes: Ronald de Wolf
No ratings yet
Quantum Computing: Lecture Notes: Ronald de Wolf
163 pages
3. Extra Practice_ Topics 2.1-2.6
No ratings yet
3. Extra Practice_ Topics 2.1-2.6
5 pages
0-1 Knapsack Problem
No ratings yet
0-1 Knapsack Problem
57 pages
Gujarat Technological University
No ratings yet
Gujarat Technological University
1 page
PPT Chap 1305_ak
No ratings yet
PPT Chap 1305_ak
50 pages
Lecocq and Robin (2015) - Aidsills
No ratings yet
Lecocq and Robin (2015) - Aidsills
20 pages
Runge-Kutta 4th-Order Method and Hints
No ratings yet
Runge-Kutta 4th-Order Method and Hints
14 pages
Matlab Codes: Appendix C
No ratings yet
Matlab Codes: Appendix C
5 pages
04 CDK2FAB3 Kecerdasan Buatan - GA
No ratings yet
04 CDK2FAB3 Kecerdasan Buatan - GA
82 pages
Project PPT Bhu
No ratings yet
Project PPT Bhu
12 pages
Answer: B
No ratings yet
Answer: B
13 pages
Chapter 14, Problem 11.: Proprietary Material. © 2007 The Mcgraw-Hill Companies, Inc. All Rights Reserved. No Part
No ratings yet
Chapter 14, Problem 11.: Proprietary Material. © 2007 The Mcgraw-Hill Companies, Inc. All Rights Reserved. No Part
6 pages
Newton Raphson Method For Non-Linear Equations
No ratings yet
Newton Raphson Method For Non-Linear Equations
2 pages
Machine Learning
100% (1)
Machine Learning
12 pages
LECTURE TWO MICR-I-PPT
No ratings yet
LECTURE TWO MICR-I-PPT
28 pages
String Notes
No ratings yet
String Notes
87 pages
Solving XOR Problem Using DNN AIDS
No ratings yet
Solving XOR Problem Using DNN AIDS
4 pages
Activation Functions in Neural Networks: What Is Activation Function?
No ratings yet
Activation Functions in Neural Networks: What Is Activation Function?
11 pages
12209-Article Text-74805-2-10-20220825
No ratings yet
12209-Article Text-74805-2-10-20220825
12 pages
implicit
No ratings yet
implicit
2 pages
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
No ratings yet
Progress in Pattern Recognition, Image Analysis, Computer Vision, and Applications
1,000 pages
4 TH
No ratings yet
4 TH
4 pages
Power Systems Stability Control: Reinforcement Learning Framework
No ratings yet
Power Systems Stability Control: Reinforcement Learning Framework
9 pages

3 2MLP Extra Notes

Uploaded by

3 2MLP Extra Notes

Uploaded by

Arab Academy For Science & Technology and

Maritime Transport – Cairo Branch

Course: Neural Networks Course code:CC524

The Multilayer Perceptron

How are we to overcome the problem of being unable to solve

Figure 3.1 Combining perceptrons can solve the

Linear a=n Purelin

Saturating a=0 n<0

a=1 neurons with max n

The adapted perceptron units are arranged in layers, and so the

Figure 3.3: The multilayer perceptron “Our new model”

 Weights from hidden nodes to output nodes are called output-layer

 At each pattern “p”, The total squared error :

 The output from each unit “j” is the non-linear function

The Gradient-descent (LMS) algorithm:-

but , net = W .O (from 2)

Wij ( new) = Wij ( old ) + η × δ j × Oi ────►(8)

 We now need to know what is δ j for each of the units.

(∆W46 = η × f ′(net6 ) × (T6 − O6 ) × O4 )

The O4 reaches all the outputs,through (net6 ,net7).

δ k = f k′(net k ) • (Tk − Ok ) ────►(15)

The δ for weight Wij in the (i/p) layer, Thus,

∆Wij = η × ⎡ f j′(net j )× ∑W jk (netk ) × (Tk − Ok )⎤ × Oi

∆W14 = η × [ f 4′(net 4 ) × (W46 ⋅ f 6′(net 6 ) ⋅ (T6 − O6 )

2)Present input and desired output

3)Calculate actual output

The first test of the multilayer perceptron is to see if we can produce a

Input Hidden Unit Output

The generalized delta rule provides a method for teaching multilayer

VISUALISING NETWORK BEHAVIOUR:

As we have seen, the network computes an error or energy function,

To try to clarify the situation a multilayer network receives a number of

Multilayer Perceptrons as Classifiers:

The XOR problem demonstrates some of the difficulties associated with

 Lowering the gain term (η ) “and adapting η :

 Addition of internal nodes:

 Generalization: Generalization is concerned with how well the

¾ How many framing instances are required for good generalization?

It is often difficult to answer any of these questions without fixing some

Parsimony principle:-For n networks which are all capable of

 Learning Speed: Some techniques have been used to accelerate

 Stopping Criteria: The process of adjusting the weights based on the

 The network should have such a size as to capture the structure

You might also like

Weights from hidden nodes to output nodes are called output-layer

At each pattern “p”, The total squared error :

The output from each unit “j” is the non-linear function

The Gradient-descent (LMS) algorithm:-

We now need to know what is δ j for each of the units.

Lowering the gain term (η ) “and adapting η :

Addition of internal nodes:

Generalization: Generalization is concerned with how well the

Learning Speed: Some techniques have been used to accelerate

Stopping Criteria: The process of adjusting the weights based on the

The network should have such a size as to capture the structure