0% found this document useful (0 votes)
55 views71 pages

Artificial Neural Networks: Multilayer Perceptrons Backpropagation

This document discusses artificial neural networks and multilayer perceptrons. It describes how multilayer perceptrons can have one or more hidden layers with different numbers of nodes and activation functions. Backpropagation is introduced as a learning algorithm that allows neural networks to learn through iterative adjustment of weights to minimize error. The document provides examples of how multilayer perceptrons can solve problems like the XOR problem and discusses their capabilities for forming decision regions and approximating functions.

Uploaded by

dinkarbhombe
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
55 views71 pages

Artificial Neural Networks: Multilayer Perceptrons Backpropagation

This document discusses artificial neural networks and multilayer perceptrons. It describes how multilayer perceptrons can have one or more hidden layers with different numbers of nodes and activation functions. Backpropagation is introduced as a learning algorithm that allows neural networks to learn through iterative adjustment of weights to minimize error. The document provides examples of how multilayer perceptrons can solve problems like the XOR problem and discusses their capabilities for forming decision regions and approximating functions.

Uploaded by

dinkarbhombe
Copyright
© Attribution Non-Commercial (BY-NC)
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PPT, PDF, TXT or read online on Scribd
You are on page 1/ 71

Artificial Neural Networks

MultiLayer Perceptrons
Backpropagation



Berrin Yanikoglu
Nov. 2003
Capabilities of
Multilayer Perceptrons
Multilayer Perceptron
In Multilayer perceptrons, there may be one or more hidden layer(s) which are
called hidden since they are not observed from the outside.

Multilayer Perceptron
Each layer may have different number of nodes and different activation functions
Commonly:
Same activation function within one layer
Typically,
sigmoid activation function is used in the hidden units, and
sigmoid or linear activation functions are used in the output units
depending on the problem (classification or function approximation)

In feedforward networks, activations are passed only from one layer to the next
Backpropagation
Capabilities of multilayer NNs were known, but a learning algorithm was
introduced by Werbos (1974); made famous by Rumelhart and
McClelland (mid 1980s - the PDP book)
Started massive research in the area



XOR problem
Learning Boolean functions: 1/0 output can be seen as a 2-class
classification problem

Xor can be solved by a 1-hidden layer network

XOR problem
1
W
1

W
2
-1
-1
1
1
-0.5
1.5
1

1
-1.5
W
1
1
= [ 1 1 -0.5]
W
1
2
= [-1 -1 1.5]
W
2
= [ 1 1 -1.5]
Notice how each node implements a decision
boundary and the output node combines (AND)
their result.
boundary
1
(implemented by node1)
boundary
2
2
Capabilities (Hardlimiting nodes)
Single layer
Hyperplane boundaries

1-hidden layer
Can form any, possibly unbounded convex region

2-hidden layers
Arbitrarily complex decision regions


Capabilities: Decision Regions
From Lippmans NN tutorial
When the hardlimiting non-
linearities are replaced with
sigmoidal nonlinearities,
similar behavior is
observed except that the
decision regions are
replaced by smooth curves
instead of straight lines.

Capabilities (Hardlimiting nodes)
2-hidden layers (see Lippman, 1987):
First hidden layer computes regions

2nd hidden layer computes an AND operation (one for each hypercube -
worst case: #of disconnected regions)
about 1/3 the # of nodes in the first hidden layer

Output layer computes an OR operation

No more than 2-hidden layers is ever required


Capabilities
Every bounded continuous function can be approximated arbitrarily
accurately by 2 layers of weights (1-hidden layer) and sigmoidal units
(Cybenko 1989, Hornik et al. 1989)
Discontinuities can be theoretically tolerated for most real life
problems. Also, functions without compact support can be learned
under some conditions.

All other functions can be learned by 2-hidden layer networks (Cybenko
1988)
Based on Kolmogorovs Thm.

Function Approximation
Discontinuity Continuous with discontinuous first derivative
Self Study
Classification Example
Example-Self Study
Elementary Decision Boundaries
First Subnetwork
First Boundary:
a
1
1
hardlim
1 0
p 0.5 + ( ) =
Second Boundary:
a
2
1
hardlim
0 1
p 0.75 + ( ) =
0 1 2
1
0.5
0
Elementary Decision Boundaries
Third Boundary:
Fourth Boundary:
Second Subnetwork
a
3
1
hardlim
1 0
p 1.5 ( ) =
a
4
1
hardlim
0 1
p 0.25 ( ) =
Total Network
W
1
1 0
0 1
1 0
0 1
= b
1
0.5
0.75
1.5
0.25
=
W
2
1 1 0 0
0 0 1 1
= b
2
1.5
1.5
=
W
3
1 1
= b
3
0.5
=
Function Approximation &
Network Capabilities
Function Approximation
Neural Networks are intrinsically function approximators:
we can train a NN to map real valued vectors to real-valued vectors

Function approximation capabilities of a simple network, in response to its
parameters (weights and biases) are illustrated in the next slides

Function Approximation: Example
f
1
n ( )
1
1 e
n
+
----------------- =
f
2
n ( ) n =
w
1 1 ,
1
10 =
w
2 1 ,
1
10 =
b
1
1
10 =
b
2
1
10 =
w
1 1 ,
2
1 =
w
1 2 ,
2
1 =
b
2
0 =
Nominal Parameter Values
Layer number as superscripts
Nominal Response
-2 -1 0 1 2
-1
0
1
2
3
Parameter Variations
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
1 b
2
1 s s
What would be the effect of varying the bias of the output neuron?
Parameter Variations
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
1 w
1 1 ,
2
1 s s
1 w
1 2 ,
2
1 s s
0 b
2
1
20 s s
1 b
2
1 s s
Performance Learning
Performance Learning
A learning paradigm where we adjust the network parameters (weights and
biases) so as to optimize the performance of the network

Need to define a performance index (e.g. mean square error)

Search the parameter space to minimize the performance index
with respect to the parameters
Performance Index Example

Example: Performance index of a linear perceptron
defined to be the mean square error, over the input
samples x
i
is:
Performance surface
29
Performance surface with 2 weights
The idea is to find a minimum in the space of weights and
the error function E:
w1
w2
E(W)
Performance Optimization

Iterative minimization techniques:

Define E(.) as the performance index

Starting with an initial guess W(0), find W(n+1) at each iteration
such that
E( W(n+1)) < E(W(n))
Basic Optimization Algorithm
w
k 1 +
w
k
o
k
p
k
+ =
w A
k
w
k 1 +
w
k

( ) o
k
p
k
= =
p
k
- Search Direction
o
k
- Learning Rate
or
w
k
w
k 1 +
o
k
p
k
Start with initial guess w
0
and update the guess in each stage
moving along the search direction:
32
Performance surface
The gradient of the performance surface is a vector (with the dimension of w) that:
points toward the direction of maximum change,
with a magnitude equal to the slope of the tangent of the performance
surface.

A ball rolling down the hill will always attempt to roll in the direction opposite
to the gradient arrow (steepest descent).
The slope at the bottom is zero, so the gradient is also zero (that is the
reason the ball stops there).

33
Performance surface
Performance Optimization

Iterative minimization techniques: Steepest Descent
Successive adjustments to W are in the direction of the steepest descent
(direction opposite to the gradient vector)

W(n+1) = W(n) - qg(n)
where g(n) = )) ( ( n W E V
Steepest Descent
Matrix Form
F x ( ) V
x
1
c
c
F x ( )
x
2
c
c
F x ( )
.

x
n
c
c
F x ( )
= F x ( ) V
2
x
1
2
2
c
c
F x ( )
x
1
x
2
c
2
c
c
F x ( ) .
x
1
x
n
c
2
c
c
F x ( )
x
2
x
1
c
2
c
c
F x ( )
x
2
2
2
c
c
F x ( ) .
x
2
x
n
c
2
c
c
F x ( )
.

.

.

x
n
x
1
c
2
c
c
F x ( )
x
n
x
2
c
2
c
c
F x ( ) .
x
n
2
2
c
c
F x ( )
=
Gradient Hessian

Directional Derivatives
F x ( ) c x
i
c
c
2
F x ( ) cx
i
2

ith element of gradient is the first derivative (slope) of F(x) along x


i
axis:

i,i element of Hessian is the second derivative (curvature) of F(x) along x
i
axis:
What is the derivative of a function along an arbitrary direction?


Directional Derivatives
First derivative of F(x) along vector p is the projection of the gradient onto p:






p
T
F x ( ) V
p
- - - - - - - - - - - - - - - - - - - - - - -
Which direction has the greatest slope?
When the inner product of the direction vector and the gradient is
a maximum
When the direction vector is the same as the gradient

39
Two simple error surfaces (for 2 weights)
-2 -1 0 1 2
-2
-1
0
1
2
Plots
-2
-1
0
1
2
-2
-1
0
1
2
0
5
10
15
20
x
1
x
1
x
2
x
2
1.4
1.3
0.5
0.0
1.0
Directional
Derivatives
F x ( ) x
1
2
2x
1
x
2
2 x
2
2
+ + =
Example
F x ( ) x
1
2
2x
1
x
2
2 x
2
2
+ + =
x
-
0.5
0
= p
1
1
=
F x ( )
x x
-
=
V
x
1
c
c
F x ( )
x
2
c
c
F x ( )
x x
-
=
2x
1
2x
2
+
2x
1
4x
2
+
x x
-
=
1
1
= = =
p
T
F x ( ) V
p
- - - - - - - - - - - - - - - - - - - - - - -
1 1
1
1
1
1
- - - - - - - - - - - - - - - - - - - - - - - -
0
2
- - - - - - - 0 = = =
For
Performance Optimization:
Iterative Techniques Summary
F x
k 1 +
( ) F x
k
( ) <
Choose the next step so that the function decreases:
F x
k 1 +
( ) F x
k
x A
k
+ ( ) F x
k
( ) g
k
T
x A
k
+ ~ =
For small changes in x we can approximate F(x) using the Taylor Series Expansion:
g
k
F x ( ) V
x x
k
=

where
g
k
T
x A
k
o
k
g
k
T
p
k
0 < =
If we want the function to decrease, we must choose p
k
such that:
p
k
g
k
=
We can maximize the decrease by choosing:
x
k 1 +
x
k
o
k
g
k
=
Example
F x ( ) x
1
2
2 x
1
x
2
2x
2
2
x
1
+ + + =
x
0
0.5
0.5
=
F x ( ) V
x
1
c
c
F x ( )
x
2
c
c
F x ( )
2x
1
2x
2
1 + +
2x
1
4x
2
+
= =
g
0
F x ( ) V
x x
0
=
3
3
= =
o 0.1 =
x
1
x
0
og
0

0.5
0.5
0.1
3
3

0.2
0.2
= = =
x
2
x
1
og
1

0.2
0.2
0.1
1.8
1.2

0.02
0.08
= = =
Plot
-2 -1 0 1 2
-2
-1
0
1
2
Steepest Descent
Show that steepest descent satisfies the condition for iterative descent:
E( W(n+1)) < E(W(n))

Using Taylor series expansion:
E(W(n+1)) E(W(n)) - qg
T
(n) g(n)
= E(W(n)) - q,,g(n)||
2

Result follows since q,,g(n)||
2
is >0 for q > 0
~
Minima and Maxima
skip and go to next section
Minima
The point x* is a strong minimum of F(x) if a scalar o > 0 exists,
such that F(x*) < F(x* + Ax) for all Ax such that o > ||Ax|| > 0.
Strong (Local) Minimum
The point x* is a unique global minimum of F(x) if
F(x*) < F(x* + Ax) for all Ax 0.
Global Minimum
The point x* is a weak minimum of F(x) if it is not a strong
minimum, and a scalar o > 0 exists, such that F(x*) s F(x* + Ax)
for all Ax such that o > ||Ax|| > 0.
Weak Minimum
=
Scalar Example
-2 -1 0 1 2
0
2
4
6
8
F x ( ) 3x
4
7x
2

1
2
--- x 6 + =
What type of minima are these?
Scalar Example
-2 -1 0 1 2
0
2
4
6
8
F x ( ) 3x
4
7x
2

1
2
--- x 6 + =
Strong Minimum
Strong Maximum
Global Minimum
Vector Example
-2
-1
0
1
2
-2
-1
0
1
2
0
4
8
12
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
-2
-1.5
-1
-0.5
0
0.5
1
1.5
2
F x ( ) x
2
x
1
( )
4
8x
1
x
2
x
1
x
2
3 + + + =
-2 -1 0 1 2
-2
-1
0
1
2
-2
-1
0
1
2
-2
-1
0
1
2
0
2
4
6
8
F x ( ) x
1
2
1.5x
1
x
2
2 x
2
2
+ ( ) x
1
2
=
Optimality Conditions
What are the conditions that need to be satisfied for minima?
F x ( ) V
x x
-
=
0 =
Show using the Taylor series that the necessary
condition for a minimum point (strong or weak) is:
Gradient Descent

Delta Rule for Adaline (Linear Activation)
Backpropagation for MLP
62
o
w
0
w
2
...

w
n
63
64
65
Gradient Descent:
another slide explaining the same thing
Gradient:
VE[w]=[cE/cw
0
, cE/cw
n
]
(w
1
,w
2
)
(w
1
+Aw
1
,w
2
+Aw
2
)
Aw=-q VE[w]
cE/cw
i

=c/cw
i
E
p
(t
p
-o
p
)
2
= E
p
2(t
p
- o
p
)c/cw
i
o
p

= E
p
2(t
p
- o
p
)c/cw
i
E
i
w
i
x
i


= E
p
2(t
p
- o
p
)(-x
i
)
66
Stochastic Approximation to
Steepest Descent
Instead of updating every weight until all examples have been
observed, we update on every example:

w
i
~ (t-o) x
i
(not summing through all the patterns!)

In this case we update the weights incrementally.

Remarks:
-When there are multiple local minima stochastic gradient descent
may avoid the problem of getting stuck on a local minimum.

-Standard gradient descent needs more computation but can be
used with a larger step size.
67
Learning algorithm using the Delta Rule
Algorithm for learning using the delta rule:

1. Assign random values to the weight vector
2. Continue until the stopping condition is met
a) Initialize each w
i
to zero
b) For each example p:
Update w
i
:
w
i
+= (t
p
o
p
) x
i
c) Update w
i
:
w
i
= w
i
+ q w
i
3. Until error is small

68
Difficulties with Gradient Descent
There are two main difficulties with the gradient descent
method:

1. Convergence to a minimum may take a long time.

2. There is no guarantee we will find the global minimum.



Backpropagation Algorithm
General Activation Function
Chain Rule
f n w ( ) ( ) d
w d
-----------------------
f n ( ) d
n d
--------------
n w ( ) d
w d
--------------- =
f n ( ) n ( ) cos =
n e
2w
=
f n w ( ) ( ) e
2w
( ) cos =
f n w ( ) ( ) d
w d
-----------------------
f n ( ) d
n d
--------------
n w ( ) d
w d
--------------- n ( ) sin ( ) 2e
2w
( ) e
2w
( ) sin ( ) 2e
2w
( ) = = =
Example:
Application to Gradient Calculation
F

c
w
i j ,
m
c
------------
F

c
n
i
m
c
---------
n
i
m
c
w
i j ,
m
c
------------ =
F

c
b
i
m
c
---------
F

c
n
i
m
c
---------
n
i
m
c
b
i
m
c
--------- =
Transfer Function Derivatives
f
n
(
)
n d
d
1
1
e
n
+
- - - - - - - - - - - - - - - - -
\ .
| |
e
n
1
e
n
+
( )
2
- - - - - - - - - - - - - - - - - - - - - - - - 1
1
1
e
n
+
- - - - - - - - - - - - - - - - -
\ .
| |
1
1
e
n
+
- - - - - - - - - - - - - - - - -
\ .
| |
1 a
( )
a
( )
= = = =
f
n
( )
n d
d
n
( )
1 = =
Backpropagation
To calculate the partial derivative of E
p
(error on pattern p) w.r.t a given
weight w
ji
, we have to consider whether this is the weight of an output or
hidden node:

If w
ji
is an output node weight:




ji
j
j
j
j ji
p
dw
dnet
dnet
do
do
dE
dw
dE
=
i j j j
ji
p
o net f o t
dw
dE

'
= ) ( ) (
j
i
o
j
w
ji

=
i
ji i j w o net
) ( j j net f o =
Note that o
i
is the
input to node j.
E
p
= (t
p
-o
p
)
2
Backpropagation


If w
ji
is a hidden node weight:




ji
j
j
j
j ji
p
dw
dnet
dnet
do
do
dE
dw
dE
=
i j
j
o net f
do
dE

'
= ) (
j
i
o
j
w
ji

=
i
ji i j w o net
) ( j j net f o =
k

k

w
kj
Backpropagation


If w
ji
is a hidden node weight:




ji
j
j
j
j ji
p
dw
dnet
dnet
do
do
dE
dw
dE
=
i j
j
o net f
do
dE

'
= ) (
j
i
o
j
w
ji

=
i
ji i j w o net
) ( j j net f o =
k

k

=
k
k
kj
j dnet
dE
w
do
dE
Note that as j is a hidden node, we do not know its target.
Hence, dE/do
j
can only be calculated through js contribution to the
derivative of E w.r.t net
k
at the output nodes:
w
kj
Backpropagation Algorithm
Matrix Format
(skip)

go to next section
Network Complexity
Choice of Architecture
g p ( ) 1
i t
4
----- p
\ .
| |
sin + =
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
1-3-1 Network (1 input, 3 hidden, 1 output nodes)
i = 1 i = 2
i = 4 i = 8
Choice of Network Architecture
g p ( ) 1
6t
4
------ p
\ .
| |
sin + =
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
1-5-1
1-2-1 1-3-1
1-4-1
Residual error decreases with O(1/M) where M is the number of hidden units
Convergence in Time
g p ( ) 1 tp ( ) sin + =
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
1
2
3
4
5
0
1
2
3
4
5
0
Generalization
p
1
t
1
{ , } p
2
t
2
{ , } . p
Q
t
Q
{ , } , , ,
g p ( ) 1
t
4
---
p
\ .
| |
sin + =
p 2 1.6 1.2 . 1.6 2 , , , , , =
-2 -1 0 1 2
-1
0
1
2
3
-2 -1 0 1 2
-1
0
1
2
3
1-2-1 1-9-1
Next: Issues and Variations
on
Backpropagation

You might also like