0% found this document useful (0 votes)
6 views64 pages

121 DL2 Ann

Uploaded by

greenmyworld1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
6 views64 pages

121 DL2 Ann

Uploaded by

greenmyworld1
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 64

ULC665

Deep Learning

Background: Artificial Neural Networks

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 1 / 64


Agenda

■ Introduction to ANN.
■ Hebb’s Learning
■ Performance Index analysis
■ Gradient Descent / LMS / Delta rule
■ Back-propagation algorithm

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 2 / 64


Introduction

■ An artificial neural network (ANN) may be defined as an informa-


tion·processing model that is inspired by the way biological nervous
systems, such as the brain, process information.

Figure: Schematic Drawing of Biological Neurons

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 3 / 64


Biological Inspiration

■ The brain consists of a large number (approx. 1011 ) of highly connected


elements (104 connections/element) called neurons.
■ These neurons have three principal components:
• The dendrites are tree-like receptive networks of nerve fibers that
carry electrical signals into the cell body.
• The cell body effectively sums and thresholds these incoming sig-
nals.
• The axon is a single long fiber that carries the signal from the cell
body out to other neurons.
■ Synapse: The point of contact between an axon of one cell and a
dendrite of another cell is called a synapse.
■ It is the arrangement of neurons and the strengths of the individual
synapses, determined by a complex chemical process, that establishes
the function of the neural network.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 4 / 64


Neuron Model: Single Input Neuron

Figure: Single-Input Neuron

■ Relation
• The strength of a synapse → The weight w
• The cell body → the summation and the transfer function,
• The neuron output → the signal on the axon.
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 5 / 64
Transfer Functions (ANN)

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 6 / 64


Multiple-Input Neuron

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 7 / 64


Network Architectures: A Layer of Neurons

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 8 / 64


Multiple layer Network

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 9 / 64


McCulloch-Pitts Neuron
■ Discovered in 1943.
■ The M-P neurons are connected by directed weighted paths.
■ The activation of a M-P neuron is binary.
■ The weights associated with the communication links may be excitatory
(weight is positive) or inhibitory (weight is negative).
■ Threshold (θ) fixed, the M-P neuron has no particular training algo-
rithm.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 10 / 64


AND function using M-P model

Implement AND function using McCulloch-Fitts neuron (take binary data).

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 11 / 64


ANDNOT function

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 12 / 64


Decision boundary

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 13 / 64


Implementation of OR function

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 14 / 64


Hebb’s learning rule
■ Hebb’s Learning
• “When an axon of cell A is near enough to excite a cell B and
repeatedly or persistently takes part in firing it, some growth pro-
cess or metabolic change takes place in one or both cells such that
A’s efficiency, as one of the cells firing B, is increased.”
• First, let’s rephrase the postulate: If two neurons on either side of
a synapse are activated simultaneously, the strength of the synapse
will increase.
■ Associative memory: The task of an associative memory is to learn
pairs of prototype input/output vectors:
{p1 , t1 }, {p2 , t2 }, ..., {pQ , tQ }
• If the network receives an input p = pq then it should produce an
output a = tq , for q = 1, 2..., Q .
• In addition, if the input is changed slightly (i.e., p = pq + δ) then
the output should only be changed slightly (i.e., a = tq + ε).
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 15 / 64
Supervised Hebb’s learning

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 16 / 64


Using Hebb’s learning rule

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 17 / 64


Hebb’s learning: Performance evaluation

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 18 / 64


Hebb’s learning: Performance evaluation

When the prototype input patterns are not orthogonal, the Hebb rule produces
some errors. There are several procedures that can be used to reduce these er-
rors.
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 19 / 64
Pseudoinverse Rule
■ The task of the linear associator was to produce an output of tq for an
input of pq . In other words,
Wpq = tq q = 1, 2, ..., Q
WP = T (In matrix form) where, T = [t1 t2 .... tQ ], P = [p1 p2 .... pQ ]

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 20 / 64


Pseudoinverse Rule

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 21 / 64


Perceptron Architecture

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 22 / 64


Implement AND function (Initial weights zeros)

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 23 / 64


Performance learning

■ Performance learning is another important class of learning law, in


which the network parameters are adjusted to optimize the performance
of the network.
■ There are two steps involved in this optimization process.
• Define “performance”: Find a quantitative measure of network
performance, called the performance index, which is small when
the network performs well and large when the network performs
poorly.
• To search the parameter space (adjust the network weights and
biases) in order to reduce the performance index.
Investigate the characteristics of performance surfaces and set some conditions that will guarantee that a surface
does have a minimum point (the optimum we are searching for).

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 24 / 64


Background: Optimizing performance index

■ The performance index that we want to minimize is represented by


f (x) , where is x the scalar parameter we are adjusting.
■ Assumption: The performance index is an analytic function, so that all
of its derivatives exist.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 25 / 64


Vector case
■ PI will be a function of all of the network parameters (weights and
biases), of which there may be a very large number. Therefore, we need
to extend the Taylor series expansion to functions of many variables.
■ Consider the following function of variables:

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 26 / 64


Contd...

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 27 / 64


Directional Derivatives

■ What if we want to know the derivative of the function in an arbitrary


direction?
■ We let p be a vector in the direction along which we wish to know the
derivative.
■ This directional derivative can be computed from

• Consider the function F(x) = x21 + 2x22 . Find the derivative of the
function at the point X∗ = [0.5 .5]T in the direction p = [2 −1]T .

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 28 / 64


Directional derivative: Contd...

■ Any direction that is orthogonal to the gradient will have zero slope.
■ Which direction has the greatest slope?
• The maximum slope will occur when the inner product of the
direction vector and the gradient is a maximum.
• This happens when the direction vector is the same as the gradient.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 29 / 64


Minima

■ Assume that the optimum point is a minimum of the performance


index.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 30 / 64


Minima: Examples

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 31 / 64


Minima examples

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 32 / 64


Minima examples

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 33 / 64


Necessary condition for Optimality
■ First-order necessary (but not sufficient) condition for to be a local
minimum point: The gradient must be zero at a minimum point.

Points satisfying above are called stationary points

■ Second-Order Conditions : Assume X∗ be a stationary point. Using


First-order condition , the Taylor series expansion will be

• W.r.t. all points in small neighborhood of X∗ , a strong minimum


will exist at X∗ if

• For this to be true for arbitrary ∆X ̸= 0 requires that the Hessian


matrix be positive definite.
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 34 / 64
■ A positive definite Hessian matrix is a second-order, sufficient condition
for a strong minimum to exist. It is not a necessary condition.
■ A matrix A is positive definite if zT Az > 0 for any vector z ̸= 0 .
• If all eigenvalues are positive, then the matrix is positive definite.
■ It is positive semidefinite if zT Az ≥ 0 for any vector z
• If all eigenvalues are nonnegative, then the matrix is positive
semidefinite.
■ Example:

■ This matrix is positive semidefinite, which is a necessary condition for


X∗ to be a strong minimum point.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 35 / 64


Quadratic Functions

■ The quadratic function is one type universal PI. .


■ Objective: To investigate the characteristics of the quadratic function.
• the quadratic function appears in many applications
• many functions can be approximated by quadratic functions in
small neighborhoods, especially near local minimum points.
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 36 / 64
Gradient and Hessian of Quadratic Function

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 37 / 64


Eigensystem of the Hessian

■ Objective: To investigate the general shape of the quadratic function.


■ Consider a quadratic function that has a stationary point at the origin,
and whose value there is zero:

Figure: Elliptical Hollow

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 38 / 64


Contd ...

■ The second derivative of a quadratic function F(X) is given by a sym-


metric matrix A.
■ Remember: Symmetric A means Real eigenvalues, eigenvectors corre-
sponding to the eigenvalues that are orthogonal and the matrix must
be diagonalizable.
■ The second derivative in the direction of a vector P corresponds to the
curvature of the function in that direction

■ Maximum curvature represents the direction of fastest change and min-


imum curvature represents the direction of slowest change.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 39 / 64


■ Important results
• The maximum second derivative occurs in the direction of the
eigenvector that corresponds to the largest eigenvalue.
• In fact, in each of the eigenvector directions the second derivatives
will be equal to the corresponding eigenvalue. The eigenvalues are
the second derivatives in the directions of the eigenvectors
■ Following figure illustrates the case where λ1 < λ2
• The minimum curvature will occur in the direction of Z1 . We will
cross contour lines more slowly in this direction.
• The maximum curvature will occur in the direction Z2 , therefore
we will cross contour lines more quickly in that direction.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 40 / 64


Example-1

Figure: Since all the eigenvalues are equal, the curvature should be the same in
all directions, and therefore the function should have circular contours.
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 41 / 64
Example-2

Figure: In this case the maximum curvature is in the direction of so we should


cross contour lines more quickly in that direction.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 42 / 64


Example-3

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 43 / 64


Example-4

Figure: The Hessian matrix is positive semidefinite, and we have a weak


minimum along the line x1 = x2
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 44 / 64
Summary

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 45 / 64


(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 46 / 64
Summary

■ Assumption: for simplicity, that the stationary point of the quadratic


function was at the origin, and that it had a zero value there.
■ If c is nonzero then the function is simply increased in magnitude by
at every point. The shape of the contours do not change.
■ When d is nonzero, and A is invertible, the shape of the contours are
not changed, but the stationary point of the function moves to

■ If A is not invertible (has some zero eigenvalues) and d is nonzero then


stationary points may not exist.
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 47 / 64
Performance learning (Contd...)

■ Performance learning is another important class of learning law, in


which the network parameters are adjusted to optimize the performance
of the network.
■ There are two steps involved in this optimization process.
• Define “performance”: Find a quantitative measure of network
performance, called the performance index, which is small when
the network performs well and large when the network performs
poorly.
• To search the parameter space (adjust the network weights and
biases) in order to reduce the performance index.
Investigate the characteristics of performance surfaces and set some conditions that will guarantee that a surface
does have a minimum point (the optimum we are searching for).
To develop algorithms to optimize a performance index .

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 48 / 64


Steepest Descent

■ Objective: To develop algorithms to optimize a performance index F(x).


■ For our purposes the word “optimize” will mean to find the value of x
that minimizes F (x).
■ All of the optimization algorithms we will discuss are iterative.
■ We begin from some initial guess, , and then update our guess in stages
according to an equation of the form

■ where the vector pk represents a search direction, and the positive


scalar αk is the learning rate, which determines the length of the step.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 49 / 64


Steepest Descent / Gradient descent Derivation

Figure: Gradient Descent


(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 50 / 64
Steepest Descent : Learning rate hyper-parameter
■ Learning rate hyperparameter: the size of the steps
■ Large: take large steps and would expect to converge faster.
• You might jump across the valley and end up on the other side,
possibly even higher up than you were before.
• Unstable: This might make the algorithm diverge, with larger and
larger values, failing to find a good solution.
■ Learning rate: too small, the algorithm will have to go through many
iterations to converge, which will take a long time.
■ Is there some way to predict the maximum allowable learning rate?
Only for Quadratic functions.

(Deep Leaarning, Dr. Ashish Gupta) Figure


ULC665 : ANN 51 / 64
Optimum Learning Rate

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 52 / 64


ADALINE (ADAptive LInear NEuron) Network

■ The ADALINE network is very similar to the perceptron, except that


its transfer function is linear, instead of hard-limiting.
■ Both the ADALINE and the perceptron suffer from the same inherent
limitation: they can only solve linearly separable problems.
■ The LMS algorithm, however, is more powerful than the perceptron
learning rule.
• While the perceptron rule is guaranteed to converge to a solution
that correctly categorizes the training patterns, the resulting net-
work can be sensitive to noise, since patterns often lie close to the
decision boundaries.
• The LMS algorithm minimizes mean square error, and therefore
tries to move the decision boundaries as far from the training
patterns as possible.

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 53 / 64


(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 54 / 64
LMS/ Delta learning rule

■ Direct solution: Normal Solution.

■ If we don’t want to calculate the inverse of R, use the steepest descent


algorithm

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 55 / 64


Limitations of Single-layer neural networks

■ Single layer neural networks can only solve problems that are linearly
separable.
■ Examples of linearly Inseparable Problems

■ Solution is Multi-layer perceptrons / Deep Forward Networks


• consist of one more hidden layer of weights.
• each layer may have its own (non-linear) transfer function.
• trained by Backpropogation algorithm, a generalization of the LMS
algorithm.
(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 56 / 64
XOR Problem

■ The XOR representation

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 57 / 64


Single-layer network

Figure: Linear model is not able to represent the XOR function

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 58 / 64


Shifting to Multi-layer network

Figure: Multilayer perceptron with one hidden layer

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 59 / 64


Multi-layer network: Linear activation function ? NO

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 60 / 64


MLP: A Solution to XOR problem

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 61 / 64


Backpropogation Algorithm

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 62 / 64


Backpropagation Algorithm

(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 63 / 64


(Deep Leaarning, Dr. Ashish Gupta) ULC665 : ANN 64 / 64

You might also like