100% found this document useful (1 vote)
267 views

Neural Network Module 2 Notes

This document provides an overview of supervised learning techniques for neural networks. It discusses perceptron learning and non-separable data sets. It then covers the α-least mean square learning algorithm, the mean squared error surface, and steepest descent search. It also describes the μ-LMS approximation of gradient descent and its application to noise cancellation. The document outlines multi-layer network architecture and the backpropagation learning algorithm. It concludes with practical considerations for implementing the backpropagation algorithm.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
100% found this document useful (1 vote)
267 views

Neural Network Module 2 Notes

This document provides an overview of supervised learning techniques for neural networks. It discusses perceptron learning and non-separable data sets. It then covers the α-least mean square learning algorithm, the mean squared error surface, and steepest descent search. It also describes the μ-LMS approximation of gradient descent and its application to noise cancellation. The document outlines multi-layer network architecture and the backpropagation learning algorithm. It concludes with practical considerations for implementing the backpropagation algorithm.
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 72

Neural Network

Module 2
Supervised Learning
Contents
Perceptron Learning and Non Separable sets
α- Least Mean Square Learning
MSE Error Surface
Steepest Descent Search
μ- LMS Approximate to Gradient Descent
Application of LMS to Noise Cancelling
Multi-layered Network Architecture
Backpropagation Learning Algorithm
Practical Consideration of BP Algorithm

11-11-2021 Neural Network : Module 2 2


PERCEPTRON LEARNING AND
NON-SEPARABLE SETS

11-11-2021 Neural Network : Module 2 3


11-11-2021 Neural Network : Module 2 4


11-11-2021 Neural Network : Module 2 5


α-LEAST MEAN SQUARE LEARNING

11-11-2021 Neural Network : Module 2 6


Operational Details of α-LMS

11-11-2021 Neural Network : Module 2 7


11-11-2021 Neural Network : Module 2 8


11-11-2021 Neural Network : Module 2 9


α-LMS Works with Normalized Training
Patterns

11-11-2021 Neural Network : Module 2 10


11-11-2021 Neural Network : Module 2 11


Operational Summary of α- LMS Learning
Algorithm

11-11-2021 Neural Network : Module 2 12


MSE ERROR SURFACE AND ITS
GEOMETRY

11-11-2021 Neural Network : Module 2 13


11-11-2021 Neural Network : Module 2 14
11-11-2021 Neural Network : Module 2 15
11-11-2021 Neural Network : Module 2 16
11-11-2021 Neural Network : Module 2 17
11-11-2021 Neural Network : Module 2 18

11-11-2021 Neural Network : Module 2 19


STEEPEST DESCENT SEARCH WITH
EXACT GRADIENT INFORMATION

11-11-2021 Neural Network : Module 2 20


11-11-2021 Neural Network : Module 2 21
11-11-2021 Neural Network : Module 2 22
11-11-2021 Neural Network : Module 2 23
μ- LMS: APPROXIMATE GRADIENT
DESCENT

11-11-2021 Neural Network : Module 2 24


11-11-2021 Neural Network : Module 2 25
μ-LMS Algorithm: Convergence in the
Mean

11-11-2021 Neural Network : Module 2 26


11-11-2021 Neural Network : Module 2 27
11-11-2021 Neural Network : Module 2 28
APPLICATION OF LMS TO NOISE
CANCELLATION

11-11-2021 Neural Network : Module 2 29


11-11-2021 Neural Network : Module 2 30

11-11-2021 Neural Network : Module 2 31


11-11-2021 Neural Network : Module 2 32
11-11-2021 Neural Network : Module 2 33
11-11-2021 Neural Network : Module 2 34
MULTILAYERED NETWORK
ARCHITECTURE
• The primary application of TLNs is in data classification-they can successfully classify linearly
separable data sets.
• On the other hand, linear neurons perform some kind of a least squares fit of a given data set-fitting
linear functions to approximate non-linear ones.
• The computational capabilities of these single neuron systems are limited by the nature of the
signal function, and by the lack of a layered architecture.
• Layering drastically increases the computational power of the system.
• A layered network of linear neurons does not provide any additional computational capability

11-11-2021 Neural Network : Module 2 35


Generic Architecture of a feedforward neural
network

11-11-2021 Neural Network : Module 2 36


11-11-2021 Neural Network : Module 2 37


BACKPROPAGATION LEARNING
ALGORITHM
• Notation

11-11-2021 Neural Network : Module 2 38


∙ Vectors and scalar variables will be respectively subscripted and superscripted by the iteration
index k.
∙ Network is homogeneous in the sense that all neurons use similar signal functions.
o For the neurons in the input layer
• S(x) =x
o For the sigmoidal neurons in the hidden and output layer,

11-11-2021 Neural Network : Module 2 39


Squared Error Performance Function

11-11-2021 Neural Network : Module 2 40


11-11-2021 Neural Network : Module 2 41
Outline of Learning Procedure

11-11-2021 Neural Network : Module 2 42


Batch and Pattern update

11-11-2021 Neural Network : Module 2 43


Derivation of Backpropagation Algorithm

11-11-2021 Neural Network : Module 2 44


11-11-2021 Neural Network : Module 2 45
Computation of Error Gradient
1. Hidden to Output Layer weight gradient

11-11-2021 Neural Network : Module 2 46


11-11-2021 Neural Network : Module 2 47
Input to hidden layer weight gradient

11-11-2021 Neural Network : Module 2 48


11-11-2021 Neural Network : Module 2 49
11-11-2021 Neural Network : Module 2 50
m

11-11-2021 Neural Network : Module 2 51


11-11-2021 Neural Network : Module 2 52
11-11-2021 Neural Network : Module 2 53
11-11-2021 Neural Network : Module 2 54
PRACTICAL CONSIDERATIONS IN
IMPLEMENTING THE BP ALGORITHM

11-11-2021 Neural Network : Module 2 55


11-11-2021 Neural Network : Module 2 56
Initialization of network weights

o It is important to correctly choose a set of initial weights for the network.


o Sometimes it can decide whether or not the network is able to learn the training set function.
o It is common practice to initialize weights to small random values within some interval [-∈, ∈]
o Initialization of weights of the entire network to the same value can lead to network paralysis where the
network learns nothing-weight changes are uniformly zero.
o Further, very small ranges of weight randomization should be avoided in general since this may lead to very
slow learning in the initial stages.

11-11-2021 Neural Network : Module 2 57


o Alternatively, an incorrect choice of weights might lead to network saturation where weight changes are
almost negligible over a large number of consecutive epochs.
o This may be incorrectly interpreted as a local minimum because weights might begin to change after a large
number of epochs.
o When neurons saturate, the signal values are close to the extremes 0 or 1 and the signal derivatives are
infinitesimally small.
o Since the weight changes are proportional to the signal derivative, saturated neurons generate weight
changes that are negligibly small.
o This is a major problem, because if the neuron outputs are incorrect (such as being close to 1 for a desired
output close to 0) these small weight changes will allow the neuron to escape from incorrect saturation only
after a very long time.
o Randomization of network weights can help avoid these problems.

11-11-2021 Neural Network : Module 2 58


Data Pre-processing

∙ Limitations to the performance of a neural network (or for that matter any machine learning technique in
general) can be partly attributed to the quality of data employed for training.
∙ Real world data sets are generally incomplete: they may lack feature values or may have missing features.
∙ Data is also often noisy, containing errors or outliers, and may be inconsistent.
∙ Alternatively, the data set might have many more features than are necessary for solving the classification
problem at hand, especially in application domains such as bioinformatics where features (such as
microarray gene expression values) can run in thousands, whereas only a mere handful of them usually turn
out to be essential.
∙ Further, feature values may have scales that are different from one another by orders of magnitude.
∙ It is therefore necessary to perform some kind of pre-processing on the data prior to using it for training and
testing purpose

11-11-2021 Neural Network : Module 2 59


1. Data Cleaning:
∙ If a class label, target value, or a large number of attribute values are missing, it may be appropriate to ignore
the data entry.
∙ For a missing attribute value, one can use the mean of that attribute, averaged over the entire data set, or
averaged over the data belonging to that class of data (for a classification problem).
∙ Alternatively, one might use the most probable value to fill in the missing value as determined by regression,
inference-based tools using a Bayesian formalism, or decision tree induction.

11-11-2021 Neural Network : Module 2 60


Handling Noise in Data:
∙ The main issue here is to identify the outliers and smooth out the data.
∙ One way is to sort the attribute values and partition them into bins and then smooth the data using either bin
means, bin median, or bin boundaries.
∙ An alternative approach uses clustering where attribute values are grouped to detect and remove outliers.
∙ Regression can be used to smooth the data by fitting it to regression functions.
∙ Inconsistent data (which includes redundancies) usually has to be corrected using domain knowledge or an
expert decision.

11-11-2021 Neural Network : Module 2 61


11-11-2021 Neural Network : Module 2 62


∙ Desired outputs should lie well within the neuronal signal range.
∙ For example, given a logistic signal function which lies in the interval (0,1) the desired outputs of patterns in
the entire training set should lie in an interval [0 + ∈, 1 - ∈] where ∈ > 0 is some small number.
∙ Desired values of 0 and 1 would cause weights to grow increasingly large because in order to generate these
limiting values of output 0 and 1 requires a -∞ or ∞ activation which can be accomplished by increasing the
values of weights.
∙ In addition, the algorithm will obviously not converge if desired outputs lie outside the achievable interval
(0,1).
∙ Alternatively, one can try to adjust the signal function to accommodate the ranges of the target values,
although this may require some adjustments at the algorithmic level.

11-11-2021 Neural Network : Module 2 63


Reduction of Features:
∙ Feature reduction is of paramount importance when the number of features is very large (as in the case of
bioinformatics problems where the number of features runs into thousands) and also when there are
redundant features.
∙ A very common approach to feature set reduction is principal components analysis and linear discriminant
analysis.
∙ Feature set reduction can also be built into the learning mechanism of the network which can automatically
select the more important features while reducing or removing the influence of the lesser important ones.

11-11-2021 Neural Network : Module 2 64


Adjusting Learning Rates

∙ If the learning rates are small enough, then the algorithm converges to the closest local minimum.
∙ Very small leaming rates can lead to long training times.
∙ In addition, if the network learning is non-uniform and the learning is stopped before the network is trained to an error
minimum, some weights will have reached their final 'optimal' values while others may not have. In such a situation, the
network might perform well on some patterns and poorly on others.
∙ If the error function can be approximated by a quadratic, then the following observation can be made
o An optimal learning rate will reach the error minimum in a single learning step.
o Rates that are lower will take longer to converge to the same solution.
o Rates that are larger but less than twice the optimal learning rate will converge to the error minimum but only after
much oscillation.
o Learning rates that are larger than twice the optimal value will diverge from the solution.
∙ There are a number of algorithms that attempt to adjust this learning rate somewhat optimally in order to speed up
conventional backpropagation.

11-11-2021 Neural Network : Module 2 65


Error Criteria for Termination of Learning

11-11-2021 Neural Network : Module 2 66


Regularization

Weight Decay:
▪ Over-fitted networks with a high degree of curvature are likely to contain weight with unusually large
magnitude.
▪ Weight decay is a simple regularization technique which penalizes large magnitudes of weights by including
into the error function a penalty term that grows with weight magnitudes.
▪ For example, the sum of the squares of the weights (including the biases) of the entire network can be
multiplied by a decay constant which decides the extent to which the penalty term affects the error function.
▪ With this, the learning process tends to favour lower magnitudes of weights and thus helps keep the
operating range of activations of neurons in the linear regime of the sigmoid

11-11-2021 Neural Network : Module 2 67


11-11-2021 Neural Network : Module 2 68


Cross-validation
▪ Cross-validation is a very effective approach when the number of samples in the data set is small and splitting into the
three subsets as discussed above is not feasible.
▪ A common approach is to use leave-out-one (LOO) cross-validation.
▪ In this, Q partitions on the data set are made (being the total number of training samples).
▪ In each partition, the network is trained on Q -1 samples, and tested on the one single sample that is left out.
▪ This process is repeated such that each sample is used once for testing.
▪ Another variant called 10-fold cross-validation is also very commonly used when the number of samples larger (say
more than 100).
▪ In this, the data set is partitioned randomly into ten different training-testing subsets of size such as
80%-training-20%-test.
▪ The network is trained on a training subset of one partition and then tested on the test subset of that partition. This is
repeated for all ten partitions.
▪ The final test error is the average of the ten test errors obtained.
▪ Cross-validation can be used to determine an appropriate network architecture as discussed in the next subsection.

11-11-2021 Neural Network : Module 2 69


Selection of a Network Architecture

∙ Both the generalization and approximation ability of a feedforward neural network are closely related to the
architecture of the network (which determines the number of weights or free parameters in the network) and
the of the training set.
∙ It is possible to have a situation where there are too many connections in the network and too few training
examples.
∙ In such a situation, the network might "memorize' the training examples only too well, and may fail to
generalize properly because the number of training examples are insufficient to appropriately pin down all
the connection values in the network.
∙ In such a case, a network gets over trained and loses its ability to generalize or interpolate correctly.
∙ The real problem is to find a network architecture that is capable of approximation and generalization
simultaneously.

11-11-2021 Neural Network : Module 2 70


How Many Hidden Layers are Enough?

∙ Although the backpropagation algorithm can be applied to any number of hidden layers, a three-layered
network can approximate any continuous function.
∙ The problem with multi-layered networks using a single hidden layer is that the neurons tend to interact with
each other globally.
∙ Such interactions can make it difficult to generate approximations of arbitrary accuracy.
∙ For a network with two hidden layers, the curve-fitting process is easier for the neural network.
∙ The reason for this is that the first hidden layer extracts local features of the function.
∙ This is done in much the same way as binary threshold neurons partition the input space into regions.
∙ Neurons in the first hidden layer learn the local features that characterize specific regions of the input space.
∙ Global features are extracted in the second hidden layer.
∙ Once again, in a way similar to the multi-layered TLN network, neurons in the second hidden layer combine
the outputs of neurons in the first hidden layer. This facilitates the learning of global features.

11-11-2021 Neural Network : Module 2 71


11-11-2021 Neural Network : Module 2 72

You might also like