UNIT-1 Notes
UNIT-1 Notes
From the model, we find that the hard limiter input or induced local field
of the neuron as
9. The goal of the perceptron is to correctly classify the set of externally
applied input x1, x2, ...… xm into one of two classes G1 and G2.
10. The decision rule for classification is that if output y is +1 then assign the
point represented by input x1, x2, ……. xm to class G1 else y is –1 then
assign to class G2.
11. In Fig. 1.8.2, if a point (x1, x2) lies below the boundary lines is assigned to
class G2 and above the line is assigned to class G1. Decision boundary is
calculated as :
The synaptic weights w1, w2, …….. wm of the perceptron can be adapted
on an iteration by iteration basis.
13. For the adaption, an error-correction rule known as perceptron
convergence algorithm is used.
14. For a perceptron to function properly, the two classes G1 and G2 must be
linearly separable.
15. Linearly separable means, the pattern or set of inputs to be classified
must be separated by a straight line.
16. Generalizing, a set of points in n-dimensional space are linearly separable
if there is a hyperplane of (n – 1) dimensions that separates the sets.
In this type of network, we have only two layers i.e., input layer and output layer but input
layer does not count because no computation is performed in this layer. Output layer is
formed when different weights are applied on input nodes and the cumulative effect per node
is taken.
c. After this the neurons collectively give the output layer to compute the output signals.
a. This network is single layer network with feedback connection in which processing
element’s output can be directed back to itself or to other processing element or both.
b. Recurrent neural network is a class of artificial neural network where connections between
nodes form a directed graph along a sequence.
c. This allows it to exhibit dynamic temporal behaviour for a time sequence. Unlike feed
forward neural networks, RNNs can use their internal state (memory) to process sequences of
inputs.
Fig. 1.12.1 shows architectural graph of multilayer perceptron with two hidden layer and an
output layer.
2. Signal flow through the network progresses in a forward direction, from the left to right
and on a layer-by-layer basis.
3. Two kinds of signals are identified in this network :
a. Functional signals : Functional signal is an input signal and propagates forward and
emerges at the output end of the network as an output signal.
b. Error signals : Error signal originates at an output neuron and
propagates backward through the network.
4. Multilayer perceptrons have been applied successfully to solve some
difficult and diverse problems by training them in a supervised manner
with highly popular algorithm known as the error backpropagation
algorithm.
Characteristics of multilayer perceptron :
1. In this model, each neuron in the network includes a non-linear activation function (non-
linearity is smooth). Most commonly used non-linear function is defined by :
where vj is the induced local field (i.e., the sum of all weights and bias)
and y is the output of neuron j.
2. The network contains hidden neurons that are not a part of input or
output of the network. Hidden layer of neurons enabled network to
learn complex tasks.
2. The network exhibits a high degree of connectivity.
What a shallow network computes ?
Answer
1. Shallow networks are the neural networks with less depth i.e., less number of hidden layers.
2. These neural networks have one hidden layer and an output layer.
3. Shallow neural networks is a term used to describe neural network that usually have only one
hidden layer as opposed to deep neural network which have several hidden layers.
4. Fig. 1.13.1, below shows a shallow neural network with single hidden layer, single input layer and
single output layer :
5. The neurons present in the hidden layer of our shallow neural network compute the following :
The superscript number [i] denotes the layer number and the subscript number j denotes the neuron
number in a particular layer.
Que 1.17. How tuning parameters effect the backpropagation neural network ?
Answer
Effect of tuning parameters of the backpropagation neural network :
1. Momentum factor :
a. The momentum factor has a significant role in deciding the values
of learning rate that will produce rapid learning.
b. It determines the size of change in weights or biases.
c. If momentum factor is zero, the smoothening is minimum and the
entire weight adjustment comes from the newly calculated change.
d. If momentum factor is one, new adjustment is ignored and previous
one is repeated.
e. Between 0 and 1 is a region where the weight adjustment is smoothened by an amount
proportional to the momentum factor.
f. The momentum factor effectively increases the speed of learning without leading to
oscillations and filters out high frequency variations of the error surface in the weight space.
2. Learning coefficient :
a. An formula to select learning coefficient has been :
Where N1 is the number of patterns of type 1 and m is the number of different pattern types.
b. The small value of learning coefficient less than 0.2 produces slower but stable training.
c. The largest value of learning coefficient i.e., greater than 0.5, the weights are changed
drastically but this may cause optimum combination of weights to be overshot resulting in
oscillations about the optimum.
d. The optimum value of learning rate is 0.6 which produce fast learning without leading to
oscillations.
3. Sigmoidal gain :
a. If sigmoidal function is selected, the input-output relationship of the neuron can be set as
O = (1 )
1
(1 e )
...(1.17.1)
where is a scaling factor known as sigmoidal gain.
b. As the scaling factor increases, the input-output characteristic of
the analog neuron approaches that of the two state neuron or the
activation function approaches the (Satisifiability) function.
c. It also affects the backpropagation. To get graded output, as the
sigmoidal gain factor is increased, learning rate and momentum
factor have to be decreased in order to prevent oscillations.
4. Threshold value :
a. in equation (1.17.1) is called as threshold value or the bias or the
noise factor.
b. A neuron fires or generates an output if the weighted sum of the
input exceeds the threshold value.
c. One method is to simply assign a small value to it and not to change
it during training.
d. The other method is to initially choose some random values and
change them during training.
Que 1.18. Write short note on gradient descent.
Answer
1. Gradient descent is an optimization technique in machine learning and
deep learning and it can be used with all the learning algorithms.
2. A gradient is the slope of a function, the degree of change of a parameter
with the amount of change in another parameter.
3. Mathematically, it can be described as the partial derivatives of a set of
parameters with respect to its inputs. The more the gradient, the steeper
the slope.
4. Gradient descent is a convex function.
5. Gradient descent can be described as an iterative method which is used
to find the values of the parameters of a function that minimizes the
cost function as much as possible.
6. The parameters are initially defined a particular value and from that,
Gradient descent is run in an iterative fashion to find the optimal values
of the parameters, using calculus, to find the minimum possible value of
the given cost function.
Que 1.19. Discuss selection of various parameters in Backpropagation Neural Network
(BPN).
Selection of various parameters in BPN :
1. Number of hidden nodes :
a. The guiding criterion is to select the minimum nodes in the first
and third layer, so that the memory demand for storing the weights
can be kept minimum.
b. The number of separable regions in the input space M, is a function
of the number of hidden nodes H in BPN and H = M – 1.
c. When the number of hidden nodes is equal to the number of training
patterns, the learning could be fastest.
d. In such cases, BPN simply remembers training patterns losing all
generalization capabilities.
e. Hence, as far as generalization is concerned, the number of hidden
nodes should be small compared to the number of training patterns
with help of VCdim (Vapnik Chervonenkis dimension) probability
theory.
f. We can estimate the selection of number of hidden nodes for a
given number of training patterns as number of weights which is
equal to I1 * I2 + I2 * I3, where I1 and I3 denote input and output
nodes and I2 denote hidden nodes.
g. Assume the training samples T to be greater than VCdim. Now if
we accept the ratio 10 : 1
Explain different types of gradient descent.
Answer
Different types of gradient descent are :
1. Batch gradient descent :
a. This is a type of gradient descent which processes all the training
example for each iteration of gradient descent.
b. When the number of training examples is large, then batch gradient
descent is computationally very expensive. So, it is not preferred.
c. Instead, we prefer to use stochastic gradient descent or
mini-batch gradient descent.
2. Stochastic gradient descent :
a. This is a type of gradient descent which processes single training
example per iteration.
b. Hence, the parameters are being updated even after one iteration
in which only a single example has been processed.
c. Hence, this is faster than batch gradient descent. When the number
of training examples is large, even then it processes only one
example which can be additional overhead for the system as the
number of iterations will be large.
Mini-batch gradient descent :
a. This is a mixture of both stochastic and batch gradient descent.
b. The training set is divided into multiple groups called batches.
c. Each batch has a number of training samples in it.
d. At a time, a single batch is passed through the network which
computes the loss of every sample in the batch and uses their
average to update the parameters of the neural network.
Que 1.21. What are the advantages and disadvantages of
stochastic gradient descent ?
Answer
Advantages of stochastic gradient descent :
1. It is easier to fit into memory due to a single training sample being
processed by the network.
2. It is computationally fast as only one sample is processed at a time.
3. For larger datasets it can converge faster as it causes updates to the
parameters more frequently.
4. Due to frequent updates the steps taken towards the minima of the loss
function have oscillations which can help getting out of local minimums
of the loss function (in case the computed position turns out to be the
local minimum).
Disadvantages of stochastic gradient descent :
1. Due to frequent updates the steps taken towards the minima are very
noisy. This can often lead the gradient descent into other directions.
2. Also, due to noisy steps it may take longer to achieve convergence to the
minima of the loss function.
3. Frequent updates are computationally expensive due to using all
resources for processing one training sample at a time.
4. It loses the advantage of vectorized operations as it deals with only a
single example at a time.
Que 1.22. Explain neural networks as universal function
approximation.
Answer
1. Feedforward networks with hidden layers provide a universal approximation
framework.
2. The universal approximation theorem states that a feedforward network
with a linear output layer and atleast one hidden layer with any
“squashing” activation function can approximate any Borel measurable
function from one finite-dimensional space to another with any desired
non-zero amount of error, provided that the network is given enough
hidden units. The derivatives of the feedforward network can also approximate the
derivatives of the function.
4. The concept of Borel states that for any continuous function on a closed
and bounded subset of Rn is Borel measurable and therefore may be
approximated by a neural network.
5. The universal approximation theorem means that a large MLP will be
able to represent function.
6. Even if the MLP is able to represent the function, learning can fail for
two different reasons.
a. Optimization algorithm used for training may not be able to find
the value of the parameters that corresponds to the desired function.
b. Training algorithm might choose the wrong function due to
overfitting.
7. Feedforward networks provide a universal system for representing
functions, in the sense that, given a function, there exists a feedforward
network that approximates the function.
8. There is no universal procedure for examining a training set of specific
examples and choosing a function that will generalize to points not in
the training set.
9. The universal approximation theorem says that there exists a network
large enough to achieve any degree of accuracy we desire, but the
theorem does not say how large this network will be.
10. Scientists provide some bounds on the size of a single-layer network
needed to approximate a broad class of functions. In the worse case, an
exponential number of hidden units.
11. This is easiest to see in the binary case : the number of possible binary
functions on vectors v {0,1}n is 22n and selecting one such function
requires 2n bits, which will in general require O(2n) degrees of freedom.
12. A feedforward network with a single layer is sufficient to represent any
function, but the layer may be infeasibly large and may fail to learn and
generalize correctly.