Unit-2 DL Cse
Unit-2 DL Cse
Note: There is also a bias unit in a feed-forward neural network in all the layers except the
output layer.
Assumptions:
i = number of neurons in input layer
From the diagram, we have i = 3, h = 4 and o = 2. Note that the red colored neuron is the
bias for that layer. Each bias of a layer is connected to all the neurons in the next layer
except the bias of the next layer.
Mathematically:
Number of connections between the first and second layer: 3 × 4 = 12, which is nothing but
the product of i and h.
Number of connections between the second and third layer: 4 × 2 = 8, which is nothing but
the product of h and o.
There are connections between layers via bias as well. Number of connections between the
bias of the first layer and the neurons of the second layer (except bias of the second layer):
1 × 4, which is nothing but h.
Number of connections between the bias of the second layer and the neurons of the third
layer: 1 × 2, which is nothing but o.
Summing up all:
3×4+4×2+1×4+1×2
= 12 + 8 + 4 + 2
= 26
Thus, this feed-forward neural network has 26 connections in all and thus will have 26
trainable parameters.
3×4+4×2+1×4+1×2
=3×4+4×2+4+2
=i×h+h×o+h+o
Thus, the total number of parameters in a feed-forward neural network with one hidden
layer is given by:
(i × h + h × o) + h + o
Since this network is a small network it was also possible to count the connections in the
diagram to find the total number. But, what if the number of layers is more? Let us work on
one more scenario and see if this formula works or we need an extension to this.
Scenario 1: A feed-forward neural network with three hidden layers. Number of units in the
input, first hidden, second hidden, third hidden and output layers are respectively 3, 5, 6, 4
and 2.
Assumptions:
Number of connections between the first and second layer: 3 × 5 = 15, which is nothing but
the product of i and h1.
Number of connections between the second and third layer: 5 × 6 = 30, which is nothing but
the product of h1 and h2.
Number of connections between the third and fourth layer: 6 × 4 = 24, which is nothing but
the product of h2 and h3.
Number of connections between the fourth and fifth layer: 4 × 2= 8, which is nothing but
the product of h3 and o.
Number of connections between the bias of the first layer and the neurons of the second
layer (except bias of the second layer): 1 × 5 = 5, which is nothing but h1.
Number of connections between the bias of the second layer and the neurons of the third
layer: 1 × 6 = 6, which is nothing but h2.
Number of connections between the bias of the third layer and the neurons of the fourth
layer: 1 × 4 = 4, which is nothing but h3.
Number of connections between the bias of the fourth layer and the neurons of the fifth
layer: 1 × 2 = 2, which is nothing but o.
Summing up all:
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
= 15 + 30 + 24 + 8 + 5 + 6 + 4 + 2
= 94
Thus, this feed-forward neural network has 94 connections in all and thus 94 trainable
parameters.
3×5+5×6+6×4+4×2+1×5+1×6+1×4+1×2
=3×5+5×6+6×4+4×2+5+6+4+2
= i × h1 + h1 × h2 + h2 × h3+ h3 × o + h1 + h2 + h3+ o
Thus, the total number of parameters in a feed-forward neural network with three hidden
layers is given by:
(i × h1 + h1 × h2 + h2 × h3 + h3 × o) + h1 + h2 + h3+ o
Thus, the formula to find the total number of trainable parameters in a feed-forward neural
network with n hidden layers is given by:
Now, how error function is used in Backpropagation, and how Backpropagation works?
Input values
X1=0.05
X2=0.10
Initial weight
W1=0.15 w5=0.40
W2=0.20 w6=0.45
W3=0.25 w7=0.50
W4=0.30 w8=0.55
Bias Values
b1=0.35 b2=0.60
Target Values
T1=0.01
T2=0.99
Now, we first calculate the values of H1 and H2 by a forward pass.
Forward Pass
To find the value of H1 we first multiply the input value from the weights as
H1=x1×w1+x2×w2+b1
H1=0.05×0.15+0.10×0.20+0.35
H1=0.3775
Now, we calculate the values of y1 and y2 in the same way as we calculate the H1
and H2.
To find the value of y1, we first multiply the input value i.e., the outcome of H1 and
H2 from the weights as
y1=H1×w5+H2×w6+b2
y1=0.593269992×0.40+0.596884378×0.45+0.60
y1=1.10590597
Our target values are 0.01 and 0.99. Our y1 and y2 value is not matched with our
target values T1 and T2.
Now, we will find the total error, which is simply the difference between the outputs from
the target outputs. The total error is calculated as
To update the weight, we calculate the error corresponding to each weight with the help of a
total error. The error on weight w is calculated by differentiating total error with respect to w.
From equation two, it is clear that we cannot partially differentiate it with respect to
w5 because there is no any w5. We split equation one into multiple terms so that we
can easily differentiate it with respect to w5 as
Now, we
calculate
each term one by one to differentiate E total with respect to w5 as
Putting the value of e-y in equation (5)
In the same way, we calculate w6new,w7new, and w8new and this will give us the following values
w5new=0.35891648
w6new=408666186
w7new=0.511301270
w8new=0.561370121
Backward pass at Hidden layer:
Now, we will backpropagate to our hidden layer and update the weight w1, w2, w3,
and w4 as we have done with w5, w6, w7, and w8 weights.
From equation (2), it is clear that we cannot partially differentiate it with respect to w1
because there is no any w1. We split equation (1) into multiple terms so that we can easily
differentiate it with respect to w1 as
Now, we calculate each term one by one to differentiate E total with respect to w1 as
We again split this because there is no any H1 final term in Etoatal as
We again Split both because there is no any y1 and y2 term in E1 and E2.
We split it as
Now, we find the value of by putting values in equation (18) and (19) as
We calculate the partial derivative of the total net input to H1 with respect to w1 the
same as we did for the output neuron:
In the same way, we calculate w2new,w3new, and w4 and this will give us the following
values
w1new=0.149780716
w2new=0.19956143
w3new=0.24975114
w4new=0.29950229
We have updated all the weights. We found the error 0.298371109 on the network when
we fed forward the 0.05 and 0.1 inputs. In the first round of Backpropagation, the total error
is down to 0.291027924. After repeating this process 10,000, the total error is down to
0.0000351085. At this point, the outputs neurons generate 0.159121960 and 0.984065734
i.e., nearby our target value when we feed forward the 0.05 and 0.1.
Given that we are using a decaying average of the partial derivatives and calculating the
square root of this average gives the technique its name, e.g, square root of the mean
squared partial derivatives or root mean square (RMS). For example, the custom step size
for a parameter may be written as:
Once we have the custom step size for the parameter, we can update the parameter using
the custom step size and the partial derivative f'(x(t)).
This process is then repeated for each input variable until a new point in the search space is
created and can be evaluated.
RMSProp is a very effective extension of gradient descent and is one of the preferred
approaches generally used to fit deep learning neural networks.
Adam:
The Adam optimization algorithm is an extension to stochastic gradient descent. It use to
update network weights iterative based in training data. Adam was presented by Diederik
Kingma from Open AI and Jimmy Ba. the name Adam is derived from adaptive moment
estimation.
Straightforward to implement.
Computationally efficient.
Little memory requirements.
Invariant to diagonal rescale of the gradients.
Well suited for problems that are large in terms of data and/or parameters.
Appropriate for non-stationary objectives.
Appropriate for problems with very noisy/or sparse gradients.
Hyper-parameters have intuitive interpretation and typically require little tuning.
Its main objective is to prevent layer activation outputs from exploding or vanishing
gradients during the forward propagation. If either of the problems occurs, loss gradients
will either be too large or too small, and the network will take more time to converge if it is
even able to do so at all.
If we initialized the weights correctly, then our objective i.e, optimization of loss function
will be achieved in the least time otherwise converging to a minimum using gradient
descent will be impossible.
One of the important things which we have to keep in mind while building your neural
network is to initialize your weight matrix for different connections between layers
correctly.
Let us see the following two initialization scenarios which can cause issues while we training
the model:
If we initialized all the weights with 0, then what happens is that the derivative wrt loss
function is the same for every weight in W[l], thus all weights have the same value in
subsequent iterations. This makes hidden layers symmetric and this process continues for all
the n iterations. Thus initialized weights with zero make your network no better than a
linear model. It is important to note that setting biases to 0 will not create any problems as
non-zero weights take care of breaking the symmetry and even if bias is 0, the values in
every neuron will still be different.
– This technique tries to address the problems of zero initialization since it prevents neurons
from learning the same features of their inputs since our goal is to make each neuron learn
different functions of its input and this technique gives much better accuracy than zero
initialization.
– In general, it is used to break the symmetry. It is better to assign random values except 0
to weights.
– Remember, neural networks are very sensitive and prone to overfitting as it quickly
memorizes the training data.
“What happens if the weights initialized randomly can be very high or very
low?”
(a) Vanishing gradients :
For any activation function, abs(dW) will get smaller and smaller as we go backward with
every layer during backpropagation especially in the case of deep neural networks. So, in
this case, the earlier layers’ weights are adjusted slowly.
Due to this, the weight update is minor which results in slower convergence.
This makes the optimization of our loss function slow. It might be possible in the worst case,
this may completely stop the neural network from training further.
More specifically, in the case of the sigmoid and tanh and activation functions, if your
weights are very large, then the gradient will be vanishingly small, effectively preventing the
weights from changing their value. This is because abs(dW) will increase very slightly or
possibly get smaller and smaller after the completion of every iteration.
So, here comes the use of the RELU activation function in which vanishing gradients are
generally not a problem as the gradient is 0 for negative (and zero) values of inputs and 1
for positive values of inputs.
This is the exact opposite case of the vanishing gradients, which we discussed above.
Consider we have weights that are non-negative, large, and having small activations A.
When these weights are multiplied along with the different layers, they cause a very large
change in the value of the overall gradient (cost). This means that the changes in W, given
by the equation W= W — ⍺ * dW, will be in huge steps, the downward moment will increase.
What is an Eigenvalue?
Mathematically, the eigenvalue is the number by which the eigenvector is multiplied and
produces the same result as if the matrix were multiplied with the vector as shown in
Equation 1.
Ax = λx……………(1)
Where A is the square matrix, λ is the eigenvalue and x is the eigenvector
Details of how to calculate the determinant of a matrix can be found in a linear algebra
textbook.
Equation 2 (A - λI)x = 0