Back in NN
Back in NN
gradient descent function by finding the optimum values of weights and biases
using backpropagation.
The gradient descent algorithm presents us with the following parameter update
equation:
𝑙 𝑙 ∂𝐿
𝑤𝑘𝑗 = 𝑤𝑘𝑗 − η 𝑙
∂𝑤𝑘𝑗
where k and j are the indices of the weight in the weight matrix and l is the index of
Given the neural network in the diagram, for the output layer, the following weights
and bias terms will be updated using the gradient descent update equation:
Now, for the hidden layer, the following weights and biases will be updated:
To compute these gradients, you use an algorithm called backpropagation.
As you can see in the above-mentioned formulas, there exist partial derivatives of
the loss function L with respect to the weights and biases in these equations. To
compute these, you use the chain rule. You can observe the dependencies of
different layers in the gradient computation.
Now, let’s simplify the neural network given above and represent it in a condensed
format as shown below.
In this case, the loss function is a function of 𝑤1, 𝑏1, 𝑤2 and 𝑏2.
The loss function, the activation function and the cumulative inputs are shown in the
following expressions:
Now, let’s compute the gradient of the loss function with respect to one of the
weights to understand how backpropagation works.
∂𝐿
Suppose you want to calculate ∂𝑤2
, that is, the gradient of the loss function, with
∂𝐿 ∂𝐿 ∂ℎ2 ∂𝑧2
∂𝑤2
= ∂ℎ2 ∂𝑧2 ∂𝑤2
Based on the definition of the loss function, L is a direct function of ℎ2, ℎ2 is a function
of 𝑧2 and 𝑧2 is a function of 𝑤2.
Now let’s see how each of the three terms in the RHS of the equation are computed:
1 2 ∂𝐿 ∂ 1 2
𝐿= 2
(𝑦 − ℎ2) →→ ∂ℎ2
= ∂ℎ2 2
(𝑦 − ℎ2) =− (𝑦 − ℎ2)
∂ℎ2 2 2
ℎ2 = 𝑡𝑎𝑛ℎ(𝑧2)→→ ∂𝑧2
= 1 − 𝑡𝑎𝑛ℎ (𝑧2) = 1 − (ℎ2)
∂𝑧2
𝑧2 = 𝑤2ℎ1 + 𝑏2→→ ∂𝑤2
= ℎ1
Hence, you get the gradient of the loss function with respect to 𝑤2, which is shown
below.
∂𝐿 2
∂𝑤2
= [− (𝑦 − ℎ2)][1 − (ℎ2) ][ℎ1]
With this, you have completed the computation of the gradient of the loss function
with respect to the weight 𝑤2 for backpropagation.
The housing data set has two inputs, which are the size of the house and the number of
rooms available, and one output, which is the price of the house.
As seen in the computation of the forward pass, we randomly initialise the weights and
biases in the network. Let’s now take the same initialisation and the same input
observation that you used earlier while doing forward propagation.
So, we have:
2
As previously calculated, the output prediction ℎ obtained is 0.63, whereas the actual output 𝑦
is −0.54. Using backpropagation, let’s update the weights and biases such that this difference
between the predicted and the actual output gets minimised.
The steps taken to update the weights and biases between the hidden layer and the
output layer are as follows.
First, you will focus on the weights for the output layer. Let’s take the gradient of 𝐿
2
with respect to 𝑤11.
You know that:
2 2
∂𝐿 ∂𝐿 ∂ℎ1 ∂𝑧1
2 = 2 2 2 (using chain rule)
∂𝑤11 ∂ℎ1 ∂𝑧1 ∂𝑤11
∂𝐿 ∂ 1 2 2 2
1) 2 = 2 2
(𝑦 − ℎ1) =− (𝑦 − ℎ1)
∂ℎ1 ∂ℎ1
∂𝐿
2 =− (− 0. 54 − 0. 63) = 1. 17
∂ℎ1
2
∂ℎ1 2 2
2) 2 = 1 as ℎ1 = 𝑧1 (linear activation)
∂𝑧1
2
∂𝑧1 ∂ 2 2 1 2 1
3). 2 = 2 (𝑏1 + 𝑤11ℎ1 + 𝑤12ℎ2)
∂𝑤11 ∂𝑤11
2
∂𝑧1 1
2 = (ℎ1) = 0. 484
∂𝑤11
∂𝐿
Hence, this evaluates to 2 = 1. 17 * 1 * 0. 484 = 0. 5663.
∂𝑤11
Now, using the update rule for gradient descent and considering the learning rate η
as 0.2:
2 2 ∂𝐿
𝑤11(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤11 − η 2 = 0. 3 − (0. 2 * 0. 5663) = 0. 1867
∂𝑤11
2 2
∂𝐿 ∂𝐿 ∂ℎ1 ∂𝑧1
Similarly, 2 = 2 2 2 .
∂𝑤12 ∂ℎ1 ∂𝑧1 ∂𝑤12
Since you have already computed the first two derivatives, let’s now compute the
third one.
2
∂𝑧1 ∂ 2 2 1 2 1
2 = 2 (𝑏1 + 𝑤11ℎ1 + 𝑤12ℎ2)
∂𝑤12 ∂𝑤12
2
∂𝑧1 1
2 = (ℎ2) = 0. 424
∂𝑤12
∂𝐿
Hence, this evaluates to 2 = 1. 17 * 1 * 0. 424 = 0. 4961.
∂𝑤12
Now,
2 2 ∂𝐿
𝑤12(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤12 − η 2 = 0. 2 − (0. 2 * 0. 4961) = 0. 1008.
∂𝑤12
You have computed the first two derivatives already and the third one can be
computed as shown below.
2
∂𝑧1 ∂ 2 2 1 2 1
2 = 2 (𝑏1 + 𝑤11ℎ1 + 𝑤12ℎ2)
∂𝑏1 ∂𝑏1
2
∂𝑧1
2 =1
∂𝑏1
∂𝐿
Hence, this evaluates to 2 = 1. 17 * 1 * 1 = 1. 17.
∂𝑏1
Now,
2 2 ∂𝐿
𝑏1(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑏1 − η 2 = 0. 4 − (0. 2 * 1. 17) = 0. 166.
∂𝑏1
So, you have updated values of weights and biases of the output layer from a single
iteration.
2 2 2
𝑤11(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 0. 1867, 𝑤12(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 0. 1008, 𝑏1(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 0. 166
The steps involved in computing the updated weights and biases in the hidden layer
are shown in this image given below.
Now, let’s start with computing the weights and biases corresponding to the first
neuron of the hidden layer.
1
Taking the gradient of 𝐿 with respect to 𝑤11, you can say that:
2 1 2 1 1
∂𝐿 ∂𝐿 ∂ℎ1 ∂ℎ1 ∂𝐿 ∂ℎ1 ∂ℎ1 ∂𝑧1
1 = 2 1 1 = 2 1 [ 1 1 ]
∂𝑤11 ∂ℎ1 ∂ℎ1 ∂𝑤11 ∂ℎ1 ∂ℎ1 ∂𝑧1 ∂𝑤11
∂𝐿 ∂ 1 2 2 2
2 = 2 2
(𝑦 − ℎ1) =− (𝑦 − ℎ1)
∂ℎ1 ∂ℎ1
∂𝐿
2 =− (− 0. 54 − 0. 63) = 1. 17
∂ℎ1
Now, let’s compute the second, third and fourth derivative terms.
2
∂ℎ1 ∂ 2 2 1 2 1 2
1) 1 = 1 (𝑏1 + 𝑤11ℎ1 + 𝑤12ℎ2) = 𝑤11 = 0. 30
∂ℎ1 ∂ℎ1
1
∂ℎ1 1 1 1 1
2). 1 = σ(𝑧1)(1 − σ(𝑧1)) = ℎ1(1 − ℎ1) = 0. 484(1 − 0. 484)
∂𝑧1
1
∂𝑧1 ∂ 1 1 1
3) 1 = 1 (𝑏1 + 𝑤11𝑥1 + 𝑤12𝑥2) = 𝑥1 =− 0. 32
∂𝑤11 ∂𝑤11
∂𝐿
Hence, this evaluates to 1 = 1. 17 * 0. 30 * 0. 484 * (1 − 0. 484) * (− 0. 32) =− 0. 028.
∂𝑤11
Now,
1 1 ∂𝐿
𝑤11(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤11 − η 1 = 0. 2 − 0. 2 * (− 0. 028) = 0. 2056.
∂𝑤11
2 1 1
∂𝐿 ∂𝐿 ∂ℎ1 ∂ℎ1 ∂𝑧1
Similarly, 1 = 2 1 [ 1 1 ].
∂𝑤12 ∂ℎ1 ∂ℎ1 ∂𝑧1 ∂𝑤12
Since you have already computed the values of the first three terms, you simply
need to calculate the pending derivative term.
1
∂𝑧1 ∂ 1 1 1
1 = 1 (𝑏1 + 𝑤11𝑥1 + 𝑤12𝑥2) = 𝑥2 =− 0. 66
∂𝑤12 ∂𝑤12
∂𝐿
Hence, this evaluates to 1 = 1. 17 * 0. 30 * 0. 484 * (1 − 0. 484) * (− 0. 66) =− 0. 058.
∂𝑤12
Now,
1 1 ∂𝐿
𝑤12(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤12 − η 1 = 0. 15 − 0. 2 * (− 0. 058) = 0. 1616.
∂𝑤12
2 1 1
∂𝐿 ∂𝐿 ∂ℎ1 ∂ℎ1 ∂𝑧1
Similarly, 1 = 2 1 [ 1 1 ].
∂𝑏1 ∂ℎ1 ∂ℎ1 ∂𝑧1 ∂𝑏1
∂𝐿
Hence, this evaluates to 1 = 1. 17 * 0. 30 * 0. 484 * (1 − 0. 484) * 1 = 0. 088.
∂𝑏1
Now,
1 1 ∂𝐿
𝑏1(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑏1 − η 1 = 0. 1 − 0. 2 * (0. 088) = 0. 0824.
∂𝑏1
Hence, for the first node, let’s compute the updated values of the weights and
biases using gradient descent and a learning rate of 0.2 (η).
1 1 ∂𝐿
𝑤11(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤11 − η 1 = 0. 2 − 0. 2 * (− 0. 028) = 0. 2056
∂𝑤11
1 1 ∂𝐿
𝑤12(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤12 − η 1 = 0. 15 − 0. 2 * (− 0. 058) = 0. 1616
∂𝑤12
1 1 ∂𝐿
𝑏1(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑏1 − η 1 = 0. 1 − 0. 2 * (0. 088) = 0. 0824
∂𝑏1
In the same manner, you calculate the weights and biases corresponding to the
second neuron in the hidden layer.
1
Starting with finding the derivative of the loss function L with respect to 𝑤21:
2 1 2 1 1
∂𝐿 ∂𝐿 ∂ℎ1 ∂ℎ2 ∂𝐿 ∂ℎ1 ∂ℎ2 ∂𝑧2
1 = 2 1 1 = 2 1 [ 1 1 ]
∂𝑤21 ∂ℎ1 ∂ℎ2 ∂𝑤21 ∂ℎ1 ∂ℎ2 ∂𝑧2 ∂𝑤21
1
∂ℎ2 1 1 1 1
2). 1 = σ(𝑧2)(1 − σ(𝑧2)) = ℎ2(1 − ℎ2) = 0. 424(1 − 0. 424)
∂𝑧2
1
∂𝑧2 ∂ 1 1 1
3) 1 = 1 (𝑏2 + 𝑤21𝑥1 + 𝑤22𝑥2) = 𝑥1 =− 0. 32
∂𝑤21 ∂𝑤21
1 1
Also, for 𝑤22 and 𝑏2, the first three terms will remain the same. Only the last term will
change. Hence, you will compute only the last term.
1
∂𝑧2
1 = 𝑥2 =− 0. 66
∂𝑤21
1
∂𝑧2
1 =1
∂𝑏2
Hence, for the second node:
2 1 1
∂𝐿 ∂𝐿 ∂ℎ1 ∂ℎ1 ∂𝑧2
1 = 2 1 [ 1 1 ] = 1. 17 * 0. 20 * 0. 424 * (1 − 0. 424) * (− 0. 32) =− 0. 018
∂𝑤21 ∂ℎ1 ∂ℎ1 ∂𝑧1 ∂𝑤21
2 1 1
∂𝐿 ∂𝐿 ∂ℎ1 ∂ℎ1 ∂𝑧2
1 = 2 1 [ 1 1 ] = 1. 17 * 0. 20 * 0. 424 * (1 − 0. 424) * (− 0. 66) =− 0. 038
∂𝑤22 ∂ℎ1 ∂ℎ1 ∂𝑧1 ∂𝑤22
2 1 1
∂𝐿 ∂𝐿 ∂ℎ1 ∂ℎ1 ∂𝑧2
1 = 2 1 [ 1 1 ] = 1. 17 * 0. 20 * 0. 424 * (1 − 0. 424) * 1 = 0. 057
∂𝑏2 ∂ℎ1 ∂ℎ1 ∂𝑧1 ∂𝑏2
Now, computing the updated values of weights and biases using gradient descent
and a learning rate of 0.2 (η):
1 1 ∂𝐿
𝑤21(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤21 − η 1 = 0. 5 − 0. 2 * (− 0. 018) = 0. 5036
∂𝑤21
1 1 ∂𝐿
𝑤22(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑤22 − η 1 = 0. 6 − 0. 2 * (− 0. 038) = 0. 6076
∂𝑤22
1 1 ∂𝐿
𝑏2(𝑢𝑝𝑑𝑎𝑡𝑒𝑑) = 𝑏2 − η 1 = 0. 25 − 0. 2 * (0. 057) = 0. 2386
∂𝑏2
Given below are the new values for weights and biases after one step of gradient
descent for the hidden and the output layers.
Now, let’s perform another forward pass and check if performing backpropagation
and updating the weights and biases once has helped in reducing the loss.
You can see that the loss function computed with the updated weights and biases is
lower than earlier, which is what we want. By repeatedly performing
backpropagation to get optimum values of weights and biases, you can continue
reducing the loss. This, eventually, will help you obtain the predicted output that is
as close as possible to the actual expected output. This is how a neural network
learns using backpropagation.
Point 2: For each layer, compute the cumulative input and apply the non-linear
activation function on each neuron of each layer to get the prediction.
Point 4: Assess the performance of the neural network through a loss function, for
example, a cross-entropy loss function for classification and RMSE for regression.
Point 6: Compute the derivative of the loss function with respect to the weights in the
output layer.
Point 7: From the last layer to the first layer, for each layer, compute the gradient of
the loss function with respect to the weights at each layer and all the intermediate
gradients.
Updating the Model Parameters Using an Optimisation Algorithm such as Gradient Descent
Point 8: Once all the gradients of the loss with respect to the weights and biases are
obtained, use the gradient descent update equation to update the values of the
weights and biases.
Point 9: Repeat the process for a specified number of iterations or until the
predictions made by the model are acceptable.