Backpropagation Algorithm
Backpropagation Algorithm
Backpropagation Algorithm
1
The backpropagation algorithm consists of two phases:
1. The forward pass where our inputs are passed through the network and output
predictions obtained (also known as the propagation phase).
2. The backward pass where we compute the gradient of the loss function at the
final layer (i.e., predictions layer) of the network and use this gradient to
recursively apply the chain rule to update the weights in our network (also known
as the weight update phase).
Backpropagation Algorithm
2
Search for the related information of the following concepts:
1. Forward pass
2. Backward pass
3. Gradient
4. Loss function
5. Chain rule
1
10/12/2023
3
The Forward Pass
4
Gradient
2
10/12/2023
5
Gradient
The surface of our bowl is the loss landscape, which is a plot of the loss
function. The difference between our loss landscape and your cereal bowl is
that your cereal bowl only exists in three dimensions, while your loss
landscape exists in many dimensions, perhaps tens, hundreds, or thousands of
dimensions.
Each position along the surface of the bowl corresponds to a particular loss
value given a set of parameters W (weight matrix) and b (bias vector). Our
goal is to try different values of W and b, evaluate their loss, and then take a
step towards more optimal values that (ideally) have lower loss.
6
Loss function
3
10/12/2023
7
Loss function
Backpropagation Algorithm
8
4
10/12/2023
Backpropagation Algorithm
9
we present the feature vector (0,1,1) (and
target output value 1 to the network).
Here we can see that 0, 1, and 1 have
been assigned to the three input nodes in
the network.
10
5
10/12/2023
11
11
12
12
6
10/12/2023
13
Applying the step function with net = 0.506 we see that our network
predicts 1 which is, in fact, the correct class label. However, our
network is not very confident in this class label – the predicted value
0.506 is very close to the threshold of the step. Ideally, this prediction
should be closer to 0.98−0.99, implying that our network has truly
learned the underlying pattern in the dataset. In order for our
network to actually “learn”, we need to apply the backward pass.
13
14
7
10/12/2023
15
Line 5 then defines the constructor to our NeuralNetwork class. The constructor requires
a single argument, followed by a second optional one:
• layers: A list of integers which represents the actual architecture of the feedforward
network. For example, a value of [2,2,1] would imply that our first input layer has two
nodes, our hidden layer has two nodes, and our final output layer has one node.
• alpha: Here we can specify the learning rate of our neural network. This value is
applied during the weight update phase.
16
8
10/12/2023
17
On Line 14 we start looping over the number of layers in the network (i.e.,
len(layers)), but we stop before the final two layer.
Each layer in the network is randomly initialized by constructing an MxN
weight matrix by sampling values from a standard, normal distribution (Line
18). The matrix is MxN since we wish to connect every node in current layer
to every node in the next layer.
18
9
10/12/2023
19
The final code block of the constructor handles the special case where the
input connections need a bias term, but the output does not:
Again, these weight values are randomly sampled and then normalized.
20
10
10/12/2023
21
Given a layers value of (2, 2, 1), the output of calling this function will be:
22
11
10/12/2023
23
As well as the derivative of the sigmoid which we’ll use during the
backward pass:
24
12
10/12/2023
We’ll draw inspiration from the scikit-learn library and define a function
named fit which will be responsible for actually training our NeuralNetwork
25
26
13
10/12/2023
27
28
14
10/12/2023
29
The final entry in A is thus the output of the last layer in our network
30
15
10/12/2023
31
32
16
10/12/2023
33
34
17
10/12/2023
35
36
18
10/12/2023
37
38
19
10/12/2023
39
40
20
10/12/2023
41
42
21
10/12/2023
43
44
22
10/12/2023
45
46
23
10/12/2023
47
48
24
10/12/2023
49
50
25
10/12/2023
51
We also perform min/max normalizing by scaling each digit into the range [0,1] (Line 14).
51
52
Next, let’s construct a training and testing split, using 75% of the data for testing
and 25% for evaluation
We’ll also encode our class label integers as vectors, a process called one-hot encoding that
we will discuss in detail later in this chapter
52
26
10/12/2023
53
Here we can see that we are training a NeuralNetwork with a 64−32−16−10 architecture.
The output layer has ten nodes due to the fact that there are ten possible output classes
for the digits 0-9. We then allow our network to train for 1,000 epochs.
53
Once our network has been trained, we can evaluate it on the testing set: 54
54
27
10/12/2023
55
55
56
56
28
10/12/2023
57
57
29