Annette Paper

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

Neural Networks: The Backpropagation Algorithm

Annette Lopez Davila


Math 400, College of William and Mary
Professor Chi-Kwong Li

Abstract
This paper illustrates how basic theories of linear algebra and calculus can be combined with
computer programming methods to create neural networks.
Keywords: Gradient Descent, Backpropagation, Chain Rule, Automatic Differentiation,
Activation and Loss Functions

1 Introduction
As computers advanced in the 1950s, researchers attempted to simulate biologically inspired
models that could recognize binary patterns. This led to the birth of machine learning, an
application of computer science and mathematics in which systems have the ability to “learn” by
improving their performance. Neural networks are algorithms that can learn patterns and find
connections in data for classification, clustering, and prediction problems. Data including
images, sounds, text, and time series are translated numerically into tensors, thus allowing the
system to perform mathematical analysis.
In this paper, we will be exploring fundamental mathematical concepts behind neural networks
including reverse mode automatic differentiation, the gradient descent algorithm, and
optimization functions.

2 Neural Network Architecture


In order to understand Neural Networks, we must first
examine the smallest unit in a system: the neuron. A
neuron is a unit which holds a number; it is a mathematical
function that collects information. These neurons are
connected to each other in layers and are assigned an
activation value; the higher the activation value, the greater
the activation. Each activation number is multiplied with a
corresponding weight which describes connection strength
from node to node. A neural network has an architecture of
input nodes, output nodes, and hidden layers. For each
node in a proceeding layer, the weighted sum is computed:
` 𝑧𝑖 = 𝑤1 𝑎1 + 𝑤2 𝑎2 + ⋯ 𝑤𝑛 𝑎𝑛
𝑤ℎ𝑒𝑟𝑒 𝑖 = [1, # 𝑜𝑓 𝑛𝑒𝑢𝑟𝑜𝑛𝑠 𝑖𝑛 ℎ𝑖𝑑𝑑𝑒𝑛 𝑙𝑎𝑦𝑒𝑟] and n=# of activation numbers
The weighted inputs are added with a bias term in order for the output to be meaningfully active.

1
𝑧𝑖 = 𝑤1 𝑎1 + 𝑤2 𝑎2 + ⋯ 𝑤𝑛 𝑎𝑛 + 𝑏
A neural network’s hidden layers have multiple
nodes. For the first node in the hidden layer, we
multiplied the corresponding weights and biases
against the activation number. This must be
repeated throughout the nodes in the hidden
layer. The above equation can be consolidated
into vectors in order to exemplify this:

Each row in matrix 𝑤


⃗⃗ represents the weights corresponding with each hidden layer, while the
columns represent the weights corresponding to a particular activation number.

3 The Activation Function


The function 𝑧𝑖 is linear in nature; thus, a nonlinear activation function is applied for more
complex performance. Activation functions commonly used include sigmoid functions,
piecewise functions, gaussian functions, tangent functions, threshold functions, or ReLu
functions.

Function N am e Function
1
Sigmoid/Logistic 𝑓(𝑥) =
1 + 𝑒 −𝛽𝑥
0 𝑖𝑓 𝑥 ≤ 𝑥𝑚𝑖𝑛
Piecewise Linear 𝑓(𝑥) = {𝑚𝑥 + 𝑏 𝑖𝑓 𝑥max > 𝑥 > 𝑥𝑚𝑖𝑛
1 𝑖𝑓 𝑥 ≥ 𝑥max
1 −(𝑥−𝜇)2
Gaussian 𝑓(𝑥) = 𝑒 2𝜎2
√2𝜋𝜎
0 𝑖𝑓 0 > 𝑥
Threshold/Unit Step 𝑓(𝑥) = {
1 𝑖𝑓 𝑥 ≥ 0
ReLu 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥)
Tanh 𝑓(𝑥) = tanh (𝑥)

2
Activation function choice depends on the
range needed for the data, error, and speed.
Without an activation function, the neural
network behaves like a linear regression
model. The need for an activation function
comes from the definition of linear
functions and transformations. Previously
we discussed the linear algebra from the
input step to the hidden layer. The solution
of the function would resolve as a matrix of
weighted sums. In order to calculate an
output, the weighted sums matrix becomes
the “new” activation layer. These activation
numbers have their own sets of weights and biases. When we substitute the activation matrix for
the weighted sums matrix, we see that a composition of two linear functions is a linear function
itself. Hence, an activation function is needed.
Proof: Composition of Linear Functions
𝑧1 = 𝑤
⃗⃗ 1 𝑎 + 𝑏1
𝑧2 = 𝑤
⃗⃗ 2 𝑧1 + 𝑏2
𝑧2 = 𝑤
⃗⃗ 2 (𝑤
⃗⃗ 1 𝑎 + 𝑏1 ) + 𝑏2

𝑧2 = [𝑤 ⃗⃗ 2 𝑏⃗1 + 𝑏⃗2 ]
⃗⃗ 1 ]𝑎 + [𝑤
⃗⃗ 2 𝑤

𝐼𝑓 𝑊 = [𝑤 ⃗⃗ 2 𝑏⃗1 + 𝑏⃗2 ], 𝑡ℎ𝑒𝑛 𝑧2 = 𝑊𝑎 + 𝐵, 𝑤ℎ𝑖𝑐ℎ 𝑖𝑠 𝑎𝑙𝑠𝑜 𝑎 𝑙𝑖𝑛𝑒𝑎𝑟 𝑓𝑢𝑛𝑐𝑡𝑖𝑜𝑛


⃗⃗ 1 ] 𝑎𝑛𝑑 𝐵 = [𝑤
⃗⃗ 2 𝑤

With the activation function, the new weighted sum becomes:


ℎ𝑖 = 𝜎(𝑧𝑖 )= 𝜎(𝑤1 𝑎1 + 𝑤2 𝑎2 + ⋯ 𝑤𝑛 𝑎𝑛 + 𝑏)
̅̅̅1 = 𝜎(𝑤
h ⃗⃗ 𝑎 + 𝑏⃗)
4 The Cost/Loss Function
A neural network may have thousands of parameters. Some combinations of weights and biases
will produce better output for the model. For example, in a binary classification problem, the
algorithm will classify some input as one of two things. The output node with the highest
activation number will determine how the input is classified. In a binary classification problem,
there are two labels. For example, an image can be determined to be a cat or dog; the feature
“cat” is given the label of 0 and “dog” is given label 1. Different weights and biases will produce
different output. How can we determine which combination of parameters will be most accurate?
In order to measure error, a loss function is necessary. The loss function tells the machine how
far away the combination of weights and biases is from the optimal solution. There are many loss

3
functions that can be used in neural networks; Mean Squared Error and Cross Entropy Loss are
two of the most common.
MSE Cost= 𝛴0.5(𝑦 − 𝑦̂)2
Cross Entropy Cost= 𝛴 ( 𝑦̂ 𝑙𝑜𝑔(𝑦) + (1 − 𝑦̂) 𝑙𝑜𝑔(1 − 𝑦))
The loss function contains every weight and bias in the neural network. That can be a very big
function!
𝐶(𝑤1 , 𝑤2 , … , 𝑤ℎ , 𝑏1 , … , 𝑏𝑖 )

5 The Backpropagation Algorithm


The objective of machine learning involves the optimization of the chosen loss function. With
every epoch, the machine “learns” by adapting the weights and biases to minimize the loss.
Optimization theory centers itself on calculus. For neural networks in particular, reverse-mode
automatic differentiation serves a core role.
In order to minimize the cost function, one must determine which weights and biases to adjust.
Computing the gradient with respect to the parameters can help us do just that, as by definition
the gradient is a vector of partial derivatives of 𝐶(𝑤1 , 𝑤2 , … , 𝑤ℎ , 𝑏1 , … , 𝑏𝑖 ). As we recall,
derivatives measure the change of a function’s output with respect to its input. The gradient of
the cost function tells us in which direction 𝐶(𝑤1 , 𝑤2 , … , 𝑤ℎ , 𝑏1 , … , 𝑏𝑖 ) decreases most quickly.
This is often known as Gradient Descent. With each epoch, the machine converges towards the
local minimum. Automatic differentiation combines the chain rule with massive computational
power in order to derive the gradient from a potentially massive, complex model. In reverse, this
algorithm is better known as Backpropagation. Backpropagation is recursively done through
every single layer of the neural network.
In order to understand the basic workings of backpropagation, let us look at the simplest example
of a neural network: a network with only one node per layer.

We have derived the equations for cost, weighted sum, and activated weighted sum:

𝑧 𝐿 = 𝑤 𝐿 𝑎𝐿−1 + 𝑏 𝐿
𝑎𝐿 = 𝜎(𝑧 𝐿 )
1
𝐶 = (𝑎𝐿 − 𝑦)2 ∗

1
The cost function is simplified for proof of concept

4
We can determine how sensitive the cost function is to changes in a single weight. Beginning
from the output, we can apply the chain rule to every activation layer. For a weight between the
hidden layer and output layer, our derivative is:
𝛿𝐶𝑘 𝛿𝑧 𝐿 𝛿𝑎𝐿 𝛿𝐶𝑘
=
𝛿𝑤 𝐿 𝛿𝑤 𝐿 𝛿𝑧 𝐿 𝛿𝑎𝐿
With the definition of the functions, we can easily solve for the partial derivatives:
𝛿𝐶𝑘
= 2(𝑎𝐿 − 𝑦)
𝛿𝑎
𝛿𝑎𝐿
= 𝜎 ′ (𝑧 𝐿 )
𝛿𝑧 𝐿
𝛿𝑧
= 𝑎𝐿−1
𝛿𝑤 𝐿
𝛿𝐶𝑘
= 𝑎𝐿−1 𝜎 ′ (𝑧 𝐿 ) 2(𝑎𝐿 − 𝑦)
𝛿𝑤 𝐿

This method is iterated through every weight, activation number, and bias in the system.
Previously, we calculated the derivative of one particular cost function with one variable.
However, in order to account for every weight in that layer, the average of the derivatives is
taken:
𝑛−1
𝛿𝐶 1 𝛿𝐶𝑘
= ∑
𝛿𝑤 𝐿 𝑛 𝛿𝑤 𝐿
𝑘=0

Similarly, we can calculate the sensitivity of the cost function with respect to a single bias
between the hidden layer and the output layer and the derivative accounting for every bias in a
layer:
𝑛−1
𝛿𝐶𝑘 𝛿𝑧 𝐿 𝛿𝑎𝐿 𝛿𝐶 𝛿𝐶 1 𝛿𝐶𝑘
= 𝛿𝑏𝐿 𝛿𝑧 𝐿 𝛿𝑎𝐿= 𝜎 ′ (𝑧 𝐿 ) 2(𝑎𝐿 − 𝑦) = 𝑛∑
𝛿𝑏 𝐿 𝛿𝑏 𝐿 𝑘=0 𝛿𝑏
𝐿

What happens when we go beyond the output layer and the preceding hidden layer? The chain
rule is applied once more, and the derivative changes in account to its partials. For example, the
derivative below accounts for the partials of the cost function with respect to an input activation
number.
𝛿𝐶𝑘 𝛿𝑧 𝐿 𝛿𝑎𝐿 𝛿𝐶
= 𝛿𝑎𝐿−1 𝛿𝑧 𝐿 𝛿𝑎𝐿= 𝑤 𝐿 𝜎 ′ (𝑧 𝐿 ) 2(𝑎𝐿 − 𝑦)
𝛿𝑎𝐿−1

Neural Networks tend to have several thousand inputs, outputs, and nodes; the above equations
seem highly oversimplified. Although adding complexity changes the formulas slightly, the
concepts remain the same, as seen below:

5
𝑛𝐿 −1
2
𝐶𝑚 = ∑ (𝑎𝑗𝐿 − 𝑦𝑗 )
𝑗=0

𝑎𝑗 = 𝜎(𝑧𝑗𝐿 )

𝑧𝑗𝐿 = ⋯ + 𝑤𝑗𝑘
𝐿 𝐿−1
𝑎𝑘 + ⋯

𝛿𝐶𝑚 𝛿𝑧𝑗 𝐿 𝛿𝑎𝑗 𝐿 𝛿𝐶𝑚


=
𝛿𝑤𝑗𝑘 𝐿 𝛿𝑤𝑗𝑘 𝐿 𝛿𝑧𝑗 𝐿 𝛿𝑎𝑗 𝐿
𝑛𝐿 −1
𝛿𝐶𝑚 𝛿𝑧𝑗 𝐿 𝛿𝑎𝑗 𝐿 𝛿𝐶𝑚
= ∑
𝛿𝑎𝐿−1 𝛿𝑎𝑘 𝐿−1 𝛿𝑧𝑗 𝐿 𝛿𝑎𝑗 𝐿
𝑗=0

By calculating every derivative of each weight and bias, the gradient vector can be found.
Although one could try to compute the gradient of a neural network by hand, the vector will
usually be in complex dimensions unfathomable for us to decipher. Thus, with computational
help, our neural network can perform such intricate calculations, and repeat them hundreds, if
not thousands of times until the minimum is reached.
𝛿𝐶
𝛿𝑤 1
𝛿𝐶
𝛿𝑏1
∇𝐶 = ⋮
𝛿𝐶
𝛿𝑤 𝐿
𝛿𝐶
[ 𝛿𝑏 𝐿 ]

6 Applications and Further Research


Automatic differentiation has many applications other than in machine learning such as in Data
Assimilation, Design Optimization, Numerical Methods, and Sensitivity Analysis. It is efficient,
stable, precise, and known to be a better choice than other types of computer-based
differentiation. Backpropagation has been called into question recently, as it does not learn
continuously. For example, our brains learn continuously; they do not forget information when
we learn something new. Because of this, backpropagation may be sidelined in Machine
Learning in the future.
Applications of Neural Networks trained with Backpropagation vary greatly. Such applications
include sonar target recognition, text recognition, network controlled steering of cars, face
recognition software, remote sensing, and robotics.

6
7 Works Cited
Images
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,fashion mnist shoe
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,Gradient Descent Animation 3D
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,loss vs accuracy function neural
network
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,neural netwrok diagram
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,neural netwrok diagram
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,sigmoid function
(n.d.). Retrieved September 03, 2020, from
https://www.saedsayad.com/artificial_neural_network.htm
Sources
Gajawada, S. (2019, November 19). The Math behind Artificial Neural Networks. Retrieved
September 03, 2020, from https://towardsdatascience.com/the-heart-of-artificial-neural-
networks-26627e8c03ba
Kostadinov, S. (2019, August 12). Understanding Backpropagation Algorithm. Retrieved
September 03, 2020, from https://towardsdatascience.com/understanding-backpropagation-
algorithm-7bb3aa2f95fd
Repetto, A. (2017, August 19). The Problem with Back-Propagation. Retrieved September 03,
2020, from https://towardsdatascience.com/the-problem-with-back-propagation-
13aa84aabd71
Silva, S. (2020, March 28). The Maths behind Back Propagation. Retrieved September 03, 2020,
from https://towardsdatascience.com/the-maths-behind-back-propagation-cf6714736abf
Skalski, P. (2020, February 16). Deep Dive into Math Behind Deep Networks. Retrieved
September 03, 2020, from https://towardsdatascience.com/https-medium-com-piotr-
skalski92-deep-dive-into-deep-networks-math-17660bc376ba
Victor Zhou. (n.d.). Machine Learning for Beginners: An Introduction to Neural Networks.
Retrieved September 03, 2020, from https://victorzhou.com/blog/intro-to-neural-networks/

You might also like