Annette Paper
Annette Paper
Annette Paper
Abstract
This paper illustrates how basic theories of linear algebra and calculus can be combined with
computer programming methods to create neural networks.
Keywords: Gradient Descent, Backpropagation, Chain Rule, Automatic Differentiation,
Activation and Loss Functions
1 Introduction
As computers advanced in the 1950s, researchers attempted to simulate biologically inspired
models that could recognize binary patterns. This led to the birth of machine learning, an
application of computer science and mathematics in which systems have the ability to “learn” by
improving their performance. Neural networks are algorithms that can learn patterns and find
connections in data for classification, clustering, and prediction problems. Data including
images, sounds, text, and time series are translated numerically into tensors, thus allowing the
system to perform mathematical analysis.
In this paper, we will be exploring fundamental mathematical concepts behind neural networks
including reverse mode automatic differentiation, the gradient descent algorithm, and
optimization functions.
1
𝑧𝑖 = 𝑤1 𝑎1 + 𝑤2 𝑎2 + ⋯ 𝑤𝑛 𝑎𝑛 + 𝑏
A neural network’s hidden layers have multiple
nodes. For the first node in the hidden layer, we
multiplied the corresponding weights and biases
against the activation number. This must be
repeated throughout the nodes in the hidden
layer. The above equation can be consolidated
into vectors in order to exemplify this:
Function N am e Function
1
Sigmoid/Logistic 𝑓(𝑥) =
1 + 𝑒 −𝛽𝑥
0 𝑖𝑓 𝑥 ≤ 𝑥𝑚𝑖𝑛
Piecewise Linear 𝑓(𝑥) = {𝑚𝑥 + 𝑏 𝑖𝑓 𝑥max > 𝑥 > 𝑥𝑚𝑖𝑛
1 𝑖𝑓 𝑥 ≥ 𝑥max
1 −(𝑥−𝜇)2
Gaussian 𝑓(𝑥) = 𝑒 2𝜎2
√2𝜋𝜎
0 𝑖𝑓 0 > 𝑥
Threshold/Unit Step 𝑓(𝑥) = {
1 𝑖𝑓 𝑥 ≥ 0
ReLu 𝑓(𝑥) = 𝑚𝑎𝑥(0, 𝑥)
Tanh 𝑓(𝑥) = tanh (𝑥)
2
Activation function choice depends on the
range needed for the data, error, and speed.
Without an activation function, the neural
network behaves like a linear regression
model. The need for an activation function
comes from the definition of linear
functions and transformations. Previously
we discussed the linear algebra from the
input step to the hidden layer. The solution
of the function would resolve as a matrix of
weighted sums. In order to calculate an
output, the weighted sums matrix becomes
the “new” activation layer. These activation
numbers have their own sets of weights and biases. When we substitute the activation matrix for
the weighted sums matrix, we see that a composition of two linear functions is a linear function
itself. Hence, an activation function is needed.
Proof: Composition of Linear Functions
𝑧1 = 𝑤
⃗⃗ 1 𝑎 + 𝑏1
𝑧2 = 𝑤
⃗⃗ 2 𝑧1 + 𝑏2
𝑧2 = 𝑤
⃗⃗ 2 (𝑤
⃗⃗ 1 𝑎 + 𝑏1 ) + 𝑏2
𝑧2 = [𝑤 ⃗⃗ 2 𝑏⃗1 + 𝑏⃗2 ]
⃗⃗ 1 ]𝑎 + [𝑤
⃗⃗ 2 𝑤
3
functions that can be used in neural networks; Mean Squared Error and Cross Entropy Loss are
two of the most common.
MSE Cost= 𝛴0.5(𝑦 − 𝑦̂)2
Cross Entropy Cost= 𝛴 ( 𝑦̂ 𝑙𝑜𝑔(𝑦) + (1 − 𝑦̂) 𝑙𝑜𝑔(1 − 𝑦))
The loss function contains every weight and bias in the neural network. That can be a very big
function!
𝐶(𝑤1 , 𝑤2 , … , 𝑤ℎ , 𝑏1 , … , 𝑏𝑖 )
We have derived the equations for cost, weighted sum, and activated weighted sum:
𝑧 𝐿 = 𝑤 𝐿 𝑎𝐿−1 + 𝑏 𝐿
𝑎𝐿 = 𝜎(𝑧 𝐿 )
1
𝐶 = (𝑎𝐿 − 𝑦)2 ∗
1
The cost function is simplified for proof of concept
4
We can determine how sensitive the cost function is to changes in a single weight. Beginning
from the output, we can apply the chain rule to every activation layer. For a weight between the
hidden layer and output layer, our derivative is:
𝛿𝐶𝑘 𝛿𝑧 𝐿 𝛿𝑎𝐿 𝛿𝐶𝑘
=
𝛿𝑤 𝐿 𝛿𝑤 𝐿 𝛿𝑧 𝐿 𝛿𝑎𝐿
With the definition of the functions, we can easily solve for the partial derivatives:
𝛿𝐶𝑘
= 2(𝑎𝐿 − 𝑦)
𝛿𝑎
𝛿𝑎𝐿
= 𝜎 ′ (𝑧 𝐿 )
𝛿𝑧 𝐿
𝛿𝑧
= 𝑎𝐿−1
𝛿𝑤 𝐿
𝛿𝐶𝑘
= 𝑎𝐿−1 𝜎 ′ (𝑧 𝐿 ) 2(𝑎𝐿 − 𝑦)
𝛿𝑤 𝐿
This method is iterated through every weight, activation number, and bias in the system.
Previously, we calculated the derivative of one particular cost function with one variable.
However, in order to account for every weight in that layer, the average of the derivatives is
taken:
𝑛−1
𝛿𝐶 1 𝛿𝐶𝑘
= ∑
𝛿𝑤 𝐿 𝑛 𝛿𝑤 𝐿
𝑘=0
Similarly, we can calculate the sensitivity of the cost function with respect to a single bias
between the hidden layer and the output layer and the derivative accounting for every bias in a
layer:
𝑛−1
𝛿𝐶𝑘 𝛿𝑧 𝐿 𝛿𝑎𝐿 𝛿𝐶 𝛿𝐶 1 𝛿𝐶𝑘
= 𝛿𝑏𝐿 𝛿𝑧 𝐿 𝛿𝑎𝐿= 𝜎 ′ (𝑧 𝐿 ) 2(𝑎𝐿 − 𝑦) = 𝑛∑
𝛿𝑏 𝐿 𝛿𝑏 𝐿 𝑘=0 𝛿𝑏
𝐿
What happens when we go beyond the output layer and the preceding hidden layer? The chain
rule is applied once more, and the derivative changes in account to its partials. For example, the
derivative below accounts for the partials of the cost function with respect to an input activation
number.
𝛿𝐶𝑘 𝛿𝑧 𝐿 𝛿𝑎𝐿 𝛿𝐶
= 𝛿𝑎𝐿−1 𝛿𝑧 𝐿 𝛿𝑎𝐿= 𝑤 𝐿 𝜎 ′ (𝑧 𝐿 ) 2(𝑎𝐿 − 𝑦)
𝛿𝑎𝐿−1
Neural Networks tend to have several thousand inputs, outputs, and nodes; the above equations
seem highly oversimplified. Although adding complexity changes the formulas slightly, the
concepts remain the same, as seen below:
5
𝑛𝐿 −1
2
𝐶𝑚 = ∑ (𝑎𝑗𝐿 − 𝑦𝑗 )
𝑗=0
𝑎𝑗 = 𝜎(𝑧𝑗𝐿 )
𝑧𝑗𝐿 = ⋯ + 𝑤𝑗𝑘
𝐿 𝐿−1
𝑎𝑘 + ⋯
By calculating every derivative of each weight and bias, the gradient vector can be found.
Although one could try to compute the gradient of a neural network by hand, the vector will
usually be in complex dimensions unfathomable for us to decipher. Thus, with computational
help, our neural network can perform such intricate calculations, and repeat them hundreds, if
not thousands of times until the minimum is reached.
𝛿𝐶
𝛿𝑤 1
𝛿𝐶
𝛿𝑏1
∇𝐶 = ⋮
𝛿𝐶
𝛿𝑤 𝐿
𝛿𝐶
[ 𝛿𝑏 𝐿 ]
6
7 Works Cited
Images
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,fashion mnist shoe
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,Gradient Descent Animation 3D
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,loss vs accuracy function neural
network
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,neural netwrok diagram
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,neural netwrok diagram
(n.d.). Retrieved September 03, 2020, from
https://www.bing.com/images/search?view=detailV2,sigmoid function
(n.d.). Retrieved September 03, 2020, from
https://www.saedsayad.com/artificial_neural_network.htm
Sources
Gajawada, S. (2019, November 19). The Math behind Artificial Neural Networks. Retrieved
September 03, 2020, from https://towardsdatascience.com/the-heart-of-artificial-neural-
networks-26627e8c03ba
Kostadinov, S. (2019, August 12). Understanding Backpropagation Algorithm. Retrieved
September 03, 2020, from https://towardsdatascience.com/understanding-backpropagation-
algorithm-7bb3aa2f95fd
Repetto, A. (2017, August 19). The Problem with Back-Propagation. Retrieved September 03,
2020, from https://towardsdatascience.com/the-problem-with-back-propagation-
13aa84aabd71
Silva, S. (2020, March 28). The Maths behind Back Propagation. Retrieved September 03, 2020,
from https://towardsdatascience.com/the-maths-behind-back-propagation-cf6714736abf
Skalski, P. (2020, February 16). Deep Dive into Math Behind Deep Networks. Retrieved
September 03, 2020, from https://towardsdatascience.com/https-medium-com-piotr-
skalski92-deep-dive-into-deep-networks-math-17660bc376ba
Victor Zhou. (n.d.). Machine Learning for Beginners: An Introduction to Neural Networks.
Retrieved September 03, 2020, from https://victorzhou.com/blog/intro-to-neural-networks/