An Introduction To Mathematics Behind Neural Networks

Download as doc, pdf, or txt
Download as doc, pdf, or txt
You are on page 1of 5
At a glance
Powered by AI
The key takeaways are an introduction to the math behind neural networks and an explanation of perceptrons, forward propagation, backpropagation, and gradient descent.

Perceptrons are the simplest neural networks that consist of inputs, one neuron, weights, a bias, an activation function, and one output. They multiply the inputs by weights and pass the summed result through an activation function.

Forward propagation in perceptrons involves multiplying the inputs by weights, adding them together, adding a bias, and passing the summed result through an activation function to get the predicted output.

An Introduction To Mathematics

Behind Neural Networks


December 23rd 2019
TWEET THIS

Today, with open source machine learning software


libraries such as TensorFlow, Keras or PyTorch we can create
neural network, even with a high structural complexity, with
just a few lines of code. Having said that, the Math behind
neural networks is still a mystery to some of us and having
the Math knowledge behind neural networks and deep
learning can help us understand what’s happening inside a
neural network. It is also helpful in architecture selection,
fine-tuning of Deep Learning models, hyperparameters tuning
and optimization.
Introduction

I ignored understanding the Math behind neural networks and


Deep Learning for a long time as I didn’t have good
knowledge of algebra or differential calculus. Few days ago, I
decided to to start from scratch and derive the methodology
and Math behind neural networks and Deep Learning, to
know how and why they work. I also decided to write this
article, which would be useful to people like me, who finds it
difficult to understand these concepts.
Perceptrons

Perceptrons — invented by Frank Rosenblatt in 1957, are the


simplest neural network that consist of n number of inputs,
only one neuron and one output, where n is the number of
features of our dataset. The process of passing the data
through the neural network is know as forward propagation
and the forward propagation carried out in a Perceptron is
explained in the following three steps.
Step 1 : For each input, multiply the input value xᵢ with
weights wᵢ and sum all the multiplied values. Weights —
represent the strength of the connection between neurons
and decides how much influence the given input will have on
the neuron’s output. If the weight w₁ has higher value than
the weight w₂, then the input x₁ will have higher influence on
the output than w₂.

The row vectors of the inputs and weights are x = [x₁, x₂, … ,
xₙ] and w =[w₁, w₂, … , wₙ] respectively and their dot
product is given by

Hence, the summation is equal to the dot product of the


vectors x and w

Step 2: Add bias b to the summation of multiplied values


and let’s call this z. Bias — also know as offset is necessary in
most of the cases, to move the entire activation function to
the left or right to generate the required output values .

Step 3 : Pass the value of z to a non-linear activation


function. Activation functions — are used to introduce non-
linearity into the output of the neurons, without which the
neural network will just be a linear function. Moreover, they
have a significant impact on the learning speed of the neural
network. Perceptrons have binary step function as their
activation function. However, we shall use Sigmoid — also
know as logistic function as our activation function.
where, σ denotes the Sigmoid activation function and the
output we get after the forward prorogation is know as
the predicted value y.
Learning Algorithm

The learning algorithm consist of two parts —


Backpropagation and Optimization.
Backpropagation : Backpropagation, short for backward
propagation of errors, refers to the algorithm for computing
the gradient of the loss function with respect to the weights.
However, the term is often used to refer to the entire learning
algorithm. The backpropagation carried out in a Perceptron is
explained in the following two steps.
Step 1 : To know an estimation of how far are we from our
desired solution a loss function is used. Generally, Mean
Squared Error is chosen as loss function for regression
problems and cross entropy for classification problems. Let’s
take a regression problem and its loss function be Mean
Squared Error, which squares the difference
between actual (yᵢ) and predicted value ( ŷᵢ ).

Loss function is calculated for the entire training dataset and


their average is called the Cost function C.

Step 2 : In order to find the best weights and bias for our
Perceptron, we need to know how the cost function changes
in relation to weights and bias. This is done with the help
the gradients (rate of change) — how one quantity changes
in relation to another quantity. In our case, we need to find
the gradient of the cost function with respect to the weights
and bias.
Let’s calculate the gradient of cost function C with respect to
the weight wᵢ using partial derivation. Since the cost function
is not directly related to the weight wᵢ, let’s use the chain
rule.

Now we need to find the following three gradients

Let’s start with the gradient of the Cost function (C) with
respect to the predicted value ( ŷ )

Let y = [y₁ , y₂ , … yₙ] and ŷ =[ ŷ₁ , ŷ₂ , … ŷₙ] be the row


vectors of actual and predicted values. Hence the above
equation is simplifies as

Now let’s find the the gradient of the predicted value with
respect to the z. This will be a bit lengthy.

The gradient of z with respect to the weight wᵢ is

Therefore we get,

What about Bias? — Bias is theoretically considered to have


an input of constant value 1. Hence,
Optimization : Optimization is the selection of a best
element from some set of available alternatives, which in our
case, is the selection of best weights and bias of the
perceptron. Let’s choose gradient descent as our optimization
algorithm, which changes the weights and bias, proportional
to the negative of the gradient of the Cost function with
respect to the corresponding weight or bias. Learning rate (α)
is a hyperparameter which is used to control how much the
weights and bias are changed.
The weights and bias are updated as follows and the
Backporpagation and gradient descent is repeated until
convergence.

Conclusion

I hope that you’ve found this article useful and understood


the maths behind the Neural Networks and Deep Learning. I
have explained the working of a single neuron in this article,
however the these basic concepts are applicable to all kinds
of Neural Networks with some modifications. If you have any
questions or if you found a mistake, please let me know in
the comment.

You might also like