0% found this document useful (0 votes)
12 views29 pages

Neural Networks

The document provides an overview of artificial neurons, their structure, and the training process of neural networks, particularly focusing on the Mean Squared Error (MSE) loss function and gradient descent optimization. It describes the architecture of Multi-Layer Perceptrons (MLP), the role of activation functions, and the backpropagation algorithm used for training. Additionally, it includes a practical implementation example using the PyTorch library to demonstrate forward and backward passes in a neural network.

Uploaded by

ayaazouz1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views29 pages

Neural Networks

The document provides an overview of artificial neurons, their structure, and the training process of neural networks, particularly focusing on the Mean Squared Error (MSE) loss function and gradient descent optimization. It describes the architecture of Multi-Layer Perceptrons (MLP), the role of activation functions, and the backpropagation algorithm used for training. Additionally, it includes a practical implementation example using the PyTorch library to demonstrate forward and backward passes in a neural network.

Uploaded by

ayaazouz1997
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 29

Biological and Artificial Neuron

Artificial neuron related names


() ()
w1 x1
w= w2 x= x 2 T
z=w x+ b
… …
wD xD

() ()
b 1
w1 x1
w= w2 x= x 2 T
z=w x
… …
wD xD

y=f ( z )=f ( wT x )

Perceptron
Sample data consists of n observed pairs:
(x1, y1), … (xi, yi) ,… (xn, yn), i=1…n.
xi – input vector, yi -label

b, w1, w2, wD - unknowns

n n
1
[ ]
2
f ( w )= ∑ [ ε i ] =∑ y i−f ( w x i )
2 T
n i=1 i=1

Mean Squared Error - MSE.


In context of neural networks MSE loss.

In a neural network, the weights are usually found using an optimization algorithm that
minimizes a chosen loss function. In the case of Mean Squared Error (MSE) loss, the
objective is to minimize the average squared difference between the predicted output and
the true output for a given set of input data.
Here's a general overview of how the weights are updated in a neural network using MSE
loss:
1. Initialize the weights of the neural network randomly.
2. Forward pass: Feed the input data through the neural network and obtain the
predicted output.
3. Compute the MSE loss between the predicted output and the true output.
4. Backward pass: Calculate the gradient of the loss with respect to each weight in the
network using backpropagation.
5. Use an optimization algorithm such as Stochastic Gradient Descent (SGD) or
ADAM to update the weights in the direction that reduces the loss (opposite to
gradient). The amount of weight update is determined by the learning rate
hyperparameter.
6. Repeat steps 2-5 for multiple epochs (passes through the entire dataset) until the loss
converges to a minimum or stops improving significantly or other finishing
conditions are satisfied.
During the training process, the weights are adjusted in a way that minimizes the MSE
loss between the predicted output and the true output. This is done by iteratively updating
the weights in the direction of steepest descent of the loss function until convergence.
Once the weights have converged to a minimum, the neural network can be used to make
predictions on new unseen data.
Artificial Neural Networks (ANN)

Multi-Layer Perceptrons (MLP)

Feed Forward Network


A N-layer neural network with inputs, hidden layers of K neurons each and one output
layer.
There are connections (synapses) between neurons across layers, but not within a layer.
N-layer neural network (excluding input)
Single layer NN have no hidden layers, input mapper onto output (SVM, logistic
regression)
Output layer neurons most commonly do not have an activation function - last output
layer represents the class scores.

y ( 1) =f ( W ( 1) x ) – outputs of first hidden layer

y ( 2) =f ( W ( 2) y ( 1) ) - outputs of second hidden layer

y=f ( W ( 3) y ( 2) ) - output (one output neuron – scalar number output)

y=f ( W ( 3 ) f ( W ( 2) f ( W ( 1) x ) ))
Input is a [3x1] vector
Weights W(1) of size [4x3], matrix with connections of the hidden layer, and the biases in
vector b1, of size [4x1].
Single neuron has its weights in a row of W1
Matrix-vector multiplication evaluates the activations of all neurons in that layer.
W(2) is [4x4] matrix with connections, and W(3) a [1x4] matrix for the last (output) layer.
The full forward pass is simply three matrix multiplications and applications of the
activation functions.
Size of the network - the number of parameters, number of layers.
This network has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32
weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.
Activation Functions
An activation function in a neural network defines how the weighted sum of the input is
transformed into an output from a node or nodes in a layer of the network.
It decides whether a neuron should be activated or not. This means that it will decide
whether the neurons input to the network is important or not in the process of prediction.
Sometimes the activation function is called a “transfer function.” If the output range of
the activation function is limited, then it may be called a “squashing function.” Many
activation functions are nonlinear and may be referred to as the “nonlinearity” in the
layer or the network design.
The choice of activation function has a large impact on the capability and performance of
the neural network, and different activation functions may be used in different parts of
the model.
Technically, the activation function is used within or after the internal processing of each
node in the network, although networks are designed to use the same activation function
for all nodes in a layer.
A network may have three types of layers: input layers that take raw input from the
domain, hidden layers that take input from another layer and pass output to another layer,
and output layers that make a prediction.
All hidden layers typically use the same activation function. The output layer will
typically use a different activation function from the hidden layers and is dependent upon
the type of prediction required by the model.
Activation functions are also typically differentiable, meaning the first-order derivative
can be calculated for a given input value. This is required given that neural networks are
typically trained using the backpropagation of error algorithm that requires the derivative
of prediction error in order to update the weights of the model.
There are many different types of activation functions used in neural networks, although
perhaps only a small number of functions used in practice for hidden and output layers.

Importance of activation functions


y=f ( W ( 3 ) f ( W ( 2) f ( W ( 1) x ) )) with activation functions

y=W (3 ) W ( 2) W ( 1) x without activation

y=( W ( 3 ) W (2 ) W ( 1) ) x perceptron equation


Derivatives
NEURAL NETWORK TRAINING

For example, we have a neural network with 2 trainable


parameters. We could express loss of neural network as
L(w0, w1, Xtrain, Ytrain). Here w0, w1 are weights of neural
network (trainable parameters). Here Xtrain, Ytrain are train data.
Because we will focus on trainable parameters, further we will
write function without train data, it is L(w0, w1).
Training is minimization of loss versus trainable parameters.
Gradient Descent

Gradient of L(w0, w1):


(
∇ L ( w 0 , w 1) =
∂L ∂ L
,
∂ w 0 ∂ w1 )
lim f ( w0 +h , w1 )−f ( w0 , w 1) f w +h , w −f w , w
∂ f ℏ →0 ( 0 1) ( 0 1)
= ≈
∂ w0 h h

lim f ( w 0 , w1 +h )−f ( w0 , w1 ) f w , w +h −f w , w
∂ f ℏ →0 ( 0 1 ) ( 0 1)
= ≈
∂ w1 h h

w (01 )=w(00 )−η


∂L
∂ w0 |
w(00) ,w (0)
1

w (11 )=w1 −η
(0 ) ∂L
∂ w1 |
w(0)
0
, w(10)

η - learning rate
w =w −η ∇ L ( w )|w
(1) (0)
( 0)
❑ first iteration
w
(i+1 )
=w −η ∇ L ( w )|w
(i)
(i)
❑ i+1 iteration
Stop, when L has small changes or maximal iterations number
was executed.

When we have D trainable parameters, gradient:

( ∂∂wf , ∂∂wf ,..., ∂∂wf , ..., ∂∂wf )


∇ f ( w0 , w1 , ..., wi ,... wD )=
0 1 i D

lim f ( w i+ h )−f ( wi ) f w + h −f w
∂ f ℏ →0 ( i ) ( i)
= ≈
∂ wi h h

Number of operations O ( D ), because for we need calculate D partial derivatives, and for
2

every derivative number of operations proportional to number of parameters.


Backpropagation

The training of Neural Networks (NN) based on gradient-based optimization algorithms


is organized in two major steps:
Forward Propagation - here we calculate the output of the NN given inputs
Backward Propagation - here we calculate the gradients of the output with regards to
weights.
The first step is usually straightforward to understand and to calculate. The general idea
behind the second step is also clear — we need gradients to know the direction to make
steps in gradient descent optimization algorithm.
Although the backpropagation is not a new idea (developed in 1970s), answering the
question “how” these gradients are calculated gives some people a hard time. One has to
reach for some calculus, especially partial derivatives and the chain rule, to fully
understand back-propagation working principles.
Originally backpropagation was developed to differentiate complex nested functions.
However, it became highly popular thanks to the machine learning community and is
now the cornerstone of Neural Networks.
We’ll go to the roots and solve an exemplary problem step-by-step by hand, then we’ll
implement it in python using PyTorch, and finally we’re going to compare both results to
make sure everything works fine.
Computational Graph
Let’s assume we want to perform the following set of operations to get our result r:

If you substitute all individual node-equations into the final one you’ll find that we’re
solving the following equation. Being more specific we want to calculate its value and its
partial derivatives. So in this use case it is a pure mathematical task.
2 2 2
r =z ( x + y)
Forward Pass
To make this concept more tangible let’s take some numbers for our calculation. For
example:
x=1
y=2
z=4
Here we simply substitute our inputs into equations. The results of individual node-steps
are shown below. The final output is r=144.
Backward Pass
Now it’s time to perform a backpropagation, known also under a more fancy name
“backward propagation of errors” or even “reverse mode of automatic differentiation”.
To calculate gradients with regards to each of 3 variables we have to calculate partial
derivatives at each node in the graph (local gradients). Below we show how to do it for
the two last nodes/steps.
2
∂r ∂w
= =2 w
∂w ∂w

∂ w ∂ zv
= =v
∂z ∂z

After completing the calculations of local gradients a computation graph for the
backpropagation is like below.
Now, to calculate the final gradients (in orange circles) we have to use the chain rule. In
practice this means we have to multiply all partial derivatives along the path from the
output to the variable of interest:

Now we can use these gradient for whatever we want — e.g. optimization with a gradient
descent (SGD, Adam, etc.).

Implementation in PyTorch

There are numerous Neural Network frameworks in various languages where you can
implement such computations and make a computer to calculate gradients for you.
Below, We’ll demonstrate how to use the python PyTorch library to solve our exemplary
task.

import torch

x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = torch.tensor(4.0, requires_grad=True)
# forward pass
u = x**2
v = u+y
w = z*v
r = w**2
print('r=', r.item())
# backward pass
r.backward()
print('dr/dx = ', x.grad.item())
print('dr/dy = ', y.grad.item())
print('dr/dz = ', z.grad.item())

The output of this code is:

Results from PyTorch are identical to the ones we calculated by hand.

Some notes:
● In PyTorch everything is a tensor — even if it contains only a single value.

● In PyTorch when you specify a variable which is a subject of gradient-based

optimization you have to specify argument requires_grad = True. Otherwise, it will


be treated as fixed input
● With this implementation, all back-propagation calculations are simply performed by
using method r.backward()

You might also like