Neural Networks
Neural Networks
() ()
b 1
w1 x1
w= w2 x= x 2 T
z=w x
… …
wD xD
y=f ( z )=f ( wT x )
Perceptron
Sample data consists of n observed pairs:
(x1, y1), … (xi, yi) ,… (xn, yn), i=1…n.
xi – input vector, yi -label
n n
1
[ ]
2
f ( w )= ∑ [ ε i ] =∑ y i−f ( w x i )
2 T
n i=1 i=1
In a neural network, the weights are usually found using an optimization algorithm that
minimizes a chosen loss function. In the case of Mean Squared Error (MSE) loss, the
objective is to minimize the average squared difference between the predicted output and
the true output for a given set of input data.
Here's a general overview of how the weights are updated in a neural network using MSE
loss:
1. Initialize the weights of the neural network randomly.
2. Forward pass: Feed the input data through the neural network and obtain the
predicted output.
3. Compute the MSE loss between the predicted output and the true output.
4. Backward pass: Calculate the gradient of the loss with respect to each weight in the
network using backpropagation.
5. Use an optimization algorithm such as Stochastic Gradient Descent (SGD) or
ADAM to update the weights in the direction that reduces the loss (opposite to
gradient). The amount of weight update is determined by the learning rate
hyperparameter.
6. Repeat steps 2-5 for multiple epochs (passes through the entire dataset) until the loss
converges to a minimum or stops improving significantly or other finishing
conditions are satisfied.
During the training process, the weights are adjusted in a way that minimizes the MSE
loss between the predicted output and the true output. This is done by iteratively updating
the weights in the direction of steepest descent of the loss function until convergence.
Once the weights have converged to a minimum, the neural network can be used to make
predictions on new unseen data.
Artificial Neural Networks (ANN)
y=f ( W ( 3 ) f ( W ( 2) f ( W ( 1) x ) ))
Input is a [3x1] vector
Weights W(1) of size [4x3], matrix with connections of the hidden layer, and the biases in
vector b1, of size [4x1].
Single neuron has its weights in a row of W1
Matrix-vector multiplication evaluates the activations of all neurons in that layer.
W(2) is [4x4] matrix with connections, and W(3) a [1x4] matrix for the last (output) layer.
The full forward pass is simply three matrix multiplications and applications of the
activation functions.
Size of the network - the number of parameters, number of layers.
This network has 4 + 4 + 1 = 9 neurons, [3 x 4] + [4 x 4] + [4 x 1] = 12 + 16 + 4 = 32
weights and 4 + 4 + 1 = 9 biases, for a total of 41 learnable parameters.
Activation Functions
An activation function in a neural network defines how the weighted sum of the input is
transformed into an output from a node or nodes in a layer of the network.
It decides whether a neuron should be activated or not. This means that it will decide
whether the neurons input to the network is important or not in the process of prediction.
Sometimes the activation function is called a “transfer function.” If the output range of
the activation function is limited, then it may be called a “squashing function.” Many
activation functions are nonlinear and may be referred to as the “nonlinearity” in the
layer or the network design.
The choice of activation function has a large impact on the capability and performance of
the neural network, and different activation functions may be used in different parts of
the model.
Technically, the activation function is used within or after the internal processing of each
node in the network, although networks are designed to use the same activation function
for all nodes in a layer.
A network may have three types of layers: input layers that take raw input from the
domain, hidden layers that take input from another layer and pass output to another layer,
and output layers that make a prediction.
All hidden layers typically use the same activation function. The output layer will
typically use a different activation function from the hidden layers and is dependent upon
the type of prediction required by the model.
Activation functions are also typically differentiable, meaning the first-order derivative
can be calculated for a given input value. This is required given that neural networks are
typically trained using the backpropagation of error algorithm that requires the derivative
of prediction error in order to update the weights of the model.
There are many different types of activation functions used in neural networks, although
perhaps only a small number of functions used in practice for hidden and output layers.
lim f ( w 0 , w1 +h )−f ( w0 , w1 ) f w , w +h −f w , w
∂ f ℏ →0 ( 0 1 ) ( 0 1)
= ≈
∂ w1 h h
w (11 )=w1 −η
(0 ) ∂L
∂ w1 |
w(0)
0
, w(10)
η - learning rate
w =w −η ∇ L ( w )|w
(1) (0)
( 0)
❑ first iteration
w
(i+1 )
=w −η ∇ L ( w )|w
(i)
(i)
❑ i+1 iteration
Stop, when L has small changes or maximal iterations number
was executed.
lim f ( w i+ h )−f ( wi ) f w + h −f w
∂ f ℏ →0 ( i ) ( i)
= ≈
∂ wi h h
Number of operations O ( D ), because for we need calculate D partial derivatives, and for
2
If you substitute all individual node-equations into the final one you’ll find that we’re
solving the following equation. Being more specific we want to calculate its value and its
partial derivatives. So in this use case it is a pure mathematical task.
2 2 2
r =z ( x + y)
Forward Pass
To make this concept more tangible let’s take some numbers for our calculation. For
example:
x=1
y=2
z=4
Here we simply substitute our inputs into equations. The results of individual node-steps
are shown below. The final output is r=144.
Backward Pass
Now it’s time to perform a backpropagation, known also under a more fancy name
“backward propagation of errors” or even “reverse mode of automatic differentiation”.
To calculate gradients with regards to each of 3 variables we have to calculate partial
derivatives at each node in the graph (local gradients). Below we show how to do it for
the two last nodes/steps.
2
∂r ∂w
= =2 w
∂w ∂w
∂ w ∂ zv
= =v
∂z ∂z
After completing the calculations of local gradients a computation graph for the
backpropagation is like below.
Now, to calculate the final gradients (in orange circles) we have to use the chain rule. In
practice this means we have to multiply all partial derivatives along the path from the
output to the variable of interest:
Now we can use these gradient for whatever we want — e.g. optimization with a gradient
descent (SGD, Adam, etc.).
Implementation in PyTorch
There are numerous Neural Network frameworks in various languages where you can
implement such computations and make a computer to calculate gradients for you.
Below, We’ll demonstrate how to use the python PyTorch library to solve our exemplary
task.
import torch
x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = torch.tensor(4.0, requires_grad=True)
# forward pass
u = x**2
v = u+y
w = z*v
r = w**2
print('r=', r.item())
# backward pass
r.backward()
print('dr/dx = ', x.grad.item())
print('dr/dy = ', y.grad.item())
print('dr/dz = ', z.grad.item())
Some notes:
● In PyTorch everything is a tensor — even if it contains only a single value.