0% found this document useful (0 votes)
21 views34 pages

Lecture 5

Uploaded by

asedovskaya.ann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views34 pages

Lecture 5

Uploaded by

asedovskaya.ann
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 34

NEURAL NETWORK TRAINING

For example, we have a neural network with 2 trainable


parameters. We could express loss of neural network as
L(w0, w1, Xtrain, Ytrain). Here w0, w1 are weights of neural
network (trainable parameters). Here Xtrain, Ytrain are train data.
Because we will focus on trainable parameters, further we will
write function without train data, it is L(w0, w1).
Training is minimization of loss versus trainable parameters.
Gradient Descent

Gradient of L(w0, w1):

ℏ→

ℏ→

( ) ( )
( ) ( )
,
( ) ( )
( ) ( )
,
- learning rate
( ) ( )
𝐰
( ) first iteration
( ) ()
𝐰
() i+1 iteration

Stop, when L has small changes or maximal iterations number


was executed.

When we have D trainable parameters, gradient:

..., ...,
Number of operations , because for we need calculate D partial derivatives, and for
every derivative number of operations proportional to number of parameters.
Backpropagation
The training of Neural Networks (NN) based on gradient-based optimization algorithms is
organized in two major steps:
Forward Propagation - here we calculate the output of the NN given inputs
Backward Propagation - here we calculate the gradients of the output with regards to
weights.
The first step is usually straightforward to understand and to calculate. The general idea
behind the second step is also clear — we need gradients to know the direction to make
steps in gradient descent optimization algorithm.
Although the backpropagation is not a new idea (developed in 1970s), answering the
question “how” these gradients are calculated gives some people a hard time. One has to
reach for some calculus, especially partial derivatives and the chain rule, to fully
understand back-propagation working principles.
Originally backpropagation was developed to differentiate complex nested functions.
However, it became highly popular thanks to the machine learning community and is now
the cornerstone of Neural Networks.
We’ll go to the roots and solve an exemplary problem step-by-step by hand, then we’ll
implement it in python using PyTorch, and finally we’re going to compare both results to
make sure everything works fine.
Computational Graph
Let’s assume we want to perform the following set of operations to get our result r:

If you substitute all individual node-equations into the final one you’ll find that we’re
solving the following equation. Being more specific we want to calculate its value and its
partial derivatives. So in this use case it is a pure mathematical task.
Forward Pass
To make this concept more tangible let’s take some numbers for our calculation. For
example:
x=1
y=2
z=4
Here we simply substitute our inputs into equations. The results of individual node-steps
are shown below. The final output is r=144.
Backward Pass
Now it’s time to perform a backpropagation, known also under a more fancy name
“backward propagation of errors” or even “reverse mode of automatic differentiation”.
To calculate gradients with regards to each of 3 variables we have to calculate partial
derivatives at each node in the graph (local gradients). Below we show how to do it for
the two last nodes/steps.

After completing the calculations of local gradients a computation graph for the
backpropagation is like below.
Now, to calculate the final gradients (in orange circles) we have to use the chain rule. In
practice this means we have to multiply all partial derivatives along the path from the
output to the variable of interest:

Now we can use these gradient for whatever we want — e.g. optimization with a gradient
descent (SGD, Adam, etc.).
Implementation in PyTorch

There are numerous Neural Network frameworks in various languages where you can
implement such computations and make a computer to calculate gradients for you. Below,
We’ll demonstrate how to use the python PyTorch library to solve our exemplary task.

import torch

x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = torch.tensor(4.0, requires_grad=True)
# forward pass
u = x**2
v = u+y
w = z*v
r = w**2
print('r=', r.item())
# backward pass
r.backward()
print('dr/dx = ', x.grad.item())
print('dr/dy = ', y.grad.item())
print('dr/dz = ', z.grad.item())
The output of this code is:

Results from PyTorch are identical to the ones we calculated by hand.

Some notes:
 In PyTorch everything is a tensor — even if it contains only a single value.
 In PyTorch when you specify a variable which is a subject of gradient-based
optimization you have to specify argument requires_grad = True. Otherwise, it will be
treated as fixed input
 With this implementation, all back-propagation calculations are simply performed by
using method r.backward()
OPTIMIZERS
What is an optimizer?
Optimizers are algorithms or methods used to minimize an error function(loss function)or
to maximize the efficiency of production. Optimizers are mathematical functions which
are dependent on model’s learnable parameters i.e Weights & Biases. Optimizers help to
know how to change weights and learning rate of neural network to reduce the losses.
Let’s learn about different types of optimizers and how they exactly work to minimize the
loss function.
Optimiazer is necesarry that to train neural network.
Vanila Gradient Descent
Gradient descent is an optimization algorithm based on a convex function and tweaks its
parameters iteratively to minimize a given function to its local minimum. Gradient
Descent iteratively reduces a loss function by moving in the direction opposite to that of
steepest ascent. It is dependent on the derivatives of the loss function for finding minima.
uses the data of the entire training set to calculate the gradient of the cost function to the
parameters which requires large amount of memory and slows down the process.

( ) ()
()
𝐰
Advantages of Gradient Descent
1.Easy to understand.
2.Easy to implement.
Disadvantages of Gradient Descent
1.Because this method calculates the gradient for the entire data set in one update, the
calculation is very slow.
2.It requires large memory and it is computationally expensive.
Stochastic Gradient Descent
It is a variant of Gradient Descent. It update the model parameters one by one. If the
model has 10K dataset SGD will update the model parameters 10k times.

Advantages of Stochastic Gradient Descent


1.Frequent updates of model parameter
2.Requires less Memory.
3.Allows the use of large data sets as it has to update only one example at a time.
Disadvantages of Stochastic Gradient Descent
1.The frequent can also result in noisy gradients which may cause the error to increase
instead of decreasingit.
2. High Variance.
3.Frequent updates are computationally expensive.
Mini-Batch Gradient Descent
It is a combination of the concepts of SGD and batch gradient descent. It simply splits the
training dataset into small batches and performs an update for each of those batches. This
creates a balance between the robustness of stochastic gradient descent and the efficiency
of batch gradient descent. it can reduce the variance when the parameters are updated, and
the convergence is more stable. It splits the data set in batches in between 50 to 256
examples, chosen at random.

Advantages of Mini Batch Gradient Descent:


1. It leads to more stable convergence.
2. More efficient gradient calculations.
3. Requires less amount of memory.
Disadvantages of Mini Batch Gradient Descent:
1. Mini-batch gradient descent does not guarantee good convergence.
2. If the learning rate is too small, the convergence rate will be slow. If it is too large, the
loss function will oscillate or even deviate at the minimum value.
SGD with Momentum
SGD with Momentum is a stochastic optimization method that adds a momentum term to
regular stochastic gradient descent. Momentum simulates the inertia of an object when it is
moving, that is, the direction of the previous update is retained to a certain extent during
the update, while the current update gradient is used to fine-tune the final update direction.
In this way, you can increase the stability to a certain extent, so that you can learn faster,
and also have the ability to get rid of local optimization.

( ) ()
SGD 𝐰
()

( ) () ()
SGD with Momentum

( ) ()
where
γ – momentum.
Advantages of SGD with momentum:
1.Momentum helps to reduce the noise.
2.Exponential Weighted Average is used to smoothen the curve.
Disadvantage of SGD with momentum
1.Extra hyperparameter is added.
AdaGrad(Adaptive Gradient Descent)
In all the algorithms that we discussed previously the learning rate remains constant. The
intuition behind AdaGrad is can we use different Learning Rates for each and every
neuron for each and every hidden layer based on different iterations.

( ) ()
()
𝐰
( )
where
( ) ()

Advantages of AdaGrad:
1. Learning Rate changes adaptively with iterations.
2. It is able to train sparse data as well.
Disadvantage of AdaGrad:
1. If the neural network is deep the learning rate becomes very small number which will
cause dead neuron problem.
RMS-Prop (Root Mean Square Propagation)
RMS-Prop is a special version of Adagrad in which the learning rate is an exponential
average of the gradients instead of the cumulative sum of squared gradients. RMS-Prop
basically combines momentum with AdaGrad.

( ) ()

Advantages of RMS-Prop
1.In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning
rate for each
parameter.
Disadvantages of RMS-Prop
1.Slow Learning
AdaDelta
Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive,
monotonically reducing the learning rate and remove decaying learning rate problem. In
Adadelta we do not need to set the default learning rate as we take the ratio of the running
average of the previous time steps to the current gradient.

Advantages of Adadelta
1.The main advantage of AdaDelta is that we do not need to set a default learning rate.
Disadvantages of Adadelta
1.Computationally expensive
Adam (Adaptive Moment Estimation)
It is a method that computes adaptive learning rates for each parameter. It stores both the
decaying average of the past gradients, similar to momentum and also the decaying
average of the past squared gradients, similar to RMS-Prop and Adadelta. Thus, it
combines the advantages of both methods.

( ) () ()

Advantages of Adam
1.Easy to implement
2.Computationally efficient.
3.Little memory requirements.
Linear regression with Pytorch
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

x_np = np.array([[10.0], [9.0], [3.0], [2.0]], dtype=np.float32)


y_np = np.array([[90.0], [80.0], [50.0], [30.0]], dtype=np.float32)

x_tensor = torch.from_numpy(x_np)
y_tensor = torch.from_numpy(y_np)

plt.figure(1)
plt.plot(x_np, y_np, '*')

class LinearRegression(nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
x = self.linear(x)
return x
model = LinearRegression()

criterion = nn.MSELoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(50):


model.train()
optimizer.zero_grad()
# Forward pass
y_pred = model(x_tensor)
# Compute Loss
loss = criterion(y_pred, y_tensor)
# Backward pass
loss.backward()
optimizer.step()
new_x_np = np.arange(0, 10.05, 0.1, dtype=np.float32)[:,None]
new_x_tensor = torch.from_numpy(new_x_np)
model.eval()
with torch.no_grad():
y_pred_tensor = model(new_x_tensor)
y_pred = y_pred_tensor.numpy()
plt.figure(1)
plt.plot(new_x_np, y_pred, 'r-')

p = np.polyfit(x_np.flatten(), y_np.flatten(), 1)
y_pred2 = np.polyval(p, new_x_np)
plt.figure(1)
plt.plot(new_x_np, y_pred2, 'g-')

Pytorch module torch.nn


Neural networks layers or other graph building blocks
https://pytorch.org/docs/stable/nn.html
Pytorch module torch.nn.functional – Pytorch functions

Activation functions

threshold threshold_
relu relu_
hardtanh hardtanh_
hardswish
relu6
elu elu_
selu
celu
leaky_relu leaky_relu_
prelu
rrelu rrelu_
glu
gelu
logsigmoid
hardshrink
tanhshrink
softsign
softplus
softmin
softmax
softshrink
gumbel_softmax
log_softmax
tanh
sigmoid
hardsigmoid
MNIST dataset

Images of size 28 x 28, or 784 values per image


Already splitted to train and test subsets
Cross entropy loss
Cross Entropy Loss/Negative Log Likelihood

This is the most common setting for classification problems. Cross-entropy loss increases
as the predicted probability diverges from the actual label.
Mathematical formulation :-

Notice that when actual label is 1 (yi = 1), second half of function disappears whereas in
case actual label is 0 (yi = 0) first half is dropped off. In short, we are just multiplying the
log of the actual predicted probability for the ground truth class. An important aspect of
this is that cross entropy loss penalizes heavily the predictions that are confident but
wrong.
Logistic Regression
Binnary classification using neural networks and Cross Entropy Loss.

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

x_np = np.array([[10.0], [9.0], [3.0], [2.0]], dtype=np.float32)


y_np = np.array([[1], [1], [0], [0]], dtype=np.int64).flatten()
x_tensor = torch.from_numpy(x_np)
y_tensor = torch.from_numpy(y_np)

plt.figure(1)
plt.plot(x_np, y_np, '*')

class LogisticRegression(nn.Module):
def __init__(self):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(1, 2)
def forward(self, x):
x = self.linear(x)
return x
model = LogisticRegression()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(500):


model.train()
optimizer.zero_grad()
# Forward pass
y_pred = model(x_tensor)
# Compute Loss
loss = criterion(y_pred, y_tensor)
# Backward pass
loss.backward()
optimizer.step()

new_x_np = np.arange(0, 10.05, 0.1, dtype=np.float32)[:,None]


new_x_tensor = torch.from_numpy(new_x_np)
model.eval()
with torch.no_grad():
y_pred_tensor = model(new_x_tensor)
y_pred = y_pred_tensor.numpy()
idx = np.argmax(y_pred, axis=1)
plt.figure(1)
plt.plot(new_x_np, y_pred[:,1], 'r-')
plt.plot(new_x_np, idx, 'g-')

You might also like