0% found this document useful (0 votes)

21 views34 pages

Lecture 5

Uploaded by

asedovskaya.ann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views34 pages

Lecture 5

Uploaded by

asedovskaya.ann

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

NEURAL NETWORK TRAINING

For example, we have a neural network with 2 trainable

parameters. We could express loss of neural network as
L(w0, w1, Xtrain, Ytrain). Here w0, w1 are weights of neural
network (trainable parameters). Here Xtrain, Ytrain are train data.
Because we will focus on trainable parameters, further we will
write function without train data, it is L(w0, w1).
Training is minimization of loss versus trainable parameters.
Gradient Descent

Gradient of L(w0, w1):

ℏ→

( ) ( )
( ) ( )
,
( ) ( )
( ) ( )
,
- learning rate
( ) ( )
𝐰
( ) first iteration
( ) ()
𝐰
() i+1 iteration

Stop, when L has small changes or maximal iterations number

was executed.

When we have D trainable parameters, gradient:

..., ...,
Number of operations , because for we need calculate D partial derivatives, and for
every derivative number of operations proportional to number of parameters.
Backpropagation
The training of Neural Networks (NN) based on gradient-based optimization algorithms is
organized in two major steps:
Forward Propagation - here we calculate the output of the NN given inputs
Backward Propagation - here we calculate the gradients of the output with regards to
weights.
The first step is usually straightforward to understand and to calculate. The general idea
behind the second step is also clear — we need gradients to know the direction to make
steps in gradient descent optimization algorithm.
Although the backpropagation is not a new idea (developed in 1970s), answering the
question “how” these gradients are calculated gives some people a hard time. One has to
reach for some calculus, especially partial derivatives and the chain rule, to fully
understand back-propagation working principles.
Originally backpropagation was developed to differentiate complex nested functions.
However, it became highly popular thanks to the machine learning community and is now
the cornerstone of Neural Networks.
We’ll go to the roots and solve an exemplary problem step-by-step by hand, then we’ll
implement it in python using PyTorch, and finally we’re going to compare both results to
make sure everything works fine.
Computational Graph
Let’s assume we want to perform the following set of operations to get our result r:

If you substitute all individual node-equations into the final one you’ll find that we’re
solving the following equation. Being more specific we want to calculate its value and its
partial derivatives. So in this use case it is a pure mathematical task.
Forward Pass
To make this concept more tangible let’s take some numbers for our calculation. For
example:
x=1
y=2
z=4
Here we simply substitute our inputs into equations. The results of individual node-steps
are shown below. The final output is r=144.
Backward Pass
Now it’s time to perform a backpropagation, known also under a more fancy name
“backward propagation of errors” or even “reverse mode of automatic differentiation”.
To calculate gradients with regards to each of 3 variables we have to calculate partial
derivatives at each node in the graph (local gradients). Below we show how to do it for
the two last nodes/steps.

After completing the calculations of local gradients a computation graph for the
backpropagation is like below.
Now, to calculate the final gradients (in orange circles) we have to use the chain rule. In
practice this means we have to multiply all partial derivatives along the path from the
output to the variable of interest:

Now we can use these gradient for whatever we want — e.g. optimization with a gradient
descent (SGD, Adam, etc.).
Implementation in PyTorch

There are numerous Neural Network frameworks in various languages where you can
implement such computations and make a computer to calculate gradients for you. Below,
We’ll demonstrate how to use the python PyTorch library to solve our exemplary task.

import torch

x = torch.tensor(1.0, requires_grad=True)
y = torch.tensor(2.0, requires_grad=True)
z = torch.tensor(4.0, requires_grad=True)
# forward pass
u = x**2
v = u+y
w = z*v
r = w**2
print('r=', r.item())
# backward pass
r.backward()
print('dr/dx = ', x.grad.item())
print('dr/dy = ', y.grad.item())
print('dr/dz = ', z.grad.item())
The output of this code is:

Results from PyTorch are identical to the ones we calculated by hand.

Some notes:
 In PyTorch everything is a tensor — even if it contains only a single value.
 In PyTorch when you specify a variable which is a subject of gradient-based
optimization you have to specify argument requires_grad = True. Otherwise, it will be
treated as fixed input
 With this implementation, all back-propagation calculations are simply performed by
using method r.backward()
OPTIMIZERS
What is an optimizer?
Optimizers are algorithms or methods used to minimize an error function(loss function)or
to maximize the efficiency of production. Optimizers are mathematical functions which
are dependent on model’s learnable parameters i.e Weights & Biases. Optimizers help to
know how to change weights and learning rate of neural network to reduce the losses.
Let’s learn about different types of optimizers and how they exactly work to minimize the
loss function.
Optimiazer is necesarry that to train neural network.
Vanila Gradient Descent
Gradient descent is an optimization algorithm based on a convex function and tweaks its
parameters iteratively to minimize a given function to its local minimum. Gradient
Descent iteratively reduces a loss function by moving in the direction opposite to that of
steepest ascent. It is dependent on the derivatives of the loss function for finding minima.
uses the data of the entire training set to calculate the gradient of the cost function to the
parameters which requires large amount of memory and slows down the process.

( ) ()
()
𝐰
Advantages of Gradient Descent
1.Easy to understand.
2.Easy to implement.
Disadvantages of Gradient Descent
1.Because this method calculates the gradient for the entire data set in one update, the
calculation is very slow.
2.It requires large memory and it is computationally expensive.
Stochastic Gradient Descent
It is a variant of Gradient Descent. It update the model parameters one by one. If the
model has 10K dataset SGD will update the model parameters 10k times.

Advantages of Stochastic Gradient Descent

1.Frequent updates of model parameter
2.Requires less Memory.
3.Allows the use of large data sets as it has to update only one example at a time.
Disadvantages of Stochastic Gradient Descent
1.The frequent can also result in noisy gradients which may cause the error to increase
instead of decreasingit.
2. High Variance.
3.Frequent updates are computationally expensive.
Mini-Batch Gradient Descent
It is a combination of the concepts of SGD and batch gradient descent. It simply splits the
training dataset into small batches and performs an update for each of those batches. This
creates a balance between the robustness of stochastic gradient descent and the efficiency
of batch gradient descent. it can reduce the variance when the parameters are updated, and
the convergence is more stable. It splits the data set in batches in between 50 to 256
examples, chosen at random.

Advantages of Mini Batch Gradient Descent:

1. It leads to more stable convergence.
2. More efficient gradient calculations.
3. Requires less amount of memory.
Disadvantages of Mini Batch Gradient Descent:
1. Mini-batch gradient descent does not guarantee good convergence.
2. If the learning rate is too small, the convergence rate will be slow. If it is too large, the
loss function will oscillate or even deviate at the minimum value.
SGD with Momentum
SGD with Momentum is a stochastic optimization method that adds a momentum term to
regular stochastic gradient descent. Momentum simulates the inertia of an object when it is
moving, that is, the direction of the previous update is retained to a certain extent during
the update, while the current update gradient is used to fine-tune the final update direction.
In this way, you can increase the stability to a certain extent, so that you can learn faster,
and also have the ability to get rid of local optimization.

( ) ()
SGD 𝐰
()

( ) () ()
SGD with Momentum

( ) ()
where
γ – momentum.
Advantages of SGD with momentum:
1.Momentum helps to reduce the noise.
2.Exponential Weighted Average is used to smoothen the curve.
Disadvantage of SGD with momentum
1.Extra hyperparameter is added.
AdaGrad(Adaptive Gradient Descent)
In all the algorithms that we discussed previously the learning rate remains constant. The
intuition behind AdaGrad is can we use different Learning Rates for each and every
neuron for each and every hidden layer based on different iterations.

( ) ()
()
𝐰
( )
where
( ) ()

Advantages of AdaGrad:
1. Learning Rate changes adaptively with iterations.
2. It is able to train sparse data as well.
Disadvantage of AdaGrad:
1. If the neural network is deep the learning rate becomes very small number which will
cause dead neuron problem.
RMS-Prop (Root Mean Square Propagation)
RMS-Prop is a special version of Adagrad in which the learning rate is an exponential
average of the gradients instead of the cumulative sum of squared gradients. RMS-Prop
basically combines momentum with AdaGrad.

( ) ()

Advantages of RMS-Prop
1.In RMS-Prop learning rate gets adjusted automatically and it chooses a different learning
rate for each
parameter.
Disadvantages of RMS-Prop
1.Slow Learning
AdaDelta
Adadelta is an extension of Adagrad and it also tries to reduce Adagrad’s aggressive,
monotonically reducing the learning rate and remove decaying learning rate problem. In
Adadelta we do not need to set the default learning rate as we take the ratio of the running
average of the previous time steps to the current gradient.

Advantages of Adadelta
1.The main advantage of AdaDelta is that we do not need to set a default learning rate.
Disadvantages of Adadelta
1.Computationally expensive
Adam (Adaptive Moment Estimation)
It is a method that computes adaptive learning rates for each parameter. It stores both the
decaying average of the past gradients, similar to momentum and also the decaying
average of the past squared gradients, similar to RMS-Prop and Adadelta. Thus, it
combines the advantages of both methods.

( ) () ()

Advantages of Adam
1.Easy to implement
2.Computationally efficient.
3.Little memory requirements.
Linear regression with Pytorch
import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

x_np = np.array([[10.0], [9.0], [3.0], [2.0]], dtype=np.float32)

y_np = np.array([[90.0], [80.0], [50.0], [30.0]], dtype=np.float32)

x_tensor = torch.from_numpy(x_np)
y_tensor = torch.from_numpy(y_np)

plt.figure(1)
plt.plot(x_np, y_np, '*')

class LinearRegression(nn.Module):
def __init__(self):
super(LinearRegression, self).__init__()
self.linear = nn.Linear(1, 1)
def forward(self, x):
x = self.linear(x)
return x
model = LinearRegression()

criterion = nn.MSELoss()

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(50):

model.train()
optimizer.zero_grad()
# Forward pass
y_pred = model(x_tensor)
# Compute Loss
loss = criterion(y_pred, y_tensor)
# Backward pass
loss.backward()
optimizer.step()
new_x_np = np.arange(0, 10.05, 0.1, dtype=np.float32)[:,None]
new_x_tensor = torch.from_numpy(new_x_np)
model.eval()
with torch.no_grad():
y_pred_tensor = model(new_x_tensor)
y_pred = y_pred_tensor.numpy()
plt.figure(1)
plt.plot(new_x_np, y_pred, 'r-')

p = np.polyfit(x_np.flatten(), y_np.flatten(), 1)
y_pred2 = np.polyval(p, new_x_np)
plt.figure(1)
plt.plot(new_x_np, y_pred2, 'g-')

Pytorch module torch.nn

Neural networks layers or other graph building blocks
https://pytorch.org/docs/stable/nn.html
Pytorch module torch.nn.functional – Pytorch functions

Activation functions

threshold threshold_
relu relu_
hardtanh hardtanh_
hardswish
relu6
elu elu_
selu
celu
leaky_relu leaky_relu_
prelu
rrelu rrelu_
glu
gelu
logsigmoid
hardshrink
tanhshrink
softsign
softplus
softmin
softmax
softshrink
gumbel_softmax
log_softmax
tanh
sigmoid
hardsigmoid
MNIST dataset

Images of size 28 x 28, or 784 values per image

Already splitted to train and test subsets
Cross entropy loss
Cross Entropy Loss/Negative Log Likelihood

This is the most common setting for classification problems. Cross-entropy loss increases
as the predicted probability diverges from the actual label.
Mathematical formulation :-

Notice that when actual label is 1 (yi = 1), second half of function disappears whereas in
case actual label is 0 (yi = 0) first half is dropped off. In short, we are just multiplying the
log of the actual predicted probability for the ground truth class. An important aspect of
this is that cross entropy loss penalizes heavily the predictions that are confident but
wrong.
Logistic Regression
Binnary classification using neural networks and Cross Entropy Loss.

import numpy as np
import matplotlib.pyplot as plt
import torch
import torch.nn as nn
import torch.nn.functional as F

x_np = np.array([[10.0], [9.0], [3.0], [2.0]], dtype=np.float32)

y_np = np.array([[1], [1], [0], [0]], dtype=np.int64).flatten()
x_tensor = torch.from_numpy(x_np)
y_tensor = torch.from_numpy(y_np)

plt.figure(1)
plt.plot(x_np, y_np, '*')

class LogisticRegression(nn.Module):
def __init__(self):
super(LogisticRegression, self).__init__()
self.linear = nn.Linear(1, 2)
def forward(self, x):
x = self.linear(x)
return x
model = LogisticRegression()

criterion = torch.nn.CrossEntropyLoss()

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(500):

model.train()
optimizer.zero_grad()
# Forward pass
y_pred = model(x_tensor)
# Compute Loss
loss = criterion(y_pred, y_tensor)
# Backward pass
loss.backward()
optimizer.step()

new_x_np = np.arange(0, 10.05, 0.1, dtype=np.float32)[:,None]

new_x_tensor = torch.from_numpy(new_x_np)
model.eval()
with torch.no_grad():
y_pred_tensor = model(new_x_tensor)
y_pred = y_pred_tensor.numpy()
idx = np.argmax(y_pred, axis=1)
plt.figure(1)
plt.plot(new_x_np, y_pred[:,1], 'r-')
plt.plot(new_x_np, idx, 'g-')

Linear Algebra Cheat Sheet
100% (2)
Linear Algebra Cheat Sheet
1 page
2006-2010 Syllabus of MG University Mechanical Engineering
0% (1)
2006-2010 Syllabus of MG University Mechanical Engineering
74 pages
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
No ratings yet
WINSEM2024-25 CSE4006 ETH AP2024254000693 2025-01-08 Reference-Material-I
40 pages
CS460 - Deep Learning - W02 & W03
No ratings yet
CS460 - Deep Learning - W02 & W03
44 pages
Gradient Descent - PR
No ratings yet
Gradient Descent - PR
31 pages
Supervised Deep Learning
No ratings yet
Supervised Deep Learning
28 pages
Gradient-Based Optimizers
No ratings yet
Gradient-Based Optimizers
54 pages
Ch2-Training, Optimization and Regularization of DNN-new
No ratings yet
Ch2-Training, Optimization and Regularization of DNN-new
114 pages
DL U-I Introduction Part-2
No ratings yet
DL U-I Introduction Part-2
48 pages
An Overview of Gradient Descent Optimization Algorithms PDF
No ratings yet
An Overview of Gradient Descent Optimization Algorithms PDF
12 pages
DL Regularization
No ratings yet
DL Regularization
51 pages
Optimizer
No ratings yet
Optimizer
13 pages
Gradient Descent Method
No ratings yet
Gradient Descent Method
12 pages
Lecture 7 - Optimization Part I
No ratings yet
Lecture 7 - Optimization Part I
38 pages
Cours 5
No ratings yet
Cours 5
23 pages
Optimization
No ratings yet
Optimization
44 pages
Deep Learning Tutorial 9
No ratings yet
Deep Learning Tutorial 9
70 pages
ML807 Distributed and Federated Learning Slides 2
No ratings yet
ML807 Distributed and Federated Learning Slides 2
211 pages
Deep Learning Chapter 1
No ratings yet
Deep Learning Chapter 1
46 pages
S09 DNN Gradients Wip
No ratings yet
S09 DNN Gradients Wip
28 pages
Ai - W7L13
No ratings yet
Ai - W7L13
46 pages
2023246032-Backward Propagation and Other Differential Algorithms
No ratings yet
2023246032-Backward Propagation and Other Differential Algorithms
48 pages
Gradient Descent Deep Learning Lecture
No ratings yet
Gradient Descent Deep Learning Lecture
5 pages
CS601 - Machine Learning - Unit 2 New
No ratings yet
CS601 - Machine Learning - Unit 2 New
56 pages
Training NNs
No ratings yet
Training NNs
34 pages
Unit 2.4
No ratings yet
Unit 2.4
31 pages
Mlfa Autumn 23 Optimization
No ratings yet
Mlfa Autumn 23 Optimization
37 pages
Comparison of Gradient Descent Algorithms On Training Neural Networks
No ratings yet
Comparison of Gradient Descent Algorithms On Training Neural Networks
20 pages
Soft Computing Assignment
No ratings yet
Soft Computing Assignment
9 pages
Convolutional Neural Network
100% (1)
Convolutional Neural Network
59 pages
Foundations of Deep Learning
No ratings yet
Foundations of Deep Learning
30 pages
Lect 6
No ratings yet
Lect 6
60 pages
Part 13 MD
No ratings yet
Part 13 MD
41 pages
QB Unit 3
No ratings yet
QB Unit 3
14 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Module 4 Lab 3
No ratings yet
Module 4 Lab 3
6 pages
Lec 8
No ratings yet
Lec 8
43 pages
Deep Learning
No ratings yet
Deep Learning
20 pages
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-09-03 Reference-Material-I
No ratings yet
FALLSEM2024-25 BCSE401L TH VL2024250102084 2024-09-03 Reference-Material-I
16 pages
Technical Writing
No ratings yet
Technical Writing
9 pages
Lecture 2
No ratings yet
Lecture 2
31 pages
2020 CS182 Section 2 Notes
No ratings yet
2020 CS182 Section 2 Notes
6 pages
Unit 2
No ratings yet
Unit 2
19 pages
Advantages Bpa
No ratings yet
Advantages Bpa
38 pages
Gradient Descent
No ratings yet
Gradient Descent
8 pages
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
No ratings yet
ECE/CS 559 - Neural Networks Lecture Notes #7: The Backpropagation Algorithm
9 pages
Neural Networks
No ratings yet
Neural Networks
29 pages
ML Lec 09 ANN Quadratic Training
No ratings yet
ML Lec 09 ANN Quadratic Training
44 pages
Pure Optimization
No ratings yet
Pure Optimization
23 pages
Domnic Object Detecion Basics
No ratings yet
Domnic Object Detecion Basics
62 pages
Gradient Descent Optimization
No ratings yet
Gradient Descent Optimization
27 pages
Implement 03-1
No ratings yet
Implement 03-1
24 pages
DeepLearning Recap
No ratings yet
DeepLearning Recap
104 pages
Deep Learning (All in One)
No ratings yet
Deep Learning (All in One)
23 pages
Gradient Descent
No ratings yet
Gradient Descent
2 pages
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
100% (1)
Lesson 5 Deep Neural Net Optimization Tuning Interpretability
105 pages
Neural Net 3rdclass
No ratings yet
Neural Net 3rdclass
35 pages
Gradient Descent Algorithms and Variations - PyImageSearch
No ratings yet
Gradient Descent Algorithms and Variations - PyImageSearch
21 pages
Neural Networks - 2
No ratings yet
Neural Networks - 2
79 pages
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
From Everand
Backpropagation: Fundamentals and Applications for Preparing Data for Training in Deep Learning
Fouad Sabry
No ratings yet
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
From Everand
Bundle Adjustment: Optimizing Visual Data for Precise Reconstruction
Fouad Sabry
No ratings yet
Introduction to Algorithms
From Everand
Introduction to Algorithms
S VASIST
No ratings yet
Chemistrytt
No ratings yet
Chemistrytt
4 pages
Percentage Word Problems 6 1a
No ratings yet
Percentage Word Problems 6 1a
2 pages
Sa 320
No ratings yet
Sa 320
13 pages
Application Note XC9572XL
No ratings yet
Application Note XC9572XL
6 pages
Exit Quiz Questions
No ratings yet
Exit Quiz Questions
2 pages
T 9 Sec 3
No ratings yet
T 9 Sec 3
5 pages
Basic Concepts of Chemistry Questions
No ratings yet
Basic Concepts of Chemistry Questions
4 pages
Afr 1041 Objective Question Set Dmaster
No ratings yet
Afr 1041 Objective Question Set Dmaster
7 pages
SP-5 Manual - EN
No ratings yet
SP-5 Manual - EN
32 pages
Module 1 de PDF
No ratings yet
Module 1 de PDF
18 pages
All Maths Formulas Class 10
No ratings yet
All Maths Formulas Class 10
54 pages
Os Lab Record
No ratings yet
Os Lab Record
48 pages
Roofpak Units With Energy Recovery Wheels: Packaged Heating and Cooling Units
No ratings yet
Roofpak Units With Energy Recovery Wheels: Packaged Heating and Cooling Units
74 pages
Ni Putu Lion Budanti - Analisis Kuantitatif - Artikel Uji Asumsi Klasik Dan Regresi Linier - Tugas 5
No ratings yet
Ni Putu Lion Budanti - Analisis Kuantitatif - Artikel Uji Asumsi Klasik Dan Regresi Linier - Tugas 5
17 pages
Perturbation Methods in Fluid - Mechanics - MiltonVanDyke
No ratings yet
Perturbation Methods in Fluid - Mechanics - MiltonVanDyke
143 pages
Noether's Theorem Explained: Chris Wymant September 23, 2010
No ratings yet
Noether's Theorem Explained: Chris Wymant September 23, 2010
3 pages
Physics Revision Planner - Neet: S.NO. Topic Difficult Topics Grade 11
No ratings yet
Physics Revision Planner - Neet: S.NO. Topic Difficult Topics Grade 11
17 pages
Study and Design of Electro Magnetic Lev
No ratings yet
Study and Design of Electro Magnetic Lev
5 pages
Pectinase
No ratings yet
Pectinase
4 pages
Absolutely
No ratings yet
Absolutely
16 pages
Countable or Uncountable
No ratings yet
Countable or Uncountable
3 pages
Primary Aberrations of A Thin Lens With Different Object and Image Space Media
No ratings yet
Primary Aberrations of A Thin Lens With Different Object and Image Space Media
10 pages
V B01B0006B-GB PDF
No ratings yet
V B01B0006B-GB PDF
12 pages
B+V Manual - BVES-750 Hydr - 710000-Y-H-D Rev 002
No ratings yet
B+V Manual - BVES-750 Hydr - 710000-Y-H-D Rev 002
42 pages
Biosintesis Lipid
No ratings yet
Biosintesis Lipid
27 pages
QMB 2024
No ratings yet
QMB 2024
3 pages
Ameya SE
No ratings yet
Ameya SE
1 page
Capital Budgeting
No ratings yet
Capital Budgeting
34 pages

Lecture 5

Uploaded by

Lecture 5

Uploaded by

NEURAL NETWORK TRAINING

For example, we have a neural network with 2 trainable

Gradient of L(w0, w1):

Stop, when L has small changes or maximal iterations number

When we have D trainable parameters, gradient:

Results from PyTorch are identical to the ones we calculated by hand.

Advantages of Stochastic Gradient Descent

Advantages of Mini Batch Gradient Descent:

x_np = np.array([[10.0], [9.0], [3.0], [2.0]], dtype=np.float32)

optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

for epoch in range(50):

Pytorch module torch.nn

Images of size 28 x 28, or 784 values per image

x_np = np.array([[10.0], [9.0], [3.0], [2.0]], dtype=np.float32)

optimizer = torch.optim.Adam(model.parameters(), lr=0.01)

for epoch in range(500):

new_x_np = np.arange(0, 10.05, 0.1, dtype=np.float32)[:,None]

You might also like