Module 4
Module 4
Module 4
Data Analytics
1
Topics in Module-4-Optimization
• Gradient descent
• Momentum
• Adagrad
• RMSprop
• Adam
• AMSGrad
2
Topics in Module-4-Optimization
Animation of 5 gradient descent
methods on a surface: gradient
descent (cyan), momentum (magenta),
AdaGrad (white), RMSProp (green),
Adam (blue). Left well is the global
minimum; right well is a local
minimum.
Source:
https://towardsdatascience.com/a-
visual-explanation-of-gradient-
descent-methods-momentum-
adagrad-rmsprop-adam-
f898b102325c
3
Optimization?
4
Optimization?
• Three components of an optimization problem: objective function (minimization
or maximization), decision variables and constraints.
• Optimization is considered as one among the three pillars of data science. Linear
algebra and statistics are the other two pillars.
5
Terminologies in Optimization Module-4: Introduction to Optimization
7
Working of Optimization: packing a
lunchbox
• Goal: The goal is to pack the lunchbox with a variety of tasty and
nutritious items so that you have a satisfying meal at lunchtime.
• Constraints: There are constraints, or limitations, to consider. For
example, the lunchbox has a fixed size, and you may want to include a
variety of items like a sandwich, fruits, vegetables, and a drink.
• Optimization: Optimization involves figuring out the best way to
arrange and pack these items to make the most of the limited space.
You might consider the size and shape of each item, how they fit
together, and how to use the available space efficiently.
8
Working of Optimization: packing a
lunchbox
• Trade-offs: Sometimes, you might have to make trade-offs. For
instance, if you want to include a larger sandwich, you may need to
sacrifice space for other items.
• Optimization involves finding the right balance based on your
priorities.
• Outcome: The optimized lunchbox is the one that allows you to fit the
most satisfying and nutritious combination of items within the given
constraints.
9
• In this example, the lunchbox represents a problem or a task, and
optimization is the process of arranging and selecting items to achieve
the best outcome within the given limits.
• Optimization concept applies to many real-world scenarios, from
organizing your room to planning a schedule or even solving more
complex problems in fields like mathematics, engineering, and
computer science.
10
Optimization – Scenario 2
• Requirement: Minimizing Cost
• Problem: A pizza delivery business want to minimize the cost of
delivering pizzas to different locations in a city.
• Objective: Minimize the total cost of delivering pizzas.
• Constraints:
• Each delivery has a fixed cost associated with it.
• There is a maximum distance a delivery person can travel in a given time.
• Each delivery has a time window during which it must be completed
11
• Optimization Steps:
• Identify Costs: Understand the cost associated with each delivery, including
travel time, fuel, and other expenses.
• Define Constraints: Consider the limitations, such as the maximum distance and
time window for each delivery.
• Optimize Routes: Use optimization algorithms to find the most efficient routes
for the delivery persons, minimizing the total cost while adhering to constraints.
• Outcome: The optimized solution would provide the most cost-effective
way to deliver pizzas, ensuring that deliveries are made within the
specified time windows and without exceeding the maximum distance.
12
Optimization – Scenario 3
• Requirement: Maximizing Benefit
• Problem: A farmer with a limited amount of land want to maximize
the crop yield to get the highest profit.
• Objective: Maximize the total crop yield.
• Constraints:
• Limited land area available for cultivation.
• Each crop requires specific conditions (e.g., sunlight, water) and has a growth
period.
13
• Optimization Steps:
• Understand Crop Characteristics: Know the growth requirements and yield
potential of different crops.
• Consider Land Constraints: Take into account the limited land area available
for cultivation.
• Optimize Crop Selection: Use optimization techniques to choose the
combination of crops and their arrangement that maximizes the total yield
within the available land and time constraints.
• Outcome: The optimized solution would provide the farmer with the
most profitable combination of crops to plant, considering the
available land and the specific requirements of each crop
14
Module-4: Introduction to Optimization
What is Optimization?
15
Module-4: Introduction to Optimization
What is Optimization?
• Choosing the best element from some set of available alternatives and solving problems in which
one seeks to minimize or maximize a real function
• Optimization is the process where we train the model iteratively that results in a maximum and
minimum function evaluation.
16
Module-4: Introduction to Optimization
What is Optimization?
17
Module-4: Introduction to Optimization
Effect of learning rate on Optimization
• Learning rate (λ) is one such hyper-
parameter that defines
the adjustment in the weights of our
network with respect to the loss
gradient descent.
• It determines how fast or slow we
will move towards the optimal
weights.
• If the learning rate is very large we will skip the optimal solution.
• If it is too small we will need too many iterations to converge to the best values. So
using a good learning rate is crucial.
• A learning rate of 0.01 and 0.011 are unlikely to yield vastly different results.
18
Module-4: Introduction to Optimization
2 2
Basic Optimization Algorithm and Example F x = x1 + 2 x1 x 2 + 2x 2 + x1
xk + 1 = xk + k p k 0.5 = 0.1
x0 =
or 0.5
xk = xk + 1 – x k = kp k
F x g0 = F x = 3
x1 2x 1 + 2x2 + 1 x = x0 3
F x = =
2x 1 + 4x 2
xk +1 F x
kp k
x2
xk
x 1 = x 0 – g 0 = 0.5 – 0.1 3 = 0.2
0.5 3 0.2
pk - Search Direction
ak - Learning Rate
x2 = x1 – g1 = 0.2 – 0.1 1.8 = 0.02
0.2 1.2 0.08
19
Module-4: Introduction to Optimization
F x = Ax + d
x k + 1 = xk – gk = x k – Ax k + d xk + 1 = I – A x k – d
Stability is determined
by the eigenvalues of
this matrix.
I – A zi = z i – Az i = z i – iz i = 1 – i z i
Stability Requirement:
2 2
1 – i 1 ---- ------------
i max
20
Module-4: Introduction to Optimization
2 2
------------ = ---------- = 0.38
max 5.24
= 0.37 = 0.39
2 2
1 1
0 0
-1 -1
-2 -2
-2 -1 0 1 2 -2 -1 0 1 2 21
Gradient descent optimization Module-4 Topic-1: Gradient descent
θ is the parameter we wish to update, dJ/dθ is the partial derivative which tells us the rate of
change of error on the cost function with respect to the parameter θ and α here is the Learning
Rate. J here represents the cost function and there are multiple ways to calculate this cost. Based
on the way we are calculating this cost function there are different variants of Gradient Descent.
22
Gradient descent optimization Module-4 Topic-1: Gradient descent
gk + Ak xk = 0
–1
x k = – Ak g k
xk + 1 = xk – A–k 1 gk
24
Module-4 Topic-1: Gradient descent
Example
2
2 2
F x = x1 + 2 x1 x 2 + 2x 2 + x1
x0 = 0.5 1
F x
x1 2x 1 + 2x2 + 1 0.5
F x = =
2x 1 + 4x 2
F x 0
x2
g0 = F x = 3 -1
x = x0 3
A= 22 -2
24 -2 -1 0 1 2
–1
0.5 2 2 3 0.5 1 – 0.5 3 0.5 1.5 –1
x1 = – = – = – =
0.5 2 4 3 0.5 – 0.5 0.5 3 0.5 0 0.5
25
Module-4- Topic-2 Variants of gradient descent
Variants of gradient
1. Batch gradient descent:
descent
• Vanilla gradient descent, is the simplest variant of gradient descent.
• In batch gradient descent, the entire training dataset is used to compute the gradients of the cost
function with respect to the model parameters in each iteration.
• This can be computationally expensive for large datasets, but it guarantees convergence to a local
minimum of the cost function.
26
Module-4- Topic-2 Variants of gradient descent
27
Module-4- Topic-2 Variants of gradient descent
Variants of gradient
3. Mini-batch Gradient Descent
descent
• Mini-batch gradient descent is a compromise between batch gradient
descent and stochastic gradient descent.
• In mini-batch gradient descent, the gradients are computed on a small
random subset of the training dataset, typically between 10 and 1000
examples, called a mini-batch.
• This reduces the computational cost of the algorithm compared to batch gradient descent, while also
reducing the variance of the updates compared to SGD.
• Mini-batch gradient descent is widely used in deep learning because it strikes a good balance between
convergence speed and stability. 28
Module-4- Topic-2 Variants of gradient descent
Variants of gradient descent
4. Nesterov Accelerated Gradient (NAG)
• Nesterov accelerated gradient is an extension of momentum gradient descent that takes into account
the future gradient values when computing the momentum term.
• This helps to reduce overshooting and can lead to faster convergence than momentum gradient
descent.
29
Module-4- Topic-3 Momentum
Variants of gradient
5. Momentum Gradient
descent
• Momentum gradient descent is a variant of gradient descent
that adds a momentum term to the update rule.
• The momentum term accumulates the gradient values over
time and dampens the oscillations in the cost function, leading
to faster convergence. This is particularly useful in cases where
the cost function has a lot of noise or curvature, which can
cause traditional gradient descent to get stuck in local minima.
30
Module-4- Topic-4 Adagrad
Variants
6. Adagrad
of gradient descent
• Adagrad is a variant of gradient descent that adapts the
learning rate for each parameter based on its historical
gradient values.
• Parameters with large gradients have their learning rates
reduced, while parameters with small gradients have
their learning rates increased. This helps to normalize
the updates and can be useful in cases where the cost
function has a lot of curvature or different scales of
gradients.
31
Module-4- Topic-5 RMSProp
Variants of gradient descent
7. RMSProp
• RMSProp is a variant of gradient descent that
also adapts the learning rate for each parameter,
but instead of using the historical gradient values, it
uses a moving average of the squared gradient
values. This helps to reduce the learning rate for
parameters that have large squared gradient values,
which can cause the algorithm to oscillate or diverge.
32
Module-4- Topic-6 Adam
Variants of gradient descent
8. Adam
• Adam, Adaptive Moment Estimation, is a variant of gradient descent that combines the ideas of
Adagrad and RMSProp.
• It adapts the learning rate for each parameter based on the historical gradient values and also uses a
moving average of the gradient values to compute the momentum term.
• Adam is one of the most widely used optimization algorithms in deep learning because it is efficient,
stable, and robust to different types of cost functions and datasets.
33
Module-4- Topic-7 AMSGrad
Variants of gradient descent
9. AMSGrad
• AMSGrad is an extension to the Adam
version of gradient descent that attempts
to improve the convergence properties of
the algorithm, avoiding large abrupt
changes in the learning rate for each input
variable.
34
Module-4- Variants of gradient descent
35
Summary Module-4- Summary
• Introduction to Optimization
• Gradient Descent: first-order,
iterative-based optimization
algorithm
• Variants of Gradient Descent: batch
gradient descent, mini-batch gradient
descent and stochastic gradient
descent
• Momentum Optimizer: accelerates
the stochastic gradient descent in the
relevant direction - NAG uses the
momentum term for anticipatory • Adadelta: sum of gradients recursively defined as the
update decaying
• Adagrad: adaptively scales learning • average of past gradients
rate for different dimension • RMSProp: same first update of Adadelta
• Adam: combination of RMSProp and momentum
• AMSGrad: considers the maximum of past squared gradients
36