EDA Lecture Module 4
EDA Lecture Module 4
Email: [email protected]
Handphone No.: +91-9944226963
1 Module 4: Optimization
Introduction to Optimization
Gradient Descent
Variants of Gradient Descent
Momentum Optimizer
Nesterov Accelerated Gradient
Adagrad
Adadelta
RMSProp
Adam
AMSGrad
Introduction to Optimization
Optimization is the process of maximizing or minimizing a real
function by systematically choosing input values from an allowed set
of values and computing the value of the function.
It refers to usage of specific methods to determine the best solution
from all feasible solutions, say for example, finding the best functional
representation and finding the best hyperplane to classify data.
Three components of an optimization problem: objective function
(minimzation or maximization), decision variables and constraints.
Based on the type of objective function, constraints and decision
variables, several types of optimization problems exists. An
optmization can be linear or non-linear, convex or non-convex,
iterative or non-iterative, etc.
Optimization is considered as one among the three pillars of data
science. Linear algebra and statistics are the other two pillars.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 4 / 20
Module 4: Optimization
Introduction to Optimization
Consider the following optimization problem which attempts to find
the maximal marigin hyperplane with marigin M:
Equation (1) is the objective function, equations (2) and (3) are the
constraints, and α0 , α1 , ..., αp are the decision variables.
In general, an objective function is denoted as f (·) and minimizer of
f (·) is same as the maximizer of −f (·).
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 5 / 20
Module 4: Optimization
Gradient Descent
Gradient Descent is the most common optimization algorithm in
machine learning and deep learning.
It is a first-order, iterative-based optimization algorithm which only
takes into account the first derivative when performing the updates
on the parameters.
In each iteration, there are 2 steps: (i) finding the (locally) steepest
direction according to the first derivative of an objective function; and
(ii) finding the best point in the line. The parameters are updated in
the opposite direction of the gradient of the objective function.
The learning rate determines the convergence (i.e. the number of
iterations required to reach the local minimum). It should neither be
too small nor too large. Very small α leads to very slow convergance
and a very large value leads to oscillations around the minima or may
even lead to divergence.
Facilitator: Dr Sathiya Narayanan S VIT-Chennai - SENSE Winter Semester 2020-21 6 / 20
Module 4: Optimization
Xk = Xk−1 − αGk−1
In this case,
0 1 + 4x1 + 2x2
f (X ) = .
2x1 + 6x2
In the first iteration, the direction G0 and the best point X1 are
estimated as follows:
0 4 0.1
G0 = f (X0 ) = and X1 = X0 − αG0 = .
4 0.1
Question 4.1
Let the stopping criteria be the absolute difference between the function
values in successive iterations less than 0.005. Your answer should show
the search direction and the value of the function in each iteration.
Momentum Optimizer
Momentum Optimizer
RMSProp
Adam
Adaptive Moment estimation (Adam) combines RMSProp and
Momentum.
It incorporates the momentum term (i.e. first moment with
exponential weighting of the gradient) in RMSProp as follows:
α
wk = wk−1 − p m̂k−1
v̂k−1 +
where m̂k−1 and v̂k−1 are bias-corrected versions of mk−1 (first
moment) and vk−1 (second moment) respectively. The first and
second moments are:
AMSGrad