0% found this document useful (0 votes)
11 views

Optimization and Gradient Descent Algorithm

The document covers optimization and the Gradient Descent algorithm, explaining its intuition, mathematical formulation, and implementation. It discusses various optimization techniques, the importance of differentiability and convexity for Gradient Descent, and challenges such as local optima and learning rate tuning. Additionally, it highlights applications of Gradient Descent in machine learning and provides examples of its implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
11 views

Optimization and Gradient Descent Algorithm

The document covers optimization and the Gradient Descent algorithm, explaining its intuition, mathematical formulation, and implementation. It discusses various optimization techniques, the importance of differentiability and convexity for Gradient Descent, and challenges such as local optima and learning rate tuning. Additionally, it highlights applications of Gradient Descent in machine learning and provides examples of its implementation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 37

Machine Learning

Optimization and Gradient Descent Algorithm


Lecture – 3 to 4

Instructor: Qamar Askari


Lecture Headlines
• What is optimization?
• Optimization types and algorithms
• Gradient descent intuition
• Univariate and bivariate GD with examples
• Gradient Descent Algorithm and implementation
• Problem requirements for GD
• Challenges for GD
What is optimization?

TSP 3SAT n-Queen

minimize (total distance travelled) maximize (satisfied clauses) minimize (attacking queens)
What is optimization?

Thousands of applications are there from every field like biology,


chemistry, electronics, gaming, machine learning and music etc.

For example
• Timetabling • Clustering
• Feature selection • Game strategy planning
• Wireless sensor network optimization • Circuit designing
• Vehicle routing • Bioinformatics
… • Watermarking etc.
… ….
Mathematical Formulation of Optimization
Problems
• Optimization can be in terms of minimization or maximization
• Optimization problems can be formulated mathematically
• For example: Optimal point (0, 0)

f(x, y) = −(x² + y²) + 4


find x & y to maximize f
Another example
Optimal point (0, 0)

where A = 10

find x to minimize f

Rastrigin function for n = 2


Optimization Algorithms/Techniques
• Greedy Algorithms
• Dynamic Programming
• Mathematical optimization algorithms
• Linear programming, quadratic programming, convex programming, etc.
• Approximation-based algorithms
• Nature-inspired algorithms
Gradient Descent Algorithm
Gradient Descent Algorithm
• A commonly used algorithm in ML and DL
• It was proposed before the modern era of Computers
Intuition
• Suppose you are standing on some hilly terrain, blindfolded, and
being required to get as low as possible.

What will you do?


Step 1: Figure out which way is the downhill/slope
Step 2: Take a step in that way

Keep repeating these two steps


Intuition
Step 1: Figure out which way is the downhill/slope – Find derivative df(x)/dx
Step 2: Take a step in that way – Subtract derivative from x to take step

Keep repeating these two steps

Height
Note:
We should have some mathematical
formula f(x) to calculate height from
current position x.

Position
Example
• Assume that the height can be computed from x using following
function:
Height = f(x) = 2x – 3
Example
• Assume that the height can be computed from x using following
function:
Height = f(x) = 2x – 3

Step 1: df/dx = 2
Step 2: x = x – 2 = 6 – 2 = 4
Example
• Assume that the height can be computed from x using following
function:
Height = f(x) = 2x – 3

Step 1: df/dx = 2
Step 2: x = x – 2 = 4 – 2 = 2
Example – If slope is negative
• Assume that the height can be computed from x using following
function:
Height = f(x) = -2x + 10
Example – If slope is negative
• Assume that the height can be computed from x using following
function:
Height = f(x) = -2x + 10

Step 1: df/dx = -2
Step 2: x = x – (-2) = 2 + 2 = 4
Example – If the function is non-linear
• Assume that the height can be computed from x using following
function:
Height = f(x) = x2 – 2x + 1

Step 1: df/dx = 2x - 2
Step 2: x = x – (2x - 2) = 4 - 6 = -2
Example – If the function is non-linear
• Assume that the height can be computed from x using following
function:
Height = f(x) = x2 – 2x + 1

What will happen if


one more iteration is
executed?

It will move again on


location 4.
Step 1: df/dx = 2x - 2
Step 2: x = x – (2x - 2) = 4 - 6 = -2
Example – If the function is non-linear
• Solution: we can control the step size by just multiplying the
derivative term with a small fraction before subtracting from the
value of x.
Height = f(x) = x2 – 2x + 1

Step 1: df/dx = 2x - 2
Step 2: x = x – 0.1 * (2x - 2)
= 4 – 0.6 = 3.4
Example – If the function is non-linear
• Solution: we can control the step size by just multiplying the
derivative term with a small fraction before subtracting from the
value of x.
Height = f(x) = x2 – 2x + 1

Step 1: df/dx = 2x - 2
Step 2: x = x – 0.1 * (2x - 2)
= 4 – 0.6 = 3.4
Finalized Gradient Descent Rule

xnew = xold – α df/dx


where α controls step size of the algorithm
Little about α
• The smaller α the longer GD converges, or may reach maximum
iteration before reaching the optimum point
• If α is too big the algorithm may not converge to the optimal point
(jump around) or even to diverge completely.
Little about α
• The smaller α the longer GD converges, or may reach maximum
iteration before reaching the optimum point
• If α is too big the algorithm may not converge to the optimal point
(jump around) or even to diverge completely.
Gradient Descent for multi-variate functions
• What if the height is a multi-variate function?

Height = f(x, y) = 0.5x2 + 0.5y2 + 1


• We’ll compute partial derivatives w.r.t each variable

Updating x:
Step 1: df/dx = 2*0.5 x
Step 2: x = x – 0.2 * (2*0.5 x) = 5 - 1 = 4

Updating y:
Step 1: df/dy = 2*0.5 y
Step 2: y = y – 0.2 * (2*0.5 y) = 4 – 0.8 = 3.2
Demo: Gradient Descent for multi-variate
functions
Gradient Descent Algorithm – Steps
1. Choose/randomly initialize a starting point
2. Calculate gradient at current point
3. Make a scaled step in the opposite direction to the gradient if
minimizing (For maximization step is taken in same direction to the
gradient)
• Repeat points 2 and 3 until one of the criteria is met:
• maximum number of iterations reached
• step size is smaller than the tolerance.
Python code
• Implementation of GD
algorithm in Python is
provided separately.
Following function is
used as an example:
Problem requirements
• Gradient Descent is more productive if the problem/function is:
• Differentiable
• Convex
Problem requirements – Differentiability

Few examples of differentiable functions


Problem requirements – Differentiability

Few examples of non-differentiable functions


Problem requirements – Convexity
Challenges for Gradient Descent:
Local optima and saddle points
Challenges for Gradient Descent:
Dependence on starting point
Challenges for Gradient Descent:
α / learning rate tuning
Applications of Gradient Descent Algorithm
• Applicability to Machine Learning problems
• Error/loss/cost minimization in many algorithms such as ANN, Linear
Regression, Logistic Regression, etc.
Case study: Positioning Class Invigilator
• On board discussion if have time
Gradient Descent variations
• Stochastic Gradient Descent
• Batch Gradient Descent

You might also like