Lec 11

This lecture covers the fundamentals of gradient descent and optimization in the context of linear algebra for AI and ML. It explains the concepts of local and global minimizers, convex versus non-convex functions, and the properties of convex functions. The document also discusses the importance of shaping loss functions for successful gradient descent applications in various machine learning models.

Uploaded by

shshankmittal1111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

21 views13 pages

Lec 11

Uploaded by

shshankmittal1111

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 13

Linear Algebra for AI and ML

Lecture 11

Prabhat Kumar Mishra

Gradient descent
• Iterative algorithms
• guaranteed to be successful for convex loss functions
• How to shape loss function so that gradient descent can succeed
Basics of optimization
• What is the general form of an optimization problem?
• What is the meaning of local and global minimizer?
• Convex versus non-convex function:
• A function f is convex if for t ∈ (0,1) the following holds
• f (tx + (1 − ty)) ≤ tf(x) + (1 − t)f(y) for every x, y
• The above inequality is trivially true even if x or y are outside the domain
of f by setting its value ∞
This characterization of descent directions allows us to provide condi-
Gradient descent
tions as to when w minimizes F.
Proposition 4. The point w? is a local minimizer only if rF(w? ) = 0 .
• We want to minimize a function Φ
Why is this true? Well, the point rF(w? ) is always a descent direction
• Start with guess w
if it’s not zero. If w? is and the nd w
0 a local minimum,
1 in such a way
there can be no descent directions.
Therefore, the gradient must vanish.
• Φ(w1) < Φ(w0)
Gradient descent uses the fact that the negative gradient is always
• Find a direction
a descent directiond such
to construct
that an algorithm: repeatedly compute the
gradient and take a step in the opposite direction to minimize F.
• Φ(w) is decreasing when w is changing in the direction d
Gradient Descent
• Start from an initial point w0 2 R . d

• At each step t = 0, 1, 2, . . .:
– Choose a step size at > 0
– Set wt+1 = wt at rF(wt )
fi
Descent direction
• A vector d is descent direction for f at w0 if
• f(w0 + td) < f(w0) for some t > 0
• For continuously di erentiable f, If d is descent direction then
⊤
• d ∇f(w0) < 0
• Argue with the help of Taylor’s theorem (use remainder)

f(w0 + td) = f(w0) + ∇f(w0 + t̄d)⊤(td) for some t̄ ∈ (0,t)

ff
Taylor’s theorem
• If f : ℝ → ℝ has n + 1 derivatives in open interval I around a, then
• f(x) = f(a) + f′(a)(x − a) + … + Rn(x)
n+1
f (c) n+1
• Where Rn (x) = (x − a) for some c ∈ (x, a)
n + 1!
• For n = 0
• f(x) = f(a) + R0(x)
• R0(x) = f′(c)(x − a)

So in our case: f(w0 + td) = f(w0) + ∇f(w0 + t̄d)⊤(td) for some t̄ ∈ (0,t)

Descent direction
• A vector d is descent direction for f at w0 if
• f(w0 + td) < f(w0) for some t > 0
• For continuously di erentiable f, If d is descent direction then
⊤
• d ∇f(w0) < 0
⊤
• Taylor’s theorem gives f(w0 + td) = f(w0) + ∇f(w0 + t̄d) (td) for t̄ ∈ (0,t)
• Since f(w0 + td) < f(w0) for some t > 0
⊤
• 0 > f(w0 + td) − f(w0) = ∇f(w0 + t̄d) (td)
⊤
• For small t, ∇f(w0 + t̄d) d < 0 and due to the continuity of ∇f, we get
⊤
• d ∇f(w0) < 0
ff
Proposition
• The point w* is a local minimizer of f only if
• ∇f(w*) = 0
⊤ 2
• If not true, d = − ∇f(w*) will be a descent direction (d ∇f(w*) = −∥∇f(w*)∥ )
Proposition
d
• Let f : ℝ → ℝ be convex and di erentiable
• x* will be global minimizer if and only if
• ∇f(x*) = 0
• f(x*) ≤ f(x) ∀x ⟺ ∇f(x*) = 0
f(x* + t(x − x*)) = f ((1 − t)x* + tx)
≤ (1 − t)f(x*) + tf(x)
(f(x* + t(x − x*)) − (1 − t)f(x*) ≤ tf(x), for t ∈ [0,1]

= f(x) + t ( ∇f(x + t̄(x − x))) (x − x), where t̄ ∈ (0,t)

⊤
but f(x* + t(x − x*))

Therefore, t (f(x) + ( ∇f(x + t̄(x − x))) (x − x)) ≤ tf(x)

⊤
ff
Proposition
For some arbitrary x
f(x* + t(x − x*)) = f ((1 − t)x* + tx)
≤ (1 − t)f(x*) + tf(x)
(f(x* + t(x − x*)) − (1 − t)f(x*) ≤ tf(x), for t ∈ [0,1]
= f(x*) + t ( ∇f(x* + t̄(x − x*))) (x − x*), where t̄ ∈ (0,t).
⊤
but f(x* + t(x − x*))

Therefore, t (f(x) + ( ∇f(x + t̄(x − x))) (x − x)) ≤ tf(x)

⊤

→ 0, we have, f(x) + ( ∇f(x)) (x − x*) ≤ f(x) for all x, since

⊤
Taking the limit, t
x was arbitrary.
∇f(x*) = 0 ⟹ f(x*) ≤ f(x) ∀x, and ∇f(x*) ≠ 0 will result in
contradiction for some x for which ( ∇f(x*)) (x − x*) > f(x) − f(x*).
⊤
Properties of convex functions
• All norms are convex
• If f convex, α ≥ 0, then αf will be convex
• If f, g are convex then
• f + g are convex
• h(x) = max{f(x), g(x)} is convex
• h(x) = f(Ax + b) is convex
Loss function
• Loss function of the form J(x) = 1y(x)⊤y(x)<0
̂ is not suitable for gradient
descent

• Support vector machine

• Logistic regression
• Least squares classi cation
fi
Two statements
• f: di erentiable and convex
• For any u, v we have
⊤
• f(u) ≥ f(v) + ∇f(v) (u − v)
• This condition also means that the rst order approximation (or linear
approximation) of a convex function is always an under-estimator

• The rst de nition f (tx + (1 − ty)) ≤ tf(x) + (1 − t)f(y) means that the
function evaluated at any point between x, y stays below than the line
joining f(x) and f(y)

• You can understand the di erence between the above two statements by
2
making a graph of f(x) = x .
ff
fi
fi
ff
fi

Homework 1
No ratings yet
Homework 1
8 pages
Theory of Mechanism Design
No ratings yet
Theory of Mechanism Design
279 pages
06 Optimization
No ratings yet
06 Optimization
42 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
(K) K (k+1) (K) K (K)
No ratings yet
(K) K (k+1) (K) K (K)
6 pages
Gradient - Descent Important 23-24
No ratings yet
Gradient - Descent Important 23-24
37 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
04 Nonlinear Systems and Optimization
No ratings yet
04 Nonlinear Systems and Optimization
74 pages
Lecture 3 Gradient Descent
No ratings yet
Lecture 3 Gradient Descent
37 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Optimization: 1 Motivation
No ratings yet
Optimization: 1 Motivation
20 pages
Lecture 8
No ratings yet
Lecture 8
16 pages
11 Gradient Descent
No ratings yet
11 Gradient Descent
58 pages
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
No ratings yet
OQM Lecture Note - Part 8 Unconstrained Nonlinear Optimisation
23 pages
Lecture 10
No ratings yet
Lecture 10
4 pages
Chapter 4
No ratings yet
Chapter 4
65 pages
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
No ratings yet
Mscfe XXX (Course Name) - Module X: Collaborative Review Task
19 pages
MScFE 650 MLF - Video - Transcripts - M3
No ratings yet
MScFE 650 MLF - Video - Transcripts - M3
19 pages
Math Lecture 4
No ratings yet
Math Lecture 4
27 pages
Nisheeth VishnoiFall2014 ConvexOptimization PDF
No ratings yet
Nisheeth VishnoiFall2014 ConvexOptimization PDF
114 pages
O4MD 03 Descent Methods
No ratings yet
O4MD 03 Descent Methods
18 pages
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
No ratings yet
(k+1) K (K) (K) (K) : Recall That A Direction Is A Vector of Unit Length
5 pages
Subgradients Slides
No ratings yet
Subgradients Slides
37 pages
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Gradient Descent: Ryan Tibshirani Convex Optimization 10-725
27 pages
Nocedal - Wright CH - 02-01
No ratings yet
Nocedal - Wright CH - 02-01
9 pages
Chapter Gradient Descent
No ratings yet
Chapter Gradient Descent
6 pages
02 Grad Desc
No ratings yet
02 Grad Desc
54 pages
Optimization PPT - Part-2
No ratings yet
Optimization PPT - Part-2
42 pages
Lecture 9 Si416
No ratings yet
Lecture 9 Si416
14 pages
Week 10 Notes MLF
No ratings yet
Week 10 Notes MLF
20 pages
Lecture 05 - Unconstrained
No ratings yet
Lecture 05 - Unconstrained
21 pages
5 1 SD 17122020
No ratings yet
5 1 SD 17122020
47 pages
Subgradients: Ryan Tibshirani Convex Optimization 10-725
No ratings yet
Subgradients: Ryan Tibshirani Convex Optimization 10-725
25 pages
NLP Slides
No ratings yet
NLP Slides
201 pages
Part3 1
No ratings yet
Part3 1
15 pages
Lecture 2
No ratings yet
Lecture 2
19 pages
Slides-4 Optimization Extra Gradient Descent
No ratings yet
Slides-4 Optimization Extra Gradient Descent
67 pages
Lec4 Gradient Method Revise
No ratings yet
Lec4 Gradient Method Revise
33 pages
06 SG Method
No ratings yet
06 SG Method
33 pages
DSA3102 Midterm Chestsheet
No ratings yet
DSA3102 Midterm Chestsheet
2 pages
Bregman
No ratings yet
Bregman
9 pages
Epigrafo PDF
No ratings yet
Epigrafo PDF
12 pages
Optimization Class Notes MTH-9842
No ratings yet
Optimization Class Notes MTH-9842
25 pages
First Order Method
No ratings yet
First Order Method
33 pages
IE643 Lecture8 2020sep11 2020sep8
No ratings yet
IE643 Lecture8 2020sep11 2020sep8
100 pages
Bedd
No ratings yet
Bedd
13 pages
Lecture 7
No ratings yet
Lecture 7
4 pages
Unconstrained Minimization
No ratings yet
Unconstrained Minimization
7 pages
Lecture Notes 4 Gradient Descent
No ratings yet
Lecture Notes 4 Gradient Descent
6 pages
Data Science - Convex Optimization and Examples PDF
No ratings yet
Data Science - Convex Optimization and Examples PDF
9 pages
Lecture 12
No ratings yet
Lecture 12
16 pages
Setting Parameters of A Deep Neural Network - Hierarchical Representations
No ratings yet
Setting Parameters of A Deep Neural Network - Hierarchical Representations
10 pages
Optimization For Data Sciences PPT3
No ratings yet
Optimization For Data Sciences PPT3
12 pages
Gradient Descent - Xiaowei Huang
No ratings yet
Gradient Descent - Xiaowei Huang
53 pages
Clnote Oct8
No ratings yet
Clnote Oct8
39 pages
MAE Opti Worksheet 4 Correction
No ratings yet
MAE Opti Worksheet 4 Correction
3 pages
OpTimIzation Overview
No ratings yet
OpTimIzation Overview
47 pages
The Set F May Be Specified by Equations of The Form (1.1) And/or (1.2) - Alternatively, The Term Global Minimiser Can Be Used To Denote A Point at Which The Function F Attains Its Global Minimum
No ratings yet
The Set F May Be Specified by Equations of The Form (1.1) And/or (1.2) - Alternatively, The Term Global Minimiser Can Be Used To Denote A Point at Which The Function F Attains Its Global Minimum
4 pages
Multiple Integrals, A Collection of Solved Problems
From Everand
Multiple Integrals, A Collection of Solved Problems
Steven Tan
No ratings yet
Square Summable Power Series
From Everand
Square Summable Power Series
Louis de Branges
5/5 (1)
Worked Examples in Mathematics for Scientists and Engineers
From Everand
Worked Examples in Mathematics for Scientists and Engineers
G. Stephenson
No ratings yet
Alkylation 2
No ratings yet
Alkylation 2
22 pages
Intermediate Mathematical Methods For Economics
No ratings yet
Intermediate Mathematical Methods For Economics
3 pages
1973 MCQs CB AP Calc AB
No ratings yet
1973 MCQs CB AP Calc AB
15 pages
RMM 22 FINAL Compressed
100% (2)
RMM 22 FINAL Compressed
99 pages
Entropy 25 00343
No ratings yet
Entropy 25 00343
21 pages
Secondorder Variational Analysis in Optimization Variational Stability and Control Theory Algorithms Applications Boris S Mordukhovich Download
No ratings yet
Secondorder Variational Analysis in Optimization Variational Stability and Control Theory Algorithms Applications Boris S Mordukhovich Download
76 pages
Technical Note: R. I. Bot, S. M. Grad, and G. Wanka
No ratings yet
Technical Note: R. I. Bot, S. M. Grad, and G. Wanka
16 pages
Calculus Criterion For Concavity
No ratings yet
Calculus Criterion For Concavity
14 pages
ECEN 687 VLSI Design Automation: Nonlinear Programming
No ratings yet
ECEN 687 VLSI Design Automation: Nonlinear Programming
26 pages
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
No ratings yet
CS 229 Spring 2016 Problem Set #3: Theory & Unsupervised Learning
5 pages
ListOfQuestions Optimization T24
No ratings yet
ListOfQuestions Optimization T24
3 pages
Knuth D. The Art of Computer Programming Vol 4. Fasc 5 2020
100% (1)
Knuth D. The Art of Computer Programming Vol 4. Fasc 5 2020
760 pages
Sample Ineqs
No ratings yet
Sample Ineqs
6 pages
Space Syntax Symposium 14
No ratings yet
Space Syntax Symposium 14
28 pages
Concave Functions of Two Variables: X XX X
No ratings yet
Concave Functions of Two Variables: X XX X
4 pages
Midterm 20
No ratings yet
Midterm 20
2 pages
Convex Functions and Their Applications A Contemporary Approach Constantin Niculescu Persson
100% (2)
Convex Functions and Their Applications A Contemporary Approach Constantin Niculescu Persson
269 pages
E2 201: Information Theory (2019) Solutions To Homework 3
No ratings yet
E2 201: Information Theory (2019) Solutions To Homework 3
11 pages
Differentiation Minima and Maxima
No ratings yet
Differentiation Minima and Maxima
11 pages
2.3 Optimization With Inequality Constraints: Kuhn-Tucker Conditions
No ratings yet
2.3 Optimization With Inequality Constraints: Kuhn-Tucker Conditions
7 pages
First and Second Derivative Test Powerpoint
No ratings yet
First and Second Derivative Test Powerpoint
12 pages
SSRN-id1498514-Order Book Resilience, Price Manipulation, and The Positive Portfolio Problem
No ratings yet
SSRN-id1498514-Order Book Resilience, Price Manipulation, and The Positive Portfolio Problem
24 pages
Concave and Convex Functions: 1 Basic Definitions
No ratings yet
Concave and Convex Functions: 1 Basic Definitions
12 pages
Optimization Lectures Formal Note
No ratings yet
Optimization Lectures Formal Note
9 pages
Separable
No ratings yet
Separable
9 pages
Kılınc-Karzan, F., & Mellon, C. (2025) - Essential Mathematics For Convex Optimization. Cambridge University Press. (Draft)
No ratings yet
Kılınc-Karzan, F., & Mellon, C. (2025) - Essential Mathematics For Convex Optimization. Cambridge University Press. (Draft)
460 pages
Methods For The Descriptive Analysis of Archaeological
No ratings yet
Methods For The Descriptive Analysis of Archaeological
19 pages
Lecture 02 - Convexity
No ratings yet
Lecture 02 - Convexity
42 pages

Lec 11

Uploaded by

Lec 11

Uploaded by

Linear Algebra for AI and ML

Prabhat Kumar Mishra

f(w0 + td) = f(w0) + ∇f(w0 + t̄d)⊤(td) for some t̄ ∈ (0,t)

= f(x*) + t ( ∇f(x* + t̄(x − x*))) (x − x*), where t̄ ∈ (0,t)

Therefore, t (f(x*) + ( ∇f(x* + t̄(x − x*))) (x − x*)) ≤ tf(x)

Therefore, t (f(x*) + ( ∇f(x* + t̄(x − x*))) (x − x*)) ≤ tf(x)

→ 0, we have, f(x*) + ( ∇f(x*)) (x − x*) ≤ f(x) for all x, since

• Support vector machine

You might also like

= f(x) + t ( ∇f(x + t̄(x − x))) (x − x), where t̄ ∈ (0,t)

Therefore, t (f(x) + ( ∇f(x + t̄(x − x))) (x − x)) ≤ tf(x)

Therefore, t (f(x) + ( ∇f(x + t̄(x − x))) (x − x)) ≤ tf(x)

→ 0, we have, f(x) + ( ∇f(x)) (x − x*) ≤ f(x) for all x, since