0% found this document useful (0 votes)
21 views13 pages

Lec 11

This lecture covers the fundamentals of gradient descent and optimization in the context of linear algebra for AI and ML. It explains the concepts of local and global minimizers, convex versus non-convex functions, and the properties of convex functions. The document also discusses the importance of shaping loss functions for successful gradient descent applications in various machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views13 pages

Lec 11

This lecture covers the fundamentals of gradient descent and optimization in the context of linear algebra for AI and ML. It explains the concepts of local and global minimizers, convex versus non-convex functions, and the properties of convex functions. The document also discusses the importance of shaping loss functions for successful gradient descent applications in various machine learning models.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Linear Algebra for AI and ML

Lecture 11

Prabhat Kumar Mishra


Gradient descent
• Iterative algorithms
• guaranteed to be successful for convex loss functions
• How to shape loss function so that gradient descent can succeed
Basics of optimization
• What is the general form of an optimization problem?
• What is the meaning of local and global minimizer?
• Convex versus non-convex function:
• A function f is convex if for t ∈ (0,1) the following holds
• f (tx + (1 − ty)) ≤ tf(x) + (1 − t)f(y) for every x, y
• The above inequality is trivially true even if x or y are outside the domain
of f by setting its value ∞
This characterization of descent directions allows us to provide condi-
Gradient descent
tions as to when w minimizes F.
Proposition 4. The point w? is a local minimizer only if rF(w? ) = 0 .
• We want to minimize a function Φ
Why is this true? Well, the point rF(w? ) is always a descent direction
• Start with guess w
if it’s not zero. If w? is and the nd w
0 a local minimum,
1 in such a way
there can be no descent directions.
Therefore, the gradient must vanish.
• Φ(w1) < Φ(w0)
Gradient descent uses the fact that the negative gradient is always
• Find a direction
a descent directiond such
to construct
that an algorithm: repeatedly compute the
gradient and take a step in the opposite direction to minimize F.
• Φ(w) is decreasing when w is changing in the direction d
Gradient Descent
• Start from an initial point w0 2 R . d

• At each step t = 0, 1, 2, . . .:
– Choose a step size at > 0
– Set wt+1 = wt at rF(wt )
fi
Descent direction
• A vector d is descent direction for f at w0 if
• f(w0 + td) < f(w0) for some t > 0
• For continuously di erentiable f, If d is descent direction then

• d ∇f(w0) < 0
• Argue with the help of Taylor’s theorem (use remainder)

f(w0 + td) = f(w0) + ∇f(w0 + t̄d)⊤(td) for some t̄ ∈ (0,t)


ff
Taylor’s theorem
• If f : ℝ → ℝ has n + 1 derivatives in open interval I around a, then
• f(x) = f(a) + f′(a)(x − a) + … + Rn(x)
n+1
f (c) n+1
• Where Rn (x) = (x − a) for some c ∈ (x, a)
n + 1!
• For n = 0
• f(x) = f(a) + R0(x)
• R0(x) = f′(c)(x − a)

So in our case: f(w0 + td) = f(w0) + ∇f(w0 + t̄d)⊤(td) for some t̄ ∈ (0,t)


Descent direction
• A vector d is descent direction for f at w0 if
• f(w0 + td) < f(w0) for some t > 0
• For continuously di erentiable f, If d is descent direction then

• d ∇f(w0) < 0

• Taylor’s theorem gives f(w0 + td) = f(w0) + ∇f(w0 + t̄d) (td) for t̄ ∈ (0,t)
• Since f(w0 + td) < f(w0) for some t > 0

• 0 > f(w0 + td) − f(w0) = ∇f(w0 + t̄d) (td)

• For small t, ∇f(w0 + t̄d) d < 0 and due to the continuity of ∇f, we get

• d ∇f(w0) < 0
ff
Proposition
• The point w* is a local minimizer of f only if
• ∇f(w*) = 0
⊤ 2
• If not true, d = − ∇f(w*) will be a descent direction (d ∇f(w*) = −∥∇f(w*)∥ )
Proposition
d
• Let f : ℝ → ℝ be convex and di erentiable
• x* will be global minimizer if and only if
• ∇f(x*) = 0
• f(x*) ≤ f(x) ∀x ⟺ ∇f(x*) = 0
f(x* + t(x − x*)) = f ((1 − t)x* + tx)
≤ (1 − t)f(x*) + tf(x)
(f(x* + t(x − x*)) − (1 − t)f(x*) ≤ tf(x), for t ∈ [0,1]

= f(x*) + t ( ∇f(x* + t̄(x − x*))) (x − x*), where t̄ ∈ (0,t)



but f(x* + t(x − x*))

Therefore, t (f(x*) + ( ∇f(x* + t̄(x − x*))) (x − x*)) ≤ tf(x)



ff
Proposition
For some arbitrary x
f(x* + t(x − x*)) = f ((1 − t)x* + tx)
≤ (1 − t)f(x*) + tf(x)
(f(x* + t(x − x*)) − (1 − t)f(x*) ≤ tf(x), for t ∈ [0,1]
= f(x*) + t ( ∇f(x* + t̄(x − x*))) (x − x*), where t̄ ∈ (0,t).

but f(x* + t(x − x*))

Therefore, t (f(x*) + ( ∇f(x* + t̄(x − x*))) (x − x*)) ≤ tf(x)


→ 0, we have, f(x*) + ( ∇f(x*)) (x − x*) ≤ f(x) for all x, since



Taking the limit, t
x was arbitrary.
∇f(x*) = 0 ⟹ f(x*) ≤ f(x) ∀x, and ∇f(x*) ≠ 0 will result in
contradiction for some x for which ( ∇f(x*)) (x − x*) > f(x) − f(x*).

Properties of convex functions
• All norms are convex
• If f convex, α ≥ 0, then αf will be convex
• If f, g are convex then
• f + g are convex
• h(x) = max{f(x), g(x)} is convex
• h(x) = f(Ax + b) is convex
Loss function
• Loss function of the form J(x) = 1y(x)⊤y(x)<0
̂ is not suitable for gradient
descent

• Support vector machine


• Logistic regression
• Least squares classi cation
fi
Two statements
• f: di erentiable and convex
• For any u, v we have

• f(u) ≥ f(v) + ∇f(v) (u − v)
• This condition also means that the rst order approximation (or linear
approximation) of a convex function is always an under-estimator

• The rst de nition f (tx + (1 − ty)) ≤ tf(x) + (1 − t)f(y) means that the
function evaluated at any point between x, y stays below than the line
joining f(x) and f(y)

• You can understand the di erence between the above two statements by
2
making a graph of f(x) = x .
ff
fi
fi
ff
fi

You might also like