CS115 Optimization
CS115 Optimization
Ngoc-Hoang Luong
The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1
1
https://probml.github.io/pml-book/book1.html
University of Information Technology (UIT) Math for CS CS115 2 / 53
Table of Contents
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
1.5
1.0
0.5
0.0
−0.5
−1.0
−1.5
Một điểm được xem là cực tiểu cục bộ chặt chẽ nếu chi phí của nó thấp hơn đáng kể
A point is said to be a strict local minimum if its cost is strictly
so với chi phí của các điểm lân cận.
lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
f (x + h) − f (x)
f ′ (x) = lim
h→0
| {z h }
forward difference
f (x + h/2) − f (x − h/2)
= lim
h→0
| {z h }
central difference
f (x) − f (x − h)
= lim
h→0
| {z h }
backward difference
Example:
f (x + hv) − f (x)
Dv f (x) = lim
h→0 h
We can approximate this numerically using 2 function calls to f ,
regardless of n.
By contrast, a numerical approximation to the standard gradient
vector takes n + 1 calls (or 2n if using central differences).
The directional derivative along v is the scalar product of the
gradient g and the vector v:
Dv f (x) = ∇f (x) · v
Du f (3, 2) = ∇f (3, 2) · u
= (12e1 + 9e2 ) · (u1 e1 + u2 e2 )
= 12u1 + 9u2
Example (cont.)
The unit vector in the direction of vector (1,2) is:
(2, 1) √ √
u = √ = (2/ 5, 1/ 5)
5
The directional derivative of f at (3,2) in the direction of (2,1) is:
Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:
Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]
Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:
Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]
The maximal value of Du f (a) occurs when u and ∇f (a) point in the
same direction (i.e., θ = 0).
Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]
g(α) = f (x + αd)
g ′ (α) = dT ∇f (x + αd)
g ′′ (α) = dT ∇2 f (x + αd)d
Interpretation
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
Ax = λx
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
2 0
∇2 f (1, 0) = ∇2 f (−1, 0) =4 , which is an indefinite matrix.
0 −1
2 0
∇2 f (1, 0)
= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.
Because there are two global minimum points, they are nonstrict
global minima, but they are strict local minimum points, since each
has a neighborhood in which it is the unique minimizer.
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
The task of finding any point (regardless of its cost) in the feasible
set is called feasibility problem.
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
If we draw a line from x to x′ , all points on the line lie inside the set.
Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.
Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.
f (x) = xT Ax
Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.
f (x) = xT Ax
Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.
f (x) = xT Ax
Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.
f (x) = xT Ax
Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.
f (x) = xT Ax
1 Introduction
2 Matrix calculus
4 Optimality conditions
7 First-order methods
θ t+1 = θ t + ρt dt
θ t+1 = θ t + ρt dt
The sequence of step sizes {ρt } is called the learning rate schedule.
The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.
The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.
Example:
L(θ) = 0.5(θ12 − θ2 )2 + 0.5(θ1 − 1)2
Pick our descent direction dt = −g t . Consider ρt = 0.1 vs ρt = 0.6:
The optimal step size can be found by finding the value that
maximally decreases the objective along the chosen direction by
solving the 1d minimization problem