0% found this document useful (0 votes)
57 views

CS115 Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
57 views

CS115 Optimization

Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 148

Optimization

Ngoc-Hoang Luong

University of Information Technology (UIT)


Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for CS, Fall 2023

University of Information Technology (UIT) Math for CS CS115 1 / 53


References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
University of Information Technology (UIT) Math for CS CS115 2 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 3 / 53


Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 4 / 53


Introduction
Vấn đề cốt lõi trong ML là ước lượng tham số (khớp mô hình)
The core problem in ML is parameter estimation (model fitting).

University of Information Technology (UIT) Math for CS CS115 5 / 53


Introduction
Vấn đề cốt lõi trong ML là ước lượng tham số (khớp mô hình)
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

University of Information Technology (UIT) Math for CS CS115 5 / 53


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.

University of Information Technology (UIT) Math for CS CS115 5 / 53


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
We focus on continuous optimization.

University of Information Technology (UIT) Math for CS CS115 5 / 53


Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).

University of Information Technology (UIT) Math for CS CS115 5 / 53


Introduction
vấn đề cốt lõi trong ML là ước lượng tham số.
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).
The term objective function refers to a function we want to
maximize or minimize.

University of Information Technology (UIT) Math for CS CS115 5 / 53


Introduction
Vấn đề cốt lõi trong ML là ước lượng tham số (khớp mô hình).
The core problem in ML is parameter estimation (model fitting).
Chúng ta cần giải quyết một vấn đề tối ưu hóa: cố gắng tìm ra
We need to solve an optimization problem: i.e., trying to find the
giá trị cho một tập hợp các biến tham số
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
nhằm giảm thiểu một hàm mất mát hoặc hàm chi phí scalar
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)


θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of


variables being optimized.
Chúng ta tập trung vào tối ưu hóa liên tục.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).
Thuật ngữ hàm mục tiêu đề cập đến một hàm mà chúng ta muốn
The term objective function refers to a function we want to
tối đa hóa hoặc giảm thiểu.
maximize or minimize.
Một thuật toán để tìm cực trị của một hàm mục tiêu được gọi là một trình giải.
An algorithm to find an optimum of an objective function is a solver.
University of Information Technology (UIT) Math for CS CS115 5 / 53
Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.

University of Information Technology (UIT) Math for CS CS115 6 / 53


Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.
In general, finding global optima is computationally intractable. We
will try to find a local optimum.

University of Information Technology (UIT) Math for CS CS115 6 / 53


Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.
Nói chung, việc tìm kiếm cực tiểu toàn cục là không khả thi về mặt tính toán.
In general, finding global optima is computationally intractable. We
Chúng ta sẽ cố gắng tìm một cực tiểu cục bộ
will try to find a local optimum.
Đối với vấn đề liên tục, một cực tiểu cục bộ là một điểm
For continuous problem, a local optimum is a point θ ∗ which has
mà tại đó có chi phí thấp hơn (hoặc bằng) so với các điểm "gần" đó.
lower (or equal) cost than “nearby” points.
∃δ > 0, ∀θ ∈ Θ, s.t. ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) ≤ L(θ) (2)

1.5

1.0

0.5

0.0

−0.5

−1.0

−1.5

−2.0 local minimum

−2.5 Global minimum

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0


University of Information Technology (UIT) Math for CS CS115 6 / 53
Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

University of Information Technology (UIT) Math for CS CS115 7 / 53


Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

A point is said to be a strict local minimum if its cost is strictly


lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

University of Information Technology (UIT) Math for CS CS115 7 / 53


Local versus global optimization
Một cực tiểu cục bộ có thể được bao quanh bởi các cực tiểu cục bộ khác với
A local minimum could be surrounded by other local minima with the
cùng một giá trị mục tiêu; điều này được biết đến là cực tiểu cục bộ phẳng (flat local
same objective value; this is known as a flat local minimum.
minimum).

Một điểm được xem là cực tiểu cục bộ chặt chẽ nếu chi phí của nó thấp hơn đáng kể
A point is said to be a strict local minimum if its cost is strictly
so với chi phí của các điểm lân cận.
lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

We can define a (strict) local maximum analogously.


University of Information Technology (UIT) Math for CS CS115 7 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 8 / 53


Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.

University of Information Technology (UIT) Math for CS CS115 9 / 53


Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.
Consider a scalar-argument function f : R → R. Its derivative at a
point a is the quantity
f (x + h) − f (x)
f ′ (x) ≜ lim
h→0 h
assuming the limit exists.

University of Information Technology (UIT) Math for CS CS115 9 / 53


Derivatives
Lĩnh vực giải tích đề cập đến việc tính toán "tốc độ thay đổi" của các hàm số
The topic of calculus concerns computing “rates of change” of
khi chúng ta biến đổi các đầu vào của chúng.
functions as we vary their inputs. Đạo hàm của nó tại một điểm
Xét một hàm số với đối số vô hướng
Consider a scalar-argument function f : R → R. Its derivative at a
point a is the quantity
f (x + h) − f (x)
f ′ (x) ≜ lim
h→0 h
giả sử giới hạn này tồn tại
assuming the limit exists.
This measures how quickly the output changes when we move a small
distance in the input space away from x (i.e., the “rate of change”).

University of Information Technology (UIT) Math for CS CS115 9 / 53


Derivatives

f ′ (x) can be seen as the slope of the tangent line at f (x)


f (x + ∆x) ≈ f (x) + f ′ (x)∆x
for small ∆x.

University of Information Technology (UIT) Math for CS CS115 10 / 53


Derivatives

We can compute a finite difference approximation to the derivative


by using a finite step size h

f (x + h) − f (x)
f ′ (x) = lim
h→0
| {z h }
forward difference
f (x + h/2) − f (x − h/2)
= lim
h→0
| {z h }
central difference
f (x) − f (x − h)
= lim
h→0
| {z h }
backward difference

The smaller the step size h, the better the estimate.

University of Information Technology (UIT) Math for CS CS115 11 / 53


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′

University of Information Technology (UIT) Math for CS CS115 12 / 53


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).

University of Information Technology (UIT) Math for CS CS115 12 / 53


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.

University of Information Technology (UIT) Math for CS CS115 12 / 53


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .

University of Information Technology (UIT) Math for CS CS115 12 / 53


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .

University of Information Technology (UIT) Math for CS CS115 12 / 53


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .
We can use Leibniz notation, if we denote the function by y = f (x),
dy d
and its derivative by dx or dx f (x).

University of Information Technology (UIT) Math for CS CS115 12 / 53


Derivatives

We can think of differentiation as an operator that maps functions


to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .
We can use Leibniz notation, if we denote the function by y = f (x),
dy d
and its derivative by dx or dx f (x).
To denote the evaluation of the derivative at a point a, we write
df
dx .
x=a

University of Information Technology (UIT) Math for CS CS115 12 / 53


Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.

University of Information Technology (UIT) Math for CS CS115 13 / 53


Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.
The gradient of f at a point x is the vector of its partial derivatives
 ∂f 
1∂x
∂f ∂f ∂f
g= = ∇f =  ...  = e1 + . . . + en
 
∂x ∂f
∂x1 ∂xn
∂xn

University of Information Technology (UIT) Math for CS CS115 13 / 53


Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.
The gradient of f at a point x is the vector of its partial derivatives
 ∂f 
1 ∂x
∂f ∂f ∂f
g= = ∇f =  ...  = e1 + . . . + en
 
∂x ∂f
∂x1 ∂xn
∂xn

To emphasize the point at which the gradient is evaluated, we write


∂f
g(x∗ ) ≜
∂x x∗
University of Information Technology (UIT) Math for CS CS115 13 / 53
Gradients

Example:

f (x1 , x2 ) = x21 + x1 x2 + 3x22


! 
∂f 
∂x 2x 1 + x 2
∇f (x1 , x2 ) = ∂f1 =
∂x
x1 + 6x2
2

The nabla operator ∇ maps a function f : Rn → R to another


function g : Rn → Rn .
Since g() is a vector-valued function, it is known as a vector field.

University of Information Technology (UIT) Math for CS CS115 14 / 53


Directional derivative

The directional derivative measures how much the function


f : Rn → R changes along a direction v in space.

f (x + hv) − f (x)
Dv f (x) = lim
h→0 h
We can approximate this numerically using 2 function calls to f ,
regardless of n.
By contrast, a numerical approximation to the standard gradient
vector takes n + 1 calls (or 2n if using central differences).
The directional derivative along v is the scalar product of the
gradient g and the vector v:

Dv f (x) = ∇f (x) · v

University of Information Technology (UIT) Math for CS CS115 15 / 53


Directional derivative
Example: Let f (x, y) = x2 y. Find the derivative of f in the direction
(1,2) at the point (3,2).
The gradient ∇f (x, y) is:
! 
∂f 
∂x 2xy
∇f (x, y) = ∂f =
∂y
x2
     
12 1 0
∇f (3, 2) = = 12 +9 = 12e1 + 9e2
9 0 1
Let u = u1 e1 + u2 e2 be a unit vector. The derivative of f in the
direction of u at (3,2) is:

Du f (3, 2) = ∇f (3, 2) · u
= (12e1 + 9e2 ) · (u1 e1 + u2 e2 )
= 12u1 + 9u2

University of Information Technology (UIT) Math for CS CS115 16 / 53


Directional derivative

Example (cont.)
The unit vector in the direction of vector (1,2) is:

(1, 2) (1, 2) (1, 2) √ √


u= =√ = √ = (1/ 5, 2/ 5)
∥(1, 2)∥ 12 + 2 2 5
The directional derivative at (3,2) in the direction of (1,2) is:

Du f (3, 2) = 12u1 + 9u2


12 18 30
=√ +√ =√
5 5 5
We normalize vector (1,2) so that the directional derivative is
independent of its magnitude and depending only on its direction.

University of Information Technology (UIT) Math for CS CS115 17 / 53


Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of


(2,1) at the point (3,2).

University of Information Technology (UIT) Math for CS CS115 18 / 53


Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of


(2,1) at the point (3,2).
The unit vector in the direction of (2,1) is:

(2, 1) √ √
u = √ = (2/ 5, 1/ 5)
5
The directional derivative of f at (3,2) in the direction of (2,1) is:

Du f (3, 2) = 12u1 + 9u2


24 9 33
=√ +√ =√
5 5 5

University of Information Technology (UIT) Math for CS CS115 18 / 53


Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?

University of Information Technology (UIT) Math for CS CS115 19 / 53


Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

University of Information Technology (UIT) Math for CS CS115 19 / 53


Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

The maximal value of Du f (a) occurs when u and ∇f (a) point in the
same direction (i.e., θ = 0).

University of Information Technology (UIT) Math for CS CS115 19 / 53


Directional derivative

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

When θ = 0, the directional derivative Du f (a) = ∥∇f (a)∥.


When θ = π, the directional derivative Du f (a) = −∥∇f (a)∥.
For what value of θ is Du f (a) = 0?

University of Information Technology (UIT) Math for CS CS115 20 / 53


Jacobian

Consider a function that maps a vector to another vector,


f : Rn → Rm . The Jacobian matrix of this function is an m × n
matrix of partial derivatives:
 ∂f1 ∂f1 
∇f1 (x)T
 
∂x1 ... ∂xn
∂f
J f (x) = ≜  ... .. ..  =  ..
 
T . .   .
∂x

∂f m ∂fm ∇fm (x) T
∂x1 ... ∂xn

We layout the results in the same orientation as the output f . This is


called the numerator layout of the Jacobian formulation.

University of Information Technology (UIT) Math for CS CS115 21 / 53


Hessian

For a function f : Rn → R that is twice differentiable, the Hessian


matrix is the (symmetric) n × n matrix of second partial derivatives
∂2f ∂2f
 
∂x21
... ∂x1 ∂xn
∂2f .. .. ..
= ∇2 f = 
 
Hf = 2 . . .

∂x  
∂2f ∂2f
∂xn ∂x1 ... ∂x2n

The Hessian is the Jacobian of the gradient.

University of Information Technology (UIT) Math for CS CS115 22 / 53


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).

University of Information Technology (UIT) Math for CS CS115 23 / 53


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
! 
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

University of Information Technology (UIT) Math for CS CS115 23 / 53


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
! 
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

Second, compute the Hessian (i.e., second-order partial derivatives):


∂2f ∂2f
!  
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

University of Information Technology (UIT) Math for CS CS115 23 / 53


Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
! 
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

Second, compute the Hessian (i.e., second-order partial derivatives):


∂2f ∂2f
!  
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

Finally, evaluate the Hessian matrix at the point (1,1):


 
2 4
H f (1, 1) =
4 2

University of Information Technology (UIT) Math for CS CS115 23 / 53


Geometric meaning

If we follow the direction d from x, we can define a uni-dimensional


function g(α):

g(α) = f (x + αd)
g ′ (α) = dT ∇f (x + αd)
g ′′ (α) = dT ∇2 f (x + αd)d

Interpretation

g ′ (0) = dT ∇f (x) [directional derivative]


g ′′ (0) = dT ∇2 f (x)d [directional curvature]

If g ′′ (0) is non-negative with a certain d: f is convex in direction d.


If g ′′ (0) is non-negative for all d: ∇2 f (x) is positive semidefinite → f
is convex at x.
University of Information Technology (UIT) Math for CS CS115 24 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 25 / 53


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 53


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,

University of Information Technology (UIT) Math for CS CS115 26 / 53


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 53


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,

University of Information Technology (UIT) Math for CS CS115 26 / 53


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.

University of Information Technology (UIT) Math for CS CS115 26 / 53


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.

University of Information Technology (UIT) Math for CS CS115 26 / 53


Definitions

We say that a symmetric n × n matrix A is:


positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.
The expression xT Ax is a function of x called the quadratic form
associated to A. (It’s made up of terms like x2i and xi xj .)
We make these definitions for a symmetric matrix A, i.e., AT = A.
Hessian matrices are symmetric.

University of Information Technology (UIT) Math for CS CS115 26 / 53


Diagonal matrices

For a diagonal matrix


 
d1 0 . . . 0
 0 d2 . . . 0 
 
 ...
D= .. . .
.
.. 
. .
..
 
0 0 . dn

the quadratic form


  
d1 0 . . . 0 x1
 0 d2 . . . 0   
  x2 
xT Dx = x1 x2
 
 ...
. . . xn  .. . . ..   . 
. . .  .. 
..

0 0 . dn xn

is just d1 x21 + d2 x22 + . . . + dn x2n .


University of Information Technology (UIT) Math for CS CS115 27 / 53
Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:
 
2 0
H f (x, y) =
0 4

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:
 
2 0
H f (x, y) =
0 4
For an arbitrary x ∈ R2 , we have
 
T 2 0
x x = 2x21 + 4x22 > 0 for all x ̸= 0.
0 4

University of Information Technology (UIT) Math for CS CS115 28 / 53


Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:
 
2 0
H f (x, y) =
0 4
For an arbitrary x ∈ R2 , we have
 
T 2 0
x x = 2x21 + 4x22 > 0 for all x ̸= 0.
0 4
So, H f (x, y) ≻ 0 for all (x, y) ∈ R2 . H f (x, y) is positive definite.
University of Information Technology (UIT) Math for CS CS115 28 / 53
Positive definiteness and eigenvalues

For an n × n matrix A, if a nonzero vector x ∈ Rn satisfies

Ax = λx

for some scalar λ ∈ R, we call λ an eigenvalue of A and x its


associated eigenvector.
If A is an n × n symmetric matrix, then it can be factored as
 
λ1 0 ... 0
 0 λ2 ... 0 
A = QT ΛQ = QT 
 
 ... ..
.
.. . Q
. .. 
..
 
0 0 . λn

where λ1 , . . . , λn are the eigenvalues of A and the columns of Q are


the corresponding eigenvectors.

University of Information Technology (UIT) Math for CS CS115 29 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

University of Information Technology (UIT) Math for CS CS115 30 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0

University of Information Technology (UIT) Math for CS CS115 30 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0

University of Information Technology (UIT) Math for CS CS115 30 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0

University of Information Technology (UIT) Math for CS CS115 30 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0

University of Information Technology (UIT) Math for CS CS115 30 / 53


Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the


quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.


A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0
A is indefinite if it has both positive and negative eigenvalues.
University of Information Technology (UIT) Math for CS CS115 30 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 31 / 53


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.

University of Information Technology (UIT) Math for CS CS115 32 / 53


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.

University of Information Technology (UIT) Math for CS CS115 32 / 53


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.

University of Information Technology (UIT) Math for CS CS115 32 / 53


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.
Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 32 / 53


Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the


points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.
Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Sufficient conditions: If g ∗ = 0 and H ∗ is positive definite, then θ ∗
is a local optimum.

University of Information Technology (UIT) Math for CS CS115 32 / 53


Optimality conditions for local vs global optima
1 Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Suppose we were at a point θ ∗ at which the gradient is non-zero.
At such a point, we could decrease the function by following the
negative gradient a small distance, so this would not be optimal.
So the gradient must be zero.

University of Information Technology (UIT) Math for CS CS115 33 / 53


Optimality conditions for local vs global optima
1 Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Suppose we were at a point θ ∗ at which the gradient is non-zero.
At such a point, we could decrease the function by following the
negative gradient a small distance, so this would not be optimal.
So the gradient must be zero.
2 Sufficient conditions: If g ∗ = 0 and H ∗ is positive definite, then θ ∗
is a local optimum.
Why a zero gradient is not sufficient?
The stationary point could be a local minimum, local maximum, or
saddle point.

University of Information Technology (UIT) Math for CS CS115 33 / 53


Global optimizers

We classify a stationary point of a function f : Rn → R as a global


minimizer if the Hessian matrix of f is positive semidefinite
everywhere,
and as a global maximizer if the Hessian matrix is negative
semidefinite everywhere.
If the Hessian matrix is positive definite, or negative definite, the
minimizer and maximizer (respectively) is strict.

University of Information Technology (UIT) Math for CS CS115 34 / 53


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2

University of Information Technology (UIT) Math for CS CS115 35 / 53


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).

University of Information Technology (UIT) Math for CS CS115 35 / 53


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
 2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2

University of Information Technology (UIT) Math for CS CS115 35 / 53


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
 2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2
 
−1 0
Since ∇2 f (0, 0) = 4 ≺ 0, it follows that (0,0) is a strict
0 −2
local maximum point.

University of Information Technology (UIT) Math for CS CS115 35 / 53


Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .


(x21 + x22 − 1)x1
 
The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
 2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2
 
−1 0
Since ∇2 f (0, 0) = 4 ≺ 0, it follows that (0,0) is a strict
0 −2
local maximum point.
By the fact that f (x1 , 0) = (x21 − 1)2 + 1 → ∞ as x1 → ∞, the
function is not bounded above, and thus (0,0) is not a global
maximum point.

University of Information Technology (UIT) Math for CS CS115 35 / 53


Example

 
2 0
∇2 f (1, 0) = ∇2 f (−1, 0) =4 , which is an indefinite matrix.
0 −1

University of Information Technology (UIT) Math for CS CS115 36 / 53


Example


2 0
∇2 f (1, 0)
= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

University of Information Technology (UIT) Math for CS CS115 36 / 53


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4

University of Information Technology (UIT) Math for CS CS115 36 / 53


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.

University of Information Technology (UIT) Math for CS CS115 36 / 53


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.

University of Information Technology (UIT) Math for CS CS115 36 / 53


Example

 
2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.
 
2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.
Because there are two global minimum points, they are nonstrict
global minima, but they are strict local minimum points, since each
has a neighborhood in which it is the unique minimizer.

University of Information Technology (UIT) Math for CS CS115 36 / 53


Example

University of Information Technology (UIT) Math for CS CS115 37 / 53


Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 38 / 53


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.

University of Information Technology (UIT) Math for CS CS115 39 / 53


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.

University of Information Technology (UIT) Math for CS CS115 39 / 53


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.

University of Information Technology (UIT) Math for CS CS115 39 / 53


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

University of Information Technology (UIT) Math for CS CS115 39 / 53


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

Our constrained optimization problem is



θ̂ = argmin L(θ)
θ∈C

University of Information Technology (UIT) Math for CS CS115 39 / 53


Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

Our constrained optimization problem is



θ̂ = argmin L(θ)
θ∈C

If C = RD , it is called unconstrained optimization.


University of Information Technology (UIT) Math for CS CS115 39 / 53
Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

University of Information Technology (UIT) Math for CS CS115 40 / 53


Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.


A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

University of Information Technology (UIT) Math for CS CS115 40 / 53


Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.


A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

The task of finding any point (regardless of its cost) in the feasible
set is called feasibility problem.

University of Information Technology (UIT) Math for CS CS115 40 / 53


Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 41 / 53


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.

University of Information Technology (UIT) Math for CS CS115 42 / 53


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.

University of Information Technology (UIT) Math for CS CS115 42 / 53


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.
We say S is a convex set if, for any x, x′ ∈ S, we have
λx + (1 − λ)x′ ∈ S, ∀λ ∈ [0, 1]

University of Information Technology (UIT) Math for CS CS115 42 / 53


Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.
We say S is a convex set if, for any x, x′ ∈ S, we have
λx + (1 − λ)x′ ∈ S, ∀λ ∈ [0, 1]

If we draw a line from x to x′ , all points on the line lie inside the set.

University of Information Technology (UIT) Math for CS CS115 42 / 53


Convex functions

f is a convex function if its epigraph (the set of points above the


function) defines a convex set.

University of Information Technology (UIT) Math for CS CS115 43 / 53


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

University of Information Technology (UIT) Math for CS CS115 44 / 53


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

University of Information Technology (UIT) Math for CS CS115 44 / 53


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.


A function is concave if −f (x) is convex.

University of Information Technology (UIT) Math for CS CS115 44 / 53


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.


A function is concave if −f (x) is convex.
A function can be neither convex nor concave.

University of Information Technology (UIT) Math for CS CS115 44 / 53


Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.


A function is concave if −f (x) is convex.
A function can be neither convex nor concave.
Some examples of 1d convex functions: x2 , eax , − log(x),
xa (a > 1, x > 0), |x|a (a ≥ 1), x log x(x > 0).
University of Information Technology (UIT) Math for CS CS115 44 / 53
Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

University of Information Technology (UIT) Math for CS CS115 45 / 53


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

University of Information Technology (UIT) Math for CS CS115 45 / 53


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 45 / 53


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.


This is strictly convex if A is positive definite.
It is neither convex nor concave if

University of Information Technology (UIT) Math for CS CS115 45 / 53


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.


This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.

University of Information Technology (UIT) Math for CS CS115 45 / 53


Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.


This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.
Intuitively, a convex function is shaped like a bowl.

University of Information Technology (UIT) Math for CS CS115 45 / 53


Convex functions

The quadratic form f (x) = xT Ax in 2d.

University of Information Technology (UIT) Math for CS CS115 46 / 53


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.

University of Information Technology (UIT) Math for CS CS115 46 / 53


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.

University of Information Technology (UIT) Math for CS CS115 46 / 53


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.

University of Information Technology (UIT) Math for CS CS115 46 / 53


Convex functions

The quadratic form f (x) = xT Ax in 2d.


(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.
(d) A is indefinite, so f is neither convex nor concave.
University of Information Technology (UIT) Math for CS CS115 46 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 47 / 53


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.

University of Information Technology (UIT) Math for CS CS115 48 / 53


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.

University of Information Technology (UIT) Math for CS CS115 48 / 53


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .

University of Information Technology (UIT) Math for CS CS115 48 / 53


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .
At each iteration t, an update is performed

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent


direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .

University of Information Technology (UIT) Math for CS CS115 48 / 53


First-order methods

We consider iterative optimization methods that leverage first order


derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .
At each iteration t, an update is performed

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent


direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .
The update steps are continued until a stationary point is reached,
where the gradient is zero.

University of Information Technology (UIT) Math for CS CS115 48 / 53


Descent direction

A direction d is a descent direction if there is a small enough (but


nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.

University of Information Technology (UIT) Math for CS CS115 49 / 53


Descent direction

A direction d is a descent direction if there is a small enough (but


nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .

University of Information Technology (UIT) Math for CS CS115 49 / 53


Descent direction

A direction d is a descent direction if there is a small enough (but


nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .


The gradient at the current iterate,

g t ≜ ∇L(θ)|θt = ∇L(θ t ) = g(θ t )

points in the direction of maximal increase in f , so the negative


gradient is a descent direction.

University of Information Technology (UIT) Math for CS CS115 49 / 53


Descent direction

Any direction d is also a descent direction if the angle θ between d


and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

University of Information Technology (UIT) Math for CS CS115 50 / 53


Descent direction

Any direction d is also a descent direction if the angle θ between d


and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .

University of Information Technology (UIT) Math for CS CS115 50 / 53


Descent direction

Any direction d is also a descent direction if the angle θ between d


and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .


This is the direction of steepest descent.

University of Information Technology (UIT) Math for CS CS115 50 / 53


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.

University of Information Technology (UIT) Math for CS CS115 51 / 53


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.

University of Information Technology (UIT) Math for CS CS115 51 / 53


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.

University of Information Technology (UIT) Math for CS CS115 51 / 53


Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.
Example:
L(θ) = 0.5(θ12 − θ2 )2 + 0.5(θ1 − 1)2
Pick our descent direction dt = −g t . Consider ρt = 0.1 vs ρt = 0.6:

University of Information Technology (UIT) Math for CS CS115 51 / 53


Line search

The optimal step size can be found by finding the value that
maximally decreases the objective along the chosen direction by
solving the 1d minimization problem

ρt = argmin ϕt (ρ) = argmin L(θ t + ρdt )


ρ>0 ρ>0

This is line search: we are searching along the line defined by dt .


ϕt (ρ) = L(θ t + ρdt ) is a convex function of an affine function of ρ,
for fixed θ t and d.
If the loss is convex, this subproblem is also convex.

University of Information Technology (UIT) Math for CS CS115 52 / 53


Line search
Example, consider the quadratic loss
1
L(θ) = θ T Aθ + bT θ + c
2
Compute the derivatives of ϕ(ρ) = L(θ + ρd) gives
 
dϕ(ρ) d 1 T T
= (θ + ρd) A(θ + ρd) + b (θ + ρd) + c
dρ dρ 2
= dT A(θ + ρd) + dT b
= dT (Aθ + b) + ρdT Ad
dϕ(ρ)
Solving for dρ = 0 gives
dT (Aθ + b)
ρ=−
dT Ad
This is exact line search. There are several methods, such as
Armijo backtracking method, that try to ensure reduction in the
objective function without spending too much time trying to solve
University ofthis subproblem
Information precisely.
Technology (UIT) Math for CS CS115 53 / 53

You might also like