0% found this document useful (0 votes)

57 views

CS115 Optimization

Uploaded by

Quân Võ Đình Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

57 views

CS115 Optimization

Uploaded by

Quân Võ Đình Minh

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 148

Optimization

Ngoc-Hoang Luong

University of Information Technology (UIT)

Vietnam National University - Ho Chi Minh City (VNU-HCM)

Math for CS, Fall 2023

University of Information Technology (UIT) Math for CS CS115 1 / 53

References

The contents of this document are taken mainly from the follow sources:
Kevin P. Murphy. Probabilistic Machine Learning: An Introduction. 1

1
https://probml.github.io/pml-book/book1.html
University of Information Technology (UIT) Math for CS CS115 2 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 3 / 53

Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 4 / 53

Introduction
Vấn đề cốt lõi trong ML là ước lượng tham số (khớp mô hình)
The core problem in ML is parameter estimation (model fitting).

University of Information Technology (UIT) Math for CS CS115 5 / 53

Introduction
Vấn đề cốt lõi trong ML là ước lượng tham số (khớp mô hình)
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)

θ∈Θ

University of Information Technology (UIT) Math for CS CS115 5 / 53

Introduction
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.

University of Information Technology (UIT) Math for CS CS115 5 / 53

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.
We focus on continuous optimization.

University of Information Technology (UIT) Math for CS CS115 5 / 53

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).

University of Information Technology (UIT) Math for CS CS115 5 / 53

Introduction
vấn đề cốt lõi trong ML là ước lượng tham số.
The core problem in ML is parameter estimation (model fitting).
We need to solve an optimization problem: i.e., trying to find the
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).
The term objective function refers to a function we want to
maximize or minimize.

University of Information Technology (UIT) Math for CS CS115 5 / 53

Introduction
Vấn đề cốt lõi trong ML là ước lượng tham số (khớp mô hình).
The core problem in ML is parameter estimation (model fitting).
Chúng ta cần giải quyết một vấn đề tối ưu hóa: cố gắng tìm ra
We need to solve an optimization problem: i.e., trying to find the
giá trị cho một tập hợp các biến tham số
values for a set of variables θ ∈ Θ that minimize a scalar-valued loss
nhằm giảm thiểu một hàm mất mát hoặc hàm chi phí scalar
function or cost function: L : Θ → R

θ ∗ = argmin L(θ) (1)

θ∈Θ

The parameter space is given by Θ ∈ RD , where D is the number of

variables being optimized.
Chúng ta tập trung vào tối ưu hóa liên tục.
We focus on continuous optimization.
To maximize a score function or reward function R(θ), we can
minimize L(θ) = −R(θ).
Thuật ngữ hàm mục tiêu đề cập đến một hàm mà chúng ta muốn
The term objective function refers to a function we want to
tối đa hóa hoặc giảm thiểu.
maximize or minimize.
Một thuật toán để tìm cực trị của một hàm mục tiêu được gọi là một trình giải.
An algorithm to find an optimum of an objective function is a solver.
University of Information Technology (UIT) Math for CS CS115 5 / 53
Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.

University of Information Technology (UIT) Math for CS CS115 6 / 53

Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.
In general, finding global optima is computationally intractable. We
will try to find a local optimum.

University of Information Technology (UIT) Math for CS CS115 6 / 53

Local versus global optimization
A point that satisfies Equation 1 is called a global optimum. Finding
such a point is called global optimization.
Nói chung, việc tìm kiếm cực tiểu toàn cục là không khả thi về mặt tính toán.
In general, finding global optima is computationally intractable. We
Chúng ta sẽ cố gắng tìm một cực tiểu cục bộ
will try to find a local optimum.
Đối với vấn đề liên tục, một cực tiểu cục bộ là một điểm
For continuous problem, a local optimum is a point θ ∗ which has
mà tại đó có chi phí thấp hơn (hoặc bằng) so với các điểm "gần" đó.
lower (or equal) cost than “nearby” points.
∃δ > 0, ∀θ ∈ Θ, s.t. ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) ≤ L(θ) (2)

1.5

1.0

0.5

0.0

−0.5

−1.0

−1.5

−2.0 local minimum

−2.5 Global minimum

−1.0 −0.5 0.0 0.5 1.0 1.5 2.0 2.5 3.0

University of Information Technology (UIT) Math for CS CS115 6 / 53
Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

University of Information Technology (UIT) Math for CS CS115 7 / 53

Local versus global optimization
A local minimum could be surrounded by other local minima with the
same objective value; this is known as a flat local minimum.

A point is said to be a strict local minimum if its cost is strictly

lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

University of Information Technology (UIT) Math for CS CS115 7 / 53

Local versus global optimization
Một cực tiểu cục bộ có thể được bao quanh bởi các cực tiểu cục bộ khác với
A local minimum could be surrounded by other local minima with the
cùng một giá trị mục tiêu; điều này được biết đến là cực tiểu cục bộ phẳng (flat local
same objective value; this is known as a flat local minimum.
minimum).

Một điểm được xem là cực tiểu cục bộ chặt chẽ nếu chi phí của nó thấp hơn đáng kể
A point is said to be a strict local minimum if its cost is strictly
so với chi phí của các điểm lân cận.
lower than those of neighboring points.
∃δ > 0, ∀θ ∈ Θ, θ ̸= θ ∗ : ∥θ − θ ∗ ∥ < δ, L(θ ∗ ) < L(θ) (3)

We can define a (strict) local maximum analogously.

University of Information Technology (UIT) Math for CS CS115 7 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 8 / 53

Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.

University of Information Technology (UIT) Math for CS CS115 9 / 53

Derivatives
The topic of calculus concerns computing “rates of change” of
functions as we vary their inputs.
Consider a scalar-argument function f : R → R. Its derivative at a
point a is the quantity
f (x + h) − f (x)
f ′ (x) ≜ lim
h→0 h
assuming the limit exists.

University of Information Technology (UIT) Math for CS CS115 9 / 53

Derivatives
Lĩnh vực giải tích đề cập đến việc tính toán "tốc độ thay đổi" của các hàm số
The topic of calculus concerns computing “rates of change” of
khi chúng ta biến đổi các đầu vào của chúng.
functions as we vary their inputs. Đạo hàm của nó tại một điểm
Xét một hàm số với đối số vô hướng
Consider a scalar-argument function f : R → R. Its derivative at a
point a is the quantity
f (x + h) − f (x)
f ′ (x) ≜ lim
h→0 h
giả sử giới hạn này tồn tại
assuming the limit exists.
This measures how quickly the output changes when we move a small
distance in the input space away from x (i.e., the “rate of change”).

University of Information Technology (UIT) Math for CS CS115 9 / 53

Derivatives

f ′ (x) can be seen as the slope of the tangent line at f (x)

f (x + ∆x) ≈ f (x) + f ′ (x)∆x
for small ∆x.

University of Information Technology (UIT) Math for CS CS115 10 / 53

Derivatives

We can compute a finite difference approximation to the derivative

by using a finite step size h

f (x + h) − f (x)
f ′ (x) = lim
h→0
| {z h }
forward difference
f (x + h/2) − f (x − h/2)
= lim
h→0
| {z h }
central difference
f (x) − f (x − h)
= lim
h→0
| {z h }
backward difference

The smaller the step size h, the better the estimate.

University of Information Technology (UIT) Math for CS CS115 11 / 53

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′

University of Information Technology (UIT) Math for CS CS115 12 / 53

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).

University of Information Technology (UIT) Math for CS CS115 12 / 53

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.

University of Information Technology (UIT) Math for CS CS115 12 / 53

Derivatives

We can think of differentiation as an operator that maps functions

University of Information Technology (UIT) Math for CS CS115 12 / 53

Derivatives

We can think of differentiation as an operator that maps functions

University of Information Technology (UIT) Math for CS CS115 12 / 53

Derivatives

We can think of differentiation as an operator that maps functions

to functions, D(f ) = f ′
f ′ (x) computes the derivative at x (assuming the derivative exists at
that point).
The prime symbol f ′ to denote derivative is Lagrange notation.
The second derivative function, which measures how quickly the
gradient is changing, is denoted by f ′′ .
The n’th derivative function is denote f (n) .
We can use Leibniz notation, if we denote the function by y = f (x),
dy d
and its derivative by dx or dx f (x).

University of Information Technology (UIT) Math for CS CS115 12 / 53

Derivatives

We can think of differentiation as an operator that maps functions

University of Information Technology (UIT) Math for CS CS115 12 / 53

University of Information Technology (UIT) Math for CS CS115 13 / 53

Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.
The gradient of f at a point x is the vector of its partial derivatives
 ∂f 
1∂x
∂f ∂f ∂f
g= = ∇f =  ...  = e1 + . . . + en
 
∂x ∂f
∂x1 ∂xn
∂xn

University of Information Technology (UIT) Math for CS CS115 13 / 53

Gradients
We extend the notion of derivatives to handle vector-argument
functions, f : Rn → R, by defining the partial derivative of f with
respect to xi to be
∂f f (x + hei ) − f (x)
= lim
∂xi h→0 h
where ei is the i’th unit vector, ei = (0, . . . , 1, . . . , 0) with the i’th
element = 1 and all the other elements are 0.
The gradient of f at a point x is the vector of its partial derivatives
 ∂f 
1 ∂x
∂f ∂f ∂f
g= = ∇f =  ...  = e1 + . . . + en
 
∂x ∂f
∂x1 ∂xn
∂xn

To emphasize the point at which the gradient is evaluated, we write

∂f
g(x∗ ) ≜
∂x x∗
University of Information Technology (UIT) Math for CS CS115 13 / 53
Gradients

Example:

f (x1 , x2 ) = x21 + x1 x2 + 3x22

!
∂f
∂x 2x 1 + x 2
∇f (x1 , x2 ) = ∂f1 =
∂x
x1 + 6x2
2

The nabla operator ∇ maps a function f : Rn → R to another

function g : Rn → Rn .
Since g() is a vector-valued function, it is known as a vector field.

University of Information Technology (UIT) Math for CS CS115 14 / 53

Directional derivative

The directional derivative measures how much the function

f : Rn → R changes along a direction v in space.

f (x + hv) − f (x)
Dv f (x) = lim
h→0 h
We can approximate this numerically using 2 function calls to f ,
regardless of n.
By contrast, a numerical approximation to the standard gradient
vector takes n + 1 calls (or 2n if using central differences).
The directional derivative along v is the scalar product of the
gradient g and the vector v:

Dv f (x) = ∇f (x) · v

University of Information Technology (UIT) Math for CS CS115 15 / 53

Directional derivative
Example: Let f (x, y) = x2 y. Find the derivative of f in the direction
(1,2) at the point (3,2).
The gradient ∇f (x, y) is:
!
∂f
∂x 2xy
∇f (x, y) = ∂f =
∂y
x2

12 1 0
∇f (3, 2) = = 12 +9 = 12e1 + 9e2
9 0 1
Let u = u1 e1 + u2 e2 be a unit vector. The derivative of f in the
direction of u at (3,2) is:

Du f (3, 2) = ∇f (3, 2) · u
= (12e1 + 9e2 ) · (u1 e1 + u2 e2 )
= 12u1 + 9u2

University of Information Technology (UIT) Math for CS CS115 16 / 53

Directional derivative

Example (cont.)
The unit vector in the direction of vector (1,2) is:

(1, 2) (1, 2) (1, 2) √ √

u= =√ = √ = (1/ 5, 2/ 5)
∥(1, 2)∥ 12 + 2 2 5
The directional derivative at (3,2) in the direction of (1,2) is:

Du f (3, 2) = 12u1 + 9u2

12 18 30
=√ +√ =√
5 5 5
We normalize vector (1,2) so that the directional derivative is
independent of its magnitude and depending only on its direction.

University of Information Technology (UIT) Math for CS CS115 17 / 53

Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of

(2,1) at the point (3,2).

University of Information Technology (UIT) Math for CS CS115 18 / 53

Directional derivative

Example 2: Let f (x, y) = x2 y. Find the derivative of f in the direction of

(2,1) at the point (3,2).
The unit vector in the direction of (2,1) is:

(2, 1) √ √
u = √ = (2/ 5, 1/ 5)
5
The directional derivative of f at (3,2) in the direction of (2,1) is:

Du f (3, 2) = 12u1 + 9u2

24 9 33
=√ +√ =√
5 5 5

University of Information Technology (UIT) Math for CS CS115 18 / 53

Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?

University of Information Technology (UIT) Math for CS CS115 19 / 53

Directional derivative

Questions:
At a point a, in which direction u is the directional derivative
Du f (a) maximal?
What is the directional derivative in that direction Du f (a) =?
The relationship between the gradient and the directional derivative:

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

University of Information Technology (UIT) Math for CS CS115 19 / 53

Directional derivative

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

The maximal value of Du f (a) occurs when u and ∇f (a) point in the
same direction (i.e., θ = 0).

University of Information Technology (UIT) Math for CS CS115 19 / 53

Directional derivative

Du f (a) = ∇f (a) · u
= ∥∇f (a)∥∥u∥ cos θ [θ is the angle between u and the gradient.]
= ∥∇f (a)∥ cos θ [u is a unit vector.]

When θ = 0, the directional derivative Du f (a) = ∥∇f (a)∥.

When θ = π, the directional derivative Du f (a) = −∥∇f (a)∥.
For what value of θ is Du f (a) = 0?

University of Information Technology (UIT) Math for CS CS115 20 / 53

Jacobian

Consider a function that maps a vector to another vector,

f : Rn → Rm . The Jacobian matrix of this function is an m × n
matrix of partial derivatives:
 ∂f1 ∂f1 
∇f1 (x)T
 
∂x1 ... ∂xn
∂f
J f (x) = ≜  ... .. ..  =  ..
 
T . .   .
∂x

∂f m ∂fm ∇fm (x) T
∂x1 ... ∂xn

We layout the results in the same orientation as the output f . This is

called the numerator layout of the Jacobian formulation.

University of Information Technology (UIT) Math for CS CS115 21 / 53

Hessian

For a function f : Rn → R that is twice differentiable, the Hessian

matrix is the (symmetric) n × n matrix of second partial derivatives
∂2f ∂2f
 
∂x21
... ∂x1 ∂xn
∂2f .. .. ..
= ∇2 f = 
 
Hf = 2 . . .

∂x  
∂2f ∂2f
∂xn ∂x1 ... ∂x2n

The Hessian is the Jacobian of the gradient.

University of Information Technology (UIT) Math for CS CS115 22 / 53

Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).

University of Information Technology (UIT) Math for CS CS115 23 / 53

Hessian
Example: Find the Hessian of f (x, y) = x2 y + y 2 x at the point (1,1).
First, compute the gradient (i.e., first-order partial derivatives):
!
∂f
2xy + y 2

∇f (x, y) = ∂f =∂x
∂y
x2 + 2yx

University of Information Technology (UIT) Math for CS CS115 23 / 53

Second, compute the Hessian (i.e., second-order partial derivatives):

∂2f ∂2f
!
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

University of Information Technology (UIT) Math for CS CS115 23 / 53

Second, compute the Hessian (i.e., second-order partial derivatives):

∂2f ∂2f
!
∂x2 ∂x∂y 2y 2x + 2y
H f (x, y) = ∂ 2 f ∂2f =
2
2x + 2y 2x
∂y∂x ∂y

Finally, evaluate the Hessian matrix at the point (1,1):

2 4
H f (1, 1) =
4 2

University of Information Technology (UIT) Math for CS CS115 23 / 53

Geometric meaning

If we follow the direction d from x, we can define a uni-dimensional

function g(α):

g(α) = f (x + αd)
g ′ (α) = dT ∇f (x + αd)
g ′′ (α) = dT ∇2 f (x + αd)d

Interpretation

g ′ (0) = dT ∇f (x) [directional derivative]

g ′′ (0) = dT ∇2 f (x)d [directional curvature]

If g ′′ (0) is non-negative with a certain d: f is convex in direction d.

If g ′′ (0) is non-negative for all d: ∇2 f (x) is positive semidefinite → f
is convex at x.
University of Information Technology (UIT) Math for CS CS115 24 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 25 / 53

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 53

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,

University of Information Technology (UIT) Math for CS CS115 26 / 53

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,

University of Information Technology (UIT) Math for CS CS115 26 / 53

Definitions

We say that a symmetric n × n matrix A is:

University of Information Technology (UIT) Math for CS CS115 26 / 53

Definitions

We say that a symmetric n × n matrix A is:

University of Information Technology (UIT) Math for CS CS115 26 / 53

Definitions

We say that a symmetric n × n matrix A is:

University of Information Technology (UIT) Math for CS CS115 26 / 53

Definitions

We say that a symmetric n × n matrix A is:

positive semidefinite (A ⪰ 0) if xT Ax ≥ 0 for all x,
positive definite (A ≻ 0) if xT Ax > 0 for all x ̸= 0,
negative semidefinite (A ⪯ 0) if xT Ax ≤ 0 for all x,
negative definite (A ≺ 0) if xT Ax < 0 for all x ̸= 0,
indefinite if none of the above apply.
The expression xT Ax is a function of x called the quadratic form
associated to A. (It’s made up of terms like x2i and xi xj .)
We make these definitions for a symmetric matrix A, i.e., AT = A.
Hessian matrices are symmetric.

University of Information Technology (UIT) Math for CS CS115 26 / 53

Diagonal matrices

For a diagonal matrix

 
d1 0 . . . 0
 0 d2 . . . 0 
 
 ...
D= .. . .
.
.. 
. .
..
 
0 0 . dn

the quadratic form

  
d1 0 . . . 0 x1
 0 d2 . . . 0   
  x2 
xT Dx = x1 x2

 ...
. . . xn  .. . . ..   . 
. . .  .. 
..

0 0 . dn xn

is just d1 x21 + d2 x22 + . . . + dn x2n .

University of Information Technology (UIT) Math for CS CS115 27 / 53
Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.

University of Information Technology (UIT) Math for CS CS115 28 / 53

Diagonal matrices
If d1 , . . . , dn are all nonnegative, then d1 x21 + d2 x22 + . . . + dn x2n must
be nonnegative for any x, so D ⪰ 0: D is positive semidefinite.
If d1 , . . . , dn are all positive, then d1 x21 + d2 x22 + . . . + dn x2n can only
be 0 if x = 0, so D ≻ 0: D is positive definite.
If d1 , . . . , dn ≤ 0, then D ⪯ 0, and if d1 , . . . , dn < 0, then D ≺ 0.
D is indefinite if the signs of d1 , . . . , dn are mixed.
Example: Consider the function f (x, y) = x2 + 2y 2 .
The gradient ∇f (x, y) = (2x, 4y).
The Hessian matrix of f is:

2 0
H f (x, y) =
0 4
For an arbitrary x ∈ R2 , we have

T 2 0
x x = 2x21 + 4x22 > 0 for all x ̸= 0.
0 4

University of Information Technology (UIT) Math for CS CS115 28 / 53

For an n × n matrix A, if a nonzero vector x ∈ Rn satisfies

Ax = λx

for some scalar λ ∈ R, we call λ an eigenvalue of A and x its

associated eigenvector.
If A is an n × n symmetric matrix, then it can be factored as
 
λ1 0 ... 0
 0 λ2 ... 0 
A = QT ΛQ = QT 
 
 ... ..
.
.. . Q
. .. 
..
 
0 0 . λn

where λ1 , . . . , λn are the eigenvalues of A and the columns of Q are

the corresponding eigenvectors.

University of Information Technology (UIT) Math for CS CS115 29 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

University of Information Technology (UIT) Math for CS CS115 30 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

University of Information Technology (UIT) Math for CS CS115 30 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0

University of Information Technology (UIT) Math for CS CS115 30 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0

University of Information Technology (UIT) Math for CS CS115 30 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0

University of Information Technology (UIT) Math for CS CS115 30 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0

University of Information Technology (UIT) Math for CS CS115 30 / 53

Positive definiteness and eigenvalues
Apply to the quadratic form xT Ax, we get

xT Ax = xT QT ΛQx = (Qx)T Λ(Qx)

If we substitute y = Qx (converting to a different basis), the

quadratic form becomes diagonal:

xT Ax = y T Λy = λ1 y12 + λ2 y22 + . . . + λn yn2

We can classify the matrix A by looking at the eigenvalues of A.

A ⪰ 0 if λ1 , λ2 , . . . , λn ≥ 0
A ≻ 0 if λ1 , λ2 , . . . , λn > 0
A ⪯ 0 if λ1 , λ2 , . . . , λn ≤ 0
A ≺ 0 if λ1 , λ2 , . . . , λn < 0
A is indefinite if it has both positive and negative eigenvalues.
University of Information Technology (UIT) Math for CS CS115 30 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 31 / 53

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

points which correspond to local optima.

University of Information Technology (UIT) Math for CS CS115 32 / 53

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.

University of Information Technology (UIT) Math for CS CS115 32 / 53

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

University of Information Technology (UIT) Math for CS CS115 32 / 53

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

points which correspond to local optima.
Let g(θ) = ∇L(θ) be the gradient vector, and H(θ) = ∇2 L(θ) be
the Hessian matrix.
Consider a point θ ∗ ∈ RD , and let g ∗ = g(θ)|θ∗ be the gradient at
that point, and H ∗ = H(θ)|θ∗ be the corresponding Hessian.
Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 32 / 53

Optimality conditions for local vs global optima

For continuous, twice differentiable functions, we can characterize the

University of Information Technology (UIT) Math for CS CS115 32 / 53

University of Information Technology (UIT) Math for CS CS115 33 / 53

Optimality conditions for local vs global optima
1 Necessary conditions: If θ ∗ is a local minimum, then we must have
g ∗ = 0 (i.e., θ ∗ must be a stationary point), and H ∗ must be
positive semi-definite.
Suppose we were at a point θ ∗ at which the gradient is non-zero.
At such a point, we could decrease the function by following the
negative gradient a small distance, so this would not be optimal.
So the gradient must be zero.
2 Sufficient conditions: If g ∗ = 0 and H ∗ is positive definite, then θ ∗
is a local optimum.
Why a zero gradient is not sufficient?
The stationary point could be a local minimum, local maximum, or
saddle point.

University of Information Technology (UIT) Math for CS CS115 33 / 53

Global optimizers

We classify a stationary point of a function f : Rn → R as a global

minimizer if the Hessian matrix of f is positive semidefinite
everywhere,
and as a global maximizer if the Hessian matrix is negative
semidefinite everywhere.
If the Hessian matrix is positive definite, or negative definite, the
minimizer and maximizer (respectively) is strict.

University of Information Technology (UIT) Math for CS CS115 34 / 53

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

(x21 + x22 − 1)x1

The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2

University of Information Technology (UIT) Math for CS CS115 35 / 53

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

(x21 + x22 − 1)x1

The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).

University of Information Technology (UIT) Math for CS CS115 35 / 53

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

University of Information Technology (UIT) Math for CS CS115 35 / 53

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

University of Information Technology (UIT) Math for CS CS115 35 / 53

Example

Let f (x1 , x2 ) = (x21 + x22 − 1)2 + (x22 − 1)2 .

(x21 + x22 − 1)x1

The gradient is ∇f (x) = 4
(x21 + x22 − 1)x2 + (x22 − 1)x2
The stationary points are (0,0), (1,0), (-1,0), (0,1), (0,-1).
2
3x1 + x22 − 1

2x1 x2
The Hessian is ∇2 f (x) = 4
2x1 x2 x21 + 6x22 − 2

−1 0
Since ∇2 f (0, 0) = 4 ≺ 0, it follows that (0,0) is a strict
0 −2
local maximum point.
By the fact that f (x1 , 0) = (x21 − 1)2 + 1 → ∞ as x1 → ∞, the
function is not bounded above, and thus (0,0) is not a global
maximum point.

University of Information Technology (UIT) Math for CS CS115 35 / 53

Example

2 0
∇2 f (1, 0) = ∇2 f (−1, 0) =4 , which is an indefinite matrix.
0 −1

University of Information Technology (UIT) Math for CS CS115 36 / 53

Example

2 0
∇2 f (1, 0)
= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

University of Information Technology (UIT) Math for CS CS115 36 / 53

Example

2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4

University of Information Technology (UIT) Math for CS CS115 36 / 53

Example

University of Information Technology (UIT) Math for CS CS115 36 / 53

Example

2 0
∇2 f (1, 0)= ∇2 f (−1, 0)
=4 , which is an indefinite matrix.
0 −1
Hence (1,0) and (-1,0) are saddle points.

2 2 0 0
∇ f (0, 1) = ∇ f (0, −1) = 4 , which is positive semidefinite.
0 4
The fact that the Hessian matrices of f at (0,1) and (0,-1) are
positive semidefinite is not enough to conclude that these are local
minimum points; they might be saddle points.
However, in this case, since f (0, 1) = f (0, −1) = 0 and the function
is lower bounded by zero, (0,1) and (0,-1) are global minimum points.

University of Information Technology (UIT) Math for CS CS115 36 / 53

Example

University of Information Technology (UIT) Math for CS CS115 36 / 53

Example

University of Information Technology (UIT) Math for CS CS115 37 / 53

Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 38 / 53

Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.

University of Information Technology (UIT) Math for CS CS115 39 / 53

Constrained vs unconstrained optimization
In unconstrained optimization, we find any value in the parameter
space Θ that minimizes the loss.
We can also have a set of constraints C on the allowable values.
We partition the set of constraints C into:
Inequality constraints: gj (θ) ≤ 0 for j ∈ I.
Equality constraints: hk (θ) = 0 for k ∈ E.
The feasible set is the subset of the parameter space that satisfies
the constraints:
C = {θ : gj (θ) ≤ 0 : j ∈ I, hk (θ) = 0 : k ∈ E} ⊆ RD

University of Information Technology (UIT) Math for CS CS115 39 / 53

Our constrained optimization problem is

∗
θ̂ = argmin L(θ)
θ∈C

University of Information Technology (UIT) Math for CS CS115 39 / 53

Our constrained optimization problem is

∗
θ̂ = argmin L(θ)
θ∈C

If C = RD , it is called unconstrained optimization.

University of Information Technology (UIT) Math for CS CS115 39 / 53
Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

University of Information Technology (UIT) Math for CS CS115 40 / 53

Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

University of Information Technology (UIT) Math for CS CS115 40 / 53

Constrained vs unconstrained optimization

Constraints can change the number of optima of a function.

A function that was unbounded (no well-defined global maximum or
minimum) can acquire multiple maxima or minima when we add
constraints.

The task of finding any point (regardless of its cost) in the feasible
set is called feasibility problem.

University of Information Technology (UIT) Math for CS CS115 40 / 53

Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 41 / 53

Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.

University of Information Technology (UIT) Math for CS CS115 42 / 53

Convex sets
In convex optimization, the objective is a convex function defined
over a convex set.
In such problems, every local minimum is also a global minimum.
Many models are designed so that their training objectives are convex.
We say S is a convex set if, for any x, x′ ∈ S, we have
λx + (1 − λ)x′ ∈ S, ∀λ ∈ [0, 1]

University of Information Technology (UIT) Math for CS CS115 42 / 53

If we draw a line from x to x′ , all points on the line lie inside the set.

University of Information Technology (UIT) Math for CS CS115 42 / 53

Convex functions

f is a convex function if its epigraph (the set of points above the

function) defines a convex set.

University of Information Technology (UIT) Math for CS CS115 43 / 53

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

University of Information Technology (UIT) Math for CS CS115 44 / 53

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

University of Information Technology (UIT) Math for CS CS115 44 / 53

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

A function is concave if −f (x) is convex.

University of Information Technology (UIT) Math for CS CS115 44 / 53

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

A function is concave if −f (x) is convex.
A function can be neither convex nor concave.

University of Information Technology (UIT) Math for CS CS115 44 / 53

Convex functions
f (x) is called a convex function if it is defined on a convex set, and
if, for any x, y ∈ S, and for any 0 ≤ λ ≤ 1, we have:

f (λx + (1 − λ)y) ≤ λf (x) + (1 − λ)f (y)

A function is strictly convex if the inequality is strict.

A function is concave if −f (x) is convex.
A function can be neither convex nor concave.
Some examples of 1d convex functions: x2 , eax , − log(x),
xa (a > 1, x > 0), |x|a (a ≥ 1), x log x(x > 0).
University of Information Technology (UIT) Math for CS CS115 44 / 53
Convex functions

Theorem
Suppose f : Rn → R is twice differentiable over its domain. Then f is
convex iff H = ∇2 f (x) is positive semi-definite for all x ∈ dom(f ).
Furthermore, f is strictly convex if H is positive definite.

University of Information Technology (UIT) Math for CS CS115 45 / 53

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

University of Information Technology (UIT) Math for CS CS115 45 / 53

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

University of Information Technology (UIT) Math for CS CS115 45 / 53

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

This is strictly convex if A is positive definite.
It is neither convex nor concave if

University of Information Technology (UIT) Math for CS CS115 45 / 53

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.

University of Information Technology (UIT) Math for CS CS115 45 / 53

Convex functions

For example, consider the quadratic form

f (x) = xT Ax

This is convex if A is positive semi-definite.

This is strictly convex if A is positive definite.
It is neither convex nor concave if A has eigenvalues of mixed sign.
Intuitively, a convex function is shaped like a bowl.

University of Information Technology (UIT) Math for CS CS115 45 / 53

Convex functions

The quadratic form f (x) = xT Ax in 2d.

University of Information Technology (UIT) Math for CS CS115 46 / 53

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.

University of Information Technology (UIT) Math for CS CS115 46 / 53

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.

University of Information Technology (UIT) Math for CS CS115 46 / 53

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.

University of Information Technology (UIT) Math for CS CS115 46 / 53

Convex functions

The quadratic form f (x) = xT Ax in 2d.

(a) A is positive definite, so f is strictly convex.
(b) A is negative definite, so f is strictly concave.
(c) A is positive semi-definite, but singular, so f is convex.
(d) A is indefinite, so f is neither convex nor concave.
University of Information Technology (UIT) Math for CS CS115 46 / 53
Table of Contents

1 Introduction

2 Matrix calculus

3 Positive definite matrices

4 Optimality conditions

5 Constrained vs unconstrained optimization

6 Convex vs nonconvex optimization

7 First-order methods

University of Information Technology (UIT) Math for CS CS115 47 / 53

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.

University of Information Technology (UIT) Math for CS CS115 48 / 53

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.

University of Information Technology (UIT) Math for CS CS115 48 / 53

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .

University of Information Technology (UIT) Math for CS CS115 48 / 53

First-order methods

We consider iterative optimization methods that leverage first order

derivatives of the objective function.
They compute which directions point “downhill”, but ignore
curvature information.
All these algorithms require the user specify a starting point θ 0 .
At each iteration t, an update is performed

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent

direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .

University of Information Technology (UIT) Math for CS CS115 48 / 53

First-order methods

We consider iterative optimization methods that leverage first order

θ t+1 = θ t + ρt dt

where ρt is the step size or learning rate, and dt is a descent

direction, e.g, the negative of the gradient given by g t = ∇θ L(θ)|θt .
The update steps are continued until a stationary point is reached,
where the gradient is zero.

University of Information Technology (UIT) Math for CS CS115 48 / 53

Descent direction

A direction d is a descent direction if there is a small enough (but

nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.

University of Information Technology (UIT) Math for CS CS115 49 / 53

Descent direction

A direction d is a descent direction if there is a small enough (but

nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .

University of Information Technology (UIT) Math for CS CS115 49 / 53

Descent direction

A direction d is a descent direction if there is a small enough (but

nonzero) amount ρ that we can move in direction d and be
guaranteed to decrease the function value.
We require there exists an ρmax > 0 such that

L(θ + ρd) < L(θ)

for all 0 < ρ < ρmax .

The gradient at the current iterate,

g t ≜ ∇L(θ)|θt = ∇L(θ t ) = g(θ t )

points in the direction of maximal increase in f , so the negative

gradient is a descent direction.

University of Information Technology (UIT) Math for CS CS115 49 / 53

Descent direction

Any direction d is also a descent direction if the angle θ between d

and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

University of Information Technology (UIT) Math for CS CS115 50 / 53

Descent direction

Any direction d is also a descent direction if the angle θ between d

and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .

University of Information Technology (UIT) Math for CS CS115 50 / 53

Descent direction

Any direction d is also a descent direction if the angle θ between d

and −g t is less than 90 degrees and satisfies

dT g t = ∥d∥∥g t ∥ cos(θ) < 0

The best choice would be to pick dt = −g t .

This is the direction of steepest descent.

University of Information Technology (UIT) Math for CS CS115 50 / 53

Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.

University of Information Technology (UIT) Math for CS CS115 51 / 53

Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.

University of Information Technology (UIT) Math for CS CS115 51 / 53

Step size (learning rate)

University of Information Technology (UIT) Math for CS CS115 51 / 53

Step size (learning rate)

The sequence of step sizes {ρt } is called the learning rate schedule.
The simplest method is to use constant step size, ρt = ρ.
However, if it is too large, the method may fail to converge. If it is
too small, the function will converge but very slowly.
Example:
L(θ) = 0.5(θ12 − θ2 )2 + 0.5(θ1 − 1)2
Pick our descent direction dt = −g t . Consider ρt = 0.1 vs ρt = 0.6:

University of Information Technology (UIT) Math for CS CS115 51 / 53

Line search

The optimal step size can be found by finding the value that
maximally decreases the objective along the chosen direction by
solving the 1d minimization problem

ρt = argmin ϕt (ρ) = argmin L(θ t + ρdt )

ρ>0 ρ>0

This is line search: we are searching along the line defined by dt .

ϕt (ρ) = L(θ t + ρdt ) is a convex function of an affine function of ρ,
for fixed θ t and d.
If the loss is convex, this subproblem is also convex.

University of Information Technology (UIT) Math for CS CS115 52 / 53

Line search
Example, consider the quadratic loss
1
L(θ) = θ T Aθ + bT θ + c
2
Compute the derivatives of ϕ(ρ) = L(θ + ρd) gives

dϕ(ρ) d 1 T T
= (θ + ρd) A(θ + ρd) + b (θ + ρd) + c
dρ dρ 2
= dT A(θ + ρd) + dT b
= dT (Aθ + b) + ρdT Ad
dϕ(ρ)
Solving for dρ = 0 gives
dT (Aθ + b)
ρ=−
dT Ad
This is exact line search. There are several methods, such as
Armijo backtracking method, that try to ensure reduction in the
objective function without spending too much time trying to solve
University ofthis subproblem
Information precisely.
Technology (UIT) Math for CS CS115 53 / 53

CS115 Optimization
No ratings yet
CS115 Optimization
160 pages
Midterm Revision
No ratings yet
Midterm Revision
16 pages
Advanced Mathematics 1 + 2
No ratings yet
Advanced Mathematics 1 + 2
7 pages
Lecture_1_2_background
No ratings yet
Lecture_1_2_background
6 pages
Lecture 5
No ratings yet
Lecture 5
18 pages
Lecture 01 - Linear and Non-Linear Equations W1
No ratings yet
Lecture 01 - Linear and Non-Linear Equations W1
36 pages
Convex Optimization Prerequisite_topics
No ratings yet
Convex Optimization Prerequisite_topics
6 pages
Nonlinear Programming PDF
No ratings yet
Nonlinear Programming PDF
224 pages
Nonlinear Programming Concepts PDF
No ratings yet
Nonlinear Programming Concepts PDF
224 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Process Optimization Algorythms PDF
No ratings yet
Process Optimization Algorythms PDF
77 pages
ORF363 COS323 F14 Lec3
No ratings yet
ORF363 COS323 F14 Lec3
16 pages
Introduction To Unconstrained Optimization - Direct Search Methods
No ratings yet
Introduction To Unconstrained Optimization - Direct Search Methods
17 pages
University of Maryland: Econ 600
No ratings yet
University of Maryland: Econ 600
22 pages
Chương 6 Tối Ưu Không Ràng Buộc
No ratings yet
Chương 6 Tối Ưu Không Ràng Buộc
22 pages
thi_note_phd_small
No ratings yet
thi_note_phd_small
13 pages
Chapter 5
No ratings yet
Chapter 5
13 pages
Lec 18
No ratings yet
Lec 18
6 pages
Linear_Algebra
No ratings yet
Linear_Algebra
19 pages
Optimization Best
No ratings yet
Optimization Best
71 pages
Chapter 2
No ratings yet
Chapter 2
5 pages
Princeton University Notation and Terminology in optimization
No ratings yet
Princeton University Notation and Terminology in optimization
13 pages
Chapter 5 Defination: Ax+b C
No ratings yet
Chapter 5 Defination: Ax+b C
3 pages
Microeconomic Analysis Notes
No ratings yet
Microeconomic Analysis Notes
23 pages
2023 Logictic Regression VN
No ratings yet
2023 Logictic Regression VN
49 pages
Objective Function in Machine Learning: Enhancing Performance Optimization Through Mathematical Modeling
No ratings yet
Objective Function in Machine Learning: Enhancing Performance Optimization Through Mathematical Modeling
2 pages
Curs Tehnici de Optimizare
No ratings yet
Curs Tehnici de Optimizare
141 pages
AI Lec-06
No ratings yet
AI Lec-06
30 pages
NEOM UNIT-1 Sept-23
No ratings yet
NEOM UNIT-1 Sept-23
34 pages
IE643 Lecture8 2020sep11 2020sep8
No ratings yet
IE643 Lecture8 2020sep11 2020sep8
100 pages
Numerical Method-Nonlinear I
No ratings yet
Numerical Method-Nonlinear I
45 pages
Lec 03
No ratings yet
Lec 03
42 pages
Intro 2 ML
No ratings yet
Intro 2 ML
162 pages
Week02 Convex Optimization
No ratings yet
Week02 Convex Optimization
48 pages
Optimization: Lecturer: Stanley B. Gershwin
No ratings yet
Optimization: Lecturer: Stanley B. Gershwin
62 pages
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
No ratings yet
Optimization For ML (2) : CS771: Introduction To Machine Learning Piyush Rai
14 pages
Mathematics Notes and Formula For Class 12 Chapter 12. Linear Programming PDF
No ratings yet
Mathematics Notes and Formula For Class 12 Chapter 12. Linear Programming PDF
7 pages
MATH685_Sp10_lec1
No ratings yet
MATH685_Sp10_lec1
26 pages
L2
No ratings yet
L2
35 pages
Machine Learning Coursera
No ratings yet
Machine Learning Coursera
43 pages
Calculus CHAPTER 1
No ratings yet
Calculus CHAPTER 1
11 pages
5165 Test 2 Cheating
No ratings yet
5165 Test 2 Cheating
7 pages
Lec 1
No ratings yet
Lec 1
9 pages
Chapter 9st - Non-Linear Programming
No ratings yet
Chapter 9st - Non-Linear Programming
21 pages
CH 4-Design Optimization-Optimum Design Concepts PDF
No ratings yet
CH 4-Design Optimization-Optimum Design Concepts PDF
62 pages
Lecture 14: Linear Algebra: cs412: Introduction To Numerical Analysis
No ratings yet
Lecture 14: Linear Algebra: cs412: Introduction To Numerical Analysis
8 pages
Lecture7[1]
No ratings yet
Lecture7[1]
46 pages
Unconstrained and Constrained Optimization Algorithms by Soman K.P
No ratings yet
Unconstrained and Constrained Optimization Algorithms by Soman K.P
166 pages
Chapter 3 Unconstrained Convex Optimization
No ratings yet
Chapter 3 Unconstrained Convex Optimization
28 pages
Particulars Mittal 2015
No ratings yet
Particulars Mittal 2015
29 pages
Reviewer EE4
No ratings yet
Reviewer EE4
2 pages
1
No ratings yet
1
31 pages
Nonlinear Programming Unconstrained
No ratings yet
Nonlinear Programming Unconstrained
182 pages
Mathematics Formula Book - First Year
No ratings yet
Mathematics Formula Book - First Year
18 pages
Math Programming
No ratings yet
Math Programming
8 pages
NumProg2 - 2020-07-12
No ratings yet
NumProg2 - 2020-07-12
110 pages
Dssm-U5 MHK
No ratings yet
Dssm-U5 MHK
51 pages
Final Revision
No ratings yet
Final Revision
22 pages
Mathematical Optimization: Fundamentals and Applications
From Everand
Mathematical Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
Random Optimization: Fundamentals and Applications
From Everand
Random Optimization: Fundamentals and Applications
Fouad Sabry
No ratings yet
CH 4 Determinants Worksheet 1
No ratings yet
CH 4 Determinants Worksheet 1
2 pages
FP1-4-lesson plan
No ratings yet
FP1-4-lesson plan
3 pages
Practical 1
No ratings yet
Practical 1
4 pages
Numerical Methods For Eigenvalue Problems: D. Löchel Supervisors: M. Hochbruck Und M. Tokar
No ratings yet
Numerical Methods For Eigenvalue Problems: D. Löchel Supervisors: M. Hochbruck Und M. Tokar
25 pages
Matrix-Vector Multiplication Using Falk Scheme
100% (1)
Matrix-Vector Multiplication Using Falk Scheme
2 pages
CLS JEEAD-18-19 XII Mat Target-5 SET-2 Chapter-4 PDF
No ratings yet
CLS JEEAD-18-19 XII Mat Target-5 SET-2 Chapter-4 PDF
50 pages
Exercise 1.3: Numerical Methods
No ratings yet
Exercise 1.3: Numerical Methods
7 pages
Download full Linear Algebra and Matrix Computations With MATLAB 1st Edition Dingyü Xue ebook all chapters
100% (1)
Download full Linear Algebra and Matrix Computations With MATLAB 1st Edition Dingyü Xue ebook all chapters
62 pages
Matrix Bhu Class Notes
No ratings yet
Matrix Bhu Class Notes
80 pages
BLOSUM Matrices
No ratings yet
BLOSUM Matrices
18 pages
Eigenvalues & Eigenvectors
No ratings yet
Eigenvalues & Eigenvectors
8 pages
DETERMINANTS
No ratings yet
DETERMINANTS
12 pages
DETERMINANT REVISION 2024-25
No ratings yet
DETERMINANT REVISION 2024-25
8 pages
Special Determinants and Matrices and Their Use in Economics
No ratings yet
Special Determinants and Matrices and Their Use in Economics
30 pages
2 - Solution of Simultaneous Linear Equations - 26 Pgs New
No ratings yet
2 - Solution of Simultaneous Linear Equations - 26 Pgs New
26 pages
Feature Extraction
No ratings yet
Feature Extraction
90 pages
Jeopardy Review
No ratings yet
Jeopardy Review
57 pages
Assignment 03 - Matrix Inverses and Determinants
No ratings yet
Assignment 03 - Matrix Inverses and Determinants
2 pages
Matrices PDF
No ratings yet
Matrices PDF
7 pages
An Introduction To Hill Ciphers: Using Linear Algebra
No ratings yet
An Introduction To Hill Ciphers: Using Linear Algebra
23 pages
Shrey Mehrotra Class 12th B Maths
No ratings yet
Shrey Mehrotra Class 12th B Maths
11 pages
Math 240: Matrix Operations and Linear Systems: Ryan Blair
No ratings yet
Math 240: Matrix Operations and Linear Systems: Ryan Blair
25 pages
Cd3291 Dsa Unit 5 Notes Eduengg
No ratings yet
Cd3291 Dsa Unit 5 Notes Eduengg
23 pages
Download Full Multivariate Statistical Analysis Revised And Expanded 2nd Edition Narayan C. Giri (Author) PDF All Chapters
100% (21)
Download Full Multivariate Statistical Analysis Revised And Expanded 2nd Edition Narayan C. Giri (Author) PDF All Chapters
50 pages
Section 8-2
No ratings yet
Section 8-2
10 pages
Algebra of Matrices: Previous Eamcet Bits
No ratings yet
Algebra of Matrices: Previous Eamcet Bits
10 pages
Questions
No ratings yet
Questions
114 pages
Matrices DPP
No ratings yet
Matrices DPP
3 pages
9th Math Workbook Ch-1 To 5 by Bismillah Academy 0300-7980055
No ratings yet
9th Math Workbook Ch-1 To 5 by Bismillah Academy 0300-7980055
21 pages
Gauss Elimination and Jordan Method
No ratings yet
Gauss Elimination and Jordan Method
26 pages