0% found this document useful (0 votes)

13 views224 pages

ELE704 - Lecture Notes - I - 03-04-2024

Uploaded by

Seckin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

13 views224 pages

ELE704 - Lecture Notes - I - 03-04-2024

Uploaded by

Seckin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 224

Introduction

Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

ELE 704 Optimization

Cenk Toker

Hacettepe University

2023-24 Spring

Cenk Toker ELE 704 Optimization 1 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Textbooks

There is no specific textbook. Lecture notes will be a composition

of the references below:
- Luenberger, Linear and Nonlinear Programming, Kluwer, 2002.
- Boyd and Vandenberghe, Convex Optimization, Cambridge,
2004.
- S. S. Rao, Engineering Optimization: Theory and Practice, 4th
Edition, Wiley, 2009.
- Baldick, Applied Optimization, Cambridge, 2006.
- Freund, Lecture Notes, MIT.
- Bertsekas, Lecture Notes, MIT.
- Bertsekas, Nonlinear Programming, Athena Scientific, 1999.

Cenk Toker ELE 704 Optimization 2 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Course Content

- Notation and Background

- Unconstrained Optimization: Steepest Descent
- Unconstrained Optimization: Conjugate Gradient
- Unconstrained Optimization: Newton’s Method
- Constrained Optimization: Optimality and Duality
- Constrained Optimization: Gradient Projection
- Constrained Optimization: Modified Newton’s Method
- Constrained Optimization: Penalty & Barrier Methods
- Constrained Optimization: Interior Point Method

Cenk Toker ELE 704 Optimization 3 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Notation

First, we are going to introduce the notation used in this lecture.

The symbols Z and R denote the sets of integers and real numbers,
respectively.

The symbols Z++ and R++ denote the sets of positive (not including
zero) integer and real numbers, respectively. Similarly, Z+ and R+ denote
the sets of nonnegative integer and real numbers, respectively.

The symbols ∀,∃, : , ∧, ∨, =⇒, ⇐⇒, ⊂ and ∈ denote the terms “for all”,
“there exists”, “such that”, “and”, “or”, “if . . . then”, “if and only if (iff)”,
“subset of” and “element of”, respectively.

Functions of a continuous variable indicated with round brackets, for

instance, f (t) where t ∈ R.

Functions are always assumed to be real-valued unless explicitly stated

otherwise.

Cenk Toker ELE 704 Optimization 4 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Boldface type or a bar over the letter is used to denote matrix

and vector quantities. Small letters will be used for column
vectors, e.g. a or ā, and capital letters will be used for
matrices, e.g. A or Ā¯. In this context aT or āT denotes a row
vector. The elements of vectors denoted by a subscript
starting from 1. See below for some examples,
 
a1
 a2 
a= . 
 
 .. 
aN
a = [ a1 a2 · · · aN ]T
a = [ai ]N ×1
T
a = [ a1 a2 · · · aN ]

where i = 1, 2, · · · , N .
Cenk Toker ELE 704 Optimization 5 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Arithmetic, sign and equality (or inequality) operators for

vectors and matrices applies to their elements directly,

n−η = [ n1 − K1 n2 − K2 · · · nN − KN ]T
n<η ≡ n1 < K1 , n2 < K2 , · · · , nN < KN

where n = [ n1 n2 · · · nN ]T and η = [ K1 K2 · · · KN ]T .

Cenk Toker ELE 704 Optimization 6 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Scalar values apply to every element of vectors. Here are some

explanatory examples
T
Ka = Ka1 Ka2 · · · KaN
T
a+K = a1 + K a2 + K · · · aN + K
a<K ≡ a1 < K, a2 < K, · · · , aN < K
a=K ≡ a1 = K, a2 = K, · · · , aN = K

where a = [ a1 a2 · · · aN ]T and K is a scalar value.

Cenk Toker ELE 704 Optimization 7 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The symbols ZN and RN denote the sets of N -dimensional

integer vectors, and N -dimensional real number vectors,
respectively. In other words ZN and RN denote N -dimensional
(column) vector spaces where vector elements are integers and
real numbers, respectively, for instance
(
N c = [ c1 c2 · · · cN ]T and
c∈R ≡
c1 ∈ R, c2 ∈ R, · · · , cN ∈ R

Cenk Toker ELE 704 Optimization 8 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An M ×N matrix A can be stated in two forms

 
A1,1 A1,2 · · · A1,N
 A2,1 A2,2 · · · A2,N 
A= . .. .. 
 
 .. ..
. . . 
AM,1 AM,2 · · · AM,N
A = [Ai,j ]M ×N

where i = 1, 2, · · · , M and j = 1, 2, · · · , N .
The quantity AT denotes the transpose of A.
The quantities A−1 and A−T denote the inverse and the
inverse transpose of A, respectively.
An N -length column vector a is essentially an N ×1 matrix
and similarly the row vector aT is also a 1×N matrix.
Cenk Toker ELE 704 Optimization 9 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Matrix multiplication is defined only between an M ×P matrix

A and P ×N matrix B to produce an M ×N matrix C:
C = AB
where
k=P
X
Ci,j = Ai,k Bk,j
k=0
with i = 1, 2, · · · , M and j = 1, 2, · · · , N .
Matrix multiplication is not commutative
AB ̸= BA
Matrix multiplication is associative
ABC = (AB)C = A(BC)
Matrix multiplication is distributive
A(B + C) = AB + AC
Cenk Toker ELE 704 Optimization 10 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Transpose of a matrix product is equal to the product of the

transposed matrices in reverse order, e.g.,

(AB)T = BT AT

The trace of an N ×N square matrix A is defined to be the

sum of the elements on the main diagonal (the diagonal from
the upper left to the lower right) of A, i.e.,
N
X
tr(A) = A1,1 + A2,2 + · · · + AN,N = Ai,i
i=1

Cenk Toker ELE 704 Optimization 11 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The notation
diag [ d1 d2 · · · dN ]
stands for an N × N diagonal matrix D with the main
diagonal elements Di,i = di , i.e.
 
d1 0 · · · 0 0
 0 d2 0 ··· 0
 .. . . . . .
 
D=. . . . 
 . . . . 

 0 · · · 0 dN −1 0 
0 0 ··· 0 dN

The shorthand notations diag d and diag dT can be also

used, i.e.

diag d ≡ diag dT ≡ diag [ d1 d2 · · · dN ]

where d = [ d1 d2 · · · dN ]T .
Cenk Toker ELE 704 Optimization 12 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The symbol IN denotes the N × N identity matrix, i.e.

IN = diag [ 1 1 · · · 1 ] = diag [ 1 ]1×N .

| {z }
N

Sometimes the subscript N is omitted and simply I is used to

denote an identity matrix.
Note that, for an N × N nonsingular (i.e., invertible) matrix A

AA−1 = A−1 A = I

Cenk Toker ELE 704 Optimization 13 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The · operator denotes the inner product (or dot product)

operator for two vectors of the same length as defined below

a · b = aT b
= a1 b1 + a2 b2 + · · · + aN bN

where a = [ a1 a2 · · · aN ]T and b = [ b1 b2 · · · bN ]T .
Note that, the inner product produces a scalar value and it is
commutative, i.e.
a·b=b·a
so,
a T b = bT a
Inner product is sometimes denoted with the ⟨ , ⟩ operator, i.e.,

⟨a, b⟩ = a · b = aT b

Cenk Toker ELE 704 Optimization 14 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The ⊗ operator denotes the outer product (or tensor

product) operator for two vectors of the same length as
defined below
a ⊗ b = abT
where a = [ a1 a2 · · · aN ]T and b = [ b1 b2 · · · bN ]T .
Note that, the outer product produces an N × N square
matrix and it is not commutative, i.e.

a ⊗ b ̸= b ⊗ a

so,
abT ̸= baT

Cenk Toker ELE 704 Optimization 15 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Functions are also defined in terms of its input and output

sets, i.e.
f :A→B
means "f is an B-valued function of an A-valued variable"
where A ⊆ R and B ⊆ R.
Vector variables (or arguments) will be denoted by small
letters using boldface type, that is

f (x) ≡ f (x1 , x2 , · · · , xN )

where x = [ x1 x2 · · · xN ]T . Assuming x ∈ RN and f (x)

is a real-valued function than

f (x) : RN → R

Cenk Toker ELE 704 Optimization 16 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Domain: The domain of a function is the set of "input" or argument

values for which the function is defined. That is, the function provides an
"output" or value for each member of the domain.

The domain of a function is denoted by

dom f

For instance, the domain of cosine is the set of all real numbers (the
fundamental domain is [0, 2π), all other real values fold onto this region
with mod 2π), while the domain of the square root consists only of
numbers greater than or equal to 0 (ignoring complex numbers in both
cases), i.e.,

dom cos = R
dom sqrt = R+

For a function whose domain is a subset of the real numbers, when the
function is represented in an xy Cartesian coordinate system, the domain
is represented on the x-axis (y-axis gives the range, next slide).

Cenk Toker ELE 704 Optimization 17 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Range: The range of a function refers to the image of the function. The
codomain is a set containing the function’s outputs, whereas the image is
the part of the codomain which consists only of the function’s outputs.

For example, the function f (x) = x2 is often described as a function from

the real numbers to the real numbers, meaning that the codomain is R,
but its image (i.e., range) is the set of non-negative real numbers, i.e.,
R+ .

For a function whose range is a subset of the real numbers, when the
function is represented in an xy Cartesian coordinate system, the range is
represented on the y-axis.

Cenk Toker ELE 704 Optimization 18 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Function vectors and function matrices are also denoted in the

similar fashion like f (x) and F(x), respectively. For example, a
function vector f (x) of size N can be stated as

f (x) = [ f1 (x) f2 (x) · · · fN (x) ]T .

Assuming x ∈ R and f (x) are real-valued functions than

f (x) : R → RN

Sets are denoted by { }. An example set is defined below

A = { x | P (x) }

meaning "A is a set of x such that P (x) is true". Here, the

letter x can be replaced by other symbols. Another example
would be
B = { y | y is a prime number }
Cenk Toker ELE 704 Optimization 19 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Supremum: The supremum (sup) of a subset S of a totally or partially

ordered set T is the least element of T that is greater than or equal to all
elements of S. Consequently, the supremum is also referred to as the
least upper bound (LUB). Plural of supremum is suprema.

If S contains a greatest element, then that element is the supremum.

Otherwise, the supremum does not belong to S (or does not exist, e.g.
∞).

For instance, the negative real numbers do not have a greatest element,
and their supremum is 0 (which is not a negative real number).

Examples:

sup {1, 2, 3} = 3
sup {x ∈ R : 0 < x < 1} = sup {x ∈ R : 0 ≤ x ≤ 1} = 1

One basic property of the supremum is

sup {f (t) + g(t) : t ∈ A} ≤ sup {f (t) : t ∈ A} + sup {g(t) : t ∈ A}

for any functions f and g.

Cenk Toker ELE 704 Optimization 20 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Infimum: The infimum (inf) of a subset S of a partially ordered set T is

the greatest element of T that is less than or equal to all elements of S.
Consequently the term greatest lower bound (GLB) is also commonly
used. Plural of infimum is infima.

If the infimum exists, it is unique. If S contains a least element, then that

element is the infimum; otherwise, the infimum does not belong to S (or
does not exist, e.g. −∞).

For instance, the positive real numbers do not have a least element, and
their infimum is 0, which is not a positive real number.

Examples:

inf {1, 2, 3} = 1
inf {x ∈ R : 0 < x < 1} = inf {x ∈ R : 0 ≤ x ≤ 1} = 0

Cenk Toker ELE 704 Optimization 21 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Norm: A norm is a way of measuring the length or strength of a vector.

The general form of the norm is called Lp -norm given by
N
!1/p
X p
∥x∥p = |xi | for p ≥ 1.
i=1

The value of p is typically 1 or 2 or ∞.

L1 -norm is the Taxicab norm or Manhattan norm:

N
X
∥x∥1 = |xi |
i=1

L2 -norm is the Euclidean norm:

v
uN
uX
∥x∥2 = t |xi |2
i=1

L∞ -norm is the maximum norm or infinity norm:

∥x∥∞ = max |xi |
i

Note: If not stated otherwise, ∥x∥ will denote the Euclidean norm, ∥x∥2 ,
i.e., L2 -norm.
Cenk Toker ELE 704 Optimization 22 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Unit Ball: Let ∥ · ∥ denote any norm on RN , then the unit ball is the set
of all vectors with norm less than or equal to one, i.e.,

B = {x : ∥x∥ ≤ 1} .

Then, B is called the unit ball of the norm ∥ · ∥.

The unit ball satisfies the following properties:

- B is symmetric about the origin, i.e., x ∈ B if and only if −x ∈ B.
- B is convex.
- B is closed, bounded, and has nonempty interior.

Two-dimensional unit ball of L1 -norm, i.e., B1 = x ∈ R2 : ∥x∥1 ≤ 1 .

Cenk Toker ELE 704 Optimization 23 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Two-dimensional unit ball of L2 -norm, i.e., B2 = x ∈ R2 : ∥x∥2 ≤ 1 .

ball of L∞ -
Two-dimensional unit
norm, i.e., B∞ = x ∈ R2 : ∥x∥∞ ≤ 1 .

Cenk Toker ELE 704 Optimization 24 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Two-dimensional unit balls of L1 , L2 and L∞ norms shown together.

Cenk Toker ELE 704 Optimization 25 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Dual Norm: Let ∥ · ∥ denote any norm on RN , then the dual norm,
denoted by ∥ · ∥∗ , is the function from RN to R with values
n o
∥x∥∗ = max yT x : ∥y∥ ≤ 1 = sup yT x : ∥y∥ ≤ 1
y

The above definition also corresponds to a norm: it is convex, as it is the

pointwise maximum of convex (in fact, linear) functions y → xT y; it is
homogeneous of degree 1, that is, ∥αx∥∗ = α∥x∥∗ for every x in RN and
α ≥ 0.

By definition of the dual norm,

xT y ≤ ∥x∥ · ∥y∥∗

This can be seen as a generalized version of the Cauchy-Schwarz

inequality, which corresponds to the Euclidean norm.

The dual to the dual norm above is the original norm, i.e.,

∥x∥∗∗ = ∥x∥

Cenk Toker ELE 704 Optimization 26 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- The norm dual to the Euclidean norm is itself. This comes directly from
the Cauchy-Schwarz inequality.

∥x∥2∗ = ∥x∥2

- The norm dual to the the L∞ -norm is the L1 -norm, or vice versa.

∥x∥∞∗ = ∥x∥1 and ∥x∥1∗ = ∥x∥∞

- More generally, the dual of the Lp -norm is the Lq -norm

∥x∥p∗ = ∥x∥q
p
where q = .
1−p

Cenk Toker ELE 704 Optimization 27 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An eigenvector of an N × N square matrix A is a non-zero vector v

that, when multiplied by A, yields the original vector multiplied by a
single number λ, i.e.,
Av = λv
The number λ is called the eigenvalue of A corresponding to v.

Thus, in order to find the eigenvalues of A, we solve the above equation

for λ

Av − λv = 0
(A − λI)v = 0

where I is the N × N identity matrix. It is a fundamental result of linear

algebra that an equation Mv = 0 has a non-zero solution v if and only if
the determinant of the matrix M is zero, i.e. det(M) = 0. It follows that
the eigenvalues of A are precisely the real numbers λ that satisfy the
characteristic equation
det (A − λI) = 0

Cenk Toker ELE 704 Optimization 28 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Note that, condition number, κ(·), of a matrix is given by the ratio of the
largest and the smallest eigenvalue, e.g.,
max λi
κ(H(x)) =
min λi

If the condition number is close to one, the matrix is well-conditioned

which means its inverse can be computed with good accuracy. If the
condition number is large, then the matrix is said to be ill-conditioned.
Practically, such a matrix is almost singular, and the computation of its
inverse, or solution of a linear system of equations is prone to large
numerical errors. A matrix that is not invertible has the condition number
equal to infinity.

Cenk Toker ELE 704 Optimization 29 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An N × N symmetric matrix A is called positive-semidefinite if

xT Ax ≥ 0
for all x ∈ RN , and is called positive-definite if
xT Ax > 0
for all nonzero x ∈ RN .

If A is positive-semidefinite, it is denoted by
A⪰0
and has nonnegative eigenvalues.

If A is positive-definite, it is denoted by
A≻0
and has positive eigenvalues.
T
For any real matrix B, the
matrix B B is positive-semidefinite, and
T
rank (B) = rank B B .
Cenk Toker ELE 704 Optimization 30 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An N × N symmetric matrix A is called negative-semidefinite if

xT Ax ≤ 0

for all x ∈ RN , and is called negative-definite if

xT Ax < 0

for all nonzero x ∈ RN .

If A is negative-semidefinite, it is denoted by

A⪯0

and has nonpositive eigenvalues.

If A is negative-definite, it is denoted by

A≺0

and has negative eigenvalues.

Cenk Toker ELE 704 Optimization 31 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Quadratic norm: A generalized quadratic norm of x is defined by

1/2
∥x∥P = xT Px = ∥P1/2 x∥2 = ∥Mx∥2

where P = MT M is an N × N symmetric positive definite (SPD) matrix.

When P = I then, quadratic norm is equal to the Euclidean norm.

The dual of the quadratic norm is given by

1/2
∥x∥P∗ = ∥x∥Q = xT Qx

where Q = P−1 , i.e.,

1/2
∥x∥P∗ = xT P−1 x

Cenk Toker ELE 704 Optimization 32 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

∂
The notation stands for the multidimensional partial
∂x
derivative operator in the form of a column vector, i.e.
T
∂ ∂ ∂ ∂
= ··· .
∂x ∂x1 ∂x2 ∂xN

Consider the following example

T
∂f (x) ∂f (x) ∂f (x) ∂f (x)
= ··· .
∂x ∂x1 ∂x2 ∂xN

Cenk Toker ELE 704 Optimization 33 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The ∇x denotes the multidimensional gradient (or del)

operator in the form of a column vector, similar to the
multidimensional partial derivative operator. In general we are
going to omit the subscript x, i.e.

∂ T

∂ ∂
∇ = ∇x = ··· .
∂x1 ∂x2 ∂xN
Consider the following example
∂f (x)
∇f (x) = ∇x f (x) =
∂x
T
∂f (x) ∂f (x) ∂f (x)
= ··· .
∂x1 ∂x2 ∂xN
Note that ∇ operator is not commutative, i.e.,

∇f (x) ̸= f (x)∇
Cenk Toker ELE 704 Optimization 34 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The directional derivative in the direction d is given by

(d · ∇)f (x) = d · ∇f (x) = ∇f (x) · d

= dT ∇f (x) = ∇T f (x) d
∂f (x) ∂f (x) ∂f (x)
= d1 + d2 + · · · + dN
∂x1 ∂x2 ∂xN

where d = [ d1 d2 · · · dN ]T .

Cenk Toker ELE 704 Optimization 35 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The ∇2 denotes the Hessian operator (∇∇T ) or Hessian

matrix (symmetric), not the Laplace operator ∇ · ∇ (or ∇T ∇).
∇2 = ∇ ⊗ ∇ = ∇∇T
 2 ∂ 2 ∂ ∂2

∂x1 ∂x1 ∂x1 ∂x2 ··· ∂x1 ∂xN
 
 ∂2 ∂2 ∂2
···

=  ∂x2 ∂x1 ∂x2 ∂x2 ∂x2 ∂xN
 
 .. .. .. .. 
 . . . .


∂2 ∂2 ∂2
∂xN ∂x1 ∂xN ∂x2 ··· ∂xN ∂xN

H(x) = ∇2 f (x) = ∇∇T f (x)

 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)

∂x1 ∂x1 ∂x1 ∂x2 ··· ∂x1 ∂xN
 2 
 ∂ f (x) ∂ 2 f (x) ∂ 2 f (x)
···

=  ∂x2 ∂x1 ∂x2 ∂x2 ∂x2 ∂xN
 
 .. .. .. ..

 . . . .


∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂xN ∂x1 ∂xN ∂x2 · · · ∂xN ∂xN
Cenk Toker ELE 704 Optimization 36 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

T
For a vector function f (x), ∇f T (x) gives the Jacobian
matrix
 ∂f (x) ∂f (x)
· · · ∂f∂x1 (x)

1 1
∂x1 ∂x2 N
 
∂f2 (x) ∂f2 (x) ∂f2 (x) 
∂f 
 ∂x1 ∂x2 · · · ∂xN 
Jf (x) = =
∂xi M ×N  .. .. ..
.
.. 
 . . . 

∂fM (x) ∂fM (x)
∂x1 ∂x2 · · · ∂f∂x
M (x)
N

where f (x) = [ f1 (x) f2 (x) · · · fM (x) ]T and

x = [ x1 x2 · · · xN ]T .
Here,
T
Jf (x) = ∇f T (x)
JTf (x) = ∇f T (x)

Sometimes ∇f (x) is also used to denote the Jacobian matrix.

Cenk Toker ELE 704 Optimization 37 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The directional gradient in the direction d for a vector

function f (x) would be given by
T T
(d · ∇)f (x) = dT ∇f T (x) = ∇f T (x) d
= dT JTf (x) = Jf (x) d

 ∂f (x) ∂f1 (x) ∂f1 (x)

1
d1 ∂x 1
+ d 2 ∂x2 + · · · + dN ∂xN
 
 ∂f2 (x) ∂f2 (x) ∂f2 (x) 
d + d + · · · + d
=  1 ∂x1 2 ∂x2 N ∂xN 

.. 
.
 
 
d1 ∂f∂x
N (x)
1
+ d ∂fN (x)
2 ∂x2 + · · · + d ∂fN (x)
N ∂xN

where f (x) = [ f1 (x) f2 (x) · · · fN (x) ]T and

x = [ x1 x2 · · · xN ]T .

Cenk Toker ELE 704 Optimization 38 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The Taylor series of a real or complex-valued function f (x) that is

infinitely differentiable in a neighborhood of a point x0 is the power series
f ′ (x0 ) f ′′ (x0 ) f ′′′ (x0 )
f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 + (x − x0 )3 + · · ·
1! 2! 3!
f ′ (x0 ) f ′′ (x0 ) f ′′′ (x0 )
f (x0 + ∆x) = f (x0 ) + ∆x + (∆x)2 + (∆x)3 + · · ·
1! 2! 3!
1 ′′ 1
∆f = f ′ (x0 )∆x + f (x0 )(∆x)2 + f ′′′ (x0 )(∆x)3 + · · ·
2 6
where n! denotes the factorial of n, ∆x = x − x0 , ∆f = f (x) − f (x0 ),
and f ′ (x0 ), f ′′ (x0 ), f ′′′ (x0 ), . . . denotes the first, second, third, . . .
derivatives of f (x) evaluated at the point x0 , respectively.

Cenk Toker ELE 704 Optimization 39 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The min or min operator returns the minimum value (i.e.

x∈RN x
global minimum) of the f (x) function, similarly argmin or
x∈RN
argmin operator returns the argument values x∗ which result
x
in the minimum value of the f (x) function, i.e.,

f (x∗ ) = min f (x)

x
= min f (x)
x∈RN

x∗ = argmin f (x)
x
= argmin f (x)
x∈RN

Cenk Toker ELE 704 Optimization 40 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Similarly, min gives the local minimum where its argument

x∈L
values are inside the subdomain L where L ⊂ RN , i.e.,

f (x∗ ) = min f (x)

x∈L
∗
x = argmin f (x)
x∈L

gives the local minimum.

Cenk Toker ELE 704 Optimization 41 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The superscript (k) denotes the iteration level, for example, a

local minimum and its argument values at k-th iteration would
be defined as

f (x(k) ) = min f (x)

x∈L(k)

x(k) = argmin f (x)

x∈L(k)

respectively, where k = 1, 2, · · · K.

Cenk Toker ELE 704 Optimization 42 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Geometric Definitions

Geometries that will be frequently used in our course:

- Hyperplane
- Halfspace
- Polyhedra
- Euclidean Balls and Ellipsoids
- Cones

Cenk Toker ELE 704 Optimization 43 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Hyperplane

- Each equality constraint defines a different hyperplane. So, the

set of points

x | aT x = b; a, x ∈ RN , b ∈ R, a ̸= 0

constitute a hyperplane, vector a is normal to this hyperplane

and point x lie on the hyperplane. Note that hyperplanes are
affine, i.e., preserve collinearity (all points lying on a line
initially still lie on a line after transformation).

aT x = b

- Hyperplane describes a plane in R3 (i.e., 3D), (as shown in

the figure above) and describes a line in R2 (i.e., 2D), (as
shown on the next two figures).
Cenk Toker ELE 704 Optimization 44 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Halfspace

- Each inequality constraint defines a different halfspace. So,

the set of points

x | aT x ≥ b; a, x ∈ RN , b ∈ R, a ̸= 0

or
x | aT x ≤ b; a, x ∈ RN , b ∈ R, a ̸= 0

constitute a halfspace.

The half-space {x | 2x1 + x2 ≤ 0} The half-space {x | 2x1 + x2 ≤ 3}

Cenk Toker ELE 704 Optimization 45 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Polyhedra

- The solution set of finitely many linear inequalities & equalities constitute
a polyhedra,
n o
x | Ax ≤ b, Cx = d; A ∈ RM ×N , C ∈ RP ×N , b ∈ RM , d ∈ RP , x ∈ RN

In other words, a polyhedron is the intersection of finite number of

halfspaces and hyperplanes

Cenk Toker ELE 704 Optimization 46 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Example 1:
- Find the solution of

max x1 + 2x2 + 3x3

s.t. x1 + x2 + x3 = 1
x1 , x 2 , x 3 ≥ 0

Cenk Toker ELE 704 Optimization 47 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Example 2:
- Consider the pyramid in the following figure. The length of
each side is 1 unit. Find the halfspaces defining this volume.

- Solution:
 
x1
0 1 0 x2  ≥ 0
x3
 
h i x1
√1 1
− 2√ − √16 x2  ≥ 0
2 3
x3
 
h q i x1
1 2 x  ≥ 0
0 − 2√ 3 3 2
x3
 
h i x1 1
− √12 − 2√ 1
3
− √1 x  ≥ − √
6 2
x3 2

Cenk Toker ELE 704 Optimization 48 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Euclidean balls & ellipsoids

- The set of points B(xc , r) = {x | ∥x − xc ∥ ≤ r} = {xc + ru | ∥u∥ ≤ 1}

form a ball with respect to Euclidian norm with center xc and radius r.

- The set of points BP (xc , r) = x | (x − xc )T P (x − xc ) ≤ r with

respect to the quadratic norm with symmetric positive definite (SPD)

matrix P form an ellipsoid. The axes of the ellipsoid are related with the
eigenvalues of P.

Cenk Toker ELE 704 Optimization 49 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Change of coordinates:
- Let y = P1/2 x, then the ellipsoid becomes y | (y − yc )T (y − yc ) ≤ r ,

i.e., a ball with respect to Euclidian norm with center yc and radius r.

Cenk Toker ELE 704 Optimization 50 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Cone

- The set of points x | x = θ1 x1 + θ2 x2 ; x1 , x2 ∈ RN , θ1 , θ2 ≥ 0 form a

cone.

Cenk Toker ELE 704 Optimization 51 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Example 3:

x1
- Find region x1 + 2 x2 ≥ 4, i.e. 1 2 ≥ 4, and x1 , x2 ≥ 0
x2

- Solution:

- Example 4: Find region

1 2 x1 6 x1 0
≥ , ≥
2 1 x2 6 x2 0
- Solution:

Cenk Toker ELE 704 Optimization 52 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 5: Find region x1 + 2 x2 + 3 x3 = 1, and x1 , x2 , x3 ≥ 0

- Solution:

- Example 6: Find region 1 ≤ x1 ≤ 2, 0 ≤ x2 ≤ 1 and 4 ≤ x3 ≤ 5

- Solution:

Cenk Toker ELE 704 Optimization 53 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 7:

min x1 + x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0

- Solution:

Cenk Toker ELE 704 Optimization 54 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 8:

min 3x1 + x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0

- Solution:

Cenk Toker ELE 704 Optimization 55 / 224

Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 9:

min x1 − x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0

- Solution:

Cenk Toker ELE 704 Optimization 56 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Overview

Overview
- Uncontrained Optimization
- Constrained Optimization

Cenk Toker ELE 704 Optimization 57 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Unconstrained Optimization

- Unconstrained optimization is to find the point which

minimizes the cost function f (x) without subject to any other
constraints.
- In other words,

min f (x)
x∈X

where x = [ x1 x2 · · · xN ]T , X ⊂ RN and x∗ is a feasible

solution (or point) with

f (x∗ ) = min f (x)

x∈X

Cenk Toker ELE 704 Optimization 58 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Constrained Optimization

- Constrained optimization is to find the point which minimizes

the cost function f (x) with subject to some equality and/or
unequality constraints.
- In other words,

min f (x)
x∈X
subject to g(x) ≤ 0
h(x) = 0

- where x = [ x1 x2 · · · xN ]T , X ⊂ RN ,
g(x) = [ g1 (x) g2 (x) · · · gM (x) ]T are the inequality
constraints, h(x) = [ h1 (x) h2 (x) · · · hL (x) ]T are the
equality constraints and x∗ is a feasible solution (or point)
iff g ≤ 0 and h = 0 with

f (x∗ ) = min f (x)

x∈X

Cenk Toker ELE 704 Optimization 59 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Question: What if we want to maximize f (x)?

- max f (x) = min (−f (x))

x x

- Examples:
- Computer networks - e.g., optimum routing problem
- Production planning in a factory
- Resource allocation
- Computer aided design (CAD) - e.g., shortest paths in a
PCB
- Travelling salesman problem

Cenk Toker ELE 704 Optimization 60 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- A Diet Problem: Find the "most economical" diet that

satisfies the minimum nutrition requirements for good health.
- N different foods
- price of i-th food is ci
- M basic nutritional ingredients
- an individual must take "at least" bj units of j-th
nutrient per day
- each unit of i-th food contains Aj,i units of the j-th
nutrient

Cenk Toker ELE 704 Optimization 61 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Formulation:
- Variable: amount of food xi for i-th food. Thus, variables can
be represented by the vector x
T
x = x1 x2 · · · xi · · · xN

where i = 1, 2, · · · , N .
Amount of food cannot be negative, i.e., xi ≥ 0, so

x≥0

- Cost:

f (x) = c1 x1 + c2 x2 + · · · + ci xi + · · · + cN xN
= cT x

- Constraints:

A1,1 x1 + A1,2 x2 + · · · + A1,i xi + · · · + A1,N xN ≥ b1  
A2,1 x1 + A2,2 x2 + · · · + A2,i xi + · · · + A2,N xN ≥ b2 



.. .. .. ... .. ... .. 
. . . . .

Ax ≥ b
Aj,1 x1 + Aj,2 x2 + · · · + Aj,i xi + · · · + Aj,N xN ≥ bj 
.. .. .. .. .. .. .. 


. . . . . . . 



AM,1 x1 + AM,2 x2 + · · · + AM,i xi + · · · + AM,N xN ≥ bM


Cenk Toker ELE 704 Optimization 62 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Thus, the optimization formulation for this diet problem is

formed as

min cT x
x
s.t. b − Ax ≤ 0
−x≤0

Cenk Toker ELE 704 Optimization 63 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Basic Definitions

Definition: Ball centered at point x0 with radius ε

B(x0 , ε) = {x0 | ∥x − x0 ∥ < ε}
Local/global, strict/non-strict, minima/maxima
Consider the optimization problem
min f (x)
x

s.t. x ∈ F (including constraints)

Cenk Toker ELE 704 Optimization 64 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Definition: x∗ ∈ F is a local minimum if

∃ε > 0 ⇒ f (x∗ ) ≤ f (y), ∀y ∈ B(x∗ , ε) ∩ F, y ̸= x∗

- Definition: x∗ ∈ F is a global minimum if

f (x∗ ) ≤ f (y), ∀y ∈ F, y ̸= x∗

- Definition: x∗ ∈ F is a strict local minimum if

∃ε > 0 ⇒ f (x∗ ) < f (y), ∀y ∈ B(x∗ , ε) ∩ F, y ̸= x∗

- Definition: x∗ ∈ F is a strict global minimum if

f (x∗ ) < f (y), ∀y ∈ F, y ̸= x∗

- Definition: For strict/non-strict and local/global maximum, change ≤

and < to ≥ and >, respectively, in the above definitions.

Cenk Toker ELE 704 Optimization 65 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Gradient and Hessian

- Let f (x) : X → R, X ⊂ RN , f (x) is differentiable at x0 ∈ X

- if
∃ ∇f (x0 ), ∀x ∈ X
where ∇f (x0 ) is the gradient of f (x) at point x0 .

- In the neighbourhood of a point x0 , first order approximation of f (x) can

be given as

f (x) = f (x0 ) + ∇T f (x0 )(x − x0 ) + α(x0 , x − x0 )

where lim α(x0 , x − x0 ) = 0. In other words,

(x−x0 )→0

∆f = ∇T f (x0 )∆x + α(x0 , ∆x)

where ∆f = f (x) − f (x0 ), ∆x = x − x0 and lim α(x0 , ∆x) = 0.

∆x→0

- f (x) is differentiable on X if f (x) is differentiable on ∀x ∈ X

Cenk Toker ELE 704 Optimization 66 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Example 10: Let f (x) = 3 x21 x32 + x22 x33 , then

6 x1 x32
 

∇f (x) = 9 x21 x22 + 2 x2 x33 

3 x22 x23

- Remember: In the neighbourhood of a point x0 , a function f (x) can be

approximated by a second order Taylor series expansion
1 ′′
f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f (x0 )(x − x0 )2 + residual
2
where the first order derivative f ′ (x) of f (x) around x0 is given by

f (x) − f (x0 ) ∆f
f ′ (x0 ) = lim = lim
(x−x0 )→0 x − x0 ∆x→0 ∆x

where ∆x = x − x0 , ∆f = f (x) − f (x0 ). In terms of the differences,

1 ′′
∆f = f ′ (x0 )∆x + f (x0 )(∆x)2 + residual
2

Cenk Toker ELE 704 Optimization 67 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Second order Taylor series expansion in the neighbourhood of a point x0

is given by
1
f (x) = f (x0 )+∇T f (x0 )(x − x0 )+ (x − x0 )T ∇2 f (x0 ) (x − x0 )+ residual
2 | {z }
H(x0 )

where ∇2 f (x0 ) is called the Hessian (matrix) of f (x) at point x0 given by

2
∂ f (x)
H(x) = ∇2 f (x) = ∇∇T f (x) =
∂xi ∂xj N ×N

- Definition: The directional derivative of f (x) at x0 in the direction d is

defined by
f (x0 + λd) − f (x0 )
∇T f (x0 ) d = lim
λ→0 λ

Cenk Toker ELE 704 Optimization 68 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- f (x) is twice differentiable at x0 ∈ X,

- if
∃ ∇f (x0 ) and ∃ H(x0 ), ∀x ∈ X
where H(x0 ) is an N × N symmetric matrix representing the
Hessian of f (x) at x = x0 .

- In the neighbourhood of a point x0 , second order approximation of f (x)

can be given as
1
f (x) = f (x0 ) + ∇T f (x0 )(x − x0 ) + (x − x0 )T H(x0 )(x − x0 ) + β(x0 , x − x0 )
2
where lim β(x0 , x − x0 ) = 0. Similarly,
(x−x0 )→0

1
∆f = ∇T f (x0 )∆x + ∆xT H(x0 )∆x + β(x0 , ∆x)
2
where lim β(x0 , ∆x) = 0
∆x→0

- f (x) is twice differentiable on X if f (x) is twice differentiable on ∀x ∈ X

Cenk Toker ELE 704 Optimization 69 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Example 11:
- Let f (x) = 3 x21 x32 + x22 x33 as in the previous example,

6 x32 18 x1 x22
 
0
2
H(x) = ∇ f (x) = 18 x1 x22 18 x21 x2 + 2 x33 6 x2 x32

0 6 x2 x23 6 x22 x3

Cenk Toker ELE 704 Optimization 70 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Positive Semidefinite & Positive Definite Matrices

An N × N matrix M is called
- positive definite (M ≻ 0), if xT Mx > 0, ∀x ∈ RN , x ̸= 0,
or, all eigenvalues of M are positive

- positive semidefinite (M ⪰ 0), if xT Mx ≥ 0, ∀x ∈ RN , x ̸= 0,

or, all eigenvalues of M are nonnegative

- negative definite (M ≺ 0), if xT Mx < 0, ∀x ∈ RN , x ̸= 0,

or, all eigenvalues of M are negative

- negative semidefinite (M ⪯ 0), if xT Mx ≤ 0, ∀x ∈ RN , x ̸= 0,

or, all eigenvalues of M are nonpositive

- indefinite, if ∃x, y ∈ RN , xT Mx > 0, yT My < 0,

or, some eigenvalues of M are positive and some are negative

Cenk Toker ELE 704 Optimization 71 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Example 12:
-
2 0
M= ≻ 0, positive definite
0 3

8 −1
M= ≻ 0, positive definite
−1 1
Check xT Mx or check eigenvalues.

- Hint: Recall that for x ∈ R, f (x) = ax2 ⇒ f ′′ (x) = 2a. Thus the
function f (x) is convex if a > 0 or concave if a < 0.

Thus, positive/negative definiteness of the Hessian is related to convexity.

Cenk Toker ELE 704 Optimization 72 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Optimality Conditions for Unconstrained Problems

Problem:
min f (x)
x∈X

- Theorem: x∗ is a strict local minimum, if and only if

∇f (x∗ ) = 0 and H(x∗ ) ≻ 0

- Theorem: x∗ is a non-strict local minimum, if and only if

∇f (x∗ ) = 0 and H(x∗ ) ⪰ 0

- Definition: d is a descent direction of f (x) at point x0 if

f (x0 +εd) < f (x0 ), ∀ε > 0 and ε is sufficiently small (ε → 0)

Cenk Toker ELE 704 Optimization 73 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Theorem: Assume that f (x) is differentiable at x0

∃d : ∇T f (x0 )d < 0 ⇒ ∀λ > 0 (λ → 0), f (x0 + λd) < f (x0 )
Hence, then d is a descent direction of f (x) at x = x0 .

- Proof: We know that (from the definition of gradient)

f (x0 + λd) = f (x0 ) + ∇T f (x0 )(λd) + λ∥d∥α(d, λd)
where lim α(x0 , λd) = 0. Then
λ→0

f (x0 + λd) − f (x0 )

= ∇T f (x0 )d + residual (→ 0 as λ → 0)
λ | {z }
<0 (given)

Hence f (x0 + λd) − f (x0 ) < 0 as λ → 0 (λ > 0)

- Corollary: (First order necessary optimality condition) If x∗ is a local

minimum then ∇f (x∗ ) = 0.

- Proof: If ∇f (x∗ ) ̸= 0, then d = −∇f (x) would be a descent direction

and x∗ would not be a local minimum (at x∗ , there would still be room
for decrease in f (x))
Cenk Toker ELE 704 Optimization 74 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Theorem: (Second order necessary optimality condition) Suppose f (x∗ )

is twice differentiable at x∗ . If x∗ is a local minimum, then H(x∗ ) ⪰ 0

- Proof: We know that ∇f (x∗ ) = 0. Now, suppose

H(x) ≺ 0 ⇒ ∃d : dT H(x∗ )d < 0
1
f (x∗ + λd) = f (x∗ ) + λ∇T f (x∗ ) d + λ2 dT H(x∗ )d + residual
| {z } 2
0
1
∗
= f (x ) + λ2 dT H(x∗ )d + residual
2
If we rearrange
f (x∗ + λd) − f (x∗ ) 1
= dT H(x∗ )d + residual (→ 0 as λ → 0)
λ2 2 | {z }
<0

then
f (x∗ + λd) − f (x∗ ) < 0, ∀λ > 0, λ → 0
∗
⇒ x is not a local minimum. Contradiction!

- Note: If ∇f (x0 ) = 0 and H(x0 ) is positive semidefinite, i.e., H(x0 ) ⪰ 0,

point x0 may not be a (local) minimum.
Cenk Toker ELE 704 Optimization 75 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Theorem: (Sufficient condition for local optimality) If ∇f (x0 ) = 0 and

H(x0 ) is positive definite, i.e., H(x0 ) ≻ 0, then x = x0 is a strict local
minimum.

- Proof:
f (x∗ + λd) − f (x∗ ) 1
= dT H(x∗ )d + residual (→ 0 as λ → 0)
λ2 2 | {z }
>0

then
f (x∗ + λd) − f (x∗ ) > 0 ∀λ > 0, λ → 0
⇒ x∗ is a local minimum.

- Semi-definiteness does not guarantee minimum or maximum (e.g. it can

be a saddle point, as shown below)

- If ∇f (x0 ) = 0 and H(x0 ) is positive (negative) definite, then x0 is a

strict local minimum (maximum).
Cenk Toker ELE 704 Optimization 76 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Example 13:
1 1
H(x0 ) = ≻0
1 4 ⇒
∇f (x0 ) = 0
Point x0 satisfies the sufficient conditions, point x0 is a local minimum.

- Example 14: Let f (x) = x31 + x22 , then

3x21

6x1 0
- ∇f (x) = and H(x) =
2x2 0 2

0 0 0
- Point ∇f (x) = 0 at x0 = but H(x0 ) = and H(x0 ) is
0 0 2
positive semidefinite, x0 may or may not be a local minimum. Note that
f (x0 ) = f (0, 0) = 0.

f (−ε, 0) = −ε3 < 0 = f (x0 ), ∀ε > 0 ⇒ x0 is not a local minimum.

Cenk Toker ELE 704 Optimization 77 / 224

Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Example 15: Let f (x) = x41 + x22 , then

3
12x21

4x1 0
- ∇f (x) = and H(x) =
2x2 0 2

0 0 0
- Point ∇f (x) = 0 at x0 = and H(x0 ) = . H(x0 ) is positive
0 0 2
semidefinite, x0 may or may not be a local minimum. Note that
f (x0 ) = f (0, 0) = 0.

∀x f (x) ≥ 0 = f (x0 ) ⇒ x0 is a local minimum.

N

exp aTi x + bi
P
- Example 16: min f (x) = min log
i=1

- First order optimality condition ∇f (x) = 0.

N
X 1
∇f (x) = ai exp aTi x + bi N
=0
exp (aTi x + bi )
i=1
P
i=1

But there is no analytical solution, so solution can be obtained

numerically by an iterative algorithm.
Cenk Toker ELE 704 Optimization 78 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Convex Sets

- Definition: A set C ⊆ RN is said to be convex if

x1 , x2 ∈ C and 0 ≤ θ ≤ 1 ⇒ θx1 + (1 − θ)x2 ∈ C
- θx1 + (1 − θ)x2 defines a line segment between points x1 and x2 .

Some simple convex and nonconvex sets are shown below.

Left: The hexagon, which includes its boundary (shown darker), is convex.
Middle: The kidney shaped set is not convex, since the line segment between
the two points in the set shown as dots is not contained in the set. Right: The
square contains some boundary points but not others, and is not convex.
Cenk Toker ELE 704 Optimization 79 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

k
P
- Convex Combination: x1 , x2 , x3 , . . . , xk and for θi ≥ 0, θi = 1 if
i=1
x = θ1 x1 + θ2 x2 + · · · + θk xk ∈ C, then C is a convex set.

- Convex Hull: Convex hull is a set of all convex combinations of the

points in S.

- For example, the convex hulls of two sets in R2 are given below.

Left: The convex hull of a set of fifteen points (shown as dots) is the
pentagon (shown shaded). Right: The convex hull of the kidney shaped
set in the middle of the convex set is the whole shaded set.

Cenk Toker ELE 704 Optimization 80 / 224

Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Convex and Concave Functions

- Definition: A function f : C → R is convex if

dom f = C ⊆ RN

is a convex set and satisfies the Jensen’s Inequality given below

f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )

∀x1 , x2 ∈ C and 0 ≤ θ ≤ 1.

Cenk Toker ELE 704 Optimization 81 / 224

- Definition: A function f : C → R is concave if

dom f = C ⊆ RN

is a convex set and

f (θx1 + (1 − θ)x2 ) ≥ θf (x1 ) + (1 − θ)f (x2 )

∀x1 , x1 ∈ C and 0 ≤ θ ≤ 1.

- Note: If f is convex (concave), then −f is concave (convex).

- Note: If f is strictly convex or strictly concave, then the equality signs in

the inequalities are removed.

- For example, f is strictly convex if

dom f = C ⊆ RN

is a convex set and

f (θx1 + (1 − θ)x2 ) < θf (x1 ) + (1 − θ)f (x2 )

∀x1 , x1 ∈ C and 0 < θ < 1.

Cenk Toker ELE 704 Optimization 82 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Note: An affine function

f (x) = aT x + b (or f (x) = Ax + b)

with
dom f : convex
satisfies
f (θx1 + (1 − θ)x2 ) = θf (x1 ) + (1 − θ)f (x2 )
Hence it can be considered as convex or concave depending on the
problem.

Cenk Toker ELE 704 Optimization 83 / 224

Examples on R: (scalar)
- Convex:
- affine: ax + b, on R, for any a, b ∈ R
- exponential: eax , on R, for any a ∈ R
- powers: xα , on R++ , for α ≥ 1 or α ≤ 0
- powers of absolute value: |x|α , on R, for α ≥ 1
- negative entropy: x log x, on R++

- Concave:
- affine: ax + b, on R, for any a, b ∈ R
- powers: xα , on R++ , for 0 ≤ α ≤ 1
- logarithm: log x, on R++

Cenk Toker ELE 704 Optimization 84 / 224

Examples on RN : (vectors)
- All norms (i.e. Lp -norm) are convex

N
!1/p
X p
∥x∥p = |xi | for p ≥ 1.
i=1

- Affine functions are convex and concave depending on the problem

f (x) = aT x + b

Cenk Toker ELE 704 Optimization 85 / 224

First Order Condition for Convexity

- Theorem: If f (x) is differentiable (i.e. ∇f (x) exists ∀x ∈ dom f (x)) and

dom f (x) is convex, then f (x) is convex iff

f (x) ≥ f (x0 ) + ∇T f (x0 )(x − x0 ), ∀x, x0 ∈ dom f (x)

- As shown in the figure below

first-order approximation of a convex function f (x) is a global

underestimator.

Cenk Toker ELE 704 Optimization 86 / 224

Second Order Conditions for Convexity

- Theorem: If f (x) is twice differentiable (i.e. H(x) exists ∀x ∈ dom f (x))

and dom f (x) is convex, then f (x) is convex iff

H(x) ⪰ 0, ∀x ∈ dom f (x)

- Theorem: If S is a convex set, f (x) : S → R is a convex function with

local minimum x∗ , then x∗ is a global minimum of f (x) over S.

- If f (x) is (strictly) convex, a local minimum is the (unique) global

minimum.

- If f (x) is (strictly) concave, a local maximum is the (unique) global

maximum.

- If f (x) is convex, then the global optimality condition that is both

necessary and sufficient is

- Theorem: Let f (x) : X → R be convex and differentiable on X, Then x0

is a global minimum iff ∇f (x) = 0.
Cenk Toker ELE 704 Optimization 87 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Example 17: f (x) = 12 x21 + x1 x2 + 2x22 − 4x1 − 4x2 − x32 with

dom f (x) = {(x1 , x2 ) | x2 < 0}

x1 + x2 − 4 1 1
- ∇f (x) = 2 and H(x) = .
x1 + 4x2 − 3x2 1 4 − 6x2

- H(x) ⪰ 0 on dom f (x), thus f (x) is convex.

Cenk Toker ELE 704 Optimization 88 / 224

Further Examples:
- Quadratic function
1 T
x Px + qT x + r
f (x) =
2
∇f (x) = Px + q
H(x) = P

f (x) is convex if P ⪰ 0.

- Least-squares objective function

f (x) = ∥Ax − b∥2

= (Ax − b)T (Ax − b)
= xT AT Ax − 2bT Ax + bT b
∇f (x) = 2AT (Ax − b)
H(x) = 2AT A

f (x) is convex for any A (Here f (x) is a quadratic function with

P = 2AT A, q = −2AT b and r = bT b).
Cenk Toker ELE 704 Optimization 89 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Log-sum-exp: The function

N
X
f (x) = log exi
i=1

is convex over RN . This function can be interpreted as a differentiable (in

fact, analytic) approximation of the max function, since

max x ≤ f (x) ≤ max x + log N

for all x.

- Geometric mean
N
!1/N
Y
f (x) = xi
i=1

is convex for on RN
++ .

Cenk Toker ELE 704 Optimization 90 / 224

Operations that Preserve Convexity

- Nonnegative multiple
αf (x), for α ≥ 0

- Sum (including infinite sums and integrals)

f1 (x) + f2 (x)

- Compositions with affine function

f (Ax + b)
are convex if f (x) is convex.

- Ex: Log barrier

M
X n o
f (x) = − log bi − aTi x , dom f (x) = x | aTi x < bi , i = 1, . . . , M
i=1

- and norm of an affine function

f (x) = ∥Ax + b∥
Cenk Toker ELE 704 Optimization 91 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Pointwise maximum
f (x) = max {f1 (x), f2 (x), . . . , fi (x), . . . , fM (x)}
is convex for if fi (x) are convex.

- Ex: Piecewise-linear function

f (x) = max aTi x + bi
i=1,...,M

- Pointwise supremum
g(x) = sup f (x, y), ∀y ∈ A
y∈A

is convex where f (x, y) is convex in x for ∀y ∈ A.

- Ex: Distance to the farthest point in a set C (convex set)

f (x) = sup ∥x − y∥
y∈C

Cenk Toker ELE 704 Optimization 92 / 224

- Pointwise infimum
g(x) = inf f (x, y)
y∈C

is convex where f (x, y) is convex in (x, y) and C is a convex set.

- Ex: Distance to a set S (convex)

d(x, S) = inf ∥x − y∥
y∈S

- Composition with scalar functions

f (x) = h(g(x))

with g : RN → R and h : R → R is convex if g is convex, h is convex and

h is nondecreasing, or g is concave, h is convex and h is nonincreasing.

- Ex: eg(x) is convex if g is convex. Similarly,

1
g(x)
is convex if g is concave and positive.

Cenk Toker ELE 704 Optimization 93 / 224

Quadratic Functions, Forms and Optimization

- Definition: A quadratic function has the form (f : RN → R)

1 T
f (x) = x Qx + cT x + r
2
where Q ∈ RN ×N , x, c ∈ RN , r ∈ R and Q is a symmetric matrix.

- Quadratic optimization problem (Quadratic Program)

1 T
min x Qx + cT x + r
2
s.t. x ∈ RN

- Ex: Least-squares problem (Approximation of an over-determined linear

system Ax = b where A is an M × N matrix and M > N , i.e., the
number of equations are more than the number of variables)

min ∥Ax − b∥22 = xT AT Ax − 2bT Ax + bT b

s.t. x ∈ RN

Cenk Toker ELE 704 Optimization 94 / 224

- Property: Assuming (f (x) : RN → R) is twice differentiable at x = x0 ,

f (x) can be approximated by a quadratic function in the neighbourhood
of x0 (a very useful property for the Newton’s method).
1
min f (x) ∼ T
= f (x0 ) + ∇ f (x0 )(x − x0 ) + (x − xT0 )H(x0 )(x − x0 )
2
s.t. x ∈ RN

is a quadratic optimization problem.

- Solution of Quadratic Problem (QP)

1 T
x Qx + cT x + r
f (x) =
2
∇f (x) = Qx + c
H(x) = Q

- You may find resources about taking the derivative of expressions with
matrix/vectors at
http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html
Cenk Toker ELE 704 Optimization 95 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Theorem: The function f (x) = 21 xT Qx + cT x + r is convex iff Q ⪰ 0.

- Proof: Apply Jensen’s equality

- Corollary:

- f (x) is strictly convex iff Q ≻ 0 or convex iff Q ⪰ 0

- f (x) is strictly concave iff Q ≺ 0 or concave iff Q ⪯ 0

- f (x) is neither convex nor concave iff Q is indefinite.

Cenk Toker ELE 704 Optimization 96 / 224

Optimality Conditions

- Theorem: Suppose Q is a symmetric positive semidefinite (SPSD) matrix

f (x) = 21 xT Qx + cT x + r has its minimum at x∗ iff x∗ satisfies

∇f (x∗ ) = Qx∗ + c = 0

- Proof: Express f (x∗ ) as f (x) = f (x∗ + (x − x∗ )) and show that

f (x) ≥ f (x∗ ) ∀x ∈ RN by using Qx∗ + c = 0.

- Ex:

f (x) = xT x = ∥x∥22
f (x) = (x − a)T (x − a) = ∥x − a∥22
f (x) = (x − a)T P(x − a) = ∥x − a∥2P , where P is SPD.

Cenk Toker ELE 704 Optimization 97 / 224

Characteristics of Symmetric Matrices

- Definition: A matrix is an orthonormal matrix if MT = M−1

- Corollary: If M is orthonormal and y = Mx

∥y∥22 = yT y = xT M T 2
| {zM} x = ∥x∥2 ⇒ ∥y∥2 = ∥x∥2
I
- Recall: Mx = λx, λ ∈ R is an eigenvalue of M, x ∈ RN and x ̸= 0 is a
corresponding eigenvector. How many eigenvalues?

- Recall: A matrix Q is called symmetric if Q = QT .

- Proposition: If Q is a real symmetric matrix, then all of its eigenvalues

are real.

- Proposition: If Q is a real symmetric matrix, then its eigenvectors

corresponding to different eigenvalues are orthogonal.

- Proposition: If Q is a symmetric matrix with rank N , then it has N

distinct eigenvectors which constitute an orthonormal basis for RN .
Cenk Toker ELE 704 Optimization 98 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Proposition: If Q is SPSD, its eigenvalues are nonnegative.

- Proposition: If Q is an N × N square matrix, then trace and determinant

of Q would be equal to the sum and product of its eigenvalues,
respectively.
N
X N
X
tr(Q) = Qi,i = λi
i=1 i=1
N
Y
det(Q) = λi
i=1

- Proposition: (Eigendecomposition) If Q is a symmetric matrix, then

Q = RDRT . Columns of the orthonormal matrix R are the eigenvectors
of Q, D is a diagonal matrix with eigenvalues of Q on the main diagonal.

- Proposition: If Q is SPSD, then Q = MT M and M = D1/2 RT is a

square root of Q.

- Proposition: If Q is SPSD and xT Qx = 0, then Qx = 0.

Cenk Toker ELE 704 Optimization 99 / 224

- Proposition: If a symmetric matrix Q is positive definite (i.e. Q ≻ 0),

then Q is nonsingular (i.e. its inverse exists) as det Q > 0.

- Proposition: If Q is positive definite (i.e. Q ≻ 0), then any principal

sub-matrix of Q is also positive definite.

- Proposition: If Q is positive semi-definite (i.e. Q ⪰ 0), then any principal

sub-matrix of Q is also positive semi-definite.

Q c
- Proposition: If Q is a symmetric matrix and Q ≻ 0 and M = T ,
c b
T −1
then M ≻ 0 iff b > c Q c.

Cenk Toker ELE 704 Optimization 100 / 224

Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Unconstrained Minimization

- The aim is

min f (x)

where f (x) : RN → R is twice differentiable.

- The problem is solvable, i.e. a finite optimal point x∗ exists.

- The optimal value (finite) is given by

p∗ = inf f (x) = f (x∗ ) (> −∞)

Cenk Toker ELE 704 Optimization 101 / 224

Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Example 1: Quadratic program

1 T
min f (x) = x Qx − bT x + c
x∈RN 2

where Q : RN ×N is symmetric, b ∈ RN and c ∈ R.

- Necessary conditions:

∇f (x∗ ) = Qx∗ − b = 0
H(x∗ ) = Q ⪰ 0 (PSD)

- Q ≺ 0 ⇒ f (x) has no local minimum.

- Q ≻ 0 ⇒ x∗ = Q−1 b is the unique global minimum.
- Q ⪰ 0 ⇒ either no solution or ∞ number of solutions.

Cenk Toker ELE 704 Optimization 102 / 224

Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Example 2: Consider
1
min f (x1 , x2 ) = (αx21 + βx22 − x1 )
x1 ,x2 ∈R 2

- Here, let us first express the above equation in the quadratic program
form with

α γ
Q=
−γ β

1
b=
0
where γ ∈ R, for simplicity we can take γ = 0. So,
- If α > 0 and β > 0 i.e. (Q ≻ 0) x∗ = ( α1 , 0) is the unique global
minimum.
α1 > 0 and β = 0 i.e. (Q ⪰ 0). Infinite number of solutions,
- If
( α , y), y ∈ R .
- If α = 0 and β > 0 i.e. (Q ⪰ 0). No solution.
- If α < 0 and β > 0 (or α > 0 and β < 0), (i.e. Q is indefinite). No
solution.
Cenk Toker ELE 704 Optimization 103 / 224
Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

α > 0, β > 0 α > 0, β = 0

150

100
40

50
20

0
0

−50 −20
10 10 10
5 5 5
10
0 5 0 0
−5 0
−5 −5 −5
−10 −10 −10 −10
x2 x1 x2 x1

α = 0, β > 0 α > 0, β < 0

40
40

20
20
0

0 −20

−40
−20
10 10
−60
5 10 10 5
0 5 5 0
0 0
−5 −5 −5 −5
−10 −10 −10 −10
x2 x1 x2 x1

Cenk Toker ELE 704 Optimization 104 / 224

Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

- Two possibilities:
- {f (x) : x ∈ X} is unbounded below ⇒ no optimal solution.
- {f (x) : x ∈ X} is bounded below ⇒ a global minimum exists, if
∥x∥ ≠ ∞.

- Then, unconstrained minimization methods

- produce sequence of points x(k) ∈ dom f (x) for k = 0, 1, . . . with

f (x(k) ) → p∗

- can be interpreted as iterative methods for solving the optimality

condition
∇f (x∗ ) = 0

Cenk Toker ELE 704 Optimization 105 / 224

Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Topography

(https://www.rei.com/learn/expert-advice/topo-maps-how-to-
use.html)
Cenk Toker ELE 704 Optimization 106 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Descent Methods

Cenk Toker ELE 704 Optimization 107 / 224

Motivation

- If ∇f (x) ̸= 0, there is an interval (0, δ) of stepsizes such that

f (x − α∇f (x)) < f (x) ∀α ∈ (0, δ)
- If d makes an angle with ∇f (x) that is greater than 90◦ , i.e.,
∇T f (x) d < 0
∃ an interval (0, δ) of stepsizes such that
f (x + αd) < f (x) ∀α ∈ (0, δ)
Cenk Toker ELE 704 Optimization 108 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Definition: The descent direction d is selected such that

∇T f (x) d < 0

Proposition: For a descent method

f (x(k+1) ) < f (x(k) )

except x(k) = x∗ .

Definition: Minimizing sequence is defined as

x(k+1) = x(k) + α(k) d(k)

where the scalar α(k) ∈ (0, δ) is the stepsize (or step length) at iteration
k, and d(k) ∈ RN is the step or search direction.
- How to find optimum α(k) ? Line Search Algorithm
- How to find optimum d(k) ? Depends on the descent algorithm,
e.g. d = −∇f (x(k) ).

Cenk Toker ELE 704 Optimization 109 / 224

General Descent Method

- Given a starting point x(0) ∈ dom f (x)

- repeat
- Determine a descent direction d(k) ,
- Line search: Choose a stepsize α(k) > 0,
- Update: x(k+1) = x(k) + α(k) d(k) ,
- until stopping criterion is satisfied.

Example 3: Simplest method: Gradient Descent

x(k+1) = x(k) − α(k) ∇f (x(k) ), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −∇f (x(k) ).

Cenk Toker ELE 704 Optimization 110 / 224

Example 4: Most sophisticated method: Newton’s Method

x(k+1) = x(k) − α(k) H−1 (x(k) )∇f (x(k) ), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −H−1 (x(k) )∇f (x(k) ).

Cenk Toker ELE 704 Optimization 111 / 224

Line Search

- Suppose f (x) is a continuously differentiable convex function and we

want to find
α(k) = argmin f (x(k) + αd(k) )
α
(k)
for a given descent direction d . Now, let

h(α) = f (x(k) + αd(k) )

where h(α) : R → R is a "convex function" in the scalar variable α then

the problem becomes
α(k) = argmin h(α)
α

Then, as h(α) is convex, it has a minimum at

∂h(α(k) )
h′ (α(k) ) = =0
∂α

Cenk Toker ELE 704 Optimization 112 / 224

where h′ (α) is given by

∂h(α)
h′ (α) = = ∇T f (x(k) + αd(k) ) d(k) (using chain rule)
∂α
Therefore, since d is the descent direction, (i.e., ∇T f (x(k) ) d(k) < 0 ), we
have h′ (0) < 0. Also, h′ (α) is a monotone increasing function of α
because h(α) is convex. Hence. search for h′ (α(k) ) = 0.

-
Cenk Toker ELE 704 Optimization 113 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Choice of stepsize:
- Constant stepsize
α(k) = c : constant

- Diminishing stepsize
α(k) → 0
∞
α(k) = ∞.
P
while satisfying
k=−∞

- Exact line search (analytic)

α(k) = argmin f (x(k) + αd(k) )

Cenk Toker ELE 704 Optimization 114 / 224

Exact line search: (for quadratic programs)

- If f (x) is a quadratic function, then h(α) is also a quadratic function, i.e.,

h(α) = f (x(k) − αd(k) )

α2 (k) T
= f (x(k) ) + α∇T f (x(k) )d(k) + d H(x(k) )d(k)
2
- Exact line search solution α0 which minimizes the quadratic equation
above, i.e., ∂h(α
∂α
0)
= 0, is given by

∇T f (x(k) )d(k)
α0 = α(k) = − T
d(k) H(x(k) )d(k)

- If f (x) is a higher order function, then second order Taylor series

approximation can be used for the exact line search algorithm (which also
gives an approximate solution).

Cenk Toker ELE 704 Optimization 115 / 224

Bisection Algorithm:
- Assume h(α) is convex, then h′ (α) is a monotonically increasing
function. Suppose that we know a value α̂ such that h′ (α̂) > 0.

- Since h′ (0) < 0, α̃ = 0+α̂

2
is the next test point

- If h′ (α̃) = 0, α(k) = α̃ is found (very difficult to achieve)

- If h′ (α̃) > 0 narrow down the search interval to (0, α̃)
- If h′ (α̃) < 0 narrow down the search interval to (α̃, α̂)

Cenk Toker ELE 704 Optimization 116 / 224

Algorithm:
- Set k = 0, αℓ = 0 and αu = α̂
αℓ +αu
- Set α̃ = 2
and calculate h′ (α)
- If h′ (α̃) > 0 ⇒ αu = α̃ and k = k + 1
- If h′ (α̃) < 0 ⇒ αℓ = α̃ and k = k + 1
- If h′ (α̃) = 0 ⇒ stop.

Cenk Toker ELE 704 Optimization 117 / 224

Proposition: After every iteration, the current interval [αℓ , αu ] contains

α∗ , h′ (α∗ ) = 0.

Proposition: At the k-th iteration, the length of the current interval is

k
1
L= α̂
2

Proposition: A value of α such that |α − α∗ | < ε can be found in at most

α̂
log2
ε
steps.

- How to find α̂ such that h′ (α̂) > 0?

- Make an initial guess of α̂
- If h′ (α̂) < 0 ⇒ α̂ = 2α̂, go to step 2
- Stop.

Cenk Toker ELE 704 Optimization 118 / 224

- Stopping criterion for the Bisection Algorithm: h′ (α̃) → 0 as k → ∞,

may not converge quickly.

- Some relevant stopping criteria:

- Stop after k = K iterations (K : user defined)
- Stop when |αu − αℓ | ≤ ε (ε : user defined)
- Stop when h′ (α̃) ≤ ϵ (ϵ : user defined)

- In general, 3rd criterion is the best.

Cenk Toker ELE 704 Optimization 119 / 224

Backtracking line search

For small enough α:

f (x0 + αd) ≃ f (x0 ) + α∇T f (x0 )d < f (x0 ) + γα∇T f (x0 )d

where 0 < γ < 0.5 as ∇T f (x0 ) d < 0.

Cenk Toker ELE 704 Optimization 120 / 224

Algorithm: Backtracking line search

- Given a descent direction d for f (x) at x0 ∈ dom f (x)

- α=1
- while f (x0 + αd) > f (x0 ) + γα∇T f (x0 )d
α = βα
- end

where 0 < γ < 0.5 and 0 < β < 1

- At each iteration step size α is reduced by β (β ≃ 0.1 : coarse search,

β ≃ 0.8 : fine search).

- γ can be interpreted as the fraction of the decrease in f (x) predicted by

linear extrapolation (γ = 0.01 ↔ 0.3 (typical) meaning that we accept a
decrease in f (x) between 1% and 30%).

Cenk Toker ELE 704 Optimization 121 / 224

- The backtracking exit inequality

f (x0 + αd) ≤ f (x0 ) + γα∇T f (x0 )d

holds for α ∈ [0, α0 ]. Then, the line search stops with a step length α
i. α = 1 if α0 ≥ 1
ii. α ∈ [βα0 , α0 ].
In other words, the step length obtained by backtracking line search
satisfies
α ≥ min {1, βα0 } .

Cenk Toker ELE 704 Optimization 122 / 224

Convergence
n o∞
Definition: Let ∥ · ∥ be a norm on RN . Let x(k) be a sequence of
n o∞ k=0
N (k)
vectors in R . Then, the sequence x is said to converge to a
k=0
limit x∗ if

∀ε > 0, ∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε ) ⇒ (∥x(k) − x∗ ∥ < ε)

n o∞
If the sequence x(k) converges to x∗ then we write
k=0

lim x(k) = x∗
k→∞
n o∞
and call x∗ as the limit of the sequence x(k) .
k=0
- Nε may depend on ε
- For a distance ε, after Nε iterations, all the subsequent iterations
are within this distance ε to x∗ .

- This definition does not characterize how fast the convergence is (i.e.
rate of convergence).
Cenk Toker ELE 704 Optimization 123 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Rate of Convergence
n o∞
Definition: Let ∥ · ∥ be a norm on RN . A sequence x(k) that
k=0
∗ N
converges to x ∈ R is said to converge at rate R ∈ R++ and with rate
constant δ ∈ R++ if

∥x(k+1) − x∗ ∥
lim =δ
k→∞ ∥x(k) − x∗ ∥R
- If R = 1, 0 < δ < 1, then rate is linear
- If 1 < R < 2, 0 < δ < ∞, then rate is called super-linear
- If R = 2, 0 < δ < ∞, then rate is called quadratic

- The rate of convergence R is sometimes called asymptotic convergence

rate. It may not apply for the first iterations, but applies asymptotically
as k → ∞.

Cenk Toker ELE 704 Optimization 124 / 224

∞
Example 5: The sequence ak

k=0
, 0 < a < 1 converges to 0.

∥ak+1 − 0∥
lim =a ⇒ R = 1, δ = a
k→∞ ∥ak − 0∥1
n k
o∞
Example 6: The sequence a2 , 0 < a < 1 converges to 0.
k=0

k+1
∥a2 − 0∥
lim =1 ⇒ R = 2, δ = 1
k→∞ ∥a2k − 0∥2

Cenk Toker ELE 704 Optimization 125 / 224

Gradient Descent (GD) Method

Cenk Toker ELE 704 Optimization 126 / 224

Gradient Descent Method

- First order Taylor series expansion at x0 gives us

f (x0 + αd) ≈ f (x0 ) + α∇T f (x0 )d.
This approximation is valid for α∥d∥ → 0.
- We want to choose d so that ∇T f (x0 )d is as small as (as negative as)
possible for maximum descent.
- If we normalize d, i.e. ∥d∥ = 1, then normalized direction d̃
∇f (x0 )
d̃ = −
∥∇f (x0 )∥
makes the smallest inner product with ∇f (x0 ).
- Then, the unnormalized direction
d = −∇f (x0 )
is called the direction of gradient descent (GD) at the point of x0 .
- d is a direction as long as ∇f (x0 ) ̸= 0.
Cenk Toker ELE 704 Optimization 127 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Algorithm: Gradient Descent Algorithm

- Given a starting point x(0) ∈ dom f (x)

- repeat
- Direction: d(k) = −∇f (x(k) )
- Line search: Choose step size α(k) via a line search algorithm
- Update: x(k+1) = x(k) + α(k) d(k)

- until stopping criterion is satisfied

- A typical stopping criterion is ∥∇f (x)∥ < ε, ε → 0 (small)

Cenk Toker ELE 704 Optimization 128 / 224

Convergence Analysis

- The Hessian matrix H(x) is bounded as

- mI ⪯ H(x), i.e.,

(H(x) − mI) ⪰ 0
yT H(x)y ≥ m∥y∥2 , ∀y ∈ RN

- H(x) ⪯ M I, i.e.,

(M I − H(x)) ⪰ 0
yT H(x)y ≤ M ∥y∥2 , ∀y ∈ RN
with ∀x ∈ dom f (x).

Cenk Toker ELE 704 Optimization 129 / 224

- Note that, condition number of a matrix is given by the ratio of the

largest and the smallest eigenvalue, e.g.,
max λi M
κ(H(x)) = =
min λi m
If the condition number is close to one, the matrix is well-conditioned
which means its inverse can be computed with good accuracy. If the
condition number is large, then the matrix is said to be ill-conditioned.

Cenk Toker ELE 704 Optimization 130 / 224

Lower Bound: mI ⪯ H(x)

- For x, y ∈ dom f (x)
1
f (y) = f (x) + ∇T f (x)(y − x) + (y − x)T H(z)(y − x)
2
for some z on the line segment [x, y] where H(z) ⪰ mI. Thus,
m
f (y) ≥ f (x) + ∇T f (x)(y − x) + ∥y − x∥2
2
- If m = 0, then the inequality characterizes convexity.
- If m > 0, then we have a better lower bound for f (y)
- Right-hand side is convex in y. Minimum is achieved at
1
y0 = x − ∇f (x)
m
- Then,
m
f (y) ≥ f (x) + ∇T f (x)(y0 − x) + ∥y0 − x∥2
2
1
≥ f (x) − ∥∇f (x)∥2
2m
∀y ∈ dom f .
Cenk Toker ELE 704 Optimization 131 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- When y = x∗
1
f (x∗ ) = p∗ ≥ f (x) − ∥∇f (x)∥2
2m

- A stopping criterion

f (x) − p∗ ≤ 1
2m
∥∇f (x)∥2

Cenk Toker ELE 704 Optimization 132 / 224

Upper Bound: H(x) ⪯ M I

- For any x, y ∈ dom f (x), using similar derivations as the lower bound, we
arrive at an upper bound
M
f (y) ≤ f (x) + ∇T f (x)(y0 − x) + ∥y0 − x∥2
2
- Then for y = x∗
1
f (x∗ ) = p∗ ≤ f (x) − ∥∇f (x)∥2
2M
- Hence,
1
2M
∥∇f (x)∥2 ≤ f (x) − p∗

Cenk Toker ELE 704 Optimization 133 / 224

Convergence of GD using exact line search

- For the exact line search, let us use second order approximation for
f (x(k+1) ):

f (x(k+1) ) = f (x(k) − α∇f (x(k) ))

2
∼ (k) T (k) (k) α T (k) (k) (k)
= f (x ) − α ∇ f (x )∇f (x ) + ∇ f (x ) H(x ) ∇f (x )
| {z } 2 | {z }
∥∇f (x(k) )∥2 ⪯M I

This criterion is quadratic in α.

- Normally, exact line search solution α0 which minimizes the quadratic

equation above is given by

∇T f (x(k) )∇f (x(k) )

α0 =
∇T f (x(k) )H(x(k) )∇f (x(k) )

Cenk Toker ELE 704 Optimization 134 / 224

- However, let us use the upper bound of the second order approximation
for convergence analysis

M α2
f (x(k+1) ) ≤ f (x(k) ) − α∥∇f (x(k) )∥2 + ∥∇f (x(k) )∥2
2

- Find α0′ such that upper bound of f (x(k) − α∇f (x(k) )) is minimized over
α.

- Upper bound equation (i.e., right-hand side equation) is quadratic in α,

hence minimized for
1
α0′ =
M
- with the minimum value
1
f (x(k) ) − ∥∇f (x(k) )∥2
2M
- Then, for α0′
1
f (x(k+1) ) ≤ f (x(k) ) − ∥∇f (x(k) )∥2
2M
Cenk Toker ELE 704 Optimization 135 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Subtract p∗ for both sides

1
f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − ∥∇f (x(k) )∥2
2M
- We know that
1
f (x(k) )−p∗ ≤ ∥∇f (x(k) )∥2 ⇒ ∥∇f (x(k) )∥2 ≥ 2m(f (x(k) )−p∗ )
2m
- Then substituting this result to the above inequality
m
f (x(k+1) ) − p∗ ≤ (f (x(k) ) − p∗ ) − (f (x(k) ) − p∗ )
M
m
≤ (1 − )(f (x(k) ) − p∗ )
M
- or
f (x(k+1) ) − p∗ m m
≤ (1 − )=c≤1 ( ≤ 1)
f (x(k) ) − p∗ M M

Cenk Toker ELE 704 Optimization 136 / 224

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

m

- Upper limit of rate constant is 1 − M

- Number of steps? Apply the above inequality recursively

f (x(k+1) ) − p∗ ≤ ck (f (x(0) ) − p∗ )

- i.e., f (x(k+1) ) → p∗ as k → ∞, since 0 ≤ c < 1. Thus, convergence is

guaranteed.
- If m = M ⇒ c = 0, then convergence occurs in one iteration.
- If m ≪ M ⇒ c → 1, the slow convergence.

Cenk Toker ELE 704 Optimization 137 / 224

- (f (x(k+1) ) − p∗ ) ≤ ε is achieved after at most

log [f (x(0) ) − p∗ ]/ε
K=
log (1/c)
iterations
- Numerator is small when initial point is close to x∗ (K gets
smaller).

- Numerator increases as accuracy increases (i.e., ε decreases) (K

gets larger).
m
- Denominator decreases linearly with M (reciprocal of the condition
m m m
number) as c = (1 − M ), i.e., log(1/c) = − log(1 − M )≃ M (using
1 1 2
log(x) = log(x0 ) + x0 (x − x0 ) − 2x2 (x − x0 ) + · · · with x0 = 1).
0

m
- well-conditioned Hessian, M
→ 1 ⇒ denominator is large (K
gets smaller).
m
- ill-conditioned Hessian, M
→ 0 ⇒ denominator is small (K
gets larger).
Cenk Toker ELE 704 Optimization 138 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence of GD using backtracking line search

- Backtracking exit condition

f (x(k) − α∇f (x(k) )) ≤ f (x(k) ) − γα∥∇f (x(k) )∥2

1
is satisfied when α ∈ [βα0 , α0 ] where α0 ≤ M
.
β
- Backtracking line search terminates either if α = 1 or α ≥ M
which gives
a lower bound on the decrease
- f (x(k+1) ) ≤ f (x(k) ) − γ∥∇f (x(k) )∥2 if α = 1
βγ β
- f (x(k+1) ) ≤ f (x(k) ) − M
∥∇f (x(k) )∥2 if α ≥ M

Cenk Toker ELE 704 Optimization 139 / 224

- If we put these inequalities (1 & 2) together

βγ
f (x(k+1) ) ≤ f (x(k) ) − min γ, ∥∇f (x(k) )∥2
M

- Similar to the analysis exact line search, subtract p∗ from both sides

β
f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − γ min 1, ∥∇f (x(k) )∥2
M

- But, we know that ∥∇f (x(k) )∥2 ≥ 2m(f (x(k) ) − p∗ ), then

β
f (x(k+1) ) − p∗ ≤ 1 − 2mγ min 1, f (x(k) ) − p∗
M

- Finally

f (x(k+1) ) − p∗

β
≤ 1 − 2mγ min 1, =c<1
f (x(k) ) − p∗ M

Cenk Toker ELE 704 Optimization 140 / 224

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Rate is constant c < 1

f (x(k+1) ) − p∗ ≤ ck f (x(0) ) − p∗

Thus, k → ∞ ⇒ ck → 0, so convergence is guaranteed.

Cenk Toker ELE 704 Optimization 141 / 224

Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9).

Example 7: (quadratic problem in R2 ) Replace γ, α and t with σ, γ and α.

Cenk Toker ELE 704 Optimization 142 / 224

Cenk Toker ELE 704 Optimization 143 / 224

Cenk Toker ELE 704 Optimization 144 / 224

Example 8: (nonquadratic problem in R2 ) Replace α and t with γ and α.

Cenk Toker ELE 704 Optimization 145 / 224

Cenk Toker ELE 704 Optimization 146 / 224

Cenk Toker ELE 704 Optimization 147 / 224

Example 9: (problem in R100 ) Replace α and t with γ and α.

Cenk Toker ELE 704 Optimization 148 / 224

Cenk Toker ELE 704 Optimization 149 / 224

Example 10: (Condition number) Replace γ, α and t with σ, γ and α.

Cenk Toker ELE 704 Optimization 150 / 224

Cenk Toker ELE 704 Optimization 151 / 224

Cenk Toker ELE 704 Optimization 152 / 224

Observations:
- The gradient descent algorithm is simple.

- The gradient descent method often exhibits approximately linear

convergence.

- The choice of backtracking parameters γ and β has a noticeable but not

dramatic effect on the convergence. Exact line search sometimes improves
the convergence of the gradient method, but the effect is not large (and
probably not worth the trouble of implementing the exact line search).

- The convergence rate depends greatly on the condition number of the

Hessian, or the sublevel sets. Convergence can be very slow, even for
problems that are moderately well-conditioned (say, with condition
number in the 100s). When the condition number is larger (say, 1000 or
more) the gradient method is so slow that it is useless in practice.

- The main advantage of the gradient method is its simplicity. Its main
disadvantage is that its convergence rate depends so critically on the
condition number of the Hessian or sublevel sets.

Cenk Toker ELE 704 Optimization 153 / 224

Steepest Descent (SD) Method

Cenk Toker ELE 704 Optimization 154 / 224

Steepest Descent (SD) Method

- The above definition also corresponds to a norm: it is convex, as it is the

pointwise maximum of convex (in fact, linear) functions y → xT y; it is
homogeneous of degree 1, that is, ∥αx∥∗ = α∥x∥∗ for every x in RN and
α ≥ 0.

- By definition of the dual norm,

xT y ≤ ∥x∥ · ∥y∥∗

This can be seen as a generalized version of the Cauchy-Schwartz

inequality, which corresponds to the Euclidean norm.

- The dual to the dual norm above is the original norm.

Cenk Toker ELE 704 Optimization 155 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- The norm dual to the Euclidean norm is itself. This comes directly from
the Cauchy-Schwartz inequality.

∥x∥2∗ = ∥x∥2

- The norm dual to the the L∞ -norm is the L1 -norm, or vice versa.

∥x∥∞∗ = ∥x∥1 and ∥x∥1∗ = ∥x∥∞

- More generally, the dual of the Lp -norm is the Lq -norm

∥x∥p∗ = ∥x∥q
p
where q = .
1−p

Cenk Toker ELE 704 Optimization 156 / 224

Quadratic norm: A generalized quadratic norm of x is defined by

1/2
∥x∥P = xT Px = ∥P1/2 x∥2 = ∥Mx∥2

where P = MT M is an N × N symmetric positive definite (SPD) matrix.

- When P = I then, quadratic norm is equal to the Euclidean norm.

- The dual of the quadratic norm is given by

1/2
∥x∥P∗ = ∥x∥Q = xT P−1 x

where Q = P−1 .

Cenk Toker ELE 704 Optimization 157 / 224

Steepest Descent Method

- The first-order Taylor series approximation of f (x(k) + αd) around x(k) is

f (x(k) + αd) ≈ f (x(k) ) + α∇T f (x(k) )d.

This approximation is valid for α∥d∥2 → 0.

- We want to choose d so that ∇T f (x0 )d is as small as (as negative as)

possible for maximum descent.

- First normalize d to obtain normalized steepest descent direction (nsd)

dnsd n o
dnsd = argmin ∇T f (x(k) ) d : ∥d∥ = 1
where ∥ · ∥ is any norm and ∥ · ∥∗ is its dual norm on RN . Choice of norm
is very important.

- It is also convenient to consider the unnormalized steepest descent

direction (sd)
dsd = ∥∇f (x)∥∗ dnsd
where ∥ · ∥∗ is the dual norm of ∥ · ∥.
Cenk Toker ELE 704 Optimization 158 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Then, for the steepest descent step, we have

∇T f (x)dsd = ∥∇f (x)∥∗ ∇T f (x)dnsd = −∥∇f (x)∥2∗

| {z }
−∥∇f (x)∥∗

(Recall that ∥x∥∗ = max yT x : ∥y∥ ≤ 1)

f (x(k) + αd) ≈ f (x(k) ) + α∇T f (x(k) )d

- Hence,
= f (x(k) ) − α∥∇f (x)∥2∗

Algorithm: Steepest Descent Algorithm

- Given a starting point x(0) ∈ dom f (x)

- repeat
- Compute the steepest descent direction d(k) sd
- Line search: Choose step size α(k) via a line search algorithm
(k)
- Update: x(k+1) = x(k) + α(k) dsd

- until stopping criterion is satisfied

Cenk Toker ELE 704 Optimization 159 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Steepest Descent for different norms:

Euclidean norm: As ∥ · ∥2∗ = ∥ · ∥2 and having x0 = x(k) , the steepest

descent direction is the negative gradient, i.e.

dsd = −∇f (x0 )

- For Euclidean norm, steepest descent algorithm is the same as the

gradient descent algorithm.

Cenk Toker ELE 704 Optimization 160 / 224

Quadratic norm: For a quadratic norm ∥ · ∥P and having x0 = x(k) , the

normalized descent direction is given by
∇f (x0 ) ∇f (x0 )
dnsd = −P−1 = −P−1
∥∇f (x0 )∥P∗ (∇T f (x0 )P−1 ∇f (x0 ))1/2
(Recall that ∥x∥P∗ = ∥x∥P−1 )

- As ∥∇f (x)∥P∗ = ∥P−1/2 ∇f (x)∥2 , then

dsd = −P−1 ∇f (x0 )

Cenk Toker ELE 704 Optimization 161 / 224

Change of coordinates: Let y = P 1/2 x, then ∥x∥P = ∥y∥2 . Using this

change of coordinates, we can solve the original problem of minimizing
f (x) by solving the equivalent problem of minimizing the function
f˜(y) : RN → R, given by

f˜(y) = f (P−1/2 y) = f (x)

- Apply the gradient descent method to f˜(y). The descent direction at y0

(x0 = P−1/2 y0 for the original problem) is

dy = −∇f˜(y0 ) = −P−1/2 ∇f (P−1/2 y0 ) = −P−1/2 ∇f (x0 )

- Then the descent direction for the original problem becomes

dx = P−1/2 dy = −P−1 ∇f (x0 )

- Thus, x∗ = P−1/2 y∗ .

- The steepest descent method in the quadratic norm ∥ · ∥P is equivalent to

the gradient descent method applied to the problem after the coordinate
transformation
y = P1/2 x
Cenk Toker ELE 704 Optimization 162 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

L1 -norm: For an L1 -norm ∥ · ∥1 and having x0 = x(k) , the normalized

descent direction is given by
n o
dnsd = − argmin ∇T f (x)d : ∥d∥1 = 1 .

- Let i be any index for which ∥∇f (x0 )∥∞ = max |(∇f (x0 ))i |. Then a
normalized steepest descent direction dnsd for the L1 -norm is given by

∂f (x0 )
dnsd = − sign ei
∂xi
where ei is the i-th standard basis vector (i.e. the coordinate axis
direction) with the steepest gradient. For example, in the figure above we
have dnsd = e1 .
Cenk Toker ELE 704 Optimization 163 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Then, the unnormalized steepest descent direction is given by

∂f (x0 )
dsd = dnsd ∥∇f (x0 )∥∞ = − ei
∂xi

- The steepest descent algorithm in the L1 -norm has a very natural

interpretation:
- At each iteration we select a component of ∇f (x0 ) with maximum
absolute value, and then decrease or increase the corresponding
component of x0 , according to the sign of (∇f (x0 ))i .
- The algorithm is sometimes called a coordinate-descent algorithm,
since only one component of the variable x(k) is updated at each
iteration.
- This can greatly simplify, or even trivialize, the line search.

Cenk Toker ELE 704 Optimization 164 / 224

Choice of norm:
- Choice of norm can dramatically affect the convergence

- Condition number of the Hessian should be close to unity for fast

convergence

- Consider quadratic norm with respect to SPD matrix P. Performing the

change of coordinates
y = P1/2 x
can change the condition number.
- If an approximation of the Hessian at the optimal point, H(x∗ ), is
known, then setting P ∼= H(x∗ ) will yield

P−1/2 H(x∗ )P1/2 ∼

=I
resulting in a very low condition number.
- If P is chosen correctly the ellipsoid ε = x : xT Px ≤ 1

approximates the cost surface at point x.

- A correct P will greatly improve the convergence whereas the wrong
choice of P will result in very poor convergence.
Cenk Toker ELE 704 Optimization 165 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence Analysis

- (Using backtracking line search) It can be shown that any norm can be
bounded in terms of Euclidean norm with a constant η ∈ (0, 1]
∥x∥∗ ≥ η∥x∥2

- Assuming strongly convex f (x) and using H(x) ≺ M I

M α2
f (x(k) + αdsd ) ≤ f (x(k) ) + α∇T f (x(k) )dsd + ∥dsd ∥22
2
M α2
≤ f (x(k) ) + α∇T f (x(k) )dsd + ∥dsd ∥2∗
2η 2
M
≤ f (x(k) ) − α∥∇f (x(k) )∥2∗ + α2 2 ∥∇f (x(k) )∥2∗
2η
- Right hand side of the inequality is a quadratic function of α and has a
2
minimum at α = ηM . Then,

η2 γη 2 T
f (x(k) +αdsd ) ≤ f (x(k) )− ∥∇f (x(k) )∥2∗ ≤ f (x(k) )+ ∇ f (x(k) )dsd
2M M
Cenk Toker ELE 704 Optimization 166 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

2 T
- Since γ < 0.5 and −∥∇f
n (x)∥o ∗ = ∇ f (x)dsd , backtracking line search
βη 2
will return α ≥ min 1, M , then

βη 2

f (x(k) + αdsd ) ≤ f (x(k) ) − γ min 1, ∥∇f (x(k) )∥2∗
M
βη 2

≤ f (x(k) ) − γη 2 min 1, ∥∇f (x(k) )∥22
M

- Subtracting p∗ from both sides and using

∥∇f (x(k) )∥2 ≥ 2m(f (x(k) ) − p∗ ), we have

f (x(k+1) ) − p∗ βη 2

2
≤ 1 − 2mγη min 1, =c<1
f (x(k) ) − p∗ M

- Linear convergence

f (x(k) ) − p∗ ≤ ck f (x(0) ) − p∗

as k → ∞, ck → 0. So, convergence is guaranteed,

Cenk Toker ELE 704 Optimization 167 / 224

Example 11: A steepest descent example with L1 -norm.

Cenk Toker ELE 704 Optimization 168 / 224

Example 12: Consider the nonquadratic problem in R2 given in Example 8

(replace α and t with γ and α).

Cenk Toker ELE 704 Optimization 169 / 224

Cenk Toker ELE 704 Optimization 170 / 224

- When P = I, i.e., gradient descent

Cenk Toker ELE 704 Optimization 171 / 224

Cenk Toker ELE 704 Optimization 172 / 224

Cenk Toker ELE 704 Optimization 173 / 224

Conjugate Gradient (CG) Method

Cenk Toker ELE 704 Optimization 174 / 224

Conjugate Gradient Method

- Can overcome the slow convergence of Gradient Descent algorithm

- Computational complexity is lower than Newton’s Method.

- Can be very effective in dealing with general objective functions.

- We will first investigate the quadratic problem

1
min xT Qx − bT x
2
where Q is SPD, and then extend the solution to the general case by
approximation.

Cenk Toker ELE 704 Optimization 175 / 224

Conjugate Directions

Definition: Given a symmetric matrix Q, two vectors d1 and d2 are said

to be Q-orthogonal or conjugate with respect to Q if

dT1 Qd2 = 0

- Although it is not required, we will assume that Q is SPD.

- If Q = I, then the above definition becomes the definition of

orthogonality.

- A finite set of non-zero vectors d0 , d1 , . . . , dk is said to be a

Q-orthogonal set if
dTi Qdj = 0, ∀i, j : i ̸= j

Cenk Toker ELE 704 Optimization 176 / 224

Proposition: If Q is SPD and the set of non-zero vectors d0 , d1 , . . . , dk

are Q-orthogonal, then these vectors are linearly indepedent.

Proof: Assume linear dependency and suppose ∃αi , i = 0, 1, . . . , k :

α0 d0 + α1 d1 + · · · + αk dk = 0

- Multiplying with dTi Q yields

α0 dTi Qd0 +α1 dTi Qd1 + · · · + αi dTi Qdi + · · · + αk dTi Qdk = 0

| {z } | {z } | {z } | {z }
=0 =0 must be 0 =0

- But dTi Qdi > 0 (Q: PD), then αi = 0. Repeat for all αi .

Cenk Toker ELE 704 Optimization 177 / 224

Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
Quadratic Problem:
1
min xT Qx − bT x
2
- If Q is N × N PD matrix, then we have unique solution
Qx∗ = b
- Let d0 , d1 , . . . , dN −1 be non-zero Q-orthogonal vectors corresponding to
the N × N SPD matrix Q. They are linearly independent. Then the
optimum solution is given by
x∗ = α0 d0 + α1 d1 + · · · + αN −1 dN −1
- We can find the value of the coefficients αi by multiplying the above
equation with dTi Q:
dTi Qx∗ = αi dTi Qdi
dTi b
αi = . . . Qx∗ = b
dTi Qdi
- Finally the optimum solution is given by,
N −1
X dTi b
x∗ = di
i=0
dTi Qdi

Cenk Toker ELE 704 Optimization 178 / 224

- αi can be found from the known vector b and matrix Q once di are
found.

- The expansion of x∗ is a result of an iterative process of N steps where

at the i-th step αi di is added.
−1
Conjugate Direction Theorem: Let {di }Ni=0 be a set of non-zero
n oN
Q-orthogonal vectors. For any x(0) ∈ dom f (x), the sequence x(k)
k=0
generated according to

x(k+1) = x(k) + α(k) dk , ... k ≥ 0

- with
dTk g(k)
α(k) = −
dTk Qdk
and g(k) is the gradient at x(k)

g(k) = ∇f (x(k) ) = Qx(k) − b

- converges to the unique solution x∗ of Qx∗ = b after N steps, i.e.

x(N ) = x∗ .
Cenk Toker ELE 704 Optimization 179 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Proof: Since dk are linearly independent, we can write

x∗ − x(0) = α(0) d0 + α(1) d1 + · · · + α(N −1) dN −1

- for some α(k) . we can find α(k) by

dTk Q x∗ − x(0)
α(k) = (1)
dTk Qdk

- Now, the iterative steps from x(0) to x(k)

x(k) − x(0) = α(0) d0 + α(1) d1 + · · · + α(k−1) dk−1

- and due to Q-orthogonality

dTk Q x(k) − x(0) = 0 (2)

- Using (1) and (2) we arrive at

dTk Q x∗ − x(k) dT g(k)
α(k) = T
= − Tk
dk Qdk dk Qdk

Cenk Toker ELE 704 Optimization 180 / 224

Descent Properties of the Conjugate Gradient Method

- We define B(k) which is spanned by {d0 , d1 , . . . , dk−1 } as the subspace
of RN , i.e.,
B(k) = span {d0 , d1 , . . . , dk−1 } ⊆ RN

- We will show that at each step x(k) minimizes the objective over the
k-dimensional linear variety x(0) + B(k) .
N −1
Theorem: (Expanding Subspace Theorem) Let {di }i=0 be non-zero,
N
Q-orthogonal vectors in R .

- For any x(0) ∈ RN , the sequence

x(k+1) = x(k) + α(k) dk
dTk g(k)
α(k) = −
dTk Qdk

minimizes f (x) = 21 xT Qx − bT x on the line

x = x(k−1) − αdk−1 , −∞ < α < ∞

(0) (k)
and on x +B .
Cenk Toker ELE 704 Optimization 181 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Proof: Since x(k) ∈ x(0) + B(k) , i.e., B(k) contains the line
x = x(k−1) − αdk−1 , it is enough to show that x(k) minimizes f (x) over
x(0) + B(k)

- Since we assume that f (x) is strictly convex, the above condition holds
when g(k) is orthogonal to B(k) , i.e., the gradient of f (x) at x(k) is
orthogonal to B(k) .

Cenk Toker ELE 704 Optimization 182 / 224

- Proof of g(k) ⊥ B(k) is by induction

- Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true.

- Now assume that g(k) ⊥ B(k) , show that g(k+1) ⊥ B(k+1)

- From the definition of g(k) (g(k) = Qx(k) − b), it can be shown that

g(k+1) = g(k) + αk Qdk

- Hence, by the definition of αk

dTk g(k+1) = dTk g(k) + αk dTk Qdk = 0

- Also, for i < k

dTi g(k+1) = dT g(k) + αk dTi Qdk = 0

| i {z } | {z }
vanishes by induction =0

Cenk Toker ELE 704 Optimization 183 / 224

Corollary: The gradients g(k) , k = 0, 1, . . . , N satisfy

dTi g(k) = 0

for i < k.

- Expanding subspace, at every iteration dk increases the dimensionality of

B. Since x(k) minimizes f (x) over x(0) + B(k) , x(N ) is the overall
minimum of f (x).

Cenk Toker ELE 704 Optimization 184 / 224

The Conjugate Gradient Method

- In the conjugate direction method, select the successive direction vectors

as a conjugate version of the successive gradients obtained as the method
progresses

Conjugate Gradient Algorithm:

- Start at any x(0) ∈ RN and define d(0) = −g(0) = b − Qx(0)

- repeat
T
d(k) g(k)
- α(k) = − T
d(k) Qd(k)

- x(k+1) = x(k) + α(k) d(k)

- g(k+1) = Qx(k+1) − b
g(k+1) Qd(k)
- β (k) = T
d(k) Qd(k)

- d(k+1) = −g(k+1) + β (k) d(k)

- until k = N .
Cenk Toker ELE 704 Optimization 185 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Algorithm terminates in at most N steps with the exact solution (for the
quadratic case)

- Gradient is always linearly independent of all previous direction vectors,

i.e., g(k) ⊥ B(k) , where B(k) = {d0 , d1 , . . . , dk−1 }

- If solution is reached before N steps, the gradient is zero

- Very simple formula, computational complexity is slightly higher than

gradient descent algorithm

- The process makes uniform progress toward the solution at every step.
Important for the nonquadratic case.

Cenk Toker ELE 704 Optimization 186 / 224

Example 13: Consider the quadratic problem

1
min xT Qx − bT x
2

3 2 2
where Q = and b = .
2 6 −8

- Solution is given by

Cenk Toker ELE 704 Optimization 187 / 224

CG Summary
- In theory (with exact arithmetic) converges to solution in N steps
- The bad news: due to numerical round-off errors, can take more
than N steps (or fail to converge)
- The good news: with luck (i.e., good spectrum of Q), can get good
approximate solution in ≪ N steps

- Compared to direct (factor-solve) methods, CG is less reliable, data

dependent; often requires good (problem-dependent) preconditioner

- But, when it works, can solve extremely large systems

Cenk Toker ELE 704 Optimization 188 / 224

Extension to Nonquadratic Problems

- Idea is simple. We have two loops
- Outer loop approximates the problem with a quadratic one
- Inner loop runs conjugate gradient method (CGM) for the
approximation

- i.e., for the neighbourhood of point x0

1
f (x) ∼ T T
= f (x0 ) + ∇ f (x0 )(x − x0 ) + (x − x0 ) H(x0 )(x − x0 ) + residual
2 | {z }
| {z } →0
quadratic function

- Expanding
∼ 1 xT H(x0 )x+ ∇T f (x0 ) − xT 1 T

T
f (x) = 0 H(x0 ) x+f (x0 ) + x0 H(x0 )x0 − ∇ f (x0 )x0
2 |
2
{z }
independent of x, i.e., constant

- Thus,
1
min f (x) ≡ min xT H(x0 )x+ ∇T f (x0 ) − xT0 H(x0 ) x
2
1
≡ min xT Qx − bT x
2
Cenk Toker ELE 704 Optimization 189 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Here,

Q = H(x0 )
bT = −∇T f (x0 ) + xT0 H(x0 )

- The gradient g(k) is

g(k) = Qx(k) − b
= H(x0 )x0 + ∇f (x0 ) − H(x0 )x0 . . . x0 = x(k)
= ∇f (x0 )

Cenk Toker ELE 704 Optimization 190 / 224

Nonquadratic Conjugate Gradient Algorithm:

- Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set

d(0) = −g(0)

- repeat
- repeat
T
d(k) g(k)
- α(k) = − T
d(k) H(x(k) )d(k)

- x(k+1) = x(k) + α(k) d(k)

- g(k+1) = ∇f (x(k+1) )
T
g(k+1) H(x(k) )d(k)
- β (k) = T
d(k) H(x(k) )d(k)

- d(k+1) = −g(k+1) + β (k) d(k)

- until k = N .
- new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and
d(0) = −g(0) .

- until stopping criterion is satisfied

Cenk Toker ELE 704 Optimization 191 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- No line search is required.

- H(x(k) ) must be evaluated at each point, can be impractical.

- Algorithm may not be globally convergent.

- Involvement of H(x(k) ) can be avoided by employing a line search

algorithm for α(k) and slightly modifying β (k)

Cenk Toker ELE 704 Optimization 192 / 224

Nonquadratic Conjugate Gradient Algorithm with Line-search:

- Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set
d(0) = −g(0)
- repeat
- repeat
- Line search: α(k) = argmin f (x(k) + αd(k) )
α

- Update: x(k+1) = x(k) + α(k) d(k)

- Gradient: g(k+1) = ∇f (x(k+1) )
- Use
T
g(k+1) g(k+1)
- Fletcher-Reeves method: β (k) = T , or
g(k) g(k)
(k) T
(g(k+1) −g ) g(k+1)
- Polak-Ribiere method: β (k) = T
g(k) g(k)

- d(k+1) = −g(k+1) + β (k) d(k)

- until k = N .
- new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and
d(0) = −g(0) .
- until stopping criterion is satisfied
Cenk Toker ELE 704 Optimization 193 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Polak-Ribiere method can be superior to the Fletcher-Reeves method.

- Global convergence of the line search methods is established by noting

that a gradient descent step is taken every N steps and serves as a spacer
step. Since the other steps do not increase the objective, and in fact
hopefully they decrease it, global convergence is guaranteed.

Cenk Toker ELE 704 Optimization 194 / 224

Example 14: Convergence example of the nonlinear Conjugate Gradient Method: (a) A
complicated function with many local minima and maxima. (b) Convergence path of
Fletcher-Reeves CG. Unlike linear CG, convergence does not occur in two steps. (c) Cross-section
of the surface corresponding to the first line search. (d) Convergence path of Polak-Ribiere CG.

Cenk Toker ELE 704 Optimization 195 / 224

Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 196 / 224

The Newton Step

- In Newton’s Method, local quadratic approximations of f (x) are utilized.

Starting with the second-order Taylor’s approximation around x(k) ,
1
f (x(k+1) ) = f (x(k) ) + ∇f (x(k) )∆x + ∆xT H(x(k) )∆x + residual
| {z 2 }
fˆ(x(k+1) )

where ∆x = x(k+1) − x(k) , find ∆x = ∆xnt such that fˆ(x(k+1) ) is

minimized.
∂ fˆ(x(k+1) )
- Quadratic approximation optimum step ∆xnt (by solving ∂∆x
= 0)

∆xnt = −H−1 (x(k) )∇f (x(k) )

is called the Newton step, which is a descent direction, i.e.,

∇T f (x(k) )∆xnt = −∇T f (x(k) )H−1 (x(k) )∇f (x(k) ) < 0

Cenk Toker ELE 704 Optimization 197 / 224

- Then
x(k+1) = x(k) + ∆xnt

Cenk Toker ELE 704 Optimization 198 / 224

Interpretation of the Newton Step

1. Minimizer of second-order approximation

- As given on the previous slide ∆x minimizes fˆ(x(k+1) ), i.e., the quadratic

approximation of f (x) in the neighbourhood of x(k) .

- If f (x) is quadratic, then f (x(0) ) + ∆x is the exact minimizer of f (x)

and algorithm terminates in a single step with the exact answer.

- If f (x) is nearly quadratic, then x + ∆x is a very good estimate of the

minimizer of f (x), x∗ .

- For twice differentiable f (x), quadratic approximation is very accurate in

the neighbourhood of x∗ , i.e., when x is very close to x∗ , the point
x + ∆x is a very good estimate of x∗ .

Cenk Toker ELE 704 Optimization 199 / 224

2. Steepest Descent Direction in Hessian Norm

- The Newton step is the steepest descent direction at x(k) , i.e.,

1
2
∥v∥H(x(k) ) = vT H(x(k) )v

- In the steepest descent method, the quadratic norm ∥ · ∥P can

significantly increase speed of convergence, by decreasing the condition
number. In the neighbourhood of x∗ , P = H(x∗ ) is a very good choice.
- In Newton’s method when x is near x∗ , we have H(x) ≃ H(x∗ ).

Cenk Toker ELE 704 Optimization 200 / 224

3. Solution of Linearized Optimality Condition

- First-order optimality condition

∇f (x∗ ) = 0

- near x∗ (using first order Taylor’s approximation for ∇f (x + ∆x))

∇f (x + ∆x) ≃ ∇f (x) + H(x)∆x = 0

- with the solution

∆xnt = −H−1 (x)∇f (x)

Cenk Toker ELE 704 Optimization 201 / 224

The Newton Decrement

- The norm of the Newton step in the quadratic norm defined by H(x) is
called the Newton decrement
1
2
λ(x) = ∥∆xnt ∥H(x) = ∆xTnt H(x)∆xnt

- It can be used as a stopping criterion since it is an estimate of f (x) − p∗ ,

i.e.,
1
f (x) − inf f (y) = f (x) − fˆ(x + ∆xnt ) = λ2 (x)
y 2

where
1
fˆ(x + ∆xnt ) = f (x) + ∇T f (x)∆xnt + ∆xTnt H(x)∆xnt
2
i.e., the second-order quadratic approximation of f (x) at x.

Cenk Toker ELE 704 Optimization 202 / 224

- Substitute fˆ(x + ∆xnt ) into f (x) − inf f (y) and let

∆xnt = −H−1 (x)∇f (x)

then
1 T 1
∇ f (x)H−1 (x)∇f (x) = λ2 (x)
2 2
λ2 (x)
- So, if 2
< ϵ, then algorithm can be terminated for some small ϵ.

- With the substitution of ∆xnt = −H−1 (x)∇f (x), the Newton decrement
can also be written as
1
λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) )
2

Cenk Toker ELE 704 Optimization 203 / 224

Newton’s Method

- Given a starting point x(0) ∈ dom f (x) and some small tolerance ϵ > 0

- repeat
- Compute the Newton step and Newton decrement

∆x(k) = −H−1 (x(k) )∇f (x(k) )

1
λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) )
2

- Stopping criterion, quit if λ2 (x(k) )/2 ≤ ϵ.

- Line search: Choose a stepsize α(k) > 0, e.g. by backtracking line
search.
- Update: x(k+1) = x(k) + α(k) ∆x(k) .

Cenk Toker ELE 704 Optimization 204 / 224

- The stepsize α(k) (i.e., line search) is required for the non-quadratic
initial parts of the algorithm. Otherwise, algorithm may not converge due
to large higher-order residuals.

- As x(k) gets closer to x∗ . f (x) can be better approximated by the

second-order expansion. Hence, stepsize α(k) is no longer required. Line
search algorithm will automatically set α(k) = 1.

- If we start with α(k) = 1 and keep it the same, then the algorithm is
called the pure Newton’s method.

- For an arbitrary f (x), there are two regions of convergence.

- damped Newton phase, when x is far from x∗
- quadratically convergent phase, when x gets closer to x∗

- If we let H(x) = I, the algorithm reduces to gradient descent (GD)

x(k+1) = x(k) − α(k) ∇f (x(k) )

Cenk Toker ELE 704 Optimization 205 / 224

- If H(x) is not positive definite, Newton’s method will not converge.

- So, use (aI + H(x))−1 instead of H−1 (x), also known as (a.k.a)
Marquardt method. There always exists an a which will make the matrix
(aI + H(x)) positive definite.

- a is a trade-off between GD and NA

- a → ∞ ⇒ Gradient Descent (GD)
- a → 0 ⇒ Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 206 / 224

- Newton step and decrement are independent of affine transformations

(i.e., linear coordinate transformations), i.e., for non-singular T ∈ RN ×N

x = Ty and f˜(y) = f (Ty)

then

∇f˜(y) = TT ∇f (x)
H̃(y) = TT H(x)T

- So, the Newton step will be

∆ynt = −H̃−1 (y)∇f˜(y)

−1
= − TT H(x)T TT ∇f (x)

= −T−1 H−1 (x)∇f (x)

= T−1 ∆xnt

i.e,
x + ∆xnt = T (y + ∆ynt ), ∀x
Cenk Toker ELE 704 Optimization 207 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Similarly, the Newton decrement will be

1
λ(y) = ∇T f˜(y)H̃−1 (y)∇f˜(y)
2

−1 12
T T T
= ∇ f (x)T T H(x)T T ∇f (x)
1
2
= ∇T f (x)H(x)∇f (x)
= λ(x)

- Thus, Newton’s Method is independent of affine transformations (i.e.,

linear coordinate transformations).

Cenk Toker ELE 704 Optimization 208 / 224

Convergence Analysis

Read Boyd Sect. 9.5.3.

- Assume a strongly convex f (x) with mI ⪯ H(x) with constant m
∀x ∈ dom f (x),

and H(x) is Lipschitz continuous on dom f (x), i.e.,

∥H(x) − H(y)∥2 ≤ L ∥x − y∥2

for constant L > 0. This inequality imposes a bound on the third

derivative if f (x).

- If L is small f (x) is closer to a quadratic function. If L is large, f (x) is

far from a quadratic function. If L = 0, then f (x) is quadratic.

- Thus, L measures how well f (x) can be approximated by a quadratic

function.

- Newton’s Method will perform well for small L.

Cenk Toker ELE 704 Optimization 209 / 224

m2
Convergence: There exist constants η ∈ 0, L
and σ > 0 such that
- Damped Newton Phase
∥∇f (x)∥2 ≥ η
- α < 1 gives better solutions, so most iterations will require line
search, e.g. backtracking.
- As k increases, function value decreases by at least σ, but not
necessarily quadratic.
f (x(0) )−p∗
- This phase ends after at most σ
iterations

- Quadratically Convergent Phase

∥∇f (x)∥2 < η

- All iterations use α = 1 (i.e., quadratic approximation suits very

well.)
∥∇f (x(k+1) )∥ L
- ≤ , i.e., quadratic convergence.
∥∇f (x(k) )∥2 2m2

Cenk Toker ELE 704 Optimization 210 / 224

- For small ϵ > 0, f (x) − p∗ < ϵ is achieved after at most

ϵ0
log2 log2
ϵ
2m3
iterations where ϵ0 = L2
. This is typically 5-6 iterations.
- Number of iterations is bounded above by

f (x(0) ) − p∗ ϵ0
+ log2 log2
σ ϵ
- σ and ϵ0 are dependent on m, L and x(0) .

Cenk Toker ELE 704 Optimization 211 / 224

NA Summary

- Convergence of Newton’s method is rapid in general, and quadratic near

x∗ . Once the quadratic convergence phase is reached, at most six or so
iterations are required to produce a solution of very high accuracy.

- Newton’s method is affine invariant. It is insensitive to the choice of

coordinates, or the condition number of the sublevel sets of the objective.

- Newton’s method scales well with problem size. Ignoring the computation
of the Hessian, its performance on problems in R10000 is similar to its
performance on problems in R10 , with only a modest increase in the
number of steps required.

- The good performance of Newton’s method is not dependent on the

choice of algorithm parameters. In contrast, the choice of norm for
steepest descent plays a critical role in its performance.

Cenk Toker ELE 704 Optimization 212 / 224

- The main disadvantage of Newton’s method is the cost of forming and

storing the Hessian, and the cost of computing the Newton step, which
requires solving a set of linear equations.

- Other alternatives (called quasi-Newton methods) are also provided by a

family of algorithms for unconstrained optimization. These methods
require less computational effort to form the search direction, but they
share some of the strong advantages of Newton methods, such as rapid
convergence near x∗ .

Cenk Toker ELE 704 Optimization 213 / 224

Example 15: Consider the nonquadratic problem in R2 given in Example 8 and

Example 12 (replace α and t with γ and α).

Cenk Toker ELE 704 Optimization 214 / 224

Cenk Toker ELE 704 Optimization 215 / 224

Example 16: Consider the nonquadratic problem in R100 given in Example 9

(replace α and t with γ and α).

Cenk Toker ELE 704 Optimization 216 / 224

Cenk Toker ELE 704 Optimization 217 / 224

Example 17: (problem in R10000 ) Replace α and t with γ and α.

Cenk Toker ELE 704 Optimization 218 / 224

Cenk Toker ELE 704 Optimization 219 / 224

Approximation of the Hessian

- For relatively large scale problems, i.e. N is large, calculating the inverse
of the Hessian at each iteration can be costly. So, we may use, some
approximations of the Hessian

S(x) = Ĥ−1 (x) → H−1 (x)

x(k+1) = x(k) − α(k) S(x(k) )∇f (x(k) )

Hybrid GD + NA

- We know that the first phase the Newton’s Algorithm (NA) is not very
fast. Therefore, first we can run run GD which has considerably low
complexity and after satisfying some conditions, we can switch to the NA.

- Newton’s Algorithm may not converge for highly non-quadratic functions

unless x is close to x∗ .

- Hybrid method (given on the next slide) also guarantees global

convergence.
Cenk Toker ELE 704 Optimization 220 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Hybrid Algorithm
- Start at x(0) ∈ dom f (x)
- repeat
- run GD (i.e., S(x(k) ) = I)
- until stopping criterion is satisfied
- Start at the final point of GD
- repeat
- run NA with exact H(x) (i.e., S(x(k) ) = H−1 (x(k) ))
- until stopping criterion is satisfied

Cenk Toker ELE 704 Optimization 221 / 224

The Chord Method

- If f (x) is close to a quadratic function, we may use S(x(k) ) = H−1 (x(0) )

throughout the iterations, i.e.,

∆x(k) = −H−1 (x(0) )∇f (x(k) )

x(k+1) = x(k) + ∆x(k)

- This is also the same as the SD algorithm with P = H(x(0) ) and

α(k) = 1.

Cenk Toker ELE 704 Optimization 222 / 224

The Shamanski Method

- Updating the Hessian at every N iterations may give better performance,

i.e., S(x(k) ) = H−1 (x⌊ N ⌋N )
k

∆x(k) = −H−1 (x⌊ N ⌋N )∇f (x(k) )

x(k+1) = x(k) + ∆x(k)

- This is a trade-off between the Chord method (N ← ∞) and the full NA

(N ← 1).

Cenk Toker ELE 704 Optimization 223 / 224

Approximating Particular Terms

- Inversion of sparse matrices can be easier, i.e., when many entries of

H(x) are zero
- If some entries of H(x) are small or below a small threshold, then
set them to zero, obtaining Ĥ(x). Thus, Ĥ(x) becomes sparse.
- In the extreme case. when the Hessian is strongly diagonal
dominant, let the off-diagonal terms to be zero, obtaining Ĥ(x).
Thus, Ĥ(x) becomes diagonal which is very easy to invert.

- There are also other advanced quasi-Newton (modified Newton)

algorithms developed to approximate the inverse of the Hessian, e.g.
Broyden and Davidon-Fletcher-Powell (DFP) methods.

Cenk Toker ELE 704 Optimization 224 / 224

Future Proofing Your HR Strategy 2022 & Beyond
100% (3)
Future Proofing Your HR Strategy 2022 & Beyond
24 pages
The Zero Error Capacity of A Noisy Channel
100% (1)
The Zero Error Capacity of A Noisy Channel
12 pages
Casing Design Example
100% (2)
Casing Design Example
42 pages
Microsoft Word: Microsoft Official Academic Course
No ratings yet
Microsoft Word: Microsoft Official Academic Course
210 pages
What Is Optimal Control Theory?
No ratings yet
What Is Optimal Control Theory?
30 pages
Math 243M, Numerical Linear Algebra Lecture Notes
No ratings yet
Math 243M, Numerical Linear Algebra Lecture Notes
71 pages
Unit 4 Web Programming
No ratings yet
Unit 4 Web Programming
184 pages
Dissertation Interior Design Examples
100% (2)
Dissertation Interior Design Examples
5 pages
Business Communication Skills UNIT 1
No ratings yet
Business Communication Skills UNIT 1
23 pages
INS12 Hardware Description (01) (PDF) - EN
No ratings yet
INS12 Hardware Description (01) (PDF) - EN
8 pages
N260 - Computerised Financial Systems N6 - Instructions - Nov 2024
No ratings yet
N260 - Computerised Financial Systems N6 - Instructions - Nov 2024
19 pages
MA225 Part5
No ratings yet
MA225 Part5
62 pages
MA412 Final
No ratings yet
MA412 Final
82 pages
Lecture 01
No ratings yet
Lecture 01
79 pages
Sherman Morrison Formula
No ratings yet
Sherman Morrison Formula
68 pages
Linear Algebra
No ratings yet
Linear Algebra
44 pages
L02 Mathematical Preliminaries
No ratings yet
L02 Mathematical Preliminaries
44 pages
Stat339 Lect - wk01b - 2023-1
No ratings yet
Stat339 Lect - wk01b - 2023-1
50 pages
Convex Analysis (2024)
No ratings yet
Convex Analysis (2024)
32 pages
Linear Matrix Inequalities in Control
No ratings yet
Linear Matrix Inequalities in Control
60 pages
Modelling of A Gas Turbine With Modelica
No ratings yet
Modelling of A Gas Turbine With Modelica
80 pages
EC400 Handout
No ratings yet
EC400 Handout
57 pages
Bookwithindex
No ratings yet
Bookwithindex
96 pages
Convex Opt Alg
No ratings yet
Convex Opt Alg
66 pages
02 Convex Sets Notes Cvxopt f22
No ratings yet
02 Convex Sets Notes Cvxopt f22
29 pages
2.mathemetical Background
No ratings yet
2.mathemetical Background
34 pages
Lecture 02
No ratings yet
Lecture 02
30 pages
Games103 02 Math
No ratings yet
Games103 02 Math
47 pages
Set 20160906
No ratings yet
Set 20160906
27 pages
Multivariable - Chapter1
No ratings yet
Multivariable - Chapter1
13 pages
Bedd
No ratings yet
Bedd
13 pages
Mit18 S096iap23 Lec05
No ratings yet
Mit18 S096iap23 Lec05
6 pages
University of Maryland: Econ 600
No ratings yet
University of Maryland: Econ 600
22 pages
02 - Math of Patter Recognition
No ratings yet
02 - Math of Patter Recognition
31 pages
GEM 802 Chapter 1
No ratings yet
GEM 802 Chapter 1
52 pages
03 School Sports Draft Data Privacy Notice and Consent Form 3
No ratings yet
03 School Sports Draft Data Privacy Notice and Consent Form 3
3 pages
Cheatsheet 2
No ratings yet
Cheatsheet 2
2 pages
Penawaran Harga Pekerjaan Mep Sulfindo
No ratings yet
Penawaran Harga Pekerjaan Mep Sulfindo
4 pages
EM Short Formulas
No ratings yet
EM Short Formulas
24 pages
Analysis Summary
No ratings yet
Analysis Summary
4 pages
Matlab 2
No ratings yet
Matlab 2
40 pages
Block Coding For Noiseless Feedback
No ratings yet
Block Coding For Noiseless Feedback
176 pages
ELE704 - Lecture Notes - II - 04!04!2024
No ratings yet
ELE704 - Lecture Notes - II - 04!04!2024
117 pages
Lecture3 ConvexSetsFuns PDF
No ratings yet
Lecture3 ConvexSetsFuns PDF
43 pages
Lecture 02 - Convexity
No ratings yet
Lecture 02 - Convexity
42 pages
Module 3 - Supplementary Slides
No ratings yet
Module 3 - Supplementary Slides
36 pages
Iso 6489-3
No ratings yet
Iso 6489-3
12 pages
L2 Sets
No ratings yet
L2 Sets
21 pages
CPSC 542F WINTER 2017: Lecture Notes
No ratings yet
CPSC 542F WINTER 2017: Lecture Notes
10 pages
Convex Fns Scribed
No ratings yet
Convex Fns Scribed
6 pages
Drive Spares Old PDF
No ratings yet
Drive Spares Old PDF
3 pages
01 - Lab Notes
No ratings yet
01 - Lab Notes
8 pages
Sip PSTN Call Flow
No ratings yet
Sip PSTN Call Flow
7 pages
Selected Linear Algebra For Machine Learning
No ratings yet
Selected Linear Algebra For Machine Learning
30 pages
Mathematical Foundations123
No ratings yet
Mathematical Foundations123
22 pages
Project Report Template 2023.docx-1
No ratings yet
Project Report Template 2023.docx-1
10 pages
EC400 Optimisation and Fixed Points Course Pack
No ratings yet
EC400 Optimisation and Fixed Points Course Pack
57 pages
NC 10
No ratings yet
NC 10
12 pages
MATHEMATICS, Lecture 1: Carmen Herrero
No ratings yet
MATHEMATICS, Lecture 1: Carmen Herrero
28 pages
T&S Book
No ratings yet
T&S Book
8 pages
Matrixanalysis PDF
No ratings yet
Matrixanalysis PDF
46 pages
Curs Tehnici de Optimizare
No ratings yet
Curs Tehnici de Optimizare
141 pages
Mobile Application Development Past
No ratings yet
Mobile Application Development Past
3 pages
LinearAlgebra Matlab
No ratings yet
LinearAlgebra Matlab
12 pages
SGC 410 Product Sheet 4189341242 Uk
No ratings yet
SGC 410 Product Sheet 4189341242 Uk
2 pages
CHR$128 User Defined Graphics
No ratings yet
CHR$128 User Defined Graphics
4 pages
MPM1D Unit 2 Lesson 9 Zero and Negative Exponent 1vysndd
No ratings yet
MPM1D Unit 2 Lesson 9 Zero and Negative Exponent 1vysndd
2 pages
Deep Learning For Middle School Students
No ratings yet
Deep Learning For Middle School Students
34 pages
Learning To Detect
No ratings yet
Learning To Detect
11 pages
Department of Computer Science and Engineering: Course Name: Differential and Integral Calculus Course Code: MATH 207
No ratings yet
Department of Computer Science and Engineering: Course Name: Differential and Integral Calculus Course Code: MATH 207
13 pages
RR1720 User Manual PDF
No ratings yet
RR1720 User Manual PDF
71 pages
Linear Least-Squares
No ratings yet
Linear Least-Squares
7 pages
Lec 1
No ratings yet
Lec 1
60 pages
OptimumEngineeringDesign Day2b
No ratings yet
OptimumEngineeringDesign Day2b
24 pages
Database Design Section 5 Quiz
No ratings yet
Database Design Section 5 Quiz
7 pages
Eric Resume
No ratings yet
Eric Resume
2 pages
04 Notes 6250 f13
0% (1)
04 Notes 6250 f13
16 pages
Oslo 1
No ratings yet
Oslo 1
69 pages
Parrilo LectureNotes EIDMA
No ratings yet
Parrilo LectureNotes EIDMA
114 pages
Ais 102 Final Module 1 5
No ratings yet
Ais 102 Final Module 1 5
8 pages
CPSC 542f Notes
No ratings yet
CPSC 542f Notes
10 pages
Linear Algebra For Business Analytics
No ratings yet
Linear Algebra For Business Analytics
27 pages
SFU MACM 409 Chapter 1 Notes
No ratings yet
SFU MACM 409 Chapter 1 Notes
11 pages
Mathematical Treatise On Linear Algebra
No ratings yet
Mathematical Treatise On Linear Algebra
7 pages
Knight Eod Robot
No ratings yet
Knight Eod Robot
11 pages
Lecture Notes #1: Review of Matrix Algebra: 1 Vectors
No ratings yet
Lecture Notes #1: Review of Matrix Algebra: 1 Vectors
8 pages
Level of Implementation of Industrial Technology Syllabi at ThePangasinan State University
No ratings yet
Level of Implementation of Industrial Technology Syllabi at ThePangasinan State University
9 pages
Linear Algebra Cheat Sheet
No ratings yet
Linear Algebra Cheat Sheet
2 pages
Statistical Quality Control
No ratings yet
Statistical Quality Control
36 pages
Cs421 Cheat Sheet
No ratings yet
Cs421 Cheat Sheet
2 pages
August, 2009 Issue
No ratings yet
August, 2009 Issue
8 pages