0% found this document useful (0 votes)
13 views224 pages

ELE704 - Lecture Notes - I - 03-04-2024

Uploaded by

Seckin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views224 pages

ELE704 - Lecture Notes - I - 03-04-2024

Uploaded by

Seckin
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 224

Introduction

Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

ELE 704 Optimization

Cenk Toker

Hacettepe University

2023-24 Spring

Cenk Toker ELE 704 Optimization 1 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Textbooks

There is no specific textbook. Lecture notes will be a composition


of the references below:
- Luenberger, Linear and Nonlinear Programming, Kluwer, 2002.
- Boyd and Vandenberghe, Convex Optimization, Cambridge,
2004.
- S. S. Rao, Engineering Optimization: Theory and Practice, 4th
Edition, Wiley, 2009.
- Baldick, Applied Optimization, Cambridge, 2006.
- Freund, Lecture Notes, MIT.
- Bertsekas, Lecture Notes, MIT.
- Bertsekas, Nonlinear Programming, Athena Scientific, 1999.

Cenk Toker ELE 704 Optimization 2 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Course Content

- Notation and Background


- Unconstrained Optimization: Steepest Descent
- Unconstrained Optimization: Conjugate Gradient
- Unconstrained Optimization: Newton’s Method
- Constrained Optimization: Optimality and Duality
- Constrained Optimization: Gradient Projection
- Constrained Optimization: Modified Newton’s Method
- Constrained Optimization: Penalty & Barrier Methods
- Constrained Optimization: Interior Point Method

Cenk Toker ELE 704 Optimization 3 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Notation

First, we are going to introduce the notation used in this lecture.


The symbols Z and R denote the sets of integers and real numbers,
respectively.

The symbols Z++ and R++ denote the sets of positive (not including
zero) integer and real numbers, respectively. Similarly, Z+ and R+ denote
the sets of nonnegative integer and real numbers, respectively.

The symbols ∀,∃, : , ∧, ∨, =⇒, ⇐⇒, ⊂ and ∈ denote the terms “for all”,
“there exists”, “such that”, “and”, “or”, “if . . . then”, “if and only if (iff)”,
“subset of” and “element of”, respectively.

Functions of a continuous variable indicated with round brackets, for


instance, f (t) where t ∈ R.

Functions are always assumed to be real-valued unless explicitly stated


otherwise.

Cenk Toker ELE 704 Optimization 4 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Boldface type or a bar over the letter is used to denote matrix


and vector quantities. Small letters will be used for column
vectors, e.g. a or ā, and capital letters will be used for
matrices, e.g. A or . In this context aT or āT denotes a row
vector. The elements of vectors denoted by a subscript
starting from 1. See below for some examples,
 
a1
 a2 
a= . 
 
 .. 
aN
a = [ a1 a2 · · · aN ]T
a = [ai ]N ×1
T
a = [ a1 a2 · · · aN ]

where i = 1, 2, · · · , N .
Cenk Toker ELE 704 Optimization 5 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Arithmetic, sign and equality (or inequality) operators for


vectors and matrices applies to their elements directly,

n−η = [ n1 − K1 n2 − K2 · · · nN − KN ]T
n<η ≡ n1 < K1 , n2 < K2 , · · · , nN < KN

where n = [ n1 n2 · · · nN ]T and η = [ K1 K2 · · · KN ]T .

Cenk Toker ELE 704 Optimization 6 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Scalar values apply to every element of vectors. Here are some


explanatory examples
 T
Ka = Ka1 Ka2 · · · KaN
 T
a+K = a1 + K a2 + K · · · aN + K
a<K ≡ a1 < K, a2 < K, · · · , aN < K
a=K ≡ a1 = K, a2 = K, · · · , aN = K

where a = [ a1 a2 · · · aN ]T and K is a scalar value.

Cenk Toker ELE 704 Optimization 7 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The symbols ZN and RN denote the sets of N -dimensional


integer vectors, and N -dimensional real number vectors,
respectively. In other words ZN and RN denote N -dimensional
(column) vector spaces where vector elements are integers and
real numbers, respectively, for instance
(
N c = [ c1 c2 · · · cN ]T and
c∈R ≡
c1 ∈ R, c2 ∈ R, · · · , cN ∈ R

Cenk Toker ELE 704 Optimization 8 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An M ×N matrix A can be stated in two forms


 
A1,1 A1,2 · · · A1,N
 A2,1 A2,2 · · · A2,N 
A= . .. .. 
 
 .. ..
. . . 
AM,1 AM,2 · · · AM,N
A = [Ai,j ]M ×N

where i = 1, 2, · · · , M and j = 1, 2, · · · , N .
The quantity AT denotes the transpose of A.
The quantities A−1 and A−T denote the inverse and the
inverse transpose of A, respectively.
An N -length column vector a is essentially an N ×1 matrix
and similarly the row vector aT is also a 1×N matrix.
Cenk Toker ELE 704 Optimization 9 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Matrix multiplication is defined only between an M ×P matrix


A and P ×N matrix B to produce an M ×N matrix C:
C = AB
where
k=P
X
Ci,j = Ai,k Bk,j
k=0
with i = 1, 2, · · · , M and j = 1, 2, · · · , N .
Matrix multiplication is not commutative
AB ̸= BA
Matrix multiplication is associative
ABC = (AB)C = A(BC)
Matrix multiplication is distributive
A(B + C) = AB + AC
Cenk Toker ELE 704 Optimization 10 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Transpose of a matrix product is equal to the product of the


transposed matrices in reverse order, e.g.,

(AB)T = BT AT

The trace of an N ×N square matrix A is defined to be the


sum of the elements on the main diagonal (the diagonal from
the upper left to the lower right) of A, i.e.,
N
X
tr(A) = A1,1 + A2,2 + · · · + AN,N = Ai,i
i=1

Cenk Toker ELE 704 Optimization 11 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The notation
diag [ d1 d2 · · · dN ]
stands for an N × N diagonal matrix D with the main
diagonal elements Di,i = di , i.e.
 
d1 0 · · · 0 0
 0 d2 0 ··· 0
 .. . . . . .
 
D=. . . . 
 . . . . 

 0 · · · 0 dN −1 0 
0 0 ··· 0 dN

The shorthand notations diag d and diag dT can be also


used, i.e.

diag d ≡ diag dT ≡ diag [ d1 d2 · · · dN ]

where d = [ d1 d2 · · · dN ]T .
Cenk Toker ELE 704 Optimization 12 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The symbol IN denotes the N × N identity matrix, i.e.

IN = diag [ 1 1 · · · 1 ] = diag [ 1 ]1×N .


| {z }
N

Sometimes the subscript N is omitted and simply I is used to


denote an identity matrix.
Note that, for an N × N nonsingular (i.e., invertible) matrix A

AA−1 = A−1 A = I

Cenk Toker ELE 704 Optimization 13 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The · operator denotes the inner product (or dot product)


operator for two vectors of the same length as defined below

a · b = aT b
= a1 b1 + a2 b2 + · · · + aN bN

where a = [ a1 a2 · · · aN ]T and b = [ b1 b2 · · · bN ]T .
Note that, the inner product produces a scalar value and it is
commutative, i.e.
a·b=b·a
so,
a T b = bT a
Inner product is sometimes denoted with the ⟨ , ⟩ operator, i.e.,

⟨a, b⟩ = a · b = aT b

Cenk Toker ELE 704 Optimization 14 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The ⊗ operator denotes the outer product (or tensor


product) operator for two vectors of the same length as
defined below
a ⊗ b = abT
where a = [ a1 a2 · · · aN ]T and b = [ b1 b2 · · · bN ]T .
Note that, the outer product produces an N × N square
matrix and it is not commutative, i.e.

a ⊗ b ̸= b ⊗ a

so,
abT ̸= baT

Cenk Toker ELE 704 Optimization 15 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Functions are also defined in terms of its input and output


sets, i.e.
f :A→B
means "f is an B-valued function of an A-valued variable"
where A ⊆ R and B ⊆ R.
Vector variables (or arguments) will be denoted by small
letters using boldface type, that is

f (x) ≡ f (x1 , x2 , · · · , xN )

where x = [ x1 x2 · · · xN ]T . Assuming x ∈ RN and f (x)


is a real-valued function than

f (x) : RN → R

Cenk Toker ELE 704 Optimization 16 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Domain: The domain of a function is the set of "input" or argument


values for which the function is defined. That is, the function provides an
"output" or value for each member of the domain.

The domain of a function is denoted by

dom f

For instance, the domain of cosine is the set of all real numbers (the
fundamental domain is [0, 2π), all other real values fold onto this region
with mod 2π), while the domain of the square root consists only of
numbers greater than or equal to 0 (ignoring complex numbers in both
cases), i.e.,

dom cos = R
dom sqrt = R+

For a function whose domain is a subset of the real numbers, when the
function is represented in an xy Cartesian coordinate system, the domain
is represented on the x-axis (y-axis gives the range, next slide).

Cenk Toker ELE 704 Optimization 17 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Range: The range of a function refers to the image of the function. The
codomain is a set containing the function’s outputs, whereas the image is
the part of the codomain which consists only of the function’s outputs.

For example, the function f (x) = x2 is often described as a function from


the real numbers to the real numbers, meaning that the codomain is R,
but its image (i.e., range) is the set of non-negative real numbers, i.e.,
R+ .

For a function whose range is a subset of the real numbers, when the
function is represented in an xy Cartesian coordinate system, the range is
represented on the y-axis.

Cenk Toker ELE 704 Optimization 18 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Function vectors and function matrices are also denoted in the


similar fashion like f (x) and F(x), respectively. For example, a
function vector f (x) of size N can be stated as

f (x) = [ f1 (x) f2 (x) · · · fN (x) ]T .

Assuming x ∈ R and f (x) are real-valued functions than

f (x) : R → RN

Sets are denoted by { }. An example set is defined below

A = { x | P (x) }

meaning "A is a set of x such that P (x) is true". Here, the


letter x can be replaced by other symbols. Another example
would be
B = { y | y is a prime number }
Cenk Toker ELE 704 Optimization 19 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Supremum: The supremum (sup) of a subset S of a totally or partially


ordered set T is the least element of T that is greater than or equal to all
elements of S. Consequently, the supremum is also referred to as the
least upper bound (LUB). Plural of supremum is suprema.

If S contains a greatest element, then that element is the supremum.


Otherwise, the supremum does not belong to S (or does not exist, e.g.
∞).

For instance, the negative real numbers do not have a greatest element,
and their supremum is 0 (which is not a negative real number).

Examples:

sup {1, 2, 3} = 3
sup {x ∈ R : 0 < x < 1} = sup {x ∈ R : 0 ≤ x ≤ 1} = 1

One basic property of the supremum is

sup {f (t) + g(t) : t ∈ A} ≤ sup {f (t) : t ∈ A} + sup {g(t) : t ∈ A}

for any functions f and g.


Cenk Toker ELE 704 Optimization 20 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Infimum: The infimum (inf) of a subset S of a partially ordered set T is


the greatest element of T that is less than or equal to all elements of S.
Consequently the term greatest lower bound (GLB) is also commonly
used. Plural of infimum is infima.

If the infimum exists, it is unique. If S contains a least element, then that


element is the infimum; otherwise, the infimum does not belong to S (or
does not exist, e.g. −∞).

For instance, the positive real numbers do not have a least element, and
their infimum is 0, which is not a positive real number.

Examples:

inf {1, 2, 3} = 1
inf {x ∈ R : 0 < x < 1} = inf {x ∈ R : 0 ≤ x ≤ 1} = 0

Cenk Toker ELE 704 Optimization 21 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Norm: A norm is a way of measuring the length or strength of a vector.


The general form of the norm is called Lp -norm given by
N
!1/p
X p
∥x∥p = |xi | for p ≥ 1.
i=1

The value of p is typically 1 or 2 or ∞.

L1 -norm is the Taxicab norm or Manhattan norm:


N
X
∥x∥1 = |xi |
i=1

L2 -norm is the Euclidean norm:


v
uN
uX
∥x∥2 = t |xi |2
i=1

L∞ -norm is the maximum norm or infinity norm:


∥x∥∞ = max |xi |
i

Note: If not stated otherwise, ∥x∥ will denote the Euclidean norm, ∥x∥2 ,
i.e., L2 -norm.
Cenk Toker ELE 704 Optimization 22 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Unit Ball: Let ∥ · ∥ denote any norm on RN , then the unit ball is the set
of all vectors with norm less than or equal to one, i.e.,

B = {x : ∥x∥ ≤ 1} .

Then, B is called the unit ball of the norm ∥ · ∥.

The unit ball satisfies the following properties:


- B is symmetric about the origin, i.e., x ∈ B if and only if −x ∈ B.
- B is convex.
- B is closed, bounded, and has nonempty interior.

Two-dimensional unit ball of L1 -norm, i.e., B1 = x ∈ R2 : ∥x∥1 ≤ 1 .




Cenk Toker ELE 704 Optimization 23 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Two-dimensional unit ball of L2 -norm, i.e., B2 = x ∈ R2 : ∥x∥2 ≤ 1 .




 ball of L∞ -
Two-dimensional unit
norm, i.e., B∞ = x ∈ R2 : ∥x∥∞ ≤ 1 .

Cenk Toker ELE 704 Optimization 24 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Two-dimensional unit balls of L1 , L2 and L∞ norms shown together.

Cenk Toker ELE 704 Optimization 25 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Dual Norm: Let ∥ · ∥ denote any norm on RN , then the dual norm,
denoted by ∥ · ∥∗ , is the function from RN to R with values
  n o
∥x∥∗ = max yT x : ∥y∥ ≤ 1 = sup yT x : ∥y∥ ≤ 1
y

The above definition also corresponds to a norm: it is convex, as it is the


pointwise maximum of convex (in fact, linear) functions y → xT y; it is
homogeneous of degree 1, that is, ∥αx∥∗ = α∥x∥∗ for every x in RN and
α ≥ 0.

By definition of the dual norm,

xT y ≤ ∥x∥ · ∥y∥∗

This can be seen as a generalized version of the Cauchy-Schwarz


inequality, which corresponds to the Euclidean norm.

The dual to the dual norm above is the original norm, i.e.,

∥x∥∗∗ = ∥x∥

Cenk Toker ELE 704 Optimization 26 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- The norm dual to the Euclidean norm is itself. This comes directly from
the Cauchy-Schwarz inequality.

∥x∥2∗ = ∥x∥2

- The norm dual to the the L∞ -norm is the L1 -norm, or vice versa.

∥x∥∞∗ = ∥x∥1 and ∥x∥1∗ = ∥x∥∞

- More generally, the dual of the Lp -norm is the Lq -norm

∥x∥p∗ = ∥x∥q
p
where q = .
1−p

Cenk Toker ELE 704 Optimization 27 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An eigenvector of an N × N square matrix A is a non-zero vector v


that, when multiplied by A, yields the original vector multiplied by a
single number λ, i.e.,
Av = λv
The number λ is called the eigenvalue of A corresponding to v.

Thus, in order to find the eigenvalues of A, we solve the above equation


for λ

Av − λv = 0
(A − λI)v = 0

where I is the N × N identity matrix. It is a fundamental result of linear


algebra that an equation Mv = 0 has a non-zero solution v if and only if
the determinant of the matrix M is zero, i.e. det(M) = 0. It follows that
the eigenvalues of A are precisely the real numbers λ that satisfy the
characteristic equation
det (A − λI) = 0

Cenk Toker ELE 704 Optimization 28 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Note that, condition number, κ(·), of a matrix is given by the ratio of the
largest and the smallest eigenvalue, e.g.,
max λi
κ(H(x)) =
min λi

If the condition number is close to one, the matrix is well-conditioned


which means its inverse can be computed with good accuracy. If the
condition number is large, then the matrix is said to be ill-conditioned.
Practically, such a matrix is almost singular, and the computation of its
inverse, or solution of a linear system of equations is prone to large
numerical errors. A matrix that is not invertible has the condition number
equal to infinity.

Cenk Toker ELE 704 Optimization 29 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An N × N symmetric matrix A is called positive-semidefinite if


xT Ax ≥ 0
for all x ∈ RN , and is called positive-definite if
xT Ax > 0
for all nonzero x ∈ RN .

If A is positive-semidefinite, it is denoted by
A⪰0
and has nonnegative eigenvalues.

If A is positive-definite, it is denoted by
A≻0
and has positive eigenvalues.
T
For any real matrix B, the
 matrix B B is positive-semidefinite, and
T
rank (B) = rank B B .
Cenk Toker ELE 704 Optimization 30 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

An N × N symmetric matrix A is called negative-semidefinite if

xT Ax ≤ 0

for all x ∈ RN , and is called negative-definite if

xT Ax < 0

for all nonzero x ∈ RN .

If A is negative-semidefinite, it is denoted by

A⪯0

and has nonpositive eigenvalues.

If A is negative-definite, it is denoted by

A≺0

and has negative eigenvalues.

Cenk Toker ELE 704 Optimization 31 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Quadratic norm: A generalized quadratic norm of x is defined by


 1/2
∥x∥P = xT Px = ∥P1/2 x∥2 = ∥Mx∥2

where P = MT M is an N × N symmetric positive definite (SPD) matrix.

When P = I then, quadratic norm is equal to the Euclidean norm.

The dual of the quadratic norm is given by


 1/2
∥x∥P∗ = ∥x∥Q = xT Qx

where Q = P−1 , i.e.,


 1/2
∥x∥P∗ = xT P−1 x

Cenk Toker ELE 704 Optimization 32 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods


The notation stands for the multidimensional partial
∂x
derivative operator in the form of a column vector, i.e.
 T
∂ ∂ ∂ ∂
= ··· .
∂x ∂x1 ∂x2 ∂xN

Consider the following example


 T
∂f (x) ∂f (x) ∂f (x) ∂f (x)
= ··· .
∂x ∂x1 ∂x2 ∂xN

Cenk Toker ELE 704 Optimization 33 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The ∇x denotes the multidimensional gradient (or del)


operator in the form of a column vector, similar to the
multidimensional partial derivative operator. In general we are
going to omit the subscript x, i.e.

∂ T
 
∂ ∂
∇ = ∇x = ··· .
∂x1 ∂x2 ∂xN
Consider the following example
∂f (x)
∇f (x) = ∇x f (x) =
∂x
 T
∂f (x) ∂f (x) ∂f (x)
= ··· .
∂x1 ∂x2 ∂xN
Note that ∇ operator is not commutative, i.e.,

∇f (x) ̸= f (x)∇
Cenk Toker ELE 704 Optimization 34 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The directional derivative in the direction d is given by

(d · ∇)f (x) = d · ∇f (x) = ∇f (x) · d


= dT ∇f (x) = ∇T f (x) d
∂f (x) ∂f (x) ∂f (x)
= d1 + d2 + · · · + dN
∂x1 ∂x2 ∂xN

where d = [ d1 d2 · · · dN ]T .

Cenk Toker ELE 704 Optimization 35 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The ∇2 denotes the Hessian operator (∇∇T ) or Hessian


matrix (symmetric), not the Laplace operator ∇ · ∇ (or ∇T ∇).
∇2 = ∇ ⊗ ∇ = ∇∇T
 2 ∂ 2 ∂ ∂2

∂x1 ∂x1 ∂x1 ∂x2 ··· ∂x1 ∂xN
 
 ∂2 ∂2 ∂2
···

=  ∂x2 ∂x1 ∂x2 ∂x2 ∂x2 ∂xN
 
 .. .. .. .. 
 . . . .


∂2 ∂2 ∂2
∂xN ∂x1 ∂xN ∂x2 ··· ∂xN ∂xN

H(x) = ∇2 f (x) = ∇∇T f (x)


 ∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)

∂x1 ∂x1 ∂x1 ∂x2 ··· ∂x1 ∂xN
 2 
 ∂ f (x) ∂ 2 f (x) ∂ 2 f (x)
···

=  ∂x2 ∂x1 ∂x2 ∂x2 ∂x2 ∂xN
 
 .. .. .. ..

 . . . .


∂ 2 f (x) ∂ 2 f (x) ∂ 2 f (x)
∂xN ∂x1 ∂xN ∂x2 · · · ∂xN ∂xN
Cenk Toker ELE 704 Optimization 36 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

T
For a vector function f (x), ∇f T (x) gives the Jacobian
matrix
 ∂f (x) ∂f (x)
· · · ∂f∂x1 (x)

1 1
∂x1 ∂x2 N
 
  ∂f2 (x) ∂f2 (x) ∂f2 (x) 
∂f 
 ∂x1 ∂x2 · · · ∂xN 
Jf (x) = =
∂xi M ×N  .. .. ..
.
.. 
 . . . 

∂fM (x) ∂fM (x)
∂x1 ∂x2 · · · ∂f∂x
M (x)
N

where f (x) = [ f1 (x) f2 (x) · · · fM (x) ]T and


x = [ x1 x2 · · · xN ]T .
Here,
T
Jf (x) = ∇f T (x)
JTf (x) = ∇f T (x)

Sometimes ∇f (x) is also used to denote the Jacobian matrix.


Cenk Toker ELE 704 Optimization 37 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The directional gradient in the direction d for a vector


function f (x) would be given by
T T
(d · ∇)f (x) = dT ∇f T (x) = ∇f T (x) d
= dT JTf (x) = Jf (x) d

 ∂f (x) ∂f1 (x) ∂f1 (x)

1
d1 ∂x 1
+ d 2 ∂x2 + · · · + dN ∂xN
 
 ∂f2 (x) ∂f2 (x) ∂f2 (x) 
d + d + · · · + d
=  1 ∂x1 2 ∂x2 N ∂xN 

.. 
.
 
 
d1 ∂f∂x
N (x)
1
+ d ∂fN (x)
2 ∂x2 + · · · + d ∂fN (x)
N ∂xN

where f (x) = [ f1 (x) f2 (x) · · · fN (x) ]T and


x = [ x1 x2 · · · xN ]T .

Cenk Toker ELE 704 Optimization 38 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The Taylor series of a real or complex-valued function f (x) that is


infinitely differentiable in a neighborhood of a point x0 is the power series
f ′ (x0 ) f ′′ (x0 ) f ′′′ (x0 )
f (x) = f (x0 ) + (x − x0 ) + (x − x0 )2 + (x − x0 )3 + · · ·
1! 2! 3!
f ′ (x0 ) f ′′ (x0 ) f ′′′ (x0 )
f (x0 + ∆x) = f (x0 ) + ∆x + (∆x)2 + (∆x)3 + · · ·
1! 2! 3!
1 ′′ 1
∆f = f ′ (x0 )∆x + f (x0 )(∆x)2 + f ′′′ (x0 )(∆x)3 + · · ·
2 6
where n! denotes the factorial of n, ∆x = x − x0 , ∆f = f (x) − f (x0 ),
and f ′ (x0 ), f ′′ (x0 ), f ′′′ (x0 ), . . . denotes the first, second, third, . . .
derivatives of f (x) evaluated at the point x0 , respectively.

Cenk Toker ELE 704 Optimization 39 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The min or min operator returns the minimum value (i.e.


x∈RN x
global minimum) of the f (x) function, similarly argmin or
x∈RN
argmin operator returns the argument values x∗ which result
x
in the minimum value of the f (x) function, i.e.,

f (x∗ ) = min f (x)


x
= min f (x)
x∈RN

x∗ = argmin f (x)
x
= argmin f (x)
x∈RN

Cenk Toker ELE 704 Optimization 40 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Similarly, min gives the local minimum where its argument


x∈L
values are inside the subdomain L where L ⊂ RN , i.e.,

f (x∗ ) = min f (x)


x∈L

x = argmin f (x)
x∈L

gives the local minimum.

Cenk Toker ELE 704 Optimization 41 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

The superscript (k) denotes the iteration level, for example, a


local minimum and its argument values at k-th iteration would
be defined as

f (x(k) ) = min f (x)


x∈L(k)

x(k) = argmin f (x)


x∈L(k)

respectively, where k = 1, 2, · · · K.

Cenk Toker ELE 704 Optimization 42 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Geometric Definitions

Geometries that will be frequently used in our course:


- Hyperplane
- Halfspace
- Polyhedra
- Euclidean Balls and Ellipsoids
- Cones

Cenk Toker ELE 704 Optimization 43 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Hyperplane

- Each equality constraint defines a different hyperplane. So, the


set of points

x | aT x = b; a, x ∈ RN , b ∈ R, a ̸= 0


constitute a hyperplane, vector a is normal to this hyperplane


and point x lie on the hyperplane. Note that hyperplanes are
affine, i.e., preserve collinearity (all points lying on a line
initially still lie on a line after transformation).

aT x = b

- Hyperplane describes a plane in R3 (i.e., 3D), (as shown in


the figure above) and describes a line in R2 (i.e., 2D), (as
shown on the next two figures).
Cenk Toker ELE 704 Optimization 44 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Halfspace

- Each inequality constraint defines a different halfspace. So,


the set of points

x | aT x ≥ b; a, x ∈ RN , b ∈ R, a ̸= 0


or
x | aT x ≤ b; a, x ∈ RN , b ∈ R, a ̸= 0


constitute a halfspace.

The half-space {x | 2x1 + x2 ≤ 0} The half-space {x | 2x1 + x2 ≤ 3}


Cenk Toker ELE 704 Optimization 45 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Polyhedra

- The solution set of finitely many linear inequalities & equalities constitute
a polyhedra,
n o
x | Ax ≤ b, Cx = d; A ∈ RM ×N , C ∈ RP ×N , b ∈ RM , d ∈ RP , x ∈ RN

In other words, a polyhedron is the intersection of finite number of


halfspaces and hyperplanes

Cenk Toker ELE 704 Optimization 46 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Example 1:
- Find the solution of

max x1 + 2x2 + 3x3


s.t. x1 + x2 + x3 = 1
x1 , x 2 , x 3 ≥ 0

Cenk Toker ELE 704 Optimization 47 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Example 2:
- Consider the pyramid in the following figure. The length of
each side is 1 unit. Find the halfspaces defining this volume.

- Solution:
 
  x1
0 1 0 x2  ≥ 0
x3
 
h i x1
√1 1
− 2√ − √16 x2  ≥ 0
2 3
x3
 
h q i x1
1 2 x  ≥ 0
0 − 2√ 3 3 2
x3
 
h i x1 1
− √12 − 2√ 1
3
− √1 x  ≥ − √
6 2
x3 2

Cenk Toker ELE 704 Optimization 48 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Euclidean balls & ellipsoids

- The set of points B(xc , r) = {x | ∥x − xc ∥ ≤ r} = {xc + ru | ∥u∥ ≤ 1}


form a ball with respect to Euclidian norm with center xc and radius r.

- The set of points BP (xc , r) = x | (x − xc )T P (x − xc ) ≤ r with




respect to the quadratic norm with symmetric positive definite (SPD)


matrix P form an ellipsoid. The axes of the ellipsoid are related with the
eigenvalues of P.

Cenk Toker ELE 704 Optimization 49 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Change of coordinates:
- Let y = P1/2 x, then the ellipsoid becomes y | (y − yc )T (y − yc ) ≤ r ,


i.e., a ball with respect to Euclidian norm with center yc and radius r.

Cenk Toker ELE 704 Optimization 50 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Cone

- The set of points x | x = θ1 x1 + θ2 x2 ; x1 , x2 ∈ RN , θ1 , θ2 ≥ 0 form a




cone.

Cenk Toker ELE 704 Optimization 51 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

Example 3:
 
  x1
- Find region x1 + 2 x2 ≥ 4, i.e. 1 2 ≥ 4, and x1 , x2 ≥ 0
x2

- Solution:

- Example 4: Find region


        
1 2 x1 6 x1 0
≥ , ≥
2 1 x2 6 x2 0
- Solution:

Cenk Toker ELE 704 Optimization 52 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 5: Find region x1 + 2 x2 + 3 x3 = 1, and x1 , x2 , x3 ≥ 0

- Solution:

- Example 6: Find region 1 ≤ x1 ≤ 2, 0 ≤ x2 ≤ 1 and 4 ≤ x3 ≤ 5

- Solution:

Cenk Toker ELE 704 Optimization 53 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 7:

min x1 + x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0

- Solution:

Cenk Toker ELE 704 Optimization 54 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 8:

min 3x1 + x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0

- Solution:

Cenk Toker ELE 704 Optimization 55 / 224


Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods

- Example 9:

min x1 − x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0

- Solution:

Cenk Toker ELE 704 Optimization 56 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Overview

Overview
- Uncontrained Optimization
- Constrained Optimization

Cenk Toker ELE 704 Optimization 57 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Unconstrained Optimization

- Unconstrained optimization is to find the point which


minimizes the cost function f (x) without subject to any other
constraints.
- In other words,

min f (x)
x∈X

where x = [ x1 x2 · · · xN ]T , X ⊂ RN and x∗ is a feasible


solution (or point) with

f (x∗ ) = min f (x)


x∈X

Cenk Toker ELE 704 Optimization 58 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Constrained Optimization

- Constrained optimization is to find the point which minimizes


the cost function f (x) with subject to some equality and/or
unequality constraints.
- In other words,

min f (x)
x∈X
subject to g(x) ≤ 0
h(x) = 0

- where x = [ x1 x2 · · · xN ]T , X ⊂ RN ,
g(x) = [ g1 (x) g2 (x) · · · gM (x) ]T are the inequality
constraints, h(x) = [ h1 (x) h2 (x) · · · hL (x) ]T are the
equality constraints and x∗ is a feasible solution (or point)
iff g ≤ 0 and h = 0 with

f (x∗ ) = min f (x)


x∈X

Cenk Toker ELE 704 Optimization 59 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Question: What if we want to maximize f (x)?

- max f (x) = min (−f (x))


x x

- Examples:
- Computer networks - e.g., optimum routing problem
- Production planning in a factory
- Resource allocation
- Computer aided design (CAD) - e.g., shortest paths in a
PCB
- Travelling salesman problem

Cenk Toker ELE 704 Optimization 60 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- A Diet Problem: Find the "most economical" diet that


satisfies the minimum nutrition requirements for good health.
- N different foods
- price of i-th food is ci
- M basic nutritional ingredients
- an individual must take "at least" bj units of j-th
nutrient per day
- each unit of i-th food contains Aj,i units of the j-th
nutrient

Cenk Toker ELE 704 Optimization 61 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Formulation:
- Variable: amount of food xi for i-th food. Thus, variables can
be represented by the vector x
 T
x = x1 x2 · · · xi · · · xN

where i = 1, 2, · · · , N .
Amount of food cannot be negative, i.e., xi ≥ 0, so

x≥0

- Cost:

f (x) = c1 x1 + c2 x2 + · · · + ci xi + · · · + cN xN
= cT x

- Constraints:

A1,1 x1 + A1,2 x2 + · · · + A1,i xi + · · · + A1,N xN ≥ b1  
A2,1 x1 + A2,2 x2 + · · · + A2,i xi + · · · + A2,N xN ≥ b2 



.. .. .. ... .. ... .. 
. . . . .

Ax ≥ b
Aj,1 x1 + Aj,2 x2 + · · · + Aj,i xi + · · · + Aj,N xN ≥ bj 
.. .. .. .. .. .. .. 


. . . . . . . 



AM,1 x1 + AM,2 x2 + · · · + AM,i xi + · · · + AM,N xN ≥ bM

Cenk Toker ELE 704 Optimization 62 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Thus, the optimization formulation for this diet problem is


formed as

min cT x
x
s.t. b − Ax ≤ 0
−x≤0

Cenk Toker ELE 704 Optimization 63 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Basic Definitions

Definition: Ball centered at point x0 with radius ε


B(x0 , ε) = {x0 | ∥x − x0 ∥ < ε}
Local/global, strict/non-strict, minima/maxima
Consider the optimization problem
min f (x)
x

s.t. x ∈ F (including constraints)


Cenk Toker ELE 704 Optimization 64 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Definition: x∗ ∈ F is a local minimum if

∃ε > 0 ⇒ f (x∗ ) ≤ f (y), ∀y ∈ B(x∗ , ε) ∩ F, y ̸= x∗

- Definition: x∗ ∈ F is a global minimum if

f (x∗ ) ≤ f (y), ∀y ∈ F, y ̸= x∗

- Definition: x∗ ∈ F is a strict local minimum if

∃ε > 0 ⇒ f (x∗ ) < f (y), ∀y ∈ B(x∗ , ε) ∩ F, y ̸= x∗

- Definition: x∗ ∈ F is a strict global minimum if

f (x∗ ) < f (y), ∀y ∈ F, y ̸= x∗

- Definition: For strict/non-strict and local/global maximum, change ≤


and < to ≥ and >, respectively, in the above definitions.

Cenk Toker ELE 704 Optimization 65 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Gradient and Hessian

- Let f (x) : X → R, X ⊂ RN , f (x) is differentiable at x0 ∈ X


- if
∃ ∇f (x0 ), ∀x ∈ X
where ∇f (x0 ) is the gradient of f (x) at point x0 .

- In the neighbourhood of a point x0 , first order approximation of f (x) can


be given as

f (x) = f (x0 ) + ∇T f (x0 )(x − x0 ) + α(x0 , x − x0 )

where lim α(x0 , x − x0 ) = 0. In other words,


(x−x0 )→0

∆f = ∇T f (x0 )∆x + α(x0 , ∆x)

where ∆f = f (x) − f (x0 ), ∆x = x − x0 and lim α(x0 , ∆x) = 0.


∆x→0

- f (x) is differentiable on X if f (x) is differentiable on ∀x ∈ X


Cenk Toker ELE 704 Optimization 66 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Example 10: Let f (x) = 3 x21 x32 + x22 x33 , then

6 x1 x32
 

∇f (x) = 9 x21 x22 + 2 x2 x33 


3 x22 x23

- Remember: In the neighbourhood of a point x0 , a function f (x) can be


approximated by a second order Taylor series expansion
1 ′′
f (x) = f (x0 ) + f ′ (x0 )(x − x0 ) + f (x0 )(x − x0 )2 + residual
2
where the first order derivative f ′ (x) of f (x) around x0 is given by

f (x) − f (x0 ) ∆f
f ′ (x0 ) = lim = lim
(x−x0 )→0 x − x0 ∆x→0 ∆x

where ∆x = x − x0 , ∆f = f (x) − f (x0 ). In terms of the differences,


1 ′′
∆f = f ′ (x0 )∆x + f (x0 )(∆x)2 + residual
2

Cenk Toker ELE 704 Optimization 67 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Second order Taylor series expansion in the neighbourhood of a point x0


is given by
1
f (x) = f (x0 )+∇T f (x0 )(x − x0 )+ (x − x0 )T ∇2 f (x0 ) (x − x0 )+ residual
2 | {z }
H(x0 )

where ∇2 f (x0 ) is called the Hessian (matrix) of f (x) at point x0 given by


 2 
∂ f (x)
H(x) = ∇2 f (x) = ∇∇T f (x) =
∂xi ∂xj N ×N

- Definition: The directional derivative of f (x) at x0 in the direction d is


defined by
f (x0 + λd) − f (x0 )
∇T f (x0 ) d = lim
λ→0 λ

Cenk Toker ELE 704 Optimization 68 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- f (x) is twice differentiable at x0 ∈ X,


- if
∃ ∇f (x0 ) and ∃ H(x0 ), ∀x ∈ X
where H(x0 ) is an N × N symmetric matrix representing the
Hessian of f (x) at x = x0 .

- In the neighbourhood of a point x0 , second order approximation of f (x)


can be given as
1
f (x) = f (x0 ) + ∇T f (x0 )(x − x0 ) + (x − x0 )T H(x0 )(x − x0 ) + β(x0 , x − x0 )
2
where lim β(x0 , x − x0 ) = 0. Similarly,
(x−x0 )→0

1
∆f = ∇T f (x0 )∆x + ∆xT H(x0 )∆x + β(x0 , ∆x)
2
where lim β(x0 , ∆x) = 0
∆x→0

- f (x) is twice differentiable on X if f (x) is twice differentiable on ∀x ∈ X


Cenk Toker ELE 704 Optimization 69 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Example 11:
- Let f (x) = 3 x21 x32 + x22 x33 as in the previous example,

6 x32 18 x1 x22
 
0
2
H(x) = ∇ f (x) = 18 x1 x22 18 x21 x2 + 2 x33 6 x2 x32

0 6 x2 x23 6 x22 x3

Cenk Toker ELE 704 Optimization 70 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Positive Semidefinite & Positive Definite Matrices

An N × N matrix M is called
- positive definite (M ≻ 0), if xT Mx > 0, ∀x ∈ RN , x ̸= 0,
or, all eigenvalues of M are positive

- positive semidefinite (M ⪰ 0), if xT Mx ≥ 0, ∀x ∈ RN , x ̸= 0,


or, all eigenvalues of M are nonnegative

- negative definite (M ≺ 0), if xT Mx < 0, ∀x ∈ RN , x ̸= 0,


or, all eigenvalues of M are negative

- negative semidefinite (M ⪯ 0), if xT Mx ≤ 0, ∀x ∈ RN , x ̸= 0,


or, all eigenvalues of M are nonpositive

- indefinite, if ∃x, y ∈ RN , xT Mx > 0, yT My < 0,


or, some eigenvalues of M are positive and some are negative

Cenk Toker ELE 704 Optimization 71 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Example 12:
-  
2 0
M= ≻ 0, positive definite
0 3
 
8 −1
M= ≻ 0, positive definite
−1 1
Check xT Mx or check eigenvalues.

- Hint: Recall that for x ∈ R, f (x) = ax2 ⇒ f ′′ (x) = 2a. Thus the
function f (x) is convex if a > 0 or concave if a < 0.

Thus, positive/negative definiteness of the Hessian is related to convexity.

Cenk Toker ELE 704 Optimization 72 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

Optimality Conditions for Unconstrained Problems

Problem:
min f (x)
x∈X

- Theorem: x∗ is a strict local minimum, if and only if

∇f (x∗ ) = 0 and H(x∗ ) ≻ 0

- Theorem: x∗ is a non-strict local minimum, if and only if

∇f (x∗ ) = 0 and H(x∗ ) ⪰ 0

- Definition: d is a descent direction of f (x) at point x0 if

f (x0 +εd) < f (x0 ), ∀ε > 0 and ε is sufficiently small (ε → 0)

Cenk Toker ELE 704 Optimization 73 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Theorem: Assume that f (x) is differentiable at x0


∃d : ∇T f (x0 )d < 0 ⇒ ∀λ > 0 (λ → 0), f (x0 + λd) < f (x0 )
Hence, then d is a descent direction of f (x) at x = x0 .

- Proof: We know that (from the definition of gradient)


f (x0 + λd) = f (x0 ) + ∇T f (x0 )(λd) + λ∥d∥α(d, λd)
where lim α(x0 , λd) = 0. Then
λ→0

f (x0 + λd) − f (x0 )


= ∇T f (x0 )d + residual (→ 0 as λ → 0)
λ | {z }
<0 (given)

Hence f (x0 + λd) − f (x0 ) < 0 as λ → 0 (λ > 0)

- Corollary: (First order necessary optimality condition) If x∗ is a local


minimum then ∇f (x∗ ) = 0.

- Proof: If ∇f (x∗ ) ̸= 0, then d = −∇f (x) would be a descent direction


and x∗ would not be a local minimum (at x∗ , there would still be room
for decrease in f (x))
Cenk Toker ELE 704 Optimization 74 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Theorem: (Second order necessary optimality condition) Suppose f (x∗ )


is twice differentiable at x∗ . If x∗ is a local minimum, then H(x∗ ) ⪰ 0

- Proof: We know that ∇f (x∗ ) = 0. Now, suppose


H(x) ≺ 0 ⇒ ∃d : dT H(x∗ )d < 0
1
f (x∗ + λd) = f (x∗ ) + λ∇T f (x∗ ) d + λ2 dT H(x∗ )d + residual
| {z } 2
0
1

= f (x ) + λ2 dT H(x∗ )d + residual
2
If we rearrange
f (x∗ + λd) − f (x∗ ) 1
= dT H(x∗ )d + residual (→ 0 as λ → 0)
λ2 2 | {z }
<0

then
f (x∗ + λd) − f (x∗ ) < 0, ∀λ > 0, λ → 0

⇒ x is not a local minimum. Contradiction!

- Note: If ∇f (x0 ) = 0 and H(x0 ) is positive semidefinite, i.e., H(x0 ) ⪰ 0,


point x0 may not be a (local) minimum.
Cenk Toker ELE 704 Optimization 75 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Theorem: (Sufficient condition for local optimality) If ∇f (x0 ) = 0 and


H(x0 ) is positive definite, i.e., H(x0 ) ≻ 0, then x = x0 is a strict local
minimum.

- Proof:
f (x∗ + λd) − f (x∗ ) 1
= dT H(x∗ )d + residual (→ 0 as λ → 0)
λ2 2 | {z }
>0

then
f (x∗ + λd) − f (x∗ ) > 0 ∀λ > 0, λ → 0
⇒ x∗ is a local minimum.

- Semi-definiteness does not guarantee minimum or maximum (e.g. it can


be a saddle point, as shown below)

- If ∇f (x0 ) = 0 and H(x0 ) is positive (negative) definite, then x0 is a


strict local minimum (maximum).
Cenk Toker ELE 704 Optimization 76 / 224
Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Example 13:  
1 1
H(x0 ) = ≻0
1 4 ⇒
∇f (x0 ) = 0
Point x0 satisfies the sufficient conditions, point x0 is a local minimum.

- Example 14: Let f (x) = x31 + x22 , then

3x21
   
6x1 0
- ∇f (x) = and H(x) =
2x2 0 2

   
0 0 0
- Point ∇f (x) = 0 at x0 = but H(x0 ) = and H(x0 ) is
0 0 2
positive semidefinite, x0 may or may not be a local minimum. Note that
f (x0 ) = f (0, 0) = 0.

f (−ε, 0) = −ε3 < 0 = f (x0 ), ∀ε > 0 ⇒ x0 is not a local minimum.

Cenk Toker ELE 704 Optimization 77 / 224


Introduction
Overview Unconstrained Optimization
Convex Sets and Functions Constrained Optimization
Unconstrained Minimization Basic Definitions
Descent Methods

- Example 15: Let f (x) = x41 + x22 , then


 3
12x21
 
4x1 0
- ∇f (x) = and H(x) =
2x2 0 2
   
0 0 0
- Point ∇f (x) = 0 at x0 = and H(x0 ) = . H(x0 ) is positive
0 0 2
semidefinite, x0 may or may not be a local minimum. Note that
f (x0 ) = f (0, 0) = 0.

∀x f (x) ≥ 0 = f (x0 ) ⇒ x0 is a local minimum.


 N

exp aTi x + bi
P 
- Example 16: min f (x) = min log
i=1

- First order optimality condition ∇f (x) = 0.


N
X   1
∇f (x) = ai exp aTi x + bi N
=0
exp (aTi x + bi )
i=1
P
i=1

But there is no analytical solution, so solution can be obtained


numerically by an iterative algorithm.
Cenk Toker ELE 704 Optimization 78 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Convex Sets

- Definition: A set C ⊆ RN is said to be convex if


x1 , x2 ∈ C and 0 ≤ θ ≤ 1 ⇒ θx1 + (1 − θ)x2 ∈ C
- θx1 + (1 − θ)x2 defines a line segment between points x1 and x2 .

Some simple convex and nonconvex sets are shown below.

Left: The hexagon, which includes its boundary (shown darker), is convex.
Middle: The kidney shaped set is not convex, since the line segment between
the two points in the set shown as dots is not contained in the set. Right: The
square contains some boundary points but not others, and is not convex.
Cenk Toker ELE 704 Optimization 79 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

k
P
- Convex Combination: x1 , x2 , x3 , . . . , xk and for θi ≥ 0, θi = 1 if
i=1
x = θ1 x1 + θ2 x2 + · · · + θk xk ∈ C, then C is a convex set.

- Convex Hull: Convex hull is a set of all convex combinations of the


points in S.

- For example, the convex hulls of two sets in R2 are given below.

Left: The convex hull of a set of fifteen points (shown as dots) is the
pentagon (shown shaded). Right: The convex hull of the kidney shaped
set in the middle of the convex set is the whole shaded set.

Cenk Toker ELE 704 Optimization 80 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Convex and Concave Functions

- Definition: A function f : C → R is convex if

dom f = C ⊆ RN

is a convex set and satisfies the Jensen’s Inequality given below

f (θx1 + (1 − θ)x2 ) ≤ θf (x1 ) + (1 − θ)f (x2 )

∀x1 , x2 ∈ C and 0 ≤ θ ≤ 1.

Cenk Toker ELE 704 Optimization 81 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Definition: A function f : C → R is concave if

dom f = C ⊆ RN

is a convex set and

f (θx1 + (1 − θ)x2 ) ≥ θf (x1 ) + (1 − θ)f (x2 )

∀x1 , x1 ∈ C and 0 ≤ θ ≤ 1.

- Note: If f is convex (concave), then −f is concave (convex).

- Note: If f is strictly convex or strictly concave, then the equality signs in


the inequalities are removed.

- For example, f is strictly convex if

dom f = C ⊆ RN

is a convex set and

f (θx1 + (1 − θ)x2 ) < θf (x1 ) + (1 − θ)f (x2 )

∀x1 , x1 ∈ C and 0 < θ < 1.


Cenk Toker ELE 704 Optimization 82 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Note: An affine function

f (x) = aT x + b (or f (x) = Ax + b)

with
dom f : convex
satisfies
f (θx1 + (1 − θ)x2 ) = θf (x1 ) + (1 − θ)f (x2 )
Hence it can be considered as convex or concave depending on the
problem.

Cenk Toker ELE 704 Optimization 83 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Examples on R: (scalar)
- Convex:
- affine: ax + b, on R, for any a, b ∈ R
- exponential: eax , on R, for any a ∈ R
- powers: xα , on R++ , for α ≥ 1 or α ≤ 0
- powers of absolute value: |x|α , on R, for α ≥ 1
- negative entropy: x log x, on R++

- Concave:
- affine: ax + b, on R, for any a, b ∈ R
- powers: xα , on R++ , for 0 ≤ α ≤ 1
- logarithm: log x, on R++

Cenk Toker ELE 704 Optimization 84 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Examples on RN : (vectors)
- All norms (i.e. Lp -norm) are convex

N
!1/p
X p
∥x∥p = |xi | for p ≥ 1.
i=1

- Affine functions are convex and concave depending on the problem

f (x) = aT x + b

Cenk Toker ELE 704 Optimization 85 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

First Order Condition for Convexity

- Theorem: If f (x) is differentiable (i.e. ∇f (x) exists ∀x ∈ dom f (x)) and


dom f (x) is convex, then f (x) is convex iff

f (x) ≥ f (x0 ) + ∇T f (x0 )(x − x0 ), ∀x, x0 ∈ dom f (x)

- As shown in the figure below

first-order approximation of a convex function f (x) is a global


underestimator.

Cenk Toker ELE 704 Optimization 86 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Second Order Conditions for Convexity

- Theorem: If f (x) is twice differentiable (i.e. H(x) exists ∀x ∈ dom f (x))


and dom f (x) is convex, then f (x) is convex iff

H(x) ⪰ 0, ∀x ∈ dom f (x)

- Theorem: If S is a convex set, f (x) : S → R is a convex function with


local minimum x∗ , then x∗ is a global minimum of f (x) over S.

- If f (x) is (strictly) convex, a local minimum is the (unique) global


minimum.

- If f (x) is (strictly) concave, a local maximum is the (unique) global


maximum.

- If f (x) is convex, then the global optimality condition that is both


necessary and sufficient is

- Theorem: Let f (x) : X → R be convex and differentiable on X, Then x0


is a global minimum iff ∇f (x) = 0.
Cenk Toker ELE 704 Optimization 87 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Example 17: f (x) = 12 x21 + x1 x2 + 2x22 − 4x1 − 4x2 − x32 with


dom f (x) = {(x1 , x2 ) | x2 < 0}
   
x1 + x2 − 4 1 1
- ∇f (x) = 2 and H(x) = .
x1 + 4x2 − 3x2 1 4 − 6x2

- H(x) ⪰ 0 on dom f (x), thus f (x) is convex.

Cenk Toker ELE 704 Optimization 88 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Further Examples:
- Quadratic function
1 T
x Px + qT x + r
f (x) =
2
∇f (x) = Px + q
H(x) = P

f (x) is convex if P ⪰ 0.

- Least-squares objective function

f (x) = ∥Ax − b∥2


= (Ax − b)T (Ax − b)
= xT AT Ax − 2bT Ax + bT b
∇f (x) = 2AT (Ax − b)
H(x) = 2AT A

f (x) is convex for any A (Here f (x) is a quadratic function with


P = 2AT A, q = −2AT b and r = bT b).
Cenk Toker ELE 704 Optimization 89 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Log-sum-exp: The function


N
X
f (x) = log exi
i=1

is convex over RN . This function can be interpreted as a differentiable (in


fact, analytic) approximation of the max function, since

max x ≤ f (x) ≤ max x + log N

for all x.

- Geometric mean
N
!1/N
Y
f (x) = xi
i=1

is convex for on RN
++ .

Cenk Toker ELE 704 Optimization 90 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Operations that Preserve Convexity

- Nonnegative multiple
αf (x), for α ≥ 0

- Sum (including infinite sums and integrals)


f1 (x) + f2 (x)

- Compositions with affine function


f (Ax + b)
are convex if f (x) is convex.

- Ex: Log barrier


M
X   n o
f (x) = − log bi − aTi x , dom f (x) = x | aTi x < bi , i = 1, . . . , M
i=1

- and norm of an affine function


f (x) = ∥Ax + b∥
Cenk Toker ELE 704 Optimization 91 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Pointwise maximum
f (x) = max {f1 (x), f2 (x), . . . , fi (x), . . . , fM (x)}
is convex for if fi (x) are convex.

- Ex: Piecewise-linear function

 
f (x) = max aTi x + bi
i=1,...,M

- Pointwise supremum
g(x) = sup f (x, y), ∀y ∈ A
y∈A

is convex where f (x, y) is convex in x for ∀y ∈ A.

- Ex: Distance to the farthest point in a set C (convex set)

f (x) = sup ∥x − y∥
y∈C

Cenk Toker ELE 704 Optimization 92 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Pointwise infimum
g(x) = inf f (x, y)
y∈C

is convex where f (x, y) is convex in (x, y) and C is a convex set.

- Ex: Distance to a set S (convex)

d(x, S) = inf ∥x − y∥
y∈S

- Composition with scalar functions

f (x) = h(g(x))

with g : RN → R and h : R → R is convex if g is convex, h is convex and


h is nondecreasing, or g is concave, h is convex and h is nonincreasing.

- Ex: eg(x) is convex if g is convex. Similarly,


1
g(x)
is convex if g is concave and positive.

Cenk Toker ELE 704 Optimization 93 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Quadratic Functions, Forms and Optimization

- Definition: A quadratic function has the form (f : RN → R)


1 T
f (x) = x Qx + cT x + r
2
where Q ∈ RN ×N , x, c ∈ RN , r ∈ R and Q is a symmetric matrix.

- Quadratic optimization problem (Quadratic Program)


1 T
min x Qx + cT x + r
2
s.t. x ∈ RN

- Ex: Least-squares problem (Approximation of an over-determined linear


system Ax = b where A is an M × N matrix and M > N , i.e., the
number of equations are more than the number of variables)

min ∥Ax − b∥22 = xT AT Ax − 2bT Ax + bT b


s.t. x ∈ RN

Cenk Toker ELE 704 Optimization 94 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Property: Assuming (f (x) : RN → R) is twice differentiable at x = x0 ,


f (x) can be approximated by a quadratic function in the neighbourhood
of x0 (a very useful property for the Newton’s method).
1
min f (x) ∼ T
= f (x0 ) + ∇ f (x0 )(x − x0 ) + (x − xT0 )H(x0 )(x − x0 )
2
s.t. x ∈ RN

is a quadratic optimization problem.

- Solution of Quadratic Problem (QP)


1 T
x Qx + cT x + r
f (x) =
2
∇f (x) = Qx + c
H(x) = Q

- You may find resources about taking the derivative of expressions with
matrix/vectors at
http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html
Cenk Toker ELE 704 Optimization 95 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Theorem: The function f (x) = 21 xT Qx + cT x + r is convex iff Q ⪰ 0.

- Proof: Apply Jensen’s equality

- Corollary:

- f (x) is strictly convex iff Q ≻ 0 or convex iff Q ⪰ 0

- f (x) is strictly concave iff Q ≺ 0 or concave iff Q ⪯ 0

- f (x) is neither convex nor concave iff Q is indefinite.

Cenk Toker ELE 704 Optimization 96 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Optimality Conditions

- Theorem: Suppose Q is a symmetric positive semidefinite (SPSD) matrix


f (x) = 21 xT Qx + cT x + r has its minimum at x∗ iff x∗ satisfies

∇f (x∗ ) = Qx∗ + c = 0

- Proof: Express f (x∗ ) as f (x) = f (x∗ + (x − x∗ )) and show that


f (x) ≥ f (x∗ ) ∀x ∈ RN by using Qx∗ + c = 0.

- Ex:

f (x) = xT x = ∥x∥22
f (x) = (x − a)T (x − a) = ∥x − a∥22
f (x) = (x − a)T P(x − a) = ∥x − a∥2P , where P is SPD.

Cenk Toker ELE 704 Optimization 97 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

Characteristics of Symmetric Matrices

- Definition: A matrix is an orthonormal matrix if MT = M−1

- Corollary: If M is orthonormal and y = Mx


∥y∥22 = yT y = xT M T 2
| {zM} x = ∥x∥2 ⇒ ∥y∥2 = ∥x∥2
I
- Recall: Mx = λx, λ ∈ R is an eigenvalue of M, x ∈ RN and x ̸= 0 is a
corresponding eigenvector. How many eigenvalues?

- Recall: A matrix Q is called symmetric if Q = QT .

- Proposition: If Q is a real symmetric matrix, then all of its eigenvalues


are real.

- Proposition: If Q is a real symmetric matrix, then its eigenvectors


corresponding to different eigenvalues are orthogonal.

- Proposition: If Q is a symmetric matrix with rank N , then it has N


distinct eigenvectors which constitute an orthonormal basis for RN .
Cenk Toker ELE 704 Optimization 98 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Proposition: If Q is SPSD, its eigenvalues are nonnegative.

- Proposition: If Q is an N × N square matrix, then trace and determinant


of Q would be equal to the sum and product of its eigenvalues,
respectively.
N
X N
X
tr(Q) = Qi,i = λi
i=1 i=1
N
Y
det(Q) = λi
i=1

- Proposition: (Eigendecomposition) If Q is a symmetric matrix, then


Q = RDRT . Columns of the orthonormal matrix R are the eigenvectors
of Q, D is a diagonal matrix with eigenvalues of Q on the main diagonal.

- Proposition: If Q is SPSD, then Q = MT M and M = D1/2 RT is a


square root of Q.

- Proposition: If Q is SPSD and xT Qx = 0, then Qx = 0.

Cenk Toker ELE 704 Optimization 99 / 224


Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions

- Proposition: If a symmetric matrix Q is positive definite (i.e. Q ≻ 0),


then Q is nonsingular (i.e. its inverse exists) as det Q > 0.

- Proposition: If Q is positive definite (i.e. Q ≻ 0), then any principal


sub-matrix of Q is also positive definite.

- Proposition: If Q is positive semi-definite (i.e. Q ⪰ 0), then any principal


sub-matrix of Q is also positive semi-definite.
 
Q c
- Proposition: If Q is a symmetric matrix and Q ≻ 0 and M = T ,
c b
T −1
then M ≻ 0 iff b > c Q c.

Cenk Toker ELE 704 Optimization 100 / 224


Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Unconstrained Minimization

- The aim is

min f (x)

where f (x) : RN → R is twice differentiable.

- The problem is solvable, i.e. a finite optimal point x∗ exists.

- The optimal value (finite) is given by

p∗ = inf f (x) = f (x∗ ) (> −∞)


x

Cenk Toker ELE 704 Optimization 101 / 224


Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Example 1: Quadratic program


1 T
min f (x) = x Qx − bT x + c
x∈RN 2

where Q : RN ×N is symmetric, b ∈ RN and c ∈ R.

- Necessary conditions:

∇f (x∗ ) = Qx∗ − b = 0
H(x∗ ) = Q ⪰ 0 (PSD)

- Q ≺ 0 ⇒ f (x) has no local minimum.


- Q ≻ 0 ⇒ x∗ = Q−1 b is the unique global minimum.
- Q ⪰ 0 ⇒ either no solution or ∞ number of solutions.

Cenk Toker ELE 704 Optimization 102 / 224


Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Example 2: Consider
1
min f (x1 , x2 ) = (αx21 + βx22 − x1 )
x1 ,x2 ∈R 2

- Here, let us first express the above equation in the quadratic program
form with
 
α γ
Q=
−γ β
 
1
b=
0
where γ ∈ R, for simplicity we can take γ = 0. So,
- If α > 0 and β > 0 i.e. (Q ≻ 0) x∗ = ( α1 , 0) is the unique global
minimum.
 α1 > 0 and β = 0 i.e. (Q ⪰ 0). Infinite number of solutions,
- If
( α , y), y ∈ R .
- If α = 0 and β > 0 i.e. (Q ⪰ 0). No solution.
- If α < 0 and β > 0 (or α > 0 and β < 0), (i.e. Q is indefinite). No
solution.
Cenk Toker ELE 704 Optimization 103 / 224
Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

α > 0, β > 0 α > 0, β = 0

150

60

100
40

50
20

0
0

−50 −20
10 10 10
5 5 5
10
0 5 0 0
−5 0
−5 −5 −5
−10 −10 −10 −10
x2 x1 x2 x1

α = 0, β > 0 α > 0, β < 0

60

60

40
40

20
20
0

0 −20

−40
−20
10 10
−60
5 10 10 5
0 5 5 0
0 0
−5 −5 −5 −5
−10 −10 −10 −10
x2 x1 x2 x1

Cenk Toker ELE 704 Optimization 104 / 224


Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

- Two possibilities:
- {f (x) : x ∈ X} is unbounded below ⇒ no optimal solution.
- {f (x) : x ∈ X} is bounded below ⇒ a global minimum exists, if
∥x∥ ≠ ∞.

- Then, unconstrained minimization methods


- produce sequence of points x(k) ∈ dom f (x) for k = 0, 1, . . . with

f (x(k) ) → p∗

- can be interpreted as iterative methods for solving the optimality


condition
∇f (x∗ ) = 0

Cenk Toker ELE 704 Optimization 105 / 224


Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods

Topography

(https://www.rei.com/learn/expert-advice/topo-maps-how-to-
use.html)
Cenk Toker ELE 704 Optimization 106 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Descent Methods

Descent Methods

Cenk Toker ELE 704 Optimization 107 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Motivation

- If ∇f (x) ̸= 0, there is an interval (0, δ) of stepsizes such that


f (x − α∇f (x)) < f (x) ∀α ∈ (0, δ)
- If d makes an angle with ∇f (x) that is greater than 90◦ , i.e.,
∇T f (x) d < 0
∃ an interval (0, δ) of stepsizes such that
f (x + αd) < f (x) ∀α ∈ (0, δ)
Cenk Toker ELE 704 Optimization 108 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Definition: The descent direction d is selected such that

∇T f (x) d < 0

Proposition: For a descent method

f (x(k+1) ) < f (x(k) )

except x(k) = x∗ .

Definition: Minimizing sequence is defined as

x(k+1) = x(k) + α(k) d(k)

where the scalar α(k) ∈ (0, δ) is the stepsize (or step length) at iteration
k, and d(k) ∈ RN is the step or search direction.
- How to find optimum α(k) ? Line Search Algorithm
- How to find optimum d(k) ? Depends on the descent algorithm,
e.g. d = −∇f (x(k) ).

Cenk Toker ELE 704 Optimization 109 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

General Descent Method

- Given a starting point x(0) ∈ dom f (x)


- repeat
- Determine a descent direction d(k) ,
- Line search: Choose a stepsize α(k) > 0,
- Update: x(k+1) = x(k) + α(k) d(k) ,
- until stopping criterion is satisfied.

Example 3: Simplest method: Gradient Descent

x(k+1) = x(k) − α(k) ∇f (x(k) ), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −∇f (x(k) ).

Cenk Toker ELE 704 Optimization 110 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 4: Most sophisticated method: Newton’s Method

x(k+1) = x(k) − α(k) H−1 (x(k) )∇f (x(k) ), k = 0, 1, . . .

Note that, here the descent direction is d(k) = −H−1 (x(k) )∇f (x(k) ).

Cenk Toker ELE 704 Optimization 111 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Line Search

- Suppose f (x) is a continuously differentiable convex function and we


want to find
α(k) = argmin f (x(k) + αd(k) )
α
(k)
for a given descent direction d . Now, let

h(α) = f (x(k) + αd(k) )

where h(α) : R → R is a "convex function" in the scalar variable α then


the problem becomes
α(k) = argmin h(α)
α

Then, as h(α) is convex, it has a minimum at

∂h(α(k) )
h′ (α(k) ) = =0
∂α

Cenk Toker ELE 704 Optimization 112 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

where h′ (α) is given by

∂h(α)
h′ (α) = = ∇T f (x(k) + αd(k) ) d(k) (using chain rule)
∂α
Therefore, since d is the descent direction, (i.e., ∇T f (x(k) ) d(k) < 0 ), we
have h′ (0) < 0. Also, h′ (α) is a monotone increasing function of α
because h(α) is convex. Hence. search for h′ (α(k) ) = 0.

-
Cenk Toker ELE 704 Optimization 113 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Choice of stepsize:
- Constant stepsize
α(k) = c : constant

- Diminishing stepsize
α(k) → 0

α(k) = ∞.
P
while satisfying
k=−∞

- Exact line search (analytic)

α(k) = argmin f (x(k) + αd(k) )


α

Cenk Toker ELE 704 Optimization 114 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Exact line search: (for quadratic programs)


- If f (x) is a quadratic function, then h(α) is also a quadratic function, i.e.,

h(α) = f (x(k) − αd(k) )


α2 (k) T
= f (x(k) ) + α∇T f (x(k) )d(k) + d H(x(k) )d(k)
2
- Exact line search solution α0 which minimizes the quadratic equation
above, i.e., ∂h(α
∂α
0)
= 0, is given by

∇T f (x(k) )d(k)
α0 = α(k) = − T
d(k) H(x(k) )d(k)

- If f (x) is a higher order function, then second order Taylor series


approximation can be used for the exact line search algorithm (which also
gives an approximate solution).

Cenk Toker ELE 704 Optimization 115 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Bisection Algorithm:
- Assume h(α) is convex, then h′ (α) is a monotonically increasing
function. Suppose that we know a value α̂ such that h′ (α̂) > 0.

- Since h′ (0) < 0, α̃ = 0+α̂


2
is the next test point

- If h′ (α̃) = 0, α(k) = α̃ is found (very difficult to achieve)


- If h′ (α̃) > 0 narrow down the search interval to (0, α̃)
- If h′ (α̃) < 0 narrow down the search interval to (α̃, α̂)

Cenk Toker ELE 704 Optimization 116 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Algorithm:
- Set k = 0, αℓ = 0 and αu = α̂
αℓ +αu
- Set α̃ = 2
and calculate h′ (α)
- If h′ (α̃) > 0 ⇒ αu = α̃ and k = k + 1
- If h′ (α̃) < 0 ⇒ αℓ = α̃ and k = k + 1
- If h′ (α̃) = 0 ⇒ stop.

Cenk Toker ELE 704 Optimization 117 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Proposition: After every iteration, the current interval [αℓ , αu ] contains


α∗ , h′ (α∗ ) = 0.

Proposition: At the k-th iteration, the length of the current interval is


 k
1
L= α̂
2

Proposition: A value of α such that |α − α∗ | < ε can be found in at most


  
α̂
log2
ε
steps.

- How to find α̂ such that h′ (α̂) > 0?


- Make an initial guess of α̂
- If h′ (α̂) < 0 ⇒ α̂ = 2α̂, go to step 2
- Stop.

Cenk Toker ELE 704 Optimization 118 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Stopping criterion for the Bisection Algorithm: h′ (α̃) → 0 as k → ∞,


may not converge quickly.

- Some relevant stopping criteria:


- Stop after k = K iterations (K : user defined)
- Stop when |αu − αℓ | ≤ ε (ε : user defined)
- Stop when h′ (α̃) ≤ ϵ (ϵ : user defined)

- In general, 3rd criterion is the best.

Cenk Toker ELE 704 Optimization 119 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Backtracking line search

For small enough α:

f (x0 + αd) ≃ f (x0 ) + α∇T f (x0 )d < f (x0 ) + γα∇T f (x0 )d

where 0 < γ < 0.5 as ∇T f (x0 ) d < 0.

Cenk Toker ELE 704 Optimization 120 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Algorithm: Backtracking line search

- Given a descent direction d for f (x) at x0 ∈ dom f (x)


- α=1
- while f (x0 + αd) > f (x0 ) + γα∇T f (x0 )d
α = βα
- end

where 0 < γ < 0.5 and 0 < β < 1

- At each iteration step size α is reduced by β (β ≃ 0.1 : coarse search,


β ≃ 0.8 : fine search).

- γ can be interpreted as the fraction of the decrease in f (x) predicted by


linear extrapolation (γ = 0.01 ↔ 0.3 (typical) meaning that we accept a
decrease in f (x) between 1% and 30%).

Cenk Toker ELE 704 Optimization 121 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- The backtracking exit inequality

f (x0 + αd) ≤ f (x0 ) + γα∇T f (x0 )d

holds for α ∈ [0, α0 ]. Then, the line search stops with a step length α
i. α = 1 if α0 ≥ 1
ii. α ∈ [βα0 , α0 ].
In other words, the step length obtained by backtracking line search
satisfies
α ≥ min {1, βα0 } .

Cenk Toker ELE 704 Optimization 122 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence
n o∞
Definition: Let ∥ · ∥ be a norm on RN . Let x(k) be a sequence of
n o∞ k=0
N (k)
vectors in R . Then, the sequence x is said to converge to a
k=0
limit x∗ if

∀ε > 0, ∃Nε ∈ Z+ : (k ∈ Z+ and k ≥ Nε ) ⇒ (∥x(k) − x∗ ∥ < ε)


n o∞
If the sequence x(k) converges to x∗ then we write
k=0

lim x(k) = x∗
k→∞
n o∞
and call x∗ as the limit of the sequence x(k) .
k=0
- Nε may depend on ε
- For a distance ε, after Nε iterations, all the subsequent iterations
are within this distance ε to x∗ .

- This definition does not characterize how fast the convergence is (i.e.
rate of convergence).
Cenk Toker ELE 704 Optimization 123 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Rate of Convergence
n o∞
Definition: Let ∥ · ∥ be a norm on RN . A sequence x(k) that
k=0
∗ N
converges to x ∈ R is said to converge at rate R ∈ R++ and with rate
constant δ ∈ R++ if

∥x(k+1) − x∗ ∥
lim =δ
k→∞ ∥x(k) − x∗ ∥R
- If R = 1, 0 < δ < 1, then rate is linear
- If 1 < R < 2, 0 < δ < ∞, then rate is called super-linear
- If R = 2, 0 < δ < ∞, then rate is called quadratic

- The rate of convergence R is sometimes called asymptotic convergence


rate. It may not apply for the first iterations, but applies asymptotically
as k → ∞.

Cenk Toker ELE 704 Optimization 124 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)


Example 5: The sequence ak

k=0
, 0 < a < 1 converges to 0.

∥ak+1 − 0∥
lim =a ⇒ R = 1, δ = a
k→∞ ∥ak − 0∥1
n k
o∞
Example 6: The sequence a2 , 0 < a < 1 converges to 0.
k=0

k+1
∥a2 − 0∥
lim =1 ⇒ R = 2, δ = 1
k→∞ ∥a2k − 0∥2

Cenk Toker ELE 704 Optimization 125 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Gradient Descent (GD) Method

Gradient Descent (GD) Method

Cenk Toker ELE 704 Optimization 126 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Gradient Descent Method

- First order Taylor series expansion at x0 gives us


f (x0 + αd) ≈ f (x0 ) + α∇T f (x0 )d.
This approximation is valid for α∥d∥ → 0.
- We want to choose d so that ∇T f (x0 )d is as small as (as negative as)
possible for maximum descent.
- If we normalize d, i.e. ∥d∥ = 1, then normalized direction d̃
∇f (x0 )
d̃ = −
∥∇f (x0 )∥
makes the smallest inner product with ∇f (x0 ).
- Then, the unnormalized direction
d = −∇f (x0 )
is called the direction of gradient descent (GD) at the point of x0 .
- d is a direction as long as ∇f (x0 ) ̸= 0.
Cenk Toker ELE 704 Optimization 127 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Algorithm: Gradient Descent Algorithm

- Given a starting point x(0) ∈ dom f (x)

- repeat
- Direction: d(k) = −∇f (x(k) )
- Line search: Choose step size α(k) via a line search algorithm
- Update: x(k+1) = x(k) + α(k) d(k)

- until stopping criterion is satisfied

- A typical stopping criterion is ∥∇f (x)∥ < ε, ε → 0 (small)

Cenk Toker ELE 704 Optimization 128 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence Analysis

- The Hessian matrix H(x) is bounded as

- mI ⪯ H(x), i.e.,

(H(x) − mI) ⪰ 0
yT H(x)y ≥ m∥y∥2 , ∀y ∈ RN

- H(x) ⪯ M I, i.e.,

(M I − H(x)) ⪰ 0
yT H(x)y ≤ M ∥y∥2 , ∀y ∈ RN
with ∀x ∈ dom f (x).

Cenk Toker ELE 704 Optimization 129 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Note that, condition number of a matrix is given by the ratio of the


largest and the smallest eigenvalue, e.g.,
max λi M
κ(H(x)) = =
min λi m
If the condition number is close to one, the matrix is well-conditioned
which means its inverse can be computed with good accuracy. If the
condition number is large, then the matrix is said to be ill-conditioned.

Cenk Toker ELE 704 Optimization 130 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Lower Bound: mI ⪯ H(x)


- For x, y ∈ dom f (x)
1
f (y) = f (x) + ∇T f (x)(y − x) + (y − x)T H(z)(y − x)
2
for some z on the line segment [x, y] where H(z) ⪰ mI. Thus,
m
f (y) ≥ f (x) + ∇T f (x)(y − x) + ∥y − x∥2
2
- If m = 0, then the inequality characterizes convexity.
- If m > 0, then we have a better lower bound for f (y)
- Right-hand side is convex in y. Minimum is achieved at
1
y0 = x − ∇f (x)
m
- Then,
m
f (y) ≥ f (x) + ∇T f (x)(y0 − x) + ∥y0 − x∥2
2
1
≥ f (x) − ∥∇f (x)∥2
2m
∀y ∈ dom f .
Cenk Toker ELE 704 Optimization 131 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- When y = x∗
1
f (x∗ ) = p∗ ≥ f (x) − ∥∇f (x)∥2
2m

- A stopping criterion

f (x) − p∗ ≤ 1
2m
∥∇f (x)∥2

Cenk Toker ELE 704 Optimization 132 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Upper Bound: H(x) ⪯ M I

- For any x, y ∈ dom f (x), using similar derivations as the lower bound, we
arrive at an upper bound
M
f (y) ≤ f (x) + ∇T f (x)(y0 − x) + ∥y0 − x∥2
2
- Then for y = x∗
1
f (x∗ ) = p∗ ≤ f (x) − ∥∇f (x)∥2
2M
- Hence,
1
2M
∥∇f (x)∥2 ≤ f (x) − p∗

Cenk Toker ELE 704 Optimization 133 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence of GD using exact line search

- For the exact line search, let us use second order approximation for
f (x(k+1) ):

f (x(k+1) ) = f (x(k) − α∇f (x(k) ))


2
∼ (k) T (k) (k) α T (k) (k) (k)
= f (x ) − α ∇ f (x )∇f (x ) + ∇ f (x ) H(x ) ∇f (x )
| {z } 2 | {z }
∥∇f (x(k) )∥2 ⪯M I

This criterion is quadratic in α.

- Normally, exact line search solution α0 which minimizes the quadratic


equation above is given by

∇T f (x(k) )∇f (x(k) )


α0 =
∇T f (x(k) )H(x(k) )∇f (x(k) )

Cenk Toker ELE 704 Optimization 134 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- However, let us use the upper bound of the second order approximation
for convergence analysis

M α2
f (x(k+1) ) ≤ f (x(k) ) − α∥∇f (x(k) )∥2 + ∥∇f (x(k) )∥2
2

- Find α0′ such that upper bound of f (x(k) − α∇f (x(k) )) is minimized over
α.

- Upper bound equation (i.e., right-hand side equation) is quadratic in α,


hence minimized for
1
α0′ =
M
- with the minimum value
1
f (x(k) ) − ∥∇f (x(k) )∥2
2M
- Then, for α0′
1
f (x(k+1) ) ≤ f (x(k) ) − ∥∇f (x(k) )∥2
2M
Cenk Toker ELE 704 Optimization 135 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Subtract p∗ for both sides


1
f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − ∥∇f (x(k) )∥2
2M
- We know that
1
f (x(k) )−p∗ ≤ ∥∇f (x(k) )∥2 ⇒ ∥∇f (x(k) )∥2 ≥ 2m(f (x(k) )−p∗ )
2m
- Then substituting this result to the above inequality
m
f (x(k+1) ) − p∗ ≤ (f (x(k) ) − p∗ ) − (f (x(k) ) − p∗ )
M
m
≤ (1 − )(f (x(k) ) − p∗ )
M
- or
f (x(k+1) ) − p∗ m m
≤ (1 − )=c≤1 ( ≤ 1)
f (x(k) ) − p∗ M M

Cenk Toker ELE 704 Optimization 136 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence


m

- Upper limit of rate constant is 1 − M

- Number of steps? Apply the above inequality recursively

f (x(k+1) ) − p∗ ≤ ck (f (x(0) ) − p∗ )

- i.e., f (x(k+1) ) → p∗ as k → ∞, since 0 ≤ c < 1. Thus, convergence is


guaranteed.
- If m = M ⇒ c = 0, then convergence occurs in one iteration.
- If m ≪ M ⇒ c → 1, the slow convergence.

Cenk Toker ELE 704 Optimization 137 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- (f (x(k+1) ) − p∗ ) ≤ ε is achieved after at most


 
log [f (x(0) ) − p∗ ]/ε
K=
log (1/c)
iterations
- Numerator is small when initial point is close to x∗ (K gets
smaller).

- Numerator increases as accuracy increases (i.e., ε decreases) (K


gets larger).
m
- Denominator decreases linearly with M (reciprocal of the condition
m m m
number) as c = (1 − M ), i.e., log(1/c) = − log(1 − M )≃ M (using
1 1 2
log(x) = log(x0 ) + x0 (x − x0 ) − 2x2 (x − x0 ) + · · · with x0 = 1).
0

m
- well-conditioned Hessian, M
→ 1 ⇒ denominator is large (K
gets smaller).
m
- ill-conditioned Hessian, M
→ 0 ⇒ denominator is small (K
gets larger).
Cenk Toker ELE 704 Optimization 138 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence of GD using backtracking line search

- Backtracking exit condition

f (x(k) − α∇f (x(k) )) ≤ f (x(k) ) − γα∥∇f (x(k) )∥2


1
is satisfied when α ∈ [βα0 , α0 ] where α0 ≤ M
.
β
- Backtracking line search terminates either if α = 1 or α ≥ M
which gives
a lower bound on the decrease
- f (x(k+1) ) ≤ f (x(k) ) − γ∥∇f (x(k) )∥2 if α = 1
βγ β
- f (x(k+1) ) ≤ f (x(k) ) − M
∥∇f (x(k) )∥2 if α ≥ M

Cenk Toker ELE 704 Optimization 139 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- If we put these inequalities (1 & 2) together


 
βγ
f (x(k+1) ) ≤ f (x(k) ) − min γ, ∥∇f (x(k) )∥2
M

- Similar to the analysis exact line search, subtract p∗ from both sides
 
β
f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − γ min 1, ∥∇f (x(k) )∥2
M

- But, we know that ∥∇f (x(k) )∥2 ≥ 2m(f (x(k) ) − p∗ ), then


  
β 
f (x(k+1) ) − p∗ ≤ 1 − 2mγ min 1, f (x(k) ) − p∗
M

- Finally

f (x(k+1) ) − p∗
  
β
≤ 1 − 2mγ min 1, =c<1
f (x(k) ) − p∗ M

Cenk Toker ELE 704 Optimization 140 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Rate of convergence is unity (i.e., R = 1) ⇒ linear convergence

- Rate is constant c < 1


 
f (x(k+1) ) − p∗ ≤ ck f (x(0) ) − p∗

Thus, k → ∞ ⇒ ck → 0, so convergence is guaranteed.

Cenk Toker ELE 704 Optimization 141 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9).

Example 7: (quadratic problem in R2 ) Replace γ, α and t with σ, γ and α.

Cenk Toker ELE 704 Optimization 142 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 143 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 144 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 8: (nonquadratic problem in R2 ) Replace α and t with γ and α.

Cenk Toker ELE 704 Optimization 145 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 146 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 147 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 9: (problem in R100 ) Replace α and t with γ and α.

Cenk Toker ELE 704 Optimization 148 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 149 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 10: (Condition number) Replace γ, α and t with σ, γ and α.

Cenk Toker ELE 704 Optimization 150 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 151 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 152 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Observations:
- The gradient descent algorithm is simple.

- The gradient descent method often exhibits approximately linear


convergence.

- The choice of backtracking parameters γ and β has a noticeable but not


dramatic effect on the convergence. Exact line search sometimes improves
the convergence of the gradient method, but the effect is not large (and
probably not worth the trouble of implementing the exact line search).

- The convergence rate depends greatly on the condition number of the


Hessian, or the sublevel sets. Convergence can be very slow, even for
problems that are moderately well-conditioned (say, with condition
number in the 100s). When the condition number is larger (say, 1000 or
more) the gradient method is so slow that it is useless in practice.

- The main advantage of the gradient method is its simplicity. Its main
disadvantage is that its convergence rate depends so critically on the
condition number of the Hessian or sublevel sets.

Cenk Toker ELE 704 Optimization 153 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Steepest Descent (SD) Method

Steepest Descent (SD) Method

Cenk Toker ELE 704 Optimization 154 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Steepest Descent (SD) Method

Dual Norm: Let ∥ · ∥ denote any norm on RN , then the dual norm,
denoted by ∥ · ∥∗ , is the function from RN to R with values
n o
∥x∥∗ = max yT x : ∥y∥ ≤ 1 = sup yT x : ∥y∥ ≤ 1
y

- The above definition also corresponds to a norm: it is convex, as it is the


pointwise maximum of convex (in fact, linear) functions y → xT y; it is
homogeneous of degree 1, that is, ∥αx∥∗ = α∥x∥∗ for every x in RN and
α ≥ 0.

- By definition of the dual norm,

xT y ≤ ∥x∥ · ∥y∥∗

This can be seen as a generalized version of the Cauchy-Schwartz


inequality, which corresponds to the Euclidean norm.

- The dual to the dual norm above is the original norm.


Cenk Toker ELE 704 Optimization 155 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- The norm dual to the Euclidean norm is itself. This comes directly from
the Cauchy-Schwartz inequality.

∥x∥2∗ = ∥x∥2

- The norm dual to the the L∞ -norm is the L1 -norm, or vice versa.

∥x∥∞∗ = ∥x∥1 and ∥x∥1∗ = ∥x∥∞

- More generally, the dual of the Lp -norm is the Lq -norm

∥x∥p∗ = ∥x∥q
p
where q = .
1−p

Cenk Toker ELE 704 Optimization 156 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Quadratic norm: A generalized quadratic norm of x is defined by


 1/2
∥x∥P = xT Px = ∥P1/2 x∥2 = ∥Mx∥2

where P = MT M is an N × N symmetric positive definite (SPD) matrix.

- When P = I then, quadratic norm is equal to the Euclidean norm.

- The dual of the quadratic norm is given by


 1/2
∥x∥P∗ = ∥x∥Q = xT P−1 x

where Q = P−1 .

Cenk Toker ELE 704 Optimization 157 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Steepest Descent Method

- The first-order Taylor series approximation of f (x(k) + αd) around x(k) is

f (x(k) + αd) ≈ f (x(k) ) + α∇T f (x(k) )d.


This approximation is valid for α∥d∥2 → 0.

- We want to choose d so that ∇T f (x0 )d is as small as (as negative as)


possible for maximum descent.

- First normalize d to obtain normalized steepest descent direction (nsd)


dnsd n o
dnsd = argmin ∇T f (x(k) ) d : ∥d∥ = 1
where ∥ · ∥ is any norm and ∥ · ∥∗ is its dual norm on RN . Choice of norm
is very important.

- It is also convenient to consider the unnormalized steepest descent


direction (sd)
dsd = ∥∇f (x)∥∗ dnsd
where ∥ · ∥∗ is the dual norm of ∥ · ∥.
Cenk Toker ELE 704 Optimization 158 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Then, for the steepest descent step, we have

∇T f (x)dsd = ∥∇f (x)∥∗ ∇T f (x)dnsd = −∥∇f (x)∥2∗


| {z }
−∥∇f (x)∥∗

(Recall that ∥x∥∗ = max yT x : ∥y∥ ≤ 1)


y

f (x(k) + αd) ≈ f (x(k) ) + α∇T f (x(k) )d


- Hence,
= f (x(k) ) − α∥∇f (x)∥2∗

Algorithm: Steepest Descent Algorithm

- Given a starting point x(0) ∈ dom f (x)

- repeat
- Compute the steepest descent direction d(k) sd
- Line search: Choose step size α(k) via a line search algorithm
(k)
- Update: x(k+1) = x(k) + α(k) dsd

- until stopping criterion is satisfied


Cenk Toker ELE 704 Optimization 159 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Steepest Descent for different norms:

Euclidean norm: As ∥ · ∥2∗ = ∥ · ∥2 and having x0 = x(k) , the steepest


descent direction is the negative gradient, i.e.

dsd = −∇f (x0 )

- For Euclidean norm, steepest descent algorithm is the same as the


gradient descent algorithm.

Cenk Toker ELE 704 Optimization 160 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Quadratic norm: For a quadratic norm ∥ · ∥P and having x0 = x(k) , the


normalized descent direction is given by
∇f (x0 ) ∇f (x0 )
dnsd = −P−1 = −P−1
∥∇f (x0 )∥P∗ (∇T f (x0 )P−1 ∇f (x0 ))1/2
(Recall that ∥x∥P∗ = ∥x∥P−1 )

- As ∥∇f (x)∥P∗ = ∥P−1/2 ∇f (x)∥2 , then

dsd = −P−1 ∇f (x0 )

Cenk Toker ELE 704 Optimization 161 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Change of coordinates: Let y = P 1/2 x, then ∥x∥P = ∥y∥2 . Using this


change of coordinates, we can solve the original problem of minimizing
f (x) by solving the equivalent problem of minimizing the function
f˜(y) : RN → R, given by

f˜(y) = f (P−1/2 y) = f (x)

- Apply the gradient descent method to f˜(y). The descent direction at y0


(x0 = P−1/2 y0 for the original problem) is

dy = −∇f˜(y0 ) = −P−1/2 ∇f (P−1/2 y0 ) = −P−1/2 ∇f (x0 )

- Then the descent direction for the original problem becomes

dx = P−1/2 dy = −P−1 ∇f (x0 )

- Thus, x∗ = P−1/2 y∗ .

- The steepest descent method in the quadratic norm ∥ · ∥P is equivalent to


the gradient descent method applied to the problem after the coordinate
transformation
y = P1/2 x
Cenk Toker ELE 704 Optimization 162 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

L1 -norm: For an L1 -norm ∥ · ∥1 and having x0 = x(k) , the normalized


descent direction is given by
n o
dnsd = − argmin ∇T f (x)d : ∥d∥1 = 1 .

- Let i be any index for which ∥∇f (x0 )∥∞ = max |(∇f (x0 ))i |. Then a
normalized steepest descent direction dnsd for the L1 -norm is given by
 
∂f (x0 )
dnsd = − sign ei
∂xi
where ei is the i-th standard basis vector (i.e. the coordinate axis
direction) with the steepest gradient. For example, in the figure above we
have dnsd = e1 .
Cenk Toker ELE 704 Optimization 163 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Then, the unnormalized steepest descent direction is given by


∂f (x0 )
dsd = dnsd ∥∇f (x0 )∥∞ = − ei
∂xi

- The steepest descent algorithm in the L1 -norm has a very natural


interpretation:
- At each iteration we select a component of ∇f (x0 ) with maximum
absolute value, and then decrease or increase the corresponding
component of x0 , according to the sign of (∇f (x0 ))i .
- The algorithm is sometimes called a coordinate-descent algorithm,
since only one component of the variable x(k) is updated at each
iteration.
- This can greatly simplify, or even trivialize, the line search.

Cenk Toker ELE 704 Optimization 164 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Choice of norm:
- Choice of norm can dramatically affect the convergence

- Condition number of the Hessian should be close to unity for fast


convergence

- Consider quadratic norm with respect to SPD matrix P. Performing the


change of coordinates
y = P1/2 x
can change the condition number.
- If an approximation of the Hessian at the optimal point, H(x∗ ), is
known, then setting P ∼= H(x∗ ) will yield

P−1/2 H(x∗ )P1/2 ∼


=I
resulting in a very low condition number.
- If P is chosen correctly the ellipsoid ε = x : xT Px ≤ 1


approximates the cost surface at point x.


- A correct P will greatly improve the convergence whereas the wrong
choice of P will result in very poor convergence.
Cenk Toker ELE 704 Optimization 165 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence Analysis

- (Using backtracking line search) It can be shown that any norm can be
bounded in terms of Euclidean norm with a constant η ∈ (0, 1]
∥x∥∗ ≥ η∥x∥2

- Assuming strongly convex f (x) and using H(x) ≺ M I


M α2
f (x(k) + αdsd ) ≤ f (x(k) ) + α∇T f (x(k) )dsd + ∥dsd ∥22
2
M α2
≤ f (x(k) ) + α∇T f (x(k) )dsd + ∥dsd ∥2∗
2η 2
M
≤ f (x(k) ) − α∥∇f (x(k) )∥2∗ + α2 2 ∥∇f (x(k) )∥2∗

- Right hand side of the inequality is a quadratic function of α and has a
2
minimum at α = ηM . Then,

η2 γη 2 T
f (x(k) +αdsd ) ≤ f (x(k) )− ∥∇f (x(k) )∥2∗ ≤ f (x(k) )+ ∇ f (x(k) )dsd
2M M
Cenk Toker ELE 704 Optimization 166 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

2 T
- Since γ < 0.5 and −∥∇f
n (x)∥o ∗ = ∇ f (x)dsd , backtracking line search
βη 2
will return α ≥ min 1, M , then

βη 2
 
f (x(k) + αdsd ) ≤ f (x(k) ) − γ min 1, ∥∇f (x(k) )∥2∗
M
βη 2
 
≤ f (x(k) ) − γη 2 min 1, ∥∇f (x(k) )∥22
M

- Subtracting p∗ from both sides and using


∥∇f (x(k) )∥2 ≥ 2m(f (x(k) ) − p∗ ), we have

f (x(k+1) ) − p∗ βη 2
 
2
≤ 1 − 2mγη min 1, =c<1
f (x(k) ) − p∗ M

- Linear convergence
 
f (x(k) ) − p∗ ≤ ck f (x(0) ) − p∗

as k → ∞, ck → 0. So, convergence is guaranteed,

Cenk Toker ELE 704 Optimization 167 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 11: A steepest descent example with L1 -norm.

Cenk Toker ELE 704 Optimization 168 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 12: Consider the nonquadratic problem in R2 given in Example 8


(replace α and t with γ and α).

Cenk Toker ELE 704 Optimization 169 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 170 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- When P = I, i.e., gradient descent

Cenk Toker ELE 704 Optimization 171 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 172 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 173 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Conjugate Gradient (CG) Method

Conjugate Gradient (CG) Method

Cenk Toker ELE 704 Optimization 174 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Conjugate Gradient Method

- Can overcome the slow convergence of Gradient Descent algorithm

- Computational complexity is lower than Newton’s Method.

- Can be very effective in dealing with general objective functions.

- We will first investigate the quadratic problem


1
min xT Qx − bT x
2
where Q is SPD, and then extend the solution to the general case by
approximation.

Cenk Toker ELE 704 Optimization 175 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Conjugate Directions

Definition: Given a symmetric matrix Q, two vectors d1 and d2 are said


to be Q-orthogonal or conjugate with respect to Q if

dT1 Qd2 = 0

- Although it is not required, we will assume that Q is SPD.

- If Q = I, then the above definition becomes the definition of


orthogonality.

- A finite set of non-zero vectors d0 , d1 , . . . , dk is said to be a


Q-orthogonal set if
dTi Qdj = 0, ∀i, j : i ̸= j

Cenk Toker ELE 704 Optimization 176 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Proposition: If Q is SPD and the set of non-zero vectors d0 , d1 , . . . , dk


are Q-orthogonal, then these vectors are linearly indepedent.

Proof: Assume linear dependency and suppose ∃αi , i = 0, 1, . . . , k :

α0 d0 + α1 d1 + · · · + αk dk = 0

- Multiplying with dTi Q yields

α0 dTi Qd0 +α1 dTi Qd1 + · · · + αi dTi Qdi + · · · + αk dTi Qdk = 0


| {z } | {z } | {z } | {z }
=0 =0 must be 0 =0

- But dTi Qdi > 0 (Q: PD), then αi = 0. Repeat for all αi .

Cenk Toker ELE 704 Optimization 177 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
Quadratic Problem:
1
min xT Qx − bT x
2
- If Q is N × N PD matrix, then we have unique solution
Qx∗ = b
- Let d0 , d1 , . . . , dN −1 be non-zero Q-orthogonal vectors corresponding to
the N × N SPD matrix Q. They are linearly independent. Then the
optimum solution is given by
x∗ = α0 d0 + α1 d1 + · · · + αN −1 dN −1
- We can find the value of the coefficients αi by multiplying the above
equation with dTi Q:
dTi Qx∗ = αi dTi Qdi
dTi b
αi = . . . Qx∗ = b
dTi Qdi
- Finally the optimum solution is given by,
N −1
X dTi b
x∗ = di
i=0
dTi Qdi

Cenk Toker ELE 704 Optimization 178 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- αi can be found from the known vector b and matrix Q once di are
found.

- The expansion of x∗ is a result of an iterative process of N steps where


at the i-th step αi di is added.
−1
Conjugate Direction Theorem: Let {di }Ni=0 be a set of non-zero
n oN
Q-orthogonal vectors. For any x(0) ∈ dom f (x), the sequence x(k)
k=0
generated according to

x(k+1) = x(k) + α(k) dk , ... k ≥ 0

- with
dTk g(k)
α(k) = −
dTk Qdk
and g(k) is the gradient at x(k)

g(k) = ∇f (x(k) ) = Qx(k) − b

- converges to the unique solution x∗ of Qx∗ = b after N steps, i.e.


x(N ) = x∗ .
Cenk Toker ELE 704 Optimization 179 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Proof: Since dk are linearly independent, we can write


x∗ − x(0) = α(0) d0 + α(1) d1 + · · · + α(N −1) dN −1

- for some α(k) . we can find α(k) by


 
dTk Q x∗ − x(0)
α(k) = (1)
dTk Qdk

- Now, the iterative steps from x(0) to x(k)


x(k) − x(0) = α(0) d0 + α(1) d1 + · · · + α(k−1) dk−1

- and due to Q-orthogonality


 
dTk Q x(k) − x(0) = 0 (2)

- Using (1) and (2) we arrive at


 
dTk Q x∗ − x(k) dT g(k)
α(k) = T
= − Tk
dk Qdk dk Qdk

Cenk Toker ELE 704 Optimization 180 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Descent Properties of the Conjugate Gradient Method


- We define B(k) which is spanned by {d0 , d1 , . . . , dk−1 } as the subspace
of RN , i.e.,
B(k) = span {d0 , d1 , . . . , dk−1 } ⊆ RN

- We will show that at each step x(k) minimizes the objective over the
k-dimensional linear variety x(0) + B(k) .
N −1
Theorem: (Expanding Subspace Theorem) Let {di }i=0 be non-zero,
N
Q-orthogonal vectors in R .

- For any x(0) ∈ RN , the sequence


x(k+1) = x(k) + α(k) dk
dTk g(k)
α(k) = −
dTk Qdk

minimizes f (x) = 21 xT Qx − bT x on the line

x = x(k−1) − αdk−1 , −∞ < α < ∞


(0) (k)
and on x +B .
Cenk Toker ELE 704 Optimization 181 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Proof: Since x(k) ∈ x(0) + B(k) , i.e., B(k) contains the line
x = x(k−1) − αdk−1 , it is enough to show that x(k) minimizes f (x) over
x(0) + B(k)

- Since we assume that f (x) is strictly convex, the above condition holds
when g(k) is orthogonal to B(k) , i.e., the gradient of f (x) at x(k) is
orthogonal to B(k) .

Cenk Toker ELE 704 Optimization 182 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Proof of g(k) ⊥ B(k) is by induction

- Let for k = 0, B(0) = {} (empty set), then g(k) ⊥ B(0) is true.

- Now assume that g(k) ⊥ B(k) , show that g(k+1) ⊥ B(k+1)

- From the definition of g(k) (g(k) = Qx(k) − b), it can be shown that

g(k+1) = g(k) + αk Qdk

- Hence, by the definition of αk

dTk g(k+1) = dTk g(k) + αk dTk Qdk = 0

- Also, for i < k

dTi g(k+1) = dT g(k) + αk dTi Qdk = 0


| i {z } | {z }
vanishes by induction =0

Cenk Toker ELE 704 Optimization 183 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Corollary: The gradients g(k) , k = 0, 1, . . . , N satisfy

dTi g(k) = 0

for i < k.

- Expanding subspace, at every iteration dk increases the dimensionality of


B. Since x(k) minimizes f (x) over x(0) + B(k) , x(N ) is the overall
minimum of f (x).

Cenk Toker ELE 704 Optimization 184 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

The Conjugate Gradient Method

- In the conjugate direction method, select the successive direction vectors


as a conjugate version of the successive gradients obtained as the method
progresses

Conjugate Gradient Algorithm:

- Start at any x(0) ∈ RN and define d(0) = −g(0) = b − Qx(0)

- repeat
T
d(k) g(k)
- α(k) = − T
d(k) Qd(k)

- x(k+1) = x(k) + α(k) d(k)


- g(k+1) = Qx(k+1) − b
g(k+1) Qd(k)
- β (k) = T
d(k) Qd(k)

- d(k+1) = −g(k+1) + β (k) d(k)

- until k = N .
Cenk Toker ELE 704 Optimization 185 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Algorithm terminates in at most N steps with the exact solution (for the
quadratic case)

- Gradient is always linearly independent of all previous direction vectors,


i.e., g(k) ⊥ B(k) , where B(k) = {d0 , d1 , . . . , dk−1 }

- If solution is reached before N steps, the gradient is zero

- Very simple formula, computational complexity is slightly higher than


gradient descent algorithm

- The process makes uniform progress toward the solution at every step.
Important for the nonquadratic case.

Cenk Toker ELE 704 Optimization 186 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 13: Consider the quadratic problem


1
min xT Qx − bT x
2
   
3 2 2
where Q = and b = .
2 6 −8

- Solution is given by

Cenk Toker ELE 704 Optimization 187 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

CG Summary
- In theory (with exact arithmetic) converges to solution in N steps
- The bad news: due to numerical round-off errors, can take more
than N steps (or fail to converge)
- The good news: with luck (i.e., good spectrum of Q), can get good
approximate solution in ≪ N steps

- Compared to direct (factor-solve) methods, CG is less reliable, data


dependent; often requires good (problem-dependent) preconditioner

- But, when it works, can solve extremely large systems

Cenk Toker ELE 704 Optimization 188 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Extension to Nonquadratic Problems


- Idea is simple. We have two loops
- Outer loop approximates the problem with a quadratic one
- Inner loop runs conjugate gradient method (CGM) for the
approximation

- i.e., for the neighbourhood of point x0


1
f (x) ∼ T T
= f (x0 ) + ∇ f (x0 )(x − x0 ) + (x − x0 ) H(x0 )(x − x0 ) + residual
2 | {z }
| {z } →0
quadratic function

- Expanding
∼ 1 xT H(x0 )x+ ∇T f (x0 ) − xT 1 T
 
T
f (x) = 0 H(x0 ) x+f (x0 ) + x0 H(x0 )x0 − ∇ f (x0 )x0
2 |
2
{z }
independent of x, i.e., constant

- Thus,
1  
min f (x) ≡ min xT H(x0 )x+ ∇T f (x0 ) − xT0 H(x0 ) x
2
1
≡ min xT Qx − bT x
2
Cenk Toker ELE 704 Optimization 189 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Here,

Q = H(x0 )
bT = −∇T f (x0 ) + xT0 H(x0 )

- The gradient g(k) is

g(k) = Qx(k) − b
= H(x0 )x0 + ∇f (x0 ) − H(x0 )x0 . . . x0 = x(k)
= ∇f (x0 )

Cenk Toker ELE 704 Optimization 190 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Nonquadratic Conjugate Gradient Algorithm:

- Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set


d(0) = −g(0)

- repeat
- repeat
T
d(k) g(k)
- α(k) = − T
d(k) H(x(k) )d(k)

- x(k+1) = x(k) + α(k) d(k)


- g(k+1) = ∇f (x(k+1) )
T
g(k+1) H(x(k) )d(k)
- β (k) = T
d(k) H(x(k) )d(k)

- d(k+1) = −g(k+1) + β (k) d(k)


- until k = N .
- new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and
d(0) = −g(0) .

- until stopping criterion is satisfied


Cenk Toker ELE 704 Optimization 191 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- No line search is required.

- H(x(k) ) must be evaluated at each point, can be impractical.

- Algorithm may not be globally convergent.

- Involvement of H(x(k) ) can be avoided by employing a line search


algorithm for α(k) and slightly modifying β (k)

Cenk Toker ELE 704 Optimization 192 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Nonquadratic Conjugate Gradient Algorithm with Line-search:


- Starting at any x(0) ∈ RN , compute g(0) = ∇f (x(0) ) and set
d(0) = −g(0)
- repeat
- repeat
- Line search: α(k) = argmin f (x(k) + αd(k) )
α

- Update: x(k+1) = x(k) + α(k) d(k)


- Gradient: g(k+1) = ∇f (x(k+1) )
- Use
T
g(k+1) g(k+1)
- Fletcher-Reeves method: β (k) = T , or
g(k) g(k)
(k) T
(g(k+1) −g ) g(k+1)
- Polak-Ribiere method: β (k) = T
g(k) g(k)

- d(k+1) = −g(k+1) + β (k) d(k)


- until k = N .
- new starting point is x(0) = x(N ) , g(0) = ∇f (x(N ) ) and
d(0) = −g(0) .
- until stopping criterion is satisfied
Cenk Toker ELE 704 Optimization 193 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Polak-Ribiere method can be superior to the Fletcher-Reeves method.

- Global convergence of the line search methods is established by noting


that a gradient descent step is taken every N steps and serves as a spacer
step. Since the other steps do not increase the objective, and in fact
hopefully they decrease it, global convergence is guaranteed.

Cenk Toker ELE 704 Optimization 194 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 14: Convergence example of the nonlinear Conjugate Gradient Method: (a) A
complicated function with many local minima and maxima. (b) Convergence path of
Fletcher-Reeves CG. Unlike linear CG, convergence does not occur in two steps. (c) Cross-section
of the surface corresponding to the first line search. (d) Convergence path of Polak-Ribiere CG.

Cenk Toker ELE 704 Optimization 195 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Newton’s Method (NA)

Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 196 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

The Newton Step

- In Newton’s Method, local quadratic approximations of f (x) are utilized.


Starting with the second-order Taylor’s approximation around x(k) ,
1
f (x(k+1) ) = f (x(k) ) + ∇f (x(k) )∆x + ∆xT H(x(k) )∆x + residual
| {z 2 }
fˆ(x(k+1) )

where ∆x = x(k+1) − x(k) , find ∆x = ∆xnt such that fˆ(x(k+1) ) is


minimized.
∂ fˆ(x(k+1) )
- Quadratic approximation optimum step ∆xnt (by solving ∂∆x
= 0)

∆xnt = −H−1 (x(k) )∇f (x(k) )

is called the Newton step, which is a descent direction, i.e.,

∇T f (x(k) )∆xnt = −∇T f (x(k) )H−1 (x(k) )∇f (x(k) ) < 0

Cenk Toker ELE 704 Optimization 197 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Then
x(k+1) = x(k) + ∆xnt

Cenk Toker ELE 704 Optimization 198 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Interpretation of the Newton Step


1. Minimizer of second-order approximation

- As given on the previous slide ∆x minimizes fˆ(x(k+1) ), i.e., the quadratic


approximation of f (x) in the neighbourhood of x(k) .

- If f (x) is quadratic, then f (x(0) ) + ∆x is the exact minimizer of f (x)


and algorithm terminates in a single step with the exact answer.

- If f (x) is nearly quadratic, then x + ∆x is a very good estimate of the


minimizer of f (x), x∗ .

- For twice differentiable f (x), quadratic approximation is very accurate in


the neighbourhood of x∗ , i.e., when x is very close to x∗ , the point
x + ∆x is a very good estimate of x∗ .

Cenk Toker ELE 704 Optimization 199 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

2. Steepest Descent Direction in Hessian Norm

- The Newton step is the steepest descent direction at x(k) , i.e.,


 1
2
∥v∥H(x(k) ) = vT H(x(k) )v

- In the steepest descent method, the quadratic norm ∥ · ∥P can


significantly increase speed of convergence, by decreasing the condition
number. In the neighbourhood of x∗ , P = H(x∗ ) is a very good choice.
- In Newton’s method when x is near x∗ , we have H(x) ≃ H(x∗ ).

Cenk Toker ELE 704 Optimization 200 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

3. Solution of Linearized Optimality Condition

- First-order optimality condition

∇f (x∗ ) = 0

- near x∗ (using first order Taylor’s approximation for ∇f (x + ∆x))

∇f (x + ∆x) ≃ ∇f (x) + H(x)∆x = 0

- with the solution


∆xnt = −H−1 (x)∇f (x)

Cenk Toker ELE 704 Optimization 201 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

The Newton Decrement

- The norm of the Newton step in the quadratic norm defined by H(x) is
called the Newton decrement
 1
2
λ(x) = ∥∆xnt ∥H(x) = ∆xTnt H(x)∆xnt

- It can be used as a stopping criterion since it is an estimate of f (x) − p∗ ,


i.e.,
1
f (x) − inf f (y) = f (x) − fˆ(x + ∆xnt ) = λ2 (x)
y 2

where
1
fˆ(x + ∆xnt ) = f (x) + ∇T f (x)∆xnt + ∆xTnt H(x)∆xnt
2
i.e., the second-order quadratic approximation of f (x) at x.

Cenk Toker ELE 704 Optimization 202 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Substitute fˆ(x + ∆xnt ) into f (x) − inf f (y) and let


y

∆xnt = −H−1 (x)∇f (x)

then
1 T 1
∇ f (x)H−1 (x)∇f (x) = λ2 (x)
2 2
λ2 (x)
- So, if 2
< ϵ, then algorithm can be terminated for some small ϵ.

- With the substitution of ∆xnt = −H−1 (x)∇f (x), the Newton decrement
can also be written as
 1
λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) )
2

Cenk Toker ELE 704 Optimization 203 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Newton’s Method

- Given a starting point x(0) ∈ dom f (x) and some small tolerance ϵ > 0

- repeat
- Compute the Newton step and Newton decrement

∆x(k) = −H−1 (x(k) )∇f (x(k) )


 1
λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) )
2

- Stopping criterion, quit if λ2 (x(k) )/2 ≤ ϵ.


- Line search: Choose a stepsize α(k) > 0, e.g. by backtracking line
search.
- Update: x(k+1) = x(k) + α(k) ∆x(k) .

Cenk Toker ELE 704 Optimization 204 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- The stepsize α(k) (i.e., line search) is required for the non-quadratic
initial parts of the algorithm. Otherwise, algorithm may not converge due
to large higher-order residuals.

- As x(k) gets closer to x∗ . f (x) can be better approximated by the


second-order expansion. Hence, stepsize α(k) is no longer required. Line
search algorithm will automatically set α(k) = 1.

- If we start with α(k) = 1 and keep it the same, then the algorithm is
called the pure Newton’s method.

- For an arbitrary f (x), there are two regions of convergence.


- damped Newton phase, when x is far from x∗
- quadratically convergent phase, when x gets closer to x∗

- If we let H(x) = I, the algorithm reduces to gradient descent (GD)

x(k+1) = x(k) − α(k) ∇f (x(k) )

Cenk Toker ELE 704 Optimization 205 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- If H(x) is not positive definite, Newton’s method will not converge.

- So, use (aI + H(x))−1 instead of H−1 (x), also known as (a.k.a)
Marquardt method. There always exists an a which will make the matrix
(aI + H(x)) positive definite.

- a is a trade-off between GD and NA


- a → ∞ ⇒ Gradient Descent (GD)
- a → 0 ⇒ Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 206 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Newton step and decrement are independent of affine transformations


(i.e., linear coordinate transformations), i.e., for non-singular T ∈ RN ×N

x = Ty and f˜(y) = f (Ty)

then

∇f˜(y) = TT ∇f (x)
H̃(y) = TT H(x)T

- So, the Newton step will be

∆ynt = −H̃−1 (y)∇f˜(y)


 −1  
= − TT H(x)T TT ∇f (x)

= −T−1 H−1 (x)∇f (x)


= T−1 ∆xnt

i.e,
x + ∆xnt = T (y + ∆ynt ), ∀x
Cenk Toker ELE 704 Optimization 207 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Similarly, the Newton decrement will be


 1
λ(y) = ∇T f˜(y)H̃−1 (y)∇f˜(y)
2

  −1   12
T T T
= ∇ f (x)T T H(x)T T ∇f (x)
 1
2
= ∇T f (x)H(x)∇f (x)
= λ(x)

- Thus, Newton’s Method is independent of affine transformations (i.e.,


linear coordinate transformations).

Cenk Toker ELE 704 Optimization 208 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Convergence Analysis

Read Boyd Sect. 9.5.3.


- Assume a strongly convex f (x) with mI ⪯ H(x) with constant m
∀x ∈ dom f (x),

and H(x) is Lipschitz continuous on dom f (x), i.e.,

∥H(x) − H(y)∥2 ≤ L ∥x − y∥2

for constant L > 0. This inequality imposes a bound on the third


derivative if f (x).

- If L is small f (x) is closer to a quadratic function. If L is large, f (x) is


far from a quadratic function. If L = 0, then f (x) is quadratic.

- Thus, L measures how well f (x) can be approximated by a quadratic


function.

- Newton’s Method will perform well for small L.

Cenk Toker ELE 704 Optimization 209 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

 
m2
Convergence: There exist constants η ∈ 0, L
and σ > 0 such that
- Damped Newton Phase
∥∇f (x)∥2 ≥ η
- α < 1 gives better solutions, so most iterations will require line
search, e.g. backtracking.
- As k increases, function value decreases by at least σ, but not
necessarily quadratic.
f (x(0) )−p∗
- This phase ends after at most σ
iterations

- Quadratically Convergent Phase

∥∇f (x)∥2 < η

- All iterations use α = 1 (i.e., quadratic approximation suits very


well.)
∥∇f (x(k+1) )∥ L
- ≤ , i.e., quadratic convergence.
∥∇f (x(k) )∥2 2m2

Cenk Toker ELE 704 Optimization 210 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- For small ϵ > 0, f (x) − p∗ < ϵ is achieved after at most


ϵ0
log2 log2
ϵ
2m3
iterations where ϵ0 = L2
. This is typically 5-6 iterations.
- Number of iterations is bounded above by

f (x(0) ) − p∗ ϵ0
+ log2 log2
σ ϵ
- σ and ϵ0 are dependent on m, L and x(0) .

Cenk Toker ELE 704 Optimization 211 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

NA Summary

- Convergence of Newton’s method is rapid in general, and quadratic near


x∗ . Once the quadratic convergence phase is reached, at most six or so
iterations are required to produce a solution of very high accuracy.

- Newton’s method is affine invariant. It is insensitive to the choice of


coordinates, or the condition number of the sublevel sets of the objective.

- Newton’s method scales well with problem size. Ignoring the computation
of the Hessian, its performance on problems in R10000 is similar to its
performance on problems in R10 , with only a modest increase in the
number of steps required.

- The good performance of Newton’s method is not dependent on the


choice of algorithm parameters. In contrast, the choice of norm for
steepest descent plays a critical role in its performance.

Cenk Toker ELE 704 Optimization 212 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- The main disadvantage of Newton’s method is the cost of forming and


storing the Hessian, and the cost of computing the Newton step, which
requires solving a set of linear equations.

- Other alternatives (called quasi-Newton methods) are also provided by a


family of algorithms for unconstrained optimization. These methods
require less computational effort to form the search direction, but they
share some of the strong advantages of Newton methods, such as rapid
convergence near x∗ .

Cenk Toker ELE 704 Optimization 213 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 15: Consider the nonquadratic problem in R2 given in Example 8 and


Example 12 (replace α and t with γ and α).

Cenk Toker ELE 704 Optimization 214 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 215 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 16: Consider the nonquadratic problem in R100 given in Example 9


(replace α and t with γ and α).

Cenk Toker ELE 704 Optimization 216 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 217 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Example 17: (problem in R10000 ) Replace α and t with γ and α.

Cenk Toker ELE 704 Optimization 218 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Cenk Toker ELE 704 Optimization 219 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Approximation of the Hessian

- For relatively large scale problems, i.e. N is large, calculating the inverse
of the Hessian at each iteration can be costly. So, we may use, some
approximations of the Hessian

S(x) = Ĥ−1 (x) → H−1 (x)


x(k+1) = x(k) − α(k) S(x(k) )∇f (x(k) )

Hybrid GD + NA

- We know that the first phase the Newton’s Algorithm (NA) is not very
fast. Therefore, first we can run run GD which has considerably low
complexity and after satisfying some conditions, we can switch to the NA.

- Newton’s Algorithm may not converge for highly non-quadratic functions


unless x is close to x∗ .

- Hybrid method (given on the next slide) also guarantees global


convergence.
Cenk Toker ELE 704 Optimization 220 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

- Hybrid Algorithm
- Start at x(0) ∈ dom f (x)
- repeat
- run GD (i.e., S(x(k) ) = I)
- until stopping criterion is satisfied
- Start at the final point of GD
- repeat
- run NA with exact H(x) (i.e., S(x(k) ) = H−1 (x(k) ))
- until stopping criterion is satisfied

Cenk Toker ELE 704 Optimization 221 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

The Chord Method

- If f (x) is close to a quadratic function, we may use S(x(k) ) = H−1 (x(0) )


throughout the iterations, i.e.,

∆x(k) = −H−1 (x(0) )∇f (x(k) )


x(k+1) = x(k) + ∆x(k)

- This is also the same as the SD algorithm with P = H(x(0) ) and


α(k) = 1.

Cenk Toker ELE 704 Optimization 222 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

The Shamanski Method

- Updating the Hessian at every N iterations may give better performance,


i.e., S(x(k) ) = H−1 (x⌊ N ⌋N )
k

∆x(k) = −H−1 (x⌊ N ⌋N )∇f (x(k) )


k

x(k+1) = x(k) + ∆x(k)

- This is a trade-off between the Chord method (N ← ∞) and the full NA


(N ← 1).

Cenk Toker ELE 704 Optimization 223 / 224


Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)

Approximating Particular Terms

- Inversion of sparse matrices can be easier, i.e., when many entries of


H(x) are zero
- If some entries of H(x) are small or below a small threshold, then
set them to zero, obtaining Ĥ(x). Thus, Ĥ(x) becomes sparse.
- In the extreme case. when the Hessian is strongly diagonal
dominant, let the off-diagonal terms to be zero, obtaining Ĥ(x).
Thus, Ĥ(x) becomes diagonal which is very easy to invert.

- There are also other advanced quasi-Newton (modified Newton)


algorithms developed to approximate the inverse of the Hessian, e.g.
Broyden and Davidon-Fletcher-Powell (DFP) methods.

Cenk Toker ELE 704 Optimization 224 / 224

You might also like