ELE704 - Lecture Notes - I - 03-04-2024
ELE704 - Lecture Notes - I - 03-04-2024
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods
Cenk Toker
Hacettepe University
2023-24 Spring
Textbooks
Course Content
Notation
The symbols Z++ and R++ denote the sets of positive (not including
zero) integer and real numbers, respectively. Similarly, Z+ and R+ denote
the sets of nonnegative integer and real numbers, respectively.
The symbols ∀,∃, : , ∧, ∨, =⇒, ⇐⇒, ⊂ and ∈ denote the terms “for all”,
“there exists”, “such that”, “and”, “or”, “if . . . then”, “if and only if (iff)”,
“subset of” and “element of”, respectively.
where i = 1, 2, · · · , N .
Cenk Toker ELE 704 Optimization 5 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods
n−η = [ n1 − K1 n2 − K2 · · · nN − KN ]T
n<η ≡ n1 < K1 , n2 < K2 , · · · , nN < KN
where n = [ n1 n2 · · · nN ]T and η = [ K1 K2 · · · KN ]T .
where i = 1, 2, · · · , M and j = 1, 2, · · · , N .
The quantity AT denotes the transpose of A.
The quantities A−1 and A−T denote the inverse and the
inverse transpose of A, respectively.
An N -length column vector a is essentially an N ×1 matrix
and similarly the row vector aT is also a 1×N matrix.
Cenk Toker ELE 704 Optimization 9 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods
(AB)T = BT AT
The notation
diag [ d1 d2 · · · dN ]
stands for an N × N diagonal matrix D with the main
diagonal elements Di,i = di , i.e.
d1 0 · · · 0 0
0 d2 0 ··· 0
.. . . . . .
D=. . . .
. . . .
0 · · · 0 dN −1 0
0 0 ··· 0 dN
where d = [ d1 d2 · · · dN ]T .
Cenk Toker ELE 704 Optimization 12 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods
AA−1 = A−1 A = I
a · b = aT b
= a1 b1 + a2 b2 + · · · + aN bN
where a = [ a1 a2 · · · aN ]T and b = [ b1 b2 · · · bN ]T .
Note that, the inner product produces a scalar value and it is
commutative, i.e.
a·b=b·a
so,
a T b = bT a
Inner product is sometimes denoted with the ⟨ , ⟩ operator, i.e.,
⟨a, b⟩ = a · b = aT b
a ⊗ b ̸= b ⊗ a
so,
abT ̸= baT
f (x) ≡ f (x1 , x2 , · · · , xN )
f (x) : RN → R
dom f
For instance, the domain of cosine is the set of all real numbers (the
fundamental domain is [0, 2π), all other real values fold onto this region
with mod 2π), while the domain of the square root consists only of
numbers greater than or equal to 0 (ignoring complex numbers in both
cases), i.e.,
dom cos = R
dom sqrt = R+
For a function whose domain is a subset of the real numbers, when the
function is represented in an xy Cartesian coordinate system, the domain
is represented on the x-axis (y-axis gives the range, next slide).
Range: The range of a function refers to the image of the function. The
codomain is a set containing the function’s outputs, whereas the image is
the part of the codomain which consists only of the function’s outputs.
For a function whose range is a subset of the real numbers, when the
function is represented in an xy Cartesian coordinate system, the range is
represented on the y-axis.
f (x) : R → RN
A = { x | P (x) }
For instance, the negative real numbers do not have a greatest element,
and their supremum is 0 (which is not a negative real number).
Examples:
sup {1, 2, 3} = 3
sup {x ∈ R : 0 < x < 1} = sup {x ∈ R : 0 ≤ x ≤ 1} = 1
For instance, the positive real numbers do not have a least element, and
their infimum is 0, which is not a positive real number.
Examples:
inf {1, 2, 3} = 1
inf {x ∈ R : 0 < x < 1} = inf {x ∈ R : 0 ≤ x ≤ 1} = 0
Note: If not stated otherwise, ∥x∥ will denote the Euclidean norm, ∥x∥2 ,
i.e., L2 -norm.
Cenk Toker ELE 704 Optimization 22 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods
Unit Ball: Let ∥ · ∥ denote any norm on RN , then the unit ball is the set
of all vectors with norm less than or equal to one, i.e.,
B = {x : ∥x∥ ≤ 1} .
ball of L∞ -
Two-dimensional unit
norm, i.e., B∞ = x ∈ R2 : ∥x∥∞ ≤ 1 .
Dual Norm: Let ∥ · ∥ denote any norm on RN , then the dual norm,
denoted by ∥ · ∥∗ , is the function from RN to R with values
n o
∥x∥∗ = max yT x : ∥y∥ ≤ 1 = sup yT x : ∥y∥ ≤ 1
y
xT y ≤ ∥x∥ · ∥y∥∗
The dual to the dual norm above is the original norm, i.e.,
∥x∥∗∗ = ∥x∥
- The norm dual to the Euclidean norm is itself. This comes directly from
the Cauchy-Schwarz inequality.
∥x∥2∗ = ∥x∥2
- The norm dual to the the L∞ -norm is the L1 -norm, or vice versa.
∥x∥p∗ = ∥x∥q
p
where q = .
1−p
Av − λv = 0
(A − λI)v = 0
Note that, condition number, κ(·), of a matrix is given by the ratio of the
largest and the smallest eigenvalue, e.g.,
max λi
κ(H(x)) =
min λi
If A is positive-semidefinite, it is denoted by
A⪰0
and has nonnegative eigenvalues.
If A is positive-definite, it is denoted by
A≻0
and has positive eigenvalues.
T
For any real matrix B, the
matrix B B is positive-semidefinite, and
T
rank (B) = rank B B .
Cenk Toker ELE 704 Optimization 30 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods
xT Ax ≤ 0
xT Ax < 0
If A is negative-semidefinite, it is denoted by
A⪯0
If A is negative-definite, it is denoted by
A≺0
∂
The notation stands for the multidimensional partial
∂x
derivative operator in the form of a column vector, i.e.
T
∂ ∂ ∂ ∂
= ··· .
∂x ∂x1 ∂x2 ∂xN
∂ T
∂ ∂
∇ = ∇x = ··· .
∂x1 ∂x2 ∂xN
Consider the following example
∂f (x)
∇f (x) = ∇x f (x) =
∂x
T
∂f (x) ∂f (x) ∂f (x)
= ··· .
∂x1 ∂x2 ∂xN
Note that ∇ operator is not commutative, i.e.,
∇f (x) ̸= f (x)∇
Cenk Toker ELE 704 Optimization 34 / 224
Introduction
Textbooks
Overview
Course Content
Convex Sets and Functions
Notation
Unconstrained Minimization
Geometric definitions
Descent Methods
where d = [ d1 d2 · · · dN ]T .
T
For a vector function f (x), ∇f T (x) gives the Jacobian
matrix
∂f (x) ∂f (x)
· · · ∂f∂x1 (x)
1 1
∂x1 ∂x2 N
∂f2 (x) ∂f2 (x) ∂f2 (x)
∂f
∂x1 ∂x2 · · · ∂xN
Jf (x) = =
∂xi M ×N .. .. ..
.
..
. . .
∂fM (x) ∂fM (x)
∂x1 ∂x2 · · · ∂f∂x
M (x)
N
x∗ = argmin f (x)
x
= argmin f (x)
x∈RN
respectively, where k = 1, 2, · · · K.
Geometric Definitions
Hyperplane
x | aT x = b; a, x ∈ RN , b ∈ R, a ̸= 0
aT x = b
Halfspace
x | aT x ≥ b; a, x ∈ RN , b ∈ R, a ̸= 0
or
x | aT x ≤ b; a, x ∈ RN , b ∈ R, a ̸= 0
constitute a halfspace.
Polyhedra
- The solution set of finitely many linear inequalities & equalities constitute
a polyhedra,
n o
x | Ax ≤ b, Cx = d; A ∈ RM ×N , C ∈ RP ×N , b ∈ RM , d ∈ RP , x ∈ RN
Example 1:
- Find the solution of
Example 2:
- Consider the pyramid in the following figure. The length of
each side is 1 unit. Find the halfspaces defining this volume.
- Solution:
x1
0 1 0 x2 ≥ 0
x3
h i x1
√1 1
− 2√ − √16 x2 ≥ 0
2 3
x3
h q i x1
1 2 x ≥ 0
0 − 2√ 3 3 2
x3
h i x1 1
− √12 − 2√ 1
3
− √1 x ≥ − √
6 2
x3 2
Change of coordinates:
- Let y = P1/2 x, then the ellipsoid becomes y | (y − yc )T (y − yc ) ≤ r ,
i.e., a ball with respect to Euclidian norm with center yc and radius r.
Cone
cone.
Example 3:
x1
- Find region x1 + 2 x2 ≥ 4, i.e. 1 2 ≥ 4, and x1 , x2 ≥ 0
x2
- Solution:
- Solution:
- Solution:
- Example 7:
min x1 + x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0
- Solution:
- Example 8:
min 3x1 + x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0
- Solution:
- Example 9:
min x1 − x2
s.t. x1 + 2x2 ≥ 6
2x1 + x2 ≥ 6
x1 , x2 ≥ 0
- Solution:
Overview
Overview
- Uncontrained Optimization
- Constrained Optimization
Unconstrained Optimization
min f (x)
x∈X
Constrained Optimization
min f (x)
x∈X
subject to g(x) ≤ 0
h(x) = 0
- where x = [ x1 x2 · · · xN ]T , X ⊂ RN ,
g(x) = [ g1 (x) g2 (x) · · · gM (x) ]T are the inequality
constraints, h(x) = [ h1 (x) h2 (x) · · · hL (x) ]T are the
equality constraints and x∗ is a feasible solution (or point)
iff g ≤ 0 and h = 0 with
- Examples:
- Computer networks - e.g., optimum routing problem
- Production planning in a factory
- Resource allocation
- Computer aided design (CAD) - e.g., shortest paths in a
PCB
- Travelling salesman problem
Formulation:
- Variable: amount of food xi for i-th food. Thus, variables can
be represented by the vector x
T
x = x1 x2 · · · xi · · · xN
where i = 1, 2, · · · , N .
Amount of food cannot be negative, i.e., xi ≥ 0, so
x≥0
- Cost:
f (x) = c1 x1 + c2 x2 + · · · + ci xi + · · · + cN xN
= cT x
- Constraints:
A1,1 x1 + A1,2 x2 + · · · + A1,i xi + · · · + A1,N xN ≥ b1
A2,1 x1 + A2,2 x2 + · · · + A2,i xi + · · · + A2,N xN ≥ b2
.. .. .. ... .. ... ..
. . . . .
Ax ≥ b
Aj,1 x1 + Aj,2 x2 + · · · + Aj,i xi + · · · + Aj,N xN ≥ bj
.. .. .. .. .. .. ..
. . . . . . .
AM,1 x1 + AM,2 x2 + · · · + AM,i xi + · · · + AM,N xN ≥ bM
min cT x
x
s.t. b − Ax ≤ 0
−x≤0
Basic Definitions
f (x∗ ) ≤ f (y), ∀y ∈ F, y ̸= x∗
6 x1 x32
f (x) − f (x0 ) ∆f
f ′ (x0 ) = lim = lim
(x−x0 )→0 x − x0 ∆x→0 ∆x
1
∆f = ∇T f (x0 )∆x + ∆xT H(x0 )∆x + β(x0 , ∆x)
2
where lim β(x0 , ∆x) = 0
∆x→0
Example 11:
- Let f (x) = 3 x21 x32 + x22 x33 as in the previous example,
6 x32 18 x1 x22
0
2
H(x) = ∇ f (x) = 18 x1 x22 18 x21 x2 + 2 x33 6 x2 x32
0 6 x2 x23 6 x22 x3
An N × N matrix M is called
- positive definite (M ≻ 0), if xT Mx > 0, ∀x ∈ RN , x ̸= 0,
or, all eigenvalues of M are positive
Example 12:
-
2 0
M= ≻ 0, positive definite
0 3
8 −1
M= ≻ 0, positive definite
−1 1
Check xT Mx or check eigenvalues.
- Hint: Recall that for x ∈ R, f (x) = ax2 ⇒ f ′′ (x) = 2a. Thus the
function f (x) is convex if a > 0 or concave if a < 0.
Problem:
min f (x)
x∈X
then
f (x∗ + λd) − f (x∗ ) < 0, ∀λ > 0, λ → 0
∗
⇒ x is not a local minimum. Contradiction!
- Proof:
f (x∗ + λd) − f (x∗ ) 1
= dT H(x∗ )d + residual (→ 0 as λ → 0)
λ2 2 | {z }
>0
then
f (x∗ + λd) − f (x∗ ) > 0 ∀λ > 0, λ → 0
⇒ x∗ is a local minimum.
- Example 13:
1 1
H(x0 ) = ≻0
1 4 ⇒
∇f (x0 ) = 0
Point x0 satisfies the sufficient conditions, point x0 is a local minimum.
3x21
6x1 0
- ∇f (x) = and H(x) =
2x2 0 2
0 0 0
- Point ∇f (x) = 0 at x0 = but H(x0 ) = and H(x0 ) is
0 0 2
positive semidefinite, x0 may or may not be a local minimum. Note that
f (x0 ) = f (0, 0) = 0.
Convex Sets
Left: The hexagon, which includes its boundary (shown darker), is convex.
Middle: The kidney shaped set is not convex, since the line segment between
the two points in the set shown as dots is not contained in the set. Right: The
square contains some boundary points but not others, and is not convex.
Cenk Toker ELE 704 Optimization 79 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions
k
P
- Convex Combination: x1 , x2 , x3 , . . . , xk and for θi ≥ 0, θi = 1 if
i=1
x = θ1 x1 + θ2 x2 + · · · + θk xk ∈ C, then C is a convex set.
- For example, the convex hulls of two sets in R2 are given below.
Left: The convex hull of a set of fifteen points (shown as dots) is the
pentagon (shown shaded). Right: The convex hull of the kidney shaped
set in the middle of the convex set is the whole shaded set.
dom f = C ⊆ RN
∀x1 , x2 ∈ C and 0 ≤ θ ≤ 1.
dom f = C ⊆ RN
∀x1 , x1 ∈ C and 0 ≤ θ ≤ 1.
dom f = C ⊆ RN
with
dom f : convex
satisfies
f (θx1 + (1 − θ)x2 ) = θf (x1 ) + (1 − θ)f (x2 )
Hence it can be considered as convex or concave depending on the
problem.
Examples on R: (scalar)
- Convex:
- affine: ax + b, on R, for any a, b ∈ R
- exponential: eax , on R, for any a ∈ R
- powers: xα , on R++ , for α ≥ 1 or α ≤ 0
- powers of absolute value: |x|α , on R, for α ≥ 1
- negative entropy: x log x, on R++
- Concave:
- affine: ax + b, on R, for any a, b ∈ R
- powers: xα , on R++ , for 0 ≤ α ≤ 1
- logarithm: log x, on R++
Examples on RN : (vectors)
- All norms (i.e. Lp -norm) are convex
N
!1/p
X p
∥x∥p = |xi | for p ≥ 1.
i=1
f (x) = aT x + b
Further Examples:
- Quadratic function
1 T
x Px + qT x + r
f (x) =
2
∇f (x) = Px + q
H(x) = P
f (x) is convex if P ⪰ 0.
for all x.
- Geometric mean
N
!1/N
Y
f (x) = xi
i=1
is convex for on RN
++ .
- Nonnegative multiple
αf (x), for α ≥ 0
- Pointwise maximum
f (x) = max {f1 (x), f2 (x), . . . , fi (x), . . . , fM (x)}
is convex for if fi (x) are convex.
f (x) = max aTi x + bi
i=1,...,M
- Pointwise supremum
g(x) = sup f (x, y), ∀y ∈ A
y∈A
f (x) = sup ∥x − y∥
y∈C
- Pointwise infimum
g(x) = inf f (x, y)
y∈C
d(x, S) = inf ∥x − y∥
y∈S
f (x) = h(g(x))
- You may find resources about taking the derivative of expressions with
matrix/vectors at
http://www.ee.ic.ac.uk/hp/staff/dmb/matrix/intro.html
Cenk Toker ELE 704 Optimization 95 / 224
Introduction Convex Sets
Overview Convex and Concave Functions
Convex Sets and Functions First and Second Order Conditions for Convexity
Unconstrained Minimization Quadratic Functions, Forms and Optimization
Descent Methods Optimality Conditions
- Corollary:
Optimality Conditions
∇f (x∗ ) = Qx∗ + c = 0
- Ex:
f (x) = xT x = ∥x∥22
f (x) = (x − a)T (x − a) = ∥x − a∥22
f (x) = (x − a)T P(x − a) = ∥x − a∥2P , where P is SPD.
Unconstrained Minimization
- The aim is
min f (x)
- Necessary conditions:
∇f (x∗ ) = Qx∗ − b = 0
H(x∗ ) = Q ⪰ 0 (PSD)
Example 2: Consider
1
min f (x1 , x2 ) = (αx21 + βx22 − x1 )
x1 ,x2 ∈R 2
- Here, let us first express the above equation in the quadratic program
form with
α γ
Q=
−γ β
1
b=
0
where γ ∈ R, for simplicity we can take γ = 0. So,
- If α > 0 and β > 0 i.e. (Q ≻ 0) x∗ = ( α1 , 0) is the unique global
minimum.
α1 > 0 and β = 0 i.e. (Q ⪰ 0). Infinite number of solutions,
- If
( α , y), y ∈ R .
- If α = 0 and β > 0 i.e. (Q ⪰ 0). No solution.
- If α < 0 and β > 0 (or α > 0 and β < 0), (i.e. Q is indefinite). No
solution.
Cenk Toker ELE 704 Optimization 103 / 224
Introduction
Overview
Convex Sets and Functions
Unconstrained Minimization
Descent Methods
150
60
100
40
50
20
0
0
−50 −20
10 10 10
5 5 5
10
0 5 0 0
−5 0
−5 −5 −5
−10 −10 −10 −10
x2 x1 x2 x1
60
60
40
40
20
20
0
0 −20
−40
−20
10 10
−60
5 10 10 5
0 5 5 0
0 0
−5 −5 −5 −5
−10 −10 −10 −10
x2 x1 x2 x1
- Two possibilities:
- {f (x) : x ∈ X} is unbounded below ⇒ no optimal solution.
- {f (x) : x ∈ X} is bounded below ⇒ a global minimum exists, if
∥x∥ ≠ ∞.
f (x(k) ) → p∗
Topography
(https://www.rei.com/learn/expert-advice/topo-maps-how-to-
use.html)
Cenk Toker ELE 704 Optimization 106 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
Descent Methods
Descent Methods
Motivation
∇T f (x) d < 0
except x(k) = x∗ .
where the scalar α(k) ∈ (0, δ) is the stepsize (or step length) at iteration
k, and d(k) ∈ RN is the step or search direction.
- How to find optimum α(k) ? Line Search Algorithm
- How to find optimum d(k) ? Depends on the descent algorithm,
e.g. d = −∇f (x(k) ).
Note that, here the descent direction is d(k) = −H−1 (x(k) )∇f (x(k) ).
Line Search
∂h(α(k) )
h′ (α(k) ) = =0
∂α
∂h(α)
h′ (α) = = ∇T f (x(k) + αd(k) ) d(k) (using chain rule)
∂α
Therefore, since d is the descent direction, (i.e., ∇T f (x(k) ) d(k) < 0 ), we
have h′ (0) < 0. Also, h′ (α) is a monotone increasing function of α
because h(α) is convex. Hence. search for h′ (α(k) ) = 0.
-
Cenk Toker ELE 704 Optimization 113 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
Choice of stepsize:
- Constant stepsize
α(k) = c : constant
- Diminishing stepsize
α(k) → 0
∞
α(k) = ∞.
P
while satisfying
k=−∞
∇T f (x(k) )d(k)
α0 = α(k) = − T
d(k) H(x(k) )d(k)
Bisection Algorithm:
- Assume h(α) is convex, then h′ (α) is a monotonically increasing
function. Suppose that we know a value α̂ such that h′ (α̂) > 0.
Algorithm:
- Set k = 0, αℓ = 0 and αu = α̂
αℓ +αu
- Set α̃ = 2
and calculate h′ (α)
- If h′ (α̃) > 0 ⇒ αu = α̃ and k = k + 1
- If h′ (α̃) < 0 ⇒ αℓ = α̃ and k = k + 1
- If h′ (α̃) = 0 ⇒ stop.
holds for α ∈ [0, α0 ]. Then, the line search stops with a step length α
i. α = 1 if α0 ≥ 1
ii. α ∈ [βα0 , α0 ].
In other words, the step length obtained by backtracking line search
satisfies
α ≥ min {1, βα0 } .
Convergence
n o∞
Definition: Let ∥ · ∥ be a norm on RN . Let x(k) be a sequence of
n o∞ k=0
N (k)
vectors in R . Then, the sequence x is said to converge to a
k=0
limit x∗ if
lim x(k) = x∗
k→∞
n o∞
and call x∗ as the limit of the sequence x(k) .
k=0
- Nε may depend on ε
- For a distance ε, after Nε iterations, all the subsequent iterations
are within this distance ε to x∗ .
- This definition does not characterize how fast the convergence is (i.e.
rate of convergence).
Cenk Toker ELE 704 Optimization 123 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
Rate of Convergence
n o∞
Definition: Let ∥ · ∥ be a norm on RN . A sequence x(k) that
k=0
∗ N
converges to x ∈ R is said to converge at rate R ∈ R++ and with rate
constant δ ∈ R++ if
∥x(k+1) − x∗ ∥
lim =δ
k→∞ ∥x(k) − x∗ ∥R
- If R = 1, 0 < δ < 1, then rate is linear
- If 1 < R < 2, 0 < δ < ∞, then rate is called super-linear
- If R = 2, 0 < δ < ∞, then rate is called quadratic
∞
Example 5: The sequence ak
k=0
, 0 < a < 1 converges to 0.
∥ak+1 − 0∥
lim =a ⇒ R = 1, δ = a
k→∞ ∥ak − 0∥1
n k
o∞
Example 6: The sequence a2 , 0 < a < 1 converges to 0.
k=0
k+1
∥a2 − 0∥
lim =1 ⇒ R = 2, δ = 1
k→∞ ∥a2k − 0∥2
- repeat
- Direction: d(k) = −∇f (x(k) )
- Line search: Choose step size α(k) via a line search algorithm
- Update: x(k+1) = x(k) + α(k) d(k)
Convergence Analysis
- mI ⪯ H(x), i.e.,
(H(x) − mI) ⪰ 0
yT H(x)y ≥ m∥y∥2 , ∀y ∈ RN
- H(x) ⪯ M I, i.e.,
(M I − H(x)) ⪰ 0
yT H(x)y ≤ M ∥y∥2 , ∀y ∈ RN
with ∀x ∈ dom f (x).
- When y = x∗
1
f (x∗ ) = p∗ ≥ f (x) − ∥∇f (x)∥2
2m
- A stopping criterion
f (x) − p∗ ≤ 1
2m
∥∇f (x)∥2
- For any x, y ∈ dom f (x), using similar derivations as the lower bound, we
arrive at an upper bound
M
f (y) ≤ f (x) + ∇T f (x)(y0 − x) + ∥y0 − x∥2
2
- Then for y = x∗
1
f (x∗ ) = p∗ ≤ f (x) − ∥∇f (x)∥2
2M
- Hence,
1
2M
∥∇f (x)∥2 ≤ f (x) − p∗
- For the exact line search, let us use second order approximation for
f (x(k+1) ):
- However, let us use the upper bound of the second order approximation
for convergence analysis
M α2
f (x(k+1) ) ≤ f (x(k) ) − α∥∇f (x(k) )∥2 + ∥∇f (x(k) )∥2
2
- Find α0′ such that upper bound of f (x(k) − α∇f (x(k) )) is minimized over
α.
f (x(k+1) ) − p∗ ≤ ck (f (x(0) ) − p∗ )
m
- well-conditioned Hessian, M
→ 1 ⇒ denominator is large (K
gets smaller).
m
- ill-conditioned Hessian, M
→ 0 ⇒ denominator is small (K
gets larger).
Cenk Toker ELE 704 Optimization 138 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
- Similar to the analysis exact line search, subtract p∗ from both sides
β
f (x(k+1) ) − p∗ ≤ f (x(k) ) − p∗ − γ min 1, ∥∇f (x(k) )∥2
M
- Finally
f (x(k+1) ) − p∗
β
≤ 1 − 2mγ min 1, =c<1
f (x(k) ) − p∗ M
Note: Examples 7, 8, 9 and 10 are taken from Convex Optimization (Boyd and Vandenberghe) (Ch. 9).
Observations:
- The gradient descent algorithm is simple.
- The main advantage of the gradient method is its simplicity. Its main
disadvantage is that its convergence rate depends so critically on the
condition number of the Hessian or sublevel sets.
Dual Norm: Let ∥ · ∥ denote any norm on RN , then the dual norm,
denoted by ∥ · ∥∗ , is the function from RN to R with values
n o
∥x∥∗ = max yT x : ∥y∥ ≤ 1 = sup yT x : ∥y∥ ≤ 1
y
xT y ≤ ∥x∥ · ∥y∥∗
- The norm dual to the Euclidean norm is itself. This comes directly from
the Cauchy-Schwartz inequality.
∥x∥2∗ = ∥x∥2
- The norm dual to the the L∞ -norm is the L1 -norm, or vice versa.
∥x∥p∗ = ∥x∥q
p
where q = .
1−p
where Q = P−1 .
- repeat
- Compute the steepest descent direction d(k) sd
- Line search: Choose step size α(k) via a line search algorithm
(k)
- Update: x(k+1) = x(k) + α(k) dsd
- Thus, x∗ = P−1/2 y∗ .
- Let i be any index for which ∥∇f (x0 )∥∞ = max |(∇f (x0 ))i |. Then a
normalized steepest descent direction dnsd for the L1 -norm is given by
∂f (x0 )
dnsd = − sign ei
∂xi
where ei is the i-th standard basis vector (i.e. the coordinate axis
direction) with the steepest gradient. For example, in the figure above we
have dnsd = e1 .
Cenk Toker ELE 704 Optimization 163 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
Choice of norm:
- Choice of norm can dramatically affect the convergence
Convergence Analysis
- (Using backtracking line search) It can be shown that any norm can be
bounded in terms of Euclidean norm with a constant η ∈ (0, 1]
∥x∥∗ ≥ η∥x∥2
η2 γη 2 T
f (x(k) +αdsd ) ≤ f (x(k) )− ∥∇f (x(k) )∥2∗ ≤ f (x(k) )+ ∇ f (x(k) )dsd
2M M
Cenk Toker ELE 704 Optimization 166 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
2 T
- Since γ < 0.5 and −∥∇f
n (x)∥o ∗ = ∇ f (x)dsd , backtracking line search
βη 2
will return α ≥ min 1, M , then
βη 2
f (x(k) + αdsd ) ≤ f (x(k) ) − γ min 1, ∥∇f (x(k) )∥2∗
M
βη 2
≤ f (x(k) ) − γη 2 min 1, ∥∇f (x(k) )∥22
M
f (x(k+1) ) − p∗ βη 2
2
≤ 1 − 2mγη min 1, =c<1
f (x(k) ) − p∗ M
- Linear convergence
f (x(k) ) − p∗ ≤ ck f (x(0) ) − p∗
Conjugate Directions
dT1 Qd2 = 0
α0 d0 + α1 d1 + · · · + αk dk = 0
- But dTi Qdi > 0 (Q: PD), then αi = 0. Repeat for all αi .
- αi can be found from the known vector b and matrix Q once di are
found.
- with
dTk g(k)
α(k) = −
dTk Qdk
and g(k) is the gradient at x(k)
- We will show that at each step x(k) minimizes the objective over the
k-dimensional linear variety x(0) + B(k) .
N −1
Theorem: (Expanding Subspace Theorem) Let {di }i=0 be non-zero,
N
Q-orthogonal vectors in R .
Proof: Since x(k) ∈ x(0) + B(k) , i.e., B(k) contains the line
x = x(k−1) − αdk−1 , it is enough to show that x(k) minimizes f (x) over
x(0) + B(k)
- Since we assume that f (x) is strictly convex, the above condition holds
when g(k) is orthogonal to B(k) , i.e., the gradient of f (x) at x(k) is
orthogonal to B(k) .
- From the definition of g(k) (g(k) = Qx(k) − b), it can be shown that
dTi g(k) = 0
for i < k.
- repeat
T
d(k) g(k)
- α(k) = − T
d(k) Qd(k)
- until k = N .
Cenk Toker ELE 704 Optimization 185 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
- Algorithm terminates in at most N steps with the exact solution (for the
quadratic case)
- The process makes uniform progress toward the solution at every step.
Important for the nonquadratic case.
- Solution is given by
CG Summary
- In theory (with exact arithmetic) converges to solution in N steps
- The bad news: due to numerical round-off errors, can take more
than N steps (or fail to converge)
- The good news: with luck (i.e., good spectrum of Q), can get good
approximate solution in ≪ N steps
- Expanding
∼ 1 xT H(x0 )x+ ∇T f (x0 ) − xT 1 T
T
f (x) = 0 H(x0 ) x+f (x0 ) + x0 H(x0 )x0 − ∇ f (x0 )x0
2 |
2
{z }
independent of x, i.e., constant
- Thus,
1
min f (x) ≡ min xT H(x0 )x+ ∇T f (x0 ) − xT0 H(x0 ) x
2
1
≡ min xT Qx − bT x
2
Cenk Toker ELE 704 Optimization 189 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
- Here,
Q = H(x0 )
bT = −∇T f (x0 ) + xT0 H(x0 )
g(k) = Qx(k) − b
= H(x0 )x0 + ∇f (x0 ) − H(x0 )x0 . . . x0 = x(k)
= ∇f (x0 )
- repeat
- repeat
T
d(k) g(k)
- α(k) = − T
d(k) H(x(k) )d(k)
Example 14: Convergence example of the nonlinear Conjugate Gradient Method: (a) A
complicated function with many local minima and maxima. (b) Convergence path of
Fletcher-Reeves CG. Unlike linear CG, convergence does not occur in two steps. (c) Cross-section
of the surface corresponding to the first line search. (d) Convergence path of Polak-Ribiere CG.
- Then
x(k+1) = x(k) + ∆xnt
∇f (x∗ ) = 0
- The norm of the Newton step in the quadratic norm defined by H(x) is
called the Newton decrement
1
2
λ(x) = ∥∆xnt ∥H(x) = ∆xTnt H(x)∆xnt
where
1
fˆ(x + ∆xnt ) = f (x) + ∇T f (x)∆xnt + ∆xTnt H(x)∆xnt
2
i.e., the second-order quadratic approximation of f (x) at x.
then
1 T 1
∇ f (x)H−1 (x)∇f (x) = λ2 (x)
2 2
λ2 (x)
- So, if 2
< ϵ, then algorithm can be terminated for some small ϵ.
- With the substitution of ∆xnt = −H−1 (x)∇f (x), the Newton decrement
can also be written as
1
λ(x(k) ) = ∇T f (x(k) )H−1 (x(k) )∇f (x(k) )
2
Newton’s Method
- Given a starting point x(0) ∈ dom f (x) and some small tolerance ϵ > 0
- repeat
- Compute the Newton step and Newton decrement
- The stepsize α(k) (i.e., line search) is required for the non-quadratic
initial parts of the algorithm. Otherwise, algorithm may not converge due
to large higher-order residuals.
- If we start with α(k) = 1 and keep it the same, then the algorithm is
called the pure Newton’s method.
- So, use (aI + H(x))−1 instead of H−1 (x), also known as (a.k.a)
Marquardt method. There always exists an a which will make the matrix
(aI + H(x)) positive definite.
then
∇f˜(y) = TT ∇f (x)
H̃(y) = TT H(x)T
i.e,
x + ∆xnt = T (y + ∆ynt ), ∀x
Cenk Toker ELE 704 Optimization 207 / 224
Motivation
Introduction General Descent Method
Overview Line Search
Convex Sets and Functions Convergence
Unconstrained Minimization Gradient Descent (GD) Method
Descent Methods Steepest Descent (SD) Method
Conjugate Gradient (CG) Method
Newton’s Method (NA)
−1 12
T T T
= ∇ f (x)T T H(x)T T ∇f (x)
1
2
= ∇T f (x)H(x)∇f (x)
= λ(x)
Convergence Analysis
m2
Convergence: There exist constants η ∈ 0, L
and σ > 0 such that
- Damped Newton Phase
∥∇f (x)∥2 ≥ η
- α < 1 gives better solutions, so most iterations will require line
search, e.g. backtracking.
- As k increases, function value decreases by at least σ, but not
necessarily quadratic.
f (x(0) )−p∗
- This phase ends after at most σ
iterations
f (x(0) ) − p∗ ϵ0
+ log2 log2
σ ϵ
- σ and ϵ0 are dependent on m, L and x(0) .
NA Summary
- Newton’s method scales well with problem size. Ignoring the computation
of the Hessian, its performance on problems in R10000 is similar to its
performance on problems in R10 , with only a modest increase in the
number of steps required.
- For relatively large scale problems, i.e. N is large, calculating the inverse
of the Hessian at each iteration can be costly. So, we may use, some
approximations of the Hessian
Hybrid GD + NA
- We know that the first phase the Newton’s Algorithm (NA) is not very
fast. Therefore, first we can run run GD which has considerably low
complexity and after satisfying some conditions, we can switch to the NA.
- Hybrid Algorithm
- Start at x(0) ∈ dom f (x)
- repeat
- run GD (i.e., S(x(k) ) = I)
- until stopping criterion is satisfied
- Start at the final point of GD
- repeat
- run NA with exact H(x) (i.e., S(x(k) ) = H−1 (x(k) ))
- until stopping criterion is satisfied