0% found this document useful (0 votes)
18 views21 pages

Projected Gradient

Uploaded by

karlo135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
18 views21 pages

Projected Gradient

Uploaded by

karlo135
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 21

Projected Gradient Algorithm

Andersen Ang
Content
ECS, Uni. Southampton, UK
[email protected] Unconstrained vs constrained problem
Problem setup
Homepage angms.science Understanding the geometry of projection
PGD is a special case of proximal gradient
Theorem 1. An inequality of PGD with constant stepsize
 
Theorem 2. PGD converges ergodically at rate O √1 on Lipschitz function
k
Version: July 13, 2023 K
! K
1 X ∗ ∥x0 − x∗ ∥22 α X 2
f xk − f ≤ + ∥∇f (xk )∥2
First draft: August 2, 2017 K + 1 k=0 2α(K + 1) 2(K + 1) k=0

∗ L∥x0 − x∗ ∥
f (x̄K ) − f ≤ √ .
K+1
Unconstrained minimization Constrained minimization
min f (x). min f (x).
x∈Rn x∈Q

▶ All x ∈ Rn is feasible. ▶ Not all x ∈ Rn is feasible.


▶ Any x ∈ Rn can be a solution. ▶ Not all x ∈ Rn can be a solution.
▶ The solution has to be inside the set Q.
▶ An example:

min ∥Ax − b∥22 s.t. ∥x∥2 ≤ 1


x∈Rn

can be expressed as
min ∥Ax − b∥22 .
∥x∥2 ≤1

Here Q := {v ∈ Rn : ∥v∥2 ≤ 1} is known as the unit ℓ2 ball.

2 / 21
Approaches for solving constrained minimization problems

▶ Duality / Lagrangian approach


▶ Not our focus here.
▶ Although the approach of Lagrangian multiplier is usually taught in standard calculus class, the standard
explanation that {gradient on primal variable has to be anti-parallel to gradient on the dual variable} is not
intuitive and it is not the deep reason why the method works.
▶ It requires a deep understanding of convex conjugate, constraint qualifications and duality to appreciate the
Lagrangian approach, which is out of the scope here.

▶ First-order method / gradient-based method


▶ Simple.
▶ Our focus.

▶ Second-order method, Zero-order method, Higher-order method


▶ Not our focus here.

3 / 21
Solving unconstrained problem minn f (x) by gradient descent
x∈R

simple

▶ Gradient descent GD is a easy way to solve unconstrained optimization problem minn f (x).
 x∈R
intuitive

▶ Starting from an initial point x0 ∈ Rn , GD iterates the following until a stopping condition is met:

k ∈ N: the current iteration counter


k + 1 ∈ N: the next iteration counter
xk : the current variable
xk+1 = xk − αk ∇f (xk ), xk+1 : the next variable
∇f is the gradient of f with respect to differentiation of x
∇f (xk ) is the ∇f at the current variable xk
αk ∈ (0, +∞): gradient stepsize

▶ Question: how about constrained problem? Is it possible to tune GD to fit constrained problem?
Answer: yes, and the key is Euclidean projection operator proj : Rn ⇒ Rn .

▶ Remark
▶ We assume f is differentiable (i.e., ∇f exists).
▶ If f is not differentiable, we can replace gradient by subgradient, and we get the so-called subgradient method.

4 / 21
Problem setup of constrained problem

min f (x).
x∈Q

▶ We focus on the Euclidean space Rn

▶ f : Rn → R is the objective / cost function


▶ f is assumed to be continuously differentiable, i.e., ∇f (x) exists for all x f ∈ C1
▶ we assume f is globally L-Lipschitz, but not here |f (x) − f (y)| ≤ L∥x − y∥
▶ we do not assume ∇f is globally L-Lipschitz ∥∇f (x) − ∇f (y)∥ ≤ L∥x − y∥

▶ ∅ ̸= Q ⊂ Rn is convex and compact


▶ The constraint is represented by a set Q
▶ Q ⊂ Rn means Q is a subset of Rn , the domain of f
▶ Q ̸= ∅ means Q is not an empty set it is not useful for discussion if Q is empty
n o
▶ Q is a convex set ∀x∀y∀λ ∈ (0, 1) x ∈ Q, y ∈ Q =⇒ λx + (1 − λ)y ∈ Q
▶ Q is compact compact = bounded + closed

▶ For the details of convexity, Lipschitz, see here.

5 / 21
Solving constrained problem by projected gradient descent
▶ Projected gradient descent PGD = GD + projection

▶ Starting from an initial point x0 ∈ Q, PGD iterates the following equation until a stopping condition is met:
k ∈ N: the current iteration counter
k + 1 ∈ N: the next iteration counter
xk : the current variable
xk+1 : the next variable
 
xk+1 = PQ xk − αk ∇f (xk ) ,
∇f is the gradient of f with respect to differentiation of x
∇f (xk ) is the ∇f at the current variable xk
αk ∈ (0, +∞): gradient stepsize
PQ is the shorthand of projQ

▶ projQ ( · ) is called Euclidean projection operator, and itself is also an optimization problem:
PQ (x0 ) = projQ (x0 ) = argmin ∥x − x0 ∥2 . (∗)
x∈Q
i.e., given a point x0 , PQ finds a point x ∈ Q which is “closest” to x0 .
▶ The measure of “closeness” here is the Euclidean distance ∥x − x0 ∥2 .

▶ (∗) is equivalent to
1
∥x − x0 ∥22 ,
argmin
2
x∈Q
where we squaring the cost so that the function becomes differentiable.
6 / 21
Comparing PGD to GD ▶ PGD = GD + projection.
GD ▶ if the point xk − αk ∇f (xk ) after the gradient update is leaving the
set Q, project it back.
1. Pick an initial point x0 ∈ Rn ▶ if the point xk − αk ∇f (xk ) after the gradient update is within the
2. Loop until stopping condition is met: set Q, keep the point and do nothing.
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk ▶ Projection PQ ( · ) : Rn ⇒ Rn
2.3 Update: xk+1 = xk − αk ∇f (xk ) ▶ It is a mapping from Rn to Rn , i.e., a point-to-point mapping
▶ In general, for a nonconvex set Q, such mapping is possibly
non-unique (this is the ⇒)
PGD ▶ PQ ( · ) is an optimization problem
1. Pick an initial point x0 ∈ Q
1 2
2. Loop until stopping condition is met: PQ (x0 ) := argmin ∥x − x0 ∥2 . (⋆)
x∈Q 2
2.1 Descent direction: compute −∇f (xk )
2.2 Stepsize: pick a αk If Q is a convex compact set, the optimization problem has a unique
2.3 Update: yk+1 = xk − αk ∇f (xk ) solution, and we have PQ ( · ) : Rn → Rn
2.4 Projection:
xk+1 = argmin 21 ∥x − yk+1 ∥22

x∈Q easy to solve

▶ PGD is economic if (⋆) is has a closed-form expression

cheap to compute


Q is nonconvex

▶ PGD is possibly not an economic if (∗) has no closed-form expression

(∗) is expensive to compute

7 / 21
Understanding the geometry of projection ... (1/4)
Consider a convex set Q ⊂ Rn and a point x0 ∈ Rn .

Case 1. x0 ∈ Q. Case 2. x0 ∈
/ Q.

x0 x0
Q Q

▶ As x0 ∈ Q, the closest point to x0 in Q will be x0 itself. ▶ Now x0 is outside Q

▶ The distance between a point to itself is zero. ▶ We need to find a point x


▶ x∈Q
▶ Mathematically: ∥x − x0 ∥2 = 0 gives x = x0 .
▶ ∥x − x0 ∥2 is smallest
▶ This is the trivial case and therefore not interesting.
▶ This is case that is interesting.

8 / 21
Understanding the geometry of projection ... (2/4)
▶ The circles are ℓ2 -norm ball centered at x0 with different radius.

▶ Points on these circles are equidistant to x0 (with different l2 distance on different circles).

▶ Note that some points on the blue circle are inside Q, those are feasible points.

x0
Q

9 / 21
Understanding the geometry of projection ... (3/4)
▶ The point inside Q which is closest to x0 is the point where the ℓ2 norm ball “touches” Q.

▶ In this example, the blue point y is the solution to


1
PQ (x0 ) = argmin ∥x − x0 ∥22 .
x∈Q 2

y
x0
Q

▶ In fact, such point is always located on the boundary of Q for x0 ∈


/ Q.
That is, mathematically, if x0 ∈/ Q, then
1
argmin ∥x − x0 ∥22 ∈ bdryQ.
x∈Q 2

10 / 21
Understanding the geometry of projection ... (4/4)
Note that the projection is orthogonal: the blue point y is always on a straight line that is tangent to the norm ball
and Q.

y
x0
Q

The normal to the tangent is exactly x0 − y = x0 − projQ (x0 ).

11 / 21
Property of projection: Bourbaki-Cheney-Goldstein inequality

See the details here

12 / 21
PGD is a special case of proximal gradient

▶ The indicator function ι(x), of a set Q is defined as follows:


(
0 x∈Q
ιQ (x) =
+∞ x∈
/Q

▶ With the indicator function, constrained problem has two equivalent expressions

min f (x) ≡ minf (x) + ιQ (x).


x∈Q x

▶ Proximal gradient is a method to solve the optimization problem of a sum of differentiable and a non-differentiable
function:
minf (x) + g(x),
x
where g is non-differentiable.

▶ PGD is in fact the special case of proximal gradient where g(x) is the indicator function. See here for more about
proximal gradient .

13 / 21
On PGD ergodic convergence rate
▶ Theorem 1. If f is convex, PGD with constant stepsize α satisfies
K
! K x∗ is the (global) minimizer
1 X ∥x0 − x∗ ∥22 α X f ∗ := f (x∗ ) is the optimal cost value
f xk − f ∗ ≤ + ∥∇f (xk )∥22 α is the constant stepsize
K + 1 k=0 2α(K + 1) 2(K + 1) k=0
K is the total of number of iteration performed

▶ Interpretation:
PK
▶ the term 1 xk is the “average” of the sequence xk after K iterations
K+1
PK k=0
▶ denote 1
x k as x̄
K+1 k=0
▶ denote f (x̄) as f¯
Then the theorem reads:
∥x0 − x∗ ∥22
f¯ − f ∗ ≤ + something positive.
2α(K + 1)
1
Hence the convergence rate is like O( K ).

K
α X
▶ The term ∥∇f (xk )∥22 converges to zero
2(K + 1) k=0
K
X
▶ as long as 2
∥∇f (xk )∥2 is not diverging to infinity, or
k=0
K
X
▶ the growth of 2
∥∇f (xk )∥2 is slower than K
k=0

14 / 21
What is ergodic convergence?

▶ Ergodic convergence = “The centroid of a point cloud moving towards the limit point”

▶ Sequence convergence: each of x1 , x2 , ..., xk are all getting closer and closer to x∗

▶ Ergodic convergence: the average of x1 , x2 , ..., xk converges to x∗


▶ which doesn’t imply each of x1 , x2 , ..., xk are getting closer and closer to x∗
▶ some of them can be moving away from x∗ , as long as the centroid is getting closer

15 / 21
Proof of theorem 1 ... (1/3)
Combine the two boxes.
f (z) ≤ f (x) + ⟨∇f (x), z − x⟩ f is convex
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
⇐⇒ f (x) − f (z) ≤ ⟨∇f (x), x − z⟩ f (xk )−f ≤

=⇒ f (xk ) − f ∗ ≤ ⟨∇f (xk ), xk − x∗ ⟩ x = xk , z = x∗ , f (x∗ ) = f ∗

k − yk+1
Dx
PGD
f (xk ) − f ∗ ≤ , xk − x∗
E
⇐⇒ yk+1 = xk − αk ∇f (xk ) PGD
yk+1 = xk − αk ∇f (xk ) we have xk − yk+1 = α∇f (xk )
αk

xk − yk+1 , xk − x∗
=⇒ f (xk ) − f ∗ ≤ constant stepsize Then
α
∗ ∥xk − x∗ ∥2 2 ∗ 2
2 + ∥α∇f (xk )∥2 − ∥yk+1 − x ∥2
So we have f (xk )−f ≤

xk − yk+1 , xk − x∗
f (xk ) − f ∗ ≤
α
Now we have
A not-so-trivial trick

(a − b)(a − c) = a2 − ac − ab + bc ∥xk −x∗ ∥2 ∗ 2


2 −∥yk+1 −x ∥2 + α ∥∇f (x )∥2
f (xk ) − f ∗ ≤ k 2
2α 2
2a2 − 2ac − 2ab + 2bc
=
2
a2 − 2ac + a2 − 2ab + 2bc +c2 − c2 + b2 − b2
=
2
(a − c)2 + (a − b)2 − (b − c)2
=
2

Therefore

∥xk − x∗ ∥2 2 ∗ 2
2 + ∥xk − yk+1 ∥2 − ∥yk+1 − x ∥2
xk − yk+1 , xk − x∗ =
2
16 / 21
Proof of theorem 1 ... (2/3)
Now we have Note ∥yk+1 − x∗ ∥22 ≥ ∥ xk+1 −x∗ ∥22 .
| {z }
projQ (yk+1 )
∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2 - This is known as “projection operator is non-expansive”
- “post-projection distance at most the same as the pre-projected”
Next we need to make use of the fact that projection is - This is from the Bourbaki-Cheney-Goldstein inequality
non-expansive. - Details here

Pictorially

Explanation: focus on the term ∥xk − x∗ ∥22 − ∥yk+1 − x∗ ∥22


xx : current variable ΠQ (y) = xk+1
y
yk+1 : gradient updated xk
Q
xk+1 : projected yk+1 x∗
We wish to replace ∥yk+1 − x∗ ∥22 by ∥xk+1 − x∗ ∥22
How: by the fact that projection operator is non-expansive
Hence −∥yk+1 − x∗ ∥22 ≤ −∥xk+1 − x∗ ∥22 and

∥xk −x∗ ∥2 ∗ 2
2 −∥yk+1 −x ∥2
f (xk ) − f ∗ ≤ 2α + α 2
2 ∥∇f (xk )∥2

∥xk −x∗ ∥2 ∗ 2
2 −∥xk+1 −x ∥2 α 2
≤ 2α + 2 ∥∇f (xk )∥2

It forms a telescoping series !

17 / 21
Proof of theorem 1 ... (3/3)
Consider the left hand side, as f is convex, by Jensen’s inequality
∥x0 − x∗ ∥2 ∗ 2
2 − ∥x1 − x ∥2 α
f (x0 ) − f ∗ ≤ ∥∇f (x0 )∥2
 
k = 0 + 1 K 1 K
2 X X
2α 2 f xk  ≤ f (xk ).
K + 1 k=0 K + 1 k=0
∥x1 − x∗ ∥2 ∗ 2
2 − ∥x2 − x ∥2 α
k = 1 f (x1 ) − f ∗ ≤ + ∥∇f (x1 )∥2
2
2α 2 Therefore
.
. ∥x0 − x∗ ∥2
2
.  
∥xk − x∗ ∥2 ∗ 2
2 − ∥xK+1 − x ∥2 + α ∥∇f (x )∥2
1 K

2α(K + 1)
f (xk ) − f ∗ ≤
X
k = K k 2 f xk  − f ≤
2α 2 K + 1 k=0 α K
X 2
+ ∥∇f (xk )∥2 .
Sums all 2(K + 1) k=0

K 
X ∗
 ∥x0 − x∗ ∥2 ∗ 2
2 − ∥xk+1 − x ∥2 α XK
2
f (xk )−f ≤ + ∥∇f (xk )∥2 .
k=0 2α 2 k=0

As 0 ≤ 1 ∥x ∗ 2
2α k+1 − x ∥2 ,

K 
X ∗
 ∥x0 − x∗ ∥2 K
2 + α X ∥∇f (x )∥2 .
f (xk ) − f ≤ k 2
k=0 2α 2 k=0

Expand the summation on the left and divide the whole equation by K + 1

1 K
X ∗ ∥x0 − x∗ ∥2
2 α K
X 2
f (xk ) − f ≤ + ∥∇f (xk )∥2 .
K + 1 k=0 2α(K + 1) 2(K + 1) k=0

18 / 21
 
PGD converges ergodically at order O √1 on Lipschitz function
k

K
( )
1 X ∥x0 − x∗ ∥
Theorem 2. If f is Lipschitz, for the point x̄K = xk and constant stepsize α = √ we have
K + 1 k=0 L K +1

L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
Proof
▶ f is Lipschitz means ∇f is bounded: ∥∇f ∥ ≤ L, where L is the Lipschitz constant.
▶ Put x̄K , α, ∥∇f ∥ ≤ L into theorem 1.

Remarks
▶ On the stepsize α, note that it is K (total number of step) not k (current iteration number).
▶ α requires to know x∗ , so this theorem is practically useless as knowing x∗ already solves the problem.
 
▶ Although we do not know x∗ in general, the theorem tells that the ergodic convergence speed of PGD is O √1
k

19 / 21
Discussion
In the convergence analysis of GD:
1. f is convex and β-smooth (gradient is β-Lipschitz)
1
2. Convergence rate O .
k
3. The convergence rate is not ergodic

In the convergence analysis of PGD:


1. f is convex and L-Lipschitz (gradient is bounded above)
 1 
2. Convergence rate O √ .
k
3. The convergence rate is ergodic, it works on x̄K

If f is convex and β-smooth, the convergence of PGD will be the same as that of GD.
1
▶ Theoretical convergence rate of PGD on convex and β-smooth f is also O .
k
▶ However practically it depends on the complexity of the projection.
Some Q are difficult to project onto.

As PGD is a special case of proximal gradient method, it is better to study proximal gradient method. For example
here, here and here
20 / 21
Last page - summary

▶ PGD = GD + projection

▶ PGD with constant stepsize α:


K K
!
1 X ∥x0 − x∗ ∥22 α X
f xk − f∗ ≤ + ∥∇f (xk )∥22
K + 1 k=0 2α(K + 1) 2(K + 1) k=0

∥x0 −x∗ ∥
n PK o
▶ If f is Lipschitz (bounded gradient), for the point x̄K = 1
K+1 k=0 xk and constant step size α = √
L K+1
then

L∥x0 − x∗ ∥
f (x̄K ) − f ∗ ≤ √ .
K+1
End of document

21 / 21

You might also like