0% found this document useful (0 votes)
12 views15 pages

Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging

Uploaded by

tariq mezroub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
12 views15 pages

Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging

Uploaded by

tariq mezroub
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 15

15/08/2024 03:20 6.

Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

Numerical optimization for inverse problems

Contents
6.1. Smooth optimization
6.2. Convex optimization
6.3. Exercises
6.4. Assignments

In this chapter we treat numerical algorithms for solving optimization problems over R n . Throughout we will assume that the objective
J(u) = D(u) + R(u) satisfies the conditions for a unique minimizer to exist. We distinguish between two important classes of problems;
smooth problems and convex problems.

6.1. Smooth optimization


For smooth problems, we assume to have access to as many derivatives of J as we need. As before, we denote the first derivative (or
gradient) by J ′ : R n → R n . We denote the second derivative (or Hessian) by J ′′ : R n → R n×n . We will additionally assume that the
Hessian is globally bounded, i.e. there exists a constant L < ∞ such that −L ⋅ I ⪯ J ′′ (u) ⪯ L ⋅ I for all u ∈ R n . Note that this implies
that J ′ is Lipschitz continous with constant L: ∥J ′ (u) − J ′ (v)∥ 2 ≤ L∥u − v∥ 2 .

For a comprehensive treatment of this topic (and many more), we recommend the seminal book Numerical Optimization by Stephen Wright
and Jorge Nocedal [Nocedal and Wright, 2006].

Before discussing optimization methods, we first introduce the optimality conditions.

 Definition: Optimality conditions

Given a smooth functional J : R n → R, a point u ∗ ∈ R n is local minimizer iff it satisfies the first and second order optimality
conditions

J ′ (u ∗ ) = 0, J ′′ (u ∗ ) ⪰ 0.

If J ′′ (u ∗ ) ≻ 0 we call u ∗ a strict local minimizer.

6.1.1. Gradient descent


The steepest descent method proceeds to find a minimizer through a fixed-point iteration

u k+1 = (I − λJ ′ )(u k ) = u k − λJ ′ (u k ),

where λ > 0 is the step size. The following theorem states that this iteration will yield a fixed point of J , regardless of the initial iterate,
provided that we pick λ small enough.

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 1/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

 Theorem: Global convergence of steepest descent

Let J : R n → R be a smooth, Lipschitz-continuos functional. The fixed point iteration

u k+1 = (I − λJ ′ )(u k ), (6.1)

with λ ∈ (0, (2L) −1 ) produces iterates u k for which

J(u 0 ) − J ∗
min ∥J ′ (u k )∥ 22 ≤ ,
k∈{0,1,…,n−1} Cn

with C = λ (1 − ) and J ∗ = min u J(u). This implies that ∥J ′ (u k )∥ 2 → 0 as k → ∞. To guarantee


λL
2
min k∈{0,1,…,n−1} ∥J ′ (u k )∥ 2 ≤ ϵ we thus need O(1/√ϵ) iterations.

 Proof

Stronger statements about the rate of convergence can be made by making additional assumptions on J (such as (strong) convexity), but
this is left as an exercise.

6.1.2. Line search


While the previous results are nice in theory, we usually do not have access to the Lipschitz constant L. Moreover, the global bound on the
step size provided by the Lipschitz constant may be pessimistic for a particular starting point. This could lead us to pick a very small step
size, yielding slow convergence in practice. A popular way of choosing a step size adaptively is a line search strategy. To introduce these, we
slightly broaden the scope and consider the iteration

u k+1 = u k + λ k d k ,

where d k is a descent direction satisfying ⟨d k , J ′ (u k )⟩ < 0. Obviously, d k = −J ′ (u k ) is a descent direction, but other choices may be
beneficial in practice. In particular, we can choose d k = −BJ ′ (u k ) for any positive-definite matrix B to obtain a descent direction. How to
choose such a matrix will be discussed in the next section.

Two important line search methods are discussed below.

 Definition: Backtracking line search

In order to ensure sufficient progress of the iterations, we can choose a steplength that guarantees sufficient descent:

J(u k + λd k ) ≤ J(u k ) + c 1 λ⟨d k , J ′ (u k )⟩, (6.2)

with c 1 ∈ (0, 1) a small constant (typically c 1 = 10 −4 ). Existence of a λ satisfying these conditions is guaranteed by the regularity
of J . We can find a suitable λ by backtracking:

def backtracking(J,Jp,u,d,lmbda,rho=0.5,c1=1e-4)
"""
Backtracking line search to find a step size satisfying J(u + lmbda*d) <= J(u) + lmbda*c1*J(u)^Td

Input:
J - Function object returning the value of J at a given input vector
Jp - Function object returning the gradient of J at a given input vector
u - current iterate as array of length n
d - descent direction as array of length n
lmbda - initial step size
rho,c1 - backtracking parameters, default (0.5,1e-4)

Output:
lmbda - step size satisfying the sufficient decrease condition
"""
while J(u + lmbda*d) > J(u) + c*lmbda*Jp(u).dot(d):
lmbda *= rho
return lmbda

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 2/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

 Definition: Wolfe linesearch

A possible disadvantage of the backtracking linesearch introduced earlier is that it may end up choosing very small stepsizes. To
obtain a stepsize that yields a new iterate at which the slope of J is not too large, we introduce the following condition

|⟨J ′ (u k + λd k ), d k ⟩| ≤ c 2 |⟨J ′ (u k ), d k ⟩|, (6.3)

where c 2 is a small constant satisfying 0 < c 1 < c 2 < 1. The conditions (6.2) and (6.3), are referred to as the strong Wolfe
conditions. Existence of a stepsize satisfying these conditions is again guaranteed by the regularity of J (cf. [Nocedal and Wright,
2006], lemma 3.1). Finding such a λ is a little more involved than the backtracking procedure outlined above (cf. [Nocedal and
Wright, 2006], algorithm 3.5). Luckily, the SciPy library provides an implementation of this algorithm (cf.
scipy.optimize.line_search)

6.1.3. Second order methods


A well-known method for root finding is Newton’s method, which finds a root for which J ′ (u) = 0 via the fixed point iteration

u k+1 = u k − J ′′ (u k ) −1 J ′ (u k ). (6.4)

We can interpret this method as finding the new iterate u k+1 as the (unique) minimizer of the quadratic approximation of J around u k :

J(u) ≈ J(u k ) + J ′ (u k )(u − u k ) + 1


2
⟨u − u k , J ′′ (u k )(u − u k )⟩.

 Theorem: Convergence of Newton’s method

Let J be a smooth functional and u ∗ be a (local) minimizer. For any u 0 sufficiently close to u ∗ , the iteration (6.4) converges
quadratically to u ∗ , i.e.,

∥u k+1 − u ∗ ∥ 2 ≤ M∥u k − u ∗ ∥ 22 ,

with M = 2∥J ′′′ (u ∗ )∥ 2 ∥J ′′ (u ∗ ) −1 ∥ 2 .

 Proof

In practice, the Hessian may not be invertible everywhere and we may not have an initial iterate sufficiently close to a minimizer to ensure
convergence. Practical applications therefore include a line search and a safeguard against non-invertible Hessians.

In some applications, it may be difficult to compute and invert the Hessian. This problem is addressed by so-called quasi-Newton methods
which approximate the Hessian. The basis for such approximations is the secant relation

H k (u k+1 − u k ) = (J ′ (u k+1 ) − J ′ (u k )),

which is satisfied by the true Hessian J ′′ at a point η k = u k + t(u k+1 − u k ) for some t ∈ (0, 1). Obviously, we cannot hope to solve for
H k ∈ R n×n from just these n equations. We can, however, impose some structural assumptions on the Hessian. Assuming a simple
diagonal structure H k = h k I yields h k = ⟨J ′ (u k+1 ) − J ′ (u k ), u k+1 − u k ⟩/∥u k+1 − u k ∥ 22 . In fact, even gradient-descent can be
interpreted in this manner by approximating J ′′ (u k ) ≈ L ⋅ I .

An often-used approximation is the Broyden-Fletcher-Goldfarb-Shannon (BFGS) approximation, which keeps track of the steps
s k = u k+1 − u k and gradients y k = J ′ (u k+1 ) − J ′ (u k ) to recursively construct an approximation of the inverse of the Hessian as

B k+1 = (I − ρ k s k y Tk )B k (I − ρ k y k s Tk ) + ρ k s k s Tk ,

with ρ k = (⟨s k , y k ⟩) −1 and B 0 choses appropriately (e.g., B 0 = L −1 ⋅ I ). It can be shown that this approximation is sufficiently accurate to
yield super linear convergence when using a Wolfe line search.

The are many practical aspects to implementing such methods. For example, what do we do when the approximated Hessian becomes
(almost) singular? Discussing these issues is beyond the scope of these lecture notes and we refer to [Nocedal and Wright, 2006], chapter 6
for more details. The SciPy library provides an implementation of various optimization methods.

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 3/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

6.2. Convex optimization


In this section, we consider finding a minimizer of a convex functional J : R n → R ∞ . Note that we allow the functionals to take values on
the extended real line. We accordingly define the domain of J as dom(J) = {u ∈ R n | J(u) < ∞}.

To deal with convex functionals that are not smooth, we first generalize the notion of a derivative.

 Definition: subgradient

Given a convex functional J , we call g ∈ R n a subgradient of J at u if

J(v) ≥ J(u) + ⟨g, v − u⟩ ∀v ∈ R n .

This definition is reminiscent of the Taylor expansion and we can indeed easily check that it holds for convex smooth functionals
for g = J ′ (u). For non-smooth functionals there may be multiple vectors g satisfying the inequality. We call the set of all such
vectors the subdifferential which we will denote as ∂J(u). We will generally denote an arbritary element of ∂J(u) by J ′ (u).

 Example: Subdifferentials of some functions

Let

J 1 (u) = |u|,
J 2 (u) = δ [0,1] (u),
J 3 (u) = max{u, 0}.

All these functions are convex and exhibit a discontinuity in the derivative at u = 0. The subdifferentials at u = 0 are given by

∂J 1 (0) = [−1, 1]
∂J 2 (0) = (−∞, 0]
∂J 3 (0) = [0, 1]

Fig. 6.1 Examples of several convex functions and an


element of their subdifferential at u = 0.

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 4/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

Some useful calculus rules for subgradients are listed below.

 Theorem: Computing subgradients

Let J i : R n → R ∞ be proper convex functionals and let A ∈ R n×n , b ∈ R n . We then have the following usefull rules

Summation: Let J = J 1 + J 2 , then J 1′ (u) + J 2′ (u) ∈ ∂J(u) for u in the interior of dom(J).
Affine transformation: Let J(u) = J 1 (Au + b), then A T J 1′ (Au + b) ∈ ∂J for u, Au + b in the interior of dom(J).

An overview of other useful relations can be found in e.g., [Beck, 2017] section 3.8.

With this we can now formulate optimality conditions for convex optimization.

 Definition: Optimality conditions for convex optimization

Let J : R n → R ∞ be a proper convex functional. A point u ∗ ∈ R n is a minimizer iff

0 ∈ J ′ (u ∗ ).

 Example: Computing the median

The median u of a set of numbers (f 1 , f 2 , … , f n ) (with f i < f i+1 ) is a minimizer of

n
J(u) = ∑ |u − f i |.
i=1

Introducing J i = |u − f i | we have

⎧−1 u < fi
J i′ (u) = ⎨[−1, 1] u = f i ,

1 u > fi

with which we can compute J ′ (u) using the sum-rule:

⎧−n u < f1
2i − n u ∈ (f i , f i+1 )
J ′ (u) = ⎨ .
⎩2i − 1 − n + [−1, 1] u = fi
n u > fn

To find a u for which 0 ∈ J ′ (u) we need to consider the middle two cases. If n is even, we can find an i such that 2i = n and get
that for all u ∈ [f n/2 , f n/2+1 ] we have 0 ∈ J ′ (u). When n is odd, we have optimality only for u = f (n+1)/2 .

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html

Fig. 6.2 Subgradient of J for f = (1, 2, 3, 4) and
f = (1, 2, 3, 4, 5).

5/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

6.2.1. Subgradient descent


A natural extension of the gradient-descent method for smooth problems is the subgradient descent method:

u k+1 = u k − λ k J ′ (u k ), J ′ (u k ) ∈ ∂J(u k ), (6.5)

where λ k denote the step sizes.

 Theorem: Convergence of subgradient descent

Let J : R n → R be a convex, L− Lipschitz-continuous function. The iteration (6.5) produces iterates for which

∥u 0 − u ∗ ∥ 22 + L 2 ∑ n−1 2
k=0 λ k
min J(u k ) − J(u ∗ ) ≤ .
k∈{0,1,…,n−1} 2 ∑ n−1
k=0 λ k

Thus, J(u k ) → J(u ∗ ) as k → ∞ when the stepsize satisfy

∞ ∞
∑ λ k = ∞, ∑ λ 2k < ∞.
k=0 k=0

 Proof

 Remark: Convergence rate for a fixed stepsize

If we choose λ k = λ, we get

∥u 0 − u ∗ ∥ 22 + L 2 λ 2 n 2
min J(u k ) − J(u ∗ ) ≤ .
k∈{0,1,…,n−1} 2λn

we can guarantee that min k∈{0,1,…,n−1} J(u k ) − J(u ∗ ) ≤ ϵ by picking stepsize λ = ϵ/L 2 and doing n = (∥u 0 − u ∗ ∥ 2 L/ϵ) 2
iterations. However, for smooth convex functions we derive a stronger result that gradient-descent requires only O(1/ϵ) iterations
(use exercise 6.4.2, the Lipschitz property, and the subgradient inequality). For smooth strongly convex functionals we can
strengthen the result even further and show that we only need O(log 1/ϵ) iterations (see exercise 6.4.1). The proofs are left as an
exercise.

6.2.2. Proximal gradient methods


While the subgradient descent method is easily implemented, it does not fully exploit the structure of the objective. In particular, we can
often split the objective in a smooth and a convex part. For the discussion we will assume for the moment that

J(u) = D(u) + R(u),

where D is smooth and R is convex. We are then looking for a point u ∗ for which

D ′ (u ∗ ) ∈ −∂R(u ∗ ). (6.6)

Finding such a point can be done (again!) by a fixed-point iteration

u k+1 = (I + λ∂R) −1 (I − λD ′ )(u k ),

where u = (I + λ∂R) −1 (v) yields a point u for which λ −1 (v − u) ∈ ∂R(u). We can easily show that a fixed point of this iteration indeed
solves the differential inclusion problem (6.6). Assuming a fixed point u ∗ , we have

u ∗ = (I + λ∂R) −1 (I − λD ′ )(u ∗ ),

using the definition of (I + λ∂R) this yields


−1

λ −1 (u ∗ − λD ′ (u ∗ ) − u ∗ ) ∈ ∂R(u ∗ ),

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 6/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

which indeed confirms that −D ′ (u ∗ ) ∈ ∂R(u ∗ ).

 Definition: Proximal operator

The operator (I + λ∂R) −1 is called the proximal operator of λR, whose action on input v is implicitly defined as solving

min 12 ∥u − v∥ 22 + λR(u).
u

We usually denote this operator by prox λR (v).

With this, the proximal gradient method for solving (6.6) is then denoted as

u k+1 = prox λR (u k − λD ′ (u k )). (6.7)

 Theorem: Convergence of the proximal point iteration

Let J = D + R be a functional with D smooth and R convex. Denote the Lipschitz constant of D ′ by L D . The iterates produced
by (6.7) with a fixed stepsize λ = 1/L D converge to a fixed point, u ∗ , of (6.7).

If, in addition, D is convex the iterates converges sublinearly to a minimizer u ∗ :

L D ∥u ∗ − u 0 ∥ 22
J(u k ) − J ∗ ≤ .
2k

If D is μ-strongly convex, the iteration converges linearly to a minimizer u ∗ :

∥u k+1 − u ∗ ∥ 22 ≤ (1 − μ/L D )∥u k − u ∗ ∥ 22 .

 Proof

When compared to the subgradient method, we may expect better performance from the proximal gradient method when D is strongly
convex and R is convex. Even if J is smooth, the proximal gradient method may be favorable as the convergence constants depend on the
Lipschitz constant of D only; not J . All this comes at the cost of solving a minimization problem at each iteration, so these methods are
usually only applied when a closed-form expression for the proximal operator exists.

 Example: one-norm

The proximal operator for the ℓ 1 norm solves

min 12 ∥u − v∥ 2 + λ∥u∥ 1 .
u

The solution obeys u − v ∈ −∂λ∥u∥ 1 , which yields

⎧{−λ} ui > 0
u i − v i ∈ ⎨[−λ, λ] u i = 0

{λ} ui < 0

This condition is fulfulled by setting

⎧v i − λ vi > λ
u i = ⎨0 |v i | ≤ λ

vi + λ v i < −λ

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 7/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

 Example: box constraints

The Proximal operator of the indicator function of δ [a,b]n solves

minn 12 ∥u − v∥ 2 .
[a,b]

The solution is given by

⎧a vi < a
u i = ⎨v i v i ∈ [a, b].

b vi > b

Thus, u is an orthogonal projection of v on [a, b] n .

6.2.3. Splitting methods


The proximal point methods require that the proximal operator for R can be evaluated efficiently. In many practical applications this is not
the cases, however. Instead, we may have a regularizer of the form R(Au) for some linear operator A. Even when R(⋅) admits an efficient
proximal operator R(A⋅) will, in general, not. In this section we discuss a class of methods that allow us to shift the operator A to the other
part of the objective. As a model-problem we will consider solving

min D(u) + R(Au),


u∈R n

with D smooth and convex, R(⋅) convex and A ∈ R m×n a linear map. The basic idea is to introduce an auxiliary variable v and re-formulate
the variational problem as

min
u∈R n ,v∈R m

The equivalence between (6.8) and (6.9) is established in the following theorem

 Theorem: Saddle point theorem



D(u) + R(v), s.t. Au = v.

To solve such constrained optimization problems we employ the method of Lagrange multipliers which defines the Lagrangian

Λ(u, v, ν) = D(u) + R(v) + ⟨ν, Au − v⟩,

where ν ∈ R m are called the Lagrange multipliers. The solution to (6.8) is a saddle point of Λ and we can thus be obtained by solving

min max Λ(u, v, ν).


u,v ν

Let (u ∗ , v ∗ ) be a solution to (6.8), then there exists a ν ∗ ∈ R m such that (u ∗ , v ∗ , ν ∗ ) is a saddle point of Λ and vice versa.

 Proof

Another important concept related to the Lagrangian is the dual problem.

 Definition: Dual problem

The dual problem related to (6.9) is

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html
max min Λ(u, v, ν).
ν u,v

For convex problems, the primal and dual problems are equivalent, giving us freedom when designing algorithms.
(6.8)

(6.9)

(6.10)

8/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

 Theorem: Strong duality

The primal (6.9) and dual (6.10) are equivalent in the sense that

min max Λ(u, v, ν) = max min Λ(u, v, ν).


u,v ν ν u,v

 Proof

 Example: TV-denoising

The TV-denoising problem can be expressed as

1 δ 2
min
n 2 ∥u − f ∥ 2 + λ∥Du∥ 1 ,
R

with D ∈ R m×n a discretisation of the first derivative. We can express the corresponding dual problem as

max min 12 ∥u − f δ ∥ 22 + ⟨ν, Du⟩ + min v λ∥v∥ 1 − ⟨ν, v⟩.


ν u

The first term is minimised by setting u = f δ − D ∗ ν . The second term is a bit trickier. First, we note that λ∥v∥ 1 − ⟨ν, v⟩ is not
bounded from below when ∥ν∥ ∞ > λ. Furthermore, for ∥ν∥ ∞ ≤ λ it attains a minimum for v = 0.

This leads to

max − 12 ∥D ∗ ν∥ 22 + ⟨D ∗ ν, f δ ⟩ − δ ∥⋅∥∞ ≤λ (ν),


ν

which is a constrained quadratic program. Since the first part is smooth and the proximal operator for the constraint ∥ν∥ ∞ ≤ λ is
easy we can employ a proximal gradient method to solve the dual problem. Having solved it, we can retrieve the primary variable
via the relation u = f δ − D ∗ ν .

The strategy illustrated in the previous approach is an example of a more general approach to solving problems of form (6.8).

 Dual-based proximal gradient

We start from the dual problem (6.10):

max (min (D(u) + ⟨Au, ν⟩)) + (min (R(v) − ⟨ν, v⟩)).


ν u v

In this expression we recognise the convex conjugates of D and R. With this, we re-write the problem as

min D ∗ (−A T ν) + R ∗ (ν).


ν

Thus, we have moved the linear map to the other side. We can now apply the proximal gradient method provided that:

We have a closed-form expression for the convex conjugates of D and R;


R ∗ has a proximal operator that is easily evaluated.

For many simple functions, we do have such closed-form expressions of their convex conjugates. Moreover, to compute the
proximal operator, we can use Moreau’s identity: prox R (u) + prox R∗ (u) = u.

It may not always be feasible to formulate the dual problem explicitly as in the previous example. In such cases we would rather solve (6.10)
directly. A popular way of doing this is the Alternating Direction of Multipliers Method.

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 9/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

 Alternating Direction of Multipliers Method (ADMM)

We augment the Lagrangian by adding a quadratic term:


ρ
Λ ρ (u, v, ν) = D(u) + R(v) + ⟨ν, Au − v⟩ + 2 ∥Au − v∥ 22 .

We then find the solution by updating the variables in an alternating fashion

u k+1 = arg min Λ ρ (u, v k , ν k ),


u

v k+1 = arg min Λ ρ (u k+1 , v, ν k ),


v

ν k+1 = ν k + ρ(Au k+1 − v k+1 ).

Efficient implementations of this method rely on the proximal operators of D and R.

 Example:TV-denoising

Consider the TV-denoising problem from the previous example.

The ADMM method find a solution via

u k+1 = (I + ρD ∗D) −1 (f δ + D ∗ (ρv k − ν k )).

v k+1 = prox (λ/ρ)∥⋅∥1 (Du k+1 + ρ −1 ν k ).

ν k+1 = ν k + ρ (Du k+1 − v k+1 ).

We cannot do justice to the breadth and depth of the topics smooth and convex optimization in one chapter. Rather, we hope that this
chapter serves as a starting point for further study in one of these areas for some, and provides useful recipes for others.

6.3. Exercises

6.3.1. Steepest descent for strongly convex functionals


Consider the following fixed point iteration for minimizing a given function J : R n → R

u (k+1) = u (k) − αJ ′ (u (k) ),

where J is twice continuously differentiable and strictly convex:

μI ⪯ J ′′ (u) ⪯ LI,

with 0 < μ < L < ∞.

Show that the fixed point iteration converges linearly, i.e., ∥u (k+1) − u ∗ ∥ ≤ ρ∥u (k) − u ∗ ∥ with ρ < 1, for 0 < α < 2/L.

 Answer

Determine the value of α for which the iteration converges fastest.

 Answer

6.3.2. Steepest descent for convex functions


Let J : R n be convex and Lipschitz-smooth. Show that the basic steepest-descent iteration with step size λ = 1/L produces iterates for
which

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 10/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

∥u 0 − u ∗ ∥
J(u k ) − J(u ∗ ) ≤ .
2Lk

The key is to use that

J(v) ≤ J(u) + ⟨J ′ (u), v − u⟩ + L


2 ∥u − v∥ 22 .

 Answer

6.3.3. Rosenbrock
We are going to test various optimization methods on the Rosenbrock function

f(x, y) = (a − x) 2 + b(y − x 2 ) 2 ,

with a = 1 and b = 100. The function has a global minimum at (a, a 2 ).

Write a function to compute the Rosenbrock function, its gradient and the Hessian for given input (x, y). Visualize the function on
[−3, 3] 2 and indicate the neighborhood around the minimum where f is convex.
Implement the method from exercise 1 and test convergence from various initial points. Does the method always convergce? How
small do you need to pick α? How fast?
Implement a linesearch strategy to ensure that α k satisfies the Wolfe conditions, does α vary a lot?

 Answer

6.3.4. Subdifferentials
Compute the subdifferentials of the following functionals J : R n → R + :

The Euclidean norm J(u) = ∥u∥ 2 .


The elastic net J(u) = α∥u∥ 1 + β∥u∥ 22
The weighted ℓ 1 -norm J(u) = ∥Du∥ 1 , with D ∈ R m×n for m < n a full-rank matrix.

 Answer

6.3.5. Dual problems


Derive the dual problems for the following optimization problems

min u ∥u − f δ ∥ 1 + λ∥u∥ 22 .
min u 12 ∥u − f δ ∥ 22 + λ∥u∥ p , p ∈ N >0 .
min u∈[−1,1]n 12 ∥u − f δ ∥ 22 .

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 11/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

 Answer

6.3.6. TV-denoising
In this exercise we consider a one-dimensional TV-denoising problem

1 δ 2
min
n 2 ∥u − f ∥ 2 + λ∥Du∥ 1 ,
R

with D a first-order finite difference discretization of the first derivative.

Show that the problem is equivalent (in terms of solutions) to solving

min 12 ∥D ∗ ν − f δ ∥ 22 s.t. ∥ν∥ ∞ ≤ λ.


ν

Implement a proximal-gradient method for solving the dual problem.


Implement an ADMM method for solving the (primal) denoising problem.
Test and compare both methods on a noisy signal. Example code is given below.

import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300

# grid \Omega = [0,1]


n = 100
h = 1/(n-1)
x = np.linspace(0,1,n)

# parameters
sigma = 1e-1

# make data
u = np.heaviside(x - 0.2,0)
f_delta = u + sigma*np.random.randn(n)

# FD differentiation matrix
D = (np.diag(np.ones(n-1),1) - np.diag(np.ones(n),0))/h

# plot
plt.plot(x,u,x,f_delta)
plt.xlabel(r'$x$')
plt.show()

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 12/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

 Answer

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 13/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

6.4. Assignments

6.4.1. Spline regularisation


The aim is to solve the following variational problem

1
min ∥Ku − f δ ∥ 22 + α∥Lu∥ 1 ,
u 2

where K is a given forward operator (matrix) and L is a discretization of the second derivative operator.

1. Design and implement a method for solving this variational problem; you can be creative here – multiple answers are possible
2. Compare your method with the basic subgradient-descent method implemented below
3. (bonus) Find a suitable value for α using the discrepancy principle

Some code to get you started is shown below.

# import libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300

# forward operator
def getK(n):
h = 1/n;
x = np.linspace(h/2,1-h/2,n)
xx,yy = np.meshgrid(x,x)
K = h/(1 + (xx - yy)**2)**(3/2)

return K,x

# define regularization operator


def getL(n):
h = 1/n;
L = (np.diag(np.ones(n-1),-1) - 2*np.diag(np.ones(n),0) + np.diag(np.ones(n-
1),1))/h**2
return L

# define grid and operators


n = 100
delta = 1e-2
K,x = getK(n)
L = getL(n)

# true solution and corresponding data


u = np.minimum(0.5 - np.abs(0.5-x),0.3 + 0*x)
f = K@u

# noisy data
noise = np.random.randn(n)
f_delta = f + delta*noise

# plot
plt.plot(x,u,x,f,x,f_delta)
plt.xlabel(r'$x$')
plt.show()

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 14/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging

By Tristan van Leeuwen and Christoph Brune (CC BY-NC 4.0)

https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 15/15

You might also like