Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
Numerical Optimization For Inverse Problems - 10 Lectures On Inverse Problems and Imaging
Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
Contents
6.1. Smooth optimization
6.2. Convex optimization
6.3. Exercises
6.4. Assignments
In this chapter we treat numerical algorithms for solving optimization problems over R n . Throughout we will assume that the objective
J(u) = D(u) + R(u) satisfies the conditions for a unique minimizer to exist. We distinguish between two important classes of problems;
smooth problems and convex problems.
For a comprehensive treatment of this topic (and many more), we recommend the seminal book Numerical Optimization by Stephen Wright
and Jorge Nocedal [Nocedal and Wright, 2006].
Given a smooth functional J : R n → R, a point u ∗ ∈ R n is local minimizer iff it satisfies the first and second order optimality
conditions
J ′ (u ∗ ) = 0, J ′′ (u ∗ ) ⪰ 0.
u k+1 = (I − λJ ′ )(u k ) = u k − λJ ′ (u k ),
where λ > 0 is the step size. The following theorem states that this iteration will yield a fixed point of J , regardless of the initial iterate,
provided that we pick λ small enough.
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 1/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
J(u 0 ) − J ∗
min ∥J ′ (u k )∥ 22 ≤ ,
k∈{0,1,…,n−1} Cn
Proof
Stronger statements about the rate of convergence can be made by making additional assumptions on J (such as (strong) convexity), but
this is left as an exercise.
u k+1 = u k + λ k d k ,
where d k is a descent direction satisfying ⟨d k , J ′ (u k )⟩ < 0. Obviously, d k = −J ′ (u k ) is a descent direction, but other choices may be
beneficial in practice. In particular, we can choose d k = −BJ ′ (u k ) for any positive-definite matrix B to obtain a descent direction. How to
choose such a matrix will be discussed in the next section.
In order to ensure sufficient progress of the iterations, we can choose a steplength that guarantees sufficient descent:
with c 1 ∈ (0, 1) a small constant (typically c 1 = 10 −4 ). Existence of a λ satisfying these conditions is guaranteed by the regularity
of J . We can find a suitable λ by backtracking:
def backtracking(J,Jp,u,d,lmbda,rho=0.5,c1=1e-4)
"""
Backtracking line search to find a step size satisfying J(u + lmbda*d) <= J(u) + lmbda*c1*J(u)^Td
Input:
J - Function object returning the value of J at a given input vector
Jp - Function object returning the gradient of J at a given input vector
u - current iterate as array of length n
d - descent direction as array of length n
lmbda - initial step size
rho,c1 - backtracking parameters, default (0.5,1e-4)
Output:
lmbda - step size satisfying the sufficient decrease condition
"""
while J(u + lmbda*d) > J(u) + c*lmbda*Jp(u).dot(d):
lmbda *= rho
return lmbda
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 2/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
A possible disadvantage of the backtracking linesearch introduced earlier is that it may end up choosing very small stepsizes. To
obtain a stepsize that yields a new iterate at which the slope of J is not too large, we introduce the following condition
where c 2 is a small constant satisfying 0 < c 1 < c 2 < 1. The conditions (6.2) and (6.3), are referred to as the strong Wolfe
conditions. Existence of a stepsize satisfying these conditions is again guaranteed by the regularity of J (cf. [Nocedal and Wright,
2006], lemma 3.1). Finding such a λ is a little more involved than the backtracking procedure outlined above (cf. [Nocedal and
Wright, 2006], algorithm 3.5). Luckily, the SciPy library provides an implementation of this algorithm (cf.
scipy.optimize.line_search)
u k+1 = u k − J ′′ (u k ) −1 J ′ (u k ). (6.4)
We can interpret this method as finding the new iterate u k+1 as the (unique) minimizer of the quadratic approximation of J around u k :
Let J be a smooth functional and u ∗ be a (local) minimizer. For any u 0 sufficiently close to u ∗ , the iteration (6.4) converges
quadratically to u ∗ , i.e.,
∥u k+1 − u ∗ ∥ 2 ≤ M∥u k − u ∗ ∥ 22 ,
Proof
In practice, the Hessian may not be invertible everywhere and we may not have an initial iterate sufficiently close to a minimizer to ensure
convergence. Practical applications therefore include a line search and a safeguard against non-invertible Hessians.
In some applications, it may be difficult to compute and invert the Hessian. This problem is addressed by so-called quasi-Newton methods
which approximate the Hessian. The basis for such approximations is the secant relation
which is satisfied by the true Hessian J ′′ at a point η k = u k + t(u k+1 − u k ) for some t ∈ (0, 1). Obviously, we cannot hope to solve for
H k ∈ R n×n from just these n equations. We can, however, impose some structural assumptions on the Hessian. Assuming a simple
diagonal structure H k = h k I yields h k = ⟨J ′ (u k+1 ) − J ′ (u k ), u k+1 − u k ⟩/∥u k+1 − u k ∥ 22 . In fact, even gradient-descent can be
interpreted in this manner by approximating J ′′ (u k ) ≈ L ⋅ I .
An often-used approximation is the Broyden-Fletcher-Goldfarb-Shannon (BFGS) approximation, which keeps track of the steps
s k = u k+1 − u k and gradients y k = J ′ (u k+1 ) − J ′ (u k ) to recursively construct an approximation of the inverse of the Hessian as
B k+1 = (I − ρ k s k y Tk )B k (I − ρ k y k s Tk ) + ρ k s k s Tk ,
with ρ k = (⟨s k , y k ⟩) −1 and B 0 choses appropriately (e.g., B 0 = L −1 ⋅ I ). It can be shown that this approximation is sufficiently accurate to
yield super linear convergence when using a Wolfe line search.
The are many practical aspects to implementing such methods. For example, what do we do when the approximated Hessian becomes
(almost) singular? Discussing these issues is beyond the scope of these lecture notes and we refer to [Nocedal and Wright, 2006], chapter 6
for more details. The SciPy library provides an implementation of various optimization methods.
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 3/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
To deal with convex functionals that are not smooth, we first generalize the notion of a derivative.
Definition: subgradient
This definition is reminiscent of the Taylor expansion and we can indeed easily check that it holds for convex smooth functionals
for g = J ′ (u). For non-smooth functionals there may be multiple vectors g satisfying the inequality. We call the set of all such
vectors the subdifferential which we will denote as ∂J(u). We will generally denote an arbritary element of ∂J(u) by J ′ (u).
Let
J 1 (u) = |u|,
J 2 (u) = δ [0,1] (u),
J 3 (u) = max{u, 0}.
All these functions are convex and exhibit a discontinuity in the derivative at u = 0. The subdifferentials at u = 0 are given by
∂J 1 (0) = [−1, 1]
∂J 2 (0) = (−∞, 0]
∂J 3 (0) = [0, 1]
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 4/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
Let J i : R n → R ∞ be proper convex functionals and let A ∈ R n×n , b ∈ R n . We then have the following usefull rules
Summation: Let J = J 1 + J 2 , then J 1′ (u) + J 2′ (u) ∈ ∂J(u) for u in the interior of dom(J).
Affine transformation: Let J(u) = J 1 (Au + b), then A T J 1′ (Au + b) ∈ ∂J for u, Au + b in the interior of dom(J).
An overview of other useful relations can be found in e.g., [Beck, 2017] section 3.8.
With this we can now formulate optimality conditions for convex optimization.
0 ∈ J ′ (u ∗ ).
n
J(u) = ∑ |u − f i |.
i=1
Introducing J i = |u − f i | we have
⎧−1 u < fi
J i′ (u) = ⎨[−1, 1] u = f i ,
⎩
1 u > fi
⎧−n u < f1
2i − n u ∈ (f i , f i+1 )
J ′ (u) = ⎨ .
⎩2i − 1 − n + [−1, 1] u = fi
n u > fn
To find a u for which 0 ∈ J ′ (u) we need to consider the middle two cases. If n is even, we can find an i such that 2i = n and get
that for all u ∈ [f n/2 , f n/2+1 ] we have 0 ∈ J ′ (u). When n is odd, we have optimality only for u = f (n+1)/2 .
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html
⎪
Fig. 6.2 Subgradient of J for f = (1, 2, 3, 4) and
f = (1, 2, 3, 4, 5).
5/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
Let J : R n → R be a convex, L− Lipschitz-continuous function. The iteration (6.5) produces iterates for which
∥u 0 − u ∗ ∥ 22 + L 2 ∑ n−1 2
k=0 λ k
min J(u k ) − J(u ∗ ) ≤ .
k∈{0,1,…,n−1} 2 ∑ n−1
k=0 λ k
∞ ∞
∑ λ k = ∞, ∑ λ 2k < ∞.
k=0 k=0
Proof
If we choose λ k = λ, we get
∥u 0 − u ∗ ∥ 22 + L 2 λ 2 n 2
min J(u k ) − J(u ∗ ) ≤ .
k∈{0,1,…,n−1} 2λn
we can guarantee that min k∈{0,1,…,n−1} J(u k ) − J(u ∗ ) ≤ ϵ by picking stepsize λ = ϵ/L 2 and doing n = (∥u 0 − u ∗ ∥ 2 L/ϵ) 2
iterations. However, for smooth convex functions we derive a stronger result that gradient-descent requires only O(1/ϵ) iterations
(use exercise 6.4.2, the Lipschitz property, and the subgradient inequality). For smooth strongly convex functionals we can
strengthen the result even further and show that we only need O(log 1/ϵ) iterations (see exercise 6.4.1). The proofs are left as an
exercise.
where D is smooth and R is convex. We are then looking for a point u ∗ for which
D ′ (u ∗ ) ∈ −∂R(u ∗ ). (6.6)
where u = (I + λ∂R) −1 (v) yields a point u for which λ −1 (v − u) ∈ ∂R(u). We can easily show that a fixed point of this iteration indeed
solves the differential inclusion problem (6.6). Assuming a fixed point u ∗ , we have
u ∗ = (I + λ∂R) −1 (I − λD ′ )(u ∗ ),
λ −1 (u ∗ − λD ′ (u ∗ ) − u ∗ ) ∈ ∂R(u ∗ ),
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 6/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
The operator (I + λ∂R) −1 is called the proximal operator of λR, whose action on input v is implicitly defined as solving
min 12 ∥u − v∥ 22 + λR(u).
u
With this, the proximal gradient method for solving (6.6) is then denoted as
Let J = D + R be a functional with D smooth and R convex. Denote the Lipschitz constant of D ′ by L D . The iterates produced
by (6.7) with a fixed stepsize λ = 1/L D converge to a fixed point, u ∗ , of (6.7).
L D ∥u ∗ − u 0 ∥ 22
J(u k ) − J ∗ ≤ .
2k
Proof
When compared to the subgradient method, we may expect better performance from the proximal gradient method when D is strongly
convex and R is convex. Even if J is smooth, the proximal gradient method may be favorable as the convergence constants depend on the
Lipschitz constant of D only; not J . All this comes at the cost of solving a minimization problem at each iteration, so these methods are
usually only applied when a closed-form expression for the proximal operator exists.
Example: one-norm
min 12 ∥u − v∥ 2 + λ∥u∥ 1 .
u
⎧{−λ} ui > 0
u i − v i ∈ ⎨[−λ, λ] u i = 0
⎩
{λ} ui < 0
⎧v i − λ vi > λ
u i = ⎨0 |v i | ≤ λ
⎩
vi + λ v i < −λ
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 7/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
minn 12 ∥u − v∥ 2 .
[a,b]
⎧a vi < a
u i = ⎨v i v i ∈ [a, b].
⎩
b vi > b
with D smooth and convex, R(⋅) convex and A ∈ R m×n a linear map. The basic idea is to introduce an auxiliary variable v and re-formulate
the variational problem as
min
u∈R n ,v∈R m
The equivalence between (6.8) and (6.9) is established in the following theorem
To solve such constrained optimization problems we employ the method of Lagrange multipliers which defines the Lagrangian
where ν ∈ R m are called the Lagrange multipliers. The solution to (6.8) is a saddle point of Λ and we can thus be obtained by solving
Let (u ∗ , v ∗ ) be a solution to (6.8), then there exists a ν ∗ ∈ R m such that (u ∗ , v ∗ , ν ∗ ) is a saddle point of Λ and vice versa.
Proof
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html
max min Λ(u, v, ν).
ν u,v
For convex problems, the primal and dual problems are equivalent, giving us freedom when designing algorithms.
(6.8)
(6.9)
(6.10)
8/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
The primal (6.9) and dual (6.10) are equivalent in the sense that
Proof
Example: TV-denoising
1 δ 2
min
n 2 ∥u − f ∥ 2 + λ∥Du∥ 1 ,
R
with D ∈ R m×n a discretisation of the first derivative. We can express the corresponding dual problem as
The first term is minimised by setting u = f δ − D ∗ ν . The second term is a bit trickier. First, we note that λ∥v∥ 1 − ⟨ν, v⟩ is not
bounded from below when ∥ν∥ ∞ > λ. Furthermore, for ∥ν∥ ∞ ≤ λ it attains a minimum for v = 0.
This leads to
which is a constrained quadratic program. Since the first part is smooth and the proximal operator for the constraint ∥ν∥ ∞ ≤ λ is
easy we can employ a proximal gradient method to solve the dual problem. Having solved it, we can retrieve the primary variable
via the relation u = f δ − D ∗ ν .
The strategy illustrated in the previous approach is an example of a more general approach to solving problems of form (6.8).
In this expression we recognise the convex conjugates of D and R. With this, we re-write the problem as
Thus, we have moved the linear map to the other side. We can now apply the proximal gradient method provided that:
For many simple functions, we do have such closed-form expressions of their convex conjugates. Moreover, to compute the
proximal operator, we can use Moreau’s identity: prox R (u) + prox R∗ (u) = u.
It may not always be feasible to formulate the dual problem explicitly as in the previous example. In such cases we would rather solve (6.10)
directly. A popular way of doing this is the Alternating Direction of Multipliers Method.
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 9/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
Example:TV-denoising
We cannot do justice to the breadth and depth of the topics smooth and convex optimization in one chapter. Rather, we hope that this
chapter serves as a starting point for further study in one of these areas for some, and provides useful recipes for others.
6.3. Exercises
μI ⪯ J ′′ (u) ⪯ LI,
Show that the fixed point iteration converges linearly, i.e., ∥u (k+1) − u ∗ ∥ ≤ ρ∥u (k) − u ∗ ∥ with ρ < 1, for 0 < α < 2/L.
Answer
Answer
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 10/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
∥u 0 − u ∗ ∥
J(u k ) − J(u ∗ ) ≤ .
2Lk
Answer
6.3.3. Rosenbrock
We are going to test various optimization methods on the Rosenbrock function
f(x, y) = (a − x) 2 + b(y − x 2 ) 2 ,
Write a function to compute the Rosenbrock function, its gradient and the Hessian for given input (x, y). Visualize the function on
[−3, 3] 2 and indicate the neighborhood around the minimum where f is convex.
Implement the method from exercise 1 and test convergence from various initial points. Does the method always convergce? How
small do you need to pick α? How fast?
Implement a linesearch strategy to ensure that α k satisfies the Wolfe conditions, does α vary a lot?
Answer
6.3.4. Subdifferentials
Compute the subdifferentials of the following functionals J : R n → R + :
Answer
min u ∥u − f δ ∥ 1 + λ∥u∥ 22 .
min u 12 ∥u − f δ ∥ 22 + λ∥u∥ p , p ∈ N >0 .
min u∈[−1,1]n 12 ∥u − f δ ∥ 22 .
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 11/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
Answer
6.3.6. TV-denoising
In this exercise we consider a one-dimensional TV-denoising problem
1 δ 2
min
n 2 ∥u − f ∥ 2 + λ∥Du∥ 1 ,
R
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
# parameters
sigma = 1e-1
# make data
u = np.heaviside(x - 0.2,0)
f_delta = u + sigma*np.random.randn(n)
# FD differentiation matrix
D = (np.diag(np.ones(n-1),1) - np.diag(np.ones(n),0))/h
# plot
plt.plot(x,u,x,f_delta)
plt.xlabel(r'$x$')
plt.show()
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 12/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
Answer
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 13/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
6.4. Assignments
1
min ∥Ku − f δ ∥ 22 + α∥Lu∥ 1 ,
u 2
where K is a given forward operator (matrix) and L is a discretization of the second derivative operator.
1. Design and implement a method for solving this variational problem; you can be creative here – multiple answers are possible
2. Compare your method with the basic subgradient-descent method implemented below
3. (bonus) Find a suitable value for α using the discrepancy principle
# import libraries
import numpy as np
import matplotlib.pyplot as plt
import matplotlib as mpl
mpl.rcParams['figure.dpi'] = 300
# forward operator
def getK(n):
h = 1/n;
x = np.linspace(h/2,1-h/2,n)
xx,yy = np.meshgrid(x,x)
K = h/(1 + (xx - yy)**2)**(3/2)
return K,x
# noisy data
noise = np.random.randn(n)
f_delta = f + delta*noise
# plot
plt.plot(x,u,x,f,x,f_delta)
plt.xlabel(r'$x$')
plt.show()
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 14/15
15/08/2024 03:20 6. Numerical optimization for inverse problems — 10 Lectures on Inverse Problems and Imaging
https://tristanvanleeuwen.github.io/IP_and_Im_Lectures/numerical_optimisation.html 15/15