0% found this document useful (0 votes)
13 views

Image Analysis Lecture 9

Uploaded by

Frew Dokem
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
13 views

Image Analysis Lecture 9

Uploaded by

Frew Dokem
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 32

Part 3, Lecture 2: Learned regularization for

image reconstruction
From model-based to data-driven approaches

Subhadip Mukherjee
Indian Institute of Technology
Kharagpur, India
# [email protected]
€ sites.google.com/view/subhadip-
mukherjee/home
§ github.com/Subhadip-1
This lecture
• Previously, we have seen how to apply optimization methods to
reconstruct an image by minimizing a variational energy function (data
fidelity plus regularizer).
• In this lecture, we will learn how to formulate different data-adaptive
reconstruction approaches and training strategies by drawing inspiration
from the variational framework and various optimization algorithms.
• We will study two important classes of techniques for learning an image
reconstruction operator and two representative methods from each class.
1. Supervised approaches
1.1. Algorithm unrolling
1.2. Bi-level learning
2. Unsupervised and weakly supervised approaches
2.1. Plug-and-play (PnP) denoising algorithms
2.2. Adversarial regularization

• You will get an overview of the pros and cons of these approaches and learn
which approach is suitable in what context. You will learn to implement a
simple unrolling approach in the practical (using odl and pytorch).
Variational image reconstruction: A Bayesian view
• The Bayesian approach of reconstruction views the image x and the
corresponding data y as two random variables related as y = Ax + w.
• In the Bayesian framework, we characterize the posterior distribution
p(x|y) by either summarizing it in a point estimate or by sampling from it.

pdata (y|x)pprior (x)


Bayes rule : p(x|y) =
py (y)

• The Bayes maximum a-posteriori probability (MAP) estimate is the


maximizer of the (log) posterior distribution p(x|y):

MAP estimate : ^xMAP = arg min [− log p(x|y)]


x
 
≡ arg min − log pdata (y|x) − log pprior (x) .
x

• When w is Gaussian and pprior (x) ∝ exp (−γ R(x)), we have

^xMAP = arg min ∥y − Ax∥22 + λ R(x).


x

• One might be interested in other statistical estimators (besides MAP).


The inadequacy of model-driven approaches

Recall the model-based variational approach:

xλ (yδ ) = arg min f yδ , Ax + λ R(x)



(1)
x∈R d | {z } | {z }
data fidelity regularizer

• It is generally difficult to handcraft a regularizer R that models images in


different applications.
• Solving the optimization problem (1) above using one of the iterative
algorithms that we studied before could take a few thousand iterations to
converge. This leads to a slow and computationally inefficient
reconstruction, especially for large images.
• Learning the regularizer from training data can lead to a significant
improvement in image quality, as we will see next.
A motivating example

ground-truth

model-based data-driven
Different data-driven image reconstruction techniques

Figure: Categorization of data-driven reconstruction approaches, which are


color-coded based on the strongest type of convergence guarantee they satisfy.

Image Source: SM, A. Hauptmann, O Öktem, M Pereyra, and C.-B. Schönlieb, “Learned reconstruction
methods with convergence guarantees: A survey of concepts and applications," IEEE Signal Processing
Magazine 40 (1), 164-182.
Data-driven post-processing
• Post-processing approaches seek to remove artifacts from a model-driven
approach using machine learning.

model-driven data-driven
yδ −→ x† −→ ^x

• These approaches are simple to implement, but they work as a black box
with limited interpretability.
• An example in the context of tomographic reconstruction:  n
◦ Create a dataset consisting of (FBP, ground-truth) pairs of the form xi† , xi ,
i=1
where xi† = FBP(yi ) are the FBP images.
◦ Train a U-net to remove artifacts from the FBP images:

X
n
2
min xi − Uθ (xi† ) .
θ 2
i=1

• Cons: Needs more data to generalize well, does not combine imaging
physics and data in a principled fashion, does not admit variational
interpretation, lacks data consistency (i.e., Axi† ≈ yi =⇒
̸ AUθ (xi† ) ≈ yi ).
Reference: Jin et al., “Deep convolutional neural network for inverse problems in imaging," IEEE-TIP, 2017.
Algorithm unrolling: combining imaging physics with machine
learning
Algorithm unrolling: how does it work?
Key idea: Build an optimization-inspired architecture of the reconstruction
network. Each iteration in the optimization algorithm forms a layer in the
reconstruction network.

(Proximal) gradient descent Learned gradient descent


• Initialize x0 , choose step-size η. • Initialize x0 .
• Iterate until convergence: • Iterate for N ≈ 10 steps:

xk+1 = proxλ R xk − η ∇ f (yδ , Axk ) . xk+1 = hθk xk , ∇ f (yδ , Axk ) .


 

hθk is a CNN with learnable


parameters.

Training an unrolled network


Learn θ by using back-propagation on the (supervised) training loss:

1 X (i)
n
2
xN θ; yδ − x(i)

min .
θ n
i=1
The learned primal-dual (LPD) approach
• The learned gradient network seeks to learn the proximal operator
corresponding to the regularizer, but the data fidelity remains fixed.
• The learned variant of PDHG, on the other hand, offers the flexibility of
learning both the regularizer and the fidelity (by parameterizing the
proximal operators in both primal and dual spaces using two CNNs).

PDHG iterations Learned primal dual (LPD)


◦ uk+1 = proxσ
f∗ (uk + σ A x̄k ) ◦ uk+1 = ϕθk uk , yδ , σk A xk


◦ xk+1 = proxτg xk − τ A⊤ uk+1



◦ xk+1 = ψγk xk , τk A⊤ uk+1


◦ x̄k+1 = xk+1 + θ (xk+1 − xk )

Pros and cons of LPD (and unrolling in general)


• Pros: Fast and efficient reconstruction, data-efficient (generalizes well
with fewer training examples), parsimonious parameterization.
• Cons: Lack theoretical guarantees (unless parameterized and trained in a
specific fashion), training could be resource-intensive (due to the presence
of A and A⊤ in the architecture).
Schematic of the LPD network

Figure: A schematic diagram of the learned primal-dual (LPD) reconstruction network


for X-ray CT reconstruction.

Image source: Adler and Öktem, "Learned primal-dual reconstruction," IEEE-TMI, 2018.
Some key points about algorithm unrolling

• Unrolling exploits the modular structure of iterative optimization


algorithms: only the component (proximal operator) related to the image
prior (regularizer) is learned.
• The choice of the optimization algorithm dictates the architecture of the
reconstruction network.
• A trained unrolled network cannot generally be interpreted as a variational
minimizer (although its architecture is inspired by an optimization
algorithm). One can interpret the unrolled reconstruction as the
(approximate) conditional mean of the true image x given its noisy
measurement y (denoted as E [x|y]).
Learned primal-dual (LPD) for low-dose CT

Image source: Adler and Öktem, "Learned primal-dual reconstruction," IEEE-TMI, 2018.
Bi-level learning: Supervised learning of a variational image
reconstruction operator
Bi-level learning of regularizer parameters – 1
• In unrolling, the reconstructed image cannot be interpreted as a
minimizer of some variational energy function such as (1).
• Can we have a reconstruction operator that indeed corresponds to a
minimizer of a (learned) variational energy?

Bi-level learning of regularizer

1 X (i)
N
2
min ^x (θ; y(i) ) − x(i)
θ N
i=1
| {z }
g(θ)
 
subject to ^x (θ; y ) ∈ arg min f y(i) , Ax + Rθ (x) .
(i) (i)
x | {z }
Ji (x,θ)

• Unlike unrolling, the reconstruction operator ^x(i) (θ; y(i) ) in bi-level


learning corresponds to the minimizer of a variational energy Ji (x, θ).
Reference: Crockett and Fessler, “Bilevel methods for image reconstruction," arXiv:2109.09610v2, 2021.
Bi-level learning of regularizer parameters – 2
For training, we need to differentiate the upper-level loss g(θ) w.r.t. the
learnable parameter θ. Thanks to the implicit function theorem, this can be
obtained as

1 X (i)
N  
∇g(θ) = ∂^x (θ; y(i) )⊤ ^x(i) (θ; y(i) ) − x(i)
N
i=1

1 X 2
N  
=− ∇xθ Ji (x, θ)⊤ ∇2xx Ji (x, θ)−1 x − x(i) .
N
i=1 x(i) (θ;y(i) )
x=^

• Computing the upper-level gradient w.r.t. θ needs derivatives of the


lower-level loss w.r.t. both x and θ, evaluated at the true lower-level
solution ^x(i) (θ; y(i) ).
• However, ^x(i) (θ; y(i) ) is typically only approximated using an iterative
optimization algorithm, and therefore all these quantities are only known
approximately, leading to an inaccurate gradient ∇g(θ).
Bi-level learning for image denoising

ground-truth noisy (20.28 dB)

fields-of-expert regularizer (30.01 dB) smoothed TV (29.33 dB)

Figure: Bi-level learning of regularizer (with two different parameterizations) for


image denoising (source: Salehi et al., arXiv:2308.10098, 2023).
Plug-and-play (PnP) algorithms: Using image denoisers to solve
image reconstruction
Plug-and-play (PnP) algorithms

• Both unrolling
 and bi-level learning are supervised, i.e., they need
x(i) , y(i) pairs for training, which are difficult to obtain for practical
problems. Moreover, if the imaging forward operator changes, one needs
to retrain the networks.
How PnP methods work
1. PnP methods use an off-the-shelf image denoiser, which is typically
learned (modern PnP methods), but it can be model-driven too (classical
PnP methods).
2. Gaussian denoisers are image priors in disguise.

Tweedie’s identity : E [x∗ |x] −x = σ2 ∇ log pσ (x)


| {z } | {z }
Denoiser Dσ (x) gradient of the log-prior (score)

• x∗ : clean image, x: noisy image, x = x∗ + Gaussian noise with variance σ2 , pσ :


probability density of x.
Two variants of PnP: RED-PnP and proximal PnP – 1
Regularization by denoising (RED)
Recall the MAP estimation problem given by

minn f yδ , Ax − λ log p(x).



(2)
x∈R

Gradient descent for (2) takes the form

xk+1 = xk − η ∇ f (yδ , Axk ) + η λ ∇ log p(x).

For a sufficiently small σ, we can approximate ∇ log p(x) ≈ ∇ log pσ (x), and
then Tweedie’s identity leads to

ηλ
xk+1 = xk − η ∇ f (yδ , Axk ) + (Dσ (x) − x) .
σ2
The RED-PnP algorithm: xk+1 = xk − η ∇ f (yδ , Axk ) + ηγ (Dσ (x) − x) .

Note: Although a practical Gaussian denoiser is not always the gradient of an


underlying potential, the RED algorithm works surprisingly well in practice.
Reference: Reehorst and Schniter, “Regularization by denoising: clarifications and new interpretations,”
IEEE-TCI, 2019.
Two variants of PnP: RED-PnP and proximal PnP – 2
Proximal PnP
Recall proximal gradient descent:

xk+1 = proxηg (xk − η ∇f (xk )) .

For widely used regularizers such as g(x) = ∥x∥1 , the proximal operator can
essentially be interpreted as a denoiser.
Idea: Replace the proximal operators in proximal algorithms with
off-the-shelf denoisers.
• Can get the PnP variants of algorithms such as PGD, ADMM, etc. by
replacing the proximal operators with denoisers.
(θ)
• First train a denoiser Dσ to eliminate Gaussian noise:

X
n
2
min (θ)
Dσ (xi + σi ϵi ) − xi , where σi ∼ uniform[0, σ], ϵi ∼ N(0, σi ).
θ 2
i=1

• Plug it inside a proximal algorithm.


• The denoiser is trained independently of the imaging operator.
PnP for compressive image recovery

Figure: Compressive image recovery (with 20% subsampling) using PnP vis-à-vis
unrolling (ISTA-Net+). The PnP (AR) method uses a problem-dependent
artifact-removal (AR) operator, while PnP (denoising) uses a pre-trained Gaussian
denoiser.

Image source: Kamilov et al., "Plug-and-play methods for integrating physical and learned models in
computational imaging: Theory, algorithms, and applications," IEEE Signal Processing Magazine, 2023.
Learning a direct regularizer using deep neural networks
Learning an explicit regularizer
• There is a class of techniques that learn a direct regularization function
parameterized by a neural network (denoted as Rθ , where θ represents the
parameters of the network modeling the regularizer).
• The regularizer is generally learned independent of the imaging operator A
and then plugged into the variational framework, which is then minimized
for reconstruction.
• We will learn about two specific techniques for learning explicit
regularization functions.
1. Adversarial regularization (AR): Uses ideas from optimal transport.
2. Network Tikhonov (NETT): Uses an encoder-decoder-based approach to learn
a direct regularizer.

• The main advantage of these frameworks is that the training is


unsupervised, i.e., one does not need (measurement, ground-truth) pairs
to train the regularizer. This makes these approaches generalizable to
different imaging operators (at least in principle).
Adversarial regularization (AR)
The key idea is to train a regularizer such that it is small on clean
(ground-truth) images and large on images with artifacts.

L(θ) := Ex∼πx [Rθ (x)] − Ez∼πz [Rθ (z)] subject to Rθ being 1-Lipschitz. (3)

• Here, πx and πz denote the distributions of clean and artifact-ridden


images.
• Rθ is a (convolutional) neural network with learnable weights and biases.
• The training dataset consists of a set of clean images (xi )Ni=1
1
drawn from πx
N2
and a set of images (zi )i=1 that can be obtained using a cheap model-based
approach (e.g., the pseudo-inverse solution zi = A† xi ).
• The clean images and the artifact-ridden images do not have to correspond
to each other, alleviating the need for strict supervision.
• The 1-Lipschitz condition on the regularizer makes the training problem
well-posed and leads to nice theoretical interpretations of (3) and the
resulting variational problem using concepts from optimal transport.
Reference: S. Lunz, O. Öktem, and C.-B. Schönlieb, “Adversarial regularizers in inverse problems,”
NeurIPS-2018.
Training and reconstruction for AR
Training an AR
• Input: A penalty parameter λgp (to enforce the Lipschitz constraint), initial
value of the network parameter(s) θ(0) .
• for mini-batches m = 1, 2, · · · , do (until convergence):
◦ Sample xi ∼ πx , zj ∼ πz , and ϵj ∼ uniform [0, 1]; for 1 ⩽ j ⩽ nb , where nb =
minibatch size.
(ϵ) 
◦ Compute xj = ϵj xj + 1 − ϵj zj .
◦ Compute the training loss L (θ) for the mth mini-batch:

1 X 1 X 1 X
nb nb nb    2
(ϵ)
L (θ) = Rθ (xj ) − Rθ (zj ) + λgp · ∇Rθ xj −1 .
nb j=1 nb j=1 nb j=1 2


◦ Update θ(m) = Adam-optimizer θ(m−1) , ∇θ L θ(m−1) .
• Output: The trained network with parameter θ(m) = θ∗ .

Reconstruction using an AR
2
min yδ − Ax 2
+ λ Rθ∗ (x)
x
Introducing convexity in the regularizer
• If the regularizer is convex in its input, the overall variational problem is a
convex optimization =⇒ efficient solver and theoretical guarantees.
• Can be constructed by composing simple convex functions.

Figure: Adversarial convex regularizer: The blue rectangles indicate convolutional


layers with non-negative weights and the orange rectangles represent standard
convolutional layers (with no restrictions on the weights). The gray triangle in the end
denotes an average-pooling operation. The (pointwise) nonlinear activation φ needs
to be convex and monotonically increasing to preserve convexity. This requirement is
already satisfied by activations such as ReLU or leaky-ReLU.

• The architecture above is the so-called input-convex neural network


(ICNN), first proposed by Amos et al. (ICML-2017).
Performance on sparse-view CT reconstruction

ground-truth FBP (21.63 dB, 0.24) TV (29.25 dB, 0.79)

LPD (33.62 dB, 0.89) AR (31.83 dB, 0.84) ACR (30.00 dB, 0.82)

Figure: Sparse-view CT reconstruction: Comparison of different reconstruction


methods in terms of the PSNR and SSIM of the reconstructed image.
Performance on limited-angle CT reconstruction

Ground truth FBP: 21.61 dB, 0.17 TV: 25.74 dB, 0.80

LPD: 29.51 dB, 0.85 AR: 26.83 dB, 0.71 ACR: 27.98 dB, 0.84
Figure: Limited-angle CT reconstruction, along with the respective PSNR and SSIM
values. In this case, ACR outperforms TV and AR in terms of reconstruction quality.
LPD produces incorrect structures not present in the ground truth (unacceptable for
clinical applications).

Reference: SM et al., “Learned convex regularizers for inverse problems,” arXiv:2008.02839v2, 2020.
Some observations

• Learning the regularizer works better than handcrafting it.


• Unrolling-based (supervised) methods perform the best if well-trained,
but can lead to undesirable artifacts in the reconstruction for severely
ill-conditioned imaging operators (such as limited-angle CT).
• When the forward operator is less ill-posed (e.g., in sparse-view CT),
introducing convexity in the regularizer reduces its expressive power,
leading to degradation in image quality.
• When the problem is more severely ill-posed (e.g., in limited-angle CT),
the inductive bias of convexity helps achieve better reconstruction.
• More research is needed to bridge the gap between good numerical
performance and interpretability.
Training the regularizer using the NETT approach
X
• Parameterize the regularizer as Rθ (x) = βi |[ϕθ (x)]i |q , where βi > 0
i
ϕθ is a deep neural network.

Figure: Training a NETT regularizer (reference: Li et al., Li et al., “Solving inverse


problems with deep neural networks," Inverse Problems, 2020.)
n1 +n2
• Training data: (image, artifact) pairs xj , rj j=1 . For j = 1, · · · , n1 ,

rj = xj − A yj , and for j = n1 + 1, · · · , n1 + n2 , rj = 0.
P 1 +n2
• Training loss: J(ν, θ) = nj=1
 
loss Ψν ϕθ (xj ) , rj =⇒ regularizer
takes a small (large) value when the image is clean (artifact-ridden).
Summary and key takeaways
• The modularity of the variational framework and proximal algorithms for
minimization opens up the possibility of incorporating imaging physics
within data-driven approaches.
• Modeling the proximal operator with a neural network and learning it in a
data-adaptive fashion enables data-driven regularization, which
significantly outperforms model-driven (analytical/hand-crafted)
regularizers.
• Supervised techniques such as algorithm unrolling generally perform
better empirically than unsupervised approaches but are not easily
amenable to easy variational interpretation.
• Unsupervised approaches are more flexible in terms of requirements on
the training data.
• Regularizers can be learned implicitly (e.g., using a denoiser) or in a more
direct fashion (e.g., adversarial regularization).

You might also like