0% found this document useful (0 votes)
3 views

Algorithm_Unrolling_Interpretable_Efficient_Deep_Learning_for_Signal_and_Image_Processing

The document discusses algorithm unrolling as a method to enhance the interpretability and efficiency of deep learning in signal and image processing. It reviews various techniques and applications of algorithm unrolling, highlighting its advantages over traditional deep learning methods, particularly in terms of generalizability and the ability to work with smaller training datasets. The article also addresses current limitations and suggests future research directions in this area.

Uploaded by

dumbabubu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views

Algorithm_Unrolling_Interpretable_Efficient_Deep_Learning_for_Signal_and_Image_Processing

The document discusses algorithm unrolling as a method to enhance the interpretability and efficiency of deep learning in signal and image processing. It reviews various techniques and applications of algorithm unrolling, highlighting its advantages over traditional deep learning methods, particularly in terms of generalizability and the ability to work with smaller training datasets. The article also addresses current limitations and suggests future research directions in this area.

Uploaded by

dumbabubu
Copyright
© © All Rights Reserved
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 27

Vishal Monga, Yuelong Li, and Yonina C.

Eldar

Algorithm Unrolling
Interpretable, efficient deep learning for signal and image processing

D
eep neural networks provide unprecedented performance the core tasks in computer vision. Groundbreaking performance
gains in many real-world problems in signal and image improvements have been demonstrated via AlexNet [1], and
processing. Despite these gains, the future development fewer classification errors than human-level performance [2]
and practical deployment of deep networks are hindered were reported for the ImageNet data set [3].
by their black-box nature, i.e., a lack of interpretability and the
need for very large training sets. An emerging technique called
algorithm unrolling, or unfolding, offers promise in eliminat-
ing these issues by providing a concrete and systematic connec-
tion between iterative algorithms that are widely used in signal
processing and deep neural networks. Unrolling methods were
first proposed to develop fast neural network approximations
for sparse coding. More recently, this direction has attracted
enormous attention, and it is rapidly growing in both theoretic
investigations and practical applications. The increasing popu-
larity of unrolled deep networks is due, in part, to their potential
in developing efficient, high-performance (yet interpretable)
network architectures from reasonably sized training sets.
In this article, we review algorithm unrolling for signal and
image processing. We extensively cover popular techniques for
algorithm unrolling in various domains of signal and image
processing, including imaging, vision and recognition, and
speech processing. By reviewing previous works, we reveal the
connections between iterative algorithms and neural networks
and present recent theoretical results. Finally, we provide a
discussion on the current limitations of unrolling and suggest
possible future research directions.

Introduction
The past decade has witnessed a deep learning revolution. The
availability of large-scale training data sets, which is often facilitated
by Internet content; the accessibility of powerful computational
resources thanks to breakthroughs in microelectronics; and
advances in neural network research, such as the development of
effective network architectures and efficient training algorithms,
have resulted in the unprecedented success of deep learning in
innumerable applications of computer vision, pattern recognition,
and speech processing. For instance, deep learning has provided
significant accuracy gains in image recognition, which is one of

Digital Object Identifier 10.1109/MSP.2020.3016905


Date of current version: 24 February 2021

18 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | 1053-5888/21©2021IEEE
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
In the realm of signal processing, learning-based approach- are highly optimized toward special operations, such as con-
es provide an interesting algorithmic alternative to traditional volutions, so inference via deep networks is usually quite
model-based analytic methods. In contrast to conventional computationally efficient. In addition, the number of layers
iterative approaches, where the models and priors are typical- in a deep network is typically much smaller than the num-
ly designed by analyzing the physical processes and through ber of iterations required in an iterative algorithm. There-
handcrafting, deep learning approaches attempt to automati- fore, deep learning methods have emerged to offer desirable
cally discover model information and incorporate the data by computational benefits over state-of-the art approaches
optimizing network parameters that are gleaned from real- in many areas of signal processing, imaging, and vision.
world training samples. Modern neural networks generally Their popularity has reached new heights through the wide-
adopt a hierarchical architecture composed of many layers and spread availability of the supporting software infrastructure
include a large number of parameters (possibly millions) and required for their implementation.
are thus capable of learning complicated mappings, which are That said, a vast majority of deep learning techniques
difficult to design explicitly. When the training data are suffi- are purely data driven, and the underlying structures are
cient, this adaptivity enables deep networks to often overcome difficult to interpret. Previous works largely apply gen-
model inadequacies, especially when the underlying physical eral network architectures (some of them are covered in
scenario is hard to characterize precisely. the “Conventional Neural Networks” section) to different
Another advantage of deep networks is that, during problems and learn certain underlying mappings, such as
inference, processing through the network layers can be classification and regression functions, completely through
executed very fast. Many modern computational platforms end-to-end training. It is therefore hard to discover what
is learned inside the networks by examining the network
parameters, which are usually of high dimensionality, and
what the roles are of individual parameters. In other words,
generic deep networks are typically difficult to interpret. In
contrast, traditional iterative algorithms are usually highly
interpretable because they are developed via modeling the
physical processes underlying the problem and/or capturing
prior domain knowledge.
Interpretability is, of course, an important concern both
in theory and in practice. It is usually the key to conceptual
understanding and advancing the frontiers of knowledge.
Moreover, in areas such as medical applications and autono-
mous driving, it is crucial to identify the limitations and the
potential failure cases of designed systems, where interpret-
ability plays a fundamental role. Thus, a lack of interpretabil-
ity can be a serious constraint on conventional deep learning
methods, in contrast with model-based techniques that have
iterative algorithmic solutions, which are used widely in sig-
nal processing.
An issue that frequently arises together with interpret-
ability is generalizability. It is well known that the practi-
cal success of deep learning is sometimes overly dependent
on the quantity and quality of the available training data.
In scenarios where abundant, high-quality training samples
are unavailable, such as medical imaging [4], [5] and 3D
reconstruction [6], the performance of deep networks may
degrade significantly, and the results may even be worse
than those from traditional approaches. This phenomenon,
formally called overfitting in machine learning terminology,
is largely due to the employment of generic neural networks
that are associated with a huge number of parameters. With-
©SHUTTERSTOCK.COM/RAMCREATIONS

out explicitly exploiting domain knowledge beyond the time


and/or space invariance, such networks are highly under-
regularized and may severely overfit, even when there are
heavy data augmentations.
To some extent, this problem has recently been addressed via
the design of domain-enriched and prior-information-guided

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 19
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
deep networks [7]–[11]. In such cases, the network architec- work may be trained using backpropagation, resulting in
ture is designed to contain special layers that are domain model parameters that are learned from real-world train-
specific, such as the transform layer in [8]. In other instanc- ing sets. In this way, the trained network can be naturally
es, the prior structure of the expected output is exploited interpreted as a parameter-optimized algorithm, effec-
[8], [9] to develop training robustness. An excellent tutorial tively overcoming the lack of interpretability in most con-
article covering these issues is [12]. Despite these achieve- ventional neural networks.
ments, transferring domain knowledge to network param- Traditional iterative algorithms generally entail significant-
eters can be a nontrivial task, and effective priors may be ly fewer parameters compared with popular neural networks.
hard to design. More importantly, the underlying network Therefore, unrolled networks are highly parameter efficient
architectures in these approaches largely remain consistent and require fewer training data. In addition, unrolled net-
with conventional neural networks. Therefore, the pursuit works naturally inherit prior structures and domain knowl-
of interpretable, generalizable, and high-performance deep edge rather than learn that information from intensive training
architectures for signal processing problems remains a huge- data. Consequently, they tend to generalize better than generic
ly important open challenge. networks, and they can be computationally faster as long as
In the seminal work of Gregor and LeCun [13], a prom- each algorithmic iteration (or the corresponding layer) is not
ising technique called algorithm unrolling (or unfolding) overly expensive.
was developed that has helped connect iterative algo- In this article, we review the foundations of algorithm unroll-
rithms, such as those for sparse coding, to neural net- ing. Our goal is to provide readers with guidance on how to uti-
work architectures. Following this article, the past few lize unrolling to build efficient and interpretable neural networks
years have seen a surge of efforts that unroll iterative in solving signal and image processing problems. After provid-
algorithms for many significant problems in signal and ing a tutorial on how to unroll iterative algorithms into deep
image processing; examples include (but are not limited networks, we extensively review selected applications of algo-
to) compressive sensing (CS) [14], deconvolution [15], and rithm unrolling in a wide variety of signal and image processing
variational techniques for image processing [16]. Figure 1 domains. We also review general theoretical studies that shed
provides a high-level illustration of this framework. Spe- light on the convergence properties of these networks, although
cifically, each iteration of the algorithm step is represent- further analysis is an important problem for future research. In
ed as one layer of the network. Concatenating these layers addition, we clarify the connections between the general class-
forms a deep neural network. Passing through the network es of traditional iterative algorithms and deep neural networks
is equivalent to executing the iterative algorithm a finite established through algorithm unrolling. We contrast algorithm
number of times. In addition, the algorithm parameters unrolling with alternative approaches and discuss their strengths
(such as the model parameters and the regularization and limitations. Finally, we discuss open challenges and suggest
coefficients) transfer to the network parameters. The net- future research directions.

End-to-End Training
Output

h(·;θ ) Unrolling Input h1(·;θ 1) h2(·;θ 2) Output

Input
Interpretable Layers

(a) (b)

FIGURE 1. A high-level overview of algorithm unrolling. Given (a) an iterative algorithm, (b) a corresponding deep network can be generated by cascad-
ing the algorithm’s iterations h. The iteration step h in (a) is executed a number of times, resulting in the network layers h 1, h 2, f in (b). Each iteration
h depends on algorithm parameters i, which are transferred into network parameters i 1, i 2, f. Instead of determining these parameters through
cross-validation and analytical derivations, we learn i 1, i 2, f from training data sets through end-to-end training. In this way, the resulting network
could achieve better performance than the original iterative algorithm. In addition, the network layers naturally inherit interpretability from the iteration
procedure. The learnable parameters are colored in blue.

20 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
Generating interpretable networks ure 2(b). With a significantly reduced parameter dimensionality,
through algorithm unrolling training deeper networks becomes feasible. While CNNs were
We begin by describing algorithm unrolling. To motivate the first applied to digit recognition, their translation invariance is a
unrolling approach, we commence with a brief review on con- desirable property in many computer vision tasks. Hence, CNNs
ventional neural network architectures in the “Conventional have become an extremely popular and indispensable architec-
Neural Networks” section. We next discuss the first unrolling ture in imaging and vision and outperform traditional approach-
technique for sparse coding in the “Unrolling Sparse Cod- es by a large margin in many domains. They continue to exhibit
ing Algorithms into Deep Networks” section. We elaborate the best performance in many applications.
on general forms of unrolling in the “Algorithm Unrolling in In domains such as speech recognition and video process-
General” section. ing, where data are obtained sequentially, recurrent neural
networks (RNNs) [21] are a popular choice. RNNs explic-
Conventional neural networks itly model the data dependence in different time steps in the
In early neural network research, the multilayer perceptron sequence and scale well to sequences with varying lengths. A
(MLP) was a popular choice. This architecture can be moti- depiction of RNNs appears in Figure 2(c). Given the previous
vated either biologically by mimicking the human recogni-
tion system or algorithmically by generalizing the perceptron
algorithm to multiple layers. A diagram of MLP is provided
in Figure 2(a). The network is constructed through recursive 1
W1,1 2
W1,1
linear and nonlinear operations, which are called layers. The x10 x11 x12 x1L
1
units that those operations act upon are known as neurons, W2,1
which is an analogy to the neurons in human nerve systems. x20 x21 x22 x2L
The first and last layers are called the input layer and the out-
x30 x31 x32 x3L
put layer, respectively. 1
W3,3
2
W3,3
A salient property of this network is that each neuron is
Input Layer Hidden Layers Output Layer
fully connected to every neuron in the previous layer except
(a)
for the input layer. The layers are thus commonly referred to as
fully connected. Analytically, in the lth layer, the relationship
between the neurons x lj and x li + 1 is expressed as x10 x11 x12 x1L
W11 W12

x li + 1 = v c / W lij+ 1 x lj + b li + 1 m, (1) W22


j x20 x21 x22 x2L
W21
where W l + 1 and b l + 1 are the weights and the biases, re-
spectively, and v is a nonlinear activation function. We omit x30 x31 x32 x3L
drawing activation functions and biases in Figure 2(a) for W31
brevity. Popular choices of activation functions include the Input Layer Hidden Layers Output Layer
logistic function and the hyperbolic tangent function. During (b)
recent years, they have been superseded by rectified linear
units (ReLUs) [17] defined by U1,1
o0 o1 oL Outputs
x10 s10
U2,1
ReLU (x) = max " x, 0 , . ×V ×V ×V

Hidden
x20 s20 s0 ×W s1 sL
The W’s and b’s are generally trainable parameters that are States
learned from data sets through training, during which back- ×U ×U ×U
propagation [18] is often employed for gradient computation.
Today, MLPs are rarely seen in practical imaging and vision x30 s30 Inputs
x0 x1 xL
applications. The fully connected nature of MLPs contributes
to a rapid increase in the number of parameters, making train- (c)
ing difficult. To address this limitation, Fukushima et al. [19]
designed a neural network by mimicking the visual nervous sys- FIGURE 2. Conventional neural network architectures that are popular
tem [20]. The neuron connections are restricted to local neigh- in signal/image processing and computer vision applications. (a) An
bors, and weights are shared across different spatial locations. MLP, where all the neurons are fully connected. (b) A CNN, where the
The linear operations then become convolutions (or correla- neurons are sparsely connected and the weights are shared among
different neurons. Therefore, the weight matrices W l, l = 1, 2, f, L ef-
tions, in a strict sense), and thus the networks employing such fectively become convolution operators. (c) An RNN, where the inputs x l,
localizing structures are generally called convolutional neural l = 1, 2, f, L are fed in sequentially and the parameters U,    V, and W
networks (CNNs). A representation of a CNN can be seen in Fig- are shared across different time steps.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 21
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
hidden state s l - 1 and the input variable x l, the next hidden ciency of the iterative shrinkage and thresholding algorithm
state s l is computed as (ISTA), which is one of the most popular approaches in sparse
coding. The crux of this article is summarized in Figure 3 and
s l = v 1 ^Ws l - 1 + Ux l + bh, detailed in “Learned Iterative Shrinkage and Thresholding
Algorithm.” Each iteration of the ISTA constitutes one linear
while the output variable o l is generated by operation followed by a nonlinear soft-thresholding operation,
which mimics the ReLU activation function. A diagram of one
o l = v 2 ^Vs l + bh . iteration step reveals the ISTA’s resemblance to a single net-
work layer. Thus, one can form a deep network by mapping
Here U, V, W, and b are trainable network parameters each iteration to a network layer and stacking the layers togeth-
and v 1, and v 2 are activation functions. We again omit the er, which is equivalent to executing an ISTA iteration multiple
activation functions and biases in Figure 2(c). In contrast to times. Because the same parameters are shared across differ-
MLPs and CNNs, where the layer operations are recursively ent layers, the resulting network resembles an RNN in terms of
applied in a hierarchical representation fashion, RNNs apply architecture. In recent studies [22]–[24], different parameters
the recursive operations as the time step evolves. A distinctive are employed in various layers, as we discuss in the “Select
property of RNNs is that the parameters U, V, and W are Theoretical Studies” section.
shared across all the time steps rather than varying from layer After unrolling the ISTA into a network, the network is
to layer. Training RNNs can thus be difficult, as the gradients trained using training samples through backpropagation. The
of the parameters may either explode or vanish. learned network is dubbed the LISTA (which stands for “learned
ISTA”). It turns out that significant computational benefits can
Unrolling sparse coding algorithms into deep networks be obtained by learning from real data. For instance, Gregor et al.
The earliest work in algorithm unrolling dates back to the pa- [13] experimentally verified that a learned network reaches a
per by Gregor et al. on improving the computational efficiency specific performance level that is roughly 20 times faster than
of sparse coding algorithms through end-to-end training [13]. an accelerated ISTA. Consequently, the sparse coding prob-
In particular, the authors discussed how to improve the effi- lem can be efficiently solved by passing through a compact

Algorithm: Input x0, Output xL 1


Wt = I – µ WT W Wt
for l = 0,1,… , L − 1 do xl + Sλ xl+1
1
We = µ WT
xl+1 = Sλ qq 1 1 We
I – µ WT Wr xl + µ WT yr

end for
y
Stacking

(a) (b)

Wt Wt
x0 + Sλ x1 + Sλ x2 xL

We We

y
(c)

FIGURE 3. The LISTA. One iteration of the ISTA executes a linear operation and then a nonlinear one and thus can be recast into a network layer; by stack-
ing the layers together, a deep network is formed. The network is subsequently trained using paired inputs and outputs by backpropagation to optimize
the parameters W e , W t, and m; n is a constant parameter that controls the step size of each iteration. The trained network, a LISTA, is computationally
more efficient compared with the original ISTA. The trainable parameters in the network are colored in blue. For details, see “Learned Iterative Shrinkage
and Thresholding Algorithm.” In practice, W e , W t, and m may vary in each layer. (a) An ISTA. (b) A single network layer. (c) An unrolled deep network.

22 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
LISTA network. From a theoretical perspective, recent studies can be applied to general iterative algorithms. An illustration is
[23], [24] have characterized the linear convergence rate of the given in Figure 4. In general, the algorithm repetitively performs
LISTA and further verified its computational advantages in a certain analytic operations, which we represent abstractly as the
rigorous and ­quantitative manner. A more detailed exposition h function. Similar to the LISTA, we unroll the algorithm into
and discussion on related theoretical studies will be provided a deep network by mapping each iteration into a single network
in the “Select Theoretical Studies” section. In addition to the layer and stacking a finite number of layers together. Each it-
ISTA, Gregor et al. discussed unrolling and optimizing another eration step of the algorithm contains parameters, such as the
sparse coding method, the coordinate descent (CoD) algorithm model parameters and the regularization coefficients, which we
[25]. The technique behind, and the implications of, unrolled denote by vectors i l, l = 0, f, L - 1. Through unrolling, the
CoD are largely similar to the LISTA. parameters i l correspond to those of the deep network, and they
can be optimized to real-world scenarios by training the network
Algorithm unrolling in general end to end by using real data sets.
Although the work by Gregor et al. [13] focused on improving the While the motivation behind the LISTA was computational
computational efficiency of sparse coding, the same techniques savings, the proper use of algorithm unrolling can also lead to

Learned Iterative Shrinkage and Thresholding Algorithm


The pursuit of the parsimonious representation of signals preted as cascading L such layers together, which essen-
has been a problem of enduring interest in signal process- tially forms an L-layer deep network. In the unrolled
ing. One of the most common quantitative manifestations network, an implicit substitution of parameters has been
of this is the well-known sparse coding problem [S1]. made: W t = I - (1/n) W T W, and W e = (1/n) W T. While
Given an input vector y ! R n and an overcomplete dic- these substitutions generalize the original parametrization
tionary W ! R n # m with m 2 n , sparse coding refers to the and expand the representation power of the unrolled net-
pursuit of a sparse representation of y using W. In other work, recent theoretical studies [24] suggest that they may
words, we seek a sparse code x ! R m such that y . Wx be inconsequential in an asymptotic sense, as the optimal
while encouraging as many coefficients as possible in x network parameters asymptotically admit a weight cou-
to be zero (or small in magnitude). A common approach pling scheme.
to determine x is to solve the following convex optimiza- The unrolled network is trained using real data sets to
tion problem: optimize the parameters W t , W e , and m. The learned
1 2
ISTA (LISTA) may achieve higher efficiency compared to
min y - Wx + m x 1 , (S1) the ISTA. It is also useful when W is not exactly known.
x!R 2
m 2

Training is performed through a sequence of vectors y 1,


where m 2 0 is a regularization parameter that controls
y 2, f, y N ! R n and their corresponding ground-truth sparse
the sparseness of the solution.
codes x ) 1, x ) 2, f, x ) N.
A popular method for solving (S1) is the iterative shrink-
By feeding each y n, n = 1, f, N into the network, we
age and thresholding algorithm (ISTA) family [28]. In its
retrieve its output xt n ^ y n; W t, W e, m h as the predicted sparse
simplest form, the ISTA performs the following iterations:
code for y n . Comparing the output with the ground-truth
sparse code x ) n , the network training loss function is
x l + 1 = S m ' b I - n W T W l x l + n W T y 1, l = 0, 1, f, (S2)
1 1
formed as

, ^ W t, W e, m h = N | xt n ^ y n; W t, W e, m h - x ) n 22 , (S4)
N
1
where I ! R m # m is the identity matrix, n is a positive
n=1
parameter that controls the iteration step size, and S m ($) is
the soft-thresholding operator defined elementwise as and the network is trained through loss minimization,
using popular gradient-based learning techniques, such
S m (x) = sign (x) $ max " ; x ; - m, 0 , . (S3) as stochastic gradient descent [18], to learn W t, W t,
and m. It has been empirically shown that the number of
Basically, the ISTA is equivalent to a gradient step of
2 layers L in the (trained) LISTA can be an order of magni-
y - Wx 2 followed by a projection onto the , 1 ball.
tude smaller than the number of iterations required for the
As depicted in Figure 3, the iteration (S2) can be recast
ISTA [13] to achieve convergence corresponding to a
into a single network layer. This layer includes a series of
new observed input.
analytic operations (matrix–vector multiplication, summa-
tion, and soft thresholding), which is of the same nature as Reference
[S1] Y. C. Eldar and G. Kutyniok, Compressed Sensing: Theory and
a neural network. Executing the ISTA L times can be inter- Applications. Cambridge, U.K.: Cambridge Univ. Press, 2012.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 23
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
dramatically improved performance in practical applications. structures are more specifically tailored to target applications.
For instance, we can employ backpropagation to obtain coef- These benefits not only ensure higher efficiency but provide bet-
ficients of filters [15] and dictionaries [26] that are hard to design ter generalizability, especially under limited training schemes [14].
analytically and even by handcrafting. In addition, custom More concrete examples will be presented and discussed in the
modifications may be employed in the unrolled network [14]. “Unrolling in Signal and Image Processing Problems” section.
As a particular example, in the LISTA (see “Learned Iterative
Shrinkage and Thresholding Algorithm”), the matrices W t, W e Unrolling in signal and image processing problems
may be learned in each iteration so that they are no longer held Algorithm unrolling has been applied to diverse application ar-
fixed throughout the network. Furthermore, their values may eas during the past few years. Table 1 summarizes representative
vary across different layers rather than being shared. By allow- methods and their topics of focus in different domains. Evident-
ing a slight departure from the original iterative algorithms [13], ly, research in algorithm unrolling is growing and influencing
[27] and extending the representation capacity, the performance a variety of high-impact, real-world problems and research
of the unrolled networks may be significantly boosted. ­areas. As discussed in the “Generating Interpretable Networks
Compared with conventional generic neural networks, unrolled Through Algorithm Unrolling” section, an essential element of
networks generally contain significantly fewer parameters, as they each unrolling approach is the underlying iterative algorithm
encode domain knowledge through unrolling. In addition, their that the technique starts from, which we also specify in Table 1.

Algorithm: Input z0, Output zL


for l = 0, 1,… , L − 1 do Unrolling z0 zl h(·;θ l) zl+1 zL
zl+1 ← h (zl; θ l),
end for

(a) (b)

FIGURE 4. The general idea of algorithm unrolling. Starting with an abstract iterative algorithm, we map one iteration (described as the function h param-
etrized by i l, l = 0, f, L - 1) into a single network layer and stack a finite number of layers together to form a deep network. Feeding the data forward
through an L-layer network is equivalent to executing the iteration L times (finite truncation). The parameters i l, l = 0, 1, f, L - 1 are learned from real
data sets by training the network end to end to optimize the performance. The parameters can either be shared across different layers or vary from layer
to layer. The trainable parameters are colored in blue. (a) An iterative algorithm. (b) An unrolled deep network.

Table 1. Recent methods employing algorithm unrolling in practical signal processing and imaging applications.

Reference Year Application domain Topics Underlying iterative algorithms


Hershey et al. [29] 2014 Speech processing Signal channel source separation Nonnegative matrix factorization (NMF)
Wang et al. [26] 2015 Computational imaging Image superresolution Coupled sparse coding with the ISTA
Zheng et al. [30] 2015 Vision and recognition Semantic image segmentation Conditional random field (CRF) with mean-field
(MF) iteration
Schuler et al. [31] 2016 Computational imaging Blind image deblurring Alternating minimization
Chen et al. [16] 2017 Computational imaging Image denoising, JPEG deblocking Nonlinear diffusion
Jin et al. [27] 2017 Medical imaging Sparse-view X-ray computed tomography (CT) ISTA
Liu et al. [32] 2018 Vision and recognition Semantic image segmentation CRF with MF iteration
Solomon et al. [33] 2018 Medical imaging Clutter suppression Generalized ISTA for robust principal component
analysis (PCA)
Ding et al. [34] 2018 Computational imaging Rain removal Alternating direction method of multipliers (ADMM)
Wang et al. [35] 2018 Speech processing Source separation Multiple-input spectrogram inversion
Adler et al. [36] 2018 Medical imaging CT Primal–dual hybrid gradient
Wu et al. [37] 2018 Medical imaging Lung nodule detection Primal–dual hybrid gradient
Yang et al. [14] 2019 Medical imaging Magnetic resonance imaging (MRI), ADMM
­compressive imaging
Hosseini et al. [38] 2019 Medical imaging MRI Proximal gradient descent (PGD)
Li et al. [39] 2019 Computational imaging Blind image deblurring Half-quadratic splitting
Zhang et al. [40] 2019 Smart power grids Power system state estimation and forecasting Double-loop prox–linear iterations
Zhang et al. [41] 2019 Computational imaging Blind image denoising, JPEG deblocking Moving-endpoint control problem
Lohit et al. [42] 2019 Remote sensing Multispectral image fusion Projected gradient descent
Yoffe et al. [43] 2020 Medical imaging Superresolution microscopy Sparsity-based superresolution microscopy from
­correlation information [45]

24 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
In this section, we discuss a variety of practical applications (W t, W e, and m) and the dictionary D. By integrating the
of algorithm unrolling. Specifically, we cover applications in patch extraction and recombination layers into the network,
computational imaging, medical imaging, vision and recogni- the whole system resembles a CNN since it also performs
tion, and other signal processing topics. We then discuss the patch-by-patch processing. The network is trained by pairing
enhanced efficiency brought about by algorithm unrolling. low- and high-resolution images and minimizing the mean-
square-error (MSE) loss. In addition to higher visual quality
Applications in computational imaging and a peak signal-to-noise ratio (PSNR) gain of 0.3–1.6 dB
Computational imaging is a broad area covering a wide range over the state of the art, the network is faster to train and has a
of interesting topics, such as computational p­ hotography, lower number of parameters. Figure 6 provides sample results
­hyperspectral imaging, and compressive imaging, to name from the SCN and several state-of-the art techniques.
a few. The key to success in many computational imag- Another important application focusing on improving the
ing areas frequently hinges on solving an inverse problem. quality of degraded images is blind image deblurring. Given a
­Model-based inversion has long been a popular approach. sharp image blurred by an unknown function, which is usually
Examples of model-based methods include parsimonious called the blur kernel or the point spread function, the goal
representations, such as sparse coding and low-rank matrix is to jointly estimate both the blur kernel and the underlying
pursuit; variational methods; and CRFs. The employment sharp image. There is a wide range of approaches in the
of model-based techniques gives rise to many iterative ap- literature to blind image deblurring. The blur kernel can be
proaches, as closed-form solutions are rarely available. The of different forms, such as Gaussian, defocusing, and motion.
fertile ground of these iterative algorithms, in turn, provides In addition, the blur kernel can be either spatially uniform
a solid foundation and offers many opportunities for algo- or nonuniform.
rithm unrolling. Blind image deblurring is a challenging topic because
Single-image superresolution is an important topic in compu- the blur kernel is generally of a low-pass nature, rendering
tational imaging that focuses on improving the spatial resolution the problem highly ill posed. Most existing approaches rely
of a single degraded image. In addition to offering images of on extracting stable features (such as salient edges in natural
improved visual quality, superresolution also aids diagnosis in images) to reliably estimate the blur kernels. The sharp image
medical applications and promises to improve the performance can be subsequently retrieved based on the estimated kernels.
of recognition systems. Compared with naive bicubic interpola- Schuler et al. [31] review many existing algorithms and note
tion, there exists significant room for ­performance improvement that they essentially iterate across three modules: 1) feature
by exploiting natural image structures, such as learning diction- extraction, 2) kernel estimation, and 3) image recovery.
aries to encode local image structures into sparse codes [47]. A Therefore, a network can be built by unrolling and concat-
significant research effort has been devoted to structure-aware enating several layers of these modules, as depicted in Figure 7.
approaches. Wang et al. [26] applied the LISTA (which we dis- More specifically, the feature extraction module is represented
cussed in the “Unrolling Sparse Coding Algorithms Into Deep as a few layers of convolutions and rectifier operations, which
Networks” section) to patches extracted from the input image basically mimics a small CNN, while both the kernel and the
and recombined the predicted high-resolution patches to form image estimation modules are represented as least-square
an expected high-resolution image. operations. To train the network, the blur kernels are simulated
A depiction of the entire network architecture, named the by sampling from Gaussian processes, and the blurred images
sparsity coding-based network (SCN), is provided in Figure 5. are synthetically created by blurring each sharp image through
A LISTA subnetwork is plugged into the end-to-end learning a 2D discrete convolution.
system, which estimates the sparse codes a out of all the Recently, Li et al. [15], [39] developed an unrolling
image patches. A trainable dictionary D then maps the sparse approach for blind image deblurring by enforcing sparsity
codes to reconstructed patches, followed by an operation constraints across filtered domains and then unrolling the
that injects the patches back into the whole image. The half-quadratic splitting algorithm for solving the result-
trainable parameters of the SCN include those of the LISTA i ng optimization problem. The network is called deep

D× Recovered
Input Patch LISTA Sparse Patch Recovered
Patch
Image y Extraction Subnetwork Code α Recombination Image x
"

z = Dα

FIGURE 5. The SCN [26] architecture. The patches extracted from input low-resolution image y are fed into a LISTA subnetwork to estimate the as-
sociated sparse codes a, and then high-resolution patches are reconstructed through a linear layer. The predicted high-resolution image xt is formed
by putting these patches into their corresponding spatial locations. The whole network is trained by forming low- and high-resolution image pairs by
employing standard stochastic gradient descent algorithm. The high-resolution dictionary D (colored in blue) and the LISTA parameters are trainable
from real data sets.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 25
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
unrolling for blind deblurring (DUBLID) and detailed in Applications in medical imaging
“Deep Unrolling for Blind Deblurring,” which reveals that Medical imaging is a broad area that generally focuses on ap-
custom modifications are made in the unrolled network to plying image processing and pattern recognition techniques to
integrate domain knowledge that enhances the deblurring aid clinical analysis and disease diagnosis. Interesting topics
pursuit. The authors also analytically derive custom back- in medical imaging include MRI, CT imaging, and ultrasound
propagation rules to facilitate network training. Experimen- imaging, to name a few. Just like computational imaging, med-
tally, the network offers significant performance gains and ical imaging is an area enriched with many interesting inverse
requires many fewer parameters and less inference time problems, and model-based approaches, such as sparse coding,
compared with both traditional iterative algorithms and mod- play a critical role in solving these difficulties. In practice, the
ern neural network approaches. An example of experimental data collection can be quite expensive and painstaking for the
comparisons is provided in Figure 8. patients, and therefore it is difficult to gather abundant samples

(a)

(b)

(c)

(d)

FIGURE 6. Sample experimental results from [26] for visual comparison in single-image superresolution. (a) Ground-truth images. (b) Results from
[45]. (c) Results from [46]. (d) Results from [26]. (b)–(d) include a state-of-the art iterative algorithm as well as a deep learning technique. Note that
the magnified portions show that the SCN better recovers sharp edges and spatial details.

26 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
to train conventional deep networks. Interpretability is also an tion Method of Multipliers Compressive Sensing Network” for
important concern. Therefore, algorithm unrolling has great further details.
potential in this context. As another work on MRI reconstruction, by Hosseini et al.
In MRI, a fundamental challenge is to recover a signal from [38], unrolls the well-known PGD algorithm into a deep
a small number of measurements, corresponding to a reduced network. Motivated by momentum-based acceleration tech-
scanning time. Yang et al. [14] unroll the widely known niques, such as Nesterov’s method [51], the authors intro-
ADMM algorithm, a popular optimization algorithm for solv- duce dense connections into their network to facilitate an
ing CS and related sparsity-constrained estimation problems, information flow across nonadjacent layers. Performance
into a deep network called ADMM-CSNet. The sparsity-induc- improvements over conventional PGD-based methods are
ing transformations and regularization weights are learned experimentally shown. In tomographic reconstruction, Adler
from real data to advance the network’s limited adaptability et al. [36] unroll the primal–dual hybrid gradient algorithm,
and enhance its reconstruction performance. Compared with a well-known technique for primal–dual nonsmooth optimi-
conventional iterative methods, ADMM-CSNet achieves the zation. They substitute the primal and dual proximal opera-
same reconstruction accuracy while using 10% fewer sampled tors with certain parameterized operators, such as CNNs,
data, and it speeds up the recovery by approximately 40 times. and train both the operator parameters and the algorithm
It exceeds state-of-the art deep networks by roughly 3 dB parameters in an end-to-end fashion. Their method dem-
PSNR under a 20% sampling rate. Refer to “Alternating Direc- onstrates improvements over conventional approaches in

Stage 1

Blurred Feature Kernel Image Recovered


Stage 2 Stage 3
Image y Extraction Estimation Estimation Image x

"
FIGURE 7. The architecture in [31]. The network is formed by concatenating multiple stages of essential blind image deblurring modules. Stages 2 and 3
repeat the same operations as stage 1, with different trainable parameters. From a conceptual standpoint, each stage imitates one iteration of a typical
blind image deblurring algorithm. The training data can be formed by synthetically blurring sharp images to obtain their blurred versions.

(a) (b) (c) (d) (e)

FIGURE 8. Sample experimental results from [39] for visual comparison of blind image deblurring. (a) Ground-truth images and kernels. (b) A top-performing
iterative algorithm from Perrone et al. [48]. Two state-of-the-art deep learning techniques (c) and (d), from Nah et al. [49] and Tao et al. [50], respectively, are
compared against (e) the DUBLID method.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 27
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
Deep Unrolling for Blind Deblurring
The spatially invariant blurring process can be represented noteworthy fact is that each individual minimization admits
as a discrete convolution an analytical expression, which facilitates casting (S8) into
network layers. Specifically, in the l th iteration ^ l $ 0 h, the
y = k ) x + n, (S5)
following updates are performed:
where y is the blurred image, x is the latent sharp
kl 9 V
g li X f li 9 yt + W
)

g li + 1 = F - 1 * 4
image, k is the unknown blur kernel, and n is Gaussian z li
l Xl 2
random noise. A popular class of image deblurring algo- gi k + 1
rithms perform total variation minimization, which solves := M 1 " f l ) y, z l; g l ,, 6i,
the following optimization problem: z i = S m g " g li + 1 ,
l+1 l
i
l
i

:= M 2 " g l + 1; b l ,, 6i
^ D x y - k ) g 1 22 + D y y - k ) g 2 22 h
1 
min
]Z] C \ V
k, g , g 2
SRS
]] | z i 9 f i 9 y i bbW W b_bWVW
1 2
l+1) l
e 2 SS ] bWW
+ m 1 g 1 1 + m 2 g 2 1 + 2 k 2,  k l + 1 = N 1 SS F - 1 ][ i = 1 C
SS ]] | \ b`bW
subject to k 1 = 1, k $ 0, (S6) S ]] zil+1 2
+ e bbbWWW
T \ i=1 aX+
:= M 3 " f l ) y, z l + 1 ,, (S9)
where D x y and D y y are the partial derivatives of y in
horizontal and vertical directions, respectively, and m 1, m 2, where [·]+ is the rectified linear unit operator, xt denotes
and f are positive regularization coefficients. Upon con- the discrete Fourier transform (DFT) of x, F - 1 indicates
vergence, the variables g 1 and g 2 are estimates of the the inverse DFT operator, 9 refers to elementwise multipli-
sharp image gradients in the x and y directions, respec- cation, S is the soft-thresholding operator defined ele-
tively. In [15] and [39], (S6) was generalized by realizing mentwise in (S3), and the operator N 1 ($) normalizes its
that D x and D y are computed using linear filters, which operand into the unit sum. In this case, g l = {g li} Ci = 1 ,
can be generalized into a set of C filters {fi} Ci = 1: b l = {m li g li} Ci = 1 , and g l, f l ) y, and z l refer to {g li} Ci = 1 ,
{f li ) y} Ci = 1, and {z li} Ci = 1 stacked together. Note that layer-
min | a 2 fi ) y - k ) g i 22 + m i g i 1 k + 2 k 2,
C
1 e 2
specific parameters g l, b l , and f l are used. The parameter

C
k,{g i} i = 1 i = 1
e 2 0 is a fixed constant.
subject to k 1 = 1, k $ 0. (S7)
As with most existing unrolling methods, only L
An efficient optimization algorithm to solve (S7) is the iterations are performed. The sharp image is retrieved
half-quadratic splitting algorithm, which alternately mini- from g L and k L by solving the following linear least-
mizes the surrogate problem squares problem:

| a 12
C

min fi ) y - k ) g i 2
1
C
hi
xu = argmin 2 y - ku ) x 22 + | 2 f iL ) x - g iL
C 2 2
k,{g i, z i} i = 1 i = 1
2
x

g i - z i 2 m + 2 k 2, 
1 2 e 2
i=1

]] ku 9 yt + | h i W f iL 9 X
+ mi zi 1 + Z] t ) C _b
2g i g iL bbb
)

]
] b
subject to k 1 = 1, k $ 0 (S8) = F -1 [] i=1
`b 
tk 2 + | h W
C
]] u L 2 bb
]] i fi bb
sequentially across the variables {g i} Ci = 1, {z i} Ci = 1 and k. \ i=1
a
Here, g i, i = 1, f, C are regularization coefficients. A := M " y, g , k ; h, f ,,
4 L L L
(S10)

recovering low-dose CT images. While this technique offers clutter resulting from the tissue. Thus, an important task
merits in reconstruction, the extracted features may not favor is to separate the tissue from the blood. Various filtering
detection tasks. Therefore, Wu et al. [37] extend the meth- methods have been used in this context, such as high-
od by concatenating it with a detection network and apply pass filtering and filtering based on the singular value
joint fine-tuning after individually training both networks. decomposition. Solomon et al. [33] suggest using a robust
Their jointly fine-tuned network outperforms state-of-the- PCA technique by modeling the received ultrasound movie
art alternatives. as a low-rank and sparse matrix, where the tissue is low
Another important imaging modality is ultrasound, rank and the blood vessels are sparse. They then unroll an
which has the advantage of being a radiation-free ISTA approach to robust PCA into a deep network, which
approach. When used for blood flow depiction, one of the is called convolutional robust PCA (CORONA). As the
challenges is the fact that the tissue reflections tend to be name suggests, the authors replace matrix multiplications
much stronger than those of the blood, leading to strong with convolutional layers, effectively converting the

28 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
where f L = {f iL} Ci = 1 are the filter coefficients in the Lth (S9) and (S10) and updated jointly. The final network
layer and h = " h i ,Ci = 1 are positive regularization coeffi- architecture is depicted in Figure S1. Similar to [31], the
cients. By unrolling (S9) and (S10) into a deep network, network is trained using synthetic samples, i.e., by con-
we get L layers of g, z, and k updates followed by volving the sharp images to obtain blurred versions. The
one layer of image retrieval. The filter coefficients f li training loss function is the translation-invariant mean-
and regularization parameters {m li, g li, h i} are learned square-error loss to compensate for the possible spatial
by backpropagation. Note that f iL are shared in both shifts of the deblurred images and the blur kernel.

Blurred Image y f1
fL ∗ fL–1 ∗ ∗

gL z L−1 z L−2 z1
M4
Layer L Layer L – 1 M2 ( .; β1) M1( ., .; ζ1)
( , , .; η, fL)
. .
k L−1
g1
k L−2
g0
M3 ( ., .)
Estimated k1
~
Kernel k k0
Layer 1
Estimated
Image ~
x

FIGURE S1. Deep unrolling for blind deblurring [15]. The analytical operations M 1, M 2, M 3, and M 4 correspond to casting the analytic expres-
sions in (S9) and (S10) into the network. Trainable parameters are colored in blue. In particular, the parameters f l, l = 1, f, L denote trainable filter
coefficients in the l th layer.

–10

–20

1 mm –30

(a) (b) (c) (d) (e)

FIGURE 9. Sample experimental results demonstrating the recovery of ultrasound contrast agents (UCAs) from cluttered maximum-intensity projection
(MIP) images [33]. (a) An MIP image of the input movie, composed from 50 frames of simulated UCAs cluttered by tissue. (b) A ground-truth UCA MIP
image. (c) A recovered UCA MIP image via CORONA. (d) A ground-truth tissue MIP image. (e) A recovered MIP tissue image via CORONA. The color bar
is measured in decibels. (Source: [33]; used with permission.)

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 29
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
Alternating Direction Method of Multipliers Compressive Sensing Network
Consider linear measurements y ! C m formed by y . Ux , where " a i ,Ci = 1 are dual variables and " t i ,Ci = 1 are penalty
where U ! C m # n is a measurement matrix with m 1 n . coefficients. The ADMM then alternately minimizes (S13),
Compressive sensing (CS) aims at reconstructing the orig- followed by a dual variable update, leading to the follow-
inal signal x ! R n by exploiting its underlying sparse ing iterations:
structure in a transform domain [S1]. A generalized CS
model can be formulated as the following optimization
x l = d U H U + | t i D Ti D i n <U H y + | t i D iT ^ z il - 1 - a il - 1 hF
C -1 C
problem [14]:
i=1 i=1

:= U 1 " y, a li - 1, z li - 1; t i, D i ,,
2 Ux - y 2 + | m i g ^ D i x h, (S11)
C
1 2
min
z li = Pg ' D i x l + a li - 1; t ii 1
x
i=1 m

where m i are positive regularization coefficients, g ($) is a := U 2 " a li - 1, x l; m i, t i, D i ,,


sparsity-inducing function, and " D i ,Ci = 1 is a sequence of a li = a li - 1 + h i ^ D i x l - z li h
C operators, which effectively performs linear filtering := U 3 " a li - 1, x l, z li; h i, D i ,, 6i,
(S14)
operations. Concretely, D i can be taken as a wavelet
transform, and g can be chosen as the , 1 norm.
However, for better performance, the method in [14] where h i are constant parameters and Pg " $; m , is
learns both of them from an unrolled network. the proximal mapping for g with parameter m . The
An efficient minimization algorithm for solving (S11) is unrolled network can thus be constructed by concat-
the alternating direction method of multipliers (ADMM) enating these operations and learning the parame-
[S2]. Equation (S11) is first recast into a constrained ters m i , t i , h i , and D i in each layer. Figure S2
minimization through variable splitting: depicts the resulting unrolled network architecture.
C
In [14], the authors discuss several implementation
1
min 2 Ux - y 2 + | m i g (z i),
2
issues, including efficient matrix inversion and the
C
x,{z} i = 1 i=1 
backpropagation rules. The network is trained by
subject to z i = D i x, 6i. (S12)
minimizing a normalized version of the root-mean-
The corresponding augmented Lagrangian is then formed square error.
as follows:
Reference
L t ^ x, z; a i h = 2 Ux - y 2 + | m i g ^ z i h
C
1 2
[S2] S. Boyd, N. Parikh, E. Chu, B. Peleato, and J. Eckstein, “Distributed
i=1 optimization and statistical learning via the alternating direction method
 of multipliers,” Found. Trends Mach. Learn., vol. 3, no. 1, pp. 1–122,
ti 2
+ 2 Di x - zi + ai 2 , (S13) 2011. doi: 10.1561/2200000016.

Measurements
y

α l−1 xl zl αl Recovered
U1 (., ., .; ρ, D) U 2 (., .; λ, ρ, D) U3 (., ., .; η, D) Stage l +1
Image x
"

z l−1

Stage l

FIGURE S2. The ADMM-CSNet [14]. Each stage includes a series of interrelated operations whose analytic forms are given in (S14). The trainable
parameters are colored in blue.

30 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
network into a CNN-like architecture. Compared with One example is in semantic image segmentation, which
state-of-the-art methods, CORONA demonstrates a vastly assigns class labels to each pixel in an image. Compared with
improved reconstruction quality and has many fewer traditional low-level image segmentation, the technique
parameters than the well-known residual neural network provides additional information about object categories and
(ResNet) [52]. Refer to “Convolutional Robust Principal thus creates semantically meaningful segmented objects. By
Component Analysis” for details. LISTA-based methods ­performing pixel-level labeling, semantic segmentation can
have also been applied in ultrasound to improve image also be regarded as an extension to image recognition. Appli-
superresolution (Figure 9) [53]. cations of semantic segmentation include autonomous driving,
robot vision, and medical imaging.
Applications in vision and recognition Traditionally, the CRF was a popular approach. Recently,
Computer vision is a broad and fast-growing area that has achieved deep networks have become the primary tool. Early deep
tremendous success in many interesting topics during recent years. learning approaches are capable of recognizing objects at
A major driving force for its rapid progress is deep learning. For a high level; however, they are relatively less accurate in
instance, thanks to the availability of large-scale training samples, delineating the objects than CRFs. Zheng et al. [30] unroll
in image recognition tasks, researchers have surpassed human- the MF iterations of CRF into an RNN and then concatenate
level performance across the ImageNet data set by employing the semantic segmentation network with this RNN to form
deep CNNs [2]. Nevertheless, most existing approaches are a deep network. The concatenated network resembles con-
highly empirical, and a lack of interpretability has become an ventional semantic segmentation followed by CRF-based
increasingly serious issue. To overcome this drawback, researchers postprocessing, while end-to-end training can be performed
are paying more attention to algorithm unrolling [30], [32]. across the whole network. Liu et al. [32] follow the same

Convolutional Robust Principal Component Analysis

Ll + 1 = T mn ' b I - n H 1H H 1 l Ll - H 1H H 2 S l + H 1H D 1,
In ultrasound imaging, a series of pulses is transmitted into 1
1
the imaged medium, and the pulses’ echoes are received
S l + 1 = S mn ' b I - n H 2H H 2 l S l - H 2H H 1 Ll + H 2H D 1,
in each transducer element. After beamforming and 1, 2 1
2

demodulation, a series of movie frames is acquired.


Stacking the frames together as column vectors leads to a where Tm " X , is the singular-value thresholding operator
data matrix D ! C m # n , which can be modeled as follows: that performs soft thresholding across the singular values of
X with threshold m , S 1m,2 performs rowwise soft threshold-
D = H 1 L + H 2 S + N,
ing with parameter m , and n is the step size parameter
where L represents the tissue signals, S denotes the for the ISTA. Technically, Tm and S 1m,2 correspond to the
echoes returned from the blood signals, H 1 , and H 2 are proximal mapping for the nuclear norm and the mixed , 1,2
measurement matrices, and N is the noise matrix. Due norm, respectively.
to its high spatial–temporal coherence, L is typically a Just like the migration from a multilayer perceptron to a
low-rank matrix, while S is generally a sparse matrix convolutional neural network (CNN), the matrix multiplica-
since blood vessels usually sparsely populate the tions can be replaced by convolutions, which gives rise to
imaged medium. the following iteration steps:
Based on these observations, the echoes S can be esti- Ll + 1 = Tm " P 5l ) Ll + P 3l ) S l + P 1l ) D ,, (S16)
l
1

mated through a transformed low-rank and sparse decom-


= S " P 6l ) S l + P 4l ) Ll + P 2l ) D ,, (S17)
1, 2
S l+1
position by solving the following optimization problem:
l
m2

where ) is the convolution operator. Here, P il , i = 1, f, 6 are


D - ^ H1 L + H2 S h
1 2
min
L, S 2 F + m1 L ) + m2 S 1, 2 , (S15) a series of convolution filters that are learned from the data
in the l th layer, and m l1 , m l2 are thresholding parameters for
where < $ < ) is the nuclear norm of a matrix that promotes the l th layer. By casting (S16) and (S17) into network lay-
low-rank solutions and < $ < 1,2 is the mixed , 1,2 norm, which ers, a deep network resembling a CNN is formed. The
enforces row sparsity. Equation (S15) can be solved using parameters P il , i = 1, 2, f, 6 and " m l1, m l2 , are learned from
a generalized version of the iterative shrinkage and thresh- training data. To train the network, one can first obtain
olding algorithm (ISTA) in the matrix domain by utilizing ground-truths L and S from D by executing ISTA-like algo-
the proximal mapping corresponding to the nuclear norm rithms up to convergence. Simulated samples can also be
and mixed , 1,2 norm. In the lth iteration, it executes the added to address a lack of training samples. Mean-square-
following steps: error losses are imposed on L and S, respectively.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 31
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
Unrolling the Conditional Random Field Into a Recurrent Neural Network
The conditional random field (CRF) is a fundamental model iteration [S4], which iteratively executes the following
for labeling undirected graphical models. A special case steps:
of the CRF, where only pairwise interactions of graph
nodes are considered, is the Markov random field. Given u mp ^ l h ! | G m ^ fi , f j h Q q ^ l h,
(Message Passing) : Q
a graph ^ V, E h and a predefined collection of labels L, (S18)
j!i

it assigns label l p ! L to each node p by minimizing the


following energy function: t p ^ l p h ! | | n m ^ l p, l h w m Q
(Compatibility Transform) : Q u mp ^ l h,
l!L m
(S19)
E _ " l p ,p ! V i = | z p ^ l p h + | }p, q (l p , l q),
p!V (p, q) ! E (Unary Addition) : Q p ^ l p h ! exp # -z p (l p) - Q
t p ^ l p h -,
Qp ^ lp h
(Normalization) : Q p ^ l p h !
where z p ($) and }p, q ($) are commonly called unary ener-
| Qp ^ l h ,
gy and pairwise energy, respectively. Typically, z p models l!L

the preference of assigning p with each label given the


observed data, while } p, q models the smoothness between where Q p ^ l p h can be interpreted as the margin probability of
p and q. In semantic segmentation, V is made up of the assigning p with label l p . A noteworthy fact is that each
image pixels, E is the set of pixel pairs, and L consists of update step resembles common neural network layers. For
object categories. instance, message passing can be implemented by filtering
In [30], the unary energy z p is chosen as the output of a through Gaussian kernels, which imitates passing through a
semantic segmentation network, such as the well-known convolutional layer. The compatibility transform can be imple-
fully convolutional network (FCN) [S3], while the pairwise mented through a 1 # 1 convolution, while the normalization
energy }p,q ^ fp , fq h admits the following special form: can be considered as the popular softmax layer. These layers
can thus be unrolled to form a recurrent neural network
}^ l p , l q h = n^ p, q h| w m G m ^ fp , fq h,
M
(RNN) known as the CRF–RNN. By concatenating an FCN
m=1
with the CRF–RNN, a network that can be trained end to end
where " G m , mM= 1 is a collection of Gaussian kernels and is formed. An illustration is presented in Figure S3.
" w m , mM= 1 are the corresponding weights. Here, fp and fq References
[S3] J. Long, E. Shelhamer, and T. Darrell, “Fully convolutional networks for
are the feature vectors for pixel p and q , respectively, and semantic segmentation,” in Proc. IEEE Conf. Computer Vision and Pattern
n ^ $, $ h models the label compatibility between pixel pairs. Recognition, 2015, pp. 3431–3440. doi: 10.1109/CVPR.2015.7298965.
[S4] P. Krähenbühl and V. Koltun, “Efficient inference in fully connected CRFs
An efficient inference algorithm for energy minimiza- with Gaussian edge potentials,” in Proc. 24th Int. Conf. Neural Information
tion across fully connected CRFs is the mean-field (MF) Processing Systems, 2011, pp. 109–117. doi: 10.5555/2986459.2986472.

CRF–RNN
φ
FCN

Q1 Q2
Input Message Compatibility Unary Predicted
Normalization Stage 2
"

Image y Passing Transform Addition Label I

Stage 1

FIGURE S3. The CRF–RNN network [30]. An FCN is concatenated with an RNN, called the CRF–RNN, to form a deep network. This RNN essentially
performs MF iterations and acts like CRF-based postprocessing. The concatenated network can be trained end to end to optimize its performance.

32 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
direction to construct their segmentation network, called the synthesis, and more. Among all problems of interest, source
deep parsing network. In their approach, they adopt a gen- separation stands out as a challenging yet intriguing one. Ap-
eralized pairwise energy and perform an MF iteration only plications of source separation include speech enhancement
once for the purpose of efficiency. Refer to “Unrolling the and recognition.
Conditional Random Field Into a Recurrent Neural Network” For single-channel speech separation, NMF is a widely
for further details. applied technique. Recently, Hershey et al. [29] unrolled
NMF into a deep network, known as deep NMF, as a concrete
Other signal processing applications realization of their abstract unrolling framework. Detailed
Until now, we have surveyed compelling applications in im- descriptions are in “Deep (Unrolled) Nonnegative Matrix Fac-
age processing and computer vision. Algorithm unrolling has torization.” Deep NMF was evaluated on the task of speech
also been successfully applied to a variety of other signal pro- enhancement in reverberated noisy mixtures, using a data set
cessing domains. We next consider speech processing, which collected from The Wall Street Journal. It was shown to out-
is one of the fundamental problems in digital signal process- perform both a conventional deep neural network [29] and the
ing. Topics in speech processing include recognition, coding, iterative sparse NMF method [54].

Deep (Unrolled) Nonnegative Matrix Factorization


Single-channel source separation refers to the task of To determine W and H from M, one may consider
decoupling several source signals from their mixture. solving the following optimization problem [S6]:
Suppose we collect a sequence of T mixture frames,
where m t ! R F+, t = 1, 2, f, T is the tth frame. Given a set W t = arg min D b _ M WH i + n H 1, (S22)
t ,H
W $ 0, H $ 0
of nonnegative basis vectors {w l ! R F+} lL= 1, we can repre-
sent m t (approximately) by where D b is the b divergence, which can be considered
L
as a generalization of the well-known Kullback–Leibler
m t . | w l h lt, (S20) divergence, and n is a regularization parameter that con-
l=1
trols the sparsity of the coefficient matrix H. By employing
where h lt represents the coefficients that are chosen to be a majorization–minimization scheme, (S22) can be solved
nonnegative. By stacking instances of m t column by col- by the following multiplicative updates:
umn, we form a nonnegative matrix M ! R F+ # T so that
W T 6M 9 ^ WH l - 1 hb - 2@
(S20) can be expressed in matrix form: , (S23)
W T ^ WH l - 1 hb - 1 + n
Hl = Hl-1 9

M . WH, W $ 0, H $ 0, (S21) 6M 9 ^ W l - 1 H l hb - 2@ H l
T

, (S24)
^ W l - 1 H l hb - 1 H l
Wl = Wl-1 9 T

where W has w l as its lth column, $ 0 denotes element- Normalize W l so that the columns of W l
wise nonnegativity, and H = (h l,t). To remove multiplicative 
have unit norm and scale H l accordingly, (S25)
ambiguity, it is commonly assumed that each column of
W has a unit , 2 norm; i.e., occurrences of w l are unit for l = 1, 2, f. In [29], a slightly different update scheme
vectors. The model (S21) is commonly called nonnegative for W was employed to encourage the discriminative
matrix factorization (NMF) [S5] and has found wide appli- power. We omit discussing it for brevity.
cations in signal and image processing. In practice, the A deep network can be formed by unfolding these itera-
nonnegativity constraints prevent the mutual canceling of tive updates. In [29], instances of W l are untied from the
basis vectors and thus encourage semantically meaningful update rule (S24) and considered trainable parameters. In
decompositions, which turns out to be highly beneficial. other words, only (S23) and (S25) are executed in each
Assuming that the phases among different sources are layer. Similar to (S22), the b divergence, with a different
approximately the same, the power or magnitude spectro- b value, was employed in the training loss function. A
gram of the mixture can be decomposed as a summation splitting scheme was also designed to preserve the non-
of those from each source. Therefore, after performing negativity of W l during training.
NMF, the sources can be separated by selecting basis vec- References
tors corresponding to each individual source and recom- [S5] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-neg-
ative matrix factorization,” Nature, vol. 401, no. 6755, pp. 788–791,
bining the source-specific basis vectors to recover the 1999. doi: 10.1038/44565.
magnitude spectrograms. In practical implementation, typi- [S6] C. Févotte, N. Bertin, and J. Durrieu, “Nonnegative matrix factoriza-
tion with the Itakura-Saito divergence: With application to music analysis,”
cally, a filtering process similar to classical Wiener filtering Neural Comput., vol. 21, no. 3, pp. 793–830, Mar. 2009. doi: 10.1162/
is performed for magnitude spectrogram recovery. neco.2008.04-08-771.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 33
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
Wang et al. [35] propose an end-to-end training approach copy method [43], which performs sparse recovery in the
for speech separation by casting commonly employed forward correlation domain.
and inverse short-time Fourier transform (STFT) operations
into network layers and concatenating them with an iterative Enhancing efficiency through unrolling
phase reconstruction algorithm, multiple-input spectrogram In addition to interpretability and performance impro­
inversion [55]. In doing so, the loss function acts on recon- vements, unrolling can provide significant advantages
structed signals, rather than their STFT magnitudes, and the for practical deployment, including higher computational
phase inconsistency can be reduced through training. The efficiency and a lower number of parameters, which, in
trained network exhibits an SNR that is 1 dB higher than state- turn, leads to reduced memory footprints and storage
of-the art techniques on public data sets. requirements. Table 2 summarizes select results from
Monitoring the operating conditions of power grids in real recent unrolling research to illustrate such benefits. For
time is a critical task when deploying large-scale contemporary comparison, results for one iterative algorithm and one deep
electricity networks. To address the computational complexity network are included, both selected from representative top-
issue of conventional power system state estimation methods, performing methods. Note, further, that for any two methods
Zhang et al. [40] unroll an iterative physics-based prox–linear compared in Table 2, the run times are reported on consistent
solver into a deep neural network. They further extend their implementation platforms. More details can be found in the
approach for state forecasting. Numerical experiments on the respective works.
IEEE 57 and IEEE 118 bus benchmark systems confirm the Compared to its iterative counterpart, unrolling often
technique’s improved performance over alternative approaches. dramatically boosts the computational speed. For instance,
Multispectral image fusion is a fundamental problem it was reported in [18] that the LISTA may be 20 times
in remote sensing. Lohit et al. [42] unroll the projected faster than the ISTA after training, DUBLID [39] can be
­g radient descent algorithm for fusing low-spatial-reso- 1,000 times faster than total variation-based deblurring, the
lution ­multispectral aerial images with their associated ADMM-CSNet [14] can be roughly four times faster than the
high-resolution panchromatic counterpart. They also show BM3D-AMP algorithm [56], CORONA [33] is more than 50
experimental improvements over several baselines. Finally, times faster than the fast ISTA algorithm, and the prox–lin-
unrolling has also been applied to superresolution micros- ear network proposed by Zhang [40] is more than 500 times
copy [43]. Here, the authors unroll the sparsity-based super- faster than the Gauss–Newton algorithm.
resolution microscopy from the correlation information by Typically, by embedding domain-specific structures into
using the sparsity-based superresolution correlation micros- the system, unrolled networks need many fewer parameters
than conventional ones that are less
specific to particular applications.
For instance, the number of param-
Table 2. Selected results for the running time and the parameter count
from recent unrolling works and alternative methods. eters for DUBLID is more than 100
times lower than for a scale recur-
Unrolled deep networks Traditional iterative algorithms Conventional deep networks rent network [50], while CORONA
Reference Yang et al. [14] Metzler et al. [59] Kulkarni et al. [57] [33] has an order-of-magnitude-low-
Running time (s) 2.61 12.59 2.83 er number of parameters than the
4 5
Parameter count 7.8 × 10 — 3.2 × 10
ResNet [52]. Under circumstances
Reference Solomon et al. [33] Beck et al. [28] He et al. [52]
Running time (s) 5.07 15.33 5.36 where each iteration (layer) can
Parameter count 1.8 × 103 — 8.3 × 103 be executed with high efficiency,
Reference Li et al. [39] Perrone et al. [48] Kupyn et al. [59] unrolled networks may be even more
Running time (s) 1.47 1,462.9 10.29 efficient than conventional ones. For
Parameter count 2.3 × 104 — 1.2 × 107
instance, the ADMM-CSNet [14]

l+1
W1,1
x10 x1l x1l+1 x1L Algorithm: Input x0, Output xL
for l = 0,1,… , L − 1 do
x20 x2l x2l+1 x2L
xl+1 ← σ (wl+1xl + bl+1),
end for
x30 x3l x3l+1 x3L
Activation Function

FIGURE 10. An MLP can be interpreted as executing an underlying iterative algorithm with finite iterations and layer-specific parameters.

34 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
has proved to be approximately twice as fast as the ReconNet ture is the task-driven dictionary learning algorithm proposed
[57], while DUBLID [32] is almost two times faster than the by Julien et al. [63]. The idea is similar to unrolling: the authors
Deblur Generative Adversarial Network system [58]. view a sparse coding algorithm as a trainable system whose
parameters are the dictionary coefficients. This viewpoint is
Conceptual connections and theoretical analysis equivalent to unrolling the sparse coding algorithm into a “net-
In addition to creating efficient and interpretable network work” that has infinite layers and whose output is a limit point
architectures that achieve superior performance in prac- of the sparse coding algorithm. The whole system is trained
tical ­applications, algorithm unrolling can provide valuable end to end (task driven) using gradient descent, and an analyti-
insights from a conceptual standpoint. As detailed in the cal formula for the gradient is derived.
previous section, solutions to real-world signal processing Sprechmann et al. [64] propose a framework for training
problems often exploit domain-specific prior knowledge. parsimonious models that summarizes several interesting
Inheriting this domain knowledge is of both conceptual and cases through an encoder–decoder network architecture. For
practical importance in deep learning research. To this end, al- example, a sparse coding algorithm, such as the ISTA, can be
gorithm unrolling can potentially serve as a powerful tool to viewed as an encoder, as it maps the input signal into its sparse
help establish conceptual connections between prior-informa- code. After obtaining the sparse code, the original signal is
tion-guided analytical methods and modern neural networks. recovered through the sparse code and the dictionary. This
In particular, algorithm unrolling may be utilized in the procedure can be viewed as a decoder. By concatenating the
reverse direction: instead of unrolling a particular iterative algo- encoder and decoder together, a network is formed that enables
rithm into a network, we can interpret a conventional neural net- unsupervised learning. The authors further extend the model
work as a certain iterative algorithm to be identified. Figure 10 to supervised and discriminative learning.
provides a visual illustration of applying this technique to an Dong et al. [46] observe that the forward pass of a CNN
MLP. Many traditional iterative algorithms have a fixed pattern basically executes the same operations of sparse coding-based
in their iteration steps: a linear mapping followed by a nonlinear image superresolution [47]. Specifically, the convolution opera-
operation. Therefore, the abstract algorithm in Figure 10 repre- tion performs patch extraction, and the ReLU operation mimics
sents a broad class of iterative algorithms, which, in turn, can be sparse coding. Nonlinear code mapping is performed in the inter-
identified as deep networks with a structure similar to MLPs. The mediate layers. Finally, reconstruction is obtained via the final
same technique is applicable to other networks, such as CNNs convolution layer. To a certain extent, this connection explains
and RNNs, by replacing the linear operations with convolutions why a CNN has tremendous success in single-image superreso-
and by adopting shared parameters across different layers. lution. Jin et al. [27] observe the architectural similarity between
By interpreting popular network architectures as conventional the popular U-net [4] and the unfolded ISTA network. Since
iterative algorithms, a better understanding of the network behav- sparse coding techniques have demonstrated great success in
ior and mechanism can be obtained. Furthermore, rigorous theo- many image reconstruction applications, such as CT reconstruc-
retical analysis of designed networks may be facilitated once an tion, this connection helps explain why U-net is a powerful tool
equivalence is established with a well-understood class of iterative in these domains, although the architecture was originally moti-
algorithms. Finally, architectural enhancement and performance vated under the context of semantic image segmentation.
improvements of neural networks may result from incorporating
domain knowledge associated with iterative techniques. Connections to Kalman filtering
In this section, we explore the close connections between Another line of research focuses on acceleration of neural net-
neural networks and typical families of signal processing algo- work training by identifying its relationship with extended Kal-
rithms, which are clearly revealed by unrolling techniques. Spe- man filter (EKF). Singhal and Wu [62] demonstrated that neu-
cifically, we review studies that reveal the connections between ral network training can be regarded as a nonlinear dynamic
algorithm unrolling and sparse coding, Kalman filters, differ- system that may be solved by the EKF. Simulation studies show
ential equations, and statistical inference in the “Connections to that the EKF converges much more rapidly than standard back-
Sparse Coding,” “Connections to Kalman Filtering,” “Connec- propagation. Puskorius and Feldkamp [60] propose a decoupled
tions to Differential Equations and Variational Methods,” and version of the EKF for the speedup and apply this technique to
“Connections to Statistical Inference and Sampling” sections, the training of recurrent neural networks. More details on es-
in that order. We also review elected theoretical advances that tablishing the connection between neural network training and
provide formal analysis and rigorous guarantees for unrolling EKF are in “Neural Network Training Using the Extended Kal-
approaches in the “Selected Theoretical Studies” section. man Filter.” For a comprehensive review of techniques employ-
ing Kalman filters for network training, refer to [61].
Connections to sparse coding
The earliest work in establishing the connections between Connections to differential equations
neural networks and sparse coding algorithms dates back to and variational methods
Gregor et al. [13], which we comprehensively reviewed in the Differential equations and variational methods are widely
“Unrolling Sparse Coding ­Algorithms Into Deep Networks” applied in numerous signal and image processing problems.
section. A closely related work in the dictionary learning litera- Many practical systems of differential equations require

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 35
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
numerical methods for their solution, and various iterative In [16], Chen and Pock adopt the unrolling approach to
algorithms have been developed. Theories around these tech- improve the performance of Perone–Malik anisotropic diffu-
niques are extensive and well grounded, and hence it is inter- sion [65], a well-known technique for image restoration and
esting to explore the connections between these techniques and edge detection. After generalizing the nonlinear diffusion
modern deep learning methods. model to handle nondifferentiability, they unroll the iterative

Neural Network Training Using the Extended Kalman Filter


The Kalman filter is a fundamental technique in signal pro- respectively. Here, E is the expectation operator, and d
cessing that has a wide range of applications. It obtains is the Kronecker delta function. In (S26), the noise ~ is arti-
the minimum mean-square error (MMSE) estimation of a ficially added to avoid numerical divergence and poor local
system state by recursively drawing observed samples and minima [60]. For a visual depiction, refer to Figure S4.
updating the estimate. The extended Kalman filter (EKF) The state-transition models (S26) and (S27) are special
extends to the nonlinear case through iterative lineariza- cases of the state-space model of the EKF, and thus we can
tion. Previous studies [60] have revealed that the EKF can apply the EKF technique to sequentially estimate the net-
be employed to facilitate neural network training by realiz- work parameters w k. To begin with, at k = 0, w t 0 and P0
ing that neural network training is essentially a parameter are initialized to certain values. At time step k (k $ 0),
estimation problem. More specifically, the training samples the nonlinear function h k is linearized as

h k (x k ; w k ) . h k ( x k ; Z
may be treated as observations, and, if the MSE loss is
t k), (S28)
w k ) + H k (w k - w
chosen, network training essentially performs MMSE esti-
mation that is conditional on observations. where H k = 2h k /2w k ; w = wt . For a neural network, H k is
k k

Let {(x 1, y 1), (x 2, y 2), f, (x N , y N )} be a collection of training essentially the derivative of its output yt k across its parame-
pairs. We view the training samples as sequentially ters w k and therefore can be computed via backpropaga-
observed data following a time order. At time step k, tion. The following recursion is then executed:
when feeding x k into the neural network with parameters
w, the network performs a nonlinear mapping h k (·; w k) K k = Pk H k ^ H kT Pk H k + R k h-1,
and outputs an estimate yt k of y k. This process can be w t k + K k ^ y k - yt k h,
t k+1 = w 
formally described as the following nonlinear state-transi- Pk + 1 = Pk - K k H Tk Pk + Q k, (S29)
tion model:
where K k is commonly called the Kalman gain. For details
w k + 1 = w k + ~ k, (S26) on deriving the update rules (S29), see [61, Ch. 1]. In sum-
y k = h k ^ x k; w k h + o k, (S27) mary, neural networks can be trained with the EKF by the
following steps:
where ~ k and o k are zero-mean white Gaussian noises
1) Initialize wt 0 and P0.
with covariance E (~ k ~ lT ) = d k,l Q k and E (o k o lT ) = d k,l R k,
2) For k = 0, 1, f,
a) feed x k into the network to obtain the output yt k
b) use backpropagation to compute H k in (S28)
c) apply the recursion in (S29).
Neural The matrix P k is the approximate error covariance
xk Network yk yk matrix, which models the correlations between network
"

MSE
hk (·; wk)
parameters and thus delivers second-order derivative infor-
mation, effectively accelerating the training speed. For
Time Step k
example, in [62], it was shown that training a multilayer
+wk
perceptron using the EKF requires an orders-of-magnitude-
lower number of epochs than standard backpropagation.
In [61], some variants of the EKF training paradigm are
hk+1 (·; wk+1)
xk+1 yk+1 yk+1 discussed. The neural network represented by h k can be a
"

MSE
Neural
Network recurrent network and trained in a similar fashion, and the
Time Step k + 1 noise covariance matrix R k is scaled to play a role similar
to the learning rate adjustment. To reduce the computation-
al complexity, a decoupling scheme is employed, which
FIGURE S4. The state-transition model for neural network training. The
training data can be viewed as sequentially feeding through the neural divides parameters w k into mutually exclusive groups and
network, and the network parameters can be viewed as system states. turns P k into a block-diagonal matrix.

36 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
discrete partial differential equation (PDE) solver and opti- between spatially neighboring pixels are modeled, at the sacri-
mize the filter coefficients and the regularization parameters fice of model generality and representation accuracy. Sun and
through training. The trained system proves to be highly effec- Tappen [70] propose to involve nonlocal neighbors by grouping
tive in various image reconstruction applications, such as image similar patches. The generalized MRF model is called nonlocal
denoising, single-image superresolution, and JPEG deblocking. range MRF (NLR–MRF). The inference of the NLR–MRF can
Recently, Chen et al. [66] identified the residual layers be carried out by K steps of gradient descent procedures across
inside the well-known ResNet [52] as one iteration of solving the energy function, which can be unrolled into a K-layer deep
a discrete ordinary differential equation (ODE) by employ- network. The output of the network, as a function of the model
ing the explicit Euler method. As the time step decreases and parameters, is then plugged into the (empirical) loss function.
the number of layers increases, the neural network output In this way, end-to-end training can be performed to optimize
approximates the solution of the initial value problem rep- the model parameters. In [70], this technique proves its effec-
resented by the ODE. Based on this finding, they replace the tiveness in image inpainting and denoising. Specifically, the
residual layer with an ODE solver and analytically derive NLR–MRF demonstrates clear improvements over methods
associated backpropagation rules that enable supervised that merely capture local interactions, and it shows on-par per-
learning. In this way, they construct a network of “continu- formance with state-of-the-art methods.
ous” depth and achieve a higher parameter efficiency than the In a similar spirit, Stoyanov et al. [71] and Domke [72] adopt
conventional ResNet. the message passing algorithm, a dedicated algorithm for infer-
The same idea can be applied to other deep learning techniques, ence on graphical models, in the inference stage, as opposed
such as normalizing flows [67]. Normalizing flows is a framework to gradient descent. Truncating the message passing algorithm
for generative modeling that essentially performs transformations by executing only finite iterations can be regarded as perform-
of random variables belonging to relatively simple distributions ing approximate inference, and it has the potential benefit of
to model complex probability distributions. Typically, the random computational savings. From a conceptual perspective, Domke
variables are indexed by discrete time steps corresponding to a [73] considers abstract optimization techniques for energy
discrete set of network layers. By applying the continuation tech- minimization and focuses on a scenario where the optimiza-
nique, the variable transformation becomes continuous in time, tion algorithm runs a fixed number of steps. Compared with the
and the computation of the expensive log-determinant becomes traditional implicit differentiation approach, this scheme has a
unnecessary, leading to significant computational savings. computational advantage for large-scale problems because the
In physics, PDEs are frequently used to capture the dynam- Hessian matrices need not be computed. For concrete exam-
ics of complex systems. By recasting a generic PDE into a ples, Domke studies and compares gradient descent and heavy-
trainable network, we can discover the underlying physical ball and limited-memory Broyden–Fletcher–Goldfarb–Shanno
laws through training. Long et al. [68] adopt this principle by algorithms for image labeling and denoising applications.
approximating differential operators as convolutional kernels The expectation–maximization (EM) algorithm is one of
and the nonlinear response function as a pointwise neural the best known and widely used techniques in statistical infer-
network. In doing so, the model inherits the predictive power ence. An EM algorithm is an iterative method to find maximum
of deep learning systems and the transparency of numerical likelihood and maximum a posteriori estimates of parameters
PDEs. As a recent follow-up, Long et al. [69] impose more con- in statistical models, particularly mixture models. For the
straints on the learnable filters and introduce a symbolic neural unsupervised perceptual grouping of image objects, Greff et al.
network to approximate the unknown response function. [74] model the image as a parametrized spatial mixture of K
components. They plug in a neural network as a transformer
Connections to statistical inference and sampling that maps the mixture parameters to probability distributions
Statistical inference is broadly defined as the process of and hence facilitates spatially varying conditional distributions
drawing conclusions about populations and scientific truths of image pixels. By employing the EM algorithm and unroll-
from data. Popular statistical inference techniques, such as ing it into a recurrent network, an end-to-end differentiable
linear and graphical models, Bayesian inference, and sup- clustering procedure, called neural EM (N-EM), is obtained. A
port vector machines, have demonstrated tremendous suc- dedicated training technique, referred to as RNN–EM, is also
cesses and effectiveness in a variety of practical domains. developed. After training, N-EM is capable of learning how to
High-impact signal processing and machine learning ap- group pixels according to constituent objects, promoting local-
plications that involve statistical inference include signal ized representation for individual entities.
classification, image reconstruction, representation learn- In the generative model setting, when the underlying data
ing, and more. distribution is supported on low-dimensional manifolds (a
The Markov random field (MRF), as one of the most impor- common phenomenon for natural signals), it has been recog-
tant graphical models, has been broadly applied in various nized that entropic metrics induced by maximum likelihood
image reconstruction and labeling tasks. In its underlying graph estimation are fundamentally flawed in principle and perform
representation, pixels in an image are generally considered as poorly in practice. To overcome this issue, metrics based on
graph nodes, whereas their interactions are captured by graph optimal transport (OT) have become popular choices when
edges. Traditionally, for tractability, only local interactions measuring the distances between probability distributions.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 37
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
As an early example, the Wasserstein distance is used as the approximate loss is typically known as the Sinkhorn loss and
loss function in a generative adversarial network (GAN) in the can be computed by the Sinkhorn algorithm. Genevay et al.
seminal work of Arjovsky et al. [75]. However, such metrics further approximate it by iterating only L steps and unrolling
are typically defined in variational forms instead of closed the Sinkhorn algorithm into L network layers. Because each
forms, and calculating their derivatives can be problematic. To iteration of the Sinkhorm algorithm is differentiable, the entire
a large extent, this limitation hinders applications of gradient- network can be trained end to end. In a similar spirit, Patrini
based learning techniques. et al. [77] employ the Wasserstein distance on the latent space
Recently, algorithm unrolling has become a crucial tech- of an autoencoder and approximate it by L layers of Sinkhorn
nique for the efficient minimization of OT-based metrics. In iterations. The autoencoder, in combination with these layers,
particular, Genevay et al. [76] discretize the OT-based loss by is called the Sinkhorn autoencoder (SAE). Patrini et al. fur-
drawing samples and regularize it with an entropy penalty. The ther corroborate the approximation scheme through theoretical

Convergence and Optimality Analysis of the Learned Iterative Shrinkage


and Thresholding Algorithm
Although it is shown in [13] that the learned iterative not play as big a role as it may seem, as long as the net-
shrinkage and thresholding algorithm (LISTA) achieves an work is properly trained so that it acts as a generic sparse
empirically higher efficiency than the ISTA through train- recovery algorithm. Furthermore, Xin et al. showed that,
ing, several conceptual issues remain to be addressed. with some modifications, such as using layer-specific
First, the LISTA does not exhibit superior performance over parameters, the learned network can recover the sparse
the ISTA, not even under particular scenarios. Second, the code, even when W admits correlated columns, a scenar-
convergence rate of the LISTA is unknown. Third, the LISTA io known to be particularly challenging for traditional iter-
actually differs from the ISTA by introducing artificial ative sparse coding algorithms.
parameter substitutions, and, finally, the optimal parame- Chen et al. [24] a perform similar analysis on a LISTA
ters are learned from data, and it is difficult to have a with layer-specific parameters; i.e., in layer l, the parame-
sense of what the parameters look like. ters ^ W lt, W le, m l h are used instead. Similar to Xin et al.
To address these open issues, several recent theoretical [22], the authors proved that, under certain mild assump-
studies have been conducted. A common assumption is tions, whenever the LISTA recovers x ), the following weight
that there exists a sparse code x ) ! R m that approximately coupling scheme must be asymptotically satisfied:
satisfies the linear model y . Wx ) , where W ! R n # m and
m 2 n . More specifically, it is commonly assumed that W lt - ^ I - W le W h " 0, as l " 3,
x ) 0 # s for some positive integer s, where $ 0 counts
the number of nonzero entries. Xin et al. [22] examine a which shows that the implicit variable substitutions may be
closely related sparse coding problem: inconsequential in an asymptotic sense. Therefore, the
authors adopted the coupled parameterization scheme
1 2
min
x 2 y - Wx 2 subject to x 0 # k , (S30)
W lt = I - W le W,
where k is a predetermined integer to control the sparsity
level of x. In [22], a network is constructed by unrolling and proved that the resulting network recovers x ) at a lin-
the iterative hard-thresholding (IHT) algorithm, which has a ear rate if the parameters ^ W ke , m k h 3k = 1 are appropriately
form similar to the ISTA. At layer l, the following iteration selected. They further integrate a support selection scheme
is performed: into the network. The network thus has both weight cou-
pling and support selection structures and is called LISTA-
x l + 1 = H k " W t x l + W e y ,, (S31) coupling weight support selection (CPSS).
Liu et al. [23] extend the work of Chen et al. [24] by
where H k is the hard-thresholding operator, which keeps analytically characterizing the optimal weights W ke for the
k coefficients of the largest magnitude and zeroes out the LISTA-CPSS. They proved that, under certain regularity con-
rest. Xin et al. [22] proved that, for an IHT-based network ditions, a linear convergence rate can be achieved if
to recover x ) , it must be the case that ^ W ke , m k h k are chosen in a specific form. This implies that a
network with analytic parameters can be asymptotically as
W t = I - CW, (S32) efficient as the trained version. Although the analytic
forms may be nontrivial to compute in practice, their
for some matrix C, which implies that the implicit variable analysis helps to dramatically reduce the number of net-
substitution W t = I - (1/n) W T W and W e = (1/n) W T may work parameters.

38 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
analysis and experimentally verify the superior efficiency of Unrolling” section. Recent trends and current concerns re-
the Sinkhorn algorithm over the exact Hungarian algorithm. garding algorithm unrolling are discussed in the “Trends:
In unsupervised representation learning experiments, the SAE Expanding the Application Landscape and Addressing Im-
generates samples of a higher quality than other variants of plementation Concerns” section. We contrast algorithm un-
autoencoders, such as the variational autoencoder [78] and the rolling with alternatives and discuss their relative merits and
Wasserstein autoencoder [79]. drawbacks in the “Alternative Approaches” section.

Selected theoretical studies Distilling the power of


Although the LISTA successfully achieves a higher efficiency algorithm unrolling
than its iterative counterparts, it does not necessarily recover During recent years, algorithm unrolling has proved to be highly
a more accurate sparse code compared to the iterative algo- effective in achieving superior performance and a higher efficien-
rithms, and a thorough theoretical analysis of its convergence cy in many practical domains. A question that naturally arises is,
behavior is yet to be developed. Xin et al. [22] study the un- Why is it so powerful? Figure 11 provides a high-level illustration
rolled iterative hard thresholding (IHT) algorithm, which has of how algorithm unrolling can be advantageous compared with
been widely applied in , 0 norm-constrained estimation prob- both traditional iterative algorithms and generic neural networks,
lems and resembles the ISTA to a large extent. The unrolled from a functional approximation perspective. By parameter tun-
network is capable of recovering sparse signals from dictionar- ing and customization, a traditional iterative algorithm spans a
ies with coherent columns. Furthermore, the authors analyze relatively small subset of the functions of interest and thus has
the optimality criteria for the network to recover the sparse limited representation power. Consequently, it is capable of ap-
code and verify that the network can achieve a linear conver- proximating a given target function reasonably well while still
gence rate under appropriate training. leaving some gaps that undermine its performance in practice.
In a similar fashion, Chen et al. [24] establish a linear con- Nevertheless, iterative algorithms generalize relatively well in
vergence guarantee for the unrolled ISTA network. They also limited training scenarios. From a statistical learning perspec-
derive a weight coupling scheme similar to [22]. As a follow-up, tive, iterative algorithms correspond to models with a high bias
Liu et al. [23] analytically characterize optimal network param- and a low variance.
eters by imposing mutual incoherence conditions on the net-
work weights. Analytical derivation of the optimal parameters
helps reduce the parameter dimensionality to a large extent.
Furthermore, the authors demonstrate that a network with
analytic parameters can be as effective as a network trained
completely from data. For more details on the theoretical

et thm
Generic Neural Network

k
studies around LISTA, refer to “Convergence and Optimal-

i
or
d lgor
w
ol e A
ity Analysis of the Learned Iterative Shrinkage and Threshold-

N
tiv
Target
ra
ing Algorithm.” le
Ite
nr

Papyan et al. [80] interpret CNNs as executing finite itera-


U

tions of the multilayer convolutional sparse coding (ML-CSC)


algorithm. In other words, a CNN can be viewed as an unrolled
ML-CSC algorithm. In this interpretation, the convolution
operations naturally emerge out of a convolutional sparse repre-
FIGURE 11. A high-level unified interpretation of algorithm unrolling from
sentation, with the commonly used soft-thresholding operation a functional approximation perspective. The ellipse shapes (the sets
viewed as a symmetrized ReLU. The authors also analyze the shaded in blue, green, and gray) depict the scope of functions that can
ML-CSC problem and offer theoretical guarantees, such as the be approximated by each category of methods. Compared with iterative
uniqueness of the multilayer sparse representation, the stability algorithms, which have limited representation power and usually underfit
the target, unrolled networks usually better approximate the target,
of the solutions under small perturbations, and the ­effectiveness
thanks to their higher representation power. On the other hand, unrolled
of ML-CSC in terms of sparse recovery. In a recent follow-up networks have less representation power than generic neural networks,
work [81], they further propose dedicated iterative optimization but they usually generalize better in practice, hence providing an attrac-
algorithms for solving the ML-CSC problem and demonstrate tive balance.
superior efficiency over other conven-
tional algorithms, such as the ADMM
and the fast ISTA, for solving the multi- Table 3. Feature comparisons among iterative algorithms, generic deep networks, and unrolled networks.
layer bias pursuit problem.
Parameter
Technique Performance Efficiency dimensionality Interpretability Generalizability
Perspectives and recent trends
Iterative algorithms Low Low Low High High
We reflect on the remarkable effec-
Generic deep networks High High High Low Low
tiveness of algorithm unrolling in the Unrolled networks High High Middle High Middle
“Distilling the Power of Algorithm

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 39
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
On the other hand, a generic neural network is capable of work training procedure as solving an optimal control prob-
more accurately approximating the target function because of its lem. By analyzing the corresponding Pontryagin’s maximum
universal approximation capability. Nevertheless, as it typically principle, the authors devise a novel network training algo-
consists of an enormous number of parameters, it constitutes a rithm. Compared with conventional gradient-based methods,
large subset in the function space. Therefore, when performing the proposed algorithm has a faster initial convergence and is
network training, the search space becomes large, and training resilient against stalling in flat landscapes. The principles of
is a major challenge. The high dimensionality of the parameters optimal control have also inspired researchers in the design of
also requires abundant training samples, and generalization real-world imaging systems. In such an approach to image res-
becomes an issue. Furthermore, the network efficiency may also toration, Zhang et al. [41] argue that different endpoints must
suffer as the network size increases. Generic neural networks are be chosen when handling images with various degradation
essentially models with a high variance and a low bias. levels. To this end, they introduce a dedicated policy network
In contrast, the unrolled network, by expanding the capac- for predicting an endpoint. The policy network is essentially
ity of iterative algorithms, can approximate the target f­ unction a convolutional RNN. The estimated endpoint is used to gov-
more accurately while spanning a relatively small subset in the ern the termination of the restoration network. Both networks
function space. The reduced size of the search space allevi- interplay and are trained under a reinforcement learning
ates the training burden and the requirement for large-scale framework. The entire model is thus called the dynamically
training data sets. Since iterative algorithms are carefully unfolding recurrent restorer (DURR). Experiments on blind
developed based on domain knowledge and already provide image denoising and JPEG deblocking verify that the DURR
a reasonably accurate approximation of the target function, is capable of delivering higher quality reconstructed images
by extending them and training them from real data, unrolled that have sharper details and that it generalizes better when the
networks can often obtain a highly accurate approximation of degradation levels vary or are unseen in the training data sets,
the target functions. As an intermediate state between generic compared with its competitors. Furthermore, the DURR has
networks and iterative algorithms, unrolled networks typically significantly fewer parameters and a higher runtime efficiency.
have a relatively low bias and variance simultaneously. Table 3 Unrolled networks can share parameters across all lay-
summarizes some features of iterative algorithms, generic net- ers, and they can carry over layer-specific parameters. In the
works, and unrolled networks. former case, the networks are typically more parameter effi-
cient. However, how to effectively train a network is a chal-
Trends: Expanding the application landscape lenge because unrolled networks essentially resemble RNNs
and addressing implementation concerns and may similarly suffer from gradient explosion and vanish-
A continuous trend during recent years is to explore more ing problems. In the latter case, the networks slightly deviate
general underlying iterative algorithms. Earlier unrolling from the original iterative algorithm and may not completely
approaches were centered around the ISTA algorithm [13], inherit the algorithm’s theoretical benefits, such as conver-
[27], [26], while, lately, other alternatives have been pursued, gence guarantees. However, the networks can have enhanced
such as proximal splitting [64], the ADMM [14], and half-qua- representation power and adapt to real-world scenarios more
dratic splitting [15], to name a few. For instance, Metz et al. accurately. The training may also be much easier compared
[82] unroll the ADMM optimizer [83] to stabilize GAN train- to RNNs. During recent years, a growing number of unroll-
ing, while Diamond et al. [84] propose a general framework ing techniques has enabled the parameters to vary from
for unrolled optimization. Consequently, a growing number of layer to layer.
unrolling approaches as well as novel unrolled network archi- An interesting concern relates to the deployment of neu-
tectures appears in topical publications. ral networks on resource-constrained platforms, such as
In addition to expanding the methodology, researchers are digital single-lens reflex cameras and mobile devices. The
broadening the application scenarios of algorithm unrolling. heavy storage demand renders many top-performing deep
For instance, in communications, Samuel et al. [85] propose a networks impractical, while straightforward network com-
deep network, known as deterministic networking (DetNet), pression usually significantly deteriorates the networks’
based on unrolling the projected gradient descent algorithm performance. Therefore, in addition to computational effi-
for least-squares recovery. In multiple-input, multiple-output ciency, researchers today are paying increasing attention to
detection tasks, DetNet achieves a performance similar to a the parameter efficiency aspect, and increasing attention is
detector based on semidefinite relaxation, and it is 30 times paid to algorithm unrolling.
faster. Further, DetNet exhibits promising performance in Finally, there are other factors to be considered when con-
handling ill-conditioned channels and is more robust than structing unrolled networks. In particular, many iterative algo-
the approximate message passing-based detector, as it does rithms, when unrolled straightforwardly, may introduce highly
not require knowledge of the noise variance. More examples nonlinear and/or nonsmooth operations, such as hard thresh-
of unrolling techniques in communications can be found in olding. Therefore, it is usually desirable to design algorithms
[86] and [87]. whose iteration procedures are either smooth or can be well
From an optimal control viewpoint, Li et al. [88] interpret approximated by smooth operators. Another aspect relates
deep neural networks as dynamic systems and recast the net- to the network depth. Although deeper networks offer higher

40 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
representation power, they are generally harder to train in learning methods, including a deep neural network. Ryu et al.
practice [52]. Indeed, techniques such as stacked pretraining [96] prove that, when the denoisers satisfy certain Lipschitz
have been frequently employed in existing algorithm unrolling conditions, replacing the proximal operators with denoisers
approaches to overcome the training difficulty to some extent. leads to convergence for some popular optimization algorithms,
Taking this into account, iterative algorithms that have a faster such as the ADMM and forward–backward splitting. Based
convergence rate and simpler iterative procedures are gener- on this theoretical result, they also developed a technique to
ally considered more often. enforce the Lipschitz conditions when training the denoisers.
This technique has the advantage of inheriting knowledge
Alternative approaches about conventional deep networks, such as network architec-
Besides algorithm unrolling, there are other approaches for tures, training algorithms, initialization schemes, and so forth.
characterizing and enhancing the interpretability of deep net- In addition, in practice, this technique can effectively comple-
works. The initial motivation behind neural networks is to ment the limitations of iterative algorithms. For instance,
model the behavior of the human brain. Traditionally, neural Shlezinger et al. [95] demonstrated that, by replacing part of the
networks are often interpreted from a neurobiological per- Viterbi algorithm with a neural network, full knowledge of the
spective. However, discrepancies between the actual human statistical relationship between channel input and output is no
brain and artificial neural networks have constantly been ob- longer necessary. Therefore, the resulting algorithm achieves a
served. During recent years, there have been other interest- higher robustness and better performance under model imper-
ing works focusing on identifying and quantifying network fections. Nevertheless, the procedures themselves are still
interpretability by analyzing the correlations between neuron approximated abstractly via conventional neural networks.
activations and human perception. One such example is the
emerging technique called network dissection [89], which Conclusions
studies how neurons capture semantic objects in a scene and In this article, we provided an extensive review of algorithm
how state-of-the art networks internally represent high-level unrolling, starting with the LISTA as a basic example. We then
visual concepts. showcased practical applications of unrolling in various real-
Specifically, Zhou et al. [89] analyze neuron activations on world signal and image processing problems. In many applica-
pixel-level annotated data sets and quantify the network tion domains, the unrolled interpretable deep networks offer
interpretability by correlating the neuron activations with state-of-the art performance and achieve a high computational
ground-truth annotations. Bau et al. [90] extend this tech- efficiency. From a conceptual standpoint, algorithm unroll-
nique to GANs. These works complement algorithm unroll- ing also helps reveal the connections between deep learning
ing by offering visual and biological interpretations of deep and other important categories of approaches that are widely
networks. However, they are mainly focused on characterizing applied for solving signal and image processing problems.
the interpretability of existing networks and are less effective Although algorithm unrolling is a promising technique to
at connecting neural networks with traditional iterative algo- build efficient and interpretable neural networks and has al-
rithms and motivating novel network architectures. ready achieved success in many domains, it is still evolving.
Another closely related technique is to employ a conven- We conclude this article by discussing limitations and open
tional deep network as a drop-in replacement of certain pro- challenges related to algorithm unrolling and suggest possible
cedures in an iterative algorithm. Figure 12 illustrates this directions for future research.
technique. The universal approximation theorem [91] justifies
the use of neural networks to approximate algorithmic proce- Proper training of unrolled networks
dures, as long as such procedures can be represented as con- Unrolling techniques provide a powerful principled frame-
tinuous mappings. For instance, in [59], Metzler et al. observe work for constructing interpretable and efficient deep
that one step of the ISTA may be treated as a denoising pro-
cedure and henceforth can be replaced by a denoising CNN.
The same approach applies to approximate message passing
[92], an extension of the ISTA. Meinhardt et al. [93] replace Algorithm: Input x0, Output xL
the proximal operator of the regularization used in many con- for l = 1, 2,… , L − 1 do
vex energy minimization algorithms with a denoising neural
yl ← f (x l),
network. In this way, the neural network acts as an implicit Neural
regularizer in many inverse problems and, equivalently, as a zl ← (y l), Network
natural image prior. The denoising neural network can then xl+1 ← h (z l),
be employed in different applications, alleviating the need for end for (b)
problem-specific training.
(a)
In a similar fashion, in [94], Gupta et al. replace the pro-
jection procedure in projected gradient descent with a denois- FIGURE 12. An alternative approach to algorithm unrolling is to replace
ing CNN. Shlezinger et al. [95] substitute the evaluation of the one step of the (a) iterative algorithm with (b) an intact conventional
log likelihood in the Viterbi algorithm with dedicated machine neural network.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 41
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
networks; nevertheless, the full potential of unrolled net- practical benefits when training data are limited and when
works can be exploited only when the networks are appropri- working with resource-constrained platforms.
ately trained. Compared to popular conventional networks
(CNNs and autoencoders), unrolled networks usually exhibit Authors
customized structures. Therefore, existing training schemes Vishal Monga ([email protected]) received his Ph.D.
may not work well. In addition, unrolled networks some- degree in electrical engineering from the University of Texas
times deliver shared parameters among different layers and at Austin. Currently, he is a professor in the School of
thus resemble RNNs, which are well known to be difficult Electrical Engineering and is a member of the electrical engi-
to train [97]. Thus, many existing works apply greedy layer- neering and computer science faculty at Pennsylvania State
wise pretraining. The development of well-grounded end-to- University, University Park, Pennsylvania, 16802, USA. He is
end training schemes for unrolled networks continues to be an elected member of the IEEE Image Video and
a topic of great interest. Multidimensional Signal Processing Technical Committee and
An issue of paramount importance is how to initialize the a senior area editor of IEEE Signal Processing Letters, and he
network. While there are well-studied methods for initial- has served on the editorial boards of IEEE Transactions on
izing conventional networks [2], [98], how to systematically Image Processing, IEEE Signal Processing Letters, and IEEE
transfer such knowledge to customized unrolled networks Transactions on Circuits and Systems for Video Technology.
remains a challenge. In addition, how to prevent vanishing and He is a recipient of the U.S. National Science Foundation
­exploding gradients during training is another important mat- CAREER Award and a 2016 Joel and Ruth Spira Teaching
ter. D
­ eveloping equivalents to, and counterparts of, established Excellence Award. His research interests include optimization-
practices, such as batch normalization [44] and residual learn- based methods with applications in signal and image process-
ing [52], for unrolled networks is a viable research direction. ing, learning, and computer vision. He is a Senior Member
of IEEE.
Bridging the gap between theory and practice Yuelong Li ([email protected]) received his Ph.D.
While substantial progress has been achieved toward under- degree in electrical engineering from Pennsylvania State
standing network behavior through unrolling, more work University, State College, in 2018. He is currently an applied
needs to be done to thoroughly understand the mechanism. scientist at Amazon Lab 126, San Jose, California, 94089,
Although the effectiveness of some networks for image re- USA. His research interests include computational photogra-
construction tasks has been somehow explained by drawing phy and 3D modeling, with a focus on nonlinear program-
parallels to sparse coding algorithms, it is still mysterious why ming techniques and, more recently, deep learning
state-of-the art networks perform well on various recognition techniques. He is a Member of IEEE.
tasks. Furthermore, unfolding itself is not uniquely defined. Yonina C. Eldar ([email protected]) received
For instance, there are multiple ways to choose the underlying B.Sc. degrees in physics and in electrical engineering both
iterative algorithms, decide what parameters become train- from Tel-Aviv University, Israel, in 1995 and 1996, respective-
able and what parameters to fix, and more. A formal study ly, and her Ph.D. degree in electrical engineering and computer
of how these choices affect convergence and generalizability science from the Massachusetts Institute of Technology (MIT),
can provide valuable insights and practical guidance. Another Cambridge, in 2002. She is a professor in the Department of
interesting direction is to develop a theory that provides guid- Math and Computer Science, Weizmann Institute of Science,
ance for practical applications. For instance, it is interesting to Rehovot, 7610001, Israel, where she heads the Center for
perform analyses that guide practical network design choices, Biomedical Engineering and Signal Processing. She is also a
such as the dimensions of parameters, the network depth, and visiting professor at MIT and the Broad Institute and an adjunct
so on. It is particularly interesting to identify factors that have professor at Duke University Durham, North Carolina, USA.
a high impact on network performance. She is the editor-in-chief of Foundations and Trends in Signal
Processing, a member of several IEEE technical and award
Improving the generalizability committees, a member of the Israel Academy of Sciences and
One of the critical limitations of common deep networks is Humanities, and a European Association for Signal Processing
their lack of generalizability, i.e., severe performance degra- fellow. She has received awards including the IEEE Signal
dations when operating on data sets that are significantly dif- Processing Society Technical Achievement Award, the IEEE/
ferent from those used during training. Compared with neural AESS Fred Nathanson Memorial Radar Award, and the IEEE
networks, iterative algorithms usually generalize better, and it Kiyo Tomiyasu Award. She is a Fellow of IEEE.
is interesting to explore how to maintain this property in un-
rolled networks. Preliminary investigations have experimen-
tally shown an improved generalization of unrolled networks References
[1] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet classification with
in a few cases [39], but a formal theoretic understanding re- deep convolutional neural networks,” in Proc. 25th Int. Conf. Neural Information
Processing System, 2012, pp. 1097–1105. doi: 10.5555/2999134.2999257.
mains elusive and is highly desirable. From an impact stand-
[2] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Surpassing
point, this line of research may provide newer additions to ap- human-level performance on ImageNet classification,” in Proc. IEEE Int. Conf.
proaches for semisupervised/unsupervised learning and offer Computer Vision, Dec. 2015, pp. 1026–1034. doi: 10.1109/ICCV.2015.123.

42 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
[3] J. Deng, W. Dong, R. Socher, L. Li, L. Kai, and L. Fei-Fei, “ImageNet: A large- [27] K. H. Jin, M. T. McCann, E. Froustey, and M. Unser, “Deep convolutional
scale hierarchical image database,” in Proc. IEEE Conf. Computer Vision and neural network for inverse problems in imaging,” IEEE Trans. Image Process.,
Pattern Recognition, June 2009, pp. 248–255. doi: 10.1109/CVPR.2009.5206848. vol. 26, no. 9, pp. 4509–4522, 2017. doi: 10.1109/TIP.2017.2713099.
[4] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional networks for biomed- [28] A. Beck and M. Teboulle, “A fast iterative shrinkage-thresholding algorithm for
ical image segmentation,” in Proc. Int. Conf. Medical Image Computing and Computer linear inverse problems,” SIAM J. Imag. Sci., vol. 2, no. 1, pp. 183–202, 2009. doi:
Assisted Intervention, 2015, pp. 234–241. doi: 10.1007/978-3-319-24574-4_28. 10.1137/080716542.
[5] N. Ibtehaz and M. S. Rahman, “MultiResUNet: Rethinking the U-Net architecture [29] J. R. Hershey, J. Le Roux, and F. Weninger, “Deep unfolding: Model-based
for multimodal biomedical image segmentation,” Neural Netw., vol. 121, inspiration of novel deep architectures,” 2014, arXiv:1409.2574.
pp. 74–87, Jan. 2020. doi: 10.1016/j.neunet.2019.08.025.
[30] S. Zheng, S. Jayasumana, B. Romera-Paredes, V. Vineet, Z. Su, D. Du, C. Huang,
[6] G. Nishida, A. Bousseau, and D. G. Aliaga, “Procedural modeling of a building and P. H. S. Torr, “Conditional random fields as recurrent neural networks,” in
from a single image,” Comput. Graph. Forum, vol. 37, no. 2, pp. 415–429, 2018. Proc. Int. Conf. Computer Vision, Dec. 2015, pp. 1529–1537. doi: 10.1109/ICCV.
doi: 10.1111/cgf.13372. 2015.179.
[7] M. Tofighi, T. Guo, J. K. P. Vanamala, and V. Monga, “Prior information guided [31] C. J. Schuler, M. Hirsch, S. Harmeling, and B. Scholkopf, “Learning to
regularized deep learning for cell nucleus detection,” IEEE Trans. Med. Imag., deblur,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 7, pp. 1439–1451,
vol. 38, no. 9, pp. 2047–2058, Sept. 2019. doi: 10.1109/TMI.2019.2895318. July 2016. doi: 10.1109/TPAMI.2015.2481418.
[8] T. Guo, H. Seyed Mousavi, and V. Monga, “Adaptive transform domain image [32] Z. Liu, X. Li, P. Luo, C. C. Loy, and X. Tang, “Deep learning markov random
super-resolution via orthogonally regularized deep networks,” IEEE Trans. Image field for semantic segmentation,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 40,
Process., vol. 28, no. 9, pp. 4685–4700, Sept. 2019. doi: 10.1109/TIP.2019.2913500. no. 8, pp. 1814–1828, Aug. 2018. doi: 10.1109/TPAMI.2017.2737535.
[9] Y. Chen, Y. Tai, X. Liu, C. Shen, and J. Yang, “FSRNet: End-to-end learning [33] O. Solomon, R. Cohen, Y. Zhang, Y. Yang, Q. He, J. Luo, R. J. G. van Sloun,
face super-resolution with facial priors,” in Proc. IEEE Conf. Computer Vision and and Y. C. Eldar, “Deep unfolded robust PCA with application to clutter suppression
Pattern Recognition, 2018, pp. 2492–2501. in ultrasound,” IEEE Trans. Med. Imag., vol. 39, no. 4, pp. 1051–1063, Apr. 2020.
[10] M. Garnelo, J. Schwarz, D. Rosenbaum, F. Viola, D. J. Rezende, S. M. A. doi: 10.1109/TMI.2019.2941271.
Eslami, and Y. W. Teh, “Neural processes,” 2018, arXiv:1807.01622. [34] Y. Ding, X. Xue, Z. Wang, Z. Jiang, X. Fan, and Z. Luo, “Domain knowledge
[11] S. Sun, G. Zhang, J. Shi, and R. Grosse, “Functional variational Bayesian neu- driven deep unrolling for rain removal from single image,” in Proc. Int. Conf.
ral networks,” 2019, arXiv:1903.05779. Digital Home, 2018, pp. 14–19. doi: 10.1109/ICDH.2018.00010.

[12] A. Lucas, M. Iliadis, R. Molina, and A. K. Katsaggelos, “Using deep neural [35] Z. Q. Wang, J. L. Roux, D. Wang, and J. R Hershey, “End-to-end speech sepa-
networks for inverse problems in imaging: Beyond analytical methods,” IEEE Signal ration with unfolded iterative phase reconstruction,” in Proc. Interspeech, 2018,
Process. Mag., vol. 35, no. 1, pp. 20–36, 2018. doi: 10.1109/MSP.2017. pp. 2708–2712. doi: 10.21437/Interspeech.2018-1629.
2760358. [36] J. Adler and O. Öktem, “Learned primal-dual reconstruction,” IEEE Trans.
[13] K. Gregor and Y. LeCun, “Learning fast approximations of sparse coding,” in Proc. Med. Imag., vol. 37, no. 6, pp. 1322–1332, 2018. doi: 10.1109/TMI.2018.
Int. Conf. Machine Learning, 2010, pp. 399–406. doi: 10.5555/3104322.3104374. 2799231.

[14] Y. Yang, J. Sun, H. Li, and Z. Xu, “ADMM-CSNet: A deep learning approach [37] D. Wu, K. Kim, B. Dong, G. E. Fakhri, and Q. Li, “End-to-end lung nodule
for image compressive sensing,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 42, detection in computed tomography,” in Machine Learning in Medical Imaging
no. 3, pp. 521–538, 2020. doi: 10.1109/TPAMI.2018.2883941. (Lecture Notes in Computer Science), Y. Shi, H. I. Suk, M. Liu, Eds. Cham:
Springer-Verlag, 2018, pp. 37–45.
[15] Y. Li, M. Tofighi, V. Monga, and Y. C. Eldar, “An algorithm unrolling
approach to deep image deblurring,” in Proc. IEEE Int. Conf. Acoustics, Speech, [38] S. A. H. Hosseini, B. Yaman, S. Moeller, M. Hong, and M. Akçakaya, “Dense
and Signal Processing, 2019, pp. 7675–7679. doi: 10.1109/ICASSP.2019. recurrent neural networks for inverse problems: History-cognizant unrolling of opti-
8682542. mization algorithms,” 2019, arXiv:1912.07197.
[16] Y. Chen and T. Pock, “Trainable nonlinear reaction diffusion: A flexible [39] Y. Li, M. Tofighi, J. Geng, V. Monga, and Y. C. Eldar, “Efficient and interpre-
framework for fast and effective image restoration,” IEEE Trans. Pattern Anal. table deep blind image deblurring via algorithm unrolling,” IEEE Trans. Comput.
Mach . Intell ., vol. 39, no. 6, pp. 1256 –1272 , 2017. doi: 10.1109/ Imag., vol. 6, pp. 666–681, Jan. 2020. doi: 10.1109/TCI.2020.2964202.
TPAMI.2016.2596743. [40] L. Zhang, G. Wang, and G. B. Giannakis, “Real-time power system state estima-
[17] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier neural networks,” tion and forecasting via deep unrolled neural networks,” IEEE Trans. Signal Process.,
in Proc. Int. Conf. Artificial Intelligence and Statistics, 2011, pp. 315–323. vol. 67, no. 15, pp. 4069–4077, Aug. 2019. doi: 10.1109/TSP.2019.2926023.
[18] Y. A. LeCun, L. Bottou, G. B. Orr, and k Müller, “Efficient BackProp,” in [41] X. Zhang, Y. Lu, J. Liu, and B. Dong, “Dynamically unfolding recurrent
Neural Networks: Tricks of the Trade (Lecture Notes in Computer Science), restorer: A moving endpoint control method for image restoration,” in Proc. Int.
Berlin: Springer-Verlag, 2012, pp. 9–48. Conf. Learning Representations, 2019.
[19] K. Fukushima, “Neocognitron: A self-organizing neural network model for a [42] S. Lohit, D. Liu, H. Mansour, and P. T. Boufounos, “Unrolled projected gradient
mechanism of pattern recognition unaffected by shift in position,” Biol. Cybern., descent for multi-spectral image fusion,” in Proc. IEEE Int. Conf. Acoustics, Speech
vol. 36, no. 4, pp. 193–202, Apr. 1980. doi: 10.1007/BF00344251. and Signal Processing, May 2019, pp. 7725–7729. doi: 10.1109/ICASSP.2019.
8683124.
[20] D. H. Hubel and T. N. Wiesel, “Receptive fields, binocular interaction and
functional architecture in the cat’s visual cortex,” J. Physiol., vol. 160, no. 1, [43] G. Dardikman-Yoffe and Y. Eldar, “Learned SPARCOM: Unfolded deep
pp. 106–154, 1962. doi: 10.1113/jphysiol.1962.sp006837. super-resolution microscopy,” Opt. Express, vol. 28, no. 19, pp. 27,736–27,763,
2020. doi: 10.1364/OE.401925.
[21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning internal representations
by error propagation,” in Parallel Distributed Processing: Explorations in the [44] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep network
Microstructure of Cognition, Vol. 1, D. E. Rumelhart, J. L. McClelland, and CORPORATE training by reducing internal covariate shift,” in Proc.32nd Int. Conf. Machine
PDP Research Group, Eds. Cambridge, MA: MIT Press, 1986, pp. 318–362. Learning, 2015, pp. 448–456.
[22] B. Xin, Y. Wang, W. Gao, D. Wipf, and B. Wang, “Maximal sparsity with deep [45] R. Timofte, V. De Smet, and L. Van Gool, “A+: Adjusted anchored neighbor-
networks?” in Proc. Advances Neural Information Processing Systems, 2016, hood regression for fast super-resolution,” in Proc. Asian Conf. Computer Vision,
pp. 4340–4348. doi: 10.5555/3157382.3157583. 2014, pp. 111–126.
[23] J. Liu, X. Chen, Z. Wang, and W. Yin, “ALISTA: Analytic weights are as [46] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using deep
good as learned weights in LISTA,” in Proc. Int. Conf. Learning Representation, convolutional networks,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 2,
2019. pp. 295–307, Feb. 2016. doi: 10.1109/TPAMI.2015.2439281.
[24] X. Chen, J. Liu, Z. Wang, and W. Yin, “Theoretical linear convergence of unfold- [47] J. Yang, J. Wright, T. S. Huang, and Y. Ma, “Image super-resolution via sparse
ed ISTA and its practical weights and thresholds,” in Proc. 32nd Int. Conf. Information representation,” IEEE Trans. Image Process., vol. 19, no. 11, pp. 2861–2873, Nov.
Processing Systems, 2018, pp. 9079–9089. doi: 10.5555/3327546.3327581. 2010. doi: 10.1109/TIP.2010.2050625.
[25] Y. Li and S. Osher, “Coordinate descent optimization for L1 minimization with [48] D. Perrone and P. Favaro, “A clearer picture of total variation blind deconvolu-
application to compressed sensing: A greedy algorithm,” Inverse Probl. Imag., vol. tion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 38, no. 6, pp. 1041–1055, June
3, no. 3, pp. 487–503, 2009. doi: 10.3934/ipi.2009.3.487. 2016. doi: 10.1109/TPAMI.2015.2477819.
[26] Z. Wang, D. Liu, J. Yang, W. Han, and T. Huang, “Deep networks for image [49] S. Nah, T. H. Kim, and K. M. Lee, “Deep multi-scale convolutional neural net-
super-resolution with sparse prior,” in Proc. IEEE Int. Conf. Computer Vision, work for dynamic scene deblurring,” in Proc. IEEE Conf. Computer Vision and
2015, pp. 370–378. doi: 10.1109/ICCV.2015.50. Pattern Recognition, 2017, pp. 257–265. doi: 10.1109/CVPR.2017.35.

IEEE SIGNAL PROCESSING


licensed use limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR. MAGAZINE
Downloaded | March 202130,2023
on November | 43
at 18:24:04 UTC from IEEE Xplore. Restrictions apply.
[50] X. Tao, H. Gao, X. Shen, J. Wang, and J. Jia, “Scale-recurrent network for deep [74] K. Greff, S. van Steenkiste, and J. Schmidhuber, “Neural expectation maxi-
image deblurring,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, mization,” in Proc. 31st Conf. Neural Information Processing Systems, 2017,
2018, pp. 8174–8182. doi: 10.1109/CVPR.2018.00853. pp. 6694–6704.
[51] Y. Nesterov, “Gradient methods for minimizing composite functions,” Math. [75] M. Arjovsky, S. Chintala, and L. Bottou, “Wasserstein generative adversarial
Program., vol. 140, pp. 125–161, Aug. 2013. doi: 10.1007/s10107-012-0629-5. networks,” in Proc. Int. Conf. Machine Learning, vol. 70, 2017, pp. 214–223.
[52] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image recog- [76] A. Genevay, G. Peyre, and M. Cuturi, “Learning generative models with
nition,” in Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2016, Sinkhorn divergences,” in Proc. 21st Int. Conf. Artificial Intelligence and
pp. 770–778. doi: 10.1109/CVPR.2016.90. Statistics, Mar. 2018, pp. 1608–1617.
[53] R. J. G. van Sloun, R. Cohen, and Y. C. Eldar, “Deep learning in ultrasound [77] G. Patrini, R. Berg, P. Forré, M. Carioni, S. Bhargav, M. Welling,
imaging,” Proc. IEEE, vol. 108, no. 1, pp. 11–29, Jan. 2020. doi: 10.1109/ T. Genewein, and F. Nielsen, “Sinkhorn AutoEncoders,” in Proc. Conf. Uncertainty
JPROC.2019.2932116. Artificial Intelligence, July 2019, pp. 733–743.
[54] J. Eggert and E. Korner, “Sparse coding and NMF,” in Proc. IEEE Int. Joint [78] D. P. Kingma and M. Welling, “Auto-encoding variational Bayes,” in Proc. Int.
Conf. Neural Networks, vol. 4, July 2004, pp. 2529–2533. Conf. Learning Representations, 2014.
[55] D. Gunawan and D. Sen, “Iterative phase estimation for the synthesis of sepa- [79] I. Tolstikhin, O. Bousquet, S. Gelly, and B. Schoelkopf, “Wasserstein auto-
rated sources from single-channel mixtures,” IEEE Signal Process. Lett., vol. 17, encoders,” in Proc. Int. Conf. Learning Representations, 2018.
no. 5, pp. 421–424, May 2010. doi: 10.1109/LSP.2010.2042530.
[80] V. Papyan, Y. Romano, and M. Elad, “Convolutional neural networks ana-
[56] C. A. Metzler, A. Maleki, and R. G. Baraniuk, “From denoising to compressed lyzed via convolutional sparse coding,” J. Mach. Learn. Res., vol. 18, pp. 1–52,
sensing,” IEEE Trans. Inf. Theory, vol. 62, no. 9, pp. 5117–5144, 2016. doi: July 2017.
10.1109/TIT.2016.2556683.
[81] J. Sulam, A. Aberdam, A. Beck, and M. Elad, “On multi-layer basis pursuit,
[57] K. Kulkarni, S. Lohit, P. Turaga, R. Kerviche, and A. Ashok, “ReconNet: efficient algorithms and convolutional neural networks,” IEEE Trans. Pattern Anal.
Non-iterative reconstruction of images from compressively sensed measurements,” Mach . Intell ., vol. 42, no. 8, pp. 1968 –1980, 2020. doi: 10.1109/
in Proc. IEEE Conf. Computer Vision and Pattern Recognition, 2016, pp. 449– TPAMI.2019.2904255.
458. doi: 10.1109/CVPR.2016.55.
[82] L. Metz, B. Poole, D. Pfau, and J. Sohl-Dickstein, “Unrolled generative adver-
[58] O. Kupyn, V. Budzan, M. Mykhailych, D. Mishkin, and J. Matas, sarial networks,” in Proc. Int. Conf. Learning Representations, 2017.
“DeblurGAN: Blind motion deblurring using conditional adversarial networks,” in
[83] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” in
Proc. IEEE Conf. Computer Vision and Pattern Recognition, June 2018, pp. 8183–
Proc. Int. Conf. Learning Representations, 2015.
8192. doi: 10.1109/CVPR.2018.00854.
[84] S. Diamond, V. Sitzmann, F. Heide, and G. Wetzstein, “Unrolled optimization
[59] C. Metzler, A. Mousavi, and R. Baraniuk, “Learned D-AMP: Principled neural
with deep priors,” 2018, arXiv:1705.08041.
network based compressive image recovery,” in Proc. 31st Int. Conf. Neural
Information Processing Systems, 2017, pp. 1772–1783. doi: 10.5555/3294771. [85] N. Samuel, T. Diskin, and A. Wiesel, “Deep MIMO detection,” in Proc. Int.
3294940. Workshop Signal Processing Advances Wireless Communications, July 2017,
pp. 1–5. doi: 10.1109/SPAWC.2017.8227772.
[60] G. V. Puskorius and L. A. Feldkamp, “Neurocontrol of nonlinear dynamical
systems with Kalman filter trained recurrent networks,” IEEE Trans. Neural Netw., [86] A. Balatsoukas-Stimming and C. Studer, “Deep unfolding for communications
vol. 5, no. 2, pp. 279–297, Mar. 1994. doi: 10.1109/72.279191. systems: A survey and some new directions,” 2019, arXiv:1906.05774.
[61] S. S. Haykin, Kalman Filtering and Neural Networks. New York: Wiley, [87] N. Farsad, N. Shlezinger, A. J. Goldsmith, and Y. C. Eldar, “Data-driven sym-
2001. bol detection via model-based machine learning,” 2020, arXiv:2002.07806.
[62] S. Singhal and L. Wu, “Training multilayer perceptrons with the extended [88] Q. Li, L. Chen, C. Tai, and E. Weinan, “Maximum principle based algorithms
Kalman algorithm,” in Advances Neural Information Processing Systems, D. S. for deep learning,” J. Mach. Learn. Res., vol. 18, no. 1, pp. 5998–6026, 2017. doi:
Touretzky, Ed. San Francisco: Morgan-Kaufmann, 1989, pp. 133–140. 10.5555/3122009.3242022.
[63] J. Mairal, F. Bach, and J. Ponce, “Task-driven dictionary learning,” IEEE [89] B. Zhou, D. Bau, A. Oliva, and A. Torralba, “Interpreting deep visual represen-
Trans. Pattern Anal. Mach. Intell., vol. 34, no. 4, pp. 791–804, Apr. 2012. doi: tations via network dissection,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 41,
10.1109/TPAMI.2011.156. no. 9, pp. 2131–2145, Sept. 2019. doi: 10.1109/TPAMI.2018.2858759.
[64] P. Sprechmann, A. M. Bronstein, and G. Sapiro, “Learning efficient sparse and [90] D. Bau, J. Y. Zhu, H. Strobelt, B. Zhou, J. B. Tenenbaum, W. T. Freeman, and
low rank models,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 37, no. 9, A. Torralba, “GAN dissection: Visualizing and understanding generative adversarial
pp. 1821–1833, Sept. 2015. doi: 10.1109/TPAMI.2015.2392779. networks,” in Proc. Int. Conf. Learning Representations, 2019.
[65] P. Perona and J. Malik, “Scale-space and edge detection using anisotropic dif- [91] G. Cybenko, “Approximation by superpositions of a sigmoidal function,”
fusion,” IEEE Trans. Pattern Anal. Mach. Intell., vol. 12, no. 7, pp. 629–639, July Math. Control Signal Syst., vol. 2, no. 4, pp. 303–314, Dec. 1989. doi: 10.1007/
1990. doi: 10.1109/34.56205. BF02551274.
[66] T. Q. Chen, Y. Rubanova, J. Bettencourt, and D. K Duvenaud, “Neural ordi- [92] D. L. Donoho, A. Maleki, and A. Montanari, “Message-passing algorithms for
nary differential equations,” in Proc. 32nd Conf. Neural Information Processing compressed sensing,” Proc. Natl. Acad. Sci., vol. 106, no. 45, pp. 18914–18919,
Systems, 2018, pp. 6571–6583. 2009. doi: 10.1073/pnas.0909892106.
[67] D. Rezende and S. Mohamed, “Variational inference with normalizing flows,” [93] T. Meinhardt, M. Moeller, C. Hazirbas, and D. Cremers, “Learning proximal
in Proc. Int. Conf. Machine Learning, June 2015, pp. 1530 –1538. doi: operators: Using denoising networks for regularizing inverse imaging problems,” in
10.5555/3045118.3045281. Proc. Int. Conf. Computer Vision, Venice, Oct. 2017, pp. 1799–1808. doi: 10.1109/
ICCV.2017.198.
[68] Z. Long, Y. Lu, X. Ma, and B. Dong, “PDE-Net: Learning PDEs from data,” in
Proc. 35th Int. Conf. Machine Learning, July 2018, pp. 3208–3216. [94] H. Gupta, K. H. Jin, H. Q. Nguyen, M. T. McCann, and M. Unser, “CNN-based
projected gradient descent for consistent CT image reconstruction,” IEEE Trans. Med.
[69] Z. Long, Y. Lu, and B. Dong, “PDE-Net 2.0: Learning PDEs from data with a
Imag., vol. 37, no. 6, pp. 1440–1453, 2018. doi: 10.1109/TMI.2018.2832656.
numeric-symbolic hybrid deep network,” J. Comput. Phys., vol. 399, p. 108,925,
Dec. 2019. doi: 10.1016/j.jcp.2019.108925. [95] N. Shlezinger, N. Farsad, Y. C. Eldar, and A. J. Goldsmith, “ViterbiNet: Symbol
detection using a deep learning based Viterbi algorithm,” IEEE Trans. Wirel.
[70] J. Sun and M. F. Tappen, “Learning non-local range Markov random field for
Commun., vol. 19, no. 5, pp. 3319–3331, May 2020. doi: 10.1109/TWC.2020.
image restoration,” in Proc. IEEE Conf. Computer Vision and Pattern
2972352.
Recognition, June 2011, pp. 2745–2752. doi: 10.1109/CVPR.2011.5995520.
[96] E. Ryu, J. Liu, S. Wang, X. Chen, Z. Wang, and W. Yin, “Plug-and-play meth-
[71] V. Stoyanov, A. Ropson, and J. Eisner, “Empirical risk minimization of graphi-
ods provably converge with properly trained denoisers,” in Proc. Int. Conf. Machine
cal model parameters given approximate inference, decoding, and model structure,”
Learning, May 2019, pp. 5546–5557.
in Proc. Int. Conf. Artificial Intelligence and Statistics, June 2011, pp. 725–733.
[97] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training recurrent
[72] J. Domke, “Parameter learning with truncated message-passing,” in Proc. IEEE
neural networks,” in Proc. Int. Conf. Machine Learning, 2013, pp. 1310–1318.
Conf. Computer Vision and Pattern Recognition. June 2011, pp. 2937–2943. doi:
10.1109/CVPR.2011.5995320. [98] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep feed-
forward neural networks,” in Proc. Int. Conf. Aquatic Invasive Species, Mar. 2010,
[73] J. Domke, “Generic methods for optimization-based modeling,” in Proc.
pp. 249–256. SP
Artificial Intelligence and Statistics, Mar. 2012, pp. 318–326. 

44 limited to: NATIONAL INSTITUTE OF TECHNOLOGY DURGAPUR.


licensed use IEEE SIGNAL PROCESSING MAGAZINE
Downloaded | March 202130,2023
on November | at 18:24:04 UTC from IEEE Xplore. Restrictions apply.

You might also like