Deep Algorithm Unrolling For Blind Image Deblurring
Deep Algorithm Unrolling For Blind Image Deblurring
Abstract—Blind image deblurring remains a topic of enduring camera and the imaged scene during exposure. Assuming the
interest. Learning based approaches, especially those that employ scene is planar and the camera motion is translational, the
arXiv:1902.03493v3 [eess.IV] 29 May 2019
neural networks have emerged to complement traditional model image degradation process may be modelled as a discrete
based methods and in many cases achieve vastly enhanced
performance. That said, neural network approaches are generally convolution [1]:
empirically designed and the underlying structures are difficult to y = k ∗ x + n, (1)
interpret. In recent years, a promising technique called algorithm
unrolling has been developed that has helped connect iterative where y is the observed blurry image, x is the latent sharp
algorithms such as those for sparse coding to neural network image, k is the unknown point spread function (blur kernel),
architectures. However, such connections have not been made yet
and n is random noise which is often modelled as Gaussian.
for blind image deblurring. In this paper, we propose a neural
network architecture based on this idea. We first present an Blind motion deblurring corresponds to estimating both k and
iterative algorithm that may be considered as a generalization x given y; this estimation problem is also commonly called
of the traditional total-variation regularization method in the blind deconvolution.
gradient domain. We then unroll the algorithm to construct a Related Work: The majority of existing blind motion de-
neural network for image deblurring which we refer to as Deep
Unrolling for Blind Deblurring (DUBLID). Key algorithm param-
blurring methods are based on iterative optimization. Early
eters are learned with the help of training images. Our proposed works can be traced back to several decades ago [2], [3],
deep network DUBLID achieves significant practical performance [4], [1], [5]. These methods are only effective when the
gains while enjoying interpretability at the same time. Extensive blur kernel is relatively small. In the last decade, signif-
experimental results show that DUBLID outperforms many state- icant breakthroughs have been made both practically and
of-the-art methods and in addition is computationally faster.
conceptually. As both the image and the kernel need to be
estimated, there are infinitely many pairs of solutions forming
I. I NTRODUCTION the same blurred observation rendering blind deconvolution an
ill-posed problem. A popular remedy is to add regularizations
B LIND image deblurring refers to the process of recover-
ing a sharp image from its blurred observation without
explicitly knowing the blur function. In real world imaging,
so that many blind deblurring algorithms essentially reduce to
solving regularized inverse problems. A vast majority of these
images frequently suffer from degraded quality as a conse- techniques hinge on sparsity-inducing regularizers, either in
quence of blurring artifacts, which blind deblurring algorithms the gradient domain [6], [7], [8], [9], [10], [11], [12], [13] or
are designed to remove such artifacts. These artifacts may more general sparsifying transformation domains [14], [15],
come from different sources, such as atmospheric turbulence, [16], [17]. Variants of such methods may arise indirectly from
diffraction, optical defocusing, camera shaking, and more [1]. a statistical estimation perspective, e.g. [18], [19], [20], [21].
In the computational imaging literature, motion deblurring From a conceptual perspective, Levin et al. [22] study the
is an important topic because camera shakes are common limitations and remedies of the commonly employed Maxi-
during the photography procedure. In recent years, this topic mum a Posterior (MAP) approach, while Perrone et al. [23]
has attracted growing attention thanks to the popularity of extend their study with a particular focus on Total-Variation
smartphone cameras. On such platforms, the motion deblurring (TV) regularization. Despite some performance improvements
algorithm plays an especially crucial role because effective achieved along their developments, the iterative optimization
hardware solutions such as professional camera stabilizers are approaches generally suffer from several limitations. First,
difficult to deploy due to space restrictions. their performance depends heavily on appropriate selection
In this work we focus on motion deblurring in particular of parameter values. Second, handcrafted regularizers play an
because of its practical importance. However, our development essential role, and designing versatile regularizers that gener-
does not make assumptions on blur type and hence may alize well to a variety of real datasets can be a challenging
be extended to cases other than motion blur. Motion blurs task. Finally, hundreds and thousands of iterations are often
occur as a consequence of relative movements between the required to reach an acceptable performance level and thus
these approaches can be slow in practice.
Y. Li, M. Tofighi, and V. Monga are with Department of Electrical Complementary to the aforementioned approaches, learning
Engineering, The Pennsylvania State University, University Park, PA, 16802 based methods for determining a non-linear mapping that
USA, Emails: [email protected], [email protected], [email protected]
Y. C. Eldar is with Department of Electrical Engineering, Technion, Israel deblurs the image while adapting parameter choices to an
Institute of Technology, Haifa, Israel, Email: [email protected] underlying training image set have been developed. Principally
2
important in this class are techniques that employ deep neural • Deep Unrolling for BLind Deblurring (DUBLID): We
networks. The history of leveraging neural networks for blind propose an interpretable neural network structure called
deblurring actually dates back to the last century [24]. In the DUBLID. We first present an iterative algorithm that may
past few years, there has been a growing trend in applying be considered a generalization of the traditional total-
neural networks to various imaging problems [25], and blind variation regularization method in the gradient domain,
motion deblurring has followed that trend. Xu et al. [26] use and subsequently unroll the algorithm to construct a
large convolution kernels with carefully chosen initializations neural network. Key algorithm parameters are learned
in a Convolutional Neural Network (CNN); Yan et al. [27] with the help of training images using backpropagation,
concatenate a classification network with a regression network for which we derive analytically simple forms that are
to deblur images without prior information about the blur amenable to fast implementation.
kernel type. Chakrabarti et al. [28] work in the frequency • Performance and Computational Benefits: Through
domain and employ a neural network to predict Fourier extensive experimental validation over three benchmark
transform coefficients of image patches; Xu et al. [29] employ datasets, we verify the superior performance of the pro-
a CNN for edge enhancement prior to kernel and image es- posed DUBLID, both over conventional iterative algo-
timation. These works often outperform iterative optimization rithms and more recent neural network approaches. Both
algorithms especially for linear motion kernels; however, the traditional linear and more recently developed non-linear
structures of the networks are often empirically determined kernels are used in our experiments. Besides quality
and their actual functionality is hard to interpret. gains, we show that DUBLID is computationally simpler.
In the seminal work of Gregor et al. [30], a novel tech- In particular, the carefully designed interpretable layers
nique called algorithm unrolling was proposed. Despite its enables DUBLID to learn with far fewer parameters than
focus on approximating sparse coding algorithms, it provides state of the art deep learning approaches – hence leading
a principled framework for expressing traditional iterative to much faster inference time.
algorithms as neural networks, and offers promise in devel- • Reproducibility: To ensure reproducibility, we share
oping interpretable network architectures. Specifically, each our code and datasets that are used to generate all our
iteration step may be represented as one layer of the network, experimental results freely online.
and concatenating these layers form a deep neural network. The rest of the paper is organized as follows. General-
Passing through the network is equivalent to executing the ized gradient domain deblurring is reviewed in Section II.
iterative algorithm a finite number of times. The network We identify the roles of (gradient/feature extraction) filters
may be trained using back-propagation [31], and the trained and other key parameters, which are usually assumed fixed.
network can be naturally interpreted as a parameter optimized Based on a half-quadratic optimization procedure to solve the
algorithm. An additional benefit is that prior knowledge about aforementioned gradient domain deblurring, we develop a new
the conventional algorithms may be transferred. There has unrolling method that realizes the iterative optimization as
been limited recent exploration of neural network architectures a neural network in Section III. In particular, we show that
by unrolling iterative algorithms for problems such as super- the various linear and non-linear operators in the optimization
resolution and clutter/noise suppression [32], [33], [34], [35]. can be cascaded to generate an interpretable deep network,
In blind deblurring, Schuler et al. [36] employ neural networks such that the number of layers in the network corresponds
as feature extraction modules towards a trainable deblurring to the number of iterations. The fixed filters and parameters
system. However, the network portions are still empirical. are now learnable and a custom back-propagation procedure
Other aspects of deblurring have been investigated such as is proposed to optimize them based on training images.
spatially varying blurs [37], [38], including some recent neural Experimental results that provide insights into DUBLID as
network approaches [39], [40], [41], [42]. Other algorithms well as comparisons with state of the art methods are reported
benefit from device measurements [43], [44], [45] or leverage in Section IV. Section V concludes the paper.
multiple images [46], [47].
Motivations and Contributions: Conventional iterative algo- II. G ENERALIZED B LIND D EBLURRING VIA I TERATIVE
rithms have the merits of interpretability, but acceptable perfor- M INIMIZATION : A F ILTERED D OMAIN R EGULARIZATION
mance levels demand much longer execution time compared to P ERSPECTIVE
modern neural network approaches. Despite previous efforts,
the link between both categories remains largely unexplored A. Blind Deblurring in the Filtered Domain
for the problem of blind deblurring, and a method that simul- A common practice for blind motion deblurring is to esti-
taneously enjoys the benefits of both is lacking. In this regard, mate the kernel in the image gradient domain [18], [7], [8], [9],
we make the following contributions:1 [19], [10], [23]. Because the gradient operator ∇ commutes
with convolution, taking derivatives on both side of (1) gives
1 A preliminary 4 page version of this work has been submitted to IEEE
ICASSP 2019 [48]. This paper involves substantially more analytical develop- ∇y = k ∗ ∇x + n0 , (2)
ment in the form of: a.) the unrolling mechanism and associated optimization
problem for learning parameters, b.) derivation of custom back-propagation where n0 = ∇n is Gaussian noise. Formulation in the gradient
rules, c.) handling of color images, and d.) demonstration of computational domain, as opposed to the pixel domain, has several desirable
benefits. Experimentally, we have added a new dataset and several new state
of the art methods and scenarios in our comparisons. Finally, ablation studies strengths: first, the kernel generally serves as a low-pass
have been included to better explain the proposed DUBLID and its merits. filter, and low-frequency components of the image are barely
3
informative about the kernel. Intuitively, the kernel may be B. Efficient Minimization via Half-quadratic Splitting
inferred along edges rather than homogeneous regions in the Problem (5) is non-smooth so that traditional gradient-based
image. Therefore, a gradient domain approach can lead to optimization algorithms cannot be considered. Moreover, to
improved performance in practice [19] as the gradient operator facilitate the subsequent unrolling procedure, the algorithm
effectively filtered out the uninformative regions. Additionally, needs to be simple (to simplify the network structure) and
from a computational perspective, gradient domain formula- converge quickly (to reduce the number of layers required).
tions help in better conditioning of the linear system resulting Based on these concerns, we adopt the half-quadratic splitting
in more reliable estimation [8]. algorithm [55]. This algorithm is simple but effective, and
The model (2) alone, however, is insufficient for recovering has been successfully employed in many previous deblurring
both the image and the kernel; thus regularizers on both are techniques [56], [13], [16].
needed. The gradients of natural images are generally sparse, The basic idea is to perform variable-splitting and then
i.e., most of their values are of small magnitude [18], [22]. This alternating minimization on the penalty function. To this end,
fact motivates the developments of various sparsity-inducing we first cast (5) into the following approximation model:
regularizations on ∇x. Among them one of particular interest
XC
is the `1 -norm (often called TV) thanks to its convexity [5], 1 2
min kfi ∗ y − k ∗ gi k2
[23]. To regularize the kernel, it is common practice to assume C
k,{gi ,zi }i=1
i=1
2
the kernel coefficients are non-negative and of unit sum. Con-
1 2 2
solidating these facts, blind motion deblurring may be carried + λi kzi k1 + kgi − zi k2 + kkk2 ,
out by solving the following optimization problem [23]: 2ζi 2
1 2 2
subject to kkk1 = 1, k ≥ 0, (6)
min kDx y − k ∗ g1 k2 + kDy y − k ∗ g2 k2
k,g1 ,g2 2 C
by introducing auxiliary variables {zi }i=1 , where ζi , i =
+ λ1 kg1 k1 + λ2 kg2 k1 + kkk22 , 1, . . . , C are regularization parameters. It is well known that
2 as ζi → 0 the sequence of solutions to (6) converges to that
subject to kkk1 = 1, k ≥ 0, (3)
of (5) [57]. In a similar manner to [13], we then alternately
C C
minimize over {gi }i=1 , {zi }i=1 and k and iterate until con-
where Dx y, Dy y are the partial derivates of y in horizontal 2
vergence . Specifically, at the l-th iteration, we execute the
and vertical directions respectively. The notation k · kp denotes
following minimizations sequentially:
the `p vector norm, while λ1 , λ2 , ε are positive constant
parameters to balance the contributions of each term. The ≥ 1
2 1
gi − zli
2 , ∀i,
gil+1 ← arg min
fi ∗ y − kl ∗ gi
2 + 2
sign is to be interpreted elementwise. The solutions g1 and g2 gi 2 2ζi
of (3) are estimates of the gradients of the sharp image x, i.e., 1
gl+1 − zi
2 + λi kzi k1 , ∀i,
we may expect g1 ≈ Dx x and g2 ≈ Dy x. zl+1
i ← arg min i 2
zi 2ζi
In practice, numerical gradients of images are usually com- XC
1
fi ∗ y − k ∗ gl+1
2 + kkk2 ,
puted using discrete filters, such as the Prewitt and Sobel kl+1 ← arg min i 2 2
filters. From this viewpoint, Dx y and Dy y may be viewed k i=1
2 2
as filtering y through two derivative filters of orthogonal subject to kkk1 = 1, k ≥ 0. (7)
directions [49]. Therefore, a straightforward generalization
of (3) is to use more than two filters, i.e., pass y through a For notational brevity, we will consistently use i to index
filter bank. This generalization increases the flexibility of (3), the filters and l to index the layers (iteration) henceforth.
and appropriate choice of the filters can significantly boost The notations {}i and {}l collects every filter and layer
performance. In particular, by steering the filters towards more components, respectively. As it is, problem (5) is non-convex
directions other than horizontal and vertical, local features over the joint variables k and {gi }i and proper initialization
(such as lines, edges and textures) of different orientations is crucial to get good solutions. However, it is difficult
are more effectively captured [50], [51], [52]. Moreover, to find appropriate initializations that perform well under
the filter can adapt its shapes to enhance the representation various practical scenarios. An alternative strategy that has
sparsity [53], [54], a desirable property to pursue. been commonly employed is to use different parameters per
Suppose we have determined a desired collection of C filters iteration [55], [11], [23], [13]. For example, λi ’s are typically
C
{fi }i=1 . By commutativity of convolutions, we have chosen as a large value from the beginning, and then gradually
decreased towards a small constant. In [55] the values of ζi ’s
fi ∗y = fi ∗k∗x+n0i = k∗(fi ∗x)+n0i , i = 1, 2, . . . , C, (4) decrease as the algorithm proceeds for faster convergence. In
where the filtered noises n0i = fi ∗ n are still Gaussian. To numerical analysis and optimization, this strategy is called the
encourage sparsity of the filtered image, we formulate the continuation method and its effectiveness is known for solving
optimization problem (which may similarly be regarded as a non-convex problems [58]. By adopting this strategy, we
generalization of [23]) choose different parameters {ζil , λli }i,l across the iterations. We
XC take this idea one step further by optimizing the filters across
1 2 2
min kfi ∗ y − k ∗ gi k2 + λi kgi k1 + kkk2 , 2 In the non-blind deconvolution literature, a formal convergence proof has
C
k,{gi }i=1
i=1
2 2
been shown in [55], while for blind deconvolution, empirical convergence has
subject to kkk1 = 1, k ≥ 0. (5) been frequently observed as shown in [11], [13], etc.
4
iterations as well, i.e. we design filters {fil }i,l . Consequently, Algorithm 1 Half-quadratic Splitting Algorithm for Blind
the alternating minimization scheme in (7) becomes: Deblurring with Continuation
Input: Blurred image y, filter banks {fil }i,l , positive constant
1
2 1
2 parameters {ζil , λli }i,l , number of iterations3 L and ε.
gil+1← arg min
fil ∗ y − kl ∗ gi
2 + l
gi − zli
2 , ∀i,
gi 2 2ζi Output: Estimated kernel k, e estimated feature maps {gei }C .
i=1
(8)
1
2 1: Initialize k1 ← δ; z1i ← 0, i = 1, . . . , C.
zl+1
i ← arg min l
gil+1 − zi
2 + λli kzi k1 , ∀i, (9) 2: for l = 1 to L do
zi 2ζi
C 3: for i = 1 to C do
X1
kl+1 ← arg min
fil ∗ y − k ∗ gl+1
2 + kkk2 , (10) 4: yil ← fil ∗ y, ( )
i 2 2
k 2 2 l bl ∗ bl +zbl
i=1 l+1 ζ k y
5: gi ← F −1 i
2
i i
,
subject to kkk1 = 1, k ≥ 0. ζil kbl +1
6: zl+1
i ← Sλli ζil gil+1 ,
We summarize the complete algorithm in Algorithm 1, where δ 7: end for
PC d ∗
in Step 1 is the impulse function. Problem (8) can be efficiently 1 z l+1
y l
kl+ 3 ← F −1 Pi=1 i 2 i ,
b
8:
solved by making use of the Discrete Fourier Transform Ci=1 zd l+1
i +
(DFT) and its solution is given in Step 5 of Algorithm 1, h P i
2 1 l+ 13
where ·∗ is the complex conjugation and is the Hadamard 9: kl+ 3 ← kl+ 3 − β l log i exp k i ,
+
(elementwise) product operator. The operations are to be l+ 2
10: kl+1 ←
kl+ 23
,
interpreted elementwise when acting on matrices and vectors.
k 3
1
We let b· denote the DFT and F −1 be the inverse DFT. The 11: l ← l + 1.
closed-form solution to problem (9) is well known and can be 12: end for
found in Step 6, where, Sλ (·) is the soft-thresholding operator
defined as: After Algorithm 1 converges, we obtain the estimated
feature maps {gei }i and the estimated kernel k. e Because of the
Sλ (x) = sgn(x) · max{|x| − λ, 0}. low-pass nature of k, e using it alone is inadequate to reliably
recover the sharp image x and regularization is needed. We
e
Subproblem (10) is a quadratic programming problem with may infer from (4) that, as k approximates k, gei should
a simplex constraint. While in principle, it may be solved approximate f i ∗ x. Therefore, we retrieve x by solving the
using iterative numerical algorithms, in the blind deblurring following optimization problem:
literature an approximation scheme is often adopted. The 1
2 X C
e ∗ x
ηi
fiL ∗ x − gei
2
unconstrained quadratic programming problem is solved first e ← arg min
x
y − k
+ 2
x 2 2 2
(again using DFT) to obtain a solution; its negative coefficients i=1
are then thresholded out, and finally normalized to have
∗
kb PC ∗
unit sum (Steps 8–10 of Algorithm 1). We define [x]+ = −1
e y b + i=1 ηi fi gei
c L b
max{x, 0}. This function is commonly called the Rectified =F 2 2 , (11)
b P
k e + C cL
Linear Unit (ReLU) in neural network terminology [59]. Note i=1 ηi fi
that in Step 9 of Algorithm 1, we are adopting a common
practice [8], [16] by thresholding the kernel coefficients using where ηi ’s are positive regularization parameters.
a positive constant (which is usually set as a constant param-
eter multiplying the maximum of the kernel coefficients); to III. A LGORITHM U NROLLING FOR D EEP B LIND I MAGE
avoid the non-smoothness of the maximum operation, we use D EBLURRING (DUBLID)
the log-sum-exp function as a smooth surrogate. A. Network Construction via Algorithm Unrolling
We note that the quality of the outputs and the convergence
Each step of Algorithm 1 is in analytic form and can be
speed depend crucially on the filters {fil }i,l and parameters
implemented using a series of basic functional operations.
{ζil , λli }i,l , which are difficult to infer due to the huge variety In particular, step 5 and step 8 in Algorithm 1 can be
of real world data. Under traditional settings, they are usually implemented according to the diagrams in Fig. 1a and Fig. 1b,
determined by hand crafting or domain knowledge. For ex- respectively. The soft-thresholding operation in step 6 may
ample, in [23], [16] {fil }i,l are taken as Prewitt filters while be implemented using two ReLU operations by recognizing
{λli }l ’s are chosen as geometric sequences. Their optimal that Sλ (x) = [x − λ]+ − [−x − λ]+ . Similarly, (11) may be
values thus remain unclear. To optimize the performance, implemented according to Fig. 1c. Therefore, each iteration of
we learn (adapt) the filters and parameters by using training
images in a deep learning set-up via back-propagation. A 3 L refers to the number of outer iterations, which subsequently becomes
detailed visual comparison between filters commonly used the number of network layers in Section III-A. While in traditional iterative
algorithms it is commonly determined by certain termination criteria, this
in conventional algorithms and filters learned through real approach is difficult to implement for neural networks. Therefore, in this
datasets by DUBLID is provided in Section IV-B. work we choose it through cross-validation as is done in [30], [32].
5
y
F ··· F
f1L
Embedding the above into the network, we obtain the struc-
ture depicted in Fig. 3. Note that yl can now be obtained
F
more efficiently by filtering yl+1 through wl . Also note that
kL l C
× conj F {wij }i,j=1 are to be learned as marked in Fig. 3. Experimental
F conj + Σ justification of cascaded filtering is provided in Fig. 5.
η1
fCL
÷
× .. B. Training
In a given training set, for each blurred image yttrain (t =
× conj F 1, . . . , T ), we let the corresponding sharp image and kernel be
F −1 + Σ
xtrain
t and ktrain
t , respectively. We do not train the parameter
ηC
ε in step 8 of Algorithm 1 because it simply serves as a
×
small constant to avoid division by zeros. We re-parametrize
e
x
ζil in step 6 of Algorithm 1 by letting bli = λli ζil and
C
denote bl = (bli )i=1 , l = 1, . . . , L. The network outputs
(c) x et corresponding to ytrain depend on the parameters wl ,
et , k t
Fig. 1. Block diagram representations of (a) step 5 in Algorithm 1, (b) step 8 b , ζ , β l , l = 1, 2, . . . , L. In addition, x
l l et depends on η. We
in Algorithm 1 and (c) Equation (11). After unrolling the iterative algorithm to train the network to determine these parameters by solving the
form a multi-layer network, the diagramatic representations serve as building following optimization problem:
blocks that repeat themselves from layer to layer. The parameters (ζ and η)
are learned from real datasets and colored in blue. PT
min κt
MSE ket wl , bl , ζ l , β l , Tτt ktrain
t=1 2 l t
{wl ,bl ,ζ l ,β l } l ,η,{τt }t
Operators
Convolutional layers
···
fL ∗ f L−1 ∗ f1 ∗
Blurred Image y
yL yL−1 ··· y1
Fig. 1c
gL
zL−1
SλL−1 gL−1 ··· z1
Sλ1 g1 Fig. 1a
Estimated
Image
e
x g0
Estimated
Kernel
e
k
kL kL−1 ··· k1 Fig. 1b
k0
Fig. 2. Algorithm 1 unrolled as a neural network. The parameters that are learned from real datasets are colored in blue.
use the Adam algorithm [61] to accelerate the training speed. whose solutionis given as follows:
The learning rate is set to 1 × 10−3 initially and decayed by a mrr br + mrg bg + mrb bb
fr = F −1
x ,
factor of 0.5 every 20 epochs. We terminate training after 160 d
epochs. The parameters {bli }i,l are initialized to 0.02, {ζil }i,l ∗
mrg br + mgg bg + mgb bb
initialized to 1, {β l }l initialized to 0, and {ηi }i initialized to fg = F −1
x ,
d
20, respectively. These values are again determined through ∗
mrb br + m∗gb bg + mbb bb
cross-validation. The upper part (feature extraction portion) of fb = F −1
x ,
the network in Fig. 3 resembles a CNN with linear activations d
(identities) and thus we initialize the weights according to [62]. where
2
C. Handling Color Images mrr = cgg cbb − |cgb | , mrg = crb c∗gb − cbb crg ,
2
For color images, the red, green and blue channels yr , yg , mrb = crg cgb − cgg crb , mgg = cbb crr − |crb | ,
and yb are blurred by the same kernel, and thus the following mgb = c∗rg crb − crr cgb , mbb = crr cgg − |crg | .
2
model holds instead of (1):
Here,
yc = k ∗ xc + nc , c ∈ {r, g, b}. C 2
X ∗ b
To be consistent with existing literature, we modify w in L ccc0 = d
ηi w d
ic wic0 + k δcc0 , c, c0 ∈ {r, g, b},
Fig. 3 to allow for multi-channel inputs. More specifically, yL i=1
C
X
is produced by the following formula: b∗ y ∗
X bc = k cc + d
ηi w bi ,
ic g c ∈ {r, g, b},
yiL = L
wic ∗ yc , i = 1, . . . , C. i=1
2
c∈{r,g,b} d = cgg crr − |crg | cbb + 2<{c∗rb crg cgb }
It is easy to check that, with wL and yL being replaced, all 2 2
− |cgb | crr − |crb | cgg ,
the components of the network can be left unchanged except
for the module in Fig. 1c. This is because (4) no longer holds and δcc0 is the Kronecker delta function. These analytical
and is modified to the following: formulas may be represented using diagrams similar to Fig. 1c
X X and embedded into a network.
wic ∗yc = k∗ wic ∗ xc +n0i , i = 1, . . . , C,
c∈{r,g,b} c∈{r,g,b} IV. E XPERIMENTAL V ERIFICATION
P A. Experimental Setups
where n0i = c∈{r,g,b} wic ∗ n represents Gaussian noise.
Problem (11) then becomes: 1) Datasets, Training and Test Setup:
X 1
2
{f fg , x
xr , x fb } ← arg min e ∗ xc
yc − k • Training for linear kernels: For the images we used
xr ,xg ,xb 2 2 the Berkeley Segmentation Data Set 500 (BSDS500) [63]
c∈{r,g,b}
2 which is a large dataset of 500 natural images that is
XC
X
ηi
explicitly divided into disjoint training, validation and
+
wic ∗ xc − gei
, test subsets. Here we use 300 images for training by
i=1
2
c∈{r,g,b}
2 combining the training and validation images.
7
Feature Extraction
yL yL−1 y1
wL wL−1 w1 Operators
Convolutional layers
∗ ∗ ··· ∗
Blurred Image y
Fig. 1c zL−1
SλL−1
··· z1
Sλ1 Fig. 1a
Deconvolution
gL gL−1 g1
g0
··· Fig. 1b
kL kL−1 k1
e Estimated
Estimated Image x k0
e
Kernel k
Fig. 3. DUBLID: using a cascade of small 3 × 3 filters instead of large filters (as compared to the network in Fig. 2) reduces the dimensionality of the
parameter space, and the network can be easier to train. Intermediate data (hidden layers) on the trained network are also shown. It can be observed that, as
l increases, gl and yl evolve in a coarse-to-fine manner. The parameters that will be learned from real datasets are colored in blue.
TABLE II
E FFECTS OF DIFFERENT VALUES OF NUMBER OF FILTERS C.
Number of filters 8 16 32
···
PSNR (dB) 26.55 27.30 27.16
1 C 2 C 3 C C
RMSE (×10−3 ) 1.99 1.67 1.93 fi i=1 fi i=1 fi i=1 fiL i=1
(a) DUBLID-Learned
We next study the effects of different values of C in
a similar fashion. The network performance over different
choices of C is summarized in Table II. It can be seen that the
network performance clearly improves as C increases from Dx Dy
8 to 16. However, the performance slightly drops when C (b) DUBLID-Sobel
increases further, presumably because of overfitting. We thus
fix C = 16 henceforth.
To corroborate the network design choices made in Sec-
tion III-A, we illustrate DUBLID performance for different
filter choices. We first verify the significance of learning the
filters {wl }l (and in turn {fil }i,l ) and compare the performance
with a typical choice of analytical filters, the Sobel filters,
(c) (d) (e)
in Fig. 4. Note that by employing Sobel filters, the network
reduces to executing TV-based deblurring but for a small Fig. 4. Comparison of learned filters with analytic Sobel filters: (a) DUBLID
learned filters. (b) Sobel filters that are commonly employed in traditional
number of iterations, which coincides with the number of iterative blind deblurring algorithms. (c) An example motion blur kernel. (d)
layers L. For fairness in comparison, the fixed Sobel filter Reconstructed kernel using Sobel filters and (e) using learned filters.
version of DUBLID (called DUBLID-Sobel) is trained exactly
as in Section III-B to optimize other parameters. As Fig. 4
reveals, DUBLID-Sobel is unable to accurately recover the
kernel. Indeed, such phenomenon has been observed and
analytically studied by Levin et al. [22], where they point
out that traditional gradient-based approaches can easily get
stuck at a delta solution. To gain further insight, we visualize
the learned filters as well as the Sobel filters in Fig. 4a and
(a) (b) (c)
Fig. 4b. The learned filters demonstrate richer variations than
known analytic (Sobel) filters and are able to capture higher- Fig. 5. The effectiveness of cascaded filtering: (a) a sample motion kernel.
(b) Reconstructed kernel by fixing all fil ’s to be of size 3 × 3, which can
level image features as l grows. This enables the DUBLID be implemented by enforcing wij l to be of size 1 × 1 whenever l < L. (c)
network to better recover the kernel coefficients subsequently. Reconstructed kernel using the cascaded filtering structure in Fig. 3.
Quantitatively, the PSNR achieved by DUBLID-Sobel for
L = 10 and C = 16 on the same training-test set up is 18.60
dB, which implies that DUBLID achieves a 8.7 dB gain by
explicitly optimizing filters in a data-adaptive fashion.
Finally, we show the effectiveness of cascaded filtering. To
this end, we compare with the alternative scheme of fixing the
size of {fil }i,l by restricting {wl }l to be of size 1×1 whenever
l < L. The results are shown in Fig. 5. By employing learnable (a)
filters, the network becomes capable of capturing the correct
directions of blur kernels as shown in Fig. 5b. In the absence of
cascaded filtering though, the recovered kernel is still coarse
– a limitation that is overcome by using cascaded filtering, (b)
verified in Fig. 5c.
in range [0, 1]) individually. Examples of training samples [29] perform comparably and mildly worse than DUBLID.
(images and kernels) are shown in Fig. 6. We use 200 DUBLID however achieves the deblurring at a significantly
images from the test portion of the BSDS500 dataset [70] for lower computational cost as verified in Section IV-E.
evaluation. We randomly choose angles in [0, π] and lengths in Visual examples are shown in Figs. 9 and 10 for qualitative
[5, 20] to generate 4 test kernels. The images and kernels and comparisons. It can be clearly seen that DUBLID is capable
convolved to synthesize 800 blurred images. White Gaussian of more faithfully recovering the kernels, and hence produces
noise (again with standard deviation 0.01) is also added. reconstructed images of higher visual quality. In particular,
Note that some of the state of the art methods compared DUBLID preserves local details better as shown in the zoom
against are only designed to recover the kernels, including [29] boxes of Figs. 9 and 10 while providing sharper images than
and [23]. To get the deblurred image, the non-blind method Nah et al. [40], Chakrabarti et al. [28] and Kupyn et al. [66].
in [13] is used consistently. The scores are averaged and Finally, DUBLID is free of visually objectionable artifacts
summarized in Table III. The RMSE values are computed over observed in Perrone et al. [23] and Xu et al. [29].
kernels, and smaller values indicate more accurate recoveries.
For all other metrics on images, higher scores generally
E. Computational Comparisons Against State of the Art
imply better performance. We do not include results from
Chakrabarti et al. [28] here because that method works on Table VI summarizes the execution (inference) times of each
grayscale images only. Table III confirms that DUBLID out- method for processing a typical blurred image of resolution
performs competing state-of-the art algorithms by a significant 480 × 320 and a blur kernel of size 31 × 31. The number
margin. of parameters for DUBLID is estimated as follows: for 3 × 3
Fig. 7 shows four example images and kernels for a qualita- filters wij , there are a total of L = 10 layers and in each
tive comparison. The two top-performing methods, Perrone et layer there are C 2 = 16 × 16 filters, which contribute to 3 ×
al. [23] and Nah et al. [40], are also included as representatives 3 × 16 × 16 × 10 ≈ 2.3 × 104 parameters. Other parameters
of iterative methods and deep learning methods, respectively. have negligible dimensions compared with wij and thus do
Although [23] can roughly infer the directions of the blur not contribute significantly.
kernels, the recovered coefficients clearly differ from the We include measurements of running time on both CPU and
groundtruth as evidenced by the spread-out branches. Conse- GPU. The − symbol indicates inapplicability. For instance,
quently, destroyed local structures and false colors are clearly Chakrabarti et al. [28] and Nah et al. [40] only provide
observed in the reconstructed images. Nah et al.’s method [40] GPU implementations of their work and likewise Perrone et
does not suffer from false colors, yet the recovered images al’s iterative method [23] is only implemented on a CPU.
appear blurry. In contrast, DUBLID recovers kernels close Specifically, the two benchmark platforms are: 1.) Intel Core
to the groundtruth, and produces significantly fewer visually i7–6900K, 3.20GHz CPU, 8GB of RAM, and 2.) an NVIDIA
objectionable artifacts in the recovered images. TITAN X GPU. The results in Table VI deliver two messages.
D. Evaluation on Non-linear Kernels First, the deep/neural network based methods are faster than
It has been observed in several previous works [22], [71] their iterative algorithm counterparts, which is to be expected.
that realistic motion kernels often have non-linear shapes due Second, amongst the deep neural net methods DUBLID runs
to irregular camera motions, such as those shown in Fig. 8. significantly faster than the others on both GPU and CPU,
Therefore, the capability to handle such kernels is crucial for largely because it has significantly fewer parameters as seen in
a blind motion deblurring method. the final row of Table VI. Note that the number of parameters
We generate training kernels by interpolating the paths for competing deep learning methods are computed based on
provided by [71] and those created by ourselves: specifically, the description in their respective papers.
we record the camera motion trajectories using the Vicon
system, and then interpolate the trajectories spatially to create
motion kernels. We further augment these kernels by scaling V. C ONCLUSION
over 4 different scales and rotating over 8 directions. In
this way, we build around 30, 000 training kernels in total4 . We propose an Algorithm Unrolling approach for Deep
The blurred images for training are synthesized by randomly Blind image Deblurring (DUBLID). Our approach is based
picking a kernel and convolving with it. Gaussian noise of on recasting a generalized TV-regularized algorithm into a
standard deviation 0.01 is again added. We use the standard neural network, and optimizing its parameters via a custom
image set from [22] (comprising 4 images and 8 kernels) and designed backpropogation procedure. Unlike most existing
from [12] (comprising 80 images and 8 kernels) as the test sets. neural network approaches, our technique has the benefit
The average scores for both datasets are presented in Table IV of interpretability, while sharing the performance benefits of
and Table V, respectively. In both datasets, DUBLID emerges modern neural network approaches. While some existing ap-
overall as the best method. The method of Chakrabarti et proaches excel for the case of linear kernels and others for non-
al. [28] performs second best in Table V. In Table IV, Perrone linear, our method is versatile across a variety of scenarios and
et al. [23] and the recent deep learning method of Xu et al. kernel choices – as is verified both visually and quantitatively.
Further, DUBLID requires much fewer parameters leading to
4 To re-emphasize, all learning based methods use the same training-test significant computational benefits over iterative methods as
configuration for fairness in comparison. well as competing deep learning techniques.
10
(a) Groundtruth (b) Perrone et al. [23] (c) Nah et al. [40] (d) DUBLID
Fig. 7. Qualitative comparisons on the BSDS500 [70] dataset. The blur kernels are placed at the right bottom corner. DUBLID recovers the kernel at higher
accuracy and therefore the estimated images are more faithful to the groundtruth.
TABLE III
Q UANTITATIVE COMPARISON OVER AN AVERAGE OF 200 IMAGES AND 4 KERNELS . T HE BEST SCORES ARE IN BOLD FONTS .
Metrics DUBLID Perrone et al. [23] Nah et al. [40] Xu et al. [29] Kupyn et al. [66]
PSNR (dB) 27.30 22.23 24.82 24.02 23.98
ISNR (dB) 4.45 2.06 1.92 1.12 1.05
SSIM 0.88 0.76 0.80 0.78 0.78
×10−3
RMSE 1.67 5.21 − 2.40 −
TABLE IV
Q UANTITATIVE COMPARISON OVER AN AVERAGE OF 4 IMAGES AND 8 KERNELS FROM [22].
DUBLID Perrone et al. [23] Nah et al. [40] Chakrabarti et al. [28] Xu et al. [29] Kupyn et al. [66]
PSNR (dB) 27.15 26.79 24.51 23.21 26.75 23.98
ISNR (dB) 3.79 3.63 1.35 0.06 3.59 0.43
SSIM 0.89 0.89 0.81 0.81 0.89 0.80
RMSE ×10−3
3.87 3.83 − 4.33 3.98 −
TABLE V
Q UANTITATIVE COMPARISON OVER AN AVERAGE OF 80 IMAGES AND 8 NONLINEAR MOTION KERNELS FROM [12].
DUBLID Perrone et al. [23] Nah et al. [40] Chakrabarti et al. [28] Xu et al. [29] Kupyn et al. [66]
PSNR (dB) 29.91 29.82 26.98 29.86 26.55 25.84
ISNR (dB) 4.11 4.02 0.86 4.06 0.43 0.15
SSIM 0.93 0.92 0.85 0.91 0.87 0.83
×10−3
RMSE 2.33 2.68 − 2.72 2.79 −
11
TABLE VI
RUNNING TIME COMPARISONS OVER DIFFERENT METHODS . T HE IMAGE SIZE IS 480 × 320 AND THE KERNEL SIZE IS 31 × 31.
DUBLID Chakrabarti et al. [28] Nah et al. [40] Perrone et al. [23] Xu et al. [29] Kupyn et al. [66]
CPU Time (s) 1.47 − − 1462.90 6.89 10.29
GPU Time (s) 0.05 227.80 7.32 − 2.01 0.13
Number of Parameters 2.3 × 104 1.1 × 108 2.3 × 107 − 6.0 × 106 1.2 × 107
∂zl+1
i ∂zl+1
i ∂ gd
l+1
i ∂ ybil
= (15)
∂yil bl ∂yil
Here we develop the back-propagation rules for computing ∂ gd
l+1 ∂ y
i i
the gradients of DUBLID. We will use F to denote the DFT ∗
operator and F∗ its adjoint operator, and 1 is a vector whose kbl
= diag I{|Pg gl+1 |>bl } F∗ diag 2 F,
entries are all ones. I refers to the identiy matrix. The symbols i i bl l
k + ζi
I{} means indicator vectors and diag(·) embeds the vector into
a diagonal matrix. The operators Pg and Pk are projections and
that restrict the operand into the domain of the image and the ∗!
∂zl+1
i ∂zl+1
i ∂ gd
l+1
i ∂ kbli ∂ gdl+1
i ∂ kbli
kernel, respectively. Let L be the cost function defined in (12). = + ∗ (16)
bl ∂kli l
We derive its gradients w.r.t. its variables using the chain rule ∂kl ∂ gd
l+1
i
∂ k ∂ kbl ∂ki
as follows:
= diag I{|Pg gl+1 |>bl } F∗
i
i
∗ 2
diag ζ l ybl F∗ −
kbl ybil
∇wil L = ∇wil yil ∇yil L = Rwil Fdiag yd
l+1
F∗ ∇yil L,
i i 2 diag 2 2 F,
i bl 2
k +ζil k +ζil
bl
= F∗ I ∇ l+1 L ,
l+ 31
∂ k[ ∂kl+ 3 ∂kl+ 3 ∂ k[ l+ 31
bl 2
2
{| g i | i }
P g l+1
>bl zi
k +ζil 2
I 1T kl+ 3 − kl+ 3 1T
2
= 2 diag I n 1 o F∗ ,
2 Pk kl+ 3 >0
T
1 k l+ 3
2
2
∂kl+1 ∂kl+1 k[ l+ 31 ybl I 1T kl+ 3 − kl+ 3 1T
i
= = 2 · (17)
T ∂yil l+ 13 y bl yil
∂ k[
2
i 1T kl+ 3
∇bli L = ∇bli zl+1
i ∇zl+1 L = I{gl+1 <−bl } − I{gl+1 >bl } ∇zl+1 L,
i i i i i i
PC d ∗
l+1
o F∗ diag i=1 zi F,
diag In 1
PC l+1 2
Pk kl+ 3 >0
i=1 z i + ε
where Rwil is the operator that extracts the components lying ∗
∂kl+1 ∂kl+1 ∂ k[ l+ 3 ∂ zd
1 l+1
∂ [
k
1
l+ 3 ∂z \ l+1
in the support of wil . Again using the chain rule, i i
= + ∗
∂zl+1
i ∂k[ l+ 1
3 ∂ zd l+1 ∂zl+1
i \
∂ ∂z l+1 ∂zi
l+1
i i
(18)
∂L ∂L ∂zl+1
i ∂L ∂L ∂zl+1
i ∂L ∂kl
= , = + , I 1T k l+ 23
−k l+ 23
1T
∂kl ∂zl+1 ∂k l ∂zli zl+1 ∂zli ∂kl ∂zli
i i
= 2 diag In 1 o F∗ ·
∂L ∂L ∂zl+1 ∂L ∂kl+1 ∂L ∂yil−1 2 Pk kl+ 3 >0
= l+1 i l + + . (13) 1T kl+ 3
l l
∂yi zi ∂yi ∂kl+1 ∂yi ∂yil−1 ∂yil
12
(a) Groundtruth (b) Perrone et al. [23] (c) Nah et al. [40] (d) Chakrabarti [28] (e) Xu et al. [29] (f) Kupyn et al. [66] (g) DUBLID
Fig. 9. Qualitative comparisons on the dataset from [22]. The blur kernels are placed at the right bottom corner. DUBLID generates fewer artifacts and
preserves more details than competing state of the art methods.
(a) Groundtruth (b) Perrone et al. [23] (c) Nah et al. [40] (d) Chakrabarti [28] (e) Xu et al. [29] (f) Kupyn et al. [66] (g) DUBLID
Fig. 10. Qualitative comparisons on the dataset from [12]. The blur kernels are placed at the right bottom corner.
PC d ∗ ∗
l+1 c
yjl zidl+1
j=1 zj
−diag 2 F
PC d l+1
2
j=1 zj + ε
2
∗
l+1 +ε − l+1 l+1
PC d PC
ybil j=1 zj j=1 zj
cl zd
y
d
j i
diag F∗ ,
+ 2 2
l+1 +ε
PC d
j=1 zj
ζil
∗
∇gil L = Fdiag F I{|Pg gil+1 |>bli } ∇zl+1
2 L
bl i
k + ζil
PL cl ∗ yd l−1
bl ∗
∂yil−1 ∂ yd
g g
∂yil−1 l−1
i ∂ ybil ∗ l−1
[ + −Fdiag
j=1 j j i
= = F diag wi F. (19)
PL cl 2
2
∂yil bl ∂yil
∂ yd
l−1 ∂ y
i i j=1 g j +ε
PL cl 2 PL cl ∗ d
yil−1 j=1 gj +ε −
l−1
j=1 gj yj gbil
d
+ F∗ diag
PL cl 2
2 F∗
j=1 gj +ε
Plugging (14) (15) (16) (17) (18) (19) into (13), we obtain
1 T
I( ) kl− 3
l− 2
3 >0
1
Pk k
1 In 2
Pk kl− 3 >0
o ∇kl C − 1 2 ∇k l L
∗
2 1T kl− 3 1T kl− 3
ζil ybil kbl ybil
∇kl L= F∗ diag 2 − Fdiag 2
bl 2
2
k +ζil k +ζil
bl
F∗ I{|Pg gl+1 |>bl } ∇zl+1 L
i i i
13