2014 - Efficient Blind Compressed Sensing Using Sparsifying Transforms With Convergence Guarantees and Application To MRI
2014 - Efficient Blind Compressed Sensing Using Sparsifying Transforms With Convergence Guarantees and Application To MRI
Abstract. Natural signals and images are well-known to be approximately sparse in transform domains such
as Wavelets and DCT. This property has been heavily exploited in various applications in image
processing and medical imaging. Compressed sensing exploits the sparsity of images or image
patches in a transform domain or synthesis dictionary to reconstruct images from undersampled
measurements. In this work, we focus on blind compressed sensing, where the underlying sparsifying
transform is a priori unknown, and propose a framework to simultaneously reconstruct the underlying
arXiv:1501.02923v2 [cs.LG] 23 Oct 2015
image as well as the sparsifying transform from highly undersampled measurements. The proposed
block coordinate descent type algorithms involve highly efficient optimal updates. Importantly, we
prove that although the proposed blind compressed sensing formulations are highly nonconvex, our
algorithms are globally convergent (i.e., they converge from any initialization) to the set of critical
points of the objectives defining the formulations. These critical points are guaranteed to be at
least partial global and partial local minimizers. The exact point(s) of convergence may depend
on initialization. We illustrate the usefulness of the proposed framework for magnetic resonance
image reconstruction from highly undersampled k-space measurements. As compared to previous
methods involving the synthesis dictionary model, our approach is much faster, while also providing
promising reconstruction quality.
Key words. Sparsifying transforms, Inverse problems, Compressed sensing, Medical imaging, Magnetic reso-
nance imaging, Sparse representation, Dictionary learning.
1
The sparsifying transform model enjoys similar advantages over other models such as the noisy signal
analysis dictionary model [70], which is not discussed here for reasons of space.
Transform-blind Compressed Sensing with Convergence Guarantees 3
where we used the substitution Ψx = z, and (·)H denotes the matrix Hermitian (conjugate
transpose) operation. Similar to the synthesis sparse coding problem, Problem (1.2) too is
NP-hard. Often the l0 quasi norm in (1.1) is replaced with its convex relaxation, the l1
norm [20], and the following convex problem is solved to reconstruct the image, when the CS
measurements are noisy [38, 21].
In Problem (1.3), the ℓ2 penalty for the measurement fidelity term can also be replaced
with alternative penalties such as a weighted ℓ2 penalty, depending on the physics of the
measurement process and the statistics of the measurement noise.
Recently, CS theory has been applied to imaging techniques such as magnetic resonance
imaging (MRI) [38, 39, 10, 76, 34, 58, 62], computed tomography (CT) [11, 13, 35], and
Positron emission tomography (PET) imaging [78, 43], demonstrating high quality recon-
structions from a reduced set of measurements. Such compressive measurements are highly
advantageous in these applications. For example, they help reduce the radiation dosage in
CT, and reduce scan times and improve clinical throughput in MRI. Well-known inverse prob-
lems in image processing such as inpainting (where an image is reconstructed from a subset
of measured pixels) can also be viewed as compressed sensing problems.
1.3. Blind Compressed Sensing. While conventional compressed sensing techniques uti-
lize fixed analytical sparsifying transforms such as wavelets [45], finite differences, and con-
tourlets [18], to reconstruct images, in this work, we instead focus on the idea of blind com-
pressed sensing (BCS) [64, 65, 30, 36, 80], where the underlying sparse model is assumed
unknown a priori. The goal of blind compressed sensing is to simultaneously reconstruct the
underlying image(s) as well as the dictionary or transform from highly undersampled mea-
surements. Thus, BCS enables the sparse model to be adaptive to the specific data under
consideration. Recent research has shown that such data-driven adaptation of dictionaries
or transforms is advantageous in many applications [23, 41, 1, 42, 85, 64, 69, 66, 73, 81].
While the adaptation of synthesis dictionaries [52, 26, 2, 83, 75, 40] has been extensively
studied, recent work has shown advantages in terms of computation and application-specific
performance, for the adaptation of transform models [70, 73, 81].
In a prior work on BCS [64], we successfully demonstrated the usefulness of dictionary-
based blind compressed sensing for MRI, even in the case when the undersampled measure-
ments corresponding to only a single image are provided. In the latter case, the overlapping
patches of the underlying image are assumed to be sparse in a dictionary, and the (unknown)
patch-based dictionary, that is typically much smaller in size than the image, is learnt directly
from the compressive measurements.
BCS techniques have been demonstrated to provide much better image reconstruction
quality compared to compressed sensing methods that utilize a fixed sparsifying transform
or dictionary [64, 36, 80]. This is not surprising since BCS methods allow for data-specific
adaptation, and data-specific dictionaries typically sparsify the underlying images much better
than analytical ones.
4 S. Ravishankar and Y. Bresler
1.4. Contributions.
1.4.1. Highlights. The BCS framework assumes a particular class of sparse models for
the underlying image(s) or image patches. While prior work on BCS primarily focused on the
synthesis dictionary model, in this work, we instead focus on the sparsifying transform model.
We propose novel problem formulations for BCS involving well-conditioned or orthonormal
adaptive square sparsifying transforms. Our framework simultaneously adapts the sparsifying
transform and reconstructs the underlying image(s) from highly undersampled measurements.
We propose efficient block coordinate descent-type algorithms for transform-based BCS. Im-
portantly, we establish that our iterative algorithms are globally convergent (i.e., they converge
from any initialization) to the set of critical points of the proposed highly non-convex BCS
cost functions. These critical points are guaranteed to be at least partial global and par-
tial local minimizers. The exact point(s) of convergence may depend on initialization. Such
convergence guarantees have not been established for prior blind compressed sensing methods.
Note that although we focus on compressed sensing in the discussions and experiments
of this work, the formulations and algorithms proposed by us can also handle the case when
the measurement or sensing matrix A is square (e.g., in signal denoising), or even tall (e.g.,
deconvolution).
1.5. Organization. The rest of this paper is organized as follows. Section 2 describes our
transform learning-based blind compressed sensing formulations and their properties. In Sec-
tion 3, we derive efficient block coordinate descent algorithms for solving the BCS Problems,
and discuss the algorithms’ computational costs. In Section 4, we present novel convergence
guarantees for our algorithms. The proof of convergence is provided in the Appendix. Section
Transform-blind Compressed Sensing with Convergence Guarantees 5
Here, ν > 0 is a weight for the measurement fidelity term (kAx − yk22 ), and Pj ∈ Cn×p
√ √
represents the operator that extracts a n × n 2D patch as a vector Pj x ∈ Cn from the
image x. A total of N overlapping 2D patches are used. The synthesis model allows each
patch Pj x to be approximated by a linear combination Dbj of a small number of columns
from a dictionary D ∈ Cn×K , where bj ∈ CK is sparse. The columns of the learnt dictionary
(represented by dk , 1 ≤ k ≤ K) in (P0) are additionally constrained to be of unit norm in
order to avoid the scaling ambiguity [33]. The dictionary, and the image patch, are assumed
to be much smaller than the image (n, K ≪ p) in (P0). Problem (P0) thus enforces all the
N (a typically large number) overlapping image patches to be sparse in some dictionary D,
which can be considered as a strong yet flexible prior on the underlying image.
We use B ∈ Cn×N to denote the matrix that has the sparse codes of the patches bj as its
columns. Each sparse code is permitted a maximum sparsity level of s ≪ n in (P0). Although
a single sparsity level is used for all patches in (P0) for simplicity, in practice, different sparsity
levels may be allowed for different patches (for example, by setting an appropriate error
threshold in the sparse coding step of optimization algorithms [64]). For the case of MRI,
the sensing matrix A in (P0) is Fu ∈ Cm×p , the undersampled Fourier encoding matrix [64].
The weight ν in (P0) is set depending on the measurement noise standard deviation σ as
6 S. Ravishankar and Y. Bresler
ν = σθ , where θ is a positive constant [64]. In practice, if the noise level is unknown, it may
be estimated.
Problem (P0) is to learn a patch-based synthesis sparsifying dictionary (n, K ≪ p), and
reconstruct the image simultaneously from highly undersampled measurements. As discussed
before, we have previously shown significantly superior image reconstructions for MRI using
(P0), as compared to non-adaptive compressed sensing schemes that solve Problem (1.3).
However, the BCS Problem (P0) is both non-convex and NP-hard. Approximate iterative
algorithms for (P0) (e.g., the DLMRI algorithm [64]) typically solve the synthesis sparse coding
problem repeatedly, which makes them computationally expensive. Moreover, no convergence
guarantees exist for the algorithms that solve (P0).
2.2. Sparsifying Transform-based Blind Compressed Sensing.
2.2.1. Problem Formulations with Sparsity Constraints. In order to overcome some of
the aforementioned drawbacks of synthesis dictionary-based BCS, we propose using the spar-
sifying transform model in this work. Sparsifying transform learning has been shown to be
effective and efficient in applications, while also enjoying good convergence guarantees [73].
Therefore, we use the following transform learning regularizer [70]
N
kW Pj x − bj k22 + λ Q(W ) s.t. kBk0 ≤ s
X
(2.3) ζ(x) = min
W,B
j=1
along with the constraint set S = {x ∈ Cp : kxk2 ≤ C} within Problem (2.1) to arrive at the
following adaptive sparsifying transform-based BCS problem formulation
N
yk22 kW Pj x − bj k22 + λ Q(W ) s.t. kBk0 ≤ s, kxk2 ≤ C.
X
(P1) min ν kAx − +
x,W,B
j=1
Here, W ∈ Cn×n denotes a square sparsifying transform for the patches of the underlying
image. The penalty kW Pj x − bj k22 denotes the sparsification error (transform domain residual)
[70] for the jth patch, with bj denoting the transform sparse code. The sparsity term kBk0 =
PN
j=1 kbj k0 counts the number of non-zeros in the sparse code matrix B. Notice that the
sparsity constraint in (P1) is enforced on all the overlapping patches, taken together. This is
a way of enabling variable sparsity levels for each specific patch. The constraint kxk2 ≤ C
with C > 0 in (P1), is to enforce any prior knowledge on the signal energy (or, range). For
example, if the pixels of the underlying image take intensity values in the range 0 − 255,
√
then C = 255 p is an appropriate bound. The function Q(W ) : Cn×n 7→ R in Problem
(P1) denotes a regularizer for the transform, and the weight λ > 0. Notice that without an
additional regularizer, W = 0 is a trivial sparsifier for any patch, and therefore, W = 0, bj = 0
∀j, x = A† y (assuming this x satisfies kxk2 ≤ C) with (·)† denoting the pseudo-inverse, would
trivially minimize the objective (without the regularizer Q(W )) in Problem (P1).
Similar to prior work on transform learning [70, 67], we set Q(W ) , − log |det W | +
0.5 kW k2F as the regularizer in the objective to prevent trivial solutions. The − log |det W |
penalty eliminates degenerate solutions such as those with repeated rows. The kW k2F penalty
helps remove a ‘scale ambiguity’ in the solution [70], which occurs when the optimal solution
Transform-blind Compressed Sensing with Convergence Guarantees 7
satisfies an exactly sparse representation, i.e., the optimal (x, W, B) in (P1) is such that
W Pj x = bj ∀ j, and kBk0 ≤ s. In this case, if the kW k2F penalty is absent in (P1), the
optimal (W, B) can be scaled by β ∈ C, with |β| → ∞, which causes the objective to decrease
unbounded.
The − log |det W | and 0.5 kW k2F penalties together also additionally help control the
condition number κ(W ) and scaling of the learnt transform. If we were to minimize only the
Q(W ) regularizer in Problem (P1) with respect to W , then the minimum is achieved with
a W that is unit conditioned, and with spectral norm (scaling) of 1 [70], i.e., a unitary or
orthonormal transform W . Thus, similar to Corollary 2 in [70], it is easy to show that as
λ → ∞ in Problem (P1), the optimal sparsifying transform(s) tends to a unitary one. In
practice, transforms learnt via (P1) are typically close to unitary even for finite λ. Adaptive
well-conditioned transforms (small κ(W ) > 1) have been previously shown to perform better
than adaptive (strictly) orthonormal ones in some scenarios in image representation, or image
denoising [70, 67].
In this work, we set λ = λ0 N in (P1), where λ0 > 0 is a constant. This setting allows λ
to scale with the size of the data (i.e., total number of patches). In practice, the weight λ0
needs to be set according to the expected range (in intensity values) of the underlying image,
as well as depending on the desired condition number of the learnt transform. The weight ν
in (P1) is set similarly as in (P0).
When a unitary sparsifying transform is preferred, the Q(W ) regularizer in (P1) (and in
(2.3)) could instead be replaced by the constraint W H W = I, where I denotes the identity
matrix, yielding the following formulation
N
yk22 kW Pj x − bj k22 s.t. W H W = I, kBk0 ≤ s, kxk2 ≤ C.
X
(P2) min ν kAx − +
x,W,B
j=1
The unitary sparsifying transform case is special, in that Problem (P2) is also a unitary
synthesis dictionary-based blind compressed sensing problem, with W H denoting the synthesis
dictionary. This follows from the identity kW Pj x − bj k2 = kPj x − W H bj k2 , for unitary W .
2.2.2. Properties of Transform BCS Formulations - Identifiability and Uniqueness.
The following simple proposition considers an “error-free” scenario and establishes the global
identifiability of the underlying image and sparse model in BCS via solving the proposed
Problems (P1) or (P2).
Proposition 2.1. Let x ∈ Cp with kxk2 ≤ C, and let y = Ax with A ∈ Cm×p . Sup-
pose that W ∈ Cn×n is a unitary transform that sparsifies the collection of patches of x as
P N
j=1 kW Pj xk0 ≤ s. Further, let B denote the matrix that has W Pj x as its columns. Then,
(x, W, B) is a global minimizer of both Problems (P1) and (P2), i.e., it is identifiable by solving
these problems.
Proof. : For the given (x, W, B), the terms N 2 2
P
j=1 kW Pj x − bj k2 and kAx − yk2 in (P1)
and (P2) each attain their minimum possible value (lower bound) of zero. Since W is unitary,
the penalty Q(W ) in (P1) is also minimized by the given W . Notice that the constraints in
both (P1) and (P2) are satisfied for the given (x, W, B). Therefore, this triplet is feasible for
both problems and achieves the minimum possible value of the objective in both cases. Thus,
it is a global minimizer of both (P1) and (P2).
8 S. Ravishankar and Y. Bresler
Thus, when “error-free” measurements are provided, and the patches of the underlying
image are exactly sparse (as defined by the constraint in (P1)) in some unitary transform,
Proposition 2.1 guarantees that the image as well as the model are jointly identifiable by
solving (i.e., finding global minimizers in) (P1) (or, (P2)).
An interesting topic, which we do not fully pursue here pertains to the condition(s) under
which the underlying image in Proposition 2.1 is the unique minimizer of the proposed BCS
problems. The proposed problems do admit an equivalence class of solutions/minimizers with
respect to the transform W and the set of sparse codes {bj }N 2
j=1 . Given a particular minimizer
(x, W, B) of (P1) or (P2), we have that (x, ΘW, ΘB) is another equivalent P minimizer for all
H
sparsity-preserving unitary matrices Θ, i.e., Θ such that Θ Θ = I and j kΘbj k0 ≤ s. For
example, Θ can be a row permutation matrix, or a diagonal ±1 sign matrix. Importantly
however, the minimizer with respect to x in (P1) or (P2) is invariant to the modification of
(W, B) by sparsity-preserving unitary matrices Θ, i.e., the optimal x in (P1) or (P2) remains
the same for all such choices of (ΘW, ΘB).
We note that by imposing additional structure on W in our transform-based BCS for-
mulations, one can derive conditions for the uniqueness of the minimizers. Assume a global
minimizer (x, W, B) exists in (P2) satisfying the conditions (error-free scenario) in Proposition
2.1. Then, for example, when the unitary transform W in (P2) is further constrained to be
doubly sparse, i.e., W = SΦ, with S a sparse matrix and Φ a known matrix (e.g., DCT, or
Wavelet Φ), then because W H = ΦH S H is an equivalent (doubly sparse) synthesis dictionary
(corresponding to the transform W ), the uniqueness conditions (involving the spark condi-
tion) proposed in prior work on synthesis dictionary-based BCS [30] (Section V-A of [30]) can
be extended to the transform-based setting here. A detailed analysis and description of such
uniqueness results will be presented elsewhere.
2.2.3. Problem Formulations with Sparsity Penalties. While Problem (P1) involves a
sparsity constraint, an alternative version of Problem (P1) is obtained by replacing the ℓ0
sparsity constraint with an ℓ0 penalty in the objective (and in (2.3)), in which case we have
the following optimization problem
N
kW Pj x − bj k22 + ν kAx − yk22 + λ Q(W ) + η 2 kBk0 s.t. kxk2 ≤ C.
X
(P3) min
x,W,B
j=1
where η 2 , with η > 0, denotes the weight for the sparsity penalty.
A version of Problem (P3) (without the ℓ2 constraint) has been used very recently in
adaptive tomographic reconstruction [54, 55]. However, it is interesting to note that in the
absence of the kxk2 ≤ C condition, the objective in (P3) is actually non-coercive. To see this,
consider W = I and xβ = x0 + βz, where x0 is a particular solution to y = Ax, β ∈ R, and
z ∈ N (A) with N (A) denoting the null space of A. For this setting, as β → ∞ with the
jth sparse code in (P3) set to W Pj xβ , it is obvious that the objective in (P3) remains finite,
thereby making it non-coercive. The energy constraint on x in (P3) restricts the set of feasible
2
In the remainder of this work, when certain indexed variables are enclosed within braces, it means that
we are considering the set of variables over the range of all the indices.
Transform-blind Compressed Sensing with Convergence Guarantees 9
images to a compact set, and alleviates potential problems (such as unbounded iterates within
a minimization algorithm) due to the non-coercive objective.
While a single weight η 2 is used for the sparsity penalty kBk0 = N
P
j=1 kbj k0 in (P3), one
2
could also use different weights ηj for the sparsity penalties kbj k0 corresponding to different
patches, if such weights are known, or estimated.
Just as Problem (P3) is an alternative to Problem (P1), we can also obtain a corresponding
alternative version (denoted as (P4)) of Problem (P2) by replacing the sparsity constraint with
a penalty. Although, in the rest of this paper, we consider Problems (P1)-(P3), the proposed
algorithms and convergence results in this work easily extend to the case of (P4).
2.3. Extensions. While the proposed sparsifying transform-based BCS problem formu-
lations are for the (extreme) scenario when the CS measurements corresponding to a single
image are provided, these formulations can be easily extended to other scenarios too. For
example, when multiple images (or frames, or slices) have to be jointly reconstructed using a
single adaptive (spatial) sparsifying transform, then the objectives in Problems (P1)-(P3) for
this case are the summation of the corresponding objective functions for each image. In appli-
cations such as dynamic MRI (or for example, compressive video), the proposed formulations
can be extended by considering adaptive spatiotemporal sparsifying transforms of 3D patches
(cf. [80] that extends Problem (P0) in such a way to compressed sensing dynamic MRI).
Similar extensions are also possible for higher-dimensional applications such as 4D imaging.
3. Algorithm and Properties.
3.1. Algorithm. Here, we propose block coordinate descent-type algorithms to solve the
proposed transform-based BCS problem formulations (P1)-(P3). Our algorithms alternate
between solving for the sparse codes {bj } (sparse coding step), transform W (transform update
step), and image x (image update step), with the other variables kept fixed. One could also
alternate a few times between the sparse coding and transform update steps, before performing
one image update step. In the following, we describe the three main steps in detail. We
show that each of the steps has a simple solution that can be computed cheaply in practical
applications such as MRI.
3.1.1. Sparse Coding Step. The sparse coding step of our algorithms for Problems (P1)
and (P2) involves the following optimization problem
N
kW Pj x − bj k22 s.t. kBk0 ≤ s.
X
(3.1) min
B
j=1
Now, let Z ∈ Cn×N be the matrix with the transformed (vectorized) patches W Pj x as its
columns. Then, using this notation, Problem (3.1) can be rewritten as follows, where k·kF
denotes the standard Frobenius norm.
The above problem is to project Z onto the non-convex set B ∈ Cn×N : kBk0 ≤ s of matrices
that have sparsity ≤ s, which we call the s-ℓ0 ball. The optimal projection B̂ is easily computed
10 S. Ravishankar and Y. Bresler
by zeroing out all but the s coefficients of largest magnitude in Z. We denote this operation
by B̂ = Hs (Z), where Hs (·) is the corresponding projection operator. In case, there is more
than one choice for the s elements of largest magnitude in Z, then Hs (Z) is chosen as the
projection for which the indices of these s elements are the lowest possible in lexicographical
order.
In the case of Problem (P3), the sparse coding step involves the following unconstrained
(and non-convex) optimization problem
The optimal solution B̂ in this case is obtained as B̂ = Ĥη1 (Z), with the hard-thresholding
operator Ĥη1 (·) defined as follows, where the subscript ij indexes matrix entries (i for row and
j for column).
1
0 , |Zij | < η
(3.4) Ĥη (Z) =
ij Zij , |Zij | ≥ η
The optimal solution to Problem (3.3) is not unique when the condition |Zij | = η is satisfied
for some i, j (cf. Page 3 of [73] for a similar scenario and an explanation). The definition in
(3.4) chooses one of the multiple optimal solutions in this case.
3.1.2. Transform Update Step. Here, we solve for W in the proposed formulations, with
the other variables kept fixed. In the case of Problems (P1) and (P3), this involves the
following optimization problem
N
kW Pj x − bj k22 + 0.5λ kW k2F − λ log |det W |
X
(3.5) min
W
j=1
Now, let X ∈ Cn×N be the matrix with the vectorized patches Pj x as its columns, and recall
that B is the matrix of codes bj . Then, Problem (3.5) becomes
An analytical solution for this problem has been recently derived [67, 73], and is stated in the
following proposition. It is expressed in terms of an appropriate singular value decomposition
1
(SVD). We let M 2 denote the positive definite square root of a positive definite matrix M .
Proposition 3.1. Given X ∈ Cn×N , B ∈ Cn×N , and λ > 0, factorize XX H +0.5λI as LLH ,
with L ∈ Cn×n . Further, let L−1 XB H have a full SVD of V ΣRH . Then, a global minimizer
for the transform update step (3.6) is
1
(3.7) Ŵ = 0.5R Σ + Σ2 + 2λI 2 V H L−1
Proof. : See the proof of Proposition 1 of [73], particularly the discussion following that
proof.
The factor L in Proposition 3.1 can for example be the factor L in the Cholesky fac-
torization XX H + 0.5λI = LLH , or the Eigenvalue Decomposition (EVD) square root of
XX H + 0.5λI. The closed-form solution (3.7) is nevertheless invariant to the particular choice
of L [73]. Although in practice both the SVD and the square root of non-negative scalars are
computed using iterative methods, we will assume in the convergence analysis in this work,
that the solution (3.7) (as well as later ones that involve such computations) is computed
exactly. In practice, standard numerical methods are guaranteed to quickly provide machine
precision accuracy for the SVD or other (aforementioned) computations.
In the case of Problem (P2), the transform update step involves the following problem
The solution to the above problem can be expressed as follows (see [67], or Proposition 2 of
[73]).
Proposition 3.2. Given X ∈ Cn×N and B ∈ Cn×N , let XB H have a full SVD of U ΣV H .
Then, a global minimizer in (3.8) is
(3.9) Ŵ = V U H
Problem (3.10) is a least squares problem with an ℓ2 (alternatively, squared ℓ2 ) constraint [31].
It can be solved exactly by using the Lagrange multiplier method [31]. The corresponding
Lagrangian formulation is
N
kW Pj x − bj k22 + ν kAx − yk22 + µ kxk22 − C
X
(3.11) min
x
j=1
where µ ≥ 0 is the Lagrange multiplier. The solution to (3.11) satisfies the following Normal
Equation
N
X
H
PjT W H bj + ν AH y
(3.12) G + ν A A + µI x =
j=1
where
N
X
(3.13) G, PjT W H W Pj
j=1
12 S. Ravishankar and Y. Bresler
and (·)T (matrix transpose) is used instead of (·)H above for real matrices. The solution to
(3.12) is unique for any µ ≥ 0 because
PN matrix G2is positive-definite. To see why, consider any
p H
z ∈ C . Then, we have z Gz = j=1 kW Pj zk2 , which is strictly positive unless W Pj z = 0
∀ j. Since the W in our algorithm is ensured to be invertible, we have that W Pj z = 0 ∀ j if
and only if Pj z = 0 ∀ j, which implies (assuming that the set of patches in our formulations
covers all pixels in the image) that z = 0. This ensures G ≻ 0. The unique solution to (3.12)
can be found by direct methods (for small-sized problems), or by conjugate gradients (CG).
To solve the original Problem (3.10), the Lagrange multiplier µ in (3.12) must also
be chosen optimally. This is done by first computing the EVD of the p × p matrix
H H
G +νA
P A as U ΣU . Since G ≻ 0, we have that Σ ≻ 0. Then, defining z ,
H N T H H H −1
U j=1 Pj W bj + ν A y , we have that (3.12) implies U x = (Σ + µI) z. There-
fore,
p
2 |zi |2
kxk22 = U H x , f˜(µ)
X
(3.14) =
2
i=1
(Σii + µ)2
where zi above denotes the ith entry of vector z. If f˜(0) ≤ C 2 , then µ̂ = 0 is the optimal
multiplier. Otherwise, the optimal µ̂ > 0. In the latter case, since the function f˜(µ) in (3.14)
is monotone decreasing for µ > 0, and f˜(0) > C 2 , there is a unique multiplier µ̂ > 0 such
that f˜(µ̂) − C 2 = 0. The optimal µ̂ is found by using the classical Newton’s method (or,
alternatively [25]), which in our case is guaranteed to converge to the optimal solution at a
quadratic rate. Once the optimal µ̂ is found (to within machine precision), the unique solution
to the image update Problem (3.10) is U (Σ + µ̂I)−1 z, coinciding with the solution to (3.12)
with µ = µ̂.
In practice, when a large value (or, loose estimate) of C is used (for example, in our
experiments later in Section 5), the optimal solution to (3.10) is typically obtained with the
setting µ̂ = 0 for the multiplier in (3.11). In this case, the unique minimizer of the objective
in (3.10) (e.g., obtained with CG) directly satisfies the constraint. Therefore, the additional
computations (e.g., EVD) to find the optimal µ̂ can be avoided in this case. Other alternative
ways to find the solution to (3.10) when the optimal µ̂ 6= 0, without the EVD computation,
include using the projected gradient method, or solving (3.11) repeatedly (by CG) for various
µ (tuned in steps) until the kxk2 = C condition is satisfied.
When employing CG (or, the projected gradient method), the structure of the various
matrices in (3.12) can be exploited to enable efficient computations. First, we show that under
certain assumptions, the matrix G in (3.12) is a 2D circulant matrix, i.e., a Block Circulant
matrix with Circulant Blocks (abbreviated as BCCB matrix), enabling efficient computation
(via FFTs) of the product Gx (used in CG). Second, in certain applications, the matrix AH A
in (3.12) may have a structure (e.g., sparsity, Toeplitz structure, etc.) that enables efficient
computation of AH Ax (used in CG). In such cases, the CG iterations can be performed
efficiently.
We now show the matrix G in (3.12) is a BCCB matrix. To do this, we make some
assumptions on how the overlapping 2D patches are selected from the image(s) in our formu-
lations. First, we assume that periodically positioned, overlapping 2D image patches are used.
Furthermore, the patches that overlap the image boundaries are assumed to ‘wrap around’
Transform-blind Compressed Sensing with Convergence Guarantees 13
on the opposite side of the image [64]. Now, defining the patch overlap stride r to be the
distance in pixels between corresponding pixel locations in adjacent image patches, it is clear
that the setting r = 1 results in a maximal set of overlapping patches. When r = 1 (and
assuming patch ‘wrap around’), the following proposition establishes that the matrix G is a
Block Circulant matrix with Circulant Blocks. Let F ∈ Cp×p denote the full (2D) Fourier
encoding matrix assumed normalized such that F H F = I.
Proposition 3.3. LetP r = 1, and assume that all ‘wrap around’ image patches are included.
Then, the matrix G = N T H
j=1 Pj W W Pj in (3.12) is a BCCB matrix with eigenvalue decom-
position F H ΓF , with Γ ≻ 0.
Proof. : First, note that W H W = ni=1 ei eTi W H W , where {ei }ni=1 are the columns of the
P
n × n identity matrix. Denote the ith row of W H W by hi . Then,
PN theT matrix ei eTi W H
W is all
zero P th T H
except for its i row, which is equal to hi . Let Gi , j=1 Pj ei ei W W Pj . Then,
G = ni=1 Gi .
Consider a vectorized image z ∈ Cp . Because the set of entries of Gi z is simply the set of
inner products between hi and all the (overlapping and wrap around) patches of the image
corresponding to z, it follows that applying Gi on 2D circularly shifted versions of the image
corresponding to z produces correspondingly shifted versions of the output Gi z. Hence, Gi is
an operator corresponding to 2D circular convolution.
Now, it follows that each Gi is a BCCB matrix. Since G = ni=1 Gi is a sum of BCCB
P
matrices, it is therefore a BCCB matrix as well. Now, from standard results regarding BCCB
matrices, we know that G has an EVD of F H ΓF , where Γ is a diagonal matrix of eigenvalues.
It was previously established that G is positive-definite. Therefore, Γ ≻ 0 holds.
Proposition 3.3 states that matrix G in (3.12) is a BCCB matrix, that is diagonalizable
by the Fourier basis. Therefore, G can be applied (via FFTs) on a vector (e.g., in CG) at
O(p log
PNp) cost. This is much lower than the O(n2 p) cost (n2 >> log p typically) of applying
G = j=1 Pj W H W Pj directly using patch-based operations. We now show how the diagonal
T
3.1.4. Image Update Step: Case of MRI. In certain scenarios, the optimal x̂ in (3.10)
can be found very efficiently. Here, we consider the case of Fourier imaging modalities, or
more specifically, MRI, where A = Fu , the undersampled Fourier encoding matrix. In order
to obtain an efficient solution for the x̂ in (3.10), we assume that the k-space measurements
in MRI are obtained by subsampling on a uniform Cartesian grid. We then show that the
optimal multiplier µ̂ and the corresponding optimal x̂ can be computed without any EVD
14 S. Ravishankar and Y. Bresler
N
X
F GF H + ν F FuH Fu F H + µI F x = F PjT W H bj + ν F FuH y
(3.15)
j=1
All p-dimensional vectors (vectorized images) in (3.15) are in Fourier or k-space. Vector
F FuH y ∈ Cp represents the zero-filled (or, zero padded) k-space measurements. The matrix
F FuH Fu F H is a diagonal matrix consisting of ones and zeros, with the ones at those diagonal
entries that correspond to sampled locations in k-space. By Proposition 3.3, for r = 1, the
matrix F GF H = Γ is diagonal. Therefore, the matrix pre-multiplying F x in (3.15) is diagonal
and invertible.
PN Denoting the diagonal of Γ by γ ∈ Rp (all positive vector), S0 , F FuH y, and
S , F j=1 Pj W H bj , we have that the solution to (3.15) for fixed µ is
T
S(kx ,ky )
(
γ(kx ,ky )+µ , (kx , ky ) ∈
/Ω
(3.16) F xµ (kx , ky ) = S(kx ,ky )+ν S0 (kx ,ky )
γ(kx ,ky )+ν+µ , (kx , ky ) ∈ Ω
where (kx , ky ) indexes k-space locations, and Ω represents the subset of k-space that has been
sampled. Equation (3.16) provides a closed-form solution to the Lagrangian Problem (3.11)
for CS MRI, with F xµ (kx , ky ) representing the optimal updated value (for a particular µ) in
k-space at location (kx , ky ).
The function f˜(µ) in (3.14) now has a simple form (no EVD needed) as follows
We check if f˜(0) ≤ C 2 first, before applying Newton’s method to solve f˜(µ̂) = C 2 . The
optimal x̂ in (3.10) is obtained via a 2D inverse FFT of the updated F xµ̂ in (3.16).
The pseudocodes of the overall Algorithms A1, A2, and A3 corresponding to the BCS
Problems (P1), (P2), and (P3) respectively, are shown below. An appropriate choice for
the initial W 0 , B 0 , x0 in Algorithms A1-A3 would depend on the specific application. For
example, W 0 could be the n × n 2D DCT matrix, x0 = A† y (normalized so that it satisfies
x0 2 ≤ C), and B 0 can be the minimizer of Problems (P1)-(P3) for these fixed W 0 and x0 .
3.2. Computational Cost. Algorithms A1, A2, and A3 involve the steps of sparse coding,
transform update, and image update. We now briefly discuss the computational costs of each
of these steps.
First, in each outer iteration of our Algorithms A1 and A3, the computation of the ma-
trix XX H + 0.5λI, where X has the image patches as its columns, can be done in O(n2 N )
operations, where typically n ≪ N . The computation of the inverse square root L−1 requires
only O(n3 ) operations.
Transform-blind Compressed Sensing with Convergence Guarantees 15
2) For l = 1: M̂ Repeat
(a) Transform Update Step:
H
i. Set V ΣRH as the full SVD of L−1 X B̃ l−1 for Algorithms A1 and A3, or the
H
full SVD of X B̃ l−1
for Algorithm A2.
1
l 2
ii. W̃ = 0.5R Σ + Σ + 2λI 2 V H L−1 for Algorithms A1 and A3, or W̃ l = RV H
for Algorithm A2.
(b) Sparse Coding Step: B̃ l = Hs W̃ l X for Algorithms A1 and A2, or B̃ l =
End
3) Set W t = W̃ M̂ and B t = B̃ M̂ . Set btj as the jth column of B t ∀ j.
4) Image Update Step:
(a) For generic CS scheme, solve (3.12) for xt with µ = 0, by linear CG. If xt 2 > C,
H
i. Compute U ΣU H as EVD of N PjT W t W t Pj + ν AH A.
P
P j=1
ii. Compute z = U H N T W t H bt + ν AH y .
j=1 j P j
iii. Use Newton’s method to find µ̂ such that f˜(µ̂) = C 2 in (3.14).
iv. xt = U (Σ + µ̂I)−1 z.
(b) For MRI, do the following
t H t
i. Compute the image c = N T
P
j=1 Pj W bj . S ← F F T (c).
PN H
ii. Compute a1 as the first column of j=1 PjT W t W t Pj .
√
iii. Set γ ← p × F F T (a1 ).
iv. Compute f˜(0) as per (3.17).
v. If f˜(0) ≤ C 2 , set µ̂ = 0. Else, use Newton’s method to solve f˜(µ̂) = C 2 for µ̂.
vi. Update S to be the right hand side of (3.16) with µ = µ̂. xt = IF F T (S).
End
The cost of the sparse coding step in our algorithms is dominated by the computation of the
3
The superscripts t and l denote the main iterates, and the iterates in the inner alternations between
transform update and sparse coding, respectively. The F F T and IF F T denote the fast implementations of the
normalized 2D DFT and 2D IDFT. For MRI, r = 1, and the encoding matrix F is assumed normalized, and
arranged so that its first column is the constant DC column. In Step 4a, although we list the image update
method involving EVD, an alternative scheme is one mentioned in the text in Section 3.1.3, involving only CG.
16 S. Ravishankar and Y. Bresler
matrix Z = W X in (3.2) (for Algorithms A1, A2) or (3.3) (for Algorithm A3), and therefore
scales as O(n2 N ). Notably, the projection onto the s-ℓ0 ball in (3.2) costs only O(nN log N )
operations, when employing sorting [70], with log N ≪ n typically. Alternatively, in the case
of (3.3), the hard thresholding operation costs only O(nN ) comparisons.
The cost of the transform update step of our algorithms is dominated by the computation
of the matrix XB H . Since B is sparse, XB H is computed with αn2 N multiply-add operations,
where α < 1 is the fraction of non-zeros in B.
The cost of the image update step in our algorithms depends on the specific application
(i.e., the specific structure of AH A, etc.). Here, we discuss the cost of the image update
step discussed in Section 3.1.4, for the specific case of MRI. We assume r = 1, and that the
patches ‘wrap around’, which implies that N = p (i.e., number of patches equals number of
image
PN pixels). The computational cost here is dominated by the computation of the term
P T W H b in the normal equation (3.12), which takes O(n2 N ) operations. On the other
j=1 j j
hand, the various FFT and IFFT operations cost only O(N log N ) operations, where log N ≪
n2 typically. The Newton’s method to compute the optimal multiplier µ̂ is only used when
µ = 0 is non-optimal. In the latter case, Newton’s method takes up O(N J) ˜ operations, with
˜
J being the number of iterations (typically small, and independent of n) of Newton’s method.
Based on the preceding arguments, it is easy to observe that the total cost per (outer)
iteration of the Algorithms A1-A3 scales (for MRI) as O(n2 N M̂ ). Now, the recent synthesis
dictionary-based BCS method called DLMRI [64] learns a dictionary D ∈ Cn×K from CS MRI
measurements by solving Problem (P0). For this scheme, the computational cost per outer
iteration scales as O(N KnsJ) ˆ [70], where Jˆ is the number of (inner) iterations of dictionary
learning (using the K-SVD algorithm [2]), and the other notations are the same as in (P0).
Assuming that K ∝ n, and that the synthesis sparsity s ∝ n, we have that the cost per
iteration of DLMRI scales as O(n3 N J). ˆ Thus, the per-iteration computational cost of the
proposed BCS schemes is much lower (lower in order by factor n assuming M̂ ∼ J) ˆ than
that for synthesis dictionary-based BCS. This gap in computations is amplified for higher-
dimensional imaging applications such as 3D or 4D imaging, where the size of the 3D or 4D
patches is typically much bigger than in the case of 2D imaging.
As illustrated in our experiments in Section 5, the proposed BCS algorithms converge
in few iterations in practice. Therefore, the per-iteration computational advantages over
synthesis dictionary-based BCS also typically translate to a net computational advantage in
practice.
volved here, standard results on convergence of block coordinate descent methods (e.g., [77])
do not apply here.
Very recent works on the convergence of block coordinate descent-type algorithms (e.g.,
[82], or the Block Coordinate Variable Metric Forward-Backward algorithm [14]) prove con-
vergence of the iterate sequence (for specific algorithm) to a critical point of the objective.
However, these works make numerous assumptions, some of which P can be easily shown to
be violated for the proposed formulations (for example, the term N j=1 kW P j x − b k
j 2
2
in the
objectives of our formulations, although differentiable, violates the L-Lipschitzian gradient
property described in Assumption 2.1 of [14]).
In fact, in certain simple scenarios, one can easily derive non-convergent iterate sequences
for the Algorithms A1-A3. Non-convergence mainly arises for the transform or sparse code
sequences (rather than the image sequence) due to the fact that the optimal solutions in
the sparse coding or transform update steps may be non-unique. For example, in the trivial
case when y = 0, the image x = 0 is the unique minimizer in the proposed BCS problems.
Hence, if x0 = 0, then the iterates in Algorithms A1-A3 easily satisfy xt = 0 ∀ t. Since the
patches of xt are all zero, then any unitary matrix would be an (non-unique) optimal sparsifier
in the transform update step of the algorithms, no matter how many inner alternations are
performed between sparse coding and transform update within each outer iteration of the
algorithms (the sparse codes are always 0 in this case). Hence, the W t sequence can be any
If z ∈ ˆ
/ domφ, then ∂φ(z) , ∅. The sub-differential of φ at z is defined as
ˆ k ) → h̃ .
∂φ(z) , h̃ ∈ Rq : ∃zk → z, φ(zk ) → φ(z), hk ∈ ∂φ(z
(4.2)
It is easy to see that (cf. [73] for a similar statement and justification) the unconstrained
minimization problem involving the objective g(W, B, x) is exactly equivalent to the corre-
sponding constrained formulation (P1), in the sense that the minimum objective values as
well as the set of minimizers of the two formulations are identical. The same result also holds
with respect to (P2) and u(W, B, x), or (P3) and v(W, B, x).
Since the functions g, u, and v accept complex-valued (input) arguments, we will compute
all derivatives or sub-differentials (Definition 4.2) of these functions with respect to the (real-
valued) real and imaginary parts of the variables (W , B, x). Note that the functions g, u,
and v are proper (we set the negative log-determinant penalty to be +∞ wherever det W = 0)
and lower semi-continuous. For the Algorithms A1-A3, we denote the iterates (outputs) in
each outer iteration t by the set W t , B t , xt .
For a matrix H, we let ρj (H) denote the magnitude of the j th largest element (magnitude-
wise) of the matrix H. For some matrix E, kEk∞ , maxi,j |Eij |. Finally, Re(A) denotes the
real part of some scalar or matrix A.
4.3. Main Results. The following theorem provides the convergence result for Algorithm
A1 that solves Problem (P1). We assume that the initial estimates (W 0 , B 0 , x0 ) satisfy all
problem constraints.
Theorem 4.3. Let W t , B t , xt denote the iterate sequence generated by Algorithm A1
with measurements y ∈ Cm and initial (W 0 , B 0 , x0 ). Then, the objective sequence gt
with gt , g W t , B t , xt is monotone decreasing, and
∗ ∗ 0 0 0
converges
t t t
to a finite value, say
g = g (W , B , x ). Moreover, the iterate sequence W , B , x is bounded, and all its
accumulation points are equivalent in the sense that they achieve the exact same value g∗ of
Transform-blind Compressed Sensing with Convergence Guarantees 19
the objective. The sequence at with at , xt − xt−1 2 , converges to zero. Finally, every
Each accumulation point (W, B, x) also satisfies the following partial local optimality condi-
tions
The conditions each hold for all ∆x ˜ ∈ Cp , and all sufficiently small dW ∈ Cn×n satisfying
kdW kF ≤ ǫ for some ǫ > 0 that depends on the specific W , and all ∆B ∈ Cn×N in the union
′ ′
of the following regions R1 and R2, where X ∈ Cn×N is the matrix with Pj x, 1 ≤ j ≤ N , as
its columns.
R1. The half-space Re tr (W X − B)∆B H ≤ 0.
The following corollary to Theorem 4.3 also holds, where ‘globally convergent’ refers to
convergence from any initialization.
20 S. Ravishankar and Y. Bresler
Corollary 4.5. Algorithm A1 is globally convergent to a subset of the set of critical points of
the non-convex objective g (W, B, x). The subset includes all critical points (W, B, x), that are
at least partial global minimizers of g(W, B, x) with respect to each of W , B, and x, as well
as partial local minimizers of g(W, B, x) with respect to each of the pairs (W, B), and (B, x).
Theorem 4.3 holds for Algorithm A1 irrespective of the number of inner alternations
M̂ , between transform update and sparse coding, within each outer algorithm iteration. In
practice, we have observed that a larger value of M̂ (particularly in initial algorithm iterations)
may enable Algorithm A1 to be insensitive (for example, in terms of the quality of the image
reconstructed) to the initial (even, badly chosen) values of W 0 and B 0 .
The convergence results for Algorithms A2 or A3 are quite similar to that for Algorithm
A1. The following two Theorems briefly state the results for Algorithms A3 and A2, respec-
tively.
Theorem 4.6. Theorem 4.3 applies to Algorithm A3 and the corresponding objective
v(W, B, x) as well, except that the set of perturbations ∆B ∈ Cn×N in Theorem 4.3 is re-
stricted to k∆Bk∞ < η/2 for Algorithm A3.
Theorem 4.7. Theorem 4.3, except for the condition (4.9), applies to Algorithms A2 and
the corresponding objective u(W, B, x) as well.
Note that owing to Theorems 4.6 and 4.7, results similar to Corollaries 4.4 and 4.5 also
apply for Algorithms A3 and A2, respectively. The proofs of the stated convergence theorems
are provided in Appendix A.
In general, the subset of the set of critical points to which each algorithm converges may
be larger than the set of global minimizers (Section 2.2.2) in the respective problems. The
question of the conditions under which the proposed algorithms converge to the (perhaps
smaller) set of global minimizers in the proposed problems is open, and its investigation is left
for future work.
5. Numerical Experiments.
5.1. Framework. Here, we study the usefulness of the proposed sparsifying transform-
based blind compressed sensing framework for the CS MRI application 4 . The MR data
used in these experiments are 512 × 512 complex-valued images shown (only the magnitude
is displayed) in Fig. 1(a) and Fig. 2(a). The image in Fig. 1(a) was kindly provided
by Prof. Michael Lustig, UC Berkeley, and is one image slice (with rich features) from a
multislice data acquisition. The image in Fig. 2(a) is publicly available 5 . We simulate
various undersampling patterns in k-space 6 including variable density 2D random sampling 7
[76, 64], and Cartesian sampling with (variable density) random phase encodes (1D random).
4
We have also proposed another sparsifying transform-based BCS MRI method recently [71]. However, the
latter approach involves many more parameters (e.g., error thresholds to determine patch-wise sparsity levels),
which may be hard to tune in practice. In contrast, the methods proposed in this work involve only a few
parameters that are relatively easy to set.
5
It can be downloaded from http://web.stanford.edu/class/ee369c/data/brain.mat. A phase-shifted
version of the image was used in the experiments in [72].
6
We simulate the k-space of an image x using the command fftshift(fft2(ifftshift(x))) in Matlab. Fig. 2(b)
shows the k-space (only magnitude is displayed) of the reference in Fig. 2(a).
7
This sampling scheme is feasible when data corresponding to multiple image slices are jointly acquired,
and the frequency encode direction is chosen perpendicular to the image plane.
Transform-blind Compressed Sensing with Convergence Guarantees 21
We employ Problem (P1) and the corresponding Algorithm A1 to reconstruct images from
undersampled measurements in the experiments here 8 . Our reconstruction method is referred
to as Transform Learning MRI (TLMRI).
First, we illustrate the empirical convergence behavior of TLMRI. We also compare the
reconstructions provided by the TLMRI method to those provided by the following schemes:
1) the Sparse MRI method of Lustig et al [38], that utlilizes Wavelets and Total Variation
as fixed sparsifying transforms; 2) the DLMRI method [64] that learns adaptive overcomplete
sparsifying dictionaries; 3) the PANO method [63] that exploits the non-local similarities
between image patches (similar to [15]), and employs a 3D transform to sparsify groups
of similar patches; and 4) the PBDWS method [51]. The PBDWS method is a very recent
partially adaptive sparsifying transform based compressed sensing reconstruction method that
uses redundant Wavelets and trained patch-based geometric directions. It has been shown to
perform better than the earlier PBDW method [62].
We simulated the performance of the Sparse MRI, PBDWS, and PANO methods using
the software implementations available from the respective authors’ websites [37, 61, 59].
We used the built-in parameter settings in those implementations, which performed well in
our experiments. Specifically, for the PBDWS method, the shift invariant discrete Wavelet
transform (SIDWT) based reconstructed image is used as the guide (initial) image [51, 61].
We employed the zero-filling reconstruction (produced within the PANO demo code [59]) as
the initial guide image for the PANO method [63, 59].
The implementation of the DLMRI algorithm that solves Problem (P0) is also available
online [68]. For DLMRI, image patches of size 6 × 6 (n = 36) are used, as suggested in [64]
9 , and a four fold overcomplete synthesis dictionary (K = 144) is learnt using 25 iterations
of the algorithm. A patch overlap stride of r = 1 is used, and 14400 (found empirically
10 ) randomly selected patches are used during the dictionary learning step (executed for 20
iterations) of the DLMRI algorithm. Mean-subtraction is not performed for the patches prior
to the dictionary learning step of DLMRI. A maximum sparsity level (of s = 7 per patch) is
employed together with an error threshold (for sparse coding) during the dictionary learning
step. The ℓ2 error threshold per patch varies linearly from 0.48 to 0.15 over the DLMRI
8
Problem (P3) has been recently shown to be useful for adaptive tomographic reconstruction [54, 55].
The corresponding Algorithm A3 has the advantage that the sparse coding step involves the cheap hard
thresholding operation, rather than the more expensive projection onto the s-ℓ0 ball (used in Algorithms A1
and A2). We have observed that Algorithm A3 also works well for MRI. Problems (P1) and (P2) differ in that
(P2) enforces unit conditioning of the learnt transform, whereas (P1) also allows for more general condition
numbers. Algorithm A2 (for (P2)) involves slightly cheaper computations than Algorithm A1 (for (P1)). In
our experiments for MRI, we observed that well-conditioned transforms learnt via (P1) performed (in terms
of image reconstruction quality) slightly better than strictly unitary learnt transforms. Therefore, we show
results for (P1) in this work. We did not observe any dramatic difference in performance between the proposed
methods in our experiments here. A detailed investigation of scenarios and applications where one of (P1),
(P2), or (P3), performs the best (in terms of reconstruction quality compared to the others) is left for future
work on specific applications. In this work, we have emphasized more the properties of these formulations, and
the novel convergence guarantees of the corresponding algorithms.
9
The reconstruction quality improves slightly with a larger patch size, but with a substantial increase in
runtime.
10
Using a larger training size (> 14400) during the dictionary learning step of the algorithm provides negli-
gible improvement in final image reconstruction quality, while leading to increased runtimes.
22 S. Ravishankar and Y. Bresler
iterations. These parameter settings (all other parameters are set as per the indications in
the DLMRI-Lab toolbox [68]) were observed to work well for the DLMRI algorithm.
The parameters for TLMRI (with Algorithms A1) are set to n = 36, r = 1 (with patch
wrap around), ν = 3.81, M̂ = 1, λ0 = 0.2, and C = 105 . The sparsity level s = 0.055×nN (this
corresponds to an average sparsity level per patch of 0.055×n, or 5.5% sparsity) 11 , where N =
5122 , is used in our experiments except in Section 5.2, where s = 0.045×nN is used. The initial
transform estimate W 0 is the (simple) patch-based 2D DCT [70], and the initial image x0 is set
to be the standard zero-filling Fourier reconstruction 12 . The initial sparse code settings are
the solution to (3.1), for the given (W 0 , x0 ). Our TLMRI implementation was coded in Matlab
version R2013a. Note that this implementation has not been optimized for efficiency. A link
to the Matlab implementation is provided at http://www.ifp.illinois.edu/~yoram/. All
simulations in this work were executed in Matlab. All computations were performed with an
Intel Core i5 CPU at 2.5GHz and 4GB memory, employing a 64-bit Windows 7 operating
system.
Similar to prior work [64], we quantify the quality of MR image reconstruction using
the peak-signal-to-noise ratio (PSNR), and high frequency error norm (HFEN) metrics. The
PSNR (expressed in decibels (dB)) is computed as the ratio of the peak intensity value of
some reference image to the root mean square reconstruction error (computed between image
magnitudes) relative to the reference. In MRI, the reference image is typically the image
reconstructed from fully sampled k-space data. The HFEN metric quantifies the quality of
reconstruction of edges or finer features. A rotationally symmetric Laplacian of Gaussian
(LoG) filter is used, whose kernel is of size 15 × 15 pixels, and with a standard deviation of
1.5 pixels [64]. HFEN is computed as the ℓ2 norm of the difference between the LoG filtered
reconstructed and reference magnitude images.
5.2. Convergence and Learning Behavior. In this experiment, we consider the reference
image in Fig. 1(a). We perform four fold undersampling of the k-space space of the (peak
normalized 13 ) reference. The (variable density [64]) sampling mask is shown in Fig. 1(b).
When the TLMRI algorithm is executed using the undersampled data, the objective function
converges monotonically and quickly over the iterations as shown in Fig. 1(e). The changes
between successive iterates xt − xt−1 2 (Fig. 1(g)) converge towards 0. Such convergence
was established by Theorem 4.3, and is indicative (a necessary but not suffficient condition)
of convergence of the entire sequence xt . As far as the performance metrics are concerned,
the PSNR metric (Fig. 1(f)) increases over the iterations, and the HFEN metric decreases,
11
The sparsity level s is a regularization parameter in our framework that provides a trade-off between how
much aliasing is removed over the algorithm iterations, and how much image information is kept or restored
(i.e., not eliminated by the sparsity condition). We determined the sparsity level empirically in the experiments
in this work.
12
While we used the naive zero-filling Fourier reconstruction in our experiments here for simplicity, one
could also use other better initializations for x such as the SIDWT based reconstructed image [51], or the
reconstructions produced by recent methods (e.g., PBDWS, etc.). We have observed empirically that better
initializations may lead to faster convergence of TLMRI, and TLMRI typically only improves the image quality
compared to the initializations (assuming properly chosen sparsity levels).
13
In practice, the data or k-space measurements can always be normalized prior to processing them for image
reconstruction. Otherwise, the parameter settings for algorithms may need to be modified to account for data
scaling.
Transform-blind Compressed Sensing with Convergence Guarantees 23
7500 31 2.3 1
10
30.5 2.2
7000
Objective Function
30 2.1
0
2
10
.
.x − xt−1 .
6500 29.5 2
PSNR
HFEN
29 PSNR 1.9
HFEN
6000
. t
28.5 1.8 −1
10
28 1.7
5500
27.5 1.6
−2
5000 10 0 1 2
20 40 60 80 100 27 1.5 10 10 10
0 20 40 60 80 100
Iteration Number Iteration Number Iteration Number (t)
Figure 1: Convergence of TLMRI with 5x undersampling: (a) Reference image; (b) sampling
mask in k-space; (c) initial zero-filling reconstruction (26.93 dB); (d) TLMRI reconstruction
(30.54 dB); (e) objective function (since the regularizer Q(W ) ≥ n/2 [70], we have subtracted
out the constant offset of nλ/2 from the objective values here); (f) PSNR and HFEN; (g)
changes between successive iterates ( xt − xt−1 2 ), and (h) real (top) and imaginary (bottom)
parts of the learnt W , with the matrix rows shown as patches.
indicating improving reconstruction quality over the algorithm iterations. These metrics also
converge quickly.
The initial zero-filling reconstruction (Fig. 1(c)) shows aliasing artifacts that are typical
in the undersampled measurement scenario, and has a PSNR of only 26.93 dB. On the other
hand, the final TLMRI reconstruction (Fig. 1(d)) is much enhanced (by 3.6 dB), with a
PSNR of 30.54 dB. Since Algorithm A1 is guaranteed to converge to the set of critical points
of Problem (P1), the result in Fig. 1(d) suggests that, in practice, the set of critical points
may in fact include images that are close to the true image. Note that our identifiability result
(Proposition 2.1) in Section 2.2 ensured global optimality of the underlying image only in a
noiseless (or error-free) scenario. The learnt transform W (κ(W ) = 1.01) for this example is
shown in Fig. 1(h). This is a complex valued transform. Both the real and imaginary parts
of W display texture or frequency like structures, that sparsify the patches of the MR image.
Our algorithm is thus able to learn this structure and reconstruct the image using only the
undersampled measurements.
Image Sampling Scheme Undersampling Zero-filling Sparse MRI PBDWS PANO DLMRI TLMRI
2 2D Random 4x 25.3 26.13 31.69 32.80 33.01 33.12
1 2D Random 5x 26.9 27.84 30.27 30.37 30.49 30.56
2 2D Random 7x 25.3 26.38 31.10 30.92 31.70 31.94
2 Cartesian 4x 28.9 29.73 31.67 32.24 32.67 32.78
2 Cartesian 7x 27.9 28.58 31.11 31.08 30.91 31.24
Table 1: PSNRs corresponding to the Zero-filling, Sparse MRI [38], PBDWS [51], PANO [63],
DLMRI [64], and TLMRI reconstructions, for various sampling schemes and undersampling
factors. The best PSNRs are marked in bold.
in Section 5.1). We also use a lower sparsity level (< 0.055 × nN ) during the initial several
algorithm iterations, which leads to faster convergence. We consider the complex-valued ref-
erence images in Fig. 1(a) and Fig. 2(a) that are labeled as Image 1 and Image 2 respectively,
and simulate variable density 2D random or Cartesian undersampling [64] of the k-spaces of
these images. Table 1 lists the reconstruction PSNRs corresponding to the zero-filling, Sparse
MRI, PBDWS, PANO, DLMRI, and TLMRI14 reconstructions for various cases.
The TLMRI algorithm is seen to provide the best PSNRs (analogous results were observed
to typically hold with respect to the HFEN metric not shown in the table) for the various
scenarios in Table 1. Significant improvements (up to 7 dB) are observed over the Sparse
MRI method, that uses fixed sparsifying transforms. Moreover, TLMRI provides up to 1.4
dB improvement in PSNR over the recent (partially adaptive) PBDWS method, and up to
1 dB improvement over the recent non-local patch similarity-based PANO method. Finally,
the TLMRI reconstruction quality is somewhat (up to 0.33 dB) better than DLMRI. This is
despite the latter using a 4 fold overcomplete (i.e., larger or richer) dictionary.
Fig. 2 shows the TLMRI reconstruction (Fig. 2(d)) of Image 2 for the case of 2D random
sampling (sampling mask shown in Fig. 2(c)) and seven fold undersampling. The reconstruc-
tion errors (i.e., the magnitude of the difference between the magnitudes of the reconstructed
and reference images) for several schemes are shown in Figs. 2 (e)-(h). The error map for
TLMRI clearly shows the smallest image distortions. Fig. 3 shows the TLMRI (Fig. 3(c))
and DLMRI (Fig. 3(e)) reconstructions and reconstruction error maps (Figs. 3 (d), (f)) for
Image 2 with Cartesian sampling (sampling mask shown in Fig. 3(b)) and seven fold under-
sampling. TLMRI provides a better reconstruction of image edges and better aliasing removal
than DLMRI in this case.
The average run times of the Sparse MRI, PBDWS, PANO, DLMRI, and TLMRI algo-
rithms in Table 1 are 251 seconds, 797 seconds, 400 seconds, 3273 seconds, and 243 seconds,
respectively. The PBDWS run time includes the time taken for computing the initial SIDWT
based reconstruction or guide image [51] in the PBDWS software package [61]. The TLMRI
14
We observed that in Table 1, if we sampled vertical (instead of horizontal) lines for the Cartesian sampling
patterns (by transposing the Cartesian sampling masks used in Table 1), the reconstructed images for TLMRI
had PSNRs (32.52 dB and 31.05 dB respectively, for the 4x and 7x undersampling cases) similar to the horizontal
lines sampling case. However, the learnt sparsifying transforms for the two cases had many dissimilar rows.
This is not surprising since prior work on transform learning [73] has empirically shown that even the patches
of a single image may admit multiple equally good sparsifying transforms, that may not be related by only row
permutations or sign changes.
Transform-blind Compressed Sensing with Convergence Guarantees 25
0 0 0 0
Figure 2: Results for 2D random sampling and 7x undersampling: (a) Reference image; (b)
k-space of reference; (c) sampling mask in k-space; (d) TLMRI reconstruction (31.94 dB);
(e) magnitude of PBDWS [51] reconstruction error; (f) magnitude of PANO [63] reconstruc-
tion error; (g) magnitude of DLMRI [64] reconstruction error; and (h) magnitude of TLMRI
reconstruction error.
algorithm is thus the fastest one in Table 1, and provides a speedup of about 13.5x over
the synthesis dictionary-based DLMRI, and a speedup of about 3.3x and 1.6x over the PB-
DWS and PANO 15 methods, respectively. Note that the speedups for TLMRI over PBDWS
or PANO were obtained by comparing our unoptimized Matlab implementation of TLMRI
against the MEX (or C) implementations of PBDWS and PANO.
While our results show some (preliminary) potential for the proposed sparsifying
transform-based blind compressed sensing framework (for MRI), a much more detailed inves-
tigation will be presented elsewhere. Combining the proposed scheme with the patch-based
directional Wavelets ideas [62, 51], or non-local patch similarity ideas [63, 48], or extending
our framework to learning overcomplete sparsifying transforms (c.f., [81]) could potentially
boost transform-based BCS performance further.
6. Conclusions. In this work, we presented a novel sparsifying transform-based frame-
work for blind compressed sensing. Our formulations exploit the (adaptive) transform domain
sparsity of overlapping image patches in 2D, or voxels in 3D. The proposed formulations are
however highly nonconvex. Our block coordinate descent-type algorithms for solving the pro-
posed problems involve efficient update steps. Importantly, our algorithms are guaranteed to
converge to the critical points of the objectives defining the proposed formulations. These crit-
15
Another faster version of the PANO method is also publicly available [60]. However, we found that although
this version has an average run time of only 40 seconds in Table 1, it also provides worse reconstruction PSNRs
than [59] in Table 1.
26 S. Ravishankar and Y. Bresler
0.25 0.25
0.2 0.2
0.15 0.15
0.1 0.1
0.05 0.05
0 0
Figure 3: Cartesian sampling with 7 fold undersampling: (a) Initial zero-filling reconstruction
(27.9 dB); (b) sampling mask in k-space; (c) TLMRI reconstruction (31.24 dB); (d) magnitude
of TLMRI reconstruction error; (e) DLMRI reconstruction (30.91 dB); and (f) magnitude of
DLMRI reconstruction error.
ical points are also guaranteed to be at least partial global and partial local minimizers. Our
numerical examples showed the usefulness of the proposed scheme for magnetic resonance im-
age reconstruction from highly undersampled k-space data. Our approach while being highly
efficient also provides promising MR image reconstruction quality. The usefulness of the pro-
posed blind compressed sensing methods in other inverse problems and imaging applications
merits further study. A detailed investigation of the theoretical rate of convergence in our
algorithms is also of potential interest.
Appendix. Convergence Proofs.
A.1. Proof of Theorem 4.3. In this proof, we let H̃s (Z) denote the set of all optimal
projections of Z ∈ Cn×N onto the s-ℓ0 ball B ∈ Cn×N : kBk0 ≤ s , i.e., H̃s (Z) is the set of
ments y ∈ Cm and initial (W 0 , B 0 , x0 ). Assume that the initial (W 0 , B 0 , x0 ) is such that
g W 0 , B 0 , x0 is finite. Throughout this proof, we let X t be the matrix with Pj xt (1 ≤ j ≤ N )
as its columns. The various results in Theorem 4.3 are now proved in the following order.
Transform-blind Compressed Sensing with Convergence Guarantees 27
input y ∈ Cm 0 0 0
and initial (W , B , x ). Then, the sequence of objective function values
t t t is monotone decreasing, and converges to a finite value g ∗ = g∗ (W 0 , B 0 , x0 ).
g W ,B ,x
Proof. : Algorithm A1 first alternates between the transform update and sparse coding
steps (Step 2 in algorithm pseudocode), with fixed image x. In the transform update step,
we obtain a global minimizer (i.e., (3.7)) with respect to W for Problem (3.5). In the sparse
coding step too, we obtain an exact solution for B in Problem (3.2) as B̂ = Hs (Z). Therefore,
the objective function can only decrease when we alternate between the transform update
steps (similar to the case in [73]). Thus, we have g W t+1 , B t+1 , xt ≤
and sparse coding
g W t , B t , xt .
the objective (4.3) is trivially non-negative. Furthermore, the Q(W ) regularizer is bounded
as Q(W ) ≥ n2 (cf. [70]). Therefore, the objective g (W, B, x) > 0. Since the sequence
g W t , B t , xt is monotone decreasing
tandt lower
t
bounded, it converges.
Lemma A.2. The iterate sequence W , B , x generated by Algorithm A1 is bounded, and
it has at least one accumulation point.
Proof. : The existence of a convergent subsequence (and hence, an accumulation point)
for a bounded sequence is a standard result. Therefore, we only prove the boundedness of the
iterate sequence.
Since xt 2 ≤ C ∀ t trivially, we have that the sequence xt is bounded. We now show
the squared ℓ2 norm terms and the barrier functions ψ(B t ) and χ(xt ) in the objective g t (4.3),
are non-negative. Therefore, we have
(A.2) λQ(W t ) ≤ g t ≤ g0
where the second inequality follows from Lemma A.1. Now, the function Q(W t ) =
P n 2 t
i=1 (0.5αi − log αi ), where αi (1 ≤ i ≤ n) are the singular values of W , is a coercive
function of the (non-negative) singular values, and therefore, it has bounded lower level sets
28 S. Ravishankar and Y. Bresler
16 .Combining this fact with (A.2), we can immediately conclude that ∃ c0 ∈ R depending on
g0 and λ, such that W t F ≤ c0 ∀ t.
Finally, the boundedness of B t follows from the following
arguments. First, for Algo-
rithm A1 (see pseudocode), we have that B t = Hs W t X t−1 . Therefore, by the definition of
Hs (·), we have
B t F = Hs W t X t−1 F ≤ W t X t−1 F ≤ W t 2 X t−1 F
(A.3)
Since, by our previous arguments, W t and X t−1 are both bounded by constants independent
of t, we have that the sequence of sparse code matrices B t , is also bounded.
We now establish some key optimality properties of the accumulation points of the iterate
sequence in Algorithm A1.
Lemma A.3. All the accumulation points of the iterate sequence generated by Algorithm A1
with a given initial (W 0 , B 0 , x0 ) correspond to a common objective value g ∗ . Thus, they are
equivalent in that sense.
Proof. : Consider the subsequence {W qt , B qt , xqt } (indexed by qt ) of the iterate sequence,
that converges to the accumulation point (W ∗ , B ∗ , x∗ ). Before proving the lemma, we establish
some simple properties of (W ∗ , B ∗ , x∗ ). First, equation (A.2) implies that − log |det W qt | ≤
0
(g0 /λ), for every t. This further implies that |det W qt | ≥ e−g /λ > 0 ∀ t. Therefore, due to the
continuity of the function |det W |, the limit W ∗ of the subsequence is also non-singular, with
0
|det W ∗ | ≥ e−g /λ . Second, B qt = Hs W qt X qt −1 , where X qt −1 is the matrix with Pj xqt −1
(1 ≤ j ≤ N ) as its columns. Thus, B qt trivially satisfies ψ(B qt ) = 0 for every t. Now, {B qt }
converges to B ∗ , which makes B ∗ the limit of a sequence of matrices, each of which has no
more than s non-zeros. Thus, the limit B ∗ obviously cannot have more than s non-zeros.
Therefore, kB ∗ k0 ≤ s, or equivalently ψ(B ∗ ) = 0. Finally, since xqt satisfies the constraint
kxqt k22 ≤ C 2 , we have χ(xqt ) = 0 ∀ t. Additionally, since xqt → x∗ as t → ∞, we also have
kx∗ k22 = limt→∞ kxqt k22 ≤ C 2 . Therefore, χ(x∗ ) = 0. Now, it is obvious from the above
arguments that
(A.4) lim χ(xqt ) = χ(x∗ ), lim ψ(B qt ) = ψ(B ∗ )
t→∞ t→∞
The above result together with the fact (from Lemma A.1) that limt→∞ g(W qt , B qt , xqt ) = g ∗
implies that g(W ∗ , B ∗ , x∗ ) = g∗ .
The following lemma establishes partial global optimality with respect to the image, of
every accumulation point.
Lemma A.4. Any accumulation point (W ∗ , B ∗ , x∗ ) of the iterate sequence generated by
Algorithm A1 satisfies
Now, by (A.4), we have that limt→∞ χ(xqt ) = χ(x∗ ). Taking the limit t → ∞ term by term
on both sides of (A.7) for some fixed x ∈ Cp yields the following result
N N
2 2
+ ν kAx∗ − yk22 + χ(x∗ ) ≤
X X
W ∗ Pj x∗ − b∗j 2
W ∗ Pj x − b∗j 2
j=1 j=1
Since the choice of x in (A.8) is arbitrary, (A.8) holds for any x ∈ Cp . Recall that
ψ(B ∗ ) = 0 and Q(W ∗ ) is finite based on the arguments in the proof of Lemma A.3. Therefore,
(A.8) implies that g(W ∗ , B ∗ , x∗ ) ≤ g(W ∗ , B ∗ , x) ∀ x ∈ Cp . This establishes the result (A.6)
of the Lemma.
The following lemma will be used to establish that the change between successive image
iterates xt − xt−1 2 , converges to 0.
Lemma A.5. Consider the subsequence {W qt , B qt , xqt } (indexed by qt ) of the iterate se-
quence, that converges to the accumulation point (W ∗ , B ∗ , x∗ ). Then, the subsequence xqt −1
also converges to x∗ .
Proof. : First, by Lemma A.3, we have that
(A.9) g(W ∗ , B ∗ , x∗ ) = g∗
Next, consider a convergent subsequence xqnt −1 of (the bounded sequence) xqt −1 that
converges to say x∗∗ . Now, applying the same arguments as in Equation (A.5) (in the proof
of Lemma A.3), but with respect to the (convergent) subsequence xqnt −1 , W qnt , B qnt , we
have that
Furthermore, it was shown in Section 3.1.3 that if the set of patches in our formulation (P1)
cover all pixels in the image (always true for the case of periodically positioned overlapping
image patches), then the minimization of g (W ∗ , B ∗ , x) with respect to x has a unique so-
lution. Therefore, x∗ is the unique minimizer in (A.12). Combining this with the fact that
g(W ∗ , B ∗ , x∗∗ ) = g(W ∗ ,B ∗ , x∗ ) yields
that x∗∗ = x∗ . Since we worked with an arbitrary
convergent subsequence x nt q −1 (of q −1 ∗
qx−1 ) in the above proof, we have that x is the limit
t
Proof. : Consider the sequence at with at , xt − xt−1 2 . We will show below that
that t
t 0 is both the limit inferior and limit superior of a , which means that the sequence
a itself converges to 0.
Now, consider a convergent subsequence {aqt } of at . Since the sequence {W qt , B qt , xqt }
is bounded, there exists a convergent subsequence {W qnt , B qnt , xqnt } converging to say
(W ∗ , B ∗ , x∗ ). By Lemma A.5, we then have that xqnt −1 also converges to x∗ . Thus,
the subsequence {aqnt } with aqnt , xqnt − xqnt −1 2 converges to 0. Since, {aqnt } itself is a
subsequence of a convergent sequence, we must have that {aqt } converges to the same limit
(i.e., 0). We have thus shown that zero is the limit of any convergent subsequence of at .
The next property is the partial global optimality of every accumulation point with respect
to the sparse code. In order to establish this property,we need the following result.
Lemma A.7. Consider a bounded matrix sequence Z k with Z k ∈ Cn×N , that converges
to Z ∗ . Then, every accumulation point of Hs (Z k ) belongs to the set H̃s (Z ∗ ).
Moreover, denoting by X ∗ ∈ Cn×N the matrix whose jth column is Pj x∗ , for 1 ≤ j ≤ N , the
above condition can be equivalently stated as
(A.15) B ∗ ∈ H̃s (W ∗ X ∗ )
1
qt 2
(1,qt +1) qt qt
(V qt )H (Lqt )−1
2
(A.18) W̃ = 0.5R Σ + (Σ ) + 2λI
To prove the lemma, we will consider the limit t → ∞ in (A.18). In order to take this limit, we
need the following results. First, due to the continuity of the matrix square root and matrix
32 S. Ravishankar and Y. Bresler
inverse functions at positive definite matrices, we have that the following limit holds, where
X ∗ ∈ Cn×N has Pj x∗ (1 ≤ j ≤ N ) as its columns.
−1/2 −1/2
lim (Lqt )−1 = lim X qt (X qt )H + 0.5λI = X ∗ (X ∗ )H + 0.5λI
t→∞ t→∞
1/2
Next, defining L∗ , X ∗ (X ∗ )H + 0.5λI , we also have that
Applying Lemma A.9 to (A.19), we have that every accumulation point (V ∗ , Σ∗ , R∗ ) of the
sequence {V qt , Σqt , Rqt } is such that V ∗ Σ∗ (R∗ )H is a full SVD of (L∗ )−1 X ∗ (B ∗ )H . Now, con-
sider a convergent subsequence {V qnt , Σqnt , Rqnt } of {V qt , Σqt , Rqt }, with limit (V ∗ , Σ∗ , R∗ ).
Then, taking the limit t → ∞ in (A.18) along this subsequence, we have
1
∗∗
W , lim W̃ (1,qnt +1) ∗ ∗ ∗ 2
= 0.5R Σ + (Σ ) + 2λI
2
(V ∗ )H (L∗ )−1
t→∞
Combining this result with the aforementioned definitions of the square root L∗ and the full
SVD V ∗ Σ∗ (R∗ )H , and applying Proposition 3.1, we get
Finally,
n applying the o same arguments as in the proof of Lemma A.3 to the subsequence
qnt qnt W̃ (1,qnt +1) , we easily get that g ∗ = g(W ∗∗ , B ∗ , x∗ ). Since, by Lemma A.3, we
B ,x ,
also have that g∗ = g(W ∗ , B ∗ , x∗ ), we get g(W ∗ , B ∗ , x∗ ) = g(W ∗∗ , B ∗ , x∗ ), which together
with (A.20) immediately establishes the required result (A.17).
The following lemma establishes that every accumulation point of the iterate sequence
in Algorithm A1 is a critical point of the objective g(W, B, x). All derivatives or sub-
differentials are computed with respect to the real and imaginary parts of the corresponding
variables/vectors/matrices below.
Lemma A.11. Every accumulation point (W ∗ , B ∗ , x∗ ) of the iterate sequence generated by
Algorithm A1 is a critical point of the objective g(W, B, x) satisfiying
(A.21) 0 ∈ ∂g (W ∗ , B ∗ , x∗ )
Now, using the preceding results, we easily have that 0 ∈ ∂g (W ∗ , B ∗ , x∗ ) above. Thus, every
accumulation point in Algorithm A1 is a critical point of the objective.
The following two lemmas establish pairwise partial local optimality of the accumulation
points in Algorithm A1. Here, X ∗ ∈ Cn×N is the matrix with Pj x∗ as its columns.
Lemma A.12. Every accumulation point (W ∗ , B ∗ , x∗ ) of the iterate sequence generated by
Algorithm A1 is a partial minimizer of the objective g(W, B, x) with respect to (W, B), in the
sense of (4.9), for sufficiently small dW ∈ Cn×n , and all ∆B ∈ Cn×N in the union of the
regions R1 and R2 in Theorem 4.3. Furthermore, if kW ∗ X ∗ k0 ≤ s, then the ∆B in (4.9) can
be arbitrary.
Proof. : Consider the subsequence {W qt , B qt , xqt } (indexed by qt ) of the iterate sequence,
that converges to the accumulation point (W ∗ , B ∗ , x∗ ). First, by Lemmas A.8 and A.10, we
have
(A.26) B ∗ ∈ H̃s (W ∗ X ∗ )
(A.27) 2W ∗ X ∗ (X ∗ )H − 2B ∗ (X ∗ )H + λW ∗ − λ (W ∗ )−H = 0
where (A.27) follows from the first order conditions for partial global optimality of W ∗ in
(A.17). The accumulation point (W ∗ , B ∗ , x∗ ) also satisfies ψ(B ∗ ) = 0 and χ(x∗ ) = 0.
Now, considering perturbations dW ∈ Cn×n , and ∆B ∈ Cn×N , we have that
In order to prove the condition (4.9) in Theorem 4.3, it suffices to consider sparsity preserving
perturbations ∆B, that is ∆B ∈ Cn×N such that B ∗ + ∆B has sparsity ≤ s. Otherwise
g(W ∗ + dW, B ∗ + ∆B, x∗ ) = +∞ > g(W ∗ , B ∗ , x∗ ) trivially. Therefore, we only consider
sparsity preserving ∆B in the following, for which ψ(B ∗ + ∆B) = 0 in (A.28).
Since the image x∗ is fixed here, we can utilize (A.26) and (A.27), and apply similar
arguments as in the proof of Lemma 9 (equations (43)-(46)) in [73], to simplify the right hand
side in (A.28). The only difference is that the matrix transpose (·)T operations in [73] are
replaced with Hermitian (·)H operations here, and the operation hQ, Ri involving two matrices
34 S. Ravishankar and Y. Bresler
hQ, Ri , Re tr(QRH )
(A.29)
Upon such simplifications, we can conclude [73] that ∃ǫ′ > 0 depending on W ∗ such that
whenever kdW kF < ǫ′ , we have
Consider first the case of ∆B in region R1. Then, the term − hW ∗ X ∗ − B ∗ , ∆Bi above is
trivially non-negative for such ∆B in Theorem 4.3, and therefore, g(W ∗ + dW, B ∗ + ∆B, x∗ ) ≥
g(W ∗ , B ∗ , x∗ ) for ∆B ∈ R1.
Next, consider the case of ∆B in region R2. Then, when kW ∗ X ∗ k0 > s, it is easy to
see that any such sparsity preserving ∆B in region R2 in Theorem 4.3 will have its sup-
port (i.e., non-zero locations) contained in the support of B ∗ ∈ H̃s (W ∗ X ∗ ). Therefore,
hW ∗ X ∗ − B ∗ , ∆Bi = 0 in this case. On the other hand, if kW ∗ X ∗ k0 ≤ s, then by (A.26),
W ∗ X ∗ − B ∗ = 0. Therefore, by these arguments, hW ∗ X ∗ − B ∗ , ∆Bi = 0 for any sparsity
preserving ∆B ∈ R2. This result together with (A.30) implies g(W ∗ + dW, B ∗ + ∆B, x∗ ) ≥
g(W ∗ , B ∗ , x∗ ) for any ∆B ∈ R2.
Finally, if kW ∗ X ∗ k0 ≤ s, then since W ∗ X ∗ − B ∗ = 0 in (A.30), therefore in this case, ∆B
in (4.9) can be arbitrary.
Lemma A.13. Every accumulation point (W ∗ , B ∗ , x∗ ) of the iterate sequence generated by
Algorithm A1 is a partial minimizer of the objective g(W, B, x) with respect to (B, x), in the
sense of (4.10), for all ∆x ˜ ∈ Cp , and all ∆B ∈ Cn×N in the union of the regions R1 and R2
in Theorem 4.3. Furthermore, if kW ∗ X ∗ k0 ≤ s, then the ∆B in (4.10) can be arbitrary.
Proof. : Consider the subsequence {W qt , B qt , xqt } (indexed by qt ) of the iterate sequence,
that converges to the accumulation point (W ∗ , B ∗ , x∗ ). It follows from Lemmas A.8 and A.4
that ψ(B ∗ ) = χ(x∗ ) = 0. Now, considering perturbations ∆B ∈ Cn×N whose columns are
denoted as ∆bj (1 ≤ j ≤ N ), and ∆x ˜ ∈ Cp , we have that
N
2
˜ = ˜ − ∆bj
X
∗ ∗ ∗
g(W ,B + ∆B, x + ∆x) W ∗ Pj x∗ − b∗j + W ∗ Pj ∆x
2
j=1
2 ˜
(A.31) ˜
+ λ Q(W ∗ ) + ν Ax∗ − y + A∆x + ψ(B ∗ + ∆B) + χ(x∗ + ∆x)
2
In order to prove the condition (4.10) in Theorem 4.3, it suffices to consider sparsity preserving
perturbations ∆B, that is ∆B ∈ Cn×N such that B ∗ + ∆B has sparsity ≤ s. It also suffices
˜ which are such that x∗ + ∆x
to consider energy preserving perturbations ∆x, ˜ ≤ C. For
2
˜ ∗ ∗ ∗ ˜ ∗ ∗ ∗
any other ∆B or ∆x, g(W , B + ∆B, x + ∆x) = +∞ > g(W , B , x ) trivially. Therefore,
we only consider the energy/sparsity preserving perturbations in the following, for which
ψ(B ∗ + ∆B) = 0 and χ(x∗ + ∆x) ˜ = 0 in (A.31). Now, upon expanding the squared ℓ2 terms
17
We include the Re(·) operation in the definition here, which allows for simpler notations in the rest of the
proof. However, the hQ, Ri defined in (A.29) is no longer the conventional inner product.
Transform-blind Compressed Sensing with Convergence Guarantees 35
Now, if (the optimal) µ = 0 above, then g(W ∗ , B ∗ + ∆B, x∗ + ∆x) ˜ ≥ g(W ∗ , B ∗ , x∗ ) for
˜ ∈ C , and ∆B ∈ R1 ∪ R2. The alternative scenario is the case when (the
arbitrary ∆x p
optimal) µ > 0, which can occur only if x∗ 2 = C holds. Since we are considering
˜
energy preserving perturbations ∆x, ˜ 2 ≤ C 2 = x∗ 2 , which implies
we have x∗ + ∆x 2 2
2
D E
∗ ˜ ˜
−2 x , ∆x ≥ ∆x 2 ≥ 0. Combining this result with (A.33), we again have (now for the
˜
µ > 0 case) that g(W ∗ , B ∗ + ∆B, x∗ + ∆x) ˜ ∈ Cp , and
≥ g(W ∗ , B ∗ , x∗ ) for arbitrary ∆x
∆B ∈ R1 ∪ R2.
A.2. Proofs of Theorems 4.6 and 4.7. The proofs of Theorems 4.6 and 4.7 are very
similar to that for Theorem 4.3. We only discuss some of the minor differences, as follows.
First, in the case of Theorem 4.6, the main difference in the proof is that the non-negative
barrier function ψ(B) and the P operator Hs (·) (in the proof of Theorem 4.3) are replaced by
the non-negative penalty η 2 N j=1 kb j k 0 and the operator Ĥ 1 (·), respectively. Moreover, the
η
mapping H̃s (·) defined in (A.1) for the proof of Theorem 4.3, is replaced by the matrix-to-set
mapping Ĥη (·) (for the proof of Theorem 4.6) defined as
0 , |Zij | < η
(A.34) Ĥη (Z) = {Z ij , 0} , |Zij | = η
ij
Zij , |Zij | > η
By thus replacing the relevant functions and operators, the various steps in the proof of
Theorem 4.3 can be easily extended to the case of Theorem 4.6. As such, Theorem 4.6 mainly
differs from Theorem 4.3 in terms of the definition of the set of allowed (local) perturbations
∆B. In particular, the proofs of partial local optimality of accumulation points in Theorem
4.6 are easily extended from the aforementioned proofs for Lemmas A.12 and A.13, by using
the techniques and inequalities presented in Appendix F of [73].
Finally, the main difference between the proofs of Theorems 4.3 and 4.7 is that the non-
negative λ Q(W ) penalty in the former is replaced by the barrier function ϕ(W ) (that enforces
36 S. Ravishankar and Y. Bresler
the unitary property, and keeps W t always bounded) in the latter. Otherwise, the proof
techniques are very similar for the two cases.
REFERENCES
[1] M. Aharon and M. Elad, Sparse and redundant modeling of image content using an image-signature-
dictionary, SIAM Journal on Imaging Sciences, 1 (2008), pp. 228–247.
[2] M. Aharon, M. Elad, and A. Bruckstein, K-SVD: An algorithm for designing overcomplete dictio-
naries for sparse representation, IEEE Transactions on signal processing, 54 (2006), pp. 4311–4322.
[3] H. Attouch, J. Bolte, P. Redont, and A. Soubeyran, Proximal alternating minimization and
projection methods for nonconvex problems: An approach based on the kurdyka-lojasiewicz inequality,
Math. Oper. Res., 35 (2010), pp. 438–457.
[4] Y. Bresler, Spectrum-blind sampling and compressive sensing for continuous-index signals, in 2008
Information Theory and Applications Workshop Conference, 2008, pp. 547–554.
[5] Y. Bresler and P. Feng, Spectrum-blind minimum-rate sampling and reconstruction of 2-D multiband
signals, in Proc. 3rd IEEE Int. Conf. on Image Processing, ICIP’96, sep 1996, pp. 701–704.
[6] Y. Bresler, M. Gastpar, and R. Venkataramani, Image compression on-the-fly by universal sam-
pling in Fourier imaging systems, in Proc. 1999 IEEE Information Theory Workshop on Detection,
Estimation, Classification, and Imaging, feb 1999, p. 48.
[7] A. M. Bruckstein, D. L. Donoho, and M. Elad, From sparse solutions of systems of equations to
sparse modeling of signals and images, SIAM Review, 51 (2009), pp. 34–81.
[8] E. Candès, J. Romberg, and T. Tao, Robust uncertainty principles: exact signal reconstruction from
highly incomplete frequency information, IEEE Trans. Information Theory, 52 (2006), pp. 489–509.
[9] E. Candès and T. Tao, Decoding by linear programming, IEEE Trans. on Information Theory, 51 (2005),
pp. 4203–4215.
[10] R. Chartrand, Fast algorithms for nonconvex compressive sensing: MRI reconstruction from very few
data, in Proc. IEEE International Symposium on Biomedical Imaging (ISBI), 2009, pp. 262–265.
[11] G. H. Chen, J. Tang, and S. Leng, Prior image constrained compressed sensing (piccs): A method
to accurately reconstruct dynamic CT images from highly undersampled projection data sets, Med.
Phys., 35 (2008), pp. 660–663.
[12] S. S. Chen, D. L. Donoho, and M. A. Saunders, Atomic decomposition by basis pursuit, SIAM J.
Sci. Comput., 20 (1998), pp. 33–61.
[13] K. Choi, J. Wang, L. Zhu, T.-S. Suh, S. Boyd, and L. Xing, Compressed sensing based cone-beam
computed tomography reconstruction with a first-order method, Med. Phys., 37 (2010), pp. 5113–5125.
[14] E. Chouzenoux, J.-C. Pesquet, and A. Repetti, A block coordinate variable metric forward-backward
algorithm, (2014). Preprint: http://www.optimization-online.org/DB_HTML/2013/12/4178.html.
[15] K. Dabov, A. Foi, V. Katkovnik, and K. Egiazarian, Image denoising by sparse 3D transform-
domain collaborative filtering, IEEE Trans. on Image Processing, 16 (2007), pp. 2080–2095.
[16] W. Dai and O. Milenkovic, Subspace pursuit for compressive sensing signal reconstruction, IEEE
Trans. Information Theory, 55 (2009), pp. 2230–2249.
[17] G. Davis, S. Mallat, and M. Avellaneda, Adaptive greedy approximations, Journal of Constructive
Approximation, 13 (1997), pp. 57–98.
[18] M. N. Do and M. Vetterli, The contourlet transform: an efficient directional multiresolution image
representation, IEEE Trans. Image Process., 14 (2005), pp. 2091–2106.
[19] D. Donoho, Compressed sensing, IEEE Trans. Information Theory, 52 (2006), pp. 1289–1306.
[20] D. L. Donoho, For most large underdetermined systems of linear equations the minimal l1-norm solution
is also the sparsest solution, Comm. Pure Appl. Math, 59 (2004), pp. 797–829.
[21] D. L. Donoho, M. Elad, and V. N. Temlyakov, Stable recovery of sparse overcomplete representations
in the presence of noise, IEEE Trans. Inform. Theory, 52 (2006), pp. 6–18.
[22] B. Efron, T. Hastie, I. Johnstone, and R. Tibshirani, Least angle regression, Annals of Statistics,
32 (2004), pp. 407–499.
[23] M. Elad and M. Aharon, Image denoising via sparse and redundant representations over learned
dictionaries, IEEE Trans. Image Process., 15 (2006), pp. 3736–3745.
Transform-blind Compressed Sensing with Convergence Guarantees 37
[24] M. Elad, P. Milanfar, and R. Rubinstein, Analysis versus synthesis in signal priors, Inverse Prob-
lems, 23 (2007), pp. 947–968.
[25] L. Eldèn, Solving quadratically constrained least squares problems using a differential-geometric approach,
BIT Numerical Mathematics, 42 (2002), pp. 323–335.
[26] K. Engan, S.O. Aase, and J.H. Hakon-Husoy, Method of optimal directions for frame design, in Proc.
IEEE International Conference on Acoustics, Speech, and Signal Processing, 1999, pp. 2443–2446.
[27] P. Feng, Universal Spectrum Blind Minimum Rate Sampling and Reconstruction of Multiband Signals,
PhD thesis, University of Illinois at Urbana-Champaign, mar 1997. Yoram Bresler, adviser.
[28] P. Feng and Y. Bresler, Spectrum-blind minimum-rate sampling and reconstruction of multiband sig-
nals, in ICASSP, vol. 3, may 1996, pp. 1689–1692.
[29] M. Gastpar and Y. Bresler, On the necessary density for spectrum-blind nonuniform sampling subject
to quantization, in ICASSP, vol. 1, jun 2000, pp. 348–351.
[30] S. Gleichman and Yonina C. Eldar, Blind compressed sensing, IEEE Transactions on Information
Theory, 57 (2011), pp. 6958–6975.
[31] G. H. Golub and C. F. Van Loan, Matrix Computations, Johns Hopkins University Press, Baltimore,
Maryland, 1996.
[32] I. F. Gorodnitsky, J. George, and B. D. Rao, Neuromagnetic source imaging with FOCUSS: A
recursive weighted minimum norm algorithm, Electrocephalography and Clinical Neurophysiology, 95
(1995), pp. 231–251.
[33] R. Gribonval and K. Schnass, Dictionary identification–sparse matrix-factorization via l1 -
minimization, IEEE Trans. Inform. Theory, 56 (2010), pp. 3523–3539.
[34] Y. Kim, M. S. Nadar, and A. Bilgin, Wavelet-based compressed sensing using gaussian scale mixtures,
in Proc. ISMRM, 2010, p. 4856.
[35] X. Li and S. Luo, A compressed sensing-based iterative algorithm for ct reconstruction and its possible
application to phase contrast imaging, BioMedical Engineering OnLine, 10 (2011), p. 73.
[36] S. G. Lingala and M. Jacob, Blind compressive sensing dynamic mri, IEEE Transactions on Medical
Imaging, 32 (2013), pp. 1132–1145.
[37] M. Lustig, Michael Lustig home page. http://www.eecs.berkeley.edu/~ mlustig/Software.html,
2014. [Online; accessed October, 2014].
[38] M. Lustig, D.L. Donoho, and J.M. Pauly, Sparse MRI: The application of compressed sensing for
rapid MR imaging, Magnetic Resonance in Medicine, 58 (2007), pp. 1182–1195.
[39] M. Lustig, J. M. Santos, D. L. Donoho, and J. M. Pauly, k-t SPARSE: High frame rate dynamic
MRI exploiting spatio-temporal sparsity, in Proc. ISMRM, 2006, p. 2420.
[40] J. Mairal, F. Bach, J. Ponce, and G. Sapiro, Online learning for matrix factorization and sparse
coding, J. Mach. Learn. Res., 11 (2010), pp. 19–60.
[41] J. Mairal, M. Elad, and G. Sapiro, Sparse representation for color image restoration, IEEE Trans.
on Image Processing, 17 (2008), pp. 53–69.
[42] J. Mairal, G. Sapiro, and M. Elad, Learning multiscale sparse representations for image and video
restoration, SIAM Multiscale Modeling and Simulation, 7 (2008), pp. 214–241.
[43] K. Malczewski, Pet image reconstruction using compressed sensing, in Signal Processing: Algorithms,
Architectures, Arrangements, and Applications (SPA), 2013, Sept 2013, pp. 176–181.
[44] S. C. Malik, Principles of Real Analysis, New Age International, New Delhi, India, 1982.
[45] S. Mallat, A Wavelet Tour of Signal Processing, Academic Press, San Diego, CA, 1999.
[46] S. G. Mallat and Z. Zhang, Matching pursuits with time-frequency dictionaries, IEEE Transactions
on Signal Processing, 41 (1993), pp. 3397–3415.
[47] M. W. Marcellin, M. J. Gormish, A. Bilgin, and M. P. Boliek, An overview of JPEG-2000, in
Proc. Data Compression Conf., 2000, pp. 523–541.
[48] Y. Mohsin, G. Ongie, and M. Jacob, Iterative shrinkage algorithm for patch-smoothness regularized
medical image recovery, IEEE Transactions on Medical Imaging, (2015).
[49] B. S. Mordukhovich, Variational Analysis and Generalized Differentiation. Vol. I: Basic theory,
Springer-Verlag, Heidelberg, Germany, 2006.
[50] B. K. Natarajan, Sparse approximate solutions to linear systems, SIAM J. Comput., 24 (1995), pp. 227–
234.
[51] B. Ning, X. Qu, D. Guo, C. Hu, and Z. Chen, Magnetic resonance image reconstruction using
38 S. Ravishankar and Y. Bresler
trained geometric directions in 2d redundant wavelets domain and non-convex optimization, Magnetic
Resonance Imaging, 31 (2013), pp. 1611–1622.
[52] B. A. Olshausen and D. J. Field, Emergence of simple-cell receptive field properties by learning a
sparse code for natural images, Nature, 381 (1996), pp. 607–609.
[53] Y. Pati, R. Rezaiifar, and P. Krishnaprasad, Orthogonal matching pursuit : recursive function
approximation with applications to wavelet decomposition, in Asilomar Conf. on Signals, Systems and
Comput., 1993, pp. 40–44 vol.1.
[54] L. Pfister, Tomographic reconstruction with adaptive sparsifying transforms, master’s thesis, University
of Illinois at Urbana-Champaign, Aug. 2013.
[55] L. Pfister and Y. Bresler, Model-based iterative tomographic reconstruction with adaptive sparsifying
transforms, in SPIE International Symposium on Electronic Imaging: Computational Imaging XII,
vol. 9020, 2014, pp. 90200H–1–90200H–11.
[56] W. K. Pratt, J. Kane, and H. C. Andrews, Hadamard transform image coding, Proc. IEEE, 57
(1969), pp. 58–68.
[57] K. P. Pruessmann, Encoding and reconstruction in parallel MRI, NMR in Biomedicine, 19 (2006),
pp. 288–299.
[58] C. Qiu, W. Lu, and N. Vaswani, Real-time dynamic MR image reconstruction using kalman filtered
compressed sensing, in Proc. IEEE International Conference on Acoustics, Speech and Signal Pro-
cessing, 2009, pp. 393–396.
[59] X. Qu, PANO Code. http://www.quxiaobo.org/project/CS_MRI_PANO/Demo_PANO_SparseMRI.zip,
2014. [Online; accessed May, 2015].
[60] , PANO Code with multi-core cpu parallel computing.
http://www.quxiaobo.org/project/CS_MRI_PANO/Demo_Parallel_PANO_SparseMRI.zip, 2014.
[Online; accessed April, 2015].
[61] , PBDWS Code. http://www.quxiaobo.org/project/CS_MRI_PBDWS/Demo_PBDWS_SparseMRI.zip,
2014. [Online; accessed September, 2014].
[62] X. Qu, D. Guo, B. Ning, Y. Hou, Y. Lin, S. Cai, and Z. Chen, Undersampled MRI reconstruction
with patch-based directional wavelets, Magnetic Resonance Imaging, 30 (2012), pp. 964–977.
[63] X. Qu, Y. Hou, F. Lam, D. Guo, J. Zhong, and Z. Chen, Magnetic resonance image reconstruction
from undersampled measurements using a patch-based nonlocal operator, Medical Image Analysis, 18
(2014), pp. 843–856.
[64] S. Ravishankar and Y. Bresler, MR image reconstruction from highly undersampled k-space data by
dictionary learning, IEEE Trans. Med. Imag., 30 (2011), pp. 1028–1041.
[65] , Multiscale dictionary learning for MRI, in Proc. ISMRM, 2011, p. 2830.
[66] , Learning doubly sparse transforms for image representation, in IEEE Int. Conf. Image Process.,
2012, pp. 685–688.
[67] , Closed-form solutions within sparsifying transform learning, in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 5378–5382.
[68] , DLMRI - Lab: Dictionary learning MRI software.
http://www.ifp.illinois.edu/~ yoram/DLMRI-Lab/DLMRI.html, 2013. [Online; accessed Oc-
tober, 2014].
[69] , Learning doubly sparse transforms for images, IEEE Trans. Image Process., 22 (2013), pp. 4598–
4612.
[70] , Learning sparsifying transforms, IEEE Trans. Signal Process., 61 (2013), pp. 1072–1086.
[71] , Sparsifying transform learning for compressed sensing MRI, in Proc. IEEE Int. Symp. Biomed.
Imag., 2013, pp. 17–20.
[72] , Blind compressed sensing using sparsifying transforms, in International Conference on Sampling
Theory and Applications (SampTA), May 2015, pp. 513–517.
[73] , ℓ0 sparsifying transform learning with efficient optimal updates and convergence guarantees, IEEE
Trans. Signal Process., 63 (2015), pp. 2389–2404.
[74] R. T. Rockafellar and Roger J.-B. Wets, Variational Analysis, Springer-Verlag, Heidelberg, Ger-
many, 1998.
[75] K. Skretting and K. Engan, Recursive least squares dictionary learning algorithm, IEEE Transactions
on Signal Processing, 58 (2010), pp. 2121–2130.
Transform-blind Compressed Sensing with Convergence Guarantees 39
[76] J. Trzasko and A. Manduca, Highly undersampled magnetic resonance image reconstruction via ho-
motopic l0 -minimization, IEEE Trans. Med. Imaging, 28 (2009), pp. 106–121.
[77] P. Tseng, Convergence of a block coordinate descent method for nondifferentiable minimization, J. Optim.
Theory Appl., 109 (2001), pp. 475–494.
[78] S. Valiollahzadeh, T. Chang, J. W. Clark, and O. R. Mawlawi, Image recovery in pet scan-
ners with partial detector rings using compressive sensing, in IEEE Nuclear Science Symposium and
Medical Imaging Conference (NSS/MIC), Oct 2012, pp. 3036–3039.
[79] R. Venkataramani and Y. Bresler, Further results on spectrum blind sampling of 2D signals, in Proc.
IEEE Int. Conf. Image Proc., ICIP, vol. 2, Oct. 1998, pp. 752–756.
[80] Y. Wang, Y. Zhou, and L. Ying, Undersampled dynamic magnetic resonance imaging using patch-
based spatiotemporal dictionaries, in 2013 IEEE 10th International Symposium on Biomedical Imaging
(ISBI), April 2013, pp. 294–297.
[81] B. Wen, S. Ravishankar, and Y. Bresler, Structured overcomplete sparsifying transform learning with
convergence guarantees and applications, International Journal of Computer Vision, (2014), pp. 1–31.
[82] Y. Xu and W. Yin, A block coordinate descent method for regularized multiconvex optimization with
applications to nonnegative tensor factorization and completion, SIAM Journal on Imaging Sciences,
6 (2013), pp. 1758–1789.
[83] M. Yaghoobi, T. Blumensath, and M. Davies, Dictionary learning for sparse approximations with
the majorization method, IEEE Transaction on Signal Processing, 57 (2009), pp. 2178–2191.
[84] J. C. Ye, Y. Bresler, and P. Moulin, A self-referencing level-set method for image reconstruction
from sparse Fourier samples, Int. J. Computer Vision, 50 (2002), pp. 253–270.
[85] G. Yu, G. Sapiro, and S. Mallat, Image modeling and enhancement via structured sparse model
selection, in Proc. IEEE International Conference on Image Processing (ICIP), 2010, pp. 1641–1644.