Compressed Sensing
Compressed Sensing
Compressed Sensing
David L. Donoho
Department of Statistics
Stanford University
Abstract
m
Suppose x is an unknown vector in R (depending on context, a digital image or signal);
we plan to acquire data and then reconstruct. Nominally this ‘should’ require m samples. But
suppose we know a priori that x is compressible by transform coding with a known transform,
and we are allowed to acquire data about x by measuring n general linear functionals – rather
than the usual pixels. If the collection of linear functionals is well-chosen, and we allow for
a degree of reconstruction error, the size of n can be dramatically smaller than the size m
usually considered necessary. Thus, certain natural classes of images with m pixels need
only n = O(m1/4 log5/2 (m)) nonadaptive nonpixel samples for faithful recovery, as opposed
to the usual m pixel samples.
Our approach is abstract and general. We suppose that the object has a sparse rep-
resentation in some orthonormal basis (eg. wavelet, Fourier) or tight frame (eg curvelet,
Gabor), meaning that the coefficients belong to an `p ball for 0 < p ≤ 1. This implies
that the N most important coefficients in the expansion allow a reconstruction with `2 error
O(N 1/2−1/p ). It is possible to design n = O(N log(m)) nonadaptive measurements which
contain the information necessary to reconstruct any such object with accuracy comparable
to that which would be possible if the N most important coefficients of that object were
directly observable. Moreover, a good approximation to those N important coefficients may
be extracted from the n measurements by solving a convenient linear program, called by the
name Basis Pursuit in the signal processing literature. The nonadaptive measurements have
the character of ‘random’ linear combinations of basis/frame elements.
These results are developed in a theoretical framework based on the theory of optimal re-
covery, the theory of n-widths, and information-based complexity. Our basic results concern
properties of `p balls in high-dimensional Euclidean space in the case 0 < p ≤ 1. We estimate
the Gel’fand n-widths of such balls, give a criterion for near-optimal subspaces for Gel’fand
n-widths, show that ‘most’ subspaces are near-optimal, and show that convex optimization
can be used for processing information derived from these near-optimal subspaces.
The techniques for deriving near-optimal subspaces include the use of almost-spherical
sections in Banach space theory.
Key Words and Phrases. Integrated Sensing and Processing. Optimal Recovery. Information-
Based Complexity. Gel’fand n-widths. Adaptive Sampling. Sparse Solution of Linear equations.
Basis Pursuit. Minimum `1 norm decomposition. Almost-Spherical Sections of Banach Spaces.
1
1 Introduction
As our modern technology-driven civilization acquires and exploits ever-increasing amounts of
data, ‘everyone’ now knows that most of the data we acquire ‘can be thrown away’ with almost no
perceptual loss – witness the broad success of lossy compression formats for sounds, images and
specialized technical data. The phenomenon of ubiquitous compressibility raises very natural
questions: why go to so much effort to acquire all the data when most of what we get will be
thrown away? Can’t we just directly measure the part that won’t end up being thrown away?
In this paper we design compressed data acquisition protocols which perform as if it were
possible to directly acquire just the important information about the signals/images – in ef-
fect not acquiring that part of the data that would eventually just be ‘thrown away’ by lossy
compression. Moreover, the protocols are nonadaptive and parallelizable; they do not require
knowledge of the signal/image to be acquired in advance - other than knowledge that the data
will be compressible - and do not attempt any ‘understanding’ of the underlying object to guide
an active or adaptive sensing strategy. The measurements made in the compressed sensing
protocol are holographic – thus, not simple pixel samples – and must be processed nonlinearly.
In specific applications this principle might enable dramatically reduced measurement time,
dramatically reduced sampling rates, or reduced use of Analog-to-Digital converter resources.
Such constraints are actually obeyed on natural classes of signals and images; this is the primary
reason for the success of standard compression tools based on transform coding [10]. To fix ideas,
we mention two simple examples of `p constraint.
• Bounded Variation model for images. Here image brightness is viewed as an underlying
function f (x, y) on the unit square 0 ≤ x, y ≤ 1 which obeys (essentially)
Z 1Z 1
|∇f |dxdy ≤ R.
0 0
• Bump Algebra model for spectra. Here a spectrum (eg mass spectrum or NMR spectrum)
is modelled as digital samples (f (i/n)) of an underlying function f on the real line which is
2
a superposition of so-called spectral lines of varying positions, amplitudes, and linewidths.
Formally,
X∞
f (t) = ai g((t − ti )/si ).
i=1
Here the parameters ti are line locations, ai are amplitudes/polarities and si are linewidths,
and g represents a lineshape, for example thePGaussian, although other profiles could be
considered. We assume the constraint where i |ai | ≤ R, which in applications represents
an energy or total mass constraint. Again we take a wavelet viewpoint, this time specifically
using smooth wavelets. The data can be represented as a superposition of contributions
from various scales. Let x(j) denote the component of the image at scale j, and let
(ψij ) denote the orthonormal basis of wavelets at scale j, containing 2j elements. The
corresponding coefficients again obey kθ(j) k1 ≤ c · R · 2−j/2 , [33].
While in these two examples, the `1 constraint appeared, other `p constraints can appear nat-
urally as well; see below. For some readers the use of `p norms with p < 1 may seem initially
strange; it is now well-understood that the `p norms with such small p are natural mathematical
measures of sparsity [11, 13]. As p decreases below 1, more and more sparsity is being required.
Also, from this viewpoint, an `p constraint based on p = 2 requires no sparsity at all.
Note that in each of these examples, we also allowed for separating the object of interest
into subbands, each one of which obeys an `p constraint. In practice below, we stick with the
view that the object θ of interest is a coefficient vector obeying the constraint (1.1), which may
mean, from an application viewpoint, that our methods correspond to treating various subbands
separately, as in these examples.
The key implication of the `p constraint is sparsity of the transform coefficients. Indeed, we
have trivially that, if θN denotes the vector θ with everything except the N largest coefficients
set to 0,
kθ − θN k2 ≤ ζ2,p · kθkp · (N + 1)1/2−1/p , N = 0, 1, 2, . . . , (1.2)
with a constant ζ2,p depending only on p ∈ (0, 2). Thus, for example, to approximate θ with
error , we need to keep only the N (p−2)/2p biggest terms in θ.
3
We are interested in the `2 error of reconstruction and in the behavior of optimal information
and optimal algorithms. Hence we consider the minimax `2 error as a standard of comparison:
So here, all possible methods of nonadaptive sampling are allowed, and all possible methods of
reconstruction are allowed. P
In our application, the class X of objects of interest is the set of objects x = i θi ψi where
θ = θ(x) obeys (1.1) for a given p and R. Denote then
Our goal is to evaluate En (Xp,m (R)) and to have practical schemes which come close to attaining
it.
Theorem 1 Let (n, mn ) be a sequence of problem sizes with n < mn , n → ∞, and mn ∼ Anγ ,
γ > 1, A > 0. Then for 0 < p ≤ 1 there is Cp = Cp (A, γ) > 0 so that
We find this surprising in four ways. First, compare (1.3) with (1.2). We see that the forms
are similar, under the calibration n = N log(mn ). In words: the quality of approximation to x
which could be gotten by using the N biggest transform coefficients can be gotten by using the
n ≈ N log(m) pieces of nonadaptive information provided by In . The surprise is that we would
not know in advance which transform coefficients are likely to be the important ones in this
approximation, yet the optimal information operator In is nonadaptive, depending at most on
the class Xp,m (R) and not on the specific object. In some sense this nonadaptive information is
just as powerful as knowing the N best transform coefficients.
This seems even more surprising when we note that for objects x ∈ Xp,m (R), the transform
representation is the optimal one: no other representation can do as well at characterising x
by a few coefficients [11, 12]. Surely then, one imagines, the sampling kernels ξi underlying
the optimal information operator must be simply measuring individual transform coefficients?
Actually, no: the information operator is measuring very complex holographic functionals which
in some sense mix together all the coefficients in a big soup. Compare (6.1) below.
Another surprise is that, if we enlarged our class of information operators to allow adaptive
ones, e.g. operators in which certain measurements are made in response to earlier measure-
ments, we could scarcely do better. Define the minimax error under adaptive information EnAdapt
allowing adaptive operators
where, for i ≥ 2, each kernel ξi,x is allowed to depend on the information hξj,x , xi gathered at
previous stages 1 ≤ j < i. Formally setting
we have
4
Theorem 2 For 0 < p ≤ 1, and Cp > 0
So adaptive information is of minimal help – despite the quite natural expectation that an
adaptive method ought to be able iteratively somehow ‘localize’ and then ‘close in’ on the ‘big
coefficients’.
An additional surprise is that, in the already-interesting case p = 1, Theorems 1 and 2
are easily derivable from known results in OR/IBC and approximation theory! However, the
derivations are indirect; so although they have what seem to the author as fairly important
implications, very little seems known at present about good nonadaptive information operators
or about concrete algorithms matched to them.
Our goal in this paper is to give direct arguments which cover the case 0 < p ≤ 1 of highly
compressible objects, to give direct information about near-optimal information operators and
about concrete, computationally tractable algorithms for using this near-optimal information.
Definition 1.1 The Gel’fand n-width of X with respect to the `2m norm is defined as
where the infimum is over n-dimensional linear subspaces of Rm , and Vn⊥ denotes the ortho-
complement of Vn with respect to the standard Euclidean inner product.
In words, we look for a subspace such that ‘trapping’ x ∈ X in that subspace causes x to be
small. Our interest in Gel’fand n-widths derives from an equivalence between optimal recovery
for nonadaptive information and such n-widths, well-known in the p = 1 case [39], and in the
present setting extending as follows:
Thus the Gel’fand n-widths either exactly or nearly equal the value of optimal information.
Ultimately, the bracketing with constant 21/p−1 will be for us just as good as equality, owing
to the unspecified constant factors in (1.3). We will typically only be interested in near-optimal
performance, i.e. in obtaining En to within constant factors.
It is relatively rare to see the Gel’fand n-widths studied directly [40]; more commonly one
sees results about the Kolmogorov n-widths:
5
Definition 1.2 Let X ⊂ Rm be a bounded set. The Kolmogorov n-width of X with respect the
`2m norm is defined as
dn (X; `2m ) = inf sup inf kx − yk2 .
Vn x∈X y∈Vn
In particular
dn (b2,m ; `∞ n 2
m ) = d (b1,m , `m ).
The asymptotic properties of the left-hand side have been determined by Garnaev and Gluskin
[24]. This follows major work by Kashin [28], who developed a slightly weaker version of this
result in the course of determining the Kolmogorov n-widths of Sobolev spaces. See the original
papers, or Pinkus’s book [40] for more details.
dn (b2,m , `∞
m ) (n/(1 + log(m/n)))
−1/2
.
Theorem 1 now follows in the case p = 1 by applying KGG with the duality formula (1.5)
and the equivalence formula (1.4). The case 0 < p < 1 of Theorem 1 does not allow use of
duality and the whole range 0 < p ≤ 1 is approached differently in this paper.
6
the central algorithm is
x̂∗n = center(In−1 (yn ) ∩ Xp,m (R)),
and it obeys, when the information In is optimal,
1.6 Results
Our paper develops two main types of results.
7
To evaluate the quality of an information operator In , set
Thus for large n we have a simple description of near-optimal information and a tractable
near-optimal reconstruction algorithm.
1.8 Contents
Section 2 introduces a set of conditions CS1-CS3 for near-optimality of an information operator.
Section 3 considers abstract near-optimal algorithms, and proves Theorems 1-3. Section 4 shows
that solving the convex optimization problem (L1 ) gives a near-optimal algorithm whenever
0 < p ≤ 1. Section 5 points out immediate extensions to weak-`p conditions and to tight frames.
Section 6 sketches potential implications in image, signal, and array processing. Section 7 shows
that conditions CS1-CS3 are satisfied for “most” information operators.
Finally, in Section 8 below we note the ongoing work by two groups (Gilbert et al. [25]) and
(Candès et al [4, 5]), which although not written in the n-widths/OR/IBC tradition, imply (as
we explain), closely related results.
8
2 Information
Consider information operators constructed as follows. With Ψ the orthogonal matrix whose
columns are the basis elements ψi , and with certain n-by-m matrices Φ obeying conditions
specified below, we construct corresponding information operators In = ΦΨT . Everything will
be completely transparent to the orthogonal matrix Ψ and hence we will assume that Ψ is the
identity throughout this section.
In view of the relation between Gel’fand n-widths and minimax errors, we may work with
n-widths. We define the width of a set X relative to an operator Φ:
In words, this is the radius of the section of X cut out by nullspace(Φ). In general, the Gel’fand
n-width is the smallest value of w obtainable by choice of Φ:
We will show for all large n and m the existence of n by m matrices Φ where
CS1 The minimal singular value of ΦJ exceeds η1 > 0 uniformly in |J| < ρn/ log(m).
9
CS1 demands a certain quantitative degree of linear independence among all small groups of
columns. CS2 says that linear combinations of small groups of columns give vectors that look
much like random noise, at least as far as the comparison of `1 and `2 norms is concerned. It
will be implied by a geometric fact: every VJ slices through the `1m ball in such a way that the
resulting convex section is actually close to spherical. CS3 says that for every vector in some
VJ , the associated quotient norm QJ c is never dramatically better than the simple `1 norm on
Rn .
It turns out that matrices satisfying these conditions are ubiquitous for large n and m when
we choose the ηi and ρ properly. Of course, for any finite n and m, all norms are equivalent and
almost any arbitrary matrix can trivially satisfy these conditions simply by taking η1 very small
and η2 , η3 very large. However, the definition of ‘very small’ and ‘very large’ would have to
depend on n for this trivial argument to work. We claim something deeper is true: it is possible
to choose ηi and ρ independent of n and of m ≤ Anγ .
Consider the set
←m terms →
Sn−1 × · · · × Sn−1
of all n × m matrices having unit-normalized columns. On this set, measure frequency of
occurrence with the natural uniform measure (the product measure, uniform on each factor
S n−1 ).
Theorem 6 Let (n, mn ) be a sequence of problem sizes with n → ∞, n < mn , and m ∼ Anγ ,
A > 0 and γ ≥ 1. There exist ηi > 0 and ρ > 0 depending only on A and γ so that, for each
δ > 0 the proportion of all n × m matrices Φ satisfying CS1-CS3 with parameters (ηi ) and ρ
eventually exceeds 1 − δ.
10
Choose θ so that 0 = Φθ. Let J denote the indices of the k = bρn/ log(m)c largest values in
θ. Without loss of generality suppose coordinates are ordered so that J comes first among the
m entries, and partition θ = [θJ , θJ c ]. Clearly
while, because each entry in θJ is at least as big as any entry in θJ c , (1.2) gives
On the other hand, again using v ∈ VJ and |J| = k < ρn/ log(m) invoke CS2, getting
√
kvk1 ≥ η2 · n · kvk2 .
with c1 = ζ1,p ρ1−1/p /η2 η3 . Recalling |J| = k < ρn/ log(m), and invoking CS1 we have
3 Algorithms
Given an information operator In , we must design a reconstruction algorithm An which delivers
reconstructions compatible in quality with the estimates for the Gel’fand n-widths. As discussed
in the introduction, the optimal method in the OR/IBC framework is the so-called central
algorithm, which unfortunately, is typically not efficiently computable in our setting. We now
describe an alternate abstract approach, allowing us to prove Theorem 1.
11
3.1 Feasible-Point Methods
Another general abstract algorithm from the OR/IBC literature is the so-called feasible-point
method, which aims simply to find any reconstruction compatible with the observed information
and constraints.
As in the case of the central algorithm, we consider, for given information yn = In (x) the
collection of all objects x ∈ Xp,m (R) which could have given rise to the information yn :
X̂p,R (yn ) = {x : yn = In (x), x ∈ Xp,m (R)}
In the feasible-point method, we simply select any member of X̂p,R (yn ), by whatever means. A
popular choice is to take an element of least norm, i.e. a solution of the problem
(Pp ) min kθ(x)kp subject to yn = In (x),
x
where here θ(x) = ΨT x is the vector of transform coefficients, θ ∈ `pm . A nice feature of this
approach is that it is not necessary to know the radius R of the ball Xp,m (R); the element of
least norm will always lie inside it.
Calling the solution x̂p,n , one can show, adapting standard OR/IBC arguments in [36, 46, 40]
Lemma 3.1
sup kx − x̂p,n k2 ≤ 2 · En (Xp,m (R)), 0 < p ≤ 1. (3.1)
Xp,m (R)
12
3.2 Proof of Theorem 3
Before proceeding, it is convenient to prove Theorem 3. Note that the case p ≥ 1 is well-known
in OR/IBC so we only need to give an argument for p < 1 (though it happens that our argument
works for p = 1 as well). The key point will be to apply the p-triangle inequality
valid for 0 < p < 1; this inequality is well-known in interpolation theory [1] through Peetre and
Sparr’s work, and is easy to verify directly.
Suppose without loss of generality that there is an optimal subspace Vn , which is fixed and
given in this proof. As we just saw,
Now
dn (Xp,m (R)) = radius(X̂p,R (0))
so clearly En ≥ dn . Now suppose without loss of generality that x1 and x−1 attain the radius
bound, i.e. they satisfy In (x±1 ) = yn and, for c = center(X̂p,R (yn )) they satisfy
Then define δ = (x1 − x−1 )/21/p . Set θ±1 = ΨT x±1 and ξ = ΨT δ. By the p-triangle inequality
and so
kξkp = k(θ1 − θ−1 )/21/p kp ≤ R.
Hence δ ∈ Xp,m (R). However, In (δ) = In ((x+1 − x−1 )/21/p ) = 0, so δ belongs to Xp,m (R) ∩ Vn .
Hence kδk2 ≤ dn (Xp,m (R)) and
QED
13
At the same time the combination of Theorem 7 and Theorem 6 shows that
with immediate extensions to En (Xm,p (R)) for all R 6= 1 > 0. We conclude that
as was to be proven.
4 Basis Pursuit
The least-norm method of the previous section has two drawbacks. First it requires that one
know p; we prefer an algorithm which works for 0 < p ≤ 1. Second, if p < 1 the least-norm
problem invokes a nonconvex optimization procedure, and would be considered intractable. In
this section we correct both drawbacks.
This can be formulated as a linear programming problem: let A be the n by 2m matrix [Φ − Φ].
The linear program
(LP ) min 1T z subject to Az = yn , x ≥ 0.
z
has a solution z∗, say, a vector in R2m which can be partitioned as z ∗ = [u∗ v ∗ ]; then θ∗ =
u∗ − v ∗ solves (P1 ). The reconstruction x̂1,n = Ψθ∗ . This linear program is typically considered
computationally tractable. In fact, this problem has been studied in the signal analysis literature
under the name Basis Pursuit [7]; in that work, very large-scale underdetermined problems -
14
e.g. with n = 8192 and m = 262144 - were solved successfully using interior-point optimization
methods.
As far as performance goes, we already know that this procedure is near-optimal in case
p = 1; from from (3.2):
Corollary 4.1 Suppose that In is an information operator achieving, for some C > 0,
In particular, we have a universal algorithm for dealing with any class X1,m (R) – i.e. any Ψ,
any m, any R. First, apply a near-optimal information operator; second, reconstruct by Basis
Pursuit. The result obeys
for C a constant depending at most on log(m)/ log(n). The inequality can be interpreted as
follows. Fix > 0. Suppose the unknown object x is known to be highly compressible, say
obeying the a priori bound kθk1 ≤ cmα , α < 1/2. For any such object, rather than making m
measurements, we only need to make n ∼ C · m2α log(m) measurements, and our reconstruction
obeys:
kx − x̂1,θ k2 · kxk2 .
While the case p = 1 is already significant and interesting, the case 0 < p < 1 is of interest
because it corresponds to data which are more highly compressible and for which, our interest
in achieving the performance indicated by Theorem 1 is even greater. Later in this section we
extend the same interpretation of x̂1,n to performance over Xp,m (R) throughout p < 1.
where of course kθk0 is just the number of nonzeros in θ. Again, since the work of Peetre and
Sparr the importance of `0 and the relation with `p for 0 < p < 1 is well-understood; see [1] for
more detail.
Ordinarily, solving such a problem involving the `0 norm requires combinatorial optimization;
one enumerates all sparse subsets of {1, . . . , m} searching for one which allows a solution Φθ = x.
However, when (P0 ) has a sparse solution, (P1 ) will find it.
Theorem 8 Suppose that Φ satisfies CS1-CS3 with given positive constants ρ, (ηi ). There is
a constant ρ0 > 0 depending only on ρ and (ηi ) and not on n or m so that, if θ has at most
ρ0 n/ log(m) nonzeros, then (P0 ) and (P1 ) both have the same unique solution.
15
In words, although the system of equations is massively undetermined, `1 minimization and
sparse solution coincide - when the result is sufficiently sparse.
There is by now an extensive literature exhibiting results on equivalence of `1 and `0 min-
imization [14, 20, 47, 48, 21]. In the first literature on this subject, equivalence was found
under conditions involving sparsity constraints allowing O(n1/2 ) nonzeros. While it may seem
surprising that any results of this kind are possible, the sparsity constraint kθk0 = O(n1/2 )
is, ultimately, disappointingly small. A major breakthrough was the contribution of Candès,
Romberg, and Tao (2004) which studied the matrices built by taking n rows at random from
an m by m Fourier matrix and gave an order O(n/ log(n)) bound, showing that dramatically
weaker sparsity conditions were needed than the O(n1/2 ) results known previously. In [17], it
was shown that for ‘nearly all’ n by m matrices with n < m < An, equivalence held for ≤ ρn
nonzeros, ρ = ρ(A). The above result says effectively that for ‘nearly all’ n by m matrices with
m ≤ Anγ , equivalence held up to O(ρn/ log(n)) nonzeros, where ρ = ρ(A, γ).
Our argument, in parallel with [17], shows that the nullspace Φβ = 0 has a very special
structure for Φ obeying the conditions in question. When θ is sparse, the only element in a
given affine subspace θ + nullspace(Φ) which can have small `1 norm is θ itself.
To prove Theorem 8, we first need a lemma about the non-sparsity of elements in the
nullspace of Φ. Let J ⊂ {1, . . . , m} and, for a given vector β ∈ Rm , let 1J β denote the mutilated
vector with entries βi 1J (i). Define the concentration
k1J βk1
ν(Φ, J) = sup{ : Φβ = 0}
kβk1
This measures the fraction of `1 norm which can be concentrated on a certain subset for a vector
in the nullspace of Φ. This concentration cannot be large if |J| is small.
Lemma 4.1 Suppose that Φ satisfies CS1-CS3, with constants ηi and ρ. There is a constant η0
depending on the ηi so that if J ⊂ {1, . . . , m} satisfies
|J| ≤ ρ1 n/ log(m), ρ1 ≤ ρ,
then
1/2
ν(Φ, J) ≤ η0 · ρ1 .
Proof. This is a variation on the argument for Theorem 7. Let β ∈ nullspace(Φ). Assume
without loss of generality that J is the most concentrated subset of cardinality |J| < ρn/ log(m),
and that the columns of Φ are numbered so that J = {1, . . . , |J|}; partition β = [βJ , βJ c ]. We
again consider v = ΦJ βJ , and have −v = ΦJ c βJ c . We again invoke CS2-CS3, getting
√ p
kvk2 ≤ η2 η3 · ( n/ log(m/n)) · kθJ c k1 ≤ c1 · (n/ log(m))1/2−1/p ,
16
We first show that if ν(Φ, J) < 1/2, θ is the only minimizer of (P1 ). Suppose that θ0 is a
solution to (P1 ), obeying
kθ0 k1 ≤ kθk1 .
Then θ0 = θ + β where β ∈ nullspace(Φ). We have
Suppose now that ν < 1/2. Then 2ν(Φ, J) − 1 < 0 and we have
kβk1 ≤ 0,
i.e. θ = θ0 .
√
Now recall the constant η0 > 0 of Lemma (4.1). Define ρ0 so that η0 ρ0 < 1/2 and ρ0 ≤ ρ.
1/2
Lemma 4.1 shows that |J| ≤ ρ0 n/ log(m) implies ν(Φ, J) ≤ η0 ρ0 < 1/2. QED.
Theorem 9 Suppose that Φ satisifies CS1-CS3 with constants ηi and ρ. There is C = C(p, (ηi ), ρ, A, γ)
so that the solution θ̂1,n to (P1 ) obeys
The proof requires an `1 stability lemma, showing the stability of `1 minimization under
small perturbations as measured in `1 norm. For `2 and `∞ stability lemmas, see [16, 48, 18];
however, note that those lemmas do not suffice for our needs in this proof.
Lemma 4.2 Let θ be a vector in Rm and 1J θ be the corresponding mutilated vector with entries
θi 1J (i). Suppose that
kθ − 1J θk1 ≤ ,
where ν(Φ, J) ≤ ν0 < 1/2. Consider the instance of (P1 ) defined by x = Φθ; the solution θ̂1 of
this instance of (P1 ) obeys
2
kθ − θ̂1 k1 ≤ . (4.1)
1 − 2ν0
Proof of Lemma. Put for short θ̂ ≡ θ̂1 , and set β = θ − θ̂ ∈ nullspace(Φ). By definition of ν,
while
kβJ c k1 ≤ kθJ c k1 + kθ̂J c k1 .
As θ̂ solves (P1 ),
kθ̂J k1 + kθ̂J c k1 ≤ kθk1 ,
and of course
kθJ k1 − kθ̂J k1 ≤ k1J (θ − θ̂)k1 .
17
Hence
kθ̂J c k1 ≤ k1J (θ − θ̂)k1 + kθJ c k1 .
Finally,
k1J (θ − θ̂)k1 = kβJ k1 ≤ ν0 · kβk1 = ν0 · kθ − θ̂k1 .
Combining the above, setting δ ≡ kθ − θ̂k1 and ≡ kθJ c k1 , we get
δ ≤ (ν0 δ + 2)/(1 − ν0 ),
5 Immediate Extensions
Before continuing, we mention two immediate extensions to these results of interest below and
elsewhere.
kxk2 = kθ(x)k2 ,
and the reconstruction formula x = Ψθ(x). In fact, Theorems 7 and 9 only need the Parseval
relation in the proof. Hence the same results hold without change when the relation between x
18
and θ involves a tight frame. In particular, if Φ is an n × m matrix satisfying CS1-CS3, then
0
In = ΦΨT defines a near-optimal information operator on Rm , and solution of the optimization
problem
(L1 ) min kψ T xk1 subject to In (x) = yn ,
x
defines a near-optimal reconstruction algorithm x̂1 .
|θ|(N ) ≤ R · N −1/p , N = 1, 2, 3, . . . .
Conversely, for a given θ, the smallest R for which these inequalities all hold is defined to be the
norm: kθkw`p ≡ R. The “weak” moniker derives from kθkw`p ≤ kθkp . Weak `p constraints have
the following key property: if θN denotes the vector N with all except the N largest items set
to zero, then the inequality
is valid for p < 1 and q = 1, 2, with R = kθkw`p . In fact, Theorems 7 and 9 used (5.1) in the
proof, together with (implicitly) kθkp ≥ kθkw`p . Hence we can state results for spaces Yp,m (R)
defined using only weak-`p norms, and the proofs apply without change.
6 Stylized Applications
We sketch 3 potential applications of the above abstract theory.
kf − Pj1 f k2 ≤ C · kf kB · N −1/2 .
In the compressed sensing scheme, we need also wavelets ψj,k = 2j/2 ψ(2j x − k) where ψ is an
oscillating function with mean zero. We pick a coarsest scale j0 = j1 /2. We measure the resumé
19
coefficients βj0 ,k , – there are 2j0 of these – and then let θ ≡ (θ` )m
`=1 denote an enumeration of
j
the detail wavelet coefficients ((αj,k : 0 ≤ k < 2 ) : j0 ≤ j < j1 ). The dimension m of θ is
m = 2j1 − 2j0 . The norm
j1
X j1
X
kθk1 ≤ k(αj,k )k1 ≤ C · kf kB · 2−j ≤ cB2−j0 .
j0 j0
We take n = c · 2j0 log(2j1 ) and apply a near-optimal information operator for this n and m
(described in more detail below). We apply the near-optimal algorithm of `1 minimization,
getting the error estimate
X 1 −1
jX
fˆ = βj0 ,k ϕj0 ,k + α̂j,k ψj,k
k j=j0
has error
again with c independent of f ∈ F(B). This is of the same order of magnitude as the error of
linear sampling.
The compressed sensing scheme takes a total of 2j0 samples of resumé coefficients and
n ≤ c2j0 log(2j1 ) samples associated with detail coefficients, for a total ≤ c · j1 · 2j1 /2 pieces
of information. It achieves error comparable to classical sampling based on 2j1 samples. Thus it
needs dramatically fewer samples for comparable accuracy: roughly speaking, only the square
root of the number of samples of linear sampling.
To achieve this dramatic reduction in sampling, we need an information operator based on
some Φ satisfying CS1-CS3. The underlying measurement kernels will be of the form
m
X
ξi = Φi,` φ` , i = 1, . . . , n, (6.1)
j=1
20
0 ≤ k1 , k2 < 2j1 . indexing position.
P We use the Haar scaling function ϕj1 ,k = 2j1 ·1{2j1 x−k∈[0,1]2 } .
We reconstruct by Pj1 f = k βj1 ,k ϕj1 ,k giving an approximation error
kf − Pj1 f k2 ≤ 4 · B · 2−j1 /2 .
There are N = 4j1 coefficients βj1 ,k associated with the unit interval, and so the approximation
error obeys:
kf − Pj1 f k2 ≤ c · B · N −1/4 .
In the compressed sensing scheme, we need also Haar wavelets ψjσ1 ,k = 2j1 ψ σ (2j1 x − k)
where ψ σ is an oscillating function with mean zero which is either horizontally-oriented (σ = v),
vertically oriented (σ = h) or diagonally-oriented (σ = d). We pick a ‘coarsest scale’ j0 = j1 /2,
and measure the resumé coefficients βj0 ,k , – there are 4j0 of these. Then let θ ≡ (θ` )m
`=1 be the
σ : 0 ≤ k , k < 2j , σ ∈ {h, v, d}) : j ≤ j <
concatenation of the detail wavelet coefficients ((αj,k 1 2 0
j1 ). The dimension m of θ is m = 4j1 − 4j0 . The norm
j1
X
kθk1 ≤ k(θ(j) )k1 ≤ c(j1 − j0 )kf kBV .
j=j0
We take n = c · 4j0 log2 (4j1 ) and apply a near-optimal information operator for this n and m.
We apply the near-optimal algorithm of `1 minimization to the resulting information, getting
the error estimate
kθ̂ − θk2 ≤ ckθk1 · (n/ log(m))−1/2 ≤ cB · 2−j1 ,
with c independent of f ∈ F(B). The overall reconstruction
X 1 −1
jX
fˆ = βj0 ,k ϕj0 ,k + α̂j,k ψj,k
k j=j0
has error
again with c independent of f ∈ F(B). This is of the same order of magnitude as the error of
linear sampling.
The compressed sensing scheme takes a total of 4j0 samples of resumé coefficients and n ≤
c4j0 log2 (4j1 ) samples associated with detail coefficients, for a total ≤ c · j12 · 4j1 /2 pieces of
measured information. It achieves error comparable to classical sampling with 4j1 samples.
Thus just as we have seen in the Bump Algebra case, we need dramatically fewer samples for
comparable accuracy: roughly speaking, only the square root of the number of samples of linear
sampling.
21
images are cartoons – well-behaved except for discontinuities inside regions with nice curvilinear
boundaries. They might be reasonable models for certain kinds of technical imagery – eg in
radiology.
The curvelets tight frame [3] is a collection of smooth frame elements offering a Parseval
relation X
kf k22 = |hf, γµ i|2
µ
and reconstruction formula X
f= hf, γµ iγµ .
µ
The frame elements have a multiscale organization, and frame coefficients θ(j) grouped by scale
obey the weak `p constraint
kθ(j) kw`2/3 ≤ c(B, L), f ∈ C 2,2 (B, L);
compare [3]. For such objects, classical linear sampling at scale j1 by smooth 2-D scaling
functions gives
kf − Pj1 f k2 ≤ c · B · 2−j1 /2 , f ∈ C 2,2 (B, L).
This is no better than the performance of linear sampling for the BV case, despite the piecewise
C 2 character of f ; the possible discontinuities in f are responsible for the inability of linear
sampling to improve its performance over C 2,2 (B, L) compared to BV.
In the compressed sensing scheme, we pick a coarsest scale j0 = j1 /4. We measure the
resumé coefficients βj0 ,k in a smooth wavelet expansion – there are 4j0 of these – and then let
θ ≡ (θ` )m
`=1 denote a concatentation of the finer-scale curvelet coefficients (θ
(j) : j ≤ j < j ).
0 1
The dimension m of θ is m = c(4j1 − 4j0 ), with c > 1 due to overcompleteness of curvelets. The
norm
j1
2/3
X
kθkw`2/3 ≤ ( k(θ(j) )kw`2/3 )3/2 ≤ c(j1 − j0 )3/2 ,
j0
with c depending on B and L. We take n = c·4j0 log5/2 (4j1 ) and apply a near-optimal information
operator for this n and m. We apply the near-optimal algorithm of `1 minimization to the
resulting information, getting the error estimate
kθ̂ − θk2 ≤ ckθk2/3 · (n/ log(m))−1 ≤ c0 · 2−j1 /2 ,
with c0 absolute. The overall reconstruction
X 1 −1
jX X
fˆ = βj0 ,k ϕj0 ,k + θ̂µ(j) γµ
k j=j0 µ∈Mj
has error
kf − fˆk2 ≤ kf − Pj1 f k2 + kPj1 f − fˆk2 = kf − Pj1 f k2 + kθ̂ − θk2 ≤ c2−j1 /2 ,
again with c independent of f ∈ C 2,2 (B, L). This is of the same order of magnitude as the error
of linear sampling.
The compressed sensing scheme takes a total of 4j0 samples of resumé coefficients and n ≤
5/2
c4 0 log5/2 (4j1 ) samples associated with detail coefficients, for a total ≤ c · j1 · 4j1 /4 pieces of
j
information. It achieves error comparable classical sampling based on 4j1 samples. Thus, even
more so than in the Bump Algebra case, we need dramatically fewer samples for comparable
accuracy: roughly speaking, only the fourth root of the number of samples of linear sampling.
22
7 Nearly All Matrices are CS-Matrices
We may reformulate Theorem 6 as follows.
Theorem 10 Let n, mn be a sequence of problem sizes with n → ∞, n < m ∼ Anγ , for A > 0
and γ > 1. Let Φ = Φn,m be a matrix with m columns drawn independently and uniformly at
random from S n−1 . Then for some ηi > 0 and ρ > 0, conditions CS1-CS3 hold for Φ with
overwhelming probability for all large n.
Indeed, note that the probability measure on Φn,m induced by sampling columns iid uniform
on Sn−1 is exactly the natural uniform measure on Φn,m . Hence Theorem 6 follows immediately
from Theorem 10.
In effect matrices Φ satisfying the CS conditions are so ubiquitous that it is reasonable to
generate them by sampling at random from a uniform probability distribution.
The proof of Theorem 10 is conducted over the next three subsections; it proceeds by studying
events Ωin , i = 1, 2, 3, where Ω1n ≡ { CS1 Holds }, etc. It will be shown that for parameters
ηi > 0 and ρi > 0
P (Ωin ) → 1, n → ∞;
then defining ρ = mini ρi and Ωn = ∩i Ωin , we have
P (Ωn ) → 1, n → ∞.
Since, when Ωn occurs, our random draw has produced a matrix obeying CS1-CS3 with param-
eters ηi and ρ, this proves Theorem 10.
Lemma 7.1 Consider sequences of (n, mn ) with n ≤ mn ∼ Anγ . Define the event
P (Ωn,m,ρ,λ ) → 1, n → ∞.
The proof involves three ideas. First, that for a specific subset we get large deviations bounds
on the minimum eigenvalue.
Lemma 7.2 For J ⊂ {1, . . . m}, let Ωn,J denote the event that the minimum eigenvalue λmin (ΦTJ ΦJ )
exceeds λ < 1. Then for sufficiently small ρ1 > 0 there is β1 > 0 so that for all n > n0 ,
P (Ωcn,J ) ≤ exp(−nβ1 ),
uniformly in |J| ≤ ρ1 n.
23
This was derived in [17] and in [18], using the concentration of measure property of singular
values of random matrices, eg. see Szarek’s work [44, 45].
Second, we note that the event of main interest is representable as:
Lemma 7.3 Suppose we have events Ωn,J all obeying, for some fixed β > 0 and ρ > 0,
P (Ωcn,J ) ≤ exp(−nβ),
for each J ⊂ {1, . . . , m} with |J| ≤ ρn. Pick ρ1 > 0 with ρ1 < min(β, ρ) and β1 > 0 with
β1 < β − ρ1 . Then for all sufficiently large n,
the last inequality following because each member J ∈ J is of cardinality ≤ ρn, since ρ1 n/ log(m) <
ρn, as soon as n ≥ 3 > e. Also, of course,
m
log ≤ k log(m),
k
Definition 7.1 Let |J| = k. We say that ΦJ offers an -isometry between `2 (J) and `1n if
r
π
(1 − ) · kαk2 ≤ · kΦJ αk1 ≤ (1 + ) · kαk2 , ∀α ∈ Rk . (7.1)
2n
The following Lemma shows that condition CS2 is a generic property of matrices.
24
Lemma 7.4 Consider the event Ω2n (≡ Ω2n (, ρ)) that every ΦJ with |J| ≤ ρ · n/ log(m) offers an
-isometry between `2 (J) and `1n . For each > 0, there is ρ() > 0 so that
P (Ω2n ) → 1, n → ∞.
To prove this, we first need a lemma about individual subsets J proven in [17].
to finish apply the individual Lemma 7.5 together with the combining principle in Lemma 7.3.
CS3 will follow if we can show that for every v ∈ Range(ΦI ), some approximation y to sgn(v)
satisfies |hy, φi i| ≤ 1 for i ∈ J c .
Lemma 7.6 Simultaneous Sign-Pattern Embedding. Fix δ > 0. Then for τ < δ 2 /32, set
For sufficiently small ρ3 > 0, there is an event Ω3n ≡ Ω3n (ρ3 , δ) with P (Ω3n ) → 1, as n → ∞. On
this event, for every subset J with |J| < ρ3 n/ log(m), for every sign pattern in σ ∈ ΣJ , there is
a vector y(≡ yσ ) with
ky − n σk2 ≤ n · δ · kσk2 , (7.3)
and
|hφi , yi| ≤ 1, i ∈ J c. (7.4)
25
In words, a small multiple n σ of any sign pattern σ almost lives in the dual ball {x : |hφi , xi| ≤
1, i ∈ J c }.
Before proving this result, we indicate how it gives the property CS3; namely, that |J| <
ρ3 n/ log(m), and v = −ΦJ c βJ c imply
p
kβJ c k1 ≥ η3 / log(m/n) · kvk1 .
By the duality theorem for linear programming the value of the primal program
Lemma 7.6 gives us a supply of dual-feasible vectors and hence a lower bound on the dual
program. Take σ = sgn(v); we can find y which is dual-feasible and obeys
picking δ appropriately and taking into account the spherical sections theorem, for sufficiently
large n we have δkσk2 kvk2 ≤ 41 kvk1 ; (7.5) follows with η3 = 3/4.
Lemma 7.7 Individual Sign-Pattern Embedding. Let σ ∈ {−1, 1}n , let y0 = n σ, with n ,
mn , τ , δ as in the statement of Lemma 7.6. Let k ≥ 0. Given a collection (φi : 1 ≤ i ≤ m − k),
there is an iterative algorithm, described below, producing a vector y as output which obeys
ky − y0 k2 ≤ δ · ky0 k2 . (7.8)
Lemma 7.7 will be proven in a section of its own. We now show that it implies Lemma 7.6.
We recall a standard implication of so-called Vapnik-Cervonenkis theory [42]:
n n n
#ΣJ ≤ + + ··· + .
0 1 |J|
26
Notice that if |J| < ρn/ log(m), then
log(#ΣJ ) ≤ ρ · n + log(n),
while also
log #{J : |J| < ρn/ log(m), J ⊂ {1, . . . , m}} ≤ ρn.
Hence, the total number of sign patterns generated by operators ΦJ obeys
Now β furnished by Lemma 7.7 is positive, so pick ρ3 < β/2 with ρ3 > 0. Define
where Ωσ,J denotes the instance of the event (called Ωσ in the statement of Lemma 7.7) generated
by a specific σ, J combination. On the event Ω3n , every sign pattern associated with any ΦJ
obeying |J| < ρ3 n/ log(m) is almost dual feasible. Now
QED.
27
7.4.2 Analysis Framework
Also in [17] bounds were developed for two key descriptors of the algorithm trajectory:
α` = ky` − y`−1 k2 ,
and
|I` | = #{i : |hφi , y` i| > 1 − 2−`−1 }.
We adapt the arguments deployed there. We define bounds δ` ≡ δ`;n and ν` ≡ ν`;n for α` and
|I` |, of the form
δ`;n = ky0 k2 · ω ` , ` = 1, 2, . . . ,
ν`;n = n · λ0 · 2n · ω 2`+2 /4, ` = 0, 1, 2, . . . ;
here λ0 = 1/2 and ω = min(1/2, δ/2, ω0 ), where ω0 > 0 will be defined later below. We define
sub-events
E` = {αj ≤ δj , j = 1, . . . , `, |Ij | ≤ νj , j = 0, . . . , ` − 1};
Now define
Ωσ = ∩n`=1 E` ;
this event implies, because ω ≤ 1/2,
X
ky − y0 k2 ≤ ( α`2 )1/2 ≤ ky0 k2 · ω/(1 − ω 2 )1/2 ≤ ky0 k2 · δ.
This implies
P (Ωcσ ) ≤ 2n exp{−βn},
and the Lemma follows. QED
With this choice of ω0 , when the event E` occurs, λmin (ΦTI` ΦI` ) > λ0 . Also on E` , uj =
2−j−1 /αj > 2−j−1 /δj = vj (say) for j ≤ `.
28
In [17] an analysis framework was developed in which a family (Zi` : 1 ≤ i ≤ m, 0 ≤ ` < n)
of random variables iid N (0, n1 ) was introduced, and it was shown that
X
P {Gc` |E` } ≤ P { 1{|Z ` |>v` } > ν` },
i
i
and X 2
c
|G` , E` } ≤ P {2 · λ−1 2 2
Zi` 1{|Z ` |>v` } > δ`+1
P {F`+1 0 ν` + δ ` }.
i
i
That paper also stated two simple Large Deviations bounds:
and
m−k
1 X 2
log P { 1{|Zi |>t} > m∆} ≤ e−t /2 − ∆/4.
m
i=1
m−k
X
Zi2 1{|Zi |>τ` } > m∆` ,
i=1
and
n 2
∆` = (λ0 δ`+1 /2 − ν` )/δ`2
m
We therefore have the inequality
1 2
c
log P {F`+1 |G` , E` } ≤ e−τ` /4 − ∆` /4.
m
Now
2 −2` −2`
e−τ` /4 = e−[(16 log(m/τ n))/16]·(2ω) = (τ n/m)(2ω) ,
and
n 2 n
∆` = (ω /4 − ω 2 /8) = ω 2 /8.
m m
−2`
Since ω ≤ 1/2, the term of most concern in (τ n/m)(2ω) is at ` = 0; the other terms are always
better. Also ∆` in fact does not depend on `. Focusing now on ` = 0, we may write
29
Recalling that ω ≤ δ/2 and putting
β ≡ β(τ ; δ) = (δ/2)2 /8 − τ,
we get β > 0 for τ < δ 2 /32, and
c
P {F`+1 |G` , E` } ≤ exp(−nβ).
A similar analysis holds for the G` ’s. QED
8 Conclusion
8.1 Summary
We have described an abstract framework for compressed sensing of objects which can be rep-
resented as vectors x ∈ Rm . We assume the x of interest is a priori compressible so that in
kΨT xkp ≤ R for a known basis or frame Ψ and p < 2. Starting from an n by m matrix Φ with
n < m satisfying conditions CS1-CS3, and with Ψ the matrix of an orthonormal basis or tight
frame underlying Xp,m (R), we define the information operator In (x) = ΦΨT x. Starting from
the n-pieces of measured information yn = In (x) we reconstruct x by solving
(L1 ) min kΨT xk1 subject to yn = In (x)
x
The proposed reconstruction rule uses convex optimization and is computationally tractable.
Also the needed matrices Φ satisfying CS1-CS3 can be constructed by random sampling from a
uniform distribution on the columns of Φ.
We give error bounds showing that despite the apparent undersampling (n < m), good ac-
curacy reconstruction is possible for compressible objects, and we explain the near-optimality of
these bounds using ideas from Optimal Recovery and Information-Based Complexity. Examples
are sketched related to imaging and spectroscopy.
Let 0 < p ≤ 1. For some C = C(p, (ηi ), ρ1 ) and all θ ∈ bp,m the solution θ̂1,n of (P1 ) obeys the
estimate:
kθ̂1,n − θk2 ≤ C · (n/ log(mn ))1/2−1/p .
In short, a different approach might exhibit operators Φ with good widths over `1 balls only,
and low concentration on ‘thin’ sets. Another way to see that the conditions CS1-CS3 can no
doubt be approached differently is to compare the results in [17, 18]; the second paper proves
results which partially overlap those in the first, by using a different technique.
30
8.3 The Partial Fourier Ensemble
We briefly discuss two recent articles which do not fit in the n-widths tradition followed here,
and so were not easy to cite earlier with due prominence.
First, and closest to our viewpoint, is the breakthrough paper of Candès, Romberg, and
Tao [4]. This was discussed in Section 4.2 above; the result of [4] showed that `1 minimization
can be used to exactly recover sparse sequences from the Fourier transform at n randomly-
chosen frequencies, whenever the sequence has fewer than ρ∗ n/ log(n) nonzeros, for some ρ∗ > 0.
Second is the article of Gilbert et al. [25], which showed that a different nonlinear reconstruction
algorithm can be used to recover approximations to a vector in Rm which is nearly as good as
the best N -term approximation in `2 norm, using about n = O(log(m) log(M )N ) random but
nonuniform samples in the frequency domain; here M is (it seems) an upper bound on the norm
of θ.
These articles both point to the partial Fourier Ensemble, i.e. the collection of n×m matrices
made by sampling n rows out of the m × m Fourier matrix, as concrete examples of Φ working
within the CS framework; that is, generating near-optimal subspaces for Gel’fand n-widths, and
allowing `1 minimization to reconstruct from such information for all 0 < p ≤ 1.
Now [4] (in effect) proves that if mn ∼ Anγ , then in the partial Fourier ensemble with uniform
measure, the maximal concentration condition A1 (8.1) holds with overwhelming probability for
large n (for appropriate constants η1 < 1/2, ρ1 > 0). On the other hand, the results in [25]
seem to show that condition A2 (8.2) holds in the partial Fourier ensemble with overwhelming
probability for large n, when it is sampled with a certain non-uniform probability measure.
Although the two papers [4, 25] refer to different random ensembles of partial Fourier matrices,
both reinforce the idea that interesting relatively concrete families of operators can be developed
for compressed sensing applications. In fact, Emmanuel Candès has informed us of some recent
results he obtained with Terry Tao [6] indicating that, modulo polylog factors, A2 holds for the
uniformly sampled partial Fourier ensemble. This seems a very significant advance.
Acknowledgements
In spring 2004, Emmanuel Candès told DLD about his ideas for using the partial Fourier ensem-
ble in ‘undersampled imaging’; some of these were published in [4]; see also the presentation [5].
More recently, Candès informed DLD of the results in [6] we referred to above. It is a pleasure
to acknowledge the inspiring nature of these conversations. DLD would also like to thank Anna
Gilbert for telling him about her work [25] finding the B-best Fourier coefficients by nonadaptive
sampling, and to thank Candès for conversations clarifying Gilbert’s work.
References
[1] J. Bergh and J. Löfström (1976) Interpolation Spaces. An Introduction. Springer Verlag.
[2] E. J. Candès and DL Donoho (2000) Curvelets - a surprisingly effective nonadaptice rep-
resentation for objects with edges. in Curves and Surfaces eds. C. Rabut, A. Cohen, and
L.L. Schumaker. Vanderbilt University Press, Nashville TN.
[3] E. J. Candès and DL Donoho (2004) New tight frames of curvelets and optimal represen-
tations of objects with piecewise C 2 singularities. Comm. Pure and Applied Mathematics
LVII 219-266.
31
[4] E.J. Candès, J. Romberg and T. Tao. (2004) Robust Uncertainty Principles: Exact Signal
Reconstruction from Highly Incomplete Frequency Information. Manuscript.
[6] Candes, EJ. and Tao, T. (2004) Estimates for Fourier Minors, with Applications.
Manuscript.
[7] Chen, S., Donoho, D.L., and Saunders, M.A. (1999) Atomic Decomposition by Basis Pur-
suit. SIAM J. Sci Comp., 20, 1, 33-61.
[9] Cohen, A. DeVore, R., Petrushev P., Xu, H. (1999) Nonlinear Approximation and the space
BV (R2 ). Amer. J. Math. 121, 587-628.
[10] Donoho, DL, Vetterli, M. DeVore, R.A. Daubechies, I.C. (1998) Data Compression and
Harmonic Analysis. IEEE Trans. Information Theory. 44, 2435-2476.
[11] Donoho, DL (1993) Unconditional Bases are optimal bases for data compression and for
statistical estimation. Appl. Computational Harmonic analysis 1 100-115.
[12] Donoho, DL (1996) Unconditional Bases and bit-level compression. Appl. Computational
Harmonic analysis 3 388-392.
[13] Donoho, DL (2001) Sparse components of images and optimal atomic decomposition. Con-
structive Approximation 17 353-382.
[14] Donoho, D.L. and Huo, Xiaoming (2001) Uncertainty Principles and Ideal Atomic Decom-
position. IEEE Trans. Info. Thry. 47 (no.7), Nov. 2001, pp. 2845-62.
[15] Donoho, D.L. and Elad, Michael (2002) Optimally Sparse Representation from Overcom-
plete Dictionaries via `1 norm minimization. Proc. Natl. Acad. Sci. USA March 4, 2003 100
5, 2197-2002.
[16] Donoho, D., Elad, M., and Temlyakov, V. (2004) Stable Recovery of Sparse Over-
complete Representations in the Presence of Noise. Submitted. URL: http://www-
stat.stanford.edu/˜donoho/Reports/2004.
[17] Donoho, D.L. (2004) For most large underdetermined systems of linear equations, the min-
imal `1 solution is also the sparsest solution. Manuscript. Submitted. URL: http://www-
stat.stanford.edu/˜donoho/Reports/2004
[18] Donoho, D.L. (2004) For most underdetermined systems of linear equations, the minimal `1 -
norm near-solution approximates the sparsest near-solution. Manuscript. Submitted. URL:
http://www-stat.stanford.edu/˜donoho/Reports/2004
[19] A. Dvoretsky (1961) Some results on convex bodies and Banach Spaces. Proc. Symp. on
Linear Spaces. Jerusalem, 123-160.
[20] M. Elad and A.M. Bruckstein (2002) A generalized uncertainty principle and sparse repre-
sentations in pairs of bases. IEEE Trans. Info. Thry. 49 2558-2567.
32
[21] J.J. Fuchs (2002) On sparse representation in arbitrary redundant bases. Manuscript.
[22] T. Figiel, J. Lindenstrauss and V.D. Milman (1977) The dimension of almost-spherical
sections of convex bodies. Acta Math. 139 53-94.
[23] S. Gal, C. Micchelli, Optimal sequential and non-sequential procedures for evaluating a
functional. Appl. Anal. 10 105-120.
[24] Garnaev, A.Y. and Gluskin, E.D. (1984) On widths of the Euclidean Ball. Soviet Mathe-
matics – Doklady 30 (in English) 200-203.
[26] R. Gribonval and M. Nielsen. Sparse Representations in Unions of Bases. To appear IEEE
Trans Info Thry.
[27] G. Golub and C. van Loan.(1989) Matrix Computations. Johns Hopkins: Baltimore.
[28] Boris S. Kashin (1977) Diameters of certain finite-dimensional sets in classes of smooth
functions. Izv. Akad. Nauk SSSR, Ser. Mat. 41 (2) 334-351.
[29] M.A. Kon and E. Novak The adaption problem for approximating linear operators Bull.
Amer. Math. Soc. 23 159-165.
[30] Thomas Kuhn (2001) A lower estimate for entropy numbers. Journal of Approximation
Theory, 110, 120-124.
[31] Michel Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and
Monographs 89. American Mathematical Society 2001.
[34] Melkman, A. A., Micchelli, C. A.; Optimal estimation of linear operators from inaccurate
data. SIAM J. Numer. Anal. 16; 1979; 87–105;
[35] A.A. Melkman (1980) n-widths of octahedra. in Quantitative Approximation eds. R.A.
DeVore and K. Scherer, 209-216, Academic Press.
[36] Micchelli, C.A. and Rivlin, T.J. (1977) A Survey of Optimal Recovery, in Optimal Es-
timation in Approximation Theory, eds. C.A. Micchelli, T.J. Rivlin, Plenum Press, NY,
1-54.
[37] V.D. Milman and G. Schechtman (1986) Asymptotic Theory of Finite-Dimensional Normed
Spaces. Lect. Notes Math. 1200, Springer.
[38] E. Novak (1996) On the power of Adaption. Journal of Complexity 12, 199-237.
[39] Pinkus, A. (1986) n-widths and Optimal Recovery in Approximation Theory, Proceeding
of Symposia in Applied Mathematics, 36, Carl de Boor, Editor. American Mathematical
Society, Providence, RI.
33
[40] Pinkus, A. (1985) n-widths in Approximation Theory. Springer-Verlag.
[41] G. Pisier (1989) The Volume of Convex Bodies and Banach Space Geometry. Cambridge
University Press.
[42] D. Pollard (1989) Empirical Processes: Theory and Applications. NSF - CBMS Regional
Conference Series in Probability and Statistics, Volume 2, IMS.
[43] Carsten Schutt (1984) Entropy numbers of diagonal operators between symmetric banach
spaces. Journal of Approximation Theory, 40, 121-128.
[44] Szarek, S.J. (1990) Spaces with large distances to `n∞ and random matrices. Amer. Jour.
Math. 112, 819-842.
[46] Traub, J.F., Woziakowski, H. (1980) A General Theory of Optimal Algorithms, Academic
Press, New York.
[47] J.A. Tropp (2003) Greed is Good: Algorithmic Results for Sparse Approximation To appear,
IEEE Trans Info. Thry.
[48] J.A. Tropp (2004) Just Relax: Convex programming methods for Subset Sleection and
Sparse Approximation. Manuscript.
34