0% found this document useful (0 votes)
10 views18 pages

Compressed Sensing

Uploaded by

bangbanghappy666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
10 views18 pages

Compressed Sensing

Uploaded by

bangbanghappy666
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 18

IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO.

4, APRIL 2006 1289

Compressed Sensing
David L. Donoho, Member, IEEE

Abstract—Suppose is an unknown vector in (a digital the important information about the signals/images—in effect,
image or signal); we plan to measure general linear functionals not acquiring that part of the data that would eventually just be
of and then reconstruct. If is known to be compressible “thrown away” by lossy compression. Moreover, the protocols
by transform coding with a known transform, and we recon-
struct via the nonlinear procedure defined here, the number of are nonadaptive and parallelizable; they do not require knowl-
measurements can be dramatically smaller than the size . edge of the signal/image to be acquired in advance—other than
Thus, certain natural classes of images with pixels need only knowledge that the data will be compressible—and do not at-
= ( 1 4 log5 2 ( )) nonadaptive nonpixel samples for tempt any “understanding” of the underlying object to guide
faithful recovery, as opposed to the usual pixel samples. an active or adaptive sensing strategy. The measurements made
More specifically, suppose has a sparse representation in
some orthonormal basis (e.g., wavelet, Fourier) or tight frame
in the compressed sensing protocol are holographic—thus, not
(e.g., curvelet, Gabor)—so the coefficients belong to an ball simple pixel samples—and must be processed nonlinearly.
for 0 1. The most important coefficients in that In specific applications, this principle might enable dra-
expansion allow reconstruction with 2 error ( 1 2 1 ). It is matically reduced measurement time, dramatically reduced
possible to design = ( log( )) nonadaptive measurements sampling rates, or reduced use of analog-to-digital converter
allowing reconstruction with accuracy comparable to that attain-
able with direct knowledge of the most important coefficients.
resources.
Moreover, a good approximation to those important coeffi-
cients is extracted from the measurements by solving a linear A. Transform Compression Background
program—Basis Pursuit in signal processing. The nonadaptive Our treatment is abstract and general, but depends on one spe-
measurements have the character of “random” linear combi-
cific assumption which is known to hold in many settings of
nations of basis/frame elements. Our results use the notions of
optimal recovery, of -widths, and information-based complexity. signal and image processing: the principle of transform sparsity.
We estimate the Gel’fand -widths of balls in high-dimensional We suppose that the object of interest is a vector , which
Euclidean space in the case 0 1, and give a criterion can be a signal or image with samples or pixels, and that there
identifying near-optimal subspaces for Gel’fand -widths. We is an orthonormal basis for which can
show that “most” subspaces are near-optimal, and show that
be, for example, an orthonormal wavelet basis, a Fourier basis,
convex optimization (Basis Pursuit) is a near-optimal way to
extract information derived from these near-optimal subspaces. or a local Fourier basis, depending on the application. (As ex-
plained later, the extension to tight frames such as curvelet or
Index Terms—Adaptive sampling, almost-spherical sections of
Banach spaces, Basis Pursuit, eigenvalues of random matrices, Gabor frames comes for free.) The object has transform coeffi-
Gel’fand -widths, information-based complexity, integrated cients , and these are assumed sparse in the sense
sensing and processing, minimum 1 -norm decomposition, op- that, for some and for some
timal recovery, Quotient-of-a-Subspace theorem, sparse solution
of linear equations.
(I.1)
I. INTRODUCTION
Such constraints are actually obeyed on natural classes of sig-

A S our modern technology-driven civilization acquires and


exploits ever-increasing amounts of data, “everyone” now
knows that most of the data we acquire “can be thrown away”
nals and images; this is the primary reason for the success of
standard compression tools based on transform coding [1]. To
fix ideas, we mention two simple examples of constraint.
with almost no perceptual loss—witness the broad success of
• Bounded Variation model for images. Here image bright-
lossy compression formats for sounds, images, and specialized
ness is viewed as an underlying function on the
technical data. The phenomenon of ubiquitous compressibility
unit square , which obeys (essentially)
raises very natural questions: why go to so much effort to ac-
quire all the data when most of what we get will be thrown
away? Can we not just directly measure the part that will not
end up being thrown away?
In this paper, we design compressed data acquisition proto- The digital data of interest consist of pixel sam-
cols which perform as if it were possible to directly acquire just ples of produced by averaging over pixels.
We take a wavelet point of view; the data are seen as a su-
perposition of contributions from various scales. Let
Manuscript received September 18, 2004; revised December 15, 2005.
The author is with the Department of Statistics, Stanford University, Stanford, denote the component of the data at scale , and let
CA 94305 USA (e-mail: [email protected]). denote the orthonormal basis of wavelets at scale , con-
Communicated by A. Høst-Madsen, Associate Editor for Detection and
Estimation. taining elements. The corresponding coefficients
Digital Object Identifier 10.1109/TIT.2006.871582 obey .
0018-9448/$20.00 © 2006 IEEE
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1290 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

• Bump Algebra model for spectra. Here a spectrum (e.g., use “OR/IBC” as a generic label for work taking place in those
mass spectrum or magnetic resonance spectrum) is fields, admittedly being less than encyclopedic about various
modeled as digital samples of an underlying scholarly contributions.
function on the real line which is a superposition of We have a class of possible objects of interest, and are
so-called spectral lines of varying positions, amplitudes, interested in designing an information operator
and linewidths. Formally that samples pieces of information about , and an algorithm
that offers an approximate reconstruction of .
Here the information operator takes the form

Here the parameters are line locations, are ampli-


tudes/polarities, and are linewidths, and represents a where the are sampling kernels, not necessarily sampling
lineshape, for example the Gaussian, although other pro- pixels or other simple features of the signal; however, they are
files could be considered. We assume the constraint where nonadaptive, i.e., fixed independently of . The algorithm is
, which in applications represents an en- an unspecified, possibly nonlinear reconstruction operator.
ergy or total mass constraint. Again we take a wavelet We are interested in the error of reconstruction and in the
viewpoint, this time specifically using smooth wavelets. behavior of optimal information and optimal algorithms. Hence,
The data can be represented as a superposition of con- we consider the minimax error as a standard of comparison
tributions from various scales. Let denote the com-
ponent of the spectrum at scale , and let denote
the orthonormal basis of wavelets at scale , containing
elements. The corresponding coefficients again obey So here, all possible methods of nonadaptive linear sampling are
, [2]. allowed, and all possible methods of reconstruction are allowed.
While in these two examples, the constraint appeared, other In our application, the class of objects of interest is the set
constraints with can appear naturally as well; of objects where obeys (I.1) for a given
see below. For some readers, the use of norms with and . Denote then
may seem initially strange; it is now well understood that the
norms with such small are natural mathematical measures of
sparsity [3], [4]. As decreases below , more and more sparsity
Our goal is to evaluate and to have practical
is being required. Also, from this viewpoint, an constraint
schemes which come close to attaining it.
based on requires no sparsity at all.
Note that in each of these examples, we also allowed for sep-
C. Four Surprises
arating the object of interest into subbands, each one of which
obeys an constraint. In practice, in the following we stick Here is the main quantitative phenomenon of interest for this
with the view that the object of interest is a coefficient vector paper.
obeying the constraint (I.1), which may mean, from an applica- Theorem 1: Let be a sequence of problem sizes with
tion viewpoint, that our methods correspond to treating various , , and , , . Then for
subbands separately, as in these examples. there is so that
The key implication of the constraint is sparsity of the
transform coefficients. Indeed, we have trivially that, if de- (I.3)
notes the vector with everything except the largest coeffi-
cients set to
We find this surprising in four ways. First, compare (I.3) with
(I.2) (I.2). We see that the forms are similar, under the calibration
. In words: the quality of approximation to
for , with a constant depending only on which could be gotten by using the biggest transform coef-
. Thus, for example, to approximate with error , we ficients can be gotten by using the pieces of
need to keep only the biggest terms in . nonadaptive information provided by . The surprise is that
we would not know in advance which transform coefficients are
B. Optimal Recovery/Information-Based Complexity likely to be the important ones in this approximation, yet the
Background optimal information operator is nonadaptive, depending at
Our question now becomes: if is an unknown signal whose most on the class and not on the specific object. In
transform coefficient vector obeys (I.1), can we make a some sense this nonadaptive information is just as powerful as
reduced number of measurements which will allow knowing the best transform coefficients.
faithful reconstruction of . Such questions have been discussed This seems even more surprising when we note that for ob-
(for other types of assumptions about ) under the names of jects , the transform representation is the optimal
Optimal Recovery [5] and Information-Based Complexity [6]; one: no other representation can do as well at characterizing
we now adopt their viewpoint, and partially adopt their nota- by a few coefficients [3], [7]. Surely then, one imagines,
tion, without making a special effort to be really orthodox. We the sampling kernels underlying the optimal information
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1291

operator must be simply measuring individual transform coef- provided by the wavelet basis; if , this is again symmetric
ficients? Actually, no: the information operator is measuring about the origin and orthosymmetric, while not being convex,
very complex “holographic” functionals which in some sense but still star-shaped.
mix together all the coefficients in a big soup. Compare (VI.1) To develop this geometric viewpoint further, we consider two
below. (Holography is a process where a three–dimensional notions of -width; see [5].
(3-D) image generates by interferometry a two–dimensional
Definition 1.1: The Gel’fand -width of with respect to
(2-D) transform. Each value in the 2-D transform domain is
the norm is defined as
influenced by each part of the whole 3-D object. The 3-D
object can later be reconstructed by interferometry from all
or even a part of the 2-D transform domain. Leaving aside
the specifics of 2-D/3-D and the process of interferometry, we where the infimum is over -dimensional linear subspaces of
perceive an analogy here, in which an object is transformed to a , and denotes the orthocomplement of with respect
compressed domain, and each compressed domain component to the standard Euclidean inner product.
is a combination of all parts of the original object.)
Another surprise is that, if we enlarged our class of informa- In words, we look for a subspace such that “trapping”
tion operators to allow adaptive ones, e.g., operators in which in that subspace causes to be small. Our interest in Gel’fand
certain measurements are made in response to earlier measure- -widths derives from an equivalence between optimal recovery
ments, we could scarcely do better. Define the minimax error for nonadaptive information and such -widths, well known in
under adaptive information allowing adaptive operators the case [5], and in the present setting extending as
follows.
Theorem 3: For and
where, for , each kernel is allowed to depend on the (I.4)
information gathered at previous stages .
Formally setting (I.5)

Thus the Gel’fand -widths either exactly or nearly equal the


value of optimal information. Ultimately, the bracketing with
we have the following. constant will be for us just as good as equality, owing to
the unspecified constant factors in (I.3). We will typically only
Theorem 2: For , there is so that for be interested in near-optimal performance, i.e., in obtaining
to within constant factors.
It is relatively rare to see the Gel’fand -widths studied
directly [8]; more commonly, one sees results about the
So adaptive information is of minimal help—despite the quite Kolmogorov -widths.
natural expectation that an adaptive method ought to be able Definition 1.2: Let be a bounded set. The
iteratively somehow “localize” and then “close in” on the “big Kolmogorov -width of with respect the norm is defined
coefficients.” as
An additional surprise is that, in the already-interesting case
, Theorems 1 and 2 are easily derivable from known results
in OR/IBC and approximation theory! However, the derivations
are indirect; so although they have what seem to the author as where the infimum is over -dimensional linear subspaces
fairly important implications, very little seems known at present of .
about good nonadaptive information operators or about concrete In words, measures the quality of approximation of
algorithms matched to them. possible by -dimensional subspaces .
Our goal in this paper is to give direct arguments which cover In the case , there is an important duality relationship
the case of highly compressible objects, to give di- between Kolmogorov widths and Gel’fand widths which allows
rect information about near-optimal information operators and us to infer properties of from published results on . To
about concrete, computationally tractable algorithms for using state it, let be defined in the obvious way, based on
this near-optimal information. approximation in rather than norm. Also, for given
and , let and be the standard dual indices
D. Geometry and Widths , . Also, let denote the standard unit
From our viewpoint, the phenomenon described in The- ball of . Then [8]
orem 1 concerns the geometry of high-dimensional convex and
nonconvex “balls.” To see the connection, note that the class (I.6)
is the image, under orthogonal transformation, of In particular
an ball. If this is convex and symmetric about the
origin, as well as being orthosymmetric with respect to the axes
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1292 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

The asymptotic properties of the left-hand side have been de- F. Results
termined by Garnaev and Gluskin [9]. This follows major work Our paper develops two main types of results.
by Kashin [10], who developed a slightly weaker version of this
• Near-Optimal Information. We directly consider the
result in the course of determining the Kolmogorov -widths of
problem of near-optimal subspaces for Gel’fand -widths
Sobolev spaces. See the original papers, or Pinkus’s book [8]
of , and introduce three structural conditions
for more details.
(CS1–CS3) on an -by- matrix which imply that its
Theorem 4: (Kashin, Garnaev, and Gluskin (KGG)) For nullspace is near-optimal. We show that the vast majority
all and of -subspaces of are near-optimal, and random
sampling yields near-optimal information operators with
overwhelmingly high probability.
• Near-Optimal Algorithm. We study a simple nonlinear re-
construction algorithm: simply minimize the norm of
Theorem 1 now follows in the case by applying KGG
the coefficients subject to satisfying the measurements.
with the duality formula (I.6) and the equivalence formula (I.4).
This has been studied in the signal processing literature
The case of Theorem 1 does not allow use of duality
under the name Basis Pursuit; it can be computed by linear
and the whole range is approached differently in this
programming. We show that this method gives near-op-
paper.
timal results for all .
In short, we provide a large supply of near-optimal infor-
E. Mysteries …
mation operators and a near-optimal reconstruction procedure
Because of the indirect manner by which the KGG result im- based on linear programming, which, perhaps unexpectedly,
plies Theorem 1, we really do not learn much about the phenom- works even for the nonconvex case .
enon of interest in this way. The arguments of Kashin, Garnaev, For a taste of the type of result we obtain, consider a specific
and Gluskin show that there exist near-optimal -dimensional information/algorithm combination.
subspaces for the Kolmogorov widths; they arise as the null • CS Information. Let be an matrix generated by
spaces of certain matrices with entries which are known to randomly sampling the columns, with different columns
exist by counting the number of matrices lacking certain prop- independent and identically distributed (i.i.d.) random
erties, the total number of matrices with entries, and com- uniform on . With overwhelming probability for
paring. The interpretability of this approach is limited. large , has properties CS1–CS3 discussed in detail
The implicitness of the information operator is matched in Section II-A below; assume we have achieved such a
by the abstractness of the reconstruction algorithm. Based on favorable draw. Let be the basis matrix with
OR/IBC theory we know that the so-called central algorithm basis vector as the th column. The CS Information
is optimal. This “algorithm” asks us to consider, for given operator is the matrix .
information , the collection of all objects which • -minimization. To reconstruct from CS Information, we
could have given rise to the data solve the convex optimization problem

subject to

Defining now the center of a set In words, we look for the object having coefficients
with smallest norm that is consistent with the informa-
center tion .
To evaluate the quality of an information operator , set
the central algorithm is

center
To evaluate the quality of a combined algorithm/information
and it obeys, when the information is optimal, pair , set

see Section III below. Theorem 5: Let , be a sequence of problem sizes


This abstract viewpoint unfortunately does not translate into obeying , , ; and let be a
a practical approach (at least in the case of the , corresponding sequence of operators deriving from CS matrices
). The set is a section of the ball with underlying parameters and (see Section II below). Let
, and finding the center of this section does not corre- . There exists so that
spond to a standard tractable computational problem. Moreover, is near-optimal:
this assumes we know and , which would typically not be
the case.
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1293

for , . Moreover, the algorithm delivering usual the nullspace . We define the width of a set
the solution to is near-optimal: relative to an operator

subject to (II.1)

for , . In words, this is the radius of the section of cut out by


the nullspace . In general, the Gel’fand -width is the
Thus, for large , we have a simple description of near-op- smallest value of obtainable by choice of
timal information and a tractable near-optimal reconstruction
algorithm. is an matrix

G. Potential Applications We will show for all large and the existence of by
matrices where
To see the potential implications, recall first the Bump Al-
gebra model for spectra. In this context, our result says that,
for a spectrometer based on the information operator in The-
orem 5, it is really only necessary to take with dependent at most on and the ratio .
measurements to get an accurate reconstruction of such spectra,
rather than the nominal measurements. However, they must A. Conditions CS1–CS3
then be processed nonlinearly.
Recall the Bounded Variation model for images. In that con- In the following, with let denote a sub-
text, a result paralleling Theorem 5 says that for a specialized matrix of obtained by selecting just the indicated columns of
imaging device based on a near-optimal information operator it . We let denote the range of in . Finally, we consider
is really only necessary to take measure- a family of quotient norms on ; with denoting the
ments to get an accurate reconstruction of images with pixels, norm on vectors indexed by
rather than the nominal measurements.
subject to
The calculations underlying these results will be given below,
along with a result showing that for cartoon-like images (which These describe the minimal -norm representation of achiev-
may model certain kinds of simple natural imagery, like brain able using only specified subsets of columns of .
scans), the number of measurements for an -pixel image is We define three conditions to impose on an matrix ,
only . indexed by strictly positive parameters , , and .
CS1: The minimal singular value of exceeds
H. Contents
uniformly in .
Section II introduces a set of conditions CS1–CS3 for CS2: On each subspace we have the inequality
near-optimality of an information operator. Section III considers
abstract near-optimal algorithms, and proves Theorems 1–3.
Section IV shows that solving the convex optimization problem
gives a near-optimal algorithm whenever . Sec- uniformly in .
tion V points out immediate extensions to weak- conditions CS3: On each subspace
and to tight frames. Section VI sketches potential implications
in image, signal, and array processing. Section VII, building on
work in [11], shows that conditions CS1–CS3 are satisfied for
“most” information operators. uniformly in .
Finally, in Section VIII, we note the ongoing work by two CS1 demands a certain quantitative degree of linear indepen-
groups (Gilbert et al. [12] and Candès et al. [13], [14]), which dence among all small groups of columns. CS2 says that linear
although not written in the -widths/OR/IBC tradition, imply combinations of small groups of columns give vectors that look
(as we explain), closely related results. much like random noise, at least as far as the comparison of
and norms is concerned. It will be implied by a geometric
fact: every slices through the ball in such a way that
II. INFORMATION
the resulting convex section is actually close to spherical. CS3
Consider information operators constructed as follows. With says that for every vector in some , the associated quotient
the orthogonal matrix whose columns are the basis elements norm is never dramatically smaller than the simple norm
, and with certain -by- matrices obeying conditions on .
specified below, we construct corresponding information oper- It turns out that matrices satisfying these conditions are ubiq-
ators . Everything will be completely transparent to uitous for large and when we choose the and properly.
the choice of orthogonal matrix and hence we will assume Of course, for any finite and , all norms are equivalent and
that is the identity throughout this section. almost any arbitrary matrix can trivially satisfy these conditions
In view of the relation between Gel’fand -widths and min- simply by taking very small and , very large. However,
imax errors, we may work with -widths. Let denote as the definition of “very small” and “very large” would have to
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1294 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

depend on for this trivial argument to work. We claim some- A similar argument for approximation gives, in case
thing deeper is true: it is possible to choose and independent
of and of . (II.4)
Consider the set
Now . Hence, with , we have
. As and , we can
invoke CS3, getting
of all matrices having unit-normalized columns. On this
set, measure frequency of occurrence with the natural uniform
measure (the product measure, uniform on each factor ). On the other hand, again using and
Theorem 6: Let be a sequence of problem sizes with , invoke CS2, getting
, , and , , and . There
exist and depending only on and so that, for
each the proportion of all matrices satisfying
CS1–CS3 with parameters and eventually exceeds . Combining these with the above
We will discuss and prove this in Section VII. The proof will
show that the proportion of matrices not satisfying the condition
decays exponentially fast in .
For later use, we will leave the constants and implicit and
speak simply of CS matrices, meaning matrices that satisfy the with . Recalling ,
given conditions with values of parameters of the type described and invoking CS1 we have
by this theorem, namely, with and not depending on and
permitting the above ubiquity.

B. Near-Optimality of CS Matrices In short, with

We now show that the CS conditions imply near-optimality


of widths induced by CS matrices.
Theorem 7: Let be a sequence of problem sizes with
and . Consider a sequence of by
matrices obeying the conditions CS1–CS3 with and The theorem follows with
positive and independent of . Then for each , there is
so that for

III. ALGORITHMS
Proof: Consider the optimization problem
Given an information operator , we must design a recon-
struction algorithm which delivers reconstructions compat-
subject to ible in quality with the estimates for the Gel’fand -widths. As
discussed in the Introduction, the optimal method in the OR/IBC
Our goal is to bound the value of framework is the so-called central algorithm, which unfortu-
nately, is typically not efficiently computable in our setting.
We now describe an alternate abstract approach, allowing us to
prove Theorem 1.
Choose so that . Let denote the indices of the
largest values in . Without loss of generality, A. Feasible-Point Methods
suppose coordinates are ordered so that comes first among
Another general abstract algorithm from the OR/IBC litera-
the entries, and partition . Clearly
ture is the so-called feasible-point method, which aims simply
to find any reconstruction compatible with the observed infor-
(II.2) mation and constraints.
As in the case of the central algorithm, we consider, for given
while, because each entry in is at least as big as any entry in information , the collection of all objects
, (I.2) gives which could have given rise to the information

(II.3)
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1295

In the feasible-point method, we simply select any member of with where for given
, by whatever means. One can show, adapting standard and
OR/IBC arguments in [15], [6], [8] the following.
Lemma 3.1: Let where and
is an optimal information operator, and let be any element of
. Then for B. Proof of Theorem 3
(III.1) Before proceeding, it is convenient to prove Theorem 3. Note
that the case is well known in OR/IBC so we only need to
In short, any feasible point is within a factor two of optimal. give an argument for (though it happens that our argument
Proof: We first justify our claims for optimality of the cen- works for as well). The key point will be to apply the
tral algorithm, and then show that a feasible point is near to the -triangle inequality
central algorithm. Let again denote the result of the central
algorithm. Now
valid for ; this inequality is well known in interpola-
radius
tion theory [17] through Peetre and Sparr’s work, and is easy to
verify directly.
Suppose without loss of generality that there is an optimal
Now clearly, in the special case when is only known to lie in subspace , which is fixed and given in this proof. As we just
and is measured, the minimax error is saw
exactly radius . Since this error is achieved by the
radius
central algorithm for each such , the minimax error over all
is achieved by the central algorithm. This minimax error is
Now
radius radius
Now the feasible point obeys ; hence, so clearly . Now suppose without loss of generality that
and attain the radius bound, i.e., they satisfy
radius and, for center they satisfy
But the triangle inequality gives

Then define . Set and


. By the -triangle inequality
hence, as

radius
radius and so

Hence . However,
More generally, if the information operator is only near-
optimal, then the same argument gives

(III.2) so belongs to . Hence,


and
A popular choice of feasible point is to take an element of
least norm, i.e., a solution of the problem

subject to

where here is the vector of transform coefficients, C. Proof of Theorem 1


. A nice feature of this approach is that it is not necessary We are now in a position to prove Theorem 1 of the
to know the radius of the ball ; the element of least Introduction.
norm will always lie inside it. For later use, call the solution First, in the case , we have already explained in the
. By the preceding lemma, this procedure is near-minimax: Introduction that the theorem of Garnaev and Gluskin implies
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1296 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

the result by duality. In the case , we need only to A. The Case


show a lower bound and an upper bound of the same order. In the case , is a convex optimization problem.
For the lower bound, we consider the entropy numbers, de- Written in an equivalent form, with being the optimization
fined as follows. Let be a set and let be the smallest variable, gives
number such that an -net for can be built using a net of
cardinality at most . From Carl’s theorem [18]—see the ex- subject to
position in Pisier’s book [19]—there is a constant so that
the Gel’fand -widths dominate the entropy numbers.
This can be formulated as a linear programming problem: let
be the by matrix . The linear program

LP subject to (IV.1)
Secondly, the entropy numbers obey [20], [21]
has a solution , say, a vector in which can be partitioned
as ; then solves . The recon-
struction is . This linear program is typically consid-
At the same time, the combination of Theorems 7 and 6 shows ered computationally tractable. In fact, this problem has been
that studied in the signal analysis literature under the name Basis
Pursuit [26]; in that work, very large-scale underdetermined
problems—e.g., with and —were solved
successfully using interior-point optimization methods.
Applying now the Feasible Point method, we have As far as performance goes, we already know that this pro-
cedure is near-optimal in case ; from (III.2) we have the
following.
Corollary 4.1: Suppose that is an information operator
with immediate extensions to for all .
achieving, for some
We conclude that

then the Basis Pursuit algorithm achieves, for


as was to be proven. all

D. Proof of Theorem 2
Now is an opportune time to prove Theorem 2. We note that
in the case of , this is known [22]–[25]. The argument is In particular, we have a universal algorithm for dealing with
the same for , and we simply repeat it. Suppose that any class —i.e., any , any , any . First, apply a
, and consider the adaptively constructed subspace ac- near-optimal information operator; second, reconstruct by Basis
cording to whatever algorithm is in force. When the algorithm Pursuit. The result obeys
terminates, we have an -dimensional information vector and
a subspace consisting of objects which would all give that
information vector. For all objects in , the adaptive informa-
tion therefore turns out the same. Now the minimax error asso- for a constant depending at most on . The
ciated with that information is exactly radius ; inequality can be put to use as follows. Fix . Suppose
but this cannot be smaller than the unknown object is known to be highly compressible,
say obeying the a priori bound , . Let
radius . For any such object, rather than making
measurements, we only need to make
measurements, and our reconstruction obeys
The result follows by comparing with .

IV. BASIS PURSUIT


While the case is already significant and interesting, the
The least-norm method of the previous section has two draw- case is of interest because it corresponds to data
backs. First, it requires that one know ; we prefer an algo- which are more highly compressible, offering more impressive
rithm which works for . Second, if , the performance in Theorem 1, because the exponent
least-norm problem invokes a nonconvex optimization proce- is even stronger than in the case. Later in this section,
dure, and would be considered intractable. In this section, we we extend the same interpretation of to performance over
correct both drawbacks. throughout .
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1297

B. Relation Between and Minimization This measures the fraction of norm which can be concen-
trated on a certain subset for a vector in the nullspace of . This
The general OR/IBC theory would suggest that to handle concentration cannot be large if is small.
cases where , we would need to solve the nonconvex
optimization problem , which seems intractable. However, Lemma 4.1: Suppose that satisfies CS1–CS3, with con-
in the current situation at least, a small miracle happens: solving stants and . There is a constant depending on the so
is again near-optimal. To understand this, we first take a that if satisfies
small detour, examining the relation between and the extreme
case of the spaces. Let us define then

subject to
Proof: This is a variation on the argument for Theorem 7.
where is just the number of nonzeros in . Again, since Let . Assume without loss of generality that is the
the work of Peetre and Sparr [16], the importance of and the most concentrated subset of cardinality , and
relation with for is well understood; see [17] for that the columns of are numbered so that ;
more detail. partition . We again consider , and have
Ordinarily, solving such a problem involving the norm re- . We again invoke CS2–CS3, getting
quires combinatorial optimization; one enumerates all sparse
subsets of searching for one which allows a solu-
tion . However, when has a sparse solution,
will find it. We invoke CS1, getting

Theorem 8: Suppose that satisfies CS1–CS3 with given


Now, of course, . Combining all these
positive constants , . There is a constant depending
only on and and not on or so that, if some solution to
has at most nonzeros, then and
both have the same unique solution. The lemma follows, setting .

In words, although the system of equations is massively Proof of Theorem 8: Suppose that and is sup-
underdetermined, minimization and sparse solution coin- ported on a subset .
cide—when the result is sufficiently sparse. We first show that if , is the only minimizer
There is by now an extensive literature exhibiting results on of . Suppose that is a solution to , obeying
equivalence of and minimization [27]–[34]. In the early
literature on this subject, equivalence was found under condi- Then where . We have
tions involving sparsity constraints allowing nonzeros.
While it may seem surprising that any results of this kind are
Invoking the definition of twice
possible, the sparsity constraint is, ultimately,
disappointingly small. A major breakthrough was the contribu-
tion of Candès, Romberg, and Tao [13] which studied the ma- Now gives and we have
trices built by taking rows at random from an by Fourier
matrix and gave an order bound, showing that i.e., .
dramatically weaker sparsity conditions were needed than the Now recall the constant of Lemma 4.1. Define so
results known previously. In [11], it was shown that that and . Lemma 4.1 shows that
for ’nearly all’ by matrices with , equiv- implies .
alence held for nonzeros, . The above re-
sult says effectively that for ’nearly all’ by matrices with C. Near-Optimality of Basis Pursuit for
, equivalence held up to nonzeros,
We now return to the claimed near-optimality of Basis Pursuit
where .
throughout the range .
Our argument, in parallel with [11], shows that the nullspace
has a very special structure for obeying the conditions Theorem 9: Suppose that satisifies CS1–CS3 with con-
in question. When is sparse, the only element in a given affine stants and . There is so that a solu-
subspace which can have small norm is itself. tion to a problem instance of with obeys
To prove Theorem 8, we first need a lemma about the non-
sparsity of elements in the nullspace of . Let
and, for a given vector , let denote the mutilated The proof requires an stability lemma, showing the sta-
vector with entries . Define the concentration bility of minimization under small perturbations as measured
in norm. For and stability lemmas, see [33]–[35]; how-
ever, note that those lemmas do not suffice for our needs in this
proof.
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1298 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

Lemma 4.2: Let be a vector in and be the corre- V. IMMEDIATE EXTENSIONS


sponding mutilated vector with entries . Suppose that
Before continuing, we mention two immediate extensions to
the results so far; they are of interest below and elsewhere.
where . Consider the instance of
defined by ; the solution of this instance of A. Tight Frames
obeys Our main results so far have been stated in the context of
making an orthonormal basis. In fact, the results hold for tight
(IV.2) frames. These are collections of vectors which, when joined to-
gether as columns in an matrix have
Proof of Lemma 4.2: Put for short , and set . It follows that, if , then we have the
. By definition of Parseval relation

while
and the reconstruction formula . In fact, Theorems
7 and 9 only need the Parseval relation in the proof. Hence, the
As solves same results hold without change when the relation between
and involves a tight frame. In particular, if is an matrix
satisfying CS1–CS3, then defines a near-optimal
and of course information operator on , and solution of the optimization
problem

Hence,

defines a near-optimal reconstruction algorithm .


A referee remarked that there is no need to restrict attention
Finally to tight frames here; if we have a general frame the same results
go through, with constants involving the frame bounds. This is
Combining the above, setting , and , true and potentially very useful, although we will not use it in
we get what follows.

B. Weak Balls
Our main results so far have been stated for spaces, but the
and (IV.2) follows.
proofs hold for weak balls as well . The weak ball
Proof of Theorem 9: We use the same general framework of radius consists of vectors whose decreasing rearrange-
as in Theorem 7. Let where . Let be the ments obey
solution to , and set .
Let as in Lemma 4.1 and set . Let
index the largest amplitude entries in . From
and (II.4) we have Conversely, for a given , the smallest for which these in-
equalities all hold is defined to be the norm: . The
“weak” moniker derives from . Weak con-
and Lemma 4.1 provides straints have the following key property: if denotes a muti-
lated version of the vector with all except the largest items
set to zero, then the inequality
Applying Lemma 4.2
(IV.3) (V.1)

The vector lies in and has . is valid for and , with . In fact,
Hence, Theorems 7 and 9 only needed (V.1) in the proof, together with
(implicitly) . Hence, we can state results for
spaces defined using only weak- norms, and the
proofs apply without change.
We conclude by homogeneity that

VI. STYLIZED APPLICATIONS


Combining this with (IV.3),
We sketch three potential applications of the above abstract
theory. In each case, we exhibit that a certain class of functions
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1299

has expansion coefficients in a basis or frame that obey a partic- The compressed sensing scheme takes a total of samples
ular or weak embedding, and then apply the above abstract of resumé coefficients and samples associ-
theory. ated with detail coefficients, for a total pieces of
information. It achieves error comparable to classical sampling
A. Bump Algebra based on samples. Thus, it needs dramatically fewer sam-
Consider the class of functions which ples for comparable accuracy: roughly speaking, only the cube
are restrictions to the unit interval of functions belonging to the root of the number of samples of linear sampling.
Bump Algebra [2], with bump norm . This was To achieve this dramatic reduction in sampling, we need an
mentioned in the Introduction, which observed that the wavelet information operator based on some satisfying CS1–CS3. The
coefficients at level obey where de- underlying measurement kernels will be of the form
pends only on the wavelet used. Here and later we use standard
wavelet analysis notations as in [36], [37], [2]. (VI.1)
We consider two ways of approximating functions in . In
the classic linear scheme, we fix a “finest scale” and mea- where the collection is simply an enumeration of the
sure the resumé coefficients where wavelets , with and .
, with a smooth function integrating to .
Think of these as point samples at scale after applying B. Images of Bounded Variation
an antialiasing filter. We reconstruct by
giving an approximation error We consider now the model with images of Bounded Varia-
tion from the Introduction. Let denote the class of func-
tions with domain , having total variation at
most [38], and bounded in absolute value by as
with depending only on the chosen wavelet. There are
well. In the Introduction, it was mentioned that the wavelet co-
coefficients associated with the unit interval, and
efficients at level obey where depends only
so the approximation error obeys
on the wavelet used. It is also true that .
We again consider two ways of approximating functions in .
The classic linear scheme uses a 2-D version of the scheme we
In the compressed sensing scheme, we need also wavelets have already discussed. We again fix a “finest scale” and
where is an oscillating function with measure the resumé coefficients where now
mean zero. We pick a coarsest scale . We measure the is a pair of integers , . indexing
resumé coefficients —there are of these—and then let position. We use the Haar scaling function
denote an enumeration of the detail wavelet coeffi-
cients . The dimension
of is . The norm satisfies
We reconstruct by giving an approxima-
tion error

This establishes the constraint on norm needed for our theory.


We take and apply a near-optimal informa- There are coefficients associated with the unit
tion operator for this and (described in more detail later). square, and so the approximation error obeys
We apply the near-optimal algorithm of minimization, getting
the error estimate
In the compressed sensing scheme, we need also Haar
wavelets where is an oscillating
function with mean zero which is either horizontally oriented
, vertically oriented , or diagonally oriented
with independent of . The overall reconstruction
. We pick a “coarsest scale” , and measure
the resumé coefficients —there are of these. Then let
be the concatenation of the detail wavelet coeffi-
cients .
has error The dimension of is . The norm obeys

This establishes the constraint on norm needed for applying


again with independent of . This is of the same our theory. We take and apply a near-op-
order of magnitude as the error of linear sampling. timal information operator for this and . We apply the near-
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1300 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

optimal algorithm of minimization to the resulting informa- This is no better than the performance of linear sampling for
tion, getting the error estimate the Bounded Variation case, despite the piecewise character
of ; the possible discontinuities in are responsible for the
inability of linear sampling to improve its performance over
with independent of . The overall reconstruction compared to Bounded Variation.
In the compressed sensing scheme, we pick a coarsest scale
. We measure the resumé coefficients in a
smooth wavelet expansion—there are of these—and then
let denote a concatentation of the finer scale
has error curvelet coefficients . The dimension of
is , with due to overcompleteness of
curvelets. The weak “norm” obeys

again with independent of . This is of the same


order of magnitude as the error of linear sampling. with depending on and . This establishes the constraint on
The compressed sensing scheme takes a total of samples weak norm needed for our theory. We take
of resumé coefficients and samples associ-
ated with detail coefficients, for a total pieces
of measured information. It achieves error comparable to clas-
sical sampling with samples. Thus, just as we have seen in and apply a near-optimal information operator for this and .
the Bump Algebra case, we need dramatically fewer samples for We apply the near-optimal algorithm of minimization to the
comparable accuracy: roughly speaking, only the square root of resulting information, getting the error estimate
the number of samples of linear sampling.

C. Piecewise Images With Edges with absolute. The overall reconstruction


We now consider an example where , and we can apply
the extensions to tight frames and to weak- mentioned earlier.
Again in the image processing setting, we use the - model
discussed in Candès and Donoho [39], [40]. Consider the col-
lection of piecewise smooth , with has error
values, first and second partial derivatives bounded by , away
from an exceptional set which is a union of curves having
first and second derivatives in an appropriate parametriza-
tion; the curves have total length . More colorfully, such
images are cartoons—patches of uniform smooth behavior
separated by piecewise-smooth curvilinear boundaries. They again with independent of . This is of the same
might be reasonable models for certain kinds of technical order of magnitude as the error of linear sampling.
imagery—e.g., in radiology. The compressed sensing scheme takes a total of samples
The curvelets tight frame [40] is a collection of smooth frame of resumé coefficients and samples associ-
elements offering a Parseval relation ated with detail coefficients, for a total pieces of
information. It achieves error comparable to classical sampling
based on samples. Thus, even more so than in the Bump Al-
gebra case, we need dramatically fewer samples for comparable
and reconstruction formula accuracy: roughly speaking, only the fourth root of the number
of samples of linear sampling.

VII. NEARLY ALL MATRICES ARE CS MATRICES


The frame elements have a multiscale organization, and frame
coefficients grouped by scale obey the weak constraint We may reformulate Theorem 6 as follows.
Theorem 10: Let , be a sequence of problem sizes with
, , for and . Let
compare [40]. For such objects, classical linear sampling at be a matrix with columns drawn independently and
scale by smooth 2-D scaling functions gives uniformly at random from . Then for some and
, conditions CS1–CS3 hold for with overwhelming
probability for all large .
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1301

Indeed, note that the probability measure on induced by The third and final idea is that bounds for individual subsets
sampling columns i.i.d. uniform on is exactly the natural can control simultaneous behavior over all . This is expressed
uniform measure on . Hence, Theorem 6 follows immedi- as follows.
ately from Theorem 10.
Lemma 7.3: Suppose we have events all obeying, for
In effect matrices satisfying the CS conditions are so ubiq-
some fixed and
uitous that it is reasonable to generate them by sampling at
random from a uniform probability distribution.
The proof of Theorem 10 is conducted over Sections VII-
A–C; it proceeds by studying events , , where for each with . Pick with
CS1 Holds , etc. It will be shown that for parameters and with . Then for all
and sufficiently large

for some
then defining and , we have
with

Since, when occurs, our random draw has produced a ma- Our main goal of this subsection, Lemma 7.1, now follows
trix obeying CS1–CS3 with parameters and , this proves by combining these three ideas.
Theorem 10. The proof actually shows that for some , It remains only to prove Lemma 7.3. Let
, ; the convergence is exponen-
tially fast. with

We note that, by Boole’s inequality


A. Control of Minimal Eigenvalue
The following lemma allows us to choose positive constants
and so that condition CS1 holds with overwhelming
probability. the last inequality following because each member is of
Lemma 7.1: Consider sequences of with cardinality , since , as soon as
. Define the event . Also, of course

Then, for each , for sufficiently small so we get . Taking as given, we get the
desired conclusion.

B. Spherical Sections Property


The proof involves three ideas. The first idea is that the event We now show that condition CS2 can be made overwhelm-
of interest for Lemma 7.1 is representable in terms of events ingly likely for large by choice of and sufficiently small
indexed by individual subsets but still positive. Our approach derives from [11], which applied
an important result from Banach space theory, the almost spher-
ical sections phenomenon. This says that slicing the unit ball in a
Our plan is to bound the probability of occurrence of every . Banach space by intersection with an appropriate finite-dimen-
The second idea is that for a specific subset , we get large de- sional linear subspace will result in a slice that is effectively
viations bounds on the minimum eigenvalue; this can be stated spherical [43], [44]. We develop a quantitative refinement of
as follows. this principle for the norm in , showing that, with over-
whelming probability, every operator for
Lemma 7.2: For , let denote the event affords a spherical section of the ball. The basic argument we
that the minimum eigenvalue exceeds . use originates from work of Milman, Kashin, and others [44],
Then there is so that for sufficiently small and [10], [45]; we refine an argument in Pisier [19] and, as in [11],
all draw inferences that may be novel. We conclude that not only do
almost-spherical sections exist, but they are so ubiquitous that
every with will generate them.
uniformly in . Definition 7.1: Let . We say that offers an -isom-
etry between and if
This was derived in [11] and in [35], using the concentration
of measure property of singular values of random matrices, e.g.,
see Szarek’s work [41], [42]. (VII.1)
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1302 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

The following lemma shows that condition CS2 is a generic and


property of matrices.
(VII.6)
Lemma 7.4: Consider the event that every
with offers an -isometry between
and . For each , there is so that In words, a small multiple of any sign pattern almost
lives in the dual ball .
Before proving this result, we indicate how it gives the prop-
erty CS3; namely, that , and
To prove this, we first need a lemma about individual subsets imply
proven in [11].
Lemma 7.5: Fix . Choose so that (VII.7)

(VII.2) Consider the convex optimization problem


and subject to (VII.8)
(VII.3) This can be written as a linear program, by the same sort of con-
struction as given for (IV.1). By the duality theorem for linear
Choose so that programming, the value of the primal program is at least the
value of the dual
subject to
and let denote the difference between the two sides. For
a subset in , let denote the event that Lemma 7.6 gives us a supply of dual-feasible vectors and hence
furnishes an -isometry to . Then as a lower bound on the dual program. Take ; we can
find which is dual-feasible and obeys

Now note that the event of interest for Lemma 7.4 is


picking appropriately and taking into account the spherical
sections theorem, for sufficiently large , we have
to finish, apply the individual Lemma 7.5 together with the com-
; (VII.7) follows with .
bining principle in Lemma 7.3.
1) Proof of Uniform Sign-Pattern Embedding: The proof of
C. Quotient Norm Inequalities Lemma 7.6 follows closely a similar result in [11] that consid-
ered the case . Our idea here is to adapt the
We now show that, for , for sufficiently small argument for the case to the
, nearly all large by matrices have property CS3. Our case, with changes reflecting the different choice of , , and the
argument borrows heavily from [11] which the reader may find sparsity bound . We leave out large parts of the ar-
helpful. We here make no attempt to provide intuition or to com- gument, as they are identical to the corresponding parts in [11].
pare with closely related notions in the local theory of Banach The bulk of our effort goes to produce the following lemma,
spaces (e.g., Milman’s Quotient of a Subspace Theorem [19]). which demonstrates approximate embedding of a single sign
Let be any collection of indices in ; pattern in the dual ball.
is a linear subspace of , and on this subspace a subset of
possible sign patterns can be realized, i.e., sequences of ’s Lemma 7.7: Individual Sign-Pattern Embedding. Let
generated by , let , with , , , as in the statement of
Lemma 7.6. Let . Given a collection
, there is an iterative algorithm, described below, producing a
vector as output which obeys
CS3 will follow if we can show that for every , (VII.9)
some approximation to satisfies
for . Let be i.i.d. uniform on ; there is an event
described below, having probability controlled by
Lemma 7.6: Uniform Sign-Pattern Embedding. Fix .
(VII.10)
Then for , set
(VII.4) for which can be explicitly given in terms of and . On
this event
For sufficiently small , there is an event
with , as . On this event, for each subset (VII.11)
with , for each sign pattern in , there
is a vector with Lemma 7.7 will be proven in a section of its own. We now
(VII.5) show that it implies Lemma 7.6.
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1303

We recall a standard implication of so-called Vapnik– If is empty, then the process terminates, and set .
Cervonenkis theory [46] Termination must occur at stage . At termination

Hence, is definitely dual feasible. The only question is how


Notice that if , then close to it is.
2) Analysis Framework: Also in [11] bounds were devel-
oped for two key descriptors of the algorithm trajectory

while also
and

Hence, the total number of sign patterns generated by operators We adapt the arguments deployed there. We define bounds
obeys and for and , of the form

Now furnished by Lemma 7.7 is positive, so pick


with . Define here and , where will
be defined later. We define subevents

where denotes the instance of the event (called in the


statement of Lemma 7.7) generated by a specific , combi-
nation. On the event , every sign pattern associated with any Now define
obeying is almost dual feasible. Now

this event implies, because

which tends to zero exponentially rapidly.


We will show that, for chosen in conjunction with
D. Proof of Individual Sign-Pattern Embedding
1) An Embedding Algorithm: The companion paper [11] in- (VII.12)
troduced an algorithm to create a dual feasible point starting
This implies
from a nearby almost-feasible point . It worked as follows.
Let be the collection of indices with

and the lemma follows.


and then set 3) Large Deviations: Define the events

so that
where denotes the least-squares projector
. In effect, identify the indices where
exceeds half the forbidden level , and “kill” those
indices. Put
Continue this process, producing , , etc., with stage-de-
pendent thresholds successively closer to . Set

and note that this depends quite weakly on . Recall that the
and, putting , event is defined in terms of and . On the event ,
. Lemma 7.1 implicitly defined a quan-
tity lowerbounding the minimum eigenvalue of
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1304 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

every where . Pick so that


. Pick so that
Since , the term of most concern in is
at ; the other terms are always better. Also in fact does
With this choice of , when the event occurs, not depend on . Focusing now on , we may write

Also, on , (say) for .


In [11], an analysis framework was developed in which a
family of random variables i.i.d. Recalling that and putting
was introduced, and it was shown that

we get for , and

and

A similar analysis holds for the ’s.

VIII. CONCLUSION
That paper also stated two simple large deviations bounds. A. Summary
Lemma 7.8: Let be i.i.d. , , We have described an abstract framework for compressed
sensing of objects which can be represented as vectors .
We assume the object of interest is a priori compressible so
that for a known basis or frame and .
and Starting from an by matrix with satisfying condi-
tions CS1–CS3, and with the matrix of an orthonormal basis
or tight frame underlying , we define the information
operator . Starting from the -tuple of measured
information , we reconstruct an approximation to
Applying this, we note that the event by solving

subject to

The proposed reconstruction rule uses convex optimization and


stated in terms of variables, is equivalent to an event is computationally tractable. Also, the needed matrices satis-
fying CS1–CS3 can be constructed by random sampling from a
uniform distribution on the columns of .
We give error bounds showing that despite the apparent un-
dersampling , good accuracy reconstruction is possible
stated in terms of standard random variables, where for compressible objects, and we explain the near-optimality of
these bounds using ideas from Optimal Recovery and Informa-
tion-Based Complexity. We even show that the results are stable
and under small measurement errors in the data (
small). Potential applications are sketched related to imaging
and spectroscopy.
We therefore have for the inequality
B. Alternative Formulation
We remark that the CS1–CS3 conditions are not the only way
Now to obtain our results. Our proof of Theorem 9 really shows the
following.
Theorem 11: Suppose that an matrix obeys the
following conditions, with constants , and
and .
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
DONOHO: COMPRESSED SENSING 1305

A1: The maximal concentration (defined in Note Added in Proof


Section IV-B) obeys In the months since the paper was written, several groups have
(VIII.1) conducted numerical experiments on synthetic and real data for
the method described here and related methods. They have ex-
A2: The width (defined in Section II) obeys plored applicability to important sensor problems, and studied
applications issues such as stability in the presence of noise. The
(VIII.2)
reader may wish to consult the forthcoming Special Issue on
Sparse Representation of the EURASIP journal Applied Signal
Let . For some and all Processing, or look for papers presented at a special session in
, the solution of obeys the estimate ICASSP 2005, or the workshop on sparse representation held
in May 2005 at the University of Maryland Center for Scien-
tific Computing and Applied Mathematics, or the workshop in
November 2005 at Spars05, Université de Rennes.
In short, a different approach might exhibit operators with A referee has pointed out that Compressed Sensing is in some
good widths over balls only, and low concentration on “thin” respects similar to problems arising in data stream processing,
sets. Another way to see that the conditions CS1–CS3 can no where one wants to learn basic properties [e.g., moments, his-
doubt be approached differently is to compare the results in [11], togram] of a datastream without storing the stream. In short,
[35]; the second paper proves results which partially overlap one wants to make relatively few measurements while infer-
those in the first, by using a different technique. ring relatively much detail. The notions of “Iceberg queries” in
large databases and “heavy hitters” in data streams may provide
C. The Partial Fourier Ensemble
points of entry into that literature.
We briefly discuss two recent articles which do not fit in the
-widths tradition followed here, and so were not easy to cite
earlier with due prominence. ACKNOWLEDGMENT
First, and closest to our viewpoint, is the breakthrough paper In spring 2004, Emmanuel Candès told the present author
of Candès, Romberg, and Tao [13]. This was discussed in about his ideas for using the partial Fourier ensemble in “under-
Section IV-B; the result of [13] showed that minimization sampled imaging”; some of these were published in [13]; see
can be used to exactly recover sparse sequences from the also the presentation [14]. More recently, Candès informed the
Fourier transform at randomly chosen frequencies, whenever present author of the results in [47] referred to above. It is a plea-
the sequence has fewer than nonzeros, for some sure to acknowledge the inspiring nature of these conversations.
. Second is the article of Gilbert et al. [12], which The author would also like to thank Anna Gilbert for telling
showed that a different nonlinear reconstruction algorithm can him about her work [12] finding the B-best Fourier coefficients
be used to recover approximations to a vector in which is by nonadaptive sampling, and to thank Emmanuel Candès for
nearly as good as the best -term approximation in norm, conversations clarifying Gilbert’s work. Thanks to the referees
using about random but nonuniform for numerous suggestions which helped to clarify the exposition
samples in the frequency domain; here is (it seems) an upper and argumentation. Anna Gilbert offered helpful pointers to the
bound on the norm of . data stream processing literature.
These papers both point to the partial Fourier ensemble, i.e.,
the collection of matrices made by sampling rows out
REFERENCES
of the Fourier matrix, as concrete examples of working
within the CS framework; that is, generating near-optimal sub- [1] D. L. Donoho, M. Vetterli, R. A. DeVore, and I. C. Daubechies, “Data
compression and harmonic analysis,” IEEE Trans. Inf. Theory, vol. 44,
spaces for Gel’fand -widths, and allowing minimization to no. 6, pp. 2435–2476, Oct. 1998.
reconstruct from such information for all . [2] Y. Meyer, Wavelets and Operators. Cambridge, U.K.: Cambridge
Now [13] (in effect) proves that if , then in the Univ. Press, 1993.
[3] D. L. Donoho, “Unconditional bases are optimal bases for data compres-
partial Fourier ensemble with uniform measure, the maximal sion and for statistical estimation,” Appl. Comput. Harmonic Anal., vol.
concentration condition A1 (VIII.1) holds with overwhelming 1, pp. 100–115, 1993.
probability for large (for appropriate constants , [4] , “Sparse components of images and optimal atomic decomposi-
tion,” Constructive Approx., vol. 17, pp. 353–382, 2001.
). On the other hand, the results in [12] seem to show that [5] A. Pinkus, “n-widths and optimal recovery in approximation theory,” in
condition A2 (VIII.2) holds in the partial Fourier ensemble with Proc. Symp. Applied Mathematics, vol. 36, C. de Boor, Ed., Providence,
overwhelming probability for large , when it is sampled with a RI, 1986, pp. 51–66.
[6] J. F. Traub and H. Woziakowski, A General Theory of Optimal Algo-
certain nonuniform probability measure. Although the two pa- rithms. New York: Academic, 1980.
pers [13], [12] refer to different random ensembles of partial [7] D. L. Donoho, “Unconditional bases and bit-level compression,” Appl.
Fourier matrices, both reinforce the idea that interesting rela- Comput. Harmonic Anal., vol. 3, pp. 388–92, 1996.
[8] A. Pinkus, n-Widths in Approximation Theory. New York: Springer-
tively concrete families of operators can be developed for com- Verlag, 1985.
pressed sensing applications. In fact, Candès has informed us of [9] A. Y. Garnaev and E. D. Gluskin, “On widths of the Euclidean ball” (in
some recent results he obtained with Tao [47] indicating that, English), Sov. Math.–Dokl., vol. 30, pp. 200–203, 1984.
[10] B. S. Kashin, “Diameters of certain finite-dimensional sets in classes of
modulo polylog factors, A2 holds for the uniformly sampled smooth functions,” Izv. Akad. Nauk SSSR, Ser. Mat., vol. 41, no. 2, pp.
partial Fourier ensemble. This seems a very significant advance. 334–351, 1977.
Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.
1306 IEEE TRANSACTIONS ON INFORMATION THEORY, VOL. 52, NO. 4, APRIL 2006

[11] D. L. Donoho, “For most large underdetermined systems of linear [30] R. Gribonval and M. Nielsen, “Sparse representations in unions of
equations, the minimal ` -norm solution is also the sparsest solution,” bases,” IEEE Trans Inf Theory, vol. 49, no. 12, pp. 3320–3325, Dec.
Commun. Pure Appl. Math., to be published. 2003.
[12] A. C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan, and M. Strauss, [31] J. J. Fuchs, “On sparse representation in arbitrary redundant bases,”
“Near-optimal sparse fourier representations via sampling,” in Proc 34th IEEE Trans. Inf. Theory, vol. 50, no. 6, pp. 1341–1344, Jun. 2002.
ACM Symp. Theory of Computing, Montréal, QC, Canada, May 2002, [32] J. A. Tropp, “Greed is good: Algorithmic results for sparse approxima-
pp. 152–161. tion,” IEEE Trans Inf. Theory, vol. 50, no. 10, pp. 2231–2242, Oct. 2004.
[13] E. J. Candès, J. Romberg, and T. Tao, “Robust uncertainty principles: [33] , “Just relax: Convex programming methods for identifying sparse
Exact signal reconstruction from highly incomplete frequency informa- signals in noise,” IEEE Trans Inf. Theory, vol. 52, no. 3, pp. 1030–1051,
tion.,” IEEE Trans. Inf. Theory, to be published. Mar. 2006.
[14] E. J. Candès, “Robust Uncertainty Principles and Signal Recovery,” [34] D. L. Donoho, M. Elad, and V. Temlyakov, “Stable recovery of sparse
presented at the 2nd Int. Conf. Computational Harmonic Anaysis, overcomplete representations in the presence of noise,” IEEE Trans. Inf.
Nashville, TN, May 2004. Theory, vol. 52, no. 1, pp. 6–18, Jan. 2006.
[15] C. A. Micchelli and T. J. Rivlin, “A survey of optimal recovery,” in Op- [35] D. L. Donoho, “For most underdetermined systems of linear equations,
timal Estimation in Approximation Theory, C. A. Micchelli and T. J. the minimal ` -norm near-solution approximates the sparsest near-so-
Rivlin, Eds: Plenum, 1977, pp. 1–54. lution,” Commun. Pure Appl. Math., to be published.
[16] J. Peetre and G. Sparr, “Interpolation of normed abelian groups,” Ann. [36] I. C. Daubechies, Ten Lectures on Wavelets. Philadelphia, PA: SIAM,
Math. Pure Appl., ser. 4, vol. 92, pp. 217–262, 1972. 1992.
[17] J. Bergh and J. Löfström, Interpolation Spaces. An Introduc- [37] S. Mallat, A Wavelet Tour of Signal Processing. San Diego, CA: Aca-
tion. Berlin, Germany: Springer-Verlag, 1976. demic, 1998.
[18] B. Carl, “Entropy numbers s-numbers, and eigenvalue problems,” J. [38] A. Cohen, R. DeVore, P. Petrushev, and H. Xu, “Nonlinear approxima-
Funct. Anal., vol. 41, pp. 290–306, 1981. tion and the space BV (R ),” Amer. J. Math., vol. 121, pp. 587–628,
[19] G. Pisier, The Volume of Convex Bodies and Banach Space Geom- 1999.
etry. Cambridge, U.K.: Cambridge Univ. Press, 1989. [39] E. J. Candès and D. L. Donoho, “Curvelets—A surprisingly effective
[20] C. Schütt, “Entropy numbers of diagonal operators between symmetric nonadaptive representation for objects with edges,” in Curves and Sur-
Banach spaces,” J. Approx. Theory, vol. 40, pp. 121–128, 1984. faces, C. Rabut, A. Cohen, and L. L. Schumaker, Eds. Nashville, TN:
[21] T. Kuhn, “A lower estimate for entropy numbers,” J. Approx. Theory, Vanderbilt Univ. Press, 2000.
vol. 110, pp. 120–124, 2001. [40] , “New tight frames of curvelets and optimal representations of ob-
[22] S. Gal and C. Micchelli, “Optimal sequential and nonsequential proce- jects with piecewise C singularities,” Comm. Pure Appl. Math., vol.
dures for evaluating a functional,” Appl. Anal., vol. 10, pp. 105–120, LVII, pp. 219–266, 2004.
1980. [41] S. J. Szarek, “Spaces with large distances to ` and random matrices,”
[23] A. A. Melkman and C. A. Micchelli, “Optimal estimation of linear oper- Amer. J. Math., vol. 112, pp. 819–842, 1990.
ators from inaccurate data,” SIAM J. Numer. Anal., vol. 16, pp. 87–105, [42] , “Condition numbers of random matrices,” J. Complexity, vol. 7,
1979. pp. 131–149, 1991.
[24] M. A. Kon and E. Novak, “The adaption problem for approximating [43] A. Dvoretsky, “Some results on convex bodies and banach spaces,” in
linear operators,” Bull. Amer. Math. Soc., vol. 23, pp. 159–165, 1990. Proc. Symp. Linear Spaces, Jerusalem, Israel, 1961, pp. 123–160.
[25] E. Novak, “On the power of adaption,” J. Complexity, vol. 12, pp. [44] T. Figiel, J. Lindenstrauss, and V. D. Milman, “The dimension of almost-
199–237, 1996. spherical sections of convex bodies,” Acta Math., vol. 139, pp. 53–94,
[26] S. Chen, D. L. Donoho, and M. A. Saunders, “Atomic decomposition by 1977.
basis pursuit,” SIAM J. Sci Comp., vol. 20, no. 1, pp. 33–61, 1999. [45] V. D. Milman and G. Schechtman, Asymptotic Theory of Finite-Dimen-
[27] D. L. Donoho and X. Huo, “Uncertainty principles and ideal atomic de- sional Normed Spaces (Leecture Notes in Mathematics). Berlin, Ger-
composition,” IEEE Trans. Inf. Theory, vol. 47, no. 7, pp. 2845–62, Nov. many: Springer-Verlag, 1986, vol. 1200.
2001. [46] D. Pollard, Empirical Processes: Theory and Applications. Hayward,
[28] M. Elad and A. M. Bruckstein, “A generalized uncertainty principle and CA: Inst. Math. Statist., vol. 2, NSF-CBMS Regional Conference Series
sparse representations in pairs of bases,” IEEE Trans. Inf. Theory, vol. in Probability and Statistics.
49, no. 9, pp. 2558–2567, Sep. 2002. [47] E. J. Candès and T. Tao, “Near-optimal signal recovery from random
[29] D. L. Donoho and M. Elad, “Optimally sparse representation from over- projections: Universal encoding strategies,” Applied and Computational
complete dictionaries via ` norm minimization,” Proc. Natl. Acad. Sci. Mathematics, Calif. Inst. Technol., Tech. Rep., 2004.
USA, vol. 100, no. 5, pp. 2197–2002, Mar. 2002.

Authorized licensed use limited to: Anhui Normal University. Downloaded on July 31,2023 at 13:04:02 UTC from IEEE Xplore. Restrictions apply.

You might also like