Compressed Sensing

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Compressed Sensing

David L. Donoho
Department of Statistics
Stanford University

September 14, 2004

Abstract
m
Suppose x is an unknown vector in R (depending on context, a digital image or signal);
we plan to acquire data and then reconstruct. Nominally this ‘should’ require m samples. But
suppose we know a priori that x is compressible by transform coding with a known transform,
and we are allowed to acquire data about x by measuring n general linear functionals – rather
than the usual pixels. If the collection of linear functionals is well-chosen, and we allow for
a degree of reconstruction error, the size of n can be dramatically smaller than the size m
usually considered necessary. Thus, certain natural classes of images with m pixels need
only n = O(m1/4 log5/2 (m)) nonadaptive nonpixel samples for faithful recovery, as opposed
to the usual m pixel samples.
Our approach is abstract and general. We suppose that the object has a sparse rep-
resentation in some orthonormal basis (eg. wavelet, Fourier) or tight frame (eg curvelet,
Gabor), meaning that the coefficients belong to an `p ball for 0 < p ≤ 1. This implies
that the N most important coefficients in the expansion allow a reconstruction with `2 error
O(N 1/2−1/p ). It is possible to design n = O(N log(m)) nonadaptive measurements which
contain the information necessary to reconstruct any such object with accuracy comparable
to that which would be possible if the N most important coefficients of that object were
directly observable. Moreover, a good approximation to those N important coefficients may
be extracted from the n measurements by solving a convenient linear program, called by the
name Basis Pursuit in the signal processing literature. The nonadaptive measurements have
the character of ‘random’ linear combinations of basis/frame elements.
These results are developed in a theoretical framework based on the theory of optimal re-
covery, the theory of n-widths, and information-based complexity. Our basic results concern
properties of `p balls in high-dimensional Euclidean space in the case 0 < p ≤ 1. We estimate
the Gel’fand n-widths of such balls, give a criterion for near-optimal subspaces for Gel’fand
n-widths, show that ‘most’ subspaces are near-optimal, and show that convex optimization
can be used for processing information derived from these near-optimal subspaces.
The techniques for deriving near-optimal subspaces include the use of almost-spherical
sections in Banach space theory.

Key Words and Phrases. Integrated Sensing and Processing. Optimal Recovery. Information-
Based Complexity. Gel’fand n-widths. Adaptive Sampling. Sparse Solution of Linear equations.
Basis Pursuit. Minimum `1 norm decomposition. Almost-Spherical Sections of Banach Spaces.

1
1 Introduction
As our modern technology-driven civilization acquires and exploits ever-increasing amounts of
data, ‘everyone’ now knows that most of the data we acquire ‘can be thrown away’ with almost no
perceptual loss – witness the broad success of lossy compression formats for sounds, images and
specialized technical data. The phenomenon of ubiquitous compressibility raises very natural
questions: why go to so much effort to acquire all the data when most of what we get will be
thrown away? Can’t we just directly measure the part that won’t end up being thrown away?
In this paper we design compressed data acquisition protocols which perform as if it were
possible to directly acquire just the important information about the signals/images – in ef-
fect not acquiring that part of the data that would eventually just be ‘thrown away’ by lossy
compression. Moreover, the protocols are nonadaptive and parallelizable; they do not require
knowledge of the signal/image to be acquired in advance - other than knowledge that the data
will be compressible - and do not attempt any ‘understanding’ of the underlying object to guide
an active or adaptive sensing strategy. The measurements made in the compressed sensing
protocol are holographic – thus, not simple pixel samples – and must be processed nonlinearly.
In specific applications this principle might enable dramatically reduced measurement time,
dramatically reduced sampling rates, or reduced use of Analog-to-Digital converter resources.

1.1 Transform Compression Background


Our treatment is abstract and general, but depends on one specific assumption which is known
to hold in many settings of signal and image processing: the principle of transform sparsity. We
suppose that the object of interest is a vector x ∈ Rm , which can be a signal or image with m
samples or pixels, and that there is an orthonormal basis (ψi : i = 1, . . . , m) for Rm which can
be for example an orthonormal wavelet basis, a Fourier basis, or a local Fourier basis, depending
on the application. (As explained later, the extension to tight frames such as curvelet or Gabor
frames comes for free). The object has transform coefficients θi = hx, ψi i, and these are assumed
sparse in the sense that, for some 0 < p < 2 and for some R > 0:
X
kθkp ≡ ( |θi |p )1/p ≤ R. (1.1)
i

Such constraints are actually obeyed on natural classes of signals and images; this is the primary
reason for the success of standard compression tools based on transform coding [10]. To fix ideas,
we mention two simple examples of `p constraint.
• Bounded Variation model for images. Here image brightness is viewed as an underlying
function f (x, y) on the unit square 0 ≤ x, y ≤ 1 which obeys (essentially)
Z 1Z 1
|∇f |dxdy ≤ R.
0 0

The digital data of interest consists of m = n2 pixel samples of f produced by averaging


over 1/n×1/n pixels. We take a wavelet point of view; the data are seen as a superposition
of contributions from various scales. Let x(j) denote the component of the data at scale j,
and let (ψij ) denote the orthonormal basis of wavelets at scale j, containing 3 · 4j elements.
The corresponding coefficients obey kθ(j) k1 ≤ 4R.

• Bump Algebra model for spectra. Here a spectrum (eg mass spectrum or NMR spectrum)
is modelled as digital samples (f (i/n)) of an underlying function f on the real line which is

2
a superposition of so-called spectral lines of varying positions, amplitudes, and linewidths.
Formally,
X∞
f (t) = ai g((t − ti )/si ).
i=1

Here the parameters ti are line locations, ai are amplitudes/polarities and si are linewidths,
and g represents a lineshape, for example thePGaussian, although other profiles could be
considered. We assume the constraint where i |ai | ≤ R, which in applications represents
an energy or total mass constraint. Again we take a wavelet viewpoint, this time specifically
using smooth wavelets. The data can be represented as a superposition of contributions
from various scales. Let x(j) denote the component of the image at scale j, and let
(ψij ) denote the orthonormal basis of wavelets at scale j, containing 2j elements. The
corresponding coefficients again obey kθ(j) k1 ≤ c · R · 2−j/2 , [33].

While in these two examples, the `1 constraint appeared, other `p constraints can appear nat-
urally as well; see below. For some readers the use of `p norms with p < 1 may seem initially
strange; it is now well-understood that the `p norms with such small p are natural mathematical
measures of sparsity [11, 13]. As p decreases below 1, more and more sparsity is being required.
Also, from this viewpoint, an `p constraint based on p = 2 requires no sparsity at all.
Note that in each of these examples, we also allowed for separating the object of interest
into subbands, each one of which obeys an `p constraint. In practice below, we stick with the
view that the object θ of interest is a coefficient vector obeying the constraint (1.1), which may
mean, from an application viewpoint, that our methods correspond to treating various subbands
separately, as in these examples.
The key implication of the `p constraint is sparsity of the transform coefficients. Indeed, we
have trivially that, if θN denotes the vector θ with everything except the N largest coefficients
set to 0,
kθ − θN k2 ≤ ζ2,p · kθkp · (N + 1)1/2−1/p , N = 0, 1, 2, . . . , (1.2)
with a constant ζ2,p depending only on p ∈ (0, 2). Thus, for example, to approximate θ with
error , we need to keep only the N  (p−2)/2p biggest terms in θ.

1.2 Optimal Recovery/Information-Based Complexity Background


Our question now becomes: if x is an unknown signal whose transform coefficient vector θ
obeys (1.1), can we make a reduced number n  m of measurements which will allow faithful
reconstruction of x. Such questions have been discussed (for other types of assumptions about
x) under the names of Optimal Recovery [39] and Information-Based Complexity [46]; we now
adopt their viewpoint, and partially adopt their notation, without making a special effort to
be really orthodox. We use ‘OR/IBC’ as a generic label for work taking place in those fields,
admittedly being less than encyclopedic about various scholarly contributions.
We have a class X of possible objects of interest, and are interested in designing an infor-
mation operator In : X 7→ Rn that samples n pieces of information about x, and an algorithm
A : Rn 7→ Rm that offers an approximate reconstruction of x. Here the information operator
takes the form
In (x) = (hξ1 , xi, . . . , hξn , xi)
where the ξi are sampling kernels, not necessarily sampling pixels or other simple features of
the signal, however, they are nonadaptive, i.e. fixed independently of x. The algorithm An is
an unspecified, possibly nonlinear reconstruction operator.

3
We are interested in the `2 error of reconstruction and in the behavior of optimal information
and optimal algorithms. Hence we consider the minimax `2 error as a standard of comparison:

En (X) = inf sup kx − An (In (x))k2 ,


An ,In x∈X

So here, all possible methods of nonadaptive sampling are allowed, and all possible methods of
reconstruction are allowed. P
In our application, the class X of objects of interest is the set of objects x = i θi ψi where
θ = θ(x) obeys (1.1) for a given p and R. Denote then

Xp,m (R) = {x : kθ(x)kp ≤ R}.

Our goal is to evaluate En (Xp,m (R)) and to have practical schemes which come close to attaining
it.

1.3 Four Surprises


Here is the main quantitative phenomenon of interest for this article.

Theorem 1 Let (n, mn ) be a sequence of problem sizes with n < mn , n → ∞, and mn ∼ Anγ ,
γ > 1, A > 0. Then for 0 < p ≤ 1 there is Cp = Cp (A, γ) > 0 so that

En (Xp,m (R)) ≤ Cp · R · (n/ log(mn ))1/2−1/p , n → ∞. (1.3)

We find this surprising in four ways. First, compare (1.3) with (1.2). We see that the forms
are similar, under the calibration n = N log(mn ). In words: the quality of approximation to x
which could be gotten by using the N biggest transform coefficients can be gotten by using the
n ≈ N log(m) pieces of nonadaptive information provided by In . The surprise is that we would
not know in advance which transform coefficients are likely to be the important ones in this
approximation, yet the optimal information operator In is nonadaptive, depending at most on
the class Xp,m (R) and not on the specific object. In some sense this nonadaptive information is
just as powerful as knowing the N best transform coefficients.
This seems even more surprising when we note that for objects x ∈ Xp,m (R), the transform
representation is the optimal one: no other representation can do as well at characterising x
by a few coefficients [11, 12]. Surely then, one imagines, the sampling kernels ξi underlying
the optimal information operator must be simply measuring individual transform coefficients?
Actually, no: the information operator is measuring very complex holographic functionals which
in some sense mix together all the coefficients in a big soup. Compare (6.1) below.
Another surprise is that, if we enlarged our class of information operators to allow adaptive
ones, e.g. operators in which certain measurements are made in response to earlier measure-
ments, we could scarcely do better. Define the minimax error under adaptive information EnAdapt
allowing adaptive operators

InA = (hξ1 , xi, hξ2,x , xi, . . . hξn,x , xi),

where, for i ≥ 2, each kernel ξi,x is allowed to depend on the information hξj,x , xi gathered at
previous stages 1 ≤ j < i. Formally setting

EnAdapt (X) = inf sup kx − An (InA (x))k2 ,


An ,InA x∈X

we have

4
Theorem 2 For 0 < p ≤ 1, and Cp > 0

En (Xp,m (R)) ≤ 21/p · EnAdapt (Xp,m (R)), R>0

So adaptive information is of minimal help – despite the quite natural expectation that an
adaptive method ought to be able iteratively somehow ‘localize’ and then ‘close in’ on the ‘big
coefficients’.
An additional surprise is that, in the already-interesting case p = 1, Theorems 1 and 2
are easily derivable from known results in OR/IBC and approximation theory! However, the
derivations are indirect; so although they have what seem to the author as fairly important
implications, very little seems known at present about good nonadaptive information operators
or about concrete algorithms matched to them.
Our goal in this paper is to give direct arguments which cover the case 0 < p ≤ 1 of highly
compressible objects, to give direct information about near-optimal information operators and
about concrete, computationally tractable algorithms for using this near-optimal information.

1.4 Geometry and Widths


From our viewpoint, the phenomenon described in Theorem 1 concerns the geometry of high-
dimensional convex and nonconvex ‘balls’. To see the connection, note that the class Xp,m (R) is
the image, under orthogonal transformation, of an `p ball. If p = 1 this is convex and symmetric
about the origin, as well as being orthosymmetric with respect to the axes provided by the
wavelet basis; if p < 1, this is again symmetric about the origin and orthosymmetric, while not
being convex, but still starshaped.
To develop this geometric viewpoint further, we consider two notions of n-width; see [39].

Definition 1.1 The Gel’fand n-width of X with respect to the `2m norm is defined as

dn (X; `2m ) = inf sup{kxk2 : x ∈ Vn⊥ ∩ X},


Vn

where the infimum is over n-dimensional linear subspaces of Rm , and Vn⊥ denotes the ortho-
complement of Vn with respect to the standard Euclidean inner product.

In words, we look for a subspace such that ‘trapping’ x ∈ X in that subspace causes x to be
small. Our interest in Gel’fand n-widths derives from an equivalence between optimal recovery
for nonadaptive information and such n-widths, well-known in the p = 1 case [39], and in the
present setting extending as follows:

Theorem 3 For 0 < p ≤ 1,

dn (Xp,m (R)) ≤ En (Xp,m (R)) ≤ 21/p−1 · dn (Xp,m (R)), R > 0. (1.4)

Thus the Gel’fand n-widths either exactly or nearly equal the value of optimal information.
Ultimately, the bracketing with constant 21/p−1 will be for us just as good as equality, owing
to the unspecified constant factors in (1.3). We will typically only be interested in near-optimal
performance, i.e. in obtaining En to within constant factors.
It is relatively rare to see the Gel’fand n-widths studied directly [40]; more commonly one
sees results about the Kolmogorov n-widths:

5
Definition 1.2 Let X ⊂ Rm be a bounded set. The Kolmogorov n-width of X with respect the
`2m norm is defined as
dn (X; `2m ) = inf sup inf kx − yk2 .
Vn x∈X y∈Vn

where the infimum is over n-dimensional linear subspaces of Rm .

In words, dn measures the quality of approximation of X possible by n-dimensional subspaces


Vn .
In the case p = 1, there is an important duality relationship between Kolmogorov widths
and Gel’fand widths which allows us to infer properties of dn from published results on dn . To
state it, let dn (X, `q ) be defined in the obvious way, based on approximation in `q rather than
`2 norm. Also, for given p ≥ 1 and q ≥ 1, let p0 and q 0 be the standard dual indices = 1 − 1/p,
1 − 1/q. Also, let bp,m denote the standard unit ball of `pm . Then [40]
0
dn (bp,m ; `qm ) = dn (bq0 ,m , `pm ). (1.5)

In particular
dn (b2,m ; `∞ n 2
m ) = d (b1,m , `m ).

The asymptotic properties of the left-hand side have been determined by Garnaev and Gluskin
[24]. This follows major work by Kashin [28], who developed a slightly weaker version of this
result in the course of determining the Kolmogorov n-widths of Sobolev spaces. See the original
papers, or Pinkus’s book [40] for more details.

Theorem 4 (Kashin; Garnaev and Gluskin) For all n and m > n

dn (b2,m , `∞
m )  (n/(1 + log(m/n)))
−1/2
.

Theorem 1 now follows in the case p = 1 by applying KGG with the duality formula (1.5)
and the equivalence formula (1.4). The case 0 < p < 1 of Theorem 1 does not allow use of
duality and the whole range 0 < p ≤ 1 is approached differently in this paper.

1.5 Mysteries ...


Because of the indirect manner by which the KGG result implies Theorem 1, we really don’t
learn much about the phenomenon of interest in this way. The arguments of Kashin, Garnaev
and Gluskin show that there exist near-optimal n-dimensional subspaces for the Kolmogorov
widths; they arise as the nullspaces of certain matrices with entries ±1 which are known to exist
by counting the number of matrices lacking certain properties, the total number of matrices
with ±1 entries, and comparing. The interpretability of this approach is limited.
The implicitness of the information operator is matched by the abstractness of the recon-
struction algorithm. Based on OR/IBC theory we know that the so-called central algorithm is
optimal. This “algorithm” asks us to consider, for given information yn = In (x) the collection
of all objects x which could have given rise to the data yn :

In−1 (yn ) = {x : In (x) = yn }.

Defining now the center of a set S

center(S) = inf sup kx − ck2 ,


c x∈S

6
the central algorithm is
x̂∗n = center(In−1 (yn ) ∩ Xp,m (R)),
and it obeys, when the information In is optimal,

sup kx − x̂∗n k2 = En (Xp,m (R));


Xp,m (R)

see Section 3 below.


This abstract viewpoint unfortunately does not translate into a practical approach (at least
in the case of the Xp,m (R), 0 < p ≤ 1). The set In−1 (yn ) ∩ Xp,m (R) is a section of the ball
Xp,m (R), and finding the center of this section does not correspond to a standard tractable
computational problem. Moreover, this assumes we know p and R which would typically not be
the case.

1.6 Results
Our paper develops two main types of results.

• Near-Optimal Information. We directly consider the problem of near-optimal subspaces


for Gel’fand n-widths of Xp,m (R), and introduce 3 structural conditions (CS1-CS3) on
an n-by-m matrix which imply that its nullspace is near-optimal. We show that the vast
majority of n-subspaces of Rm are near-optimal, and random sampling yields near-optimal
information operators with overwhelmingly high probability.

• Near-Optimal Algorithm. We study a simple nonlinear reconstruction algorithm: simply


minimize the `1 norm of the coefficients θ subject to satisfying the measurements. This
has been studied in the signal processing literature under the name Basis Pursuit; it can
be computed by linear programming. We show that this method gives near-optimal results
for all 0 < p ≤ 1.

In short, we provide a large supply of near-optimal information operators and a near-optimal


reconstruction procedure based on linear programming, which, perhaps unexpectedly, works even
for the non-convex case 0 < p < 1.
For a taste of the type of result we obtain, consider a specific information/algorithm combi-
nation.

• CS Information. Let Φ be an n × m matrix generated by randomly sampling the columns,


with different columns iid random uniform on Sn−1 . With overwhelming probability for
large n, Φ has properties CS1-CS3 detailed below; assume we have achieved such a favor-
able draw. Let Ψ be the m × m basis matrix with basis vector ψi as the i-th column. The
CS Information operator InCS is the n × m matrix ΦΨT .

• `1 -minimization. To reconstruct from CS Information, we solve the convex optimization


problem
(L1 ) min kΨT xk1 subject to yn = InCS (x).
In words, we look for the object x̂1 having coefficients with smallest `1 norm that is
consistent with the information yn .

7
To evaluate the quality of an information operator In , set

En (In , X) ≡ inf sup kx − An (In (x))k2 .


An x∈X

To evaluate the quality of a combined algorithm/information pair (An , In ), set

En (An , In , X) ≡ sup kx − An (In (x))k2 .


x∈X

Theorem 5 Let n, mn be a sequence of problem sizes obeying n < mn ∼ Anγ , A > 0, γ ≥ 1;


and let InCS be a corresponding sequence of operators deriving from CS-matrices with underlying
parameters ηi and ρ (see Section 2 below). Let 0 < p ≤ 1. There exists C = C(p, (ηi ), ρ, A, γ) > 0
so that InCS is near-optimal

En (InCS , Xp,m (R)) ≤ C · En (Xp,m (R)), R > 0, n > n0 .

Moreover the algorithm A1,n delivering the solution to (L1 ) is near-optimal:

En (A1,n , InCS , Xp,m (R)) ≤ C · En (Xp,m (R)), R>0 n > n0 .

Thus for large n we have a simple description of near-optimal information and a tractable
near-optimal reconstruction algorithm.

1.7 Potential Applications


To see the potential implications, recall first the Bump Algebra model for spectra. In this
context, our result says that, for a spectrometer based on the information operator In in Theorem

5, it is really only necessary to take n = O( m log(m)) measurements to get an accurate
reconstruction of such spectra, rather than the nominal m measurements. However, they must
then be processed nonlinearly.
Consider the Bounded Variation model for Images. In that context, a result paralleling
Theorem 5 says that for a specialized imaging device based on a near-optimal information

operator it is really only necessary to take n = O( m log(m)) measurements to get an accurate
reconstruction of images with m pixels, rather than the nominal m measurements.
The calculations underlying these results will be given below, along with a result showing
that for cartoon-like images (which may model certain kinds of simple natural imagery, like
brains) the number of measurements for an m-pixel image is only O(m1/4 log(m)).

1.8 Contents
Section 2 introduces a set of conditions CS1-CS3 for near-optimality of an information operator.
Section 3 considers abstract near-optimal algorithms, and proves Theorems 1-3. Section 4 shows
that solving the convex optimization problem (L1 ) gives a near-optimal algorithm whenever
0 < p ≤ 1. Section 5 points out immediate extensions to weak-`p conditions and to tight frames.
Section 6 sketches potential implications in image, signal, and array processing. Section 7 shows
that conditions CS1-CS3 are satisfied for “most” information operators.
Finally, in Section 8 below we note the ongoing work by two groups (Gilbert et al. [25]) and
(Candès et al [4, 5]), which although not written in the n-widths/OR/IBC tradition, imply (as
we explain), closely related results.

8
2 Information
Consider information operators constructed as follows. With Ψ the orthogonal matrix whose
columns are the basis elements ψi , and with certain n-by-m matrices Φ obeying conditions
specified below, we construct corresponding information operators In = ΦΨT . Everything will
be completely transparent to the orthogonal matrix Ψ and hence we will assume that Ψ is the
identity throughout this section.
In view of the relation between Gel’fand n-widths and minimax errors, we may work with
n-widths. We define the width of a set X relative to an operator Φ:

w(Φ, X) ≡ sup kxk2 subject to x ∈ X ∩ nullspace(Φ) (2.1)

In words, this is the radius of the section of X cut out by nullspace(Φ). In general, the Gel’fand
n-width is the smallest value of w obtainable by choice of Φ:

dn (X) = inf{w(Φ, X) : Φ is an n × m matrix}

We will show for all large n and m the existence of n by m matrices Φ where

w(Φ, bp,m ) ≤ C · dn (bp,m ),

with C dependent at most on p and the ratio log(m)/ log(n).

2.1 Conditions CS1-CS3


In the following, with J ⊂ {1, . . . , m} let ΦJ denote a submatrix of Φ obtained by selecting
just the indicated columns of Φ. We let VJ denote the range of ΦJ in Rn . Finally, we consider
a family of quotient norms on Rn ; with `1 (J c ) denoting the `1 norm on vectors indexed by
{1, . . . , m}\J,
QJ c (v) = min kθk`1 (J c ) subject to ΦJ c θ = v.
These describe the minimal `1 -norm representation of v achievable using only specified subsets
of columns of Φ.
We describe a set of three conditions to impose on an n × m matrix Φ, indexed by strictly
positive parameters ηi , 1 ≤ i ≤ 3, and ρ.

CS1 The minimal singular value of ΦJ exceeds η1 > 0 uniformly in |J| < ρn/ log(m).

CS2 On each subspace VJ we have the inequality



kvk1 ≥ η2 · n · kvk2 , ∀v ∈ VJ ,

uniformly in |J| < ρn/ log(m).

CS3 On each subspace VJ


p
QJ c (v) ≥ η3 / log(m/n) · kvk1 , v ∈ VJ ,

uniformly in |J| < ρn/ log(m).

9
CS1 demands a certain quantitative degree of linear independence among all small groups of
columns. CS2 says that linear combinations of small groups of columns give vectors that look
much like random noise, at least as far as the comparison of `1 and `2 norms is concerned. It
will be implied by a geometric fact: every VJ slices through the `1m ball in such a way that the
resulting convex section is actually close to spherical. CS3 says that for every vector in some
VJ , the associated quotient norm QJ c is never dramatically better than the simple `1 norm on
Rn .
It turns out that matrices satisfying these conditions are ubiquitous for large n and m when
we choose the ηi and ρ properly. Of course, for any finite n and m, all norms are equivalent and
almost any arbitrary matrix can trivially satisfy these conditions simply by taking η1 very small
and η2 , η3 very large. However, the definition of ‘very small’ and ‘very large’ would have to
depend on n for this trivial argument to work. We claim something deeper is true: it is possible
to choose ηi and ρ independent of n and of m ≤ Anγ .
Consider the set
←m terms →
Sn−1 × · · · × Sn−1
of all n × m matrices having unit-normalized columns. On this set, measure frequency of
occurrence with the natural uniform measure (the product measure, uniform on each factor
S n−1 ).

Theorem 6 Let (n, mn ) be a sequence of problem sizes with n → ∞, n < mn , and m ∼ Anγ ,
A > 0 and γ ≥ 1. There exist ηi > 0 and ρ > 0 depending only on A and γ so that, for each
δ > 0 the proportion of all n × m matrices Φ satisfying CS1-CS3 with parameters (ηi ) and ρ
eventually exceeds 1 − δ.

We will discuss and prove this in Section 7 below.


For later use, we will leave the constants ηi and ρ implicit and speak simply of CS matri-
ces, meaning matrices that satisfy the given conditions with values of parameters of the type
described by this theorem, namely, with ηi and ρ not depending on n and permitting the above
ubiquity.

2.2 Near-Optimality of CS-Matrices


We now show that the CS conditions imply near-optimality of widths induced by CS matrices.

Theorem 7 Let (n, mn ) be a sequence of problem sizes with n → ∞ and mn ∼ A · nγ . Consider


a sequence of n by mn matrices Φn,mn obeying the conditions CS1-CS3 with ηi and ρ positive
and independent of n. Then for each p ∈ (0, 1], there is C = C(p, η1 , η2 , η3 , ρ, A, γ) so that

w(Φn,mn , bp,mn ) ≤ C · (n/ log(m/n))1/2−1/p , n > n0 .

Proof. Consider the optimization problem

(Qp ) sup kθk2 subject to Φθ = 0, kθkp ≤ 1.

Our goal is to bound the value of (Qp ):

val(Qp ) ≤ C · (n/ log(m/n))1/2−1/p .

10
Choose θ so that 0 = Φθ. Let J denote the indices of the k = bρn/ log(m)c largest values in
θ. Without loss of generality suppose coordinates are ordered so that J comes first among the
m entries, and partition θ = [θJ , θJ c ]. Clearly

kθJ c kp ≤ kθkp ≤ 1, (2.2)

while, because each entry in θJ is at least as big as any entry in θJ c , (1.2) gives

kθJ c k2 ≤ ζ2,p · (k + 1)1/2−1/p (2.3)

A similar argument for `1 approximation gives, in case p < 1,

kθJ c k1 ≤ ζ1,p · (k + 1)1−1/p . (2.4)

Now 0 = ΦJ θJ + ΦJ c θ1 . Hence, with v = ΦJ θJ , we have −v = ΦJ c θJ c . As v ∈ VJ and


|J| = k < ρn/ log(m), we can invoke CS3, getting
p
kθJ c k1 ≥ QJ c (−v) ≥ η3 / log(m/n) · kvk1 .

On the other hand, again using v ∈ VJ and |J| = k < ρn/ log(m) invoke CS2, getting

kvk1 ≥ η2 · n · kvk2 .

Combining these with the above,



kvk2 ≤ (η2 η3 )−1 · ( log(m/n)/ n) · kθJ c k1 ≤ c1 · (n/ log(m))1/2−1/p ,
p

with c1 = ζ1,p ρ1−1/p /η2 η3 . Recalling |J| = k < ρn/ log(m), and invoking CS1 we have

kθJ k2 ≤ kΦJ θJ k2 /η1 = kvk2 /η1 .

In short, with c2 = c1 /η1 ,

kθk2 ≤ kθJ k2 + kθJ c k2


≤ c2 · (n/ log(m))1/2−1/p + ζ2,p · (ρn/ log(m))1/2−1/p .

The Theorem follows with C = (ζ1,p η2 η3 /η1 + ζ2,p ρ1/2−1/p ). QED

3 Algorithms
Given an information operator In , we must design a reconstruction algorithm An which delivers
reconstructions compatible in quality with the estimates for the Gel’fand n-widths. As discussed
in the introduction, the optimal method in the OR/IBC framework is the so-called central
algorithm, which unfortunately, is typically not efficiently computable in our setting. We now
describe an alternate abstract approach, allowing us to prove Theorem 1.

11
3.1 Feasible-Point Methods
Another general abstract algorithm from the OR/IBC literature is the so-called feasible-point
method, which aims simply to find any reconstruction compatible with the observed information
and constraints.
As in the case of the central algorithm, we consider, for given information yn = In (x) the
collection of all objects x ∈ Xp,m (R) which could have given rise to the information yn :
X̂p,R (yn ) = {x : yn = In (x), x ∈ Xp,m (R)}
In the feasible-point method, we simply select any member of X̂p,R (yn ), by whatever means. A
popular choice is to take an element of least norm, i.e. a solution of the problem
(Pp ) min kθ(x)kp subject to yn = In (x),
x

where here θ(x) = ΨT x is the vector of transform coefficients, θ ∈ `pm . A nice feature of this
approach is that it is not necessary to know the radius R of the ball Xp,m (R); the element of
least norm will always lie inside it.
Calling the solution x̂p,n , one can show, adapting standard OR/IBC arguments in [36, 46, 40]
Lemma 3.1
sup kx − x̂p,n k2 ≤ 2 · En (Xp,m (R)), 0 < p ≤ 1. (3.1)
Xp,m (R)

In short, the least-norm method is within a factor two of optimal.


Proof. We first justify our claims for optimality of the central algorithm, and then show that
the minimum norm algorithm is near to the central algorithm. Let again x̂∗n denote the result
of the central algorithm. Now
radius(X̂p,R (yn )) ≡ inf sup{kx − ck2 : x ∈ X̂p,R (yn )}
c
= sup{kx − x̂∗n k2 : x ∈ X̂p,R (yn )}
Now clearly, in the special case when x is only known to lie in Xp,m (R) and yn is measured, the
minimax error is exactly radius(X̂p,R (yn )). Since this error is achieved by the central algorithm
for each yn , the minimax error over all x is achieved by the central algorithm. This minimax
error is
sup{radius(X̂p,R (yn )) : yn ∈ In (Xp,m (R))} = En (Xp,m (R)).
Now the least-norm algorithm obeys x̂p,n (yn ) ∈ Xp,m (R); hence
kx̂p,n (yn ) − x̂∗n (yn )k2 ≤ radius(X̂p,R (yn )).
But the triangle inequality gives
kx − x̂p,n k2 ≤ kx − x̂∗n (yn )k2 + kx̂∗n (yn ) − x̂p,n k2 ;
hence, if x ∈ X̂p,R (yn ),
kx − x̂p,n k2 ≤ 2 · radius(X̂p,R (yn ))
≤ 2 · sup{radius(X̂p,R (yn )) : yn ∈ In (Xp,m (R))}
= 2 · En (Xp,m (R)).
QED.
More generally, if the information operator In is only near-optimal, then the same argument
gives
sup kx − x̂p,n k2 ≤ 2 · En (In , Xp,m (R)), 0 < p ≤ 1. (3.2)
Xp,m (R)

12
3.2 Proof of Theorem 3
Before proceeding, it is convenient to prove Theorem 3. Note that the case p ≥ 1 is well-known
in OR/IBC so we only need to give an argument for p < 1 (though it happens that our argument
works for p = 1 as well). The key point will be to apply the p-triangle inequality

kθ + θ0 kpp ≤ kθkpp + kθ0 kpp ,

valid for 0 < p < 1; this inequality is well-known in interpolation theory [1] through Peetre and
Sparr’s work, and is easy to verify directly.
Suppose without loss of generality that there is an optimal subspace Vn , which is fixed and
given in this proof. As we just saw,

En (Xp,m (R)) = sup{radius(X̂p,R (yn )) : yn ∈ In (Xp,m (R))}.

Now
dn (Xp,m (R)) = radius(X̂p,R (0))
so clearly En ≥ dn . Now suppose without loss of generality that x1 and x−1 attain the radius
bound, i.e. they satisfy In (x±1 ) = yn and, for c = center(X̂p,R (yn )) they satisfy

En (Xp,m (R)) = kx1 − ck2 = kx−1 − ck2 .

Then define δ = (x1 − x−1 )/21/p . Set θ±1 = ΨT x±1 and ξ = ΨT δ. By the p-triangle inequality

kθ1 − θ−1 kpp ≤ kθ1 kpp + kθ−1 kpp ,

and so
kξkp = k(θ1 − θ−1 )/21/p kp ≤ R.
Hence δ ∈ Xp,m (R). However, In (δ) = In ((x+1 − x−1 )/21/p ) = 0, so δ belongs to Xp,m (R) ∩ Vn .
Hence kδk2 ≤ dn (Xp,m (R)) and

En (Xp,m (R)) = kx1 − x−1 k2 /2 = 21/p−1 kδk2 ≤ 21/p−1 · dn (Xp,m (R)).

QED

3.3 Proof of Theorem 1


We are now in a position to prove Theorem 1 of the introduction.
First, in the case p = 1, we have already explained in the introduction that the theorem of
Garnaev and Gluskin implies the result by duality. In the case 0 < p < 1, we need only to show
a lower bound and an upper bound of the same order.
For the lower bound, we consider the entropy numbers, defined as follows. Let X be a set
and let en (X, `2 ) be the smallest number  such that an -net for X can be built using a net of
cardinality at most 2n . From Carl’s Theorem - see the exposition in Pisier’s book - there is a
constant c > 0 so that the Gel’fand n-widths dominate the entropy numbers.

dn (bp,m ) ≥ cen (bp,m ).

Secondly, the entropy numbers obey [43, 30]

en (bp,m )  (n/ log(m/n))1/2−1/p .

13
At the same time the combination of Theorem 7 and Theorem 6 shows that

dn (bp,m ) ≤ c(n/ log(m))1/2−1/p

Applying now the Feasible-Point method, we have

En (Xm,p (1)) ≤ 2dn (bp,m ),

with immediate extensions to En (Xm,p (R)) for all R 6= 1 > 0. We conclude that

En (bp,m )  (n/ log(m/n))1/2−1/p .

as was to be proven.

3.4 Proof of Theorem 2


Now is an opportune time to prove Theorem 2. We note that in the case of 1 ≤ p, this is known
[38]. The argument is the same for 0 < p < 1, and we simply repeat it. Suppose that x = 0,
and consider the adaptively-constructed subspace according to whatever algorithm is in force.
When the algorithm terminates we have an n-dimensional information vector 0 and a subspace
Vn0 consisting of objects x which would all give that information vector. For all objects in Vn0
the adaptive information therefore turns out the same. Now the minimax error associated with
that information is exactly radius(Vn0 ∩ Xp,m (R)); but this cannot be smaller than

inf radius(Vn ∩ Xp,m (R)) = dn (Xp,m (R)).


Vn

The result follows by comparing En (Xp,m (R)) with dn .

4 Basis Pursuit
The least-norm method of the previous section has two drawbacks. First it requires that one
know p; we prefer an algorithm which works for 0 < p ≤ 1. Second, if p < 1 the least-norm
problem invokes a nonconvex optimization procedure, and would be considered intractable. In
this section we correct both drawbacks.

4.1 The Case p = 1


In the case p = 1, (P1 ) is a convex optimization problem. Write it out in an equivalent form,
with θ being the optimization variable:

(P1 ) min kθk1 subject to Φθ = yn .


θ

This can be formulated as a linear programming problem: let A be the n by 2m matrix [Φ − Φ].
The linear program
(LP ) min 1T z subject to Az = yn , x ≥ 0.
z

has a solution z∗, say, a vector in R2m which can be partitioned as z ∗ = [u∗ v ∗ ]; then θ∗ =
u∗ − v ∗ solves (P1 ). The reconstruction x̂1,n = Ψθ∗ . This linear program is typically considered
computationally tractable. In fact, this problem has been studied in the signal analysis literature
under the name Basis Pursuit [7]; in that work, very large-scale underdetermined problems -

14
e.g. with n = 8192 and m = 262144 - were solved successfully using interior-point optimization
methods.
As far as performance goes, we already know that this procedure is near-optimal in case
p = 1; from from (3.2):

Corollary 4.1 Suppose that In is an information operator achieving, for some C > 0,

En (In , X1,m (1)) ≤ C · En (X1,m (1));

then the Basis Pursuit algorithm A1,n (yn ) = x̂1,n achieves

En (In , A1,n , X1,m (R)) ≤ 2C · En (X1,m (R)), R > 0.

In particular, we have a universal algorithm for dealing with any class X1,m (R) – i.e. any Ψ,
any m, any R. First, apply a near-optimal information operator; second, reconstruct by Basis
Pursuit. The result obeys

kx − x̂1,n k2 ≤ C · kθk1 · (n/ log m)−1/2 ,

for C a constant depending at most on log(m)/ log(n). The inequality can be interpreted as
follows. Fix  > 0. Suppose the unknown object x is known to be highly compressible, say
obeying the a priori bound kθk1 ≤ cmα , α < 1/2. For any such object, rather than making m
measurements, we only need to make n ∼ C · m2α log(m) measurements, and our reconstruction
obeys:
kx − x̂1,θ k2   · kxk2 .
While the case p = 1 is already significant and interesting, the case 0 < p < 1 is of interest
because it corresponds to data which are more highly compressible and for which, our interest
in achieving the performance indicated by Theorem 1 is even greater. Later in this section we
extend the same interpretation of x̂1,n to performance over Xp,m (R) throughout p < 1.

4.2 Relation between `1 and `0 minimization


The general OR/IBC theory would suggest that to handle cases where 0 < p < 1, we would
need to solve the nonconvex optimization problem (Pp ), which seems intractable. However, in
the current situation at least, a small miracle happens: solving (P1 ) is again near-optimal. To
understand this, we first take a small detour, examining the relation between `1 and the extreme
case p → 0 of the `p spaces. Let’s define

(P0 ) min kθk0 subject to Φθ = x,

where of course kθk0 is just the number of nonzeros in θ. Again, since the work of Peetre and
Sparr the importance of `0 and the relation with `p for 0 < p < 1 is well-understood; see [1] for
more detail.
Ordinarily, solving such a problem involving the `0 norm requires combinatorial optimization;
one enumerates all sparse subsets of {1, . . . , m} searching for one which allows a solution Φθ = x.
However, when (P0 ) has a sparse solution, (P1 ) will find it.

Theorem 8 Suppose that Φ satisfies CS1-CS3 with given positive constants ρ, (ηi ). There is
a constant ρ0 > 0 depending only on ρ and (ηi ) and not on n or m so that, if θ has at most
ρ0 n/ log(m) nonzeros, then (P0 ) and (P1 ) both have the same unique solution.

15
In words, although the system of equations is massively undetermined, `1 minimization and
sparse solution coincide - when the result is sufficiently sparse.
There is by now an extensive literature exhibiting results on equivalence of `1 and `0 min-
imization [14, 20, 47, 48, 21]. In the first literature on this subject, equivalence was found
under conditions involving sparsity constraints allowing O(n1/2 ) nonzeros. While it may seem
surprising that any results of this kind are possible, the sparsity constraint kθk0 = O(n1/2 )
is, ultimately, disappointingly small. A major breakthrough was the contribution of Candès,
Romberg, and Tao (2004) which studied the matrices built by taking n rows at random from
an m by m Fourier matrix and gave an order O(n/ log(n)) bound, showing that dramatically
weaker sparsity conditions were needed than the O(n1/2 ) results known previously. In [17], it
was shown that for ‘nearly all’ n by m matrices with n < m < An, equivalence held for ≤ ρn
nonzeros, ρ = ρ(A). The above result says effectively that for ‘nearly all’ n by m matrices with
m ≤ Anγ , equivalence held up to O(ρn/ log(n)) nonzeros, where ρ = ρ(A, γ).
Our argument, in parallel with [17], shows that the nullspace Φβ = 0 has a very special
structure for Φ obeying the conditions in question. When θ is sparse, the only element in a
given affine subspace θ + nullspace(Φ) which can have small `1 norm is θ itself.
To prove Theorem 8, we first need a lemma about the non-sparsity of elements in the
nullspace of Φ. Let J ⊂ {1, . . . , m} and, for a given vector β ∈ Rm , let 1J β denote the mutilated
vector with entries βi 1J (i). Define the concentration

k1J βk1
ν(Φ, J) = sup{ : Φβ = 0}
kβk1

This measures the fraction of `1 norm which can be concentrated on a certain subset for a vector
in the nullspace of Φ. This concentration cannot be large if |J| is small.

Lemma 4.1 Suppose that Φ satisfies CS1-CS3, with constants ηi and ρ. There is a constant η0
depending on the ηi so that if J ⊂ {1, . . . , m} satisfies

|J| ≤ ρ1 n/ log(m), ρ1 ≤ ρ,

then
1/2
ν(Φ, J) ≤ η0 · ρ1 .

Proof. This is a variation on the argument for Theorem 7. Let β ∈ nullspace(Φ). Assume
without loss of generality that J is the most concentrated subset of cardinality |J| < ρn/ log(m),
and that the columns of Φ are numbered so that J = {1, . . . , |J|}; partition β = [βJ , βJ c ]. We
again consider v = ΦJ βJ , and have −v = ΦJ c βJ c . We again invoke CS2-CS3, getting
√ p
kvk2 ≤ η2 η3 · ( n/ log(m/n)) · kθJ c k1 ≤ c1 · (n/ log(m))1/2−1/p ,

We invoke CS1, getting


kβJ k2 ≤ kΦJ βJ k2 /η1 .
p
Now of course kβJ k1 ≤ |J| · kβJ k2 . Combining all these
r
log(m)
kβJ k1 ≤ |J| · η1−1 η2 η3 ·
p
· kβk1 .
n
The lemma follows, setting η0 = η2 η3 /η1 . QED
Proof of Theorem 8. Suppose that x = Φθ and θ is supported on a subset J ⊂ {1, . . . , m}.

16
We first show that if ν(Φ, J) < 1/2, θ is the only minimizer of (P1 ). Suppose that θ0 is a
solution to (P1 ), obeying
kθ0 k1 ≤ kθk1 .
Then θ0 = θ + β where β ∈ nullspace(Φ). We have

0 ≤ kθk1 − kθ0 k1 ≤ kβJ k1 − kβJ c k1 .

Invoking the definition of ν(Φ, J) twice,

kβJ k1 − kβJ c k1 ≤ (ν(Φ, J) − (1 − ν(Φ, J))kβk1 .

Suppose now that ν < 1/2. Then 2ν(Φ, J) − 1 < 0 and we have

kβk1 ≤ 0,

i.e. θ = θ0 .

Now recall the constant η0 > 0 of Lemma (4.1). Define ρ0 so that η0 ρ0 < 1/2 and ρ0 ≤ ρ.
1/2
Lemma 4.1 shows that |J| ≤ ρ0 n/ log(m) implies ν(Φ, J) ≤ η0 ρ0 < 1/2. QED.

4.3 Near-Optimality of BP for 0 < p < 1


We now return to the claimed near-optimality of BP throughout the range 0 < p < 1.

Theorem 9 Suppose that Φ satisifies CS1-CS3 with constants ηi and ρ. There is C = C(p, (ηi ), ρ, A, γ)
so that the solution θ̂1,n to (P1 ) obeys

kθ − θ̂1,n k2 ≤ Cp · kθkp · (n/ log m)1/2−1/p ,

The proof requires an `1 stability lemma, showing the stability of `1 minimization under
small perturbations as measured in `1 norm. For `2 and `∞ stability lemmas, see [16, 48, 18];
however, note that those lemmas do not suffice for our needs in this proof.

Lemma 4.2 Let θ be a vector in Rm and 1J θ be the corresponding mutilated vector with entries
θi 1J (i). Suppose that
kθ − 1J θk1 ≤ ,
where ν(Φ, J) ≤ ν0 < 1/2. Consider the instance of (P1 ) defined by x = Φθ; the solution θ̂1 of
this instance of (P1 ) obeys
2
kθ − θ̂1 k1 ≤ . (4.1)
1 − 2ν0

Proof of Lemma. Put for short θ̂ ≡ θ̂1 , and set β = θ − θ̂ ∈ nullspace(Φ). By definition of ν,

kθ − θ̂k1 = kβk1 ≤ kβJ c k1 /(1 − ν0 ),

while
kβJ c k1 ≤ kθJ c k1 + kθ̂J c k1 .
As θ̂ solves (P1 ),
kθ̂J k1 + kθ̂J c k1 ≤ kθk1 ,
and of course
kθJ k1 − kθ̂J k1 ≤ k1J (θ − θ̂)k1 .

17
Hence
kθ̂J c k1 ≤ k1J (θ − θ̂)k1 + kθJ c k1 .
Finally,
k1J (θ − θ̂)k1 = kβJ k1 ≤ ν0 · kβk1 = ν0 · kθ − θ̂k1 .
Combining the above, setting δ ≡ kθ − θ̂k1 and  ≡ kθJ c k1 , we get

δ ≤ (ν0 δ + 2)/(1 − ν0 ),

and (4.1) follows. QED


Proof of Theorem 9 We use the same general framework as Theorem 7. Let x = Φθ where
kθkp ≤ R. Let θ̂ be the solution to (P1 ), and set β = θ̂ − θ ∈ nullspace(Φ).
Let η0 as in Lemma 4.1 and set ρ0 = min(ρ, (4η0 )−2 ). Let J index the ρ0 n/ log(m) largest-
amplitude entries in θ. From kθkp ≤ R and (2.4) we have

kθ − 1J θk1 ≤ ξ1,p · R · |J|1−1/p ,

and Lemma 4.1 provides


1/2
ν(Φ, J) ≤ η0 ρ0 ≤ 1/4.
Applying Lemma 4.2,
kβk1 ≤ c · R · |J|1−1/p . (4.2)
The vector δ = β/kβk1 lies in nullspace(Φ) and has kδk1 = 1. Hence

kδk2 ≤ c · (n/ log(m))−1/2 .

We conclude by homogeneity that

kβk2 ≤ c · kβk1 · (n/ log(m))−1/2 .

Combining this with (4.2),


kβk2 ≤ c · R · (n/ log(m))1/2−1/p .
QED

5 Immediate Extensions
Before continuing, we mention two immediate extensions to these results of interest below and
elsewhere.

5.1 Tight Frames


Our main results so far have been stated in the context of (φi ) making an orthonormal basis.
In fact the results hold for tight frames. These are collections of vectors which, when joined
together as columns in an m × m0 matrix Ψ (m < m0 ) have ΨΨT = Im . It follows that, if
θ(x) = ΨT x, then we have the Parseval relation

kxk2 = kθ(x)k2 ,

and the reconstruction formula x = Ψθ(x). In fact, Theorems 7 and 9 only need the Parseval
relation in the proof. Hence the same results hold without change when the relation between x

18
and θ involves a tight frame. In particular, if Φ is an n × m matrix satisfying CS1-CS3, then
0
In = ΦΨT defines a near-optimal information operator on Rm , and solution of the optimization
problem
(L1 ) min kψ T xk1 subject to In (x) = yn ,
x
defines a near-optimal reconstruction algorithm x̂1 .

5.2 Weak `p Balls


Our main results so far have been stated for `p spaces, but the proofs hold for weak `p balls as
well (p < 1). The weak `p ball of radius R consists of vectors θ whose decreasing rearrangements
|θ|(1) ≥ |θ|(2) ≥ |θ|(3) ≥ ... obey

|θ|(N ) ≤ R · N −1/p , N = 1, 2, 3, . . . .

Conversely, for a given θ, the smallest R for which these inequalities all hold is defined to be the
norm: kθkw`p ≡ R. The “weak” moniker derives from kθkw`p ≤ kθkp . Weak `p constraints have
the following key property: if θN denotes the vector N with all except the N largest items set
to zero, then the inequality

kθ − θN kq ≤ ζq,p · R · (N + 1)1/q−1/p (5.1)

is valid for p < 1 and q = 1, 2, with R = kθkw`p . In fact, Theorems 7 and 9 used (5.1) in the
proof, together with (implicitly) kθkp ≥ kθkw`p . Hence we can state results for spaces Yp,m (R)
defined using only weak-`p norms, and the proofs apply without change.

6 Stylized Applications
We sketch 3 potential applications of the above abstract theory.

6.1 Bump Algebra


Consider the class F(B) of functions f (t), t ∈ [0, 1] which are restrictions to the unit interval
of functions belonging to the Bump Algebra B [33], with bump norm kf kB ≤ B. This was
mentioned in the introduction, where it was mentioned that the wavelet coefficients at level j
obey kθ(j) k1 ≤ C · B · 2−j where C depends only on the wavelet used. Here and below we use
standard wavelet analysis notations as in [8, 32, 33].
We consider two ways of approximating functions in f . In the classic linear scheme, we fix a
’finest scale’ j1 and measure the resumé coefficients βj1 ,k = hf, ϕj1 ,k i where ϕj,k = 2j/2 ϕ(2j t−k),
with ϕ a smooth function integrating to 1. Think of these −j1 after
Pas point samples at scale 2
applying an anti-aliasing filter. We reconstruct by Pj1 f = k βj1 ,k ϕj1 ,k giving an approximation
error
kf − Pj1 f k2 ≤ C · kf kB · 2−j1 /2 ,
with C depending only on the chosen wavelet. There are N = 2j1 coefficients (βj1 ,k )k associated
with the unit interval, and so the approximation error obeys:

kf − Pj1 f k2 ≤ C · kf kB · N −1/2 .

In the compressed sensing scheme, we need also wavelets ψj,k = 2j/2 ψ(2j x − k) where ψ is an
oscillating function with mean zero. We pick a coarsest scale j0 = j1 /2. We measure the resumé

19
coefficients βj0 ,k , – there are 2j0 of these – and then let θ ≡ (θ` )m
`=1 denote an enumeration of
j
the detail wavelet coefficients ((αj,k : 0 ≤ k < 2 ) : j0 ≤ j < j1 ). The dimension m of θ is
m = 2j1 − 2j0 . The norm
j1
X j1
X
kθk1 ≤ k(αj,k )k1 ≤ C · kf kB · 2−j ≤ cB2−j0 .
j0 j0

We take n = c · 2j0 log(2j1 ) and apply a near-optimal information operator for this n and m
(described in more detail below). We apply the near-optimal algorithm of `1 minimization,
getting the error estimate

kθ̂ − θk2 ≤ ckθk1 · (n/ log(m))−1/2 ≤ c2−2j0 ≤ c2−j1 ,

with c independent of f ∈ F(B). The overall reconstruction

X 1 −1
jX
fˆ = βj0 ,k ϕj0 ,k + α̂j,k ψj,k
k j=j0

has error

kf − fˆk2 ≤ kf − Pj1 f k2 + kPj1 f − fˆk2 = kf − Pj1 f k2 + kθ̂ − θk2 ≤ c2−j1 ,

again with c independent of f ∈ F(B). This is of the same order of magnitude as the error of
linear sampling.
The compressed sensing scheme takes a total of 2j0 samples of resumé coefficients and
n ≤ c2j0 log(2j1 ) samples associated with detail coefficients, for a total ≤ c · j1 · 2j1 /2 pieces
of information. It achieves error comparable to classical sampling based on 2j1 samples. Thus it
needs dramatically fewer samples for comparable accuracy: roughly speaking, only the square
root of the number of samples of linear sampling.
To achieve this dramatic reduction in sampling, we need an information operator based on
some Φ satisfying CS1-CS3. The underlying measurement kernels will be of the form
m
X
ξi = Φi,` φ` , i = 1, . . . , n, (6.1)
j=1

where the collection (φ` )m


`=1 is simply an enumeration of the wavelets ψj,k , with j0 ≤ j < j1 and
0 ≤ k < 2j .

6.2 Images of Bounded Variation


We consider now the model with images of Bounded Variation from the introduction. Let F(B)
denote the class of functions f (x) with domain(x) ∈ [0, 1]2 , having total variation at most B
[9], and bounded in absolute value by kf k∞ ≤ B as well. In the introduction, it was mentioned
that the wavelet coefficients at level j obey kθ(j) k1 ≤ C · B where C depends only on the wavelet
used. It is also true that kθ(j) k∞ ≤ C · B · 2−j .
We again consider two ways of approximating functions in f . The classic linear scheme uses a
two-dimensional version of the scheme we have already discussed. We again fix a ‘finest scale’ j1
and measure the resumé coefficients βj1 ,k = hf, ϕj1 ,k i where now k = (k1 , k2 ) is a pair of integers

20
0 ≤ k1 , k2 < 2j1 . indexing position.
P We use the Haar scaling function ϕj1 ,k = 2j1 ·1{2j1 x−k∈[0,1]2 } .
We reconstruct by Pj1 f = k βj1 ,k ϕj1 ,k giving an approximation error

kf − Pj1 f k2 ≤ 4 · B · 2−j1 /2 .

There are N = 4j1 coefficients βj1 ,k associated with the unit interval, and so the approximation
error obeys:
kf − Pj1 f k2 ≤ c · B · N −1/4 .
In the compressed sensing scheme, we need also Haar wavelets ψjσ1 ,k = 2j1 ψ σ (2j1 x − k)
where ψ σ is an oscillating function with mean zero which is either horizontally-oriented (σ = v),
vertically oriented (σ = h) or diagonally-oriented (σ = d). We pick a ‘coarsest scale’ j0 = j1 /2,
and measure the resumé coefficients βj0 ,k , – there are 4j0 of these. Then let θ ≡ (θ` )m
`=1 be the
σ : 0 ≤ k , k < 2j , σ ∈ {h, v, d}) : j ≤ j <
concatenation of the detail wavelet coefficients ((αj,k 1 2 0
j1 ). The dimension m of θ is m = 4j1 − 4j0 . The norm
j1
X
kθk1 ≤ k(θ(j) )k1 ≤ c(j1 − j0 )kf kBV .
j=j0

We take n = c · 4j0 log2 (4j1 ) and apply a near-optimal information operator for this n and m.
We apply the near-optimal algorithm of `1 minimization to the resulting information, getting
the error estimate
kθ̂ − θk2 ≤ ckθk1 · (n/ log(m))−1/2 ≤ cB · 2−j1 ,
with c independent of f ∈ F(B). The overall reconstruction

X 1 −1
jX
fˆ = βj0 ,k ϕj0 ,k + α̂j,k ψj,k
k j=j0

has error

kf − fˆk2 ≤ kf − Pj1 f k2 + kPj1 f − fˆk2 = kf − Pj1 f k2 + kθ̂ − θk2 ≤ c2−j1 ,

again with c independent of f ∈ F(B). This is of the same order of magnitude as the error of
linear sampling.
The compressed sensing scheme takes a total of 4j0 samples of resumé coefficients and n ≤
c4j0 log2 (4j1 ) samples associated with detail coefficients, for a total ≤ c · j12 · 4j1 /2 pieces of
measured information. It achieves error comparable to classical sampling with 4j1 samples.
Thus just as we have seen in the Bump Algebra case, we need dramatically fewer samples for
comparable accuracy: roughly speaking, only the square root of the number of samples of linear
sampling.

6.3 Piecewise C 2 Images with C 2 Edges


We now consider an example where p < 1, and we can apply the extensions to tight frames
and to weak-`p mentioned earlier. Again in the image processing setting, we use the C 2 -C 2
model discussed in Candès and Donoho [2, 3]. Consider the collection C 2,2 (B, L) of piecewise
smooth f (x), x ∈ [0, 1]2 , with values, first and second partial derivatives bounded by B, away
from an exceptional set Γ which is a union of C 2 curves having first and second derivatives in
an appropriate parametrization ≤ B; the curves have total length ≤ L. More colorfully, such

21
images are cartoons – well-behaved except for discontinuities inside regions with nice curvilinear
boundaries. They might be reasonable models for certain kinds of technical imagery – eg in
radiology.
The curvelets tight frame [3] is a collection of smooth frame elements offering a Parseval
relation X
kf k22 = |hf, γµ i|2
µ
and reconstruction formula X
f= hf, γµ iγµ .
µ

The frame elements have a multiscale organization, and frame coefficients θ(j) grouped by scale
obey the weak `p constraint
kθ(j) kw`2/3 ≤ c(B, L), f ∈ C 2,2 (B, L);
compare [3]. For such objects, classical linear sampling at scale j1 by smooth 2-D scaling
functions gives
kf − Pj1 f k2 ≤ c · B · 2−j1 /2 , f ∈ C 2,2 (B, L).
This is no better than the performance of linear sampling for the BV case, despite the piecewise
C 2 character of f ; the possible discontinuities in f are responsible for the inability of linear
sampling to improve its performance over C 2,2 (B, L) compared to BV.
In the compressed sensing scheme, we pick a coarsest scale j0 = j1 /4. We measure the
resumé coefficients βj0 ,k in a smooth wavelet expansion – there are 4j0 of these – and then let
θ ≡ (θ` )m
`=1 denote a concatentation of the finer-scale curvelet coefficients (θ
(j) : j ≤ j < j ).
0 1
The dimension m of θ is m = c(4j1 − 4j0 ), with c > 1 due to overcompleteness of curvelets. The
norm
j1
2/3
X
kθkw`2/3 ≤ ( k(θ(j) )kw`2/3 )3/2 ≤ c(j1 − j0 )3/2 ,
j0

with c depending on B and L. We take n = c·4j0 log5/2 (4j1 ) and apply a near-optimal information
operator for this n and m. We apply the near-optimal algorithm of `1 minimization to the
resulting information, getting the error estimate
kθ̂ − θk2 ≤ ckθk2/3 · (n/ log(m))−1 ≤ c0 · 2−j1 /2 ,
with c0 absolute. The overall reconstruction
X 1 −1
jX X
fˆ = βj0 ,k ϕj0 ,k + θ̂µ(j) γµ
k j=j0 µ∈Mj

has error
kf − fˆk2 ≤ kf − Pj1 f k2 + kPj1 f − fˆk2 = kf − Pj1 f k2 + kθ̂ − θk2 ≤ c2−j1 /2 ,
again with c independent of f ∈ C 2,2 (B, L). This is of the same order of magnitude as the error
of linear sampling.
The compressed sensing scheme takes a total of 4j0 samples of resumé coefficients and n ≤
5/2
c4 0 log5/2 (4j1 ) samples associated with detail coefficients, for a total ≤ c · j1 · 4j1 /4 pieces of
j

information. It achieves error comparable classical sampling based on 4j1 samples. Thus, even
more so than in the Bump Algebra case, we need dramatically fewer samples for comparable
accuracy: roughly speaking, only the fourth root of the number of samples of linear sampling.

22
7 Nearly All Matrices are CS-Matrices
We may reformulate Theorem 6 as follows.

Theorem 10 Let n, mn be a sequence of problem sizes with n → ∞, n < m ∼ Anγ , for A > 0
and γ > 1. Let Φ = Φn,m be a matrix with m columns drawn independently and uniformly at
random from S n−1 . Then for some ηi > 0 and ρ > 0, conditions CS1-CS3 hold for Φ with
overwhelming probability for all large n.

Indeed, note that the probability measure on Φn,m induced by sampling columns iid uniform
on Sn−1 is exactly the natural uniform measure on Φn,m . Hence Theorem 6 follows immediately
from Theorem 10.
In effect matrices Φ satisfying the CS conditions are so ubiquitous that it is reasonable to
generate them by sampling at random from a uniform probability distribution.
The proof of Theorem 10 is conducted over the next three subsections; it proceeds by studying
events Ωin , i = 1, 2, 3, where Ω1n ≡ { CS1 Holds }, etc. It will be shown that for parameters
ηi > 0 and ρi > 0
P (Ωin ) → 1, n → ∞;
then defining ρ = mini ρi and Ωn = ∩i Ωin , we have

P (Ωn ) → 1, n → ∞.

Since, when Ωn occurs, our random draw has produced a matrix obeying CS1-CS3 with param-
eters ηi and ρ, this proves Theorem 10.

7.1 Control of Minimal Eigenvalue


The following lemma allows us to choose positive constants ρ1 and η1 so that condition CS1
holds with overwhelming probability.

Lemma 7.1 Consider sequences of (n, mn ) with n ≤ mn ∼ Anγ . Define the event

Ωn,m,ρ,λ = {λmin (ΦTJ ΦJ ) ≥ λ, ∀|J| < ρ · n/ log(m)}.

Then, for each λ < 1, for sufficiently small ρ > 0

P (Ωn,m,ρ,λ ) → 1, n → ∞.

The proof involves three ideas. First, that for a specific subset we get large deviations bounds
on the minimum eigenvalue.

Lemma 7.2 For J ⊂ {1, . . . m}, let Ωn,J denote the event that the minimum eigenvalue λmin (ΦTJ ΦJ )
exceeds λ < 1. Then for sufficiently small ρ1 > 0 there is β1 > 0 so that for all n > n0 ,

P (Ωcn,J ) ≤ exp(−nβ1 ),

uniformly in |J| ≤ ρ1 n.

23
This was derived in [17] and in [18], using the concentration of measure property of singular
values of random matrices, eg. see Szarek’s work [44, 45].
Second, we note that the event of main interest is representable as:

Ωn,m,ρ,η = ∩|J|≤ρn/ log(m) Ωn,J .

Thus we need to estimate the probability of occurrence of every Ωn,J simultaneously.


Third, we can combine the individual bounds to control simultaneous behavior:

Lemma 7.3 Suppose we have events Ωn,J all obeying, for some fixed β > 0 and ρ > 0,

P (Ωcn,J ) ≤ exp(−nβ),

for each J ⊂ {1, . . . , m} with |J| ≤ ρn. Pick ρ1 > 0 with ρ1 < min(β, ρ) and β1 > 0 with
β1 < β − ρ1 . Then for all sufficiently large n,

P {Ωcn,J for some J ⊂ {1, . . . , m} with |J| ≤ ρ1 n/ log(m)} ≤ exp(−β1 n).

Lemma 7.1 follows from this immediately.


To prove Lemma 7.3, let J = {J ⊂ {1, . . . , m} with |J| ≤ ρ1 n/ log(m)}. We note that by
Boole’s inequality, X
P (∪J Ωcn,J ) ≤ P (Ωcn,J ) ≤ #J · exp(−βn),
J∈J

the last inequality following because each member J ∈ J is of cardinality ≤ ρn, since ρ1 n/ log(m) <
ρn, as soon as n ≥ 3 > e. Also, of course,
 
m
log ≤ k log(m),
k

so we get log(#J ) ≤ ρ1 n. Taking β1 as given, we get the desired conclusion. QED

7.2 Spherical Sections Property


We now show that condition CS2 can be made overwhelmingly likely by choice of η2 and ρ2
sufficiently small but still positive. Our approach derives from [17], which applied an important
result from Banach space theory, the almost spherical sections phenomenon. This says that
slicing the unit ball in a Banach space by intersection with an appropriate finite-dimensional
linear subspace will result in a slice that is effectively spherical. We develop a quantitative
refinement of this principle for the `1 norm in Rn , showing that, with overwhelming probability,
every operator ΦJ for |J| < ρn/ log(m) affords a spherical section of the `1n ball. The basic
argument we use originates from work of Milman, Kashin and others [22, 28, 37]; we refine
an argument in Pisier [41] and, as in [17] draw inferences that may be novel. We conclude
that not only do almost-spherical sections exist, but they are so ubiquitous that every ΦJ with
|J| < ρn/ log(m) will generate them.

Definition 7.1 Let |J| = k. We say that ΦJ offers an -isometry between `2 (J) and `1n if
r
π
(1 − ) · kαk2 ≤ · kΦJ αk1 ≤ (1 + ) · kαk2 , ∀α ∈ Rk . (7.1)
2n

The following Lemma shows that condition CS2 is a generic property of matrices.

24
Lemma 7.4 Consider the event Ω2n (≡ Ω2n (, ρ)) that every ΦJ with |J| ≤ ρ · n/ log(m) offers an
-isometry between `2 (J) and `1n . For each  > 0, there is ρ() > 0 so that

P (Ω2n ) → 1, n → ∞.

To prove this, we first need a lemma about individual subsets J proven in [17].

Lemma 7.5 Fix  > 0. Choose δ so that

(1 − 3δ)(1 − δ)−1 ≥ (1 − )1/2 and (1 + δ)(1 − δ)−1 ≤ (1 + )1/2 . (7.2)

Choose ρ1 = ρ1 () so that


2
ρ1 · (1 + 2/δ) < δ 2 ,
π
and let β() denote the difference between the two sides. For a subset J in {1, . . . , m} let Ωn,J
denote the event that ΦJ furnishes an -isometry to `1n . Then as n → ∞,

max P (Ωcn,J ) ≤ 2 exp(−β()n(1 + o(1))).


|J|≤ρ1 n

Now note that the event of interest for Lemma 7.4 is

Ω2n = ∩|J|≤ρn/ log(m) Ωn,J ;

to finish apply the individual Lemma 7.5 together with the combining principle in Lemma 7.3.

7.3 Quotient Norm Inequalities


We now show that, for η3 = 3/4, for sufficiently small ρ3 > 0, nearly all matrices have property
CS3.
Let J be any collection of indices in {1, . . . , m}; Range(ΦJ ) is a linear subspace of Rn , and
on this subspace a subset ΣJ of possible sign patterns can be realized, i.e. sequences of ±1’s
generated by
X
σ(k) = sgn αi φi (k) , 1 ≤ k ≤ n.
I

CS3 will follow if we can show that for every v ∈ Range(ΦI ), some approximation y to sgn(v)
satisfies |hy, φi i| ≤ 1 for i ∈ J c .

Lemma 7.6 Simultaneous Sign-Pattern Embedding. Fix δ > 0. Then for τ < δ 2 /32, set

n = (log(mn /(τ n)))−1/2 /4.

For sufficiently small ρ3 > 0, there is an event Ω3n ≡ Ω3n (ρ3 , δ) with P (Ω3n ) → 1, as n → ∞. On
this event, for every subset J with |J| < ρ3 n/ log(m), for every sign pattern in σ ∈ ΣJ , there is
a vector y(≡ yσ ) with
ky − n σk2 ≤ n · δ · kσk2 , (7.3)
and
|hφi , yi| ≤ 1, i ∈ J c. (7.4)

25
In words, a small multiple n σ of any sign pattern σ almost lives in the dual ball {x : |hφi , xi| ≤
1, i ∈ J c }.
Before proving this result, we indicate how it gives the property CS3; namely, that |J| <
ρ3 n/ log(m), and v = −ΦJ c βJ c imply
p
kβJ c k1 ≥ η3 / log(m/n) · kvk1 .

By the duality theorem for linear programming the value of the primal program

min kβJ c k1 subject to ΦJ c βJ c = −v (7.5)

is at least the value of the dual

maxhv, yi subject to |hφi , yi| ≤ 1, i ∈ J c .

Lemma 7.6 gives us a supply of dual-feasible vectors and hence a lower bound on the dual
program. Take σ = sgn(v); we can find y which is dual-feasible and obeys

hv, yi ≥ hv, n σi − ky − n σk2 kvk2 ≥ n kvk1 − n δkσk2 kvk2 ;

picking δ appropriately and taking into account the spherical sections theorem, for sufficiently
large n we have δkσk2 kvk2 ≤ 41 kvk1 ; (7.5) follows with η3 = 3/4.

7.3.1 Proof of Simultaneous Sign-Pattern Embedding


The proof of Lemma 7.6 follows closely a similar result in [17] that considered the case n < m <
Am. Our idea here is to adapt the argument for the n < m < Am case to the n < m ∼ Amγ
case, with changes reflecting the different choice of , δ, and the sparsity bound ρn/ log(m). We
leave out large parts of the argument, as they are identical to the corresponding parts in [17].
The bulk of our effort goes to produce the following lemma, which demonstrates approximate
embedding of a single sign pattern in the dual ball.

Lemma 7.7 Individual Sign-Pattern Embedding. Let σ ∈ {−1, 1}n , let y0 = n σ, with n ,
mn , τ , δ as in the statement of Lemma 7.6. Let k ≥ 0. Given a collection (φi : 1 ≤ i ≤ m − k),
there is an iterative algorithm, described below, producing a vector y as output which obeys

|hφi , yi| ≤ 1, i = 1, . . . , m − k. (7.6)

Let (φi )m−k


i=1 be iid uniform on S
n−1 ; there is an event Ω described below, having probability
σ
controlled by
P rob(Ωcσ ) ≤ 2n exp{−nβ}, (7.7)
for β > 0 which can be explicitly given in terms of τ and δ. On this event,

ky − y0 k2 ≤ δ · ky0 k2 . (7.8)

Lemma 7.7 will be proven in a section of its own. We now show that it implies Lemma 7.6.
We recall a standard implication of so-called Vapnik-Cervonenkis theory [42]:
     
n n n
#ΣJ ≤ + + ··· + .
0 1 |J|

26
Notice that if |J| < ρn/ log(m), then

log(#ΣJ ) ≤ ρ · n + log(n),

while also
log #{J : |J| < ρn/ log(m), J ⊂ {1, . . . , m}} ≤ ρn.
Hence, the total number of sign patterns generated by operators ΦJ obeys

log #{σ : σ ∈ ΣJ , |J| < ρn/ log(m)} ≤ n · 2ρ + log(n).

Now β furnished by Lemma 7.7 is positive, so pick ρ3 < β/2 with ρ3 > 0. Define

Ω3n = ∩|J|<ρ3 n/ log(m) ∩σ∈ΣJ Ωσ,J ,

where Ωσ,J denotes the instance of the event (called Ωσ in the statement of Lemma 7.7) generated
by a specific σ, J combination. On the event Ω3n , every sign pattern associated with any ΦJ
obeying |J| < ρ3 n/ log(m) is almost dual feasible. Now

P ((Ω3n )c ) ≤ #{σ : σ ∈ ΣJ , |J| < ρ3 n/ log(m)} · exp(−nβ),


≤ exp{−n(β − 2ρ3 ) + log(n)} → 0, n → ∞.

QED.

7.4 Proof of Individual Sign-Pattern Embedding


7.4.1 An Embedding Algorithm
The companion paper [17] introduced an algorithm to create a dual feasible point y starting
from a nearby almost-feasible point y0 . It worked as follows.
Let I0 be the collection of indices 1 ≤ i ≤ m with

|hφi , y0 i| > 1/2,

and then set


y1 = y0 − PI0 y0 ,
where PI0 denotes the least-squares projector ΦI0 (ΦTI0 ΦI0 )−1 ΦTI0 . In effect, identify the indices
where y0 exceeds half the forbidden level |hφi , y0 i| > 1, and “kill” those indices.
Continue this process, producing y2 , y3 , etc., with stage-dependent thresholds t` ≡ 1 − 2−`−1
successively closer to 1. Set
I` = {i : |hφi , y` i| > t` },
and, putting J` ≡ I0 ∪ · · · ∪ I` ,
y`+1 = y0 − PJ` y0 .
If I` is empty, then the process terminates, and set y = y` . Termination must occur at stage
`∗ ≤ n. At termination,

|hφi , yi| ≤ 1 − 2−` −1 , i = 1, . . . , m.
Hence y is definitely dual feasible. The only question is how close to y0 it is.

27
7.4.2 Analysis Framework
Also in [17] bounds were developed for two key descriptors of the algorithm trajectory:

α` = ky` − y`−1 k2 ,

and
|I` | = #{i : |hφi , y` i| > 1 − 2−`−1 }.
We adapt the arguments deployed there. We define bounds δ` ≡ δ`;n and ν` ≡ ν`;n for α` and
|I` |, of the form
δ`;n = ky0 k2 · ω ` , ` = 1, 2, . . . ,
ν`;n = n · λ0 · 2n · ω 2`+2 /4, ` = 0, 1, 2, . . . ;
here λ0 = 1/2 and ω = min(1/2, δ/2, ω0 ), where ω0 > 0 will be defined later below. We define
sub-events
E` = {αj ≤ δj , j = 1, . . . , `, |Ij | ≤ νj , j = 0, . . . , ` − 1};
Now define
Ωσ = ∩n`=1 E` ;
this event implies, because ω ≤ 1/2,
X
ky − y0 k2 ≤ ( α`2 )1/2 ≤ ky0 k2 · ω/(1 − ω 2 )1/2 ≤ ky0 k2 · δ.

We will show that, for β > 0 chosen in conjunction with τ > 0,


c
P (E`+1 |E` ) ≤ 2 exp{−βn}. (7.9)

This implies
P (Ωcσ ) ≤ 2n exp{−βn},
and the Lemma follows. QED

7.4.3 Large Deviations


Define the events
F` = {α` ≤ δ`;n }, G` = {|I` | ≤ ν`;n },
so that
E`+1 = F`+1 ∩ G` ∩ E` .
Put
log(Anγ ) ω2
ρ0 (τ, ω; n) = (128)−1 ;
log(Anγ /τ n) 1 − ω 2
and note that this depends quite weakly on n. Recall that the event E` is defined in terms of ω
and τ . On the event E` , |J` | ≤ ρ0 n/ log(m). Lemma 7.1 implicitly defined a quantity λ1 (ρ, A, γ)
lowerbounding the minimum eigenvalue of every ΦTJ ΦJ where |J| ≤ ρn/ log(m). Pick ρ1/2 > 0
so that λ1 (ρ1/2 , A, γ) > 1/2. Pick ω0 so that

ρ0 (τ, ω0 ; n) < ρ1/2 , n > n0 .

With this choice of ω0 , when the event E` occurs, λmin (ΦTI` ΦI` ) > λ0 . Also on E` , uj =
2−j−1 /αj > 2−j−1 /δj = vj (say) for j ≤ `.

28
In [17] an analysis framework was developed in which a family (Zi` : 1 ≤ i ≤ m, 0 ≤ ` < n)
of random variables iid N (0, n1 ) was introduced, and it was shown that
X
P {Gc` |E` } ≤ P { 1{|Z ` |>v` } > ν` },
i
i

and X  2
c
|G` , E` } ≤ P {2 · λ−1 2 2
Zi` 1{|Z ` |>v` } > δ`+1
 
P {F`+1 0 ν` + δ ` }.
i
i
That paper also stated two simple Large Deviations bounds:

Lemma 7.8 Let Zi be iid N (0, 1), k ≥ 0, t > 2.


m−k
1 X 2
log P { Zi2 1{|Zi |>t} > m∆} ≤ e−t /4 − ∆/4,
m
i=1

and
m−k
1 X 2
log P { 1{|Zi |>t} > m∆} ≤ e−t /2 − ∆/4.
m
i=1

Applying this, we note that the event


X  2
2 · λ−1 2
Zi` 1{|Z ` |>v` } > δ`+1
2
  
0 ν` + δ ` },
i
i

stated in terms of N (0, n1 ) variables, is equivalent to an event

 m−k
X
Zi2 1{|Zi |>τ` } > m∆` ,

i=1

stated in terms of standard N (0, 1) random variables, where

τ`2 = n · v`2 = 2n (2ω)−2` /4,

and
n 2
∆` = (λ0 δ`+1 /2 − ν` )/δ`2
m
We therefore have the inequality
1 2
c
log P {F`+1 |G` , E` } ≤ e−τ` /4 − ∆` /4.
m
Now
2 −2` −2`
e−τ` /4 = e−[(16 log(m/τ n))/16]·(2ω) = (τ n/m)(2ω) ,
and
n 2 n
∆` = (ω /4 − ω 2 /8) = ω 2 /8.
m m
−2`
Since ω ≤ 1/2, the term of most concern in (τ n/m)(2ω) is at ` = 0; the other terms are always
better. Also ∆` in fact does not depend on `. Focusing now on ` = 0, we may write

log P {F1c |G0 } ≤ m(τ · n/m − n/m · ω 2 /8) = n(τ − ω 2 /8).

29
Recalling that ω ≤ δ/2 and putting
β ≡ β(τ ; δ) = (δ/2)2 /8 − τ,
we get β > 0 for τ < δ 2 /32, and
c
P {F`+1 |G` , E` } ≤ exp(−nβ).
A similar analysis holds for the G` ’s. QED

8 Conclusion
8.1 Summary
We have described an abstract framework for compressed sensing of objects which can be rep-
resented as vectors x ∈ Rm . We assume the x of interest is a priori compressible so that in
kΨT xkp ≤ R for a known basis or frame Ψ and p < 2. Starting from an n by m matrix Φ with
n < m satisfying conditions CS1-CS3, and with Ψ the matrix of an orthonormal basis or tight
frame underlying Xp,m (R), we define the information operator In (x) = ΦΨT x. Starting from
the n-pieces of measured information yn = In (x) we reconstruct x by solving
(L1 ) min kΨT xk1 subject to yn = In (x)
x

The proposed reconstruction rule uses convex optimization and is computationally tractable.
Also the needed matrices Φ satisfying CS1-CS3 can be constructed by random sampling from a
uniform distribution on the columns of Φ.
We give error bounds showing that despite the apparent undersampling (n < m), good ac-
curacy reconstruction is possible for compressible objects, and we explain the near-optimality of
these bounds using ideas from Optimal Recovery and Information-Based Complexity. Examples
are sketched related to imaging and spectroscopy.

8.2 Alternative Formulation


We remark that the CS1-CS3 conditions are not the only way to obtain our results. Our proof
of Theorem 9 really shows the following:
Theorem 11 Suppose that an n × m matrix Φ obeys the following conditions, with constants
ρ1 > 0,η1 < 1/2 and η2 < ∞:
A1 The maximal concentration ν(Φ, J) (defined in Section 4.2) obeys
ν(Φ, J) < η1 , |J| < ρ1 n/ log(m). (8.1)

A2 The width w(Φ, b1,mn ) (defined in Section 2) obeys


w(Φ, b1,mn ) ≤ η2 · (n/ log(mn ))−1/2 . (8.2)

Let 0 < p ≤ 1. For some C = C(p, (ηi ), ρ1 ) and all θ ∈ bp,m the solution θ̂1,n of (P1 ) obeys the
estimate:
kθ̂1,n − θk2 ≤ C · (n/ log(mn ))1/2−1/p .
In short, a different approach might exhibit operators Φ with good widths over `1 balls only,
and low concentration on ‘thin’ sets. Another way to see that the conditions CS1-CS3 can no
doubt be approached differently is to compare the results in [17, 18]; the second paper proves
results which partially overlap those in the first, by using a different technique.

30
8.3 The Partial Fourier Ensemble
We briefly discuss two recent articles which do not fit in the n-widths tradition followed here,
and so were not easy to cite earlier with due prominence.
First, and closest to our viewpoint, is the breakthrough paper of Candès, Romberg, and
Tao [4]. This was discussed in Section 4.2 above; the result of [4] showed that `1 minimization
can be used to exactly recover sparse sequences from the Fourier transform at n randomly-
chosen frequencies, whenever the sequence has fewer than ρ∗ n/ log(n) nonzeros, for some ρ∗ > 0.
Second is the article of Gilbert et al. [25], which showed that a different nonlinear reconstruction
algorithm can be used to recover approximations to a vector in Rm which is nearly as good as
the best N -term approximation in `2 norm, using about n = O(log(m) log(M )N ) random but
nonuniform samples in the frequency domain; here M is (it seems) an upper bound on the norm
of θ.
These articles both point to the partial Fourier Ensemble, i.e. the collection of n×m matrices
made by sampling n rows out of the m × m Fourier matrix, as concrete examples of Φ working
within the CS framework; that is, generating near-optimal subspaces for Gel’fand n-widths, and
allowing `1 minimization to reconstruct from such information for all 0 < p ≤ 1.
Now [4] (in effect) proves that if mn ∼ Anγ , then in the partial Fourier ensemble with uniform
measure, the maximal concentration condition A1 (8.1) holds with overwhelming probability for
large n (for appropriate constants η1 < 1/2, ρ1 > 0). On the other hand, the results in [25]
seem to show that condition A2 (8.2) holds in the partial Fourier ensemble with overwhelming
probability for large n, when it is sampled with a certain non-uniform probability measure.
Although the two papers [4, 25] refer to different random ensembles of partial Fourier matrices,
both reinforce the idea that interesting relatively concrete families of operators can be developed
for compressed sensing applications. In fact, Emmanuel Candès has informed us of some recent
results he obtained with Terry Tao [6] indicating that, modulo polylog factors, A2 holds for the
uniformly sampled partial Fourier ensemble. This seems a very significant advance.

Acknowledgements
In spring 2004, Emmanuel Candès told DLD about his ideas for using the partial Fourier ensem-
ble in ‘undersampled imaging’; some of these were published in [4]; see also the presentation [5].
More recently, Candès informed DLD of the results in [6] we referred to above. It is a pleasure
to acknowledge the inspiring nature of these conversations. DLD would also like to thank Anna
Gilbert for telling him about her work [25] finding the B-best Fourier coefficients by nonadaptive
sampling, and to thank Candès for conversations clarifying Gilbert’s work.

References
[1] J. Bergh and J. Löfström (1976) Interpolation Spaces. An Introduction. Springer Verlag.

[2] E. J. Candès and DL Donoho (2000) Curvelets - a surprisingly effective nonadaptice rep-
resentation for objects with edges. in Curves and Surfaces eds. C. Rabut, A. Cohen, and
L.L. Schumaker. Vanderbilt University Press, Nashville TN.

[3] E. J. Candès and DL Donoho (2004) New tight frames of curvelets and optimal represen-
tations of objects with piecewise C 2 singularities. Comm. Pure and Applied Mathematics
LVII 219-266.

31
[4] E.J. Candès, J. Romberg and T. Tao. (2004) Robust Uncertainty Principles: Exact Signal
Reconstruction from Highly Incomplete Frequency Information. Manuscript.

[5] Candes, EJ. (2004) Presentation at Second International Conference on Computational


Harmonic Anaysis, Nashville Tenn. May 2004.

[6] Candes, EJ. and Tao, T. (2004) Estimates for Fourier Minors, with Applications.
Manuscript.

[7] Chen, S., Donoho, D.L., and Saunders, M.A. (1999) Atomic Decomposition by Basis Pur-
suit. SIAM J. Sci Comp., 20, 1, 33-61.

[8] Ingrid Daubechies (1992) Ten Lectures on Wavelets. SIAM.

[9] Cohen, A. DeVore, R., Petrushev P., Xu, H. (1999) Nonlinear Approximation and the space
BV (R2 ). Amer. J. Math. 121, 587-628.

[10] Donoho, DL, Vetterli, M. DeVore, R.A. Daubechies, I.C. (1998) Data Compression and
Harmonic Analysis. IEEE Trans. Information Theory. 44, 2435-2476.

[11] Donoho, DL (1993) Unconditional Bases are optimal bases for data compression and for
statistical estimation. Appl. Computational Harmonic analysis 1 100-115.

[12] Donoho, DL (1996) Unconditional Bases and bit-level compression. Appl. Computational
Harmonic analysis 3 388-392.

[13] Donoho, DL (2001) Sparse components of images and optimal atomic decomposition. Con-
structive Approximation 17 353-382.

[14] Donoho, D.L. and Huo, Xiaoming (2001) Uncertainty Principles and Ideal Atomic Decom-
position. IEEE Trans. Info. Thry. 47 (no.7), Nov. 2001, pp. 2845-62.

[15] Donoho, D.L. and Elad, Michael (2002) Optimally Sparse Representation from Overcom-
plete Dictionaries via `1 norm minimization. Proc. Natl. Acad. Sci. USA March 4, 2003 100
5, 2197-2002.

[16] Donoho, D., Elad, M., and Temlyakov, V. (2004) Stable Recovery of Sparse Over-
complete Representations in the Presence of Noise. Submitted. URL: http://www-
stat.stanford.edu/˜donoho/Reports/2004.

[17] Donoho, D.L. (2004) For most large underdetermined systems of linear equations, the min-
imal `1 solution is also the sparsest solution. Manuscript. Submitted. URL: http://www-
stat.stanford.edu/˜donoho/Reports/2004

[18] Donoho, D.L. (2004) For most underdetermined systems of linear equations, the minimal `1 -
norm near-solution approximates the sparsest near-solution. Manuscript. Submitted. URL:
http://www-stat.stanford.edu/˜donoho/Reports/2004

[19] A. Dvoretsky (1961) Some results on convex bodies and Banach Spaces. Proc. Symp. on
Linear Spaces. Jerusalem, 123-160.

[20] M. Elad and A.M. Bruckstein (2002) A generalized uncertainty principle and sparse repre-
sentations in pairs of bases. IEEE Trans. Info. Thry. 49 2558-2567.

32
[21] J.J. Fuchs (2002) On sparse representation in arbitrary redundant bases. Manuscript.

[22] T. Figiel, J. Lindenstrauss and V.D. Milman (1977) The dimension of almost-spherical
sections of convex bodies. Acta Math. 139 53-94.

[23] S. Gal, C. Micchelli, Optimal sequential and non-sequential procedures for evaluating a
functional. Appl. Anal. 10 105-120.

[24] Garnaev, A.Y. and Gluskin, E.D. (1984) On widths of the Euclidean Ball. Soviet Mathe-
matics – Doklady 30 (in English) 200-203.

[25] A. C. Gilbert, S. Guha, P. Indyk, S. Muthukrishnan and M. Strauss, (2002) Near-optimal


sparse fourier representations via sampling, in Proc 34th ACM symposium on Theory of
Computing, pp. 152–161, ACM Press.

[26] R. Gribonval and M. Nielsen. Sparse Representations in Unions of Bases. To appear IEEE
Trans Info Thry.

[27] G. Golub and C. van Loan.(1989) Matrix Computations. Johns Hopkins: Baltimore.

[28] Boris S. Kashin (1977) Diameters of certain finite-dimensional sets in classes of smooth
functions. Izv. Akad. Nauk SSSR, Ser. Mat. 41 (2) 334-351.

[29] M.A. Kon and E. Novak The adaption problem for approximating linear operators Bull.
Amer. Math. Soc. 23 159-165.

[30] Thomas Kuhn (2001) A lower estimate for entropy numbers. Journal of Approximation
Theory, 110, 120-124.

[31] Michel Ledoux. The Concentration of Measure Phenomenon. Mathematical Surveys and
Monographs 89. American Mathematical Society 2001.

[32] S. Mallat(1998). A Wavelet Tour of Signal Processing. Academic Press.

[33] Y. Meyer. (1993) Wavelets and Operators. Cambridge University Press.

[34] Melkman, A. A., Micchelli, C. A.; Optimal estimation of linear operators from inaccurate
data. SIAM J. Numer. Anal. 16; 1979; 87–105;

[35] A.A. Melkman (1980) n-widths of octahedra. in Quantitative Approximation eds. R.A.
DeVore and K. Scherer, 209-216, Academic Press.

[36] Micchelli, C.A. and Rivlin, T.J. (1977) A Survey of Optimal Recovery, in Optimal Es-
timation in Approximation Theory, eds. C.A. Micchelli, T.J. Rivlin, Plenum Press, NY,
1-54.

[37] V.D. Milman and G. Schechtman (1986) Asymptotic Theory of Finite-Dimensional Normed
Spaces. Lect. Notes Math. 1200, Springer.

[38] E. Novak (1996) On the power of Adaption. Journal of Complexity 12, 199-237.

[39] Pinkus, A. (1986) n-widths and Optimal Recovery in Approximation Theory, Proceeding
of Symposia in Applied Mathematics, 36, Carl de Boor, Editor. American Mathematical
Society, Providence, RI.

33
[40] Pinkus, A. (1985) n-widths in Approximation Theory. Springer-Verlag.

[41] G. Pisier (1989) The Volume of Convex Bodies and Banach Space Geometry. Cambridge
University Press.

[42] D. Pollard (1989) Empirical Processes: Theory and Applications. NSF - CBMS Regional
Conference Series in Probability and Statistics, Volume 2, IMS.

[43] Carsten Schutt (1984) Entropy numbers of diagonal operators between symmetric banach
spaces. Journal of Approximation Theory, 40, 121-128.

[44] Szarek, S.J. (1990) Spaces with large distances to `n∞ and random matrices. Amer. Jour.
Math. 112, 819-842.

[45] Szarek, S.J.(1991) Condition Numbers of Random Matrices.

[46] Traub, J.F., Woziakowski, H. (1980) A General Theory of Optimal Algorithms, Academic
Press, New York.

[47] J.A. Tropp (2003) Greed is Good: Algorithmic Results for Sparse Approximation To appear,
IEEE Trans Info. Thry.

[48] J.A. Tropp (2004) Just Relax: Convex programming methods for Subset Sleection and
Sparse Approximation. Manuscript.

34

You might also like