CompSense - Elad - IEEETSP With Cover Page v2
CompSense - Elad - IEEETSP With Cover Page v2
Efficient Design of Embedded Dat a Acquisit ion Syst ems based on Smart Sampling
Ramakrishnan Angarai Ganesan
Optimized Projections for Compressed Sensing
Michael Elad
The Department of Computer Science
The Technion – Israel Institute of Technology
Haifa 32000 Israel
Email: [email protected].
Abstract
Compressed-Sensing (CS) offers a joint compression- and sensing-processes, based
on the existence of a sparse representation of the treated signal and a set of projected
measurements. Work on CS thus far typically assumes that the projections are drawn
at random. In this paper we consider the optimization of these projections. As a
direct such optimization is prohibitive, we target an average measure of the mutual-
coherence of the effective dictionary, and demonstrate that this leads to better CS
reconstruction performance. Both the Basis-Pursuit and the Orthogonal-Matching-
Pursuit are shown to benefit from the newly designed projections, with a reduction
of the error-rate by a factor of 10 and beyond.
1 Introduction
Consider a family of signals {xj }j ∈ Rn , known to have sparse representations over a fixed
1
with kαj k0 ≤ T ≪ n for all j. The ℓ0 -norm used here simply counts the number of
non-zeros in αj .
Compressed-Sensing (CS) offers a joint sensing- and compression-processes for such sig-
nals [1, 2, 3, 4, 5, 6, 7]. Using a projection matrix P ∈ Rp×n with T < p ≪ n, CS suggests
yj = Pxj . (2)
The original signal xj can be reconstructed from yj by exploiting the sparsity of it’s rep-
resentation – i.e., among all possible α satisfying yj = PDα we seek the sparsest. If
this representation coincides with αj , we get a perfect reconstruction of the signal using
which is known to be NP-hard even for moderate-sizes of the linear system in the constraint
[8, 9]. Approximation techniques, known as pursuit algorithms are deployed, and are proven
to lead to the true result for very sparse solutions [11, 12, 10].
Work on CS thus far assumes that P is drawn at random, which simplifies its theoretical
analysis, and also facilitates a simple implementation [1, 2, 3, 4, 5, 6, 7]. In this paper
we show that by optimizing the choice of P such that it leads to better coherence of the
both the Basis-Pursuit (BP) [10] and the Orthogonal Matching-Pursuit (OMP) algorithms
[11, 12].
In the next section we provide the intuition behind CS, along with a statement of the
2
main results in the literature regarding its expected performance, which are related to this
work. Section 3 concentrates on a proposed iterative method for improving the projections
based on the mutual-coherence (as will be defined shortly) of the overall new dictionary. We
demonstrate experimental results Section 4 and show the performance gain obtained with
the optimized projections. As this work is the first to consider the design of the projections,
and as it approaches this problem indirectly by improving the mutual coherence, there is
clearly a room for future work and improvements. Ideas on how to further extend this work
We have described above the core idea behind Compressed-Sensing. The first question one
must ask is – why will it work at all? In order to answer this question, we need to recall
and normalized inner product between different columns in D. Put formally, this reads
¯ T ¯
¯ di dj ¯
µ{D} = max . (4)
and
1≤i,j≤k i6=j kdi k · kdj k
The mutual-coherence provides a measure of the worst similarity between the dictionary
columns, a value that exposes the dictionary’s vulnerability, as such two closely related
G = D̃T D̃, computed using the dictionary after normalizing each of its columns. The
3
off-diagonal entries in G are the inner products that appear in Equation (4). The mutual-
Suppose that the signal x0 has been constructed by x0 = Dα0 with a sparse representa-
1. The vector α0 is necessarily the sparsest one to describe x0 , i.e. it is the solution of
3. The OMP for approximating α0 is also guaranteed to succeed. The OMP is a greedy
and sequential method that accumulates the non-zeros in α0 one at a time, while
attempting to obtain the fastest decrease of the residual error kx0 − Dαk.
Based on the above, suppose that the projection matrix P has been chosen and we are
to solve
4
then necessarily, the original α0 is the solution of the problem posed in (8), both pursuit
The above implies that if P is designed such that µ{PD} is as small as possible, this
allows a wider set of candidate signals to reside under the umbrella of successful CS be-
havior. While this conclusion is true from a worst-case stand-point, it turns out that the
mutual-coherence as defined above does not do justice to the actual behavior of sparse
and allow a small fraction of signals with the same representation’s cardinality to fail, than
values of kα0 k0 substantially beyond the above bound are still leading to successful CS.
required for successful reconstruction? Assuming that the cardinality of the representation,
equations with 2T unknowns (the indices of the non-zeros and their coefficients). Recent
work has established that indeed, for a high success-rate of CS, it is enough to use O{T }
These results are typically accompanied by an assumption about the specific dictionary
structure, the use of random projections, and considering an asymptotic case where the
If we address this very question of the required number of projections from the point of
view of the value of µ{PD}, we are likely to find that O{n} measurements are needed,
5
loosing all the compressibility potential in CS. Again we find that replacing the measure
µ{PD} with a parallel one that considers average absolute inner-products may do more
In this section we shall consider a different mutual-coherence, which reflects average behav-
of all absolute and normalized inner products between different columns in D (denoted as
As the value of t grows, we obtain that µt {D} grows and approaches µ{D} from below.
Also, it is obvious from the definition that µt {D} ≥ t. In the optimization procedure we
are about to describe we will target this value and minimize it iteratively.
Note that a different and more direct approach towards the design of the projection
matrix would be its learning based on signal examples and tests involving the pursuit
algorithm deployed. We believe that such a method is likely to lead to better performance
compared to the method described here. Nevertheless, such a direct scheme is also expected
to be far more complex and involved, and thus its replacement with the optimization of
Put very simply, our goal is to minimize µt {PD} with respect to P, assuming that the
6
dictionary D and the parameter t are known and fixed. Since µt {PD} is defined via the
from- and to- the Gram matrix in every iteration. This algorithm is inspired by a similar
approach adopted in [16] for the design of Grassmanian frames that minimize the mutual-
A slightly different mode of operation of the above algorithm can be proposed, where t
varies from one iteration to another, by addressing at all times a constant fraction of the
entries in the Gram matrix. For example, the value t can be updated at each iteration such
that it targets the top 20% of the inner-products. We shall denote the average mutual-
coherence of the top t% by µt% {PD}, and, as we shall see in the next section, it is this
measure that we will work with. The algorithm for optimizing P with the above two options
is described in Figure 1.
In this algorithm we start with a random set of p projections stored in the matrix P.
As our main objective is the reduction of the inner-products that are above t in absolute
value (assuming the first mode of operation), the Gram matrix of the normalized effective
dictionary is computed, and these values are “shrinked” multiplying their values by 0 < γ <
entries in G with magnitude below t but above γt are “shrinked” by a smaller amount,
7
Objective: Minimize µt {PD} with respect to P.
8
using the function
γx |x| ≥ t
y= γt · sign(x) t > |x| ≥ γt . (11)
x γt > |x|
This function is described graphically for t = 0.5 and γ = 0.6 in Figure 2. For convenience,
the functions y = x and y = γx are also shown. As can be seen, the proposed function is
0.8
0.6
0.4
0.2
Output Value
−0.2
−0.4
−0.6
−0.8
−1
−1 −0.8 −0.6 −0.4 −0.2 0 0.2 0.4 0.6 0.8 1
Input Value
Figure 2: The shrink operation employed in the algorithm for t = 0.5 and γ = 0.6.
The above shrinking operation causes the resulting Gram matrix to become full-rank in
the general case. Thus, the next steps are mending this by forcing a rank p and finding the
matrix P that could best describe the squared-root of the obtained Gram matrix. Thus,
steps 1-4 in the algorithm are addressing the objective of the process – the reduction of
µt {PD}, and steps 5-7 are responsible for the feasibility of the proposed new Gram matrix
9
and the identity of the emerged projection matrix.
Regarding convergence properties, not much can be said in general. The overall problem
is far from being convex, and convergence is guaranteed only of γ is chosen very close to
1. However, as we show next, in practice one can choose γ = 0.5 and still get convergence,
and in fact an accelerated one, compared to the use of higher values. Since the objective
function can be evaluated after every iteration with almost no additional cost, this could
be used for an automatic stopping of the algorithm (in case an assent is started), and even
results in Figure 3. Considering a random dictionary (every entry is drawn from iid zero
mean and unit variance Gaussian distribution) of size 200 × 400, we seek the best projec-
tion matrix containing 30 projections, such that µt {PD} is minimized for t = 0.2. The
initialization is a random matrix P0 of size 30 × 200, built the same way as the dictionary.
We use several values of γ, from 0.55 to 0.95. In all cases we obtain convergence, and it
is faster as γ is smaller. The value of µt {PD} is by definition above t, but as can be seen,
before the optimization and after 50 iterations (using γ = 0.5 and t = 0.2). As can be
seen, there is a marked shift towards the origin of the histogram after optimization, with an
emphasis on the right tail which represents the higher values. A similar effect is seen also
in Figure 5, which presents similar histograms, this time working with t% = 40%. Thus,
in this run we target at every iteration the minimization of the average of the top 40% of
10
0.3
0.295
0.29
0.285
Value of µt
0.28
0.275
γ=0.95
0.27 γ=0.85
γ=0.75
0.265
γ=0.65
γ=0.55
0.26
0.255
0 100 200 300 400 500 600 700 800 900 1000
Iteration
Figure 3: The value of µt {Pk D} as a function of the iteration for t = 0.2 and various values
of γ.
It is now time to assess how the optimized projections perform in the compressed-sensing
setting. We should remind the reader that in this work we assume that by optimizing
µt {PD} w.r.t. P, one leads to more informative projections, which in turn leads to better
proven, and here we limit our study to an empirical one. The proposed test includes the
following steps:
Stage 1 - Generate Data: Choose a dictionary D ∈ Rn×k , and synthesize N test signals
{xj }N N
j=1 by generating N sparse vectors {αj=1 }j of length k each, and computing
11
10000
8000
6000
Original Projection Matrix
4000
2000
0
0 0.2 0.4 0.6 0.8 1
10000
8000
6000
After 50 Iterations
4000
2000
0
0 0.2 0.4 0.6 0.8 1
Figure 4: The histogram of the absolute off-diagonal entries of G before the optimization
10000
8000
6000
Original Projection Matrix
4000
2000
0
0 0.2 0.4 0.6 0.8 1
10000
8000
6000
After 50 Iterations
4000
2000
0
0 0.2 0.4 0.6 0.8 1
Figure 5: The histogram of the absolute off-diagonal entries of G before the optimization
12
∀j, xj = Dαj . All representations are to be built using the same low cardinality
kαk0 = T .
projection matrix P, and apply it to the signals, obtaining ∀j, yj = Pxj . Compute
Stage 3 - Performance Tests: Apply the BP and the OMP to reconstruct the signals
and testing the error kxj − Dα̂j k2 . Measure the average error-rate – A reconstruction
failure.
BP and the OMP as described above, and see how the newly designed projections
We have followed the above stages in the following two experiments. The first experiment
studies the performance of CS before and after the optimization of the projections, with
BP and OMP, and for varying amounts of measurements. The second one studies the effect
13
In the first experiment we used a random dictionary of size 80×120 (other options such as
a redundant DCT dictionary, where tested too, and found to lead to qualitatively the same
results, and thus omitted). This size was chosen as it enables the CS performance evaluation
in reasonable time. We generated N = 100, 000 sparse vectors of length k = 120 with T = 4
non-zeros in each. The non-zeros locations were chosen at random and populated with iid
zero-mean and unit variance Gaussian values. These sparse vectors were used to create the
with varying values of m in the range 16 ÷ 40. The relative error rate was evaluated as a
function of m for both the BP and the OMP, before and after the projection optimization.
The projection optimization (per every value of m) was done using up to 1, 000 iterations1
using γ = 0.95 and varying %t = 20%. The results are shown in Figure 6.
Each point in the shown graph represents an average performance, accumulated over
average performance over N = 100, 000 examples, in cases where more than 300 errors were
accumulated, the test was stopped and the average so far was used instead. This was done
in order to reduce the overall test run-time. Another substantial speed-up was obtained
by replacing the BP direct test (which requires a linear programming solver) with a much
As can be seen and as expected, the results of both pursuit techniques improve as m
increases. In this test the BP performs much better than the OMP. The optimized pro-
jections are indeed leading to improved performance for both algorithms. For some values
1
The algorithm is stopped in case of an increase in the value µt% .
14
0
10
BP with optimized P
−4
10
−5
10
15 20 25 30 35 40
m
m, with random projections and optimized projections. Note: A vanishing graph implies
of m there is nearly a 10 : 1 improvement factor for the BP and more than 100 : 1 im-
provement for the OMP. Indeed, the OMP with the optimized projections lead to better
The second experiment is similar to the first one, this time fixing m = 25 and varying
T in the range 1 ÷ 7. The results are shown in Figure 7. As expected, as T grows, perfor-
mance deteriorates. However, the optimized projections are consistent in their improved
performance.
We should emphasize that the presented results do not include a thorough optimization
of the parameters γ and t, and the relation between µt and the CS performance remains
still obscure at this stage. Also, our experiments concentrated on one specific choice of
dictionary size that enables reasonable run-time simulation, and this has an impact on
15
0
10
−1
10
OMP with
BP with optimized P
Relative # of errors
−2
10 random P
−3 BP with optimized P
10
−4
10
−5
10
1 2 3 4 5 6 7
T − Cardinality of the input signals
the relatively weak performance CS shows. Other experiments we have done with much
larger dictionaries show the same improvement as above, but require too long run-time
for gathering fair statistics, and thus avoided. Still, the point this paper is making about
the potential to get better projections and thereby improving CS performance, is clearly
demonstrated.
5 Conclusions
that state that signals can be compressed and sensed at the same time. This is based
on the structural assumption such signals are satisfying – having a sparse and redundant
16
CS idea is the use of linear projections that mix the signal. This operation has been
traditionally chosen as a random matrix. This work aims to show that better choices of such
mixtures are within reach. The projections can be designed such that the average mutual-
coherence of the effective dictionary becomes favorable. We have defined this property,
shown how to design a projection operator based on it, and demonstrated how it is indeed
The idea of optimizing the projections is appealing and should be further studied. Here
• How can the proposed optimization algorithm be performed or approximated for very
• Optimizing the projections can be done alternatively using a direct method that
work. Further work is required to explore this option, and show how effective it is
sented here, to the CS performance, so as to give better justification for the proposed
work. Perhaps there is yet another simple measure of the effective dictionary PD
17
Acknowledgement
The author would like to thank Dr. Michael Zibulevsky for helpful discussions and his
fruitful ideas on how to speed-up the tests carried out in this work.
The problem we face is the following: We generate a sparse vector α0 and compute from
and check whether α̂ = α0 . The problem in such a direct approach is the need to deploy
a linear programming solver per each test, and as we are interested in many thousands of
Since we are dealing here with a synthetic test, where the desired solution is a-priori
known, we can replace the direct solution of (A-1) with a much more moderate test of
considering α0 and checking wether it is indeed it’s global minimizer. In order to do so,
we consider the necessary first-order KKT conditions, as emerging from the Lagrangian of
with λ serving as the Lagrange ,multipliers. Taking its derivative with respect to α, and
using the fact that the derivative of the absolute value at zero leads to the feasible interval
18
[−1, 1] (considering the sub-gradients), we obtain
+1 α0 (j) > 0
DT PT λ = −1 α0 (j) < 0 , (A-3)
uj α0 (j) = 0
Thus, if we find a feasible solution λ to this system, we can guarantee that α0 is the
solution of (A-1) and thus the BP is expected to succeed. If we cannot find a solution,
we suspect that BP fails. Declaration of failure in such a case is definitely possible, but
leads to an upper-bound on the true number of errors, as our numerical scheme for solving
Equation (A-3) may fail in-spite of the BP success. Assuming that the expected number of
such suspected failures is substantially smaller compared to N (as is indeed the case in our
simulations), we can directly try to solve (A-1) for this few cases, and see whether failure
takes place.
As for the solution of (A-3), this can be achieved in various ways. We separate the
Starting with a very small β, the first constraint is satisfied while the second might be
violated. Iterating and increasing the value of β, the first term remains zero while the
step can be proposed, where the extreme entries in the vector |A2 λ| that are above 1 are
19
treated by increasing their weight in W. A finite number of such an iterative algorithm
(50 iterations) was used, and shown to be 1-2 orders of magnitude faster than the full LP
solver.
References
[1] Candès, E.J., Romberg, J. and Tao, T. (2006) Robust uncertainty principles: exact
[2] Candès, E.J. and Romberg, J. (2005) Quantitative robust uncertainty principles and
matics.
[3] Candès, E.J. and Tao, T. (2006) Near optimal signal recovery from random projections:
[4] Donoho, D.L. (2006) Compressed sensing, IEEE Trans. on Inf. Theory, Vol. 52, pp.
1289–1306, April.
[5] Tsaig, Y. and Donoho, D.L. (2006) Extensions of compressed sensing, Signal Process-
[6] Tropp, J.A. and Gilbert A.C. (2006) Signal recovery from partial information via
20
[7] Tropp, J.A., Wakin, M.B., Duarte, M.F., Baron, D., and Baraniuk, R.G., (2006)
Random filters for compressive sampling and reconstruction, Proc. IEEE International
May.
Comput., 24:227–234.
[9] Davis, G., Mallat, S., and Avellaneda, M. (1997) Greedy adaptive approximation. J.
[10] Chen, S.S., Donoho, D.L. and Saunders, M.A. (2001) Atomic decomposition by basis
[11] Mallat, S. and Zhang, Z. (1993) Matching pursuit in a time-frequency dictionary, IEEE
[12] Pati, Y.C., Rezaiifar, R., and Krishnaprasad, P.S. (1993) Orthogonal matching pur-
puters.
[13] Donoho, D.L. & Elad, M. (2002) Optimally sparse representation in general (non-
[14] Gribonval, R. & Nielsen, M. (2004) Sparse representations in unions of bases, IEEE
21
[15] Tropp, J.A. (2004) Greed is Good: Algorithmic results for sparse approximation. IEEE
[16] Dhillon, I.S., Heath R.W.Jr. and Strohmer, T. (2005) Designing structured tight
frames via alternating projection, IEEE Trans. on Inf. Theory, Vol. 51, pp. 188–209,
January.
22