Signal Processing: Bob L. Sturm, Laurent Daudet
Signal Processing: Bob L. Sturm, Laurent Daudet
Signal Processing: Bob L. Sturm, Laurent Daudet
Signal Processing
journal homepage: www.elsevier.com/locate/sigpro
Department of Architecture, Design and Media Technology, Aalborg University Copenhagen, Lautrupvang 15, 2750 Ballerup, Denmark
Institut Langevin (LOA), Universite Paris Diderot Paris 7, UMR 7587, 10, rue Vauquelin, 75231 Paris, France
a r t i c l e in f o
abstract
Article history:
Received 5 January 2010
Received in revised form
15 February 2011
Accepted 2 March 2011
Available online 10 March 2011
Keywords:
Multiscale decomposition
Sparse approximation
Timefrequency dictionary
Audio similarity
1. Introduction
Sparse approximation is essentially the modeling of
data with few terms from a large and typically overcomplete set of atoms, called a dictionary [24]. Consider
an x 2 RK , and a dictionary D composed of N unit-norm
atoms in the same space, expressed in matrix form as
D 2 RKN , where N b K. A pursuit is an algorithm that
decomposes x in terms of D such that JxDsJ2 r e for
some error e Z0. (In this paper, we work in a Hilbert space
unless otherwise noted.) When D is overcomplete, D has
full row rank and there exists an innite number of
solutions to choose from, even for e 0. Sparse approximation aims to nd a solution s that is mostly zeros for e
small. In that case, we say that x is sparse in D.
Corresponding author.
d2D
2837
m 1,2, . . . ,ni 1
4
for ni arbitrarily large, with C 40, and where am is the mth
element of the column vector a. This can be seen in the
magnitude representation weights in Fig. 1, which are
weights of sparse representations of piano notes, described
in Section 4.1. With MP and a complete dictionary, we are
guaranteed g 40 because Jrn 1J2 o JrnJ2 for all n [24].
Consider the Euclidean distance between two signals
of the same dimension, which is the cosine distance for
unit-norm signals. Thus, with respect to this distance, the
yi 2 Y nearest to xq is given by solving
min Jyi xq J max /xq ,yi S
i2I
i2I
2838
express (5) as
ni
X
i2I
i2I
nq
X
m1l1
7
where BCml Bml Cml is the Hadamard, or entry wise,
product, Bml is the element of B in the mth row of the lth
column, Giq 9HTi ni Hq nq is a ni nq matrix with elements from the Gramian of the dictionary, i.e., G9DT D,
and nally we dene the outer product of the weights
Aiq 9ai ni aTq nq :
M
X
for M 2,3, . . . ,minni ,nq , and setting Siq 1 Aiq Giq 11 .
With this we can express the argument of (7) as
/xq ,yi S
m MM 1=2:
RM r C 2 JcM J1 Jd J1
14
10
m1l1
PM9
m1
nq
ni X
X
11
m1
16
with probability 0 rp r1
12
~
Uiq M9Siq M RM
13
17
2839
0.5
(7.53, 1.10)
(4.75, 1.00)
0.4
(3.00, 0.90)
(1.89, 0.80)
(1.19, 0.70)
0.3
0.2
(0.69, 0.50)
(0.72, 0.55)
0.1
(0.78, 0.60)
0
2
500
1000
2000
3000
4000
5000
of data
19
0.4
0.6
0.8
1.2
20
21
2840
t2T i
nq
ni X
X
22
m1l1
23
wj
8
1,
>
>
>
>
>
< K=sj 2 ,
uj Z t,uj sj rt K
uj o t,uj sj Zt K
uj sj t2 =s2j ,
>
>
>
>
>
: t Kuj 2 =s2 ,
j
24
HTi ni PTt Pt Hi
ni ml a
where the nt weights aj 2 fai ni m :
0,1r m,l rni g are those associated with atoms having
support in [t,t K), and wj we dene to weigh the
2
contribution of aj to the localized energy estimate. We
Pnt
set xt 0 if j 1 a2j 0.
If all atoms contributing to the subsequence have their
entire support in [t,t K), and are orthonormal, then we
can set each wj 1. This does not hold for subsequences
of a signal decomposed using an overcomplete dictionary,
uj o t,t ouj sj r t K
25
t r uj ot K,uj sj 4 t K
K-Ni
nt
X
wj a2j JPt ri ni J2
26
j1
P
Fig. 4. Short-term energy ratios, log10 nj t 1 wj a2j =JPt yi J2 , over 1 s windows (hopped 125 ms) for MP decompositions using 8xMDCT [31] to a global
residual energy 30 dB below the initial signal energy. Arrow points to line (top, gray) using weighting wj 1. The other line (bottom, black) uses (25).
Data in (a) are described in Section 4.2; data in (b) are described in Section 4.3. (a) Six speech signals (0-20 s), Music except (21-23 s), Realization of GWN
(24-27 s) and (b) Music Orchestra.
nq
nt X
X
m1l1
27
where
for
M 2,3, . . . ,minni ,nq ,
Siq t,1 xt Aiq tGiq t11 ,
Siq t,M9Siq t,M1 xt
M
X
and
with
28
m1
Du (samples/ms)
Df (Hz)
32/0.7
64/2
128/3
256/6
512/12
1024/23
2048/46
4096/93
8192/186
43.1
43.1
43.1
43.1
21.5
10.8
5.4
2.7
1.3
2841
1.2
0.80
0.70
0.8
0.60
0.50
0.6
0.40
0.4
0.30
0.20
0.2
0.10
0
10
10
10
10
Atom Pairs P(M)
10
10
Fig. 5. Estimated remainder (assuming unit-norm signals) as a function of the number of atom pairs already considered for dataset used to produce
Fig. 1. Gray: bound in (15). Black, numbered: bound in (16) for several labeled p (probability that remainder does not exceed bound) with n2375
(number of elements in each sparse model), and C, g 0:4785,0:5.
2842
Fig. 6. jSiq 10j (9) for two sets of recorded piano notes. (a) and (b): Set A compared with itself in time and sparse domains (M 10). (c) and (d): Set B
(rows) compared with set A (columns) in time and sparse domains (M 10). Elements on main diagonals in (a) and (b) are scaled by 0.25 to increase
contrast of other elements. (a) A Time-domain magnitude correlations, (b)A Sparse-domain approximations of magnitude correlations, (c) B-A Timedomain magnitude correlations and (d) B-A Sparse-domain approximations of magnitude correlations.
2843
Fig. 7. Black: jSiq Mj (9) as a function of the number of atom pairs considered for the set of piano notes in A with a signal from either (a) A (note A3)
or (b) B (note D5 approximately). Gray: for each Siq(M), magnitudes of Liq(M) (12) and Uiq(M) (13) using bound in (16) with p 0.2 (probability that
remainder does not exceed bound), and n 100 (number of elements in each sparse model). Largest magnitude correlations are labeled. Note differences
in axes. Signal A3 from A with (c,g) (0.78, 0.60) and (b) Signal D5 from B with (c, g) (1.17,0.70).
2844
Fig. 8. jSiq 10j (9) for two sets of recorded piano notes in a sparse
domain using only the atoms with duration at least 186 ms. Compare
with Fig. 6(d).
correlations between the notes much more clearly compared with Fig. 6(d). Removing the short-term phenomena improves tonal-level comparisons between the
signals because non-overlapping yet energetic short
atoms are replaced by atoms representative of the note
harmonics.
Frequency (kHz)
8
6
4
2
Frequency (kHz)
8
6
4
2
Frequency (kHz)
8
6
4
2
0
0.1
0.2
0.3
0.4
0.5
Time (s)
Fig. 9. Log spectrograms of the query signals with which we search. Top: query of male saying cheese. Middle: query distorted with additive white
Gaussian noise (AWGN) with SNR 10 dB. Bottom: query distorted with interfering crow sound with SNR 5 dB.
2845
see its position even when using a single atom pair for
each 100 ms partition of the time-domain. We see the
same behavior in Fig. 10(b), (c) for the two distorted
signals, but in the case where the crow sound interferes
we nd the query for M Z 2, or with at least three atom
pair comparisons. The rst atom of the decomposed query
with the crow is modeling the crow and not the content of
interest, and so we must increase the order of the model
to nd the location of xq . As we increase the number of
pairs considered we also nd other segments that point in
the same direction as xq . Table 2 gives the times and
content of the ten largest values in jSiq t,10j. For the clean
and AWGN distorted xq , cheese appears ve of the six
times it exists in the original signal. Curiously, these same
ve instances are the ve largest magnitude correlations
when xq has the crow interference.
We perform the same test as above but using a much
longer speech signal (about 21 minutes in length)
excerpted from a book-on-CD, The Old Man and the
Sea by Ernest Hemingway, read aloud by a single person.
From this signal we create several queries xq , from words
Fig. 10. jSiq M,tj (28) as a function of time and the number of atoms M (labeled at right) considered from each representation for each localized sparse
approximation. Localized magnitude correlation of each signal with query is shown by the thin black line in front of the gray time-domain signal at rear.
(a) clean signal, (b) Signal with AWGN at 10 dB energy and (c) Signal with crow signal at 5 dB energy.
2846
Table 2
Times, values and signal content for rst 10 peaks in jSiq t,10j (P(10) 55) in Figs. 10(a)(c). Highest-rated distances in each (bold) points to the origin of
signal.
#
Clean signal
Signal WGN
Signal Crow
t (s)
jSiq j
Content
t (s)
jSiq j
Content
t (s)
jSiq j
Content
1
2
3
4
5
10.0
13.6
11.3
16.9
15.1
0.798
0.199
0.153
0.149
0.141
cheese
cheese
-ives is-
cheese
delicious
10.0
13.6
15.1
11.3
16.9
0.236
0.080
0.051
0.045
0.042
cheese
cheese
delicious
-ives is-
cheese
10.0
13.6
16.9
6.9
1.3
0.409
0.060
0.030
0.012
0.011
cheese
cheese
cheese
cheese
cheese
6
7
8
9
10
18.3
1.3
8.1
2.4
6.9
0.076
0.057
0.035
0.026
0.024
delicious
cheese
delicious
delicious
cheese
18.3
8.1
12.0
5.2
6.8
0.028
0.014
0.012
0.011
0.010
delicious
delicious
-licious
delicious
cheese
18.3
13.2
15.1
16.0
22.8
0.010
0.010
0.009
0.004
0.003
delicious
cottage
delicious
cott-
WGN
2847
Frequency (kHz)
0.5
0.4
0.3
0.2
0.1
Frequency (kHz)
0.5
0.4
0.3
0.2
0.1
0
5
Time (s)
Fig. 11. Polyphonic orchestral query: sonogram (top) and timefrequency tiles (bottom) of 50-order sparse approximation.
Fig. 12. jSiq t,Mj (28) for three values of M for the query and two different signals. Arrows mark the appearances of the A theme, and their appearance
number. Magnitude correlation of query with localized and normalized signal is shown by the solid gray area in front of the black time-domain signal at
rear. (a) Query compared with original signal and (b) Query compared different interpretation.
2848
Acknowledgments
B.L. Sturm performed part of this research as a Chateaubriand Post-doctoral Fellow (N. 634146B) at the
Institut Jean Le Rond dAlembert, Equipe Lutheries, Acoustique, Musique, Universite Pierre et Marie Curie, Paris 6,
France; as well as at the Department of Architecture,
Design and Media Technology, at Aalborg University
Copenhagen, Denmark. L. Daudet acknowledges partial
support from the French Agence Nationale de la
Recherche under contract ANR-06-JCJC-0027-01 DESAM.
The authors thank the anonymous reviewers for their
very detailed and helpful comments.
Appendix A. Proof of remainder bounds
30
31
setting Siq 1 Aiq Giq 11 . We do not yet know the extent to
which this approach can ameliorate the problems with the
m
X
n
X
lml 1g
m M1 l 1
JcM J1 Jd J1
n1
X
n
X
lnm 1g
m 1 l m1
A:1
R40
JcM J2
n
X
m
X
2g
m M 1 l 1 lml 1
A:8
2g
m 1 l m 1 lnm 1
A:9
nd Jd
Jd
0:5 2
A:3
n
X
On6
A:10
2
4
m
2n
12n
120n
m1
n1
X
Jd J2
2849
J
n1
X
n
X
1
lnm
1
m 1 l m1
"
#
n1
n
m
X
X
1
1 X
1
nm 1 l 1 l l 1 l
m1
n1
X
1
nm
n2 m2
lnn=m
:
nm 1
2nm 12n2 m2
m1
A:11
2
To nd Jc0:5
M J we rst use partial fractions and then the
partial sum of the harmonic series:
n
m
X
X
1
2
Jc0:5
M J
lml 1
m M 1 l 1
m
1 X
1
1
1
l
ml
1
m M 1
l1
"
#
n
m
X
X
1
1
1
1
lnm gE
m1
2m 12m2 l 1 l
m M 1
n
X
1
1
1
lnm gE
2
: A:12
2
m1
2m 12m
m M1
n
X
subject to Czg ka
B:1
where we dene zg 91,1=2g , . . . ,1=ng T , and add a multiple of C in order to keep it from getting too large since the
bounds (14)(16) are all proportional to it. The constraint
is added to ensure all elements of the difference Czg ai
are positive such that (4) is true.
To remove the g component from the exponent, and
since all of the elements of z and a are positive and nonzero, we can instead solve the problem
minJlnC1 glnzlnaJ2 llnC
C, g
2850
2glnzT lnC1lna2lnClnaT 1
B:2
Czg ka.
go
lnzT lnalnC1
B:3
JlnzJ2
"
Co exp l
n
1X
lnaglnzi
ni1
#
B:4
The factor l in effect controls the step size for convergence. A typical value we use is l 7 0:03 based on
experiments (the sign of which depends on if the objective function decreases with decreasing C).
References
[1] R. Agrawal, C. Faloutsos, A. Swami, Efcient similarity search in
sequence databases, in: Proceedings of the International Conference of Foundations of Data Organization and Algorithms, Chicago,
IL, October 1993, pp. 6984.
[2] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation, IEEE
Transactions of Signal Processing 54 (11) (2006) 43114322.
[3] M. Casey, C. Rhodes, M. Slaney, Analysis of minimum distances in
high-dimensional musical spaces, IEEE Transactions on Audio,
Speech and Language Processing 16 (5) (2008) 10151028.
[4] M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, M. Slaney,
Content-based music information retrieval: current directions and
future challenges, Proceedings of the IEEE 96 (4) (2008) 668696.
[5] K. Chang, J.-S.R. Jang, C.S. Iliopoulos, Music genre classication via
compressive sampling, in: Proceedings of the International Society
for Music Information Retrieval, Amsterdam, The Netherlands,
August 2010, pp. 387392.
[6] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by
basis pursuit, SIAM Journal Scientic Computing 20 (1) (1998)
3361.
[7] S. Chu, S. Narayanan, C.-C.J. Kuo, Environmental sound recognition
with timefrequency audio features, IEEE Transactions on Audio,
Speech and Language Processing 17 (6) (2009) 11421158.
[8] C. Cotton, D.P.W. Ellis, Finding similar acoustic events using
matching pursuit and locality-sensitive hashing, in: Proceedings
of the IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics, Mohonk, NY, October 2009, pp. 125128.
[9] L. Daudet, Sparse and structured decompositions of signals with the
molecular matching pursuit, IEEE Transactions on Audio, Speech
and Language Processing 14 (5) (2006) 18081816.
[10] D.P.W. Ellis, G.E. Poliner, Identifying cover songs with chroma
features and dynamic programming beat tracking, in: Proceedings
of the International Conference on Acoustics, Speech and Signal
Processing Honolulu, Hawaii, April 2007, pp. 14291432.
[11] C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence
matching in time-series databases, in: Proceedings of the ACM
SIGMOD International Conference on Management of Data,
Minneapolis, MN, 1994, pp. 419429.
[12] J. Gemmeke, L. ten Bosch, L. Boves, B. Cranen, Using sparse
representations for exemplar based continuous digit recognition,
in: Proceedings of the European Signal Processing Conference.
Glasgow, Scotland, August 2009, pp. 17551759.
[26] M. Muller,
F. Kurth, M. Clausen, Chroma-based statistical audio
features for audio matching, in: IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics, New Paltz, NY, October
2005, pp. 275278.
[27] Y. Panagakis, C. Kotropoulos, G.R. Arce, Music genre classication
via sparse representations of auditory temporal modulations, in:
Proceedings of the European Signal Processing Conference
Glasgow, Scotland, August 2009, pp. 15.
[28] Y. Pati, R. Rezaiifar, P. Krishnaprasad, Orthogonal matching
pursuit: Recursive function approximation with applications to
wavelet decomposition, in: Proceedings of the Asilomar Conference
on Signals, Systems, and Computers, vol. 1, Pacic Grove, CA,
November 1993, pp. 4044.
[29] T.V. Pham, A. Smeulders, Sparse representation for coarse and ne
object recognition, IEEE Transactions on Pattern Analysis and
Machine Intelligence 28 (4) (2006) 555567.
[30] D. Raei, A. Mendelzon, Efcient retrieval of similar time sequences
using DFT, in: Proceedings of the International Conference of
Foundations of Data Organization and Algorithms, Kobe, Japan,
November 1998, pp. 249257.
[31] E. Ravelli, G. Richard, L. Daudet, Union of MDCT bases for audio
coding, IEEE Transactions on Audio, Speech and Language Processing 16 (8) (2008) 13611372.
[32] E. Ravelli, G. Richard, L. Daudet, Audio signal representations for
indexing in the transform domain, IEEE Transactions on Audio,
Speech and Language Processing 18 (3) (2010) 434446.
[33] L. Rebollo-Neira, D. Lowe, Optimized orthogonal matching pursuit
approach, IEEE Signal Processing Letters 9 (4) (2002) 137140.
[34] S. Scholler, H. Purwins, Sparse coding for drum sound classication
and its use as a similarity measure, in: Proceedings of the International Workshop on Machine Learning Music ACM Multimedia,
Firenze, Italy, October 2010.
[35] J. Serra , E. Gomez, P. Herrera, X. Serra, Chroma binary similarity and
local alignment applied to cover song identication, IEEE Transactions on Audio, Speech and Language Processing 16 (August 2008)
11381151.
2851