Signal Processing: Bob L. Sturm, Laurent Daudet

Download as pdf or txt
Download as pdf or txt
You are on page 1of 16

Signal Processing 91 (2011) 28362851

Contents lists available at ScienceDirect

Signal Processing
journal homepage: www.elsevier.com/locate/sigpro

Recursive nearest neighbor search in a sparse and multiscale domain


for comparing audio signals
Bob L. Sturm a,,#, Laurent Daudet b,##
a
b

Department of Architecture, Design and Media Technology, Aalborg University Copenhagen, Lautrupvang 15, 2750 Ballerup, Denmark
Institut Langevin (LOA), Universite Paris Diderot Paris 7, UMR 7587, 10, rue Vauquelin, 75231 Paris, France

a r t i c l e in f o

abstract

Article history:
Received 5 January 2010
Received in revised form
15 February 2011
Accepted 2 March 2011
Available online 10 March 2011

We investigate recursive nearest neighbor search in a sparse domain at the scale of


audio signals. Essentially, to approximate the cosine distance between the signals we
make pairwise comparisons between the elements of localized sparse models built from
large and redundant multiscale dictionaries of timefrequency atoms. Theoretically,
error bounds on these approximations provide efcient means for quickly reducing the
search space to the nearest neighborhood of a given data; but we demonstrate here that
the best bound dened thus far involving a probabilistic assumption does not provide a
practical approach for comparing audio signals with respect to this distance measure.
Our experiments show, however, that regardless of these non-discriminative bounds,
we only need to make a few atom pair comparisons to reveal, e.g., the origin of an
excerpted signal, or melodies with similar timefrequency structures.
& 2011 Elsevier B.V. All rights reserved.

Keywords:
Multiscale decomposition
Sparse approximation
Timefrequency dictionary
Audio similarity

1. Introduction
Sparse approximation is essentially the modeling of
data with few terms from a large and typically overcomplete set of atoms, called a dictionary [24]. Consider
an x 2 RK , and a dictionary D composed of N unit-norm
atoms in the same space, expressed in matrix form as
D 2 RKN , where N b K. A pursuit is an algorithm that
decomposes x in terms of D such that JxDsJ2 r e for
some error e Z0. (In this paper, we work in a Hilbert space
unless otherwise noted.) When D is overcomplete, D has
full row rank and there exists an innite number of
solutions to choose from, even for e 0. Sparse approximation aims to nd a solution s that is mostly zeros for e
small. In that case, we say that x is sparse in D.

 Corresponding author.

E-mail addresses: [email protected] (B.L. Sturm),


[email protected] (L. Daudet).
#
EURASIP# 7255.
##
EURASIP # 2298.
0165-1684/$ - see front matter & 2011 Elsevier B.V. All rights reserved.
doi:10.1016/j.sigpro.2011.03.002

Matching pursuit (MP) is an iterative descent sparse


approximation method based on greedy atom selection
[17,24]. We express the nth-order model of the signal
x Hnan rn, where an is a length-n vector of
weights, Hn are the n corresponding columns of D, and
rn is the residual. MP augments the nth-order representation, X n fHn,an,rng, according to
9
8
>
>
=
< Hn 1 Hnjhn 
T
1
X n 1 an 1 aT n,/rn,hn S
>
;
: rn 1 xHn 1an 1 >
using the atom selection criterion
hn arg minJrn/rn,dSdJ2 arg maxj/rn,dSj
d2D

d2D

where JdJ 1 is implicit. The inner product here is dened


/x,yS9yT x. This criterion guarantees Jrn 1J2 r JrnJ2
[24]. Other sparse approximation methods include orthogonal MP [28], orthogonal least squares (OLS) [41,33],
molecular methods [9,38,19], cyclic MP and OLS [36],
and minimizing error jointly with a relaxed sparsity
measure [6]. These approaches have higher computational

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

complexities than MP, but can produce data models that


are more sparse.
Sparse approximation is data-adaptive and can produce parametric and multiscale models having features
that function more like mid-level objects than low-level
projections onto sets of vectors [9,19,38,8,22,27,32,34,43].
These aspects make sparse approximation a compelling
complement to state-of-the-art approaches for and applications of comparing audio signals based upon, e.g.,
monoresolution cepstral and redundant timefrequency
representations, such as ngerprinting [42], cover song
identication [26,10,3,35], content segmentation, indexing, search and retrieval [16,4], artist or genre classication [39].
In the literature we nd some existing approaches to
working with audio signals in a sparse domain. Features
built from sparse approximations can provide competitive descriptors for music information retrieval tasks,
such as beat tracking, chord recognition, and genre
classication [40,32]. Sparse representation classiers
have been applied to music genre recognition [27,5],
and robust speech recognition [12]. Parameters of sparse
models can be compared using histograms to nd similar
sounds in acoustic environment recordings [7,8], or
atoms can be learned to compare and group percussion
sounds [34]. Biologically inspired sparse codings of
correlograms of sounds can be used to learn associations
between descriptive high-level keywords and audio
features such that new sounds can be automatically
categorized, and large collections of sounds can be
queried in meaningful ways [22]. Outside the realm of
audio signals, sparse approximation has been applied to
face recognition [44], object recognition [29], and landmine detection [25].
In this paper, we discuss the comparison of audio
signals in a sparse domain, but not specically for ngerprinting or efcient audio indexing and searchtwo tasks
that have been convincingly solved [42,13,16,18]. We
explore the possibilities and effectiveness of comparing,
atom-by-atom, audio signals modeled using sparse
approximation and large overcomplete timefrequency
dictionaries. Our contributions are threefold: (1) we generalize recursive nearest-neighbor search algorithm to
comparing subsequences [14,15]; (2) we show that
though sparse models of audio signals can be compared
by considering pairs of atoms, the best bound so far
derived [14,15] does not make a practical procedure;
and (3) we show experimentally that the hierarchic
comparison of audio signals in a sparse domain still
provides intriguing and informative results. Overall, our
work here shows that a sparse domain can facilitate
comparisons of audio signals in hierarchical ways
through comparing individual elements of each sparse
data model organized roughly in order of importance.
In the next two sections, we discuss and elaborate
upon a recursive method of nearest neighbor search in a
sparse domain [14,15]. We extend this method to comparing subsequences, and examine the practicality of
probabilistic bounds on the distances between neighbors.
In the fourth section, we describe several experiments
in which we compare a variety of audio signals through

2837

their sparse models. We conclude with a discussion about


the results and several future directions.
2. Nearest neighbor search by recursion in a sparse
domain
Consider a set of signals
Y9fyi 2 RK : Jyi J 1gi2I

where I f1,2, . . .g indexes this set, and a query signal


xq 2 RK , Jxq J 1. Assume that we have generated sparse
approximations for all of these signals Y^ 9ffHi ni ,
ai ni ,ri ni g : yi Hi ni ai ni ri ni gi2I using a dictionary
D that spans the space RK , and giving the nq-order
representation fHq nq ,aq nq ,rq nq g for xq . Since D spans
RK , D is complete, and any signal in RK is compressible
in D, meaning that we can order the representation weights
in ai ni or aq nq in terms of decreasing magnitude, i.e.,
0 o jai ni m 1 j rjai ni m j rCmg ,

m 1,2, . . . ,ni 1
4

for ni arbitrarily large, with C 40, and where am is the mth
element of the column vector a. This can be seen in the
magnitude representation weights in Fig. 1, which are
weights of sparse representations of piano notes, described
in Section 4.1. With MP and a complete dictionary, we are
guaranteed g 40 because Jrn 1J2 o JrnJ2 for all n [24].
Consider the Euclidean distance between two signals
of the same dimension, which is the cosine distance for
unit-norm signals. Thus, with respect to this distance, the
yi 2 Y nearest to xq is given by solving
min Jyi xq J max /xq ,yi S
i2I

i2I

We can express this inner product in terms of sparse


representations
/xq ,yi S /Hq nq aq nq rq nq ,Hi ni ai ni ri ni S
aTi ni HTi ni Hq nq aq nq Orq ,ri 

With a complete dictionary we can make Orq ,ri 


negligible by choosing e arbitrarily small, so we can

Fig. 1. Gray lines show decays of representation weight magnitudes as a


function of approximation order k for several decompositions of unitnorm signals (4-s recordings of single piano notes described in Section
4.1). Thick black line shows a global compressibility bound with its
parameters.

2838

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

express (5) as
ni
X

max /xq ,yi S  max aTi ni Giq aq nq max


i2I

i2I

i2I

nq
X

Aiq Giq ml

m1l1

7
where BCml Bml Cml is the Hadamard, or entry wise,
product, Bml is the element of B in the mth row of the lth
column, Giq 9HTi ni Hq nq is a ni  nq matrix with elements from the Gramian of the dictionary, i.e., G9DT D,
and nally we dene the outer product of the weights
Aiq 9ai ni aTq nq :

2.2. Bounding the remainder

2.1. Recursive search limited by bounds


Since we expect the decay of the magnitude of elements in Aiq Giq to be fastest in diagonal directions by (4),
we dene a recursive sum along the M anti-diagonals
starting at the top left:
Siq M9Siq M1

M
X

Aiq Giq mMm 1

for M 2,3, . . . ,minni ,nq , and setting Siq 1 Aiq Giq 11 .
With this we can express the argument of (7) as
/xq ,yi S 

Aiq Giq ml Siq M RM

where at step M, we are comparing M additional pairs of


atoms to those considered in the previous steps. R(M) is a
remainder that we will bound. The total number of atom
pairs contributing to Siq(M) (9) is
M
X

m MM 1=2:

1. Giq ml 1 (worst case scenario, but impossible for


n 4 1)
g

RM r C 2 JcM J1 Jd J1

14

10

m1l1

PM9

To reduce the search space quickly we desire that (12)


and (13) converge quickly to the neighborhood of
/xq ,yi S, or in other words, that the bounds on the
remainder quickly become discriminative. Jost et al.
[14,15] derive three different bounds on R(M). From the
weakest to the strongest, these are:

m1

nq
ni X
X

of the set of distances of xq from all signals in Y. Then we


nd the index of the largest lower bound imax
arg maxi2I Liq 1, and reduce the search space to
I 1 9fi 2 I : Uiq 1 ZLimax q 1g, since all other data have a
least upper bound on their inner product with xq than the
greatest lower bound in the set. For the next step, we
compute the sets fLiq 2gi2I 1 and fUiq 2gi2I 1 , nd the index
of the maximum imax arg maxi2I 1 Liq 2, and construct
the reduced set I 2 9fi 2 I 1 : Uiq 2 Z Limax q 2g. Continuing
in this way, we nd the elements of Y closest to xq at each
M with respect to the cosine distance by recursing into
the sparse approximations of the signals.

11

m1

2. Giq ml  iid Bernoulli(0.5), O f1,1g (impossible for


n 4 1)
p g
g
RM r C 2 ln4JcM J22 Jd J22 1=2
15
3. Giq ml  iid Uniform, O 1,1,
p 1
g
g
RM r C 2 2=3Erf pJcM J22 Jd J22 1=2

16

with probability 0 rp r1

The approach taken by Jost et al. [14,15] to nd the


nearest neighbors of xq in Y bounds the remainder R(M)
by compressibility (4). Assuming we have a positive upper
~
bound on the remainder, i.e., RM r RM,
we know lower
and upper bounds on the cosine distance Liq M r
/xq ,yi S r Uiq M, where
~
Liq M9Siq MRM

12

~
Uiq M9Siq M RM

13

Finding elements of Y close to xq with respect to (5) can


be done recursively over the approximation order M. For a
~
given M, we nd fSiq Mg , compute the remainder RM,
i2I

and eliminate signals that are not sufciently close to xq


with respect to their cosine distance by comparing the
bounds. This approach is similar to hierarchical ones, e.g.,
[21], where the features become more discriminable as
the search process runs. (Also note that compressibility is
similar to the argument made in justifying the truncation
of Fourier series in early work on similarity search
[1,11,30], i.e., that power spectral densities of many
time-series decay like Ojf jb with b4 1.)
Starting with M 1, we compute the sets fLiq 1gi2I and
fUiq 1gi2I , that is, the rst-order upper and lower bounds

where we dene the following vectors for n9minni ,nq


and M 2,y, n:
g

cM 9flml 1g : m M 1, . . . ,n; l 1, . . . ,mg

17

d 9flnm 1g : m 1, . . . ,n1; l m 1, . . . ,ng:


18
Appendix A gives derivations of these bounds, as well as
the efcient computation of (16) for the special case of
g 0:5. The parameters C, g describe the compressibility
of the signals in the dictionary (4). The bounds of (15) and
(16) are much more discriminative than (14) because they
involve an 2 -norm at the price of uncertainty in the
bound. The bound in (16) is attractive because we can
tune it with the parameter p, which is the probability that
the remainder will not exceed the bound. Fig. 2 shows
bounds based on (16) for several pairs of compressibility
parameters for the dataset used to produce Fig. 1.
2.3. Estimating the compressibility parameters
The bounds (14)(16), and consequently the number
of atom pairs we must consider before the bounds

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

2839

Estimated Remainder R(M)

0.5
(7.53, 1.10)

(4.75, 1.00)

0.4

(3.00, 0.90)
(1.89, 0.80)
(1.19, 0.70)

0.3

0.2

(0.69, 0.50)

(0.72, 0.55)

0.1

(0.78, 0.60)

0
2

500

1000

2000

3000

4000

5000

Atom Pairs P(M)


Fig. 2. Estimated remainder (assuming unit-norm signals) using bound in (16) with p 0.2 (probability that remainder does not exceed bound) and
n 100 (number of elements in each sparse model) as a function of the number of atom pairs already considered for several pairs of compressibility
parameters C, g estimated from the dataset used to produce Fig. 1.

of data

Y9fyi 2 RNi : Ni Z Kgi2I

19

(note that now we do not restrict the norms of these


signals). We can create from the elements of Y a new set
of all subsequences having the same length as a
K-dimensional query xq (K o Ni :

Y K 9fPt yi =JPt yi J : t 2 T i f1,2, . . . Ni K 1g,yi 2 Yg

0.4

0.6

0.8

1.2

Fig. 3. Error surface as a function of the compressibility parameters for


the dataset used to produce Fig. 1, with the feasible set shaded at top
left, and optimal parameters marked by a circle.

become discriminable, depend on the compressibility


parameters, C, gFwhich themselves depend in a complex way on the signal, the dictionary, and the method of
sparse approximation. Fig. 3 shows the error surface,
feasible set, and the optimal parameters for the dataset
used to produce Fig. 1. We describe our parameter
estimation procedure in Appendix B. The resulting bound
is shown in black in Fig. 1. These compressibility parameters also agree with those seen in Fig. 2.

3. Subsequence nearest neighbor search in a localized


sparse domain
The recursive nearest neighbor search so far described
has the obvious limitation that it cannot be applied to
comparing subsequences of large data vectors, as is
natural for comparing audio signals. Thus, we must adapt
its structure to work for comparing subsequences in a set

20

where Pt extracts a K-length subsequence in yi starting a


time-index t (it is an identity matrix of size K starting a
column t in a K  Ni matrix of zeros). The set T i are times
at which we create length-K subsequences from yi . If we
decompose each of these by sparse approximation, then
we can use the framework in the previous section.
However, sparse approximation is an expensive operation
that we want to do only once for the entire signal, and
independent of the length of xq .
To address this problem, we instead approximate each
element in Y K by building local sparse representations
from the global sparse approximations of each yi , and
then calculating their distance to xq using the framework
in the previous section. From here on we consider only
the K-length subsequences of a single element yi 2 Y
without loss of generality (i.e., all other elements of Y
can be included as subsequences). Toward this end,
consider that we have decomposed the Ni-length signal
yi using a complete dictionary to produce the representation fHi ni ,ai ni ,ri ni g. From this we construct the local
sparse representations of yi :
Y^ K 9ffPt Hi ni , xt ai ni ,Pt ri ni g : t 2 T i g

21

where the time partition T i is the set of all times at which


we extract a K-length subsequence from yi , and xt is
set such that Jxt Pt yi J 1, i.e., each length-K subsequence
is unit-norm. For each K-dimensional subsequence,

2840

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

(7) now becomes


max /xq ,Pt yi S max/Hq nq aq nq , xt Pt Hi ni ai ni S Orq ,ri 
t2T i

t2T i

 maxxt aTi ni HTi ni PTt Hq nq aq nq


t2T i
maxxt
t2T i

nq
ni X
X

22

Aiq Giq tml

m1l1

where Aiq is dened in (8), we dene the time-localized


Gramian
Giq t9HTi ni PTt Hq nq

as shown by Fig. 4. For much of the time we see


Pnt
2
2
j 1 aj Z JPt yi J , which means our localized estimate
of the segment energy is greater than its real value. This
will make xt and consequently (22) smaller.
Instead, we make a more reasonable estimate of JPt yi J
by accounting for the fact that atoms can have support
outside [t,t K). For instance, if an atom has some fraction
of support in the subsequence we multiply its weight by
that fraction. We thus weigh the contribution of the jth
atom to the subsequence norm using

23

and we have excluded the terms involving the residuals


because we can make them arbitrarily small.

wj

3.1. Estimating the localized energy

8
1,
>
>
>
>
>
< K=sj 2 ,

uj Z t,uj sj rt K
uj o t,uj sj Zt K

uj sj t2 =s2j ,
>
>
>
>
>
: t Kuj 2 =s2 ,
j

The only thing left to do is to nd an expression for xt


so that each subsequence is comparable with the others
with respect to the cosine distance. We assume that the
localized energy can be approximated from the local
sparse representation in the following way assuming
JPt yi J 40
v
q u
nt
uX
T
T
xt JPt yi J  aTi ni Hi ni Pt Pt Hi ni ai ni  t
wj a2j
j1

24
HTi ni PTt Pt Hi

ni ml a
where the nt weights aj 2 fai ni m :
0,1r m,l rni g are those associated with atoms having
support in [t,t K), and wj we dene to weigh the
2
contribution of aj to the localized energy estimate. We
Pnt
set xt 0 if j 1 a2j 0.
If all atoms contributing to the subsequence have their
entire support in [t,t K), and are orthonormal, then we
can set each wj 1. This does not hold for subsequences
of a signal decomposed using an overcomplete dictionary,

uj o t,t ouj sj r t K

25

t r uj ot K,uj sj 4 t K

where uj and sj are the position and scale, respectively, of


the atom associated with the weight aj. In other words, if
an atom is completely in [t,t K), it contributes all of its
energy to the approximation; otherwise, it contributes
only a fraction based on how its support intersects
[t,t K). With this we are now slightly underestimating
the localized energies, as seen in Fig. 4. In both of these
cases for {wj}, however, we can assume by the energy
conservation of MP [24] that as the subsequence length
becomes larger our error in estimating the subsequence
energy goes to zero, i.e.,
lim JPt yi J2 

K-Ni

nt
X

wj a2j JPt ri ni J2

26

j1

With a complete dictionary, we can make the right hand


side zero. Signicant departures from the energy estimate
of subsequences can be due to the interactions between
atoms [37].

P
Fig. 4. Short-term energy ratios, log10 nj t 1 wj a2j =JPt yi J2 , over 1 s windows (hopped 125 ms) for MP decompositions using 8xMDCT [31] to a global
residual energy 30 dB below the initial signal energy. Arrow points to line (top, gray) using weighting wj 1. The other line (bottom, black) uses (25).
Data in (a) are described in Section 4.2; data in (b) are described in Section 4.3. (a) Six speech signals (0-20 s), Music except (21-23 s), Realization of GWN
(24-27 s) and (b) Music Orchestra.

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

which do not translate to audio signals decomposed over


redundant timefrequency dictionaries. Jost et al. [14,15]
do not state the compressibility parameters they use, but
for the high-dimensional audio signals with which we
work in this paper it is not unusual to have g  0:5 when
using MP and highly overcomplete dictionaries. We nd
that decomposing 4 s segments of music signals (single
channel, 44.1 kHz sample rate, representation weights
shown in Fig. 1) using the dictionary in Table 1 requires
on average 2375 atoms to reduce the residual energy
20 dB below the initial signal energy. Thus, for the bound
(16) using n2375 atoms, and with the parameters
C, g 0:4785,0:5 (in the feasible set), Fig. 5 clearly
shows that in order to have any discriminable bound
(say 70:2 for unit-norm signals) we must either select a
low value for pin which case we are assuming the rst
atom comparison is approximately the cosine distance
or we must make over a million pairwise comparisons.
This is not practical for signals of large dimension, and
dictionaries containing billions of timefrequency atoms.
There is no possibility of tabulating the dictionary Gramian for quick lookup of atom pair correlations; and the
cost of looking up atoms in million-atom decompositions
is expensive as well. It is clear then that the tightest
bound given in (16) is not practical for efciently discriminating distances between audio signals with respect
to their cosine distance (5) decomposed by MP and time
frequency dictionaries.

3.2. Recursive subsequence search limited by bounds


Now, similar to (9) and (10), we can say,
/xq ,Pt yi S  xt

nq
nt X
X

Aiq tGiq tml Siq t,M Rt,M

m1l1

27
where
for
M 2,3, . . . ,minni ,nq ,
Siq t,1 xt Aiq tGiq t11 ,
Siq t,M9Siq t,M1 xt

M
X

and

with

Aiq tGiq tmMm 1

28

m1

The problem of nding the subsequence closest to xq with


respect to the cosine distance can now be done iteratively
over M by bounding each remainder R(t,M) using (14),
(15), or (16), and the method presented in Section 2.1.
Furthermore, we can compare only a subset of all possible
subsequences using a coarse time partition T i .
3.3. Practicality of the bounds for audio signals
The experiments by Jost et al. [14,15] use small images
(128 square) and orthogonal wavelet decompositions,
Table 1
Timefrequency dictionary parameters (44.1 kHz sampling rate): atom
scale s, time resolution Du , and frequency resolution Df . Finer frequency
resolution for small-scale atoms is achieved with interpolation by zeropadding.
s (samples/ms)
128/3
256/6
512/12
1024/23
2048/46
4096/93
8192/186
16,384/372
32,768/743

Du (samples/ms)

Df (Hz)

32/0.7
64/2
128/3
256/6
512/12
1024/23
2048/46
4096/93
8192/186

43.1
43.1
43.1
43.1
21.5
10.8
5.4
2.7
1.3

2841

4. Experiments in comparing audio signals


in a sparse domain
Though approximate nearest neighbor subsequence
search of sparsely approximated audio signals with the
bound (16) is impractical, we have found that approximating the cosine distance in a sparse domain has
some intriguing behaviors. We now present several
experiments where we compare different types of audio
data in a sparse domain under a variety of conditions.

Estimated Remainder R(M)

1.2

0.80
0.70

0.8

0.60
0.50

0.6
0.40
0.4

0.30
0.20

0.2
0.10
0
10

10

10

10
Atom Pairs P(M)

10

10

Fig. 5. Estimated remainder (assuming unit-norm signals) as a function of the number of atom pairs already considered for dataset used to produce
Fig. 1. Gray: bound in (15). Black, numbered: bound in (16) for several labeled p (probability that remainder does not exceed bound) with n2375
(number of elements in each sparse model), and C, g 0:4785,0:5.

2842

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

All signals are single channel, and have a sampling rate of


44.1 kHz. We decompose each by MP [17] using either the
dictionary in Table 1, or the 8xMDCT dictionary [31].
4.1. Experiment 1: comparing piano notes
In this experiment, we look at how well low-order
sparse approximations of sampled piano notes embody
their harmonic characteristics by comparing them using
the methods presented in Section 2. The data in set A are
68 notes (chromatically spanning A0 to G#6) on a real and
somewhat in-tune piano; and in set the data B
are 39 notes (roughly a C major scale C0 to D6) on a real

and very out-of-tune piano with very poor recording


conditions. We truncate all signals to have a dimension of
176,400 (4 s), and decompose each by MP [17] over a
redundant dictionary of timefrequency Gabor atoms, the
parameters of which are summarized in Table 1. We stop
each decomposition once its residual energy drops 40 dB
below the initial energy. We normalize the weights of each
model by the square root energy of the respective signal.
We do not align the time-domain signals such that the note
onsets occur at the same time. Fig. 1 shows the ordered
decays of the weights in the sparse models of data set A.
Fig. 6(a) shows the magnitude correlations between all
pairs of signals in set A evaluated in the time-domain.

Fig. 6. jSiq 10j (9) for two sets of recorded piano notes. (a) and (b): Set A compared with itself in time and sparse domains (M 10). (c) and (d): Set B
(rows) compared with set A (columns) in time and sparse domains (M 10). Elements on main diagonals in (a) and (b) are scaled by 0.25 to increase
contrast of other elements. (a) A Time-domain magnitude correlations, (b)A Sparse-domain approximations of magnitude correlations, (c) B-A Timedomain magnitude correlations and (d) B-A Sparse-domain approximations of magnitude correlations.

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

The overtone series is clear as diagonal lines offset at 12,


19, 24, and 28, semitones from the main diagonal.
Fig. 6(b) shows the approximated magnitude correlations
(9) using only M10 atoms from each signal approximation (thus P(10)55 atom pairs). Even though the mean
number of atoms in this set of models is about 7000 we
still see portions of the overtone series. The diagonal in
Fig. 6(b) does not have a uniform color because low notes
have longer sustain times than high notes, and the sparse
models thus have more timefrequency atoms with
greater energies spread over the support of the signal.
Fig. 6(c) show the magnitude correlations between sets
B and A evaluated in the time-domain; and Fig. 6(d)
shows the magnitude correlations (9) using only M10
atoms from each model. In a sparse domain, we can more
clearly see the relationships between the two sets
because the rst 10 terms of each model are most likely
related to the stable harmonics of the notes, and not to
the noise. We can see a diatonic scale starting around
MIDI number 36, as well as the fact that the pitch scale in

2843

data set B lies somewhere in-between the semitones in


data set A.
Fig. 7(a) shows the approximate magnitude correlations jSiq Mj (9), as well as the upper and lower bounds
on the remainder using the tightest bound (16) with
p 0.2, and n 100, for the signal A3 from set A and
the rest of the set. Here we can see that the lower bound
for the largest magnitude correlation exceeds the upper
bound of all the rest after comparing only M19 atoms
from each decomposition. All but ve of the signals can be
excluded from the search after M6. The four other
signals having the largest approximate magnitude correlation are labeled, and are harmonically related to the
signal through its overtone series. With a signal selected
from set B and compared to set A, Fig. 7(b) shows that
we must compare many more atoms between the models
until the bounds have any discriminability. After
P(M)1500 atom comparisons we can see that the largest
magnitudes jSiq Mj (9) are roughly harmonically related
to the detuned D5 from set B.

Fig. 7. Black: jSiq Mj (9) as a function of the number of atom pairs considered for the set of piano notes in A with a signal from either (a) A (note A3)
or (b) B (note D5 approximately). Gray: for each Siq(M), magnitudes of Liq(M) (12) and Uiq(M) (13) using bound in (16) with p 0.2 (probability that
remainder does not exceed bound), and n 100 (number of elements in each sparse model). Largest magnitude correlations are labeled. Note differences
in axes. Signal A3 from A with (c,g) (0.78, 0.60) and (b) Signal D5 from B with (c, g) (1.17,0.70).

2844

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

As a nal part of this experiment, we look at the effects


of comparing atoms with parameters within some subset.
As done in Fig. 6(d), we compare the sparse approximations of two different sets of piano notes, but here we only
consider those atoms that have scales greater than
186 ms. This in effect means that we look for signals that
share the same long-term timefrequency behaviors.
The resulting jSiq 10j (9) is shown in Fig. 8. We see

Fig. 8. jSiq 10j (9) for two sets of recorded piano notes in a sparse
domain using only the atoms with duration at least 186 ms. Compare
with Fig. 6(d).

correlations between the notes much more clearly compared with Fig. 6(d). Removing the short-term phenomena improves tonal-level comparisons between the
signals because non-overlapping yet energetic short
atoms are replaced by atoms representative of the note
harmonics.

4.2. Experiment 2: comparing speech signals


In this experiment, we test how efciently using (28)
we can nd in a speech signal the time from which we
extract some xq . We also test how distortion in the query
affects these results. We make a signal by combining six
segments of speech, a short music segment, and white
noise, shown in Fig. 4(a). The six speech segments are the
same phrase spoken by three females and three males:
Cottage cheese with chives is delicious. We extract from
one of these speech signals the word cheese, to create
xq with duration of 603 ms, shown at top in Fig. 9. We
decompose this signal using MP and the 8xMDCT dictionary [31].
We distort the query in two ways: with additive WGN
(AWGN), and with an interfering sound having a high
correlation with the dictionary. In the rst case, shown in
the middle in Fig. 9, the signal xq 0 axq n=Jaxq nJ is
the original xq distorted by a unit-norm AWGN signal n.
We set a 0:3162 such that 10log10 Jaxq J2 =JnJ2
20log10 jaj 10 dB. For this signal, we nd the following statistics from 10,000 realizations of the AWGN
signal: Ej/xq ,nSj  1  105 , Varj/xq ,nSj  4  106 .
We also nd the following statistics for the 8xMDCT

Frequency (kHz)

8
6
4
2

Frequency (kHz)

8
6
4
2

Frequency (kHz)

8
6
4
2
0

0.1

0.2

0.3

0.4

0.5

Time (s)
Fig. 9. Log spectrograms of the query signals with which we search. Top: query of male saying cheese. Middle: query distorted with additive white
Gaussian noise (AWGN) with SNR  10 dB. Bottom: query distorted with interfering crow sound with SNR  5 dB.

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

dictionary: Emaxd2D j/n,dSj  5  104 , Varmaxd2D


j/n,dSj  2  105 . Thus, the noise signal is not well
correlated either with the original signal or the 8xMDCT
dictionary. In the second case, shown at the bottom of
Fig. 9, we distort the signal by adding the sound of a crow
c so that xq 0 axq c=Jaxq cJ2 with JcJ 1. Here, we
set a 0:5623 given by 20log10 jaj 5 dB. For this
interfering signal, we nd that j/xq ,cSj  2  103 , but
maxd2D j/c,dSj  0:21, which is higher than maxd2D
j/xq ,dSj  0:17. In this case, unlike for the AWGN interference, it is likely that the sparse approximation of the
signal with the crow interference will have atoms in its
low-order model due to the crow and not the speech. We
do not expect the AWGN interference to be a part of the
signal model created by MP until much later iterations.
Fig. 10 shows jSiq t,Mj (28) aligned with the original
signal for four values of M using the sparse approximations of the clean and distorted signals. We plot at the rear
of these gures the localized magnitude time-domain
correlation of the windowed and normalized signal with
the query xq . In Fig. 10(a), using the clean xq , we clearly

2845

see its position even when using a single atom pair for
each 100 ms partition of the time-domain. We see the
same behavior in Fig. 10(b), (c) for the two distorted
signals, but in the case where the crow sound interferes
we nd the query for M Z 2, or with at least three atom
pair comparisons. The rst atom of the decomposed query
with the crow is modeling the crow and not the content of
interest, and so we must increase the order of the model
to nd the location of xq . As we increase the number of
pairs considered we also nd other segments that point in
the same direction as xq . Table 2 gives the times and
content of the ten largest values in jSiq t,10j. For the clean
and AWGN distorted xq , cheese appears ve of the six
times it exists in the original signal. Curiously, these same
ve instances are the ve largest magnitude correlations
when xq has the crow interference.
We perform the same test as above but using a much
longer speech signal (about 21 minutes in length)
excerpted from a book-on-CD, The Old Man and the
Sea by Ernest Hemingway, read aloud by a single person.
From this signal we create several queries xq , from words

Fig. 10. jSiq M,tj (28) as a function of time and the number of atoms M (labeled at right) considered from each representation for each localized sparse
approximation. Localized magnitude correlation of each signal with query is shown by the thin black line in front of the gray time-domain signal at rear.
(a) clean signal, (b) Signal with AWGN at 10 dB energy and (c) Signal with crow signal at 5 dB energy.

2846

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

Table 2
Times, values and signal content for rst 10 peaks in jSiq t,10j (P(10) 55) in Figs. 10(a)(c). Highest-rated distances in each (bold) points to the origin of
signal.
#

Clean signal

Signal WGN

Signal Crow

t (s)

jSiq j

Content

t (s)

jSiq j

Content

t (s)

jSiq j

Content

1
2
3
4
5

10.0
13.6
11.3
16.9
15.1

0.798
0.199
0.153
0.149
0.141

cheese
cheese
-ives is-
cheese
delicious

10.0
13.6
15.1
11.3
16.9

0.236
0.080
0.051
0.045
0.042

cheese
cheese
delicious
-ives is-
cheese

10.0
13.6
16.9
6.9
1.3

0.409
0.060
0.030
0.012
0.011

cheese
cheese
cheese
cheese
cheese

6
7
8
9
10

18.3
1.3
8.1
2.4
6.9

0.076
0.057
0.035
0.026
0.024

delicious
cheese
delicious
delicious
cheese

18.3
8.1
12.0
5.2
6.8

0.028
0.014
0.012
0.011
0.010

delicious
delicious
-licious
delicious
cheese

18.3
13.2
15.1
16.0
22.8

0.010
0.010
0.009
0.004
0.003

delicious
cottage
delicious
cott-
WGN

to sentences to an entire paragraph of duration 35 s. We


decompose each signal over the dictionary in Table 1 until
the global residual energy is 20 dB below the initial
energy. The approximation of the entire 21 m signal has
1,004,001 atoms selected from a dictionary containing
2,194,730,297 atoms.
One xq we extract from the signal is the spoken phrase,
the old man said (861 ms in length). This phrase
appears 26 times in the long excerpt. We evaluate
jSiq t,Mj (28) every 116 ms, and nd the time at which
xq originally appears using only M1 atom pair comparisons for each time partition. The next highest ranked
positions have values of 75% and 67% that of the largest
jSiq t,1j. When M50, the values of the second and third
largest values jSiq t,50j drop to 62% and 61% that of the
largest value. In the top 30 ranked subsequences for M5
we nd only one of the other 25 appearances of the old
man said (rank 26); but we also nd the old man
agreed (rank 11), and the old man carried (rank 16).
All other results have minimal content similarity to the
signal, but have timefrequency overlap in parts of the
atoms of each model.
We perform the same test with a sentence extracted
from the signal, They were as old as erosions in a shless
desert (2.87 s), which only appears once. No matter the
M [1, 50] we use, the origin of the excerpt remains at a
rank of 6 with a value jSiq t,50j at 67.5% that of the
highest rank subsequence. We nd that if we shift the
time partition forward by 11.6 ms its ranking jumps to
rst, with the second ranked subsequence at 73%. We
observe a similar effect for a query consisting of an entire
paragraph (35 s). We nd its origin by comparing M2 or
more atoms from each model using a time partition of
116 ms. This result, however, disappears when we evaluate jSiq t,Mj using a coarser time partition of 250 ms.
4.3. Experiment 3: comparing music signals
While the previous experiment deals with singlechannel speech signals, in this experiment we make
comparisons between polyphonic musical signals
excerpted from a commercial recording of the fourth

movement of Symphonie Fantastique by H. Berlioz. For the


query, we use a 10.31 s segment of the third appearance
of the A theme of the movement (bars 3339, located
around 1322 s in Fig. 4(b)). Fig. 11 shows the sonogram
and timefrequency tiles of the model of xq using the 50
atoms with the largest magnitude weights selected from
the 8xMDCT dictionary [31]. We add no interfering
signals as we do in the previous experiment.
Fig. 12(a) shows jSiq t,Mj (28) over the rst minute of
the original signal, for three values of M, including M50,
the timefrequency representation of which is shown at
bottom of Fig. 11. For jSiq t,50j we can see a strong spike
located around 13 s corresponding with the query, but we
also see spikes at about 2 s and around 43 s. The former
set of spikes correspond with the second appearance of
the A theme, when only low bowed strings are playing
the theme in G-minor. This is quite similar to the
instrumentation of the query: low bowed strings and a
legato bassoon in counterpoint in E[-major. The latter set
of spikes is around the end of the fth appearance of the
theme, which is played in G-minor on low pizzicato
strings with a staccato bassoon. For M 10, we see a
conspicuous spike at the time of the fth appearance
around 34 s, as well as of the fourth appearance around
24 s, where the theme is played in E[-major like the
query. Finally, we test how the sparse approximation
of this query compares with subsequences from a different recording of this movement, which is also in a
different tempo. Fig. 12(b) shows jSiq t,Mj (28) for three
different values of M. We see high similarity with the rst
and second appearances of the main theme, but not the
third, which is what the query contains.
4.4. Discussion
There is no reason to believe that a robust and accurate
speech or melody recognition system can be created by
comparing only the rst few elements of greedy decompositions in timefrequency dictionaries. What appears to
be occurring for the short signals, both the cheese and
the old man said, is that the rst few elements of their
sparse and atomic decomposition create a prosodic

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

2847

Frequency (kHz)

0.5
0.4
0.3
0.2
0.1

Frequency (kHz)

0.5
0.4
0.3
0.2
0.1
0

5
Time (s)

Fig. 11. Polyphonic orchestral query: sonogram (top) and timefrequency tiles (bottom) of 50-order sparse approximation.

Fig. 12. jSiq t,Mj (28) for three values of M for the query and two different signals. Arrows mark the appearances of the A theme, and their appearance
number. Magnitude correlation of query with localized and normalized signal is shown by the solid gray area in front of the black time-domain signal at
rear. (a) Query compared with original signal and (b) Query compared different interpretation.

representation that is comparable to others at the atomic


level. For the longer signals, such as sentences, paragraphs, and an orchestral theme, a few atoms cannot
adequately embody the prosody, but we still see that by
only making a few comparisons we are able to locate the
excerpted signalas long as the time partition is ne
enough. This is due to the atoms of the models acting in
some sense as a timefrequency ngerprint, an example
of which we see in Fig. 11. Through the cosine distance,
the relative time and frequency locations of the atoms in
the query and subsequence are being compared, weighted

by their energies. Subsequences that share atoms in


similar congurations will be gauged closer to xq than
those that do not.
By using the cosine distance it is not unexpected that (28)
will be extremely sensitive to a partitioning of the timedomain. This comes directly from the denition of the timelocalized Gramian (23), as well as the use of a dictionary that
is not translation invariant. There is no need to partition the
time axis when using a parameterized dictionary if we
assume that some of the atoms in the model of xq will have
parameters that are nearly the same as some of those in the

2848

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

relevant localized sparse representations. In such a scenario,


we can search a sparse representation for the times at which
atoms exist that are similar in scale and modulation frequency to those modeling xq . Then we can limit our search to
those particular times without considering any uniform and
arbitrary partition of the time-domain. With non-linear
greedy decomposition methods such as MP and time-variant
dictionaries, however, such an assumption cannot be guaranteed; but its limits are not yet well-known.
5. Conclusion
In this paper, we have extended and investigated the
applicability of a method of recursive nearest neighbor search
[14,15] for comparing audio signals using pairwise comparisons of model elements in a sparse domain. The multiscale
descriptions offered by sparse approximation over time
frequency dictionaries are especially attractive for such tasks
because they provide exibility in making comparisons
between data, not to mention a capacity to deal with noise.
After extending this method to the task of comparing
subsequences of audio signals, we nd that the strongest
bound known for the remainder is too weak to quickly and
efciently reduce the search space. Our experiments show,
however, that by comparing elements of sparse models we
can judge with relatively few comparisons whether signals
share the same timefrequency structures, and to what
degrees, although this can be quite sensitive to the timedomain partitioning. We also see that we can approach such
comparisons hierarchically, starting from the most energetic
content to the least, or starting from the longest scale
phenomenon to the shortest.
We are continuing this research in multiple directions.
First, since we know that the inner product matrix Giq t (23)
will be very sparse for all t in timefrequency dictionaries,
this motivates designing a tighter bound based on a Laplacian
distribution of elements in Giq t with a large probability
mass exactly at zero. This bound would be much more
realistic than that provided by assuming the elements of
the Gramian are distributed uniform (16). Another part of the
problem is of course that the sums in (9) and (28) are not
such that at step M the P(M) largest magnitude values of
Aiq Giq are actually being summed. By our assumption in (4),
we know that the decay of the magnitudes of the elements in
Aiq will be quickest in diagonal directions, but dependent
upon the element position in the matrix. These diagonal
directions are simply given by
"
#
"
#
gi =m
@=@gi
mgi lgq mgi lgq
29
@=@gq
gq =l

non-discriminating bound (16), as we have yet to design an


efcient way to generate a satisfactory L, and estimate the
bounds of the corresponding remainderwhether it is like
that in (16), or another that uses the fact that Giq t will be
very sparse, even when xq yi . We think that using a
stronger bound and this indexing order will signicantly
reduce the number of pairwise comparisons that must be
made before determining a subsequence is not close enough
with respect to the cosine distance. Furthermore, we can
make the elements of Aiq decay faster, and thus increase g, by
using other sparse approximation approaches, such as OMP
[28,23] or CMP [36]. And we cannot forget the implications of
choosing a particular dictionary. In this work, we have used
two different parametric dictionaries, one of which is
designed for audio signal coding [31]. Another interesting
research direction is to use dictionaries better suited for
content description than coding, such as content-adapted
dictionaries [20,2,19].
Finally, and specically with regards to the specic
problem of similarity search in audio signals, the cosine
distance between time-domain samples makes little sense
because it is too sensitive to signal waveforms whereas
human perception is not. Instead, many other possibilities
exist for comparing sparse approximation, such as comparing
low-level histograms of atom parameters [7,34]; comparing
mid-level structures such as harmonics [9,38,8]; and comparing high-level patterns of short atoms representing rhythm
[32]. There also exists the matching pursuit dissimilarity
measure [25], where the atoms of one sparse model are used
to decompose another signal, and vice versa to see how well
they model each other. We are exploring these various
possibilities with regards to gauging more generally similarity in audio signals at multiple levels of specicity within a
sparse domain.

Acknowledgments
B.L. Sturm performed part of this research as a Chateaubriand Post-doctoral Fellow (N. 634146B) at the
Institut Jean Le Rond dAlembert, Equipe Lutheries, Acoustique, Musique, Universite Pierre et Marie Curie, Paris 6,
France; as well as at the Department of Architecture,
Design and Media Technology, at Aalborg University
Copenhagen, Denmark. L. Daudet acknowledges partial
support from the French Agence Nationale de la
Recherche under contract ANR-06-JCJC-0027-01 DESAM.
The authors thank the anonymous reviewers for their
very detailed and helpful comments.
Appendix A. Proof of remainder bounds

where we now recognize that the weights of two different


representations can decay at different rates. With this, we can
make an ordered set of index pairs by

L fm,ll : jAiq l j ZjAiq l 1 jgl 1,2,...,ni nq

30

and dene a recursive sum for 1 om r ni nq


Siq m9Siq m1 Aiq Giq Lm

31

setting Siq 1 Aiq Giq 11 . We do not yet know the extent to
which this approach can ameliorate the problems with the

To show (14), we can bound R(M) loosely by assuming


the worst case scenario of Giq ml 1 for all its elements.
Knowing that R(M) is the sum of the elements of the
matrix Aiq Giq except for the rst P(M) values, and
assuming (4), we can say
C 2 RM r

m
X

n
X

lml 1g

m M1 l 1

JcM J1 Jd J1

n1
X

n
X

lnm 1g

m 1 l m1

A:1

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

where cM and d are dened in (17) and (18). This worst


case scenario is not possible using MP because of its
update rule (1).
We can nd the tighter bound in (15) by assuming the
distribution of signs of the elements of Giq is Bernoulli
equiprobable, i.e., PfGiq ml 1g PfGiq ml 1g 0:5.
Thus, dening a random variable bi : R/f1,1g, and its
probability mass function fB bi 0:5dbi 1 0:5dbi 1
using the Dirac function, dx, we create a random vector b
with n2  P(M) elements independently drawn from this
distribution. Placing this into the double sums of (A.1)
provides the bound
 " g #


 T cM 
g
g
2
C RM r b
A:2
 r JcM J1 Jd J1
g

d 

R40

This value can be minimized by choosing p0, for which


we arrive at the residual upper bound in (15). Note that
even though we have set p 0, we still have an unrealistically loose bound by the impossibility of MP of choosing
two sets of atoms for which all entries of their Gramian
Giq are in { 1,1}.
Finally, to show (16), we can model the elements of the
Gramian as random variables, ui : R/1,1, independently and identically distributed uniformly

0:5, 1 r ui r1
fU ui
A:5
0
else
Substituting this into (14) gives a weighted sum of
random variables satisfying
 " g #

c 

g
g
A:6
C 2 RM r uT M
g  r JcM J1 Jd J1

d 
where u is the random vector. For large M, this sum has
the asymptotic property [14,15]:
s
3R2
A:7
PfjuT sj oRg Erf
2JsJ22
Setting this equal to 0 r p r 1 and solving for R produces
the upper bound (16). We can reach the upper bound (15)
if we set p 0.9586, but note that (16) can be made zero.
This bound can still be extremely loose because the
Gramian of two models in timefrequency dictionaries
will be highly sparse.
Computing the 2 -norm in these expressions, however,
leads to evaluating the double sums
g

JcM J2

n
X

m
X

2g
m M 1 l 1 lml 1

A:8

2g
m 1 l m 1 lnm 1

A:9

where gE  0:5772 is the EulerMascheroni constant. To


0:5 2

nd Jd
Jd

0:5 2

A:3

which becomes Pfjb sj rRg Z maxf0,12expR2 =2JsJ22 g


by the axioms of probability. With this we can nd an R
T
such that Pfjb sj r Rg will be greater than or equal to
some probability 0 rp r1, i.e.,


2 1=2
g
g
Rp JcM J22 Jd J22 1=2 2ln
A:4
1p

n
X

which can be prohibitive for large n. The dimensionality of


g
g
cM is n(n 1)/2  P(M), and of d is n(n  1)/2. We approximate these values in the following way for g 0:5, using
the partial sum of the harmonic series
n
X
1
1
1
1
lnn gE


On6
A:10
2
4
m
2n
12n
120n
m1

This weighted Rademacher sequence has the property


that [14]
Pfjb sj4 Rg r 2expR2 =2JsJ22 ,

n1
X

Jd J2

2849

J
n1
X

n
X

1
lnm
1
m 1 l m1
"
#
n1
n
m
X
X
1
1 X
1


nm 1 l 1 l l 1 l
m1


n1
X
1
nm
n2 m2
lnn=m

:

nm 1
2nm 12n2 m2
m1

A:11

2
To nd Jc0:5
M J we rst use partial fractions and then the
partial sum of the harmonic series:
n
m
X
X
1
2
Jc0:5
M J
lml 1
m M 1 l 1
m
1 X
1
1

1
l
ml
1
m M 1
l1
"
#
n
m
X
X
1
1
1
1

lnm gE


m1
2m 12m2 l 1 l
m M 1


n
X
1
1
1
lnm gE

2
: A:12
2
m1
2m 12m
m M1

n
X

With these expressions we can avoid double sums in


calculating the bounds.

Appendix B. Estimating the compressibility parameters


We estimate the compressibility parameters C, g for
all signals from the entire set of representation weights.
Since by (4) the parameters C, g bound from above the
decay of all the ordered weights, only the largest magnitude weights matter for their estimation. Thus, we dene
a vector, a, of the largest n magnitude weights from each
row in the set ffai ni gi2I ,aq nq g, which is equivalent to
taking the largest weights at each approximation order.
Good compressibility parameters can be given by
minJCzg aJ2 lC
C, g

subject to Czg ka

B:1

where we dene zg 91,1=2g , . . . ,1=ng T , and add a multiple of C in order to keep it from getting too large since the
bounds (14)(16) are all proportional to it. The constraint
is added to ensure all elements of the difference Czg ai
are positive such that (4) is true.
To remove the g component from the exponent, and
since all of the elements of z and a are positive and nonzero, we can instead solve the problem
minJlnC1 glnzlnaJ2 llnC
C, g

2850

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

minlnC2 n g2 JlnzJ2 JlnaJ2 llnC


C, g

2glnzT lnC1lna2lnClnaT 1

B:2

Czg ka.

Taking the partial derisubject to the constraint


vative of this with respect to g and C, we nd

go

lnzT lnalnC1

B:3

JlnzJ2
"

Co exp l

n
1X
lnaglnzi
ni1

#
B:4

Starting with some initial value of C then, we use the


following iterative method:
1. solve for g given a C in (B.3);
2. nd the new C in (B.4) using this g;
3. set C 0 expmaxlnago lnz and evaluate the error
JC 0 zg aJ2 ;
4. repeat until the error begins to increase.

The factor l in effect controls the step size for convergence. A typical value we use is l 7 0:03 based on
experiments (the sign of which depends on if the objective function decreases with decreasing C).
References
[1] R. Agrawal, C. Faloutsos, A. Swami, Efcient similarity search in
sequence databases, in: Proceedings of the International Conference of Foundations of Data Organization and Algorithms, Chicago,
IL, October 1993, pp. 6984.
[2] M. Aharon, M. Elad, A. Bruckstein, K-SVD: an algorithm for designing of overcomplete dictionaries for sparse representation, IEEE
Transactions of Signal Processing 54 (11) (2006) 43114322.
[3] M. Casey, C. Rhodes, M. Slaney, Analysis of minimum distances in
high-dimensional musical spaces, IEEE Transactions on Audio,
Speech and Language Processing 16 (5) (2008) 10151028.
[4] M. Casey, R. Veltkamp, M. Goto, M. Leman, C. Rhodes, M. Slaney,
Content-based music information retrieval: current directions and
future challenges, Proceedings of the IEEE 96 (4) (2008) 668696.
[5] K. Chang, J.-S.R. Jang, C.S. Iliopoulos, Music genre classication via
compressive sampling, in: Proceedings of the International Society
for Music Information Retrieval, Amsterdam, The Netherlands,
August 2010, pp. 387392.
[6] S.S. Chen, D.L. Donoho, M.A. Saunders, Atomic decomposition by
basis pursuit, SIAM Journal Scientic Computing 20 (1) (1998)
3361.
[7] S. Chu, S. Narayanan, C.-C.J. Kuo, Environmental sound recognition
with timefrequency audio features, IEEE Transactions on Audio,
Speech and Language Processing 17 (6) (2009) 11421158.
[8] C. Cotton, D.P.W. Ellis, Finding similar acoustic events using
matching pursuit and locality-sensitive hashing, in: Proceedings
of the IEEE Workshop on Applications of Signal Processing to Audio
and Acoustics, Mohonk, NY, October 2009, pp. 125128.
[9] L. Daudet, Sparse and structured decompositions of signals with the
molecular matching pursuit, IEEE Transactions on Audio, Speech
and Language Processing 14 (5) (2006) 18081816.
[10] D.P.W. Ellis, G.E. Poliner, Identifying cover songs with chroma
features and dynamic programming beat tracking, in: Proceedings
of the International Conference on Acoustics, Speech and Signal
Processing Honolulu, Hawaii, April 2007, pp. 14291432.
[11] C. Faloutsos, M. Ranganathan, Y. Manolopoulos, Fast subsequence
matching in time-series databases, in: Proceedings of the ACM
SIGMOD International Conference on Management of Data,
Minneapolis, MN, 1994, pp. 419429.
[12] J. Gemmeke, L. ten Bosch, L. Boves, B. Cranen, Using sparse
representations for exemplar based continuous digit recognition,
in: Proceedings of the European Signal Processing Conference.
Glasgow, Scotland, August 2009, pp. 17551759.

[13] J. Haitsma, T. Kalker, A highly robust audio ngerprinting system


with an efcient search strategy, Journal of New Music Research 32
(2) (2003) 211221.
[14] P. Jost, Algorithmic aspects of sparse approximations, Ph.D. Thesis,
Ecole Polytechnique Federale de Lausanne, Lausanne, Switzerland,
June 2007.
[15] P. Jost, P. Vandergheynst, On nding approximate nearest neighbours in a set of compressible signals, in: Proceedings of the
European Signal Processing Conference, Lausanne, Switzerland,
August 2008, pp. 15.
[16] A. Kimura, K. Kashino, T. Kurozumi, H. Murase, A quick search
method for audio signals based on piecewise linear representation
of feature trajectories, IEEE Transactions on Audio, Speech and
Language Processing 16 (2) (2008) 396407.
[17] S. Krstulovic, R. Gribonval, MPTK: Matching pursuit made tractable,
in: Proceedings of the IEEE International Conference on Acoustics,
Speech, and Signal Processing, vol. 3, Toulouse, France, April 2006,
pp. 496499.

[18] F. Kurth, M. Muller,


Efcient index-based audio matching, IEEE
Transactions on Audio, Speech and Language Processing 16 (2)
(2008) 382395.
[19] P. Leveau, E. Vincent, G. Richard, L. Daudet, Instrument-specic
harmonic atoms for mid-level music representation, IEEE Transactions on Audio, Speech and Language Processing 16 (1) (2008)
116128.
[20] M.S. Lewicki, T.J. Sejnowski, Learning overcomplete representations, Neural Computation 12 (2000) 337365.
[21] C.-S. Li, P.S. Yu, V. Castelli, Hierarchyscan: A hierarchical similarity
search algorithm for databases of long sequences, in: Proceedings
of the International Conference on Data Engineering, New Orleans,
LA, February 1996, pp. 546553.
[22] R.F. Lyon, M. Rehn, S. Bengio, T.C. Walters, G. Chechik, Sound
retrieval and ranking using sparse auditory representations, Neural
Computation 22 (9) (2010) 23902416.
[23] B. Mailhe, R. Gribonval, P. Vandergheynst, F. Bimbot, Fast orthogonal sparse approximation algorithms over local dictionaries, Signal
Processing, this issue.
[24] S. Mallat, A Wavelet Tour of Signal Processing: The Sparse Way,
third ed., Academic Press, Elsevier, Amsterdam, 2009.
[25] R. Mazhar, P.D. Gader, J.N. Nilson, Matching pursuits dissimilarity
measure for shape-based comparison and classication of highdimensional data, IEEE Transactions on Fuzzy Systems 17 (5)
(2009) 11751188.

[26] M. Muller,
F. Kurth, M. Clausen, Chroma-based statistical audio
features for audio matching, in: IEEE Workshop on Applications of
Signal Processing to Audio and Acoustics, New Paltz, NY, October
2005, pp. 275278.
[27] Y. Panagakis, C. Kotropoulos, G.R. Arce, Music genre classication
via sparse representations of auditory temporal modulations, in:
Proceedings of the European Signal Processing Conference
Glasgow, Scotland, August 2009, pp. 15.
[28] Y. Pati, R. Rezaiifar, P. Krishnaprasad, Orthogonal matching
pursuit: Recursive function approximation with applications to
wavelet decomposition, in: Proceedings of the Asilomar Conference
on Signals, Systems, and Computers, vol. 1, Pacic Grove, CA,
November 1993, pp. 4044.
[29] T.V. Pham, A. Smeulders, Sparse representation for coarse and ne
object recognition, IEEE Transactions on Pattern Analysis and
Machine Intelligence 28 (4) (2006) 555567.
[30] D. Raei, A. Mendelzon, Efcient retrieval of similar time sequences
using DFT, in: Proceedings of the International Conference of
Foundations of Data Organization and Algorithms, Kobe, Japan,
November 1998, pp. 249257.
[31] E. Ravelli, G. Richard, L. Daudet, Union of MDCT bases for audio
coding, IEEE Transactions on Audio, Speech and Language Processing 16 (8) (2008) 13611372.
[32] E. Ravelli, G. Richard, L. Daudet, Audio signal representations for
indexing in the transform domain, IEEE Transactions on Audio,
Speech and Language Processing 18 (3) (2010) 434446.
[33] L. Rebollo-Neira, D. Lowe, Optimized orthogonal matching pursuit
approach, IEEE Signal Processing Letters 9 (4) (2002) 137140.
[34] S. Scholler, H. Purwins, Sparse coding for drum sound classication
and its use as a similarity measure, in: Proceedings of the International Workshop on Machine Learning Music ACM Multimedia,
Firenze, Italy, October 2010.
[35] J. Serra , E. Gomez, P. Herrera, X. Serra, Chroma binary similarity and
local alignment applied to cover song identication, IEEE Transactions on Audio, Speech and Language Processing 16 (August 2008)
11381151.

B.L. Sturm, L. Daudet / Signal Processing 91 (2011) 28362851

[36] B.L. Sturm, M. Christensen, Cyclic matching pursuit with multiscale


timefrequency dictionaries, in: Proceedings of the Asilomar
Conference on Signals, Systems, and Computers, Pacic Grove, CA,
November 2010.
[37] B.L. Sturm, J.J. Shynk, Sparse approximation and the pursuit of
meaningful signal models with interference adaptation, IEEE Transactions on Audio, Speech and Language Processing 18 (3) (2010)
461472.
[38] B.L. Sturm, J.J. Shynk, A. McLeran, C. Roads, L. Daudet, A comparison
of molecular approaches for generating sparse and structured multiresolution representations of audio and music signals, in: Proceedings of Acoustics, Paris, France, June 2008, pp. 57755780.
[39] G. Tzanetakis, P. Cook, Musical genre classication of audio signals, IEEE
Transactions on Speech, and Audio Processing 10 (5) (2002) 293302.

2851

[40] K. Umapathy, S. Krishnan, S. Jimaa, Multigroup classication of


audio signals using timefrequency parameters, IEEE Transactions
on Multimedia 7 (2) (2005) 308315.
[41] P. Vincent, Y. Bengio, Kernel matching pursuit, Machines Learning.
48 (1) (2002) 165187.
[42] A. Wang, An industrial strength audio search algorithm, in:
Proceedings of the International Society on Music Information
Retrieval, Baltimore, Maryland, USA, October 2003, pp. 14.
[43] J. Wright, Y. Ma, J. Mairal, G. Sapiro, T. Huang, S. Yan, Sparse
representation for computer vision and pattern recognition,
Proceedings of the IEEE 98 (6) (2009) 10311044.
[44] J. Wright, A.Y. Yang, A. Ganesh, S.S. Sastry, Y. Ma, Robust face
recognition via sparse representation, IEEE Transactions on Pattern
Analysis and Machine Intelligence 31 (2) (2009) 210227.

You might also like