Nieuwenhuizen - The Study and Implementation of Shazam's Audio Fingerprinting Algorithm For Advertisement Identification
Nieuwenhuizen - The Study and Implementation of Shazam's Audio Fingerprinting Algorithm For Advertisement Identification
II. DIFFERENT A UDIO FINGERPRINT ING TECHNIQUES Avery Wang’s Shazam algorithm uses the energy peaks in
the frame and form spectral pair landmarks. They chose
According to P.J.O Doets, M. Menor Gisbert and R.L. spectral peaks for their robustness against noise and
Lagendijk there are three groups [4] into which audio approximate linear superposability [2]. The local maxima
fingerprinting can be categorized: within a defined section are grouped into pairs [8].
A. Properties investigated
Group 1: Systems that use features based on multiple
In this study the following was observed.
subbands, namely Philips’ Robust Hash algorithm, reported
to be very robust against distortions [1]. Phillips uses Decreasing the frame size from the common value of 512 to
Haitsma & Kalker’s algorithm. 256 or 128 will increase the accuracy but decrease speed, as
there will be more frames - meaning more peaks.
Group 2: Systems that use features based on a single band
such as the spectral domain, namely Avery Wang’s Shazam Defining more peaks in a frame (normally 5) would also
result in better accuracy but decreased speed. Overlapping
and Fraunhofer’s AudioID algorithms.
regions will increase robustness but decrease speed.
Group 3: Systems using a combination of subbands or
Depending on the specific application, these parameters
frames, which are optimized through training, namely
can be tweaked accordingly. The generic code is based on
Microsoft’s Robust Audio Recognition Engine (RARE)
Shazam’s concept, so logically it is optimized for use in
which uses Hidden Markov Models (HMMs ) [5].
cellphone applications (which is typically subjected to noise
For this paper we are only interested in Group 2, as the and where speed is not a huge concern, but robustness and
commonly known algorithm, Avery Wang’s Shazam, which accuracy is).
falls in this group, was chosen for this study.
For an application which is not subjected to noise and
which requires real time analysis, the frame size will be
increased, peaks will decrease and there would be no
overlapping.
IV. A DVERT ISEMENT IDENT IFICAT ION The search time for a 5 minute sample of unknown audio is
After scientifically implementing and analyzing the determined by the database size as shown in Figure 3.
algorithm, a clear next step would be practical application. As
the adaptive Shazam code was a slightly better candidate
with a more practical hash table, it is chosen for the practical
application.
A sample containing 234 radio advertisements with total
time amounting to 1 hour and 56 minutes was used, with the
advertisements ranging from 9 to 44 seconds.
A sample with a length of 8 hours, 29 minutes and 59
seconds containing radio data was used and divided into 5
minute segments. They were renamed to the following
format:
Hour_minute_second.wav
Figure 3 : Shazam vs. Phillips
or
Hour_minute_second.mp3
The adapted Shazam algorithm found all 113
This allowed the application to identify the exact time advertisements in the radio audio provided, while Haitsma &
that the advertisements were played on the radio. Kalker’s algorithm missed two. Neither of them found a false
positive with accordance to their thresholds. It should be
Even with the algorithm analyzing a sample every 3
noted that the radio audio is virtually noise free and
seconds, the process was too time consuming. A faster
uncompressed.
solution was analyzing a 3 second audio segment at 15
seconds intervals. The largest false positive found when using the sample
in wav format was 8 landmarks and the largest positive was
112, while the mp3 sample at 128kbps found 5 and 40
V. RESULT S
respectively. This supports the decision to use a threshold
Shazam’s results were conducted on an Intel® Core2 of 19 and 9 for the wav and mp3 formats respectively and
Duo™ processor running at 2.1 GHz with access to 4 GB of this corresponds with Dan Ellis ’ findings [6] suggesting the
memory. Phillips’ results were derived from Haitsma & threshold for mp3 is 9 landmarks .
Kalker’s paper [1].
The unique feature which the Shazam algorithm exhibits is
All files, before processing, were sampled at 44.1 kHz in its ability to group spectral peaks (Shazam’s audio
stereo at 16bps (bits per second), in accordance with fingerprint), which is of particular use in advertisement
Haitsma & Kalker’s paper [1]. This insures accurate identification, as this allows multiple fingerprints to be
comparison between algorithms. detected or excluded on the same piece of audio.
B. Future work
The following additional work can be done in the future:
Different algorithms should be compared to each other
with respect to audio and advertisement identification,
including robustness.
More application uses for audio fingerprinting should be
investigated, e.g. gunshots, engine noise etc.
The prospects of using audio fingerprinting algorithms in
cases were advertisements are read should be investigated.
The possibility of video identification using audio
fingerprinting techniques (video fingerprinting), with the use
of Avery Wang’s Shazam algorithm should be explored.
The algorithm’s use in voice identification using specific
thresholds should be investigated.
Coding algorithm should be looked at for increase speed
and larger database sizes
VIII. BIBLIOGRAPHY
[1] Haitsma, Jaap and Kalker, Antonius, "A Highly Robust Audio
Fingerprinting System." International Symposium on Music
Information Retrieval (ISMIR), Eindhoven : s.n., 2002, pp. 107-115.
[2] Wang, Avery Li-Chun., "An Industrial-Strength Audio Search
Algorithm." ISMIR, London : Shazam Entertainment, Ltd., 2003.
[3] Hatch, Wes., "A Quick Review of Audio Fingerprinting." March
2003.
[4] Doets, P.J.O, Gisbert, M. Menor and Lagendijk, R.L., "On the
comparison of audio fingerprints for extracting quality parameters of
compressed audio." [ed.] Edward J. Delp III and Ping Wah Wong.