0% found this document useful (0 votes)
127 views4 pages

Nieuwenhuizen - The Study and Implementation of Shazam's Audio Fingerprinting Algorithm For Advertisement Identification

In this paper, the audio fingerprinting algorithm of Avery Wang’s is implemented and studied in terms of accuracy, speed, versatility and scalability

Uploaded by

Kate Zen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
127 views4 pages

Nieuwenhuizen - The Study and Implementation of Shazam's Audio Fingerprinting Algorithm For Advertisement Identification

In this paper, the audio fingerprinting algorithm of Avery Wang’s is implemented and studied in terms of accuracy, speed, versatility and scalability

Uploaded by

Kate Zen
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

The Study and Implementation of Shazam’s Audio

Fingerprinting Algorithm for Advertisement Identification


Heinrich A. van Nieuwenhuizen, Willie C. Venter and Leenta M.J. Grobler
School of Electrical, Electronic and Computer Engineering
North-West University, Potchefstroom Campus, South Africa
Email: {20252188; willie.venter; leenta.grobler}@nwu.ac.za
Telephone: (027) 18 299-1961 Fax: (027) 18 299-1977
Abstract- The recognition of a person by his/her Audio fingerprints are generated from short audio
fingerprint is not a new concept, but the recognition of a segments usually between 3-30 seconds in length
piece of audio by its audio sample, also known as its audio (depending on the algorithm). This audio fingerprint is
fingerprint, is. Different research groups have delivered compared to a database of known audio fingerprints to
different working implementations of audio fingerprinting identify the original audio source as seen in Figure 1.
for music, but not for advertisement identification. A fair
judgment can therefore not be made whether the available The audio fingerprints of the segments do not necessarily
algorithms are suitable for advertisement identification. In have to be of high quality to be a match. Distortions and
this paper, the audio fingerprinting algorithm of Avery interference of the original signal makes matching of the
Wang’s is implemented and studied in terms of accuracy, fingerprints less reliable, but to a certain extent, it will still be
speed, versatility and scalability recognizable. The distortions and interferences can be
compared to a smudged or partial human fingerprint.
Keywords: Audio Fingerprinting; Automatic Music Recognition;
Content-based Audio Identification; Perceptual Hashing; Robust
Matching; Recordings’
collection Fingerprint
extraction
I. INT RODUCT ION
Recordings’
Meta Data and DB
Fingerprinting systems are not a new concept; they have Identifications

been around for more than a hundred years. In 1893 Sir


Francis Galton was the first to “prove” that no two
fingerprints of human beings are alike [1]. This notion was
Unlabeled Recording
taken further by using any unique feature to identify an recording Fingerprint ID
Match
extraction
object; this includes the iris and even ears. People also
realized the potential of constructing fingerprints of audio
signals to identify and compare them; a principle known as
audio fingerprinting.
Figure 1 : Content-based audio identification framework [7]
A question which can be raised is why audio
fingerprinting? Is there not an easier way, for example cross- When you hear an advertisement on the radio, you take in
correlation? The answer is not quite that simple, although if the information and continue to listen to the next song or
any simple mathematical comparison equations were used on advertisement. You do not actually notice whether the
a normalized and an identical WAVE file, it would probably advertisement has been clipped (shortened) or played faster
work. When the same song on radio or on a CD is heard, than the initial recording.
they might sound similar, but the truth is that they are not This leaves the advertiser with a dilemma: even if they
the same mathematically - especially when noise is added or were to appoint personnel to verify that their advertisement
adjustments are made to the audio. The solution for a was played correctly, human error is always possible. It is
reliable, robust and accurate solution is extracting unique also too time consuming for a human to sit and listen and
features using audio fingerprinting. verify each advertisement [3].
There are several applications for audio fingerprinting Identifying these advertisements with audio fingerprinting
algorithms. According to Wes Hatch [3], the biggest is an exciting new solution. By analyzing the audio of a radio
benefactor would be the broadcast monitoring industry . signal, all the advertisements can be identified.
Other applications would be playlist generation, royalty
collection, program verification and advertisement There is however the question of accuracy, reliability and
the speed at which the algorithm detects the advertisements.
verification.
Another problem is that the algorithms were designed for
Hatch’s research inspired the advertisement identification
music purposes.
technique for this paper.
It is shown in Figure 2 that voice does not have the same III. OPERAT ION
frequency response as music, which makes identifying All audio data used in the advertisement operation will be
advertisements (which typically consist of 80% voice data) sampled from radio advertisements heard on your everyday
quite difficult. radio station. Both Avery Wang’s Shazam and Haitsma &
Kalker’s [2] algorithms were applied in the following
scenario, but Avery Wang’s algorithm was deemed more
suitable for today’s radio advertisements and was therefore
studied in greater detail.
Avery Wang claims that for a database of 20 000 tracks,
implemented on a PC, the search time is between 5 to 500
milliseconds [2]. As the code is not available, adaptive code
for MATLAB™ was produced by Dan Ellis [6]. Robert
Macrae of C4DM Queen Mary University, London, altered
the code for use in the Windows environment and the
authors in turn altered and implemented the code for use in
Figure 2 : Typical Frequency Response [9] advertisement identification. The code was reproduced in
VB.NET.
In this paper a brief description of audio fingerprinting and The proposed algorithm makes use of a spectrogram. The
an adaptive code of Avery Wang’s Shazam algorithm are spectrogram is the squared magnitude of the STFT (Short-
discussed. time Fourier Transform).
spectrogram(t ,  )  STFT (t ,  )
2
In section II, the three main groups of audio fingerprinting (1)
techniques are briefly reviewed, after which the operation of
Avery Wang’s Shazam [2] algorithm is discussed in section Usually the spectrogram is divided into small fragments
III. After evaluating the performance of the algorithm, we (typically 512 points) which are called windows or frames.
describe the practical implementation of advertisement
This is the shared basis of Group 2. The differences
identification in section IV. In section V the results are
between the fingerprint algorithms in the group typically
published while the validation and verification is presented
involve how much the frames overlap, how the fingerprint is
in section VI. The paper ends with a conclusion and
defined in the frame and the storing and searching of the
potential future work is highlighted in section VII.
fingerprints.

II. DIFFERENT A UDIO FINGERPRINT ING TECHNIQUES Avery Wang’s Shazam algorithm uses the energy peaks in
the frame and form spectral pair landmarks. They chose
According to P.J.O Doets, M. Menor Gisbert and R.L. spectral peaks for their robustness against noise and
Lagendijk there are three groups [4] into which audio approximate linear superposability [2]. The local maxima
fingerprinting can be categorized: within a defined section are grouped into pairs [8].
A. Properties investigated
Group 1: Systems that use features based on multiple
In this study the following was observed.
subbands, namely Philips’ Robust Hash algorithm, reported
to be very robust against distortions [1]. Phillips uses Decreasing the frame size from the common value of 512 to
Haitsma & Kalker’s algorithm. 256 or 128 will increase the accuracy but decrease speed, as
there will be more frames - meaning more peaks.
Group 2: Systems that use features based on a single band
such as the spectral domain, namely Avery Wang’s Shazam Defining more peaks in a frame (normally 5) would also
result in better accuracy but decreased speed. Overlapping
and Fraunhofer’s AudioID algorithms.
regions will increase robustness but decrease speed.
Group 3: Systems using a combination of subbands or
Depending on the specific application, these parameters
frames, which are optimized through training, namely
can be tweaked accordingly. The generic code is based on
Microsoft’s Robust Audio Recognition Engine (RARE)
Shazam’s concept, so logically it is optimized for use in
which uses Hidden Markov Models (HMMs ) [5].
cellphone applications (which is typically subjected to noise
For this paper we are only interested in Group 2, as the and where speed is not a huge concern, but robustness and
commonly known algorithm, Avery Wang’s Shazam, which accuracy is).
falls in this group, was chosen for this study.
For an application which is not subjected to noise and
which requires real time analysis, the frame size will be
increased, peaks will decrease and there would be no
overlapping.
IV. A DVERT ISEMENT IDENT IFICAT ION The search time for a 5 minute sample of unknown audio is
After scientifically implementing and analyzing the determined by the database size as shown in Figure 3.
algorithm, a clear next step would be practical application. As
the adaptive Shazam code was a slightly better candidate
with a more practical hash table, it is chosen for the practical
application.
A sample containing 234 radio advertisements with total
time amounting to 1 hour and 56 minutes was used, with the
advertisements ranging from 9 to 44 seconds.
A sample with a length of 8 hours, 29 minutes and 59
seconds containing radio data was used and divided into 5
minute segments. They were renamed to the following
format:
Hour_minute_second.wav
Figure 3 : Shazam vs. Phillips
or
Hour_minute_second.mp3
The adapted Shazam algorithm found all 113
This allowed the application to identify the exact time advertisements in the radio audio provided, while Haitsma &
that the advertisements were played on the radio. Kalker’s algorithm missed two. Neither of them found a false
positive with accordance to their thresholds. It should be
Even with the algorithm analyzing a sample every 3
noted that the radio audio is virtually noise free and
seconds, the process was too time consuming. A faster
uncompressed.
solution was analyzing a 3 second audio segment at 15
seconds intervals. The largest false positive found when using the sample
in wav format was 8 landmarks and the largest positive was
112, while the mp3 sample at 128kbps found 5 and 40
V. RESULT S
respectively. This supports the decision to use a threshold
Shazam’s results were conducted on an Intel® Core2 of 19 and 9 for the wav and mp3 formats respectively and
Duo™ processor running at 2.1 GHz with access to 4 GB of this corresponds with Dan Ellis ’ findings [6] suggesting the
memory. Phillips’ results were derived from Haitsma & threshold for mp3 is 9 landmarks .
Kalker’s paper [1].
The unique feature which the Shazam algorithm exhibits is
All files, before processing, were sampled at 44.1 kHz in its ability to group spectral peaks (Shazam’s audio
stereo at 16bps (bits per second), in accordance with fingerprint), which is of particular use in advertisement
Haitsma & Kalker’s paper [1]. This insures accurate identification, as this allows multiple fingerprints to be
comparison between algorithms. detected or excluded on the same piece of audio.

Table 1 : Advertisements results This is very helpful to produce an identification match


Proce ssing Shazam Phillips even when a radio presenter talks whilst the advertisement is
Adding to database 5min 25s 20 min playing.
Searching 3 seconds sample 116ms 15ms
30 min VI. VALIDAT ION AND VERIFICAT ION
Search through entire sample 56s 14 min 20s
To compare the authenticity of the adaptive Shazam
Number of advertisements
algorithm, the search time was compared to that of Avery
found 113 111
Wang’s paper [2].
T ime faster than real-time 16 36*
*see explanation It was not exactly the same but this can be attributed to
below many different reasons, including the fact that the code was
not entirely identical and the same computer was not used.
It can be seen in Table 1 that Haitsma & Kalker’s algorithm The adaptive code was also compared to that of Dan Ellis ’
might have the advantage for this database size, but the adaptive MATLAB code which is also derived of the
algorithm displays a linear time increase with respect to Shazam’s algorithm and the results were also very similar.
database size, whereas Avery Wang’s algorithm remains As Avery Wang’s algorithm is not publically available,
relatively constant. the verification is only approximate, but as much as possible
When 300 minutes worth of audio (approximately 60 was derived from theory and the results prove that this is a
functional algorithm. Even if the adaptive algorithm does not
songs) is inserted into the database, Avery Wang’s
entirely match that of Avery Wang’s algorithm, it can still be
algorithm searches through a 5 minute unknown audio
indisputably classified as a Group 2 algorithm.
segment in 19.8 seconds and Haitsma & Kalker in 20.696
seconds.
VII. CONCLUSION AND FUT URE WORK Delft : Security, Steganography, and Watermarking of Multimedia
Contents VIII, 2006, Vol. 6072.
A. Conclusion [5] Burges, Christopher J.C, Platt, John C and Jana, Soumya.,
"Distortion Discriminant Analysis for Audio." s.l. : IEEE
T RANSACT IONS ON SPEECH AND AUDIO PROCESSING, ZZ,
The following was observed in the study. Issue Y, Vol. XX.
[6] Ellis, Daniel P.W., "Robust Landmark-Based Audio
The algorithm was implemented and to verify its accuracy
Fingerprinting." [Online] 2009.
it was compared to that of Avery Wang’s “An Industrial- http://labrosa.ee.columbia.edu/matlab/fingerprint/.
Strength Audio Search Algorithm” [2] and Dan Ellis’ [7] Cano, Pedro, et al., "Audio Fingerprinting: Concepts And
“Robust Landmark-Based Audio Fingerprinting” [6]. It was Applications." Studies in Computational Intelligence (SCI), 2005,
verified that the algorithms are correctly implemented. Issue 2, pp. 233-245.
[8] Ogle, James P. and Ellis, Daniel P.W., "Fingerprinting to identify
The adaptive code based on Wang’s algorithm is a clear repeated sound events in long-duration personal audio recordings."
option for advertisement identification. It was seen that New York : s.n., 2007.
decreasing the algorithm frame size, will decrease speed but [9] http://www.coutant.org/shure300/index.html
increase the accuracy respectfully.
This work was completed at the Telkom Centre of Excellence
Defining more peaks in the Shazam’s algorithm’s frame
(normally 5) would also result in better accuracy, but At the NWU, and is funded by the HTBO THRIP project
decreased speed. Increasing the overlapping regions in the
algorithm will increase robustness but decrease speed.
Heinrich van Nieuwenhuizen received his B.Eng degree in 2009
The nearest false positive found with mp3 compression and is currently pursuing his M.Eng at the North West University,
at 128kbps averages 5 or 6, which strengthens James .P Ogle Potchefstroom campus. His research interests include software design,
and Daniel P.W. Ellis’ theory that 9 peaks [8] are needed for a audio fingerprinting and implementation and comparison of audio
match. It was found that the nearest false positive when fingerprinting algorithms for industrial use.
using the wav format was 8 and, as all matches had 26 peaks
or higher, a safe threshold of 19 peaks was determined.
Overall, impressive results for the algorithm were
obtained, analyzing radio signals 16 times faster than real-
time, which in turn allow more data to be analyzed using
larger databases. This will make the algorithm more lucrative
for advertisement companies.

B. Future work
The following additional work can be done in the future:
Different algorithms should be compared to each other
with respect to audio and advertisement identification,
including robustness.
More application uses for audio fingerprinting should be
investigated, e.g. gunshots, engine noise etc.
The prospects of using audio fingerprinting algorithms in
cases were advertisements are read should be investigated.
The possibility of video identification using audio
fingerprinting techniques (video fingerprinting), with the use
of Avery Wang’s Shazam algorithm should be explored.
The algorithm’s use in voice identification using specific
thresholds should be investigated.
Coding algorithm should be looked at for increase speed
and larger database sizes

VIII. BIBLIOGRAPHY
[1] Haitsma, Jaap and Kalker, Antonius, "A Highly Robust Audio
Fingerprinting System." International Symposium on Music
Information Retrieval (ISMIR), Eindhoven : s.n., 2002, pp. 107-115.
[2] Wang, Avery Li-Chun., "An Industrial-Strength Audio Search
Algorithm." ISMIR, London : Shazam Entertainment, Ltd., 2003.
[3] Hatch, Wes., "A Quick Review of Audio Fingerprinting." March
2003.
[4] Doets, P.J.O, Gisbert, M. Menor and Lagendijk, R.L., "On the
comparison of audio fingerprints for extracting quality parameters of
compressed audio." [ed.] Edward J. Delp III and Ping Wah Wong.

You might also like