0% found this document useful (0 votes)
58 views53 pages

How Does Shazam Work Coding Geek

The article explains how Shazam works through audio fingerprinting, detailing the basics of sound, music theory, and signal processing. It discusses the characteristics of sound waves, musical notes, and how digital music is created from analog signals. The author provides a comprehensive overview of the technical aspects involved in recognizing music through Shazam's algorithm.

Uploaded by

sajidayawar786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
58 views53 pages

How Does Shazam Work Coding Geek

The article explains how Shazam works through audio fingerprinting, detailing the basics of sound, music theory, and signal processing. It discusses the characteristics of sound waves, musical notes, and how digital music is created from analog signals. The author provides a comprehensive overview of the technical aspects involved in recognizing music through Shazam's algorithm.

Uploaded by

sajidayawar786
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 53

How does Shazam work - Coding Geek http://coding-geek.

com/how-shazam-works/

A blog about I T, program m ing and Java

Menu

HOME »» ALGORI THM »» HOW DOES SHAZAM WORK

by Chr ist ophe | updat ed: August 6, 2015 | post ed: May 23, 2015

91 Com m ent s

1,755 369

Have you ever wondered how Shazam works? I asked m yself


this quest ion a few years ago and I read a research art icle
writ t en by Avery Li- Chun Wang, t he confounder of Shazam , t o
understand the m agic behind Shazam . The quick answer is
audio fingerprint ing, which leads t o anot her question: what is
audio fingerprint ing?

When I was student, I never took a course in signal processing.


To really understand Shazam ( and not j ust have a vague idea) I
had t o st art with t he basics. This art icle is a sum m ary of t he
search I did t o understand Shazam .

I’ll st art wit h t he basics of m usic theory, present


som e signal processing stuff and end wit h t he m echanism s
behind Shazam . You don’t need any knowledge to read this
art icle but since it involves com put er science and m at hem atics
it’s bet ter t o have a good scient ific background ( especially for

1 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

the last part s) . If you already know what the words “ oct aves” ,
“ frequencies” , “ sam pling” and “ spect ral leakage” m ean you can
skip t he first part s.

Since it ’s a long and t echnical art icle ( 11k words) feel free to
read each part at different t im es.

Cont ent s [ show ]

A sound is a vibrat ion t hat propagat es t hrough air ( or water)


and can be decrypt ed by ears. For exam ple, when you list en t o
your m p3 player the earphones produce vibrat ions that
propagate t hrough air unt il t hey reach your ears. The light is
also a vibration but you can’t hear it because your ears can’t
decrypt it ( but your eyes can) .

A vibrat ion can be m odeled by sinusoidal waveform s. In t his


chapt er, we’ll see how m usic can be physically/ t echnically
described.

Pure t one s v s re a l sounds


A pure tone is a tone with a sinusoidal waveform . A sine
wave is charact erized by:

I t s frequency: t he num ber of cycles per second. I t s unit is


t he Hertz ( Hz) , for exam ple 100Hz = 100 cycles per
second.

I t s am plit ude ( relat ed to loudness for sounds) : t he size of


each cycle.

Those charact erist ics are decrypt ed by t he hum an ear t o form a


sound. Hum an can hear pure tones from 20 Hz to 20 000 Hz
( for t he best ears) and t his range decreases over age. By
com parison, t he light you see is com posed of sinewaves from
4* 10^ 14 Hz t o 7.9* 10^ 14 Hz.

2 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

You can check t he range of your ears with youtube videos like
this one t hat displays all the pure tones from 20 Hz t o 20k Hz,
in m y case I can’t hear anything above 15 kHz.

The hum an perception of loudness depends on the frequency of


the pure tone. For inst ance, a pure t one at am plit ude 10 of
frequency 30Hz will be quieter t han a pure t one at am plitude 10
of frequency 1000Hz. Hum ans ears follow a psychoacoust ic
m odel, you can check t his article on Wikipedia for m ore
inform ation.

Not e: This fun fact will have consequences at the end of the
art icle.

pure sinewave at 20 Hz

In t his figure, you can see the representation of a pure sine


wave of frequency 20hz and am plit ude 1.

Pure t ones doesn’t nat urally exist but every sound in the world
is the sum a m ult iple pure t ones at different am plit udes.

3 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

com posit ion of sinewaves

In t his figure, you can see the representation of a m ore realist ic


sound which is t he com posit ion of m ultiple sinewaves:

a pure sinewave of frequency 20hz and am plit ude 1

a pure sinewave of frequency 40hz and am plit ude 2

a pure sinewave of frequency 80hz and am plit ude 1. 5

a pure sinewave of frequency 160hz and am plit ude 1

A real sound can be com posed of thousands of pure t ones.

M usica l N ot e s

A m usic part ition is a set of notes execut ed at a cert ain


m om ent . Those notes also have a durat ion and a loudness.

The notes are divided in oct a v e s. I n m ost occidental countries,

4 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

an octave is a set of 8 not es ( A, B, C, D, E, F,G in m ost English-


speaking count ries and Do, Re, Mi, Fa, Sol, La, Si in m ost Latin
occident al count ries) wit h t he following propert y:

The frequency of a not e in an oct ave doubles in t he next


oct ave. For exam ple, t he frequency of A4 ( A in t he 4 t h
oct ave) at 440Hz equals 2 t im es the frequency of A3 ( A in
t he 3 rd octave) at 220Hz and 4 t im es t he frequency of A2
( A in t he 2 nd oct ave) at 110Hz.

Many instrum ent s provides m ore than 8 not es by octaves, those


not es are called sem it one or halfst ep.

For t he 4 t h octave ( or 3 rd oct ave in Latin occidental countries) ,


the notes have the following frequency:

C4 ( or Do3) = 261.63Hz

D4 ( or Re3) = 293.67Hz

E4 ( or Mi3) = 329. 63Hz

F4 ( or Fa3) = 349. 23Hz

G4 ( or Sol3) = 392Hz

A4 ( or La3) = 440Hz

B4 ( or Si3) = 493. 88Hz

Though it m ight be odd, t he frequency sensit ivity of ears is


logarithm ic. I t m eans t hat:

bet ween 32. 70 Hz and 61. 74Hz ( t he 1st oct ave)

or between 261. 63Hz and 466. 16Hz ( 4 t h oct ave)

or between 2 093 Hz and 3 951.07Hz ( 7t h oct ave)

Hum an ears will be able t o det ect the sam e num ber of notes.

FYI , the A4/ La3 at 440Hz is a standard reference for t he


calibrat ion of acoust ic equipm ent and m usical inst rum ent s.

Tim bre

5 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

The sam e not e doesn’t sound exact ly t he sam e if it’s played by


a guit ar, a piano, a violin or a hum an singer. The reason is that
each instrum ent has its own t im b re for a given not e.

For each inst rum ent, t he sou n d p rod u ce d is a m u lt it u d e of


fre q u e n cie s t h a t sou n d s lik e a g iv e n n ot e ( t he scientific
term for a m usical not e is p it ch ) . This sound has a
fu n d a m e n t a l frequency ( t he lowest frequency) and m ultiple
ov e rt on e s ( any frequency higher t han t he fundam ental) .

Most instrum ent s produce ( close t o) h a rm on ic sou n d s. For


those inst rum ents, t he overt ones are m ult iples of t he
fundam ental frequency called h a rm on ics. For exam ple t he
com position of pure tones A2 ( fundam ental) , A4 and A6 is
harm onic whereas the com posit ion of pure t ones A2, B3, F5 is
in h a rm on ic.

Many percussion instrum ent s ( like cym bals or drum s) creat e


inharm onic sounds.

Not e: The pit ch ( t he m usical not e perceived) m ight not be


present in t he sound played by an instrum ent . For exam ple, if
an instrum ent plays a sound wit h pure t ones A4, A6 and A8,
Hum an brain will interpret t he result ing sound has an A2 not e.
This not e/ pitch will be an A2 whereas t he lowest frequency in
the sound is A4 ( t his fact is called the m issing fundam ent al) .

Spe ct rogra m
A m usic song is played by m ultiple inst rum ents and singers. All
those inst rum ents produce a com binat ion of sinewaves at
m ultiples frequencies and t he overall is an even bigger
com binat ion of sinewaves.

It is possible t o see m usic with a spectrogram . Most of t he tim e,


a spect rogram is a 3 dim ensions graph where:

on the horizontal ( X) axis, you have the tim e,

on t he vert ical ( Y) axis you have t he frequency of t he pure


t one

t he t hird dim ension is described by a color and it


represents the am plit ude of a frequency at a cert ain tim e.

6 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

For exam ple, here is a sound of a piano playing of C4 not e


( whose fundam ental frequency is 261. 63Hz)

00:00

And here is the associat ed spect rogram :


00:00

The color represent s the am plitude in dB ( we’ll see in a next


chapt er what it m eans) .

As I t old you in the previous chapt er, t hough t he not e played is


a C4 t here are ot her frequencies t han 261Hz in t his record: t he
overt ones. What ’s int erest ing is t hat t he ot her frequencies are
m ultiple of t he first one: the piano is an exam ple of a
h a rm on ic in st ru m e n t .

Anot her int erest ing fact is t hat t he int ensit y of t he frequencies
changes t hrough tim e. I t ’s another part icularit y of an
instrum ent that m akes it unique. I f you take the sam e artist but
you replace t he piano, t he evolution of frequencies won’t
behave the sam e and t he result ing sound will be slight ly
different because each art ist/ instrum ent has its own style.
Technically speaking, these evolutions of frequencies are
m odifying t he e n v e lop e of the sound signal ( which is a part of
the t im bre) .

To give you a first idea of Shazam m usic fingerprint ing


algorithm , you can see in t his spect rogram t hat som e

7 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

frequencies ( t he lowest ones) are m ore im port ant t han others.


What if we kept j ust t he strongest ones?

Unless you’re a vinyl disk lover, when you list en t o m usic you’re
using a digit al file ( m p3, apple lossless, ogg, audio CD,
what ever ) . But when art ists produce m usic, it is analogical ( not
represent ed by bits) . The m usic is d ig it a liz e d in order t o be
st ored and played by elect ronics devices ( like com put ers,
phones, m p3 players, cd players …) . I n t his part we’ll see how
to pass from an analog sound t o a digit al one. Knowing how a
digit al m usic is m ade will help us t o analyse and m anipulat e
this digit al m usic in t he next part s.

Sa m pling
Analog signals are continuous signals, which m eans if you take
one second of an analog signal, you can divide this second into
[ put the great est num ber you can t hink of and I hope it’s a big
one ! ] part s t hat last a fract ion of second. I n t he digital world,
you can’t afford t o st ore an infinite am ount of inform at ion. You
need t o have a m inim um unit , for exam ple 1 m illisecond.
During t his unit of t im e t he sound cannot change so t his unit
needs t o be short enough so t hat the digit al song sounds like
the analog one and big enough t o lim it t he space needed for
st oring t he m usic.

For exam ple, t hink about your favorit e m usic. Now think about
it wit h the sound changing only every 2 seconds, it sounds like
nothing. Technically speaking the sound is a lia se d . In order t o
be sure that your song sounds great you can choose a very
sm all unit like a nano ( 10^ - 9) second. This tim e your m usic
sounds great but you don’t have enough disk space t o st ore it,
too bad.

This problem is called sa m p lin g .

The standard unit of t im e in digit al m usic is 44 100 unit s ( or


sa m p le s) p e r se con d . But where does t his 44,1kHz com e

8 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

from ? Well, som e dude t hought it would be good t o put 44100


unit s per second and t hat all … I ’m kidding of course.

In the first chapter I told you that hum ans can hear sounds from
20Hz to 20kHz. A theorem from Nyquist and Shannon st at es
that if you want t o digit alize a signal from 0Hz t o 20kHz you
need at least 40 000 sam ples per second. The m ain idea is that
a sine wave signal at a frequency F needs at least 2 points per
cycle to be ident ified. If t he frequency of your sam pling is at
least t wice than the frequency of your signal, you’ll end up wit h
at least 2 point s per cycle of the original signal.

Let ’s try to underst and with a picture, look at t his exam ple of a
good sam pling:

In t his figure, a sound at 20Hz is digit alized using a 40Hz


sam pling rat e:

t he blue curve represent s t he sound at 20 Hz,

t he red crosses represent t he sam pled sound, which


m eans I m arked t he blue curve wit h a red cross every 1/ 40
second,

t he green line an int erpolat ion of the sam pled sound.

Though it hasn’t the sam e shape nor t he sam e


am plit ude, t h e fre q u e n cy of t h e sa m p le d sig n a l re m a in s
t h e sa m e .

And here is an exam ple of bad sam pling :

9 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

In t his figure, a sound at 20 Hz is digitalized wit h a 30Hz


sam pling rat e. This tim e the fre q u e n cy of t h e sa m p le d
sig n a l is n ot t h e sa m e a s t h e orig in a l sig n a l: it ’s only 10Hz.
If you look carefully, you can see t hat one cycle in the sam pled
signal represent s two cycles in t he original signal. This case is
an under sam pling.

This case also shows som et hing else: if you want t o digit alize a
signal bet ween 0Hz and 20 kHz, you need rem ove from t he
signal its frequencies over 20kHz before t he sam pling.
Ot herwise those frequencies will be t ransform ed int o
frequencies bet ween 0Hz and 20Khz and t herefore add
unwanted sounds ( it’s called a lia sin g ) .

To sum up, if you want a good m usic conversion from analogic


to digital you have t o record the analog m usic at least 40000
tim es per second. HI FI corporat ions ( like Sony) chose 44,1kHz
during t he 80s because it was above 40000 Hz and com patible
wit h the video norm s NTSC and PAL. Other st andards exist for
audio like 48 kHz ( Blueray) , 96 kHz or 192 kHz but if you’re
neit her a professional nor an audiophile you’re likely t o list en t o
44. 1 kHz m usic.

Not e1: The theorem of Nyquist - Shannon is broader t han what I


said, you can check on Wikipedia if you want to know m ore
about it .

Not e2: The frequency of the sam pling rate needs t o be st rict ly
superior of 2 tim es t he frequency of t he signal to digitalize
because in the worst case scenario, you could end up wit h a

10 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

constant digitalized signal.

Qua nt iza t ion


We saw how t o digit alize t he frequencies of an analogic m usic
but what about t he loudness of m usic? The loudness is a
relat ive m easure: for t he sam e loudness inside t he signal, if
you increase your speakers the sound will be higher. The
loudness m easures t he variat ion bet ween t he lowest and t he
highest level of sound inside a song.

The sam e problem appears loudness: how to pass from a


continuous world ( with an infinite variat ion of volum e) to a
discrete one?

Im agine your favorite m usic wit h only 4 st at es of loudness: no


sound, low sound, high sound and full power. Even t he best
song in t he world becom es unbearable. What you’ve j ust
im agined was a 4- level quant ization.

Here is an exam ple of a low quantizat ion of an audio signal:

This figure present s an 8 level quantizat ion. As you can see, t he


result ing sound ( in red) is very alt ered. The difference bet ween
the real sound and t he quant ized one is called q u a n t iz a t ion
e rror or q u a n t iz a t ion n oise . Th is 8 le v e l q u a n t iz a t ion is
a lso ca lle d a 3 b it s q u a n t iz a t ion because you only need 3
bit s to im plem ent the 8 different levels ( 8 = 2^ 3) .

Here is t he sam e signal with a 64 levels quantizat ion ( or 6 bit s

11 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

quant ization)

Though t he result ing sound is still alt ered, it looks ( and sounds)
m ore like t he original sound.

Thankfully, hum ans don’t have extra sensitive ears. Th e


st a n d a rd q u a n t iz a t ion is cod e d on 1 6 b it s, which m eans
65536 levels. Wit h a 16 bits quantizat ion, the quantizat ion
noise is low enough for hum an ears.

Not e: I n st udio, the quantizat ion used by professionals is 24


bit s, which m eans t here are 2^ 24 ( 16 m illions) possible
variat ions of loudness between t he lowest point of t he track and
the highest .

Not e2: I m ade som e approxim at ions in m y exam ples


concerning the num ber of quantization levels.

Pulse Code d M odula t ion


PCM or Pulse Coded Modulat ion is a st andard that represent s
digit al signals. I t is used by com pact discs and m ost elect ronics
devices. For exam ple, when you list en t o an m p3 file in your
com put er/ phone/ t ablet , the m p3 is autom atically transform ed
int o a PCM signal and t hen send t o your headphones.

A PCM st ream is a stream of organized bit s. It can be com posed


of m ultiple channels. For exam ple, a st ereo m usic has 2

12 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

channels.

In a st ream , t he am plit ude of t he signal is divided int o


sam ples. The num ber of sam ples per second correspond to t he
sam pling rat e of t he m usic. For inst ance a 44, 1kHz sam pled
m usic will have 44100 sam ples per second. Each sam ple gives
t he ( quant ized) am plit ude of t he sound of t he corresponding
fract ion of seconds.

There are m ult iple PCM form at s but the m ost used one in audio
is the ( linear) PCM 44,1kHz, 16- bit dept h st ereo form at. This
form at has 44 100 sam ples for each second of m usic. Each
sam ple t akes 4 bytes:

2 byt es ( 16 bit s) for t he int ensit y ( from - 32, 768 t o 32, 767)
of t he left speaker

2 byt es ( 16 bit s) for t he int ensit y ( from - 32, 768 t o 32, 767)
of t he right speaker

In a PCM 44, 1kHz 16- bit depth stereo form at , you have 44100
sam ples like this one for every second of m usic.

You now know how t o pass from an analog sound to a digit al


one. But how can you get the frequencies inside a digital
signal? This part is very im portant since t he Shazam
fingerprinting algorit hm works only wit h frequencies.

For analog ( and therefore cont inuous) signals, t here is a


transform at ion called t he Cont iguous Fou rie r t ra n sform . This
funct ion t ransform s a function of t im e into a funct ion of
frequencies. I n ot her words, if you apply the Fourier transform

13 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

on a sound, it will give you t he frequencies ( and their


int ensit ies) inside t his sound.

But there are 2 problem s:

We are dealing with digital sounds and t herefore finit e


( none continuous) sounds.

To have a bet ter knowledge of t he frequencies inside a


m usic, we need to apply t he Fourier Transform on sm all
part s of the full lengt h audio signal, like 0. 1 second part s
so that we know what are t he frequencies for each 0. 1
second part s of an audio t rack) .

Thankfully, there is another m at hem atical funct ion,


the Discre t e Fou rie r Tra n sform ( DFT) , that works with som e
lim it at ions.

Not e: The Fourier Transform m ust be applied on only one


channel, which m eans that if you have a stereo song you need
to t ransform it int o a m ono song.

Discre t e Fourie r Tra nsform


The DFT ( Discret e Fourier Transform ) applies t o discret e signals
and gives a discret e spectrum ( t he frequencies inside the
signal) .

Here is t he m agic form ula t o transform a digital signal int o


frequencies ( don’t run away, I ’ll explain it ) :

In t his form ula:

N is t he size of the w in d ow : t he num ber of sam ples t hat


com posed t he signal ( we’ll talk a lot about windows in t he
next part) .

X( n) represents t he nt h b in of fre q u e n cie s

14 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

x( k) is kt h sam ple of t he audio signal

For exam ple, for an audio signal with a 4096- sam ple window
this form ula m ust be applied 4096 t im es:

1 t im e for n = 0 t o com put e t he 0 t h bin a frequencies

1 t im e for n = 1 t o com put e t he 1 st bin a frequencies

1 t im e for n = 2 t o com put e t he 2 nd bin a frequencies

As you m ight have noticed, I spoke about bin of frequencies and


not frequency. The reason is that t he DFT gives a d iscre t e
sp e ct ru m . A bin of frequencies is t he sm allest unit of
frequency the DFT can com put e. The size of the bin ( called
sp e ct ra l/ sp e ct ru m re solu t ion or fre q u e n cy re solu t ion )
equals t he sam pling rat e of t he signal divided by t he size of t he
window ( N) . I n our exam ple, wit h a 4096- sam ple window and a
st andard audio sam pling rat e at 44. 1kHz, t he frequency
resolut ion is 10.77 Hz ( except the 0 t h bin t hat is special) :

t he 0t h bin represent s the frequencies bet ween 0Hz t o


5.38Hz

t he 1st bin represent s t he frequencies bet ween 5. 38Hz t o


16. 15Hz

the 2nd bin represents the frequencies between 16. 15Hz to


26. 92Hz

t he 3rd bin represent s t he frequencies bet ween 26.92Hz t o


37. 68Hz

That m eans t hat the DFT can’t dissociate 2 frequencies that are
closer t han 10.77Hz. For exam ple notes at 27Hz, 32Hz and
37Hz ends up in t he sam e bin. I f t he note at 37Hz is very
powerful you’ll j ust know that the 3 rd bin is powerful. This is
problem at ic for dissociating not es in the lowest oct aves. For
exam ple:

15 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

a A1 ( or La - 1) is at 55Hz whereas a B1 ( or Si - 1) is at
58. 27Hz and a G1 ( or Sol - 1) is at 49 Hz.

t he first note of a st andard 88- key piano is a A0 at 27. 5 Hz


followed by a A# 0 at 29. 14Hz.

You can im prove the frequency resolut ion by increasing the


window size but t hat m eans losing fast frequency/ not e changes
inside the m usic:

An audio signal has a sam pling rat e of 44, 1 kHz

I ncreasing t he window m eans t aking m ore sam ples and


t herefore increasing the t im e t aken by t he window.

Wit h 4096 sam ples, t he window duration is 0.1 sec and t he


frequency resolut ion is 10. 7 Hz: you can detect a change
every 0.1 sec.

Wit h 16384 sam ples, the window durat ion is 0. 37 sec and
the frequency resolution is 2.7 Hz: you can detect a
change every 0. 37 sec.

Anot her part icularit y for an audio signal is t hat w e on ly n e e d


h a lf t h e b in s com p u t e d b y t h e DFT. In t he previous
exam ple, the bin definition is 10. 7 Hz, which m eans t hat the
2047 t h bin represents t he frequencies from 21902, 9 Hz t o
21913, 6 Hz. But :

The 2048 t h bin will give t he sam e inform at ion as t he 0 t h


bin

The 2049 t h bin will give t he sam e inform at ion as t he 1t h


bin

The X+ 2048 t h bin will give the sam e inform ation as the Xt h
bin

..

If you want t o know why the bin resolut ion equals” t he sam pling
rat e” divided by “ t he size of t he window” or why this form ula is
so bizarre, you can read a 5- part art icle on Fourier Transform
on this very good websit e ( especially part 4 and part 5) which
is the best art icle for beginners t hat I read ( and I read a lot of

16 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

art icles on the m at t er) .

W indow funct ions


If you want to get the frequency of a one- second sound for each
0. 1- second part s, you have to apply t he Fourier Transform for
the first 0.1- second part, apply it for the second 0.1- second
part , apply it for t he t hird 0. 1- second part …

The problem

By doing so, you are im plicit ly applying a ( rect angular) window


funct ion:

For t he first 0. 1 second you are applying t he Fourier


t ransform on t he full one- second signal m ult iplied by a
funct ion t hat equals 1 bet ween 0 and 0.1second, and 0 for
t he rest

For t he second 0. 1 second you are applying t he Fourier


t ransform on t he full one- second signal m ult iplied by a
funct ion t hat equals 1 bet ween 0.1 and 0.2 second, and 0
for t he rest

For the third 0.1 second you are applying the Fourier
t ransform on t he full one- second signal m ult iplied by a
funct ion t hat equals 1 bet ween 0.2 and 0.3 second, and 0
for t he rest

Here is a visual exam ple of the window function t o apply t o a


digit al ( sam pled) audio signal t o get t he first 0. 01- second part :

17 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

In t his figure, t o get t he frequencies for t he first 0. 01- second


part , you need to m ult iply t he sam pled audio signal ( in blue)
with the window function ( in green) .

In t his figure, t o get t he frequencies for t he second 0. 01- second


part , you need to m ult iply t he sam pled audio signal ( in blue)
with the window function ( in green) .

By “ windowing” the audio signal, you m ultiply your signal


audio( t ) by a window funct ion window( t) . This window function
produces sp e ct ra l le a k a g e . Spect ral leakage is the apparition
of new frequencies t hat doesn’t exist inside the audio
signal. The power of the real frequencies is leaked t o others
frequencies.

Here is a non- form al ( and very light ) m at hem atical


explanat ion. Let ’s assum e you want a part of the full audio
signal. You will m ultiply t he audio signal wit h a window funct ion
that let pass the sound only for the part you want :

part _of_audio( t ) = full_audio( t ) . window ( t )

When you t ry t o get t he frequencies of the part of audio, you


apply t he Fourier t ransform on the signal

Fourier( part_of_audio( t ) ) = Fourier( full_audio( t ) . window ( t ) )

According t o the convolut ion t heorem ( * represent s t he


convolution operator and . t he m ult iplicat ion operat or)

Fourier( full_audio( t) . window ( t ) ) = Fourier( full_audio( t) ) *

18 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Fourier( window ( t ) )

—> Fourier( part _of_audio( t ) ) = Fourier( full_audio( t) ) *


Fourier( window ( t ) )

—> The frequencies of the part _of_audio( t) depend on t he


window( ) function used.

I won’t go deeper because it requires advanced m athem at ics. I f


you want t o know m ore, look at this link on page 29, the
chapt er “ t he truncat e effect s” present s t he m athem at ical effect
of applying a rect angular window on a signal. What you need t o
keep in m ind is t hat cutt ing an audio signal int o sm all part s to
analyze t he frequencies of each part produces spectral leakage.

different types of windows

You ca n ’t a v oid sp e ct ra l le a k a g e b u t y ou ca n h a n d le h ow
t h e le a k a g e w ill b e h a v e by choosing t he right window
funct ion: inst ead of using a rect angular window funct ion, you
can choose a t riangular widows, a Parzen window, a Blackm an
window, a Ham m ing window …

The rect angular window is t he easiest window t o use ( because


you j ust have to “ cut” the audio signal into sm all part s) but for
analyzing t he m ost im port ant frequencies in a signal, it m ight
not be t he best t ype of windows. Let ’s have a look of 3 types of
windows: rect angular, Ham m ing and Blackm an. I n order to
analyse t he effect of the 3 windows, we will use t he following
audio signal com posed of:

A frequency 40 Hz wit h an am plit ude of 2

A frequency 160 Hz with an am plit ude of 0.5

A frequency 320 Hz with an am plit ude of 8

A frequency 640Hz wit h an am plit ude of 1

A frequency 1000 Hz wit h an am plitude of 1

A frequency 1225 Hz wit h an am plitude of0.25

A frequency 1400 Hz wit h an am plitude of 0. 125

A frequency 2000 Hz wit h an am plitude of 0. 125

A frequency 2500Hz with an am plit ude of 1.5

19 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

In a perfect world, t he Fourier t ransform of this signal should


give us t he following spect rum :

This figure shows a spectrum wit h only 9 vertical lines ( at 40


Hz, 160 Hz, 320 Hz, 640 Hz, 1000 Hz, 1225 Hz, 1400 Hz, 2000
Hz and 2500 Hz. The y axis gives the am plitude in decibels ( dB)
which m eans the scale is logarithm ic. Wit h t his scale a sound at
60 dB is 100 t im es m ore powerful t han a sound at 40 dB and
10000 tim es m ore powerful than a sound at 20 dB. To give you
an idea, when you speak in a quiet room , t he sound you
produce is 20- 30 dB higher ( at 1 m of you) than the sound of
the room .

In order t o plot t his “ perfect” spect rum , I applied the Fourier


Transform with a very long window: a 10- second window. Using
a very long window reduces t he spectrum leakage but 10
seconds is too long because in a real song the sound changes
m uch fast er. To give you an idea of how fast t he m usic
changes:

here is a video wit h 1 change ( or beat ) per second, it


sounds slow but it ’s a com m on rhyt hm for classical m usic.

here is a video wit h 2. 7 changes per second, it sounds


m uch fast er but t his rhyt hm is com m on for elect ro m usic

here is a video wit h 8. 3 changes per second, it ’s a very


( very) fast rhyt hm but possible for sm all parts of songs.

In order t o capt ure t hose fast changes, you need t o “ cut ” t he


sound int o very sm all part s using window funct ions. I m agine

20 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

you want to analyze the frequencies of a sound every 1/ 3


second.

In t his figure, you can m ult iply t he audio signal wit h one of t he
3 window t ypes t o get t he part of the signal bet ween 0. 333sec
and 0. 666 sec. As I said, using a rectangular window is like
cutt ing t he signal bet ween 0. 333sec and 0.666sec whereas wit h
the Ham m ing or t he Blackm an windows you need t o m ult iply
the signal wit h t he window signal.

Now, here is t he spect rum of t he previous audio signal wit h a


4096- sam ple window:

The signal is sam pled at 44100Hz so a 4096- sam ple window


represent s a 93- m illisecond part ( 4096/ 44100) and a frequency
resolut ion of 10.7 Hz.

21 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

This figure shows t hat all windows m odify t he real spect rum of
the sound. We clearly see t hat a part of t h e p ow e r of t h e re a l
fre q u e n cie s is sp re a d t o t h e ir n e ig h b ou rs. The spect rum
from t he rect angular window is t he worst since t he spectrum
leakage is m uch higher than the 2 ot hers. I t ’s especially true
bet ween 40 and 160 Hz. The Blackm an window gives the
closest spect rum from t he real spectrum .

Here is t he sam e exam ple wit h a Fourier Transform of a 1024


window:

The signal is sam pled at 44100Hz so a 1024- sam ple window


represent s a 23- m illisecond part ( 1024/ 44100) and a frequency
resolut ion of 43 Hz.

This t im e t he rect angular window gives t he best spect rum . With


the 3 windows t he 160 Hz frequency is hidden by t he spect rum
leakage produced by t he 40 Hz and 320 Hz frequencies. The
Blackm an window gives t he worst result with a 1225 Hz
frequency close t o invisible.

Com paring bot h figures shows that t he spectrum leakage


increases ( for all t he window funct ion) as the frequency
resolut ion increases. The fingerprint algorithm used by Shazam
look for the loudest frequencies inside an audio t rack. Because
of spectrum leakage, we can’t j ust t ake t he X highest
frequencies. I n the last exam ple, t he 3 loudest frequencies are
approxim at ely 320 Hz, 277 Hz ( 320- 43) and 363 Hz ( 320+ 43)
whereas only the 320 Hz frequency exist s.

22 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Which window is the best ?

There are no “ best ” or “ worst ” windows. Each window has its


specificities and depending on t he problem you m ight want to
use a cert ain type.

A rectangular window has excellent resolut ion charact erist ics


for sinusoids of com parable strengt h, but it is a poor choice for
sinusoids of disparat e am plit udes ( which is t he case inside a
song because t he m usical not es don’t have the sam e loudness) .

Windows like Blackm an are bet ter t o prevent from t he case


where spectrum leakage of st rong frequencies hides weak
frequencies. But , t hese windows deal badly wit h noise since a
noise will hide m ore frequencies than rectangular window. This
is problem at ic for an algorithm like Shazam that needs t o
handle noise ( for inst ance when you Shazam a m usic in a bar or
out door t here are a lot of noise) .

A Ham m ing window is bet ween these t wo extrem es and is ( in


m y opinion) a bet t er choice for an algorit hm like shazam .

Here are som e useful links t o go deeper on window functions


and spect rum leakage:

ht tp: / / en. wikipedia. org/ wiki/ Spectral_leakage

ht tp: / / en. wikipedia. org/ wiki/ Window_funct ion

ht tp: / / web. m it . edu/ xiphm ont / Public/ windows. pdf

Fa st Fourie r Tra nsform a nd t im e com ple x it y

the problem

23 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

If you look again at t he DFT form ula ( don’t worry, it ’s t he last


tim e you see it ) , you can see that t o com pute one bin you need
to do N additions and N m ult iplicat ions ( where N is t he size of
the window) . Gett ing t he N bins requires 2 * N^ 2 operat ions
which is a lot .

For exam ple, let’s assum e you have a three- m inut e song at
44, 1 kHz and you com pute t he spectrogram of t he song with a
4096- sam ple window. You’ll have t o com pute 10.7
( 44100/ 4096) DFT per second so 1938 DFTs for the full song.
Each DFT needs 3. 35* 10^ 7 operat ions ( 2* 4096^ 2) . To get t he
spectrogram of t he song you need to do 6, 5* 10^ 10 operat ions.

Le t ’s a ssu m e y ou h a v e a m u sic colle ct ion of 1 0 0 0 t h re e -


m in u t e s- lon g son g s, you’ll need 6,5* 10^ 13 operations to get
the spect rogram s of your songs. Even wit h a good processor, it
w ou ld t a k e d a y s/ m on t h s t o g e t t h e re su lt .

Thankfully, there are fast er im plem ent at ions of t he DFT called


FFT ( Fast Fourier Transform s) . Som e im plem ent at ions require
j ust 1. 5* N * log( N) operat ions. For t he sam e m usic collection,
u sin g t h e FFT in st e a d of t h e DFT requires 340 tim es less
additions ( 1.43* 10^ 11) and it w ou ld t a k e m in u t e s/ h ou rs t o
g e t t h e re su lt .

This exam ple shows another t radeoff: though increasing the


size of the window im proves the frequency resolut ion, it also
increases t he com put at ion t im e. For t he sam e m usic collect ion,
if you com pute t he spectrogram using a 512 sam ple window
( frequency resolution of 86 Hz) , you get t he result wit h t he FFT
in 1.07* 10^ 11 operat ions, approxim at ely 1/ 4 t im e fast er than
wit h a 4096 sam ple window ( frequency resolut ion of 10.77 Hz) .

This t im e com plexity is im port ant since when you shazam a


sound, your phone needs to com pute the spectrogram of the
recorded audio and a m obile processor is less powerful t han a
deskt op processor.

downsam pling

Thankfully, there is a t rick t o keep the frequency resolut ion and


reduce t he window size at the sam e tim e, it ’s called

24 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

downsam pling. Let ’s t ake a st andard song at 44100 Hz, if you


resam ple it at 11025 Hz ( 44100/ 4) you will get t he sam e
frequency resolut ion whet her you do a FFT on t he 44.1kHz song
with a 4096 window or you do a FFT on the 11kHz resam pled
song wit h a 1024 window. The only difference is t hat t he
resam pled song will only have frequencies from 0 to 5 kHz. But
the m ost im port ant part of a song is between 0 and 5kHz. In
fact m ost of you won’t hear a big difference bet ween a m usic at
11kHz and a m usic at 44. 1kHz. So, t he m ost im port ant
frequencies are st ill in t he resam pled song which is what
m att ers for an algorit hm like Shazam .

Downsam pling a 44.1 kHz song t o a 11. 025 kHz one is not very
difficult: A sim ple way to do it is t o t ake t he sam ples by group
of 4 and to t ransform this group into j ust one sam ple by t aking
the average of t he 4 sam ples. The only t ricky part is t hat before
downsam pling a signal, you need t o filt er t he higher
frequencies in t he sound t o avoid aliasing ( rem em ber the
Nyquist - Shannon t heorem ) . This can be done by using a digit al
low pass filter.

FFT

But let ’s go back t o the FFT. The sim plest im plem ent at ion of t he
FFT is t he radix 2 Cooley–Tukey algorit hm which is a divide a
conquer algorit hm . The idea is that inst ead of direct ly
com put ing the Fourier Transform on t he N- sam ple window, t he
algorithm :

divides t he N- sam ple window int o 2 N/ 2- sam ple windows

com putes ( recursively) t he FFT for t he 2 N/ 2- sam ple


windows

25 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

com putes efficient ly t he FFT for t he N- sam ple windows


from t he 2 previous FFT

The last part only cost s N operat ions using a m at hem at ical t rick
on the root s of unit y ( the exponent ial term s) .

Here is a readable version of t he FFT ( writ t en in python) t hat I


found on Wikipedia

from cmath import *


def fft(x):
N=len(x)
if N==1: return x

even=fft([x[k] for k in range(0,N,2)])


odd= fft([x[k] for k in range(1,N,2)])

M=N/2
l=[ even[k] + exp(‐2j*pi*k/N)*odd[k] for k in ran
r=[ even[k] ‐ exp(‐2j*pi*k/N)*odd[k] for k in ran

return l+r

For m ore inform at ion on the FFT, you can check this art icle on
Wikipedia.

We’ve seen a lot of st uff during t he previous parts. Now, we’ll


put everyt hing t ogether t o explain how Shazam quickly
identifies songs ( at last ! ) . I ’ll first give you a global overview of
Shazam , then I’ll focus on t he generation of t he fingerprint s and
I’ll finish wit h the efficient audio search m echanism .

Not e: From now on, I assum e t hat you read the part s on
m usical not es, FFT and window funct ions. I’ll som etim es use
the words “ frequency” , “ bin” , ” not e” or t he full expression “ bin
of frequencies” but it’s t he sam e concept since we’re dealing
wit h digit al audio signals.

Globa l ov e rv ie w
An a u d io fin g e rp rin t is a digit al sum m ary t hat can be used t o
identify an audio sam ple or quickly locat e sim ilar it em s in an

26 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

audio database. For exam ple, when you’re hum m ing a song to
som eone, you’re creating a fingerprint because you’re
ext racting from t he m usic what you think is essential ( and if
you’re a good singer, t he person will recognize t he song) .

Before going deeper, here is a sim plified architect ure of what


Shazam m ight be. I don’t work at Shazam so it ’s only a guess
( from the 2003 paper of the co- founder of Shazam ) :

On t he server side:

Shazam precom putes fingerprints from a very big dat abase


of m usic t racks.

All t hose fingerprint s are put in a fingerprint dat abase


which is updat ed whenever a new song is added in t he
song dat abase

On t he client side:

when a user uses t he Shazam app, t he app first records


t he current m usic wit h t he phone m icrophone

t he phone applies t he sam e fingerprint ing algorit hm


as Shazam on t he record

t he phone sends the fingerprint t o Shazam

Shazam checks if t his fingerprint m at ches wit h one of its


fingerprint s
If no it inform s the user t hat t he m usic can’t be found

If yes, it looks for the m et adat a associat ed with t he


fingerprints ( nam e of the song, I Tunes url, Am azon url

27 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

…) and gives it back to t he user.

The key points of Shazam are:

being Noise/ Fault tolerant:


because t he m usic recorded by a phone in a
bar/ out door has a bad qualit y,

because of t he art ifact due t o window funct ions,

because of t he cheap m icrophone inside a phone t hat


produces noise/ dist ort ion

because of m any physical st uff I ’m not aware of

fingerprint s needs t o be tim e invariant : t he fingerprint of a


full song m ust be able t o m atch wit h j ust a 10- second
record of t he song

fingerprint m at ching need t o be fast : who wants t o wait


m inutes/ hours to get an answer from Shazam ?

having few false posit ives: who want s t o get an answer


t hat doesn’t correspond t o the right song?

Spe ct rogra m filt e ring


Audio fingerprints differ from st andard com put er fingerprints
like SSHA or MD5 because t wo different files ( in term s of bits)
that cont ain t he sam e m usic m ust have the sam e audio
fingerprint. For exam ple a song in a 256kbit ACC form at
( I Tunes) m ust give t he sam e fingerprint as t he sam e song in a
256kbit MP3 form at ( Am azon) or in a 128kbit WMA form at
( Microsoft ) . To solve t his problem , a u d io fin g e rp rin t in g
a lg orit h m s u se s t h e sp e ct rog ra m of audio signals t o
ext ract fingerprints.

Get t ing our spectrogram

I told you before t hat to get the spect rogram of a digit al sound
you need to apply a FFT. For a fingerprinting algorithm we need
a good frequency resolution ( like 10. 7Hz) to reduce spect rum
leakage and have a good idea of t he m ost im portant notes
played inside t he song. At t he sam e t im e, we need t o reduce
the com putation t im e as far as possible and t herefore use t he

28 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

lowest possible window size. I n t he research paper from


Shazam , they don’t explain how they get t he spectrogram but
here is a possible solut ion:

On t he server side ( Shazam ) , the 44. 1khz sam pled sound ( from
CD, MP3 or what ever sound form at) needs t o pass from st ereo
to m ono. We can do t hat by t aking t he average of the left
speaker and t he right one. Before downsam pling, we need t o
filt er t he frequencies above 5kHz t o avoid aliasing. Then, t he
sound can be downsam pled at 11. 025kHz.

On t he client side ( phone) , t he sam pling rat e of t he m icrophone


that records the sound needs t o be at 11. 025 kHz.

Then, in bot h cases we need to apply a window funct ion t o t he


signal ( like a ham m ing 1024- sam ple window, read t he chapter
on window funct ion to see why) and apply t he FFT for every
1024 sam ples. By doing so, each FFT analyses 0. 1 second of
m usic. This gives us a spect rogram :

from 0 Hz t o 5000Hz

wit h a bin size of 10.7Hz,

512 possible frequencies

and a unit of t im e of 0. 1 second.

Filtering

At t his st age we have the spect rogram of t he song. Since


Shazam needs t o be noise tolerant, on ly t h e lou d e st n ot e s
a re k e p t . But you can’t j ust keep t he X m ore powerful
frequencies every 0. 1 second. Here are som e reasons:

I n the beginning of t he art icle I spoke about


psychoacoust ic m odels. Hum an ears have m ore difficulties

29 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

t o hear a low sound ( < 500Hz) t han a m id- sound ( 500Hz-


2000Hz) or a high sound ( > 2000Hz) . As a result low
sounds of m any “ raw” songs are art ificially increased
before being released. I f you only take t he m ost powerful
frequencies you’ll end up wit h only the low ones and I f 2
songs have t he sam e drum partit ion, they m ight have a
very close filtered spectrogram whereas t here are flut es in
t he first song and guit ars in the second.

We saw on t he chapt er on window funct ions t hat if you


have a very powerful frequency ot her powerful frequencies
close to t his one will appeared on t he spectrum whereas
t hey doesn’t exist ( because of spectrum leakage) . You
m ust be able t o only take t he real one.

Here is a sim ple way t o keep only strong frequencies while


reducing t he previous problem s:

st ep1 – For each FFT result, you put t he 512 bins you inside 6
logarithm ic bands:

the very low sound band ( from bin 0 to 10)

t he low sound band ( from bin 10 t o 20)

the low- m id sound band ( from bin 20 to 40)

t he m id sound band ( from bin 40 t o 80)

t he m id- high sound band ( from bin 80 to 160)

t he high sound band ( from bin 160 to 511)

st ep2 – For each band you keep the st rongest bin of


frequencies.

st ep3 – You t hen com pute t he average value of t hese 6


powerful bins.

st ep4 – You keep the bins ( from t he 6 ones) that are above this
m ean ( m ult iplied by a coefficient) .

The step4 is very im portant because you m ight have:

an a cappella m usic involving soprano singers wit h only


m id or m id- high frequencies

a j azz/ rap m usic with only low and low- m id frequencies

30 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

..

And you don’t want to keep a weak frequency in a band j ust


because t his frequency is t he st rongest of it s band.

But this algorithm has a lim itation. I n m ost songs som e parts
are very weak ( like t he beginning or t he end of a song) . If you
analyze t hese part s you’ll end up wit h false st rong frequencies
because t he m ean value ( com puted at st ep 3) of t hese part s is
very low. To avoid that , instead of taking the m ean of t he 6
powerful beans of the current FFT ( t hat represents only 0.1sec
of the song) you could take t he m ean of t he m ost powerful bins
of t he full song.

To sum m arize, by applying t his algorit hm we’re filt ering t he


spectrogram of t he song t o keep the peaks of energy in the
spectrum that represent t he loudest notes. To give you a visual
idea of what t his filtering is, here is a real spect rogram of a
14- second song.

This figure is from t he Shazam research art icle. In t his


spectrogram , you can see t hat som e frequencies are m ore
powerful t han others. If you apply t he previous algorit hm on t he
spectrogram here is what you’ll get:

31 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

This figure ( st ill from the Shazam research art icle) is a filt ered
spectrogram . Only t he strongest frequencies from t he previous
figure are kept . Som e part s of the song have no frequency ( for
exam ple between 4 and 4.5 seconds) .

The num ber of frequencies in t he filt ered spect rogram depends


on the coefficient used wit h the m ean during step4. I t also
depends on the num ber of bands you use ( we used 6 bands but
we could have used anot her num ber) .

At t his st age, the intensity of the frequencies is useless.


Therefore, t his spect rogram can m odeled as a 2- colum n t able
where

t he first colum n represent s the frequency inside t he


spect rogram ( t he Y axis)

t he second colum n represent s t he tim e when the


frequency occurred during t he song ( t he X axis)

This filtered spectrogram is not t he final fingerprint but it’s a


huge part of it . Read t he next chapt er t o know m ore.

Not e: I gave you a sim ple algorit hm t o filt er t he spect rogram . A


bet t er approach could be t o use a logarit hm ic sliding window
and t o keep only the m ost powerful frequencies above t he
m ean + the st andard deviation ( m ultiplied by a coefficient ) of a
m oving part of t he song. I used t his approach when I did m y
own Shazam prot ot ype but it ’s m ore difficult t o explain ( and
I’m not even sure t hat what I did was correct …) .

32 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

St oring Finge rprint s


We’ve j ust ended up with a filt ered spect rogram of a song. How
can we st ore and use it in an efficient way? This part is where
the power of Shazam lies. To understand the problem , I ’ll
present a sim ple approach where I search for a song by using
direct ly t he filt ered spect rogram s.

Sim ple search approach

Pre- st ep: I precom put e a dat abase of filt ered spect rogram s for
all t he songs in m y com puter

St ep 1: I record a 10- second part of a song from TV in m y


com put er

St ep 2: I com put e the filtered spectrogram of t his record

St ep 3: I com pare t his “ sm all” spectrogram wit h t he “ full”


spectrogram of each songs. How can I com pare a 10- second
spectrogram wit h a spect rogram of a 180- second song? I nst ead
of losing m yself in a bad explanat ion, here is a visual
explanat ion of what I need t o do.

Visually speaking, I need t o superpose t he sm all spectrogram


everywhere inside t he spect rogram of the full song t o check if
the sm all spect rogram m at ches with a part of t he full one.

33 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

And I need t o do t his for each song unt il I find a perfect m at ch.

In t his exam ple, t here is a perfect m atch bet ween t he record


and t he end of t he song. I f it’s not t he case, I have t o com pare
t he record wit h anot her song and so on unt il I find a perfect
m atch. I f I don’t find a perfect m atch I can choose t he closest
m at ch I found ( in all t he songs) if t he m at ching rat e is above a
threshold. For inst ance, if t he best m at ch I found gives m e a
90% sim ilarit y between t he record and a part of a song, I
can assum e it’s the right song because the 10% of none
sim ilarit y are cert ainly due t o ext ernal noise.

Though it works well, t his sim ple approach requires a lot of


com put at ion tim e. I t needs t o com put e all t he possibilit ies of
m atching bet ween the 10- second record and each song in t he
collection. Let ’s assum e on average m usic cont ains 3 peak
frequencies per 0. 1 seconds. Therefore, t he filt ered

34 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

spectrogram of t he 10- second record has 300 t im e- frequency


points. I n t he worst case scenario, you’ll need 300 * 300 * 30*
S operat ions t o find t he right song where S is t he num ber of
second of m usic in your collect ion. I f like m e you have 30k
songs ( 7 * 10^ 6 seconds of m usic) it m ight t ake a long tim e
and it ’s harder for Shazam wit h its 40 m illion songs collection
( it ’s a guess I couldn’t find t he current size of Shazam ) .

So, how Shazam does it efficient ly?

Target zones

Inst ead of com paring each point one by one, the idea is to look
for m ultiple point s at t he sam e t im e. I n t he Shazam paper, this
group of point is called a t a rg e t z on e . The paper from Shazam
doesn’t explain how t o generat e t hese t arget zones but here is
a possibility. For t he sake of com prehension I ’ll fix t he size of
t he t arget zone at 5 frequency- t im e point s.

In order t o be sure t hat bot h the record and the full song will
generat e the sam e target zones, you need an order relat ion
bet ween t he tim e- frequency points in a filt ered spect rogram .
Here is one:

I f two tim e- frequency points have the sam e tim e, t he


t im e- frequency point wit h t he lowest frequency is before
t he ot her one.

I f a t im e t im e- frequency point has a lower t im e t han


anot her point one t hen it is before.

Here is what you get if you apply t his order on t he sim plified
spectrogram we saw before:

35 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

In t his figure I labeled all t he t im e- frequency point s using this


order relation. For exam ple:

The point 0 is before any other point s in the spect rogram .

The point 2 is aft er point 0 and 1 but before all t he ot hers.

Now t hat the spect rogram s can be inner- ordered, we can


creat e t he sam e t arget zones on different spectrogram wit h t he
following rule: “ To generate t arget zones in a spectrogram , you
need for each t im e- frequency point t o creat e a group com posed
of this point and t he 4 point s after it ” . We’ll end up with
approxim at ely the sam e am ount of target zones as the num ber
of points. This generat ion is t he sam e for t he songs or t he
record

In t his sim plified spect rogram , you can see the different t arget
zones generat ed by t he previous algorit hm . Since t he t arget
size is 5, m ost of the point s belong t o 5 t arget zones ( except
the point s at t he beginning and t he end of t he spectrogram ) .

Not e: I didn’t underst and at first why for the record we needed
to com pute that m uch target zones. We could generate target
zones wit h a rule like “ for each point whose label is a m ultiple
of 5 you need t o create a group com posed of t his frequency and
the 4 frequencies after it ” . Wit h t his rule, t he num ber of t arget
zones would be reduced by 5 and so t he search t im e ( explained
in t he next part ) . The only reason I found is t hat com puting all
the possible zones on both t he record and t he song increases a
lot t he noise robust ness.

36 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Address generation

We now have m ult iple t arget zones, what do we do next ? We


creat e for each point an a d d re ss based on those t arget zones.
In order t o creat e t hose addresses, we also need an a n ch or
p oin t per t arget zone. Again, t he paper doesn’t explain how t o
do it. I propose t his anchor point t o be the 3 rd point before t he
target zone. The anchor can be anywhere as long as the way it
is generat ed is reproducible ( which it is t hanks to our order
relat ion) .

In t his pict ure I plot t ed 2 t arget zones wit h their anchor points.
Let ’s focus on t he purple target zone. The address form ula
proposed by shazam is following one:

[ “ frequency of t he anchor” ; ” frequency of the point” ; ” delt a


tim e bet ween t he anchor and t he point ” ] .

For t he purple target zone:

t he address of point 6 is [ “ frequency of 3” ; ” frequency of


point 6” ; ” delta_t im e between point 3 & point 6” ] so
concretely [ 10; 30; 1] ,

t he address of point 7 is [ 10; 20; 2] .

Both point s appeared also in the brown t arget zone, their


addresses wit h this target zone are [ 10; 30; 2] for point 6 and
[ 10; 20; 3] for point 7.

37 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

I spoke about addresses, right ? That m eans that t hose


addresses are linked t o som et hing. I n t he case of t he full songs
( so only in t he server side) , t hose addresses are linked t o t he
following couple [ “ absolut e tim e of the anchor in t he song” ; ” Id
of the song” ] . In our sim ple exam ple wit h the 2 previous point s
we have t he following result :

[ 10; 30; 1] –> [ 2; 1]

[ 10; 30; 2] –> [ 2; 1]

[ 10; 30; 2] –> [ 1; 1]

[ 10; 30; 3] –> [ 1; 1]

If you apply t he sam e logic for all the point s of all the t arget
zones of all the song spectrogram s, you’ll end up with a very
big table wit h 2 colum ns:

t he addresses

t he couples ( “ tim e of anchor” ; “ song I d” ) .

Th is t a b le is t h e fin g e rp rin t d a t a b a se of Sh a z a m . I f on
average a song contains 30 peak frequencies per second and
the size of t he t arget zone is 5, t he size of t his t able is 5 * 30
* S where S is t he num ber of seconds of the m usic collect ion.

If you rem em ber, we used an FFT wit h 1024 sam ples which
m eans t hat there are only 512 possible frequency values. Those
frequencies can be coded in 9 bit s ( 2^ 9 = 512) . Assum ing t hat
the delt a t im e is in m illiseconds, it will never be over 16
seconds because it would im ply a song wit h a 16- second part
wit hout m usic ( or very low sound) . So, t he delta t im e can be
coded in 14 bit s ( 2^ 14 = 16384) . Th e a d d re ss ca n b e cod e d
in a 3 2 - b it in t e g e r:

9 bit s for t he “ frequency of t he anchor”

9 bit s for t he ” frequency of t he point ”

14 bits for t he ” delt a t im e bet ween the anchor and t he


point ”

Using t he sam e logic, t h e cou p le ( “t im e of a n ch or” ; “son g


I d ”) ca n b e cod e d in a 6 4 - b it in t e g e r ( 32 bit for each part ) .

The fingerprint t able can be im plem ent ed as a sim ple array of

38 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

list of 64- bit int egers where:

t he index of t he array is t he 32- bit int eger address

t he list of 64- bit s int egers is all t he couples for t his address
.

In ot her words, we t ransform ed the fin g e rp rin t t a b le into an


in v e rt e d look - u p t hat allows search operation in O( 1) ( ie.
very effect ive search t im e) .

Note: You m ay have noticed that I didn’t choose the anchor


point inside the t arget zone ( I could have chosen the first point
of the t arget Zone for exam ple) . I f I did it would have
generat ed a lot of addresses like [ frequency anchor; frequency
anchor; 0] and t herefore t oo m any couples( “ t im e of anchor” ;
“ song I d” ) would have an address like [ Y,Y,0] where Y is t he
frequency ( between 0 and 511) . In ot her words, t he look- up
would have been skewed.

Se a rching And Scoring t he finge rprint s


We now have a great data struct ure on t he server side, how
can we use it ? I t’s m y last quest ion, I prom ise!

Search

To perform a search, t he fingerprinting st ep is perform ed on the


recorded sound file to generate an address/ value st ructure
slight ly different on t he value side:

[ “ frequency of t he anchor” ; ” frequency of the point” ; ” delt a


tim e bet ween t he anchor and t he point ” ] - > [ “ absolute t im e of
the anchor in t he record” ] .

This dat a is t hen sent to t he server side ( Shazam ) . Let ’s take


the sam e assum ption t han before ( 300 t im e- frequency point s in
the filtered spectrogram of t he 10- second record and t he size of
the t arget zone of 5 points) , it m eans t here are approxim at ely
1500 dat a sent to Shazam .

Each address from t he record is used t o search in the

39 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

fingerprint dat abase for t he associat ed couples [ “ absolut e tim e


of the anchor in t he song” ; ” Id of t he song” ] . I n t erm s of t im e
com plexit y, assum ing t hat the fingerprint dat abase is
in- m em ory, t he cost is the search is proport ional t o the num ber
of address sent t o Shazam ( 1500 in our case) . This search
ret urns a big am ount of couples, let ’s say for t he rest of the
art icle it ret urns M couples.

Though M is huge, it’s way lower t han t he num ber of not es


( t im e- frequency point s) of all the songs. Th e re a l p ow e r of
t h is se a rch is t h a t in st e a d of look in g if a on e n ot e e x ist s
in a son g , w e ’re look in g if 2 n ot e s se p a ra t e d from
d e lt a _ t im e se con d s e x ist in t h e son g . At t he end of t his part
we’ll t alk m ore about t im e com plexity.

Result filt ering

Though it is not m ent ioned in t he Shazam paper, I t hink t he


next t hing t o do is to filt er the M result s of the search by
keeping only the couples of t he songs that have a m inim um
num ber of t arget zones in com m on wit h t he record.

For exam ple, let’s suppose our search has ret urned:

100 couples from song 1 which has 0 target zone in


com m on wit h the record

10 couples from song 2 which has 0 t arget zone in com m on


wit h t he record

50 couples from song 5 which has 0 t arget zone in com m on


wit h t he record

70 couples from song 8 which has 0 t arget zone in com m on


wit h t he record

83 couples from song 10 which has 30 t arget zones in


com m on wit h the record

210 couples from song 17 which has 100 target zones in


com m on wit h the record

4400 couples from song 13 which has 280 t arget zones in

40 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

com m on wit h the record

3500 couples from song 25 which has 400 t arget zones in


com m on wit h the record

Our 10- second record has ( approxim at ely) 300 t arget s zone. I n
the best case scenario:

song 1 and t he record will have a 0% m at ching rat io

song 2 and t he record will have a 0% m at ching rat io

song 5 and t he record will have a 0% m at ching rat io

song 8 and t he record will have a 0% m at ching rat io

song 10 and the record will have a 10% m at ching rat io

song 17 and the record will have a 33% m at ching rat io

song 13 and the record will have a 91. 7% m at ching rat io

song 25 and the record will have a 100% m at ching rat io

We’ll only keep the couples of song 13 and 25 from t he result .


Alt hough songs 1 2,5 and 8 have m ult iples couples in com m on
wit h the record, none of them form at least a t arget zone ( of 5
points) in com m on wit h the record. This st ep can filt er a lot of
false result s because t he fingerprint dat abase of Shazam has a
lot of couples for t he sam e address and you can easily end up
wit h couples at t he sam e address t hat don’t belong t o the sam e
target zone. I f you don’t underst and why, look t he last pict ure
of the previous part : the [ 10; 30; 2] address is used by 2
tim e- frequency points t hat doesn’t belong to t he sam e t arget
zone. I f the record also have an [ 10; 30; 2] , ( at least ) one of t he
2 couples in t he result will be filt ered in t his step.

This st ep can be done in O( M) with the help of a hash table


whose key is t he couple ( songI D; absolut e tim e of the anchor in
the song) and value t he num ber of t im e it appears in the result:

We it erat e t hrough t he M results and count ( in t he hash


t able ) t he num ber of t im e a couple is present

We rem ove all the couples ( i.e. the key of t he hash table)
t hat appear less than 4 t im es ( in ot her words we rem ove
all t he points t hat doesn’t form a t arget zone) *

We count the num ber X of tim es t he song I D is part of a

41 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

key in the hash t able ( i.e we count t he num ber of com plet e
t arget zones in t he song. Since t he couple com e from t he
search, those t arget zones are also in t he record)

We only keep t he result whose song num ber is above


300* coeff ( 300 is the num ber of target zone of t he record
and we reduce t his num ber wit h a coeff because of t he
noise) .

We put the rem aining result s in a new hash t able whose


index is the songI d ( this hashm ap will be useful for t he
next step)

* The idea is t o look for t he t arget zone creat ed by an anchor


point in a song. This anchor point can be defined by t he id of
the song it belongs and t he absolut e t im e it occurs. I m ade an
approxim at ion because in a song, you can have m ult iple anchor
points at the sam e tim e. Since we’re dealing wit h a filt ered
spectrogram , you won’t have a lot of anchor point s at t he sam e
tim e. But t he key [ songID; absolut e t im e of t he anchor in t he
song] will gather all t he target zones creat ed by these t arget
points.

Not e: I used 2 hash tables in t his algorit hm . If you don’t know


how it works, j ust see it as a very efficient way t o st ore and get
dat a. I f you wan’t t o know m ore, you can read m y article on the
HashMap in Java which is j ust an efficient hash t able.

Tim e coherency

At t his st age we only have songs t hat are really close t o t he


record. But we st ill need t o verify t he t im e coherency bet ween
the not es of t he record and t hese songs. Let ’s see why:

42 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

In t his figure, we have 2 target zones t hat belong t o 2 different


songs. I f we didn’t look for tim e coherency, those target zones
would increase t he m at ching score bet ween t he 2 songs
whereas they don’t sound alike since t he not es in t hose target
zones are not played in t he sam e order.

This last st ep is about t im e ordering. The idea is:

t o com put e for each rem aining song t he not es and t heir
absolut e tim e posit ion in t he song.

t o do the sam e for the record, which gives us t he notes


and their absolut e tim e posit ion in t he record.

if t he not es in t he song and t hose in t he record are t im e


coherent, we should find a relation like t his one: “ absolut e
t im e of t he note in the song = absolut e tim e of the not e in
record + delt a” , were delta is t he st art ing t im e of t he part
of t he song t hat m at ches wit h t he record.

for each song, we need t o find t he delt a t hat m axim izes


t he num ber of not es t hat respect this tim e relat ion

Then we choose t he song t hat has t he m axim um num ber of


t im e coherent not es wit h t he record

Now t hat you get t he idea let’s see how to do it t echnically. At


this st age we have for t he record a list of address/ value:

[ “ frequency of t he anchor” ; ” frequency of the point” ; ” delt a


tim e bet ween t he anchor and t he point ” ] - > [ “ absolute t im e of

43 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

the anchor in t he record” ] .

And we have for each song a list address/ value ( stored in t he


hash table of the previous st ep) :

[ “ frequency of t he anchor” ; ” frequency of the point” ; ” delt a


tim e bet ween t he anchor and t he point ” ] - > [ “ absolute t im e of
the anchor in t he song” ; ” Id of t he song” ] .

The following process needs t o be done for all t he rem aining


songs:

For each address in the record, we get t he associated


value of the song and we com put e delt a = “ absolut e t im e
of t he anchor in t he record” – “ absolut e t im e of t he anchor
in t he song” and put t he delt a in a “ list of delta” .

I t is possible t hat t he address in the record is associat ed


wit h m ult iples values in t he song ( i. e. m ult iple point s in
different t arget zones of the song) , in t his case we
com pute t he delt a for each associated values and we put
t he deltas in t he “ list of delta”

For each different value of delt a in t he “ list of delt a” we


count it s num ber of occurrence ( in ot her words, we count
for each delt a t he num ber of not es t hat respect the rule
“ absolute t im e of t he not e in t he song = absolut e t im e of
t he note in record + delt a” )

We keep t he great est value ( which gives us the m axim um


num ber of not es t hat are t im e coherent bet ween t he
record and the song)

From all the songs, we keep t he song with t he m axim um tim e


coherent notes. If this coherency is above “ t he num ber of note
in t he record” * “ a coefficient ” then this song is the right one.

We j ust have t o look for t he m et adata of t he song ( “ artist


nam e” ,” song nam e, “ It unes URL” , ” Am azon URL” , …) wit h the
Song ID and gives t he result back to t he user.

Let ’s talk about com plexity!

44 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

This search is really m ore com plicated t hat the sim ple one we
first saw, let’s see if this is worth it . The enhanced search is a
st ep by st ep approach t hat reduces t he com plexit y at each
st ep.

For t he sake of com prehension, I ’ll recall all the assum pt ions
( or choices) I m ade and m ake new ones t o sim plify the
problem :

We have 512 possible frequencies

on average a song contains 30 peak frequencies per


second

Therefore t he 10- sec record cont ains 300 tim e- frequency


point s

S is t he num ber of seconds of m usic off all the songs

The size of the t arget zone is 5 not es

( new) I assum e t hat t he delta t im e between a point and it s


anchor is et her 0 or 10 m sec

( new) I assum e t he generat ion of addresses is uniform ly


dist ributed, which m eans t here is t he sam e am ount of
couple for any address [ X,Y,T] where X and Y are one of
t he 512 frequencies and T is either 0 or 10 m sec

The first step, t he search, only requires 5 * 300 unitary


searches.

The size of t he result M is t he sum of t he result of t he 5 * 300


unit ary searches

M = ( 5 * 300) * ( S * 30* 5 * 300) / ( 512 * 512 * 2)

The second step, t he result filt ering can be done in M


operat ions. At the end of this st ep t here are N not es dist ributed
in Z songs. Without a statist ical analysis of the m usic collect ion,
it’s im possible t o get t he value of N and Z. I feel N is really
lower t han M and Z represent only a few songs, even for a
40- m illion song dat abase like Shazam .

The last step is t he analysis of the t im e coherency of the


Z songs. We’ll assum e t hat each song as approxim at ely t he
sam e am ount of notes: N/ Z. I n the worst case scenario ( a
record that com es from a song t hat cont ains only one not e

45 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

played continuously) , the com plexity of one analysis is ( 5* 300)


* ( N/ Z) .

The cost of the Z songs is 5 * 300 * N.

Since N< < M, the real cost of t his search is M = ( 300 * 300 *
30* S) * ( 5 * 5) / ( 512 * 512 * 2)

If you rem em ber, the cost of t he sim ple search was: 300 * 300
* 30* S.

Th is n e w se a rch is 2 0 0 0 0 t im e s fa st e r

Not e: The real com plexity depends on distribut ion of


frequencies inside the songs of t he collection but t his sim ple
calculus gives us a good idea of the real one.

Im provem ent s

The Shazam paper is from 2003 which m eans t he associat ed


research is even older. I n 2003, 64- bit s processors were
released t o the m ainst ream m arket . Inst ead of using one
anchor point per t arget zone like t he paper proposes ( because
of the lim it ed size of a 32- bit int eger) , you could use 3 anchor
points ( for exam ple t he 3 point s j ust before t he t arget zone)
and st ore t h e a d d re ss of a point in t he target zone in a
6 4 - b it in t e g e r. Th is w ou ld d ra m a t ica lly im p rov e t h e
se a rch t im e . I ndeed, t he search would be t o find 4 not es in a
song separated from detla_t im e1, detla_t im e2 and det la_tim e3
seconds which m eans the num ber of results M would be very
( very) lower than the one we j ust com puted.

A great advant age of t his fingerprint search is its h ig h


sca la b ilit y :

I n st e a d of h a v in g 1 fin g e rp rin t d a t a b a se y ou ca n
h a v e D d a t a b a se s, each of them cont aining 1/ D of t he
full song collect ion

You can search at t he sam e t im e for t he closest song of the


record in t he D dat abases

Then you choose t he closest song from t he D songs

Th e w h ole p roce ss is D t im e s fa st e r.

46 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Tradeoffs

Anot her good discussion is the noise robust ness of t his


algorithm . I could easily add 2k words j ust for this subj ect but
aft er 11k words I t hink it’s bet ter not to speak about it … or
j ust a few words.

If you read carefully, you not iced t hat I used a lot of thresholds,
coefficient s and fixed values ( like t he sam pling rat e, the
duration of a record, …) . I also chose/ m ade m any algorithm s
( t o filter a spectrogram , t o generat e a spect rogram , …) . They all
have an im pact on t he noise resist ance and t he t im e
com plexit y. The real challenge is t o find the right values and
algorithm s t hat m axim ize:

The noises resist ance

The tim e com plexit y

The precision ( reducing the num ber of false posit ive


results)

I hope you now understand how Shazam works. It took m e a lot


of tim e to underst and t he different subj ect s of t his art icle and I
st ill don’t m ast er t hem . This art icle won’t m ake you an expert
but I hope you have a very good pict ure of t he processes
behind Shazam . Keep in m ind t hat Shazam is j ust one possible
audio fingerprint ing im plem ent at ion.

You should be able to code your own Shazam . You can look at
this very good art icle t hat focuses m ore on how to code a
sim plified Shazam in Java t han t he concept s behind it . The
sam e author m ade a present at ion at a Java conference and t he
slides are available here. You can also check t his link for a
Mat Lab/ Oct ave im plem ent at ion of Shazam . And of course, you
can read by yourself the paper from Shazam co- founder
Avery Li- Chun Wang by clicking right here.

47 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

The world of m usic com puting is a very int erest ing field with
touchy algorit hm s t hat you use every day without knowing it.
Though Shazam is not easy to underst and it ’s easier than:

query by hum m ing: for exam ple SoundHound , a concurrent


of Shazam , allows you to hum / sing t he song you’re looking
for

speech recognit ion and speech synt hesis: im plem ent ed by


Skype, Apple “ Siri” and Android “ Ok Google”

m usic sim ilarity: which is t he ability to find t hat 2 song are


sim ilar. I t’s used by Echonest a st art - up recent ly acquired
by Spot ify

If you’re interested, t here is an annual cont est between


researchers on those t opics and the algorit hm s of each
part icipant are available. Here is the link t o t he MI REX cont est .

I spent approxim ately 200 hours during t he last 3 years t o


understand the signal processing concept s, t he m at hem at ics
behind them , to m ake m y own Shazam prot ot ype, to fully
understand Wang’s paper and im agine t he processes t he paper
doesn’t explain. I wrot e t his article because I have never found
an art icle t hat really explains Shazam and I wished I could have
found one when I began t his side proj ect in 2012. I hope I didn’t
writ e too m any t echnical m ist akes. The only t hing I ’m sure of is
that despite m y effort s t here are m any gram m ar and spelling
m ist akes ( alas! ) . Tell m e what you t hink of this art icle on t he
com m ents.

1,755 369

48 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Algorit hm algorithm m usic


Related Posts

How does a relational dat abase work

Mem ory opt im isat ion: Cust om Set

« « JVM m em ory m odel Leonard Susskind’s Quant um


Mechanics course » »

Leave a Reply

9 1 Com m ent s on " H ow does Shazam w or k"

Zeehtbert November 25, 2016 10:02 am

Thank you so m uch. Didn’t read t he whole t hing


Guest
yet , but definit ely will continue lat er. .
enlight ening! ! !

Reply - Share

stino November 4, 2016 6:09 pm

Great art icle! Thanks for t he explanation. I was


Guest
always wondering how shazam found t he result s
so fast! Alt hough som e years ago I had DSP course
and practical class where we had t o m ake MATLAB
exam ples t o im plem ent t he fft a dft. Am azingly
how fast people forget t hings t hey have learned in
life if they don’t use it in daily life. Thanks for the
refreshm ent , really enj oyed it !

Reply - Share

49 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Anonymous November 3, 2016 8:43 pm

Am azing work. So m uch inform at ion to digest. All


Guest of m y fifth sem ester’s com puter networking notes
have been applied int o Shazam

Reply - Share

Jomars October 18, 2016 7:58 pm

Bravo! ! !am azing! !!


Guest

Reply - Share

ĝródmieĞcie October 16, 2016 12:30 pm

Good j ob! I’m t hinking to im plem ent sim ilar


Guest
algorit hm for m y m aster t hesis, and I believe that
Your article will be a great st art ! Thanks!

Reply - Share

Anonymous September 12, 2016 3:16 pm

i dindt com e here to know how shazam does t heir


Guest
t hing i j ust want t o know how it works m an lm fao

Reply - Share

Anonymous August 26, 2016 8:57 pm

t hat ’s one of t he best articles i read in m y life. .


Guest
t hank you so m uch

Reply - Share

Load Mor e Com m ent s

50 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Top 10 articles

How does Shazam work


195, 619 views | 91
com m ent s

How does a relat ional


dat abase work
166, 687 views | 127
com m ent s

How does a HashMap work


in JAVA
58, 420 views | 52
com m ent s

Design Pat t ern: fact ory


pat terns
11, 155 views | 8
com m ent s

JVM m em ory m odel


6,950 views | 8 com m ent s

Machine Learning: Andrew


NG’s course from coursera
5,321 views | 0 com m ent s

Design pat t ern: singleton,


prot ot ype and builder
5,059 views | 4 com m ent s

How t o conduct t echnical


interviews?
4,718 views | 13
com m ent s

t he best program m ing


language
4,038 views | 6 com m ent s

What is a good
applicat ion?

51 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

2,824 views | 2 com m ent s

Categories

Algorit hm ( 3)

Design Pat t ern ( 3)

Java ( 2)

JVM ( 2)

Methodology ( 3)

review ( 2)

Tool ( 1)

Tags

big dat a

builder cousera dat abase

eclipse fact ory garbage

collect or hashm ap hashset

HashTable j vm
Liskov LSP m achine learning

m em or y m ooc m usic

prot ot ype

quant um m echanics reference

runt ine dat a areas set short cut

key singlet on sof t ref erence

W eakHashMap weak reference

work

Sear ch t his sit e...

Sear ch

52 of 53 11/30/2016 12:26 PM
How does Shazam work - Coding Geek http://coding-geek.com/how-shazam-works/

Proudly pow ered by WordPress Prem ium St yle Them e by w w w.gopiplus.com

53 of 53 11/30/2016 12:26 PM

You might also like