Neural Information Processing PDF
Neural Information Processing PDF
Neural Information Processing PDF
ISBN 979-10-90821-04-0
http://sabiod.univ-tln.fr
ISBN: 979-10-90821-04-0
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
Contents
Video supports: Most of the talks are available in video at http://sabiod.univ-tln.fr/nips4b/
Authors list..............................................................................................................................7
Chapter 1 Introduction............................................................................................... 9
1.1 Objectives..................................................................................................................11
1.2 2YHUYLHZRIWKH%LUG challenge...............................................................................12
1.3 Whale song clustering challenge .............................................................................14
1.4 Neurosonar analysis.................................................................................................15
1.5 Acknowledgements................................................................................................... 20
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
LeCun Y.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
Janvier M., Horaudm R., Girin L., Berthommier F., Boe L., Kemp C.,
Rey A., Legou T.
182
Massaron L.
7.6 A novel approach based on ensemble learning to NIPS4B challenge .............. 195
Chen W., Zhao G., Li X.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
227
8.5 Automatic analysis of a whale song..............................................................................
.
Potamitis L., Ntalampiras S.
Annex : Schedule.............................................................................................................255
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
Chapter 1
Introduction
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
Introduction
This book is the content of the 1st Big Bioacoustics Data [NIPS4B] that took place at Tahoe lake,
Nevada, in december 2013, during the NIPS international conference. The 40 attendies provided further
insights into the analysis of large scale bioacoustic data and modeling of animal sounds, not only from a
neuro- perspective, but also by highly reinforcing the need to approach these unique signals within the
machine learning community.
As a result both the bioacoustics community and the mainstream NIPS community met, leading to
new collaborations: the communications ranged from the complexity of bioacoustics to scaled analyses,
from understanding and monitoring bird song ontogeny, to cricket auditory neural functions, from use of
sparse architectures for whale sound classification, to range estimation and bat tracking...
Although, in recent years, the majority of the existing applications lend themselves to advanced
acoustic signal processing methodologies, our efforts are successfully integrating robust processing and
machine learning algorithms for scaled analysis of these abundant recordings. Major issues such as data
repositories and the need for standardizations within the bioacoustics field discussed and addressed.
We exchanged ideas on how to proceed in understanding bioacoustics to provide methods for
biodiversity indexing, and to open a novel paradigm toward a Bioacoustic Turing Test: one might model
animal communication before tackling the original Turing test for human being.
The scaled bioacoustic data science is a novel challenge for artificial intelligence that requiere
new methods. For example Minke whales, observed all around the planet have been recorded by
Kindermann's acoustic observatory at the ice shelf around Antarctica during 8 years. Big data scientists
are today invited to look into that data using advanced methods to definitely new knowledge about this
important species.
Similarly, large cabled submarine acoustic observatory deployments permit data to be acquired
continuously, over long time periods. For examples, Neptune observatory in Canada, Antares or Nemo
neutrino workshop on Neural Information Processing Scaled for observatories in Mediterranean sea are
'big data' challenges to the scientists. Automated analysis, including the classification of acoustic signals,
event detection, data mining and machine to discover relationships among data streams are techniques
which promise to aid scientists in discoveries in an otherwise overwhelming quantity of acoustic data as it
is presented in this book.
Minke whale Fourier time-frequency representation, on 5 minutes scale (left) versus on two years scale (right),
showing season effect and global frequency shift [from L. Kindermann 2013, in this book].
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
10
1.1 Objectives
Bioacoustic data science aims at analyzing and modeling animal sounds for neuroethology / biodiversity
assessment. However, given the complexity of the collected data along with the different taxonomies of the
different species and their environmental contexts, it requires original approaches. In recent years, the field
of bioacoustics has received increasing attention due to its diverse potential benefits to science and society,
and is steadily required by regulatory agencies as a tool for timely monitoring and mitigation of
environmental impacts from human activities. The increased expectations from bioacoustic research have
been coincident with a dramatic increase in the spatial, temporal and spectral scales of acoustic data
collection efforts. One of the most promising strategies concerns neural information processing and
advanced machine learning.
The features and biological significance of animal sounds, while constrained by the physics of sound
production and propagation, have evolved through the processes of natural selection. Additional insights
have been gained through analysis and attempts of modeling of animal sounds as related to critical life
functions (e.g. communicating, mating, migrating, navigating, etc.), social context, and individual, species
and population identification. These observations have led to both quantitative and qualitative advancements,
as for example MRIs for monitoring bird song ontogeny. These yieled to new paradigms such as prococesses
that underlie song learning and their modelisation. Although, the majority of the existing applications lend
themselves to widely used, advanced acoustic signal processing methodologies, the field has yet to
successfully integrate robust signal processing and machine learning algorithms, applied for example to bird,
insect, or whale song identification, source localisation, (neural)modelisation of the biosonar of bats or
dolphins...
This NIPS4B workshop has helped to introduce and solidify an innovative computational framework in the
field of bioacoustics by focusing on the principles of neural information processing in an inheretly
hierarchical manner. State of the art machine learning algorithms have been explored in order to draw
physiological parallels within bioacoustics, while an applicative framework has address classification tasks.
For example, new sparse feature representations have been pursued by using both shallow and deep
architectures in order to model the underlying highly complex data distribution. Cost creation and hyperparameter optimization in architectures such as Deep Belief Networks (DBN), Sparse Auto Encoders (SAE),
Convolutional Networks (ConNet), Scattering transforms, ..., have provided insights in the analysis of these
complex signals. Any interesting new learning technique for this type of bioacoustic signal is very welcome.
NIPS4B has encouraged interdisciplinary, scientific exchanges and foster collaborations among the
workshop participants for the bioacoustic signal analysis and understanding of the auditory process. NIP4B
aims at bringing together experts from the machine learning and computational auditory scene analysis fields
1 http://www.youtube.com/watch?v=0Szo3gdiTRk
2 http://glotin.univ-tln.fr/oncet/
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
11
with experts in the field of animal acoustic communication systems to promote, discuss and explore the use
of machine learning techniques in bioacoustics for signal separation, classification, localisation,... It has
concerned researchers in modeling the auditory cortex, neurophysiological process in perception and
learning, machine listening, signal processing, and computer science to discuss these complementary
perspectives on bioacoustics.
Scaled bioacoustics is a new challenge that requiere new methods that will be discussed in this book. For
example antarctic Minke whales are today observed all around the planet. Long term recordings from
Kindermann's acoustic observatory at the ice shelf show its acoustic emission around Antarctica during
years (see figure below). Tenth of thousands of hours of this sound have been recorded during the last 8
years. Big data scientists are today invited to look into that data using advanced methods to extract
definitely new knowledge about this important species.
The large cabled submarine acoustic observatory deployments permit data to be acquired continuously,
over long time periods. For examples, the current running ones are the Neptune observatory in Canada(see
M.H. talk in this book), Antares or Nemo neutrino observatories in Mediterranean sea (see H.G.'s talk).
This capability presents a big data challenge to the scientist using and accessing the data. Automated
analysis, including the classification of acoustic signals, event detection, data mining and machine learning
to discover relationships among data streams are techniques which promise to aid scientists in making
discoveries in an otherwise overwhelming quantity of acoustic data as it will be prensented in this book.
1.22YHUYLHZRIWKH%LUG Challenge
Challenge 1: Bird Song Classification / Kaggle web site now available
This Bird NIPS4B competition asks participants to identify which of 87 sound classes of birds and their
ecosystem are present into 1 000 continuous wild recordings (from different places in Provence France nearly 2 hours of recordings, frequency sample = 44.1 kHz, SM2 system). The data is provided by
the BIOTOPE1 society (having the largest collection of wild recordings of birds in Europe). The training set
matches the test set conditions.
This challenge is a more complex task than our previous one at ICML4B challenge2 for which 77 teams
participated - see proceedings at sabiod.org3.
This enhanced challenge opens the 2nd of october. The metrics is the Area Under the Curve, as for our
previous previous4 challenge.
1/ SOUND FILES: WHOLE WAV FILES, TRAIN and TEST, 138 Mo5
1
2
3
4
5
6
http://www.biotope.fr/
http://www.kaggle.com/c/the-icml-2013-bird-challenge/
http://sabiod.univ-tln.fr/ICML4B2013_proceedings.pdf
http://www.kaggle.com/c/the-icml-2013-bird-challenge/
http://sabiod.univ-tln.fr/nips4b/media/birds/NIPS4B_BIRD_CHALLENGE_TRAIN_TEST_WAV.tar.gz
http://sabiod.univ-tln.fr/nips4b/media/birds/NIPS4B_BIRD_CHALLENGE_TRAIN_TEST_MFCC.tar.gz
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
12
2/ SUGGESTED FEATURES: we provide baseline features of these train and test .wav files, computing
optimized MFCC for bird's sound representation, as distributed in ICML4B 2013 bird challenge : MEL
FILTER CEPSTRA COEFFICIENTS (MFCC) of WHOLE TRAIN and TEST FILES (158 Mo)1 The format
is a matrix 17xN: 17 cepstral coefficients x N frames, frame size 11.6 ms, frame shift 3.9 ms, one line per
frame. You may compute their speed and acceleration by simple line differences. These suggested features
minimize the signal reconstruction error in average on bird species. The script which produced these MFCC
is:MFCC SCRIPT for BIRD SOUND REPRESENTATION (please cite if you use these features).
3/ LABELS:Here are the tables of the 87 classes to learn (.csv, xls, html) 2.
** This archive also includes the TRAINING LABELS of the 687 train files (.csv, xls, html)3.
For some species we discriminate the song to the call (and to the drum). We also include some species living
within with these birds: 7 insects and a batracian. Each of these 87 classes in this table are to be predicted in
the 1000 test files. Some training files are empty (background noise only called 'empty class') to tune your
model, this class is not to be predicted. The training set contains 687 files. Each species is represented by
nearly 10 training files (within various context / other species).
4/ EXAMPLES: The test set is composed of 1000 files. All the species into the test set are in the training set.
We give here two samples containing each two species: Sylvia cantillans (which is singing) and Sylvia
melanocephala (which is calling)4 .
Second sample: Sylvia cantillans (which is also singing) and Petronia petronia (which is calling)5
We give in Fig 1 the winning score table that is depicting the scores species by species on a dev set
selected by Lasseck, see details in his paper in this book.
1 http://sabiod.univ-tln.fr/nips4b/media/birds/NIPS4B_BIRD_CHALLENGE_TRAIN_TEST_MFCC.tar.gz
2/3 http://sabiod.univ-tln.fr/nips4b/media/birds/NIPS4B_BIRD_CHALLENGE_TRAIN_LABELS.tar
4 http://sabiod.univ-tln.fr/nips4b/media/birds/nips4b2013_birds_file_0001.wav
5 http://sabiod.univ-tln.fr/nips4b/media/birds/nips4b2013_birds_file_0002.wav
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
13
Fig 1 : The list of the species of the challenge, with their AUC score from the winning solution,
on a dev set of the author Lasseck (see chap. in this book)
The overall Kaggle public leaderboard is given Fig 2 below to be compared to the private
score that has been processed on blind test (on unseen data).
Fig 2 : Top of the public leaderboard (test on unseen data during the whole challenge)
Looking at the repartition of the scores (number of trail , AUC) we see that the
best models have not used too much runs. The comparison with the private
leaderboard shows that the models generalized well (see Fig3).
Fig 3 : Top of the private leaderboard (last test on the selected runs by the challengers)
Organizers:
Pr. H.
Glotin1 - Institut
Universitaire
Email: [email protected]
O. Dufour5 - CNRS LSIS, FR
Dr. Y. Bas - BIOTOPE6, FR
1
2
3
4
5
6
http://glotin.univ-tln.fr/
http://iuf.amue.fr/iuf/presentation
http://www.lsis.org/
http://www.univ-tln.fr/
http://dyni.univ-tln.fr/~odufour/
http://www.biotope.fr/
de
France 2,
FR
Figure: Spectrum of around 20 seconds of the given song of Humpback Whale (start from about 5'40 to 6'.
Ordinata from 0 to 22.05 kHz, over 512 bins (fft on 1024 bins), frameshift of 10 ms.
We also give the usual Mel Filter Cepstrum Coefficients of this wav file (octave / matlab v6 format)2. The
parameters of extraction of these MFCC are given here.3.
For this challenge, you may propose any efficient representation of this song that helps to study its structure,
discover and index its song units. You can find an interesting preliminary approach in: Pace, F., Benard, F.,
Glotin, H., Adam, O., and White, P. (2010) Subunit definition for humpback whale call classification, int.
journal Applied Acoustics, Elsevier, 11(71)
The workshop allows discussions over the proposed representations (clustering, indexing, sequence
modeling etc.). Your representation of this song file shall be sent to [email protected] in usual
format (.xml, .csv or .mat ...). The size (bytes) of your representation and its quality (MSE on the
reconstructed signal of interest) are used to rank it.
Organizers:
Doh Yann (UTLN), Joseph Razik (UTLN) and Herv Glotin (UTLN & IUF)
We thank Darewin for the recording.
1
2
3
4
http://sabiod.univ-tln.fr/nips4b/media/NIPS4B_Humpback_Darewin_LaReunion_Jul_03_2013-001_26min.wav
http://sabiod.univ-tln.fr/nips4b/media/NIPS4B_Humpback_Darewin_LaReunion_Jul_03_2013-001_26min_1024_CORRECTED.mat
http://sabiod.univ-tln.fr/nips4b/media/NIPS4BparametersMFCChumpbacksongsample.txt
http://sabiod.univ-tln.fr/nips4b/media/Pace_etal_APAC2010.pdf
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
14
file a2 (5MB)
file b3(11MB).
The two next files contain sonar of another dolphin species, the biggest one, i.e. Physeter macrocephalus (15
meters, 40T): file c4 (28MB), and file d5 (high signal to noise ratio, recorded at Toulon in 2012-DECAV
SABIOD, FS=48kHz, 55MB).
Nice 25 minutes of one Physeter have been recorded on 5 channels in Bahamas by NATO, and we have
precisely computed the 4D positions of this whale [Glotin 2008]6 with its real animation on YouTube7. The
whole recordings 25 minutes x 5 channels at 48kHz and the positions and references are in this archive8 (500
MB) (here is one sample of 5 min9). You find a sparse coding representation of these clicks in [Paris et
2013]9.
1 http://sabiod.univ-tln.fr/nips4b/media/NIPS4B_Humpback_Darewin_LaReunion_Jul_03_2013001_26min_1024_CORRECTED.mat
2 http://sabiod.univ-tln.fr/nips4b/media/DECAV_20110607_073535_v2_raccourcie_bateauenfond_v2.wav
3 http://sabiod.univ-tln.fr/nips4b/media/DECAV_20120916_174818_v2_raccourcie_propre2.wav
4 http://sabiod.univ-tln.fr/nips4b/media/DECAV_20121006_171343_v2_dauphin_cachalot_assez_propre_org.wav
5 http://sis.univ-tln.fr/~glotin/DECAV_20120917_135935.wav
6 http://sis.univtln.fr/~glotin/NIPS4B_MATERIAL/DATA_WAV_POSITIONS/BAHAMAS/GLOTIN_etal_Whale_Cocktail_Party_
Int_JOURN_CANADIAN_ACOUSTICS_spring2008.pdf
7 http://www.youtube.com/watch?v=0Szo3gdiTRk
8 http://sis.univtln.fr/~glotin//NIPS4B_MATERIAL/DATA_WAV_POSITIONS/BAHAMAS_Physeter_4channels.tar.gz
9 http://sis.univtln.fr/~glotin/NIPS4B_MATERIAL/DATA_WAV_POSITIONS/BAHAMAS/HYDRO10/10S_ch5_10-15.wav
10 http://arxiv.org/pdf/1306.3058v1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
15
Fig: The sonar sample given below is from Nicky, here with her calf, recorded at Shark Bay1Australia (cred.
Giraudet 2013).
This Tursiops sonar sample2 is from the wild dolphin called Nicky, 37 years old, visiting nearly daily
Monkey Mia Bay (frequency sample 96kHz, 32 bits, with CR55 hydrophone of Cetacean Research). Here is
its time-amplitude representation:
Fig: The time-amplitude representation of this Nicky's sonar short sample (0.7 sec).
Here we give a longer sequence of Nicky3 (same FS=96kHz, 32 bits, 19MB), recorded at 2m from her noise.
You may use Audacity4 or GNU Octave5 to read it.
1 http://www.monkeymiadolphins.org/
2 http://sabiod.univ-tln.fr/nips4b/media/NIPS4B_sonar_S1.wav
3 http://sabiod.univtln.fr/nips4b/media/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_aug2013_SABIOD_96kHz_32bits_a
fter19min_nips4bfile_e.wav
4 http://audacity.sourceforge.net/
5 http://www.gnu.org/software/octave/
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
16
Fig: Longer time-amplitude of Nicky's sonar (100 sec.): same file e1.
In [Ryabov 2011]2 it is shown that Tursiops dolphin are producing the packs of coherent and non-coherent
broadband pulses. The waveform and spectrum of coherent pulses are invariable within a pack (see fig.
below), but considerably varies from a pack to a pack. The waveform of each non-coherent pulse vary from a
pulse to a pulse in each pack, therefore their spectrum also vary from a pulse to a pulse and have many
extrema. It is very likely that the non-coherent pulses play a part of phonemes of a dolphin spoken language
and the probing signals of dolphin's non-coherent coherent sonar. Efficient feature extraction and
classification on sonar sequences are requiered for such studies.
1 http://sabiod.univtln.fr/nips4b/media/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_aug2013_SABIOD_96kHz_32bits_a
fter19min_nips4bfile_e.wav
2 http://www.scirp.org/journal/PaperDownload.aspx?paperID=7397
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
17
We give also Amazone dolphin river (Inia) recorded by David E. Bonnett ([email protected]) 1: Inia,
2007, 96kHz FS, 600Mo2 , and Inia, 2009, 96kHz FS, 300Mo3 , and Inia, 2011, 96kHz FS, 300Mo4
We also give some Indian River bottlenos dolphins record, recorded by M. Trone (DRS).
Indian River Dolphin, 2013a, 96kHzFS, 358 Mo 5 and Indian River dolphin, 2013b, 500kHz FS, 1Go6
From these recordings, you may try to learn features that may correlate with some morphological difference,
here are some point of interest on river files.7
Similar paradigm applies to bats' sonar: [Kno 2012] shows that if bat echolocation is primarily used for
orientation and foraging, it also holds great potential for social communication.
1
2
3
4
5
6
7
http://sis.univ-tln.fr/~glotin/DRS_and_Microtrack_Recording_System_Descriptions.pdf
http://sabiod.univ-tln.fr/nips4b/media/Amazon_2007_96kHz.zip
http://sabiod.univ-tln.fr/nips4b/media/Amazon_2009_96kHz.zip
http://sabiod.univ-tln.fr/nips4b/media/Amazon_2011_96kHz.zip
http://sabiod.univ-tln.fr/nips4b/media/Indian_River_Lagoon_2013_96kHz.zip
http://sabiod.univ-tln.fr/nips4b/media/Indian_River_Lagoon_2013_500kHz.zip
http://sis.univ-tln.fr/~glotin/nips4b_dolphin_river_point_of_interests.pdf
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
18
Myopterus bat's sonar sample (credit Cyberio)1 (frequency sampling=250kHz, duration=4 seconds,
2MB)
Fig: The time-amplitude representation of this bat sonar sample. The nearest point of approach (NPA) to the
microphone corresponds to the highest amplitude (near the sample #400K). Before NPA, the bats flies in
direction to the microphone, after NPA the bat emits in the opposite direction.
The communicative function of echolocation calls is still largely unstudied, especially in the wild. The vocal
signatures encoding social information in echolocation calls has not been up to now well studied. The
authors found pronounced vocal signatures encoding sex and individual identity : free- living males
discriminate approaching male and female conspecifics solely based on their echolocation calls. Males
always produced aggressive vocalizations when hearing male echolocation calls and courtship vocalizations
when hearing female echolocation calls; hence, they responded with complex social vocalizations in the
appropriate social context. Advanced statistics may reveal other dependences into biosonar sequences...
1 http://sabiod.univ-tln.fr/nips4b/media/MyopterusECORx_091612_223528COMP1.wav
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
19
Marie Trone (Valencia Coll. USA) and Herv Glotin (UTLN & IUF)
We thank Cyberio SA for the bat recordings, and MASTODONS SABIOD for
its support for the Tursiops recordings
References NeuroSonar :
[Ryabov 2011] Some Aspects of Analysis of Dolphins'Acoustical Signals, Ryabov (2011), Open
Journal of Acoustics, 2011, 1, 41-54 doi:10.4236/oja.2011.12006 Published Online
(http://www.SciRP.org/journal/oja)
[Kno 2012] Knrnschild M, Jung K, Nagy M, Metz M, Kalko EKV (2012) Bat echolocation calls
facilitate social communication. Proceedings of the Royal Society of London B 279: 4827-4835.
[Glotin 2008] Glotin H., Caudal F., Giraudet P., 'WHALE COCKTAIL PARTY: REAL - TIME
MULTIPLE TRACKING AND SIGNAL ANALYSES', int. Journal Canadian Acoustics, V.36(1),
ISSN 0711-6659, DEMO @ sabiod.org
[Paris 2013] S. Paris and al.(2013) Physeter catodon localization by sparse coding, ICML for
Bioacoustics workshop, 2013
[Benard 2010] Benard F., Giraudet P., Glotin H. (2010) 'Whale 3D monitoring using astrophysic
NEMO ONDE 2m wide platform with state optimal filtering by Rao-Blackwell Monte Carlo data
association', int. jour. App. Acoustics, (71)
1.5 Acknowledgements
We thank the members of the organizing committee H. Glotin, T. Artires, R. Balestriero, Y. Doh, M.
Bartcus for their continuous effort for NIPS4B.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
20
Chapter 2
Natural Neural Bioacoustic Learning
Tchernichovski O.
2.2 Neuroethology of hearing in crickets: embeded neural process to avoid bat ...................39
Pollack G.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
21
Human language, as well as birdsong, relies on the ability to imitate vocal sounds and
arrange them in new sequences. During developmental song learning, the songbird brain
produces highly variable song patterns, which allow vocal exploration to guide learning.
Tracking song development continuously show that exploratory variability is regulated in
fine time scales, such that each song element becomes less variable independently when
approaching the target (adult tutor) song. Therefore, multiple localized reinforcementlearning processes can explain how the bird learn to match specific song elements.
However, we found that vocal exploration alone cannot explain how birds learn to match
vocal combinatorial sequences. Combining an experimental approach in zebra finches with
an analysis of natural development of vocal transitions in Bengalese finches and pre-lingual
human infants, we found a common, stepwise pattern of acquiring vocal transitions across
species. Results point to a common generative process that is conserved across species,
suggesting that the long-noted gap between perceptual versus motor combinatorial
capabilities in human infants may arise partly from the challenges in constructing new
pairwise vocal transitions. Therefore, learning vocal sequences is likely to be constraint by a
neuronal growth process, perhaps of establishing connections between representations of
song gestures.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
22
#$ %
#$$
!
,
23
+
. // 0%
11 + 2, 1 /3 -$$-
// 0% 11 + 2, 1 /3 -$$-
!
"#$%$
&'(
+
. 4 5% 1 , 1
2
, + 7, -$$8
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
24
+
. 4 5% 1 , 1
2
, + 7, -$$8
9:721
9
721
1
3;
/1
/1
+
. 4 5% 1 , 1
2
, + 7, -$$8
.
./0
&
"-
#$%%
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
25
, !
>
1 ? !
! @
A !
<
4
B
1
$$#
$$%
8$
3$
4$
5$
6$
7$
%$$
#$$
2$$
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
26
%$84#
5#
$%$%$8777
#%%$%6$348
9#56$%3%62#
75#324$#36
%%$7#
77
$##%3$6$85
#44863$2##
92####27#78
5$834%#67
$32$5$4266
$48724#2#
$736#555#5
%%%25
62
$#$2874%35
343$72$#83
9%#85#88%
7536#$5##2
%%756
62
$%7654$#6
3%%36425$3
9%%3567326#
7%%7343$3#
$7#68555#
%%6#%
56
$%286#87##
353522554
9%22$3##753
7#8#823#2#
$67#2#3#23
%#266
3%
$62542$876
83#488%222
9#5488%4647
2444$%%78
$6#2%#%566
%273%
66
$%366365$2
4228#$4457
9##4%565$25
287#26%%6%
$3%562%%3%
%2558
4#
$27#47$%%8
44#%548$44
9#7662567%8
2$23232%$#
$456$78763
%7755
62
$#4585#643
5884384878
9#%7$3$58%2
7$27#665
%75%#
&
%52$7
74
$$438458$4
%%##2$82#5
9%4#883#%23
73%6887577
$3#24%3375
'
66
.
/
$%725#8$38
.
(
458754#36#
.
:
9%5#5377$78
.
".
278$36337%
.
$4%%23#%6%
%5767
45
$#%574#853
4588%6$28%
9#26572%4#2
28#87555#8
$487%$7223
%564%
67
$6#658%28
5345287$72
9%865234%5#
243%2%65%2
$5%587766%
%4$$$
63
$%26%%3226
35766435%2
9#252%#%835
2%$$572278
$363$564#7
%4%38
6%
$%#7844647
46#26#4###
9%87#6$##5
2525663%6%
$58%%77635
%445%
63
$%77$$#243
%$#%$#46#4
9##63265$87
7$6254#7$8
$4$3#2%355
%4342
74
$$55823#3%
%228$535$7
9%553$%3%$2
75#8837556
$58835284
%3$6%
23
$$55#45278
%37465$8%2
9#66%3452$4
2366522676
$3$6328$5#
%3$8#
3%
$#$$$%$%#%
#$3$7$3825
92$46742674
6$27$56#74
$4457$#%%5
%3#%8
55
$226#45582
363%$3$822
9%46$4656$#
757$47$#$7
$6%%78833#
%3625
58
$#5%466546
38$2857#22
9%35$768774
7#6$7##553
$6$$886887
%8775
75
$%68%684#
8822#%4442
9%5$%74483%
72%%#52#46
$6#4%#7#35
#$7$6
6%
$%824$5485
3$$#3328%%
9%7%2462354
7%##%78#44
$7#364%6##
#$577
56
$#77%$68#
3$#$83#555
9%638%6$7#8
286$2357#8
$7#845%334
#$4#8
5%
$%554#2854
8$%537%72%
9%44%273%%8
7478%5%%73
#$374
6%
$%833%3#6%
36#572$557
9%$625%%887
73%%%837#6
$77%$5%$3
#2#34
53
$%437$3652
43738%7%36
9#%27372633
7%88%8635#
$5658#$54%
#7#72
4$
$%36355#$4
88$3638743
9#65#4$$473
28785528#6
$4528%8742
$3##$%3472
$665%%88%8
? ;
G
;
D>
>
1H
@
7
7
=8
8$HI$
#$$
27
;
% /
7
0
7
2
88
I$
I8
($
,
!
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
28
1
!
; <
. 14H14 HK 14H14
7
.
14H14 HK 14H14 HK 14H14
1
4
;
;
L
M
?
; L
M
C @
>
7
"
N 2
"
4
"
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
29
,
,
" &
8$HI$
#$$
"3,.
($O
.'
* '
14 14 14
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
30
= / =
1 4 1
4 1 4
1 4 1
4 1 4
1 4 1
3
""###
""###
#
!
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
31
"###
$
!
"###
$
%
$
&
$
'
,
;
""###
""###
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
32
$"
%
$"
%
+
H H41111< T41 14U
. <1414< T14 ' 41U
H
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
33
N 2
"
4
"
3 .
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
34
.
1
. 1 ! !
1
. 1 ! !
1
.
,!
.
/
.
4 1
?
.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
35
,
.
1
7
% /
' N
4
"
N 2
"
? 4
L
; M.
,
?L?M ' ,+
?
.
L !M
H
H !
Y
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
36
/1 LE
;FM
1+
L3; M
9
L M
/1
1 ;
1
&
1"2")$3*(
0
@
N 2
"
/ 7
,
?
1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
37
1
.
?
! ;
!
6
%
@
%
@
+
4 %
;
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
38
Many behavioral studies on crickets have identified the relationships between signal
structure and behavioral effectiveness, and the neural basis for sound reception and analysis.
We'll present the behavioral studies on signal recognition; relationships between stimulus
structure and behavioral effectiveness; roles of sound frequency, stimulus temporal
structure; positive and negative phonotaxis to cricket-like and bat-like signals, respectively
Early auditory processing: separate channels for processing mate-attraction signals and
predator-derived signals (ultrasound); temporal response properties of receptor neurons and
first-order interneurons. And descending brain neurons: conveying the results of processing
in the brain to motor centers that control behavior.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
39
Gerald Pollack
Dept. of Biology, McGill University
Huber & Thorson (1985) Scientific American 253:60-68
6.5
kHz
2.5
100
20
1.3
seconds
0.13
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
40
AN2 produces
bursts; brief
episodes of highrate firing
Bursts accurately
detect
conspicuous
increases in
amplitude
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
41
!"
#
$$
$
!"
%
&
AN2
'
Receptor
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
42
Frequencyspecific
temporal coding
is apparent at
the level of subthreshold
variations in
membrane
potential
(
)
*+
,
-(
./
*+
, *++
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
43
Pre
R1
R2
Post
R3
Composite
Ca-activatedK current?
4-AP
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
44
Summary
Low
redundancy of
responses of
low-frequency
receptors
* AN2 bursts
eliminated by
contralateral
inhibition
!"#
$
%&'())''
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
45
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
46
Chapter 3
Representation for Bioacoustics
3.1 Dynamic timewarping and gaussian process multinomial probit regression for
bat call identificatio ...............................................................................................................................48
Stathopoulos V., Zamora-Gutierrez V., Jones K., Girolami M.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
47
Veronica Zamora-Gutierrez
Department of Zoology
University of Cambridge Cambridge, CB2 3E
[email protected]
Kate Jones
Centre for Biodiversity and Environment Research
Dept. Genetics, Evolution and Environment
University College London
London, WC1E 6BT
[email protected]
Mark A. Girolami
Department of Statistical Science
University College London
London, WC1E 6BT
[email protected]
Abstract
We study the problem of identifying bat species from echolocation calls in order
to build automated bioacoustic monitoring algorithms. We employ the Dynamic
Time Warping algorithm which has been successfully applied for bird flight calls
identification and show that classification performance is superior to hand crafted
call shape parameters used in previous research. This highlights that generic bioacoustic software with good classification rates can be constructed with little domain knowledge. We conduct a study with field data of 21 bat species from the
north and central Mexico using a multinomial probit regression model with Gaussian process prior and a full EP approximation of the posterior of latent function values. Results indicate high classification accuracy across almost all classes
while misclassification rate across families of species is low highlighting the common evolutionary path of echolocation in bats.
Introduction
In many tropical ecosystems, bats are keystone species as they act as important pollinators, seed
dispersal agents and regulators of insect populations [1]. In spite of their importance, most bat
population studies in the tropics have been short term and the lack of long term bat monitoring programs is a result of their inherent difficulty. Bats produce unique sounds at frequencies that usually
do not overlap with other species and most bat species have evolved species-specific echolocation
calls [2, 3, 4]. However, their calls also show great interspecific variation and flexibility caused
by habitat, geography, sex, age, etc. and in other cases there is a great overlap of call structures
between species which makes species identification complicated [5, 6, 7]. Developing automatic
identification tools would therefore assist in creating long term acoustic monitoring programs for
biodiversity.
This work is a first step towards this direction. Our aim here is not to do an exhaustive comparison
of methods but to show that using state of the art algorithms from the Machine Learning literature
1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
48
and with no significant tunning or heavily engineered feature extraction methods good identification
rates can be achieved.
In this study we use data of 21 species collected in North and Central Mexico and treat bat call
identification as a supervised classification problem where a representative set of bat calls is used
to train a classification model which is then applied to classify novel instances of bat calls. We
employ a Multinomial probit regression model with Gaussian process prior [8] which can achieve
good generalization capabilities with moderate to low numbers of training data. We also utilize a
kernel representation of the data that directly compares the calls spectrograms and thus it requires
minor tunning.
Methodology
We approach bat call identification as a classification problem where the class response variables
yn {1, . . . C} indicate the species id for the nth call in the library and x RD is a D-dimensional
vector representation of the call, e.g. features extracted from the calls spectrogram.Species ids from
all calls in the library are collected in a vector y = [y1 , . . . , yN ] and all call vector representations are
collected in the matrix X = [x1 , . . . , xN ]T . of size N D. In Section 2.1 we will define a probabilistic model for the conditional probability p(y|X, ) where denotes a vector of unknown model
parameters with an associated prior distribution p(). The id for a new call, y , with vector represen where parameter
tation x is obtained by the class with highest probability from p(y |x , X, y, )
estimates are obtained by maximizing the posterior distribution, i.e. = argmax p(|X, y).
2.1
The probabilistic model assumes a latent function f : RD RC with latent values f (xn ) = f n =
[fn1 , fn2 , . . . , fnC ]T such that when transformed by a sigmoid-like function give the class probabilities
p(yn |f n ). Here we use a the multinomial probit function, Equation (1), which is convenient for
deriving the EP approximation and Gibbs sampling [18, 8].
Z
C
Y
(un + fnyn fnj )dun
(1)
p(yn |f n ) = N (un |0, 1)
j=1,j6=yn
For the latent function values we assume independent zero-mean Gaussian process priors for
each class similar to [14]. Collecting latent function values for all calls and classes in f =
1
2
C T
[f11 , . . . , fN
, f12 , . . . , fN
. . . , f1C , . . . , fN
] the GP prior is p(f |X, ) = N (f |0, K()) where
K() is a CN CN block covariance matrix with block matrices K 1 (), . . . , K C (), each of
c
size N N , on its diagonal. Elements Ki,j
define the prior covariance between the latent function
c
c
values fi , fj governed by a covariance function k(xi , xj |) with unknown parameters .
Optimising the unknown kernel parameters involves computing and maximising the posterior
Z
p(|X, y) p() p(y|f )p(f |X, )df .
(2)
Making predictions for a new call, y , x , involves two steps. First computing the distribution of
the latent function values for the new call
Z
= p(f |x , X, )p(f
p(f |x , X, y, )
|X, y, )df
(3)
and then computing the class probabilities using the multinomial probit function
Z
(4)
Full EP approximation
Unfortunately exact inference is not possible and we have to either resort to numerical estimation
through Markov Chain Monte Carlo or use approximate methods. Due to the large number of classes
2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
49
(21 species in our data) in this work we consider the latter approach and use Expectation Propagation
(EP) [19] to approximate the posterior of the latent function values p(f |X, y, ) in Equations (2)
and (3) while for computing the integral in (4) we can again use the EP algorithm.
EP
method
approximates
the
posterior
using
qEP (f |X,
y, )
QN
e
e
e
e
e
e
e
e
e
e
p(f
|X,
)
t
(f
|
Z
,
)
where
t
(f
|
Z
,
)
=
Z
N
f
|e
n
n
n
n
n
n
n
n
n
n
n
n
n
n
n=1
ZEP
e n.
en ,
e n,
are local likelihood approximate terms with parameters Z
The approximation parameters are updated by first computing the cavity distribution qn (f n ) =
e n )1 and then matching them with the moments of the
en ,
e n,
qEP (f n |X, y, )e
tn (fn |Z
tilted distribution
q(f n ) = Zn1 qn (f n )p(yn |f n ).
(5)
The
1
Unlike the binary probit case, where the tilted distribution (5) is univariate and thus its moments
are easy to compute, the tilted distribution for the multinomial probit model is C-dimensional. Previous work on EP approximations for the multinomial probit model [16] further approximated the
moments of the tilted distribution using the Laplace approximation which assumes that the distributions can be closely approximated my a multivariate normal.
In this work we show that a full EP algorithm can be derived by augmenting the latent function
values f with the auxiliary variables un from Equation (1) and permuting both the augmented
variables and the covariance matrix K(). This results in the same algorithm as the nested EP
approximation presented by [17], however this presentation clearly shows why a single iteration of
the inner EP for the tilted distributions using the moments estimated from the previous iteration of
the outer EP is enough for the algorithm to converge.
We introduce the new variables w which are formed by augmenting f with un and permuting
C
1
, uN ]T . Similarly we augment the
, . . . , fN
such that w = [f11 , . . . , f1C , u1 , f21 , . . . , f2C , u2 , . . . , fN
covariance matrix K() and permute accordingly such that the new covariance matrix V () is a
1
C
(C + 1)N (C + 1)N block matrix with blocks V ()i,j = diag([Ki,j
, . . . , Ki,j
, 1]), i, j
{1, . . . , N } of size C + 1 C + 1. Now we can write the posterior for w as
p(w|X, y, ) N (w|0, V )
N
Y
C
Y
(wTn bn,j )
(6)
n=1 j=1,j6=yn
where wn = [fn1 , . . . , fnC , un ]T and bn,j = [(eyn ej ), 1]T with ej a C-dimensional vector of
zeros and the j th element set to 1.
The EP approximate posterior for w follows as
1
qep (w) = Zep
N (w|0, V )
N
Y
C
Y
e
tn,j (wTn bn,j )
(7)
n=1 j=1,j6=yn
1
where e
tn,j (wTn bn,j ) = Zen,j
N wTn bn,j |en,j ,
en,j are the local approximate terms with parameters Zen,j , en,j ,
en,j . This corresponds to an approximate posterior with N (C 1) local approximation terms which have to be updated by matching their moments with the corresponding tilted
distributions
1
q(wTn bn,j ) = Zn,j
qn,j (wTn bn,j )(wTn bn,j )
(8)
where qn,j (wTn bn,j ) = qep (wTn bn,j )e
tn,j (wTn bn,j )1 are the cavity distributions. Calculating the
moments for the tilted distribution can now be done analytically as Equation (8) resembles the tilted
distribution of the probit model [14, 17].
2.3
Spectrogram Features
The vector representation xn for each call is constructed by extracting call shape parameters from
the calls spectrogram similar to [9]. The spectrogram of a call is calculated by using a hamming
window of size 256 with 95% overlap and an FFT length of 512. The frequency range of the
spectrogram is thresholded by removing frequencies below 5kHz and above 210kHz. An example
3
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
50
Although extracting call shape parameters from the spectrogram of a call captures some of the calls
characteristics and shape, there is still a lot of information that is discarded, e.g. harmonics. An
alternative to characterising a call using predefined parameters is to directly utilise its spectrogram.
However due to the differences in call duration the spectrograms will need to be normalised in order
to have the same length using some form of interpolation. In this work we borrow ideas from speech
recognition [11] and previous work on bird call classification [13] and employ the Dynamic Time
Warping (DTW) kernel to directly compare two calls spectrograms.
Given two calls i, j from the library and their spectrograms S i , S j , where S i CF W with F
being the number of frequency bands and T the number of windows, the dissimilarity matrix D i,j
RW W is constructed such that
D
i,j
(w, v) = 1 = p
(9)
DTW uses the dissimilarity matrix in order to stretch or expand spectrogram S i over time in order
to match S j by calculating the optimal warping path with the smallest alignment cost, ci,j , using
dynamic programming. For each call we construct a vector representation xn by computing the optimal warping paths with all N calls from the library and concatenating the alignment costs such that
xn = [cn,1 , . . . , cn,N ]. We then use the squared exponential covariance function for the covariance
matrix of the GP classifier. Figure 2 shows the optimal alignment scores for the training data used
in this study.
2.5
Multiple Kernel GP
GP classifiers allow for integrating information from different sources or different representations of
the data by combining covariance functions. Although both representations discussed in the previous
sections are extracted from a calls spectrogram, some of the call parameters used in Section 2.3
involve non-linear and complex transformations of the spectrograms by utilising prior knowledge of
bat call shapes. Since such knowledge is important for bat call identification and is not present in the
DTW representation we combine both kernels by a weighted sum and treat the weights as unknown
4
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
51
Samples
Calls
1 Balantiopteryx plicata
Family: Molossidae
16
384
2 Nyctinomops femorosaccus
3 Tadarida brasiliensis
Family: Mormoopidae
16
49
311
580
4 Mormoops megalophylla
5 Pteronotus davyi
6 Pteronotus parnellii
7 Pteronotu personatus
10
8
23
7
135
106
313
51
Species
Family: Phyllostomidae
Samples
Calls
8 Artibeus jamaicensis
9 Desmodus rotundus
10 Leptonycteris yerbabuenae
11 Macrotus californicus
12 Sturnira ludovici
Family: Vespertilionidae
11
6
26
6
12
82
38
392
53
71
13 Antrozous pallidus
14 Eptesicus fuscus
15 Idionycteris phyllotis
16 Lasiurus blossevillii
17 Lasiurus cinereus
18 Lasiurus xanthinus
19 Myotis volans
20 Myotis yumanensis
21 Pipistrellus hesperus
58
74
6
10
5
8
8
5
85
1937
1589
177
90
42
204
140
89
2445
parameters. The kernel weights are jointly optimized along with the individual kernel parameters by
maximizing the marginal likelihood.
3
3.1
Bat echolocation calls were recorded across North and Central Mexico. Live-trapped bats were
measured and identified to species level using field keys [20, 21] and bat taxonomy followed in [22].
We constructed an echolocation call library by recording the calls of captured individuals using two
different techniques: 1) bats were recorded while released from the hand about 6 to 10 m from the
bat detector in open areas and away from vegetation, 2) bats were tight to a zip-line and recorded
while flying along the zip flight path. The bat detector was set to manually record calls in real time,
full spectrum at 500 KHz. Each recording consists of multiple calls from a single individual bat.
In total our dataset consists of 21 species, 449 individual bats and 8429 calls. Table 1 gives a
summary of the dataset. Care must be taken when spiting the data to training and test sets during
cross-validation in order to ensure that calls from the same individual bat recording are not in both
sets. For that we split our dataset using recordings instead of calls. For species with less than 100
recordings we include as many calls as possible up to a maximum of 100 calls per species.
3.2
Experiments
We compare the classification accuracy of the multinomial probit regression with Gaussian process
prior classifier using the three representations discussed in Sections 2.3-2.3. The values of the call
shape parameters are normalised to have zero mean and one standard deviation by subtracting the
mean and dividing by the standard deviation of the call shape parameters in the training set. For
the 33 covariance function parameters, 2 and 1 , . . . , 3 2 we use independent Gamma priors with
shape parameter 1.5 and scale parameter 10. For the DTW representation each call vector of optimal
alignment costs is normalised to unit length and independent Gamma (1.5, 10) priors are used for the
magnitude and length-scale covariance function parameters. The weights for the linear combination
of the DTW and call shape kernel functions are restricted to be positive and sum to 1 and a flat
Dirichlet prior is used.
3.3
Results
Table 2 compares the misclassification rate of the three methods. Results are averages of a 5-fold
cross validation. We can see that the DTW representation is significantly better for characterising
the species variations achieving a better classification accuracy. However, results can be improved
by also considering information from the call shape parameters. Moreover, the optimised weights
5
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
52
for the kernel combination significantly favor the DTW covariance function with a weight of 0.8
in contrast to the call shape parameters with weight 0.2. If we fix the the weight parameters to
equal values we obtain a classification error rate of 0.22 0.031 highlighting the importance of the
DTW kernel matrix.
The independent length scales allow us also to interpret the discriminatory power of the call shape
parameters. In our experiments the frequency at the center of the duration of a call, the characteristic
call frequency (Determined by finding the point in the final 40% of the call having the lowest slope
or exhibiting the end of the main trend of the body of the call) as well as the start and end frequencies
of the call have consistently obtained a small lengthscale parameter value indicating their importance
in species discrimination. This coincides with expert knowledge on bat call shapes where these call
shape parameters are extensively used for identifying species.
Table 2: Classification results, smaller values are better.
Method
Error rate
Std.
0.24
0.21
0.20
0.052
0.026
0.037
Output Class
Confusion Matrix
1
21
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
100%
4.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0
21
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
100%
0.0% 4.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0
0
21
0
0
0
0
0
0
0
0
0
0
1
0
0
4
0
0
0
0
80.8%
0.0% 0.0% 4.5% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 0.0% 19.2%
0
0
0
36
0
0
0
1
0
0
0
0
0
0
0
0
0
0
0
0
0
97.3%
0.0% 0.0% 0.0% 7.7% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.7%
0
0
0
0
33
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
100%
0.0% 0.0% 0.0% 0.0% 7.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0
0
0
0
1
20
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
95.2%
0.0% 0.0% 0.0% 0.0% 0.2% 4.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 4.8%
0
0
0
0
0
0
11
0
0
0
0
0
0
0
0
0
0
0
0
0
0
100%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
0
0
0
0
0
0
0
21
6
0
0
0
0
0
0
0
0
0
0
0
0
77.8%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 4.5% 1.3% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 22.2%
0
0
0
0
0
0
0
1
4
0
0
0
0
0
0
0
0
0
0
0
0
80.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.9% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 20.0%
10
0
0
0
0
0
0
0
0
0
16
0
0
0
0
0
3
0
0
0
1
0
80.0%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 3.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6% 0.0% 0.0% 0.0% 0.2% 0.0% 20.0%
11
0
0
0
0
0
0
0
5
0
0
9
2
0
0
0
0
0
0
0
0
0
56.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 1.1% 0.0% 0.0% 1.9% 0.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 43.8%
12
0
0
0
0
0
0
0
3
1
0
6
22
0
0
0
0
0
0
0
0
0
68.8%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6% 0.2% 0.0% 1.3% 4.7% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 31.2%
13
0
0
0
0
0
0
0
0
0
0
0
0
13
3
0
0
0
2
0
0
0
72.2%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.8% 0.6% 0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 27.8%
14
0
0
0
0
0
0
0
0
0
0
0
0
3
19
0
0
0
4
0
0
0
73.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.6% 4.1% 0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 26.9%
15
0
0
0
0
0
0
0
0
0
0
0
0
0
0
19
0
0
0
0
0
0
100%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 4.1% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0%
16
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
31
0
0
0
6
0
83.8%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 6.6% 0.0% 0.0% 0.0% 1.3% 0.0% 16.2%
17
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
0
11
0
0
0
0
100%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 2.4% 0.0% 0.0% 0.0% 0.0% 0.0%
18
0
0
0
0
0
0
0
0
0
1
0
0
6
1
0
0
0
7
0
0
0
46.7%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 1.3% 0.2% 0.0% 0.0% 0.0% 1.5% 0.0% 0.0% 0.0% 53.3%
19
0
0
0
0
0
0
0
0
0
1
0
0
0
0
0
0
0
0
16
0
0
94.1%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 3.4% 0.0% 0.0% 5.9%
20
0
0
0
0
0
0
0
0
0
2
0
0
0
0
0
1
0
0
0
15
0
83.3%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.4% 0.0% 0.0% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 0.0% 3.2% 0.0% 16.7%
21
0
0
0
0
0
0
0
0
0
0
0
0
0
4
0
0
0
1
0
0
32 86.5%
0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.0% 0.9% 0.0% 0.0% 0.0% 0.2% 0.0% 0.0% 6.8% 13.5%
100% 100% 100% 100% 97.1% 100% 100% 67.7% 36.4% 80.0% 60.0% 91.7% 59.1% 67.9% 100% 88.6% 73.3% 50.0% 100% 68.2% 100% 85.0%
0.0% 0.0% 0.0% 0.0% 2.9% 0.0% 0.0% 32.3% 63.6% 20.0% 40.0% 8.3% 40.9% 32.1% 0.0% 11.4% 26.7% 50.0% 0.0% 31.8% 0.0% 15.0%
1
10
11
12
Target Class
13
14
15
16
17
18
19
20
21
Figure 3: Confusion matrix of the best classification. Classes are in the same order and grouped as
in Table 1
In Figure 3 the confusion matrix from the best classification results, 15% misclassification rate, are
shown. There is an overall high accuracy for all classes with the exception of species Lasiurus
xanthinus, class 18, which is often misclassified as Antrozous pallidus, class 13, which needs to be
investigated further. In contrast, the very similar call shapes of the Myotis species are easily discriminated. Finally, misclassification rates are higher to within family species compared to species from
other families indicating a common evolutionary path of bat echolocation.
Previous works highlight the complexity to discriminate species from the Phyllostomidae family,
while others recognized Myotis species hard to classify as well. The high accuracy obtained in this
6
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
53
study to separate species in the Phyllostomidae family from other families and its ability to discriminate between Myotis species sets the ground for a further development of an automatic identification
tool for Mexican bats. Although only a small set of Mexican bat species was used in this study, it
shows promising applications to a bigger set of species. Despite these limitations, the development
of a national call library of full-spectrum calls together with the echolocation classification tool will
set the foundations to establish a long-term National Bat Acoustic Monitoring Program. This is a
feasible alternative for developing countries to create biodiversity monitoring programs and develop
volunteer networks because they are easier and less costly to implement at broad scales and long
term compared to other monitoring techniques.
References
[1] G. Jones, D. Jacobs, T. Kunz, M. Willig, and P. Racey. Carpe noctem: the importance of bats
as bioindicators. Endangered Species Research, 8:93115, 2009.
[2] M. B. Fenton and G. P. Bell. Recognition of Species of Insectivorous Bats by their Echolocation Calls. Journal of Mammology, 62(2):233242, 1981.
[3] G. Jones and E. Teeling. The Evolution of echolocation in bats. Trends in Ecology and Evolution, 21:149156, 2006.
[4] I. Ahlen and H. Baage. Use of ultrasound detectors for bat studies in Europe : experiences
from field identification , surveys , and monitoring. Acta Chiropterologica, 1:137150, 1999.
[5] M.K Obrist. Flexible bat echolocation: the influence of individual, habitat and conspecifics on
sonar signal design. Behavioral Ecology and Sociobiology, 36:207219, 1995.
[6] K. Murray, E. Britzke, and L. Robbins. Variation in search phase calls of bats. Journal of
Mammalogy, 82:728737, 2001.
[7] H. U. Schnitzler, C.F Moss, and A. Denzinger. From spatial orientation to food acquisition in
echolocating bats. Trends in Ecology and Evolution, 18:386394, 2003.
[8] Mark Girolami and Simon Rogers. Variational bayesian multinomial probit regression with
gaussian process priors. Neural Computation, 18(8):17901817, 2006. Probit regression;
gaussian process, variational bayes, multi-class classification.
[9] Charlotte L. Walters, Robin Freeman, Alanna Collen, Christian Dietz, M. Brock Fenton,
Gareth Jones, Martin K. Obrist, Sebastien J. Puechmaille, Thomas Sattler, Bjorn M. Siemers,
Stuart Parsons, and Kate E. Jones. A continental-scale tool for acoustic identification of European bats. Journal of Applied Ecology, 49(5):10641074, 2012.
[10] S. Parsons and G. Jones. Acoustic identication of twelve species of echolocating bat by discriminant function analysis and artifact neural networks. Journal of Experimental Biology,
203:26412656, 2000.
[11] Hiroaki Sakoe and Seibi Chiba. Dynamic programming algorithm optimization for spoken
word recognition. IEE Transactions on Acoustics, Speech and Signal Processing, 26:4349,
1978.
[12] Hansheng Lei and Bingyu Sun. A study on the dynamic time warping in kernel machines. In
Signal-Image Technologies and Internet-Based System, 2007. SITIS 07. Third International
IEEE Conference on, pages 839845, 2007.
[13] Theodoros Damoulas, Samuel Henry, Andrew Farnsworth, Michael Lanzone, and Carla
Gomes. Bayesian classification of flight calls with a novel dynamic time warping kernel. In
Proceedings of the 2010 Ninth International Conference on Machine Learning and Applications, ICMLA 10, pages 424429, Washington, DC, USA, 2010. IEEE Computer Society.
[14] Carl Edward Rasmussen and Christopher K. I. Williams. Gaussian Processes for Machine
Learning (Adaptive Computation and Machine Learning). The MIT Press, 2005.
[15] Malte Kuss and Carl Edward Rasmussen. Assessing approximate inference for binary gaussian
process classification. Journal of Machine Learning Research, pages 16791704, 2005.
[16] Mark Girolami and Mingjun Zhong. Data integration for classification problems employing
gaussian process priors. In B. Scholkopf, J. Platt, and T. Hoffman, editors, Advances in Neural
Information Processing Systems 19, pages 465472. MIT Press, Cambridge, MA, 2007.
7
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
54
[17] Jaako Riihimaki, Pasi Jylanki, and Aki Vehtari. Nested expectation propagation for gaussian
process classification with a multinomial probit likelihood. Journal of Machine Learning Research, 14:75109, 2013.
[18] Matthias Seeger, Neil D. Lawrence, and Ralf Herbrich. Efficient nonparametric bayesian modelling with sparse gaussian process approximations. Technical report, Max Planck Institute for
Biological Cybernetics, Tubingen, Germany, 2006.
[19] Thomas Minka. Expectation propagation for approximate bayesian inference. In Proceedings
of the Seventeenth Conference Annual Conference on Uncertainty in Artificial Intelligence
(UAI-01), pages 362369, San Francisco, CA, 2001. Morgan Kaufmann.
[20] R. A. Medelln, H. Arita, and O. Sanchez. Identificacion de los murcielagos de Mexico: Clave
de campo. Publicaciones Especiales. Asociacion Mexicana de Mastozoologa A. C., Mexico,
DF, 2008.
[21] G. Ceballos and G. Oliva. Los mamferos silvestres de Mexico. CONABIO UNAM Fondo de
Cultura Econmica, Mexico, DF, 2005.
[22] N. B. Simmons. Order Chiroptera. In Wilson, D.E., Reeder, D.M. (eds.), Mammal Species of
the World: A Taxonomic and Geographic Reference, third ed., pages 312529. Johns Hopkins
University Press, 2008.
8
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
55
The humpack whale songs relies on the ability of these whales to copy and
recombine vocal sounds and to arrange them in new sequences, around many
tropical sites all over the planet. We present the advantages of the sparse coding to
represent these song sequences in order to track their structure and evolution, and
to automatically recognize the area from where this song has been emitted. This
representation may also help to understand learning processes that can explain
how the whale build new songs. Demonstrations are conducted on true recordings
(we thank C. Clark and O. Lammers for sharing some of their samples).
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
56
!"!!#$%
&
!
!"
#$&
"'(
&
'
)
) )
#*"#"#
+)
",
-,
( +.
),//0
1230)
,1-
3/
5
,
0
()
5 *)
6 (
5()
()
5()
+
)+
(
+,
344
2444
5
,
)
))
) )
,"*"
( (((
5()
() 7
)
(
8
)
+
)
+
"#
.
3 +,
( +.
),930
)
,1-
9:4
(5
,
)
))
( +
+
)
-
=+)' )
> )(
>?
./
0100.0.
2&1
3$4567)))))
89:1
8
4;
&<
- =<
&
-
900
;.>
901
;&;;
>
(
.8
.
?0&<
8.
.8@
;
; &7
(
)+
) 8
"))
(*7@)8
; 0 ) )7(
A
(,8
=))
(
)*7
+7@)88
B 5(
'
(
( ))
), ) (
A(
A"8 ;&8
0&
./?011
&0
8B+))C
BD9
<
.
84D01
8".
.&2&<
.?011$&
./8
;&=D!D =!+))E
FD9
8!DG
.$D.
8"<
2&<
.
;".
(:8 =!+))
D
8DH
8=D8"
.&1
;;;
.
&;82"+)+J
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
57
-
-
"#
" ." / .
)>?
!"
"
0
(.
"
4
*
"
" '
-
)
"
""&03
*0
4
5"6457
458
91---
(+
:;<,---=>
"5' )5
,
,)
+
)
$"*0
"
& A
=
%
:
"
"
"
&
6
)---=>C$ "$
'D-8
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
58
#
"
()
0
E
%#
&6#8
G 0,--+
23
#
"
"
""& "'
?68
G0,--+
"
?68
,--+
?68
"
?68
AL"DDDG
=
.
1+))EJ
" *&
$ &
6H "7J"
7J
,--K%4?#8
#=
F $ 6*0
8
:
$
#68
.8B&.!&
1
;1M)8)D
<3
.?0&
01
&?0
1
D
AL"DDDG
=
.
1+))EJ
+
)
+
1
L3;&.0
0.0
1
&
10?
D
08 1
& ?0
&
;
1
. 0
0.0
1
;&80
?&D
>
;;
L>
+)'<
=08D=
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
59
+)
)
-
-
"#
" ." / .
+
)
G.&
L>
F *"
#
(
)
7
)8CD )E )
+
)
' ")
)344K
+
)
4#
"
&
4#
"
)$#*+#
2 4
+
)
+
6 7
5'
A
2J
5 6
'8
:;<=
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
60
+,
+7
4 234
:.N20(L>
-
-
"#
" ." .
-
)"0
'#
1- &)"0
'#
1- '#
1-0
'#
1- $%
!
A=
D1
?0
:.=!8DDD+))E(+)+
D90
:=88+))E
G
8
?0
:.811
+)J
4
&013**<D.
2A,)),)))&J1
L3,&
3'+ (
) ) +
B
&
344:
+
)
%
&
5'
5
&
C?0
/?0$4
DDD
$+))73,
,--+'
#?"6
4"
&:
8
=($% .
!42! ?Q
;
, 7#="'L8M*6F70N8
#
7MD144(8
013**<D(D;*
O
'$
CP
=4 2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
61
+
)
'+
24
4 $2
+
2T 24
4 $2+))K
&
10
8'8>
8 #8?
4
1
0
;
0
;&
?
0
1<
<4O
&
41
O
0
N4+&
.
'C
11;?0
;
.
DDD
,#"&
%D--
%(--- )
)
+
( +
,
A
*#
(Q3:44(
(1440
)
) +
AG
11
+)J
'7
)
',
-
-
"#
" ." .
-
)"0
'#
1- )"0
'#
1- '#
1-0
'#
+
)
+
R ) (
,
.
(
,(
(
62
'E
)
?.1
.&L3&;
D
+
)
+
N61
D
N60
D;&
??D
0
()
+U
)
>G,12H4J
'U
C)
$@
'#6
7/1
90!1
.>3
&
G
.
/;
?
;0
'
+>7'8
)
, +
(
)
(
' (, (
)
)
)
)
a &
&
&
&
# 20
1
(
(
) + ) , +
)
,
(
" )
>
&<;;/
7A
&<
&<
B(&
# &3
C
"%
& * &"
C+
)
+
)
+
-
+)
0 7=((8
F
+7'8+)7"(V
8
7 8
7"(8<'
7
#)8
- 6))
)
+.
)
14( //0
AL/:8=W8=+)LJ
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
63
C'
)
CC
)
!=!24$2L3
!&1
;0
)+C?8&
.
- 19&''
# 14((
# 93((+
+
)
- <
)
)
+ )
7+<C3:
3:4(8
+
)
&
1
10N+,)&8N+
- *
)
# =I
)
,
)
143/
# &
7
8))
+(A (
(.
),
(
(7 )
8
#
W)
+
+ )
C,
)
4$2!2"24 L30
7&X
X?
L3
<.<;
.
C7
)
0
7&X
X?
.&
&
Z
;
N6?0
"4 !
Y
+
)
&
,)??
+
)
CK
) 01 .# /(-2 $. //3$ (- %
&
CE
)
'
(1S
""S'#(AAA#(1
# " &
$";MA#67
8AAA#67
S8AAAN
"*678
$ "$6(- "8
;=
*
&
0678
$ "$(-A
+ 6738;#
0*
$ "3
)
(--T
+ L"
'
*0
)
)
'
=$
?
J"
"
5
#
+
)
+
- R+
(
),
)
,
+
A )
+
()
- F
(
+ )(
+(
-
,
&'')
- =
+(
,)
(
+
+
(
,
(,A)(
%%(
CDR*= '"=6#5**I="F5*<"<"=!5
?!'
,-
$ 7
)
!<&
F!/:
D
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
64
CU
)
,)
-
-
"#
" ." .
&
- R
+)
(
(
)
((
+ (7
8 )
+ )
)
+
+ (
+
)
- #,)
(
+
)
+
+
,A)
(
) ,
+
- *
)
, )
,( (((
1-C0
'#
1- $%
"
,
)
,+
,
"
L0
? 4
+
)
#
%#
%(&
'%
-
./%0,
,
'
7'718'7388
'
'
7138
)
,
1I3
=$
1*W)
3R,
)
,?
9() +
+ 0
/F (, (
:F) +
-'
)
)
G0$=*F5<
"X5L5"I 6F)$ "5&$$ )#Y<" )
+
( ) + 0
(
(
,
Z
!
!
"#$
"% &'(
'
$
)$*J
,'
)
,C
)
()4
R
) ) )
(
7#
)
8)
+
+
)
+
)
F )
+)
+ (
)
+A ) )
(A
+
)
+,1
(,
F ( )*" 7F *"8
), !
. + = ) 1
( #, +
)
) ) )
( ) (
(
F *" (
()
(
)
(
)
+( ( )
)
+
+
(
<
,
+
+7
+8)
+)
)
,.
+<
7+
+8
F
)
(
W)
&'' )
)
+
)
+
( ,)
(
$V+
!&1
;<.;&:D
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
65
,7
)
:##
:
!* "
- *
-: ()
- 19&''
# '4
'13
# 34((
# 93(&&F
- :
+>1/1313/)
+
)
+
- '
)
:)
)
-: (
7
8
,K
)
,E
)
? 0
G
0
&&
;D
8GNDG*^^^^D^^G^^8
0
8GN)M8D
+
)
+
+
)
+
.0
0
80
&
0
&D
9
&1
0
<
?
0
1;&00P
0
1
?.<;
0
@
9
;.
0
)
.D
!&
080+ ?;)&
+78;&
0;+)&
0
1
.<1
;
.
D
10
080+N \ +]*& ]\& +8
?0
0
&;0
)+C<)&
;&
8
\0
&18
& 0
"+&;
0;&
; D
9
?
&&
0
.D
)
)
10
'
)&
7)
)
? 0
W
*
$ ( "<
>
(
7
)T3
&
*! "
5&
0
&1@
+
)
? 0
0
&&
<
?
?010
&
8
;;
&1
1
1
.&0;
0
3
,U
)
# =I
)
,
)
((
)
143/(
,
)
<<3-<<3H<<3K7/[94(
8
#
W)),
)
+ )
# F
+
,
,14\
143/ (
# F
,
( ,(
)
W) +
,
+
)
+
10
)
)
10
'
)&
9
&
&&
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
66
7
)
7+
)
? 0
4&
W
(<
(- "
+
)
+
)
9
.
_:
`1
00
0
;0
<.N+
D
+
0
0
&&&;
0:
&
.
0
&
080+
7'
)
0'
0
3
A"("DEA"D/A"D
N60
.
7C
-
-
"#
" ." .
#
R ) ,()
+)
W)
( , .%, (
+
+
)
R+
%1
,
,
F F *" + +
.
)
)
"
F *" ) 7 )
(
?8
R
.
1-C0
'#
5 '&) 0
77
(201
;&!
(G
.&<
<
!
('$:.&<
.!
1
DDD
+
)
(!
L1
&1
X=0
/<1
.X8
=88$08 :8/:89:010
"
.;G8
"CG8+)'
(90
;&!
X!1
.;.
<&;X8L
8/:8=8 :8=4U8))),+)'
1*11
<
013**<D.
X!1
.;
<3& &1<:?0
.
;
1
X 8F!
8H
84&8F/:8
0
F;0
!
;&
'',8''(''
(.
;&!;
X &1:1
.;.1
;X8
/:88=8&8$08<DF!
"!!=;&
?:
+))7=
=#
!G4$1P
013**<D.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
67
7E
) #5 4
/
+
)
6(,(*2
(-$# (
$
'
6)5 )
- *#/ -
" 6
2A1D
I 2
7U
)
* 78+ % ' 9 : &
K)
)
5
,F<
+
)
S'
+T
) +
65F6
(()
3414
+
65F614
0&#$
R'
) ,F () +
(
' ")
)5
M
9-1921/:344K
)
5556"L
' )
()
)
+
+ +
5666R
'
+
'
(
5 +)122K
)
+
)
+
)
)
( )
+, +
) +,
)
)
) )()
5
<F<
N
&$%-
=D8 D
(&
'$:.;?0
<
0(<1
$4
&
?0?
(1
010
DDFD11
82
2D8OD7K8
(+811)7(K8 D+))7D
+
)
G
8 D8$2=Da90
'$&..10 244 $2?&
?
1;&?0
1&;
.<(G:?
a !"##"$%&'()&)*++,+++8L+))L
G
8 D8a&
.
;?0
.T"
1
a
.#/
/ "#/00#$()&)'()&)*18L+))L
"4 D8$2=D8$"D8G2O2
L399;09""
99"7""
99 "
())%"9)%<)=&=(8
=
+))U!88
8
5
821
D
"!D8D
8D8DW D8/""7"99"""D
11
0
0
=
=8+))U8E)1
"
8D8 D8bX0.=;1
&?0
:
(&
.$<
0
,?
c83
?
;2&
..=
!
&8+))E!G 3UKE((C+CC(+E,(8$4
)D)U*=!!O2D+))EDCKE7UKK
0 )
&$%") 0
'#
M*
CJ$
&3# !$K
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
68
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
69
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Conclusions
Outline
1 Background: Mysticete Species Recognition/Classication
2 Methodologies
LSIS/DYNI
12/10/2013
4 Conclusions
X. C. Halkias, S. Paris, H. Glotin
LSIS/DYNI
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Overview
Methodologies
Experimental Results
Conclusions
Mysticete sounds I
Goal
Automatic Classication of 5 dierent Mysticete species
vocalizations.
1
On-line analysis:
Real-time monitoring: Navigation/Preservation (endangered
species)/Migration patterns
X. C. Halkias, S. Paris, H. Glotin
LSIS/DYNI
Species
Southern Right Whale
Humpback Whale
Bowhead Whale
Blue Whale
Fin Whale
ROI Duration
130min
55min
84min
378min
162min
Fs
8kHz
4kHz
4kHz
100Hz
100Hz
Freq. Range
50-1500Hz
150-1700Hz
110-800Hz
15-19Hz
15-30Hz
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Mysticete sounds II
Methodologies
Experimental Results
Conclusions
Existing methodologies
Region of Interest (ROI) detection
Feature extraction
Figure: Range of: Southern Right, Humpback, Bowhead, Blue, Fin whale
Classication
Single-species: Boosting of true positive rates (TPR) for single
species detection/2 class problem species vs. noise
Multiple-species: Multiple species recognition in noisy
environment/ multi-class problem
Ground truth limitations (experts), supervised approaches
(SVM, ANN, RFT, HMM etc.)
Figure: Ground truth distributions for mysticete species
X. C. Halkias, S. Paris, H. Glotin
LSIS/DYNI
LSIS/DYNI
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
70
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Conclusions
Our approach
Architecture
p(x, h) =
e E (x,h)
Z
p(xi = 1|h) = (bi + j hj wij )
p(hj = 1|x) = (aj + i xi wij )
Classication
p(x|h) =
p(h|x) =
i
j
E (x, h) =
X. C. Halkias, S. Paris, H. Glotin
LSIS/DYNI
I
i=1
J
j=1 xi hj wij
Methodologies
Experimental Results
Conclusions
1
N
n=1
N
l=1
i=1
LSIS/DYNI
j=1
LSIS/DYNI
Experimental Results
Conclusions
i=1
(xi bi )2
2i2
J
j=1
aj hj
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Conclusions
(l)
+ bi
n=1
LSIS/DYNI
j=1 aj hj
J
Architecture
n=1
Methodologies
n=1
N
i=1 bi xi
1
N
I
p(hj |x)
bi =
p(xi |h)
LSIS/DYNI
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
71
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Equations
Architecture
Sparsity: Weight decay
J(W, b; x) = 12 hW ,b (x) x2 + 2 W22
N
(l)
(l)
Wij = N1
W (l) J(W(l) , b(l) ; x(n) )] + Wij
h (x) =
n=1
1
k
Tj x
T
e 1 x
e 2T x
.
..
T
j=1
Conclusions
e k x
j=1
J() = N1 [
LSIS/DYNI
k
n
i=1 j=1
e
1{y (n) = j}log
k
j T x (n)
T (n)
e l x
]+
l=1
N
k
i=1 j=0
ij2
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Conclusions
Overview
Extract 5000 random patches from ROI and 5000 from noise
Normalize patches: zero mean, unit variance
Feature vector: 1600x1 scaled and normalized patch
http://www.mobysound.org/
LSIS/DYNI
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Conclusions
Blue/Fin - SAE
Blue/Fin - RBM
Noise
2.52/2.71
3.20/2.92
94.28/94.37
Table: Confusion matrix for Blue whale, Fin whale and noise using the
SAE and RBM architectures
Table: Confusion matrix for Blue and Fin whale using the SAE and RBM
architectures
X. C. Halkias, S. Paris, H. Glotin
LSIS/DYNI
LSIS/DYNI
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
72
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Model
SAE
RBM
Model
SAE
RBM
Bowhead/Humpback/South.R RBM
Blue/Fin
Classication Accuracy
95.40%
97.50%
Blue/Fin/Noise
Classication Accuracy
92.88%
93.26%
LSIS/DYNI
Bowhead/Humpback/South.R SAE
Bowhead
4.74/7.69
15.95/14.88
79.31/77.43
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Pr. Value
S. Right
Humpback
Bowhead
Noise
Noise
3.02/2.62
4.72/4.95
4.37/4.47
87.89/87.95
Conclusions
Bowhead/Humpback/Southern Right
Model
Classication Accuracy
SAE
74.75%
RBM
73.30%
Bowhead/Humpback/Southern Right/Noise
Model
Classication Accuracy
SAE
63.73%
RBM
63.16%
Table: Classication accuracies for species within the same frequency
range with and without a noise class using the SAE and RBM
architectures
LSIS/DYNI
Conclusions
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
All species
Methodologies
Experimental Results
Conclusions
LSIS/DYNI
LSIS/DYNI
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
73
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Conclusions
Parameter tuning
Bowhead/Humpback/Southern Right/Blue/Fin
Model
Classication Accuracy
SAE
79.54%
RBM
80.68%
Bowhead/Humpback/Southern Right/Blue/Fin/Noise
Model
Classication Accuracy
SAE
69.40%
RBM
68.90%
Model
SAE/RBM/Softmax
SAE
SAE
SAE/RBM
SAE/Softmax
RBM
Parameter
Weight-decay
Sparsity
Sparsity weight
Hidden Units
GD Iterations
CG Iterations
Value
0.003
0.05
3
200
300
500
LSIS/DYNI
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Important variables
Methodologies
Experimental Results
Conclusions
Closing thoughts
Patch size
Hidden units
Feature space
dimensionality
Semi-supervised approach
RBM/SAE unsupervised methodology for feature extraction
Sparse constraints for discriminative features
Features identify salient call structures
Extension for both detection and classication
LSIS/DYNI
LSIS/DYNI
Methodologies
Experimental Results
Conclusions
Methodologies
Experimental Results
Conclusions
References
S. David,N. Mesgarani, and S. Shamma, Estimating sparse
spectro-temporal receptive elds with natural stimuli, Network:
Computation in Neural Systems,volume 18, pages
191212,2007.
Thank you
QUESTIONS?
LSIS/DYNI
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
74
Methodologies
Experimental Results
Conclusions
References II
M. A. Roch, M. S. Soldevilla, R. Hoenigman, S. M. Wiggins,
and J. H. Hilderbrand, Comparison of machine learning
techniques for the classication of echolocation clicks from
three species of odontocetes, Journal of Canadian Acoustics,
volume 36, pages 4147, 2008.
D. K. Mellinger, A comparison of methods for detetcting right
whale calls, Journal of Canadian Acoustics, volume 32, pages
5565, 2004.
M. Ranzato, C. Poultney, S. Chopra, and Y. LeCun, Ecient
learning of sparse representations with an energy-based model,
in Advances in Neural Information Processing Systems 19,
edited by B. Scholkopf, J. Platt, and T. Homan, pages
11371144, publisher MIT Press, Cambridge, MA, 2007.
X. C. Halkias, S. Paris, H. Glotin
LSIS/DYNI
NIPS4B, 10 December 2013, Lake Tahoe, NV, USA
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
75
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
76
Chapter 4
Advanced ANN
4.1 ConvNets & DNN for bioacoustic ........................................................................................78
LeCun Y.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
77
Intelligent perceptual tasks such as audition require the construction of good internal
representations. Theoretical and empirical evidence suggest that the perceptual world is best
represented by a multi-stage hierarchy in which features in successive stages are
increasingly global, invariant, and abstract. An important challenge for Machine Learning is
to devise "deep learning" methods for multi-stage architecture that can automatically learn
good feature hierarchies from labeled and unlabeled data. A class of such methods that
combine unsupervised sparse coding, and supervised refinement will be described. We
demonstrate the use of these deep learning methods to train convolutional networks
(ConvNets). ConvNets are biologically-inspired architectures consisting of multiple stages
of filter banks, interspersed with non-linear operations, and spatial pooling operations,
analogous to the simple cells and complex cells in the mammalian auditory cortex. A
number of applications will be shown.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
78
Achim Lewandowski
Austrian Research Institute
for Artificial Intelligence
Vienna, Austria
[email protected]
Abstract
Typically machine learning methods attempt to construct from some limited
amount of data a more general model which extends the range of application beyond the available examples. Many methods specifically attempt to be purely
data driven, assuming, that everything is contained in the data. On the other
hand, there often exists additional abstract knowledge about the system to be
modeled, but there is no obvious method how to combine these two domains. We
propose the calculus of functional equations as an appropriate language to describe many relations in a way that is more general than a typical parameterized
model, but allows to be more specific about the setting than using an universal
approximation scheme like neural networks. Symmetries, conservation laws,
and concepts like determinism can be expressed this way. Many of these functional equations can be translated into specific network structures and topologies, which will constrain the possible input-output relations of the network to
the solution space of the equations. This results in less data that is necessary for
training and may lead to more general results, too, that can be derived from the
model. As an example, a natural method for inter- or extrapolation of time series
is derived, which does not use any fixed interpolation scheme but is automatically constructed from the knowledge/assumption that the data series is generated
by an underlying deterministic dynamical system.
1 Introduction
To interpolate data which is sampled in finite, discrete time steps into a continuous signal e.g. for
resampling, normally a model has to be introduced for this purpose, like linear interpolation,
splines, etc. In this paper we attempt to derive a natural method of interpolation, where the correct
model is derived from the data itself, using some general assumptions about the underlying process.
Applying the formalism of generalized iteration, iteration semigroups and iterative roots from the
mathematical branch of functional equations, we attempt to characterize a method to determine if
such a natural interpolation for a given time series exists and give a method for its calculation, a
formal one for linear autoregressive time series and a neural network approximation for the general
nonlinear case.
Let x t be an auto regressive time series: x t = f x t 1 x t 2 x t n + t . We will not deal here
with finding f , i.e. predicting the time series, instead we assume f is already known or already
approximately derived from the given data. We will attempt to embed the discrete series of x t ,
+
t = 0 1 2 into a continuous function x t , t R . To clarify the idea we present the method
at first for the case that the timeseries is generated totally deterministically ( t = 0 ) by an underlying autonomous dynamical system. Later we will consider the influences of additional external
inputs and noise.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
79
The time evolution of any autonomous dynamical systems is represented by a solution of the translation equation [1],
x 0 t 1 + t 2 = x 0 t 1 t 2 (1)
where x 0 is an state vector representing an initial condition and t 1 t 2 are arbitrary time intervals.
For continuous time dynamical systems this equation holds for every positive t . If we assume that
the given time series is a discrete time sampling of an underlying continuously evolving signal, we
have to solve (1) under the conditions x 0 = x and x 1 = f x , where f is the discrete
time mapping represented by the data. (Without loss of generality we can assume the sampling rate
of the discrete time data to be one, which will result in a nice and very intuitive formalism.)
1 1
To double the sampling rate for example, (1) becomes f x = x --- --- .
2 2
1
Substituting x x --- we get x = f x , the functional equation for the iterative root
2
of the mapping f [3].
t
to x t = x t x t 1 x t n 1
with x t = f x t 1 x t 2 x t n . Except for the first element this is a trivial time shift operation,
each element of x is just replaced by its successor. But because F is a self mapping within a n dimensional space now, time development can be calculated by iterating F and we can try to find
t
the generalized iteration with non-integer iteration counts to find a time continuos embedding F ,
the continuous iteration semigroup of F and extract a function x t from this [2].
2 Linear Case
The idea is best demonstrated for the linear case, where its application simplifies and unifies sevn
eral problems. For a linear autoregessive time series AR(n) model with x t =
a k x t k , the
k=1
a1 a2
mapping F can be written as a square matrix F = 1 0
0 1
0 0
an
0 0
0 0
1 0
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
80
with the coefficients a k in the first row and the lower subdiagonal filled with ones. Then we can
compute x t = F x t 1 and the discrete time evolution of the system can now be calculated using
n
the matrix powers x t + n = F x t 1 .
t
This autoregressive system is called linear embeddable if the matrix power F exists also for all
+
1
real t R . This is the case if F can be decomposed into F = S A S with A being a diagonal
matrix consisting of the eigenvalues i of F and S being an invertible square matrix which columns are the eigenvectors of F . Additionally all i must be non-negative to have a linear and real
embedding, otherwise we will get a complex embedding.
t
with A =
1 0
0 0 .
0
0 n
Now we have a continuous function x t = F x 0 and the interpolation of the original time series
x t consists of the first element of x .
n
In case there is also a constant term, i.e. the mean is not zero, x t =
a k x t k + b , we just have to
k=1
1
0
0
0
0
1
0
0
0
0
1
0
0
0
0
0
0
0 .
0
1
12
A special case is F
, the square root of a matrix, which solves the matrix equation
12
12
F
F
= F . It resembles the iterative root of linear functions and corresponds to a doubling
of the sampling rate.
A few lines of Maple code can automate this procedure both for symbolic and numeric expressions.
A sample worksheet is available at the authors web page.
3 Examples
We will now provide some simple examples to demonstrate this formalism.
3.1 One dimensional linear case
The time series given by some x 0 and x t = 2x t 1 simply doubles every time step. The natural
interpolation we immediately get by applying the former formalism in the trivial one dimensional
t
case is x t = 2 x 0 , which is of course exactly what we expect: exponential growth. But a little
change makes the problem much more difficult: If x t = ax t 1 + b we expect a mixture of constant
and exponential growth, but what is the exact continuous law?
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
81
t 1
decomposition x t = F x 0 = SA S x 0
=
t
1a
-----------b
1 0
0 a
0
1
b
-----------1 a x0
b
------------ 1
a1
t
a 1
and for the first component x t = a x 0 + b ------------- (which equals ax 0 + b for t 1 ).
a1
We dont have to consider about stability or stationarity of the AR(1) model here but note that to
obtain a completely real valued function x t , a has to be positive. Later we will discuss about the
meaning of such cases with complex embeddings, but for short it means that there is no one-dimensional continuous time dynamical system that can generate such timeseries. In the linear case this
should be clear because a negative a implies oscillatory behavior of x t . This means, some initial
condition x 0 wont be enough to determine the continuation of the trajectory, it could be on the rising or falling slope. The underlying dynamical system needs to have one more hidden dimension to
allow embedding. The other dimension can be represented by the imaginary part of x t which will
vanish at all integer times t . But taking only the real part will still result in a valid interpolation of
the given series, the observable of the system.
1
1
1
This is such an embedding of the AR(2) process x t = --- x t 1 + --- x t 2 + --- with x 0 = x 1 = 1 .
2
2
3
Circles mark the time series x t , the left graph shows the real part of x t , our natural interpolation,
the imaginary part is on the right.
t 1
x t + 1 = F x 1 = SA S x 1 =
1+ 5 1 5
---------------- ---------------2
2
+ 5 t
1
----------------
2
1 1 ----------- 1 2 2 5 1
t
1
1 0
1
5
1
------- --- + -------------------------
2
5 2 2 5
0
------5
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
82
which turns out to evaluate exactly to Binets famous formula for the Fibonacci series in the first
1 1+ 5 t 1 5 t
component x t = ------- ---------------- ---------------- [5].
5 2 2
Because the second eigenvalue is negative, a real linear continuous time embedding does not exist
and x t takes complex values on non-integer x . Figure 2 shows real and complex part of x t .
f x = a
loga x0 + t
= x0 a .
However, this analytical method is limited to a small selection of functions and it can be shown that
there exist embeddings for a much wider range of mappings which cannot be calculated analytically yet. Furthermore the theory so far is developed mainly for real or complex valued functions,
solving Abel or Schrder type functional equations in higher dimensions is currently for the general
case beyond reach.
But simple neural networks can be used to find precise approximations for those embeddings [7,8].
The basic idea is to use a MLP with a special topology which approximates f x . To compute the
mn
fractional iterate f
, we use a network that consists of n subnetworks in a row with pairwise
identical weight matrices. The use of special training algorithm allows to perform the function
approximation with the whole network and keep the subnets identical at the same time [9]. The
fractional iterate of the function can be read out after the m-th subnet.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
83
fx
1n
mn
Predator
6
Prey
10
12
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
84
The given method provides a natural way to estimate not only the values over a year, but also to
extrapolate arbitrarily smooth into the future.
Acknowledgements
Part of the work was conducted at the RIKEN Brain Science Institute, Wako-shi, Japan.
The Austrian Research Institute for Artificial Intelligence is supported by the Austrian Federal
Ministry of Education Science and Culture.
References
1. G. Targonski: Topics in Iteration Theory. Vandenhoeck und Ruprecht, Gttingen (1981)
2. M.C. Zdun: Continuous iteration semigroups. Boll. Un. Mat. Ital. 14 A (1977) 65-70
3. M. Kuczma, B. Choczewski & R. Ger: Iterative Functional Equations. Cambridge University
Press, Cambridge (1990)
4. K. Baron & W. Jarczyk: Recent results on functional equations in a single variable, perspectives
and open problems. Aequationes Math. 61. (2001), 1-48
5. R.L. Graham, D.E. Knuth & O. Patashnik: Concrete Mathematics. Addison-Wesley, Massachusetts (1994)
6. R.E. Rice, B. Schweizer & A. Sklar: When is f(f(z)) = az2+bz+c for all complex z? Amer. Math.
Monthly 87 (1980) 252-263
7. L. Kindermann: Computing Iterative Roots with Neural Networks. Proc. Fifth Conf. Neural Information Processing, ICONIP (1998) Vol. 2:713-715
8. E. Castillo, A. Cobo, J.M Gutirrez & R.E Pruneda: Functional Networks with Applications. A
Neural-Based Paradigm. Kluwer Academic Publishers, Boston/Dordrecht/London (1999)
9. L. Kindermann & A. Lewandowski: A Comparison of Different Neural Methods for Solving Iterative Roots. Proc. Seventh Intl Conf. on Neural Information Processing, ICONIP, Taejon
(2000) 565-569
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
85
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
86
Chapter 5
Learning to Track by Passive Acoustics
5.1 Mono-channel spectral attenuation modeled by hierarchical neural net
estimates hydrophone-whale distance............................................................................................88
Doh Y., Glotin H., Razik J., Razik J., Paris S.
5.3 Range-depth tracking of multiple sperm whales over large distances using a
two element vertical array and rhythmic properties of click-trains.........................................103
Mathias D., Thode A., Straley J., Andrews R., Le Bot O., Gervaise C.,
Mars J.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
87
Doh Y.
DYNI team
Aix-Marseille Universite, CNRS, ENSAM, LSIS, UMR 7296, 13284 Marseille, France
Universite de Toulon,CNRS, LSIS, UMR 7296, 83957 La Garde, France
[email protected]
Glotin H.
IUF, Institut Universitaire de France
103 Bd. Saint-Michel, 75005 PARIS - France
DYNI team
Aix-Marseille Universite, CNRS, ENSAM, LSIS, UMR 7296, 13284 Marseille, France
Universite de Toulon,CNRS, LSIS, UMR 7296, 83957 La Garde, France
[email protected]
Razik J.
DYNI team
Aix-Marseille Universite, CNRS, ENSAM, LSIS, UMR 7296, 13284 Marseille, France
Universite de Toulon,CNRS, LSIS, UMR 7296, 83957 La Garde, France
[email protected]
Paris S.
DYNI team
Aix-Marseille Universite, CNRS, ENSAM, LSIS, UMR 7296, 13284 Marseille, France
Universite de Toulon,CNRS, LSIS, UMR 7296, 83957 La Garde, France
[email protected]
Abstract
We aim to allow whale monitoring and anti-collision system using single hydrophone. We then propose a new model to estimate the range from wideband
signals such as clicks emitted by odontocetes. We demonstrate that it is possible
to link the intrinsec distorsion of the signal with the distance of the acoustic path.
We provide different models to establish the relationships between the signal energy and propagation distance. We deal with different energy scales: the global
received energy of the signal E0 , the frequency bands energy and frequency bin
energy. We then demonstrate that intermediate prediction of the whale orientation
enhances the distance estimation, yieding to only 6 % of relative error rate.
Introduction
Passive acoustics is one of the best ways to enhance the knowledge of marine mammals emitting
sounds trough various tasks: detection, classification, localization and density estimation. 3D whales
localisation is mostly achieved using hydrophone arrays. Although these methods have been em-
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
88
ployed successfully for this task on mysticetes [6] and odontocetes [1, 3] with a high level of accuracy, they require the use of heavy and expensive hardware. Despite of the information loss, using
a single, light and cheap hydrophone device, quick to deploy, could provide the necessary data to
satisfy certain applications such as mobile listenning point and anti-collision system where a simple
range estimation is sufficient. Therefore, in this paper, we decided to focus on single-hydrophone
methods. Theoretically, it is possible to use the virtual hydrophone framework and the acoustic
property of the water column in order to estimate the position of the whale [4, 7, 5]. This technique
involves the acquisition of the direct acoustic path, the bottom reflection and the surface reflection. However, in practice, we often observe only a subset of the needed information. The above
mentioned constraints led us to find a new model of range estimation applied to wideband signals
such as clicks emitted by odontocetes. Specifically, this paper applies the proposed model on sperm
whale recordings. In order to focus on range estimating in this first approach, and to avoid source
separation problems, we choose to deal with single animal recordings.
It is well known that sound attenuation depends on frequency and propagation distance [8, 9]. The
attenuation impacts the total energy and generates a distortion of frequency representation in the
emitted signal that we can link to the distance between source and receiver. Madsen et al. in 2002,
put forward the variation of the centroid with the distance et suggest a low pass effect provided by
the acoustic propagation [18].
In this paper, we try to establish the expression of the relationship between the signal energy and
propagation distance by an empirical model based on a neural network and by the theoretical model
Inter-Frequency Attenuation (IFA) [10, 11].
Real data and their associated ground truth [3, 12, 13, 14], will allow us to develop this model by
optimizing a few but important parameters. Also, an independent partition of the data will be useful
to test the ability and limits of the proposed estimator.
2
2.1
Motivation
The relationship between the received signal and loss by transmission
Our goal is to extract information regarding the propagation distance, from the observed signal from
a unique hydrophone. The relationship between Transmission Loss (TL) and distance, as provided
by the passive sonar equation, is not well adapted to bioacoustics signal applications. Solving this
equation depends mainly on the signals power at the origin Source Level (SL). Emitting with a
variable sound level, a sperm whale is not a constant acoustic energy source. Moreover, we must
also take into account other variability factors like the animals size, the Inter-Click Interval (ICI)
and Inter-Pulse Interval (IPI), diving depth [18] or the directionality of the animal relative to the
hydrophone position [2, 17, 20, 21].
First we introduce the expression of the received energy E at the distance r from the source, as a
function of the energy source level ESL and TL [15, 7] for a given frequency in dB. We consider the
simple framework of omnidirectionnal spherical source:
E(r, f ) = ESL (f ) T L(r, f ),
(1)
(2)
where r is the propagation distance (in m), f is the frequency (in Hz) and is the frequency
attenuation coefficient (dB.m1 ). The first term of Eq. (2) is due to loss by geometric divergence of
a spherical wave while the second term represents the frequency attenuation because of interactions
between the wave and the medium.
In a first approximation, we can assume that the loss by divergence is predominant on frequency
attenuation and TL does not depend on frequency which allows to consider the total energy E0 of
the signal.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
89
The problem r = F (E0 ) remains unsolved without a theoretical or statistical model of the energy
source level. This function can be approximated and learned by a neural [24] network in particular
a MLP, since MLPs are universal function approximators and will be described in a further section.
It will be the basic model LER (for Loss Estimation Regression). But this function is empirical and
very dependant on the data used to learn the model. Thus, it has motivated us to find a theoretical
relationship only based on the frequency attenuation which could work without any knowledge of
the total SL. This model imposes to consider not only the total energy but also the detailled frequency
composition.
2.2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
90
W).
r = h(|X(f ; r)|; o; a
; el;
2.3
(4)
The proposed Theoretical Inter Frequency Attenuation (IFAT) [10] model aims to extract information from the source distance by taking advantage of the energy ratio between two frequency bands
of the emitted signal. the derivation of the attenuation laws allowed us to establish the following
relationship :
E1
E2
r(B1 , B2 ) = Z F 2
.
Z F2
(f )df
(f )df
10 log10
F1
(5)
F1
In this expression, r is the acoustic propagation distance, B1 = [F1 , F2 ] and B2 = [F1 , F2 ] are the
frequency band involved, F1 E1 and E2 the energy of band 1 and 2. In this expression r does not
depend on loss by divergence and energy at the origin, but it only depends on frequency attenuation.
Material
In this section, we present our dataset which is extracted from the Bahamas dataset distributed by
AUTEC at the second DCL workshop in Monaco 2005. It consists of five hydrophones deployed off
the Bahamas Island, and a total of 25 minutes of recording of one sperm whale with a sample rate
of 48 KHz. The trajectory computed by LSIS/DYNI (Fig. 2) [3, 14] is similar to the one resulting
by different methods by the scientific community [22, 23], and it will be considered as the ground
truth.
sperm whale track and hydrophones positions
11
H8
12
13
y in km
14
15
16
H9
17
18
H10
19
20
H11
8
10
11
12
13
14
15
x in km
Figure 2: The 2D trajectory (in x y plan) of the single sperm whale observed during 25
min (LSIS/DYNI [3]) and corresponding hydrophones positions. The whale goes to south
east. Supplemental material with the animated 3D tracking of this whale is available at
http://glotin.univ-tln.fr/oncet and http://www.youtube.com/watch?v=0Szo3gdiTRk. We also give
there in supplemental material with the file containing the [x, y, z, t] whale positions, and the
[az, el, of f axis, t] files for each hydrophone H8.. H11
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
91
hydrophone depth
mean off axis
mean azimut
mean elevation
mean distance
distance std deviation
H11
-1522 m
36 degrees
29 degrees
-14 degrees
3937 m
452 m
H10
-1361 m
60 degrees
56 degrees
-15 degrees
2900 m
242 m
H9
-1553 m
74 degrees
73 degrees
-14 degrees
4150 m
234 m
H8
-1556 m
125 degrees
129 degrees
-13 degrees
4716 m
283 m
Considering the four hydrophones H8, H9, H10, H11, the range of distance between source and
receiver is from 2500 m to 5500 m. The precise angle associated to the ground truth trajectory, has
been calculated. The mean values are mentionned in Tab. 1.
We then divide the data in 2 partitions. Partition 1 will be used for the development and parameter
optimization step. Partition 2 will be dedicated to estimation and predictions.
Results
4.1
4.1.1
The training step is employed on partition 1 of the data. The temporal order of the data is not taken
into account when applied as the input of the MLP. During the development stage, we optimized
only one parameter, which is the number of system iterations for the training session (early stopping)
related to the quality of prediction.
Predictions are generated from partition 2 of the data and we assume independence with the training
data. We were surprised to observe that the MLP learned the relationship between spectrum and
propagation distance quite quickly. 300 iterations are sufficient to obtain satisfactory results for
our prediction. In order to avoid over-fitting and to extract the most general predictor we keep the
number of iterations low.
4.1.2
Distance prediction
In this section, we propose a temporal MLP prediction following LER, IF AR and IF ARH on all
hydrophone in the same data subset as section IV.A.2 for the IFAT estimator. It has been computed
using the MLP for regression on the spectrum.
In Fig. 3 we can compare the fidelity of prediction and ground truth (results summarized in Tab. 3
. Firstly, the prediction given by IF AR and IF ARH is better than LER one. It demonstrates the
usefullness of considering spectrum inter bins and not only the global energy of the signal. LER
shows some transitions which lead to a recess that others models seem to control. However, the
bias seems to increase in the sections where the azimut is the lowest (20 degrees). This behavior
can be explained by a lack of an on-axis configuration during the training session. This assumption
means that the regression law is different between an on-axis and off-axis configuration. It may also
be caused by the different frequency structure of pulses (P or PJ) in a click following the receiver
position [16].
Then, IF ARH mean error is similar to IF AR. However, the dispersion of the predicted distances
seems enhanced by the use of an intermediate MLP predicting the position angles. The Results of
angle estimation are not presented directly in this paper. We noted that azimut prediction was more
precise than elevation prediction which could suggest that the spectrum shape is more dependent on
azimut.
4.2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
92
H11
H10
5500
GT
LER
IFAR
IFARH
radial distance in m
5000
4600
4400
4200
4000
4500
3800
3600
3400
4000
3200
3000
3500
2800
2600
3000
200
400
600
800
1000
1200
100
200
400
600
800
1000
1200
400
600
800
1000
1200
800
1000
1200
800
1000
1200
150
azimut
elevation
100
angle in
50
50
0
0
50
200
400
600
800
1000
1200
50
200
H8
H9
5000
5600
4800
radial distance in m
5400
4600
5200
4400
5000
4200
4800
4000
4600
3800
4400
3600
4200
3400
200
400
600
800
1000
1200
120
4000
200
400
600
400
600
200
100
150
80
angle in
60
100
40
20
50
0
20
40
60
200
400
600
800
time in s
1000
1200
50
200
time in s
Figure 3: Temporal prediction of radial distance compared to ground truth using model LER, IFAR,
IFARH.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
93
radial distance in m
7000
N=128
N=256
6000
5000
4000
3000
200
400
600
800
time in s
1000
1200
100
angle in
azimut
elevation
50
50
200
400
600
800
time in s
1000
1200
Figure 4: An example of final IFAT temporal estimation of radial distance compared to ground truth
radial distance and azimut evolution on H11. fixed parameters : Nbest = 4, FP128 = 8.5kHz and
FP256 = 9kHz.
In Fig. 4 we see that both curves have almost the same dispersion. Using N = 128 samples, the
distances are overestimated, while underestimated using N = 256 samples.
The significant dispersion of our estimation cannot be due to the range variation. It can be related to
the animals off-axis variation, meaning that IFAT does not cancel the animals directionality effects.
Most of the errors of IFAT using N = 128 samples seem to accumulate on sections where the azimut
is high (> 20 deg). When the animal is supposed to be on-axis, we observe that the estimation is
converging to the ground truth.
On the other hand, for the N = 256 samples, the lowest error seems to match with high azimut, and
it results to a better estimation on high off-axis configurations. Thus the average between N = 128
and N = 256 samples is computed to provide a uniform behavior relative to the azimut possibilities.
Finally, we also observe that the radial distance is decreasing which is in agreement with the directionality of the whale on the ground truth. IFAT correctly affirms that the whale is traveling towards
the hydrophone.
discussion
According to the results ( tables 2, 3) IFAT errors tend to vary between the different hydrophones,
but the IFAT model and the neural network led to common conclusions. The performance of IF AR
and IF ARH confirms the usefullness to consider spectrum inter bin property and not only the total
energy which is more sensitve to natural and voluntary variations of the Source Level.
The IFAT, IFAR, IFARH and LER models provide us with differences regarding the azimut. With
the IFAT model, the performance of the estimator seems to be more impacted by the azimut configuration. Since we developed the model without the consideration of the animals directionality, a
future step may be the inclusion of this variable in the estimator expression.
The MLP demonstrates the relationship between spectrum and propagation distance in this data set
on only 300 iterations. We demonstrated that the IFAT model could estimate distance with about 15
% of mean relative error, while the MLP IFAR or IFARH reduces it to 6 %.
As IFAT produces locale range estimates, a particle filtering process [19] could be efficiently added
after IFAT in order to produce more reliable and complex estimates.
In the case of multiple emitting whale and a monohydrophone recording, IFA would play an important role in order to cluster the clicks and thus estimate the number of emitting whales. Morever,
in the case of multiple hydrophones, IFA can help in the localisation of each whale. The IFAR and
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
94
H11
21
20
17
19
H10
16
21
16
17
H9
14
9
11
11
H8
41
16
28
28
MEAN
23
16.5
18
-
Table 2: Absolute mean relative error of estimated distance in % for all hydrophones with IFAT
theoretical model estimators
LER mean
H11
14 6
H10
20 10
H9
89
H8
15 5
MEAN
14.2 7.5
IF AT mean
20 10
21 13
97
16 14.0
16.5 11
IF AR mean
IF ARH mean
42
4 2.5
14 4
11 3
32
4 3.5
42
42
6.2 2.5
5.75 2.75
Table 3: Absolute mean relative error of estimated distance by with MLP: IFAR/IFARH and standard
deviation (in %)
IFARH models trained on a data set, can be applied on another recording set for similar species and
hydrophones. Also one can model and run IFA for other species using biosonar, like bats.
LER
25
IFAR
IFARH(o,a,el)
IFARH(o)
IFARH(a,el)
20
IFARH(a)
IFARH(el)
mean error in %
IFAT
15
10
10
# hydrophone
11
12 on all
average
the hydrophones
Figure 5: mean relative error between predictions and groundtruth. Different IFARH model versions
have been tested : IFARH(o,a,el), IFARH(a,el), IFARH(o), IFARH(a), IFARH(el).
Acknowledgments
The authors gratefully acknowledge the contribution of the Provence Alpes Cote d Azur region,
Universite Toulon Var, CESIGMA company and the anonymous referees.
References
[1] E.-M. Nosal and L. Frazer, Sperm whale three-dimensional track, swim orientation, beam pattern, and click levels observed on bottom-mounted hydrophones, The Journal of the Acoustical Society of America 122(4), 19691978 (2007).
[2] B. Mohl, M. Wahlberg and P. -T Frazer, The monopulsed nature of sperm whale clicks, The
Journal of the Acoustical Society of America 114(2), 11431154 (2003)
[3] P. Giraudet and H. Glotin, Real-time 3D tracking of whales by echo-robust precise TDOA
estimates with a widely-spaced hydrophone array, Applied Acoustics 67, 11061117 (2006).
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
95
[4] N. Josso, Characterizing the underwater environment using moving sources by opportunity,
Ph.D. thesis, Universite de Grenoble (2010).
[5] X. Mouy and D. Hannay, Tracking of Pacific walruses in the Chukchi Sea using a single
hydrophone, The Journal of the Acoustical Society of America 131(2), 13491358 (2012)
[6] S. W. Martin, T. Norris, E. M. Nosal, D. K. Mellinger,R. P. Morrissey and S. Jarvis Automatic
localization of individual Hawaiian minke whales from boing vocalization, The Journal of the
Acoustical Society of America 129, 2506 (2011)
[7] W. Au and M. Hastings, Principles of marine bioacoustics (Springer Science + business Media,
LLC, New York), 680p (2008).
[8] C. Leroy, Sound attenuation between 200 and 10000 cps mesured along single paths, Technical Report 43, Saclant ASW Research Center (1965).
[9] W. Thorp, Analytic description of the low frequency attenuation coefficient, The Journal of
the Acoustical Society of America 42(1), 270 (1967).
[10] H. Glotin, Y. Doh, R. Abeille, and A. Monnin, Physeter distance estimation using sub-band
leroy transmission loss model, in 5th Internationnal Workshop on Detection, Classification,
Localization and Density Estimation of Marine Mammals using Passive Acoustics, 49 (2011).
[11] Y. Doh, A model of Inter Frequency Attenuation (IFA) in order to estimate the distance source
receiver, Masters thesis, Universites dAix Marseille/Centrale Marseille (2011).
[12] F. Benard and H. Glotin, Whales localization using a large array : performance relative to
cramer-rao bounds and confidence regions, in e-Business and Telecommunications, 294306
(Springer - Verlag, Berlin Heidelberg) (2009).
[13] H. Glotin, F. Benard, and P. Giraudet, Whales cocktail party: a real-time tracking of multiple
whales, International Journal Canadian Acoustics 36, 139145 (2008).
[14] F. Benard, H. Glotin, and P. Giraudet, Highly defined whale group tracking by passive acoustic
stochastic matched filter, in Advances in Sound Localization, chapter 28, 527544 (InTech,
Rijeka, Croatia) (2011).
[15] C. Viala, Real time inversion in geoacoustic of wideband signals in deep water, Ph.D. thesis,
Universite de Toulon et du Var (2007).
[16] C. Laplanche, Studies by passive acoustics of the hunting behaviour of the sperm whale,
Ph.D. thesis, Universite Paris XII Val-de-Marne (2005).
[17] B. Mohl, M. Wahlberg, P. Madsen, LA. Miller and A. Surlykke, Sperm whale clicks, directionality and source level revisited, J. Acoust. Soc. Am 107, 638648 (2000).
[18] P. Madsen, R. Payne, N. U. Kristiansen, M. Walhberg, I. Kerr and B. Mohl, Sperm whale
sound production studied with ultrasound time/depth-recording tags, The Journal of Experimental Biology 205, 18991906 (2002)
[19] M. Sanjeev Arulampalam, S. Maskell and N. Gordon A tutorial on particle filters for online
nonlinear/non-Gaussian Bayesian tracking, IEEE Transactions on Signal Processing 50, 174
188 (2002)
[20] W. M. X. Zimmer, P. L. Tyack, M. P. Johnson and P. T. Madsen , Three-dimensional beam
pattern of regular sperm whale clicks confirms bent-horn hypothesis, J. Acoust. Soc. Am
117(3), 14731485 (2004).
[21] W. M. X. Zimmer, P. T. Madsen, V. Teloni, M. P. Johnson and P. L. Tyack , Off-axis effects on
the multipulse structure of sperm whale usual clicks with implication for sound production,
J. Acoust. Soc. Am 118(5), 33373345 (2005).
[22] R. Morrissey, J. Ward, N. DiMarzio, S. Jarvis, and D. Moretti., Passive acoustic detection and
localization of sperm whales (physeter macrocephalus) in the tongue of the ocean, Applied
Acoustics 67, 10911105 (2006).
[23] E.-M. Nosal and L. Frazer, Track of a sperm whale from delays between direct and surfacereflected clicks, Applied Acoustics 67, 11871201 (2006).
[24] R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification (Wiley-Interscience, New York),
654p (2000).
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
96
1 Basic motivations
Most of efficient cetacean localisation systems are based on
the Time Delay Of Arrival (TDOA) technic [NF06, BG09].
Long base hydrophones'array offers precise localization but, They represent a fixed,
centralized and expensive solution.
97
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
98
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
99
Dictionary learning
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
100
6 Experimental results
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
101
References
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
102
Delphine Mathias
Gipsa-Lab, Image-Signal Department
Grenoble-INP, France
[email protected]
Aaron Thode
Marine Physical Laboratory
Scripps Institution of Oceanography
University of California San Diego, USA
Jan Straley
University of Alaska Southeast,
Sitka, Alaska, USA
Russ Andrews
School of Fisheries and Ocean Sciences
University of Alaska Fairbanks, USA
Olivier Le Bot
Gipsa-Lab, Image-Signal Department
Grenoble-INP, France
Cedric Gervaise
Gipsa-Lab, Image-Signal Department
Grenoble-INP, France
Jerome Mars
Gipsa-Lab, Image-Signal Department
Grenoble-INP, France
Abstract
Sperm whales (Physeter macrocephalus) have followed fishing vessels off the
Alaskan coast for decades, in order to remove sablefish (depredate) from longlines. The Southeast Alaska Sperm Whale Avoidance Project (SEASWAP) has
found that whales respond to distinctive acoustic cues made by hauling fishing
vessels, as well as to marker buoys on the surface. Between 15-17 August 2010 a
simple two-element vertical array was deployed off the continental slope of Southeast Alaska in 1200 m water depth. The array was attached to a longline fishing
buoyline at 300 m depth, close to the sound-speed minimum of the deep-water
profile. The buoyline also served as a depredation decoy, attracting seven sperm
whales to the area. One animal was tagged with both a LIMPET dive depthtransmitting satellite and bioacoustic B-probe tag. Both tag datasets were used
as an independent check of a passive acoustic scheme for tracking the whale in
depth and range, which exploited the elevation angles and relative arrival times
of multiple ray paths recorded on the array. The localization approach doesnt require knowledge of the local bottom bathymetry. Numerical propagation models
yielded accurate locations up to at least 35 km range at Beaufort sea state 3. Ongoing work includes combining the arrival angle information with an algorithm
developed by Le Bot et al. [1] that uses the rhythmic properties of odontocete
click trains to separate interleaved click trains. This approach will improve our localization capabilities in presence of multiple sperm whales. 1 In order to achieve
better separation of interleaved click trains it is possible to use machine learning
1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
103
based algorithms. This new concept is based on finding useful information hidden
in a large database. This useful information can then be represented by a sparse
subspace. The first step of the approach is to extract informative features with a
new detector proposed by Dadouchi et al. [2]. Once the dictionary of features is
learned, any signal of this considered dataset can be approximated sparsely. By
reducing the dimensional space, the sparse representation has the advantage to
provide an optimally representation of the data. [Work supported by the North
Pacific Research Board, the Alaska SeaLife Center, ONR, NOAA and ANR-12ASTR-0021-03 MER CALME]
Introduction
In recent years, passive acoustic methods have become increasingly widespread for monitoring the
general assessment of marine environments [3]-[5] and expanding knowledge about marine mammals vocalization repertoire, distribution and habitat characterization. In the past decade, considerable efforts have been made for this purpose using a combination of ocean science, signal processing,
statistics and computational (algorithms) science [6].
This paper is concerned with the localization and tracking of sperm whales using a vertical array
comprising only two hydrophones. Indeed, passive acoustic monitoring has become an important
tool to study sperm whale behavior in the Gulf Of Alaska and their interaction with longlining
fisheries [7]-[8]. Each click event generated by a sperm whale can arrive on a hydrophone via
multiple ray paths. In this paper, the ray path that arrives first on a hydrophone will be called the
primary path, and other ray path arrivals that arise from the same click event are called secondary
paths, or multipath.
Most methods developed for localizing marine mammals use wide-baseline hydrophone arrays and
the time-difference-of-arrival (TDOA) of a sound on pairs of hydrophones [9]-[11]. Methods are
often based on ray-trace acoustic propagation modeling and exploit multipath arrival information
from recorded sperm whale clicks. The algorithm compares the arrival pattern from a sperm whale
click to range and depth dependent modeled arrival patterns in order to estimate whale location. The
technique can account for waveguide propagation physics like interaction with range-dependent
bathymetry and ray refraction. Tieman et al. [12] managed to track a sperm whale in three dimensions using only one acoustic sensor and a model of the azimuthally dependent bathymetry.
When multiple whales are simultaneously clicking, the biggest challenge is to arrange clicks into
separate click-trains corresponding to individual whales, and then classify clicks as primary paths
and multipaths. In the past decade several authors proposed algorithms for separating multipaths
from the primary click-train, either on single hydrophones [13] two-hydrophone arrays [14] or
wide-baseline acoustic arrays [15]. These algorithms exploit the slowly varying multipath structure of individual whales or the slowly varying features of clicks within a train (such as waveform,
power). Recently, a few papers discussed how sparse coding can be an effective technique for solving the multiple-marine mammal tracking problem [16]-[18]. Sparse coding seems to be a promising
alternative to usual time-frequency feature analysis.
To our knowledge the long-range tracking of multiple whales on a single deployment has not been
performed yet. For many applications, the deployment of several hydrophones is impractical and
too expensive. Here we discuss how a two-element vertical was used to track the range of multiple
whales up to a 35 km range over a 3-day period and how the method could be automated using
rhythmic properties of click-trains and sparse coding.
A two-element vertical array deployed at the sound speed minimum was used to track sperm whales
in the Gulf of Alaska between 15 and 17 August 2010. The vertical arrival angles and relative
arrival times of multiple refracted and surface-reflected ray paths contain enough information for
range-depth tracking without knowledge of the bottom bathymetry. A ray-tracing program was used
to model the acoustic travel times from each candidate source location, using a measured sound
speed profile. By comparing modeled and measured time lags and vertical angles, an ambiguity
2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
104
surface was created, displaying the best-fit whale position. A tagged sperm was tracked up to 35 km
range under Beaufort 3 conditions, using satellite tag data to independently verify tracking estimates.
The technique also permitted to measure the drift of multiple whales away from the vertical array.
The method and results are described in detail in Mathias et al. [8].
However we were not able to automate the tracking process in the presence of as much as six whales
simultaneously vocalizing. Techniques described above to separate click-trains such as the crosscorrelation or a rhythmic analysis failed in our case, because of the high number of multipaths
received at the hydrophones produced by whales at various ranges.
Frequency [kHz]
20
other whale
arrivals
primary path
15
secondary
(surface reflected)
path
10
5
0
1
1.5
2.5
3
Relative time [s]
3.5
3.5
1.5
2.5
3
Relative time [s]
Frequency [kHz]
0
1
Figure 1: Spectrogram of multipaths produced by two sperm whales on 15 August 2010 at 19:01:30
and recorded on vertical array top and bottom hydrophones
Localizing a sperm whale using a two-element vertical array requires measuring the relative arrival
times of at least two ray paths (the primary path and a multipath) on both hydrophones [8]. Therefore, we need to de-interleave click-trains and associate a primary path click-train with a multipath
click-train. Two approaches seem promising for performing this task when three whales or more
are vocalizing simultaneously. The first approach takes advantage of the two hydrophone array
configuration and use the arrival angle information as an additional source of information for grouping sperm whale clicks into trains associated with a given whale and propagation path. Therefore
3
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
105
on-going work includes combining the arrival angle information with an algorithm developed by
Le Bot et al. [1] that uses the rhythmic properties of odontocete click-trains to separate interleaved
click-trains. The algorithm only uses the time of arrival of each click and a complex- autocorrelation
function to compute a histogram that exhibits peaks at inter-click intervals (ICI) corresponding to the
interleaved click trains, while suppressing harmonics due to ICI multiples. This complex autocorrelation is calculated in a window sliding along the click train leading to a time-ICI representation,
which is thresholded to detect the different interleaved click trains. This sequential search could use
some complementary features such as the click arrival angle, level or its frequency content.
The second approach is based on sparse coding [19]-[20] and recent publications on the application
of this technique on marine mammal sounds [16]-[18]. We propose that a sparse transform of the
clicks in the time-frequency domain can help determine the stable components between multipaths
belonging to an individual and a given propagation path. In order to reduce the signal dimension
for more efficient computation, sets of Mel Frequency Cepstral Coefficients (MFCC) can be computed and a dictionary of features can be generated. Any click detected on the hydrophone can
therefore be represented in this space of reduced dimension. The similary between each projected
click can be computed using the cosine similarity measure for example. Glotin et al. 2013 showed
that this technique worked for tracking the sounds produced by the same minke whale during 30
minutes. It is also possible to work directly on the spectrogram to select areas of interest. A specific algorithm by Dadouchi et al. [2] has been developed to detect click and whistles. Based on
a two-stage methodology, this algorithm estimates the instantaneous frequency law of non-linear
frequency modulations under several constraints (high resolution estimation, ability to cope with
multiple overlapping and/or close signals in the time-frequency plane). The first step of the methodology is applied on the square modulus of any linear time-frequency representation, and aims at
detecting the time-frequency support of the signals of interest under probabilistic models. A Chisquared model is used to do the detection of time-frequency bins hosting signal, a Poisson model
for the gathering of detected bins into regions of interest (RoIs). Once the RoIs are detected, a high
resolution estimator using local polynomial frequency law estimation and phase continuity criteria
is used to link local approximation to get a whole estimate of the instantaneous frequency law.
Acknowledgments
This work was supported by the North Pacific Research Board, the Alaska SeaLife Center, ONR,
NOAA and ANR-12-ASTR-0021-03 MER CALME. We thank the Scaled Acoustic Biodiversity
SABIOD MASTODONS CNRS project for providing travel funds.]
References
[1] O. Le Bot, J. Bonnel, J. Mars and C. Gervaise (2013). Odontocete click train deinterleaving using
a single hydrophone and rhythm analysis, Proceedings of Meetings on Acoustics (19), 010019.
[2] F. Dadouchi, C. Gervaise, C. Iona, J. Huillery and J.I. Mars (2013). Automated segmentation
of linear time-frequency representation of marine mammal sounds, The Journal of the Acoustical
Society of America, 134(3), 2546-2555.
[3] O. M. Lammers, R.E. Brainard, W.W.L. Au, T.A. Mooney and K. Wong (2008). An Ecological
Acoustic Recorder (EAR) for long-term monitoring of biological and anthropogenic sounds on coral
reefs and other marine habitats, The Journal of the Acoustical Society of America 123, 1720-1728.
[4] M. N Anagnostou, J.A. Nystuen, E.N. Anagnostou, A. Papadopoulos, and V. Lykousis (2011).
Passive aquatic listener (PAL): An adoptive underwater acoustic recording system for the marine
environment, In Nuclear Instruments and Methods in Physics Research Section A: Accelerators,
Spectrometers, Detectors and Associated Equipment.
[5] M. Andre, M. Van Der Schaar, S. Zaugg, L. Houegnigan, A.M. Senchez, and J.V Castell (2011).
Listening to the Deep: Live monitoring of ocean noise and cetacean acoustic signals, Marine pollution bulletin, 63(1), 18-26.
[6] DCL book (2013). Detection, Classification and Localization of Marine Mammals using passive
acoustics, 2003- 2013: 10 years of international research, edited by O. Adam and F. Samaran.
4
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
106
[7] D. Mathias, A.M. Thode , J. Straley, J. Calambokidis, G.S. Schorr and K. Folkert, K. (2012).
Acoustic and diving behavior of sperm whales (Physeter macrocephalus) during natural and depredation foraging in the Gulf of Alaska, The Journal of the Acoustical Society of America 132(1).
[8] D. Mathias, A.M. Thode , J. Straley, and R.D. Andrews (2013). Acoustic tracking of sperm
whales in the Gulf of Alaska using a two-element vertical array and tags, The Journal of the Acoustical Society of America 134 (3).
[9] E.M. Nosal and L.N. Frazer (2006). Track of a sperm whale from delays between direct and
surface reflected clicks, Applied Acoustics 67, 11871201.
[10] C.O. Tiemann, M.B. Porter and L.N. Frazer (2004). Localization of marine mammals near
Hawaii using an acoustic propagation model, Journal of the Acoustical Society of America 115(6),
2834-2843.
[11] H. Glotin, F. Caudal and P. Giraudet (2008). Whale cocktail party: real-time multiple tracking
and signal analyses. Canadian Acoustics, 36(1), 139-145.
[12] C.O. Tiemann, A. Thode, A., J. Straley, K. Folkert and V. OConnell (2006). Three-dimensional
localization of sperm whales using a single hydrophone, Journal of the Acoustical Society of America 120(4), 2355-2365.
[13] P.M. Baggenstoss (2011). Separation of sperm whale click-trains for multipath rejection, The
Journal of the Acoustical Society of America 129(6), 35983609.
[14] R. Bahl, T. Ura, T. Fukuchi, Towards identification of sperm whales from their vocalizations,
The Journal of Scientific and Industrial Research 54, 409-413 (2002).
[15] E.M. Nosal (2013). Methods for tracking multiple marine mammals with wide-baseline passive acoustic arrays, The Journal of the Acoustical Society of America, 134, 2383.
[16] Y. Doh, J. Razik, S. Paris, O. Adam and H. Glotin (2013). Decomposition et analyse par
codage parcimonieux des chants de cetaces, Traitement du Signal, in press.
[17] H. Glotin, J. Razik, S. Paris and X. Halkias (2013). Sparse coding for large scale bioacoustic
similarity function improved by multiscale scattering, Proceedings of Meetings on Acoustics Vol.19.
[18] S. Paris, Y. Doh, H. Glotin, X. Halkias and J. Razik (2013). Physeter catodon localization by
sparse coding, ICML 2013 conference, 6pp.
[19] Q. Barthelemy, C. Gouy-Pailler, Y. Isaac, A. Souloumiac, A. Larue and J. I. Mars (2013).
Multivariate temporal dictionary learning for EEG, Journal of Neurosciences Methods, in press.
[20] Q. Barthelemy, A. Larue, A. Mayoue, D. Mercier and J.I. Mars (2012). Shift and 2D Rotation
Invariant Sparse coding for multivariate signals, IEEE Trans. Signal Processing 60(4), 1597-1611.
5
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
107
Ales Mishchenko
Pascale Giraudet
LSIS-DYNI
Herv Glotin
dpt biology
LSIS-DYNI
Abstract
This paper presents a method for tracking of unknown calling and moving
animals, such as bats or whales, using an arbitrary system of microphones,
organized in a 3D structure. The time differences of sound arrival (TDOAs)
result in a set of distances to pairs of microphones. Howeve r, in presence of
echoes
propagation
speed
non-uniformities,
and
background
noises,
the
of
TDOA
data
used
in
3D
trajectory
reconstruction.
Our
In t rod u ct i on
The location of an acoustic source using time di fferece of sound arrival (TDOAs) to multiple
microphones
has
many
military,
bioacoustic
and
surveillance
applications.
Tracking
of
wildlife movements has been widely studied since the 1960s [1], [2]. For the majority of
these studies, an operator supervises received signal strength while changing the orientation
of a directional receiving antenna. The problem of localizing an acoustic source from TimeDifferences-of-Arrival
number of
However,
(T DOAs)
is
received
recently
of
these
methods
assume
lot
of
scientific
attention
with
ideal
conditions,
such
as
known
and
constant
propagation speed, reliable only under controlled conditions where the air temperature can
be monitored. The effects of a wrongly assumed propagation speed is surveyed in [8]. One of
the most successful approach to overcome the problem of variable propagation speed (and
subsequent non-robustness of tracking), echoes, etc, in the field of bioacoustics is [6].
2
Due
the
difference
in
sounds,
emitted
by
different
species,
as
well
as
difference
in
animal call
structure to detect and process the calls of a particular animal. In accordance with this, we
perform the optimization of parameters for adaptation of TDOA calculation for a particular
animal.
First,
recordings,
we
adapt the
including
resulting algorithm
optimisation
of
of TDOA calculation to a
time-step,
crosscorrelation-
variety of bat
window,
number
of
maximums and persentage of TDOA to filter with coherence condition formulas. Due to the
difference in sounds, emitted by different species, as well as difference in antenae geometry
and media of sound-propagation, it is necessary to adapt the resulting algorithm of TDOA
calculation to a variety of bat recordings. For each recording we
automatically optimize
of
optimize
time-steps,
TDOA to
subdivides
filter
these
crosscorrelation-windows,
with
segments
coherence
into
condition
subsegments,
number
of
maximums
1
st
formulas.
In
this,
contaning,
on
average,
step,
a
and
our
single
distinct animal sound (click in case of whales and bats). This sundivisions facilitates the
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
108
detections of the given animal and cross-correlations between microphones in order to find
the TDOAs between them.
Second, for each pair of sensors, their arrived signals are cross-correlated (between pairs of
different microphoe-recordings), followed by extraction the maximum correlation values for
TDOAs
below.
calculation,
The
as
segments
well
of
as
calculation
multimicrophone
of TDOAs
recording
errors,
(in
described
our
case,
in
the
section
recordings
from
microphones with length of 6seconds are used) are used for this TDOAs calculation. The
Figure 1 shows the typical geometries of antennas used for TDOA calculations.
Figure 1: The antennas geometries used for TDOA calculations: Bats recordings(left) and
whales recordings(right)
In this, 2
in
the
overall
nd
step of our algorithm refines these TDOAs according to the formulas, described
section
TDOA-calculation-error
detection-vectors
in
time.
The
increase/decrease,
best
shifts
of
depending
on
detection-vectors,
all
the
amount
possible
providing
the
of the
shifts
of
minimum
overall TDOA-calculation-error, are used to find the most probable TDOAs, as described in
details in the following section 2.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
109
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
110
Figure 4: The TDOA calculation errors for flying bat recording ECOR2116 in 10 different
time-moments.
Figure 5 shows the error distribution as a joint-histogram for the state of the art (using 1
TDOA
correlation
maximum)
and
our
method,
operating
with
1-6
TDOA
correlation
maximums.
The non-symmetry of
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
111
state of
Figure 6:
Figure 7:
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
112
As it can be seen,
our method is able to fill the gaps of TDOA-trajectory, left by the state-
of-the-art TDOA calculation methods. The advantage of usage of our patent (used with just 4
maximums) over baseline (1 maximum) is
trajectorie,
difference
resulting
in
trajectories,
reconstruction
we
are
smoothness
necessary
using
to
and
robust
Levenberg
trajectory
solver
and
of
the
distance-to-microphone s
reconstruction.
the
following
For
figure
the
8
trajectory
shows
the
Figure 8:
Comparison of reconstructed trajectories of the flying bat, resulting from the state-
of-the-art single-maximum TDOA calculation method (left) and from the our multiplemaximum TDOA calculation method (right).
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
113
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
114
We have shown that our patent on multi-maximum TDOA calculation and refinement show
advantage over state-of-the-art methods.
It provides higher number and accuracy of reconstructed trajectory points, and shows less
calculation-time for the LM solving due to enhanced inputs.
Our future plans are construction of the neural network for detection and classification of
animal call recordings in big amount of recorded data. As it was shown in [9-11], this neural
network
can
be
constructed
as
the
convolutional
neural
network
(CNN),
receiving
the
microphones recordings for detection and cla ssification. The CNN were sucessfully applied
not only for images, but for time-domain, as well as adapted to to process and detect many
specific time-series patterns [10,11].
Ac kno wle dg m e nt s
This work was done with the support of SABIOD project and SATT sud-est company. We
also thank Cyberio company for bat recordings.
Re fe re nce s
[1] C.D. LeMunyan, W. White, E. Nybert, Design of a miniature radio transmitter for use in animal
studies, J. Wildl. Mgmt., vol. 23(1), pp. 107-110, 1959.
[2] W.W. Cochran, R.D. Lord, A radio tracking system for wild animals, J. Wildl. Mgmt., vol. 27(1),
pp. 9-24, 1963.
[3] P. Stoica and J. Li. Lecture notes - source localization from range-difference measurements. Signal
Processing Magazine, IEEE, 23(6):6366, November 2006.
[4] Giraudet P., Glotin H., Real-time 3D tracking of whales by echo-robust precise TDOA estimates
with a widely-spaced hydrophone array. Int. Jour. Applied Acoustics, Elsevier Ed., Vol. 67, Issues 1112, pp 1106-1117, Nov. 2006.
[5] Glotin H., Caudal F., Giraudet P., Whales cocktail party: a real-time tracking of multiple whales, in
Internat Journal Canadian Acoustics, V 36, p139-145, 2008, ISSN 0711-6659.
[6] Glotin H., Giraudet P., Caudal F., Patent, Real time multiple whale tracking by passive acoustics,
2007. no 07/06162, Europe, extension 2009 USA.
[7] Bnard F., Glotin H., Giraudet P., Whale 3D monitoring using astrophysic NEMO ONDE two
meters wide platform with state optimal filtering by Rao-Blackwell Monte Carlo data association, in
Journal of Applied Acoustics, Vol. 71 (2010), pp. 994-999
[8] P. Annibale and R. Rabenstein. Accuracy of time- difference-of-arrival based source localization
algorithms under temperature variations. In Proc. of 4th Int. Symposium on Communications, Control
and Signal Processing, (ISCCSP), Li massol, Cyprus. IEEE, 2010.
[9]
O. Abdel-Hamid,
L.
Deng,
and
D. Yu.
Exploring
convolutional
neural
network
structures
and
Cecotti,H.,Graser,A.
Brain-Computer
(Volume:33 ,
Convolutional
Interfaces
Pattern
Neural
Analysis
Networks
and
for
Machine
P300
Detection
Intelligence,
with Application
IEEE
Transactions
to
on
Issue: 3 )
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
115
Julie
E.
Elie
and
Frdric
E.
Theunissen.
UC
Berkeley.
Dept
of
Psychology
and
Helen
Wills
Neuroscience
Institute.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
116
performance
is
significantly
higher
that
what
one
could
obtain
from
any
of
the
two
other
sound
representations
tested
with
the
Random
Forest
algorithm
(MFCC,
53.1%
of
accuracy;
MPS,
63.5%
of
accuracy).
Besides,
the
Random
Forest
yielded
better
classification
performance
than
LDA,
irrespective
of
the
feature
space
used.
In
conclusion,
the
data
driven
algorithm
using
the
DFA
on
the
spectrogram
showed
superior
results
and
we
propose
that
it
could
be
used
both
for
investigating
behavioral
and
neural
mechanisms
of
sound
discrimination
and
for
the
vocalization
based
identification
of
species
in
ecological
or
environmental
studies.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
117
Mo#va#on'
Classical'approaches'in'bio2acous#cs'and'
sound'analyses'~'Using'simple'(ad'hoc)'
features.''
Hyena'Vocaliza#ons'
Percep#on'and'Neural'Representa#on.'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
118
Individual'Signature'in'the'Hyena'giggle'sounds.'
33%'Correct'
Mathevon'et'al,'BMC'Ecology'2010'
Zebra'Finch'Complete'Repertoire'is'
Complex.'
6 1@@@
D222
)*$+,$-./
)*$+,$-./
1???
4 C@@@
A222
2 2@@@
?222
0
A2
@
1 ?0 2
0 @
>:::
)*$+,$-./
)*$+,$-./
Frequency
(kHz)
Distance
Nest
Long Tonal
call 0!"#%12%34-&.566(783&9%#%:5#"6"5*(%;<=>?%@<A>AAA
Begging
0!"#%123%45-&.677(89:&9%#%,-;6#"7"6*(%<=2>?%@=A>AAA
0!"#%123%45-&.677(89:&9%#%,-;6#"7"6*(%<=2>?%@=A>AAA
Tet 0!"#%123%45-&.677(89$&:%;%;6#"7"6*(%<=>?@%A=B?BBB
0!"#%12%34-&.566(78$&3%#%,-95#"6"5*(%:;<=>%?;@=@@@
0!"#%123%45-&.677(89"&:%#%,-;6#"7"6*(%<=>?@%A=3?333
0!"#%11%23-&.455(678&2%#%,-94#"5"4*(%:;<=<%>;?=???
0!"#%11%23-&'3-4(%56787%96:8:::
Song
Aggressive
Distress
call0!"#%123%45-&.677(89$&:%#%,-;6#"7"6*(%<=3>?%@=A>AAA
uck 0!"#%123%45-&.677(89:&;%<%<6#"7"6*(%=>?@A%B>2@222
Call
E222
C??? 1:::
8 D@@@
<???
7:::
@??? ;:::
?
:
A;
@
BB
1BmsB
500
A
!"#$%&#'(
!"#$%&#'(
500 ms
=
A 2 A2
?1
1
@
1C
0< 1
A0B 3
B
!"#$%&#'( !"#$%&#'(
!"#$%&#'(
!"#$%&#'(
!"#$%&#'(
!"#$%&#'(
B
2<10001ms C> D
1 A?
!"#$%&#'(
!"#$%&#'(
We'need'an'automa#c'(unsupervised)'feature'extractor'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
119
Auditory'System'of'Song'Birds'
!Func&onal!Clusters!!
We'would'like'to'relate'the'features'to'neural'representa#on'
Woolley'SM'et'al,'J.'Neurosci'2009'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
120
A!UNIQUE!DATA!BASE!OF!THE!
COMPLETE!VOCAL!REPERTOIRE!OF!THE!
ZEBRA!FINCH.!
The!Zebra!Finch!Vocal!Repertoire:!A!quick!tour.!
Categoriza&on!of!WHAT!from!the!behavioral!context!
Needy:'Chick'Calls''
Begging'
Long'Tonal'
Alia#ve:'Social'Contact'
Whine'
Nest'
Tet'
Distance'
Song'
'
Non2Alia#ve:'Alarm''Distress''Aggression'
Tuck'
Thuck'
Distress'
Aggressive'
'
Julie'Elie'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
121
Chicks!(Juveniles):!Begging!
and!Long!Tonal!Call!
0.05
Long'Tonal'Call'
0
0.05
200
400
Begging'
600
800
1000
1200
8000
1400
time (ms)
8000
7000
7000
6000
5000
Frequency
Frequency
6000
4000
3000
4000
3000
2000
1000
0
5000
2000
0
200
400
600
800
1000
1200
1400
1000
time (ms)
0
0
100 200
time (ms)
Alia&ve!Calls!!
Making!and!Preserving!Social!Bonds!
Tet'
Tet call
8000
Whine'
7000
6000
Whine calls
Frequency
8000
7000
4000
8000
7000
3000
6000
5000
2000
4000
Frequency
Frequency
6000
Male'Song'
5000
1000
0.5
3000
0
0
2000
0.5
1000
200
time (ms)
400
150
time (ms)
200
250
150
time (ms)
200
Distance'Male'
50
100
5000
4000
3000
2000
1000
300
0
0
200
400
600
8000
1000
time (ms)
1200
1400
1600
1800
6000
Nest'
5000
4000
3000
2000
Nest calls
8000
1000
0.5
0
00
7000
6000
0.5
5000
8000
50
50
50
100
250
300
Distance'Female'
100
150
200 250
time (ms)
300
350
400
450
100
150
200 250
time (ms)
300
350
400
450
Duo'
Dance'
7000
4000
6000
3000
Frequency
Frequency
800
7000
0
0
2000
5000
4000
3000
2000
1000
1000
0
0
100
200
300
time (ms)
400
500
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
122
Tuck'
Tuck call
8000
Alarm!Calls!!!Distress!Calls! !!
Aggressive!Calls!!
1
7000
5000
100
200
300
400
time (ms)
500
600
700
100
200
300
400
time (ms)
500
600
700
8000
4000
6000
3000
Frequency
Frequency
1
0
2000
1000
0
4000
2000
Distress'
0 20 40 60
time (ms)
0
0
Distress call
Thuck'
8000
7000
Thuk call
8000
6000
Frequency
7000
6000
Frequency
Aggressive''
6000
5000
5000
4000
3000
4000
2000
3000
1000
0
0
2000
100
200
300
400
500
600
time (ms)
700
800
900
1000
1000
0
20 40
time (ms)
CLASSIFYING!VOCALIZATION!TYPES.!
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
123
Three'Feature'Spaces'
Spectrogram'+'PCA'
Mel'Frequency'Cepstral'Coecients'
Modula#on'Power'Spectrum'
Two'Classiers'
Fisher'Linear'Discriminant'Analysis'
Random'Forest'
Fisher'Discriminant'Analysis'
Relax'the'isotropic'assump#on'and'get'quadra#c'decision'boundaries'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
124
Tree'Classiers'
Random'Forest'='A'forest'of'tree'
classiers'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
125
Random'Forest'
Data'
Result'from'Random'Forest'
Logis#c'Regression'
Spectrogram!+!PCA!based!features!
Spectrogram'
Spectrogram for PCA
Dimensions!
Oscillogram'
Short'Time'
Fourier'Transform'
'
Fband'50Hz'
Tfsamp'1kHz'
'
Frequency (kHz)
231'(f)'*'201'(t)'
='46,431'
Time (s)
PCA'
'
'
Regulariza#on'
50'rst'
components'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
126
PCA!X!Regulariza&on!
Dimensions:'
50'rst'
components'
(Stectrogram+PCA)!and!DFA!
Training'data'set'
(6/7'of'the'data)''
DFA'
Bootstrap'1000'
(cross2valida#on)'
Correct'classica#on:'
66.8%'(All'Calls)'
62.5%'(Avg'per'Type)'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
127
Spectrogram+PCA!and!RF!
Tree'Classier'
+'
random'forest'
Correct'classica#on:'
83.2%'(All'Calls)'
70.4%'(Avg'per'Type)'
Mel!Frequency!Cepstrum!Coecients!
Oscillogram'
Short'Time'
Fourier'Transform'
'
Fband'50Hz'
Tfsamp'1kHz'
'
Mel'Filter'Bank'
Dimensions:!13'C'*'18'T'
Log'
+'
Discrete'Cosine'
Transform'
Dimensions:!13'C'*'18'T'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
128
MFCC!+!DFA!
Training'data'set'
(6/7'of'the'data)''
DFA'
Bootstrap'1000'
(cross2valida#on)'
Correct'classica#on:'
50.9%'(All'Calls)'
40.9%'(Avg'per'Type)'
MFCC!+!RF!
Tree'Classier'
+'
random'forest'
Correct'classica#on:'
70.6%'(All'Calls)'
53.1%'(Avg'per'Type)'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
129
Modula&on!Power!Spectrum!(MPS)!!
Dimensions!
Spectrogram'
Spectrogram for PCA
Oscillogram'
Short'Time'
Fourier'Transform'
'
Fband'50Hz'
Tfsamp'1kHz'
'
Frequency (kHz)
231'(f)'*'201'(t)'
='46,431'
Time (s)
log'
2D'FT'
+'
Downsampling'
Modula#on'Power'Spectrum'
13'(FM)'*'18'(TM)'
MPS!+!DFA!!
Modula#on'Power'Spectrum'
Training'data'set'
(6/7'of'the'data)''
DFA'
Bootstrap'1000'
(cross2valida#on)'
Correct'classica#on:'
50.5%'(All'Calls)'
41.3%'(Avg'per'Type)'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
130
MPS!+!RF!
Modula#on'Power'Spectrum'
Tree'Classier'
+'
random'forest'
Correct'classica#on:'
78.7%'(All'Calls)'
63.5%'(Avg'per'Type)'
Percent'Correct'
Summary!
Feature'Space'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
131
Conclusions!
PCA'on'spectrogram'yielded'the'best'feature'
space'(among'3):'
Higher'classica#on'performance'
Ease'of'interpreta#on'
Room'for'improvement'(Sparse,'ICA)'
MPS'might'be'bemer'than'MFCC.'
Random'Forest'is'an'ecient'classier.'
Zebra'nches'have'a'large'repertoire'of'calls'that'
can'be'categorized'based'on'behavioral'context'and'
acous#cs.'
'
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
132
Chapter 6
Non Human Speech Processing
6.1 Gabor Scalogram Reveals Formants in High-Frequency Dolphin Clicks
...............................134
Trone M., Balestriero R., Glotin H.
6.2 Supervised classification of baboon vocalizations ..................................................................143
Janvier M., Horaudm R., Girin L., Berthommier F., Boe L., Kemp C.,
Rey A., Legou T.
6.3 Software Tools for analyzing mice vocalizations with applications to preclinical models of human diseas e.....................................................................................................153
Shokoohi-Yekta M., Zakaria J., Rotschafer S., Mirebrahim H., Razak K.,
Keogh E.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
133
Randall Balestriero
Universit de Toulon
av. de l'Universit
La Garde, France
[email protected]
Herv Glotin
Institut Universitaire de France
Bt St Michel Paris
& Universit de Toulon
[email protected]
Abstract
1. Introduction
Cetacean acoustics research is currently expanding due to recent advances in available technology,
in conjunction with decreasing costs of equipment. The capability to record the higher frequencies
associated with click trains of many cetaceans of the suborder Odontoceti permits more complete
acoustic assessments. However, the copious amount of digitally recorded data produced requires
an interdisciplinary approach to develop innovative algorithms and procedures to process,
reduce and analyze the resultant complex data sets.
Detailed information concerning Odontoceti click acoustics has been derived from studying
animals in human care. The ability to finely manipulate variables under controlled conditions has
yielded information regarding the timing, frequency and amplitude of clicks. On-axis click signals
occur when the receiving animal or hydrophone is positioned directly in front of the signaling
dolphin. Off-axis clicks are characterized by lower frequencies and amplitudes when compared to
analogous on-axis clicks [R1]. Furthermore, off-axis click trains emitted by multiple animals can
produce interference resulting in decreased signal amplitude [R1]. Finally, it has been suggested
that the morphology of the cetacean head may cause the signal frequency to decrease as a function
of the angle from the beam axis, functioning like a low-pass filter [R2]. However, higher
frequency signals have narrower beam patterns than low frequency signals. As a result, high
frequency signals are more resistant to signal distortion due to off-axis propagation [R2].
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
134
Traditionally the short time Fourier transform (STFT) has been used to analyze the time,
frequency and amplitude parameters of an acoustic signal. However, the STFT employs a sliding
window that is associated with a time/frequency trade-off, such that shorter STFT windows
portray better time resolutions but poorer frequency resolutions in the resulting spectrograms, and
vice versa [R2]. Similarly, scalogram analyses also produce time, frequency and amplitude
spectrograms. However, these algorithms use wavelet transforms, and consequently improve
knowledge of where frequency components are occurring in time [R3]-[R5]. This capacity is
valuable when analyzing click trains that occur in short bursts. Furthermore, scalogram analyses
are able to detect more clicks than the STFT when the signal to noise ratio is low [R4]-[R5].
Finally, the relatively simple scalogram computations facilitate the prompt acquisition of results
[R3]. Thus, scalogram analyses may be beneficial to many real-world applications, such as
acoustically detecting multiple sperm whales using a single hydrophone [R3].
Narrow-band click trains characterized by regular, incremental changes in energy, frequency and
inter-pulse intervals are known as coherent pulses [R6]. Non-coherent pulses occur when the
consecutive, broad-band pulsed sounds within a click train are characterized by wide fluctuations
in energy, frequency and inter-pulse intervals [R6]. Dolphins residing in nondescript pools under
human care have been recorded to alternately produce these non-coherent pulses [R6]. The
function of these irregular, non-coherent pulses is not currently known. Ryabov [R6] has
suggested that these pulses may function as phonemes, which are the smallest acoustic units that
constitute a language.
Collectively, human languages consist of approximately 371 phonemes characterized by
frequencies that fall between 20 and 20,000 Hz [R6]. Phonemes are depicted on spectrograms as
formants and scalogram analyses have been used to detect formants in human speech [R7].
However, cetaceans possess far more extensive hearing ranges and acoustic repertoires than
humans. Indeed, bottlenose dolphins (Tursiops truncatus) produce sounds that vary between 200
to over 500,000 Hz, leading one researcher to propose that these animals are capable of producing
at least 3,000 phonemes consisting of non-coherent pulses [R6].
Studying free-ranging cetaceans requires the ability to utilize parameters that are not impacted by
being recorded off-axis and by interference from multiple animals producing click trains in the
same time interval. High-frequency acoustic signals are more resistant to distortion resulting
from off-axis propagation [R2]. Thus, scalogram analysis may be a more appropriate tool in
cetacean acoustic investigations due to its superiority in determining where frequency components
are occurring in time [R3]-[R5]. Furthermore, scalogram analyses have been used to detect
formants in human speech [R7]. Thus, we propose to analyze bottlenose dolphin click trains using
both STFT and scalogram analyses to compare the efficiency of these two algorithms.
Subsequently, we examine the resulting spectrograms for possible formants.
2. Material
Acoustic signals from two species of bottlenose dolphins (Tursiops sp.) were collected in their
natural environments.
2.1 Indian River Lagoon, Florida, Tursiops truncatus
A group of three bottlenose dolphins were recorded in July of 2013, in the Indian River Lagoon,
Florida. A Cetacean Research Technology CR-3 Hydrophone, a Reson EC 6061 preamp, and an
IOTech Personal Daq/3000 Series digital acquisition system were used to obtain these recordings.
A 500 kHz sampling rate at 16 bit resolution was employed during recordings yielding usable data
up to 250 kHz. DaqView Data Acquisition software was used to convert and store data in the .wav
format. These dolphins were attracted to the vicinity of the recording vessel by the fishing
activities of the boats captain. This .wav file is available in the NIPS4B website at
http://sabiod.univtln.fr/nips4b/media/NIPS2_TURSIOPS_20_2013_mosquito_lagoon_florida_TRO
NE.wav and is used under the copyright given in the NIPS4B website.
2.2 Western Australia, Tursiops aduncus
A 37-year-old female Tursiops aduncus dolphin was recorded in Monkey Mia-Shark Bay, Western
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
135
Australia in August 2013 during a mission of the SABIOD project, using a Cetacean Research
Technology CR 55 hydrophone and a TASCAM audio digital recorder. The recordings were made
with a 96 kHz sampling rate and 24 bit resolution. The recording is available in the NIPS4B
website under the NeuroSonar session:
http://sabiod.univtln.fr/nips4b/media/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_au
g2013_SABIOD_96kHz_32bits_after19min_nips4bfile_e.wav
3. Methods
3.1 Short term Fourier transform (STFT)
The T. truncatus and T. aduncus audio recordings were divided into 0.4 second and 0.7 second
segments respectively. Subsequently, 168 T. truncatus segments and 149 T. aduncus segments
were analyzed using STFT, with half-overlapping windows of 512 ms durations. The resultant
spectrograms show the local power spectrum of the signal over time.
3.2 Gabor scalogram transform
The same files that were analyzed using the STFT were also analyzed using a Gabor scalogram
transform with the ScatNet toolkit [R8] to produce scalograms. The T coefficient was set to 32,
the Q coefficient was set to 16, and the J coefficient was set to 80. Again, the resultant scalograms
show the local power spectrum of the signal over time.
It should be noted that, unlike STFT spectrograms, center frequencies of the wavelets are
quantized in a geometric progression. Henceforth, the Y-axis of the scalogram is naturally
logarithmic. Additionally, the temporal duration of the corresponding spectrograms and
scalograms are the same. However, the scalogram X-axis is similarly transformed using the
ScatNet toolkit, and in actuality reflects the time scale portrayed in the spectrogram [R8].
3.3 Formant detection
Following procedures outline by Jemma et al. [R7], formants were identified in the spectrograms
and scalograms. Within each click, frequencies of highest amplitude were identified by visual
inspection. Consecutive clicks that contained regions of higher acoustic energies at approximately
the same frequency were identified as formants. Each audio .wav file was analyzed using the
STFT and the Gabor scalogram transform, allowing a comparison of these two representations for
formant tracking.
4. Results
The STFT spectrograms did not portray the regions of local energy maxima within each click
clearly. Instead, the energy appeared to be equally distributed among the various frequencies that
demarked the click. On the contrary, the Gabor scalogram layer 1 displayed distinct bands of local
energy maxima with respect to frequency. For ease of comparison, Figure 1 depicts the Gabor
scalogram layers 1, 2 and 3 of the scattering decomposition, as well as the STFT spectrogram from
a segment of the Florida T. truncatus recordings.
We continued our exploration of dolphin acoustics by focusing our attention on layer 1 of each
scalogram. Bands of high amplitude sharing the same frequency on adjacent clicks were
connected with red lines (see Figure 2). Following human speech terminology [R7], we identified
these clusters of local energy maxima as dolphin formants.
Moreover, we labeled the formants depicted in Figure 2 as phoneme units A, B, C and D. Phoneme
B seemed to be composed of phoneme A plus an additional formant. Similarly, phoneme D
appeared to be composed of phoneme C plus an additional formant. Furthermore, phoneme C
seemed to be a shift upward of phoneme B. Additionally, three individual clicks contained local
energy maxima, but could not be connected to either adjacent click. These clicks occurred at 200,
270 and 300 ms.
The original scalograms and spectrogram for this file are available at the following URL:
http://sabiod.univtln.fr/pimc/TURSIOPS_FLORIDA_norformants/NIPS2_TURSIOPS_20_2013_mosquito_lagoon_
florida_TRONE_J80_Q16_T32/part6_983041_windowsnb2_T32_Q16_J80.png
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
136
The following Gabor scalograms derived from the T. truncatus audio file display distinct formants.
These formants cannot be ascertained from inspection of the corresponding STFT spectrograms,
which can be accessed at the following URLs for comparison:
http://sabiod.univtln.fr/pimc/TURSIOPS_FLORIDA_norformants/NIPS2_TURSIOPS_20_2013_mosquito_lagoon_florida_TRONE_J80_Q
16_T32/part9_1572865_windowsnb2_T32_Q16_J80.png
http://sabiod.univtln.fr/pimc/TURSIOPS_FLORIDA_norformants/NIPS2_TURSIOPS_20_2013_mosquito_lagoon_florida_TRONE_J80_Q
16_T32/part65_12582913_windowsnb2_T32_Q16_J80.png
http://sabiod.univtln.fr/pimc/TURSIOPS_FLORIDA_norformants/NIPS2_TURSIOPS_20_2013_mosquito_lagoon_florida_TRONE_J80_Q
16_T32/part74_14352385_windowsnb2_T32_Q16_J80.png
Finally, a directory with links to the 168 corresponding scalograms/spectrogram spectra derived
from our T. truncatus audio recordings, each 0.4 seconds in duration, can be accessed at the
following URL:
http://sabiod.univtln.fr/pimc/TURSIOPS_FLORIDA_norformants/NIPS2_TURSIOPS_20_2013_mosquito_lagoon_florida_T
RONE_J80_Q_16_T32/
Time
(seconds)
Figure 1: The scalograms (layers 1, 2 and 3) of the scattering representation of Gabor scalogram
and the STFT spectrogram, from a segment of the Florida T. truncatus file, 500 kHz sampling rate,
0.4 seconds, coefficients T=32, Q=16, J=80.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
137
Frequency
(kHz)
Time
(ms)
Figure 2: Scalogram layer 1 from the previous figure displaying the joint formant nodes. Bands of
high amplitude sharing the same frequency on adjacent clicks have been connected with red lines.
Phoneme units have been labeled as A, B, C and D. Total duration 0.4 seconds (Fs : 500 kHz).
In addition, scalograms and spectrograms derived from the audio recordings of T. aduncus in
Western Australia are also available. Each of the 149 segments depicts 0.7 seconds of audio
recordings. These corresponding scalograms/spectrogram spectra can be accessed from a file
directory using the following URL:
http://sabiod.univtln.fr/pimc/NICKY_norformants/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_aug2
013_SABIOD_96kHz_J80_Q16_T32/
Figure 3 depicts the scalograms and spectrogram derived from the first T. aduncus file listed
below. Similar to the previous example, the first layer of the scalogram highlights the frequencies
containing the most energy per click, whereas the energy appears to be evenly distributed among
the various frequencies on the spectrogram. Moreover, when the energy maxima are connected
with red lines, two phoneme units appear, which we have labeled as D 1, D2 and E. D1 contains two
clicks, while the two phoneme units labeled D2 consist of a single click each. Eight clicks make up
phoneme unit E.
These clicks which were recorded with a 96 kHz sampling rate are characterized by a frequency
bandwidth of approximately 35 kHz. All of these clicks portray their highest energies at or very
near the 48 kHz ceiling of the spectrogram. This is in contrast to the clicks sampled at 500 kHz,
which are typified by a frequency bandwidth of approximately 160 kHz (see Figures 1 and 3 ).
We have posted four Gabor scalograms derived from the T. aduncus audio recordings that display
distinct formants. These formants cannot be ascertained from inspection of the corresponding
STFT spectrograms, which can be accessed at the following URLs for comparison:
http://sabiod.univtln.fr/pimc/NICKY_norformants/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_aug2013_SABIOD_96kHz_J8
0_Q16_T32/part115_7471105_windowsnb1_T32_Q16_J80.png
http://sabiod.univtln.fr/pimc/NICKY_norformants/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_aug2013_SABIOD_96kHz_J8
0_Q16_T32/part5_262145_windowsnb1_T32_Q16_J80.png
http://sabiod.univtln.fr/pimc/NICKY_norformants/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_aug2013_SABIOD_96kHz_J8
0_Q16_T32/part73_4718593_windowsnb1_T32_Q16_J80.png
http://sabiod.univtln.fr/pimc/NICKY_norformants/Tursiops_truncatus_Nicky_SHARKD_0002S34D12_day3_aug2013_SABIOD_96kHz_J8
0_Q16_T32/part131_8519681_windowsnb1_T32_Q16_J80.png
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
138
Frequency
Hz
Time
(seconds)
Figure 3: The scalogram (layers 1, 2 and 3) of the scattering representation of Gabor scalogram and
the STFT spectrogram, from a segment of the Western Australia T. aduncus file, 96 kHz sampling rate,
0.7 seconds, Coefficients T=32, Q=16, J=80.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
139
Figure 4: Scalogram layer 1 from the previous figure displaying the linked formant nodes. Bands
of high amplitude sharing the same frequency on adjacent clicks have been connected with red
lines. Phoneme units have been labeled as D1, D2 and E. Total duration 0.7 seconds (Fs : 96kHz).
140
Indeed, other researchers have suggested that cetaceans utilize pulsed signals for communication
as well as navigation. For example, sperm whale clicks have been associated with various social
contexts, including reunions, separations, contact calls, and in response to unusual underwater
sounds [R10]. Additionally, pulsed signals have been associated with agnostic interactions,
aggressive behavior, discipline, and excitement among Atlantic spotted dolphins (Stenella
frontalis) and bottlenose dolphins (T. truncatus) [R11]. Finally, harbor porpoises (Phocoena
phocoena) utilize stereotyped, narrow-band high frequency clicks among conspecifics during
aggressive interactions [R12].
Currently, the significance of the local energy maxima identified in our scalogram analyses
remains speculative. Although we are suggesting that they may function as communicative
phonemes, many researchers will be skeptical. Indeed, some proclaim that humans will not be able
to fully appreciate cetacean communication due to the difficulty and improbability of identifying
the basic, communicative unit, or phoneme [R13].
Other critics might claim that these formants would only be valid if the clicks were recorded onaxis. However, high frequency signals are more resistant than low-frequency signals to the
amplitude and frequency weakening caused by off-axis propagation [R2]. Such distortion has been
demonstrated for 115 kHz beluga whale (Delphinapterus leucas) clicks and for 122 kHz
bottlenose dolphin clicks [R14]. Nevertheless, our data demonstrate bottlenose dolphins produce
clicks of even higher frequencies. Indeed, Figure 1 displays clicks with energy in the 200 kHz
range. Furthermore, some of the clicks that we have recorded demonstrate high amplitudes at 250
kHz, suggesting a ceiling effect due to the limitations of our recording system (Figure 5). Finally,
bottlenose dolphins have been reported to produce clicks ranging between 400 and 500 kHz
[R15]-[R16]. To our knowledge, the degree to which these high frequency signals degrade due to
off-axis propagation has yet to be determined.
Frequency
(kHz)
Time
(seconds)
Figure 5. An FFT spectrogram of a bottlenose dolphin click that was recorded with a 500 kHz
sampling rate, suggesting that information exists above 250 kHz. The click was analyzed with
Raven Pro 1.5 software.
Furthermore, cetaceans live in a three-dimensional world, where most clicks are most likely
received off-axis by conspecifics. Given that high frequency pulsed signals have been associated
with certain social situations, it is highly probable that off-axis degradation does not significantly
impede conspecific interpretation of these acoustic signals. Indeed, cetaceans may have evolved to
produce and perceive high frequency signals precisely because these signals are resistant to offaxis distortion, despite the fact that high frequency signals attenuate more readily than low
frequency signals.
In conclusion, we suggest that Gabor scalogram transforms outperform STFT analyses of cetacean
acoustics. Scalogram analyses are not subject to the time/frequency distortion trade-off that is
characteristic of the STFT. Thus, parameters that can be reliably ascertained through scalograms
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
141
include formant frequency; frequency bandwidth of the entire click that contains the formant;
quantity of clicks; and inter-click intervals. Exploring patterns associated with these parameters
may expand our understanding of cetacean communication and evolution, possibly facilitating
conservation efforts.
Future studies should utilize recording equipment capable of recording up to 1 MHz in order to
fully document the acoustic repertoire of cetaceans. Pattern recognition algorithms should be
developed to automate the tedious task of identifying formants within audio recordings containing
thousands of clicks. Furthermore, algorithms that are capable of ascertaining formant frequencies,
frequency bandwidths of the entire click that contains the formant, the quantity of formants, and
inter-click-intervals are essential in order to discern the possible functions of dolphin formants.
Once patterns have been identified, playback studies could be utilized to determine the role of
such patterns.
Acknowledgments
We thank David E. Bonnett for advice, access to the 500 kHz recording equipment and assistance
in acquiring the T. truncatus audio files. We thank Vincent Lostanlen, Phd student at DATA team
in DIENS laboratory ENS Paris for his collaboration. We thank the Big Data MASTODONS MI
CNRS project for its support with the SABIOD research program (http://sabiod.org).
References
[R1] Au, W.L. (1993). The Sonar of Dolphins. Springer-Verlag, New York
[R2] Au, W.L. & Hastings M.C. (2008). Principles of Marine Bioacoustics. Springer, New York
[R3] Lelandais, F. & Glotin, H. (2008). Mallat's Matching Pursuit of sperm whale clicks in real-time using
Daubechies 15 wavelets. In New Trends for Environmental Monitoring Using Passive Systems. IEEE conf.
Passive DOI:10.1109/PASSIVE.2008.4786977
[R4] Adam, O. (2006). Advantages of the Hilbert Huang transform for marine mammals signals analysis.
Journal of the Acoustical Society of America, 120(5), 29652973
[R5] Lopatka, M., Adam, O., Laplanche, C., Zarzycki, J. & Motsch, J.F. (2005). An attractive alternative for
sperm whale click detection using the wavelet transform in comparison to the Fourier spectrogram. Aquatic
Mammals, 31(4), 463-467. DOI: 10.1578/AM.31.4.2005.463
[R6] Ryabov, V. (2011). Some Aspects of Analysis of Dolphins Acoustical Signals. Open Journal of
Acoustics, 1, 41-54. DOI:10.4236/oja.2011.12006 Published Online (http://www.SciRP.org/journal/oja)
[R7] Jemma, I., Ouni, K., Laprie, Y., Ouni, S. & Haton, J.P. (2013). A new automatic formant tracking
approach based on scalogram maxima detection using complex wavelets. Int. Conf. on Control, Engineering
& Information Technology, Sousse, Tunisia
[R8] Mallat et al. (2013). The Scanet Toolkit, http://www.di.ens.fr/data/software/
[R9] Mallat S. (2013). Personnal communication
[R10] Watkins, W.A. and Schevill, W.E. (1977). Sperm whale codas. Journal of the Acoustical Society of
America, 62, 1485-1490
[R11] Herzing, D.L. (1996). Vocalizations & associated underwater behavior of free-ranging Atlantic spotted
dolphins, Stenella frontalis & bottlenose dolphins, Tursiops truncatus. Aquatic Mammals, 22(2), 61-79
[R12] Clausen, K.T., Wahlberg, M., Beedholm, K., Deruiter, S. and Madsen, P.T. (2011). Click
communication in harbor porpoises Phocoena phocoena. Bioacoustics: The International Journal of Animal
Sound and its Recording, 20(1), 1-28. DOI:10.1080/09524622.2011.9753630
[R13] Kuczaj, S. (2013). Why we will never be able to speak with dolphins. Public presentation at Eckerd
College, St. Petersburg
[R14] Au, W.W.L., Penner, R.H., and Turl, C.W. (1987). Propagation of beluga echolocation signals.
Journal of the Acoustical Society of America, 83, 807-813.
[R15] Lemerande, T.J. (2002). Transmitting beam patterns of the Atlantic bottlenose dolphin (Tursiops
truncatus): Investigations in the existence & use of higher frequency components found in echolocation
signals. Masters Thesis. Naval Postgraduate School, Monterey, CA. 148 p.
[R16] Toland, R.W. (1998). High frequency components in bottlenose dolphin echolocation signals. Masters
Thesis. Naval Postgraduate School, Monterey, CA. 83 p.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
142
Thierry Legou
Laboratoire Parole et Langage
and Brain and Language Research Institute
Aix-Marseille University, CNRS
Marseille, France
[email protected]
Abstract
This paper addresses automatic classification of baboon vocalizations. We considered six classes of sounds emitted by Papio papio baboons, and report the results
of supervised classification carried out with different signal representations (audio
features), classifiers, combinations and settings. Results show that up to 94.1% of
correct recognition of pre-segmented elementary segments of vocalizations can be
obtained using Mel-Frequency Cepstral Coefficients representation and Support
Vector Machines classifiers. Results for other configurations are also presented
and discussed, and a possible extension to the Sound-spotting problem, i.e. online joint detection and classification of a vocalization from a continuous audio
stream is illustrated and discussed.
Introduction
M. Janvier is funded by the Direction Generale de lArmement (DGA) included in the French Ministry
of Defence.
1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
143
In this paper we consider six categories of baboon vocalizations. We report the results obtained with
the use of different audio signal representations and supervised classification methods to characterize and recognize these vocalizations. To this end, we tested different spectral features computed
based on the usual short-term sliding window approach, e.g., Mel Frequency Cepstral Coefficients
(MFCC). We propose to introduce a sparse subset of coefficients characterizing the harmonicity of
the vocalizations, since, as opposed to (human) speech, the range of the fundamental frequency is
quite different across the baboon sound categories. As for the classifiers, we used hidden Markov
models (HMMs) [4] to model the dynamic evolution of the spectral patterns within each sound category. We also tested k-Nearest Neighbors (KNN) classifiers, Gaussian Mixture Models (GMM)
and Support Vector Machines (SVM) [5], [6] with different configurations and appropriate preprocessing of the data (especially for time alignment of feature vector sequences). Note that most
of the presented experiments concern isolated sounds that were manually pre-segmented, but we
also discuss and illustrate the feasibility of the extension of our system(s) to the soundspotting
problem, i.e. online joint automatic detection (i.e. segmentation) and classification of vocalizations
from a continuous audio stream.
The paper is organized as follows: Section 2 describes the data that were used for this study; Sections 3 and 4 present respectively the different features and classifiers that were used; Experimental
results are presented in Section 5 and conclusions are drawn in Section 6.
Data
We recorded the vocal behavior of Papio papio Guinea baboons housed at the Rousset-sur-Arc
CNRS primate center, France. The vocalizations of sixteen baboons (13 females, 3 males; aged
between 2 and 27 years at the start of recording) were considered for this study. Fourteen of the
baboons were housed as part of a larger group in a 25 30 m outdoor enclosure connected by wire
tunnels to indoor housing (6 4 m) used at night. The other baboons were housed separately in
a 4.7 6.4m outdoor enclosures connected to indoor housing (2 4m). All groups had visual
and auditory contact with each other. The monkeys could be identified by their individual physical
characteristics and by number tags on a chain around their neck. Once daily feeding (fruits, vegetables and monkey chows) occurred at 5PM; water was provided ad libitum. See [7] for a more
detailed description of the research facilities at the Rousset-sur-Arc CNRS primate center. We used
opportunistic sampling techniques to record spontaneous vocalizations produced in response to social events and to stimuli occurring naturally within the baboons environment. The presence of
the recorders and their equipment did not disturb the baboons from their natural daily activities.
Recording took place between 8:00 and 21:00 (except 17:00-18:00 due to the baboons being fed at
this time) between September 2012 and June 2013. Recording was conducted at a distance from
the baboons of < 2m to 20m, with the greater distances suitable only for the long-distance vocalizations. A digital Zoom Handy Recorder H4n (Zoom, Japan: 44.1kHz sampling frequency, 16-bit
resolution, mono) with a Me66 Sennheiser directional microphone (Sennheiser Electronic KG, Germany; with windscreen) was used to record the vocalizations. This is a super cardioid microphone
with a high sensitivity (50 mV/Pa 2.5dB) and a wide (40Hz20000Hz) and flat ( 2.5dB) frequency response. As the vocalizations were recorded outdoors, environmental sounds at different
noise levels may have interfered with the sounds at the focus of the recordings.
From continuous audio streams, individual homogeneous sequences of vocalizations (i.e. a series
of sounds of the same class) were first manually extracted by an expert for analysis. Those sequences
were further manually segmented into elementary sounds that were labelled to be submitted to our
classifiers. Six vocalization types were considered in the present study: barks, grunts, copulation
grunts (denoted Cops throughout the rest of the paper for concision), screams, wahoos, and yaks.
In total, the number of sounds per classification was: 110 barks, 130 copulation grunts, 384 grunts,
119 screams, 64 wahoos, and 336 yaks. Original sequences were used to illustrate the feasibility of
the Sound-spotting task (see Sections 4.4 and 5.4).
Features
This Section presents the audio features used in this study. Although we consider here vocalization
elements, i.e. elementary sounds that can be part of a series of longer vocalizations, and that have
been previously segmented, those elementary sounds can be of variable length. Moreover, they can
2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
144
be more or less stationary (and in general, they are rather non stationary). Therefore, from these
elementary sounds, we first extracted time sequences of feature vectors computed using a shortterm sliding window (for instance, a 30ms-Hamming window with 50% overlap). This approach
is familiar in speech processing, as well as in audio processing in general (e.g. for the analysis of
domestic or environmental sounds), and we inspire from those fields. Also, the features that we use
have been largely presented in the related literature [8, 9], and, thus, we present them only briefly.
Mel-Frequency Cepstral Coefficients: MFCCs [10] are cepstral coefficients that represent the
envelope of the short-term spectrum on a perceptive mel-frequency scale. Those coefficients are
computed as the discrete cosine transform (DCT) of the logarithm of FFT power coefficients passed
through a mel-filter bank (e.g. 40 log-spaced bands in the range 300Hz-10kHz; the bandwidth and
number of bands can vary; see Section 5). The first coefficient was omitted since it represents the
absolute energy of the signal frame and not the spectral shape, and the 1st and 2nd derivatives are
added optionnally (depending on experiment).
Average Spectral Features: We tested a series of features that represent average properties of the
Short-Term Fourier Transform (STFT) spectrum. The Spectral Roll-off is the cut-off frequency
below which 99% the spectral energy is contained. The Spectral Moments characterize the overall
shape of the spectrum using n-order moments of frequency bin weighted by spectral magnitude. We
tested the 4 first moments. The Spectral Slope / Decrease represents the global amount of decreasing
of the spectral amplitude. The Spectral Flatness of the magnitude spectrum is given by the ratio
between its arithmetic and geometric mean. Finally, the Spectral Flux / Correlation measure the
average variation between two consecutive spectra.
F0 and Harmonicity Index: The above-mentioned MFCCs (resp. the ASF) are coefficients that
characterize the spectral envelope (resp. the global shape of the spectrum) on a perceptive (resp.
linear) frequency scale. MFCCs are widely used in Automatic Speech Recognition (ASR) systems
[10] since the spectral envelope characterizes the different speech sounds through the effect of the
speakers vocal tract, while cutting loose of speech sound dependence on fundamental frequency
F0 . This is a desirable property for ASR, in order to limit speech variablity across speakers and
utterances. In contrast, in the present context of baboon vocalizations, we think that the F0 range can
be a discriminative feature since it varies much between some of the considered classes. Therefore,
we propose to test the F0 value (also extracted on a short-term basis) as an audio feature. We also
tested the harmonicity index, which is the ratio between the second maximum of the signal (shortterm) autocorrelation function (which is also used to detect F0 ) and the maximum which is obtained
at lag zero. The harmonicity index provides some simple confidence measure of the F0 value.
Feature post-processing: The successive feature vectors of a sound can be further processed to
produce different final features, which will feed the classifiers. In particular, the feature vector
sequences are generally of different lengths, whereas some of the tested classifiers (KNN, GMMs
and SVMs; see Section 4) are designed to process fixed-size vectors (or fixed-size sequences of
vectors reorganized as vectors). Therefore, it is necessary adress the problem of time normalization.
In the present study, we consider two simple forms of time normalization. The first one consists of
averaging the vectors in the time dimension over the entire acoustic event. Therefore, the feature
vector sequence is replaced with a single mean feature vector (the standard deviation can also be
used). The second form regards the interpolation of the feature vector sequence to the class average
duration, using basic (e.g. spline) interpolation/resampling techniques. Note that the GMM-T and
HMMs classifiers are fed directly with the original feature vector sequence and do not need time
normalization (HMMs are specifically designed to model dynamic sound representations). Note
finally that the final representation may consist of the (row-wise) concatenation of different features.
This is a particular case of information fusion for classification (see Section 4.3).
Implementation The MFCC and ASF features have been computed with the Python/C++ toolbox
YAAFE [9]. The F0 and harmonicity index analysis function was conducted using our own Matlab
implementation.
4
4.1
Classifiers
Definition
145
fixed or varying with the sound, depending on the feature used. Given a feature vector (or sequence
of feature vectors) x X , g(x; c) is the score of classifying x as c. A new unlabeled observation
x X is classified as: c (x) = arg maxcC g(x; c). X will denote the training set, i.e. a set of
feature vectors X = {xn }N
n=1 whose class is known, used to train the classifiers.
4.2
Four Classifiers
In this section, we present the four types of classifiers that were used in the present study. As some
features are commonly used in speech and audio processing and the Signal Processing / Machine
Learning communities, we present them very briefly, with links to the related literature.
k-nearest neighbors (KNN): The KNN classifier first find the subset Sk (x) X containing the k
closest points to a given vector x. gkNN (x, c) is then the number of feature vectors among Sk (x)
that belong to the class c.
Support Vector Machines (SVMs): SVMs are a discriminative binary classification method (see
[5] for a detailed description), which has already been used in sound recognition, e.g. [6, 11]. SVMs
provide a discriminative function h(x), learnt form a set of positive examples and a set of negative
examples. The points satisfying h(x) = 0 form a hyperplane in the space induced by a chosen
kernel function k(, ). h(x) > 0 means that x should be classified as positive and h(x) < 0 as
negative. The multi-class task uses one-versus-rest strategy. Also, we tested four different kernels
(linear, radial basis, polynomial and sigmoid).
Gaussian Mixture Models (GMM): A GMM is a probabilistic generative model widely used
in classification tasks [5]. Here, we use one GMM per sound class, which is a weighted sum
of M Gaussian components. The parameter set c is composed of M weights, mean vectors
and covariance matrices. We thus train C sets of parameters using the well-known ExpectationMaximization (EM) algorithm. The mapping g corresponds to the likelihood of the observed data
given the model parameters. GMMs can be applied directly on the mean feature vector (in such case,
we simply denote this configuration with GMMs). Alternately, for a sequence of feature vectors
x = [x1 , . . . , xT ], which are assumed to be independent, we calculate: gGMM (x; c) = p(x|c ) =
QT
t
t=1 p(x |c ). We denote this configuration by GMMs-T.
Hidden Markov Models (HMM): HMMs also belong to the family of generative models [5, 10].
In an HMM, the observations depend on a hidden discrete random variable called state, taking S
values. The state sequence is assumed to be a first-order left-to-right Markovian process and the
emission probability is a GMM. Thus, the model consists of the parameters of the GMMs and the
parameters modeling the Markovian dynamics. All are learnt using the EM algorithm. The function
g is also the likelihood of the observations given the model: gHMM (x; c) = p(x|c ).
Implementations: We used the standard Matlab KNN and GMMs algorithms. The HMMs are from
the PMTK3 library [12]. The SVMs are implemented using libSVM [13].
4.3
Information Fusion
In Section 3, we have seen that several kinds of features can be extracted from the baboon vocalizationsl to describe their spectro-temporal characteristics in order to be used in a supervised
classification scheme. This naturally raises the question of combining those features into a multimodal/multichannel classifier that would optimally exploit all information in an efficient manner, a
problem sometimes referred to as sensor fusion. This makes particular sense in the present study,
since we postulated in Section 3 that, as opposed to ASR, the F0 information is expected to provide
significant information about sound class, it is therefore necessary to test if this information can be
used in a complementary way to the spectral envelope (for instance MFCCs) information.
The usual, and simplest approach, known as early integration, consists in the (row-wise) concatenation of the different features (or feature vectors) into a single vector (in which dimension is equal
to the sum of the dimensions of the original feature vectors), possibly integrating some cross-modal
normalization processes. This new representation can then be used directly with the different classifiers presented above. In contrast, late integration performs the fusion of the features at the decision
level of separate classifiers [14]. Thus, a different classifier (of same or different type) can be used
on each feature vector and then the outputs (crisp decision, confidence score, log-likelihood values etc.) of these classifiers are merged using a higher level process. Finally, we can consider an
4
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
146
intermediary common space for fusion which is neither the input space nor the output space, leading to a type of mid-level integration. In particular, in the field of kernel-based classifiers (such as
SVMs), a new state-of-the-art fusion strategy has emerged called Multiple Kernel Learning [15]. In
this approach, the fusion is made inside the classifier: the kernel of the classifier is computed as
a combination of multiple kernels, for instance, one kernel for each feature. One advantage is the
ability to choose one type of kernel and its parameters according to the features. In Section 5.3, we
will test this strategy for the integration of MFCCs and F0 features in the present task of baboon
vocalization classification.
4.4
The above techniques are all applied on elementary sounds manually extracted from vocalization
sequences. In practice, it is desirable to have a system that is able to automatically perform both
detection (i.e. segmentation of a series of vocalizations into elementary sounds) and classification
of the detected elementary sounds from the continuous audio stream. This task can be referred
to as Sound-spotting, in reference to the Word-spotting task in ASR which is the detection
of keywords in continuous speech signals. A naive but efficient strategy consists of applying any
of the previous classifiers (that have been tuned on a training corpus of elementary sounds) on a
sliding window and decide of the detection if some criterion (e.g. a likelihood function), provided
by the classifier, exceeds a given threshold. Temporal integration is necessary to make this joint
detection/classification robust, and this can be done at the criterion level (e.g. by averaging framewise likelihoods) or at the feature level (e.g. by varying the sliding window length)1 . In the present
paper, we did not conduct a deeper investigation of the Sound-spotting problem, but in Section 5.3,
we present some elements which illustrate the feasibility of this task using the proposed classifiers.
5
5.1
Experiments
Setup
Given the database described in Section 2, different combinations of features, post-processing and
classifiers have been tested. We performed 5-cross validation tests, and used the accuracy score as a
metric of the performance in order to be able to statistically compare the different configurations. For
each experiment reported in the next section, only the best configuration of parameters (using grid
search and cross validation) has been retained due to the large number of parameters involved. For
the features, MFCCs reached its best results using 20 coefficients (with the first one omited), with
the derivates at the first and second order on a 10Hz-10000Hz bandwidth. As for the classifiers,
SVMs have shown the best results using linear kernels and radial basis kernels with a regulation
parameter equal to 0.1 and the one-versus-rest strategy. HMM have been tested with 3 to 8 states
and 5 to 10 components per state. Best results with GMM-based methods needed between 5 and 10
components in the mixture.
5.2
We first present the results obtained separately with the different feature sets, i.e. either MFCCs
or ASF or F0 +harmonicity index. The accuracy scores are given in Table 1 for a selected set of
configurations, and confusion matrices are given in Table 2 and Table 3 for a subset of those configurations.
The best performance are obtained with SVMs (with a radial basis kernel) applied on averaged
MFCC coefficients, with an accuracy score of 94.1%. This is a very good result, even for the
limited number of classes of the present problem, since there is no a priori reason to think that a
vast majority of the elementary sounds of the six classes are clearly prone to discrimination: This is
actually a major outcome of the present study. The confusion matrix for this configuration (Table 2b
is well balanced, with no major class confusion. Best results per class are obtained for Barks (97.3%
accuracy) and worst result per class are obtained for Cops with 83.8% accuracy, and 13.8% of
confusion with Grunts. It is important to note that SVMs are here applied on an averaged MFCC
1
This is reminiscent of the early vs late integration problem discussed in Section 4.3, but considering
here temporal fusion and not feature fusion.
5
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
147
vector (to represent the whole sound). Hence, the time structure of the spectral vector sequence
does not seem to be very important, leastways not as important as in speech (even if we compare
with a short word recognition task). This is confirmed by the score of the SVMs applied on timeinterpolated MFCC vectors, which is a bit lower than with averaged MFCC vectors at 90.5%. And
this is more severely confirmed by the scores obtained with the HMMs applied on the original
MFCC vector sequences (see Table 2a): the accuracy score is here only 80.8%, which is quite
deceiving. The confusion matrix exhibits notable confusions from Cops to Barks and to Grunts,
and from Grunts to Cops (but not from Barks to Cops), and also from Yaks to Screams, which is
surprising. This not only suggests that there is relatively poor additional information in the vector
sequence compared to the vector mean for the task at hand, but it also suggests that the HMMs
are not an appropriate tool for the modeling of such type of sounds. The latter makes sense since
it is not clear so far if there exists a phonological structure in the baboon vocalizations that could
be efficiently exploited by the state-space modeling of HMMs2 . Finally, GMMs (92.7% accuracy;
Table 2d) and KNN (92.4% accuracy; Table 2c), both applied on averaged vectors, are a bit below
SVMs, confirming that most of the discriminative information is contained in the average vector,
and that good recognition scores can be obtained with relatively basic classifiers. KNN applied
on interpolated MFCC vectors are at 93.1% accuracy3 , and we did not test GMMs on interpolated
MFCCs to avoid the curse of dimensionality problem which is typical for this model.
The scores obtained with ASF features are very deceiving. Many different combinations of ASF
features were tested (with the different classifiers), and the best accuracy score is 73.2% obtained
with SVMs on average ASF vectors (hence we only report this configuration in Table 1). Moreover,
when using concatenation of MFCCs and ASF features (i.e. basic early fusion at the feature level,
see Section 4.3), the scores do not improve significantly compared to using only the MFCCs, they
even decrease in some configurations (that is the case for SVMs, see Table 1). Therefore, the ASF
do not complement the MFCC information, which was predictable (they provide information on the
global shape of the spectrum with generally less resolution than MFCCs, provided that the cepstral
model order is sufficiently large). Therefore, we did not further consider those ASF features.
Generally, the results obtained with F0 alone or F0 concatenated with the harmonicity index are
remarkable, given that it is quite rudimentary information. Here, the best results are obtained with
the SVMs applied on interpolated F0 vectors, which reach 71.0%. GMM-T comes a very close
second with 70.9% accuracy. Both exploit temporal information (from interpolated or original vector
sequence), but the accuracy score of the SVMs applied on the average F0 vector is also very close
at 69.6% accuracy. However, the confusion matrices for the two latter two configurations differ
significantly: the matrix for GMM-T (Table 3a) is more balanced, whereas the matrix for SVMs
(Table 3b) shows that the Grunts and Yaks have better results, while the Wahoos are totally confused
(mainly with Barks and Cops) which is surprising. This can be explained partly by the fact that
Wahoos have some prosody which is reduced by the averaging process. Note that the SVMs scores
are biased by the fact that the best classification is obtained for the two classes with the higher
cardinals (Grunts and Yaks), and only 3 classes out of 6 can actually be regarded as correctly
classified. In contrast, the more well-balanced GMM-T matrix exhibits 5 classes out of 6 being
fairly well classified. GMMs (68.1% accuracy; confusion matrix in Table 3d) and KNN (65.4%
accuracy; confusion matrix in Table 3c), both applied on average F0 features, are a bit below the
others classifiers using F0 as a feature, but not much. KNN applied on interpolated F0 vectors
are at 69.8% accuracy. Therefore, here also, the different classifiers for fixed-size features in
both average and interpolated configurations are quite close to each other. Altogether, those results
show that basic information about harmonicity (say F0 range + harmonicity confidence) is enough
to provide honorable classification of 6-class baboon vocalizations. Note that HMMS are, again,
deceiving, with only 45.3% of correct classification.
5.3
As announced in Section 4.3, we report the results obtained with the mid-level integration of MFCC
and F0 features, using fusion of SVMs kernels. As an example, Table 4 shows the results of a Multi2
However, the GMMT score is also deceiving (78.5% accuracy) hence possibly pointing a problem with the
use of the original MFCC sequence, and so far we cannot clearly explain this result.
3
Hence, KNN with interpolated MFCCs is a bit better than KNN with averaged MFCCs, whereas SVMs
with interpolated MFCCs is a bit lower than SVMs with averaged MFCCs. Altogether, the scores with KNN,
SVMs and GMMs applied on either averaged or interpolated MFCCs are quite close to each other.
6
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
148
Features
MFCCs
MFCCs
MFCCs
MFCCs
MFCCs
MFCCs
MFCCs
ASF
MFCCs & ASF
F0
F0
F0
F0
F0
F0
F0
Classifier
KNN
SVMs
GMMs
KNN
SVMs
GMMs-T
HMMs
SVMs
SVMs
KNN
SVMs
GMMs
KNN
SVMs
GMMs-T
HMMs
Representation
Averaging
Averaging
Averaging
Interpolation
Interpolation
Sequencing
Sequencing
Averaging
Averaging
Averaging
Averaging
Averaging
Interpolation
Interpolation
Sequencing
Sequencing
Accuracy
92.4% 2.9%
94.1% 1.2%
92.7% 1.8%
93.1% 3.0%
90.5% 2.9%
78.5% 4.8%
80.8% 3.9%
73.2% 2.3%
92.4% 2.7%
65.4% 6.9%
69.6% 2.7%
68.1% 7.4%
69.8% 4.6%
71.0% 2.3%
70.9% 4.2%
45.3% 7.3%
0
2
4
4
0
252
1
18
369
0
0
11
0
1
0
114
0
1
1
0
1
1
58
0
1
2
7
4
0
319
2
2
7
10
0
312
barks
cops
grunts
screams
wahoos
yaks
yaks
2
3
0
0
58
1
wahoos
yaks
0
1
0
109
0
20
screams
wahoos
0
17
367
0
0
2
grunts
screams
0
104
9
0
0
1
cops
grunts
106
3
1
0
6
0
0
109
6
0
0
5
barks
cops
barks
cops
grunts
screams
wahoos
yaks
107
0
1
0
6
0
barks
barks
cops
grunts
screams
wahoos
yaks
yaks
7
7
11
0
55
12
wahoos
yaks
1
7
12
114
0
48
screams
wahoos
0
12
312
0
0
8
grunts
screams
0
89
42
0
0
9
cops
grunts
102
13
3
1
9
7
barks
cops
barks
cops
grunts
screams
wahoos
yaks
barks
Table 1: Accuracy score for different combinations of audio features, post-processing, and classifiers. Sequencing refers to using the original sequence of vectors.
105
0
0
0
5
1
0
115
19
0
0
2
0
8
350
1
0
9
3
1
2
112
1
3
0
0
0
0
57
0
2
6
13
6
1
321
Table 2: Confusion matrix for the baboon vocalization recognition systems using average Melfrequency cepstral coefficients (MFCCs) as features for SVMs, GMMs and KNN, and using original
sequence of MFCCs for HMMs.
7
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
149
screams
wahoos
yaks
13
30
36
0
9
11
3
53
328
1
4
12
0
0
0
69
0
30
7
10
3
0
14
7
22
7
8
48
9
242
screams
wahoos
yaks
12
19
1
0
24
3
1
72
363
0
8
9
0
0
0
67
0
18
0
0
0
0
0
0
14
7
7
50
3
263
barks
cops
grunts
screams
wahoos
yaks
yaks
grunts
65
30
9
1
28
34
83
32
13
2
29
43
wahoos
cops
barks
cops
grunts
screams
wahoos
yaks
barks
barks
cops
grunts
screams
wahoos
yaks
screams
6
3
6
27
0
237
grunts
yaks
18
36
17
1
43
16
grunts
wahoos
0
1
0
90
0
32
cops
screams
1
44
326
0
0
7
cops
grunts
0
29
26
0
7
7
barks
cops
85
17
9
1
14
37
barks
barks
barks
cops
grunts
screams
wahoos
yaks
74
21
6
2
9
45
0
18
20
0
3
1
1
50
338
0
3
8
0
0
0
94
0
56
29
35
12
1
48
20
6
6
8
22
1
206
Table 3: Confusion matrix for the baboon vocalization recognition systems using average F0 (fundamental frequency) as feature for SVMs, GMMs and KNN, and using original sequence of F0 for
GMM-T.
ple Kernel Learning experiment, in which a linear kernel has been trained on MFCC features, while
another linear kernel has been trained on F0 features, and the combination of those kernels has been
computed and used in a third SVM. It can be seen that this configuration does not outperform the
SVMs which uses only MFCCs as features: the accuracy scores are 88.1% 2.9% for the former vs
91.2% 3.3% for the latter4 . None of the other tested configurations of kernels and hyper parameters have shown a significant improvement. One conclusion of this experiment is that, although the
F0 (and harmonicity index) feature separately carries a significant information which is exploitable
for the automatic recognition of baboon vocalization, this feature was not shown in our experiments
to be complementary to the MFCC features for this task. On the contrary, the combination of F0
and MFCCs only lead so far to slightly decrease the scores obtained with MFCCs alone, which
is a bit deceiving. Of course, this is also because MFCC representation initially led to impressive
scores. Further investigation of the characterization of those features for the baboons vocalizations
is necessary to precisely describe the redundancy between them and confirm the seeming absence of
complementarity which has been observed in our experiments.
5.4 Feasibility of Sound-Spotting
In this subsection, we illustrate the feasibility of the Sound-spotting task described in Section 4.4
by applying the SVMs of Section 4.2 on an example of original (i.e. unsegmented) sequence. The
SVMs were fed with MFCC vectors on a frame-by-frame basis (i.e. average of one vector at a time,
corresponding to a 200ms-frame of signal, with 10ms-hop size). For each frame and class c, we
retrieved p(c|x) the posterior probability of the frame being part of a vocalization of class c given
the input MFCC vector x, which is the criterion used by the SVMs for classification [16]. Fig. 1
shows the results of this analysis. The top subfigure shows an excerpt of a vocalization waveform
with the corresponding class boundaries and labels which were manually annotated. The three other
subfigures plot the values of p(c|x) for the Barks, Grunts, Screams and Yaks, respectively (from
top to bottom; probabilities for Cops and Wahoos are not displayed for clarity). It is evident that
the probability contours quite well with the actual classes, i.e. globally, the probability values are
high when the corresponding class is emitted, and low when another class or background noise is
4
This latter score is different (a bit lower) than the SVMs/MFCCs score of Table 1 because a radial basis
kernel was used in the SVMs of Section 5.2.
8
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
150
barks
barks
cops
grunts
screams
wahous
yaks
cops
grunts
screams
60 107 60 13 0 2
6
1
0
0
0
0
26 1
7 13 92 82 81 31 36 0
1
1
0
1
11 1
2 3 7 4 363 369 371 0
1
1
0 1 0 1
0
0
0 94 109 96
23 9
7 20 0 0 17
0
0
0
0
0
32 1
0 4 3 5 10 14 14 18 8 14
wahous
yaks
0 0 2 31
2
2
0 0 1 10
5
3
0 1 0
7
6
6
0 0 0 54
9
22
0 55 55 4
0
2
0 0 4 272 310 299
Table 4: Confusion matrix for one instance of Multiple-Kernel SVMs combining MFCC and F0
features. For each cell, the three numbers from the left to the right corresponds to the result of
classification for: (1) SVMs with a linear kernel on F0 , (2) SVMs with a linear kernel on MFCCs,
(3) SVMs with a combination of the two precedent kernels.
emitted. For this example, a very simple detection strategy based on thresholding can be applied:
Class c is detected as p(c|x) > 0.5 (the probabilities for the different classes sum up to 1, hence
only one class at a time can be detected). Merging the successive frames associated with the same
class leads to the detected boundaries represented in the top subfigure of Fig. 4.4 with background
color corresponding to the probability contours. The detection is fairly good but not perfect: for
example, background noise is confused with Grunts at approx. 6s, and the boundaries between
Yaks and Screams are not easy to define (nor is it easy for the human listener in this example, and
manual labeling may actually be inacurate). Moreover, many sequences are not so clear. However,
more refined strategies for time integration of frame-wise information, such as the ones mentioned
in Section 4.4, are expected to fix these problems and be more robust in general. Part of our future
work is to explore such strategies and derive an efficient and robust Sound-spotting algorithm in the
present problem of baboon vocalization recognition.
Figure 1: Example of automatic joint segmentation and classification using the SVMs of Section 4.2
(see text for details).
Conclusion
In this paper we have adressed the problem of automatic classification of Guinea baboon vocalizations. Six classes of sounds have been considered, and experiments have shown that several
types of classifier (KNN, GMM, SVM) lead to correct classification scores higher than 90% for
pre-segmented elementary vocalizations. The higher scores were obtained with SVMs applied on
9
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
151
average MFCC vectors (94.1% accuracy), and the principal remaining confusions were observed
to be between grunts and copulations grunts. It is not entirely surprising that the classifiers have
difficulty in distinguishing these two vocalizations; of all the sound classes, the call units of these
two are the most similar from both an auditory perception and acoustic structure standpoint. This
study has also shown that the fundamental frequency F0 (alone or coupled with harmonicity index)
has a significant discriminative power: several classifiers applied on these features provided approximately 70% correct classification. Indeed, analysis of the baboon vocal repertoire shows that the
baboons strongly modulate their F0 between vocalizations, particularly between short- and longdistance vocal categories (Kemp et al., in prep.). However, and quite deceivingly, this information
was not found to be complementary to the spectral envelope information in our study. Finally, although we did not conduct a deep investigation of the Sound-spotting problem in the present study,
the observation of the good behavior of classifiers, designed on elementary sounds when applied
on continuous audio streams, shows that joint segmentation and recognition is expected to be feasible with a well-grounded time integration process. This time integration can be processed at the
feature level, at the classifier output level, or at some mid-level within the classifier, echoing the
discussion of Section 4.3 on feature information fusion. Future work will concern this task, which
is essential to design a real-world system. We will also consider increasing the number of classes
and defining confidence measures to help the exploitation of the classification results in primatology
studies.
Acknowledgments: Yannick Becker and the staff of the Rousset-sur-Arc primate center are acknowledged for technical support.
References
[1] K. Hammerschmidt and J. Fischer, Constraints in primate vocal production, The evolution of
communicative creativity: From fixed signals to contextual flexibility, pp. 93119, 2008.
[2] A. Mielke and K. Zuberbuhler, A method for automated individual, species and call type
recognition in free-ranging animals, Animal Behaviour, vol. 86, no. 2, pp. 475482, 2013.
[3] P. Maciej, J. Fischer, and K. Hammerschmidt, Transmission characteristics of primate vocalizations: implications for acoustic analyses, PloS one, vol. 6, no. 8, p. e23015, 2011.
[4] L. Deng and X. Li, Machine learning paradigms for speech recognition: An overview, IEEE
Trans. Audio, Speech, Language Process., vol. 21, no. 5, pp. 10601089, 2013.
[5] C. M. Bishop, Pattern recognition and Machine learning. Springer New York, 2006.
[6] G. Guo and S. Z. Li, Content-based audio classification and retrieval by support vector machines, IEEE Trans. on Neural Networks, vol. 14, no. 1, pp. 209215, 2003.
[7] J. Fagot and E. Bonte, Automated testing of cognitive performance in monkeys: Use of a
battery of computerized test systems by a troop of semi-free-ranging baboons (papio papio),
Behavior Research Methods, vol. 42, no. 2, pp. 507516, 2010.
[8] G. Peeters, A large set of audio features for sound description (similarity and classification)
in the cuidado project, 2004.
[9] B. Mathieu, S. Essid, T. Fillon, J. Prado, and G. Richard, Yaafe, an easy to use and efficient
audio feature extraction software, in Int. Conf. for Music Information Retrieval (ISMIR), 2010.
[10] L. R. Rabiner, A tutorial on Hidden Markov Models and selected applications in speech recognition, Proc. IEEE, vol. 77, no. 2, pp. 257286, 1989.
[11] A. Temko and C. Nadeu, Classification of acoustic events using SVM-based clustering
schemes, Pattern Recognition, vol. 39, no. 4, pp. 682694, 2006.
[12] K. P. Murphy, Machine learning: A probabilistic perspective. MIT Press Boston, 2012.
[13] C.-C. Chang and C.-J. Lin, LIBSVM: A library for support vector machines, ACM Trans. on
Intelligent Systems and Technology, vol. 2, pp. 27:127:27, 2011.
[14] L. I. Kuncheva, J. C. Bezdek, and R. P. Duin, Decision templates for multiple classifier fusion:
an experimental comparison, Pattern Recognition, vol. 34, no. 2, pp. 299314, 2001.
[15] M. Gonen and E. Alpaydn, Multiple kernel learning algorithms, The Journal of Machine
Learning Research, pp. 22112268, 2011.
[16] T.-F. Wu, C.-J. Lin, and R. C. Weng, Probability estimates for multi-class classification by
pairwise coupling, The Journal of Machine Learning Research, vol. 5, pp. 9751005, 2004.
10
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
152
Sarah Rotschafer
Psychology
UC Riverside
[email protected]
Eamonn Keogh
Computer Science
UC Riverside
[email protected]
Abstract
Identifying structure in mice ultrasonic vocalizations (USV) is a useful tool
for investigating the role of genetics in human disorders by modifying
(knocking out) various genes in mice and examining their vocalizations
for changes that may be linked to those genes, and hence the analogue
genes in humans [1] [2]. Thus far, it appears that all annotation and feature
extraction from USV has been done manually. We believe that the lack of
computational tools has been a major bottleneck in USV research. To
address this problem we have previously developed an intuitive software
suite that can analyze acoustic properties of USV and characterize the
relationships between behavioral segments and calls [2]. Here we present a
novel analytical tool that goes beyond quantifying basic acoustic properties
of USVs, by characterizing the relationship between the USV syllables used
during specific components of social behavior.
I n tro d u cti o n
Identifying structure in mice ultrasonic vocalizations (USV) is a useful tool for investigating
the role of genetics in human disorders by modifying (knocking out) various genes in mice
and examining their vocalizations for changes that may be linked to those genes, and hence
the analogue genes in humans [1] [2]. In recent years this framework has emerged as an
extremely promising tool for understanding human cognitive and memory disorders.
Analyzing vocal behaviors of mice models in this manner has led to the discovery of the
genetic cause of Autism [3], and has shown great promise for the study of Alzheimers
disease [4].
The UCR-USV tool has been implemented in MATLAB, making it easy to extend, and
essentially free for academics. The system performs five main functions: Syllable Extraction
and Idealization, Analysis of Basic Acoustic Properties of USV, Syllable Classification
[5] [6] [7] [8], Visual Representation of Call Rates Annotated by Mice Behaviors and
Measuring the Density of Syllables Obtained during Each Behavior Segment.
In this study we hint at the actionability of audio motif discovery by showing that motifs,
once discovered, can be used to test for changes in vocal repertoire that may be attributable
to genes that were deliberately deleted from the mouse genome.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
153
Notations
U C R - U S V To o l
The tool is designed to be user-friendly. The tool performs five functions: Syllable
Extraction and Idealization, Basic Analysis, Syllable Classification, Visual Representation of
Call Rates Annotated by Mice Behaviors and Measuring the Density of Syllables Obtained
during Each Behavior Segment.
Syllable Extraction and Idealization: Converts an audio file of USVs to a spectrogram
representation. Discrete syllables from the spectrogram are then extracted and idealized.
Basic Analysis: Records the duration and range of frequencies in each syllable, and
determines the gaps between the syllables which can be utilized to quantify the rate of USV
calls. This step also generates separate files which include the durations, frequencies and
gaps for all syllables for further analysis.
Syllable Classification: Using the GHT (Generalized Hough Transform) distance measure
[5] [6] [7] [8], all syllables are classified in separate folders. A special other class is possible
for syllables that our system could not confidently classify. These syllables can later be
classified by a human expert, or simply discarded as they are very rarely false dismissals,
but almost always simply noise/artifacts.
Visual Representation of Call Rates Annotated by Mice Behaviors: This step represents call
rates for each syllable in the dictionary and maps mice behaviors to call rates.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
154
Measuring the Density of Syllables during Each Behavior: Normalizes the number of
syllables of each class obtained during a behavior segment by total number of calls and time
spent in the behavior segment and characterizes the relationships between behavioral
segments and calls.
Meth o d o l o g y
We discuss the modules performed by the UCR-USV tool in greater detail below:
2.1
Syllable Extraction
We use the algorithm in Table 1 to extract all the candidate syllables from the spectrogram of
a mouse vocalization. The algorithm is briefly described below and additional details can be
found in [2].
Table 1: Extract candidate syllables
Algorithm 1 ExtractCandidateSyllables(SP)
Require: spectrogram of a mouse vocalization
Ensure: set of candidate syllables
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
16:
17:
18:
19:
20:
21:
22:
23:
24:
25:
26:
27:
I idealized spectrogram
L set of connected components in I
R row index of connected points
C column index of connected points
V value of connected points // value ranges from 1 to |L|
[A B] sort(V, ascend) // A has values of V sorted and B has the
index
S []
// set of candidate syllables in SP, initially empty
c1 dmin, c2 dmax // min and max duration of a syllable
j 1, k 1
for i 1 to |L| do {every connected component li in L}
n1
while A(k)=i do
(n) R(B(k)) //
contains row indices of li
(n) C(B(k)) //
contains column indices of li
nn+1
kk+1
mL(min(
):max(
), min(
):max(
))==i
//minimum bounding rectangle (MBR) of li
[r c] size of m
if |c| < c1 or |c| > c2
continue
// filter out noise
else
Sj m
add Sj to S
T1j min(
)
// start time of Sj
T2j max(
)
// end time of Sj
jj+1
return S, T1, T2 // candidate syllables in SP with start/end times
Instead of extracting candidate syllables from the original spectrogram (SP) we use an
idealized version (I) of SP, as it produces fewer false negatives to be checked. In line 2, we
convert the matrix I into a set of connected components, L. L has the same size as I, but it
has the connected pixels marked with number 1 to |L|. The set of candidate syllables in SP is
initialized with an empty set in line 7.
A syllable is a contiguous set of pixels in a spectrogram; we can thus consider it as a set of
connected points in I. The for loop in lines 10-26 is used to search for a connected
component li in I. In order to make the search time linear to the number of candidate
syllables, in lines 3-5 while creating L (a set of connected components), the row and column
indices and the values of all the connected points in arrays R, C and V, respectively are
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
155
saved. In line 6, the array V is sorted in ascending order and indices in B are saved. In the
while loop in lines 12-16, the indices in B are used to find the row and column indices of
a connected component li in I. The minimum and maximum values of the row and column
indices are used to extract the minimum bounding rectangle (MBR) of li.
It is important to note that not all of the connected components are candidate syllables. The
idealized spectrogram can still contain non-mouse vocalization sounds. In the if block of
lines 19-20, the duration of a connected component li is checked and those li in S which are
within the range of thresholds c1 and c2 are included. Since the minimum and maximum
duration of syllables can vary across different mice, the values of c1 and c2 should be set
after manual inspection of a fraction of the data. In our experiments, the values are set to 10
and 300, respectively. However the exact settings of these parameters are not critical to
subsequent steps. In lines 24-25, the start time and end time of a syllable are saved and used
for subsequent analysis. Figure 1 visually demonstrates the method. Our algorithm runs
faster than real time, and thus does not warrant further optimizations for speed.
In Figure 1, a snippet spectrogram SP, matrices corresponding to the idealized version of the
spectrogram I and connected components L are presented. For brevity in explanation,
original matrices for I and L are resized to 10x10. Finally, the MBRs of the candidate
syllables in the snippet spectrogram are marked.
Figure 1: (from left to right) A snippet of a spectrogram, the resized matrix corresponding to an idealized
spectrogram I, the resized matrix corresponding to the set of connected components L, and the MBRs of the
candidate syllables
2.2
Basic Analysis
The tool measures basic acoustic properties of syllables such as duration, dynamic range of
frequencies and gaps between syllables (Figure 2). Figure 2 shows a syllable made up of three
notes (left), and one that consists of a single note (right). The maximum possible gap
between single notes is a user-defined threshold (in this study we used 10 msec, which is
also set as the default). Any notes which are closer than the maximum gap, are combined as
one syllable. The tool reports the minimum, maximum and mean durations and produces an
output file including all durations and start and end times for each syllable. Corresponding
frequency dynamic range and gaps between syllables are included in separate output files.
Other information such as the maximum and average gap, and the total number of syllables
produced in the recording are also reported.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
156
2.3
Classification
Before the classification begins, the tool scans all the notes and combines the notes with a
gap less than the user-defined threshold into single syllables (for example, Figure 2, left).
The algorithm in Table 1 will generate a set of candidate syllables that are not classified in
this step. In order to classify them, a set of annotated syllables termed Ground Truth (G), and
a set of thresholds for each class of syllables are used. The candidate syllable cannot simply
be assigned to the class of its nearest neighbor because a large fraction of the candidate
syllables will inevitably be noise, and it is the thresholds that allow us to reject them.
A ground truth (G) dataset is a set of annotated syllables that have been classified by humans
(authors SR and KR). Each class in the ground truth may be represented by one or multiple
exemplars. The data set includes a small set of robust exemplars for our seven classes. The
Ground Truth table consists of seven syllable classes (Figure 3) against which each candidate
syllable should get compared. Our ground truth table shows consistency with other studies.
For example [15] and [16] have introduced ten categories of calls which include almost all
of the calls mentioned in Figure 3, except for the class of multiple notes (class 7 in this
study), we have considered a single class while [15] and [16] have more than one class for
multiple notes. The difference will not hurt the accuracy of our results, as most of the results
concluded in this paper are based on classes with single notes. The only conclusion
conducted for Class 7 could simply generalize to combining the categories of multiple notes
in [15] and [16].
Furthermore a set of thresholds, one for each class, is required for the purpose of
classification. Thresholds are created by simply computing the GHT distances between every
annotated syllable to its nearest neighbor from the same class. Then the mean plus two
standard deviations is chosen as the threshold distance for that class.
Given a set of candidate syllables S and ground truth syllables (G) with their matching
thresholds (), the algorithm shown in Table 1 classifies syllables in S, and rejects all others
as unclassifiable. A special other folder is created for syllables that our system could
not confidently classify. These syllables can later be classified by a human expert, or simply
discarded as they are very rarely false dismissals, but almost always simply noise/artifacts.
Table 2: Syllable classification algorithm
Algorithm 2 ClassifyCandidateSyllables(S, G, T)
Require: candidate syllables, ground truth, set of thresholds
Ensure: set of labeled syllables
1:
2:
3:
4:
5:
6:
7:
8:
9:
10:
11:
12:
13:
14:
15:
In order to classify a candidate syllable we look for its nearest neighbor in G in the for
loop of lines 8-12. In the if block of lines 13-14, the class label of the nearest neighbor
to a candidate syllable is assigned only if the distance between a candidate syllable and its
nearest neighbor from G is less than the threshold of the nearest neighbors class. The GHT
distance measure for classifying syllables was used in this study. Although GHT is a
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
157
common and popular distance measure for this type of classification, other distance
measures have also been used in this area. Hammerschmidt et al. use the log-likelihood
distance measure and Schwarzsches Bayes criteria (BIC) for clustering mice calls [17].
This opens the question of why GHT is an appropriate distance measure to consider if a set
of pixels is sufficiently similar. GHT is fast, robust to the inevitable noise left even after
idealization, and at least somewhat invariant to the significant intra-class variability
observed. After careful consideration and provisional tests of dozens of possibilities, we
converged on a distance measure based on the Generalized Hough Transform [5].
The Hough Transform [6] was introduced as a tool for finding well-defined geometric shapes
(lines, curves, rectangles, etc.) in images [7]. Ballard et al. generalized the idea and
introduced the Generalized Hough Transform to detect arbitrary shapes in images [5]. The
computation time of Ballards method is relatively expensive. It takes quadratic time, O(nb2),
to calculate the distance between a pair of windows. Here, nb is the number of black pixels
in the window. However, Zhu et al. [8] augmented GHT in a way that reduces the amortized
time for a single comparison significantly. Zhu et al. achieve speed-up by creating a
computationally cheap tight lower bound to the GHT. Moreover, they present modifications
to the classic definition that allow the measure to be symmetric and obey the triangular
inequality, two properties that are highly desirable because they allow various algorithms to
be used that exploit (or at least expect) these properties. We refer the interested reader to [8]
for more details on GHT.
E x p eri men ts
Figure 4. top) Sample instances of a motif discovered from mice vocalizations by applying our algorithm
(middle) Comparing the number of motifs during S and R behaviors for a sample recording of KO mice
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
158
vocalization. bottom) Comparing the number of motifs during S and G behaviors for a sample recording of
WT mice vocalization.
Figure 5: compares the Fishers Linear Discriminant score for every class in each behavior
Values higher than the threshold line may happen to be significant classifiers. Figure 5
suggests Class 5 during sniffing and Class 7 during NoContact could potentially be
significant discriminators for mice types, KO and WT.
Di s cu s s i o n
In contrast to many other studies, we have designed a classification algorithm for classifying
syllables by considering their shape regardless of their frequencies, mice type or other basic
features. Our UCR-USV tool is capable of automatically extracting syllables from mice
vocalizations, idealizing the calls and classifying them to separate classes in almost real
time. The tool analyzes mice vocalizations by reporting the frequencies, durations, dynamic
ranges, call rates and finally characterizes correlations between the USV syllables used
during specific components of social behavior.
The algorithm for classifying mice vocalization syllables described in Table 2 classifies
about 90 percent of the syllables by applying the following techniques: 1- Assigning
multiple instances to each group in the Ground Truth Table. 2- Idealizing the spectrogram
and removing noise. 3- Using a dynamic user-defined threshold for idealizing the
spectrogram.
USVs are typically analyzed in isolation from the social behaviors during which they are
elicited. To the best of our knowledge this is the first time to analyze mice calls based on
their social behaviors. We found out that syllables emitted by the WT mice during their
sniffing behavior, overrepresented call types of class 5 (Figure 3) comparing to the KO mice.
While the mice did not have any contact, KO mice produced denser calls of combined notes
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
159
(class 7 in Figure 3). We have also compared the density of calls between every pair of
behaviors in a specific type of mice and the results for KO and WT mice have been shown in
[9]. Higher values for Fishers Linear Discriminant scores show a more significant
discriminator among the behaviors.
Many studies have considered the rate of USV calling, [17] compares the call rates among
male and female mice, they claim that during courtship in response to female intruders,
females called more than males , and males called more to female than to male intruders. A
comprehensive analysis has been done by Roy et al. in [29] which they have compared
isolation-induced USVs generated by pups of Fmr1-KO mice with those of their wild type
(WT) littermates. They claim that the total number of calls was not significantly different
between genotypes, a detailed analysis of 10 different categories of calls revealed that loss
of Fmr1 expression in mice causes limited and call-type specific deficits in ultrasonic
vocalization: the carrier frequency of flat calls was higher, the percentage of downward calls
was lower and that the frequency range of complex calls was wider in Fmr1-KO mice
compared to their WT littermates.
R e f e re n c e s
[1] M. L. Scattoni, S. U. Gandhy, L. Ricceri, J. N. Crawley. Unusual Repertoire of Vocalizations in the
BTBR T+tf/J Mouse Model of Autism. PLoS ONE 3: e3067, 2008.
[2] J. Zakaria, S. Rotschafer, A. Mueen, K. Razak, E. Keogh. Mining Massive Archives of Mice Sounds with
Symbolized Representations. SIAM SDM, 2012. pp 588-599.
[3] R. J. Hagerman, et.al. Advances in the Treatment of Fragile X Syndrome. Pediatrics Vol. 123 No.1,
January, 2009.
[4] C. Menuet, Y. Cazals, C. Gestreau, P. Borghgraef, L. Gielis, et al. (2011) Age-Related Impairment of
Ultrasonic Vocalization in Tau.P301L Mice: Possible Implication for Progressive Language Disorders.
PloS ONE Jan; 6(10).
[5] D. H. Ballard, Generalizing the Hough transform to detect arbitrary shapes, Patt. Recognition, 13(2):
111-22 (1981).
[6] P. V. C. Hough, Method and means for recognizing complex patterns, U.S. Patent 3069654, (1962).
[7] R. O. Duda, P. E. Hart, Use of the Hough transform to detect lines and curves in pictures, Comm. ACM
15: 1115, (1972).
[8] Q. Zhu, X. Wang, E. Keogh, S.H. Lee, Augmenting the Generalized Hough Transform to Enable the
Mining of Petroglyphs, KDD 2009, pp. 10571066 (2009).
[9] M. Shokoohi-Yekta, J. Zakaria, S. Rotschafer, S. H. Mirebrahim, K. Razak and E. Keogh. Analysis of
Concomitant Behaviors and Vocalizations Reveal Social Communication Deficits in a Mouse Model of
Fragile X Syndrome. In press The Journal of Neuroscience, 2014.
[10] D. H. Ballard, Generalizing the Hough transform to detect arbitrary shapes, Patt. Recognition, 13(2): 111-22
(1981).
[11] R. O. Duda, P. E. Hart, Use of the Hough transform to detect lines and curves in pictures, Comm. ACM 15:
1115, (1972).
[12] P. V. C. Hough, Method and means for recognizing complex patterns, U.S. Patent 3069654, (1962).
[13] Q. Zhu, X. Wang, E. Keogh, S.H. Lee, Augmenting the Generalized Hough Transform to Enable the Mining of
Petroglyphs, KDD 2009, pp. 10571066 (2009).
[14] T. E. Holy, Z. Guo. Ultrasonic Songs of Male Mice. PLoS Biol 3(12): e386.
doi:10.1371/journal.pbio.0030386, 2005.
[15] Scattoni ML, Gandhy SU, Ricceri L, Crawley JN (2008) Unusual Repertoire of Vocalizations in the
BTBR T+tf/J Mouse Model of Autism. PLoS ONE 3(8): e3067. doi:10.1371/journal.pone.0003067
[16] E. J. Mahrt, D. J. Perkel, L. Tong, E. W. Rubel, C. V. Portfors. Engineered Deafness Reveals That
Mouse Courtship Vocalizations Do Not Require Auditory Experience, The Journal of Neuroscience,
33(13):55735583, (2013).
[17] K. Hammerschmidt, K. Radyushkin, H. Ehrenreich, J. Fischer, The Structure and Usage of Female and
Male Mouse Ultrasonic Vocalizations Reveal only Minor Differences. PLoS ONE 7(7): e41133.
doi:10.1371/journal.pone.0041133, 2012.
[18] A. J. Doupe, P. K. Kuhl. Birdsong and human speech: Common themes and mechanisms. Annu Rev
Neurosci 22: 567631, 1999.
[19] Y. Hao, M. Shokoohi-Yekta, G. Papageorgiou, E. J. Keogh. Parameter-Free Motif Discovery in
Arbitrary Data Archives. Submitted to KDD 2013.
[20] Enard W, Gehre S, Hammerschmidt K, Hlter SM, Blass T, Somel M, Brckner MK, Schreiweis C,
Winter C, Sohr R, Becker L, Wiebe V, Nickel B, Giger T, Mller U, Groszer M, Adler T, Aguilar A,
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
160
Bolle I, Calzada-Wack J (2009) A humanized version of FOXP2 affects cortico-basal ganglia circuits in
mice. Cell
[21] Whr M, Roullet FI, Hung AY, Sheng M, Crawley JN (2011) Communication impairments in mice
lacking Shank1: reduced levels of ultrasonic vocalizations and scent marking behavior. PLoS One
6:e20631.
[22] Fujita E, Tanabe Y, Imhof BA, Momoi MY, Momoi T (2012) CADM1-expressing synapses on Purkinje
cell dendrites are involved in mouse ultrasonic vocalization activity. PLoS One 7:e30151.
[23] Schmeisser MJ, Ey E, Wegener S, Bockmann J, Stempel AV, Kuebler A, Janssen AL, Udvardi PT,
Shiban E, Spilker C, Balschun D, Skryabin BV, Dieck St, Smalla KH, Montag D, Leblond CS, Faure P,
Torquet N, Le Sourd AM, Toro R, et al. (2012) Autistic-like behaviours and hyperactivity in mice
lacking prosap1/Shank2. Nature 486:256260.
[24] Srivastava DP, Jones KA, Woolfrey KM, Burgdorf J, Russell TA, Kalmbach A, Lee H, Yang C,
Bradberry MM, Wokosin D, Moskal JR, Casanova MF, Waters J, Penzes P (2012) Social,
communication, and cortical structural impairments in epac2-deficient mice. J Neurosci 32:11864
11878.
[25] Portfors CV (2007) Types and functions of ultrasonic vocalizations in laboratory rats and mice. J Am
Assoc Lab Anim Sci. 2007 Jan;46(1):28-34
[26] Grimsley JMS, Gadziola MA and Wenstrup JJ (2013) Automated classification of mouse pup isolation
syllables: from cluster analysis to an Excel-based mouse pup syllable classification calculator. Front.
Behav. Neurosci. 6:89. doi: 10.3389/fnbeh.2012.00089
[27] R.A. Fisher. The use of multiple measurements in taxonomic problems, Ann. Eugenics, 7 (1936), pp.
179188
[28] J. M. S. Grimsley, J. J. M. Monaghan, J. J. Wenstrup. Development of Social Vocalizations in Mice.
PLoS ONE 6(3): e17460, 2007.
[29] S. Roy, N. Watkins, D. Heck. Comprehensive Analysis of Ultrasonic Vocalizations in a Mouse Model of
Fragile X Syndrome Reveals Limited, Call Type Specific Deficits, PLoS ONE 7(9): e44816.
doi:10.1371/journal.pone.0044816, 2012.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
161
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
162
Chapter 7
Bird Song Classification Challenge
7.1 Multi-instance multi-label acoustic classication of plurality of animals :
birds, insects & amphibian................................................................................................................164
Dufour O., Glotin H., Giraudet P., Bas Y., Artieres T.
7.5 Ensemble logistic regression and gradient boosting classifiers for multilabel bird
song classification in noise (NIPS4B challenge)............................................................................189
Massaron L.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
163
7.1 Multi-Instance
Multi-Label Acoustic
Classification of Plurality of Animals : birds,
insects & amphibian
O. Dufour
H. Glotin
P. Giraudet
Y. Bas
T. Arti`eres
25/11/2013
Introduction
2.1
LSIS,
Universit
e du Sud Toulon Var. [email protected]
Universit
e, CNRS, ENSAM, LSIS, UMR 7296, 13397 Marseille, France.
[email protected]
Universit
e du Sud Toulon Var. [email protected]
BIOTOPE. [email protected]
LIP6, Universit
e Paris 6. [email protected]
1 In proc. of int. symposium Neural Information Scaled for Bioacoustics joint to NIPS,
Nevada, dec. 2013, Ed. Glotin H. et al.
Aix-Marseille
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
164
Feature extraction
40000
20000
0
(D)
Convolved 1st C C
(C)
1st Cepstr Coeff
(B)
Cepst Coeff nbr
(A)
Frequency(Hz)
2.2
15
10
5
5
1
0.9
0.8
1
0.95
Time(s)
(1)
i=50
Conv(i)
N
(2)
2.2
Feature extraction
The final step of the preprocessing consists in computing a reduced set of features for any segment. Recall that each segment consists in a series of n 16dimensional feature vectors (with n = 32).
2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
165
2.3
Training
96 coefficients featuring To get new feature vectors that are representative of longer segments, our feature extraction first consisted in computing
6 values for representing the series of n values for each of the 16 mfcc features. Let consider a particular mfcc feature v, let note (vi )i=1..n the n
values taken by this feature in the n frames of a window and let note vi
the mean value of vi . Moreover let note d and D the velocity and the
acceleration of v, which are approximated all along the sequences with
di = vi+1 vi , and Di = di+1 di . The 6 values we compute are defined
as:
n
(|vi |)
f1 = i=1
n
v
u
n
u 1
f2 = t
(vi vi )2
n 1 i=1
v
u
u
f3 = t
v
u
u
f4 = t
(3)
(4)
1
(di di )2
n 2 i=1
(5)
1
(Di Di )2
n 3 i=1
(6)
n1
|di |
n1
(7)
|Di |
n2
(8)
i=1
f5 =
n2
i=1
f6 =
2.3
Training
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
166
2.3
Training
in most of test and train signals, several classes are present. Each audio
file is not only represented by multiple instances but also associated with
multiple class labels.
Lets consider Tsoumakas definitions from [35]. We define problem transformation methods as those methods that transform the multi-label classification
problem either into one or more single-label classification problems, for which
there exists a huge bibliography of learning algorithms. We define algorithm
adaptation methods as those methods that extend specific learning algorithms
in order to handle multi-label data directly.
Problem transformation Method The most common problem transformation method learns |L| binary classifiers (|L| = 87), one for each different label
l in L. It transforms the original data set into |L| data sets Dl that contain
all examples of the original data set, labelled as l, if the labels of the original
example contained l and as l otherwise. It is the same solution used in order
to deal with a single-label multi-class problem using a binary classifier. We used
this approach (dubbed PT) with a Support Vector Machine classifier.
Algorithm adaptation methods One strategy can consist in separating
syllables of different classes in the same training recording during preprocessing
like in [10, 9]. This is an signal-processing approach. According to [30, 10, 8, 9],
we chose to use a machine learning approach. We trust in learning by bag-ofinstances in order to realise the tricky task.
Multi-instance multi-label learning (MIML) is a recent learning framework
where each example corresponds to a bag of instances as well as a set of labels
[25, 41]. To handle this MIML task, we tested different matlab toolboxes from
Nanjing University [38, 39, 37]:
MIMLRBF (MIML Radial Basis Function) is an innovative neural network
style algorithm. As its name implied, MimlRbf is derived from the popular
radial basis function (RBF) method [4]. Connections between instances
and labels are directly exploited in the process of first layer clustering and
second layer optimization. Briefly, the first layer of MIMLRBF neural
network consists of medoids (i.e. bags of instances) formed by performing
k-Medoids clustering on Miml examples for each possible class, where a
variant of Hausdorff metric [19] is utilized to measure the distance between
bags [40]. Second layer weights of MimlRbf neural network are optimized
by minimizing a sum-of- squares error function and worked out through
singular value decomposition (SVD) [33].
MIML-kNN (k-Nearest Neighbor Based Multi-Instance Multi-Label Learning Algorithm) is proposed for MIML by utilizing the populark-nearest
neighbor techniques. Given a test example, MIML-kNNnot only considers its neighbors, but also considers its citers whichregard it as their own
neighbors. The label set of the test example is determined by exploiting
the labeling information conveyed byits neighbors and citers.
M3MIML (Maximum Margin Method for Multi-instance Multi-label Learning) assumes a linear model for each class, where the output on one class is
set to be the maximum prediction of all the MIML examples instances with
4
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
167
2.4
Inference
RESULTS
2.4
Inference
At test time an incoming signal is first preprocessed as explained before in section 2.2 : interesting segments are selected and feature extraction is performed.
Second, a MIML learned model is used as explained before in section 2.3 to
compute prediction vectors from the same audio in one K-dimension vector
(K = 87). This yields that an input signal is represented as one bag of variable
number of 96-dimension vectors.
1 000 files compose the test set. The 1 000 bags of vectors obtained after
preprocessing are processed by MIML-RBF classifier to get probabilistic scores
of each one of the 87 labels sets provided in the train data set.
RESULTS
short description
5 higher maxima per file + 96 features per segment + PT
5 higher maxima per file + 96 features per segment + MIMLRBF
5 higher maxima per file + 96 features per segment + MIMLkNN
5 higher maxima per file + 96 features per segment + M3MIML
all local maxima in a file + 96 features per segment + MIMLRBF
all local maxima in a file + PCA/LDA + MIMLRBF
5 higher maxima per file + PCA/LDA + MIMLRBF
best team of NIPS4B challenge
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
168
DISCUSSION
Discussion
Figure 2 gives the False Negative Rate (FNR) for each class computed from
model M2 predictions on data test set. One can see that the global FNR (all
classes included) turns around 25%.
0.8
0.7
0.6
0.5
0.4
0.3
0.2
0.1
11
16
21
26
31
36 41 46 51 56
CLASS NUMBER
61
66
71
76
81
86
Expected comments
1. Scores are much better for classes corresponding to bird calls than for
classes corresponding to bird songs. By instance, scores of classes number 36, 17, 1, 73, 18 are excellent because the concerned calls consist in
strongly stereotyped signals.
2. Predictions remain generally very good for bird species whose songs stay
simple and few variable (cl. 25, 65, 70).
3. A FNR of 33% for Subalpine Warbler (cl. 76) on a total of 36 test files is
reasonable because it is one of the 4 most difficulty species of the challenge
recognized by an ornithologist. most difficult bird species of the challenge.
4. It is well-known that European Robin produces complex and much variable
songs (cl. 23). As a consequence, we reach a 65% FNR.
5. Song Thrush and European Serin (cl. 87 & 67) emit complex songs. Their
respective scores are 57% et 43%. Although European Serin song is distinctive, it is also composed of a lot of syllables (50 per second). This
comforts our hypothesis (see section Improvements) that in some cases
our currently 130 ms fixed window function is well too large.
Unexpected comments
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
169
IMPROVEMENTS
Improvements
1. According to figure 3, there is 36 classes for which we dont have any
single-label recording. Plus, one can see that the volume of available
training data (in seconds) varies much from one class to an other. It is
very likely that this disequilibrium brakes performances of our classification algorithm. It will be interesting to watch carefully the differences of
classification scores between classes and explain them: are they due to
train data set disequilibrium, differences in signal complexities, variable
S.N.R, acoustic properties of biotopes, etc. ?
7
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
170
IMPROVEMENTS
80
70
60
50
40
30
20
10
10
15
20
25
30
35
40 45 50
Class number
55
60
65
70
75
80
85
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
171
REFERENCES
Acknowledgments
References
[1] O. Abdel-Hamid, L. Deng, and D. Yu. Exploring convolutional neural
network structures and optimization techniques for speech recognition. In
INTERSPEECH, 2013.
[2] M. Acevedo, C. Corrada-Bravo, H. Corrada-Bravo, L. Villanueva-Rivera,
and T. Aide. Automated classification of bird and amphibian calls using
machine learning: A comparaison of methods. Ecological Informatics 4
206214, 2009.
[3] Y. Bengio and Y. Lecun. Convolutional networks for images, speech, and
time-series, 1995.
[4] C. M. Bishop. Neural Networks for Pattern Recognition. Oxford University
Press, Inc., New York, NY, USA, 1995.
[5] B. Bogert, M. Healy, and J. Tukey. The quefrency alanysis of time series
for echoes: Cepstrum, pseudo-autocovariance, cross-cepstrum, and saphecracking. In E. M. Rosenblatt, editor, Symposium on Time Series Analysis,
Chapter 15, p 209-243, 1963.
[6] A. Bossus and F. Charron. Guide des chants doiseaux deurope occidentale
: Description et comparaison des chants et des cris, 2010.
[7] F. Briggs et al. The 9th Annual MLSP Competition: New Methods for
Acoustic Classification of Multiple Simultaneous Bird Species in a Noisy
Environment. In IEEE Workshop on Machine Learning for Signal Processing, MLSP 2013, 2013.
[8] F. Briggs, X. Fern, and R. Raich. Acoustic classification of bird species
from syllables: an empirical study. Technical report, 2009.
[9] F. Briggs, X. Z. Fern, and J. Irvine. Multi-label classifier chains for bird
sound. CoRR, abs/1304.5862, 2013.
[10] F. Briggs, B. Lakshminarayanan, L. Neal, X. Fern, R. Raich, M. Betts,
S. Frey, and A. Hadley. Acoustic classification of multiple simultaneous bird
species: a multi-instance multi-label approach. Journal of the Acoustical
Society of America, 2012.
[11] C.-C. Chang. Libsvm. http://www.csie.ntu.edu.tw/~cjlin/libsvm/,
2008.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
172
REFERENCES
REFERENCES
[12] L. Chang-Hsing, L. Yeuan-Kuen, and H. Ren-Zhuang. Automatic recognition of bird songs using cepstral coefficients. Journal of Information
Technology and Applications Vol. 1, pp.17-23, 2006.
[13] R. Collobert, K. Kavukcuoglu, and C. Farabet. Torch7: A matlab-like
environment for machine learning. In Big Learning 2011 : NIPS 2011
Workshop on Algorithms, Systems, and Tools for Learning at Scale, 2011.
[14] H. G. E. Deng, L. and B. Kingsbury. New types of deep neural network
learning for speech recognition and related applications: An overview.
In International Conference on Acoustic Speech and Signal Processing
(ICASSP), 2013.
[15] O. Dufour, T. Arti`eres, H. Glotin, and P. Giraudet. Clusterized mel filter
cepstral coefficients and support vector machines for bird song identification. International Machine Learning Conference, 2013.
[16] O. Dufour, P. Giraudet, T. Arti`eres, and H. Glotin. Automatic bird classification based on mfcc clusters, ranked 4th @ icml4b kaggle 2013 competition. In Listening in the Wild, page 11, 2013.
[17] O. Dufour, H. Glotin, T. Arti`eres, and P. Giraudet. Classification de signaux acoustiques : Classification de matrices cepstre par support vector
machine. Technical report, Laboratoire Sciences de lInformation et des
Syst`emes, Universite du Sud Toulon Var, 2012.
[18] O. Dufour, H. Glotin, T. Arti`eres, and P. Giraudet. Classification de signaux acoustiques : Recherche des valeurs optimales des 17 param`etres
dentree de la fonction melfcc. Technical report, Laboratoire Sciences de
lInformation et des Syst`emes, Universite du Sud Toulon Var, 2012.
[19] G. A. Edgar. Measure, topology, and fractal geometry. Undergraduate
texts in mathematics. Springer-Verlag, New York, Berlin, Paris, 1990.
Reimpression en 1992, 1995.
[20] D. P. W. Ellis. PLP and RASTA (and MFCC, and inversion) in Matlab,
2005. online web resource.
[21] H. Glotin and O. Dufour. Clusterized Mel Filter Cepstral Coefficients and
Support Vector Machines for Bird Song Identification. INTECH, 2013.
[22] H. Glotin and J. Sueur. Overview of the first international challenge on
bird classification, 2013. online web resource.
[23] I. J. Goodfellow, D. Erhan, P. L. Carrier, A. C. Courville, M. Mirza,
B. Hamner, W. Cukierski, Y. Tang, D. Thaler, D.-H. Lee, Y. Zhou, C. Ramaiah, F. Feng, R. Li, X. Wang, D. Athanasakis, J. Shawe-Taylor, M. Milakov, J. Park, R.-T. Ionescu, M. Popescu, C. Grozea, J. Bergstra, J. Xie,
L. Romaszko, B. Xu, C. Zhang, and Y. Bengio. Challenges in representation learning: A report on three machine learning contests. In ICONIP
(3), pages 117124, 2013.
[24] A. Graves, A. rahman Mohamed, and G. E. Hinton. Speech recognition
with deep recurrent neural networks. CoRR, abs/1303.5778, 2013.
10
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
173
REFERENCES
REFERENCES
[25] Z. hua Zhou and M. ling Zhang. Multi-instance multilabel learning with
application to scene classification. In In Advances in Neural Information
Processing Systems 19, 2007.
[26] joint to Int. Conf. on Machine Learning.
The 1st International
Workshop onf Machine Learning for Bioacoustics (ICML 2013), Atlanta, USA, june 2013. Glotin H. et al.
http://sabiod.univtln.fr/ICML4B2013p roceedings.pdf.
[27] E. Kasten, M. Philip, and G. Stuart. Ensemble extraction for classification and
detection of bird species. Ecological Informatics 5 153166, 2010.
[28] H. Lee, Y. Largman, P. Pham, and A. Y. Ng. Unsupervised feature learning
for audio classification using convolutional deep belief networks. In Advances in
Neural Information Processing Systems 22, pages 10961104. 2009.
[29] A. Michael Noll. Short-time spectrum and cepstrum techniques for vocal-pitch
detection. Journal of the Acoustical Society of America, Vol. 36, No. 2, pp.
296-302, 1964.
[30] L. Neal, F. Briggs, R. Raich, and F. X. Time-frequency segmentation of bird
song in noisy acoustic environments. In International Conference on Acoustics,
Speech and Signal Processing, 2011.
[31] A.-V. Oppenheim and R.-W. Schafer. From frequency to quefrency: a history
of the cepstrum. Signal Processing Magazine, Vol 21, Issue 5, pp 95 - 1015,
2004.
[32] J. Placer and C. Slobodchikoff. A method for identifying sounds used in the
classification of alarm calls. Behavioural Processes 67: 8798, 2004.
[33] W. H. Press, B. P. Flannery, S. A. Teukolsky, and W. T. Vetterling. Numerical
Recipes in C: The Art of Scientific Computing. Cambridge University Press,
New York, NY, USA, 1988.
[34] L. Ranjard, H. Ross, and H. Ross. Unsupervised bird song syllable classification
using evolving neural networks. Journal of the Acoustical Society of America,
Volume 123, Issue 6, pp. 4358-4368, 2008.
[35] G. Tsoumakas and I. Katakis. Multi-label classification: An overview. Int J
Data Warehousing and Mining, 2007:113, 2007.
[36] H. Yu and J. Yang. A direct lda algorithm for high-dimensional data with
application to face recognition. Pattern Recognition, 34:20672070, 2001.
[37] M.-L. Zhang. A k-nearest neighbor based multi-instance multi-label learning
algorithm. In ICTAI (2), pages 207212. IEEE Computer Society, 2010.
[38] M.-L. Zhang and Z.-J. Wang. Mimlrbf: Rbf neural networks for multi-instance
multi-label learning. Neurocomputing, 72(16-18):39513956, 2009.
[39] M.-L. Zhang and Z.-H. Zhou. M3MIML: A Maximum Margin Method for Multiinstance Multi-label Learning. In ICDM 08: Proceedings of the 2008 Eighth
IEEE International Conference on Data Mining, pages 688697, Washington,
DC, USA, Dec. 2008. IEEE Computer Society.
11
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
174
REFERENCES
REFERENCES
[40] M.-L. Zhang and Z.-H. Zhou. Multi-instance clustering with applications to
multi-instance prediction. Applied Intelligence, 31(1):4768, Aug. 2009.
[41] Z.-H. Zhou, M.-L. Zhang, S.-J. Huang, and Y.-F. Li. Miml: A framework for
learning with ambiguous objects. CoRR, abs/0808.3231, 2008.
12
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
175
Abstract
The challenge of the NIPS4B competition is to identify 87 sound classes of
birds and other animals present in 1000 audio recordings, collected in the
field. The difficulty of this task lies in the large number of species and
sounds that have to be identified in various contexts dealing with different
levels of background noise and simultaneously vocalizing animals. The
solution presented here ranks first place on the kaggle private leaderboard
and achieves an Area Under the Curve of 91.7% (AUC).
In trod u cti on
The audio data was recorded at different places in Provence France and is provided by the
BIOTOPE society, having one of the largest collections of wildlife recordings of birds in
Europe. The nearly 2 hours of recordings are split into smaller clips ranging from 0.25 to
5.75 seconds. The recordings were done with Wildlife Acoustics SM2 and are presented in
uncompressed WAV format with a sample rate of 44.1 kHz. The 87 individual sound classes
within these recordings represent different bird species and their songs, calls and drumming.
Other animal species living in the same environment like insects and one amphibian are also
included. The training set consists of 687 audio files. Each file is paired with the subset of sound
classes present in that recording. Some recordings are empty, containing only background noise,
others contain up to 6 different simultaneously vocalizing birds or insects. Each species is
represented by nearly 10 training files within various contexts, different background noises and an
arbitrary number of other species. The goal of the competition is to identify which of the 87 sound
classes of birds and amphibians are present in 1000 continuous wildlife recordings, using only the
provided audio files and machine learning algorithms for automatic pattern recognition.
The method of segmentation has a big influence on classification results. Several different
approaches were tested. The one that works best regarding leaderboard score is surprisingly
simple. Audio files are first resampled to 22050 Hz. After applying the STFT using a
hanning window with a size of 512 samples and 75% overlap the resulting spectrogram is
normalized to a maximum of 1.0. The 4 lowest and 24 highest frequency bins are removed,
*
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
176
leaving 228 frequency bins or spectrogram rows representing the relevant frequency range of
approximately 170 to 10000 Hz. The narrowed spectrogram o f each audio file is treated as
grayscale image and further processed for noise reduction and segmentation.
To reduce background noise each pixel value is set to 1 if it is above 3 times the median of
its corresponding row (frequency band) AND 3 times the median of its corresponding
column (time frame), otherwise it is set to 0. This Median Clipping per frequency band and
time frame removes already most of the background noise. Variable noise levels in different
frequency regions are compensated and short, broadband distortions coming from rain, wind
or microphone handling are attenuated.
The resulting binary image is further processed using standard image processing techniques
(e.g. closing, dilation, median filter). Finally, all connected pixel s exceeding a certain spatial
extension are labeled as a segment and a rectangle with a small area added to each direction
is used to define its size and position. Figure 1 gives an example of the preprocessing steps
involved and Figure 2 shows the outcome of a complete segmentation process.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
177
Fea tu re E xt racti on
Features are calculated for both, training and test files, coming from three different sources:
File-Statistics, Segment-Statistics and Segment-Probabilities.
File-Statistics include minimum, maximum, mean and standard deviation taken from all
values of the unprocessed spectrogram. Additionally the spectrogram is divided into 16
equally sized and distributed frequency bands and their minima, maxima, means and
standard deviations are also included.
For Segment-Statistics the number of segments per file plus minimum, maximum, mean and
standard deviation for width, height and frequency position of all se gments per file are
calculated.
In order to find Segment-Probabilities a variation of Fodors method [1] is used which was
already successfully applied in the MLSP 2013 Competition. The highest matching
probability of all segments extracted from training files associated with one or more sound
class is determined in all files by template matching using normalized cross-correlation [2].
A Gaussian blur with a sigma of 1.5 is applied to segment and target spectrogram before
matching. Best matches are only searched for within the frequency range of the segment ( a
small tolerance of 4 pixels). Unlike Fodor, the template matching uses only absolute-intensity
spectrograms and for better performance the OpenCV library [4] is used.
File- and Segment-Statistics produce 81 features per file scaled to the range [0 1]. SegmentProbabilities create, corresponding with the number of extracted segments from the training
set, 9198 features per file.
Fea tu re S el ecti on
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
178
Table 1: Number of selected features and estimators per sound class plus AUC scores
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
179
Cl assi f i cati on
The scikit-learn library is used for classification [3]. For each sound class an ensemble of
randomized decision trees (sklearn.ensemble.ExtraTreesRegressor) is applied. The number
of estimators is chosen to be twice the number of selected features per class but not greater
than 500. The winning solution considers 4 features when looking for the best split and
requires a minimum of 3 samples to split an internal node. During 12 -fold cross validation
the probability of each sound class in all test files is predicted and at the end, after removing
the lowest and highest value, averaged.
Good classification results are possible even without calculating File- and Segment-Statistics
and therefor without the need to segment the test recordings. Just with SegmentProbabilities, using the same parameter settings as mentioned above, a score of 91.6% AUC
on the private leaderboard can be achieved. A score around 84% is achievable using Fileand Segment-Statistics exclusively.
By ranking feature importance returned from the decision trees during training one can find
important segments to identify each sound class. Figure 3 and 4 show the ten most important
segments to identify the songs of Cetti's Warbler (sound class 11) and Common Chiffchaff
(sound class 55). Both sound classes achieve very good classification results with a score
close to 100%. Figure 5 gives an example of a sound class with poor classification results.
The feature ranking returned from decision trees to identify the call of the European Serin
(sound class 66) is partly incorrect and segments are not properly assigned.
To give an idea how well individual species can be identified, a score per sound class is
calculated on one third of the training data during 3-fold cross validation. The average of
this score is listed and visualized in Table 1.
Figure 3: Important segments to identify the song of Cettia cetti (Cetti's Warbler)
Figure 5: Important segments to identify the call of Serinus serinus (European Serin)
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
180
Con cl u si on
This working note describes the winning solution of the NIPS4B 2013 multi-label Bird
Species Classification Challenge. The solution of the MLSP 2013 Competition, implemented
and described by Fodor, was used as a starting point for further development. The here
proposed method includes an efficient way of extracting single sound events and connected
sequences of bird calls and syllables in complex acoustic scenes and noisy environments. An
ensemble of randomized decision trees is used to learn and predict the binary relevance of
each sound class separately with individually selected features per class. The complete
source code to reproduce the classification results and additional figures are available at
www.animalsoundarchive.org/RefSys/Nips4b2013.php.
A c k n o w l e d g me n t s
I would like to thank Prof. Herv Glotin for organizing this competition, BIOTOPE and
ADEME for financing the corpus constitution and kaggle for providing the competition
platform. I especially thank Gabor Fodor for documenting his approach and publishing his
code for the 2013 MLSP Challenge. I also want to thank Dr. Karl-Heinz Frommolt for
supporting my work, sharing his knowledge and providing me with the access to the
resources of the Animal Sound Archive [5] at the Museum fr Naturkunde Berlin.
R e f e re n c e s
[1] Fodor G. (2013) The Ninth Annual MLSP Competition: First place. Machine Learning for Signal
Processing (MLSP), 2013 IEEE International Workshop on, Digital Object Identifier:
10.1109/MLSP.2013.6661932 Publication Year: 2013, Page(s): 1- 2
[2] Lewis J.P. (1995) Fast Normalized Cross-Correlation, Industrial Light and Magic
[3] Pedregosa F. et al. (2011) Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12: 2825-2830
[4] Bradski G. (2000) The OpenCV Library. Dr. Dobb's Journal of Software Tools,
http://docs.opencv.org/modules/imgproc/doc/object_detection.html
[5] Animal Sound Archive URL: http://www.animalsoundarchive.org/ [25. 11. 2013]
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
181
1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
182
6000
5500
35
5000
4500
25
Freq (Hz)
Mel band
30
20
15
4000
3500
10
5
3000
2500
Time (s)
1.5
20000
Time (s)
1.5
6000
5500
35
5000
4500
25
Freq (Hz)
Mel band
30
20
15
3500
10
3000
5
0
4000
Time (s)
1.5
2500
20000
Time (s)
1.5
Figure 1: Illustration of features for an excerpt of training file 007, comparing Mel spectra (left)
against peak chirplet data (right). The lower plots show the same features after noise reduction.
Note that the left plots show Mel spectra which we further process to MFCCs, and the right plots
show chirplets which which we further process to bigram histograms.
Acknowledgments
DS & MP are supported by an EPSRC Leadership Fellowship EP/G007144/1.
The challenge was organised by Prof Herve Glotin and the SABIOD project team, with data provided
by the BIOTOPE society and ADEME.
References
[1] L. Breiman. Random forests. Machine Learning, 45(1):532, 2001.
[2] F. Pedregosa, G. Varoquaux, A. Gramfort, V. Michel, B. Thirion, O. Grisel, M. Blondel, P. Prettenhofer, R. Weiss, V. Dubourg, et al. Scikit-learn: Machine learning in Python. Journal of
Machine Learning Research, 12:28252830, 2011.
[3] D. Stowell, S. Musevic, J. Bonada, and M. D. Plumbley. Improved multiple birdsong tracking
with distribution derivative method and Markov renewal process clustering. In Proceedings of
the International Conference on Audio and Acoustic Signal Processing (ICASSP), 2013. preprint
arXiv:1302.3642.
[4] D. Stowell and M. D. Plumbley. Framewise heterodyne chirp analysis of birdsong. In Proceedings of the European Signal Processing Conference (EUSIPCO), pages 26942698, 2012.
[5] D. Stowell and M. D. Plumbley. Segregating event streams and noise with a Markov renewal
process model. Journal of Machine Learning Research, 14:18911916, 2013.
2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
183
Jinseok Nam
Knowledge Engineering Group
Technische Universitat Darmstadt
[email protected]
Dong-Hyun Lee
[email protected]
Abstract
Multi-label Bird Species Classification competition provides an excellent opportunity to analyze the effectiveness of acoustic processing and mutlilabel learning.
We propose an unsupervised feature extraction and generation approach based on
latest advances in deep neural network learning, which can be applied generically
to acoustic data. With state-of-the-art approaches from multilabel learning, we
achieved top positions in the competition, only surpassed by teams with profound
expertise in acoustic data processing.
Introduction
Acoustic data is a common and natural representative of multilabel data, i.e. data in which an example, in this case an acoustic sample, can be mapped to several, non-exclusive classes or categories.
Examples are the popular emotions benchmark with the objective of assigning emotions to music or
the hifind dataset where the tasks is to identify used instruments, genres, moods, languages, styles
etc. in songs [12]. In the Multi-label Bird Species Classification competition (NIPS4B) the task was
to identify 87 birds, insects and amphibians in short audio recordings. Of course, these could appear
in the same sample. More specifically, the objective was to maximize the area under the ROC curve
(AUC) on 1000 unlabeled recordings, a common measure for the quality of a label ranking.
The challenge was thus two fold: On the one hand it was necessary to process the data in a way
appropriate for machine learning approaches, since the data was only available in a raw format or
in very basic preprocessed format. This article presents a combination of recent and state-of-theart approaches from neural network and deep learning which allows an unsupervised generation
of an aleatory number of features, appropriate for being processed by standard machine learning
algorithms. It basically consists of random patching, a Denoising Autoencoder unit and subsequent
convolution and represents a general approach for processing acoustic data.
On the other hand, it was essential to learn the data accurately in order to produce high quality
predictions and to get the most out the provided information in form of input feature and binary
(relevant/irrelevant) label information. We tried out three approaches: firstly, a pairwise ensemble
of SVMs which actually is geared towards the base of AUC, the correct order of pairs of labels. The
popular and effective LibSVM library was specifically adapted to allow pairwise learning and the
modifications are made available. Secondly and thirdly, random decision trees and a single layer
neural network were applied. The diversity of the classifiers ensured that the combination of the
In proc. of int. symposium Neural Information Scaled for Bioacoustics joint to NIPS, Nevada, dec. 2013,
Ed. Glotin H. et al.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
184
predictions was effective. Our final ranking in the very competitive contest show that our acoustic
prepocessing provides a good base for the following machine learning step and that the multilabel
learner exhaust this.
The dataset for the Multi-label Bird Species Classification competition contains 687 labeled training
examples, including 100 noise samples which are not labeled with any bird species, and 1,000
unlabeled test examples for measuring the generalization performance. The recordings belong to
87 categories (bird species like the subalpine warbler), each of which is associated with approx.
13 training instances. The average labelset size is 2.00 excluding noise samples, maximally 6, and
there are 265 distinct labelsets in the training data. The dataset comprises two format: one is in
raw wav format in which bird songs and calls are recorded with distant insects, and the other is
Mel-Frequency Cepstral Coefficients (MFCCs) of the wav files following preprocessing steps in [4]
where each time frame is represented with 17 coefficients. Each audio clip in both train and test data
varies in length.
Let s RmT be the input vector for an audio clip where T is the total number of frames in time and
each time frame t consists of an m dimensional feature vector. In our first attempts, we just padded
(repeated) smaller samples so that in the end all samples had the same length maxi Ti = 1288,
resulting in 21,896 total number of features (referred to as raw dataset). However, the results were
not satisfactory, thus we applied the following operations and methods from unsupervised feature
learning. These were already successfully applied e.g. on image data, hence the question was
whether they would work for acoustic data.
Firstly, we extract Mtr and Mts random patches, totally M = Mtr + Mts , whose size is psz = m
wnd from training and test data, respectively, where wnd denotes the size of window. An extracted
patch is then normalized along the time frame axis which makes each coefficient has zero mean and
unit variance. Secondly, the randomly sampled patches are concatenated to form training examples
RpszM for Denoising Autoencoder or DAE [14], which learns hidden representations from
inputs in an unsupervised way. A DAE is a neural network architecture consisting of encoder fenc
and decoder gdec with parameters = {W, b, c} to minimize the squared error loss function k
gdec (W T fenc (W
+ b) + c))k22 where W RF psz is the weights matrix connecting visible units
and hidden units, b and c are biases for hidden units and visible units, and
is the corrupt input by
adding Gaussian noise n N (0, 2 ) to an input . Once training DAE is done, each column of
the weights W T acts as a feature detector. F feature detectors can be considered in total, and each
feature detector has the same size as the randomly extracted patches, that is, the k th feature detector
T
is Wk
Rmwnd . Finally, we can obtain a fixed feature representation for an input signal s in
terms of T while convolving it with learned feature detectors.
T
T
ak = fconv s Wk
+ bk
ak+F = fconv s (Wk
) + bk
(1)
xk =
T wnd+1
X
j=1
ak,j
xk+F =
T wnd+1
X
a(k+F ),j
(2)
j=1
where stands for a 2D discrete convolution operator1 and fconv is to provide non-linearity to the
convolved feature representations. We use ReLUs f (x) = max(0, x) for nonlinear function fconv .
In order to make use of the negative part as well as the positive part of inputs to ReLUs, we apply
polarity splitting [2] in Eq. 1. Then, we sum up the convolved feature values ak over time which
T
is analogous to accumulated activations of s with respect to the feture detector Wk
(Eq. 2). For
T
instance, xk will be higher if a feature k defined by Wk is detected many times in s.
For training DAE, we extracted 100,000 17 80 patches randomly from training data and 100,000
from test data in the MFCC format. We then trained two DAE models with Gaussian noise = 0.2
1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
185
on the patches; the models have 400 and 800 hidden units resulting in 800 (small) and 2000 (big),
respectively. We used ReLU for the encoder and sigmoid f (x) = 1/(1 + exp(x)) for the decoder.
Multilabel Learning
Multilabel classification refers to the task of learning a function that maps instances xi X to
label subsets or label vectors yi = (yi,1 , . . . , yi,n ) {0, 1}n , where L = {1 , . . . , n }, n = |L|
is a finite set of predefined labels and where each label attribute yi corresponds to the absence (0)
or presence (1) of label i . In the following, we will present the different learning algorithms we
applied on the NIPS4b dataset.
Pairwise Support Vector Machines The most common approach for multilabel classification is
to use an ensemble of binary classifiers, where each classifier predicts if an instance belongs to
one specific class or not (binary relevance or BR). An alternative is to do pairwise decomposition.
Here, one classifier is trained for each pair of classes, i.e., a problem with n different classes is
decomposed into n(n1)
smaller subproblems [6]. More precisely, for each pair of classes (u , v ),
2
u < v, we learn a binary base classifier hu,v , whose training set is composed of examples for which
u is a relevant class and v is an irrelevant class, or vice versa. All other examples are ignored
for this particular subproblem [10]. During classification, all of the n(n1)
base classifiers make
2
a prediction for one of the both corresponding classes, which is interpreted as a full vote (0 or 1),
hence resulting in a full ranking over the labels.2
Pairwise learning method is often regarded as superior to BR because it profits from simpler decision
boundaries in the subproblems [6, 8]. The reason is that each of the pairwise classifiers contains
fewer examples. In fact, it has also been shown that the complexity for training an ensemble of
pairwise classifiers is comparable to the complexity of training a BR ensemble [6, 10]. During
prediction, however, we have a quadratic number of classifiers we have to evaluate. But particularly
for support vector machines this problem is alleviated by the fact that easier (sub-)problems lead to
less support vectors and that support vectors can be shared among the pairwise SVMs.
Multilabel LibSVM Because of this and because SVMs trained in a pairwise fashion already
obtained state-of-the-art results on standard benchmark datasets [12] in previous works [11], we
decided to use the very popular and effective SVM software library LibSVM [1] for training our
pairwise SVMs. However, during preliminary experiments, we found out that just plugging in LibSVM was not feasible since a simple experiment on the dataset apparently required more than 20
GB of memory. The reason is that the used Java interface copies every training instance each time
for every base learner. Additionally, each of the 3741 LibSVM instantiation could request up to
40 MB of cache. We thus extended LibSVM directly in order to support the pairwise learning of
multilabel data. Our extension does not copy a training instance more than once and also shares a
common cache for Kernel computations, so that we managed to perform an experiment in less than
250 seconds for training 618 instances and 9 seconds for testing 68 instances and with less than 100
MB of memory (worst cases, respectively) despite the quadratic number of models to be trained,
stored and evaluated. The LibSVM modifications and interfaces for the multilabel learning toolkit
MULAN [13] are available from http://www.ke.tu-darmstadt.de/resources/multilabellibsvm.
Random Decision Trees Zhang et al. [15] recently proposed to use ensembles of random decision
trees (RDTs) for learning multilabel data and also provide a software library.3 The main idea is to
generate k1 RDTs with random attribute tests at the inner nodes and maximal depth k2 . Comparably
small values of k1 and k2 , around 10 or 20 and maximally 100, are sufficient in their experiments.
During the extremely fast training, the leafs incrementally collect statistics about the label distributions y which passed all tests to the leafs. Hence, each RDT predicts an average distribution,
which is subsequently averaged over all trees. RDTs are very suitable for data with a high number
of examples and labels, since the costs are bounded by the selection of k1 and k2 . However, they
may have problems with high number of features and particularly sparse features, which is not the
case for NIPS4B.
2
3
Ties in the final votes counting are broken by using the prior probabilities of the labels.
http://www.dice4dm.com/
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
186
Neural Networks with a Single Hidden Layer Neural networks (NNs) have attracted increasing
interest in recent years thanks to success of NNs with multiple levels of trainable feature extractors,
namely deep learning in various domains such as object recognition and speech recognition. In
order to achieve state of the arts performance, one usually trains deep neural networks on a large
amount of training examples or initialize parameters by pretraining the networks on unlabeled data
in an unsupervised manner, followed by learning whole parameters including a classification layer
using labeled instances. As the bird species classification dataset has only 587 labeled examples and
only 11 positive training instances are available per label, on average, we decided to train NNs with
a only single hidden layer rather than ones with multiple hidden layers.
The single hidden layer NNs perform surprisingly well when we combine them with AdaGrad
[3], which makes it possible to adapt the learning rate per parameter, and Dropout [7] to prevent overfitting and hence improving generalization performance. The output y
of NNs for a
given training example x is computed by using the following composition of non-linear functions
y
= fo (W (2) fh (W (1) x + b(1) ) + b(2) ) where fo (x) = 1/(1 + exp(x)) and fh (x) = max(0, x)
are activation functions for the output layer and the hidden
layer, respectively. At the output layer,
P
) = ni=1 yi log(
we compute the cross entropy error CE(y, y
yi ) + (1 yl ) log(1 yi ) where
yi is the predicted score for label i . We run Stochastic Gradient Descent (SGD) to train NNs with
1,000 hidden units for 50,000 epochs, which corresponds to 300,000 parameter updates, and use
mini-batches of size 100 for computing gradients.
Experimentation
In order to estimate the performance on the public and private test set, we performed 10 fold cross
validation on the available labeled training data.
Evaluation Measures The competition submissions were evaluated by computing the area under
the ROC curve of the label rankings and then averaging over the instances. This measure can be
defined as
X X
1
1
) =
AU C(y, y
[[
yi > yj ]] + [[
yi = yj ]]
(3)
|P ||N |
2
i P i N
where [[x]] denotes the indicator function and yi the predicted score for i , e.g. the inverted ranking
position. It is obvious that 1AU C corresponds to the popular ranking loss used for evaluating multilabel classification [5]. There are several discrepancies in computing this measure, e.g. sometimes
the second term is skipped and tied pairs are arbitrarily counted as wrong or correct, or sometimes
test instances with an empty labelset are skipped. We compute the score for each instance, i.e. we
) = 1, but note that for the cross validation results it is easy to obtain
additionally set AU C(, y
the other less optimistic version with AU C 0 = (687 AU C 100)/587 = 1.17 AU C 0.17.
However, this does not explain the discrepancies between the estimated AUC values and the values
on the test set, since our best 0.94 would be only reduced to 0.93.
Results Table 1 shows our estimated results and the AUC values on the public and private test set.
The first observation is that our preprocessing approach substantially improved the ranking quality
over using the provided raw MFCC features. The achieved improvement is greater than the possible
improvement by any other tried approach or combination of approaches. This demonstrates the
applicability and effectiveness for acoustic data of our neural network based unsupervised feature
generation process. Before heading to the comparison between the used approaches, we also note
that there is an important discrepancy between the CV estimations on the training set and the test set
results which cannot be explained by overfitting or differences in computing AUC (cf. Sec. 5).
We see that the pairwise LibSVM approach (SVM), the random decision trees (RDT) and the neural
network (NN) with a single hidden layer obtain similar results on the test sets, with a small advantage
for the NN approach. This submission obtained the 5th rank on the public test set and the 8th rank
on the private test set.4 With the arrival of the big dataset the last day of the competition, used for
training the RDTs, and some struggling in merging teams and results, so that only the predictions
of the SVMs with = 0.5 and the RDTs could be merged, we managed to reach the 4th and 6th
4
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
187
Table 1: Results for the different multilabel approaches and settings in terms of AUC, estimated on the training
data via 10 fold cross validation (with standard deviation) or a train/test split of 600/86 and computed on the
public and private test set. Training and prediction times are given in seconds. The features column indicates
which feature set was used. The second block shows post-competition results.
Approach
3
SVM C = 2000 = 10
SVM C = 106 = 0.5
SVM C = 106 = 1
RDT 50000 trees
NN
SVM & RDT
features
CV
public
private
training
predicting
raw
small
small
big
big
0.8994
0.93883 0.0180
0.93915 0.0179
0.93718 0.0189
0.92595 0.0200
0.89202
0.89130
0.89129
0.89650
0.90104
0.88967
0.88996
0.88195
0.89374
0.89525
681.8
60.18
60.96
13444.44
5585.15 (GPU)
49.76
19.39
19.44
290.83
0.01 (GPU)
big
big
0.94022 0.0181
0.93976 0.0178
0.88699
0.88752
0.90331
0.89903
0.89807
0.88710
0.88696
0.89824
0.89279
0.89556
119.71
109.26
45.98
45.96
positions on the public and private leaderboard, respectively. The ranking merging effect had a
small impact on absolute numbers, but a considerable effect on the test set ranking due to the high
competitiveness in the contest.
It seems clear that merging rankings exploits the diversity of the underlying classifiers by reinforcing
predictions if the individual classifiers agree and by (tendentially) correcting rankings if for some
instances some of the rankers fail. For binary decision ensembles it can be shown that the accuracy
approximates 1 with increasing number of voters, though assuming a certain diversity (in the sense
of probabilistic independence) [9, Sec. 4.2.1]. We could confirm this when joining predictions of
classifiers of the same family, which did not lead to any improvement. But as the post-competition
results in the second half of Table 1 show, combining different approaches almost always improved
the AUC. Indeed, if we had submitted a joined prediction of all three approaches, we would have
been ranked 4th on both test sets. This is just below the three competitors using advanced and sophisticated acoustic signal processing techniques relying on expert knowledge and on own processing
of the raw acoustic data, as reflected by the relatively big gap between them (with AUC greater than
0.91) and the rest of the competitors.
However, please note that our best approaches with almost 0.93 AUC had roughly 10 wrongly paired
P|P |
labels per instance (cf. Eq. 3). It holds that this number e equals i=1 ri |P |(|P | + 1)/2 with ri
being the ranks of the positive labels in P , thus for examples with |P | = 1 the label is on average
ranked at the 11th position, for two labels e.g. on positions 5 and 6. On the other hand, additional
evaluations show that for approx. 64.6% of the instances a relevant label was predicted on the first
position (one-error loss), and that we could obtain an F1-score of 75% using perfect thresholding.
Remind however, that our CV results are overestimations.
Conclusions
We have presented a general and unsupervised approach for processing acoustic data, particularly
for short recordings of birds, insects and amphibian sounds. It is based on recent findings and stateof-the-art approaches from the field of neural networks and deep learning. The generated features,
which basically are activation signals by using learned feature detectors, achieved an important
improvement over using the unprocessed MFCCs in terms of AUC.
The three applied multilabel approaches, which we applied on the data, were highly suited for the
particular task and carefully optimized so that we were able to obtain top results in the competition.
By combining the individual approaches we could exploit the diversity among them and obtain the
4th rank on the public test set and 6th position in the final ranking. Unfortunately, we did not manage
to combine all three classifiers on time, since this would have allowed us to obtain 0.90 AUC and
hence the overall 4th rank, right after the three solutions based on expert knowledge and therefore
unreachable with our means.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
188
We see some space for improvement in the pairwise learning approach, which currently ignores
examples in the label overlaps, and the neural network approach, which still has a lot of unexplored degrees of freedoms for optimizing. However, the narrow range of AUC results in the top
10 (excluding the top 3) indicates that we already got nearly the most out of the provided acoustic
representation and multilabel learning. Next steps hence include to find new representations directly
from the raw data, and we believe from our work with the birds sounds dataset that supervised and
unsupervised techniques from neural networks and deep learning can make important contributions
to this.
References
[1] Chih-Chung Chang and Chih-Jen Lin. LIBSVM: a library for support vector machines, 2001. Manual,
Software available at http://www.csie.ntu.edu.tw/cjlin/libsvm.
[2] Adam Coates and Andrew Ng. The importance of encoding versus training with sparse coding and vector
quantization. In Proceedings of the 28th International Conference on Machine Learning, pages 921928,
2011.
[3] John C. Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning and
stochastic optimization. Journal of Machine Learning Research, 12:21212159, 2011.
[4] O. Dufour, T. Arti`eres, H. Glotin, and P. Giraudet. Clusterized mel filter cepstral coefficients and support
vector machines for bird song identification. In The 1st International Workshop on Machine Learning for
Bioacoustics (ICML 2013), pages 8993, Atlanta, USA, june 2013. Glotin H. et al.
[5] Seyda Ertekin and Cynthia Rudin. On equivalence relationships between classification and ranking algorithms. Journal of Machine Learning Research, 12:29052929, November 2011. ISSN 1532-4435.
[6] Johannes Furnkranz. Round robin classification. Journal of Machine Learning Research, 2:721747,
2002.
[7] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Improving neural networks by preventing co-adaptation of feature detectors. CoRR, abs/1207.0580, 2012.
[8] Chih-Wei Hsu and Chih-Jen Lin. A comparison of methods for multi-class support vector machines.
IEEE Transactions on Neural Networks, 13(2):415425, 2002.
[9] Ludmila I. Kuncheva. Combining Pattern Classifiers : Methods and Algorithms. Wiley-Interscience,
2004. ISBN 0471210781.
[10] Eneldo Loza Menca and Johannes Furnkranz. Pairwise learning of multilabel classifications with perceptrons. In Proceedings of the 2008 IEEE International Joint Conference on Neural Networks (IJCNN08), pages 29002907, Hong Kong, 2008. IEEE. ISBN 978-1-4244-1821-3. doi: 10.1109/IJCNN.2008.
4634206.
[11] Eneldo Loza Menca, Sang-Hyeun Park, and Johannes Furnkranz. Efficient voting prediction for pairwise
multilabel classification. Neurocomputing, 73(7-9):1164 1176, March 2010. ISSN 0925-2312.
[12] Grigorios Tsoumakas. Mulan: A java library for multi-label learning, dataset repository. Website, January
2012. URL http://mulan.sourceforge.net/datasets.html. last accessed at 2013-12-01.
[13] Grigorios Tsoumakas, Eleftherios Spyromitros Xioufis, Jozef Vilcek, and Ioannis P. Vlahavas. Mulan: A
java library for multi-label learning. Journal of Machine Learning Research, 12:24112414, 2011.
[14] Pascal Vincent, Hugo Larochelle, Yoshua Bengio, and Pierre-Antoine Manzagol. Extracting and composing robust features with denoising autoencoders. In Proceedings of the 25th international conference on
Machine learning, pages 10961103, 2008.
[15] Xiatian Zhang, Quan Yuan, Shiwan Zhao, Wei Fan, Wentao Zheng, and Zhong Wang. Multi-label classification without the multi-label cost. In Proceedings of the Tenth SIAM International Conference on Data
Mining, April 2010.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
189
Luca Massaron
Independent marketing research director & data scientist
Verona, Italy
[email protected]
Abstract
This technical report details the authors approach in the NIPS4B competition which led to a
final result of an area under the ROC curve of 0.89575 in the public leaderboard and 0.89041
in the private one. The described approach involved building an ensemble of generalized
linear models, such as a logistic regression and a classification model by hinge loss as
provided by the Vowpal Wabbit, an open source learning system library and program based
on stochastic gradient descent optimization, and boosted trees ensembles provided by Scikitlearn library in Python.
1. Description of the competition
The contest, held on the big data predictive analytics Kaggle web site (www.kaggle.com),
required participants to identify which of 87 sound classes of birds (for some species the
contest required to discriminate the song from the call) and their ecosystem are present in
1000 continuous wild recordings provided by the BIOTOPE society from different locations
in Provence, France.
The training set contained 687 .wav files, each one featuring one or more species. Each
species was overall represented by nearly 10 training files (within various context / other
species). The files were recorded at a frequency sample of 44.1 kHz on an SM2 system.
The test set, matching the training set conditions, was composed of 1000 files. All species in
the test set were also in the training set, posing quite an interesting discrimination challenge
in distinguishing signals proper to each species.
The organizers of the competition have also provided some baseline features on the train and
test .wav files. These were the optimized MFCC features, as described in the ICML4B 2013
bird challenge [1]. The format is a matrix 17xN: 17 cepstral coefficients x N frames (frame
size 11.6 ms, frame shift 3.9 ms, one line per frame).
2 . D a t a p re p a r a t i o n
First of all the presented approach is entirely based on the original MFCC data, without the
creation of further new features. The original MFCC data has just been manipulated in order
to fit the training schedule of the different machine learning algorithms involved.
The MFCC matrices have been transposed in the matrix format Nx17, so that cepstral
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
190
coefficients become variable columns of the matrix and each row represented a time unit.
Then, for each one of the matrices, it has been creating a sliding window of various sizes,
from 3 to 20, horizontally stacking contiguous rows from the matrices, thus opening an
observation window for the learning algorithm to evaluate at the same time more instants
of the analyzed sounds.
Empirically the author found that the best windows to feed a linear model with were in the
range of 15 to 20 horizontally stacked rows. The same observation proved true for different
learning algorithms, such as gradient boosting classifiers.
Basically, with a sliding window of 20 rows, the learning algorithm had 340 variables to
learn from each example (time unit).
3 . Tr a i n i n g , h y p e r- p a r a m e t e r s c h o i c e
The choice was to learn a single model for each of the 87 species involved in the study,
though the model choice and the parameters were generally chosen.
Therefore, the author first trained 87 first logistic (Logistic loss: L(p,y) = log(1+exp(-y*p)))
and then hinge (Hinge loss: L(p,y) = max(0,1-y*p)) regression models relying on the
computational speed of the open source software Vowpal Wabbit [2].
In order to let the learning process discriminate at best the different species, for every target
species in each model the author over-weighted its instances in order that the sum of the
weight of the target species was equal to the sum of the weight of the other species under
analysis (a one against all approach).
As for as Vowpal Wabbit hyper-parameters, the best results were obtained by an 24bits
hashing with 5 passes over the data. No regularization (L1/L2) has been used.
4 . P r e d i c t i o n s f ro m s i n g l e m o d e l s a n d e n s e m b l e
The author, after estimating the probabilities of species being present in a time unit in a
sound file (as for as logistic regression models using its link function, as for as hinge
regression models by rescaling and clipping the results), simply averaged logistic and hinge
probability results and therefore obtained a first ensemble forecast of the presence of every
singular species in every row of every target transposed test MFCC matrix.
In order to turn the results relative to single time units into overall probabilities of species
presence in each sound file, the author empirically experimented that using for each test
matrix a moving average of 200 rows and retaining for each species the maximum
probability result allowed to obtain a prediction whose public AUC was 0.87791 and its
private one was 0.87120.
Noticing, by direct inspection of the fitted results on the train set and on the test results, that
the estimations had surely an high recall of the species (systematically a large number of
species had high scores for each test MFCC matrix, pointing out the likelihood of many false
positives) but were likely lacking the necessary precision to reach higher scores on the
Kaggles leaderboard, the author decided to integrate the linear models by a different
approach based on gradient boosting classifiers [3], as implemented in the Scikit-learn
library [4] in Python (using the function GradientBoostingClassifier).
The underlying idea was that gradient boosting classifier (GBC), allowing interactions, has
surely less bias than the linear models (thus an increased precision) but were suffering from
an higher variance in estimates.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
191
The ensemble approach required to create a new training dataset, resampling the initial one,
in order to obtain, for each target species, all the examples of the target species itself and a
5% of examples available in each training MFCC matrix.
As for as hyper-parameters, it has been used a GBC with 30 trees, learning rate 0.1, max
depth of 10 interactions and minimum sample split of 30 cases.
By itself alone, this sole model when submitted to Kaggle obtained a public AUC of 0.87779
and a private one of 0.87143, results analogous to the ensemble of logistic and hinge
regression.
0.8
0.6
0.4
0.2
0.0
Estimated probability
1.0
By examining some randomly chosen predictions from the test set, it can be observed that, as
depicted in figure 1 for test sound file no. 500,
an ensemble of logistic and hinge
regressions tends to polarize the results in high and low probability ends and to mark many
species as possibly present in the sound file.
20
40
60
80
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
192
1.0
0.8
0.6
0.4
0.2
0.0
Estimated probability
20
40
60
80
0.8
0.6
0.4
0.2
0.0
Estimated probability
1.0
It is observed in figure 3 how the previously polarized predictions have naturally arranged
themselves into probability tiers, allowing a better probability estimation, as for as the AUC
measure.
20
40
60
80
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
193
5 . R e f l e c t i o n s a n d o p e n o p p o r t u n i t i e s f o r i m p ro v e m e n t
The proposed approach highlights how an ensemble mixing high bias / low variance models
and low bias / high variance ones may prove an effective strategy in bioacoustics problems.
Moreover, the gradient boosting classifiers are a tree based machine learning methodology
that should, in the authors opinion, better explored. The author recognizes that there are
furthermore open opportunities in further tuning of models hyper-parameters and in
simplifying the ensemble strategy.
Acknowledgments
The Neural Information Processing Scaled for Bioacoustics (NIPS4B) bird song competition
has been organized by BIOTOPE, ADEME, SABIOD.ORG and Prof. Herv Glotin.
References
[1] O. Dufour, T. Artires, H. Glotin, P. Giraudet "Clusterized Mel Filter Cepstral Coefficients and
Support Vector Machines for Bird Song Identification" , The 1st International Workshop on Machine
Learning for Bioacoustics, LSIS, pp. 89-93, ICML 2013, Atlanta, USA, 2013
[2] Vowpal Wabbit by John Langford, Lihong Li, Alex Strehl, 2007
[3] Boosting and Additive Trees in T. Hastie, R. Tibshirami, J. Friedman. Elements of
Statistical Learning, 2 nd ed., 2009
[4] F. Pedregosa et al. Scikit-learn: Machine learning in Python. Journal of Machine Learning
Research, 12:28252830, 2011
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
194
Wei Chen
Institute for Infocomm Research,
Agency for Science, Technology and Research (A*STAR), Singapore
[email protected]
Gang Zhao
School of Computing, National University of Singapore,
Singapore
[email protected]
Xiaohui Li
Institute for Infocomm Research,
Agency for Science, Technology and Research (A*STAR), Singapore
[email protected]
Abstract
Bioacoustic data science aims at analyzing and modeling animal sounds for neuroethology biodiversity assessment. The goal of competition is to automatically
identify which species of bird is present in an audio recording using supervised
learning. Devising effective algorithms for bird species classication is a preliminary step toward extracting useful ecological data from recordings collected in the
field.
In the competition, we analyze a real-world data which contains 1000 continuous
wild recordings from different places in Provence, France We identify prominent
features from windowing mfccs with overlap and leverage them to build a a ensemble classifier which is a blend of different classifiers (Gradient Boosting Tree
models, Random Forest models and Lasso and elastic-net regularized generalized
linear model etc). Our evaluation and final private leaderboard shows that our
Team DB2 method was capable of classifying large number of the bird songs,
which put us on the 4th place in the final ranking.
Our preprocessing is based on mfcc cepstral coefficients which have been proved useful for bird song
recognition[1,2]. A signal is first transformed into a series of frames where each frame consists in
17 mfcc (mel cepstra feature coefficients) feature vectors, including energy. Each frame represents
a short duration (e.g. 512 samples of a signal sampled at 44kHz).
Besides, we also following [1] to do the Windowing, silence removal and feature extraction step. In
Final step, a reduced set of features for any remaining segment / window can be computed. In final
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
195
step, we will have each segment consists in a series of n 17-dimensional feature vectors (with n in
the order of hundreds).
Training
Our solution consists of a blend of many single predictors. The standard way of training a single
predictor consists of two steps. In the first step, the validation set is created from the training dataset
and the model is trained on the remaining dataset. Then predictions for the validation set are stored.
In the second step training is done on all available data with the same meta parameters as in the first
step, such as number of tree in gradient boosting machine and so on. Last, the predictions for the
test set are stored. We use the following algorithm for the training model:
2.1
Gradient boosting is a machine learning technique used for classification problems with a suitable
loss function, which produces a final prediction model in the form of an ensemble of weak prediction
decision trees[3].
2.2
RandomForest
Random forests[4] are a combination of tree predictors such that each tree depends on the values of
a random vector sampled independently and with the same distribution for all trees in the forest. We
set number of tree as 500 for the configuration.
2.3
In statistics and, in particular, in the fitting of linear or logistic regression models, the elastic net is a
regularized regression method that combines the L1 and L2 penalties of the lasso and ridge methods.
We used the implementation of Generalized Boosted regression Models (GBM) RandomForest and
Lasso and elastic-net regularized generalized linear model in Rs package.
Blending
As blending algorithm we use a simple basic blending - linear blending. A linear blender is easy to
implement. The most basic blending method is to compute the final prediction simply as the mean
over all the predictions in the ensemble. Better results can be obtained, if the final prediction is
given by a linear combination of the ensemble predictions. In this case, the combination coefficients
have to be determined by optimization procedure, in general by regularized linear regression. In our
case, All inputs (the predictions) are normalized to [0...+1]. We use linear blending approach and
the coefficients are determined by cross-validation performance on average precision. For validation
we use 10-fold cross-validation. The performance of each individual model and blending is shown
in table 1.
Table 1: Performance for different algorithm
Method
GBM
RandomForest
GLMNET
Blending
ROC (public)
88.762%
88.850%
86.079%
89.740%
ROC (private)
88.732%
88.239%
86.332%
89.624%
Besides the overall performance, we also try to investigate the performance of individual classification performance, we found that class 38/57/80 is easy to predict with ACU more than 98 % and
the class 20/66/78/85 is the difficult class to predict. We suspect that it is because the data sparsity
problem which hurt the performance of classifier.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
196
Conclusion
We described our approach to NIPS4B challenge. In summary, we try extract the feature using
MFCC with time overlap from various statistics such as velocity acceleration etc. With the feature
extraction, we building our effective classifiers using Gradient boosting machine, RandomForest
and Lasso and elastic-net regularized generalized linear model. Each single predictor is trained
individually. A linear blending of single predictor is used for final prediction. The experiments
presented in this paper, and the ranking on the Privateleaderboard1 , suggests that our methods are
effective in multi label and multi instance classification tasks.
Acknowledgments
I would like to thank the BIOTOPE society for collecting, labeling and preparing the dataset, and
SABIOD-LSIS with ADEME for having organized the great competition under the supervision of
Pr. Glotin. We would also thanks other competitors who make this competition intresting.
References
[1] Dufour, O. and Arti`eres, T. and Glotin, H. and Giraudet, P. Clusterized Mel Filter Cepstral Coefficients and
Support Vector Machines for Bird Song Identification The 1st International Workshop onf Machine Learning
for Bioacoustics (ICML 2013)
[2] Briggs, F.; Raich, R.; Fern, X.Z., Audio Classification of Bird Species: A Statistical Manifold Approach,
Data Mining, 2009. ICDM 09. Ninth IEEE International Conference
[3] Jerome H. Friedman. 2002. Stochastic gradient boosting Comput. Stat. Data Anal
[4] Leo Breiman. 2001. Random Forests Mach. Learn.
http://www.kaggle.com/c/multilabel-bird-species-classification-nips2013
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
197
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
198
Chapter 8
Whale Song Clustering
8.1 Analyzing the temporal structure of sound production modes within
humpback whale sound sequences................................................................................................200
Mercado III E.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
199
In trod u ction
The first step in most past structural analyses of humpback whale songs has
been to sort individual sounds into discrete types (typically based on
subjective visual or aural criteria). Even when automated sorting techniques
such as self-organizing maps have been used to sort individual sounds [1], the
success of these quantitative methods has been judged relative to how human
observers sort the same sounds. This approach is highly problematic because
many sounds that humans judge to be different might be perceptually
equivalent for humpback whales and sounds that humans judge to be highly
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
200
Meth ods
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
201
001_26min.wav).
The time-domain waveform was analyzed in its raw form, with no preprocessing. The recording was segmented into sequential 100 ms duration
frames and each frame was converted into a single duty-cycle measure. Dutycycle was calculated regardless of whether the frame corresponded to a sound
produced by the singer or to a silent interval between sounds (for details of
duty-cycle calculation, see [2,5]). The duty-cycle measure provides a ratio
scale measurement in which zero corresponds to no deflections. Trains of
pulses produce a lower duty-cyle value (near 0) and more sinusoidal signals
produce a higher value (1 for a perfect sinusoid). This measure has previously
been used to analyze the vocalizations of false killer whales [5] and singing
humpback whales [2]. The sequence of duty-cycle (DC) measures is referred
to hereafter as a DC-gram. Conversion of the 26 min recording into a DCgram (15410 elements, 32 kB file) took ~10 s using Matlab R2010b on a 3.1
GHZ Apple iMac.
3
R es u lts
D is cu s s ion
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
202
Figure 2. Spectrographic DC-gram of a 26 min song bout. In the first 9 min of singing, the whale
slows down the rhythm of sound production, while maintaining production rate (~.5 Hz). After 10 min,
the whale shifts to a rhythm that closely matches the rate of sound production. Near 12 min, the whale
transitions through several rhythms before settling into a new mode near the 13 min mark.
Interestingly, the time spent in this new mode matches that spent in the rate-synchronized mode; later,
near the 17 min mark, the whale repeats this temporal pattern, again producing each mode for ~1 min.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
203
Acknowledgments
I thank S. Handel and H. Glotin for providing useful feedback on an earlier
version of this manuscript.
R ef eren ces
[1] R. Suzuki, J. R. Buck, and P. L. Tyack, "Information entropy of humpback whale songs," J.
Acoust. Soc. Am. 119, 1849-1866 (2006).
[2] E. Mercado, III, J. N. Schneider, A. A. Pack, and L. M. Herman, "Sound production by singing
humpback whales," J. Acoust. Soc. Am. 127, 2678-2691 (2010).
[3] J. Paulus, M. Mller, and A. Klapur, Audio-based music structure analysis, Proceedings of the
International Society for Music Information Retrieval, 625-636 (2010).
[4] R. J. Weiss, and J. P. Bello, Unsupervised discovery of temporal structure in music, IEEE
Journal of Selected Topics in Signal Processing 5, 1240-1251 (2011).
[5] S. O. Murray, E. Mercado, III, and H. L. Roitblat, Characterizing the graded structure of false
killer whale (Pseudoorca crassidens) vocalizations, J. Acoust. Soc. Am. 104, 1680-1688 (1998).
[6] S. Saar, and P. P. Mitra, A technique for characterizing the development of rhythms in bird song,
PLoS 3, e1461 (2008).
[7] S. Handel, S. K. Todd, and A. M. Zoidis, "Rhythmic structure in humpback whale ( Megaptera
novaeangliae) songs: preliminary implications for song production and perception," J. Acoust. Soc.
Am. 125, EL225-230 (2009).
[8] S. Handel, S. K. Todd, and A. M. Zoidis, "Hierarchical and rhythmic organization in the songs of
humpback whales (Megaptera novaeangliae)," Bioacoustics 21, 141-156 (2012).
[9] E. Mercado, III, and S. Handel, "Understanding the structure of humpback whale songs," J.
Acoust. Soc. Am. 132, 2947-2950 (2012).
[10] F. Pace, F. Benard, H. Glotin, O. Adam, and P. White, Subunit definition and analysis for
humpback whale call classification, Appl. Acoust. 71, 1107-1112 (2010).
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
204
Abstract
In this work we propose to extend the finite parsimonious Gaussian mixture to
the infinite case so that the classification of our data could be performed in one
stage. We implemented the eigenvalue decomposition of the covariance matrix
of each cluster to the Infinite Gaussian mixture model and made it parsimonious.
We developed an MCMC algorithm (Gibbs sampling) to learn the various models
and we named this approach the bayesian non-parametric parsimonious approach
for cluster analysis. The new approach will be more flexible in terms of modeling
and will automatically provide the partition of the data and the number of clusters.
This approach will be applied into the challenging problem of Whale song decomposition NIPS4B challenge. These algorithms would also give efficient clustering
on complex sequence of pulses, and then may allow muti-source/multi-animals
labelling.
Introduction
Clustering is one of the essential tasks in machine learning and statistics. One of the main problem
in data analysis is to estimate the number of clusters that fits best the data. For that we find different
approaches in the literature, where one of the most popular is the model-based clustering [1, 2].
These finite parsimonious Gaussian mixtures rely on the eigenvalue decomposition of the covariance matrix, allowing the models to change between the simplest spherical one to the more general
[3]. The model parameters can be estimated in a Maximum Likelihood (ML) framework by the
Expectation Maximization (EM) algorithm [4] or in a Maximum A Posteriori estimation (MAP) [5]
framework or by using MCMC sampling techniques[6, 7]. In this approach, as well as in standard
model-based clustering techniques, the selection of the number of clusters is performed by using
penalized likelihood criteria such as the Bayesian Information Criteria (BIC) [8], Akaike Information Criterion [9], Integrated Classification Likelihood (ICL)[10], etc. So we need to perform a two
stages for classification, first estimate the number of clusters and then run th EM algorithm for the
classification of the data.
An alternative well-principled approach for the difficult problem of model selection is to use the
Bayesian Non-Parametric (BNP) [11] methods for clustering, one of them being the infinite Gaussian mixture model (IGMM) [12]. Indeed, the principle of IGMM is based on the one of the Chinese
Restaurant Process (CRP) [13, 14, 15, 16, 17] which is well suited to the problem of non-parametric
clustering. This alternative gives us the possibility to obtain the number of clusters in the same
stage of clustering so that as the new data will be observed the number of model parameters can
be changed. The general (full GMM) model used in IGMM is not so flexible as in the case of the
model-based clustering [3, 18] where the covariance matrix can take different forms, depending on
the volume shape and orientation. Therefore we proposed to develop a new approach that will rest
being an infinite Gaussian mixture model approach that will give us the possibility to automatically
1
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
205
provide the number of classes but know with an eigenvalue decomposition of the covariance matrix
giving more flexibility for the model.
The paper is organized as follows. Section 2 briefly discusses previous work on finite Gaussian
mixture clustering, in particular we show the model-based clustering approach. Then, Section 3,
presents the proposed approach and Section 4 shows experiment results after application to the
Whale song decomposition NIPS4B challenge of the EM algorithm with ML and MAP frameworks
and the proposed bayesian non-parametric parsimonious approach.
having fk a distribution with parameters k and the non-negative mixing proportions k that sum to
one.
We will suppose in particular the multivariate Gaussian Mixture Model (GMM) [1] to cluster the
data X so that in this case we have fk being a multivariate Gaussian distribution (equation 2) with
the parameters k = (k , k ) which are respectively the mean vector and the covariance matrix for
the kth Gaussian component density.
1
1
fk (xi |k ) = Nk (xi |k , k ) 2|k | 2 exp (xi k )T 1
(x
)
(2)
i
k
k
2
The finite parsimonious GMM by the eigenvalue decomposition of the covariance matrix makes the
model more flexible, giving a possibility to variate each cluster density by volume, orientation and
shape. The parametrization of the covariance matrix is given in equation 3.
k = k Dk Ak DTk
(3)
where k is a scalar that defines the volume, Dk a orthogonal matrix that defines the orientation and
Ak is a diagonal matrix with determinant 1 witch defines the shape. This decomposition leads to
fourteen flexible models [3] going from simplest spherical models to the complex general one.
One of the most used algorithm for learning the model is the Expectation Maximization(EM) algorithm that maximizes the likelihood [19, 20] is an iterative algorithm consisting of two stages, the
expectation of the complete data log-likelihood named the E-step and the maximization of the expected complete data log-likelihood named the M-step. Maximizing the likelihood (ML framework)
will maximize the mixture likelihood p(X|k , k , k ).
n X
K
Y
p(X|k , k , k ) =
k Nk (xi |k , k )
i=1 k=1
The maximizing of the posteriori (MAP framework) can be also performed by the EM algorithm [5].
It leads by adding a prior to the mixtures parameters so that it maximizes the following posterior
parameter distribution p(|X)
p(|X) = p()p(X|)
where p() = p()p() is the prior for the parameters of the mixture. Also we find in the literature
different extension of the EM algorithm like CEM, GEM, etc. that could also be used to learn
the model. Another alternative to learn the models are the Markov Chain Monte Carlo (MCMC)
algorithms (like Gibbs sampling) [7, 21, 22].
However before learning the model with one of these finite gaussian mixture model we must have the
answer to what is the number of mixtures in our model. For that we pose Kmax that is a maximum
number of cluster possible and we compute the penalized log-likelihood criteria (BIC, AIC, ICL,
etc.) After choosing the optimal number of clusters that fit best the data we can run one of the
learning algorithms.
2
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
206
First off all we make attention that the term of non-parametric learning doest mean at all that the
model doesnt have parameters, indeed it means that it could have an infinite number of them as
the data grows, in other words it is assumed that the observed data are governed by an infinite
number of clusters, but only a finite number of them does actually generates the data. Bayesian nonparametric (BNP) mixtures for clustering offers a good alternative to infer the number of clusters
form data within one stage, rather then in two stages like in the case of the parametric modeling
[11, 23, 24, 12]. BNP approach proposes to pose a prior on an infinite partitions in such a way that
a finite number of clusters will be active. We could use the Chinese Restaurant Process (CRP) prior
[25, 26, 23] or a Dirichlet Process Mixture (DPM) [27, 23, 28].
In this work we proposed to develop the previous work called the infinite Gaussian mixture model
[12], based on the full GMM, by extending it to a more flexible mixture model where the covariance matrix has an eigenvalue decomposition [3, 18]. We call the new approach the bayesian nonparametric parsimonious approach. We assumed the Chinese Restaurant Process (CRP) prior for the
cluster assignments.
Indeed CRP provides a distribution on the infinite partitions of the data, that is a distribution over
the positive integers 1, . . . , n. Considering the following joint distribution of the unknown cluster
assignments: p(z1 , . . . , zn ) = p(z1 )p(z2 |z1 )p(z3 |z1 , z2 ) . . . p(zn |z1 , z2 , . . . , zn1 ) we can compute
each term by using the CRP distribution. The problem of the Chinese Restaurant Process can be
expressed by a real human situation if supposing a restaurant that could be extended in a real time
by having the possibility to add an infinite number of tables if the number of customers grows. So
the CRP is explained as follows: supposing we have this kind of restaurant where one customer
is visiting it. This customer enters and sits at the first table. When the second customers enters
1
if k > K+
i1+
where K+ is the number of tables that have customers sitting on that table nk > 0 or it is also known
as active classes. We note k K+ when the k-th table is occupied or in clustering problem the new
data observed will be associated to the k-th cluster and k > K+ when a new table will be occupied
or the new observation will form a new cluster.
It is also used a prior for the mixture parameters as in MAP approach or the MCMC Gibbs sampling.
This priors are used to be conjugate priors so that for example we have the normal inverse-Wishart
prior distribution for the mean and the covariance matrix if we use a full GMM. We note this prior
distribution as G0 so that we can show the following generative process.
i G0
(5)
zi CRP(z1 , . . . , zi1 ; )
(6)
xi p(.| zi )
(7)
According to this generative process we see that i exhibit a clustering property so that the unique
values of the parameters are the number of mixtures that fits the data. G0 is called the base distribution [27, 23]. The distribution over the partition zi as it was talked before is a CRP distribution.
We proposed to develop the infinite parsimonious Gaussian mixture, where the covariance matrix is
parameterized in term of eigenvalue decomposition to provide more flexibility of this model. So the
priors on the parameters depends on the type of the parsimonious model. Having chosen the MCMC
Gibbs sampling [12, 29, 16, 23] for learning the model we will have different sampling depending
on the covariance matrix decomposition.
Indeed, yet we investigated seven parsimonious models, covering the three families of the mixture
models which are the general, the diagonal and the spherical family. The parsimonious models
3
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
207
therefore go from the simplest spherical one to the more general full model. In table 1, we summarize
the considered models and the corresponding prior for each model used in Gibbs sampling.
Nr.
1
2
3
4
5
6
7
Decomposition
I
k I
B
k B
DADT
k DADT
k Dk Ak DTk
Model-Type
Spherical
Spherical
Diagonal
Diagonal
General
General
General
Prior
IG
IG
IG
IG
IW
IG and IW
IW
Applied to
k
each diagonal element of B
each diagonal element of k B
= DADT
k and = DADT
k = k Dk Ak DTk
Table 1: Considered Parsimonious GMMs via eigenvalue decomposition and the associated prior distribution
for the covariance. Note that I means that it is an inverse distribution, G means that it is a Gamma distribution
and W means that it is a Wishart distribution.
Experiments
k DADT
k I
k B
k Dk Ak DTk
Figure 1: Posterior distribution of the number of clusters obtained by the proposed bayesian nonparametric approach.
4
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
208
The table 2 shows the log-likelihood values that are divided by 106 and the number of estimated
classes obtained by using the Expectation Maximization (EM) algorithm with one of the information
criteria and the proposed bayesian non-parametric parsimonious method for clustering the data. By
analysing the results we can conclude that the best solution is by using the more general model with
the eigenvalues decomposition of the covariance matrix k Dk Ak DTk , meaning that the volume, the
orientation and the shape can vary for each cluster. The best likelihood obtained here is by using the
EM with maximum a posteriori framework algorithm, that estimates 18 classes. On the other hand,
the bayesian non-parametric model estimates 15 classes. By using the spherical models, the one with
the equal volumes I and the other one with different volumes k I, we notice that the estimation of
classes are taken to be the maximum, equal to 60, when using the finite Gaussian Mixture Models
(GMM), while for the infinite case we have estimated 9 classes for the I model and 23 classes
for the k I model. Also, for the diagonal models, we have the model with equal volumes B that
estimates 22 classes for the finite mixture models when using the EM ML approach or EM MAP
approach with the Integrated Classification Likelihood (ICL) criteria, and 18 classes when using the
proposed non-parametric bayesian clustering. 1
Table 2: Log-likelihood values (divided by 106 ) and the number of estimated classes obtained for
the whale song data set by using the Expectation Maximization approach with maximization of the
likelihood (ML) approach and with the maximization a posteriori (MAP) approach and the proposed
bayesian parsimonious approach (IPGMM).
EM ML
EM MAP
IPGMM
K
K
K
Model
log-lik
log-lik
log-lik
I
60 2.2198 60 2.1924
9 2.3413
k I
60 2.1129 60 2.0858 23 2.2133
B
22 2.1435 22 2.1339 18 2.1958
k B
59 2.0059 53 1.9595 11 2.1900
DADT
34 2.0815 33 2.1695
k DADT
51 1.9811
24 2.1589
k Dk Ak DTk 19 1.9418 18 1.9381 15 2.1234
In the figure 2 we show the spectrograms of the whale songs obtained with the proposed bayesian
non-parametric approach with the most general model k Dk Ak DTk . We chose to show these spectrograms of the whale songs because we obtained the best log-likelihood solution when using the
new method. On the vertical axes the frequency is showed and on the horizontal axes we have the
frames, each frame being represented by 10 ms. As we observe in the table 2 we have 15 clusters
for the k Dk Ak DTk model when using the infinite Gaussian mixture model, so in figure 2 we show
the 6 spectrograms of the whale songs that the time repass 10 ms.
By classification the whale song data with the infinite gaussian mixture model using the most general
model k Dk Ak DTk we see in the figure 3 the song that where observed for each observation. The
songs (classes) 8, 12 and 15 are uniformly activated in time, therefore we may figure out that they
are representing the sea noise. Whereas the songs (classes) 10,13 and 14 are clearly conveying
information (low entropy).
Conclusion
This work presents a new Bayesian non-parametric approach for clustering. It is based on an infinite
Gaussian mixture with an eigenvalue decomposition of the cluster covariance matrix and a Chinese
Restaurant Process prior. It allows deriving several flexible models and avoids the problem of model
selection in maximum likelihood-based and Bayesian parametric Gaussian mixture. We applied this
method on the Whale song decomposition NIPS4B challenge. The obtaining results highlight the
interest of using parsimonious Bayesian clustering as a good alternative namely to finite parsimonious GMM clustering. We saw that the infinite parsimonious Gaussian mixture model (IPGMM) is
1
The missing values for the two state of art models (DADT model for EM ML and the k DADT model
for EM MAP) are due to some trobles when executing the em algorithm for this data and are currently being
fixed.
5
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
209
21500
19350
19350
17200
17200
17200
15050
15050
15050
12900
10750
Frequency (Hz)
21500
19350
Frequency (Hz)
Frequency (Hz)
21500
12900
10750
10750
8600
8600
8600
6450
6450
6450
4300
4300
4300
2150
2150
10
15
20
2150
25
Frames
song unit 1
8
Frames
10
12
14
19350
19350
19350
17200
17200
17200
15050
15050
15050
10750
8600
Frequency (Hz)
21500
12900
12900
10750
8600
6450
6450
4300
4300
4300
2150
2150
Frames
song unit 11
10
10
Frames
12
14
16
18
20
10750
6450
12900
8600
song unit 10
21500
song unit 9
21500
Frequency (Hz)
Frequency (Hz)
12900
2150
10
15
20
25
10
Frames
song unit 13
15
20
25
Frames
song unit 14
Figure 2: Spectrograms for the whale songs obtained with the proposed bayesian non-parametric
approach with the most general model k Dk Ak DTk .
Figure 3: Clusters activities versus time sea noise obtained by IPGMM with k Dk Ak DTk model
more flexible in terms of modeling and automatically provides a partition of the data and the number
of clusters for the data needed to be clustered.
References
[1] G. J. McLachlan and D. Peel. Finite mixture models. New York: Wiley, 2000.
[2] C. Fraley and A. E. Raftery. Model-based clustering, discriminant analysis, and density estimation. Journal of the American Statistical Association, 97:611631, 2002.
[3] G. Celeux and G. Govaert. Gaussian parsimonious clustering models. Pattern Recognition,
28(5):781793, 1995.
[4] Gilles Celeux and Gerard Govaert. A classification em algorithm for clustering and two
stochastic versions. Comput. Stat. Data Anal., 14(3):315332, October 1992.
[5] C. Fraley and A. E. Raftery. Bayesian regularization for normal mixture estimation and modelbased clustering. Journal of Classification, 24(2):155181, september 2007.
6
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
210
[6] Halima Bensmail. Regularized models in discrimination and Bayesian classification. PhD
thesis, University Paris 6, 1995.
[7] H. Bensmail, G. Celeux, A. E. Raftery, and C. P. Robert. Inference in model-based cluster
analysis. Statistics and Computing, 7(1):110, 1997.
[8] G. Schwarz. Estimating the dimension of a model. Annals of Statistics, 6:461464, 1978.
[9] H. Akaike. A new look at the statistical model identification. IEEE Transactions on Automatic
Control, 19(6):716723, 1974.
[10] C. Biernacki, G. Celeux, and G Govaert. Assessing a mixture model for clustering with the
integrated completed likelihood. IEEE Transactions on Pattern Analysis and Machine Intelligence, 22(7):719725, 2000.
[11] N. Hjort, Holmes C., P. Muller, and S. G. Waller. Bayesian Non Parametrics. Cambrige
University Press, 2010.
[12] C. Rasmussen. The infinite gaussian mixture model. Advances in neuronal Information Processing Systems, 10:554 560, 2000.
[13] E.B. Fox. Bayesian Nonparametric Learning of Complex Dynamical Phenomena. Phd. thesis,
MIT, Cambridge, MA, 2009.
[14] D. Gorur. Nonparametric Bayesian Discrete Latent Variable Methods for Unsupervised Learning. Phd. thesis, Berlin Institute of Technology, Berlin, D83, 2007.
[15] Xiaodong Yu. Gibbs sampling methods for dirichlet process mixture model: Technical details.
Technical report, University of Maryland, College Park, University of Maryland, College Park,
September 2009.
[16] F. Wood, Thomas L. Griffiths, and Z. Ghahramani. A non-parametric bayesian method for
inferring hidden causes. In Proceedings of the Conference on Uncertainty in Artificial Intelligence. AUAI Press, 2006.
[17] F. Wood and M. J. Black. A nonparametric bayesian alternative to spike sorting. Journal of
neuroscience methods, 173(1):112, 2008.
[18] Jeffrey D. Banfield and Adrian E. Raftery. Model-based Gaussian and non-Gaussian clustering.
Biometrics, 49(3):803821, 1993.
[19] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via
the EM algorithm. Journal of The Royal Statistical Society, B, 39(1):138, 1977.
[20] G. J. McLachlan and T. Krishnan. The EM algorithm and extensions. New York: Wiley, 1997.
[21] H. Bensmail and J. J. Meulman. Model-based clustering with noise: Bayesian inference and
estimation. J. Classification, 20(1):049076, 2003.
[22] D. Ormoneit and V. Tresp. Averaging, maximum penalized likelihood and bayesian estimation
for improving gaussian mixture probability density estimates. IEEE Transactions on Neural
Networks, 9(4):639650, 1998.
[23] J. Gershman Samuel and David M. Blei. A tutorial on bayesian non-parametric model. Journal
of Mathematical Psychology, 56:112, 2012.
[24] Erik B. Sudderth. Graphical models for visual object recognition and tracking. PhD thesis,
Massachusetts Institute of Technology, Cambridge, MA, USA, 2006.
7
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b,
joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
211
Dorian Cazau
Institut Jean Le Rond dAlembert
University UPMC Paris 6, CNRS UMR 7190
Equipe Lutheries Acoustique Musicale (LAM)
[email protected]
Olivier Adam
Institut Jean Le Rond dAlembert
University UPMC Paris 6, CNRS UMR 7190
Equipe Lutheries Acoustique Musicale (LAM)
[email protected]
Abstract
Following a production-based approach, this paper proposes a new kind of representations of humpback whale songs. Simple acoustic descriptors are used to
characterize specific features of vocal sounds (e.g. fundamental frequency, formants, chaos), which can be traced back to their vocal mechanisms. Such representation of songs allow to interpret acoustic features of vocalizations in terms the
characteristics of vocal organs. It may be a very useful tool for researchers dealing
with the acoustic behavior of humpback whales.
1 Introduction
The vocal repertoire of humpback whales (Megaptera novaeangliae) ranges widely, with a great
variety of bandwidths, durations and intensities of the emitted sounds. More specifically, the vocal diversity of humpback whales includes acoustic features such as harmonic sounds with a huge
fundamental frequency range, noise-like sounds, formant structures [1], pulse-like sound units and
various non-linearities (e.g. frequency-jumps). Scientists have particularly been interested in the
complex stereotyped songs that male individuals of humpback whales, one of the most studied
species of mysticetes, emit during the winter-spring breeding season. One major topic of research
when analysing these songs is the characterization and classification of their constitutive sounds.
Two main approaches can be distinguished in this task. [2] proposed a famous hierarchical framework (songs - themes - phrases - sub-phrases - sound units), in which the temporal structure of these
songs has been longly studied. The method used is to perform spectrogram analysis to determine
salient acoustic features characterizing discrete sound patterns, which are further used to build an
organized structure in regards to their relative presence/prevalence within a song. Numerous studies
(see review in [3]) have followed this approach, including [4] who studied temporal song evolution
and [5] who proposed a manual categorization of humpback whale sounds. A second approach is
the use of machine learning methods, with the obvious advantages of building objective, automatic,
time-saving tools of analysis for humpback whale songs. Baseline surveys include [6] who used
Also at Centre de Neurosciences Paris Sud - CNRS UMR 8195 - The Bioacoustics Team - 91400 Orsay
France
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
212
information theory techniques to study song structure, [7] who developed a cluster-based learning
to classify sound units.
Recently, [8] have called for the rejection of [2]s framework, prompted by the involvement of
human subjectivity in the characterization of information-carrying vocal sounds whose semantic and
syntax are unknown to humans. Especially, how can you tell if this hierarchical descriptive structure
is relevant for the whales when producing their sounds? A similar critic could be formulated for
the machine learning technique, whose representations of humpback whale songs do not directly
provide an understandable insight in their acoustic communication.
To deal with this issue, some authors [9, 10, 8, 11] have proposed to adopt a production-based
approach by analysing the factors involved in the overall sound production system of humpback
whales. This approach consists in the study of the whale sound producing anatomy in order to evaluate the acoustic characteristics of potential vocal organs (e.g., the fundamental frequency range
of a sound generator, or the maximal airflow 1 amplitude), but also of other constraint-like factors
that are not directly controlled by the whale, being either physical (e.g., composition of internal
respiratory gases) or environmental (e.g., depth-related ambient pressure). Humpback whale vocal
production should reflect the physical interaction between all these factors, through for example
the temporal sequencing or the different acoustic contrasting types of sound patterns. Therefore,
by studying the vocal material of humpback whales in direct relation with their sound producing
mechanisms, and with any other factors which may influence them, this work should bring us more
relevant and objective descriptive information to understand humpback whale vocal behavior. The
general production-based approach has been widely applied to different mammal species [12, 13].
This approach most often combines predictions based on functional morphology, supported by computational model of sound production systems (e.g., see [14] for doves, or [15] for dolphins). One
of the reasons why similar research on mysticete cetacean in general has fallen largely behind is that
their vocal production mechanisms are still under investigation. Although agreements have been
found on the fact that vocal production results from air movements and is included in the process
of internal air recirculation, the precise acoustic origins of the different vocal features observed in
humpback whale songs have remained elusive. Recent advances have been made with the work of
[16, 17], focusing respectively on two laryngeal components : the U-fold, so called for its particular U-like shape and for its similarities with vocal folds, and the laryngeal sac, which is an air sac
located at the ventral aspect of the larynx. While [1] and [8] mostly based their studies on computational analyses of humpback whale sound sequences, the authors have used anatomical data to
develop a biomechanical modeling of the vocal production mechanisms, and quantify the acoustic
features of synthesized sound units [10, 11].
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
213
It is noteworthy that the strong muscular structures surrounding both the laryngeal sac and the lungs
(i.e. respectively abdominal muscles, the diaphragm and the intercostal muscles) should have three
different roles: 1) pressurize all the system to a specific pressure different from the depth-related
ambient pressure, 2) manage the different airflows in the different configurations and 3) allow them
to withstand large variations in hydrostatic pressure and avoid chest squeeze while diving and
surfacing. The hypothesis that airspaces inside the whale are undisturbed by the depth-dependent
ambient pressure is then anatomically reasonable.
Corniculate
cartilage flaps
U-fold
Figure 1: On the left, simplified anatomical scheme of the respiratory tractus. On the right, photo
of a dissected whale larynx, hilighting the locations of the two sound generators with black lines.
These pictures come from [18].
Considering this overall system through successive respiratory phases, two processes may induce
different configurations of this system: the laryngeal valves (i.e. the epiglottis, the U-fold and the
corniculate cartilage flaps) movements and the air recirculation. The physiological properties of the
lungs and the laryngeal sac allow them to ensure alternatively two opposite functions, i.e. storing the
incident airflow or emitting it back in the respiratory tractus for further uses. By combining valve
states (open when an airflow can pass through it, or closed otherwise) and air source locations, we
highlight the formation of three mutually exclusive configurations of the respiratory tractus, further
subdivided by the direction of the airflow, which we will now describe.
Configuration 1 , where only the U-fold and the epiglottis are open. Based on the left graph of
first row in figure 2, the air flow in this configuration can either come from the lungs or the laryngeal
sac, and then pass through the U-fold. Here, the folds are parted and the epiglottis is pressed against
the wall ;
Configuration 2 , where only the corniculate cartilage flaps and the epiglottis are open. Referring
to the second row of graphs in figure 2, the U-fold is now replaced by the corniculate cartilage flaps
and the pulmonary air going through is directly guided into the nasal cavities, before being stored in
the laryngeal sac. The longitudinal shape of these flaps do not allow them to let pass an airflow in
the opposite direction, which eliminates the laryngeal sac as an air source. In this configuration the
glottal airflow will be more distant from the laryngeal sac than in the first configuration ;
Configuration 3 , where only the U-fold is open. We consider in this configuration 3, illustrated
with the two bottom schemes in figure 2, that both the lungs and the laryngeal sac may act alternatively as an acoustic source. The epiglottis and the corniculate cartilage are tightened together,
making the nasal cavities inaccessible to the airflow. The lungs, the trachea, the U-fold and the
laryngeal sac are the sole anatomical components to be taken into account here, providing a certain
symmetry to this configuration.
2.2 Acoustic characterization of each sound unit
Previous acoustic models developed by the authors [11, 18] have allowed the association of quantitative spectral features to the previously described physiological configurations. Briefly, considering
the configurations 1 and 2, acoustic resonances inside the respiratory tractus result mainly from the
acoustic parallel coupling between the nasal cavities and the laryngeal sac. The tube-like nasal
cavities with its rigid walls generate formants through acute acoustic resonances, characterized by
harmonically related frequencies (whose dispersion depends on the length tube) spanning a large
spectrum. The large laryngeal sac with its softer elastic walls is expected to play a role of low-pass
filter with a poor sustaining of higher frequency resonances. Also, the U-fold is more likely to
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
214
a. Configuration 1
b. Configuration 2
c. Configuration 3
Figure 2: Schemes of the three physiological configurations. The blue circles stand for the acoustic
source. A closed valve is drawn in red, while an opened one is drawn in green. On the right, photos
illustrating the different configurations of the pair epiglottis / corniculate cartilage flaps, from top to
bottom : epiglottis lowered and flaps closed (config. 1), epiglottis lowered and flaps open (config.
2) and epiglottis lifted and flaps open (config. 3)
generate higher fundamental frequencies trough a thickness-to-length ratio and a laryngeal muscle
structure quite similar to common mammal vocal folds. This preliminary analysis allows us to also
discriminate the three physiological configurations in regards to the active sound generator, producing either a low-pulse F0 range (below 50 Hz) with the corniculate cartilage flaps (configuration
2), or a medium (50 - 800 Hz) to high (above 800 Hz) frequency range with the U-fold (configuration 1 & 3, with a reversed U-fold for opposite airflow). Figure 3 illustrates the different types of
vocalizations formed through the three physiological configurations, with real vocalizations taken
from recorded whale songs. The three types of calls proposed allow to explain anatomically the
acoustic differences in the fundamental frequency range (low / medium-high) and the presence of
formants. Also, the movable precarious structure of formant patterns observed across humpback
whale vocal displays fits well the idea of an active temporal dynamic, based on a temporally-shaped
laryngeal sac impacting the overall acoustic behavior of the respiratory system. Further studies will
be needed to explain source-related acoustic features, such as the noisy characteristic of a sound and
the modifications due to a reverse glottal flow. Basic LPC and F0 tracking [19] have been used to
automatically detect values of these acoustic features within each sound unit.
a.
b.
c.
Figure 3: On the left, spectrograms of three types of real sounds units extracted from recordings
of humpback whales, corresponding to the three configurations (labeled from a to c in reference
to figure 2). The first type presents an harmonic structure with a medium fundamental frequency
and modulated by formants, the second type presents a pulse sound with a period T 60ms (i.e.
F0 = 17Hz), and the third type presents an harmonic structure with a high fundamental frequency
without formants. From [18]. On the right, illustrations of detection of chaotic segments.
To these three basic sound unit types, only discriminated based on fundamental frequency range
and presence/absence of formants, we can add an acoustic feature traducing the occurrence of vocal
non-linearities. Mostly two types of vocal non-linearities are present in humpback whale songs :
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
215
frequency-jumps and chaos. These two features are very interesting in a production-based approach,
as easily tractable to specific vocal mechanisms. Indeed, frequency jumps result either from a crossover between F0 and a formant [20], or from a register transition during a F0 modulation [21].
Chaos result exclusively from chaotic oscillations of the vocal folds. We propose to define a chaotic
descriptor as the sum of three classical tools from non-linear dynamic methods [22] : the Entropy,
the Lyapunov Exponents and the Correlation Dimension. The Entropy E quantifies the rate of
loss of information about the state of a dynamic system as it evolves over time [22]. For regular
behaviors (i.e. static states, periodic and quasi-periodic oscillations), the entropy is equal to zero.
For chaotic systems with finite degrees of freedom, the entropy is finite. Chaotic systems display
a sensitive dependence on initial conditions. Such a property deeply affects the time evolution of
trajectories starting from infinitesimally close initial conditions, and Lyapunov exponents are a
measure of this dependence. These characteristic exponents give a coordinate-independent measure
of the local stability properties of a trajectory. A trajectory is chaotic if there is at least one positive
exponent. The correlation dimension D, proposed by [23], quantifies the complexity or irregularity
of a trajectory in phase space, describing the geometrical scaling property of a vocal sound in a state
space. Figure 3 on the right illustrates the use of such descriptors in the discrimination between
harmonic / chaotic segments in two different sound units.
50
FJ
40
D1 30
20
10
F0
100
200
300
400
1200
500
Frequency (Hz))
1000
0.6
Ch
0.5
0.4
D2 0.3
0.2
800
600
400
0.1
200
100
200
300
Sound units
400
500
100
200
300
400
500
600
700
800
900
Sound units
Figure 4: Curves of two descriptors D1 and D2 , detecting respectively chaotic vocalizations and
formant occurrences. On the right, a fundamental frequency tracking over a song is represented.
A more high-level representation could be extracted from figure 4, with the identification of the
different configurations developed in section 2.2, based on the continuous temporal distribution of
formants and fundamental frequencies. This would potentially provide an insight into internal respiratory mechanisms of the whale, and its management of sound production. Also, the temporal
distributions of vocal non linearities give us an interesting support to speculate on what they could
traduce within humpback whale communication framework (e.g. exceptional vocal features, lack of
vocal skills, pathological signs). The main goal of the proposed representation is then to provide a
useful tool for biologists. This kind of representation offers a meaningful support for direct interpretation and discussion of humpback whale vocal strategies within their communication framework.
References
[1] E. Mercado, J. Schneider, A. A. Pack, and L. M. Herman, Sound production by singing
humpback whales, J. Acoust. Soc. Am., vol. 127, pp. 26782691, 2010.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
216
[2] R. Payne and S. McVay, Songs of humpback whales, Science, vol. 173, pp. 585597, 1971.
[3] D. M. Cholewiak, R. S. Sousa-Lima, and S. Cerchio, Humpback whale song hierarchical
structure: Historical context and discussion of current classification issues, Marine Mammal
Science, vol. 29, pp. 312332, 2013.
[4] M. J. Noad, D. H. Cato, M. Bryden, M. N. Jenner, and K. C. S. Jenner, Cultural revolution in
whale songs, Nature, pp. 408537, 2000.
[5] W. W. L. Au, A. A. Pack, M. O. Lammers, L. M. Herman, M. H. Deakos, and K. Andrews,
Acoustic properties of humpback whale songs, J. Acoust. Soc. Am., vol. 120, pp. 11031110,
2006.
[6] R. Suzuki, J. R. Buck, and P. L. Tyack, Information entropy of humpback whale songs, J.
Acoust. Soc. Am., vol. 119, pp. 18491866, 2006.
[7] H. Ou, W. W. L. Au, L. M. Zurk, and M. O. Lammers, Automated extraction and classification of time-frequency contours in humpback vocalizations, J. Acoust. Soc. Am., vol. 133,
p. 301310, 2013.
[8] E. Mercado and S. Handel, Understanding the structure of humpback whale songs, J. Acoust.
Soc. Am., vol. 132, pp. 29472950, 2012.
[9] E. Mercado, Computational models of sound production and reception in the humpback
whale, Masters thesis, University of Hawai, 1998.
[10] D. Cazau, Acoustics of the mysticete cetacean (baleen whale) vocal production system, Masters thesis, Universite Pierre and Marie Curie, 2012.
[11] O. Adam, D. Cazau, N. Gandilhon, B. Fabre, J. T. Laitman, and J. S. Reidenberg, New acoustic model for humpback whale sound production, Applied Acoustics, vol. 74, pp. 11821190,
2013. Applied Acoustics, Elsevier.
[12] W. T. Fitch and M. D. Hauser, Voice production in non-human primates : acoustics, physiology, and functional constraints on honest advertisement, American Journal of Primatology,
vol. 37, pp. 191219, 1995.
[13] W. T. Fitch, J. Neubauer, and H. Herzel, Calls out of chaos: the adaptive significance of
nonlinear phenomena in mammalian vocal production, Animal Behaviour, vol. 63, pp. 407
418, 2002.
[14] T. Riede, M. J. Owren, and A. C. Arcadi, Nonlinear acoustics in pant hoots of common chimpanzees (pan troglodytes): Frequency jumps, subharmonics, biphonation, and deterministic
chaos, American Journal of Primatology, vol. 64, pp. 277291, 2004.
[15] T. W. Cranford, P. Krysl, and J. A. Hildebrand, Acoustic pathways revealed: simulated sound
transmission and reception in cuviers beaked whale (ziphius cavirostris), Bioinsp. Biomim.,
vol. 3, pp. 110, 2008.
[16] J. S. Reidenberg and J. T. Laitman, Discovery of a low frequency sound source in mysticeti (baleen whales): anatomical establishment of a vocal fold homolog, Anatomical Record,
vol. 290, pp. 745760, 2007.
[17] J. S. Reidenberg and J. T. Laitman, Sisters of the sinuses: Cetacean air sacs, Anatomical
Record, vol. 291, pp. 13891396, 2008.
[18] D. Cazau, O. Adam, J. T. Laitman, and J. S. Reidenberg, Understanding the intentional acoustic behavior of humpback whales: a production-based approach, J. Acoust. Soc. Am., vol. 134,
pp. 22682273, 2013.
[19] A. de Cheveigne and H. Kawahara, Yin, a fundamental frequency estimator for speech and
music, J. Acoust. Soc. Am., vol. 111, pp. 19171930, 2002.
[20] I. R. Titze, Nonlinear source-filter coupling in phonation: Theory, J. Acoust. Soc. Am.,
vol. 123, pp. 27332749, 2008.
[21] B. Roubeau, N. Henrich, and M. Castellengo, Laryngeal vibratory mechanisms: The notion
of vocal register revisited, Journal of Voice, vol. 23, pp. 425438, 2009.
[22] J. J. Jiang, Y. Zhang, and C. McGilligan, Chaos in voice, from modeling to measurement,
Journalof Voice, vol. 20, pp. 217, 2006.
[23] P. Grassberger and I. Procaccia, Measuring the strangeness of strange attractors, Physica D.,
vol. 9, pp. 189208, 1983.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
217
Herv Glotin
Inst. Universitaire de France
Bt St Michel Paris
& Universit de Toulon
[email protected] (correspond. author)
1 Introduction
It has been well documented that Humpack whales produce songs with a specific structure [Payne].
The NIPS4B challenge provides 26 minutes of a remarkable Humpback whale song recording produced at
few meters distance from the whale in La Reunion - Indian Ocean, by "Darewin" research group in 2013 at a
frequency sampling of 44.1kHz, 32 bits, mono, wav format (Fig 1).
Figure 1: Spectrum of around 20 seconds of the given song of Humpback Whale (start from about 5'40 to 6'.
0 to 22.05 kHz - frameshift of 10 ms)
Usually, the Mel Filter Cepstrum Coefficients are used as parameters to describe these songs [Pace
and al.] We propose here another efficient representation, the scalogram, and we demonstrate that the sea
noise is efficiently removed, even in the case of lower SNR recordings, allowing robust song representations.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
218
Then the first layer appears to loose few units which are also missing in the other scattering layers,
however it has a strong energy coefficient. However some specific patterns appear and could possibly be
used to describe and identify the singer. For example, the chirps have a specific length and slope as shown
with some examples extracted from 4 different samples recordings in the next sections (the original figure
are at : http://sabiod.univ-tln.fr/pimc/rapport/ ).
We give the scalogram and spectrogram of around 2 minutes on each signals. For all the scalogram
none additionnal non-linear transformation has been applied. This comparison emphasizes the strength of the
scattering decomposition conpared to the spectrogram containing the sea noises.
We illustrate this with different occurrences of some specific patterns, computed on window lasting
2^16 samples which is the maximum window length we can use in ScatNet toolkit.
3 Challenge results
Results on the NIPS4B_humpback.wav challenge data are in Fig 2.
(http://sabiod.univ-tln.fr/nips4b/challenge2.html)
( http://sabiod.univ-tln.fr/pimc/RAPPORT_NIPS4B_humpback_J62_Q8_T948.0957/ ).We give in
figure 3 some extracted examples of a recurrent particular shape (spectrogram window = 128, overlap = 64)
Figure 2 : scalogram and spectrogram of the challenge data including the 20 seconds of the challenge
part 8,J=62, Q=8, T=948.0957
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
219
Figure 3: Chirp extracted from the same challenge data and corresponding times. Duration of each window : 1.49 sec.
Begin time, Sample 1 : 0.11 sec., Sample 2 : 38.19sec., Sample 3 : 44.52sec., Sample 4 : 59.04sec.
3. Results at low SNR of various songs on same area and different days
In this section we compute with the same parameters the scalogram on a noisy recording taken in the
New Caledonian Lagoon.
/NAS3/PIMC/SITE/FGAB_WAV_all/20130720_BB_en_plusieurs_points/DECAV_20130720_113312.wav
The full results are at
http://sabiod.univ-tln.fr/pimc/RAPPORT_DECAV_20130720_113312_J62_Q8_T948.0957/
A sample is given in figure 4 below, showing again clear chirps.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
220
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
221
The figure 5 shows for the same file a recurrent particular chirp :
http://sabiod.univ-tln.fr/pimc/RAPPORT_DECAV_20130722_103948_J62_Q8_T948.0957/
We give one sample below (figure 6), showing other kind of pattern, from another kind of song units.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
222
223
Here is the zoom on other units found into these file (figure 7):
/NAS3/PIMC/SITE/FGAB_WAV_all/20130725_triangulation_et_TASCAM/DECAV_20130725_093238.wav
http://sabiod.univ-tln.fr/pimc/RAPPORT_DECAV_20130725_093238_J62_Q8_T948.0957/
A 2-minute sample already shows different patterns (figure 8) :
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
224
225
And here (figure 9) a recurrent chirp appearing multiple times on this file :
4 Conclusion
We demonstrate the advantage of Gabor scalogram to reveal humpback whale songs analysis : it
distinguishes fine details that are possibly linked to individual signature. This representation may be usefull
for research on whale identification [Cazeau 2013, in this workshop].
Looking at the recurrent units found in each file, we can see that the NIPS4B_humpback.wav has
some really flat and mid-sized units of approximately 0.7 to 1 second, the DECAV_20130720_113312.wav
file (figures 4 and 5) has longer chirps (lasting for the whole time window taken, about 1.35 second) and also
caracterized by a small positive slope.
For the DECAV_20130722_103948.wav file (figures 6 and 7), the chirps are smaller (the length is
about three times smaller than the previous example) and the slope is greater, also note the concave shape.
For another whale, in the record DECAV_20130725_093238.wav (figures 8 and 9), we see chirps with midsized length, a small positive slope, and a convex pattern. These, are the kind of signature we are looking for
individual indexing
Even if a log spectrogram may have also revealed some interesting patterns, we demonstrate the
advantage of the scalogram representation compared to spectrogram according to the sea noise level that has
been removed into the scalogram.
References
Pace, F., Benard, F., Glotin, H., Adam, O., and White, P. (2010) Subunit definition for humpback whale
callclassification, int. journal Applied Acoustics, Elsevier, 11(71)
ScatNet http://www.di.ens.fr/data/software/scatnet/documentation/
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
226
Abstract
Male whale vocalizations have the characteristics of a song. Male whales form temporal
sequences of different syllabic types making repeated phrases. We provide a method for the
automatic quantitative analysis of a single humpback whale song by decomposing these
songs in their constituent syllabic types and studying their temporal sequencing. This work
describes our approach to the humpback whale song processing challenge organized and
hosted as part of the 2013 Neural Information Processing Scaled for Bioacoustics: NIPS4B.
In trod u cti on
Animals use vocalization for reasons that are vital for their existence. Mate selection,
courtship rituals, coordination, alarming and marking of territory are the most important
ones. Cetaceans vocalize for these reasons although the complete mapping of vocalization to
behavioral modes is not yet fully clarified. The growing concern of signal processing an d
pattern recognition applications with marine mammals is associated with assessing the
impact of anthropogenic noise on cetaceans [1], providing means to avoid collisions with
ships [2] and detecting species that are in endangered status [3]. In this work we focus on a
humpback whale song. Both female and male humpback whales produce sounds that are
referred to as social sounds, while songs are produced exclusively by male humpbacks. In
order to be clear with the terminology we also adopt the widely accepted definition of a
single vocalization of a humpback whale constituting a syllable which is a distinct compact
segment of continuous sound separated from other syllables by silence. A phrase is a
sequence of heterogeneous syllables in close succession [4] whereas a song is a structured
stereotyped repetition of phrases. For more elaborate representation of the types of structural
components typically present in sequences of sounds produced by singing humpback whales
please refer to [5]. It has been well documented that humpback whales produce songs with a
specific structure [6]. A song typically lasts from 10-20 minutes and is repeated
continuously, possibly over hours with small variations in its phrase composition. The song
changes gradually from year to year and the songs of populations around the globe are
distinctively different. There is a long list of studies concerning whale songs (see [6] and the
references therein).
The 2013 Neural Information Processing Scaled for Bioacoustics: NIPS4B featu red a signal
processing and pattern recognition challenge on the domain of song whale processing [7].
The NIPS4B event provides 26 minutes of a single, high quality humpback whale song
recording produced at a few meters distance from the whale in La Reunio n - Indian Ocean.
The purpose of this study is to propose an efficient representation of a given humpback
whale song that helps to study its structure, as well as discover and index its units [8]. Our
contribution is an automatic segmentation approach of a whale song and clustering of the
resulting syllables into an approximate alphabet of syllables. Moreover, we discuss a
sequence modelling of the whale song based on modelling their succession with N -grams.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
227
S i gn al An al ysi s
2.1
E l e m e n t a r y Vo c a l i z a t i o n s a n d D i s t i n c t S y l l a b l e Ty p e s
We manually examined the spectrogram and listened to the 26-minute song and ended with 9
distinct syllables. This is in accordance with [6] that also reports 9 distinct syllables of humpback
whales. The syllables consist of a series of 200-19 kHz spectral chunks. Syllables are of varying
duration uttered at rates of about 30 per minute and are both amplitude and frequency modulated.
A description of the units of this particular humpback whale song follows:
S1: has a flat tonal. It is broadband with very strong harmonics reaching up to 16 kHz. Its
duration is of 1 sec mean.
S2: has a small low-frequency flat tonal subunit followed by a downsweep of frequencies. Its
mean duration is of 1.5 sec.
S3: has weak harmonics and an initial downsweep followed by a vibrating flat tonal. Duration of
1 sec.
S4: is relatively broadband, with strong frequency modulated harmonics reaching up to 12 kHz
ending with an upsweep. Its mean duration is of 2 sec.
S5: is broadband with harmonics plus noise reaching up to 15 kHz. Its mean duration is of 2 sec.
S6: is relatively narrowband having a mostly noisy structure. Its duration is of 2 sec.
S7: has bird like chirps, a strong tonal and is relatively narrowband. Its mean duration is of 1 sec.
S8: is a small low-frequency flat tonal subunit followed by a short-time chirp segment. It
possesses strong harmonics. S8 appears alone or with one or two following characteristic subunits.
These subunits never appear in isolation but when present they come always after S8. Its mean
duration is of .75 sec for the main syllable and about .5 sec for two following subunits.
S9: is strongly tonal with higher frequency harmonics, demonstrating an almost chirp-like
character. Its mean duration is 1 sec.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
228
3.1
Let x(n) denote the discrete time-domain signal holding the original recording, where n is the
discrete-time index. The recording is sampled at 44.1 kHz, 16 bit and downsampled to 16 kHz as
we are interested in deriving cluster indices of segments and the signal energy beyond 8 kHz is
small. The SNR is quite good as the hydrophones were close to the whale [7]. In order to extract
the useful audio event from its background we applied the Hilbert follower, also known as
envelope follower. The Hilbert follower follows the characteristic shape of the time-domain audio
envelope of vocalizations. We briefly describe its derivation and function:
Let, xh(n) = Hilbert(x(n)) return a complex sequence called the analytic signal of x(n). The analytic
signal xh(n) = x(n) + jxi(n) has a real part x(n) which is the original data, and an imaginary part,
xi(n), which contains the Hilbert transform of x(n). The envelope y(n) of the sampled time-domain
recording is calculated as:
y n
Xh n
Xh n
1/2
(1)
where Xh n stands for the conjugate of Xh(n) and X for component wise multiplication.
The envelope in Eq. 1 is compared against a threshold . When y(n) > the sample x(n) is
classified as belonging to the activity class otherwise to the non-activity class (see Fig. 2). The
threshold is calculated from the whole recording. The envelope y(n) is sorted by value and a
conservative threshold is calculated as =3*1 where 1 is the mean of the 90% of the lowest
values of the envelope. Let xe(n) hold the recordings for which y(n) > .
Spectrogram
Hilbert Follower
0.8
0.6
0.4
0.2
2
10
12
14
16
18
10
sec
12
14
16
18
1
0.8
0.6
0.4
0.2
0
Figure 2: Top: Spectrogram of the first 10 syllables. 2. Middle: envelope of the Hilbert detector.
Last: Detection result and segmentation
In Fig. 3 we give a distribution of silence and syllable durations as found by our automatic
segmentation system. As regards inter-syllable silence durations, the mean is 0.66 sec with a
standard deviation of 0.53 sec whereas the syllables have a mean of 1.25 sec and standard
deviation of 0.46 sec.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
229
80
Silence duration
Syllables duration
70
70
60
60
50
50
40
40
30
30
20
20
10
10
0.5
1.5
2.5
3.5
0.5
1.5
sec
2
sec
2.5
3.5
C m t
C m t
where, M is the number of frames before and after the frame t, and in our case M=2.
3. Spectral Bandwidth (1 coefficient): the frequency extent of each segment.
4. Spectral Entropy (8 feature): the Shannon entropy, calculated as:
H x i 1 p xi log2 p xi
N
where p(x) is the spectrum amplitude normalized so that it can be considered a probability
distribution. Spectral entropy is sensitive to predominant peaks or spectral flatness. We
calculated spectral entropy every 1 kHz (8 bands in total) and averaged the results. N is the
total number of spectral amplitudes.
5. Spectral Crest (1 feature): a measure of how noisy/tonal is the signal. It is by definition
the ratio of the maximum value of the spectral amplitude to the arithmetic mean of the
energy spectrum. We calculate spectral crest independently for 8 bands and sum.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
230
max At(n)
SCr(t)
N
n 1
At(n) / N
6. Spectral Centroid (1 feature): the center of mass of the distribution of spectral amplitude.
It has a robust connection with the impression of "brightness" of a sound.
SC(t)
n 1
N
nAt2(n)
n 1
At2(n)
where, At(n) is the magnitude of the Fourier transform at frame t and frequency bin n. A higher
centroid correspond to a spectrum with dominant high frequencies.
7. Spectral Flatness (1 feature): a measure of how noisy/tonal is the signal. It is by definition
the ratio of the geometric mean to the arithmetic mean of the energy spectrum.
A(t)L log(A(t))
SF(t)
exp
N
n 1
N
n 1
AL,t(n) / N
At(n) / N
C
i 1
At i 0.85 *
i 1
At i
F(t)
A
N
i 1
i At 1 i
where At and A t-1 are the normalized magnitude of the Fourier transform at the current time
frame t, and the previous time frame t-1, respectively. The spectral flux is a measure of the
amount of local spectral change.
The feature extraction procedure results into a 39 dimensional array and therefore the
segmentation and feature extraction produce a real matrix S of dimension 612x39.
11. Segment duration (1 feature): the duration of a segment in seconds. It is derived from the
signal segmentation procedure described in par. 3.1.
3.3
We manually selected the distinct syllables of the song by carefully observing the spectrogram of
each syllable and listening to every syllable. However, this is not possible for recordings of many
whales, or for recordings scaling up to months or years. Since we wanted a generic approach to
automatically label a song we tried some methods that do not require the number of clusters to be
set a-priori but try to discover it themselves. We analyzed these syllables using an approach that
embeds a set of high dimensional data points, estimating the intrinsic geometry of a data manifold
based on a rough estimate of each data points neighbors to represent high dimensional data in
lower dimensions. What we would need is a visualization technique that would map the high
dimensional space of features to 2 or 3 dimensions so that we have an idea of the presence of
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
231
several clusters indicating distinct syllable types. We would then expect the number of clusters to
match (in principle) the number of classes. t-Distributed Stochastic Neighbor Embedding (t-SNE)
is a powerful technique for dimensionality reduction that is particularly well suited for the
visualization of high-dimensional datasets [10]. In Fig. 4 we see the results of applying this
technique on humpback syllable segments. The method indeed is capable of bringing this real-life
data set into clusters and discovering the intrinsic dimension of the number of syllables. Although
the correct number of clusters is unknown to us and the manual labelling of Fig. 4 can be done in
many ways, we can see that we have a number of clusters between 9 and 15. This estimation now
allows to set the number of classes in several clustering methods and examine the outcome.
40
30
20
10
-10
-20
-30
-40
-50
-40
-30
-20
-10
10
20
30
40
The k-means is a well-known method of vector quantization and quite popular as a means to
apply clustering to a dataset. k-means aims to partition the observations in clusters where the
mean of the cluster serves as prototype. As soon as the observations are clustered a new
mean is calculated for each cluster. The process is repeated until no significant change is
detected. The k-means requires that the number of clusters is set a -priori. The hand-labelled
syllables and the t-DSNE lead us to set the number of clusters in K-means from 8-11 and to
perform a series of experiments.
Affinity Propagation is based around the idea of examining the suitability of having one
observation to be the exemplar of the other. Therefore messages between observations are
sent until convergence. The dataset is then described by the small number of exemplars [11].
Affinity Propagation is quite suitable to our problem because rather than requiring the
number of clusters to be known a-priori, it discovers it itself. A threshold of preference=-8 is
set in this algorithm which discovers 10 clusters [12].
For visualization clarity we show the clustering results of the 34 first members of S1, S2 and
S5 as classified by K-means in Fig. 5 and using Affinity Propagation in Fig. 6 we show
S8,S5 and S1 syllable clusters.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
232
Syllable S5
50
100
150
200
200
400
600
800
1000
1200
1400
1600
1800
2000
2200
Syllable S2
50
100
150
200
100
200
300
400
500
600
700
800
900
1000
1100
Syllable S1
50
100
150
200
200
400
600
800
1000
1200
1400
1600
Fig. 5: k-means as applied to syllables dataset. The first 34 syllables of each cluster.
Syllable S8
50
100
150
200
100
200
300
400
500
600
700
800
900
1000
Syllable S5
50
100
150
200
50
100
150
200
250
300
350
400
450
500
550
Syllable S1
50
100
150
200
100
200
300
400
500
600
700
800
900
1000
The humpback whale song demonstrates strong temporal regularities. That is, the syllables
do not appear in random order but seem to construct phrases that are uttered as regular and
repeated temporal sequences. In order to model this effect we used the cluster labels of kmeans as an alphabet and tagged the whole recording. We then proceeded in calculating the
transition probabilities from syllable to syllable (bigrams). Silence duration is modelled as
Gaussian distributed with mean and variance as derived from Fig. 3. Although the duration is
clearly bimodal as a first approach we sampled silence duration from a Gaussian having as
mean 0.66 and standard deviation 0.534.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
233
In Fig. 7 (left) we have the number of unigrams, that is, the frequency of each syllable in the
alphabet and in Fig. 7 (right) the bigram frequencies in a Hi nton diagram. A Hinton Diagram
gives an immediate view of the probability of moving from one syllable to another. Bigram
frequencies are calculated by counting the number of occurrences of every transition from
syllable to syllable and normalizing by the number of all transitions. The larger the box the
larger the transition probability. One should note that modelling of temporal evolution is
based on the derived alphabet tagged by the k-means and Affinity projection methods and
cannot correct their mistakes. The following results as shown in Fig. 7 are based straight on
the raw classification and only shown as a proof of concept. A detailed n-gram of manual
labels as well as a more accurate clustering of the syllable in order to have a realistic model
of syllable transitions will be shown elsewhere. We propose that bigrams are an
indispensable tool for studying phrase composition of whale phrases and subsequently of
songs based on phrase transitions.
Bigrams
Unigrams
S1
0.13378
S1
S1
0.08919
S2
S2
0.13649
S3
S3
0.07973
S4
S4
0.08108
S5
S5
0.16622
S6
S6
0.17973
S7
S7
0.05270
S8
S8
0.08108
S9
S9
S2
S3
S4
S5
S6
S7
S8
S9
Fig. 7 Left: Frequency of humpback whale alphabet (unigrams). Right: the alphabet with
transition probabilities from syllable to syllable (bigrams).
Di s cu ssi on
We have shown that it is possible to have an accurate decomposition of a raw recording of a
whale song into its constructing syllables and a way to model the transition between
syllables in order to study the sequencing of the song. The whole process is quite practical as
it takes under half minute from the raw 26 minute recording to the final clustered segments
on an i7, 16 GB RAM computer. One practical significance of temporal modelling beyond
studying the transition pattern and phrases of whales is that it can be used in games either as
standalone devices or programs or in audio-books. These usually use pre-recorded signal
segments that are very small due to memory constraints that the device imposes and
therefore completely predictable to the point of being annoying after a while. With our
approach the device only needs to store the syllables and an endless song can be derived on
the spot by sampling first a syllable from its cluster with uniform probability, then a silence
duration from the Gaussian distribution that we fitted on real data and then moving to the
next syllable with probability given by the bigram calculations. Thi s process can be repeated
as long as is needed and produces a rich repertoire that is quite realistic and actually costs
less in memory than a single recording of a song.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
234
R e f e re n c e s
[1] Cox T. et al., Understanding the impacts of anthropogenic sound on be aked whales,
Journal of Cetacean Resources Management 7(3):177187, 2006
[2] Laist D., Knowlton A., Mead J., Collet A., Podesta M., Collisions between ships and whales,
Marine Mammal Science 17(1): 3575, 2001.
[3] Adam O. & Samaran F., Detection, Classification and Localization of Marine Mammals
using passive acoustics. 2003-2013: 10 years of international research, 2013.
[4] Payne RS, McVay S. Songs of humpback whales. Science 173: 585 597, 1971.
[5] Mercado, E., III, Herman, L. M., & Pack, A. A, Stereotypical sound patterns in
humpback whale songs: Usage and function. Aquatic Mammals 29: 37-52, 2003.
[6] Au W, Pack A, Lammers M/, Herman L, Deakos M, Andrews K. Acoustic properties of
humpback whale songs. Journal of the Acoustical Society of America 120 (2):1103-10, 2006.
[7] http://sabiod.univ-tln.fr/nips4b/challenge2.html (date last viewed 21/11/2013)
[8] Pace, F., Benard, F., Glotin, H., Adam, O., and White, P. Subunit definition for humpback
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
235
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
236
Chapter 9
Big Bio-Acoustic Data
s
9.2 A challenge for computational bioacoustic...............................................................................
246
Kindermann L.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
237
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
238
AN INITIATIVE OF
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
World-wide expansion
Ocean Observatories Initiative (OOI)
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
239
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
Continuous presence
High sampling frequency
Co-located sensors
Interactivity
Abundant power
Event detection
AN INITIATIVE OF
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
Hydrophones (example)
OceanSonics icListen Smart Hydrophones
Ethernet-ready
24-bit HF (10 Hz 200 kHz)
24-bit LF (1 Hz 1600 Hz)
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
AN INITIATIVE OF
240
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
A
HF
HF
LF
AN INITIATIVE OF
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
AN INITIATIVE OF
241
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
23 GB / day
Hydrophone data:
> 2,000,000 files
> 80 TB
+ active acoustic
instruments
AN INITIATIVE OF
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
Stream to
5 minute
files
Serial/IP
Timesync
Archive
> 5 kHz
Filter
0.5 Hz to 5kHz
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
AN INITIATIVE OF
242
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
Mark Malleson
AN INITIATIVE OF
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
AN INITIATIVE OF
243
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
Visit http://www.orchive.net
AN INITIATIVE OF
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
DEMO
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
244
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
Visualization ideas?
On-line annotation?
Event detection?
Collaborative tools?
Data formats?
AN INITIATIVE OF
CABLED
OBSERVATORY
DATA:
CHALLENGES
AND|OPPORTUNITIES
WHAT
THE OCEAN ISACOUSTIC
TELLING US
ABOUT
CLIMATE CHANGE
APRIL 23, 2013
OCEANNETWORKS.CA
Explore Ocean Networks Canadas online resources
Plotting Utility | Data Search | Sea Tube | Digital Fishers
Contact
AN
AN IN
IINI
INITIATIVE
NIITI
N
TIA
T
IIA
AT
TIV
IV
I E OF
OF
In: proc.
of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
[email protected]
245
Abstract
Antarctic Minke whales are the most abundant baleen whale species on earth. As
the main target of todays controversial scientific whaling and possibly of a reestablished commercial whaling enterprise as proposed by some countries, they
are in the focus of interest for many NGOs and the public. Until few month ago
nothing was known about their vocal behavior, so they had no own voice and
no bioacoustic methods could be used to investigate the many open questions
about them. On the other hand, for several decades a strange sound of unknown
origin has been recorded repeatedly in the Southern Ocean but only during polar winter when the sea is covered almost completely by a dense layer of ice.
Long term recordings from our acoustic observatory at the ice shelf show it is in
fact the dominant acoustic emission around Antarctica during that time. Tenth of
thousands of hours of this sound have been recorded during the last 8 years and
are published under an open access policy. And recently, during a winter expedition to Antarctica we could finally assign this sound to the Minkes. We invite
everybody to look into that data using advanced methods to extract definitely
new knowledge about this important species.
Throughout the southern ocean a unique rhythmic underwater sound with a frequency range of
100 Hz to 20 kHz has been recorded repeatedly by many researchers and navy sonar officers [1-6].
The first published evidence of its existence dates back to 1964 where it appeared in an audio
recording as an "unidentified signal in the background" [1]. The crew of an Australian submarine
designated the sound "the bioduck" because of is auditory impression [3,4]. The PALAOA observatory, located north of Neumayer Station on the Antarctic Eckstrm ice shelf [8] and several moored
audio recorders throughout the Weddell sea pick up this sound regularly - but strictly during austral
winter only, which explains much of the difficulty in its investigation. From end of April to begin
of November it is continuously audible and most of that time it even constitutes the most intense
sound source in the southern ocean. However, the source of this signal remained a mystery until
recently.
The largest inhabitant of thee winterly pack ice is the Antarctic Minke whale, Balaenoptera bonaerensis. Up to 10 meters long and weighing 10 tons it is a rather small member of the baleen whale
family. While its larger relatives like blue, fin, and humpback whales mostly leave Antarctica
during winter for their subtropical mating grounds, this species has adapted for a permanent life in
the ice. Little is known about this most frequent of all great whale species, population estimates differ between 360.000 to 1.000.000 individuals and there are contradicting opinions whether the
stock is growing or shrinking. To the public it became famous as the main target of the controversially discussed contemporary whaling. Especially during polar night the study of this animals is
extremely difficult.
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
246
While it had been suggested, that the unidentified rhythmical sound might be produced by Antarctic Minke whales [2] along with some known irregular downsweeps [7] this was proven by several
parties only recently in 2013 [e.g. 9].
So far, no detailed study of the acoustic behavior of this species has been performed yet. The 8 year
continuous acoustic recordings from the PALAOA observatory thus provide a unique opportunity
to investigate this species for the very first time. As the minke sounds are present continuously
from April to November more than 20.000 hours are available in total, making it a good candidate
for modern computational methods.
A livestream of the under-ice hydrophone is available under http://www.awi.de/PALAOA and the
complete data set (2005-2013) is published in the PANGAEA database of the World Data Center
for Marine Environmental Sciences [10]: http://doi.pangaea.de/10.1594/PANGAEA.773610
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
247
Underwaterr A
Acoustics
coustics in
nA
Antarctica
ntarctica
Presenting a unique open Dataset
Lars Kindermann
NIPS Scaled for Bioacoustics Workshop
p Lake Tahoe 2013
Alfred Wegener Institute for Polar and Marine Research (AWI), Germany, 2013
The Southern Ocean - One of the most productive ecosystems on earth during austral summer
Ice finger at the tip of the Eckstrm Ice shelf
PALAOA site
Drilling camp
Atka Bay-Antarctica
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
248
PALAOA
A Hydrophone
e Array
Ross Seals
Woosh
X
X
AWI
WI Ocean Acoustics Group
www.awi.de/PALAOA
AWI
WI - Ocean Acoustics
s Anna
naa-Maria Seibert
PALAOA
60
2006/2007
2005/2006
50
40
30
20
10
0
19.12
24.12
29.12
03.01
08.01
13.01
18.01
23.01
28.01
02.02
07.02
29.01.2007
00:00
30.01.2007
00:00
31.01.2007
00:00
01.02.2007
00:00
02.02.2007
00:00
Date
60
50
40
30
20
10
0
23.01.2007
00:00
24.01.2007
00:00
25.01.2007
00:00
26.01.2007
00:00
27.01.2007
00:00
28.01.2007
00:00
Date
50
45
40
35
30
25
20
15
10
12:00
18:00
0:00
6:00
12:00
18:00
0:00
Time
AWI
WI - Ocean Acoustics
s Anna
naa-Maria Seibert
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
249
6:00
12
Seismology
Infrasound
Weather
Human Observations
BARIX Instreamer
mp3 48 kHz 192 kBit
Bidirectional control
ship
Reson VP2000
amplifier/filter
Server at
Neumayer Base
satellite
PALAOA Station
Neumayer Base
Portable USB
hard disks as buffer /
storage
All files:
once a year
on LTO tape
Continuous stream:
24 kBit OGG-Vorbis
Server at AWI
Bremerhaven
Germany
Cetaceans
humpback whale (Megaptera novaeangliae)
RD Climate Sciences
Orcas
Leopard seal
Blue whales
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
250
www.iwc.org
2008
2009
2010
Sonobuoy
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
251
3.15 sec
1/6
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
252
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
253
References
1. Poulter, T. Recording # 120342 (December 26th, 1964). Macaulay Library (1964).
2. Matthews, D., Macleod, R. & McCauley, R. D. Bio-Duck Activity in the Perth Canyon. An Automatic Detection Algorithm. Proceedings of Acoustics 14 (2004).
3. McCauley, R. Western Australian Exercise Area Blue Whale Project: Final Summary Report.
CMST Report R2004-29, Project - 350 173 (2004).
4. Dolman, S. J., Swift, R. J., Asmus, K. & Thiele, D. Preliminary analysis of passive acoustic recordings made in the Ross Sea during ANSLOPE III in 2004. Paper SC/57/E10 presented to the
Scientific Committee of the International Whaling Comission 18 (2005).
5. Klinck, H. & Burkhardt, E. Marine mammal automated perimeter surveillance - MAPS. Reports
on Polar and Marine Research 580, 114121 (2008).
6. van Opzeeland, I., Rettig, S., Thomisch, K., Preis, L., Lefering, I., Menze, S., Zitterbart, D.,
Monsees, M., Kindermann, L. & Boebel, O. in Boebel: Cruise Report of the Expedition of the
Research Vessel "Polarstern" to the Antarctic in 2012/13 (ANT-XXIX/2). To appear in: Reports
on Polar and Marine Research.
7. Schevill, W. E. & Watkins, W. A. Intense low-frequency sounds from an antarctic minke whale,
Balaenoptera acutorostrata. Breviora 388, 18 (1972).
8. Kindermann, L; Boebel, O; Bornemann, H et al. (2007): A perennial acoustic observatory in the
Antarctic Ocean. Computational bioacoustics for assessing biodiversity: proceedings of the international expert meeting on IT-based detection of bioacoustical patterns, December 7th until
December 10th, 2007 at the International Academy for Nature Conservation (INA), Isle of Vilm
9. Kindermann L. & Cabreira, A. Acoustic Ecology of Antarctic Minke Whales. In: Lemke, P.,
Cruise Report of the Expedition of the Research Vessel "Polarstern" to the Antarctic in 2013
(ANT-XXIX/6). To appear in: Reports on Polar and Marine Research.
10. Kindermann, L. (2013): Acoustic records of the underwater soundscape at PALAOA with links
to audio stream files, 2005-2011. Alfred Wegener Institute, Helmholtz Center for Polar and Marine Research, Bremerhaven, doi:10.1594/PANGAEA.773610
In: proc. of int. symp. Neural Information Scaled for Bioacoustics, sabiod.org/nips4b, joint to NIPS, Nevada, dec. 2013, Ed. Glotin H. et al.
254
255
www.biotope.fr/phototheque/piwigo
Contact: [email protected]
This flyer has been designed by Biotope and illustrated with photographs from our picture library. We can
produce all kind of communication documents and
provide nature photographs.
Simply ask us!
DVD included
This book presents the 34 bats species living in Western Europe. The book
opens on general chapters on bats, their
ecology, relationship with man... The reader
can find detailed descriptions of all species
with distribution maps, sonograms Generously illustrated, this book also contains a removable identification booklet to be taken in the field.
Publishing house
Biotope
Biotope
Biotope
Biotope
Picture library
Books
SonoChiro new
Engineering office
Scientific device
Chirotech
c1ounte
suibss 1yyeard
usbc
scr ear
4 riippttioinon
70955 e
(ex
cl .
VAT
(exc
ayl. VAT ) 3 *
2 , )2 0 1
1 SM2BAT+ Stereo
2 SMX-US Ultrasonic Microphone
2 memory cards SDHC 16 GB
4 rechargeable batteries 10 000 mAh
2 microphone cables (10 and 50 meters)
1 security box
new
SonoChiro software
unt i
Studies led by many chiropterologists have demonstrated that the SM2BAT+ Stereo is
perfectly adapted to:
identifying the European bats.
ensuring two remote clew with cables for SM2BAT+ microphones.
studying the moving direction of bats.
quantifying the bats populations going out of a given place.
The SM2BAT+ Stereo box is designed to record every sound, from the audible to the ultrasound, with a good restitution quality.
Scientific device
Contact:
Hubert
([email protected])
JimLagrange
Buzon ([email protected])
By stopping wind turbines during bats activity periods, the Chirotech system makes possible the combination of the preservation
of these protected species and the development of wind power production.
Chirotech
e
pric
ISBN : 979-10-90821-04-0