0% found this document useful (0 votes)
8 views8 pages

University of Groningen: 10.1109/AVSS.2017.8078461

Uploaded by

ruguoisdorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
8 views8 pages

University of Groningen: 10.1109/AVSS.2017.8078461

Uploaded by

ruguoisdorn
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

University of Groningen

A real-time system for audio source localization with cheap sensor device
Saggese, Alessia; Strisciuglio, Nicola; Vento, Mario; Petkov, Nicolai

Published in:
2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2017

DOI:
10.1109/AVSS.2017.8078461

IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.

Document Version
Publisher's PDF, also known as Version of record

Publication date:
2017

Link to publication in University of Groningen/UMCG research database

Citation for published version (APA):


Saggese, A., Strisciuglio, N., Vento, M., & Petkov, N. (2017). A real-time system for audio source
localization with cheap sensor device. In 2017 14th IEEE International Conference on Advanced Video and
Signal Based Surveillance, AVSS 2017 Article 8078461 Institute of Electrical and Electronics Engineers
Inc.. https://doi.org/10.1109/AVSS.2017.8078461

Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).

The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.
More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-
amendment.

Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.

Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
number of authors shown on this cover page is limited to 10 maximum.

Download date: 24-12-2024


A real-time system for audio source localization with cheap sensor device

Alessia Saggese1 , Nicola Strisciuglio2 , Mario Vento1 , Nicolai Petkov2


1
University of Salerno - DIEM, Italy
2
University of Groningen - JBI, The Netherlands
[email protected], [email protected]

Abstract organized in two groups depending on the acquisition de-


vice that they employ. In the first group methods for bin-
We propose an architecture for real-time audio source lo- aural localization are present, which use two microphones
calization based on the integration of localization method- as input devices. Such methods are based on the compu-
ologies within a framework that employs a cheap acquisi- tation of bio-inspired cues: the interaural level difference
tion sensor. The architecture that we present takes as input (ILD) and the interaural time difference (ITD). The former
the audio signals from two calibrated microphones. Then, measures the decay of the volume of the sound with dis-
it computes biological-inspired features of the sound signal tance. It also considers the absorption effect introduced by
and estimates its direction by means of a Gaussian Mixture the head that stops the propagation of the sound waves [23].
Model estimator. We carried out an extensive experimen- The latter concerns the differences in the arrival time of the
tal analysis on four data sets, one of which we realized and sound waves between the two microphones. The compu-
made publicly available. We evaluated several characteris- tation of ILD is generally approached by considering the
tics of the sound localization architecture and its use in real energy ratio between the two audio streams at frequency
scenarios. higher than 1500 Hz [1]. The computation of ITD is in-
stead based on the evaluation of the cross-correlation of the
two audio streams at frequency lower than 1500 Hz [14].
1. Introduction The estimation of the auditory cues is done with a prob-
abilistic approach in [23] and [15] or by comparing input
When we perceive sounds, we simultaneously identify values with theoretically generated curves in [7]. The lat-
their direction and are able to recognize the type of sound. ter approach shows less flexibility and lower robustness to
From psychological studies we know that humans feel more noise with respect to the former ones. Other methods in-
comfortable when the sound source (e.g. a speaker) can be tegrate audio and visual information to improve the relia-
located accurately. Generally, in the case of surveillance ap- bility of localization [18, 8]. In [17, 19], auditory epipolar
plications, we can say that a system that knows the source of geometry, inspired by epipolar geometry in stereo vision,
a sounds provides better information to improve the safety was introduced. An artificial human-like dummy head has
of the environment. been proposed for binaural hearing in telepresence opera-
In the last years, applications of audio source localiza- tions [21].
tion received a growing interest from research and industry
in the field of intelligent audio surveillance with the aim In the second group, methods that employ array of more
of localizing the sound source of hazardous events. Audio than two microphones are collected. Such methods aim at
analytic systems can be deployed together with and are a improving the reliability of binaural localization approaches
complementary tool to existing video surveillance infras- and are designed to perform 3-dimensional localization. A
tructures. Many IP surveillance cameras are, indeed, al- localization system that uses 4 microphones arranged on the
ready equipped with or ready to be connected to a micro- surface of a spherical robot head was suggested in [9]. The
phone, making possible a combined analysis of audio and time-delay of arrival was computed on 6 pairs of micro-
video streams [20]. phones and the location estimations were combined so as to
Two important applications in intelligent audio surveil- improve the reliability of the decision. In [22], the estima-
lance are abnormal event detection [3, 4] and sound source tion of ITD is performed by cross-correlation of the audio
localization [13]. A comprehensive review of methods for signals for an array of 8 microphones in order to perform
audio surveillance has been recently published [2]. 3-dimensional sound source localization. A study about the
Existing methods for sound source localization can be optimization of the number of microphones in the array was

978-1-5386-2939-0/17/$31.00 2017
c IEEE IEEE AVSS 2017, August 2017, Lecce, ITALY
(a) (b) (c) (d) (e) (f)

azimuth estimator
data acquisition gammatone ILD/ITD GMM
filters estimator
kinect sensor

θe training model
results
visualization

(i) (h) (g)


Figure 1: Architectural overview of the localization system. (a) A Kinect sensor (b) captures the audio signal. The localization
method (c,d,e,g,h) is composed of the modules represented within the two pairs of straight lines, which perform a sound
direction estimation at frame level. A (f) azimuth estimator integrates the frame-level decisions over a time window Wt and
the results are presented to the user through a result visualization module.

presented in [6]. The authors exploit redundancy of spatial and discuss the results that we achieved in the experiments.
information to scale the system linearly with the number Finally, we draw conclusions in Section 4.
of microphones. Beamforming techniques and Kalman fil-
ter have been used to reduce the errors in tracking multiple 2. System Architecture
moving speakers in noisy and echoic environments [16].
Such methods are more accurate and robust to noise than In Figure 1, we depict the architectural overview of the
binaural localization methods, but require larger hardware proposed system. The presented framework integrates a
resources for processing the input signals. cheap sensor device, namely the microphone array of the
In this paper, we propose a real-time architecture for Kinect sensor, with a localization method based on proba-
sound source localization based on a cheap microphone ar- bilistic estimation of sound directions [15]. Together with
ray, namely the audio acquisition card provided together the processing architecture, we propose an integration rule
with the Kinect sensor. Our contribution is in the design of the short-time estimations that improves the reliability of
and implementation of a system for hardware-software inte- the overall system.
gration of the estimates at frame-level provided by the bin-
aural sound source localization method proposed in [15]. 2.1. Localization method
The proposed architecture is modular and, in principle, can The considered localization method is inspired by how
be used together with any binaural localization method. the human auditory system processes sounds. The cochlea
The real-time implementation of the system and the use membrane in the inner ear vibrates according to the en-
of a cheap audio sensor give the possibility of deploying ergy contained in frequency sub-bands of the input sound.
large installations of the localization system with reduced In order to model this behavior, a Gammatone filterbank
costs. Furthermore, the proposed solution provides means was adopted. As suggested in [5], the input signals to each
for distributed analysis of sound sources and eventually to microphone are decomposed into Nc = 32 auditory chan-
track them within the monitored areas. We evaluate the nels using a fourth-order gammatone filterbank, where the
performance and the characteristics of the proposed archi- channel center frequencies are distributed on the equiva-
tecture in different conditions by carrying out experiments lent rectangular bandwidth (ERB) scale between 80 Hz and
on four data sets, namely the Surrey, Oldenburg, Aachen 5 kHz. Successively, binaural cues are estimated using a
and MIVIA data sets (the final name will be released after rectangular window of 20ms at a sampling frequency of
anonymous submission and revision). In this work, we con- f s = 44.1kHz. The impulse response gfc (t) of a Gam-
structed and made public available for benchmark purpose matone filter at the time t and central frequency fc is:
the OUR localization data set1 .
The paper is organized as follows: in Section 2 we gfc (t) = D tη−1 cos(2πfc t + φ)e−2πtb(fc ) u(t), (1)
present the proposed hardware and software architecture,
while in Section 3 we describe the data sets that we used where D is an arbitrary scaling constant, η the order of the
filter (fourth in our case), φ is the phase, u(t) is the unit
1 The data set is available at the url http://OUR.– step function (1 for t > 0 and 0 otherwise), and b(fc ) is a
function which determines the bandwidth for a given center
frequency. It is formally defined as: b(fc ) = a(24.7 +
0.108fc ), where a is a proportionality constant.
Given the response of the Gammatone filterbank for
both microphones, the the Interaural Level Difference (ILD)
and the Interaural Time Difference (ITD) are computed.
ILD and ITD contain complementary information about the Figure 2: Setup of the kinect sensor for the realization of
sound source position with respect the microphone array. the MIVIA data set.
Thus they are combined into a two-dimensional feature vec-
tor, called binaural vector, in order to represent the single
frame within a specific Gammatone channel. 3. Experimental analysis
Finally, a Gaussian mixture model (GMM) is used to 3.1. Data sets
estimate the position of the sound source from the set of
We performed experiments on four publicly available
binaural feature vectors. In particular, considering that the
data sets, namely the Surrey [10], Oldenburg [12],
binaural features corresponding to different channels tend
Aachen [11] and MIVIA data sets. The first three data sets
to cluster in the feature space, the azimuth-dependent pdf is
contain sound sources at different angles and distances from
modeled by summing up the superimposed Gaussian com-
the microphone array and sounds recorded in environments
ponents. During a preliminary training phase, a set of train-
with different reverberation.
ing sounds is used to learn the GMM model, which will be
used during the operating phase for estimating the source The Surrey data set was recorded by a Cortex Instru-
direction. For more details about the localization method ments Mk.2 Head and Torso Simulator (HATS). The loud-
we refer the reader to [15]. speakers were placed around the HATS on an arc with a
1.5 m radius between −90◦ and +90◦ . The sounds were
recorded at intervals of 5◦ and four different room config-
urations were considered. For each room a reverberation
2.2. Integration of azimuth estimation time RT60 was estimated according to BS EN ISO 3382
standard.
The GMM provides an estimation of the direction of the The Oldenburg data set [12] was recorded by a HATS
sound source for small overlapped (50% of their length) in five semi-controlled environments.
frames of the input audio signal of duration 20ms. The com- The Aachen data set [11] was recorded in two rooms,
bination of the estimation for the N channels of the Gam- with sound sources at a distance up to 10m from the micro-
matone filterbank provides a reliable measure of short-time phones. The characteristics of the rooms for such data sets
characteristics of the input signal. We integrate the short- are reported in Table 1.
time estimations on a larger time scale, using a sliding time We recorded the MIVIA data set and made it available
window of length Wt that forward shifts on the audio signal for research purposes. It was acquired with a double aim:
by half of its size. This allows for more robust estimations (1) to evaluate the performance of localization systems with
to noise and outliers. The direction of the input sound is cheap and less accurate devices, instead of very expensive
thus taken as the one of the audio frame with the highest devices which are more difficult to be used in real applica-
estimated GMM likelihood within the time window Wt . tions; (2) to evaluate the performance of the proposed ap-
proach by varying the distance between the microphones.
We considered two different environments, namely a living
2.3. System implementation room 6 × 4 × 2, 7 m (M.LR) and a laboratory of our univer-
sity 8 × 8 × 7 m (M.L). The sounds were recorded by using
The proposed architecture acquires the audio signal the microphone array of a Kinect sensor. The hardware set
from the microphone array of the Microsoft Kinect sen- was expanded so as to build an array of four microphones,
sor (Figure 1a). The software module for hardware inter- where the maximum distance between them is 138 cm, as
face and data acquisition (Figure 1b) is developed in C++ shown in Figure 2. Note that the distance between the two
and communicates with a Matlab implementation of the central microphones were fixed to 23 cm, as in the original
method for the estimation of the sound source direction setup of the Kinect sensor. The recording was made by con-
(Figure 1c,d,e,g,h) at frame-level through a bridge inter- sidering the following acquisition angles: -90◦ , -60◦ , -45◦ ,
face. The final estimated direction is obtained by integration -20◦ , 0◦ , 20◦ , 45◦ , 60◦ , 90◦ . As for M.LR, a man repeats
of frame-level decisions (Figure 1f) and the result is made the same word aloud for 10 seconds at a distance of 2 m
available by a visualization module (Figure 1i). from the microphones. In the second scenario, a Bose loud-
Surrey data set
S.AR the room size was 17.04 × 14.53 × 6.5 m (l x w x h) and the anechoic conditions were simulated by truncating the first reflected
waves
S.A the room is a small sized (5.72 × 6.64 × 2.31) office with seats for 8 persons. The reverberation time is RT60 = 320ms
S.B medium sized classroom (4.65 × 9.6 × 2.68) with RT60 = 470ms
S.C large sized room (23.5 × 18.8 × 2.31) used as cinema hall for more than 400 people with RT60 = 680ms
S.D a medium sized room (8.72 × 8.02 × 4.25) for presentations with high ceiling and RT60 = 890ms

Oldenburg data set


O.AR Anechoic chamber; the HATS was mounted in front of the speaker, with a distance varying from 0.8m to 3m. The azimuth angle of
the source ranged in the interval −90◦ , +90◦ , with a reverberation time RT60 < 50 ms
O.O1 Office room (3.20 × 4.55 m) The HATS was mounted on a desk close to the center of the room, while the speaker was moved in the
front hemisphere −90◦ , +90◦ at a distance of 1 and m with an elevation angle of 0◦ . The reverberation time RT60 = 300 ms
O.O2 Small office room (3.30 × 6.00 m) by using the same setup used for O.O1. The door and the window were left opened.
O.OC Busy cafeteria at lunch time. The HATS was moved in different positions, with an average reverberation time of 1250 ms and an
average SNR of 75.6 dB.
O.CY Courtyard crossed by pedestrians and bicycles. The setup is similar to the one in the cafeteria, with an average reverberation time of
900 ms and an average SNR of 86.1 dB

Aachen data set


A.MR Medium sized room 8.00 × 5.00 × 3.10 m suitable for a meeting. The distance between the loudspeaker and the receiver ranged from
1.45 m up to 2.80 m, with an average reverberation time of 0.23 s
A.LR Lecture room whose size is 10.80m x 10.90m x 3.15m. The loudspeaker was placed in different positions, with a distance ranging
from 4 m up to 10.20 m. The average reverberation time is 0.78 s.

Table 1: Description of the rooms where the sounds contained in the data sets were recorded.

speaker was used to reproduce part of a song (10 seconds where N is the number of considered sounds in a particular
for each recording) at two distances from the array center: test and Ne is the number of wrongly estimated angles. For
1 m and 3 m. the computation of the MAE, xi is the estimated direction
while xi is the nominal one.
3.2. Performance evaluation
We organized the experiments in four groups, so as to 3.3. Results discussion
evaluate the performance of the system when the sound In the analysis of the performance of the real-time local-
source is (1) at different angles and (2) different distances ization method we considered several tolerance values for
from the microphones, (3) different SNR values and (4) dif- the localization error. Specifically, we considered different
ferent configurations of the binaural microphone set. Each margins for maximum acceptable error on the estimation of
group of tests involves specific rooms from the four data the sound source position. A sound source is correctly lo-
sets, which we report in the following: calized if the estimated angle falls into a tolerance interval
around the nominal value of the sound direction. The value
• TEST1: S.AR, S.A, S.B, S.C, S.D from the Surrey of the tolerance degree influences both accuracy and MAE.
data set and the rooms O.AR and O.O1 from the Old- Indeed, only wrongly estimated directions are considered
enburg data set. for the computation of the MAE.
• TEST2: A.MR and A.LR from the Aachen data set. For the experiments in the groups “TEST1” and
“TEST2” we performed a quantitative analysis of results by
• TEST3: O.O2, O.C and O.CY from the Oldenburg computing the accuracy and MAE. In Table 2 and Table 3
data set. we report the results that we achieved for the different envi-
ronments in “TEST 1” and “TEST 2”, respectively. We ob-
• TEST4: M.LR and M.L from the MIVIA data set. served generally high performance and precise estimations.
The results decrease for highly noisy environments.
We measured the performance of the real-time sound lo-
In the experiment “TEST3”, we evaluated the perfor-
calization system by computing the accuracy (Acc) and the
mance of the system in environments with several values of
mean absolute error (MAE) of the angle estimation, as fol-
signal-to-noise ratio. For the rooms in the Oldenburg data
lows:
set, precise measures on the nominal sound source direction
N are not provided, but rather only an indication of the direc-
Ne ∗ 100 i=1 |xi − x
i |
Acc[%] = 100 − , M AE = tion of the sound. In Table 5, we report the angle estimated
N N
Test 1 Test 3
Tolerance Expected Estimated
Room Configuration
Room 5◦ 10◦ 15◦ direction angle
Acc 100% 100% 100% O.O2 1C Frontal −5◦
S.AR
MAE 0◦ 0◦ 0◦ O.O2 1D Frontal 5◦
Acc 78.37% 100% 100% O.O2 1B Right 35◦
S.A 15◦
MAE 1.08◦ 0◦ 0◦ O.O2 2A Right
Acc 81.08% 97.29% 100% O.O2 2B Left −10◦
S.B 0◦
MAE 1.08◦ 0◦ 0◦ O.C 1A Frontal
Acc 86.48% 94.59% 94.59% O.C 2D Frontal 5◦
S.C
MAE 2.83◦ 2.43◦ 2.43◦ O.C 2E Right 60◦
Acc 51.35% 70.27% 70.27% O.C 1D Right 90◦
S.D
MAE 14◦ 13◦ 13◦ O.C 1B Left −30◦
Acc 97.29% 100% 100% O.C 1C Left −90◦
O.AR
MAE 0◦ 0◦ 0◦ O.CY 1A Right 35◦
Acc 48.64% 70.27% 75.67% O.CY 1B Right 10◦
O.O1
MAE 11.41◦ 10.33◦ 9.79◦ O.CY 1C Left −20◦
O.CY 1D Left −50◦
O.CY 1E Left −60◦
Table 2: Performance results for the group of experiments
“TEST 1” with different values of the error tolerance inter- Table 4: Qualitative analysis of the results for the group of
val. experiments “TEST 3”.
Test 2
Test 4
Tolerance
Expected Angle CONF1 CONF2
Room 5◦ 10◦ 15◦
-90◦ -58◦ -45◦
Acc 100% 100% 100%
Living Room (M.LR)
A.MR -60◦ -45◦ -33◦
MAE 0◦ 0◦ 0◦ -40◦ -37◦
-45◦
Acc 50% 100% 100% -20◦ -26◦ -21◦
A.LR
MAE 3.53◦ 0◦ 0◦ 0◦ 0◦ 0◦
20◦ 23◦ 19◦
Table 3: Performance results for the group of experiments 45◦ 30◦ 37◦
60◦ 55◦ 40◦
“TEST 2” with different values of the error tolerance inter-
90◦ 56◦ 55◦
val.
-90◦ -62◦ -62◦
-60◦ -55◦ -52◦
Laboratory (M.L)

-45◦ -40◦ -38◦


by the real-time localization system together with the indi- -20◦ -23◦ -21◦
cation of the direction of the sound. For each room, various 0◦ 0◦ 0◦
20◦ 20◦ 19◦
positions of the sound source are considered.
45◦ 44◦ 41◦
For the experiments “TEST4”, we evaluated the perfor- 60◦ 57◦ 55◦
mance of the method when the physical configuration of the 90◦ 56◦ 62◦
microphone set changes. In particular we compared the es-
timated direction when the pair of microphones from the Table 5: Qualitative analysis of the results for the group of
Kinect sensor are placed at a distance of 23cm (CONF1) experiments “TEST 4”.
and 138cm (CONF2). We observed a decrease of the per-
formance of the localization system for sound sources at
directions farther than 60◦ with respect to the frontal direc- system. Block #1 refers to the acquisition module, blocks
tion. This is mainly due to the directionality of the micro- #2 and #3 are the features computation and frame-level an-
phones of the Kinect sensor. When considering the con- gle estimation, while block #4 refers to the integration mod-
figuration CONF2, we observed that the reliability of the ule. We used audio chunks of 4 seconds to be analyzed and
direction estimation decreases. The model proposed in [15] observed that in order to obtain real-time response on a ma-
is constructed by taking inspiration from the functions of chine with an Intel i5 CPU, a reasonable choice for the size
the human auditory system, which are not effectively appli- of Wt is 0.8 seconds.
cable for large binaural inter-distances. The real-time response and the non-expensiveness of the
Furthermore, we studied the average required processing audio sensor are key factors for the deployment of the pro-
time with respect to different sizes of the time windows Wt . posed system in real scenarios. The sound directions esti-
We report in Table 6 the processing time required by each mated by several sensors can be combined together in order
block of the architecture and the total processing time of the to obtain more precise localization and to eventually track
Time analysis motor maps based on the hrtf. In IROS, 2006 IEEE/RSJ In-
Wt #1 #2 #3 #4 Total tern. Conf. on, pages 1170–1176, Oct 2006.
0.8 0.51 0.5 2.63 0.002 3.642 [9] J. Huang, K. Kume, A. Saji, M. Nishihashi, T. Watanabe,
1.1 0.51 0.5 2.92 0.002 3.932
and W. L. Martens. Robotic spatial sound localization and
1.4 0.51 0.5 3.24 0.002 4.252
1.7 0.51 0.5 3.56 0.002 4.572 its 3-d sound human interface. pages 0191–, Washington,
2.0 0.51 0.5 3.97 0.002 4.982 DC, USA, 2002. IEEE Computer Society.
[10] C. Hummersone, R. Mason, and T. Brookes. Dynamic prece-
Table 6: Contribution to the processing time of the modules dence effect modeling for source separation in reverberant
of the architecture for an input sound of 4 seconds. #1 is environments. Audio, Speech, and Language Processing,
IEEE Trans. on, 18(7):1867–1871, Sept 2010.
the acquisition module, #2 and #3 are the features computa-
[11] M. Jeub, M. Schafer, and P. Vary. A binaural room impulse
tion and frame-level angle estimation, while #4 refers to the
response database for the evaluation of dereverberation algo-
integration module. Time values are in seconds. rithms. In Digital Signal Processing, 2009 16th Inter. Conf.
on, pages 1–5, July 2009.
[12] H. Kayser, S. D. Ewert, J. Anemüller, T. Rohdenburg,
the movements of the sound sources within the test environ-
V. Hohmann, and B. Kollmeier. Database of multichannel
ment. in-ear and behind-the-ear head-related and binaural room im-
pulse responses. EURASIP J. Adv. Signal Process, 2009:6:1–
4. Conclusions 6:10, Jan. 2009.
[13] F. Keyrouz, K. Diepold, and S. Keyrouz. High performance
We proposed a real-time architecture for sound source
3d sound localization for surveillance applications. In Ad-
localization that combines existing localization techniques vanced Video and Signal Based Surveillance, 2007. AVSS
with a cheap audio sensor, namely the Microsoft Kinect 2007. IEEE Conference on, pages 563–566, Sept 2007.
sensor. The architecture that we presented provides real- [14] C. Knapp and G. Carter. The generalized correlation method
time responses. Moreover, it is flexible and modular, which for estimation of time delay. Acoustics, Speech and Signal
makes it usable with any localization method that requires Processing, IEEE Trans. on, 24(4):320–327, Aug 1976.
up to four microphones. We carried out an extensive ex- [15] T. May, S. van de Par, and A. Kohlrausch. A probabilistic
perimental analysis on four data sets and evaluated different model for robust localization based on a binaural auditory
conditions that a localization method has to face with in real front-end. Audio, Speech, and Language Processing, IEEE
scenarios. We realized and made publicly available one of Transactions on, 19(1):1–13, Jan 2011.
the four data sets used for the experimental analysis. [16] M. Murase, S. Yamamoto, J. marc Valin, K. Nakadai, K. Ya-
mada, K. Komatani, T. Ogata, and H. G. Okuno. Multi-
ple moving speaker tracking by microphone array on mobile
References
robot. 1:143–145, 2005.
[1] J. Blauert. Spatial hearing : the psychophysics of human [17] K. Nakadai, H. Okuno, and H. Kitano. Auditory fovea based
sound localization. Cambridge, Mass. MIT Press, 1997. speech separation and its application to dialog system. In
[2] M. Crocco, M. Cristani, A. Trucco, and V. Murino. Au- IROS, 2002. IEEE/RSJ Intern. Conf. on, volume 2, pages
dio surveillance: A systematic review. ACM Comput. Surv., 1320–1325 vol.2, 2002.
48(4):52:1–52:46, Feb. 2016. [18] K. Nakadai, H. G. Okuno, H. Kitano, H. G. Okuno, and
[3] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and H. Kitano. Real-time sound source localization and sepa-
M. Vento. Audio surveillance of roads: A system for detect- ration for robot audition. In in Proceedings IEEE ICSLIP,
ing anomalous sounds. Intelligent Transportation Systems, 2002, pages 193–196, 2002.
IEEE Trans. on, PP(99):1–10, 2015. [19] H. Okuno, K. Nakadai, K. Hidai, H. Mizoguchi, and H. Ki-
[4] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and tano. Human-robot interaction through real-time auditory
M. Vento. Reliable detection of audio events in highly noisy and visual multiple-talker tracking. In IROS, 2001. Pro-
environments. Pattern Recognition Letters, 65:22 – 28, 2015. ceedings. 2001 IEEE/RSJ Intern. Conf. on, volume 3, pages
[5] B. R. Glasberg and B. C. Moore. Derivation of auditory 1402–1409 vol.3, 2001.
filter shapes from notched-noise data. Hearing Research, [20] S. T. Shivappa, B. D. Rao, and M. M. Trivedi. Audio-
47(1):103 – 138, 1990. visual fusion and tracking with multilevel iterative decoding:
[6] A. Handzel. Planar spherical diffraction-arrays: Linear Framework and experimental evaluation. IEEE J. Sel. Topics
sound localization algorithms. In Sensor Array and Multi- Signal Process., 4(5):882–894, Oct 2010.
channel Processing, 2006. Fourth IEEE Workshop on, pages [21] I. Toshima, S. Aoki, and T. Hirahara. An acoustical tele-
655–658, July 2006. presence robot: Telehead ii. In IROS, 2004. (IROS 2004).
[7] A. Handzel and P. Krishnaprasad. Biomimetic sound-source Proceedings. 2004 IEEE/RSJ Intern. Conf. on, volume 3,
localization. IEEE Sensors J., 2(6):607–616, Dec 2002. pages 2105–2110 vol.3, Sept 2004.
[8] J. Hornstein, M. Lopes, J. Santos-Victor, and F. Lacerda. [22] J. Valin, F. Michaud, J. Rouat, and D. Letourneau. Robust
Sound localization for humanoid robots - building audio- sound source localization using a microphone array on a
mobile robot. In IROS, 2003. (IROS 2003). Proceedings.
2003 IEEE/RSJ Intern. Conf. on, volume 2, pages 1228–
1233 vol.2, Oct 2003.
[23] V. Willert, J. Eggert, J. Adamy, R. Stahl, and E. Korner. A
probabilistic model for binaural sound localization. Systems,
Man, and Cybernetics, Part B: Cybernetics, IEEE Trans. on,
36(5):982–994, Oct 2006.

You might also like