University of Groningen: 10.1109/AVSS.2017.8078461
University of Groningen: 10.1109/AVSS.2017.8078461
A real-time system for audio source localization with cheap sensor device
Saggese, Alessia; Strisciuglio, Nicola; Vento, Mario; Petkov, Nicolai
Published in:
2017 14th IEEE International Conference on Advanced Video and Signal Based Surveillance, AVSS 2017
DOI:
10.1109/AVSS.2017.8078461
IMPORTANT NOTE: You are advised to consult the publisher's version (publisher's PDF) if you wish to cite from
it. Please check the document version below.
Document Version
Publisher's PDF, also known as Version of record
Publication date:
2017
Copyright
Other than for strictly personal use, it is not permitted to download or to forward/distribute the text or part of it without the consent of the
author(s) and/or copyright holder(s), unless the work is under an open content license (like Creative Commons).
The publication may also be distributed here under the terms of Article 25fa of the Dutch Copyright Act, indicated by the “Taverne” license.
More information can be found on the University of Groningen website: https://www.rug.nl/library/open-access/self-archiving-pure/taverne-
amendment.
Take-down policy
If you believe that this document breaches copyright please contact us providing details, and we will remove access to the work immediately
and investigate your claim.
Downloaded from the University of Groningen/UMCG research database (Pure): http://www.rug.nl/research/portal. For technical reasons the
number of authors shown on this cover page is limited to 10 maximum.
978-1-5386-2939-0/17/$31.00 2017
c IEEE IEEE AVSS 2017, August 2017, Lecce, ITALY
(a) (b) (c) (d) (e) (f)
azimuth estimator
data acquisition gammatone ILD/ITD GMM
filters estimator
kinect sensor
θe training model
results
visualization
presented in [6]. The authors exploit redundancy of spatial and discuss the results that we achieved in the experiments.
information to scale the system linearly with the number Finally, we draw conclusions in Section 4.
of microphones. Beamforming techniques and Kalman fil-
ter have been used to reduce the errors in tracking multiple 2. System Architecture
moving speakers in noisy and echoic environments [16].
Such methods are more accurate and robust to noise than In Figure 1, we depict the architectural overview of the
binaural localization methods, but require larger hardware proposed system. The presented framework integrates a
resources for processing the input signals. cheap sensor device, namely the microphone array of the
In this paper, we propose a real-time architecture for Kinect sensor, with a localization method based on proba-
sound source localization based on a cheap microphone ar- bilistic estimation of sound directions [15]. Together with
ray, namely the audio acquisition card provided together the processing architecture, we propose an integration rule
with the Kinect sensor. Our contribution is in the design of the short-time estimations that improves the reliability of
and implementation of a system for hardware-software inte- the overall system.
gration of the estimates at frame-level provided by the bin-
aural sound source localization method proposed in [15]. 2.1. Localization method
The proposed architecture is modular and, in principle, can The considered localization method is inspired by how
be used together with any binaural localization method. the human auditory system processes sounds. The cochlea
The real-time implementation of the system and the use membrane in the inner ear vibrates according to the en-
of a cheap audio sensor give the possibility of deploying ergy contained in frequency sub-bands of the input sound.
large installations of the localization system with reduced In order to model this behavior, a Gammatone filterbank
costs. Furthermore, the proposed solution provides means was adopted. As suggested in [5], the input signals to each
for distributed analysis of sound sources and eventually to microphone are decomposed into Nc = 32 auditory chan-
track them within the monitored areas. We evaluate the nels using a fourth-order gammatone filterbank, where the
performance and the characteristics of the proposed archi- channel center frequencies are distributed on the equiva-
tecture in different conditions by carrying out experiments lent rectangular bandwidth (ERB) scale between 80 Hz and
on four data sets, namely the Surrey, Oldenburg, Aachen 5 kHz. Successively, binaural cues are estimated using a
and MIVIA data sets (the final name will be released after rectangular window of 20ms at a sampling frequency of
anonymous submission and revision). In this work, we con- f s = 44.1kHz. The impulse response gfc (t) of a Gam-
structed and made public available for benchmark purpose matone filter at the time t and central frequency fc is:
the OUR localization data set1 .
The paper is organized as follows: in Section 2 we gfc (t) = D tη−1 cos(2πfc t + φ)e−2πtb(fc ) u(t), (1)
present the proposed hardware and software architecture,
while in Section 3 we describe the data sets that we used where D is an arbitrary scaling constant, η the order of the
filter (fourth in our case), φ is the phase, u(t) is the unit
1 The data set is available at the url http://OUR.– step function (1 for t > 0 and 0 otherwise), and b(fc ) is a
function which determines the bandwidth for a given center
frequency. It is formally defined as: b(fc ) = a(24.7 +
0.108fc ), where a is a proportionality constant.
Given the response of the Gammatone filterbank for
both microphones, the the Interaural Level Difference (ILD)
and the Interaural Time Difference (ITD) are computed.
ILD and ITD contain complementary information about the Figure 2: Setup of the kinect sensor for the realization of
sound source position with respect the microphone array. the MIVIA data set.
Thus they are combined into a two-dimensional feature vec-
tor, called binaural vector, in order to represent the single
frame within a specific Gammatone channel. 3. Experimental analysis
Finally, a Gaussian mixture model (GMM) is used to 3.1. Data sets
estimate the position of the sound source from the set of
We performed experiments on four publicly available
binaural feature vectors. In particular, considering that the
data sets, namely the Surrey [10], Oldenburg [12],
binaural features corresponding to different channels tend
Aachen [11] and MIVIA data sets. The first three data sets
to cluster in the feature space, the azimuth-dependent pdf is
contain sound sources at different angles and distances from
modeled by summing up the superimposed Gaussian com-
the microphone array and sounds recorded in environments
ponents. During a preliminary training phase, a set of train-
with different reverberation.
ing sounds is used to learn the GMM model, which will be
used during the operating phase for estimating the source The Surrey data set was recorded by a Cortex Instru-
direction. For more details about the localization method ments Mk.2 Head and Torso Simulator (HATS). The loud-
we refer the reader to [15]. speakers were placed around the HATS on an arc with a
1.5 m radius between −90◦ and +90◦ . The sounds were
recorded at intervals of 5◦ and four different room config-
urations were considered. For each room a reverberation
2.2. Integration of azimuth estimation time RT60 was estimated according to BS EN ISO 3382
standard.
The GMM provides an estimation of the direction of the The Oldenburg data set [12] was recorded by a HATS
sound source for small overlapped (50% of their length) in five semi-controlled environments.
frames of the input audio signal of duration 20ms. The com- The Aachen data set [11] was recorded in two rooms,
bination of the estimation for the N channels of the Gam- with sound sources at a distance up to 10m from the micro-
matone filterbank provides a reliable measure of short-time phones. The characteristics of the rooms for such data sets
characteristics of the input signal. We integrate the short- are reported in Table 1.
time estimations on a larger time scale, using a sliding time We recorded the MIVIA data set and made it available
window of length Wt that forward shifts on the audio signal for research purposes. It was acquired with a double aim:
by half of its size. This allows for more robust estimations (1) to evaluate the performance of localization systems with
to noise and outliers. The direction of the input sound is cheap and less accurate devices, instead of very expensive
thus taken as the one of the audio frame with the highest devices which are more difficult to be used in real applica-
estimated GMM likelihood within the time window Wt . tions; (2) to evaluate the performance of the proposed ap-
proach by varying the distance between the microphones.
We considered two different environments, namely a living
2.3. System implementation room 6 × 4 × 2, 7 m (M.LR) and a laboratory of our univer-
sity 8 × 8 × 7 m (M.L). The sounds were recorded by using
The proposed architecture acquires the audio signal the microphone array of a Kinect sensor. The hardware set
from the microphone array of the Microsoft Kinect sen- was expanded so as to build an array of four microphones,
sor (Figure 1a). The software module for hardware inter- where the maximum distance between them is 138 cm, as
face and data acquisition (Figure 1b) is developed in C++ shown in Figure 2. Note that the distance between the two
and communicates with a Matlab implementation of the central microphones were fixed to 23 cm, as in the original
method for the estimation of the sound source direction setup of the Kinect sensor. The recording was made by con-
(Figure 1c,d,e,g,h) at frame-level through a bridge inter- sidering the following acquisition angles: -90◦ , -60◦ , -45◦ ,
face. The final estimated direction is obtained by integration -20◦ , 0◦ , 20◦ , 45◦ , 60◦ , 90◦ . As for M.LR, a man repeats
of frame-level decisions (Figure 1f) and the result is made the same word aloud for 10 seconds at a distance of 2 m
available by a visualization module (Figure 1i). from the microphones. In the second scenario, a Bose loud-
Surrey data set
S.AR the room size was 17.04 × 14.53 × 6.5 m (l x w x h) and the anechoic conditions were simulated by truncating the first reflected
waves
S.A the room is a small sized (5.72 × 6.64 × 2.31) office with seats for 8 persons. The reverberation time is RT60 = 320ms
S.B medium sized classroom (4.65 × 9.6 × 2.68) with RT60 = 470ms
S.C large sized room (23.5 × 18.8 × 2.31) used as cinema hall for more than 400 people with RT60 = 680ms
S.D a medium sized room (8.72 × 8.02 × 4.25) for presentations with high ceiling and RT60 = 890ms
Table 1: Description of the rooms where the sounds contained in the data sets were recorded.
speaker was used to reproduce part of a song (10 seconds where N is the number of considered sounds in a particular
for each recording) at two distances from the array center: test and Ne is the number of wrongly estimated angles. For
1 m and 3 m. the computation of the MAE, xi is the estimated direction
while xi is the nominal one.
3.2. Performance evaluation
We organized the experiments in four groups, so as to 3.3. Results discussion
evaluate the performance of the system when the sound In the analysis of the performance of the real-time local-
source is (1) at different angles and (2) different distances ization method we considered several tolerance values for
from the microphones, (3) different SNR values and (4) dif- the localization error. Specifically, we considered different
ferent configurations of the binaural microphone set. Each margins for maximum acceptable error on the estimation of
group of tests involves specific rooms from the four data the sound source position. A sound source is correctly lo-
sets, which we report in the following: calized if the estimated angle falls into a tolerance interval
around the nominal value of the sound direction. The value
• TEST1: S.AR, S.A, S.B, S.C, S.D from the Surrey of the tolerance degree influences both accuracy and MAE.
data set and the rooms O.AR and O.O1 from the Old- Indeed, only wrongly estimated directions are considered
enburg data set. for the computation of the MAE.
• TEST2: A.MR and A.LR from the Aachen data set. For the experiments in the groups “TEST1” and
“TEST2” we performed a quantitative analysis of results by
• TEST3: O.O2, O.C and O.CY from the Oldenburg computing the accuracy and MAE. In Table 2 and Table 3
data set. we report the results that we achieved for the different envi-
ronments in “TEST 1” and “TEST 2”, respectively. We ob-
• TEST4: M.LR and M.L from the MIVIA data set. served generally high performance and precise estimations.
The results decrease for highly noisy environments.
We measured the performance of the real-time sound lo-
In the experiment “TEST3”, we evaluated the perfor-
calization system by computing the accuracy (Acc) and the
mance of the system in environments with several values of
mean absolute error (MAE) of the angle estimation, as fol-
signal-to-noise ratio. For the rooms in the Oldenburg data
lows:
set, precise measures on the nominal sound source direction
N are not provided, but rather only an indication of the direc-
Ne ∗ 100 i=1 |xi − x
i |
Acc[%] = 100 − , M AE = tion of the sound. In Table 5, we report the angle estimated
N N
Test 1 Test 3
Tolerance Expected Estimated
Room Configuration
Room 5◦ 10◦ 15◦ direction angle
Acc 100% 100% 100% O.O2 1C Frontal −5◦
S.AR
MAE 0◦ 0◦ 0◦ O.O2 1D Frontal 5◦
Acc 78.37% 100% 100% O.O2 1B Right 35◦
S.A 15◦
MAE 1.08◦ 0◦ 0◦ O.O2 2A Right
Acc 81.08% 97.29% 100% O.O2 2B Left −10◦
S.B 0◦
MAE 1.08◦ 0◦ 0◦ O.C 1A Frontal
Acc 86.48% 94.59% 94.59% O.C 2D Frontal 5◦
S.C
MAE 2.83◦ 2.43◦ 2.43◦ O.C 2E Right 60◦
Acc 51.35% 70.27% 70.27% O.C 1D Right 90◦
S.D
MAE 14◦ 13◦ 13◦ O.C 1B Left −30◦
Acc 97.29% 100% 100% O.C 1C Left −90◦
O.AR
MAE 0◦ 0◦ 0◦ O.CY 1A Right 35◦
Acc 48.64% 70.27% 75.67% O.CY 1B Right 10◦
O.O1
MAE 11.41◦ 10.33◦ 9.79◦ O.CY 1C Left −20◦
O.CY 1D Left −50◦
O.CY 1E Left −60◦
Table 2: Performance results for the group of experiments
“TEST 1” with different values of the error tolerance inter- Table 4: Qualitative analysis of the results for the group of
val. experiments “TEST 3”.
Test 2
Test 4
Tolerance
Expected Angle CONF1 CONF2
Room 5◦ 10◦ 15◦
-90◦ -58◦ -45◦
Acc 100% 100% 100%
Living Room (M.LR)
A.MR -60◦ -45◦ -33◦
MAE 0◦ 0◦ 0◦ -40◦ -37◦
-45◦
Acc 50% 100% 100% -20◦ -26◦ -21◦
A.LR
MAE 3.53◦ 0◦ 0◦ 0◦ 0◦ 0◦
20◦ 23◦ 19◦
Table 3: Performance results for the group of experiments 45◦ 30◦ 37◦
60◦ 55◦ 40◦
“TEST 2” with different values of the error tolerance inter-
90◦ 56◦ 55◦
val.
-90◦ -62◦ -62◦
-60◦ -55◦ -52◦
Laboratory (M.L)