Exploiting Deep Neural Networks and Head Movements For Robust Binaural Localization of Multiple Sources in Reverberant Environments
Exploiting Deep Neural Networks and Head Movements For Robust Binaural Localization of Multiple Sources in Reverberant Environments
Abstract—This paper presents a novel machine-hearing system The human auditory system determines the azimuth of sounds
that exploits deep neural networks (DNNs) and head movements in the horizontal plane by using two principal cues: interaural
for robust binaural localization of multiple sources in reverberant time differences (ITDs) and interaural level differences (ILDs).
environments. DNNs are used to learn the relationship between
the source azimuth and binaural cues, consisting of the complete A number of authors have proposed binaural sound localisation
cross-correlation function (CCF) and interaural level differences systems that use the same approach, by extracting ITDs and
(ILDs). In contrast to many previous binaural hearing systems, the ILDs from acoustic recordings made at each ear of an artifi-
proposed approach is not restricted to localization of sound sources cial head [3]–[6]. Typically, these systems first use a bank of
in the frontal hemifield. Due to the similarity of binaural cues in the cochlear filters to split the incoming sound into a number of
frontal and rear hemifields, front–back confusions often occur. To
address this, a head movement strategy is incorporated in the local- frequency bands. The ITD and ILD are then estimated in each
ization model to help reduce the front–back errors. The proposed band, and statistical models such as Gaussian mixture model
DNN system is compared to a Gaussian-mixture-model-based sys- (GMM) are used to determine the source azimuth from the
tem that employs interaural time differences (ITDs) and ILDs as corresponding binaural cues [6]. Furthermore, the robustness of
localization features. Our experiments show that the DNN is able to this approach to varying acoustic conditions can be improved by
exploit information in the CCF that is not available in the ITD cue,
which together with head movements substantially improves local- using multi-conditional training (MCT). This introduces uncer-
ization accuracies under challenging acoustic scenarios, in which tainty into the statistical models of the binaural cues, enabling
multiple talkers and room reverberation are present. them to handle the effects of reverberation and interfering sound
Index Terms—Binaural sound source localisation, deep neural
sources [4]–[7].
networks, head movements, machine hearing, multi-conditional In contrast to many previous machine systems, the approach
training, reverberation. proposed here is not restricted to sound localisation in the frontal
hemifield; we consider source positions in the 360◦ azimuth
I. INTRODUCTION
range around the head. In this unconstrained case, the loca-
HIS paper aims to reduce the gap in performance be- tion of a sound cannot be uniquely determined by ITDs and
T tween human and machine sound localisation, in condi-
tions where multiple sound sources and room reverberation
ILDs; due to the similarity of these cues in the frontal and rear
hemifields, front-back confusions occur [8]. Although machine
are present. Human listeners have little difficulty in localis- listening studies have noted this as a problem [6], [9], listeners
ing sounds under such conditions; they are able to decode the rarely make such confusions because head movements, as well
complex acoustic mixture that arrives at each ear with appar- as spectral cues due to the pinnae, play an important role in
ent ease [1]. In contrast, sound localisation by machine systems resolving front-back confusions [8], [10], [11].
is usually unreliable in the presence of interfering sources and Relatively few machine localisation systems have attempted
reverberation. This is the case even when an array of multiple to incorporate head movements. Braasch et al. [12] averaged
microphones is employed [2], as opposed to the two (binaural) cross-correlation patterns across different head orientations in
sensors available to human listeners. order to resolve front-back confusions in anechoic conditions.
More recently, May et al. [6] combined head movements and
Manuscript received April 3, 2017; revised July 4, 2017; accepted August 28, MCT in a system that achieved robust sound localisation perfor-
2017. Date of publication October 27, 2017; date of current version Novem- mance in reverberant conditions. In their approach, the localisa-
ber 27, 2017. This work was supported by the European Union FP7 project tion system included a hypothesis-driven feedback stage which
TWO!EARS (http://www.twoears.eu) under Grant 618075. The associate edi-
tor coordinating the review of this manuscript and approving it for publication triggered a head movement when the azimuth could not be un-
was Dr. Tuomas Virtanen. (Corresponding author: Ning Ma.) ambiguously estimated. Subsequently, Ma et al. [9] evaluated
N. Ma and G. J. Brown are with the Department of Computer Science, the effectiveness of different head movement strategies, using
University of Sheffield, Sheffield S1 4DP, U.K. (e-mail: [email protected];
[email protected]). a complex acoustic environment that included multiple sources
T. May is with the Hearing Systems Group, Technical University of Denmark, and room reverberation. In agreement with studies on human
DK-2800 Kgs. Lyngby, Denmark (e-mail: [email protected]). sound localisation [13], they found that localisation errors were
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. minimised by a strategy that rotated the head towards the target
Digital Object Identifier 10.1109/TASLP.2017.2750760 sound source.
2329-9290 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2445
TABLE I
ROOM CHARACTERISTICS OF THE SURREY BRIR DATABASE [21]
loading corresponding BRIRs for the relative source azimuths. silence of various lengths at the beginning of each sentence, the
Such simulation is only approximate for the reverberant room central 1 s segment of each sentence was selected for evaluation.
conditions because the Surrey BRIR database was measured Note that although the models were trained and evaluated
by moving loudspeakers around a fixed dummy head. With using speech signals, our systems are not intended to localise
the Auditorium3 BRIRs, more realistic head movements were only speech sources. Therefore a frequency range from 80 Hz
simulated by loading the corresponding BRIR for a desired head to 8 kHz was selected for the signals sampled at 16 kHz. Our
orientation. For all experiments, head movements were limited previous studies [6], [15] also show that 32 Gammatone filters
to the range of ±30◦ . (see Section II-A) provide a good tradeoff between frequency
resolutions and computational cost. As the evaluation included
localisation of up to three overlapping talkers, using too few fil-
C. Multi-conditional Training ters would result in insufficient frequency resolution to reliably
The proposed systems assumed no prior knowledge of room localise multiple talkers.
conditions. The localisation models were trained using only The baseline system was a state-of-the-art localisation sys-
anechoic HRIRs with added diffuse noise, and no reverberant tem [6] that modelled both ITDs and ILDs features within a
BRIRs were used during training. GMM framework. As in [6], the GMM modelled the binaural
Previous studies [4]–[7] have shown that MCT features can features using 16 Gaussian components and diagonal covari-
increase the robustness of localisation systems in reverberant ance matrices for each azimuth and each frequency band. The
multi-source conditions. Binaural MCT features were created by GMM parameters were initialised by 15 iterations of the k-
mixing a target signal at a specified azimuth with diffuse noise means clustering algorithm and further refined using 5 iterations
at various signal-to-noise ratios (SNRs). The diffuse noise is the of the expectation-maximization (EM) algorithm. The second
sum of 72 uncorrelated, white Gaussian noise sources, each of localisation model was the proposed DNN system using the
which was spatialised across the full 360◦ azimuth range in steps CCF and ILD features. Each DNN employed four layers includ-
of 5◦ . Both the directional target signals and the diffuse noise ing two hidden layers each consisting of 128 hidden nodes (see
were created using the same anechoic HRIR recorded using a Section II-B).
KEMAR dummy head [20]. This approach was used in pref- Both localisation systems were evaluated using different
erence to adding reverberation during training, since previous training strategies (clean training and MCT), various locali-
studies (e.g., [5]) suggested that it was more likely to generalise sation feature sets (ITD, ILD and CCF), and with or without
well across a wide range of reverberant test conditions. head movements. When no head movement was employed, the
The training material consisted of speech sentences from the source azimuths were estimated using the entire 1 s segment
TIMIT database [22]. A set of 30 sentences was randomly se- from each acoustic mixture. If head movement was used, the
lected for each of the 72 azimuth locations. For each spatialised 1 s segment was divided into two 0.5 s long blocks and the
training sentence, the anechoic signal was corrupted with dif- second block was provided to the system after completion of a
fuse noise at three SNRs (20, 10 and 0 dB SNR). The corre- head movement. Therefore in both conditions the same signal
sponding binaural features (ITDs, CCFs, and ILDs) and ILDs) duration was used for localisation.
were then extracted. Only those features for which the a priori The gross accuracy of localisation was measured by com-
SNR between the target and the diffuse noise exceeded − 5 dB paring true source azimuths with the estimated azimuths. The
were used for training. This negative SNR criterion ensured that number of active speech sources N was assumed to be known a
the multi-modal clusters in the binaural feature space at higher priori and the N azimuths for which the posterior probabilities
frequencies, which are caused by periodic ambiguities in the were the largest were selected as the estimated azimuths. Lo-
cross-correlation analysis, were properly captured. calisation of a source was considered accurate if the estimated
azimuth was less than or equal to 5◦ away from the true source
azimuth:
D. Experimental Setup
Ndist(φ, φ̂)≤θ
The GRID corpus [23] was used to create three evaluation LocAcc = (5)
N
sets of 50 acoustic mixtures which consisted of one, two or
three simultaneous talkers, respectively. Each GRID sentence where dist(.) is the angular distance between two azimuths, φ is
is approximately 1.5 s long and was spoken by one of 34 na- the true source azimuth, φ̂ is the estimated azimuth, and θ is the
tive British-English talkers. The sentences were normalised to threshold in degrees (5◦ in this study). This metric is preferred
the same root mean square (RMS) value prior to spatialisation. to RMS error because our study is concerned with full 360◦
For the two-talker and three-talker mixtures, the additional az- localisation, and localisation errors in degrees are often large
imuth directions were randomly selected from the same azimuth due to front-back confusions.
range while ensuring an angular distance of at least 10◦ between
all sources. Each evaluation set included 50 acoustic mixtures IV. RESULTS AND DISCUSSION
which were kept the same for all the evaluated azimuths and
A. Influence of MCT
room conditions in order to ensure any performance difference
was due to test conditions rather than signal variation. Since the The first experiment investigated the impact of MCT on the lo-
duration of each GRID sentence was different, and there was calisation accuracy of the proposed systems. Two scenarios were
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2449
TABLE II
GROSS LOCALIZATION ACCURACY IN % FOR VARIOUS SETS OF BRIRS WHEN LOCALIZING ONE, TWO, AND THREE COMPETING TALKERS IN THE
FRONTAL HEMIFIELD ONLY AND IN THE FULL 360◦ RANGE
no 100 99.0 90.5 84.0 63.1 52.8 81.5 59.8 51.8 100 82.5 65.5 88.2 61.2 53.5 75.6
GMM yes 100 99.9 98.7 99.2 97.1 90.7 100 97.7 91.6 100 99.3 96.5 100 98.4 91.5 97.4
Frontal
no 100 100 99.6 100 99.2 92.2 100 99.0 90.4 100 99.9 96.7 99.9 98.7 91.1 97.8
DNN yes 100 100 99.7 100 99.5 96.3 100 99.7 96.2 100 99.9 98.2 100 99.6 95.3 99.0
no 100 97.1 82.6 82.6 48.9 30.7 65.6 38.3 25.3 98.4 70.3 50.2 77.2 46.3 30.0 62.9
GMM
yes 100 100 97.8 99.0 94.2 80.7 97.0 89.0 77.6 100 97.6 88.7 97.3 90.6 79.0 92.6
360◦
no 100 100 97.4 100 87.0 68.4 94.5 79.0 63.9 97.7 92.5 78.9 94.4 83.4 67.9 87.0
DNN
yes 100 100 98.6 99.7 97.3 87.9 97.2 93.7 86.7 100 97.3 90.2 97.3 94.0 85.0 95.0
The models were trained using either clean training or the MCT method.
Fig. 5. Localization error rates produced by various systems using either clean training or MCT. Localization was performed in the full 360◦ range, so that
front–back errors could occur, as shown by the white bars for each system. No head movement strategy was employed.
considered: i) sound localisation was restricted to the frontal was substantially more robust than the GMM system, but the
hemifield so that the systems estimated source azimuths within performance also decreased significantly when multiple talk-
the range [−90◦ , 90◦ ]; ii) the systems were not informed that ers were present. The benefit of the MCT method became more
the sources lay only in the frontal hemifield and were free to apparent for both systems in this scenario – the average localisa-
report the azimuth in the full 360◦ azimuth range. In the second tion accuracy was increased from 62.9% to 92.6% for the GMM
scenario front-back confusions could occur. system and from 87% to 95% for the DNN system. Across all
Table II lists gross localisation accuracies of all the systems the room conditions the largest benefits were observed in room
evaluated using various BRIR sets from the Surrey database. B where the direct-to-reverberant ratio was the lowest, and in
First consider the scenario of localisation in the frontal hemi- room D where the reverberation time T60 was the longest.
field. For the GMM baseline system, the MCT approach sub- Errors made in 360◦ localisation could be due to front-back
stantially improved the robustness across all conditions, with confusion as well as interference caused by reverberation and
an average localisation accuracy of 97.4% compared to only overlapping talkers. Figure 5 shows errors made by both the
75.6% using clean training. The improvement with MCT was GMM and the DNN systems using either clean training or MCT
particularly large in multi-talker scenarios and in the presence in different room conditions. The errors due to front-back con-
of room reverberation. For the DNN system, the improvement fusions were indicated by white bars for each system. Here a
with MCT over clean training was not as large as that for the localisation error is considered to be a front-back confusion
GMM system and is only observed in the multi-talker scenarios. when the estimated azimuth is within ±20 degrees of the az-
The limited improvement is partly because with clean training imuth that would produce the same ITDs in the rear hemifield.
the performance of the DNN system is already very robust in It is clear that front-back confusions contributed a large portion
most conditions, with an average accuracy of 97.8%, which is of localisation errors for both systems, in particular when clean
already better than the GMM system with MCT. This suggests training was used. When the MCT method was used, not only
that when localisation was restricted to the frontal hemifield, the errors due to interference of reverberation and overlapping
the DNN can effectively extract cues from the clean CCF-ILD talkers (non-white bar portion in Fig. 5) were greatly reduced,
features that are robust in the presence of reverberation. but also the systems produced substantially fewer front-back
Considering the case of full 360◦ localisation, the scenario is errors (white bars in Fig. 5). As will be discussed in the next
more challenging and front-back errors could occur. The GMM section, without head movements the main cues distinguishing
system with clean training failed to localise the talkers accu- between front-back azimuth pairs lie in the combination of in-
rately, with error rates greater than 50% when localising multi- teaural level and time differences (or ITD-related features such
ple simultaneous talkers. The DNN system with clean training as the cross-correlation function). MCT provides the training
2450 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017
TABLE III
GROSS LOCALIZATION ACCURACY IN % USING VARIOUS FEATURE SETS FOR LOCALIZING ONE, TWO, AND THREE COMPETING TALKERS IN THE FULL 360◦ RANGE
ITD 100 99.8 96.2 99.2 81.6 67.7 91.4 76.6 64.9 97.2 89.4 76.6 89.1 76.6 65.8 84.8
GMM ITD-ILD 100 100 97.8 99.0 94.2 80.7 97.0 89.0 77.6 100 97.6 88.7 97.3 90.6 79.0 92.6
CCF-ILD 100 100 98.4 100 87.2 73.9 92.1 81.7 71.5 99.9 93.8 81.6 92.6 83.2 72.3 88.5
CCF 100 100 99.0 99.8 95.8 86.7 91.8 89.5 83.7 98.3 95.8 89.0 91.6 87.8 80.8 92.7
DNN
CCF-ILD 100 100 98.6 99.7 97.3 87.9 97.2 93.7 86.7 100 97.3 90.2 97.3 94.0 85.0 95.0
The models were trained using the MCT method. The best feature set for each system is marked in bold font.
Fig. 6. Comparison of localization error rates produced by various systems using different spatial features. Localization was not restricted in the frontal hemifield
so that front–back errors can occur, as indicated by the white bars for each system. No head movement strategy was employed.
stage with better regularisation of the features, which is able When ILDs were not used, the localisation errors were largely
to improve the generalisation of the learned models and better due to an increased number of front-back errors as suggested by
discriminate the front-back confusing azimuths. Fig. 6. For single-talker localisation in rooms B and D, without
It is also worth noting that the training and testing stages used using ILDs almost all the errors made by the systems were
HRTFs collected with different dummy heads (the KEMAR was front-back errors. When ILDs were used, the number of front-
used for training and the HATS was used for testing). However, back errors were greatly reduced in all conditions. This suggests
with MCT the localisation accuracy in the anechoic condition that the ILD cue plays a major role in solving the front-back
for localising one or two sources was 100%, which suggests that confusions. ITDs or ILDs alone may appear more symmetric
MCT also reduced the sensitivity to mismatches of the receiver. between the front and back hemifields, but together with ILDs
they create the necessary asymmetries (due to the KEMAR head
with pinnae) for the models to learn the differences between
B. Contribution of the ILD Cue front and back azimuths.
The second experiment investigated the influence of differ- Table III also lists localisation results of the GMM system
ent localisation features, in particular the contribution of the when using the same CCF-ILD feature set as used by the DNN
ILD cue. Table III lists the gross localisation accuracies us- system. The GMM failed to extract the systematic structure in
ing various feature sets. Here all models were trained using the CCF spanning multiple feature dimensions, most likely due
the MCT method and the active head movement strategy was to its inferior ability to model correlated features. The average
not applied. When ILDs were not used, the GMM performance localisation accuracy is only 88.5% compared to 95% for the
using just ITDs suffered greatly in reverberant rooms and when DNN system, and again it suffered the most in more reverberant
localising overlapping talkers; the average localisation accuracy conditions such as rooms B and D.
decreased from 92.6% to 84.8%. The performance drop was
C. Benefit of the Head Movement Strategy
particularly pronounced in rooms B and D, where the reverber-
ation was strong. For the DNN system, excluding the ILDs also Table IV lists the gross localisation accuracies with or with-
decreased the localisation performance but the performance out head movement. All systems were trained using the MCT
drop was more moderate, with the average accuracy reduced method and employed the respective best performing features
from 95% to 92.7%. The DNN system using the CCF feature (GMM ITD-ILD and DNN CCF-ILD).
exhibited more robustness in the reverberant multi-talker condi- Both the GMM and DNN systems benefitted from the use
tions than the GMM system using the ITD feature. As previously of head movements. It is clear from Fig. 7 that the localisa-
discussed, computation of the ITD involved a peak-picking op- tion errors were almost entirely due to front-back confusions in
eration that could be less reliable in challenging conditions, one-talker localisation. By exploiting the head movement, the
and the systematic changes in the CCF with the source az- systems managed to reduce most of the front-back errors and
imuth provided richer information that could be exploited by achieved near 100% localisation accuracies. In two- or three-
the DNN. talker localisation, the number of front-back errors was also
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2451
TABLE IV
GROSS LOCALIZATION ACCURACIES IN % WITH OR WITHOUT THE HEAD MOVEMENT WHEN LOCALIZING ONE, TWO, AND THREE COMPETING TALKERS IN THE
FULL 360◦ AZIMUTH RANGE
no 100 100 97.8 99.0 94.2 80.7 97.0 89.0 77.6 100 97.6 88.7 97.3 90.6 79.0 92.6
GMM yes 100 100 97.5 100 97.3 83.4 99.8 93.1 79.9 99.9 99.3 90.8 99.9 93.0 79.5 94.2
no 100 100 98.6 99.7 97.3 87.9 97.2 93.7 86.7 100 97.3 90.2 97.3 94.0 85.0 95.0
DNN
yes 100 100 98.4 100 99.2 90.0 99.8 96.1 86.9 100 99.0 91.6 99.5 94.7 84.7 96.0
Fig. 7. Localization error rates produced by various systems with or without head movement when localizing one, two, or three overlapping talkers. Localization
was performed in the 360◦ azimuth range so that front–back errors can occur, as indicated by the white bars for each system.
Fig. 8. Localization error rates produced by various systems with or without head movement, as a function of the azimuth. The histogram bin width is 20◦ . Here
the error rates were averaged across the 1-, 2- and 3-talker localization tasks. Localization was performed in the full 360◦ azimuth range so that front–back errors
can occur, as indicated by the white bars for each system.
reduced with the use of head movements. When overlapping Fig. 8 shows the localisation error rates as a function of the
talkers were present, the systems produced many localisation azimuth. The error rates here were averaged across the 1-, 2-
errors other than front-back errors, due to the partial evidence and 3-talker localisation tasks. Across most room conditions,
available to localise each talker. By removing most front-back sound localisation was generally more reliable at more central
errors, the systems were able to further improve the accuracy of locations than at lateral source locations. This is particularly
localising overlapping sound sources. the case for the GMM system, as shown in Fig. 8, where the
2452 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017
Fig. 9. Localization error rates produced by various systems as a function of the azimuth for the Auditorium3 task. Localization was performed in the full 360◦
azimuth range so that front–back errors can occur, as indicated by the white bars for each system.
localisation error rates for sources at the sides were above 20% glimpses left for localisation of each source and with stronger
even in the least reverberant Room A. It is also clear from reverberation the sources at 51◦ and 131◦ became more difficult
Fig. 8 (white bars) that localisation errors were mostly not to localise.
due to front-back confusions at lateral azimuths, and in this
case the proposed DNN system outperformed the GMM system V. CONCLUSION
significantly. This paper presented a machine-hearing framework that com-
At the central azimuths, on the other hand, almost all the local- bines DNNs and head movements for robust localisation of
isation errors were due to front-back confusions. It is noticeable multiple sources in reverberant conditions. Since simultaneous
that in more reverberant conditions (such as Rooms B and D), the talkers were located in a full 360◦ azimuth range, front-back
error rates at the central azimuths [−10◦ , 10◦ ] were particularly confusions occurred. Compared to a GMM-based system, the
high due to front-back errors for both the GMM and the DNN proposed DNN system was able to exploit the rich information
systems when head movement was not used. The front-back provided by the entire CCF, and thus substantially reduced lo-
errors were concentrated at central azimuths, probably because calisation errors. The MCT method was effective in combatting
binaural features (interaural time and level differences) were reverberation, and allowed anechoic signals to be used for train-
less discriminative between 0◦ and 180◦ than between the more ing a robust localisation model that generalised well to unseen
lateral azimuth pairs. reverberant conditions and to mismatched artificial heads used
Finally, Fig. 9 shows the localisation error rates using the in training and testing conditions. It was also found that the
Auditorium3 BRIRs in which head movements were more ac- inclusion of ILDs was necessary for reducing front-back confu-
curately simulated by loading the corresponding BRIR for a sions in reverberant rooms. The use of head rotation further in-
given head orientation. Overall the DNN systems significantly creased the robustness of the proposed system, with an average
outperformed the GMM systems. For single-source localisation localisation accuracy of 96% under acoustic scenarios where
the DNN system achieved near 100% localisation accuracy for up to three competing talkers and room reverberation were
all source locations including the one at 131◦ in the rear hemi- present.
field. The GMM system produced about 5% error rate for rear In the current study, the use of DNNs allowed higher-
source but performed well for the other locations. For two- and dimensional feature vectors to be exploited for localisation, in
three-source localisation, both GMM and DNN systems ben- comparison with previous studies [4]–[6]. This could be carried
efitted from head movements across most azimuth locations. further, by exploiting additional context within the DNN either
For the GMM system the benefit is particularly pronounced for in the time or the frequency dimension. Moreover, it is possi-
the source at 51◦ , with localisation reduced from 14% to 4% ble to complement the features used here with other binaural
in two-source localisation and from 36% to 14% in two-source features, e.g., a measure of interaural coherence [24], as well as
localisation. The rear source at 131◦ appeared to be difficult to monaural localisation cues, which are known to be important for
localise for the GMM system even with head movement, with judgment of elevation angles [25], [26]. Visual features might
20% error rate in two-source localisation. The DNN system with also be combined with acoustic features in order to achieve
head movements was able to reduce the error rate for the rear audio-visual source localisation.
source at 131◦ to 8%. The proposed system has been realised in a real world human-
In general the performance of the models for the 51◦ and robot interaction scenario. The azimuth posterior distributions
131◦ locations is worse than the other source locations when from the DNN for each processing block were temporally
there are multiple sources present at the same time. This is more smoothed using a leaky integrator and head rotation was trig-
likely due to the nature of the room acoustics at these locations, gered if a front-back confusion was detected in the integrated
e.g., they are further away from the listener and closer to walls. posterior distribution. Audio signals acquired during head rota-
When the sources are overlapping with each other, there are less tion were not processed. Such a scheme can be more practical
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2453
for a robotic platform as head rotation often produces self-noise [19] A. Bregman, Auditory Scene Analysis. Cambridge, MA, USA: MIT Press,
which makes the audio unusable. 1990.
[20] H. Wierstorf, M. Geier, A. Raake, and S. Spors, “A free database of
One limitation of the current systems is that the number of head-related impulse response measurements in the horizontal plane with
active sources is assumed to be known a priori. This can be multiple distances,” Audio Eng. Soc. Conv. 130.
improved by including a source number estimator that is either [21] C. Hummersone, R. Mason, and T. Brookes, “Dynamic precedence ef-
fect modeling for source separation in reverberant environments,” IEEE
learned from the azimuth posterior distribution output by the Trans. Audio, Speech, Lang. Process., vol. 18, no. 7, pp. 1867–1871,
DNN, or provided directly as an output node in the DNN. The Sep. 2010.
current study only deals with the situation where sound sources [22] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and
N. L. Dahlgren, “DARPA TIMIT acoustic-phonetic continuous speech
are static. Future studies will relax this constraint and address corpus CD-ROM,” Nat. Inst. Standards Technol., Gaithersburg, MD, USA,
the localisation and tracking of moving sound sources within Internal Rep. 4930, 1993.
the DNN framework. [23] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual
corpus for speech perception and automatic speech recognition,” J. Acoust.
Soc. Amer., vol. 120, pp. 2421–2424, 2006.
REFERENCES [24] C. Faller and J. Merimaa, “Sound localization in complex listening situa-
tions: Selection of binaural cues based on interaural coherence,” J. Acoust.
[1] J. Blauert, Spatial Hearing—The Psychophysics of Human Sound Local- Soc. Amer., vol. 116, pp. 3075–3089, 2004.
ization. Cambridge, MA, USA: MIT Press, 1997. [25] F. Asano, Y. Suzuki, and T. Sone, “Role of spectral cues in median plane
[2] O. Nadiri and B. Rafaely, “Localization of multiple speakers under high localization.” J. Acoust. Soc. Amer., vol. 88, no. 1, pp. 159–168, 1990.
reverberation using a spherical microphone array and the direct-path dom- [26] P. Zakarauskas and M. S. Cynader, “A computational theory of spectral
inance test,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, cue localization,” J. Acoust. Soc. Amer., vol. 94, no. 3, pp. 1323–1331,
no. 10, pp. 1494–1505, Oct. 2014. 1993.
[3] V. Willert, J. Eggert, J. Adamy, R. Stahl, and E. Korner, “A probabilistic
model for binaural sound localization,” IEEE Trans. Syst., Man, Cybern.
B, Cybern., vol. 36, no. 5, pp. 982–994, Oct. 2006.
[4] T. May, S. van de Par, and A. Kohlrausch, “A probabilistic model for Ning Ma obtained the M.Sc. degree with distinction
robust localization based on a binaural auditory front-end,” IEEE Trans. in advanced computer science in 2003 and the Ph.D.
Audio, Speech, Lang. Process., vol. 19, no. 1, pp. 1–13, Jan. 2011. degree in hearing-inspired approaches to automatic
[5] J. Woodruff and D. L. Wang, “Binaural localization of multiple sources in speech recognition in 2008, both from the Univer-
reverberant and noisy environments,” IEEE Trans. Audio, Speech, Lang. sity of Sheffield, Sheffield, U.K. He has been a Vis-
Process., vol. 20, no. 5, pp. 1503–1512, Jul. 2012. iting Research Scientist at the University of Wash-
[6] T. May, N. Ma, and G. J. Brown, “Robust localisation of multiple speakers ington, Seattle, WA, USA, and a Research Fellow
exploiting head movements and multi-conditional training of binaural at the MRC Institute of Hearing Research, Notting-
cues,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2015, ham, U.K., working on auditory scene analysis with
pp. 2679–2683. cochlear implants. Since 2015, he has been a Re-
[7] T. May, S. van de Par, and A. Kohlrausch, “Binaural localization and search Fellow at the University of Sheffield, working
detection of speakers in complex acoustic scenes,” in The Technology of on computational hearing. His research interests include robust automatic speech
Binaural Listening, J. Blauert, Ed. New York, NY, USA: Springer, 2013, recognition, computational auditory scene analysis, and hearing impairment. He
ch. 15, pp. 397–425. has authored or coauthored more than 40 papers in these areas.
[8] F. L. Wightman and D. J. Kistler, “Resolution of front–back ambiguity in
spatial hearing by listener and source movement,” J. Acoust. Soc. Amer.,
vol. 105, no. 5, pp. 2841–2853, 1999.
[9] N. Ma, T. May, H. Wierstorf, and G. J. Brown, “A machine-hearing system
exploiting head movements for binaural sound localisation in reverberant Tobias May studied hearing technology and audiol-
conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., ogy and received the M.Sc. degree from the Univer-
2015, pp. 2699–2703. sity of Oldenburg, Oldenburg, Germany, in 2007 and
[10] H. Wallach, “The role of head movements and vestibular and visual cues the binational Ph.D. degree from the University of
in sound localization,” J. Exp. Psychol., vol. 27, no. 4, pp. 339–368, 1940. Oldenburg in collaboration with the Eindhoven Uni-
[11] K. I. McAnally and R. L. Martin, “Sound localization with head move- versity of Technology, Eindhoven, The Netherlands.
ments: Implications for 3D audio displays,” Front. Neurosci., vol. 8, Since 2013, he has been with the Department of Elec-
pp. 1–6, 2014. trical Engineering, Technical University of Denmark,
[12] J. Braasch, S. Clapp, A. Parks, T. Pastore, and N. Xiang, “A binaural first as a Postdoctoral Researcher (2013–2017), and
model that analyses acoustic spaces and stereophonic reproduction sys- since 2017 as an Assistant Professor. His research
tems by utilizing head rotations,” in The Technology of Binaural Listening, interests include computational auditory scene anal-
J. Blauert, Ed. Berlin, Germany: Springer, 2013, pp. 201–223. ysis, binaural signal processing, noise-robust speaker identification, and hearing
[13] S. Perrett and W. Noble, “The effect of head rotations on vertical plane aid processing.
sound localization,” J. Acoust. Soc. Amer., vol. 102, no. 4, pp. 2325–2332,
1997.
[14] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach.
Learn., vol. 2, no. 1, pp. 1–127, 2009. Guy J. Brown received the B.Sc. (Hons.) degree
[15] N. Ma, G. J. Brown, and T. May, “Exploiting deep neural networks and in applied science from Sheffield City Polytech-
head movements for binaural localisation of multiple speakers in rever- nic, Sheffield, U.K., in 1984, and the Ph.D. de-
berant conditions,” in Proc. Interspeech, 2015, pp. 3302–3306. gree in computer science from the University of
[16] Y. Jiang, D. Wang, R. Liu, and Z. Feng, “Binaural classification for re- Sheffield, Sheffield, in 1992. He was appointed a
verberant speech segregation using deep neural networks,” IEEE/ACM Chair of the Department of Computer Science, Uni-
Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 2112–2121, versity of Sheffield, in 2013. He has held visiting
Dec. 2014. appointments at LIMSI-CNRS (France), Ohio State
[17] Y. Yu, W. Wang, and P. Han, “Localization based stereo speech source University (USA), Helsinki University of Technol-
separation using probabilistic time-frequency masking and deep neural ogy (Finland), and ATR (Japan). He has authored
networks,” EURASIP J. Audio, Speech, Music Process., vol. 2016, no. 1, more than 100 papers and is the co-Editor (with Prof.
pp. 1–18, 2016. DeLiang Wang) of the IEEE book entitled Computational Auditory Scene Anal-
[18] D. L. Wang and G. J. Brown, Eds., Computational Auditory Scene Anal- ysis: Principles, Algorithms and Applications. His research interests include
ysis: Principles, Algorithms and Applications. New York, NY, USA: computational auditory scene analysis, speech perception, hearing impairment,
Wiley/IEEE Press, 2006. and acoustic monitoring for medical applications.