0% found this document useful (0 votes)

30 views10 pages

Exploiting Deep Neural Networks and Head Movements For Robust Binaural Localization of Multiple Sources in Reverberant Environments

The document describes a machine hearing system that uses deep neural networks (DNNs) and head movements to localize multiple sound sources in reverberant environments. DNNs are trained to learn the relationship between binaural cues (including cross-correlation functions and interaural level differences) and source azimuth. Head movements are incorporated to help reduce front-back localization errors that commonly occur due to similar binaural cues from the front and back hemispheres. Experiments show the DNN system, combined with head movements, substantially outperforms a Gaussian mixture model system in challenging acoustic scenarios with multiple talkers and reverberation present.

Uploaded by

Aiping Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

30 views10 pages

Exploiting Deep Neural Networks and Head Movements For Robust Binaural Localization of Multiple Sources in Reverberant Environments

Uploaded by

Aiping Zhang

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

2444 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO.

12, DECEMBER 2017

Exploiting Deep Neural Networks and Head

Movements for Robust Binaural Localization of
Multiple Sources in Reverberant Environments
Ning Ma, Tobias May, and Guy J. Brown

Abstract—This paper presents a novel machine-hearing system The human auditory system determines the azimuth of sounds
that exploits deep neural networks (DNNs) and head movements in the horizontal plane by using two principal cues: interaural
for robust binaural localization of multiple sources in reverberant time differences (ITDs) and interaural level differences (ILDs).
environments. DNNs are used to learn the relationship between
the source azimuth and binaural cues, consisting of the complete A number of authors have proposed binaural sound localisation
cross-correlation function (CCF) and interaural level differences systems that use the same approach, by extracting ITDs and
(ILDs). In contrast to many previous binaural hearing systems, the ILDs from acoustic recordings made at each ear of an artifi-
proposed approach is not restricted to localization of sound sources cial head [3]–[6]. Typically, these systems first use a bank of
in the frontal hemifield. Due to the similarity of binaural cues in the cochlear filters to split the incoming sound into a number of
frontal and rear hemifields, front–back confusions often occur. To
address this, a head movement strategy is incorporated in the local- frequency bands. The ITD and ILD are then estimated in each
ization model to help reduce the front–back errors. The proposed band, and statistical models such as Gaussian mixture model
DNN system is compared to a Gaussian-mixture-model-based sys- (GMM) are used to determine the source azimuth from the
tem that employs interaural time differences (ITDs) and ILDs as corresponding binaural cues [6]. Furthermore, the robustness of
localization features. Our experiments show that the DNN is able to this approach to varying acoustic conditions can be improved by
exploit information in the CCF that is not available in the ITD cue,
which together with head movements substantially improves local- using multi-conditional training (MCT). This introduces uncer-
ization accuracies under challenging acoustic scenarios, in which tainty into the statistical models of the binaural cues, enabling
multiple talkers and room reverberation are present. them to handle the effects of reverberation and interfering sound
Index Terms—Binaural sound source localisation, deep neural
sources [4]–[7].
networks, head movements, machine hearing, multi-conditional In contrast to many previous machine systems, the approach
training, reverberation. proposed here is not restricted to sound localisation in the frontal
hemifield; we consider source positions in the 360◦ azimuth
I. INTRODUCTION
range around the head. In this unconstrained case, the loca-
HIS paper aims to reduce the gap in performance be- tion of a sound cannot be uniquely determined by ITDs and
T tween human and machine sound localisation, in condi-
tions where multiple sound sources and room reverberation
ILDs; due to the similarity of these cues in the frontal and rear
hemifields, front-back confusions occur [8]. Although machine
are present. Human listeners have little difficulty in localis- listening studies have noted this as a problem [6], [9], listeners
ing sounds under such conditions; they are able to decode the rarely make such confusions because head movements, as well
complex acoustic mixture that arrives at each ear with appar- as spectral cues due to the pinnae, play an important role in
ent ease [1]. In contrast, sound localisation by machine systems resolving front-back confusions [8], [10], [11].
is usually unreliable in the presence of interfering sources and Relatively few machine localisation systems have attempted
reverberation. This is the case even when an array of multiple to incorporate head movements. Braasch et al. [12] averaged
microphones is employed [2], as opposed to the two (binaural) cross-correlation patterns across different head orientations in
sensors available to human listeners. order to resolve front-back confusions in anechoic conditions.
More recently, May et al. [6] combined head movements and
Manuscript received April 3, 2017; revised July 4, 2017; accepted August 28, MCT in a system that achieved robust sound localisation perfor-
2017. Date of publication October 27, 2017; date of current version Novem- mance in reverberant conditions. In their approach, the localisa-
ber 27, 2017. This work was supported by the European Union FP7 project tion system included a hypothesis-driven feedback stage which
TWO!EARS (http://www.twoears.eu) under Grant 618075. The associate edi-
tor coordinating the review of this manuscript and approving it for publication triggered a head movement when the azimuth could not be un-
was Dr. Tuomas Virtanen. (Corresponding author: Ning Ma.) ambiguously estimated. Subsequently, Ma et al. [9] evaluated
N. Ma and G. J. Brown are with the Department of Computer Science, the effectiveness of different head movement strategies, using
University of Sheffield, Sheffield S1 4DP, U.K. (e-mail: [email protected];
[email protected]). a complex acoustic environment that included multiple sources
T. May is with the Hearing Systems Group, Technical University of Denmark, and room reverberation. In agreement with studies on human
DK-2800 Kgs. Lyngby, Denmark (e-mail: [email protected]). sound localisation [13], they found that localisation errors were
Color versions of one or more of the figures in this paper are available online
at http://ieeexplore.ieee.org. minimised by a strategy that rotated the head towards the target
Digital Object Identifier 10.1109/TASLP.2017.2750760 sound source.
2329-9290 © 2017 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission.
See http://www.ieee.org/publications standards/publications/rights/index.html for more information.
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2445

to simulate the head rotation of a human listener. The output

from the DNN is combined with a head movement strategy to
robustly localise multiple talkers in reverberant environments.

A. Binaural Feature Extraction

An auditory front-end was employed to analyse binaural ear
signals with a bank of 32 overlapping Gammatone filters, with
centre frequencies uniformly spaced on the equivalent rectan-
gular bandwidth (ERB) scale between 80 Hz and 8 kHz [18].
Inner-hair-cell processing was approximated by half-wave recti-
fication. No low-pass filtering was employed to simulate the loss
of phase-locking at high frequencies as previous studies have
shown that in general classifiers are able to exploit the high-
frequency structure [4]. Afterwards, the CCF between the right
Fig. 1. Schematic diagram of the proposed system, showing steps during and left ears was computed independently for each frequency
training (top) and testing (bottom). During testing, sound mixtures consisting band using overlapping frames of 20 ms with a 10 ms shift. The
of several talkers are rendered in a virtual acoustic environment, in which a
binaural receiver is moved in order to simulate the head rotation of a listener.
CCF was further normalised by the auto-correlation value at lag
zero [4] and evaluated for time lags in the range of ± 1.1 ms.
This paper describes a novel machine-hearing system that Two binaural features, ITDs and ILDs, are typically used in
robustly localises multiple talkers in reverberant environments, binaural localisation systems [1]. The ITD is estimated as the
by combining deep neural network (DNN) classifiers and head lag corresponding to the maximum in the CCF. The ILD corre-
movements. Recently, DNNs have been shown to give state- sponds to the energy ratio between the left and right ears within
of-the-art performance in a variety of speech recognition and the analysis window, expressed in dB. In this study, instead of
acoustic signal processing tasks [14]. In this study, we use DNNs estimating the ITD the entire CCF was used as localisation fea-
to map binaural features, obtained from an auditory model, to tures. This approach was motivated by two observations. First,
the corresponding source azimuth. Within each frequency band, computation of ITDs involves a peak-picking operation which
a DNN takes as input features the cross-correlation function may not be robust in the presence of noise and reverberation.
(CCF) (as opposed to a single estimate of ITD) and the ILD. Second, there are systematic changes in the CCF with source
Using the whole cross-correlation function provides the clas- azimuth (in particular, changes in the main peak with respect
sifier with rich information for classifying the azimuth of the to its side peaks). Even in multi-source scenarios, these can be
sound source [15]. A similar approach was used by [16] and exploited by a suitable classifier. For signals sampled at 16 kHz,
[17] in binaural speech segregation systems. However, neither the CCF with a lag range of ± 1 ms produced a 33-dimensional
study specifically addressed source localisation because it was binaural feature space for each frequency band. This was sup-
assumed that the target source was fixed at zero degrees azimuth. plemented by the ILD, forming a final 34-dimensional (34D)
The proposed binaural sound localisation system is described feature vector.
in detail in Section II. Section III describes the evaluation frame-
work and presents a number of source localisation experiments, B. DNN Localization
in which head movements are simulated by using binaural room
DNNs were used to map the 34D binaural feature set to corre-
impulse responses (BRIRs) to generate direction-dependent bin-
sponding azimuth angles. A separate DNN was trained for each
aural sound mixtures. Localisation results are presented in Sec-
of the 32 frequency bands. Employing frequency-dependent
tion IV, which compares our DNN-based approach to a baseline
DNNs was found to be effective for localising simultaneous
method that uses GMM, and assesses the contribution that var-
sound sources. Although simultaneous sources overlap in time,
ious components make to performance. The paper concludes
within a local time frame each frequency band is mostly dom-
with Section V, which proposes some avenues for future re-
inated by a single source (Bregman’s [19] notion of ‘exclusive
search.
allocation’). Hence, this allows training using single-source data
and removes the need to include multi-source data for training.
II. SYSTEM
The DNN consists of an input layer, two hidden layers, and
Figure 1 shows a schematic diagram of the proposed binau- an output layer. The input layer contained 34 nodes and each
ral sound localisation system in the full 360 ◦ azimuth range. node was assumed to be a Gaussian random variable with zero
During training, clean speech signals were spatialised using mean and unit variance. The 34D binaural feature inputs for
head related impulse responses (HRIRs), and diffuse noise each frequency band were Gaussian normalised, and white
was added before being processed by a binaural model for Gaussian noise (variance 0.4) was added to avoid overfitting,
feature extraction. The noisy binaural features were used to before being used as input to the DNN. The hidden layers had
train DNNs to learn the relationship between binaural cues sigmoid activation functions, and each layer contained 128
and sound azimuths. During testing, sound mixtures con- hidden nodes. The number of hidden nodes was heuristically
sisting of several talkers are rendered in a virtual acoustic selected – more hidden nodes increased the computation time
environment, in which a binaural receiver is moved in order but did not improve localisation accuracy. The output layer
2446 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017

contained 72 nodes corresponding to the 72 azimuth angles

in the full 360◦ azimuth range, with a 5◦ step. A ‘softmax’
activation function was applied at the output layer. The same
DNN architecture was used for all frequency bands and we did
not optimise it for individual frequencies.
The neural network was initialised with a single hidden layer,
and the number of hidden layers was gradually increased in later
training phases. In each training phase, mini-batch gradient de-
scent with a batch size of 128 was used, including a momentum
term with the momentum rate set to 0.5. The initial learning rate Fig. 2. Illustration of the head movement strategy. Top: posterior probabilities
was set to 1, which gradually decreased to 0.05 after 20 epochs. where two candidate azimuths at 60◦ and 120◦ are identified. Bottom: after head
rotation by 30◦ , only the azimuth candidate at 30◦ agrees with the azimuth-
After the learning rate decreased to 0.05, it was held constant shifted candidate from the first signal block (dotted line).
for a further 5 epochs. We also included a validation set and the
training procedure was stopped earlier if no new lower error on
the validation set could be achieved within the last 5 epochs. At
the end of each training phase, an extra hidden layer was added the front and rear hemifields, phantom sources may also become
between the last hidden layer and the output layer, and the train- apparent as peaks in the azimuth posterior distribution. Such an
ing phase was repeated until the desired number of hidden layers ambiguous posterior distribution is shown in the top panel of
was reached (two hidden layers in this study). Fig. 2. In this case, a random head movement within the range
Given the observed feature set xt,f at time frame t and fre- of [−30◦ , 30◦ ] is triggered to solve the localisation confusion.
quency band f , the 72 ‘softmax’ output values from the DNN Other possible strategies for head movement are discussed in [9].
for frequency band f were considered as posterior probabilities A second posterior distribution is computed for the signal
block after the completion of the head movement. If a peak
P(k|xt,f ), where k is the azimuth angle and k P(k|xt,f ) = 1.
The posteriors were then integrated across frequency to yield the in the first posterior distribution corresponds to a true source
probability of azimuth k, given features of the entire frequency position, then it will appear in the second posterior distribution
range at time t and will be shifted by an amount corresponding to the angle
of head rotation (assuming that sources are stationary before
P (k) f P(k|xt,f ) and after the head movement). On the other hand, if a peak
P(k|xt ) = , (1)
k P (k) f P(k|xt,f ) is due to a phantom source, it will not occur in the second
posterior distribution, as shown in the bottom panel of Fig. 2.
where P (k) is the prior probability of each azimuth k. Assuming By exploiting this relationship, potential phantom source peaks
no prior knowledge of source positions and equal probabilities are identified and eliminated from both posterior distributions.
for all source directions, Eq. (1) becomes After the phantom sources have been removed, the two posterior

f P(k|xt,f )
distributions were averaged to further emphasise the local peaks
P(k|xt ) = . (2) corresponding to true sources. The most prominent peaks in the
k f P(k|xt,f )
averaged posterior distribution were assumed to correspond to
Sound localisation was performed for a signal block consisting active source positions. Here the number of active sources was
of T time frames. Therefore the frame posteriors were further assumed to be known a priori.
averaged across time to produce a posterior distribution P(k) The proposed approach to exploiting head movements is
of sound source activity based on late information fusion – the information from the
t+T −1
model predictions is integrated. This is in contrast to the ap-
1 proach in [12] which adopted early fusion at the feature level by
P(k) = P(k|xt ). (3)
T t averaging cross-correlation patterns across different head ori-
entations. Late fusion is preferred here for a couple of reasons:
The target location was given by the azimuth k that maximised i) the use of head rotation is not needed during model training
P(k) and thus it is more straightforward to generate data for train-
k̂ = argmax P(k) (4) ing robust localisation models (DNNs); ii) early feature fusion
k tends to lose information which can otherwise be exploited by
the system. As a result, the proposed system is able to deal with
C. Localisation With Head Movements overlapping sound sources in reverberant conditions, while the
In order to reduce the number of front-back confusions, the system reported in [12] was tested in anechoic conditions with
proposed localisation model employs a hypothesis-driven feed- a single source.
back stage that triggers a head movement if the source location
cannot be unambiguously estimated. A signal block is used to III. EVALUATION
compute an initial posterior distribution of the source azimuth
A. Binaural Simulation
using the trained DNNs. In an ideal situation, the local peaks
in the posterior distribution correspond to the azimuths of true Binaural audio signals were created by convolving monaural
sources. However, due to the similarity of binaural features in sounds with HRIRs or BRIRs. For training, an anechoic HRIR
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2447

TABLE I
ROOM CHARACTERISTICS OF THE SURREY BRIR DATABASE [21]

Room A Room B Room C Room D

T 6 0 (s) 0.32 0.47 0.68 0.89

DRR (dB) 6.09 5.31 8.82 6.12

catalog based on the Knowles Electronic Manikin for Acoustic

Research (KEMAR) head and torso simulator with pinnae [20]
was used for simulating the anechoic training signals. The HRIR
catalog catalog included impulse responses for the full 360 ◦
azimuth range, allowing us to train localisation models for 72
azimuths between 0◦ and 355◦ with a 5◦ step. The models were
trained using only the anechoic HRIRs and were not retrained
for any room conditions. See Section III-C for more details Fig. 3. Schematic diagram of the Surrey BRIR room configuration. Actual
about training. source positions were always between ±90◦ , but the system could report a
source azimuth at any of 72 possible azimuths around the head (open circles).
For evaluation, the Surrey BRIR database [21] and a BRIR Black circles indicate actual source azimuths in a typical three-talker mixture
set recorded at TU Berlin [9] were used to reflect different re- (in this example, at −50◦ , −30◦ , and 15◦ ). During testing, head movements
verberant room conditions. The Surrey database was recorded were limited to the range [−30◦ , 30 ◦ ] as shown by the shaded area.
using a Cortex head and torso simulator (HATS) and includes
four room conditions with various amounts of reverberation.
The loudspeakers were placed around the HATS on an arc in the
median plane, with a 1.5 m radius between ±90◦ and measured
at 5◦ intervals. Table I lists the reverberation time (T60 ) and
the direct-to-reverberant ratio (DRR) of each room. The ane-
choic HRIRs used for training were also included to simulate
an anechoic condition.
A second set of BRIRs, recorded in the “Auditorium3” room
at TU Berlin,1 was also included particularly for evaluating the
benefit of head movements (Section IV-C). The Auditorium3
room is a mid-size lecture room of dimensions 9.3 m × 9 m,
with a trapezium shape and an estimated reverberation time T60
of 0.7 s. The BRIR measurements were made for different head
orientations ranging from −90◦ to 90◦ with an angular resolution
of 1◦ . BRIRs for six different source positions, including one in
the rear hemifield, were recorded and five of them were selected
for this study (two 0◦ positions are available and the one at
1.5 m away from the head was excluded for simplicity). The
Fig. 4. Schematic diagram of the TUB Auditorium3 configuration. The source
five selected source positions with respect to the dummy head distance, azimuth angle and respective T6 0 time are shown for each source.
are illustrated in Fig. 4.
Note that the anechoic HRIRs used for training and the Surrey
positions. Furthermore there is a difference in impulse response
BRIRs were recorded using two different dummy heads (KE-
amplitude level even for sources of the equal distance to the
MAR and Cortex HATS). We use data from two dummy heads
listener, likely due to the microphone response difference across
because this study is concerned with sound localisation in the
recording sessions. To compensate the level difference a scaling
360◦ azimuth range; the Surrey HATS HRIRs catalog is only
factor was computed for each source position by averaging the
available for the frontal azimuth angles and therefore cannot
maximum levels in the impulse responses between left and right
be used to train the full 360◦ localisation models. However, as
ears. The scaling factors were used to adjust the level for each
the experiment results will show in Section IV, with MCT our
source before spatialisation. As a result the direct sound level of
proposed systems generalised well despite the HRIR mismatch
each source when mixed together was approximately the same.
between training and testing.
For the Surrey BRIR set the level difference did not exist and
Binaural mixtures of multiple competing sources were cre-
thus this preprocessing was not applied. The spatialised signals
ated by spatialising each source separately at the respective
were finally resampled to 16 kHz for training and testing.
BRIR sampling rate, before adding them together in each of the
two binaural channels. In the Auditorium3 BRIRs there is vary-
ing distance between the listener position and different source B. Head Movement Simulation
For the Surrey BRIRs, head movements were simulated by
1 The BRIRs are freely available at http://tinyurl.com/lt76yqs computing source azimuths relative to the head orientation, and
2448 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017

loading corresponding BRIRs for the relative source azimuths. silence of various lengths at the beginning of each sentence, the
Such simulation is only approximate for the reverberant room central 1 s segment of each sentence was selected for evaluation.
conditions because the Surrey BRIR database was measured Note that although the models were trained and evaluated
by moving loudspeakers around a fixed dummy head. With using speech signals, our systems are not intended to localise
the Auditorium3 BRIRs, more realistic head movements were only speech sources. Therefore a frequency range from 80 Hz
simulated by loading the corresponding BRIR for a desired head to 8 kHz was selected for the signals sampled at 16 kHz. Our
orientation. For all experiments, head movements were limited previous studies [6], [15] also show that 32 Gammatone filters
to the range of ±30◦ . (see Section II-A) provide a good tradeoff between frequency
resolutions and computational cost. As the evaluation included
localisation of up to three overlapping talkers, using too few fil-
C. Multi-conditional Training ters would result in insufficient frequency resolution to reliably
The proposed systems assumed no prior knowledge of room localise multiple talkers.
conditions. The localisation models were trained using only The baseline system was a state-of-the-art localisation sys-
anechoic HRIRs with added diffuse noise, and no reverberant tem [6] that modelled both ITDs and ILDs features within a
BRIRs were used during training. GMM framework. As in [6], the GMM modelled the binaural
Previous studies [4]–[7] have shown that MCT features can features using 16 Gaussian components and diagonal covari-
increase the robustness of localisation systems in reverberant ance matrices for each azimuth and each frequency band. The
multi-source conditions. Binaural MCT features were created by GMM parameters were initialised by 15 iterations of the k-
mixing a target signal at a specified azimuth with diffuse noise means clustering algorithm and further refined using 5 iterations
at various signal-to-noise ratios (SNRs). The diffuse noise is the of the expectation-maximization (EM) algorithm. The second
sum of 72 uncorrelated, white Gaussian noise sources, each of localisation model was the proposed DNN system using the
which was spatialised across the full 360◦ azimuth range in steps CCF and ILD features. Each DNN employed four layers includ-
of 5◦ . Both the directional target signals and the diffuse noise ing two hidden layers each consisting of 128 hidden nodes (see
were created using the same anechoic HRIR recorded using a Section II-B).
KEMAR dummy head [20]. This approach was used in pref- Both localisation systems were evaluated using different
erence to adding reverberation during training, since previous training strategies (clean training and MCT), various locali-
studies (e.g., [5]) suggested that it was more likely to generalise sation feature sets (ITD, ILD and CCF), and with or without
well across a wide range of reverberant test conditions. head movements. When no head movement was employed, the
The training material consisted of speech sentences from the source azimuths were estimated using the entire 1 s segment
TIMIT database [22]. A set of 30 sentences was randomly se- from each acoustic mixture. If head movement was used, the
lected for each of the 72 azimuth locations. For each spatialised 1 s segment was divided into two 0.5 s long blocks and the
training sentence, the anechoic signal was corrupted with dif- second block was provided to the system after completion of a
fuse noise at three SNRs (20, 10 and 0 dB SNR). The corre- head movement. Therefore in both conditions the same signal
sponding binaural features (ITDs, CCFs, and ILDs) and ILDs) duration was used for localisation.
were then extracted. Only those features for which the a priori The gross accuracy of localisation was measured by com-
SNR between the target and the diffuse noise exceeded − 5 dB paring true source azimuths with the estimated azimuths. The
were used for training. This negative SNR criterion ensured that number of active speech sources N was assumed to be known a
the multi-modal clusters in the binaural feature space at higher priori and the N azimuths for which the posterior probabilities
frequencies, which are caused by periodic ambiguities in the were the largest were selected as the estimated azimuths. Lo-
cross-correlation analysis, were properly captured. calisation of a source was considered accurate if the estimated
azimuth was less than or equal to 5◦ away from the true source
azimuth:
D. Experimental Setup
Ndist(φ, φ̂)≤θ
The GRID corpus [23] was used to create three evaluation LocAcc = (5)
N
sets of 50 acoustic mixtures which consisted of one, two or
three simultaneous talkers, respectively. Each GRID sentence where dist(.) is the angular distance between two azimuths, φ is
is approximately 1.5 s long and was spoken by one of 34 na- the true source azimuth, φ̂ is the estimated azimuth, and θ is the
tive British-English talkers. The sentences were normalised to threshold in degrees (5◦ in this study). This metric is preferred
the same root mean square (RMS) value prior to spatialisation. to RMS error because our study is concerned with full 360◦
For the two-talker and three-talker mixtures, the additional az- localisation, and localisation errors in degrees are often large
imuth directions were randomly selected from the same azimuth due to front-back confusions.
range while ensuring an angular distance of at least 10◦ between
all sources. Each evaluation set included 50 acoustic mixtures IV. RESULTS AND DISCUSSION
which were kept the same for all the evaluated azimuths and
A. Influence of MCT
room conditions in order to ensure any performance difference
was due to test conditions rather than signal variation. Since the The first experiment investigated the impact of MCT on the lo-
duration of each GRID sentence was different, and there was calisation accuracy of the proposed systems. Two scenarios were
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2449

TABLE II
GROSS LOCALIZATION ACCURACY IN % FOR VARIOUS SETS OF BRIRS WHEN LOCALIZING ONE, TWO, AND THREE COMPETING TALKERS IN THE
FRONTAL HEMIFIELD ONLY AND IN THE FULL 360◦ RANGE

Anechoic Room A Room B Room C Room D

Hemifiled Model MCT 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Avg.

no 100 99.0 90.5 84.0 63.1 52.8 81.5 59.8 51.8 100 82.5 65.5 88.2 61.2 53.5 75.6
GMM yes 100 99.9 98.7 99.2 97.1 90.7 100 97.7 91.6 100 99.3 96.5 100 98.4 91.5 97.4
Frontal
no 100 100 99.6 100 99.2 92.2 100 99.0 90.4 100 99.9 96.7 99.9 98.7 91.1 97.8
DNN yes 100 100 99.7 100 99.5 96.3 100 99.7 96.2 100 99.9 98.2 100 99.6 95.3 99.0
no 100 97.1 82.6 82.6 48.9 30.7 65.6 38.3 25.3 98.4 70.3 50.2 77.2 46.3 30.0 62.9
GMM
yes 100 100 97.8 99.0 94.2 80.7 97.0 89.0 77.6 100 97.6 88.7 97.3 90.6 79.0 92.6
360◦
no 100 100 97.4 100 87.0 68.4 94.5 79.0 63.9 97.7 92.5 78.9 94.4 83.4 67.9 87.0
DNN
yes 100 100 98.6 99.7 97.3 87.9 97.2 93.7 86.7 100 97.3 90.2 97.3 94.0 85.0 95.0

The models were trained using either clean training or the MCT method.

Fig. 5. Localization error rates produced by various systems using either clean training or MCT. Localization was performed in the full 360◦ range, so that
front–back errors could occur, as shown by the white bars for each system. No head movement strategy was employed.

considered: i) sound localisation was restricted to the frontal was substantially more robust than the GMM system, but the
hemifield so that the systems estimated source azimuths within performance also decreased significantly when multiple talk-
the range [−90◦ , 90◦ ]; ii) the systems were not informed that ers were present. The benefit of the MCT method became more
the sources lay only in the frontal hemifield and were free to apparent for both systems in this scenario – the average localisa-
report the azimuth in the full 360◦ azimuth range. In the second tion accuracy was increased from 62.9% to 92.6% for the GMM
scenario front-back confusions could occur. system and from 87% to 95% for the DNN system. Across all
Table II lists gross localisation accuracies of all the systems the room conditions the largest benefits were observed in room
evaluated using various BRIR sets from the Surrey database. B where the direct-to-reverberant ratio was the lowest, and in
First consider the scenario of localisation in the frontal hemi- room D where the reverberation time T60 was the longest.
field. For the GMM baseline system, the MCT approach sub- Errors made in 360◦ localisation could be due to front-back
stantially improved the robustness across all conditions, with confusion as well as interference caused by reverberation and
an average localisation accuracy of 97.4% compared to only overlapping talkers. Figure 5 shows errors made by both the
75.6% using clean training. The improvement with MCT was GMM and the DNN systems using either clean training or MCT
particularly large in multi-talker scenarios and in the presence in different room conditions. The errors due to front-back con-
of room reverberation. For the DNN system, the improvement fusions were indicated by white bars for each system. Here a
with MCT over clean training was not as large as that for the localisation error is considered to be a front-back confusion
GMM system and is only observed in the multi-talker scenarios. when the estimated azimuth is within ±20 degrees of the az-
The limited improvement is partly because with clean training imuth that would produce the same ITDs in the rear hemifield.
the performance of the DNN system is already very robust in It is clear that front-back confusions contributed a large portion
most conditions, with an average accuracy of 97.8%, which is of localisation errors for both systems, in particular when clean
already better than the GMM system with MCT. This suggests training was used. When the MCT method was used, not only
that when localisation was restricted to the frontal hemifield, the errors due to interference of reverberation and overlapping
the DNN can effectively extract cues from the clean CCF-ILD talkers (non-white bar portion in Fig. 5) were greatly reduced,
features that are robust in the presence of reverberation. but also the systems produced substantially fewer front-back
Considering the case of full 360◦ localisation, the scenario is errors (white bars in Fig. 5). As will be discussed in the next
more challenging and front-back errors could occur. The GMM section, without head movements the main cues distinguishing
system with clean training failed to localise the talkers accu- between front-back azimuth pairs lie in the combination of in-
rately, with error rates greater than 50% when localising multi- teaural level and time differences (or ITD-related features such
ple simultaneous talkers. The DNN system with clean training as the cross-correlation function). MCT provides the training
2450 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017

TABLE III
GROSS LOCALIZATION ACCURACY IN % USING VARIOUS FEATURE SETS FOR LOCALIZING ONE, TWO, AND THREE COMPETING TALKERS IN THE FULL 360◦ RANGE

Anechoic Room A Room B Room C Room D

Model Feature 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Avg.

ITD 100 99.8 96.2 99.2 81.6 67.7 91.4 76.6 64.9 97.2 89.4 76.6 89.1 76.6 65.8 84.8
GMM ITD-ILD 100 100 97.8 99.0 94.2 80.7 97.0 89.0 77.6 100 97.6 88.7 97.3 90.6 79.0 92.6
CCF-ILD 100 100 98.4 100 87.2 73.9 92.1 81.7 71.5 99.9 93.8 81.6 92.6 83.2 72.3 88.5
CCF 100 100 99.0 99.8 95.8 86.7 91.8 89.5 83.7 98.3 95.8 89.0 91.6 87.8 80.8 92.7
DNN
CCF-ILD 100 100 98.6 99.7 97.3 87.9 97.2 93.7 86.7 100 97.3 90.2 97.3 94.0 85.0 95.0

The models were trained using the MCT method. The best feature set for each system is marked in bold font.

Fig. 6. Comparison of localization error rates produced by various systems using different spatial features. Localization was not restricted in the frontal hemifield
so that front–back errors can occur, as indicated by the white bars for each system. No head movement strategy was employed.

stage with better regularisation of the features, which is able When ILDs were not used, the localisation errors were largely
to improve the generalisation of the learned models and better due to an increased number of front-back errors as suggested by
discriminate the front-back confusing azimuths. Fig. 6. For single-talker localisation in rooms B and D, without
It is also worth noting that the training and testing stages used using ILDs almost all the errors made by the systems were
HRTFs collected with different dummy heads (the KEMAR was front-back errors. When ILDs were used, the number of front-
used for training and the HATS was used for testing). However, back errors were greatly reduced in all conditions. This suggests
with MCT the localisation accuracy in the anechoic condition that the ILD cue plays a major role in solving the front-back
for localising one or two sources was 100%, which suggests that confusions. ITDs or ILDs alone may appear more symmetric
MCT also reduced the sensitivity to mismatches of the receiver. between the front and back hemifields, but together with ILDs
they create the necessary asymmetries (due to the KEMAR head
with pinnae) for the models to learn the differences between
B. Contribution of the ILD Cue front and back azimuths.
The second experiment investigated the influence of differ- Table III also lists localisation results of the GMM system
ent localisation features, in particular the contribution of the when using the same CCF-ILD feature set as used by the DNN
ILD cue. Table III lists the gross localisation accuracies us- system. The GMM failed to extract the systematic structure in
ing various feature sets. Here all models were trained using the CCF spanning multiple feature dimensions, most likely due
the MCT method and the active head movement strategy was to its inferior ability to model correlated features. The average
not applied. When ILDs were not used, the GMM performance localisation accuracy is only 88.5% compared to 95% for the
using just ITDs suffered greatly in reverberant rooms and when DNN system, and again it suffered the most in more reverberant
localising overlapping talkers; the average localisation accuracy conditions such as rooms B and D.
decreased from 92.6% to 84.8%. The performance drop was
C. Benefit of the Head Movement Strategy
particularly pronounced in rooms B and D, where the reverber-
ation was strong. For the DNN system, excluding the ILDs also Table IV lists the gross localisation accuracies with or with-
decreased the localisation performance but the performance out head movement. All systems were trained using the MCT
drop was more moderate, with the average accuracy reduced method and employed the respective best performing features
from 95% to 92.7%. The DNN system using the CCF feature (GMM ITD-ILD and DNN CCF-ILD).
exhibited more robustness in the reverberant multi-talker condi- Both the GMM and DNN systems benefitted from the use
tions than the GMM system using the ITD feature. As previously of head movements. It is clear from Fig. 7 that the localisa-
discussed, computation of the ITD involved a peak-picking op- tion errors were almost entirely due to front-back confusions in
eration that could be less reliable in challenging conditions, one-talker localisation. By exploiting the head movement, the
and the systematic changes in the CCF with the source az- systems managed to reduce most of the front-back errors and
imuth provided richer information that could be exploited by achieved near 100% localisation accuracies. In two- or three-
the DNN. talker localisation, the number of front-back errors was also
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2451

TABLE IV
GROSS LOCALIZATION ACCURACIES IN % WITH OR WITHOUT THE HEAD MOVEMENT WHEN LOCALIZING ONE, TWO, AND THREE COMPETING TALKERS IN THE
FULL 360◦ AZIMUTH RANGE

Anechoic Room A Room B Room C Room D

Head
Model move 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Avg.

no 100 100 97.8 99.0 94.2 80.7 97.0 89.0 77.6 100 97.6 88.7 97.3 90.6 79.0 92.6
GMM yes 100 100 97.5 100 97.3 83.4 99.8 93.1 79.9 99.9 99.3 90.8 99.9 93.0 79.5 94.2
no 100 100 98.6 99.7 97.3 87.9 97.2 93.7 86.7 100 97.3 90.2 97.3 94.0 85.0 95.0
DNN
yes 100 100 98.4 100 99.2 90.0 99.8 96.1 86.9 100 99.0 91.6 99.5 94.7 84.7 96.0

All systems were trained using the MCT method.

Fig. 7. Localization error rates produced by various systems with or without head movement when localizing one, two, or three overlapping talkers. Localization
was performed in the 360◦ azimuth range so that front–back errors can occur, as indicated by the white bars for each system.

Fig. 8. Localization error rates produced by various systems with or without head movement, as a function of the azimuth. The histogram bin width is 20◦ . Here
the error rates were averaged across the 1-, 2- and 3-talker localization tasks. Localization was performed in the full 360◦ azimuth range so that front–back errors
can occur, as indicated by the white bars for each system.

reduced with the use of head movements. When overlapping Fig. 8 shows the localisation error rates as a function of the
talkers were present, the systems produced many localisation azimuth. The error rates here were averaged across the 1-, 2-
errors other than front-back errors, due to the partial evidence and 3-talker localisation tasks. Across most room conditions,
available to localise each talker. By removing most front-back sound localisation was generally more reliable at more central
errors, the systems were able to further improve the accuracy of locations than at lateral source locations. This is particularly
localising overlapping sound sources. the case for the GMM system, as shown in Fig. 8, where the
2452 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO. 12, DECEMBER 2017

Fig. 9. Localization error rates produced by various systems as a function of the azimuth for the Auditorium3 task. Localization was performed in the full 360◦
azimuth range so that front–back errors can occur, as indicated by the white bars for each system.

localisation error rates for sources at the sides were above 20% glimpses left for localisation of each source and with stronger
even in the least reverberant Room A. It is also clear from reverberation the sources at 51◦ and 131◦ became more difficult
Fig. 8 (white bars) that localisation errors were mostly not to localise.
due to front-back confusions at lateral azimuths, and in this
case the proposed DNN system outperformed the GMM system V. CONCLUSION
significantly. This paper presented a machine-hearing framework that com-
At the central azimuths, on the other hand, almost all the local- bines DNNs and head movements for robust localisation of
isation errors were due to front-back confusions. It is noticeable multiple sources in reverberant conditions. Since simultaneous
that in more reverberant conditions (such as Rooms B and D), the talkers were located in a full 360◦ azimuth range, front-back
error rates at the central azimuths [−10◦ , 10◦ ] were particularly confusions occurred. Compared to a GMM-based system, the
high due to front-back errors for both the GMM and the DNN proposed DNN system was able to exploit the rich information
systems when head movement was not used. The front-back provided by the entire CCF, and thus substantially reduced lo-
errors were concentrated at central azimuths, probably because calisation errors. The MCT method was effective in combatting
binaural features (interaural time and level differences) were reverberation, and allowed anechoic signals to be used for train-
less discriminative between 0◦ and 180◦ than between the more ing a robust localisation model that generalised well to unseen
lateral azimuth pairs. reverberant conditions and to mismatched artificial heads used
Finally, Fig. 9 shows the localisation error rates using the in training and testing conditions. It was also found that the
Auditorium3 BRIRs in which head movements were more ac- inclusion of ILDs was necessary for reducing front-back confu-
curately simulated by loading the corresponding BRIR for a sions in reverberant rooms. The use of head rotation further in-
given head orientation. Overall the DNN systems significantly creased the robustness of the proposed system, with an average
outperformed the GMM systems. For single-source localisation localisation accuracy of 96% under acoustic scenarios where
the DNN system achieved near 100% localisation accuracy for up to three competing talkers and room reverberation were
all source locations including the one at 131◦ in the rear hemi- present.
field. The GMM system produced about 5% error rate for rear In the current study, the use of DNNs allowed higher-
source but performed well for the other locations. For two- and dimensional feature vectors to be exploited for localisation, in
three-source localisation, both GMM and DNN systems ben- comparison with previous studies [4]–[6]. This could be carried
efitted from head movements across most azimuth locations. further, by exploiting additional context within the DNN either
For the GMM system the benefit is particularly pronounced for in the time or the frequency dimension. Moreover, it is possi-
the source at 51◦ , with localisation reduced from 14% to 4% ble to complement the features used here with other binaural
in two-source localisation and from 36% to 14% in two-source features, e.g., a measure of interaural coherence [24], as well as
localisation. The rear source at 131◦ appeared to be difficult to monaural localisation cues, which are known to be important for
localise for the GMM system even with head movement, with judgment of elevation angles [25], [26]. Visual features might
20% error rate in two-source localisation. The DNN system with also be combined with acoustic features in order to achieve
head movements was able to reduce the error rate for the rear audio-visual source localisation.
source at 131◦ to 8%. The proposed system has been realised in a real world human-
In general the performance of the models for the 51◦ and robot interaction scenario. The azimuth posterior distributions
131◦ locations is worse than the other source locations when from the DNN for each processing block were temporally
there are multiple sources present at the same time. This is more smoothed using a leaky integrator and head rotation was trig-
likely due to the nature of the room acoustics at these locations, gered if a front-back confusion was detected in the integrated
e.g., they are further away from the listener and closer to walls. posterior distribution. Audio signals acquired during head rota-
When the sources are overlapping with each other, there are less tion were not processed. Such a scheme can be more practical
MA et al.: EXPLOITING DNNs AND HEAD MOVEMENTS FOR ROBUST BINAURAL LOCALIZATION OF MULTIPLE SOURCES 2453

for a robotic platform as head rotation often produces self-noise [19] A. Bregman, Auditory Scene Analysis. Cambridge, MA, USA: MIT Press,
which makes the audio unusable. 1990.
[20] H. Wierstorf, M. Geier, A. Raake, and S. Spors, “A free database of
One limitation of the current systems is that the number of head-related impulse response measurements in the horizontal plane with
active sources is assumed to be known a priori. This can be multiple distances,” Audio Eng. Soc. Conv. 130.
improved by including a source number estimator that is either [21] C. Hummersone, R. Mason, and T. Brookes, “Dynamic precedence ef-
fect modeling for source separation in reverberant environments,” IEEE
learned from the azimuth posterior distribution output by the Trans. Audio, Speech, Lang. Process., vol. 18, no. 7, pp. 1867–1871,
DNN, or provided directly as an output node in the DNN. The Sep. 2010.
current study only deals with the situation where sound sources [22] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and
N. L. Dahlgren, “DARPA TIMIT acoustic-phonetic continuous speech
are static. Future studies will relax this constraint and address corpus CD-ROM,” Nat. Inst. Standards Technol., Gaithersburg, MD, USA,
the localisation and tracking of moving sound sources within Internal Rep. 4930, 1993.
the DNN framework. [23] M. Cooke, J. Barker, S. Cunningham, and X. Shao, “An audio-visual
corpus for speech perception and automatic speech recognition,” J. Acoust.
Soc. Amer., vol. 120, pp. 2421–2424, 2006.
REFERENCES [24] C. Faller and J. Merimaa, “Sound localization in complex listening situa-
tions: Selection of binaural cues based on interaural coherence,” J. Acoust.
[1] J. Blauert, Spatial Hearing—The Psychophysics of Human Sound Local- Soc. Amer., vol. 116, pp. 3075–3089, 2004.
ization. Cambridge, MA, USA: MIT Press, 1997. [25] F. Asano, Y. Suzuki, and T. Sone, “Role of spectral cues in median plane
[2] O. Nadiri and B. Rafaely, “Localization of multiple speakers under high localization.” J. Acoust. Soc. Amer., vol. 88, no. 1, pp. 159–168, 1990.
reverberation using a spherical microphone array and the direct-path dom- [26] P. Zakarauskas and M. S. Cynader, “A computational theory of spectral
inance test,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, cue localization,” J. Acoust. Soc. Amer., vol. 94, no. 3, pp. 1323–1331,
no. 10, pp. 1494–1505, Oct. 2014. 1993.
[3] V. Willert, J. Eggert, J. Adamy, R. Stahl, and E. Korner, “A probabilistic
model for binaural sound localization,” IEEE Trans. Syst., Man, Cybern.
B, Cybern., vol. 36, no. 5, pp. 982–994, Oct. 2006.
[4] T. May, S. van de Par, and A. Kohlrausch, “A probabilistic model for Ning Ma obtained the M.Sc. degree with distinction
robust localization based on a binaural auditory front-end,” IEEE Trans. in advanced computer science in 2003 and the Ph.D.
Audio, Speech, Lang. Process., vol. 19, no. 1, pp. 1–13, Jan. 2011. degree in hearing-inspired approaches to automatic
[5] J. Woodruff and D. L. Wang, “Binaural localization of multiple sources in speech recognition in 2008, both from the Univer-
reverberant and noisy environments,” IEEE Trans. Audio, Speech, Lang. sity of Sheffield, Sheffield, U.K. He has been a Vis-
Process., vol. 20, no. 5, pp. 1503–1512, Jul. 2012. iting Research Scientist at the University of Wash-
[6] T. May, N. Ma, and G. J. Brown, “Robust localisation of multiple speakers ington, Seattle, WA, USA, and a Research Fellow
exploiting head movements and multi-conditional training of binaural at the MRC Institute of Hearing Research, Notting-
cues,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., 2015, ham, U.K., working on auditory scene analysis with
pp. 2679–2683. cochlear implants. Since 2015, he has been a Re-
[7] T. May, S. van de Par, and A. Kohlrausch, “Binaural localization and search Fellow at the University of Sheffield, working
detection of speakers in complex acoustic scenes,” in The Technology of on computational hearing. His research interests include robust automatic speech
Binaural Listening, J. Blauert, Ed. New York, NY, USA: Springer, 2013, recognition, computational auditory scene analysis, and hearing impairment. He
ch. 15, pp. 397–425. has authored or coauthored more than 40 papers in these areas.
[8] F. L. Wightman and D. J. Kistler, “Resolution of front–back ambiguity in
spatial hearing by listener and source movement,” J. Acoust. Soc. Amer.,
vol. 105, no. 5, pp. 2841–2853, 1999.
[9] N. Ma, T. May, H. Wierstorf, and G. J. Brown, “A machine-hearing system
exploiting head movements for binaural sound localisation in reverberant Tobias May studied hearing technology and audiol-
conditions,” in Proc. IEEE Int. Conf. Acoust., Speech, Signal Process., ogy and received the M.Sc. degree from the Univer-
2015, pp. 2699–2703. sity of Oldenburg, Oldenburg, Germany, in 2007 and
[10] H. Wallach, “The role of head movements and vestibular and visual cues the binational Ph.D. degree from the University of
in sound localization,” J. Exp. Psychol., vol. 27, no. 4, pp. 339–368, 1940. Oldenburg in collaboration with the Eindhoven Uni-
[11] K. I. McAnally and R. L. Martin, “Sound localization with head move- versity of Technology, Eindhoven, The Netherlands.
ments: Implications for 3D audio displays,” Front. Neurosci., vol. 8, Since 2013, he has been with the Department of Elec-
pp. 1–6, 2014. trical Engineering, Technical University of Denmark,
[12] J. Braasch, S. Clapp, A. Parks, T. Pastore, and N. Xiang, “A binaural first as a Postdoctoral Researcher (2013–2017), and
model that analyses acoustic spaces and stereophonic reproduction sys- since 2017 as an Assistant Professor. His research
tems by utilizing head rotations,” in The Technology of Binaural Listening, interests include computational auditory scene anal-
J. Blauert, Ed. Berlin, Germany: Springer, 2013, pp. 201–223. ysis, binaural signal processing, noise-robust speaker identification, and hearing
[13] S. Perrett and W. Noble, “The effect of head rotations on vertical plane aid processing.
sound localization,” J. Acoust. Soc. Amer., vol. 102, no. 4, pp. 2325–2332,
1997.
[14] Y. Bengio, “Learning deep architectures for AI,” Found. Trends Mach.
Learn., vol. 2, no. 1, pp. 1–127, 2009. Guy J. Brown received the B.Sc. (Hons.) degree
[15] N. Ma, G. J. Brown, and T. May, “Exploiting deep neural networks and in applied science from Sheffield City Polytech-
head movements for binaural localisation of multiple speakers in rever- nic, Sheffield, U.K., in 1984, and the Ph.D. de-
berant conditions,” in Proc. Interspeech, 2015, pp. 3302–3306. gree in computer science from the University of
[16] Y. Jiang, D. Wang, R. Liu, and Z. Feng, “Binaural classification for re- Sheffield, Sheffield, in 1992. He was appointed a
verberant speech segregation using deep neural networks,” IEEE/ACM Chair of the Department of Computer Science, Uni-
Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 2112–2121, versity of Sheffield, in 2013. He has held visiting
Dec. 2014. appointments at LIMSI-CNRS (France), Ohio State
[17] Y. Yu, W. Wang, and P. Han, “Localization based stereo speech source University (USA), Helsinki University of Technol-
separation using probabilistic time-frequency masking and deep neural ogy (Finland), and ATR (Japan). He has authored
networks,” EURASIP J. Audio, Speech, Music Process., vol. 2016, no. 1, more than 100 papers and is the co-Editor (with Prof.
pp. 1–18, 2016. DeLiang Wang) of the IEEE book entitled Computational Auditory Scene Anal-
[18] D. L. Wang and G. J. Brown, Eds., Computational Auditory Scene Anal- ysis: Principles, Algorithms and Applications. His research interests include
ysis: Principles, Algorithms and Applications. New York, NY, USA: computational auditory scene analysis, speech perception, hearing impairment,
Wiley/IEEE Press, 2006. and acoustic monitoring for medical applications.

CO - Earth and Life Science (Detailed Lesson Plan)
No ratings yet
CO - Earth and Life Science (Detailed Lesson Plan)
6 pages
TNT
100% (1)
TNT
1 page
The Nature and Goals of Anthropology, Sociology and Political Science
No ratings yet
The Nature and Goals of Anthropology, Sociology and Political Science
12 pages
Prime MX FIRA 6250 2018
No ratings yet
Prime MX FIRA 6250 2018
4 pages
Prospectus 2023-2024 SSC
No ratings yet
Prospectus 2023-2024 SSC
88 pages
WK 6 Strategic Planning Policy Analysis
No ratings yet
WK 6 Strategic Planning Policy Analysis
47 pages
Final Paper Answer
No ratings yet
Final Paper Answer
11 pages
Assignment 3
No ratings yet
Assignment 3
4 pages
Tajneen Islam - Write-Up 3 Soils
No ratings yet
Tajneen Islam - Write-Up 3 Soils
9 pages
Exetastai-The Discourses of Identity in Hellenistic Erythrai
100% (1)
Exetastai-The Discourses of Identity in Hellenistic Erythrai
34 pages
Hydrogen Safety & Risk Analysis - A Critical Review
No ratings yet
Hydrogen Safety & Risk Analysis - A Critical Review
19 pages
Black Holes and Beyond
No ratings yet
Black Holes and Beyond
140 pages
WORK
No ratings yet
WORK
17 pages
Lecture Slides GGR The Role of The Board in Innovation Ver1.0 110224
No ratings yet
Lecture Slides GGR The Role of The Board in Innovation Ver1.0 110224
46 pages
Business Overview
No ratings yet
Business Overview
42 pages
Ideal Gas and Real Gas Deviation Van Der Waals Equation
No ratings yet
Ideal Gas and Real Gas Deviation Van Der Waals Equation
8 pages
EE401 Class Desc
No ratings yet
EE401 Class Desc
8 pages
CS Project File
No ratings yet
CS Project File
8 pages
7673 Final Report - ST-2016-7673-1
No ratings yet
7673 Final Report - ST-2016-7673-1
58 pages
Lyrics
No ratings yet
Lyrics
42 pages
Gender Responsiveness in Local Government Unit of San Ildefonso Ilocos Sur
No ratings yet
Gender Responsiveness in Local Government Unit of San Ildefonso Ilocos Sur
16 pages
Nursing Management and Leadership Approaches From The Perspective of Registered Nurses in Portugal
No ratings yet
Nursing Management and Leadership Approaches From The Perspective of Registered Nurses in Portugal
8 pages
MC Granahan Anthropologyas Theoretical Storytelling 2020
No ratings yet
MC Granahan Anthropologyas Theoretical Storytelling 2020
8 pages
Fluvial Processes
No ratings yet
Fluvial Processes
35 pages
Worksheet 2
No ratings yet
Worksheet 2
2 pages
Exp - S5 - Vapour Liquid Equilibrium - Corrected
No ratings yet
Exp - S5 - Vapour Liquid Equilibrium - Corrected
6 pages
Arcos Del Sol-Digital-V2
No ratings yet
Arcos Del Sol-Digital-V2
2 pages
Test Score Descriptions
No ratings yet
Test Score Descriptions
1 page
6 MM Carbon Potentiometer: Automotive / Appliance Control
No ratings yet
6 MM Carbon Potentiometer: Automotive / Appliance Control
7 pages
Program: Worksheet 1.2 (Statement of Area-Program Specific Problem)
No ratings yet
Program: Worksheet 1.2 (Statement of Area-Program Specific Problem)
5 pages
Principles: Life and Work
From Everand
Principles: Life and Work
Ray Dalio
4/5 (648)
The Glass Castle: A Memoir
From Everand
The Glass Castle: A Memoir
Jeannette Walls
4.5/5 (1856)
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
From Everand
The Gifts of Imperfection: Let Go of Who You Think You're Supposed to Be and Embrace Who You Are
Brené Brown
4/5 (1175)
The Perks of Being a Wallflower
From Everand
The Perks of Being a Wallflower
Stephen Chbosky
4.5/5 (4103)
The Emperor of All Maladies: A Biography of Cancer
From Everand
The Emperor of All Maladies: A Biography of Cancer
Siddhartha Mukherjee
4.5/5 (298)
Steve Jobs
From Everand
Steve Jobs
Walter Isaacson
4.5/5 (1139)
Angela's Ashes: A Memoir
From Everand
Angela's Ashes: A Memoir
Frank McCourt
4.5/5 (943)
Shoe Dog: A Memoir by the Creator of Nike
From Everand
Shoe Dog: A Memoir by the Creator of Nike
Phil Knight
4.5/5 (629)
The Outsider: A Novel
From Everand
The Outsider: A Novel
Stephen King
4/5 (2886)
The Light Between Oceans: A Novel
From Everand
The Light Between Oceans: A Novel
M.L. Stedman
4.5/5 (815)
Fear: Trump in the White House
From Everand
Fear: Trump in the White House
Bob Woodward
3.5/5 (836)
Rise of ISIS: A Threat We Can't Ignore
From Everand
Rise of ISIS: A Threat We Can't Ignore
Jay Sekulow
3.5/5 (144)
Team of Rivals: The Political Genius of Abraham Lincoln
From Everand
Team of Rivals: The Political Genius of Abraham Lincoln
Doris Kearns Goodwin
4.5/5 (244)
Little Women
From Everand
Little Women
Louisa May Alcott
4.5/5 (2369)
The World Is Flat 3.0: A Brief History of the Twenty-first Century
From Everand
The World Is Flat 3.0: A Brief History of the Twenty-first Century
Thomas L. Friedman
3.5/5 (2289)
The Unwinding: An Inner History of the New America
From Everand
The Unwinding: An Inner History of the New America
George Packer
4/5 (45)
The Yellow House: A Memoir (2019 National Book Award Winner)
From Everand
The Yellow House: A Memoir (2019 National Book Award Winner)
Sarah M. Broom
4/5 (100)
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
From Everand
A Heartbreaking Work Of Staggering Genius: A Memoir Based on a True Story
Dave Eggers
3.5/5 (233)
Sing, Unburied, Sing: A Novel
From Everand
Sing, Unburied, Sing: A Novel
Jesmyn Ward
4/5 (1267)
Her Body and Other Parties: Stories
From Everand
Her Body and Other Parties: Stories
Carmen Maria Machado
4/5 (903)
John Adams
From Everand
John Adams
David McCullough
4.5/5 (2546)
Manhattan Beach: A Novel
From Everand
Manhattan Beach: A Novel
Jennifer Egan
3.5/5 (919)

Exploiting Deep Neural Networks and Head Movements For Robust Binaural Localization of Multiple Sources in Reverberant Environments

Uploaded by

Exploiting Deep Neural Networks and Head Movements For Robust Binaural Localization of Multiple Sources in Reverberant Environments

Uploaded by

2444 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 25, NO.

12, DECEMBER 2017

Exploiting Deep Neural Networks and Head

to simulate the head rotation of a human listener. The output

A. Binaural Feature Extraction

contained 72 nodes corresponding to the 72 azimuth angles

Room A Room B Room C Room D

T 6 0 (s) 0.32 0.47 0.68 0.89

catalog based on the Knowles Electronic Manikin for Acoustic

Anechoic Room A Room B Room C Room D

Hemifiled Model MCT 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Avg.

Anechoic Room A Room B Room C Room D

Model Feature 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3 Avg.

Anechoic Room A Room B Room C Room D

All systems were trained using the MCT method.

You might also like