Sound Source Localization Using A Convolutional Neural
Sound Source Localization Using A Convolutional Neural
Article
Sound Source Localization Using a Convolutional Neural
Network and Regression Model
Tan-Hsu Tan , Yu-Tang Lin, Yang-Lang Chang and Mohammad Alkhaleefah *
Department of Electrical Engineering, National Taipei University of Technology, Taipei 10608, Taiwan;
[email protected] (T.-H.T.); [email protected] (Y.-T.L.); [email protected] (Y.-L.C.)
* Correspondence: [email protected]
Abstract: In this research, a novel sound source localization model is introduced that integrates
a convolutional neural network with a regression model (CNN-R) to estimate the sound source
angle and distance based on the acoustic characteristics of the interaural phase difference (IPD). The
IPD features of the sound signal are firstly extracted from time-frequency domain by short-time
Fourier transform (STFT). Then, the IPD features map is fed to the CNN-R model as an image for
sound source localization. The Pyroomacoustics platform and the multichannel impulse response
database (MIRD) are used to generate both simulated and real room impulse response (RIR) datasets.
The experimental results show that an average accuracy of 98.96% and 98.31% are achieved by the
proposed CNN-R for angle and distance estimations in the simulation scenario at SNR = 30 dB and
RT60 = 0.16 s, respectively. Moreover, in the real environment, the average accuracies of the angle
and distance estimations are 99.85% and 99.38% at SNR = 30 dB and RT60 = 0.16 s, respectively. The
performance obtained in both scenarios is superior to that of existing models, indicating the potential
of the proposed CNN-R model for real-life applications.
Keywords: deep learning; sound source localization; convolutional neural network; regression model
Citation: Tan, T.-H.; Lin, Y.-T.; Chang,
Y.-L.; Alkhaleefah, M. Sound Source
Localization Using a Convolutional
Neural Network and Regression 1. Introduction
Model. Sensors 2021, 21, 8031.
Localization technologies are widely used in everyday applications, such as navigation,
https://doi.org/10.3390/s21238031
human–computer interaction, surveillance, rescue, and smart monitoring [1,2]. Global posi-
tioning system (GPS) is the most frequently used technology for outdoor positioning [3,4].
Academic Editor: Cheng-Yuan Chang
However, GPS accuracy is degraded when it is used in indoor environments due to obsta-
cles blocking the signal’s propagation [5,6]. Consequently, a number of technologies, such
Received: 15 October 2021
Accepted: 29 November 2021
as infrared (IR), Bluetooth, and Wi-Fi, have been developed to address the challenge of in-
Published: 1 December 2021
door positioning. These technologies have become widely used for indoor localization and
positioning in recent years [7]. The propagation path of radio signals can be line-of-sight
Publisher’s Note: MDPI stays neutral
(LOS) or non-line-of-sight (NLOS) in indoor environments [8]. However, the signals of
with regard to jurisdictional claims in
indoor positioning technologies must be propagated in LOS conditions in order to produce
published maps and institutional affil- accurate location estimates [9]. Although IR offers high localization accuracy, its signal can
iations. be easily obscured by obstacles [10]. Bluetooth and Wi-Fi have the advantage of strong
penetrating power, which can penetrate through indoor obstacles [11,12]. Nevertheless,
Bluetooth is disadvantaged by its short range, and Wi-Fi requires high costs of hardware
installation and maintenance [13]. Sound has the advantages of strong penetrating power,
Copyright: © 2021 by the authors.
simple construction, and low cost [14]. Additionally, sound includes a tone, timbre, and
Licensee MDPI, Basel, Switzerland.
other features, which make it more effective than other technologies [15]. For example,
This article is an open access article
the frequency of sound emitted from different locations can be distinguished efficiently,
distributed under the terms and and multiple sound sources can be located at the same time. Therefore, sound source
conditions of the Creative Commons localization (SSL) has attracted much attention in recent years [16–18].
Attribution (CC BY) license (https:// Currently, two types of sound source localization methods are generally used in the
creativecommons.org/licenses/by/ literature. First, the microphone array methods use the microphone array as a receiving
4.0/). end to determine the direction of the sound source. The microphone arrays can be divided
into linear arrays, circular arrays, and distributed arrays. Second, human ear analysis
methods identify the sound source via simulating the signal received by the human ear.
It was shown in [19–24] that binaural beamforming-based methods can achieve high
noise reduction and sound sources preservation and localization. Microphone array-based
methods can be further divided into four approaches under different acoustic characteristics
as follows [25–28]:
1. Beamforming: calculate the input signal power, phase, and amplitude of each receiv-
ing point through beamforming technology, and calculate the azimuth angle of the
sound source with the greatest probability.
2. Time difference of arrival (TDOA): the time difference between the signals’ arrival
at two or more receiving points is combined with the spatial information of these
receiving points to infer the azimuth of the sound source [29].
3. High-resolution spectrum estimation (HRSE): the signal at the receiving point is used
to calculate the correlation between the spatial and spectral characteristics to obtain
the azimuth angle of the sound source [30].
4. Neural network (NN): train a NN model using a large amount of data to find audio
patterns for multiple acoustic sources localization [31].
Recently, various deep neural networks (DNNs) were employed for sound source
localization. Chakrabarty et al. [32] proposed a CNN-based supervised learning (CNN-
SL) approach to estimate the direction of arrival (DOA) of multiple speakers. The phase
component of the STFT coefficients of the received microphone signals are directly fed into
the CNN, and the features for DOA estimates are learned during the training process. The
ability of the DOA estimation method to accurately adapt to unseen acoustic conditions
is pretty robust. However, this method is highly dependent on the time-varying source
signal [33]. Yiwere et al. [34] presented a sound source distance estimation (SSDE) approach
by using a convolutional recurrent neural network (CRNN). The CRNN is trained using log-
scaled mel spectrograms extracted from single-channel audio signals as input features. The
transformation of the audio signals to images allows the convolutional layers of the network
to extract distance-dependent features from the audio signals. The experimental results
showed that the CRNN model can achieve a high level of accuracy. Another interesting
research work [35] proposed an indoor sound source regional localization method based on
a convolutional neural network (CNN). The sound source signal is converted into a spectral
map and fed into the CNN for regional localization. The simulation results showed that the
CNN can bring better robustness and generalization with different SNRs. Pang et al. [36]
introduced a binaural sound source localization (SSL) method based on time–frequency
CNN (TF-CNN) with multitask learning to simultaneously localize azimuth and elevation
under various acoustic conditions. The IPD and interaural level difference (ILD) are first
extracted from the received binaural signals, then each or both of them are fed to the SSL
neural network. The experimental results illustrated that the proposed method can achieve
comparable localization performance. Nevertheless, such methods are restricted to certain
ranges or areas.
This research aims to construct an indoor localization model based on the charac-
teristics of the sound spectrum, which can estimate the azimuth angle and distance of
the indoor speaker. The CNN is used to automatically extract features and increase their
versatility and robustness by training the model to resist noise. Previous works used
classification functions to normalize the output of a CNN to a probability distribution over
the output target. However, the output of classification functions is a discrete value, and
hence, it does not predict the exact value in the case of continuous variables. Unlike the
previous studies, our CNN uses a regression function instead of a classification function
because it is better suited for the continuous variable output. Additionally, this research
uses the Pyroomacoustics [37] platform to quickly construct a virtual three-dimensional
space and generate a room impulse response (RIR) with spatial sound signals. Moreover,
real space signals are synthesized with a multi-channel impulse response database [38].
In addition, the signal dataset is converted into a time-frequency domain signal through
Sensors 2021, 21, 8031 3 of 17
STFT and then is converted to an IPD feature map to be fed to the CNN model. Finally,
the distribution of output values with the regression model are observed to find the best
configuration of the model through training and evaluate the performance of the model in
different environments.
2. Proposed Methods
The overall flow chart of the proposed sound source localization system is demon-
strated in Figure 1. The sound database signal is firstly convolved with the real and
simulated RIR to obtain a new signal with spatial effect. Then the STFT of the new signal
is obtained, and the IPD features are extracted. Finally, the CNN-R model is trained on
the IPD features to estimate the angle and distance of the sound source. Notably, the IPD
image sets are divided into 70% training set, 10% validation set, and 20% test set.
Model Training
Convolutional Neural Network + Regression Model
Model Testing
sound source point is 1 and 2 m from the midpoint of the two microphones. The angle is
distributed from 0° to 180°, where every 15° is a step distance. In total, there are 26 source
points. The sound database adopts the CMU_ARCTIC database. A total of 100 audio files
are taken from the corpus of 4 participants. Convolution operations are performed at each
sampling point to generate spatial sound effects, and we adjust RT60 and SNR to achieve
data diversity.
5
.
0
1 2 3 4 5
Source point
Microphone
-45° 45°
-90° 90°
1m 2m
Figure 3. Multichannel Impulse Response Database measurement space configuration diagram [38].
Sensors 2021, 21, 8031 5 of 17
Yl (ω, k)
ϕ(ω, k ) = ∠ (2)
Yr (ω, k )
where Yl (ω, k) and Yr (ω, k) are the left and right receiving signals. The IPD can be obtained
by subtracting its phase components. In other words, the IPD is computed as the difference
of the phase angles, and phase unwrapping is used on the phase image. Figure 4 is an
example of the actual output of the IPD.
30
8 -
Frequency (kHz)
4 - 15
2 - 0
1 - -15
0 - -30
-
0 1 2 3 4
Time (s)
Figure 4. An example of the actual output of IPD where the color map represents the angle (°).
The dataset of the simulated IPD includes 400 audio records, 5 spatial sizes (5 × 5,
5 × 6, 6 × 6, 7 × 6, and 7 × 7 m2 ), 26 sampling points, 5 SNR (0, 5, 10, 20, and 30 dB), 3 RT60
(0.16, 0.36, and 0.61 s), and a total of 780,000 images. The dataset of Real IPD includes:
400 audio records, 1 spatial size (6 × 6 m2 ), 26 sampling points, 5 SNR (0, 5, 10, 20, and
30 dB), 3 RT60 (0.16, 0.36, and 0.61 s), and a total of 156,000 images. The nature of the noise
is independent Gaussian white noise added to each channel, and it is computed as follows:
!
SignalPower
SNR = 10 log10 dB (3)
NoisePower
Sensors 2021, 21, 8031 6 of 17
CNN Regression
Dense 1
ReLu ReLu Linear
Table 1. The settings of the training option used in the single acoustic environment.
Hyperparameters Configurations
Optimizer Adam
Loss Function MAE
Learning Rate 0.001
Decay 10−3
Execution Environment GPU
Batch Size 64
3. Experimental Results
The generated IPD dataset was divided into three parts: training, validation, and
testing. We used the validation dataset to monitor the performance of our model during
the training and adjust the model hyperparameters to find the optimal configuration. All
the experiments were performed using a PC with Intel Core i7-7700, CPU 3.6 GHz, and
32 GB of RAM. The GPU was NVIDIA GeForce RTX 2070 with 8 GB of memory. The model
was implemented in Python with TensorFlow.
(3) (4)
Low Precision
Figure 6. Performance evaluation criteria where yellow dots are the predicted values. (1) is high
accuracy and high precision, which is the best scenario. (2) is low accuracy and high precision. (3) is
high accuracy and low precision. (4) is low accuracy and low precision, which is the worst scenario.
Because the output of the regression model is a continuous value, this research uses
the formula in (5) to evaluate the accuracy of the proposed CNN-R:
N f ine
Accuracy = (5)
NT
Sensors 2021, 21, 8031 8 of 17
where NT is the total times of experiments, N f ine is the number of the error which is less
than the unit scale divided by 2. In this research, the angle estimation unit scale is 15°
and the distance unit scale is 1 meter; therefore, the angle is 7.5° (15/2) and the distance
estimation is 0.5 meter (1/2) as the baseline value. When the predicted value is less than
the baseline value, it is considered to be a correct value. In this research, the experiments
are mainly divided into two parts as follows:
1. Experiment 1 in a simulated environment consists of two parts: (i) a single acoustic
environment is used to train the model for angle and distance estimation, and (ii) a
multiple acoustic environment is used to train the model for angle and distance esti-
mation.
2. Experiment 2 uses a real spatial sound dataset to train the model for angle and
distance estimation.
3.2. Experiment 1
3.2.1. Model Performance in a Single Acoustic Environment
In the experiment, the goal is to show the ability of the proposed CNN-R architecture
to make correct predictions in different room dimensions under the same RT60. The use
of different room dimensions is to avoid data overfitting and validate the performance of
CNN-R. Additionally, the same RT60 is used to avoid environmental parameter changes.
The experimental results show that the proposed CNN-R can be generalized and used in
multiple acoustic environments. Table 2 shows the single acoustic environment configura-
tion. The training set room includes 5 × 5, 6 × 5, 6 × 7, and 7 × 7 (m2 ). The SNRs are 0
dB and 5 dB, respectively. The RT60 is 0.16 s. On the other hand, the testing set room is
6 × 6 (m2 ). The SNRs are 10 dB, 20 dB, and 30 dB, respectively. The RT60 is 0.16 s.
Tables 3 and 4 show the model performance for angle and distance estimation in the
single acoustic environment under three SNR scenarios, respectively.
Figure 7 shows the average of accuracy and MAE for the angle and distance estimation
in the single acoustic environment under SNR = 10 dB, 20 dB, 30 dB, and RT60 = 0.16 s. In
the single acoustic environment, the accuracy of the angle and distance estimation increases
as the SNR increases. When the SNR is greater than 20 dB, the angle and distance accuracy
can reach 99.42% and 96.12%, respectively. Additionally, the MAE is reduced to 0.8° and
0.18 m, and the RMSE is reduced to 1.32° and 0.14 m. The accuracy of the angle estimation
model in each SNR is better than the distance estimation model.
The Average ACC and MAE of Angle Estimation The Average ACC and MAE of Distance Estimation
100 100
97.31 99.42 96.12
95.88
90.08 90.62
80 80
ACC & MAE
40 ACC 40 ACC
MAE MAE
20 20
4.02 1.36 0.8 0.23 0.18 0.18
0 0
10 20 30 10 20 30
SNR SNR
(a) (b)
Figure 7. The average accuracy and MAE of (a) angle and (b) distance estimation by CNN-R in a
single acoustic environment, where SNR = 10 dB, 20 dB, 30 dB, and RT60 = 0.16 s.
Table 6 shows the model performance for angle estimation in the multiple acoustic
environment under SNR = 10, 20, 30 dB, and three RT60 scenarios. Table 7 shows the model
performance for distance estimation in the multiple acoustic environment under SNR = 10,
20, and 30 dB, and three RT60 scenarios.
Sensors 2021, 21, 8031 10 of 17
Figures 8 and 9 show the model performance for the angle and distance estimation in
the multiple acoustic environments under SNR = 10 dB, 20 dB, 30 dB, and RT60 = 0.16 s,
0.36 s, 0.61 s, respectively. In the multiple acoustic environment, two acoustic spaces,
RT60 = 0.36 s and RT60 = 0.61 s, were added. The overall accuracy increases with the
increase of SNR and the decrease of RT60. When RT60 = 0.61 s, the performance of angle
estimation is not satisfactory, and the average accuracy is 61.19%. However, if RT60 is
reduced to 0.36 s, the accuracy can be greatly increased by about 20%. Moreover, MAE
and RMSE drop sharply at RT60 = 0.16 s. Nevertheless, the best performance for angle
estimation is achieved when SNR = 30 dB, MAE = 1.96, and RMSE = 1.64, where the
accuracy is 98.96%.
The Average ACC and MAE of Angle The Average ACC and MAE of Angle The Average ACC and MAE of Angle
Estimation under RT60=0.16s Estimation under RT60=0.36s Estimation under RT60=0.61s
100 87.38 100 87.85 91.31 100
97.92 99.85 78.62
80 80 80
ACC & MAE
60.23 63.96
59.38
60 60 60
The Average ACC and MAE of Distance The Average ACC and MAE of Distance The Average ACC and MAE of Distance
Estimation under RT60=0.16s Estimation under RT60=0.36s Estimation under RT60=0.61s
100 100 100
96.35 98.31 98.04
95.58
80 89 80 88.46 80 85.69
81.19 85.38
ACC & MAE
3.3. Experiment 2
The Model Performance in a Real Acoustic Environment.
Table 8 shows the real acoustic environment configuration. The training set room is
6 × 6 (m2 ). SNR are 0 dB, 5 dB, and 10 dB. The RT60 are 0.16 s, 0.36 s, and 0.61 s. The test
set room is 6 × 6 (m2 ). The SNR are 10 dB, 20 dB, and 30 dB. The RT60 are 0.16 s, 0.36 s,
and 0.61 s, which is similar to the training set.
Table 9 shows the model performance for distance estimation in the real acoustic envi-
ronment under SNR = 10, 20, and 30 dB, and RT60 = 0.16 s, 0.36 s, and 0.61 s, respectively.
Table 10 shows the model performance for angle estimation in the real acoustic envi-
ronment under SNR = 10, 20, and 30 dB, and RT60 = 0.16 s, 0.36 s, and 0.61 s, respectively.
Table 9. Performance of distance estimation by CNN-R in a real acoustic environment at SNRs of 10,
20, and 30 dB, respectively.
Table 10. Performance of angle estimation by CNN-R in a real acoustic environment at SNRs of 10,
20, and 30 dB, and RT60 = 0.16 s, 0.36 s, and 0.61 s.
Table 11 shows the average Acc and MAE of the proposed model for angle and
distance estimation in real acoustic environments, where SNR = 10 dB, 20 dB, and 30 dB,
and RT60 = 0.16 s, 0.36 s, and 0.61 s, respectively. In a real acoustic environment, the angle
estimation accuracy increases and the error decreases as SNR increases and RT60 decreases.
Moreover, when the SNR is greater than 20 dB, the accuracy obtained is higher than 96%,
and the MAE is less than 1.7°. The accuracy of distance estimation is also improved with
Sensors 2021, 21, 8031 14 of 17
the increase in SNR. Overall, the accuracy is higher than 95% when SNR = 20 dB and 30 dB.
The Acc. and MAE of each RT60 are stable when SNR is greater than 20 dB. Table 12 shows
the accuracy of CNN-R for angle and distance estimation compared to other methods based
on the multi-channel impulse response database [38].
Table 11. The average Acc. and MAE of angle and distance estimation by CNN-R in a real acoustic
environment where SNR = 10 dB, 20 dB, and 30 dB, and RT60 = 0.16 s, 0.36 s, and 0.61 s, respectively.
Angle Distance
RT60 (s) SNR (dB) Acc. (%) MAE (°) SNR (dB) Acc. (%) MAE (m)
10 87.38 04.55 10 89.15 00.28
0.16 20 97.92 01.18 20 98.08 00.16
30 99.85 00.48 30 99.38 00.14
10 86.00 06.02 10 92.38 00.22
0.36 20 98.46 01.18 20 98.58 00.15
30 99.85 00.46 30 99.81 00.13
10 79.04 09.60 10 93.42 00.22
0.61 20 96.69 01.64 20 98.27 00.15
30 99.73 00.59 30 99.50 00.14
Table 12. Comparative results of angle and distance estimation based on the multi-channel impulse
response database in a real acoustic environment at SNR = 30 dB and RT60 = 0.16 s.
The training–validation loss curves for the proposed CNN-R in a single acoustic
environment, multiple acoustic environment, and real acoustic environment are shown
in Figure 10. Unlike the single acoustic environment and multiple acoustic environment,
the loss in real acoustic environment gradually reduces and slowly converges as the
number of epochs increases. Moreover, note in Figure 10c that the training loss curve and
validation loss curve behave similarly, which implies that the proposed CNN-R model can
be generalized and does not suffer from overfitting.
4. Discussion
This research aims to establish a general sound localization model. The results of a
single acoustic environment in Experiment 1 show that under different room conditions,
the test model can still effectively estimate the angle and distance in a single acoustic
environment, and it will be more accurate as the SNR increases. In the multiple acoustic
environments, good estimation performance can also be obtained under different room
conditions. When RT60 = 0.61 s, the accuracy is relatively insufficient. However, as the SNR
increases, the accuracy can be effectively improved. The model proposed in this research
has the best performance in the simulated room where the RT60 is less than 0.36 s and the
SNR is greater than 20 dB. In addition, in the real acoustic environment of Experiment 2,
the overall accuracy is enhanced significantly, verifying the practicability of our proposed
model in a real acoustic environment. The experimental results show that the MAE of the
model for the angle estimation is smaller than the distance estimation, which means that
the error between the predicted value and the actual value is small. Nonetheless, the RMSE
of the model for angle estimation is greater than the distance estimation, which means
that a small number of predicted values has large variations; hence, the model for angle
estimation has high accuracy. However, the precision is low. On the other hand, the model
performance for distance estimation has high accuracy and high precision.
Comparing the results of the proposed CNN-R in the multiple acoustic environment in
Experiment 1 with the results in the real acoustic environment in Experiment 2, it can clearly
be seen that under the same environmental acoustic parameters, the accuracy of the model
trained in the real environment is higher than that of the simulated acoustic environment.
The reason for this result is that when generating simulated room sound effects, the only
parameters we can adjust are SNR and RT60. However, in the real environment, the
parameters that affect sound propagation are more complex. Therefore, the model trained
with the simulation dataset has insufficient features, which affects the learning of the model,
resulting in a decrease in accuracy. The experimental results show that the accuracy of the
distance estimation is better than that of the angle estimation. The reason is that there are
13 target values for the angle estimation and only 2 target values for the distance estimation,
which increases the complexity of the angle estimation model weight training and makes
the weight distribution uneven.
Taking Tables 3 and 4 as an example, when SNR = 10 dB, the accuracy of the angle
estimation is between 71% and 100%. The accuracy close to 90° is higher, and the accuracy
close to 0° or 180° decreases on both sides. The accuracy of the distance estimation is
distributed between 87.08% and 94.15%, and the distribution of the accuracy of the distance
estimation is more concentrated than that of the angle estimation. Moreover, the Acc. is
low and MAE is high due to the small number of training samples in the single acoustic
environment compared to the multiple acoustic environment. Additionally, in general, the
accuracy drops significantly when the value of RT60 increases, except when the angle is
90 degrees in the multiple acoustic environment. One limitation of the proposed model
might be the offline design. Future work will focus on improving the proposed model for
real-time positioning. Additionally, the proposed model still needs further enhancement
for multiple sound source localization.
conditions. When SNR is greater than 10 dB and RT60 is less than 0.61s, the accuracy of
the angle and distance estimations can reach, on average, more than 95%. Additionally,
when SNR = 30 dB and RT60 = 0.16 s, the accuracies of the angle and distance estimations
can reach 98.96% and 98.31%, respectively. On the other hand, the experimental results in
the real acoustic scenarios showed that when the SNR is greater than 20 dB, the accuracy
of the angle and distance estimation exceeds 96%. Furthermore, when SNR = 30 dB and
RT60 = 0.16 s, the accuracies of the angle and distance estimations reach 99.85% and 99.38%,
respectively. In comparison to the existing methods, the experimental results also showed
that the proposed CNN-R outperforms the existing methods in terms of the angle and
distance estimation accuracies. Future work will study the combination of other acoustic
features, such as ILD, to make the features richer. Moreover, the impact of more acoustic
environments on the accuracy will also be investigated.
Author Contributions: Conceptualization, T.-H.T., Y.-T.L. and Y.-L.C.; methodology, Y.-L.C. and
Y.-T.L.; software, Y.-T.L.; validation, M.A., T.-H.T. and Y.-L.C.; formal analysis, Y.-L.C. and M.A.;
investigation, Y.-T.L. and M.A.; data curation, Y.-T.L.; writing—review and editing, T.-H.T., M.A. and
Y.-L.C.; supervision, T.-H.T. and M.A. All authors have read and agreed to the published version of
the manuscript.
Funding: This work was funded by the Ministry of Science and Technology, Taiwan, No. MOST
110-2221-E-027-089.
Conflicts of Interest: The authors declare no conflict of interest.
References
1. Morar, A.; Moldoveanu, A.; Mocanu, I.; Moldoveanu, F.; Radoi, I.E.; Asavei, V.; Gradinaru, A.; Butean, A. A comprehensive
survey of indoor localization methods based on computer vision. Sensors 2020, 20, 2641. [CrossRef]
2. Bai, Y.; Lu, L.; Cheng, J.; Liu, J.; Chen, Y.; Yu, J. Acoustic-based sensing and applications: A survey. Comput. Netw. 2020,
181, 107447. [CrossRef]
3. Zhu, R.; Wang, Y.; Cao, H.; Yu, B.; Gan, X.; Huang, L.; Zhang, H.; Li, S.; Jia, H.; Chen, J. RTK/pseudolite/LAHDE/IMU-PDR
integrated pedestrian navigation system for urban and indoor environments. Sensors 2020, 20, 1791. [CrossRef] [PubMed]
4. Cengiz, K. Comprehensive Analysis on Least-Squares Lateration for Indoor Positioning Systems. IEEE Internet Things J. 2020,
8, 2842–2856. [CrossRef]
5. Uzun, A.; Ghani, F.A.; Ahmadi Najafabadi, A.M.; Yenigün, H.; Tekin, İ. Indoor Positioning System Based on Global Positioning
System Signals with Down-and Up-Converters in 433 MHz ISM Band. Sensors 2021, 21, 4338. [CrossRef] [PubMed]
6. Bhat, S.J.; Venkata, S.K. An optimization based localization with area minimization for heterogeneous wireless sensor networks
in anisotropic fields. Comput. Netw. 2020, 179, 107371. [CrossRef]
7. Zhu, X.; Qu, W.; Qiu, T.; Zhao, L.; Atiquzzaman, M.; Wu, D.O. Indoor intelligent fingerprint-based localization: Principles,
approaches and challenges. IEEE Commun. Surv. Tutor. 2020, 22, 2634–2657. [CrossRef]
8. Jo, H.J.; Kim, S. Indoor smartphone localization based on LOS and NLOS identification. Sensors 2018, 18, 3987. [CrossRef]
[PubMed]
9. Gu, Y.; Lo, A.; Niemegeers, I. A survey of indoor positioning systems for wireless personal networks. IEEE Commun. Surv. Tutor.
2009, 11, 13–32. [CrossRef]
10. Ye, T. Algorithms for Indoor Localization Based on IEEE 802.15. 4-2011 UWB and Inertial Sensors. Ph.D. Thesis, University
College Cork, Cork, Ireland, 2015.
11. Li, W.; Su, Z.; Zhang, K.; Benslimane, A.; Fang, D. Defending malicious check-in using big data analysis of indoor positioning
system: An access point selection approach. IEEE Trans. Netw. Sci. Eng. 2020, 7, 2642–2655. [CrossRef]
12. Maheepala, M.; Kouzani, A.Z.; Joordens, M.A. Light-based indoor positioning systems: A review. IEEE Sens. J. 2020, 20, 3971–3995.
[CrossRef]
13. Mirowski, P.; Ho, T.K.; Yi, S.; MacDonald, M. SignalSLAM: Simultaneous localization and mapping with mixed WiFi, Bluetooth,
LTE and magnetic signals. In Proceedings of the International Conference on Indoor Positioning and Indoor Navigation, IEEE,
Montbeliard, France, 28–31 October 2013; pp. 1–10.
14. Chen, X.; Sun, H.; Zhang, H. A new method of simultaneous localization and mapping for mobile robots using acoustic
landmarks. Appl. Sci. 2019, 9, 1352. [CrossRef]
15. Yang, K.; Wang, K.; Bergasa, L.M.; Romera, E.; Hu, W.; Sun, D.; Sun, J.; Cheng, R.; Chen, T.; López, E. Unifying terrain awareness
for the visually impaired through real-time semantic segmentation. Sensors 2018, 18, 1506. [CrossRef] [PubMed]
16. Sun, Y.; Chen, J.; Yuen, C.; Rahardja, S. Indoor sound source localization with probabilistic neural network. IEEE Trans. Ind.
Electron. 2017, 65, 6403–6413. [CrossRef]
Sensors 2021, 21, 8031 17 of 17
17. Rahamanand, A.; Kim, B. Sound source localization in 2D using a pair of bio–inspired MEMS directional microphones. IEEE
Sens. J. 2020, 21, 1369–1377. [CrossRef]
18. Park, Y.; Choi, A.; Kim, K. Single-Channel Multiple-Receiver Sound Source Localization System with Homomorphic Deconvolu-
tion and Linear Regression. Sensors 2021, 21, 760. [CrossRef]
19. Zohourian, M.; Enzner, G.; Martin, R. Binaural speaker localization integrated into an adaptive beamformer for hearing aids.
IEEE ACM Trans. Audio Speech Lang. Process. 2017, 26, 515–528. [CrossRef]
20. Amini, J.; Hendriks, R.C.; Heusdens, R.; Guo, M.; Jensen, J. Asymmetric coding for rate-constrained noise reduction in binaural
hearing aids. IEEE ACM Trans. Audio Speech Lang. Process. 2018, 27, 154–167. [CrossRef]
21. Koutrouvelis, A.I.; Hendriks, R.C.; Heusdens, R.; Jensen, J. A convex approximation of the relaxed binaural beamforming
optimization problem. IEEE ACM Trans. Audio Speech Lang. Process. 2018, 27, 321–331. [CrossRef]
22. Koutrouvelis, A.I.; Hendriks, R.C.; Heusdens, R.; Jensen, J. Relaxed binaural LCMV beamforming. IEEE ACM Trans. Audio Speech
Lang. Process. 2016, 25, 137–152. [CrossRef]
23. Jeffet, M.; Shabtai, N.R.; Rafaely, B. Theory and perceptual evaluation of the binaural reproduction and beamforming tradeoff in
the generalized spherical array beamformer. IEEE ACM Trans. Audio Speech Lang. Process. 2016, 24, 708–718. [CrossRef]
24. Marquardt, D.; Hohmann, V.; Doclo, S. Interaural coherence preservation in multi-channel Wiener filtering-based noise reduction
for binaural hearing aids. IEEE ACM Trans. Audio Speech Lang. Process. 2015, 23, 2162–2176. [CrossRef]
25. Rascon, C.; Meza, I. Localization of sound sources in robotics: A review. Robot. Auton. Syst. 2017, 96, 184–210. [CrossRef]
26. Cobos, M.; Antonacci, F.; Alexandridis, A.; Mouchtaris, A.; Lee, B. A survey of sound source localization methods in wireless
acoustic sensor networks. Wirel. Commun. Mob. Comput. 2017, 2017. [CrossRef]
27. Argentieri, S.; Danes, P.; Souères, P. A survey on sound source localization in robotics: From binaural to array processing methods.
Comput. Speech Lang. 2015, 34, 87–112. [CrossRef]
28. Pak, J.; Shin, J.W. Sound localization based on phase difference enhancement using deep neural networks. IEEE ACM Trans.
Audio Speech Lang. Process. 2019, 27, 1335–1345. [CrossRef]
29. Scheuing, J.; Yang, B. Disambiguation of TDOA estimation for multiple sources in reverberant environments. IEEE Trans. Audio
Speech Lang. Process. 2008, 16, 1479–1489. [CrossRef]
30. Farmani, M. Informed Sound Source Localization for Hearing Aid Applications. Ph.D. Thesis, Aalborg University, Aalborg,
Denmark, 2017.
31. Youssef, K.; Argentieri, S.; Zarader, J.L. A learning-based approach to robust binaural sound localization. In Proceedings
of the 2013 IEEE/RSJ International Conference on Intelligent Robots and Systems, IEEE, Tokyo, Japan, 3–7 November 2013;
pp. 2927–2932.
32. Chakrabarty, S.; Habets, E.A. Multi-speaker DOA estimation using deep convolutional networks trained with noise signals. IEEE
J. Sel. Top. Signal Process. 2019, 13, 8–21. [CrossRef]
33. Hu, Y.; Samarasinghe, P.N.; Gannot, S.; Abhayapala, T.D. Semi-supervised multiple source localization using relative harmonic
coefficients under noisy and reverberant environments. IEEE ACM Trans. Audio Speech Lang. Process. 2020, 28, 3108–3123.
[CrossRef]
34. Yiwere, M.; Rhee, E.J. Sound source distance estimation using deep learning: an image classification approach. Sensors 2020,
20, 172. [CrossRef] [PubMed]
35. Zhang, X.; Sun, H.; Wang, S.; Xu, J. A new regional localization method for indoor sound source based on convolutional neural
networks. IEEE Access 2018, 6, 72073–72082. [CrossRef]
36. Pang, C.; Liu, H.; Li, X. Multitask learning of time-frequency CNN for sound source localization. IEEE Access 2019, 7, 40725–40737.
[CrossRef]
37. Scheibler, R.; Bezzam, E.; Dokmanić, I. Pyroomacoustics: A python package for audio room simulation and array processing
algorithms. In Proceedings of the 2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP),
Calgary, AB, Canada, 15–20 April 2018; pp. 351–355.
38. Hadad, E.; Heese, F.; Vary, P.; Gannot, S. Multichannel audio database in various acoustic environments. In Proceedings of the
2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC), IEEE, Juan les Pins, France, 8–11 September 2014;
pp. 313–317.
39. Kominek, J.; Black, A.W.; Ver, V. CMU ARCTIC databases for speech synthesis. In Proceedings of the Fifth ISCA ITRW on Speech
Synthesis, Pittsburgh, PA, USA, 14–16 June 2004; pp. 223–224.
40. Chollet, F. Deep Learning with Python; Simon and Schuster: Shelter Island, NY, USA, 2021.