L3DAS23 Learning 3D Audio Sources For Audio-Visual Extended Reality
L3DAS23 Learning 3D Audio Sources For Audio-Visual Extended Reality
Received 8 August 2023; revised 2 November 2023; accepted 3 November 2023. Date of publication 12 March 2024;
date of current version 18 June 2024. The review of this article was arranged by Associate Editor R. Serizel.
Digital Object Identifier 10.1109/OJSP.2024.3376297
We support the dataset download and the use of the baseline models via extensive instructions provided on the official GitHub repository at
https://github.com/l3das/L3DAS23. For more comprehensive information and in-depth details about the challenge, we invite the reader to visit the L3DAS Project
website at http://www.l3das.com/icassp2023.
ABSTRACT The primary goal of the L3DAS (Learning 3D Audio Sources) project is to stimulate and
support collaborative research studies concerning machine learning techniques applied to 3D audio signal
processing. To this end, the L3DAS23 Challenge, presented at IEEE ICASSP 2023, focuses on two spatial
audio tasks of paramount interest for practical uses: 3D speech enhancement (3DSE) and 3D sound event
localization and detection (3DSELD). Both tasks are evaluated within augmented reality applications. The
aim of this paper is to describe the main results obtained from this challenge. We provide the L3DAS23
dataset, which comprises a collection of first-order Ambisonics recordings in reverberant simulated envi-
ronments. Indeed, we maintain some general characteristics of the previous L3DAS challenges, featuring
a pair of first-order Ambisonics microphones to capture the audio signals and involving multiple-source
and multiple-perspective Ambisonics recordings. However, in this new edition, we introduce audio-visual
scenarios by including images that depict the frontal view of the environments as captured from the
perspective of the microphones. This addition aims to enrich the challenge experience, giving participants
tools for exploring a combination of audio and images for solving the 3DSE and 3DSELD tasks. In addition
to a brand-new dataset, we provide updated baseline models designed to take advantage of audio-image pairs.
To ensure accessibility and reproducibility, we also supply supporting API for an effortless replication of our
results. Lastly, we present the results achieved by the participants of the L3DAS23 Challenge.
INDEX TERMS 3D audio, ambisonics, data challenge, sound event localization and detection, speech
enhancement.
FIGURE 3. Schematic overview of the 3D sound event localization and detection task. Ambisonics microphones record the sound mixture of the acoustic
environment and the 3DSELD model must be able to estimate the labels of the active sound sources in each time interval for the detection and their DOA
to localize them: In this example, there are two active sound sources and they are identified by the labels telephone and printer.
to generate the spatial sound scenes. The noise sound event impulse responses of the rooms by placing the sound sources
dataset we used for Task 1 is the well-known FSD50 K dataset in random locations of a cylindrical grid defining all possible
[28]. In particular, we have selected 12 transient classes, rep- positions. Microphone and sound positions have been selected
resentative of the noise sounds that can be heard in an office: according to specific criteria, such as minimum distance from
computer keyboard, drawer open/close, cupboard open/close, walls and objects, and minimum distance between mic posi-
finger-snapping, keys jangling, knock, laughter, scissors, tele- tions in the same environment (RT60 between 0.3 and 0.8 s).
phone, writing, chink and clink, printer, and 4 continuous One microphone (mic A) lies in the exact selected position,
noise classes: alarm, crackle, mechanical fan and microwave and the other (mic B) is 20 cm distant towards the x dimension
oven. Furthermore, we extracted clean speech signals (with- from mic A. Both are shown as blue and orange dots in the
out background noise) from Librispeech [29], selecting only topdown map in Fig. 4. The two microphones are positioned
sound files up to 12 seconds. at the same height of 1.6 m, which can be considered as the
For the 3DSELD task, the measured ARIRs are convolved average ear height of a standing person. The capsules of both
with clean sound samples belonging to distinct sound classes. mics have the same orientation.
Sound events for Task 2 are taken again from the FSD50 K In every room, the speaker placement is performed accord-
dataset [28]. We have selected 14 classes, most representative ing to five concentric cylinders centered in mic A, where the
of the sounds that can be heard in an office: the 12 classes single positions are defined following a grid that guarantees
already used for 3DSE, plus female speech and male speech. a minimum Euclidean distance of 50 cm between two sound
sources placed at the same height. The radius of the cylinders
B. RECORDING PROCEDURE ranges from 1 m to 3 m with a 50 cm step and all have 6
We placed two Ambisonics microphones in 443 random po- position layers in the height dimension at 0.4 m, 0.8 m, 1.2 m,
sitions of 68 houses and generated B-Format ACN/SN3D 1.6 m, 2 m, 2.4 m from the floor, as shown in Fig. 5.
FIGURE 4. Topdown map showing mic A (blue dot) and mic B (orange dot).
Microphones can only be placed in the gray area (i.e., the area where no
obstacles are located, namely the navigable area). On the contrary, sounds
can be placed also outside the gray area, as long as they do not collide FIGURE 7. (ρ, θ, z) for one speaker position (black dot). Mic A is
with objects and remain within the perimeter of the environment. represented as an orange dot.
the floor: the green dots represent accepted position (in this
case, all positions within the room that do not collide with a
sofa and two armchairs), while the red dots show discarded
positions. No constraint is placed on the need to have the
sound source in the microphone’s view (and thus a direct
sound). A sound could then be placed behind an obstacle (such
as a column in the center of a room). SoundSpaces natively
supports all these scenarios as it propagates sounds according
to a bidirectional path tracing algorithm. Therefore, sound
sources in SoundSpaces 2.0 are to be considered omnidirec-
tional, meaning that sound can propagate in all directions.
Each speaker position is identified in cylindrical coordi-
nates w.r.t. microphone A by a tuple (ρ, θ , z), where ρ is in
the range [1.0, 3.0] (with a 0.5 step) and z in [−1.2, +0.8]
FIGURE 5. All the concentric cylinders. The partially visible red and blu (with a 0.4 step). θ is in the range [0◦ , 360◦ ), with a step that
dots represent mic A and B. depends on the value of ρ and is chosen so as to satisfy the
minimum Euclidean distance between sound sources (θ = 0◦
for frontal sounds). All labels are consistent with this notation;
elevation and azimuth or Euclidean coordinates are however
easily obtainable.
Fig. 7 visually represents the tuple (ρ, θ , z). The orange dot
in the picture is mic A and the black dot is a speaker placed
on one of the concentric cylinders. ρ represents the distance
of a sound source from mic A, θ is the angle from the y-axis,
and z is the height relative to mic A. Mic B is on the x-axis
and thus in position (0.2, 0, 0) of a local coordinate system.
Being frontal to the hearer, sounds placed on the y-axis have θ
= 0 in the dataset. θ is therefore calculated with respect to the
y-axis to comply with this principle. This has a direct impact
on the way in which it is possible to switch from one notation
FIGURE 6. Accepted source positions in green, discarded positions in red to another.
for one height level.
The dataset is divided into two main sections, respectively
dedicated to the challenge tasks. We provide normalized raw
A sound can therefore be reproduced in a room in any of waveforms of all Ambisonics channels (8 signals in total) as
the 700+ available positions (300 k+ total positions in the se- predictors data for both sections, the target data varies sig-
lected environments), to which should be subtracted all those nificantly. For 3DSE, the corresponding dataset is composed
positions that collide with objects or exceed the room space. of 16 kHz 16-bit AmbiX wav files. For the 3DSELD, the
Fig. 6 shows an example of source positioning at 0.4 m above audio files of the dataset are 32 kHz 16-bit AmbiX wav files.
by the encoder part of the U-Net. The visual features allow a between frequency and time axes are captured through
sensible decrease in the number of epochs required to achieve groups of time-frequency processing modules, and the
results comparable to those of the audio-only track. key information in the feature flow is extracted by the
group attention.
C. 3DSELD BASELINE 2) JLESS team proposed a two-stage system based on
For Task 2 (3DSELD), instead, we used a variant of the DPRNN and U-Net for the 3DSE task and a Conformer-
SELDnet architecture [3], with small changes with respect to based system for the 3DSELD task [34]. This is the only
the one used in the L3DAS22 Challenge. We ported to the team to have participated in both tasks of the challenge
PyTorch language the original Keras implementation and we and also to have developed a model for the audio-visual
modified its structure in order to make it compatible with track as part of the 3DSELD. In the two-stage U-Net for
the L3DAS23 dataset. The objective of this network is to the audio-only 3DSE, the amplitude of the STFT is fed
output a continuous estimation (within a fixed temporal grid) into the network for estimating the mask, and the phase
of the sounds present in the environment and their respective of the mixed signal is used for speech reconstruction.
location. The original SELDNet architecture is conceived for They add 4 DPRNN modules between the encoder and
processing sound spectrograms (including both magnitudes decoder of U-Net for transient modeling and extrac-
and phase information) and uses a convolutional-recurrent tion of dynamic voice information. The STFTs of the
feature extractor based on 3 convolution layers followed by multi-channel speech signals are first fed into the U-Net
a bidirectional GRU layer. In the end, the network is split into with DPRNN, and the estimated STFT is formed using
two separate branches that predict the detection (which classes beamforming. Then, the estimated STFT is sent into
are active) and location (where the sounds are) information for the second U-Net without DPRNN for the estimating
each target time step. finer mask. Sigmoid is used to activate the mask of the
We augmented the capacity of the network by increasing output layer; after that, the masked estimation results
the number of channels and layers, while maintaining the of the first level are connected with the masked estima-
original data flow. Moreover, we discard the phase informa- tion results of the second level by residual. Regarding
tion and we perform max-pooling on both the time and the the Conformer-based SELD system, the log-Mel and
frequency dimensions, as opposed to the original implementa- intensity vectors are calculated for both mic-A and mic-
tion, where only frequency-wise max-pooling is performed. In B audio signals. Then, the time difference of arrival
addition, we added the ability to detect multiple sound sources (TDOA) of 2 mics is computed using kernel density es-
of the same class that may be active at the same time (3 at timator (KDE) theory [35]. For the visual signal, images
maximum in our case). To obtain this behavior we tripled are resized into 224 × 224 px and normalized for fine-
the size of the network’s output matrix, in order to predict tuning the pretrained model. A Res-Conformer-based
separate location and detection information for all possible SELD model is adapted in the audio-visual scene. Audio
simultaneous sounds of the same class. features are fed into four residual convolution blocks
This network obtains a baseline test F-score of 0.147, with following two Conformer encoder blocks. Images are
a precision of 0.176 and a recall of 0.126. We adapted this fed into Resnet18 with pre-trained weights. The embed-
model to the audiovisual task by using a CNN-based exten- ding of images is then concatenated with audio features
sion part whose output features are concatenated to the ones before the last output layer. The authors applied some
of our augmented 3DSELD just before passing them to the data augmentation methods, such as cutout, frequency
two separate branches. This simple change resulted in a 7% shift, time shift, mixing, brightness, hue, saturation, and
improvement in the F-score (0.158), with a precision of 0.182 contrast jitter.
and a recall of 0.140. 3) CCA Speech team developed a stream attention-based
U-Net to remove background noise and reverberation
V. CHALLENGE RESULTS for 3DSE [36]. Their model consists of three parts,
Among the challenge participants, those who presented mod- encoder, decoder, and channel fusion module. They pro-
els capable of beating the proposed baselines were: SEU posed stream attention to fuse various channels in order
Speech, JLESS, CCA Speech for the 3DSE and JLESS to fully use the information between channels and this is
and NERCSLIP-USTC for the 3DSELD. The fifth best- done also in the encoder stage. Key, query, and value are
performing team, although below the baseline, is Speech- generated by three convolutional networks. A softmax
Lab410 with a model for 3DSE. The main contributions of function is applied to the last dimension of the product
these teams are briefly summarised below: of the key and query. The decoder part is composed of
1) SEU Speech proposed a dual-path convolutional recur- only convolutional blocks, while an LSTM block is used
rent network with group attention for 3DSE [33]. The in the encoder part.
model is structured as a convolutional encoder-decoder 4) NERCSLIP-USTC proposed a method based on the
with frequency-time blocks based on group attention combinations of ResNet and Conformer architectures
introduced in the middle. The encoder extracts the lo- to model both local and global patterns [37]. ResNet
cal representation from the spectrogram, the correlation blocks are used to extract high-dimension feature
[6] E. Guizzo et al., “L3DAS22 challenge: Learning 3D audio sources in [24] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley,
a real office environment,” in Proc. IEEE Int. Conf. Acoust., Speech “Polyphonic sound event detection and localization using a two-stage
Signal Process., 2022, pp. 9186–9190. strategy,” in Proc. Workshop Detection Classification Acoustic Scenes
[7] Y.-J. Lu et al., “Towards low-distortion multi-channel speech enhance- Events, 2019.
ment: The ESPNET-SE submission to the L3DAS22 challenge,” in [25] L. Mazzon, Y. Koizumi, M. Yasuda, and N. Harada, “First order
Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 9201– ambisonics domain spatial augmentation for DNN-based direction of
9205. arrival estimation,” in Proc. Workshop Detection Classification Acoustic
[8] J. Hu et al., “A track-wise ensemble event independent network for Scenes Events, 2019.
polyphonic sound event localization and detection,” in Proc. IEEE Int. [26] P. Pratik, W. J. Jee, S. Nagisetty, R. Mars, and C. S. Lim, “Sound
Conf. Acoust., Speech Signal Process., 2022, pp. 9196–9200. event localization and detection using CRNN architecture with mixup
[9] C. Chenet al., “SoundSpaces 2.0: A simulation platform for visual- for model generalization,” in Proc. Workshop Detection Classification
acoustic learning,” Neural Inf. Process. Syst. Datasets Benchmarks Acoustic Scenes Events, 2019.
Track, 2022. [27] S. K. Ramakrishnan et al., “Habitat-matterport 3D dataset (HM3D):
[10] H. Zhu, M. Luo, R. Wang, A. Zheng, and R. He, “Deep audio-visual 1000 large-scale 3D environments for embodied AI,” Neural Inf. Pro-
learning: A survey,” Int. J. Automat. Comput., vol. 18, pp. 351–376, cess. Syst. Datasets Benchmarks Track, 2021.
2020. [28] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An
[11] M. A. Gerzon, “The design of precisely coincident microphone arrays open dataset of human-labeled sound events,” IEEE/ACM Trans. Audio,
for stereo and surround sound,” J. Audio Eng. Soc. Conv., 1975. Speech, Lang. Process., vol. 30, pp. 829–852, 2022.
[12] D. Wang and J. Chen, “Supervised speech separation based on deep [29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An
learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Pro- ASR corpus based on public domain audio books,” in Proc. IEEE Int.
cess., vol. 26, no. 10, pp. 1702–1726, Oct. 2018. Conf. Acoust., Speech Signal Process., 2015, pp. 5206–5210.
[13] H. R. Guimarães, H. Nagano, and D. W. Silva, “Monaural speech [30] A. Baevski, H. Zhou, A. R. Mohamed, and M. Auli, “wav2vec 2.0: A
enhancement through deep Wave-U-Net,” Expert Syst. Appl., vol. 158, framework for self-supervised learning of speech representations,” in
2020, Art. no. 113582. Proc. Adv. Neural Inf. Process. Syst., 2020..
[14] C. Macartney and T. Weyde, “Improved speech enhancement with the [31] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen, “Joint
Wave-U-Net,” 2018, arXiv:1811.11307. measurement of localization and detection of sound events,” in Proc.
[15] A. Bosca, A. Gu’erin, L. Perotin, and S. Kitic, “Dilated U-net based IEEE Workshop Appl. Signal Process. Audio Acoust., 2019, pp. 333–
approach for multichannel speech enhancement from first-order am- 337.
bisonics recordings,” in Proc. 28th Eur. Signal Process. Conf., 2021, [32] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
pp. 216–220. works for biomedical image segmentation,” in Proc. Int. Conf. Med.
[16] P.-S. Huang, M. Kim, M. A. Hasegawa-Johnson, and P. Smaragdis, Image Comput. Comput. Assist.-Intervention, 2015.
“Deep learning for monaural speech separation,” in Proc. IEEE Int. [33] J. Cheng, C. Pang, R. Liang, J. Fan, and L. Zhao, “Dual-path dilated
Conf. Acoust., Speech Signal Process., 2014, pp. 1562–1566. convolutional recurrent network with group attention for multi-channel
[17] X. Yan, Z. Yang, T. Wang, and H. Guo, “An iterative graph spectral speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
subtraction method for speech enhancement,” Speech Commun., vol. Process., 2023, pp. 1–2.
123, pp. 35–42, 2020. [34] J. Bai, S. W. Huang, H. Yin, Y. Jia, M. Wang, and J. Chen, “3D audio
[18] C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, “Discriminative learning for signal processing systems for speech enhancement and sound localiza-
monaural speech separation using deep embedding features,” in Proc. tion and detection,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Interspeech, 2019. Process., 2023, pp. 1–2.
[19] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time– [35] V. V. Reddy, A. W. H. Khong, and B. P. Ng, “Unambiguous
frequency magnitude masking for speech separation,” IEEE/ACM speech DOA estimation under spatial aliasing conditions,” IEEE/ACM
Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 2133–2145,
Aug. 2019. Dec. 2014.
[20] O. Yazdanbakhsh and S. Dick, “Multivariate time series classification [36] H. Wang, Y. Fu, J. Li, M. Ge, L. Wang, and X. Qian, “Stream attention
using dilated convolutional neural network,” 2019, arXiv:1905.01697. based U-Net for L3DAS23 challenge,” in Proc. IEEE Int. Conf. Acoust.,
[21] A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant Speech Signal Process., 2023, pp. 1–2.
spatial sound scenes with moving sources for sound event localization [37] H. Yan, H. Xu, Q. Wang, and J. Zhang, “The NERCSLIP-USTC system
and detection,” in Proc. Workshop Detection Classification Acoustic for the L3DAS23 challenge Task2: 3D sound event localization and
Scenes Events, 2020 detection (SELD),” in Proc. IEEE Int. Conf. Acoust., Speech Signal
[22] K. Guirguis, C. Schorn, A. Guntoro, S. Abdulatif, and B. Yang, “SELD- Process., 2023, pp. 1–2.
TCN: Sound event localization & detection via temporal convolutional
networks,” in Proc. 28th Eur. Signal Process. Conf., 2020, pp. 16–20.
[23] S. P. Chytas and G. Potamianos, “Hierarchical detection of sound events
and their localization using convolutional neural networks with adap-
tive thresholds,” in Proc. Workshop Detection Classification Acoustic
Scenes Events, 2019.
Open Access funding provided by ‘Università degli Studi di Roma “La Sapienza” 2’ within the CRUI CARE Agreement