0% found this document useful (0 votes)
3 views9 pages

L3DAS23 Learning 3D Audio Sources For Audio-Visual Extended Reality

The L3DAS23 project, presented at IEEE ICASSP 2023, aims to advance machine learning techniques for 3D audio signal processing, focusing on 3D speech enhancement and sound event localization within augmented reality applications. The challenge introduces a new dataset featuring first-order Ambisonics recordings in simulated environments, along with audio-visual scenarios to enhance participant exploration. Updated baseline models and a supporting API are provided to facilitate accessibility and reproducibility of results.

Uploaded by

Milap Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
3 views9 pages

L3DAS23 Learning 3D Audio Sources For Audio-Visual Extended Reality

The L3DAS23 project, presented at IEEE ICASSP 2023, aims to advance machine learning techniques for 3D audio signal processing, focusing on 3D speech enhancement and sound event localization within augmented reality applications. The challenge introduces a new dataset featuring first-order Ambisonics recordings in simulated environments, along with audio-visual scenarios to enhance participant exploration. Updated baseline models and a supporting API are provided to facilitate accessibility and reproducibility of results.

Uploaded by

Milap Rane
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 9

ICASSP 2023 Signal Processing Grand Challenges

Received 8 August 2023; revised 2 November 2023; accepted 3 November 2023. Date of publication 12 March 2024;
date of current version 18 June 2024. The review of this article was arranged by Associate Editor R. Serizel.
Digital Object Identifier 10.1109/OJSP.2024.3376297

L3DAS23: Learning 3D Audio Sources for


Audio-Visual Extended Reality
RICCARDO F. GRAMACCIONI 1 (Graduate Student Member, IEEE),
CHRISTIAN MARINONI 1 (Graduate Student Member, IEEE), CHANGAN CHEN2 ,
AURELIO UNCINI 1 (Senior Member, IEEE), AND DANILO COMMINIELLO 1 (Senior Member, IEEE)
1
Sapienza University of Rome, 00185 Roma, Italy
2
UT Austin, Austin, TX 78712 USA
CORRESPONDING AUTHORS: RICCARDO F. GRAMACCIONI; CHRISTIAN MARINONI (e-mail: [email protected],
[email protected]).
The work of Riccardo F. Gramaccioni and Christian Marinoni was supported by the European Union through the Italian National Recovery and Resilience Plan
(NRRP) of NextGenerationEU, partnership on National Centre for HP C, Big Data and Quantum Computin (CN00000013 - Spoke 6: Multiscale Modelling &
Engineering Applications). The work of Aurelio Uncini was supported by the European Union under the Italian National Recovery and Resilience Plan (NRRP) of
NextGenerationEU, partnership on Future Artificial Intelligence Research (P E0000013 - FAIR - Spoke 5: High Quality AI). The work of Danilo Comminiello was
supported by Progetti di Ricerca Medi of Sapienza University of Rome for the project SAID: Solving Audio Inverse problems with diffusion models, under Grant
RM123188F75F8072.

We support the dataset download and the use of the baseline models via extensive instructions provided on the official GitHub repository at
https://github.com/l3das/L3DAS23. For more comprehensive information and in-depth details about the challenge, we invite the reader to visit the L3DAS Project
website at http://www.l3das.com/icassp2023.

ABSTRACT The primary goal of the L3DAS (Learning 3D Audio Sources) project is to stimulate and
support collaborative research studies concerning machine learning techniques applied to 3D audio signal
processing. To this end, the L3DAS23 Challenge, presented at IEEE ICASSP 2023, focuses on two spatial
audio tasks of paramount interest for practical uses: 3D speech enhancement (3DSE) and 3D sound event
localization and detection (3DSELD). Both tasks are evaluated within augmented reality applications. The
aim of this paper is to describe the main results obtained from this challenge. We provide the L3DAS23
dataset, which comprises a collection of first-order Ambisonics recordings in reverberant simulated envi-
ronments. Indeed, we maintain some general characteristics of the previous L3DAS challenges, featuring
a pair of first-order Ambisonics microphones to capture the audio signals and involving multiple-source
and multiple-perspective Ambisonics recordings. However, in this new edition, we introduce audio-visual
scenarios by including images that depict the frontal view of the environments as captured from the
perspective of the microphones. This addition aims to enrich the challenge experience, giving participants
tools for exploring a combination of audio and images for solving the 3DSE and 3DSELD tasks. In addition
to a brand-new dataset, we provide updated baseline models designed to take advantage of audio-image pairs.
To ensure accessibility and reproducibility, we also supply supporting API for an effortless replication of our
results. Lastly, we present the results achieved by the participants of the L3DAS23 Challenge.

INDEX TERMS 3D audio, ambisonics, data challenge, sound event localization and detection, speech
enhancement.

I. INTRODUCTION The widespread adoption of 3D audio has not only brought


Nowadays, 3D immersive audio is becoming a widespread practical benefits but has also fostered intriguing scientific
reality thanks to new emerging technologies and commercial advancements, particularly regarding deep learning method-
devices. The use of spatial audio can benefit a multitude of ologies for audio signal processing.
applications, including virtual and real conferencing, game However, the development of efficient deep learning algo-
development, music production, augmented reality and im- rithms necessitates a substantial amount of data, which may
mersive technologies in virtual environments, speech commu- not always be accessible for 3D audio applications. Recogniz-
nication, home assistants, multimedia services, audio surveil- ing this limitation, the L3DAS (Learning 3D Audio Sources)
lance in public spaces, and various other potential domains. project aims to fill this gap by facilitating the availability of
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
632 https://creativecommons.org/licenses/by-nc-nd/4.0/ VOLUME 5, 2024
3D audio datasets. Thereby, the primary goal of the project RGB images of simulated acoustic environments can be added
is to encourage the rise of novel deep learning techniques for to the audio recordings. As a result, participants had the choice
spatial audio applications. of either submitting results using only the information from
In the first edition of this project, L3DAS21 [1], we pro- the 3D audio recordings (named the audio-only track) or tak-
posed a novel multi-channel audio configuration based on ing up the audio-visual track in which the audio recordings
multiple-source, multiple-perspective (MSMP) Ambisonics and images of the environments were made available to them.
recordings made with an array of two first-order Ambisonics This choice was made because visual information proved to
microphones. As far as we know, that was the first time that enhance the performance of deep learning models [10] and we
a two-microphone Ambisonics configuration has been used believe it can also improve the results in the proposed 3DSE
for the tasks of 3D sound event localization and detection and 3DSELD tasks. The decision to participate in the audio-
(3DSELD) and 3D speech enhancement (3DSE). The base- only track or the audio-visual one was left to the participants.
lines used for the 2021 challenge were FaSNet [2] for 3DSE The use of a very large number of simulated acoustic envi-
and SELDnet [3] for 3DSELD. Recordings were made in an ronments allowed us to extend the total duration of the dataset
office room with approximate dimensions of 6 × 5 × 3 m. We to approximately 100 hours. We supply baseline models, 3D
placed two first-order A-format Ambisonics microphones in audio datasets for each task, and a Python-based API that
the center of the room and we moved a speaker reproducing facilitates the data download and preprocessing, the baseline
an analytic signal in 252 fixed spatial positions. The resulting models training and the results submission.
L3DAS21 dataset contains approximately 65 hours of MSMP
B-format Ambisonics recordings. The winning team for 3DSE II. BACKGROUND
of the L3DAS21 Challenge was 1024 k Team with the work A. AMBISONICS MICROPHONES
presented in [4]. While the work with the best results [5] Ambisonics is a multi-channel audio technology first intro-
for 3DSELD was submitted by the EPUSPL Team. Detailed duced by M. A. Gerzon in [11]. This technology allows audio
information can be found on the L3DAS project website for signals to be recorded, encoded and reproduced while fully
the 2021 edition.1 preserving their spatial information. In fact, through such
For the second edition of this project, L3DAS22 [6], we technology, the codification of the sound field also includes
maintained a similar setting to that proposed in L3DAS21 but its directional characterization. Ambisonics is still one of the
with some substantial improvements. Firstly, we generated a most complete microphone and sound reproduction systems
new dataset containing an augmented number of datapoints, available since it lets capture all spatial information of a sound
increasing the total length of the dataset from 65 to more than source and permits multiple possible decodings of the sig-
94 hours. Then, we modified the dataset synthesis pipeline in nal based on the number of loudspeakers used during signal
order to promote less resource-demanding training and facili- reproduction. Thus, reproduction is stereo compatible, being
tate both tasks. In addition, we updated the baseline for 3DSE, able to be performed with either 4 as well as 2 or 3 loud-
using a beamforming U-Net architecture [4], which provided speakers depending on specific needs. The microphones used
the best metrics for the L3DAS21 Challenge on the 3DSE in the L3DAS project are first-order Ambisonics microphones,
task. This network uses a convolutional U-Net to estimate which have four channels. Three of these channels correspond
B-format beamforming filters. The winning teams for the to three figure-of-eight capsules oriented according to the
L3DAS22 edition of the Challenge were ESP-SE [7] for 3DSE three orthogonal Cartesian axis X , Y and Z; a fourth channel
and Lab9 DSP411 [8] for 3DSELD. Further information can (W ) is associated with an omnidirectional microphone that
be found on the L3DAS website.2 assigns equal gain to all directions. The set of these four sig-
Our latest edition of the L3DAS project, presented as Sig- nals, WXYZ, recorded by the microphone, is called A-format.
nal Processing Grand Challenge at IEEE ICASSP 2023, is Once processed and mixed together they form the Ambisonics
strongly inspired by the growing interest in augmented and signal defined as B-format. The polar diagram of a first-order
virtual reality (AR & VR). In this context, enhancing speech Ambisonics microphone is shown in Fig. 1.
and localizing sound events can be fundamental to ensure Through such a structure, Ambisonics microphones allow
credible and safe experiences. The L3DAS23 Challenge, de- sounds to be represented as spherical harmonics, enabling a
scribed in this paper, uses SoundSpaces 2.0 [9] to integrate spatially coherent representation. For this reason, this type
3D sound sources captured by first-order Ambisonics micro- of signal is extremely beneficial for tasks such as 3DSE and
phones and extended reality environments. 3DSELD where the objective is to extract or recognize sound
Moreover, it introduces two substantial evolutions from sources placed in noisy environments.
previous versions: a) the 3D audio recordings were not made
in a physical location, but rather in 68 distinct simulated
B. 3D SPEECH ENHANCEMENT
environments; b) an additional track was introduced taking
Let us consider a target signal of interest simultaneously
into account the multimodal scenario, where information from
reproduced together with other sound sources in the same
environment. Its rendering will be probably unintelligible.
1 [Online]. Available: https://www.l3das.com/mlsp2021/results.html The case in which the target signal immersed in a noisy en-
2 [Online]. Available: https://www.l3das.com/icassp2022/results vironment is a speech signal is referred to as a cocktail party
VOLUME 5, 2024 633
GRAMACCIONI ET AL.: L3DAS23: LEARNING 3D AUDIO SOURCES FOR AUDIO-VISUAL EXTENDED REALITY

C. 3D SOUND EVENT LOCALIZATION AND DETECTION


For the 3D Sound Event Localization and Detection
(3DSELD) task, the setting is very similar to that of 3DSE:
some target sounds - in this case, not necessarily speech sig-
nals - are played in a noisy environment in which other sound
sources may be active simultaneously with the target sources.
The goal here is to recognize the target source in the sound
mixture and to be able to detect when and where it is active. In
other words, in addition to correctly labeling the target sound,
a model for 3DSELD must be able to provide temporal and
spatial information about its specific source.
Modern deep learning methods have proven to solve this
task efficiently [21]. The SELDNet [3] used as a baseline for
L3DAS21 and taken up in later editions of the project is based
on a convolutional-recurrent design with two distinct branches
FIGURE 1. Polar diagram of the four microphones W, X, Y, Z of a for localization and detection. An alternative version based
first-order Ambisonics microphone in the 3D space. on time convolutions has been proposed in [22]. Other solu-
tions for this task include ensemble models [23], multi-stage
problem. The objective of speech enhancement methods is training [24] and bespoke augmentation strategies [25], [26].
precisely to extract the target speech signal from a sound mix- As a baseline for the current version of the L3DAS project,
ture composed of ambient sounds and speech sounds of other we used a variant of the SELDnet architecture, with small
speakers and make it intelligible. In the case of 3D speech changes. We ported to PyTorch the original Keras implemen-
enhancement (3DSE), the aim is to use the additional spatial tation and we modified its structure to make it compatible with
information captured by the Ambisonics microphone in order the L3DAS23 dataset. This situation is illustrated in Fig. 3.
to perform a more precise extraction of the target signal from To achieve more consistent spatio-temporal descriptions of
the sound mixture. More formally, the 3DSE task can be the 3D acoustic scene, we modified this network so that it
explained as follows: let us consider the corrupted Ambison- could accept as additional input an RGB image of the virtual
ics signals x p (n) = M−1 i=0 h p,i (n)s(n − i) + (n), resulting environment in which the sounds are reproduced, similar to
from a clean speech signal s(n) affected by an acoustic im- what was done for 3DSE.
pulse response h p (n) ∈ RM of M coefficients, where p =
{W, X, Y, Z} represents the four channel of the Ambisonics
microphone, and additive noise (n). We seek a mapping III. L3DAS23 DATASET
function f (·) that is able to estimate the speech target signal A. GENERAL DESCRIPTION
s(n) from x p (n), i.e., ŝ(n) = f (xW (n), xX (n), xY (n), xZ (n)). In Each of the two tasks is supported by an appropriate dataset.
L3DAS23, the additional information, provided by an RGB The L3DAS23 datasets contain multiple-source and multiple-
image of the environment in which the sound mixture is re- perspective B-format Ambisonics audio recordings. We sam-
produced, can be used as input to the deep learning models pled the acoustic field of multiple simulated environments,
developed by the challenge participants. Such a situation is placing two first-order Ambisonics microphones in random
schematized in Fig. 2. One commonly employed strategy for points of the rooms and capturing up to 737 room impulse
conducting speech enhancement involves utilizing deep neu- responses in each one. The datasets also contain multiple RGB
ral networks (DNNs) to estimate a time-frequency mask in pictures showing the frontal view from the main microphone.
the Fourier domain. This mask is designed to isolate clean We aimed at creating plausible and variegate 3D scenarios to
speech signals from noisy spectra [12]. Cutting-edge results reflect possible real-life situations in which sound and dis-
in Ambisonics-based 3DSE can be achieved through neural parate types of background noises coexist in the same 3D
beamforming techniques such as Filter and Sum Networks reverberant environment.
(FaSNet), which are particularly well suited for low-latency The datasets of both Task 1 (3DSE) and Task 2 (3DSELD)
scenarios. Additionally, U-Net-based approaches demonstrate share a common basis: the techniques adopted for generating
competitive outcomes in both monaural [13], [14] and mul- it. Indeed, we used Soundspaces 2.0 [9] to generate Am-
tichannel SE tasks [15], albeit with increased computational bisonics Room Impulse Responses (ARIRs) and images in a
requirements. Alternative techniques for the SE task include selection of simulated 3D houses from the Habitat - Matter-
recurrent neural networks (RNNs) [16], graph-based spec- port 3D Research Dataset [27]. Each simulated environment
tral subtraction [17], discriminative learning [18] and dilated has a different size and shape, and includes multiple objects
convolutions [19], [20]. For the 3DSE task, we use a beam- and surfaces characterized by specific acoustic properties (i.e.,
forming U-Net architecture, which provided the best metrics absorption, scattering, transmission, damping).
for L3DAS21 on the 3DSE task. In L3DAS23, we consider For the 3DSE task, the computed ARIRs are convolved
monaural speech signal as output. with clean sound samples belonging to distinct sound classes

634 VOLUME 5, 2024


FIGURE 2. Schematic overview of the 3D speech enhancement task. Ambisonics microphones record the target speech signal along with other noisy
sources in the environment. The 3DSE model recovers this target speech signal from the noisy mixture and produces a clean monaural speech signal.

FIGURE 3. Schematic overview of the 3D sound event localization and detection task. Ambisonics microphones record the sound mixture of the acoustic
environment and the 3DSELD model must be able to estimate the labels of the active sound sources in each time interval for the detection and their DOA
to localize them: In this example, there are two active sound sources and they are identified by the labels telephone and printer.

to generate the spatial sound scenes. The noise sound event impulse responses of the rooms by placing the sound sources
dataset we used for Task 1 is the well-known FSD50 K dataset in random locations of a cylindrical grid defining all possible
[28]. In particular, we have selected 12 transient classes, rep- positions. Microphone and sound positions have been selected
resentative of the noise sounds that can be heard in an office: according to specific criteria, such as minimum distance from
computer keyboard, drawer open/close, cupboard open/close, walls and objects, and minimum distance between mic posi-
finger-snapping, keys jangling, knock, laughter, scissors, tele- tions in the same environment (RT60 between 0.3 and 0.8 s).
phone, writing, chink and clink, printer, and 4 continuous One microphone (mic A) lies in the exact selected position,
noise classes: alarm, crackle, mechanical fan and microwave and the other (mic B) is 20 cm distant towards the x dimension
oven. Furthermore, we extracted clean speech signals (with- from mic A. Both are shown as blue and orange dots in the
out background noise) from Librispeech [29], selecting only topdown map in Fig. 4. The two microphones are positioned
sound files up to 12 seconds. at the same height of 1.6 m, which can be considered as the
For the 3DSELD task, the measured ARIRs are convolved average ear height of a standing person. The capsules of both
with clean sound samples belonging to distinct sound classes. mics have the same orientation.
Sound events for Task 2 are taken again from the FSD50 K In every room, the speaker placement is performed accord-
dataset [28]. We have selected 14 classes, most representative ing to five concentric cylinders centered in mic A, where the
of the sounds that can be heard in an office: the 12 classes single positions are defined following a grid that guarantees
already used for 3DSE, plus female speech and male speech. a minimum Euclidean distance of 50 cm between two sound
sources placed at the same height. The radius of the cylinders
B. RECORDING PROCEDURE ranges from 1 m to 3 m with a 50 cm step and all have 6
We placed two Ambisonics microphones in 443 random po- position layers in the height dimension at 0.4 m, 0.8 m, 1.2 m,
sitions of 68 houses and generated B-Format ACN/SN3D 1.6 m, 2 m, 2.4 m from the floor, as shown in Fig. 5.

VOLUME 5, 2024 635


GRAMACCIONI ET AL.: L3DAS23: LEARNING 3D AUDIO SOURCES FOR AUDIO-VISUAL EXTENDED REALITY

FIGURE 4. Topdown map showing mic A (blue dot) and mic B (orange dot).
Microphones can only be placed in the gray area (i.e., the area where no
obstacles are located, namely the navigable area). On the contrary, sounds
can be placed also outside the gray area, as long as they do not collide FIGURE 7. (ρ, θ, z) for one speaker position (black dot). Mic A is
with objects and remain within the perimeter of the environment. represented as an orange dot.

the floor: the green dots represent accepted position (in this
case, all positions within the room that do not collide with a
sofa and two armchairs), while the red dots show discarded
positions. No constraint is placed on the need to have the
sound source in the microphone’s view (and thus a direct
sound). A sound could then be placed behind an obstacle (such
as a column in the center of a room). SoundSpaces natively
supports all these scenarios as it propagates sounds according
to a bidirectional path tracing algorithm. Therefore, sound
sources in SoundSpaces 2.0 are to be considered omnidirec-
tional, meaning that sound can propagate in all directions.
Each speaker position is identified in cylindrical coordi-
nates w.r.t. microphone A by a tuple (ρ, θ , z), where ρ is in
the range [1.0, 3.0] (with a 0.5 step) and z in [−1.2, +0.8]
FIGURE 5. All the concentric cylinders. The partially visible red and blu (with a 0.4 step). θ is in the range [0◦ , 360◦ ), with a step that
dots represent mic A and B. depends on the value of ρ and is chosen so as to satisfy the
minimum Euclidean distance between sound sources (θ = 0◦
for frontal sounds). All labels are consistent with this notation;
elevation and azimuth or Euclidean coordinates are however
easily obtainable.
Fig. 7 visually represents the tuple (ρ, θ , z). The orange dot
in the picture is mic A and the black dot is a speaker placed
on one of the concentric cylinders. ρ represents the distance
of a sound source from mic A, θ is the angle from the y-axis,
and z is the height relative to mic A. Mic B is on the x-axis
and thus in position (0.2, 0, 0) of a local coordinate system.
Being frontal to the hearer, sounds placed on the y-axis have θ
= 0 in the dataset. θ is therefore calculated with respect to the
y-axis to comply with this principle. This has a direct impact
on the way in which it is possible to switch from one notation
FIGURE 6. Accepted source positions in green, discarded positions in red to another.
for one height level.
The dataset is divided into two main sections, respectively
dedicated to the challenge tasks. We provide normalized raw
A sound can therefore be reproduced in a room in any of waveforms of all Ambisonics channels (8 signals in total) as
the 700+ available positions (300 k+ total positions in the se- predictors data for both sections, the target data varies sig-
lected environments), to which should be subtracted all those nificantly. For 3DSE, the corresponding dataset is composed
positions that collide with objects or exceed the room space. of 16 kHz 16-bit AmbiX wav files. For the 3DSELD, the
Fig. 6 shows an example of source positioning at 0.4 m above audio files of the dataset are 32 kHz 16-bit AmbiX wav files.

636 VOLUME 5, 2024


by:
STOI + (1 − WER)
M3DSE = , (1)
2
which lies in the [0,1] range, where the higher the better.
For 3DSELD, we use a joint metric for localization and de-
tection: F-score based on the location-sensitive detection [31].
The F-score allows combining precision and recall of a model,
where the precision is the number of true positives predicted
by the model divided by the number of false positives plus
true positives and the recall is the number of true positives
divided by the number of true positives plus false negatives.
The F-score is given by:
FIGURE 8. Example of a simulated view of the environment in front of the 2 ∗ Precision ∗ Recall 2 ∗ TP
F1 = = , (2)
microphone.
Precision + Recall 2 ∗ T P + FP + FN
where T P is the number of true positives classified by the
Moreover, we created different types of acoustic scenarios, model, F P and F N are respectively the numbers of false pos-
optimized for each specific task. itives and false negatives classified by the model. This metric
We split both dataset sections into: a training set (80 hours considers a true positive only if a sound class is correctly
for 3DSE and 5 hours for 3DSELD) and a test set (7 hours predicted in a temporal frame and if its predicted location lies
for 3DSE and 2.5 hours for 3DSELD), paying attention to within a Cartesian distance from the true position of at most
creating similar distributions. The train set of the 3DSE sec- 1.75 m.
tion is divided into two partitions: train360 and train100, and
contains speech samples extracted from the correspondent B. 3DSE BASELINE
partitions of Librispeech [29] (only the sample up to 12 sec- For Task 1 (3DSE), we use a beamforming U-Net architecture
onds). All sets of the 3DSELD section are divided into: OV1, [4], which provided the best metrics for the L3DAS21 Chal-
OV2, OV3. These partitions refer to the maximum amount of lenge on the 3DSE task. This network uses a convolutional
possible overlapping sounds, which are 1, 2, or 3, respectively. U-Net [32] to estimate B-format beamforming filters. It is
composed of three main modules: 1) an encoder path for
C. AUDIO-VISUAL TRACKS extracting high-level features gradually, 2) the corresponding
In addition to Ambisonics recordings, the dataset provides, decoder for the reconstruction of the original size of input
for each microphone position in the rooms, an image of size features from the output of the encoder, and 3) skip con-
512 × 512, representing the environment in front of the main nections for concatenating each layer in the encoder with its
microphone (mic A). We derived these images by virtually corresponding layer in the decoder. The input of the model is
placing a RGB sensor at the same height and orientation as the B-format audio signals, of dimension RC×(T ×S) , where
mic A, and with a 90-degree field of view. An example is C is the channel number and is equal to 4 in the case of
shown in Fig. 8. Since the microphone is placed in multiple 1-mic configuration and 8 in the case of 2-mic configuration,
different environments, the models will have to perform the T is the duration in seconds of the audio signal and S is
tasks by adapting to different reverberation conditions. the sample rate. These signals are first transported into the
Both the audio-only track and the audio-visual track were time-frequency domain via an STFT, resulting in a represen-
composed of two subtracks, namely 1-mic configuration and tation of dimension CC×(L×F ) , where L = 600 is the number
2-mic configuration. In fact, participants could choose to use of frames and F = 256 is the first 256 frequency bins of the
recordings from only one microphone or both of them. complex spectrogram. The enhancement process is performed
as that of the traditional signal beamforming: we multiply the
IV. BASELINES complex spectrogram of B-format noisy signal with the filters
A. METRICS estimated by U-Net, W ∈ CC×(L×F ) , through element-wise
For 3DSE, we adopted a metric M3DSE that is the combination multiplication, and then sum the result over the channel axis
of two distinct metrics. This evaluation metric is the combina- to estimate a single-channel enhanced complex spectrogram,
tion of the short-time objective intelligibility (STOI), which Ŝ ∈ C L×F . In the end, the iSTFT is performed to obtain the
estimates the intelligibility of the output speech signal, and enhanced time-domain signal.
word error rate (WER), which indicates the ratio of error in With this baseline model, we obtained a baseline test met-
a speech-to-text transposition, computed to assess the effects ric for Task 1 of 0.557, with a WER of 0.57 and an STOI
of the enhancement for speech recognition purposes. We use of 0.68. We adapted this model to the audiovisual task by
a Wav2Vec [30] architecture pre-trained on Librispeech 960 h using a CNN-based extension part whose output features are
to compute the WER. The final metric for this task is given concatenated along the filter dimension with those generated

VOLUME 5, 2024 637


GRAMACCIONI ET AL.: L3DAS23: LEARNING 3D AUDIO SOURCES FOR AUDIO-VISUAL EXTENDED REALITY

by the encoder part of the U-Net. The visual features allow a between frequency and time axes are captured through
sensible decrease in the number of epochs required to achieve groups of time-frequency processing modules, and the
results comparable to those of the audio-only track. key information in the feature flow is extracted by the
group attention.
C. 3DSELD BASELINE 2) JLESS team proposed a two-stage system based on
For Task 2 (3DSELD), instead, we used a variant of the DPRNN and U-Net for the 3DSE task and a Conformer-
SELDnet architecture [3], with small changes with respect to based system for the 3DSELD task [34]. This is the only
the one used in the L3DAS22 Challenge. We ported to the team to have participated in both tasks of the challenge
PyTorch language the original Keras implementation and we and also to have developed a model for the audio-visual
modified its structure in order to make it compatible with track as part of the 3DSELD. In the two-stage U-Net for
the L3DAS23 dataset. The objective of this network is to the audio-only 3DSE, the amplitude of the STFT is fed
output a continuous estimation (within a fixed temporal grid) into the network for estimating the mask, and the phase
of the sounds present in the environment and their respective of the mixed signal is used for speech reconstruction.
location. The original SELDNet architecture is conceived for They add 4 DPRNN modules between the encoder and
processing sound spectrograms (including both magnitudes decoder of U-Net for transient modeling and extrac-
and phase information) and uses a convolutional-recurrent tion of dynamic voice information. The STFTs of the
feature extractor based on 3 convolution layers followed by multi-channel speech signals are first fed into the U-Net
a bidirectional GRU layer. In the end, the network is split into with DPRNN, and the estimated STFT is formed using
two separate branches that predict the detection (which classes beamforming. Then, the estimated STFT is sent into
are active) and location (where the sounds are) information for the second U-Net without DPRNN for the estimating
each target time step. finer mask. Sigmoid is used to activate the mask of the
We augmented the capacity of the network by increasing output layer; after that, the masked estimation results
the number of channels and layers, while maintaining the of the first level are connected with the masked estima-
original data flow. Moreover, we discard the phase informa- tion results of the second level by residual. Regarding
tion and we perform max-pooling on both the time and the the Conformer-based SELD system, the log-Mel and
frequency dimensions, as opposed to the original implementa- intensity vectors are calculated for both mic-A and mic-
tion, where only frequency-wise max-pooling is performed. In B audio signals. Then, the time difference of arrival
addition, we added the ability to detect multiple sound sources (TDOA) of 2 mics is computed using kernel density es-
of the same class that may be active at the same time (3 at timator (KDE) theory [35]. For the visual signal, images
maximum in our case). To obtain this behavior we tripled are resized into 224 × 224 px and normalized for fine-
the size of the network’s output matrix, in order to predict tuning the pretrained model. A Res-Conformer-based
separate location and detection information for all possible SELD model is adapted in the audio-visual scene. Audio
simultaneous sounds of the same class. features are fed into four residual convolution blocks
This network obtains a baseline test F-score of 0.147, with following two Conformer encoder blocks. Images are
a precision of 0.176 and a recall of 0.126. We adapted this fed into Resnet18 with pre-trained weights. The embed-
model to the audiovisual task by using a CNN-based exten- ding of images is then concatenated with audio features
sion part whose output features are concatenated to the ones before the last output layer. The authors applied some
of our augmented 3DSELD just before passing them to the data augmentation methods, such as cutout, frequency
two separate branches. This simple change resulted in a 7% shift, time shift, mixing, brightness, hue, saturation, and
improvement in the F-score (0.158), with a precision of 0.182 contrast jitter.
and a recall of 0.140. 3) CCA Speech team developed a stream attention-based
U-Net to remove background noise and reverberation
V. CHALLENGE RESULTS for 3DSE [36]. Their model consists of three parts,
Among the challenge participants, those who presented mod- encoder, decoder, and channel fusion module. They pro-
els capable of beating the proposed baselines were: SEU posed stream attention to fuse various channels in order
Speech, JLESS, CCA Speech for the 3DSE and JLESS to fully use the information between channels and this is
and NERCSLIP-USTC for the 3DSELD. The fifth best- done also in the encoder stage. Key, query, and value are
performing team, although below the baseline, is Speech- generated by three convolutional networks. A softmax
Lab410 with a model for 3DSE. The main contributions of function is applied to the last dimension of the product
these teams are briefly summarised below: of the key and query. The decoder part is composed of
1) SEU Speech proposed a dual-path convolutional recur- only convolutional blocks, while an LSTM block is used
rent network with group attention for 3DSE [33]. The in the encoder part.
model is structured as a convolutional encoder-decoder 4) NERCSLIP-USTC proposed a method based on the
with frequency-time blocks based on group attention combinations of ResNet and Conformer architectures
introduced in the middle. The encoder extracts the lo- to model both local and global patterns [37]. ResNet
cal representation from the spectrogram, the correlation blocks are used to extract high-dimension feature

638 VOLUME 5, 2024


TABLE 1 Results of Task 1 Participants TABLE 3 Final Rankings of the L3DAS23 Challenge

TABLE 2 Results of Task 2 Participants VI. CONCLUSION


This paper presented the details of the L3DAS23 Signal Pro-
cessing Grand Challenge at ICASSP 2023, including: the
L3DAS23 dataset, the challenge tasks, the baseline models
and the results obtained by the winning participants. The cur-
rent version of the L3DAS project introduces the use of visual
information for 3DSE and 3DSELD tasks, given the growing
and stimulating interest in AR & VR. The introduction of
visual input extracted from analyzed acoustic environments,
representation from the input features, while Conformer whether simulated or not, can drastically benefit the research
blocks are effective to extract local fine-grained features in the field of 3D audio signal processing. For this reason,
and long-range global information, respectively. The au- future work of the L3DAS team will primarily involve the
thors also adopted several data augmentation techniques study of new methods to improve the interaction of visual
(SpecAugment, Mixup and ACS) to expand the official information with Ambisonics audio signals, in order to fur-
dataset. ther improve the results obtained with this challenge. Then,
5) SpeechLab410 proposed a refine-beamfomer system to we plan to incorporate new 3D acoustic scenarios, diverse
enhance 3D speech signals. The beamforming network microphone configurations, and novel tasks that could be of
consists of two U-Net beamforming networks. In the great relevance in the context of augmented and virtual re-
first stage, they employed a neural beamforming net- ality applications. Moreover, different tasks than 3DSE and
work to initially enhance the 3D speech signal. Then 3DSELD will be definitely taken into account, together with
the generation characteristics of a diffusion model were the collection of real-recorded data.
utilized to further enhance the speech signal. The two
stages of this enhancement model were trained sepa- ACKNOWLEDGMENT
rately. The authors support the dataset download and the
Tables 1 and 2 show the results obtained by the participants use of the baseline models via extensive instruc-
on the test set. For Task 1 (3DSE) the models had to predict tions provided on the official GitHub repository at
monaural sound waveforms, containing the enhanced speech https://github.com/l3das/L3DAS23. For more comprehensive
signals extracted from the multichannel noisy mixtures, with information and in-depth details about the challenge, we
a sampling rate of 16 kHz. For Task 2 (3DSELD) the models invite the reader to visit the L3DAS Project website at
were expected to predict the spatial coordinates and class of http://www.l3das.com/icassp2023.
the sound events active in a multichannel audio mixture. Such
information had to be generated for each frame in a discrete REFERENCES
temporal grid with 100-millisecond non-overlapping frames. [1] E. Guizzo et al., “L3DAS21 challenge: Machine learning for 3D audio
Each submitted file for this task was a csv table listing, for signal processing,” in Proc. IEEE 31st Int. Workshop Mach. Learn.
every time frame, the class and spatial coordinates of each Signal Process., 2021, pp. 1–6.
[2] Y. Luo, C. Han, N. Mesgarani, E. Ceolini, and S. -C. Liu, “FaS-
predicted sound event. All participants worked with the 2-mic Net: Low-latency adaptive beamforming for multi-microphone audio
configuration, so the results shown in Tables 1 and 2 refer to processing,” in Proc. IEEE Autom. Speech Recognit. Understanding
the 2-mic configuration. Workshop, 2019, pp. 260–267.
[3] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound event
Participants had to submit the only results obtained for localization and detection of overlapping sources using convolutional
the blind test. The submission had to contain up to two zip recurrent neural networks,” IEEE J. Sel. Topics Signal Process., vol. 13,
archives, one for the audio-only track and one for the audio- no. 1, pp. 34–48, Mar. 2019.
[4] X. Ren et al., “A neural beamforming network for b-format 3D speech
visual track, enclosing two separate folders for the challenge enhancement and recognition,” in Proc. IEEE 31st Int. Workshop Mach.
tasks, named task1 and task2. From these results, we derived Learn. Signal Process., 2021, pp. 1–6.
an overall ranking of the participants as reported in Table 3. [5] H. R. Guimarães, W. Beccaro, and M. A. Ramírez, “Optimizing time
domain fully convolutional networks for 3D speech enhancement in a
All these teams were allowed to submit their work as a 2-page reverberant environment using perceptual losses,” in Proc. IEEE 31st
paper to ICASSP 2023. Int. Workshop Mach. Learn. Signal Process., 2021, pp. 1–6.

VOLUME 5, 2024 639


GRAMACCIONI ET AL.: L3DAS23: LEARNING 3D AUDIO SOURCES FOR AUDIO-VISUAL EXTENDED REALITY

[6] E. Guizzo et al., “L3DAS22 challenge: Learning 3D audio sources in [24] Y. Cao, Q. Kong, T. Iqbal, F. An, W. Wang, and M. D. Plumbley,
a real office environment,” in Proc. IEEE Int. Conf. Acoust., Speech “Polyphonic sound event detection and localization using a two-stage
Signal Process., 2022, pp. 9186–9190. strategy,” in Proc. Workshop Detection Classification Acoustic Scenes
[7] Y.-J. Lu et al., “Towards low-distortion multi-channel speech enhance- Events, 2019.
ment: The ESPNET-SE submission to the L3DAS22 challenge,” in [25] L. Mazzon, Y. Koizumi, M. Yasuda, and N. Harada, “First order
Proc. IEEE Int. Conf. Acoust., Speech Signal Process., 2022, pp. 9201– ambisonics domain spatial augmentation for DNN-based direction of
9205. arrival estimation,” in Proc. Workshop Detection Classification Acoustic
[8] J. Hu et al., “A track-wise ensemble event independent network for Scenes Events, 2019.
polyphonic sound event localization and detection,” in Proc. IEEE Int. [26] P. Pratik, W. J. Jee, S. Nagisetty, R. Mars, and C. S. Lim, “Sound
Conf. Acoust., Speech Signal Process., 2022, pp. 9196–9200. event localization and detection using CRNN architecture with mixup
[9] C. Chenet al., “SoundSpaces 2.0: A simulation platform for visual- for model generalization,” in Proc. Workshop Detection Classification
acoustic learning,” Neural Inf. Process. Syst. Datasets Benchmarks Acoustic Scenes Events, 2019.
Track, 2022. [27] S. K. Ramakrishnan et al., “Habitat-matterport 3D dataset (HM3D):
[10] H. Zhu, M. Luo, R. Wang, A. Zheng, and R. He, “Deep audio-visual 1000 large-scale 3D environments for embodied AI,” Neural Inf. Pro-
learning: A survey,” Int. J. Automat. Comput., vol. 18, pp. 351–376, cess. Syst. Datasets Benchmarks Track, 2021.
2020. [28] E. Fonseca, X. Favory, J. Pons, F. Font, and X. Serra, “FSD50K: An
[11] M. A. Gerzon, “The design of precisely coincident microphone arrays open dataset of human-labeled sound events,” IEEE/ACM Trans. Audio,
for stereo and surround sound,” J. Audio Eng. Soc. Conv., 1975. Speech, Lang. Process., vol. 30, pp. 829–852, 2022.
[12] D. Wang and J. Chen, “Supervised speech separation based on deep [29] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: An
learning: An overview,” IEEE/ACM Trans. Audio, Speech, Lang. Pro- ASR corpus based on public domain audio books,” in Proc. IEEE Int.
cess., vol. 26, no. 10, pp. 1702–1726, Oct. 2018. Conf. Acoust., Speech Signal Process., 2015, pp. 5206–5210.
[13] H. R. Guimarães, H. Nagano, and D. W. Silva, “Monaural speech [30] A. Baevski, H. Zhou, A. R. Mohamed, and M. Auli, “wav2vec 2.0: A
enhancement through deep Wave-U-Net,” Expert Syst. Appl., vol. 158, framework for self-supervised learning of speech representations,” in
2020, Art. no. 113582. Proc. Adv. Neural Inf. Process. Syst., 2020..
[14] C. Macartney and T. Weyde, “Improved speech enhancement with the [31] A. Mesaros, S. Adavanne, A. Politis, T. Heittola, and T. Virtanen, “Joint
Wave-U-Net,” 2018, arXiv:1811.11307. measurement of localization and detection of sound events,” in Proc.
[15] A. Bosca, A. Gu’erin, L. Perotin, and S. Kitic, “Dilated U-net based IEEE Workshop Appl. Signal Process. Audio Acoust., 2019, pp. 333–
approach for multichannel speech enhancement from first-order am- 337.
bisonics recordings,” in Proc. 28th Eur. Signal Process. Conf., 2021, [32] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional net-
pp. 216–220. works for biomedical image segmentation,” in Proc. Int. Conf. Med.
[16] P.-S. Huang, M. Kim, M. A. Hasegawa-Johnson, and P. Smaragdis, Image Comput. Comput. Assist.-Intervention, 2015.
“Deep learning for monaural speech separation,” in Proc. IEEE Int. [33] J. Cheng, C. Pang, R. Liang, J. Fan, and L. Zhao, “Dual-path dilated
Conf. Acoust., Speech Signal Process., 2014, pp. 1562–1566. convolutional recurrent network with group attention for multi-channel
[17] X. Yan, Z. Yang, T. Wang, and H. Guo, “An iterative graph spectral speech enhancement,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
subtraction method for speech enhancement,” Speech Commun., vol. Process., 2023, pp. 1–2.
123, pp. 35–42, 2020. [34] J. Bai, S. W. Huang, H. Yin, Y. Jia, M. Wang, and J. Chen, “3D audio
[18] C. Fan, B. Liu, J. Tao, J. Yi, and Z. Wen, “Discriminative learning for signal processing systems for speech enhancement and sound localiza-
monaural speech separation using deep embedding features,” in Proc. tion and detection,” in Proc. IEEE Int. Conf. Acoust., Speech Signal
Interspeech, 2019. Process., 2023, pp. 1–2.
[19] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time– [35] V. V. Reddy, A. W. H. Khong, and B. P. Ng, “Unambiguous
frequency magnitude masking for speech separation,” IEEE/ACM speech DOA estimation under spatial aliasing conditions,” IEEE/ACM
Trans. Audio, Speech, Lang. Process., vol. 27, no. 8, pp. 1256–1266, Trans. Audio, Speech, Lang. Process., vol. 22, no. 12, pp. 2133–2145,
Aug. 2019. Dec. 2014.
[20] O. Yazdanbakhsh and S. Dick, “Multivariate time series classification [36] H. Wang, Y. Fu, J. Li, M. Ge, L. Wang, and X. Qian, “Stream attention
using dilated convolutional neural network,” 2019, arXiv:1905.01697. based U-Net for L3DAS23 challenge,” in Proc. IEEE Int. Conf. Acoust.,
[21] A. Politis, S. Adavanne, and T. Virtanen, “A dataset of reverberant Speech Signal Process., 2023, pp. 1–2.
spatial sound scenes with moving sources for sound event localization [37] H. Yan, H. Xu, Q. Wang, and J. Zhang, “The NERCSLIP-USTC system
and detection,” in Proc. Workshop Detection Classification Acoustic for the L3DAS23 challenge Task2: 3D sound event localization and
Scenes Events, 2020 detection (SELD),” in Proc. IEEE Int. Conf. Acoust., Speech Signal
[22] K. Guirguis, C. Schorn, A. Guntoro, S. Abdulatif, and B. Yang, “SELD- Process., 2023, pp. 1–2.
TCN: Sound event localization & detection via temporal convolutional
networks,” in Proc. 28th Eur. Signal Process. Conf., 2020, pp. 16–20.
[23] S. P. Chytas and G. Potamianos, “Hierarchical detection of sound events
and their localization using convolutional neural networks with adap-
tive thresholds,” in Proc. Workshop Detection Classification Acoustic
Scenes Events, 2019.

Open Access funding provided by ‘Università degli Studi di Roma “La Sapienza” 2’ within the CRUI CARE Agreement

640 VOLUME 5, 2024

You might also like