0% found this document useful (0 votes)
21 views

AI-based Soundscape Analysis

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views

AI-based Soundscape Analysis

Uploaded by

Bill Petrie
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

NOVEMBER 15 2023

AI-based soundscape analysis: Jointly identifying sound


sources and predicting annoyancea)
Yuanbo Hou ; Qiaoqiao Ren ; Huizhong Zhang; Andrew Mitchell ; Francesco Aletta ; Jian Kang ;
Dick Botteldooren

J. Acoust. Soc. Am. 154, 3145–3157 (2023)


https://doi.org/10.1121/10.0022408

CrossMark

 
View Export
Online Citation

22 November 2023 18:10:12


ARTICLE
...................................

AI-based soundscape analysis: Jointly identifying sound


sources and predicting annoyancea)
Yuanbo Hou,1,b) Qiaoqiao Ren,2 Huizhong Zhang,3 Andrew Mitchell,3 Francesco Aletta,3 Jian Kang,3
and Dick Botteldooren1
1
Wireless, Acoustics, Environmental, and Expert Systems Research Group, Department of Information Technology, Ghent University,
Gent, 9052, Belgium
2
AI and Robotics, Internet Technology and Data Science Lab, Department of Electronics and Information Systems, Interuniversity
Microelectronics Centre, Ghent University, Gent, 9052, Belgium
3
Institute for Environmental Design and Engineering, The Bartlett, University College London, London, WC1H 0NN, United Kingdom

ABSTRACT:
Soundscape studies typically attempt to capture the perception and understanding of sonic environments by
surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are
required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition.
Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial
intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to
analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset
containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform
sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings
indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only
one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models
and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis

22 November 2023 18:10:12


reveals that the relationship between sound sources and annoyance is similar for humans and the proposed AI-based
DCNN-CaF model. (4) Generalization tests show that the proposed model’s ARP in the presence of model-unknown
sound sources is consistent with expert expectations and can explain previous findings from the literature on sound-
scape augmentation. V C 2023 Author(s). All article content, except where otherwise noted, is licensed under a

Creative Commons Attribution (CC BY) license (http://creativecommons.org/licenses/by/4.0/).


https://doi.org/10.1121/10.0022408
(Received 22 May 2023; revised 6 October 2023; accepted 31 October 2023; published online 15 November 2023)
[Editor: James F. Lynch] Pages: 3145–3157

I. INTRODUCTION 2018; Bruce and Davies, 2014; Mackrill et al., 2013;


Maristany et al., 2016), participants are usually guided to
To mitigate the effect of urban sound on the health and
participate in questionnaires about soundscapes. For exam-
well-being of city dwellers, previous research has classically
ple, based on investigations, Yilmazer and Acun (2018)
focused on treating noise as a pollutant. For a couple of dec-
explore the relationship among the sound factors, spatial
ades, researchers have gradually changed their focus to a
functions, and properties of soundscapes. Using field ques-
more holistic approach to urban sound, referred to as the
tionnaires, Fang et al. (2021) explore how different partici-
soundscape approach (Brambilla and Maffei, 2010; Kang
pants’ perceptions and preferences for soundscapes differed.
et al., 2016; Nilsson and Berglund, 2006; Raimbault and
Questionnaires may include a direct assessment of the
Dubois, 2005). Overall, the soundscape approach offers a
soundscape quality, but the appraisal is often indicated in
more comprehensive understanding of urban sound and has
the two-dimensional plane spanned by pleasantness and
the potential to lead to more effective interventions for eventfulness (Axelsson et al., 2010). To benchmark sound-
improving the health and well-being of city dwellers scape emotion recognition in a valence-arousal plane, Fan
(Abraham et al., 2010; Tsaligopoulos et al., 2021). et al. (2017) created the Emo-soundscapes dataset based on
Previous studies on the categorization and quantifica- 6-s excerpts from Freesound.org and online labeling by
tion of soundscapes mostly rely on assessments of partici- 1182 annotators. They later used it for constructing a deep
pant perceptions. In these studies (Acun and Yilmazer, learning model for automatic classification (Fan et al.,
2018). To automatically recognize the eventfulness and
a) pleasantness of the soundscape, Fan et al. (2015) builds a
This paper is part of a special issue on Advances in Soundscape:
Emerging Trends and Challenges in Research and Practice. gold standard model and tests the correlation between
b)
Email: [email protected] the level of pleasure and the level of eventfulness. In an

J. Acoust. Soc. Am. 154 (5), November 2023 C Author(s) 2023.


V 3145
https://doi.org/10.1121/10.0022408

everyday context, uneventful and unpleasant soundscapes soundscapes is found to depend on contexts (Hong and Jeon,
are often not noticed and do not contribute to the experience 2015). The sounds that people hear in their environment can
of the place. Hence, Sun et al. (2019) propose a soundscape have a substantial impact on their overall appraisal of that envi-
classification that acknowledges that sonic environments ronment, and it can influence people’s emotional and cognitive
can be pushed into the background. Only foregrounded responses to their living surroundings. Hence, any acoustic sig-
soundscapes contribute to the appraisal and are classified as nal processing attempting to predict this, is very likely to benefit
disruptive and supportive, the latter being either calming or from automatic sound recognition. Automatic sound recognition
stimulating (Sun et al., 2019). Based on audio recordings in predicting people’s perception of the soundscape has the
containing implicit information in soundscapes, Thorogood potential to improve our understanding of how the acoustic
et al. (2016) established the background and foreground environment affects our perceptions, and can inform the devel-
classification task within a musicological and soundscape opment of more effective interventions to promote positive out-
context. For urban park soundscapes, Gong et al. (2022) comes. Therefore, Boes et al. (2018) propose to use an artificial
introduce the concepts of “importance” and “performance” neural network to predict both the sound source (human, natural,
and position the soundscape elements in this two- and mechanical) perceived by users of public parks as well as
dimensional plane. The importance dimension reflects to their appraisal of the soundscape quality. It was shown that
what extent a particular sound is an essential part of this sound recognition outperforms psychoacoustic indicators in pre-
soundscape. The perception study underlying this paper dicting each of these perceptual outcomes. Identifying specific
(Mitchell et al., 2022) can be seen as a foregrounded sound- sounds in urban soundscapes is relevant for assisting drivers or
scape assessment with annoyance as its primary dimension, self-driving cars (Marchegiani and Posner, 2017), and for urban
which is a negative dimension of soundscape assessment. planning and environment improvement (Ma et al., 2021).
This type of assessment is often used to identify sources of In this paper, a new artificial intelligence (AI) method,
noise, and it allows researchers to identify sources of annoy- inspired by the approach presented in Mitchell et al. (2023),
ance that can cause negative health reflections. is introduced to identify various sound sources and predict
Research on annoyance has been carried out based on one of the components in a circumplex appraisal of the sonic
non-acoustic approaches from different perspectives in the environment: annoyance. More specifically, this paper pro-

22 November 2023 18:10:12


fields of psychology, sociology, medicine, human-computer poses a deep-learning model based on the cross-attention
interaction, and vision. In psychology, researchers primarily mechanism to simultaneously perform sound source classifi-
focus on the effects of emotion and mood on annoyance cation (SSC) and annoyance rating prediction (ARP) for
(Timmons et al., 2023). The findings of the DEBATS study end-to-end inference of sound sources and annoyance rates
(Lefèvre et al., 2020) also confirm that considering non- in soundscapes. SSC has been widely used for audio event
acoustic factors such as situational, personal, and attitudinal recognition (Kong et al., 2020; Ren et al., 2017) and acous-
factors will improve annoyance predictions. Sociological tic scene classification (Barchiesi et al., 2015; Hou et al.,
studies tend to pay more attention to the impact of social sup- 2022a; Mesaros et al., 2018b). In this work, we will aug-
port, social relationships, and cultural factors on annoyance ment it with ARP, aiming to predict the overall appraisal of
(Beyer et al., 2017). In medical studies, the relationship the soundscape along the annoyance axis.
between annoyance and health is emphasised (Eek et al., In soundscapes with complex acoustic environments,
2010). The study of Carlsson et al. (2005) indicates that the source-related SSC and human-perception-related ARP are
correlation between subjective health and functional ability commonly used techniques for understanding how people
increases with increasing annoyance levels. Human-computer perceive and respond to sounds in soundscapes. To accu-
interaction usually utilises user experience studies, visual eye rately identify these various audio events, deep learning-
tracking, and virtual reality techniques to analyse and predict based convolutional neural networks (Li et al., 2019; Xu
the annoyance of users when interacting with machines et al., 2017), recurrent neural networks (Parascandolo et al.,
(Mount et al., 2012). On the other hand, acoustic-based 2016), convolutional recurrent neural networks (Li et al.,
annoyance research focuses more on the effects of sound and 2020), and Transformer (Vaswani et al., 2017) with multi-
auditory stimulation on an individual’s psychological and head attention are used in SSC-related detection and classifi-
emotional state (Nering et al., 2020). The relationship cation of acoustic scenes and events (DCASE) challenges
between the appraisal of the soundscape and the assessment (Mesaros et al., 2018a; Politis et al., 2021). Recently, with
of annoyance on the community level is still under- the aid of large-scale audio datasets, e.g., AudioSet
researched, although it was first explored in 2003 (Lercher (Gemmeke et al., 2017), and diverse audio pre-trained mod-
and Schulte-Fortkamp, 2003). Several non-acoustic factors els [such as convolution-based PANNs (Kong et al., 2020)
influence community noise annoyance, and some of them, and Transformer-based AST (Gong et al., 2021)], deep
such as noise sensitivity (Das et al., 2021) are so strongly learning-based approaches have made great improvement in
rooted in human auditory perception (Kliuchko et al., 2016) SSC tasks. However, most of these SSC-related studies
that they probably also contribute to soundscape appraisal. focus on recognizing sound sources without considering
The formal definition of “soundscape” refers to an under- whether they are annoying to humans. This paper proposes a
standing of the sonic environment, hence recognizing sources. joint SSC and ARP approach, expanding SSC to include
The influence of perceived sounds on the appraisal of subjective human perception.
3146 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408

An intuitive observation is that in real-life soundscapes, how to extract audio representations from the input audio
loud sounds naturally attract more human attention than qui- clips, and then perform CaF on audio representations.
eter sounds. For example, on the side of the street, the sound Finally, we use different loss functions to train task-
of roaring cars will capture people’s attention more than the dependent branches of the model to complete the
sound of small conversations on the corner. Therefore, this classification-based SSC task and the regression-based ARP
paper exploits the loudness-related root mean square value task.
(RMS) (Mulimani and Koolagudi, 2018) and Mel spectro-
grams (Bala et al., 2010) features, which conform to human
A. Audio representation extraction
hearing characteristics, to predict the objective sound sources
and perceived annoyance ratings. The proposed model uses Since the Mel spectrograms common in sound source-
convolutional blocks to extract high-level representations of related tasks and RMS features that can reflect the energy of
the two features and a cross-attention module to fuse their sound sources are used in this paper, there are two branches
semantic representations. Based on the proposed model, this of inputs to the DCNN-CaF model to extract high-level rep-
paper explores the following research questions (RQs): resentations of the two acoustic features separately, as
(1) RQ1: Can the model’s performance be improved using shown in Fig. 1. Inspired by the excellent performance of
two acoustic features? pure convolution-based pretrained audio neural networks
(2) RQ2: How does the performance of the proposed model (PANNs) (Kong et al., 2020) in audio-related tasks, a con-
compare with other models on the ARP task and the SSC volutional structure similar to that in PANNs is used in Fig.
task, as well as the joint ARP and SSC tasks? Does the cross- 1 to extract the representation of the input acoustic features.
attention-based fusion module in the model work well? Specifically, the dual-input model in Fig. 1 uses 4-layer con-
(3) RQ3: Does the proposed model capture the relationships volutional blocks. Each convolutional block contains two
between sound sources and annoyance ratings? What convolutional layers with global average pooling (GAP).
are the relationships between sound sources, annoyance The representations of Mel spectrograms and RMS features
ratings, and sound levels? generated by the convolution block, Rm and Rr, are fed to
(4) RQ4: How does the proposed model respond to adding the attention-based fusion module to generate representa-

22 November 2023 18:10:12


unknown sounds to the soundscape? tions suitable for the ARP task. The embeddings of the
sound source generated by the mapping of Rm through the
The paper is organized as follows. Section II introdu- embedding layer will be input into the final sound source
ces the proposed method. Section III describes the base- classification layer to complete the SSC task.
lines, dataset and training setup. Section IV analyzes and
discusses the results with research questions. Section V
draws conclusions. B. Cross-attention-based fusion
The cross-attention fusion module in this paper is based
II. METHOD
on the multi-headed attention (MHA) in Transformer
This section introduces the proposed model DCNN- (Vaswani et al., 2017). MHA allows models to jointly focus
CaF: the dual-branch convolutional neural network (DCNN) on representations at different positions in different subspa-
with cross-attention-based fusion (CaF). First, we introduce ces. Following the description in Transformer (Vaswani

FIG. 1. (Color online) The proposed dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF). The dimension of the output
of each layer is shown.

J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3147
https://doi.org/10.1121/10.0022408

et al., 2017), MHA is calculated on a set of queries (Q), distance between the predicted and the human-annotated
keys (K), and values (V), annoyance ratings,

MHAðQ; K; VÞ ¼ Concatðhead1 ; …; headh ÞwO ; (1) LARP ¼ MSEðy^a ; ya Þ: (7)

where Then, the final loss function of the DCNN-CaF model in this
paper is
headi ¼ AðQwQ K V
i ; Kwi ; Vwi Þ; (2)
 L ¼ LSSC þ LARP : (8)
pffiffiffi V
AðQwQ i ; Kw K
i ; Vw V
i Þ ¼ U Qw Q
i Kw KT
i = d Vwi ; (3)
III. DATASET, BASELINE, AND EXPERIMENTAL
where headi represents the output of the ith attention head SETUP
for a total number of h heads. WQ K V O
i ; Wi ; Wi , and W are
learnable weights. For MHA in the encoder, Q, K, and V A. Dataset
come from the same place, at this point, the attention in To the best of our knowledge, DeLTA (Mitchell et al.,
MHA is called self-attention (Vaswani et al., 2017). All the 2022) is the only publicly available dataset that includes
parameters (such as h ¼ 8, dk, and dv ¼ dmodel =h ¼ 512=8 both ground-truth sound source labels and human annoyance
¼ 64, etc.) follow the default settings of Transformer rating scores, so we use it in the paper. DeLTA comprises
(Vaswani et al., 2017). 2890 15-s binaural audio clips collected in urban public
From the corresponding dimensions of the output of spaces across London, Venice, Granada, and Groningen. A
each layer in Fig. 1, it can be seen that the dimensions of Rm remote listening experiment performed by 1221 participants
and Rr are both (512, 30), which correspond to the number was used to label the DeLTA recordings. In the listening
of filters of the previous convolutional layer and the number experiment, participants listened to 10 15-s binaural record-
of frames, respectively. After a series of convolutional ings of urban environments, assessed whether they con-
layers operations, the input 480 frames of Mel spectrograms tained any of the 24 classes of sound sources, and then

22 November 2023 18:10:12


and RMS features are extracted into audio representations provided an annoyance rating (continuously from 1 to 10).
with a time length of 30 frames. This means that in MHA, Participants were given labels for 24 classes of sound sour-
the time step of each head headi is also 30. To obtain the ces, including: Aircraft, Bells, Bird tweets, Bus, Car,
representation of Rm and Rr based on the mutual attention of Children, Construction, Dog bark, Footsteps, General traf-
Rm and Rr collaboratively, in MHA1 in Fig. 1, fic, Horn, Laughter, Motorcycle, Music, Non-identifiable,
Rail, Rustling leaves, Screeching brakes, Shouting, Siren,
Q ¼ Rm ; K ¼ V ¼ Rr : (4)
Speech, Ventilation, Water, and Other, adapted from the tax-
onomy developed by Salamon et al. (2014). In the listening
In contrast, in MHA2,
experiment, each recording was evaluated by two to four
Q ¼ Rr ; K ¼ V ¼ Rm : (5) participants, with an average of 3.1 recognized sound sour-
ces per recording. For more detailed information about
The cross-attention-adjusted representations of Rm and DeLTA, please see (Mitchell et al., 2022). During the train-
Rr are simply concatenated together and fed into the fusion ing of models in this paper, the training, validation, and test
layer to obtain higher acoustic representations containing sets contain 2081, 231, and 578 audio clips, respectively.
the semantics of Rm and Rr.
B. Baseline for annoyance rating prediction (ARP)
task
C. The loss function of the DCNN-CaF model To compare the performance of the proposed deep-
The model proposed in this paper performs two tasks learning-based method with traditional approaches in
simultaneously, SSC and ARP. Given that the output of the soundscape-related studies, we employ five regression meth-
sound source classification layer is y^s , and its corresponding ods inspired by their performance in annoyance prediction
label is ys , referring to the previous work (Hou and in soundscape research (Al-Shargabi et al., 2023; Iannace
Botteldooren, 2022), the binary cross-entropy (BCE) is used et al., 2019; Morrison et al., 2003; Szychowska et al., 2018;
as the loss function for the SSC task, Zhou et al., 2018) to perform the ARP task based on A-
weighted equivalent sound pressure levels. They are linear
LSSC ¼ BCEðy^s ; ys Þ: (6) regression, support vector regression (SVR), decision tree
(DT), k-nearest neighbours (KNN), and random forest.
Given the prediction output from the annoyance rating pre- Linear regression is a fundamental and interpretable model
diction layer is y^a and its corresponding label is ya , the that assumes a linear relationship between input features (in
mean squared error (MSE) (Wallach and Goffinet, 1989) is this case, sound levels) and the target variable (annoyance
used as a loss function for the ARP task to measure the ratings). SVR is particularly effective when dealing with
3148 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408

complex relationships between input features and target var- are concatenated and combined to feed to the SSC and ARP
iables. Decision tree regression is known for its ability to layers, respectively.
handle non-linear relationships and interactions among fea-
tures. Random forest regression is an ensemble method that 3. CNN-Transformer
combines multiple decision trees to improve predictive The CNN-Transformer is based on CNN, and an Encoder
accuracy and reduce overfitting. KNN regression can work from Transformer (Vaswani et al., 2017) is added after the final
well when there is a relatively small dataset and in low- convolutional layer in CNN. After the output of the Encoder is
dimensional spaces. flattened, it is fed to the SSC and ARP layers, respectively.

C. Baseline for sound source classification (SSC) task E. Training setup and metric
In SSC-related research, deep learning convolutional The 64-filter banks logarithmic Mel-scale spectrograms
neural network (CNN)-based models have achieved wide- (Bala et al., 2010) and frame-level root mean square values
spread success, and recently, Transformer-based models (RMS) (Mulimani and Koolagudi, 2018) are used as the
become dominant. Therefore, for the SSC task, the classical acoustic features in this paper. A Hamming window length
CNN-based YAMNet (Plakal and Ellis, 2023) and PANN of 46 ms and a window overlap of 1/3 (Hou et al., 2022a)
(Kong et al., 2020), and Transformer-based AST (Gong are used for each frame. A batch size of 64 and Adam opti-
et al., 2021) are used as baselines. Since YAMNet, PANN, mizer (Kingma and Ba, 2015) with a learning rate of 1e3
and AST are trained on the large-scale AudioSet (Gemmeke are used to minimize the loss in the proposed model. The
et al., 2017), the last layer of YAMNet, PANN, and AST model is trained for 100 epochs.
has 527, 521, and 527 units for output, respectively. In con- The SSC is a classification task, so accuracy (Acc), F-
trast, the SSC task in this paper has only 24 classes of audio score, and threshold-free area under curve (AUC) are used
events, so we modify the number of units in the last layer of to evaluate the classification results. The ARP is viewed as a
all three to 24, and then fine-tune the models on the DeLTA regression task in this paper, so mean absolute error (MAE)
dataset. and root mean square error (RMSE) are used to measure the

22 November 2023 18:10:12


regression results. Higher Acc, F-score, AUC and lower
D. Baseline for joint ARP and SSC task RMSE, MAE indicate better performance. Models and more
details are available on the project webpage (Hou, 2023).
This paper first attempts to use the artificial intelligence
(AI)-based model to simultaneously perform sound source IV. RESULTS AND ANALYSIS
classification and annoyance rating prediction. Therefore,
this paper adopts deep neural networks (DNN), convolu- This section analyzes the performance of the proposed
tional neural networks (CNN), and CNN-Transformer as method based on the following research questions.
baselines for comparison.
A. Can the model’s performance be improved using
1. Deep neural networks (DNN) two acoustic features?

The DNN consists of two branches. Each branch contains Two kinds of acoustic features are used in this paper,
four fully connected layers and ReLU functions (Boob et al., the Mel spectrograms that approximate the characteristics of
2022), where the number of units in each layer is 64, 128, 256, human hearing and the RMS features that characterise the
and 512, respectively. The outputs of the final fully connected acoustic level. Table I shows the ablation experiments of the
two acoustic features on the proposed DCNN-CaF model to
layer of the two branches are concatenated and combined to
specifically present the performance of the DCNN-CaF
feed to the SSC and ARP layers, respectively.
model based on different features. When only a single fea-
ture is used, the input of the DCNN-CaF model is the corre-
2. Convolutional neural networks (CNN)
sponding single branch.
Similar to DNN, the compared CNN also consists of two As shown in Table I, the DCNN-CaF model performs
branches. Each branch includes two convolutional layers, the worst on the ARP and SSC tasks when using only the
where the number of filters in each convolutional layer is 32 46 ms interval RMS features, which are related to instanta-
and 64, respectively. The outputs of the convolutional layers neous loudness. This is apparently caused by the lack of
TABLE I. Ablation study on the acoustic features.

Acoustic feature ARP SSC

# Mel RMS MAE RMSE AUC F-score (%) Acc (%)

1   1.00 6 0.15 1.18 6 0.12 0.89 6 0.01 61.12 6 5.45 90.98 6 0.99
2   1.08 6 0.14 1.27 6 0.10 0.79 6 0.02 53.64 6 3.20 89.33 6 1.73
3   0.84 6 0.12 1.05 6 0.13 0.90 6 0.01 67.20 6 3.16 92.52 6 0.87

J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3149
https://doi.org/10.1121/10.0022408

TABLE II. Performance of models trained only for the SSC task.

Model YAMNet PANN AST The proposed DCNN-CaF

Parameters (Million) 3.231 79.723 86.207 4.961 (Only SSC branch)


AUC 0.87 6 0.02 0.90 6 0.02 0.85 6 0.3 0.92 6 0.01
F-score (%) 63.95 6 2.67 66.52 6 2.31 56.91 6 2.84 67.08 6 2.23
Acc (%) 90.56 6 1.96 91.69 6 1.44 89.39 6 1.92 92.34 6 1.58

spectral information, which is embedded in the Mel spectro- small dataset used in this paper, and large and deep models are
grams and omitted from the RMS features. The dimension prone to overfitting during the training process.
of the frame-level RMS used in this paper is (T, 1), where T Table III shows the joint ARP and SSC baselines pro-
is the number of frames. Compared with Mel spectrograms posed in Sec. III D. For a fairer comparison, the DNN,
with a dimension of (T, 64), the spectral information con- CNN, and CNN-Transformer in Sec. III D also use a dual-
tained in the loudness-related one-dimensional RMS fea- input branch structure to simultaneously use the two acous-
tures is also scarcer. This factor makes it difficult for the tic features of Mel spectrograms and RMS to complete the
model to distinguish the 24 types of sound sources and pre- SSC and ARP tasks.
dict annoyance from real-life different sound sources in the As shown in Table III, the CNN based on the convolu-
DeLTA dataset based only on the RMS features alone. The tional structure outperforms the DNN based on multi-layer
DCNN-CaF using Mel spectrograms outperforms the results perceptrons (MLP) (Kruse et al., 2022) on both tasks, which
of its corresponding RMS features overall. While DCNN- reflects that the convolutional structure is more effective
CaF combining Mel spectrograms and RMS features than the MLP structure in extracting acoustic representa-
achieves the best results, which clarifies that using these two tions. While performing well on the ARP regression task,
acoustic features benefits the model’s performance on SSC the CNN-Transformer combining convolution and an
and ARP tasks. Thus, adding energy level-related informa- Encoder from Transformer has the worst result correspond-

22 November 2023 18:10:12


tion to the sound recognition improves annoyance prediction ing to the SCC task for real-life 24-class sound source rec-
as expected, but it also slightly improves sound source ognition. This may be because the DeLTA dataset used in
recognition. this paper is not large enough to allow the Transformer
Encoder with MHA (Vaswani et al., 2017) to play its
B. How does the performance of the proposed model expected role. Previous work has also shown that
compare with other models on the ARP task and the Transformer-based models tend to perform less well on
SSC task, as well as the joint ARP and SSC tasks? small datasets (Hou et al., 2022b). Finally, compared to
Does the cross-attention-based fusion module in the these common baseline models, DCNN-CaF achieves better
model work well?
results on both SSC and ARP.
Table II presents the results of classical pure convolution- Next, we explore the performance of traditional A-
based YAMNet and PANN (Kong et al., 2020), and weighted equivalent sound pressure level (LAeq)-based meth-
Transformer-based AST (Gong et al., 2021), on the SSC task. ods for annoyance prediction (ARP task). Thus, we extract
YAMNet, PANN, and AST are trained based on Mel spectro- the sound levels of audio clips in the DeLTA dataset and uti-
grams. For a fair comparison, the proposed DCNN-CaF only lize them as features to predict annoyance ratings, as shown
uses the left SSC branch of the input Mel spectrograms. in Table IV. Note that sound level is a clip-level feature,
In Table II, both YAMNet and DCNN-CaF are light- while the proposed DCNN-CaF only accepts frame-level
weight models compared to PANN and AST. Relative to features as input. Therefore, the proposed DCNN-CaF,
Transformer-based AST, the number of parameters of DCNN- which cannot input coarse-grained clip-level LAeq-based
CaF is reduced by (86.207  4.961)/86.207  100% 94%. sound level features, is omitted in Table IV.
Compared to YAMNet, PANN, and AST, which have deeper Compared to the other models in Table IV, the support
layers than DCNN-CaF, the shallow DCNN-CaF achieves bet- vector regression (SVR) achieves the best performance on
ter results on the SSC task, which may be due to the relatively the ARP task. This may be attributed to its robustness in

TABLE III. Comparison of different models for joint SSC and ARP tasks on the DeLTA dataset.

ARP SSC

Model Param. (M) MAE RMSE AUC F-score (%) Acc (%)

DNN 0.376 1.06 6 0.19 1.32 6 0.08 0.87 6 0.01 49.57 6 8.78 89.30 6 2.21
CNN 1.271 0.94 6 0.06 1.15 6 0.06 0.88 6 0.01 52.03 6 3.09 90.34 6 0.46
CNN-Transformer 17.971 1.00 6 0.13 1.14 6 0.09 0.86 6 0.02 46.89 6 6.32 88.56 6 1.24
DCNN-CaF 7.614 0.84 6 0.12 1.05 6 0.13 0.90 6 0.01 67.20 6 3.16 92.52 6 0.87

3150 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408

TABLE IV. Comparison of sound level-based approaches on the ARP task. of all the attention distributions of the 8 heads of MHA1 and
MHA2, please see the project webpage.
Method Linear regression Decision tree SVR Random forest KNN

MAE 1.03 1.45 1.01 1.27 1.13 C. Does the proposed model capture the relationships
RMSE 1.26 1.84 1.26 1.60 1.41 between sound sources and annoyance ratings?
What are the relationships between sound sources,
annoyance ratings, and sound levels?
handling outliers and its ability to effectively model nonlin-
ear relationships (Izonin et al., 2021). In summary, the LAeq- To identify which of the 24 classes of sounds is most
based traditional approaches in Table IV show competitive likely to cause annoyance, we first analyze the relationship
performance on the ARP task, and their performance is between the sound identified by the model and the annoy-
close to the deep learning neural network-based methods in ance it predicts. Then, the predictions from the model are
Table III. compared to the human classification. Specifically, we first
To intuitively present the results of DCNN-CaF for the use Spearman’s rho (David and Mallows, 1961) to analyze
annoyance rating prediction, Fig. 2 visualizes the gap the correlation between the probability of various sound
between the annoyance ratings predicted by DCNN-CaF and sources predicted by the model and the corresponding
the corresponding ground-truth annoyance ratings. The red annoyance ratings. Then, we calculate the distribution of
point representing the predicted value and the blue point sound sources at different annoyance ratings, and further
indicating the true label in Fig. 2 mostly match well, indicat- verify the model’s predictions based on human-annotated
ing that the proposed model successfully regresses the sound sources and annoyance rating labels.
annoyance ratings in the real-life soundscape. Correlation analysis between the model’s sound source
For an in-depth analysis of the performance of the classification and annoyance ratings. A Shapiro-Wilk (with
DCNN-CaF, Fig. 3 further visualizes the attention distribu- a ¼ 0.05) statistic test (Hanusz et al., 2016) is performed
tion from the cross-attention-based fusion module on some before a correlation analysis of the model’s predicted sound
test samples. As described in Sec. III B, the number of time sources and annoyance ratings on the test set. The results of
steps of each head in the multi-head attention (MHA) is 30, the Shapiro-Wilk statistic test showed no evidence that the

22 November 2023 18:10:12


which comes from 30 frames in the dimensions of the repre- model’s predictions conform to a normal distribution.
sentations of Mel spectrograms and RMS features before Therefore, a non-parametric method named Spearman’s rho
MHA. Therefore, in DCNN-CaF, the dimension of the atten- (David and Mallows, 1961) is used for correlation analysis.
tion matrix of each head in MHA is 30  30. Figure 3 visu- The Spearman’s rho correlation analysis in Table V shows
alizes the distribution of attention in the same head number that the recognition of some sounds is significantly corre-
from MHA1 and MHA2. From the distribution of attention lated with the predicted annoyance rating. Specifically, the
in the subgraphs of Fig. 3, it can be seen that the MHA1, presence of sound sources such as Children, Water, Rail,
which uses Rr to adjust Rm and MHA2, which uses Rm to Construction, Siren, Shouting, Bells, Motorcycle, Music,
adjust Rr, complement each other. For example, for sample Car, General traffic, Screeching brakes, Horn, and Bus is
#1 in Fig. 3, the attention of MHA1 in subfigure (1) is positively correlated with the annoyance rating. The pres-
mainly distributed on the left side, while the attention of ence of sound sources such as Ventilation, Footsteps, Dog
MHA2 in subfigure (2) is predominantly distributed on the bark, Bird tweet, Rustling leave, Non-identifiable, and Other
right side. For the same sample, MHA1 and MHA2 with dif- is negatively correlated with the annoyance rating. As for
ferent attention perspectives match each other well. The sound sources such as Speech, Aircraft, and Laughter, there
results in Fig. 3 illustrate that the proposed DCNN-CaF is no significant correlation between them being present and
model successfully pays different attention to the informa- annoyance rating. Further correlation analysis indicates that
tion of different locations of two kinds of acoustic features the sound source Bus shows the highest positive correlation
based on the cross-attention module, which is beneficial for with the annoyance rating, with a correlation coefficient of
the fusion of these acoustic features. For more visualizations 0.712. In contrast, the sound source Rustling leave shows

FIG. 2. (Color online) Scatter plot of annoyance ratings of model predictions and human-annotated labels on some samples in the test set. Black vertical
lines indicate gaps between two points.

J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3151
https://doi.org/10.1121/10.0022408

FIG. 3. (Color online) Attention distributions of the cross-attention-based fusion module in DCNN-CaF on audio clips from the test set. The subgraphs (1,
2) and (3, 4) are from (MHA1, MHA2) of sample #1 and #2, respectively. Brighter colours represent larger values.

the highest negative correlation with the annoyance rating, annoyance rating less than or equal to l is ni;l , and the total
with a coefficient of –0.731. number of occurrences in audio samples with an annoyance
Verifying the model’s predictions based on human- rating greater than l is ni;h . Ni ¼ ni;l þ ni;h , Ni is the total
perceived manually annotated labels. Based on the correla- number of samples containing the sound source si. Then, the
tion analysis of the model’s predictions, some sound sources probability of the sound source occurring in the samples
are more likely to cause people annoyance than others. To where annoyance is lower than or equal to l is
investigate the consistency of the correlation analysis results
ni;l ni;l
between the model-based and the human-annotated labels- Pðx  ljsi Þ ¼ ¼ ; (9)
based, we calculate the distribution of sound sources at dif- ni;l þ ni;h Ni
ferent annoyance rating levels based on the human-
where x represents the annoyance rating for fragments con-
annotated labels to explore the correlations between the
taining the sound source, si. Correspondingly, the probabil-
sound source and the annoyance levels, as shown in Fig. 4.
ity of it occurring in samples higher than l is

22 November 2023 18:10:12


Given that the mean value of the annoyance rating by
humans on the test set is l, for the ith class of sound source ni;h
si, the total number of occurrences in audio samples with an Pðx > ljsi Þ ¼ ¼ 1  Pðx  ljsi Þ: (10)
Ni

TABLE V. Spearman’s rho correlation coefficients on DeLTA.

Model predicted annoyance Annotated by human Samples Sound levels


Sound source Correlation (r) Pðx  lÞ Pðx > lÞ (N) Correlation(r)

Children 0.12a 0.35 0.65 71 0.01


Water 0.18a 0.39 0.61 69 0.02
Rail 0.26a 0.33 0.67 15 0.03
Construction 0.27a 0.43 0.57 54 0.01
Siren 0.27a 0.57 0.43 21 0.02
Shouting 0.35a 0.20 0.80 35 0.01
Bells 0.33a 0.37 0.63 19 0.01
Motorcycle 0.47a 0.24 0.76 37 0.05
Music 0.50a 0.21 0.79 43 0.04
Car 0.52a 0.35 0.65 79 0.04
General traffic 0.58a 0.38 0.62 240 0.02
Screeching brakes 0.63a 0.25 0.75 8 0.05
Horn 0.70a 0.28 0.72 21 0.06
Bus 0.71a 0.11 0.89 19 0.04
Ventilation 0.20a 0.50 0.50 34 0.01
Footsteps 0.42a 0.63 0.37 223 0.03
Non-identifiable 0.43a 0.65 0.35 52 0.01
Other 0.49a 0.59 0.41 85 0.04
Dog bark 0.56a 0.75 0.25 8 0.00
Bird tweet 0.69a 0.74 0.26 160 0.01
Rustling leaves 0.73a 0.83 0.17 18 0.06
Speech 0.05 0.46 0.54 368 0.02
Aircraft 0.02 0.45 0.55 20 0.03
Laughter 0.05 0.42 0.58 78 0.02

a
Statistical significance at the 0.01 level.

3152 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408

FIG. 4. (Color online) Distribution of the number of samples from different sources at low and high annoyance rating levels, that is, ni;l and ni;h (shown as
Nl and Nh), and the corresponding Pðx  ljsi Þ curves.

Table V comprehensively shows the probability distribution (tau ¼ 0:42; p < 0:001). That is, there is a significant corre-
of 24 classes of sound sources at different levels of annoy- lation between sound level and annoyance rating in the
ance rating according to human perception. DeLTA dataset. This is not unexpected as the ARP baseline

22 November 2023 18:10:12


The probability distribution of sound sources at differ- models based on LAeq only in Table IV have some predictive
ent annoyance rating levels in Table V reveals that accord- power.
ing to people’s real feelings, Rustling leaves sounds have Next, we delve into the relationship between the proba-
the highest probability in the low annoyance rating level bility of the presence of sound sources predicted by the
(x  l), while Bus sounds have the highest probability in model and sound levels. Table V shows that there is no sig-
the high annoyance rating level (x > l). This successfully nificant Pearson correlation between sound sources and
verifies the correctness of the above model-based correlation sound levels in the DeLTA dataset. That is, the 24 different
analysis between sound sources and annoyance ratings. classes of sound sources in the DeLTA dataset cannot be
Furthermore, Table V also shows that Children, Water, Rail, identified solely by relying on fragment-level sound level
Construction, Shouting, Bells, Motorcycle, Music, Car, information.
General traffic, Screeching brakes, Horn, and Bus are more Case study. Notably, there is a significant positive cor-
likely to occur in the high annoyance rating level, while relation between Music and annoyance ratings in Table V.
Footsteps, Dog bark, Bird tweet, Rustling leave, Non-identi- According to the statistical results in DeLTA (Mitchell
fiable, and Other are more likely to occur in the low annoy- et al., 2022), the average annoyance score for clips with
ance rate level. Speech, Aircraft, and Ventilation have a Music sources is 4.01, while the average annoyance score
similar probability of occurring in the high and low annoy- for clips without Music sources is 3.29, which implies that
ance levels, implying that they may be more prevalent in the most of the presence of Music in DeLTA causes an increase
soundscape of the test set. In short, both the proposed in annoyance rather than being relaxing. Previous studies
model-based and human-perceived-based analyses showed also show that there are various types of annoying music in
similar trends regarding which sound sources are most daily life (Trotta, 2020).
strongly associated with annoyance levels. The consistency In order to analyze it in depth, we further filter out all
between the two analyses in identifying sound sources most audio clips containing music in DeLTA, totaling 222 15-s
strongly associated with annoyance ratings indicates that the clips, with an average sound level of 81.8 dBA. We then
proposed model performs well in predicting the relation- analyze the relationship between the sound levels and
ships between sound sources and annoyance ratings. annoyance ratings of these 222 music clips. The results
Correlations between sound level and annoyance rat- show that under the condition of music source, there is a sig-
ing. In addition to exploring the correlation between sound nificant positive correlation between sound level and
sources and annoyance ratings, we further analyze the corre- human-perceived annoyance (tau ¼ 0:18; p < 0:001) in the
lation between fragment-level A-weighted equivalent sound DeLTA dataset. In summary, even though the Music is not
pressure level (LAeq) and human-perceived annoyance based significantly correlated with the sound level, it is weakly
on Kendall’s Tau (often referred to as Kendall’s Tau rank positively correlated with the sound level in Table V. In
correlation coefficient). The corresponding result is addition, the overall sound level in the audio clips where the
J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3153
https://doi.org/10.1121/10.0022408

Music source exists is high, and the sound level is signifi- defaults to 0. In this way, we get 20 test sets containing dif-
cantly related to annoyance, which may contribute to the ferent types of noise sources. Therefore, the total number of
significant positive correlation between Music and annoy- audio clips containing model-unknown noise is
ance presented in Table V. 20  578 ¼ 11 560, and the corresponding audio duration is
In addition to sound level, characteristics of music, about 48.2 h (15 s  11 560 ¼ 173 400 s).
such as its style or genre, give people different listening Figure 5 shows the average human-annotated annoy-
experiences. Previous research reveals the role of music in ance rating for sounds in the test set, the average annoyance
inciting or mitigating antisocial behaviour, and that certain rating predicted by the model for the test set without exter-
music genres can soothe or agitate individuals. Additionally, nal noise added, and the average annoyance rating predicted
the perception of annoyance may also be affected, depend- by the model for the test set added with 20 classes of noise.
ing on the choice of music. For example, genres featuring As shown in Fig. 5, without additional noise, the average
heavy rhythms, known for their potential to evoke angry annoyance rating of the 578 15-s audio clips in the test set in
emotions (Areni, 2003; Cowen et al., 2020), are often not the soundscape predicted by the model is similar to that of
favoured by listeners in an urban environment context. As human-perceived annoyance ratings. The standard deviation
highlighted in the study (Landstr€om et al., 1995), a key con- of our model prediction (the yellow line on the bars in Fig.
tributor to annoyance is the presence of a tonal component 5) is smaller than the corresponding human-perceived
in the noise. Individuals exposed to intruding noise contain- annoyance, which intuitively demonstrates that the proposed
ing tonal elements tend to report higher levels of annoyance model achieves a similar effect on the test set as the annoy-
than those exposed to non-tonal noise. Furthermore, ance ratings from humans perception.
reported levels of annoyance tend to increase when the noise Adding the 20 types of sources at an SNR of 0,
contains multiple tonal components. This observation sug- increases the sound level (i.e., RMS) and, therefore, would
gests that tonal characteristics present in the sound source most probably increase the annoyance rating. If the model
(and possibly also in the music) may be a contributing factor purely relied on the RMS value, as some other noise annoy-
to the positive correlation between music and annoyance rat- ance models do, it would predict the same increase for all
ings in Table V. sources. However, different annoyance levels are predicted

22 November 2023 18:10:12


depending on the source added, which intuitively corre-
D. How does the proposed model respond to adding sponds better to human perception. The subtle sounds, such
unknown sounds to the soundscape? as Pouring water and Keyboard typing, are less likely to
To investigate the generalization performance of the increase much annoyance. Compared to sound sources that
proposed DCNN-CaF, we randomly add 20 classes of sound are less likely to introduce human annoyance, such as Water
sources as noise to the test set in this paper to explore the drops and Clapping, the results in Fig. 5 illustrate that sound
model’s performance in predicting annoyance ratings in sources related to machines or engines increase annoyance
soundscapes with added unknown sound sources. To add a ratings more strongly for the same increase in average sound
variety of sound source samples to the 578 audio clips in the level. Overall, the model’s predictions in Fig. 5 are consis-
test set, we first use 20 sound sources from the public ESC- tent with what can be expected, but its validity is not con-
50 dataset (Piczak, 2015) as additional noise sources, each firmed by experiments with human participation.
source containing 40 5-s audio samples. Then, we randomly Figure 5 presents the performance of the DCNN-CaF
add the 5-s noise source samples to the 15-s audio files in model under artificially added unknown noise sources.
the test set, and each 15-s audio file is randomly assigned 1 However, the synthetic dataset in Fig. 5 is difficult to com-
to 3 5-s audio samples from the same noise source. During pare with real-life audio in terms of the realism and natural-
the synthesis process, the signal-to-noise ratio (SNR) ness of the sound. To compare the performance of the

FIG. 5. (Color online) The mean and standard deviation of the predicted annoyance rating under different model-unknown noise sources.

3154 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408

FIG. 6. (Color online) The performance of the proposed DCNN-CaF on real-life audio recordings.

proposed model on unknown data and, in particular, investi- along multiple dimensions. This can be a time-consuming
gate its performance for predicting positive effects of sound- and expensive process when relying solely on human listen-
scape augmentation, we test the DCNN-CaF on a real-world ing tests and questionnaires. This paper investigates the fea-
acoustic dataset from a road traffic noise environment sibility of using artificial intelligence (AI) to perform
(denoted as RTNoise) (Coensel et al., 2011). The experi- soundscape characterization without human questionnaires.
ment in Coensel et al. (2011) shows that adding bird song Predictive soundscape models based on measurable features,
and fountain sound can reduce human-perceived traffic such as the model proposed here, can enable perception-
noise loudness and increase perceived pleasantness. focused soundscape assessment and design in an automated
RTNoise contains recordings of freeways, major roads, and and distributed manner, beyond what traditional soundscape
minor roads sounds and mixtures of these sounds with two methods can achieve. This paper proposes the cross-atten-
bird choruses and two fountain sounds. tion-based DCNN-CaF using two kinds of acoustic features

22 November 2023 18:10:12


As shown in Fig. 6, whether it is on the freeways, major to ensure the accuracy and reliability of the AI model, and
roads, or minor roads, compared to the source audio clips, simultaneously perform both the sound source classification
the predicted annoyance ratings of the audio clips with task and the annoyance rating prediction task. The proposed
added bird sounds or fountain sounds will be reduced to AI model in this paper is trained on the DeLTA dataset,
varying degrees. Using the same sounds, tests with human which contains sound source labels and human-annotated
listeners in Coensel et al. (2011) show a similar tendency labels along one of the emotional dimensions of perception:
for perceived traffic noise loudness for freeway sound, but annoyance.
this effect was not so prominent for major roads and minor Our experimental analysis demonstrates the following
roads. Human-rated pleasantness in the same experiment findings: (1) the proposed DCNN-CaF with dual-input
shows an opposite trend to the predicted trend in annoyance, branches using Mel spectrograms and loudness-related RMS
which can be seen as the opposite. However, this prior features outperforms models using only one of these fea-
experimental work also showed lower perceived loudness tures. (2) On the sound source classification and annoyance
and higher pleasantness for the minor and major roads com- rating prediction tasks, the DCNN-CaF with the attention-
pared to freeway sound. The DCNN-CaF model does not based fusion of two features outperforms DNN, CNN, and
seem to be able to distinguish between these types of sound. CNN-Transformer, which concatenate two features directly.
This could be caused by the poor control of noise levels in In addition, attention visualization in the DCNN-CaF model
the online playback during the DeLTA data collection or by shows that the cross-attention-based fusion module success-
the shortness of the audio fragments (15 s) that do not fully pays different attention to the information of different
allow the model to learn the difference between short car acoustic features, which is beneficial for the fusion of these
passages and continuous traffic. Other studies show that nat- acoustic features. (3) Correlation analysis shows that the
ural sounds, especially birdsong, can relax people (Van model successfully predicts the relationships between vari-
Renterghem, 2019); and under similar noise exposure condi- ous sound sources and annoyance ratings, and these pre-
tions, respondents in neighbourhoods with more bird songs dicted relationships are consistent with those perceived by
and fountains reported lower levels of annoyance (Qu et al., humans in the soundscape. (4) Generalization tests show
2023). The response of the proposed model in Fig. 6 to the that the model’s ARP in the presence of model-unknown
sounds of birdsong and fountains in a real soundscape suc- sources is consistent with expert expectations and can
cessfully matches this existing research. explain previous findings from the literature on soundscape
augmentation. Future work involves extending the sound-
V. CONCLUSION
scape appraisal with other dimensions and taking into
Soundscape characterization involves identifying sound account more practical factors, such as participants’ hearing
sources and assessing human-perceived emotional qualities and cultural and linguistic differences, to expand the
J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3155
https://doi.org/10.1121/10.0022408

training dataset to cover more acoustic scenarios. Fan, J., Thorogood, M., and Pasquier, P. (2017). “Emo-soundscapes: A
dataset for soundscape emotion recognition,” in International Conference
Furthermore, to improve the interpretability of the proposed
on Affective Computing and Intelligent Interaction, pp. 196–201.
model, the following work will try to visualize the learned Fan, J., Thorogood, M., Riecke, B. E., and Pasquier, P. (2015). “Automatic
weights of the model through heatmap analysis to clarify recognition of eventfulness and pleasantness of soundscape,” in
which neurons play a more critical role in learning to help Proceedings of the Audio Mostly 2015 on Interaction with Sound, pp.
1–6.
explain the decision-making process of the model. Fan, J., Tung, F., Li, W., and Pasquier, P. (2018). “Soundscape emotion rec-
ognition via deep learning,” in Proceedings of the Sound and Music
Computing.
ACKNOWLEDGMENTS Fang, X., Gao, T., Hedblom, M., Xu, N., Xiang, Y., Hu, M., Chen, Y., and
Qiu, L. (2021). “Soundscape perceptions and preferences for different
The WAVES Research Group received funding from groups of users in urban recreational forest parks,” Forests 12(4), 468.
the Flemish Government under the “Onderzoeksprogramma Gemmeke, J., Ellis, D., Freedman, D., Jansen, A., Lawrence, W., Moore, R.
Artifici€ele Intelligentie (AI) Vlaanderen” programme. C., Plakal, M., and Ritter, M. (2017). “AudioSet: An ontology and
human-labeled dataset for audio events,” in Proceedings of ICASSP, pp.
776–780.
Abraham, A., Sommerhalder, K., and Abel, T. (2010). “Landscape and Gong, Y., Chung, Y. A., and Glass, J. (2021). “AST: Audio Spectrogram
well-being: A scoping study on the health-promoting impact of outdoor Transformer,” in Proceedings of INTERSPEECH, pp. 571–575.
environments,” Int. J. Public Health 55, 59–69. Gong, Y., Cui, C., Cai, M., Dong, Z., Zhao, Z., and Wang, A. (2022).
Acun, V., and Yilmazer, S. (2018). “Understanding the indoor soundscape “Residents’ preferences to multiple sound sources in urban park:
of study areas in terms of users’ satisfaction, coping methods and percep- Integrating soundscape measurements and semantic differences,” Forests
tual dimensions,” Noise Cont. Eng. J. 66(1), 66–75. 13(11), 1754.
Al-Shargabi, A. A., Almhafdy, A., AlSaleem, S. S., Berardi, U., and Ali, A. Hanusz, Z., Tarasinska, J., and Zielinski, W. (2016). “Shapiro–Wilk test
A. M. (2023). “Optimizing regression models for predicting noise pollu- with known mean,” REVSTAT-Stat. J. 14(1), 89–100.
tion caused by road traffic,” Sustainability 15(13), 10020. Hong, J. Y., and Jeon, J. Y. (2015). “Influence of urban contexts on sound-
Areni, C. S. (2003). “Examining managers’ theories of how atmospheric scape perceptions: A structural equation modeling approach,” Landscape
music affects perception, behaviour and financial performance,” J. Retail. Urban Plann. 141, 78–87.
Consumer Serv. 10(5), 263–274. Hou, Y. (2023). “AI-Soundscape,” https://github.com/Yuanbo2020/AI-
€ Nilsson, M. E., and Berglund, B. (2010). “A principal compo-
Axelsson, O., Soundscape (Last viewed 9/11/2023).
nents model of soundscape perception,” J. Acoust. Soc. Am. 128(5), Hou, Y., and Botteldooren, D. (2022). “Event-related data conditioning for

22 November 2023 18:10:12


2836–2846. acoustic event classification,” in Proceedings of INTERSPEECH, pp.
Bala, A., Kumar, A., and Birla, N. (2010). “Voice command recognition 1561–1565.
system based on MFCC and DTW,” Int. J. Eng. Sci. Technol. 2(12), Hou, Y., Kang, B., Van Hauwermeiren, W., and Botteldooren, D. (2022a).
7335–7342. “Relation-guided acoustic scene classification aided with event
Barchiesi, D., Giannoulis, D., Stowell, D., and Plumbley, M. D. (2015). embeddings,” in Proceedings of International Joint Conference on
“Acoustic scene classification: Classifying environments from the sounds Neural Networks, pp. 1–8.
they produce,” IEEE Signal Process. Mag. 32(3), 16–34. Hou, Y., Liu, Z., Kang, B., Wang, Y., and Botteldooren, D. (2022b). “CT-
Beyer, A., Kamin, S. T., and Lang, F. R. (2017). “Housing in old age: SAT: Contextual Transformer for Sequential Audio Tagging,” in
Dynamical interactions between neighborhood attachment, neighbor Proceedings of INTERSPEECH, pp. 4147–4151.
annoyance, and residential satisfaction,” J. Housing Elderly 31(4), Iannace, G., Ciaburro, G., and Trematerra, A. (2019). “Wind turbine noise
382–393. prediction using random forest regression,” Machines 7(4), 69.
Boes, M., Filipan, K., De Coensel, B., and Botteldooren, D. (2018). Izonin, I., Tkachenko, R., Shakhovska, N., and Lotoshynska, N. (2021).
“Machine listening for park soundscape quality assessment,” Acta Acust. “The additive input-doubling method based on the SVR with nonlinear
united Acust. 104(1), 121–130. kernels: Small data approach,” Symmetry 13(4), 612.
Boob, D., Dey, S. S., and Lan, G. (2022). “Complexity of training ReLU Kang, J., Aletta, F., Gjestland, T. T., Brown, L. A., Botteldooren, D.,
neural network,” Discrete Optim. 44, 100620. Schulte-Fortkamp, B., Lercher, P., van Kamp, I., Genuit, K., Fiebig, A.,
Brambilla, G., and Maffei, L. (2010). “Perspective of the soundscape Bento Coelho, J. L., Maffei, L., and Lavia, L. (2016). “Ten questions on
approach as a tool for urban space design,” Noise Control Eng. J. 58(5), the soundscapes of the built environment,” Build. Environ. 108, 284–294.
532–539. Kingma, D. P., and Ba, J. (2015). “Adam: A method for stochastic opti-
Bruce, N. S., and Davies, W. J. (2014). “The effects of expectation on the mization,” in Proceedings of International Conference on Learning
perception of soundscapes,” Appl. Acoust. 85, 1–11. Representations.

Carlsson, F., Karlson, B., Ørbaek, P., Osterberg, €
K., and Ostergren, P. Kliuchko, M., Heinonen-Guzejev, M., Vuust, P., Tervaniemi, M., and
(2005). “Prevalence of annoyance attributed to electrical equipment and Brattico, E. (2016). “A window into the brain mechanisms associated
smells in a Swedish population, and relationship with subjective health with noise sensitivity,” Sci. Rep. 6(1), 39236.
and daily functioning,” Public Health 119(7), 568–577. Kong, Q., Cao, Y., Iqbal, T., Wang, Y., Wang, W., and Plumbley, M. D.
Coensel, B. D., Vanwetswinkel, S., and Botteldooren, D. (2011). “Effects (2020). “PANNs: Large-scale pretrained audio neural networks for audio
of natural sounds on the perception of road traffic noise,” J. Acoust. Soc. pattern recognition,” IEEE/ACM Trans. Audio. Speech. Lang. Process.
Am. 129(4), EL148–EL153. 28, 2880–2894.
Cowen, A. S., Fang, X., Sauter, D., and Keltner, D. (2020). “What music Kruse, R., Mostaghim, S., Borgelt, C., Braune, C., and Steinbrecher, M.
makes us feel: At least 13 dimensions organize subjective experiences (2022). “Multi-layer perceptrons,” in Computational Intelligence: A
associated with music across different cultures,” Proc. Natl. Acad. Sci. Methodological Introduction (Springer, Berlin), pp. 53–124.
U.S.A. 117(4), 1924–1934. Landstr€ om, U., Åkerlund, E., Kjellberg, A., and Tesarz, M. (1995).
Das, C. P., Swain, B. K., Goswami, S., and Das, M. (2021). “Prediction of “Exposure levels, tonal components, and noise annoyance in working
traffic noise induced annoyance: A two-staged SEM-artificial neural net- environments,” Environ. Int. 21(3), 265–275.
work approach,” Transp. Res. Part D: Transp. Environ. 100, 103055. Lefèvre, M., Chaumond, A., Champelovier, P., Allemand, L. G., Lambert,
David, F. N., and Mallows, C. L. (1961). “The variance of spearman’s rho J., Laumon, B., and Evrard, A. S. (2020). “Understanding the relationship
in normal samples,” Biometrika 48(1/2), 19–28. between air traffic noise exposure and annoyance in populations living

Eek, F., Karlson, B., Osterberg, €
K., and Ostergren, P. (2010). “Factors asso- near airports in france,” Environ. Int. 144, 106058.
ciated with prospective development of environmental annoyance,” Lercher, P., and Schulte-Fortkamp, B. (2003). “The relevance of sound-
J. Psychosom. Res. 69(1), 9–15. scape research to the assessment of noise annoyance at the community

3156 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408

level,” in Proceedings of International Congress on Noise as a Public Plakal, M., and Ellis, D. (2023). “YAMNet,” https://github.com/tensorflow/
Health Problem, pp. 225–231. models/tree/master/research/audioset/yamnet (Last viewed 9/11/2023).
Li, Y., Liu, M., Drossos, K., and Virtanen, T. (2020). “Sound event detec- Politis, A., Mesaros, A., Adavanne, S., Heittola, T., and Virtanen, T.
tion via dilated convolutional recurrent neural networks,” in Proceedings (2021). “Overview and evaluation of sound event localization and detec-
of ICASSP, pp. 286–290. tion in DCASE 2019,” IEEE/ACM Trans. Audio. Speech. Lang. Process.
Li, Z., Hou, Y., Xie, X., Li, S., Zhang, L., Du, S., and Liu, W. (2019). 29, 684–698.
“Multi-level attention model with deep scattering spectrum for acoustic Qu, F., Li, Z., Zhang, T., and Huang, W. (2023). “Soundscape and subjec-
scene classification,” in IEEE International Conference on Multimedia & tive factors affecting residents’ evaluation of aircraft noise in the commu-
Expo Workshops (ICMEW), pp. 396–401. nities under flight routes,” Front. Psychol. 14, 1197820.
Ma, K. W., Mak, C. M., and Wong, H. M. (2021). “Effects of environmen- Raimbault, M., and Dubois, D. (2005). “Urban soundscapes: Experiences
tal sound quality on soundscape preference in a public urban space,” and knowledge,” Cities 22(5), 339–350.
Appl. Acoust. 171, 107570. Ren, J., Jiang, X., Yuan, J., and Magnenat-Thalmann, N. (2017). “Sound-
Mackrill, J., Cain, R., and Jennings, P. (2013). “Experiencing the hospital event classification using robust texture features for robot hearing,” IEEE
ward soundscape: Towards a model,” J. Environ. Psychol. 36, 1–8. Trans. Multimedia 19(3), 447–458.
Marchegiani, L., and Posner, I. (2017). “Leveraging the urban soundscape: Salamon, J., Jacoby, C., and Bello, J. P. (2014). “A dataset and taxonomy
Auditory perception for smart vehicles,” in IEEE International for urban sound research,” in Proceedings of the ACM International
Conference on Robotics and Automation, pp. 6547–6554. Conference on Multimedia, pp. 1041–1044.
Maristany, A., Lopez, M. R., and Rivera, C. A. (2016). “Soundscape quality Sun, K., De Coensel, B., Filipan, K., Aletta, F., Van Renterghem, T., De
analysis by fuzzy logic: A field study in Cordoba, Argentina,” Appl. Pessemier, T., Joseph, W., and Botteldooren, D. (2019). “Classification of
Acoust. 111, 106–115. soundscapes of urban public open spaces,” Landscape Urban Plann. 189,
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, 139–155.
T., and Plumbley, M. D. (2018a). “Detection and classification of acoustic Szychowska, M., Hafke-Dys, H., Preis, A., Koci nski, J., and Kleka, P.
scenes and events: Outcome of the DCASE 2016 challenge,” IEEE/ACM (2018). “The influence of audio-visual interactions on the annoyance rat-
Trans. Audio. Speech. Lang. Process. 26(2), 379–393. ings for wind turbines,” Appl. Acoust. 129, 190–203.
Mesaros, A., Heittola, T., and Virtanen, T. (2018b). “Acoustic scene classifi- Thorogood, M., Fan, J., and Pasquier, P. (2016). “Soundscape audio signal
cation: An overview of DCASE 2017 challenge entries,” in International
classification and segmentation using listeners perception of background
Workshop on Acoustic Signal Enhancement (IWAENC), pp. 411–415.
and foreground sound,” J. Audio Eng. Soc. 64(7/8), 484–492.
Mitchell, A., Brown, E., Deo, R., Hou, Y., Kirton-Wingate, J., Liang, J.,
Timmons, A. C., Han, S. C., Chaspari, T., Kim, Y., Narayanan, S., Duong,
Sheinkman, A., Soelistyo, C., Sood, H., and Wongprommoon, A. (2023). “Deep
J. B., Simo Fiallo, N., and Margolin, G. (2023). “Relationship satisfaction,
learning techniques for noise annoyance detection: Results from an intensive
feelings of closeness and annoyance, and linkage in electrodermal
workshop at the Alan Turing Institute,” J. Acoust. Soc. Am. 153(3), A262.
activity,” Emotion 23, 1815.
Mitchell, A., Erfanian, M., Soelistyo, C., Oberman, T., Kang, J., Aldridge,

22 November 2023 18:10:12


Trotta, F. (2020). Annoying Music in Everyday Life (Bloomsbury
R., Xue, J. H., and Aletta, F. (2022). “Effects of soundscape complexity
Publishing, New York).
on urban noise annoyance ratings: A large-scale online listening
Tsaligopoulos, A., Kyvelou, S., Votsi, N., Karapostoli, A., Economou, C.,
experiment,” Int. J. Environ. Res. Public Health 19(22), 14872.
and Matsinos, Y. G. (2021). “Revisiting the concept of quietness in the
Morrison, W. E., Haas, E. C., Shaffner, D. H., Garrett, E. S., and Fackler, J.
C. (2003). “Noise, stress, and annoyance in a pediatric intensive care urban environment-towards ecosystems’ health and human well-being,”
unit,” Crit. Care Med. 31(1), 113–119. Int. J. Environ. Res. Public Health 18(6), 3151.
Mount, W. M., Tuček, D. C., and Abbass, H. A. (2012). “A psychophysio- Van Renterghem, T. (2019). “Towards explaining the positive effect of veg-
logical analysis of weak annoyances in human computer interfaces,” in etation on the perception of environmental noise,” Urban For. Urban
Proceedings of International Conference on Neural Information Greening 40, 133–144.
Processing, pp. 202–209. Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A.
Mulimani, M., and Koolagudi, S. G. (2018). “Acoustic event classification N., Kaiser, L., Polosukhin, I. (2017). “Attention is all you need,” in
using spectrogram features,” in TENCON 2018-2018 IEEE Region 10 Proceedings of International Conference on Neural Information
Conference, pp. 1460–1464. Processing Systems, pp. 5998–6008.
Nering, K., Kowalska-Koczwara, A., and Stypuła, K. (2020). “Annoyance Wallach, D., and Goffinet, B. (1989). “Mean squared error of prediction as
based vibro-acoustic comfort evaluation of as summation of stimuli a criterion for evaluating and comparing system models,” Ecol. Modell.
annoyance in the context of human exposure to noise and vibration in 44(3-4), 299–306.
buildings,” Sustainability 12(23), 9876. Xu, Y., Kong, Q., Huang, Q., Wang, W., and Plumbley, M. D. (2017).
Nilsson, M. E., and Berglund, B. (2006). “Soundscape quality in “Convolutional gated recurrent neural network incorporating spatial fea-
suburban green areas and city parks,” Acta Acust. united Acust. 92(6), tures for audio tagging,” in 2017 International Joint Conference on
903–911. Neural Networks (IJCNN), pp. 3461–3466.
Parascandolo, G., Huttunen, H., and Virtanen, T. (2016). “Recurrent neural Yilmazer, S., and Acun, V. (2018). “A grounded theory approach to
networks for polyphonic sound event detection in real life recordings,” in assess indoor soundscape in historic religious spaces of Anatolian cul-
2016 IEEE International Conference on Acoustics, Speech and Signal ture: A case study on Hacı Bayram mosque,” Build. Acoust. 25(2),
Processing (ICASSP), pp. 6440–6444. 137–150.
Piczak, K. J. (2015). “ESC: Dataset for Environmental Sound Zhou, H., Shu, H., and Song, Y. (2018). “Using machine learning to predict
Classification,” in Proceedings of Annual ACM Conference on noise-induced annoyance,” in IEEE Region 10 Conference, pp.
Multimedia, pp. 1015–1018. 0229–0234.

J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3157

You might also like