AI-based Soundscape Analysis
AI-based Soundscape Analysis
CrossMark
View Export
Online Citation
ABSTRACT:
Soundscape studies typically attempt to capture the perception and understanding of sonic environments by
surveying users. However, for long-term monitoring or assessing interventions, sound-signal-based approaches are
required. To this end, most previous research focused on psycho-acoustic quantities or automatic sound recognition.
Few attempts were made to include appraisal (e.g., in circumplex frameworks). This paper proposes an artificial
intelligence (AI)-based dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF) to
analyze automatic soundscape characterization, including sound recognition and appraisal. Using the DeLTA dataset
containing human-annotated sound source labels and perceived annoyance, the DCNN-CaF is proposed to perform
sound source classification (SSC) and human-perceived annoyance rating prediction (ARP). Experimental findings
indicate that (1) the proposed DCNN-CaF using loudness and Mel features outperforms the DCNN-CaF using only
one of them. (2) The proposed DCNN-CaF with cross-attention fusion outperforms other typical AI-based models
and soundscape-related traditional machine learning methods on the SSC and ARP tasks. (3) Correlation analysis
everyday context, uneventful and unpleasant soundscapes soundscapes is found to depend on contexts (Hong and Jeon,
are often not noticed and do not contribute to the experience 2015). The sounds that people hear in their environment can
of the place. Hence, Sun et al. (2019) propose a soundscape have a substantial impact on their overall appraisal of that envi-
classification that acknowledges that sonic environments ronment, and it can influence people’s emotional and cognitive
can be pushed into the background. Only foregrounded responses to their living surroundings. Hence, any acoustic sig-
soundscapes contribute to the appraisal and are classified as nal processing attempting to predict this, is very likely to benefit
disruptive and supportive, the latter being either calming or from automatic sound recognition. Automatic sound recognition
stimulating (Sun et al., 2019). Based on audio recordings in predicting people’s perception of the soundscape has the
containing implicit information in soundscapes, Thorogood potential to improve our understanding of how the acoustic
et al. (2016) established the background and foreground environment affects our perceptions, and can inform the devel-
classification task within a musicological and soundscape opment of more effective interventions to promote positive out-
context. For urban park soundscapes, Gong et al. (2022) comes. Therefore, Boes et al. (2018) propose to use an artificial
introduce the concepts of “importance” and “performance” neural network to predict both the sound source (human, natural,
and position the soundscape elements in this two- and mechanical) perceived by users of public parks as well as
dimensional plane. The importance dimension reflects to their appraisal of the soundscape quality. It was shown that
what extent a particular sound is an essential part of this sound recognition outperforms psychoacoustic indicators in pre-
soundscape. The perception study underlying this paper dicting each of these perceptual outcomes. Identifying specific
(Mitchell et al., 2022) can be seen as a foregrounded sound- sounds in urban soundscapes is relevant for assisting drivers or
scape assessment with annoyance as its primary dimension, self-driving cars (Marchegiani and Posner, 2017), and for urban
which is a negative dimension of soundscape assessment. planning and environment improvement (Ma et al., 2021).
This type of assessment is often used to identify sources of In this paper, a new artificial intelligence (AI) method,
noise, and it allows researchers to identify sources of annoy- inspired by the approach presented in Mitchell et al. (2023),
ance that can cause negative health reflections. is introduced to identify various sound sources and predict
Research on annoyance has been carried out based on one of the components in a circumplex appraisal of the sonic
non-acoustic approaches from different perspectives in the environment: annoyance. More specifically, this paper pro-
An intuitive observation is that in real-life soundscapes, how to extract audio representations from the input audio
loud sounds naturally attract more human attention than qui- clips, and then perform CaF on audio representations.
eter sounds. For example, on the side of the street, the sound Finally, we use different loss functions to train task-
of roaring cars will capture people’s attention more than the dependent branches of the model to complete the
sound of small conversations on the corner. Therefore, this classification-based SSC task and the regression-based ARP
paper exploits the loudness-related root mean square value task.
(RMS) (Mulimani and Koolagudi, 2018) and Mel spectro-
grams (Bala et al., 2010) features, which conform to human
A. Audio representation extraction
hearing characteristics, to predict the objective sound sources
and perceived annoyance ratings. The proposed model uses Since the Mel spectrograms common in sound source-
convolutional blocks to extract high-level representations of related tasks and RMS features that can reflect the energy of
the two features and a cross-attention module to fuse their sound sources are used in this paper, there are two branches
semantic representations. Based on the proposed model, this of inputs to the DCNN-CaF model to extract high-level rep-
paper explores the following research questions (RQs): resentations of the two acoustic features separately, as
(1) RQ1: Can the model’s performance be improved using shown in Fig. 1. Inspired by the excellent performance of
two acoustic features? pure convolution-based pretrained audio neural networks
(2) RQ2: How does the performance of the proposed model (PANNs) (Kong et al., 2020) in audio-related tasks, a con-
compare with other models on the ARP task and the SSC volutional structure similar to that in PANNs is used in Fig.
task, as well as the joint ARP and SSC tasks? Does the cross- 1 to extract the representation of the input acoustic features.
attention-based fusion module in the model work well? Specifically, the dual-input model in Fig. 1 uses 4-layer con-
(3) RQ3: Does the proposed model capture the relationships volutional blocks. Each convolutional block contains two
between sound sources and annoyance ratings? What convolutional layers with global average pooling (GAP).
are the relationships between sound sources, annoyance The representations of Mel spectrograms and RMS features
ratings, and sound levels? generated by the convolution block, Rm and Rr, are fed to
(4) RQ4: How does the proposed model respond to adding the attention-based fusion module to generate representa-
FIG. 1. (Color online) The proposed dual-branch convolutional neural network with cross-attention-based fusion (DCNN-CaF). The dimension of the output
of each layer is shown.
J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3147
https://doi.org/10.1121/10.0022408
et al., 2017), MHA is calculated on a set of queries (Q), distance between the predicted and the human-annotated
keys (K), and values (V), annoyance ratings,
where Then, the final loss function of the DCNN-CaF model in this
paper is
headi ¼ AðQwQ K V
i ; Kwi ; Vwi Þ; (2)
L ¼ LSSC þ LARP : (8)
pffiffiffi V
AðQwQ i ; Kw K
i ; Vw V
i Þ ¼ U Qw Q
i Kw KT
i = d Vwi ; (3)
III. DATASET, BASELINE, AND EXPERIMENTAL
where headi represents the output of the ith attention head SETUP
for a total number of h heads. WQ K V O
i ; Wi ; Wi , and W are
learnable weights. For MHA in the encoder, Q, K, and V A. Dataset
come from the same place, at this point, the attention in To the best of our knowledge, DeLTA (Mitchell et al.,
MHA is called self-attention (Vaswani et al., 2017). All the 2022) is the only publicly available dataset that includes
parameters (such as h ¼ 8, dk, and dv ¼ dmodel =h ¼ 512=8 both ground-truth sound source labels and human annoyance
¼ 64, etc.) follow the default settings of Transformer rating scores, so we use it in the paper. DeLTA comprises
(Vaswani et al., 2017). 2890 15-s binaural audio clips collected in urban public
From the corresponding dimensions of the output of spaces across London, Venice, Granada, and Groningen. A
each layer in Fig. 1, it can be seen that the dimensions of Rm remote listening experiment performed by 1221 participants
and Rr are both (512, 30), which correspond to the number was used to label the DeLTA recordings. In the listening
of filters of the previous convolutional layer and the number experiment, participants listened to 10 15-s binaural record-
of frames, respectively. After a series of convolutional ings of urban environments, assessed whether they con-
layers operations, the input 480 frames of Mel spectrograms tained any of the 24 classes of sound sources, and then
complex relationships between input features and target var- are concatenated and combined to feed to the SSC and ARP
iables. Decision tree regression is known for its ability to layers, respectively.
handle non-linear relationships and interactions among fea-
tures. Random forest regression is an ensemble method that 3. CNN-Transformer
combines multiple decision trees to improve predictive The CNN-Transformer is based on CNN, and an Encoder
accuracy and reduce overfitting. KNN regression can work from Transformer (Vaswani et al., 2017) is added after the final
well when there is a relatively small dataset and in low- convolutional layer in CNN. After the output of the Encoder is
dimensional spaces. flattened, it is fed to the SSC and ARP layers, respectively.
C. Baseline for sound source classification (SSC) task E. Training setup and metric
In SSC-related research, deep learning convolutional The 64-filter banks logarithmic Mel-scale spectrograms
neural network (CNN)-based models have achieved wide- (Bala et al., 2010) and frame-level root mean square values
spread success, and recently, Transformer-based models (RMS) (Mulimani and Koolagudi, 2018) are used as the
become dominant. Therefore, for the SSC task, the classical acoustic features in this paper. A Hamming window length
CNN-based YAMNet (Plakal and Ellis, 2023) and PANN of 46 ms and a window overlap of 1/3 (Hou et al., 2022a)
(Kong et al., 2020), and Transformer-based AST (Gong are used for each frame. A batch size of 64 and Adam opti-
et al., 2021) are used as baselines. Since YAMNet, PANN, mizer (Kingma and Ba, 2015) with a learning rate of 1e3
and AST are trained on the large-scale AudioSet (Gemmeke are used to minimize the loss in the proposed model. The
et al., 2017), the last layer of YAMNet, PANN, and AST model is trained for 100 epochs.
has 527, 521, and 527 units for output, respectively. In con- The SSC is a classification task, so accuracy (Acc), F-
trast, the SSC task in this paper has only 24 classes of audio score, and threshold-free area under curve (AUC) are used
events, so we modify the number of units in the last layer of to evaluate the classification results. The ARP is viewed as a
all three to 24, and then fine-tune the models on the DeLTA regression task in this paper, so mean absolute error (MAE)
dataset. and root mean square error (RMSE) are used to measure the
The DNN consists of two branches. Each branch contains Two kinds of acoustic features are used in this paper,
four fully connected layers and ReLU functions (Boob et al., the Mel spectrograms that approximate the characteristics of
2022), where the number of units in each layer is 64, 128, 256, human hearing and the RMS features that characterise the
and 512, respectively. The outputs of the final fully connected acoustic level. Table I shows the ablation experiments of the
two acoustic features on the proposed DCNN-CaF model to
layer of the two branches are concatenated and combined to
specifically present the performance of the DCNN-CaF
feed to the SSC and ARP layers, respectively.
model based on different features. When only a single fea-
ture is used, the input of the DCNN-CaF model is the corre-
2. Convolutional neural networks (CNN)
sponding single branch.
Similar to DNN, the compared CNN also consists of two As shown in Table I, the DCNN-CaF model performs
branches. Each branch includes two convolutional layers, the worst on the ARP and SSC tasks when using only the
where the number of filters in each convolutional layer is 32 46 ms interval RMS features, which are related to instanta-
and 64, respectively. The outputs of the convolutional layers neous loudness. This is apparently caused by the lack of
TABLE I. Ablation study on the acoustic features.
1 1.00 6 0.15 1.18 6 0.12 0.89 6 0.01 61.12 6 5.45 90.98 6 0.99
2 1.08 6 0.14 1.27 6 0.10 0.79 6 0.02 53.64 6 3.20 89.33 6 1.73
3 0.84 6 0.12 1.05 6 0.13 0.90 6 0.01 67.20 6 3.16 92.52 6 0.87
J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3149
https://doi.org/10.1121/10.0022408
TABLE II. Performance of models trained only for the SSC task.
spectral information, which is embedded in the Mel spectro- small dataset used in this paper, and large and deep models are
grams and omitted from the RMS features. The dimension prone to overfitting during the training process.
of the frame-level RMS used in this paper is (T, 1), where T Table III shows the joint ARP and SSC baselines pro-
is the number of frames. Compared with Mel spectrograms posed in Sec. III D. For a fairer comparison, the DNN,
with a dimension of (T, 64), the spectral information con- CNN, and CNN-Transformer in Sec. III D also use a dual-
tained in the loudness-related one-dimensional RMS fea- input branch structure to simultaneously use the two acous-
tures is also scarcer. This factor makes it difficult for the tic features of Mel spectrograms and RMS to complete the
model to distinguish the 24 types of sound sources and pre- SSC and ARP tasks.
dict annoyance from real-life different sound sources in the As shown in Table III, the CNN based on the convolu-
DeLTA dataset based only on the RMS features alone. The tional structure outperforms the DNN based on multi-layer
DCNN-CaF using Mel spectrograms outperforms the results perceptrons (MLP) (Kruse et al., 2022) on both tasks, which
of its corresponding RMS features overall. While DCNN- reflects that the convolutional structure is more effective
CaF combining Mel spectrograms and RMS features than the MLP structure in extracting acoustic representa-
achieves the best results, which clarifies that using these two tions. While performing well on the ARP regression task,
acoustic features benefits the model’s performance on SSC the CNN-Transformer combining convolution and an
and ARP tasks. Thus, adding energy level-related informa- Encoder from Transformer has the worst result correspond-
TABLE III. Comparison of different models for joint SSC and ARP tasks on the DeLTA dataset.
ARP SSC
Model Param. (M) MAE RMSE AUC F-score (%) Acc (%)
DNN 0.376 1.06 6 0.19 1.32 6 0.08 0.87 6 0.01 49.57 6 8.78 89.30 6 2.21
CNN 1.271 0.94 6 0.06 1.15 6 0.06 0.88 6 0.01 52.03 6 3.09 90.34 6 0.46
CNN-Transformer 17.971 1.00 6 0.13 1.14 6 0.09 0.86 6 0.02 46.89 6 6.32 88.56 6 1.24
DCNN-CaF 7.614 0.84 6 0.12 1.05 6 0.13 0.90 6 0.01 67.20 6 3.16 92.52 6 0.87
3150 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408
TABLE IV. Comparison of sound level-based approaches on the ARP task. of all the attention distributions of the 8 heads of MHA1 and
MHA2, please see the project webpage.
Method Linear regression Decision tree SVR Random forest KNN
MAE 1.03 1.45 1.01 1.27 1.13 C. Does the proposed model capture the relationships
RMSE 1.26 1.84 1.26 1.60 1.41 between sound sources and annoyance ratings?
What are the relationships between sound sources,
annoyance ratings, and sound levels?
handling outliers and its ability to effectively model nonlin-
ear relationships (Izonin et al., 2021). In summary, the LAeq- To identify which of the 24 classes of sounds is most
based traditional approaches in Table IV show competitive likely to cause annoyance, we first analyze the relationship
performance on the ARP task, and their performance is between the sound identified by the model and the annoy-
close to the deep learning neural network-based methods in ance it predicts. Then, the predictions from the model are
Table III. compared to the human classification. Specifically, we first
To intuitively present the results of DCNN-CaF for the use Spearman’s rho (David and Mallows, 1961) to analyze
annoyance rating prediction, Fig. 2 visualizes the gap the correlation between the probability of various sound
between the annoyance ratings predicted by DCNN-CaF and sources predicted by the model and the corresponding
the corresponding ground-truth annoyance ratings. The red annoyance ratings. Then, we calculate the distribution of
point representing the predicted value and the blue point sound sources at different annoyance ratings, and further
indicating the true label in Fig. 2 mostly match well, indicat- verify the model’s predictions based on human-annotated
ing that the proposed model successfully regresses the sound sources and annoyance rating labels.
annoyance ratings in the real-life soundscape. Correlation analysis between the model’s sound source
For an in-depth analysis of the performance of the classification and annoyance ratings. A Shapiro-Wilk (with
DCNN-CaF, Fig. 3 further visualizes the attention distribu- a ¼ 0.05) statistic test (Hanusz et al., 2016) is performed
tion from the cross-attention-based fusion module on some before a correlation analysis of the model’s predicted sound
test samples. As described in Sec. III B, the number of time sources and annoyance ratings on the test set. The results of
steps of each head in the multi-head attention (MHA) is 30, the Shapiro-Wilk statistic test showed no evidence that the
FIG. 2. (Color online) Scatter plot of annoyance ratings of model predictions and human-annotated labels on some samples in the test set. Black vertical
lines indicate gaps between two points.
J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3151
https://doi.org/10.1121/10.0022408
FIG. 3. (Color online) Attention distributions of the cross-attention-based fusion module in DCNN-CaF on audio clips from the test set. The subgraphs (1,
2) and (3, 4) are from (MHA1, MHA2) of sample #1 and #2, respectively. Brighter colours represent larger values.
the highest negative correlation with the annoyance rating, annoyance rating less than or equal to l is ni;l , and the total
with a coefficient of –0.731. number of occurrences in audio samples with an annoyance
Verifying the model’s predictions based on human- rating greater than l is ni;h . Ni ¼ ni;l þ ni;h , Ni is the total
perceived manually annotated labels. Based on the correla- number of samples containing the sound source si. Then, the
tion analysis of the model’s predictions, some sound sources probability of the sound source occurring in the samples
are more likely to cause people annoyance than others. To where annoyance is lower than or equal to l is
investigate the consistency of the correlation analysis results
ni;l ni;l
between the model-based and the human-annotated labels- Pðx ljsi Þ ¼ ¼ ; (9)
based, we calculate the distribution of sound sources at dif- ni;l þ ni;h Ni
ferent annoyance rating levels based on the human-
where x represents the annoyance rating for fragments con-
annotated labels to explore the correlations between the
taining the sound source, si. Correspondingly, the probabil-
sound source and the annoyance levels, as shown in Fig. 4.
ity of it occurring in samples higher than l is
a
Statistical significance at the 0.01 level.
3152 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408
FIG. 4. (Color online) Distribution of the number of samples from different sources at low and high annoyance rating levels, that is, ni;l and ni;h (shown as
Nl and Nh), and the corresponding Pðx ljsi Þ curves.
Table V comprehensively shows the probability distribution (tau ¼ 0:42; p < 0:001). That is, there is a significant corre-
of 24 classes of sound sources at different levels of annoy- lation between sound level and annoyance rating in the
ance rating according to human perception. DeLTA dataset. This is not unexpected as the ARP baseline
Music source exists is high, and the sound level is signifi- defaults to 0. In this way, we get 20 test sets containing dif-
cantly related to annoyance, which may contribute to the ferent types of noise sources. Therefore, the total number of
significant positive correlation between Music and annoy- audio clips containing model-unknown noise is
ance presented in Table V. 20 578 ¼ 11 560, and the corresponding audio duration is
In addition to sound level, characteristics of music, about 48.2 h (15 s 11 560 ¼ 173 400 s).
such as its style or genre, give people different listening Figure 5 shows the average human-annotated annoy-
experiences. Previous research reveals the role of music in ance rating for sounds in the test set, the average annoyance
inciting or mitigating antisocial behaviour, and that certain rating predicted by the model for the test set without exter-
music genres can soothe or agitate individuals. Additionally, nal noise added, and the average annoyance rating predicted
the perception of annoyance may also be affected, depend- by the model for the test set added with 20 classes of noise.
ing on the choice of music. For example, genres featuring As shown in Fig. 5, without additional noise, the average
heavy rhythms, known for their potential to evoke angry annoyance rating of the 578 15-s audio clips in the test set in
emotions (Areni, 2003; Cowen et al., 2020), are often not the soundscape predicted by the model is similar to that of
favoured by listeners in an urban environment context. As human-perceived annoyance ratings. The standard deviation
highlighted in the study (Landstr€om et al., 1995), a key con- of our model prediction (the yellow line on the bars in Fig.
tributor to annoyance is the presence of a tonal component 5) is smaller than the corresponding human-perceived
in the noise. Individuals exposed to intruding noise contain- annoyance, which intuitively demonstrates that the proposed
ing tonal elements tend to report higher levels of annoyance model achieves a similar effect on the test set as the annoy-
than those exposed to non-tonal noise. Furthermore, ance ratings from humans perception.
reported levels of annoyance tend to increase when the noise Adding the 20 types of sources at an SNR of 0,
contains multiple tonal components. This observation sug- increases the sound level (i.e., RMS) and, therefore, would
gests that tonal characteristics present in the sound source most probably increase the annoyance rating. If the model
(and possibly also in the music) may be a contributing factor purely relied on the RMS value, as some other noise annoy-
to the positive correlation between music and annoyance rat- ance models do, it would predict the same increase for all
ings in Table V. sources. However, different annoyance levels are predicted
FIG. 5. (Color online) The mean and standard deviation of the predicted annoyance rating under different model-unknown noise sources.
3154 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408
FIG. 6. (Color online) The performance of the proposed DCNN-CaF on real-life audio recordings.
proposed model on unknown data and, in particular, investi- along multiple dimensions. This can be a time-consuming
gate its performance for predicting positive effects of sound- and expensive process when relying solely on human listen-
scape augmentation, we test the DCNN-CaF on a real-world ing tests and questionnaires. This paper investigates the fea-
acoustic dataset from a road traffic noise environment sibility of using artificial intelligence (AI) to perform
(denoted as RTNoise) (Coensel et al., 2011). The experi- soundscape characterization without human questionnaires.
ment in Coensel et al. (2011) shows that adding bird song Predictive soundscape models based on measurable features,
and fountain sound can reduce human-perceived traffic such as the model proposed here, can enable perception-
noise loudness and increase perceived pleasantness. focused soundscape assessment and design in an automated
RTNoise contains recordings of freeways, major roads, and and distributed manner, beyond what traditional soundscape
minor roads sounds and mixtures of these sounds with two methods can achieve. This paper proposes the cross-atten-
bird choruses and two fountain sounds. tion-based DCNN-CaF using two kinds of acoustic features
training dataset to cover more acoustic scenarios. Fan, J., Thorogood, M., and Pasquier, P. (2017). “Emo-soundscapes: A
dataset for soundscape emotion recognition,” in International Conference
Furthermore, to improve the interpretability of the proposed
on Affective Computing and Intelligent Interaction, pp. 196–201.
model, the following work will try to visualize the learned Fan, J., Thorogood, M., Riecke, B. E., and Pasquier, P. (2015). “Automatic
weights of the model through heatmap analysis to clarify recognition of eventfulness and pleasantness of soundscape,” in
which neurons play a more critical role in learning to help Proceedings of the Audio Mostly 2015 on Interaction with Sound, pp.
1–6.
explain the decision-making process of the model. Fan, J., Tung, F., Li, W., and Pasquier, P. (2018). “Soundscape emotion rec-
ognition via deep learning,” in Proceedings of the Sound and Music
Computing.
ACKNOWLEDGMENTS Fang, X., Gao, T., Hedblom, M., Xu, N., Xiang, Y., Hu, M., Chen, Y., and
Qiu, L. (2021). “Soundscape perceptions and preferences for different
The WAVES Research Group received funding from groups of users in urban recreational forest parks,” Forests 12(4), 468.
the Flemish Government under the “Onderzoeksprogramma Gemmeke, J., Ellis, D., Freedman, D., Jansen, A., Lawrence, W., Moore, R.
Artifici€ele Intelligentie (AI) Vlaanderen” programme. C., Plakal, M., and Ritter, M. (2017). “AudioSet: An ontology and
human-labeled dataset for audio events,” in Proceedings of ICASSP, pp.
776–780.
Abraham, A., Sommerhalder, K., and Abel, T. (2010). “Landscape and Gong, Y., Chung, Y. A., and Glass, J. (2021). “AST: Audio Spectrogram
well-being: A scoping study on the health-promoting impact of outdoor Transformer,” in Proceedings of INTERSPEECH, pp. 571–575.
environments,” Int. J. Public Health 55, 59–69. Gong, Y., Cui, C., Cai, M., Dong, Z., Zhao, Z., and Wang, A. (2022).
Acun, V., and Yilmazer, S. (2018). “Understanding the indoor soundscape “Residents’ preferences to multiple sound sources in urban park:
of study areas in terms of users’ satisfaction, coping methods and percep- Integrating soundscape measurements and semantic differences,” Forests
tual dimensions,” Noise Cont. Eng. J. 66(1), 66–75. 13(11), 1754.
Al-Shargabi, A. A., Almhafdy, A., AlSaleem, S. S., Berardi, U., and Ali, A. Hanusz, Z., Tarasinska, J., and Zielinski, W. (2016). “Shapiro–Wilk test
A. M. (2023). “Optimizing regression models for predicting noise pollu- with known mean,” REVSTAT-Stat. J. 14(1), 89–100.
tion caused by road traffic,” Sustainability 15(13), 10020. Hong, J. Y., and Jeon, J. Y. (2015). “Influence of urban contexts on sound-
Areni, C. S. (2003). “Examining managers’ theories of how atmospheric scape perceptions: A structural equation modeling approach,” Landscape
music affects perception, behaviour and financial performance,” J. Retail. Urban Plann. 141, 78–87.
Consumer Serv. 10(5), 263–274. Hou, Y. (2023). “AI-Soundscape,” https://github.com/Yuanbo2020/AI-
€ Nilsson, M. E., and Berglund, B. (2010). “A principal compo-
Axelsson, O., Soundscape (Last viewed 9/11/2023).
nents model of soundscape perception,” J. Acoust. Soc. Am. 128(5), Hou, Y., and Botteldooren, D. (2022). “Event-related data conditioning for
3156 J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al.
https://doi.org/10.1121/10.0022408
level,” in Proceedings of International Congress on Noise as a Public Plakal, M., and Ellis, D. (2023). “YAMNet,” https://github.com/tensorflow/
Health Problem, pp. 225–231. models/tree/master/research/audioset/yamnet (Last viewed 9/11/2023).
Li, Y., Liu, M., Drossos, K., and Virtanen, T. (2020). “Sound event detec- Politis, A., Mesaros, A., Adavanne, S., Heittola, T., and Virtanen, T.
tion via dilated convolutional recurrent neural networks,” in Proceedings (2021). “Overview and evaluation of sound event localization and detec-
of ICASSP, pp. 286–290. tion in DCASE 2019,” IEEE/ACM Trans. Audio. Speech. Lang. Process.
Li, Z., Hou, Y., Xie, X., Li, S., Zhang, L., Du, S., and Liu, W. (2019). 29, 684–698.
“Multi-level attention model with deep scattering spectrum for acoustic Qu, F., Li, Z., Zhang, T., and Huang, W. (2023). “Soundscape and subjec-
scene classification,” in IEEE International Conference on Multimedia & tive factors affecting residents’ evaluation of aircraft noise in the commu-
Expo Workshops (ICMEW), pp. 396–401. nities under flight routes,” Front. Psychol. 14, 1197820.
Ma, K. W., Mak, C. M., and Wong, H. M. (2021). “Effects of environmen- Raimbault, M., and Dubois, D. (2005). “Urban soundscapes: Experiences
tal sound quality on soundscape preference in a public urban space,” and knowledge,” Cities 22(5), 339–350.
Appl. Acoust. 171, 107570. Ren, J., Jiang, X., Yuan, J., and Magnenat-Thalmann, N. (2017). “Sound-
Mackrill, J., Cain, R., and Jennings, P. (2013). “Experiencing the hospital event classification using robust texture features for robot hearing,” IEEE
ward soundscape: Towards a model,” J. Environ. Psychol. 36, 1–8. Trans. Multimedia 19(3), 447–458.
Marchegiani, L., and Posner, I. (2017). “Leveraging the urban soundscape: Salamon, J., Jacoby, C., and Bello, J. P. (2014). “A dataset and taxonomy
Auditory perception for smart vehicles,” in IEEE International for urban sound research,” in Proceedings of the ACM International
Conference on Robotics and Automation, pp. 6547–6554. Conference on Multimedia, pp. 1041–1044.
Maristany, A., Lopez, M. R., and Rivera, C. A. (2016). “Soundscape quality Sun, K., De Coensel, B., Filipan, K., Aletta, F., Van Renterghem, T., De
analysis by fuzzy logic: A field study in Cordoba, Argentina,” Appl. Pessemier, T., Joseph, W., and Botteldooren, D. (2019). “Classification of
Acoust. 111, 106–115. soundscapes of urban public open spaces,” Landscape Urban Plann. 189,
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, 139–155.
T., and Plumbley, M. D. (2018a). “Detection and classification of acoustic Szychowska, M., Hafke-Dys, H., Preis, A., Koci nski, J., and Kleka, P.
scenes and events: Outcome of the DCASE 2016 challenge,” IEEE/ACM (2018). “The influence of audio-visual interactions on the annoyance rat-
Trans. Audio. Speech. Lang. Process. 26(2), 379–393. ings for wind turbines,” Appl. Acoust. 129, 190–203.
Mesaros, A., Heittola, T., and Virtanen, T. (2018b). “Acoustic scene classifi- Thorogood, M., Fan, J., and Pasquier, P. (2016). “Soundscape audio signal
cation: An overview of DCASE 2017 challenge entries,” in International
classification and segmentation using listeners perception of background
Workshop on Acoustic Signal Enhancement (IWAENC), pp. 411–415.
and foreground sound,” J. Audio Eng. Soc. 64(7/8), 484–492.
Mitchell, A., Brown, E., Deo, R., Hou, Y., Kirton-Wingate, J., Liang, J.,
Timmons, A. C., Han, S. C., Chaspari, T., Kim, Y., Narayanan, S., Duong,
Sheinkman, A., Soelistyo, C., Sood, H., and Wongprommoon, A. (2023). “Deep
J. B., Simo Fiallo, N., and Margolin, G. (2023). “Relationship satisfaction,
learning techniques for noise annoyance detection: Results from an intensive
feelings of closeness and annoyance, and linkage in electrodermal
workshop at the Alan Turing Institute,” J. Acoust. Soc. Am. 153(3), A262.
activity,” Emotion 23, 1815.
Mitchell, A., Erfanian, M., Soelistyo, C., Oberman, T., Kang, J., Aldridge,
J. Acoust. Soc. Am. 154 (5), November 2023 Hou et al. 3157