1 s20 S0957417423010229 Main

Expert Systems With Applications 229 (2023) 120520
Contents lists available at ScienceDirect
Expert Systems With Applications

journal homepage: www.elsevier.com/locate/eswa
Review
A survey on preprocessing and classification techniques for acoustic scene

Vikash Kumar Singh a , Kalpana Sharma a , Samarendra Nath Sur b ,∗
a
Department of Computer Science & Engineering, Sikkim Manipal Institute of Technology Sikkim Manipal University, Majitar, Rangpo, East Sikkim 737136, India
b
Department of Electronics and Communication Engineering, Sikkim Manipal Institute of Technology Sikkim Manipal University, Majitar, Rangpo, East
Sikkim 737136, India
ARTICLE INFO ABSTRACT
Keywords: There are lots of research papers for ASC, and in recent years it is rapidly increasing. DCASE also provides
Acoustic scene classification different types of competition for the submission of several papers to solve the various tasks of ASC and it is
Accuracy the opportunity for the research scholars either to participate in those competitions or to provide the enhanced
Audio sound
model for ASC. This paper provides details about the various recent approaches along with the block diagram
CNN
used for pre-processing required before model development for ASC. It also includes a description of different
Data curation
DCASE
recent techniques used for the classification of different sounds for ASC tasks. The comparative analysis for
DNN different recent available techniques both for pre-processing and classification has been done and summarized
Feature extraction in this paper. It also describes the contributions towards the survey on ASC by comparing this paper with
ML some existing survey papers based on several parameters like functionality described separately, results from
MFCC description with quantifiable value, a dataset with proper quantifiable analysis, and pictorial representation
Pre-processing of model discussed, etc. Finally, considering the benefits for eminent research scholars, this paper has also
Receptive field focused on the details for future directions for both pre-processing and classification for ASC.
Sound event detection
1. Introduction recognition of individual sound classes and activities in the environ-

ments is important to manage the chaos during the acoustic scene.
A lot of information about our daily environment and the physical Detection and classification of acoustic scenes and events (DCASE) have
events that happen in it is conveyed through sounds. The individual defined ten acoustic scene classes namely — airports, indoor shopping
sound sources can be identified (for example car passing by or footsteps malls, metro stations, pedestrian streets, public squares, a street with a
etc.), and the sound scene can be perceived (for example, inside a medium level of traffic, traveling by tram, traveling by bus, traveling
workplace or within a busy street). Multiple applications, including by an underground metro and urban park. The objective of acoustic
creating intelligent monitoring systems to identify activities in their scene classification is to classify test recordings to one of the available
environments using acoustic information, finding multimedia based on predefined acoustic scene classes that characterize the environment in
its audio content, and creating context-aware mobile devices, have which it was recorded. A lot of research works have been performed on
huge potential for using the development of audio signal processing the classification of the acoustic scene however, a significant amount
methods to automatically extract the information. As mentioned earlier
of research is still needed to reliably recognize sound scenes where
in this section, there are a few places where an acoustic scene can
multiple sounds are present.
be created, including a metro station, while traveling by bus or train,
Lots of data pre-processing can be done on the sound produced by
an indoor shopping mall, an airport, a park, etc. As the population
the environment and based on that, the detection and prediction of
grows, acoustic scenes are produced quickly. There are many ambient
sound can be improved. Sound detection and prediction are required
noises during disasters. During times of catastrophe, these ambient
to classify audio sounds into different classes that will illustrate the en-
noises may be unwanted and exacerbate the chaos. For instance, if
the gun is fired, disaster management is basically interested to know vironment in which it was logged. There are various challenges for the
whether the gun is fired in reality or not. Similarly, if there is some audio sound classification, how to create and understand the datasets,
unwanted sound/noise produced inside an indoor shopping mall then, the expectation of data with highly correlated features, how to select
there should be some mechanism through which the class of sound the suitable pre-processing technique, how to choose the best suitable
can be predicted for eliminating any annoying incidence. Therefore, classifier, or even how to evaluate and compare the results obtained
∗ Corresponding author.
E-mail addresses: vikash.s@smit.smu.edu.in (V.K. Singh), kalpana.s@smit.smu.edu.in (K. Sharma), samar.sur@gmail.com (S.N. Sur).
https://doi.org/10.1016/j.eswa.2023.120520
Received 27 September 2022; Received in revised form 12 May 2023; Accepted 18 May 2023
Available online 25 May 2023
0957-4174/© 2023 Elsevier Ltd. All rights reserved.
V.K. Singh et al. Expert Systems With Applications 229 (2023) 120520
Fig. 1. Organization of paper.
with the existing methodologies. In order to resolve these challenges, implementation details, results obtained, evaluation parameters,
a model should be created with the help of suitable technique(s) and future directions, and dataset used are summarized. Numerous
audio signal features. The developed model should reliably recognize recent papers are reviewed in-depth with a clean appropriate
the class of sound produced from the environment of acoustic scenes block diagram.
in presence of multiple sounds. 3. Comparative analysis for different pre-processing and classifica-
As stated above, there are various challenges for the classification of tion techniques for the classification of the acoustic scene with
audio sounds into appropriate classes from the environment in which accuracy has been highlighted.
it was recorded and to overcome these challenges, not only DCASE 4. Future directions for both pre-processing and classification for
civic but other researchers from different organizations across the globe ASC are focused both in paragraph and tabular format.
also, they have contributed a lot. It was identified that the stated 5. Comparison of different existing survey papers with author’s
challenges can be overcome with suitable pre-processing techniques survey paper based on several parameters like functionality
like front-end Deep Neural Networks (DNN) for feature extraction, described separately, result analysis with quantifiable value,
SCalable Automatic Repairing (SCARE) a machine learning technique, a dataset with proper quantifiable analysis, pictorial represen-
etc., and by building appropriate classifiers with the help of different tation of model discussed, and future scope for eminent re-
types of neural networks like designing ASC with Convolutional Neu- searchers have been done.
ral Network(CNN) Variants, Receptive Field (RF)-Regularized CNNs,
Support Vector Machine (SVM), integrated pre-trained DNN, etc. There are many acronyms like ASC, DCASE, ER, GAN, IoT, ML,
The authors have reviewed many papers and compared their works MFCC, TAG, WER etc. employed in this article and their details are
with some recent survey papers. The comparisons have been done enumerated in Table 2.
based on several parameters like abstract, the number of papers re- This paper summarizes the details in a systematic way about the
viewed, results discussion with the proper outcome, evaluation metrics various recent approaches, deep learning, and neural network-based
used, graphical representation of several models for understanding, algorithms, along with the block diagram used for pre-processing and
baseline set for further advancements in the field of DCASE tasks, etc. classification of different sounds for ASC tasks. The systematic way
The comparison summary is detailed in Table 1. Additionally, some of doing this is illustrated in Fig. 1. Section 2 concentrates on the
of the highlights of the author’s contributions to this study are listed different methods for pre-processing for acoustic scene classification —
below: the authors have reviewed many recent papers and their details like
findings, methodology, implementation details, results obtained, eval-
1. Review on pre-processing for acoustic scene classification — Sev- uation parameters, future directions, and dataset used, are condensed
eral recent papers are reviewed and their details like findings, in Tables 3 and 4.
methodology, implementation details, results obtained, evalua- Some recent papers on different pre-processing techniques are re-
tion parameters, future directions, and dataset used, are summa- viewed in detail with a proper neat block diagram. Similar to Section 2,
rized. Many recent papers are reviewed in detail with the proper Section 3 also describes various techniques for the classification of the
block diagrams. acoustic scene for ASC — several recent papers are reviewed and their
2. Review on classification for acoustic scene classification — Sev- details like findings, implementation details, results obtained, evalua-
eral recent papers are reviewed and their details like findings, tion parameters, future directions, and dataset used are summarized
2
Table 1
A comprehensive list of the existing survey papers and comparison.
Author Abstract Number of Total Methodology Results Future scope Evaluation Future scope Results Dataset Dataset with Pictorial rep- Baseline
details papers number of and Description listed in metrics listed in description general proper resentation identified
/year reviewed in paper functionality general general discussed detail with description quantifiable of model for future
detail reviewed described in description quantifiable analysis discussed advancement
and detail value in DCASE
summarized
Review on Data 179 47 (29 for Yes Yes Yes Yes Yes Yes Yes Yes Yes Yes
This paper Preprocessing Preprocess-
and ing 18 for
Classification of Classification
ASC tasks
Nogueira et al. Summarize the 75 26 Yes Yes Yes Yes No Yes Yes Yes Yes No
(2022) most recent
works on sound
classification
model
Singh et al. Survey on 366 142 Yes Yes Yes Yes Yes Yes Yes Yes No No
(2022) speech
recognition
technique for
processing
speech acoustic
Liu et al. (2022) Overview of Self 195 57 Yes Yes Yes Yes No No Yes No Yes No
Supervised
Learning
technique used
in audio and
speech
processing
Abeßer (2020) Review on 99 (General 0 Yes Yes Yes Yes Yes No Yes No No Yes
various DL Description)
methods for ASC
Sharma et al. Different feature 151 0 Yes Yes Yes No No Yes No No Yes No
(2020) extraction
techniques are
discussed
Thirumuru- Data curation 52 (Short 0 (General Yes Yes No No Yes Yes No No Yes (very Yes
ganathan et al. using DL Description) DL few generic
(2020) techniques are Techniques model)
reviewed. for DC
Discussed)
Salah et al. Survey on DC 51 16 for Yes Yes Yes Yes No No Yes No No No

(2019) and different Challenges
issues and 14 for
challenges Techniques
related to the
Big Data
Ridzuan and Reviews the DC 26 4 Yes Yes No Yes Yes No Yes No No No

Zainon (2019) process for Big
Data
Purwins et al. Review on the 151 0 Yes Yes No Yes Yes No Yes No No No
(2019) DL techniques (Excluding
for audio signal application
processing Very short -
No detail
description
of even
single paper)
Xia et al. (2019) Survey for the 94 (Only DL 36 Yes Yes Yes Yes Yes No Yes No Yes No
NN based DL methods
approaches on discussed)
acoustic event
detection is
provided
Mesaros et al. Tasks and 64 (General 36 Yes Yes Yes Yes Yes No Yes Yes No Yes
(2018a) outcomes of the description
DCASE 2016 for 4 DCASE
challenge are Tasks)
discussed
Dang et al. Overview of DL 26 (Only DL 13 Yes No Yes Yes No No Yes No No No

(2017) for SED approaches
discussed)
Barchiesi et al. Review on ASC 55 21 Yes Yes Yes Yes Yes Yes Yes No No Yes
(2015) (Tutorial on
ASC)
in Table 24. Some recent papers on different classification techniques 2. Review on pre-processing
used for acoustic scene classifications are reviewed in detail with a
proper neat block diagram. Future Directions for both pre-processing 2.1. Data Curation with Deep Learning
and classification for ASC are detailed both in paragraph and tabular
format in Section 4. At last, the conclusion of this review is absorbed The methods for Data Curation with Deep Learning have been
in Section 5. proposed by Thirumuruganathan et al. (2020). Data curation, or the
3
Table 2
List of acronyms.
Phrase Acronym Phrase Acronym
Acoustic Scene Classification ASC Hierarchical Regularized Logistic Regression HR-LR
Acoustic Event Detection AED Internet of Thing IoT
Accuracy Acc Information Flow of Things IFoT
Area Under the Receiver Operating characteristic Curve AUROC Keyword Spotting KWS
Atrous Convolutional Neural Networks ACNNs Knowledge-Base KB
Audio Tagging System ATS Local Binary Patterns LPB
Bidirectional Gated Recurrent Units BiGRU Long Short Term Memory LSTM
Convolutional Neural Network CNN Label-Weighted Label-Ranking Precision lwlrap
Convolutional Recurrent Neural Network CRNN Machine Learning ML
Constraint Mutual Subspace Method CMSM Mean Opinion Score MOS
Constant Q-Transform CQT Maximum Mean Discrepancy MMD
Data Curation DC Minimum Reconstruction Error MRE
Difference of Gaussian DoG Majority Voting MV
Deep Learning DL Maximum Likelihood ML
Detection and Classification of Acoustic Scenes and Events DCASE Multi-Class Open-set Evolving Recognition MCOSR
Deep Convolutional Generative Adversarial Network DCGAN Multi-Input MI
Deep Neural Network DNN Mean Squared Error MSE
Deeper CNN DCNN Mel-Frequency Cepstral Coefficients MFCC
Deep Convolutional Auto Encoders DCAEs Natural Language Processing NLP
Deep Self-Adaptive-Genetic-Algorithm DeepSAGA Ordinary Differential Equations ODE
Domain Adaptation DA Pre-Channel Energy Normalization PCEN
Environment Sound Classification ESC Repeating Pattern Extraction Technique REPET
Entity Resolution ER Receiver Operating Characteristic ROC
Error Rate ER Receptive Field RF
Extreme Value Machine EVM Receptive Field Regularized RFR
False-Positive Rate FPR SCalable Automatic REpairing SCARE
Fast Fourier Transformation FFT Singular Value Decomposition SVD
Feedforward Neural Networks FNN Sliced Wasserstein Distance SWD
Focal Loss FL Stochastic Gradient Descent SGD
F-Score F1 Short Time Fourier Transform STFT
Factorized CNN FCNN Sound Event Detection SED
Gated Recurrent Unit GRU Stochastic Weight Averaging SWA
Gated Recurrent Neural Networks GRNN Support Vector Data Description SVDD
Gaussian Mixture Model GMM Support Vector Machine SVM
Gammatone Frequency Cepstral Coefficients GFCC Audio Tagging TAG
Generative Adversarial Networks GAN Time-Delay Neural Networks TDNN
Global Average Pooling GAP True-Positive Rate TPR
Harmonic Percussive Source Separation HPSS Weibull-Calibrated SVM W-SVM
Histogram of Oriented Gradients HOG Word Error Rate WER
procedure of searching, incorporating, and cleansing data for analytical tasks including entity matching and data discovery. Additionally, they
operations, is a necessary step for any researcher who wishes to obtain issued a rallying cry to the DC community to grasp this opportunity
correct facts and statistics from the data. The detail of the data curation to make significant progress in this field (Thirumuruganathan et al.,
pathway is depicted in Fig. 2. The author has looked into exciting 2020).
research opportunities for addressing various issues, such as what is
required to significantly advance a difficult field like DC, how to apply 2.2. Analyses of machine learning techniques for data curation in diverse
DL techniques to DC, and how to find the most relevant and intriguing contexts
leads to DC given the numerous DL research initiatives. Data Ocean,
Data Discovery (Table Search), Data Assimilation (Outline Represent- Data Stream Curation technique can be used for a better understand-
ing, Schema Matching, Object Determination), Data Cleaning (Data ing of the data which is proposed by Salah et al. (2019). Understanding,
Imputation, Data Alteration), and Professional Significance are some analyzing, examining, and organizing the data is called curation. If
of the general steps used in data curation. The authors have shown the there are a large set of data, data curation is important for analyzing
DL architecture for DC for Entity Matching and Object Determination the data to reduce the cost of Big Data Analysis. Data curation is very
only (DeepER and DeepMatcher architecture respectively). To manage important in every field where large sets of data are handled daily
the burgeoning big data ecosystem, we need creative solutions to the such as health, medicine, environment, etc. The useful data must be
venerable problem of data curation. Many academic fields, both inside obtained from the big data storage pool to get better output with better
and beyond the purview of computer science, are responding favorably accuracy. In real life working environment, data with highly correlated
to deep learning. The fusion of these two fields will spark several features are expected so that the decision can be made accurately. With
research initiatives that will result in practical answers for various DC a large set of data available for analysis, the best ML classifier should be
challenges. The block diagram for different challenges for DC and their chosen to analyze the data and for that, all classifiers which may have
procedures is illustrated in Figs. 3A and 3B. The author has indicated related functionalities to the features of our data should be used and
areas for future research that include developing DC-aware DL architec- accuracy should be found for the analysis. Based on the data given,
tures and learning distributed representations for database-aware bits its class should be identified and various feature selection processes
and pieces like tuples or columns. The authors have discussed several should be used to pick the highly adequate feature to be used in the ML
interesting solutions to the data-hungry nature of disciplined DL. They model. In healthcare, the data given are properly extracted, and based
have suggested the early accomplishments in applying DL to crucial DC on the analysis and feature selection process, highly correlated data are
4
Fig. 2. DC pathway (Fernandez et al., 2017; Heer et al., 2015; Mille, 2014; Stonebrake & Ilyas, 2018; Stonebraker et al., 2013; Thirumuruganathan et al., 2020).
Fig. 3. A: Data curation challenges. B: Block diagram for data curation procedure (Cicco et al., 2019; Ebaid et al., 2019; Ebraheem et al., 2018; Mudgal et al., 2018; Szegedy
et al., 2013; Thirumuruganathan et al., 2020).
Fig. 4. Block diagram for data stream curation (Banerjee et al., 2018; Fujisawa et al., 2015; Kudo et al., 2016; Miyamoto et al., 2017; Pezoulas et al., 2019; Salah et al., 2019;
Sowe & Zettsu, 2013; Yang et al., 2017; Yasumoto et al., 2016).
used. The process of building new features from already-existing data data curation that can be used to address the problem and of different
to train the ML model is referred to as feature engineering, also known challenges associated with it, it is still difficult to meet the requirements
as feature generation. This is more significant than the actual model of big data. Various steps are required for data cleansing, those steps
because it makes use of the ML method, learns from the data given to are data analysis, mapping rules, verification, transformation, and back-
it, and is essential in developing features that are pertinent to a task. flow of cleaned data. The details of all these steps are given in Fig. 5.
If a machine is trained in such a way that it automates its features and The authors have reviewed several methods for data cleansing like —
runs it and can add some features to the existing one, then improved cleansing of data using standard methods and cleansing methods for big
results can be obtained. It must understand that technology is evolving data. The description of all these methods is depicted in Fig. 6. Among
and so does this world. The main issues and different techniques for
all these methods of data cleansing, ‘‘SCalable Automatic Repairing
data stream curation along with the comparative study and different
(SCARE)’’, a machine learning technique for data cleansing has been
applications are shown in Fig. 4. In relevance to the future research
observed very useful and efficient method. SCARE concentrates on the
direction, the basic data curation and feature engineering concept can
concern of data cleansing’s extensible, although it does not require a
be applied with the suitable newly existing ML technique to improve
the accuracy obtained by the existing model. domain specialist throughout the process. The expert is required to
validate the dataset modifications. For the data to be covered, a set
2.3. Processes, techniques, and issues with data cleansing for enormous data of rules developed by subject-matter experts is necessary. SCARE was
shown to be dependent on the consistency of training data quality
Ridzuan and Zainon (2019) review ‘‘Data Cleansing Methods for and the value of the threshold. This can be done to enhance SCARE’s
Big Data’’. The authors claim that while there are numerous ways for functionality. The cleaning procedure is done in parallel. However, for
5
Fig. 5. Data cleansing process (Cohen et al., 2015; Rahm & Do, 2000; Ridzuan & Zainon, 2019; Sidi et al., 2012).
Fig. 6. Data cleansing methods (Chu et al., 2015a, 2015b; Khayyat et al., 2015; Lee et al., 2000; Ridzuan & Zainon, 2019; Wang et al., 2016; Yakout et al., 2013).
Table 3
Summary of existing work on different pre-processing technique for ASC.
Author details Year Findings Methodol- Implementation/ Results Evaluation Metric / Limitation/Future Datasets
ogy/Functionality Evaluation Tools scope
mechanism
Jung et al. (2020c) 2020 1. Three integrated DNN 1. 128-dimensional 1. Optimization 1. DcaseNet-v3 1. Overall 1. For joint training 1. For ASC Task -
architectures, DcaseNets are Mel-spectrogram. Technique - Adam performance is Classification degradation of DCASE 2020
proposed for three task – Number of Bins - 2048, with highest Accuracy for ASC accuracy in Task 1-A dataset
ASC, Audio Tagging(AT) and 2. Extraction Mechanism learning rate : 0.001 2. Accuracy for ASC 2. Label-Weighted ASC 2. For TAG Task -
Sound Event Detection(SED). - FFT 2. Hyper-parameters after fine Label DCASE 2019
2. Developed DNN 3. Window Size/Overlap for Training – tuning: 70.35% and -Ranking Precision Task 2 dataset
architecture based on human - 40ms/20ms Fixed for all ASC, ER : 31.85% for 3. For SED Task -
cognitive processes, such as 4. Number of Interactions TAG and SED 3. Architecture TAG DCASE 2020
adults’ scene perception - 500 as one 3. NN Used - CRNN trained on a single 3. F-Score and Error Task 3 dataset
and newborns’ long-term epoch, Number of - 8 layers with 512 task - DcaseNet-v1 - Rate
learning. Epochs - 160 output filters in last F1:79.62%, for SED
3. Fine-tuning for the 5. Batch Size - ASC:32, layer ER:34.86%.
integrated DNNs for a TAG:24, 4. Memorization DcaseNet-v2 -
particular SED:32, Segment Sensitivity Technique Acc:69.54%
task boosts performance. Duration for SED - Used - DcaseNet-v3 -
5 and 30 s Mix-up LWLRP:70.62%
Spoorthy et al. (2021) 2021 1. Audio recordings are 1. CNN for classification 1. Log-Mel 1. The CRNN model 1. Accuracy 1. The different 1. DCASE 2019 ASC
divided into classes based on 2. CRNN for classification Spectrogram features performed 2. Confusion matrices feature Task 1.A
the 3. Activation functions: 2. STFT before well for the representations can Dataset - 10 acoustic
scenes and environments in ReLU, extraction of features mel-spectrogram be scenes
which they were recorded. LeakyReLU, and ELU are 3. Window Size : feature. used to improve the 14400 number audio
2. Audio segments are used used to 40 ms with 50% 2. CRNN – 90.96% performance of the recordings
to extract feature evaluate the model overlap accuracy with ASC 1440 audio
representations, which are 4. Window Function ELU activation system. recordings in each
then fed into deep learning : Hamming function for class
models. Asymmetric classification - Sampling Rate :
5. Mel-bands number Improvement 10.9%. 48000 Hz ,
: 40 3. CNN – 73.61% Resolution: 24-bit
with LeakyReLU
Kumpawat and Dey (2021) 2021 1. The spectrograms helps to 1. ASC tasks: CNN (ReLu 1. Data loaded as 1. 68% accuracy 1. Accuracy 1. Grid search can 1. DCASE 2019
mitigate the problem of and Softmax batch of 32. obtained in 24th be used for Challenge Dataset
changing of frequency and activation function), 2. Frequency epoch. optimizing 10 s audio signals
amplitude of audio. pre-trained VGG16. masking and random 2. 67.6% accuracy the parameters used for 10
2. The data augmentation 2. Spatiotemporal factor frequency with Testing for data acoustic scenes.
techniques aid in the better consideration -time stretching are dataset. augmentation. 1440 segments for
fitting of examined data into for temporal factors. done for data 2. Domain specific each scene
the model. 3. Pre-Trained Neural analysis results are obtained 40 h audio dataset
3. Problem revolving around Network for and data but
land and for that model Classification augmentation. novel solution in
is built to classify some 3. Pre-processing: various different
pre-defined acoustic scene. Array using sampling domain
of diverse
technique and linear background is
algebra are used. required.
(continued on next page)
6
Table 3 (continued).
Thirumuruganathan 2020 1. Measure the Data 1. Representation 1. Word Embeddings 1. DC-aware DL 1. Accuracy 1. How can values, 1. ImageNet-14M
et al. (2020) Curation(DC) Challenges Learning for Data based Approaches, architectures will integrity restrictions, images over
Ebaid et al. (2019) Curation Combining Word and give better result. and 20k categories
2. DL Architectures for Graph Embeddings. other metadata be
Data Curation 2. Take specific DC used when designing
Task tasks and break an
them algorithm for
into simpler learning cell
components and embeddings?
evaluate if 2. Weakly Supervised
there are existing DL DL Models through
architectures that data profiling.
could be reused. 3. Research
opportunities for
answering the
different questions
during every steps of
Data
Curation Pipeline.
Abeßer (2020) 2020 1. Deep learning based ASC 1. Current ASC - mono 1. Preprocessing - 1. Rapid increase of 1. Accuracy 1. DCASE challenges 1. DCASE 2018 and
algorithms, data or stereo signals Gradient Descent scientific 2. Error Rate still focus on fixed 2019
preparation, and data 2. Log-spectrogram is based publications by 3. Confusion matrix spectrogram based challenges dataset is
modeling are summarized. used as network algorithms, PCEN, recent advances 4. ROC curve signal representation used.
2. ASC algorithm from input for Fixed Signal 2. Filtering-Repeating in the field of deep 5. F1-Score etc instead of learnable
DCASE takes spectrogram Transformations Pattern Extraction learning such signal transformation.
as network input based on 3. DCGAN and recurrent Technique (REPET) as transfer learning, 2. State-of-the-art
Mel spectrogram. Sequence to and Harmonic attention ASC algorithms have
3. Data Preparation task - Sequence Auto Encoder Percussive Source mechanisms, and
Signal Representation, representations Separation (HPSS) multitask matured and can be
Preprocessing and Data for Learnable Signal algorithms, learning, as well as applied in
Augmentation. Transformations 3. Difference of the release of context-aware
4. Signal Representation - Gaussians(DoG) and various public devices. In real-world
Monaural vs. Multi- Sobel Filtering datasets - State-of- scenarios, novel
Channel Signals, Fixed Signal the-art ASC challenges need to
Transformations algorithms have been be faced like
and Learnable Signal microphone
Transformations. given better results. mismatch and DA,
open-set
classification,
model complexity
and processing
constraints.
Shuyang et al. (2020) 2020 1. SED technique requires a 1. Splitting each 1. Change Point 1. Annotating only 1. Segment-based 1. Optimal 1. TUT Rare Sound
huge amount of labeled recording into segments. Detection is used for 2% of the training Error Rate(ER) combination of 2017(DCASE
data for training purpose. 2. Sample selection Active data - SED 2. Accuracy active learning 2017 Task 2
2. An active learning system Process - Mismatch- Learning Process. performance is and semi-supervised Challenge) - Target
obtains the accuracy of First Farthest-Traversal 2. Sample selection similar to learning methods event classes: baby
SED model with less amount Mechanism. is done from annotating all the could cry, gunshot,
of annotation process. 3. TUT Rare Sound - Candidate training data. be more convenient and glass breaking.
3. Audio change point Sampling Rate:44.1 KHz, Segment. 2. With rare events for reducing 2. TAU Spatial
detection is used for Event-to-Background 3. Annotator is used dataset, more than annotation Sound 2019
generating Ratio: −6 or 0 or to label the 90% of labeling task. (DCASE 2019 Task 3
segments. 6 dB. TAU Spatial Sound recording budget can be saved 2. Labeling with Challenge)-
- from sample clustered data is not sound events from
Sampling Rate:48 kHz, selection. by using the SED considered 11 classes, with
EBR: 30 dB 4. Model Network system. 20 instances in each
consists of six blocks 3. Best performance class.
of - annotating only
gated CNNs. 5% of the training
set.
Suh et al. (2020) 2020 1. Acoustic scene 1. For Subtask A : 1. Use of Focal loss 1. Task 1 Subtask A 1. Accuracy 1. Issues with the 1. For Subtask A :
classification systems for 23,040 samples, which attenuates the - Accuracy : model for Subtask B TAU Urban
DCASE audio samples collected 73.7% in - Acoustic Scene 2020
2020 challenge Task 1. from three real log-loss generated classification for test Data with multiple Mobile
2. For Subtask A : Log Mel devices and six simulated with the aid of split labels are not dataset.
spectrogram with deltas devices. using well 2. Task 1 Subtask B considered. 2. For Subtask B :
/delta-deltas is used for data 2. For Subtask B : 14400 -trained samples. - Accuracy : 2. The duration of TAU Urban
preprocessing. samples, 2. Optimizer: SGD 97.6% the data is of 10 s Acoustic Scenes 2020
3. For Subtask B : Log Mel different acoustic scenes optimizer with a which may contains dataset
spectrogram is used for from twelve momentum of 0.9. useful information
data preprocessing. European cities, 3. For Subtask A very
4. Spectrum of 431 frames Three labels - indoor, :models trained at rarely.
was yielded from 10 s outdoor and 62, 126,
audio file, and each spectrum transportation and 254 epochs
- 256 bins of Mel 4. For Subtask B
frequency scale :models trained at
254 and
510 epochs
7
Koutini et al. (2020) 2020 1. Description of Task 1 of 1. Adaptive Pruning for 1. Architecture - CP 1. The baseline 1. Accuracy 1. Using the 1. DCASE 2020
the DCASE-2020 challenge. parameter ResNet ensemble of ResNet with decomposed Challenge –
2. Subtask 1.A – Receptive reduction. smaller models for frequency convolutional layer, Dataset - TAU
Field (RF) Regularized CNN 2. Decomposed pruning and CP -damping achieves slightly decrease the Ur-ban Acoustic
model as a baseline. convolutional layer for ResNet 97.6% accuracy on accuracy compared Scenes 2019 dataset
3. Subtask 1.B – Different reduction of non zero Decomposed for the test data. to - containing
parameter reduction methods, parameters. decomposition. 2. With 13 times full-parameter recording from
3. Pruning- The fewer parameters, baseline. multiple European
such as pruning, while resulting network the cities and ten
keeping the networks’ has 249386 pruned network acoustic scenes
receptive parameters and a (CP-Res Damp.-GP)
field. total size of 487.082 achieves accuracy of
KB. 97%.
4. Decomposed 3. Decomposed
Network - 17520 convolutions –
parameters 95.95%
and a total size of
34.21875 KB.
Jung et al. (2020b) 2020 1. ASC model developed with 1. 128 Mel-frequency 1. Augmentation 1. Classification 1. Accuracy 1. Increase the 1. DCASE 2019 Task
Audio Tagging System channels Type - SpecAugment accuracy for dimensionality of the 2 for Audio
(ATS) inspired by human 2. Number of Audio and concatenation and code and Tagging System to
perception mechanisms. Clips - 4970 for Slicing multi-head attention two representations extract Tag
2. Tag vector is obtained curated subset and 2. Memorization and is 75.66% and separately exists in Vectors
from ATS and it is 19815 with noisy Sensitivity Technique 75.58%, respectively, each 2. DCASE 2019 Task
concatenated labels. Number of Sound subspace of the 1 A dataset
with ASC system. Events – 80, Used - Mixup : compared to 73.63% concatenated for ASC Task
3. Multi-head attention was 3. Audio Recording Beyond Empirical for the baseline. representation. Audio recording
adopted to emphasize the Duration - 40 h Risk 2. The system’s 2. Separate from 12 different
feature map of the ASC 4. Recording Type - Minimization accuracy is 76.75% transformation layers European cities.
using Tag vector. Stereo and 3. Model for Model- Multihead led to worse Number of Acoustic
Sampling Rate - 48000 Architecture - Attention Using performance because Scene - 10
ResNet. Tag Vector with ASC of over-fitting due to
ASC architecture – Features too
Front-end DNN for many parameters of
feature extraction fully connected
and SVM layers.
Salah et al. (2019) 2019 1. Importance of IoT 1. Enhanced the error 1. On-cloud IoT 1. After resolving all 1. Accuracy 1. Research towards 1. Crowd
Yang et al. (2017) Generated data Streams detection, curation framework issues related to having curated sourced/structured
Curation minimized the data and IFoT data stream curation, data-driven open datasets
stream curation techniques redundancy and -Curator framework better results can advancements.
for better understanding of preserving 2. Second-Order be obtained using
the data are highlighted. Autoregressive time suitable machine
series learning techniques.
modeling, Ordinary
Differential Equations
(ODE), Functional
regressions, Principle
Differential Analysis
using ODE
Salah et al. (2019) 2019 1. IoT Generated Streams 1. Fast learning on 1. Automating 1. After resolving all 1. Accuracy 1. All existing issues 1. Relational datasets
Fujisawa et al. Curation technique is real-time data. Decision Making issues related to 2. Confusion Matrix related to DC need (Feature
(2015) discussed 2. Utilizing the predictive Analytics- data stream curation, collaboration and Tools) - Automated
data analytic to Random Forest better results can further proposals to Feature
enhance the process of Algorithm be obtained using be solved
automation ML techniques. to enhance the
automation using ML
schemes.
Salah et al. (2019) 2019 1. IoT Generated Streams 1. Disaster Automated 1. Require actionable 1. Mini batch k 1. Accuracy 1. To improve the 1. Data Stream
Sowe and Zettsu Curation technique is Responses Intelligence means gives better automation utilizing Curation in NLP
(2013) discussed 2. Streams Anomaly 2. Require advanced result for accuracy ML and Automated
Detection ML techniques of clustering methods, all current Features
3. Feature 3. Feature algorithms using DC-related challenges
recommendation Engineering unsupervised
architecture extracted features – require collaboration
to tackle the IoT analytic more than 35% and new solutions.
Salah et al. (2019) 2019 1. Streams Curation 1. Apache Hadoop based 1. Unsupervised 1. Better results can 1. Accuracy 1. All existing 1. Crowd
Banerjee et al. Techniques cloud feature extraction be obtained using DC-related difficulties sourced/structured
(2018) 2. Automate the process of environment and and ML algorithms after demand open datasets, IoT
feature extraction and MapReduce for data clustering against the resolving the collaboration and Data with
selection- including preprocessing normal supervised problems associated novel solutions in Automated Feature
construction and selection for 2. Feature Engineering approaches for the with data stream order to
the statistical feature curation. improve automation
features selection. using ML techniques.
2. Bench-marking
methodology for
different
automated ML open
source platforms like
Tree-Based Pipeline
Optimization Tool
8
Ridzuan and Zainon 2019 1. Enhanced the error 1. Data Cleansing - 1. SCARE - The recall of all the 1. Accuracy 1. Threshold ML 1. DCASE dataset -
(2019) detection, minimized the data Machine Learning probabilistic repairing 2. Threshold parameter value is very large
Yakout et al. (2013) technique principles technique approaches is in the hard to set real-world datasets,
redundancy and preserving to provide range of 30 precisely due to data SCARE
2. Repairing and cleansing predictions for to 65%. SCARE with redundancy. - Intel Lab Data, US
process Key Features - multiple attributes the described 2. Scalability is the Census Data
Scalability 2. SCARE depends tuple repair selection main issue when
3. Four approaches for on threshold ML shows the designing
SCARE - SCARE with the parameter. highest precision. the data cleansing
described tuple repair techniques to deal
selection strategy, SCARE with the
with volume and variety
the majority voting, the of data.
single model approach and
constraint-based repair
approaches
Lostanlen et al. (2019) 2019 1. The pointwise logarithm 1. Distributions of 1. PCEN on various PCEN is an interface 1. Covariance 1. Unlike Principal 1. The SONYC
of mel-frequency magnitudes in the datasets of natural for robustly matrices Component Analysis dataset - 66 ten
spectrogram (logmelspec) as mel-frequency acoustic detecting and of frequency (PCA), -second recordings,
an acoustic frontend. spectrogram. environments. classifying acoustic channels, PCEN can be 22 urban
2. Preserves the locality 2. Gaussianizes events in 2. Bode plot of the implemented in real sound classes, 7.3M
structure of harmonic distributions of heterogeneous filter time and coefficients.
patterns magnitudes contexts distributed across 2. The DCASE 2013
along the mel-frequency axis. while decorrelating that is sensors Scene
frequency bands. computationally Classification dataset
efficient. - 100 half-
minute recordings
from 10
soundscape classes,
100 × 30 = 3000 s
of audio
Wu and Lee (2019) 2019 1. Explained how an audio 1. Class Activation 1. Difference of 1. LogMel-128 - 1. Accuracy 1. The regions of 1. DCASE 2017 ASC
scene is perceived in CNN. Mapping (CAM) to Gaussian (DoG) and CNN-FC(65.8%) distinct sound events challenge
2. Described how the provide visualization of Sobel CNN-GAP(68.1%), in the dataset. Audio
log-magnitude Mel-scale filter CNN activation operator to DoG CNN-FC log-MEL images have segment is created
-bank (log-Mel) features of behavior to input pre-process the (72.0%), small activation by mixing 3 image
different acoustic scenes are features. log-Mel feature CNN-GAP(72.2%), intensity, which components.
learned in a CNN classifier 2. Make the sound to make the Sobel might be 10-second audio
texture more salient. background texture CNN-FC(70.1%), counter-intuitive. samples are from
3. Process the log-Mel information CNN-GAP(71.6%) However, further the training set of
features and more salient. 4. Background-drift- investigation is DCASE 2017
enhance edge information 2. Remove removed LogMel needed to dataset.
of the time- Background Drift −128 feature-CNN- find out if these 2. TUT Acoustic
frequency image. Using Medium FC(75.7%) sound events are Scenes 2017
Filter. CNN-GAP(75.4%) really trivial database is used for
3. CNN-FC model 5. Computation using for classification, or evaluation.
and CNN-GAP model. g DoG and Sobel it is because the
operator takes 0.46 CNN
and 0.30 s models fail to learn
these patterns.
Roletscheck et al. (2019) 2019 1. Handle the task of ASC 1. Log mel spectrograms 1. The stereo audio 1. Accuracy 74.7% 1. Accuracy 1. DeepSAGA’s 1. TUT Urban
for DCASE 2018 Challenge. with 100 mel samples were on the development 2. Fitness function - methodology is Acoustic Scenes
2. Generate the input features bands cover a frequency converted dataset for the score genetic. 2018 dataset from
for Neural Networks. range of up to into mono channels. population vote However, in order to subtask A,
3. Extract the log mel 22050 Hz. 2. The librosa library (’’Pop. better optimize it 2. 10-seconds audio
spectrogram. 2. Hamming window (v0.6.1) to extract vote’’) strategy. and segments from
4. Spectrograms were divided with a size of 2048 log 2. Accuracy 72.8% move closer to the 10 different acoustic
into sequences with a samples (43 ms) and a mel spectrograms. on the development aim of rendering scenes
certain number of frames hop size of 1024 3. Short-Time Fourier dataset for CNN hand-
defining the sequence length samples (21 ms) Transform (STFT) (’’DeepSAGA CNN’’) crafted NN systems
- strategy. obsolete, more
50% overlap research
will be needed in
the future to test its
performance with
additional types of
datasets.
Wilkinghoff and Kurth (2019) 2019 1. A system for open-set ASC 1. Input features - 1. Used median 1. The Auto 1. Mean Squared Since the optimized 1. DCASE Challenge
is presented. log-mel spectrograms. filtering for Encoders system Error loss function does 2019 dataset
2. Use of CNNs for closed-set 2. During outliers Harmonic significantly 2. Accuracy not -Task 1C for
classification and DCAEs detection, only data Percussive Source outperforms the specifically attempt Open-set
for rejecting unknown belonging to a single Separation via baseline to reject unknown classification and
acoustic scenes via outlier known class was Librosa. system and improves samples, Subtask A of task
detection. used to compute the 2. For closed-set the overall score using the MSE of 1 for closed-set
3. ReLU activation function mean and standard classification, the 47.6% to 62.1% on DCAEs for outlier classification.
is used to improve the deviation. mean is the evaluation detection
performance of outlier 3. Hanning window size subtracted and dataset. is essentially a
detection of 1024, a hop divided by the heuristic. One
size of 500 and 64 mel standard alternative to
bins deviation of all DCAEs is to train a
training data, which NN with a different
belongs loss
to any of the ten function designed
known classes specifically for
one-class
classification.
The remaining pre-processing techniques are listed in Table 4.
9
Table 4
Summary of existing work on different pre-processing technique for ASC-continued.
Author Findings Methodology/Functionality Implementation/ Results Evaluation Metric/ Limitation/Future Datasets
details Evaluation Mechanism Tools Scope
Purwins et al. (2019) 1. The feature 1. With available data, deep 1. Audio Analysis - 1. Results depends 1. WER - to count 1. Are mel 1. Large training
representation like learning Sequence Classification, on the environment the spectrograms indeed datasets for
log-mel spectra and models outperforms the Multilabel Classification, and the selected features of word the best -ImageNet −14
raw waveform, are traditional Sequence model architectures. error representation for million hand
reviewed. methods such as Gaussian Regression, Sequence 2. F-Score audio analysis? labeled images,
2. Problem mixture models Labeling and Event 3. Signal to 2. Under what Speech recognition
categorization - hidden Markov models to Detection. Distortion circumstances mel datasets,
Reviewed on kind of extract the 2. Audio Synthesis - Ratio, Interference spectrogram 2. Song datasets,
target useful features. Sequence Transduction Ratio, Artifacts Ratio is better to use the MusicNet datasets
to be predicted from 2. Data-driven filters and and Audio Similarity - raw waveform? , AudioSet datasets-
the input. MFCCs are Estimation. to measure source 3. Can exploring the for
3. Audio features – used for acoustic feature 3. Audio Features - Mel separation quality middle ground, a environmental sound
Building an extraction. Frequency Cepstral 6. MOS- to evaluate spectrogram with classification,
appropriate feature Coefficients(MFCCs), quality of learnable 3. Data Generation
representation and Constant-Q synthesized hyper-parameters, and Data
designing an Spectrogram and audio in speech, be better? Augmentation
appropriate classifier Raw-wave form 7. Turing Test - For 4. Can there be an
have been treated as representation audio audio dataset
separate problem in generation covering speech
audio , music, and
processing. environmental
sounds, used for
transfer learning,
solving a great range
of audio
classification
problems?
5. Explore other
paradigms for
learning more
complex models from
scarce labeled data,
like
semisupervised
learning, active
learning etc.
Ebraheem et al. 1. Measure the Data 1. DL architecture for DC for 1. DeepER and 1. DC-aware DL 1. Accuracy 1. Explore how to 1. Diverse set of
(2018) Curation(DC) Entity DeepMatcher architecture architectures will instantiate and datasets such as
Challenges. Matching and Entity for entity resolution and give better result combine DL structured,
2. Identify the most Resolution entity matching models for matching, unstructured and
promising leads that respectively error detection, data noisy
are most repair, imputation, , provides useful rule
relevant to DC. ranking, discovery, of thumb on
and when to use DL for
syntactic/semantic ER
transformations.
Singh et al. (2018) 1. Issue pointed out: 1. Feature depends on input 1. Pooling 1. Non-linear SVM 1. Accuracy 1. Proposed 1. DCASE 2016
nature of the signal length. Operation-Reduce gives 93% 2. Trained model - framework with acoustic scene
environment and 2. Scores from each layer are dimension. accuracy on Evaluated with fewer data as categorization dataset
multiple overlapping combined to 2. Sum/Max operation to evaluation set of feature well as training the
acoustic events. form the strategy for represent feature DCASE vectors DNN to incorporate
2. ASC problem is classification. map into scalar value. 2016 data. the
addressed using 3. Pooling strategy is used to 3. N dimension feature 2. Relative hidden layer
SoundNet which is reduce the vector-Scalar values improvement of information.
pre-trained on raw dimension of features are converted to feature 30.85%
audio signals. extracted from vector. by best individual
3. Layer-wise different layers of SoundNet. layer of SoundNet.
ensemble approach is Fixed length
used for vector for each audio signal.
effectiveness on
DCASE 2016 dataset.
Nguyen and 1. Convolutional 1. Features are preprocessed 1. CNN models for 1. Accuracy using 1. Accuracy using 1. Feature 1. Task 1A : TUT
Pernkopf (2018) Neural Network by splitting Single-Input (SI) and ensemble selection: majority voting and characteristics that Urban Acoustic
(CNN) ensembles the acoustic scene into Multi-Input (MI) 69.3% for task 1A average voting are extracted Scene 2018 - 8640
for acoustic scene chunks of 1s. channels. and 63.6% for task from different segments with
classification of tasks 2. Ensemble selection - to 2. Combination of the 1B devices’ recording 6122 segments for
1A and 1B combine MI-CNN structures files of training and
of the DCASE 2018 several features and CNN using both of log-mel task 1B dataset are 2518 segments for
challenge to settings to features and nearest more complex than testing.
emphasize the provide a vote for 10s data neighbor filtering – For that 2. Task 1B : TUT
similar patterns of chunks. Task 1.A. of task 1A. So Urban Acoustic
sound events in a 3. 128 log-mel energies of 3. Nearest Neighbor complicated models Scene 2018. Mobile
scene. spectrogram. Filter (NNF) applied on i.e., the recorded in six
4. Window size:40ms, STFT: spectrograms – Repeating (MI db) CNNs tend European cities for
40ms Pattern Extraction to overfit for task 10 scenes -
Hop size: 20ms Technique(REPET) 1B. 10080 segments with
7202
segments for training
and 2878
segments for testing
10
Mariotti et al. (2018) 1. Audio recognition 1. Four types of spectrograms 1. Harmonic Percussive 1. Accuracy of 1. Accuracy 1. Cepstral features 1. DCASE 2018 -
process is described. created and Source Separation ensembles: accuracy 2. Confusion matrix Models were also Task 1 dataset
2. Methods to reduce used to train networks: (HPSS) of 79.3% on the trained
the number of Mono, Stereo, 2. Global average pooling development dataset using cepstral
parameters with no Mid/Side and 3. Deep vision models – 2. VGG – 77.7% features, namely
apparent loss on Harmonic/Percussive. VGG and Resnet 3. Resnet – 73.7% MFCC obtained
accuracy are 2. Mel-spectrograms: audio from the mel
addressed. file converted spectrograms, but
3. Ensemble methods from 24 to 16-bit encoding, this yielded a
using voting or sample rate significant drop in
fitting an auxiliary -48kHz, window-2,048samples accuracy.
classifier long with 2. Work with Raw
a hop length of 1,024, for audio Models -
128 mel bins difficult to
train, because of the
time and memory
needed
to run them, and
hyperparameter
optimization
was fairly complex.
Wang et al. (2017) 1. keyword 1. A novel frontend called 1. Use of an automatic 1. PCEN significantly 1. Receiver 1. Signal processing 1. The noise sources
spotting(KWS) task is PCEN is used. gain control based improves Operating components can be - sounds
done to improve 2. Log-mel uses stabilized log dynamic compression to recognition Characteristic (ROC) viewed sampled from daily
robustness to with replace the widely performance on large curve as structural life
loudness variation. offset = 0.1 and PCEN uses s used static compression. 2. Plotting False regularizations to environments and
2. Issues of log = 0.025, 2. Implements simple rerecorded noisy and Rejection develop the YouTube videos.
compression has 𝛼 = 0.98, 𝛿 = 2 and 𝑟 = 0.5. feed-forward far-field eval (FR) rates against neural network 2. The far-field
been solved. Automatic Gain Control sets. False model. evaluation sets are
(AGC), dynamically 2. At 0.1 FA per Alarm (FA) rates. rerecorded in real
stabilizes signal level. hour, PCEN reduces environments
FR rate by about
14% absolute over
log-mel with
multi-loudness
training.
Wang et al. (2017) 1. How to optimize 1. PCEN model as NN layers. 1.PCEN

high-dimensional 2. Keyword Spotting(KWS) operations-differentiable,
PCEN acoustic permitting
parameters? model gradient-based
optimization with respect
to
hyperparameter
(Stochastic Gradient
Descent
(SGD) 1. Significant 1. ROC curves of log-mel 1. Data-dependent 1. The noise sources
improvements parameter learning can - sounds
without further improve sampled from daily
increasing model performance, it is life
complexity or important to environments and
inference-time cost. strike a balance between YouTube videos.
accuracy and resource 2. The far-field
usage for always-listening evaluation sets are
KWS. rerecorded in real
environments
Wang et al. (2016) 1. Data Curation 1. Issues like abnormal value 1. Cleanix - web 1. Cleanix supports 1. Data Quality 1. Scalability is the 1. Data integrated
technique is detection, interface for the user to effective and Report- main issue when from multiple
addressed. incomplete data filling, input efficient data Accuracy designing data sources
2. Big Data duplication, and the information of data cleaning at the large the data cleansing
Cleansing - Rule conflict resolution are source, parameters, scale techniques to deal
selection approach. resolved. and the rule selections with the
2. Key Features - Scalability, volume and variety
unification of data.
Chu et al. (2015a) 1. Big Data 1. Key Features - Easy 1. KATARA - end-to-end 1. Very effective and 1. Precision 1. KBs might not be 1. Katara can be
Chu et al. (2015b) Cleansing methods - specification, data cleansing efficient for high 2. Recall available to cover all applied to various
Knowledge-Base (KB) pattern validation, data systems that use quality data cleaning. 3. F-Measure data. datasets and KBs.
and Crowdsourcing annotation trustworthy KBs and 2. Crowdsourcing
are addressed. crowdsourcing for data can lead to
2. Introduce a cleansing - produce uncertainty
technique for accurate repairs by and noisy data.
efficiently annotate relying on KBs and Crowdsourcing is
data domain expert inefficient
and suggest possible for large dataset.
repairs.
11
Khayyat et al. (2015) 1. Big Data Key Features -Efficiency, 1. BigDansing - address 1. BigDansing 1. Precision 1. Data quality rule 1. Both synthetic
Cleansing - Rule scalability, and the scalability and outperforms existing 2. Recall optimization is not and real datasets
selection approach ease of use abstraction problems baseline systems up considered for many are used.
for data when designing a to more than two rules and their
curation is distributed data cleansing orders of magnitude constraints.
highlighted. system without sacrificing 2. Scalability of
the quality provided. existing parallel data
processing
frameworks.
Lee et al. (2000) 1. Traditional data 1. Integrates data 1. Potter’s Wheel and 1. Efficient 1. Precision 1. Custom domains 1. Real and large
cleansing used for transformation and Intelliclean data representation and 2. Recall and the datasets
Data Curation error detection. cleansing system easy corresponding
2. Duplicate elimination, management of algorithms can be
anomaly knowledge for data enforced with
detection, and removal using curation domain
constrain.
various constraints and rules to be used, correct optimization for the the result of classification performance and processing needs of ASC
quality of data may be necessary. Moreover, a better ML method can methods simultaneously. The challenging task ‘‘Low-Complexity ASC’’
be developed for data cleansing other than SCARE for Big Data. from the DCASE is the first to solve this issue. In relevance to the
future research direction, the work on the prediction or classification
3. Review on classification of several ASC tasks can be carried out with the available best Deep
Learning algorithms. For example, with the advanced technology of
3.1. An analysis of ASC techniques based on deep learning IoTs, Data Science can collaborate and models can be designed to
predict the sound of any acoustic scene. The proposed technique can
Abeßer (2020) has evaluated the idea of DL-based approaches for use the basic processing flow of ASC algorithms and measure some open
ASC. The author claims that during the past few years, there has been a challenges like Domain Adaptation, Ambiguous Allocation between
steady rise in various articles on ASC. They compiled and sorted exist- Sound Events and Scenes, etc.
ing methods for data preprocessing and data modeling with importance
on DL-based ASC procedures. Data preparation encompasses a variety
of tasks, including how to visualize features and how to perform var- 3.2. An overview of deep learning methods for analyzing audio signals
ious DC tasks. Data modeling includes learning paradigms and neural
network structures. The ability to distinguish between various indoor Purwins et al. (2019) have reviewed the most recent deep learning
and outdoor acoustic environments from recorded signals is a current methods for handling audio signals. To highlight common approaches,
research area that has drawn much interest recently. The detection issues, important references, and the possibility for cross-fertilization
of audio events that are temporarily present in an acoustic scene is amongst fields, speech, and processing of environment sound are taken
a particularly difficult ASC problem. Vehicles, car horns, and footfall into consideration. According to the authors, building a suitable feature
are a few examples of these audible events. Acoustic Event Detection representation and creating a suitable technique for these features
(AED) is applied in various applications, including hearing assistance, have frequently been approached as different issues for the processing
wellness program, safety inspection, observing wildlife in natural sur- of audio sound. This method has the disadvantage that the created
roundings, the Internet of Things (IoTs), and autonomous navigation. features might not be the best ones for the given classification pur-
Additionally, the author covered pre-processing, augmentation, and pose. There is currently no comparable dataset for speech, music,
representational strategies for audio signals for ASC. An acoustic scene and environmental sounds that is sufficiently large and appropriately
classification algorithm typically goes through the following steps: labeled to serve all three domains as of the beginning of 2019. There
audio recording, data preparation, data modeling, evaluation, and de- are additional approaches to the problem of insufficient training data,
ployment. This processing flow is shown in Fig. 7. In context with
including data production and data augmentation. For some tasks, data
data preparation, audio data must be represented in the form of signal
with known synthesis parameters and labels that resemble genuine
representation. The different kinds of signal representations are shown
data can be created. Understanding, troubleshooting, and enhancing
in Fig. 8. Data pre-processing is also required for data preparation.
machine learning techniques made easier for the generated data. If an
There are several ways for data pre-processing like - pre-processing
algorithm is exclusively trained on created data, its performance on
using gradient descent-based algorithm, filtering is another kind of pre-
real data may be poor. There are various application models which can
processing technique, etc. All these techniques are depicted in Fig. 9.
be created using Deep learning or any other classifier. Those applica-
Data augmentation is also required for providing a large number of a
tion areas are — Speech, Music, Environmental sounds, Localization
dataset to the model. The different types of augmentation techniques
are illustrated in Fig. 10. Once the data is prepared, data modeling is and Tracking, Audio Enhancement (By eliminating noise), and Source
needed where recognizing the network architecture is very important. Separation(It is crucial in audio signal processing because multiple
The architecture can be created using convolutional neural networks, sources are frequently present in realistic environments, adding to a
feed-forward neural networks or even using CRNN. The details of all mixture signal that adversely affects further signal processing tasks).
these networks are described in Fig. 11. The ASC system can be fur- Fig. 13 shows an overview of challenges with audio processing, input
ther improved by understanding the learning paradigm like multitask representations for audio features, models used by various application
learning and transfer learning are two important training approaches domains, data description, and evaluation methods. In relevance to
shown in Fig. 12 to improve ASC. There are some Open Challenges the future research direction, the collection of data from the source
like Domain Adaptation, Ambiguous Allocation between Sound Events is a very important and tedious task that must be carried out gently.
and Scenes, Model Interpretability, Real-World Deployment, etc. The The proposed model should focus on the environmental sounds and
audio from the environment is recorded and classifiers are used to extracting the different features from those datasets can be considered
detect the type of sounds or audio. One potential challenge, according the challenging task with the consideration of the source separation
to the author, is to expand on present assessment driving tasks to find also.
12
Fig. 7. Processing flow of ASC (Abeßer, 2020).
Fig. 8. Block diagram for deep learning based methods for ASC — Signal Representation (Abeßer, 2020; Arniriparian et al., 2018; Chen et al., 2018, 2019b; Fonseca et al., 2017;
Han et al., 2017; Li et al., 2019; Maka, 2018; Mars et al., 2019; Qian et al., 2017; Ren et al., 2017; Singh et al., 2018; Ye et al., 2018; Zieliński & Lee, 2018).
Fig. 9. Block diagram for deep learning based methods for ASC — Preprocessing (Abeßer, 2020; Lostanlen et al., 2019; Mariotti et al., 2018; Nguyen & Pernkopf, 2018; Rafii &
Pardo, 2012; Wang et al., 2017; Wu & Lee, 2019).
Fig. 10. Block diagram for deep learning based methods for ASC — Data Augmentation (Abeßer, 2020; Abeßer et al., 2017; Boss et al., 2015; Chen et al., 2019a; Gemmeke et al.,
2017; Goodfellow et al., 2014; Kong et al., 2019; Mun et al., 2017; Salamon & Bello, 2017; Xu et al., 2018; Zhong et al., 2017).
3.3. Enhancing SED model accuracy with minimal annotation effort applications of Sound Event Detection (SED) such as noise monitoring,
healthcare monitoring, wildlife monitoring, urban analysis, and more.
According to Shuyang et al. (2020), the detection of a sound event Since SEDs are based on supervised learning, it normally requires a
is a necessity for awareness such as a gunshot, glass smash, and in huge amount of labeled data for training purposes. The authors have
some cases prevention of some accidental incident. There are many proposed an active learning system to maximize the accuracy of the
13
Fig. 11. Block diagram for deep learning based methods for ASC — Network Architectures (Abeßer, 2020; Basbug & Sert, 2019; Bisot et al., 2017; Cho et al., 2019; Jati et al.,
2019; Marchi et al., 2016; Ren et al., 2019; Roletscheck et al., 2019; Sharma et al., 2019; Soo Hyun Bae & Kim, 2016; Takahashi et al., 2017; Yang et al., 2018).
Fig. 12. Block Diagram for deep learning based methods for ASC — Learning Paradigm (Abeßer, 2020; Lehner et al., 2019; Saki et al., 2019; Wilkinghoff & Kurth, 2019).
Fig. 13. Block diagram for conceptual overview of audio signal processing (Bittner et al., 2017; Davis & Mermelstein, 1980; Furui, 1986; Goodfellow et al., 2016; Hoshen et al.,
2015; Jaitly & Hinton, 2011; Lostanlen & Cella, 2016; Mohamed et al., 2012; Purwins et al., 2019; Thickstun et al., 2016).
SED model with the least amount of annotation process. Annotation Then the sample selection process based on mismatch-first farthest-
can be said as a process of adding extra information to data. It is a traversal is implemented, which targets the samples that are previously
process of labeling data. Currently, the process of annotating audio data wrongly predicted. A human annotator then assigns partial labels to
is time-consuming. One way to minimize the effort is to use weak labels the selected segments. This active learning process is iterative and
that do not require the onset/offset of each sound event. Even using works on the batch in each iteration. The system requires only weak
weakly supervised learning, the process of annotation on a large dataset labels that are assigned to individual segments so it results in minimal
is still time-consuming. Audio-Set is a large publicly available sound annotation effort. Annotating on segments has two advantages over full
event dataset, with weak labels. Fig. 14 depicts an overview of the recording first weak labels are sometimes insufficient for full recording
author’s suggested active learning system. The three processing units for training SED models. Secondly only the selected segment needs
to be annotated reducing annotation effort. The minimum length of a
in this system are weakly supervised learning, sample selection, and
segment is kept as one second, this is made possible by ignoring the
change point detection. According to the authors, ‘‘the audio change
change point detected one second after the previous change point. The
point detection has previously not been used for active learning’’. The
authors have proved that by taking original recordings as input the
change point is used in the process of generating segments that are
SED model easily outperforms other models that use only annotated
easy to recognize. ‘‘Mismatch-first farthest traversal principle’’ is used
segments as input for training. Also the annotating of segments is more
for the selection of candidate segments, which proves to be effective efficient compared to annotating full recordings. SED uses a neural
in sound classification. According to the mismatch-first criterion, a network consisting of six blocks of gated CNNs. And use the original
model may benefit more from an example where it makes an incorrect recording as input, so they could use the contextual information in the
prediction than a correct one. The selection principle is generalized original recordings. There are various advantages to keeping contextual
to all the labeling processes which do not require cluster numbers information. Firstly, the model can learn unique features of events
making it simpler altogether. The proposed active learning system first from background sound assuming background sound to be contextual
takes a set of unlabeled audio recordings as input and then performs information. Secondly, the model may be able to learn dependencies
change point detection that is used for segmentation. The segments between sound events and scenes. To evaluate the performance of a
are called candidate segments as they are a candidate for annotation. SED model, a segment-based Error Rate (ER) is used. The labeling
14
Fig. 14. Block diagram for active learning for SED model (Aytar et al., 2016; Cheng et al., 2008; Hakkani-Tur et al., 2002; Han et al., 2016; Kotti et al., 2008; Kumar et al.,
2018; Riccardi & Hakkani-Tur, 2005; Shuyang et al., 2017, 2018, 2020).
budget is referred to the total duration of audio that can be manually block diagram for an audio-based hazard event recognition system.
labeled. Due to a large label distribution bias that has a detrimental The model gives 90% accuracy for sound identification and 87% Mean
impact on the accuracy of learned models, the accuracy of the active Average Precision. In relevance to the future research direction, the
learning system does not increase as the labeling budget is increased. effect of variations that are irrelevant to subspace difference must be
The proposed active learning system by the authors could achieve eliminated since ambient sound always blends with background noises,
similar performance by only annotating 2% of training data which CMSM is used but to improve the accuracy other methods can also
some other systems could only achieve after annotating all the data. be compared with the existing one. The CMSM can even be improved
The training set with 5% of the data annotated yields the best results. by differentiating the audio features of different creatures of nature.
Additionally, datasets containing infrequent occurrences can help to The different creatures have different frequency domains and pitch
save more than 90% of the labeling expense. The active learning system values. The class or dictionary containing these frequency domains and
proposed by the authors needs human interaction with the system. pitch values can be used to enhance the model. Further, the limited
Automating this system by replacing human effort with a machine data collection may lead to high variance issues in classification model
learning technique such as semi-supervised learning could be more training and consequently, it is expected that by offering more clips, the
convenient. recognition accuracy will increase. Addressing the issue of insufficient
audio data is another challenge that can be focused on further. Thus,
3.4. Procedures for mining of audio big data to address the problem of unseen pattern recognition is one of the challenges that need to be
identifying anthropogenic disasters overcome.
An innovative method of investigating ambient sound for safeguard- 3.5. Trident ResNet and shallow inception models for task 1 and subtask B
ing people, property, and the economy from anthropogenic disasters of the DCASE2020 challenge
was provided by the author Ye et al. (2020). In the environment, there
are different types of noises and to differentiate all those noises, dif- Suh et al. (2020) have described the designing of Acoustic Scene
ferent frequency domain and pitch value is used. Certain sounds, such Classification (ASC) models with CNN variants in their technical report.
as screams, shouts, gunshots, and explosions, are closely associated The authors have described acoustic scene classification systems for
with anthropogenic catastrophes involving violence and conflict. The DCASE 2020 challenge task 1. Two subtasks are part of task 1. The
authors have created an emergency sound recognition framework based numerous devices dataset is used in subtask A, and the low complexity
on acoustic feature learning and data-driven taxonomy construction to model is designed for subtask B. They created a single model for
efficiently define acoustic clues for early catastrophe detection. Fig. 15 subtask A and used three parallel ResNets to implement it. The name
shows a block diagram for the suggested acoustic scene classification of this ResNet is ‘‘Trident ResNet’’. When the samples are collected
method for using the jeopardy sound recognition technique to handle from unseen devices, the proposed structure is beneficial. The structure
man-made disasters. The author has used Constraint Mutual Subspace obtains 73.7% classification accuracy in case of test split. The built
Method (CMSM) to remove the background noise. The dataset used ‘Shallow Inception’ model has fewer features. The proposed model has
by the author was crowd sounding audio dataset. Fig. 16 shows a obtained 97.6% accuracy.
15
Fig. 15. Block diagram for hazard sound recognition method (Crocco et al., 2014; Foggia et al., 2015; Ye et al., 2020).
Fig. 16. Block diagram for audio-based hazard event recognition scheme (Coates & Ng, 2011, 2012; Edelman et al., 1998; Mattys et al., 2012; Silla & Freitas, 2010; Wu & Lange,
2008; Yamaguchi et al., 2002; Ye et al., 2020).
ASC’s task is to classify the data where it was recorded. There are Table 5
Data augmentation strategies [Suh et al. (2020)].
following considerations which are used while capturing the data and
further for their classification by the author Suh et al. (2020): Strategy Parameter
Time related snip snip length = Five seconds
1. Data with multiple labels are not considered. Mixup Alpha = 2.0
2. The duration of the data is of 10 s which may contain useful
information very rarely.
3. There are ten classes and the data may belong to one of its
classes. Earlier ten classes were merged into there. 2. Three labels: indoor, outdoor, and transportation are used. 9185
and 4185 samples were used for training and testing.
The classifier for subtask A should be designed in such a way that
it works soundly on various microphone types. The CNN model and 3.5.2. Data preprocessing for Subtask A and Subtask B
the spectrum correction approach were used to solve this problem as For Subtask A : Data preprocessing for Subtask A uses a spec-
described by Kosmider (2019). The main challenge with subtask B is trogram logarithmically. The data were converted into a power spec-
creating models that are less than 500 kilobytes, which is less because trogram by the author by skipping every 1024 samples with a Hann
there are millions or billions of parameters submitted for the same task. window of 2048 length. The log Mel spectrogram was used to deter-
mine the deltas and delta-deltas, which were then stacked onto the
3.5.1. Datasets used for subtask A and subtask B channel axis.
For Subtask A : The dataset used by the author is detailed in For Subtask B : The log Mel spectrogram is used for data prepro-
Table 24. Some more details about the used dataset are given below: cessing for subtask B. Using the same method as subtask A, the author
removed deltas and delta-deltas from the data before converting it to a
1. It contains 23,040 samples. The audio samples were gathered log Mel spectrogram.
from six simulated devices and three real ones.
2. A binaural microphone, Samsung Galaxy S7, and iPhone SE 3.5.3. Data augmentations for subtask A and subtask B
devices are used for data collection. During the training phase, 64 samples from each mini-batch were
3. Samples from a GoPro Hero5 session will be included in the generated in real-time to create the augmented data. The author’s list
evaluation dataset. of data augmentation techniques is presented in Table 5.
4. 13965 and 2970 samples used for training and testing in the
training/test split’s basic information. 3.5.4. Composition of model
For Subtask B : The subtask B utilizes the TAU Urban Acoustic For Subtask A: He et al. (2016a) have used Trident ResNet model
Scenes 2020 3Class development dataset. The details about the said for subtask A. In the suggested model, a residual block is made up
dataset are given below: of 3 convolution blocks stacked three at a time and, after common
pooling, an identification path with zero padding. The normal dis-
1. It contains 14,400 samples. tribution is used to initialize the kernels, and a regularization factor
16
Table 6
Test split results for subtask A and B [Suh et al. (2020)].
System name Accuracy % Accuracy % System Name Accuracy % Accuracy %
for Subtask A for Subtask B for Subtask A for Subtask B
Baseline:DCASE2020 54.1 87.3 TridentResNet 74.2 97.5
Task1 Ensemble
TridentResNet DevSet 73.7 97.6 TridentResNet 74.4 97.7
Weighted Ensemble
TridentResNet EvalSet 73.7 97.6 – – –
of 5 × 10−4 is used for L2 regularization. Each convolution block is 3.6. Illustration of RF regularized CNN model and different parameter
a pre-activation convolution consisting of ReLU convolution which is reduction methods
suggested by Koutini et al. (2019c), McDonnell and Gao (2019) and Liu
et al. (2019). Koutini et al. (2020) describe Task 1 - ASC with Several Devices
The correct receptive field size is vital for the ASC task (Koutini for subtask A and ASC with Lower-Complexity for Subtask B. Koutini
et al., 2019b). The author suggested a way to improve the classification et al. (2020) and Heittola et al. (2020) used the RF regularized CNN
model as a baseline for Subtask 1. A, and they also looked at using
performance by reducing the receptive field because CNN overfits ASC
two distinct domain adaptation objectives, the MMD and the SWD,
data due to its huge receptive field. Additionally, the effectiveness of
which were suggested by Kolouri et al. (2019). For the reduction of
the model as a whole is reliant on the size of the frequency axis. The
parameters for Subtask 1. B, the author looked into techniques like
model structure proposed by McDonnell and Gao (2019) displays a
pruning that preserve the receptive field of the networks. In addition,
similar concept. With the use of striding convolution filters, their model the authors have included a deconstructed convolutional layer in their
minimizes the time information while maintaining the frequency bins. models, when compared to the full-parameter baseline, it decreases
To get the best RF for the input parameter, the author utilized a grid the number of non-zero features with just marginal accuracy loss.
search during data preprocessing for subtask A using spectrogram log- Utilizing Receptive-Field Regularized CNNs (RFR-CNNs), ASC proposed
arithmically, drawing inspiration from the model structure presented by Koutini et al. (2019b, 2019c), device invariant ASC by Koutini et al.
by McDonnell and Gao (2019). By stacking the residual blocks, the (2019c), Mesaros et al. (2019) and Primus et al. (2019), ASC using
authors Suh et al. (2020) were able to change the ResNet’s receptive open-set by Mesaros et al. (2019) and Lehner et al. (2019), tagging
field size — the size of RF depends on the size of the network. of audio with obtrusive labels and little oversight by Koutini et al.
The authors Suh et al. (2020) had also set up the ResNets for (2019c) and Fonseca et al. (2019), emotion and music identification
classification. The same has been also proposed by McDonnell and by Koutini et al. (2019a), and all have been shown to be very effective.
Gao (2019) and Phaye et al. (2018) in their work using deep residual The author has incorporated RFR-CNN architectures as a result of
networks and SubSpectralNet respectively to understand diverse char- these accomplishments. In addition, the authors Koutini et al. (2019b)
acteristics from various frequency bands. Three paths are included in and Luo et al. (2017) have proposed a new strategy that allows them to
manage the Effective Receptive Field in deep CNNs as well as limit the
the suggested model for the Mel bins : 0–63, 64–127, and 128–255.
RF of the models. They use Domain Adaptation for Task 1A, notably
Using a dual parallel arrangement, the Mel bins are split in half, which
SWD and MMD, to modify the networks for the challenge tasks. When
was also tested by the author, but the triple design outperformed it for
a compact model is needed for Task 1B, weight pruning as suggested
minority/unseen devices. Two blocks of 1 × 1 convolution and GAP
by Li et al. (2016), layer decomposition as pointed by Lebedev et al.
compute the classification scores after concatenating the outputs from (2014), and basic network width/depth reduction are used.
each network.
For Subtask B : The authors Suh et al. (2020) have used the 3.6.1. Architectures
‘‘Shallow Inception’’ model for subtask B. It is implemented to stop In their work on ASC using RF-regularized CNN variants, Koutini
the over-fit with the aid of using lowering the parameters with sparse et al. (2019d) presented ResNet architecture. The tests on the ResNet
connectivity suggested by Szegedy et al. (2015). The inception module architecture were built up by the author. Initial investigations revealed
proposed in their work by Suh et al. (2019) for ASC using CNN, that the best size of RF size is between 6 and 7 for task 1A. The
performed better than GoogleNet’s original architecture. The suggested author Koutini et al. (2020) used the authentic ResNet CP-Res for
module constructs an average pooling path. The model is trained using both tasks 1. A and 1. B. The author’s research on DCNN for ASC,
an SGD optimizer with a momentum of 0.9. demonstrated that constraining the CNNs’ RF, notably across the fre-
quency dimension, improves generalization in various ASC datasets. By
reducing the effective receptive field of the convolutional layer, the
3.5.5. Focal loss (FL) author has improved the performance of the proposed architectures.
The author Lin et al. (2017) has claimed that the focal loss attenu- The input is taken from the center of the neuron’s RF. By doing this
ates the log-loss generated with the aid of using attentive samples, and Damped CNNs is the resultant networks. The author has damped the
the model can consciousness at the imperfectly skilled samples. The filters of a convolutional layer in practice when weights are multiplied
authors Suh et al. (2020) have used the values 2.0 and 0.25 for the element-wise with a constant matrix, called a damping matrix. The
focusing parameter and balancing parameter respectively. author has used 𝜆 = 0.1 in their architecture. The said architecture is
referred to as CP-Res Damp in the proposed model.
3.5.6. Snapshot ensemble and results
3.6.2. Realm transformation for Task 1A
Each cycle of the training process was recorded and combined by According to the literature by Primus et al. (2019) and Eghbal-
the authors Suh et al. (2020). Huang et al. (2017) created an ensemble Zadeh et al. (2017), models trained on one device frequently struggle to
model that outperformed a single model. Three different epochs are generalize to other devices because sound distributions from different
used to train the models namely 62, 126, and 254 for subtask A and devices vary. The author has used two Domain Adaptation (DA) ob-
at 254 and 510 epochs for subtask B made up the ensemble systems jectives, SWD and MMD, to lessen the discrepancy between the target
that were submitted. To produce ensemble predictions, the scores were devices in the unknown test set and the source devices in the training
averaged. The results of the A and B subtasks are shown in Table 6. set.
17
Table 7
Comparison of domain adaptation [Koutini et al. (2020)].
SN Model Domain adaptation Accuracy % SN Model Domain adaptation Accuracy %
1 CP-Res x 69.81 3 CP-Res Damp MMD 70.80
2 CP-Res Damp x 71.07 4 CP-Res Damp SWD 71.80
Table 8
Comparison of parameter reduction method [Koutini et al. (2020)].
Model Number of NZ/Total Size (KB) Accuracy % Model Number of NZ/Total Size (KB) Accuracy %
CP-Res 3415K/3431K 6671.5 96.85 CP-Res Damp-GP 247K/345K 483.5 97.37
CP-Res Damp 3415K/3431K 6671.5 97.61 CP-Res Dec 17.5K/18.3K 34.2 95.83
CP-Res Damp-R 224K/227K 437.8 97.09 CP-Res Damp-Dec 17.5K/18.3K 34.2 95.95
3.6.3. Approaches used for Task 1B The majority of ASC systems use DNNs to classify a given audio
The purpose of Task 1B is to classify three different acoustic scenes into different classes, making use of the latest breakthroughs in DL
using models with little complexity, or a small number of parameters. as suggested by Koutini et al. (2019d), Huang et al. (2019) and Seo
Width and depth restriction, parameter pruning, and decomposed con- et al. (2019). The authors Virtanen et al. (2017), Plumbley et al. (2018)
volutions are three major approaches used for task 1B. The author Kou- and Michael Mandel and Ellis (2019) have claimed that the DCASE
tini et al. (2020) has looked into many approaches for maintaining a community is offering a platform with annual competitions and public
high level of accuracy in classification while maintaining the number of datasets to aid research on the ASC challenge. Most ASC systems,
parameters under the permissible limit. By replacing each convolution according to Zheng (2019) and Jung et al. (2018), finish the task
with the convolution operator, element-wise multiplication operator, either completely or directly by identifying the input recording or by
the output of the previous layer, filter trainable weight, learnable employing a back-end classifier and the last hidden layer output of a
pruning mask weights, and the gating function g(x), the author learns DNN as illustration vector. Both approaches follow a one-step process.
a pruning mask for each convolutional layer. Inspired by Lebedev There have been a few studies recently that have combined the ASC
et al. (2014) proposal to directly train decomposed convolutional layers and the audio tagging task as the information given by Imoto and
using SVD for CNNs in their work using fine-tuned CP-Decomposition, Shimauchi (2016) and Imoto et al. (2020). Bear et al. (2019) proposed
the author proposed to directly train decomposed layers. With Z as a that sound event detection and ASC tasks be performed together based
compression factor, a regular convolutional layer with dimensionality on SED results. Imoto et al. (2020) investigated how the understanding
equal to how many input and output filters there are three convolu- of sound events and acoustic scenes when multitasking could improve
tional layers which can be created by breaking down the kernel size as sound event detection ability. Tag vectors collected from a distinct
pointed out by Lebedev et al. (2014). audio tagging system, according to the authors Jung et al. (2020b),
can be utilized to improve ASC. The ATS’s output is denoted to as
3.6.4. Results for Task 1.A and Task 1.B a tag vector, which signifies the occurrence of distinct sound events.
Table 7 contrasts the outcomes of the recommended models on the Figs. 17 and 18 compare the proposed approach to orthodox DNN-
development set. The indicated damping of frequency enhances the based ASC systems to demonstrate how it functions. The suggested
accuracy of the baseline ResNet by approximately 1%. In the case of framework makes use of a proficient audio tagging system that excelled
SWD, using a DA goal improves results slightly, whereas MMD produces in the DCASE 2019 task as described by Akiyama and Sato (2019).
results that are on average poorer than the reference baseline. The The authors present several approaches for using tag vectors in the
accuracy of the last 10 epochs is shown in the table for each model. ASC task. The authors Bahdanau et al. (2014) and Chan et al. (2016)
Table 8 compares the results of proposed models on the development proposed initially by concatenating a tag vector with an ASC task code.
set with varying numbers of parameters. Three strategies are investigated and empirically validated to obtain a
concatenated representation. They also propose utilizing a tag vector
3.6.5. Submissions summary to implement the attention technique on the feature map of the ASC
The detailed report on the submission summary observed by the task. With or without the use of external representations, an attention
current review paper is shown in Table 9. technique primarily highlights a DNN’s intermediate illustration which
is also highlighted by Chan et al. (2016), India et al. (2019) and Ren
3.7. Acoustic scene classification using audio tagging et al. (2018). The authors employ the former case, in which a tag
vector-derived attention map appears as a feature map. They also look
As pointed out by Koutini et al. (2019d) and Huang et al. (2019), at the multi-head attention system proposed by Vaswani et al. (2017)
DNNs are used in ASC systems to categorize recordings into pre-defined and Ren et al. (2018), which shows reasonable results. Finally, by
classes. The authors Jung et al. (2020b) in their work using audio tag- merging the two approaches provided, the authors were able to make
ging for ASC, offer a unique strategy for ASC that uses an ATS modeled even more improvements.
after the human perception process. When individuals recognize an
acoustic scene, the presence of various sound occurrences gives discrim- 3.7.1. ASC architecture
inative information that influences the decision. Various approaches The ASC system consists of a front-end DNN for feature extraction,
are used in the proposed framework to simulate this mechanism. To which is also called as code extraction, and a back-end SVM for
begin, the author uses three ways to combine tag vectors obtained from classification. The authors Jung et al. (2018) and Jung et al. (2019)
ATS with an acoustic scene categorization system’s intermediate hidden used a DNN that accepts raw waveforms straightforward. The results
layer. They also used tag vectors to inspect the effects of multi-head obtained by He et al. (2016a, 2016b) shows that using convolutional
attention. The different works on ASC and events in the 2019 task 1- layers with residual connections, the system initially obtains frame-
a dataset show that the proposed technique is effective. Classification level representations. Frame-level features are aggregated and sent to a
accuracy for concatenation is 75.66% and for multi-head attention fully-connected layer using a global max pooling and a global average
is 75.58% respectively. The system’s accuracy is 76.75% when the pooling. The layer’s output is used as the code. The general architecture
proposed two approaches are combined. is depicted in Table 10 (Jung et al., 2018).
18
Table 9
Submission summary for Task 1.A and Task 1.B [Koutini et al. (2019c, 2020)].
Submission Number Architecture Observation
Task 1. A : Acoustic Scene Classification with Multiple Devices
1 CP ResNet Frequency-Damped A single network is trained with a p value of 7 and frequency
damping.
The average of the last five epochs of training is submitted.
2 CP ResNet F-damped SWD A single network is trained with a p value of 6 and frequency damping
with an additional SWD as a domain adaptation loss.
The average of the last five epochs of training along with last 5 SWA
models are submitted.
3 CP ResNet DA and non-DA ensemble There are 14 models with average predictions (p = 6 or p = 7,
damping or
basis network, without DA, with SWD, or with MMD).
The average of the last five SWA and non-SWA models for each model
is also
taken into account.
4 CP ResNet DA ensemble Same as CP ResNet DA and non-DA ensemble submission, but only
average predictions of the models trained with domain adaptation loss
is considered.
Task 1.B: Low-Complexity Acoustic Scene Classification
1 CP ResNet Decomposed Convolutional layers of the network (p = 4) is decomposed.
The resulting network has 17520 parameters and a total size of
34.21875 KB.
2 CP ResNet RF-Damp Gate Prune The frequency damped Resnet (p = 4) with adaptive pruning is
employed.
Snip 1 × 1 convolutional layer weights.
The network comprises a total of 345990 parameters, including 77824
convolutional parameters and the same number of pruning mask
weights parameters.
The total number of non-zero parameters in the network after training
and adaptive pruning is 247562, 483.520 KB in total size.
3 CP ResNet RF-Damp small width/depth The base network’s breadth and depth are lowered to fit inside the
size constraint.
The final network has 242592 parameters and is 473.813 KB in size.
4 CP ResNet ensemble of smaller models The average of three smaller models is calculated so that their total
size fits under the size restriction of 487.082 KB and 249386
parameters.
The following are the three models:
a) Adaptive pruning and decomposing weights in a damped ResNet.
Total non-zero parameters are 87168.
b) ResNet with 100288 parameters.
c) ResNet with 61930 non-zero parameters and adaptive pruning.
Fig. 17. Conventional ASC system (Jung et al., 2020b).
3.7.2. The working of ATS 3.7.3. ASC system concatenation with tag vector
The task of multi-label audio tagging detects whether an input audio The authors Jung et al. (2020b) have addressed three approaches for
recording contains several defined sound occurrences. The result of this joining the feature map of ASC to the tag vector. The easiest approach
task is a vector whose dimension is equal to the total number of pre- is to utilize an ASC system’s feature map as a code representation after
defined sound events. ’Tag vector’ is the name given to the output global pooling by concatenating it with a tag vector. After joining the
vector where each dimension indicates, in a real value ranging from ASC code to a tag vector, the second technique performs transforming
0 to 1, if a specific sound event exists. The system, which uses an features with a completely connected layer. In addition, the authors
architecture for multi-task learning and a framework for soft pseudo- propose that before concatenation, a feature transformation be applied
labels, was suggested by Akiyama and Sato (2019). Spectrogram with to a tag vector. The proposed model is depicted in Fig. 19. Table 11
mel scale input and design of ResNet combined in the system was shows the results of the baseline technique, which does not employ tag
adopted by the authors He et al. (2016a), because it is the single system vectors, as well as the three recommended methods, which combine
that performs the best. an ASC code and tag vector. The performance improves in all of the
19
Fig. 18. ASC model using audio tagging system (Jung et al., 2020b).
Table 10
General DNN architecture for ASC [Jung et al. (2018, 2020b)].
Layer Input Filter Stride Number Memorization & Output Shape
(479999,2) Length Size of Sensitivity
Filters Technique Used
Strided Convolution 12 12 128 Mix-up : Beyond (39999,128)
Convolutional Batch Empirical
(Stride size equal Normalization Risk Minimization
Filter length) LeakyReLU
Res Block Convolution 3 1 128 Mix-up : Beyond (18,128)
Batch Empirical
Normalization Risk Minimization
LeakyReLU
MaxPool(3) x 7
Global Average Global avg – – – Mix-up : Beyond (128, )
pool() Empirical
Risk Minimization
Global Max Global max – – – Mix-up : Beyond (128, )
pool() Empirical
Risk Minimization
Concat – – – – Mix-up : Beyond (256, )
Empirical
Risk Minimization
Code Fully – – – Mix-up : Beyond (64, )
Connected(64) Empirical
Risk Minimization
Output Fully – – – Mix-up : Beyond (10, )
Connected(10) Empirical
Risk Minimization
Fig. 19. Model-concatenation of tag vector with ASC feature (Jung et al., 2020b).
proposed configurations. It can also be demonstrated that if three layers It employs a vector known as a map of attention where each item is
are utilized before the code, the best performance (75.76%) is achieved. assigned a value between 0 and 1, with the softmax function accumu-
lating up to 1. A given feature being multiplied by an attention map is
3.7.4. Multi-head attention using tag vectors used to handle the attention. The authors also propose an ASC system
Attention is a widely employed method that focuses solely on with a tag vector as the attention vector. Fig. 20 shows the model
distinguishing characteristics as pointed out by Fonseca et al. (2019). used by the authors. Table 12 shows the results of applying multi-head
20
Table 11
Results of the baseline technique and three recommended methods, which combine a tag vector with an ASC Code [Jung et al. (2020b)].
Model architecture Transform after Concat Number of layers before transform Accuracy %
Baseline x – 73.63
Codecat x – 74.15
Before code 0 x 74.36
Before code 0 3 75.66
Fig. 20. Model-multihead attention using tag vector with ASC features (Jung et al., 2020b).
Table 12
Outcomes from multi-head attention with tag vectors [Jung et al. (2020b)].
Number of head Number of transform Accuracy Number of Number of Accuracy
layers % head transform layers %
for attention map for attention map
2 1 76.58 16 3 75.96
4 2 76.24 32 1 75.57
8 3 75.17 32 2 76.24
Table 13
Results of combining attention map and a tag vector utilizing two distinct fully-connected layers [Jung et al. (2020b)].
Number of head Number of layers used for Accuracy Number of Number of layers used for Accuracy
concat,number of attention % head concat,number of attention %
2 3, 3 74.76 4 3, 3 75.31
2 3, 4 75.74 4 3, 4 75.84
2 4, 3 75.17 4 4,3 76.24
2 4, 4 75.14 4 4, 4 75.34
Table 14
Results of the two approaches using shared layers [Jung et al. (2020b)].
Number of Head Number of Accuracy Number of Number of Accuracy
Transform % Head Transform Layers %
Layers
2 0 76.12 16 0 76.00
4 3 76.75 32 2 76.24
8 2 75.96 32 3 76.27
attention by creating an attention map for ASC using the tag vector. The stereo raw waveforms of shape (479999, 2) are accepted by
In most situations, the results show that multi-head attention enhances the ASC system. Table 17 contains information on the audio tagging
performance more than simple concatenation. system.
3.7.5. Combining multi-head attention with concatenation 3.7.7. Results and comparison to similar system
The authors Jung et al. (2020b) also suggested combining concate- Jung et al. (2020b) compares the top-performing system to three
nation with multi-head attention. Table 13 demonstrates the outcomes
other systems that displayed the best performance when given a raw
of converting a tag vector for attention mapping and concatenation
waveform as input. The detail is shown in the Table 18. The system
using separate fully-connected layers, whereas Table 14 shows the
developed by Huang et al. (2019) uses the SincNet architecture which
results of the two approaches using shared layers. According to the
was earlier proposed by Ravanelli and Bengio (2018). Jung et al. (2019)
findings, 76.24% is the result of transform layers which is under-
performance than the result of employing merely multi-head attention, method’s combined specialized DNNs with teacher–student learning.
which is 76.58%. But after combining concatenation with multi-head The system developed by Zheng (2019) employed a DNN architecture
attention, the obtained accuracy is 76.75%. from beginning to end with improvements to random cropping and
padding. The suggested audio tagging system outperforms competing
3.7.6. Dataset details and the experimental configuration systems that employ raw waveform as input, according to the results.
Tables 15 and Table 16 show the details of the dataset used by the On the DCASE task 1-a dataset, the suggested system has an accuracy
authors Michael Mandel and Ellis (2019) and Fonseca et al. (2019). of 76.75%.
21
Table 15
Dataset details 1 for audio tagging system [Jung et al. (2020b), Michael Mandel and Ellis (2019)].
Purpose Dataset source Audio recording Audio recording Number of Total number Recording type Recording
duration source acoustic scene of recordings sampling rate
(Class)
ASC Task DCASE 2019 40 h 12 different 10 14400 Stereo 48000
Experiments Task 1.a European (10 s for
cities each recording)
Table 16
Dataset details 2 for audio tagging system [Fonseca et al. (2019), Jung et al. (2020b)].
Purpose Dataset source Number of Duration for Number of sound
audio clips curated subset events defined
Extraction of tag DCASE 2019 Task 4970 for curated 0.3 to 30 s for 80
vectors 2 subset manual labels
using with manual (Total 10.5 h)
audio tagging labels. 1 to 15 s for noisy
system 19815 with noisy labels
labels. (Total 80 h)
Table 17
Experimental configuration details for audio tagging system [Jung et al. (2020b)].
Frequency spectrum Number of Augmentation Memorization and sensitivity Model architecture
channels type technique used
Mel-Spectrogram 128 SpecAugment and Mixup : Beyond Empirical Risk ResNet (44 million trainable parameters
Mel-frequency Slicing Minimization exploited with few modification)
Table 18
Comparison of the work [Jung et al. (2020b)].
Architecture System proposed by Input Number of parameters Accuracy(%)
SincNet Huang et al. (2019) Raw waveform 53452k 76.08
Teacher–Student Learning Jung et al. (2019) Raw waveform 636k 75.81
with specialist DNNs
End-to-end DNN Zheng (2019) Raw waveform – 69.23
Feature extraction using Front-end DNN Jung et al. (2020b) Raw waveform 676k 76.75
and Classification using Back-end SVM
3.8. Cohesive deep neural networks for sound event recognition, audio 3.8.1. Dataset and experimental set-up for DcaseNet architecture
tagging, and ASC The authors Jung et al. (2019), Jung et al. (2020a) and Primus
and Eitelsebner (2019) have used the various dataset and their spec-
Jung et al. (2020c) proposed three distinct DNN designs to integrate ifications which are shown in Table 19. Table 20 shows details for
three tasks, two of which are inspired by human cognitive processes experimental set-up for DcaseNet architecture.
and the third extends the second architecture. Using a single DNN that
learns and performs ASC, TAG, and SED, the integrated framework
3.8.2. Architecture 1:DcaseNet-v1
performs two segment-level tasks and one frame-level task. The first
As shown in Fig. 21, the authors Jung et al. (2020c) proposed DNN
design is similar to the design proposed by Jung et al. (2020b). In that
architecture for performing ASC, TAG, and SED simultaneously. The
technique, it reflects adults’ short-term perception of ASC. The authors
first design uses a CRNN to complete SED and subsequently Segment-
suggest an integrated framework that includes SED, ASC, and TAG.
The second architecture is inspired by newborns’ long-term learning. Level Scene Classification.
In comparison to SED, which needs frame-wise multilabel binary clas-
sification, the authors thought that segmentwise multiclass ASC would 3.8.3. Architecture 2:DcaseNet-v2
require a lower abstraction level. Each task’s abstraction level is con- The second architecture is similar to a baby’s long-term learning
sidered in the proposed architecture, which performs relatively scene which was proposed by Jing and Tian (2019), Doersch et al. (2015)
classification followed by customized SED. Comparative tests show that and Deng et al. (2009). It is considered that ASC requires a lesser
the second architecture performs well, so it has been expanded to level of abstraction than TAG and SED according to Kim et al. (2020)
include a few different layers for each activity. The authors created a and Imoto et al. (2020). As a result, after a few convolutional neural
single DNN that incorporates ASC, TAG, and SED and can be fine-tuned networks (CNNs) blocks, the second architecture conducts ASC, fol-
to emphasize any of these tasks’ performance. DcaseNets are suggested lowed by TAG and SED after a Gated Recurrent Unit (GRU) layer. The
DNNs for establishing pre-trained models for a wide range of acoustic complete architecture is shown in Fig. 22.
tasks. This research makes three major contributions:
1. Using DcaseNets, a DNN architecture, to conduct ASC, TAG, and 3.8.4. Architecture 3:DcaseNet-v3
SED at the same time. The third architecture, shown in Fig. 23, adds layers to the second
2. Creating DNN architecture based on human cognitive processes. architecture before performing each activity. These individual layers
3. Demonstrating that fine-tuning the integrated DNNs for a partic- are also concatenated and sent to future layers. The authors maintain
ular task boosts performance. an information path while designating a few layers to only focus on
22
Table 19
Dataset specification for DcaseNet architecture [Jung et al. (2020c)].
Dataset Specification Dataset Training Number of Segment Number of Number of Sampling Resolution and
Task duration (h) segments duration (s) evaluation Classes rate (kHz) Monaural audio
segment (Bits)
Acoustic Scene DCASE 2020 Task 1-a 43.0 13,965 10 2,970 10 24 16
Classification (ASC) dataset
Audion Tagging (TAG) DCASE 2019 Task 2 10.5 3,976 0.3–30 994 80 24 16
dataset
Sound Event Detection DCASE 2020 Task 3 10.0 600 60 100 14 24 16
(SED) dataset
Table 20
Experimental set-up for DcaseNet [Jung et al. (2020c)].
Input Number of Extraction Window Number of Number of Batch size Segment Optimiza- Hyperpa- Neural
features to Bins in mechanism size / interactions epochs duration for tion rameters for network
each DNN extracted overlap SED technique training used
Spectro-
gram
128- 2048 FFT 40ms/20ms 500 as one 160 ASC - 32 5 and 30 s Adam [30] ASC : Fixed CRNN - 8
dimensional epoch TAG - 24 with TAG : Fixed layers with
SED - 32 learning SED : Fixed 512 output
Mel- rate : 0.001 filters in
spectrogram last layer
Fig. 21. Architecture for DcaseNet-v1 (Jung et al., 2020c).
accomplishing a single task according to He et al. (2016b) and Kos- the jointly trained model and fine-tuning for each job are listed in the
mider (2019). SED and TAG are performed in order by hidden layers, column ending with the phrase ‘Fine-Tune’.
assuming that TAG, a segment-level task, requires a higher abstraction
level than SED, a frame-level task which was also suggested by Nguyen 3.9. An approach for solving ASC problem using SoundNet by integrating
et al. (2020) and Zhang et al. (2017). the results from each layer of CNN
This paper has shown the overall comparative analysis for three
DcaseNet architectures depicted in Table 21. Scene classification using acoustic information is a challenging
process because of numerous factors such as the environment’s non-
3.8.5. Results from the DcaseNet architectures stationary nature and numerous overlapping audio events, which was
For TAG, 20% of the dataset is used for evaluating performance. also pointed out by Waldekar and Saha (2018). To overcome the
The performance of the three DcaseNet architectures when trained on ASC difficulty, the authors employed SoundNet, a DCNN technique
a single task is listed in Table 22. The three architectures are used that was trained on raw audio signals. By merging the scores from
to explore the influence of Mix-up according to Zhang et al. (2017). each layer, they presented a classification technique. This is predi-
Table 23 shows the performance of the three proposed architectures cated on the notion that a DCNN’s layers learn complementary facts,
after joint training and fine-tuning. The results of training the model and that integrating this layer-wise knowledge improves classification
using various task combinations are listed in the column ending with over individual layer characteristics. Furthermore, the authors offer
the phrase ‘Joint Training’. The results of initializing the DNN using a pooling technique for reducing the features derived from various
23
Table 21
DcaseNet architecture [Jung et al. (2020c)].
DcaseNet DcaseNet-v1 DcaseNet-v2 DcaseNet-v3
Architecture
Concept Perception Mechanism of an Long-term learning of babies. DcaseNet-v2 architecture with additional
adult layers for
conducting each task.
Basic 1. Event detection using 1. Abstarct notion like gravity, dimensions 1. Separate layers are concatenated and feed
Functionality Convolutional Recurrent etc. are acquired forwarded
Neural Network (CRNN). and perform the specific tasks like moving to subsequent layers.
2. Segment level scene things. 2. Maintain an information path to show how
classification for ASC. 2. For each task representation, abstraction the layers are
3. Short event detection level is considered. solely concentrating
improves the overall 3. Few CNN blocks are required for ASC task on performing individual task.
detection throughout an audio and Gated 3. ASC task is completed using layers of CNN.
segment. Recurrent Unit (GRU) is used for TAG and SED is
SED tasks. completed with CNN and GRU. TAG task is
completed using
CNN, GRU and then Dense.
Result Analysis 1. Without mix-up, TAG and 1. After five tuning, it shows higher 1. Highest performance among all three
SED performance improve. performance than the architectures.
2. With mix-up, ASC and TAG corresponding baseline consistently. 2. After fine tuning for each task, it
performance improve. 2. Consideration of abstraction level - more outperforms the baseline
effective. except Error Rate(ER).
3. Obtained value for ER is 31.85%
4. Accuracy (Acc) for ASC after fine tuning is
70.35%.
5. Result is consistent with the hidden layers
also.
Drawback The jointly trained model does Without five tuning, the jointly trained model For joint training, degradation of accuracy in
not generated well across has a lower ASC requires
the three tasks(ASC, TAG and overall performance then the corresponding further investigation.
SED) after fine tuning. baseline for
each task.
Table 22
Performane of DcaseNet architecture trained on a single task [Jung et al. (2020c)].
Task Performance Best DcaseNet DCASE Reference Memorization
metric performance Architecture Baseline system sensitivity
% name % % technique used
(Mix-up)
ASC Acc 69.54 DcaseNet-v2 54.10 65.30 Yes
TAG lwlrap 70.62 DcaseNet-v3 – – Yes
F1 79.62 60.60 76.20 Yes
SED DcaseNet-v1
ER 34.86 54.00 30.50 No
SoundNet layers. The success of this layer-wise ensemble technique is feature map as a scalar number using the sum or max operator. For each
demonstrated by trials on the DCASE categorization dataset. According audio stream, To create a fixed-length vector, all feature maps’ scalar
to Singh et al. (2018), the suggested method improves classification values are merged. The sum operator retrieves the summation for a
accuracy by around 30.85% over the finest layer of SoundNet. feature map, whereas the max operator computes the largest value over
the feature map. A given layer has N number of feature maps, as shown
3.9.1. Proposed methodology in Fig. 24. The pooling operator generates an N-dimensional feature
The author Huzaifah (2017) suggested that each layer’s 1-D feature vector by computing a real value for each of the feature maps. The
maps have maximum features that are also reliant on the length of input layer-wise analysis is performed after gathering the 15 hidden layers’
signal. A pooling method is used to minimize features and display the pooled feature vectors. A unique classifier model is developed for each
24
Table 23
Performane of DcaseNet architecture trained after joint training and fine tuning [Jung et al. (2020c)].
Task Performance Best perfor- DcaseNet Best perfor- DcaseNet Combination of Memorization
metric mance(%) architecture mance(%) architecture Task sensitivity
after joint Name for joint after fine name for fine technique used
training training tuning tuning (Mix-up)
ASC Acc 61.08 DcaseNet-v3 70.35 DcaseNet-v3 ASC and TAG Yes
TAG lwlrap 76.23 DcaseNet-v3 75.99 DcaseNet-v3 TAG and SED Yes
F1 77.93 DcaseNet-v3 81.32 DcaseNet-v2 TAG and SED Yes
SED
ER 53.67 DcaseNet-v1 34.70 DcaseNet-v1 ASC, TAG and NO
SED
Fig. 24. Block diagram for layer-wise score level ensemble framework for ASC (Bisot et al., 2016; Singh et al., 2018).
Fig. 25. Steps for acoustic scene classification using auditory datasets (Kumpawat & Dey, 2021).
layer utilizing the feature vectors. As a consequence, 15 distinct classi- 3.10.1. Proposed methodology
fier models are produced. SVM and MRE-based classifiers are employed To begin with, various attempts have been made in the research and
for this work. The testing samples’ feature vectors are used to evaluate development sector to overcome difficulties linked to land-use pattern
the layer-wise trained models. A fusion-based categorization system is classification. However, the majority of the early work in this area was
proposed based on layer-wise analysis. The classification results for the exclusively dependent on imagery datasets according to Hershey et al.
15 models that were trained using the 15 hidden layers’ feature vectors (2017) and Kong et al. (2017). This resulted in serious public privacy
are pooled. Each classifier model’s fused scores are estimated using two difficulties, and the general public was not in favor of such studies
methods: (i) MV of the labels and (ii) ML estimation. The final class as it was also observed by Mesaros et al. (2017) and Mesaros et al.
labels will be determined by the majority vote of the labels acquired (2018b). This is built on the idea of solving the difficulties outlined by
by each classifier in the case of majority voting. When using ML, the utilizing auditory datasets. Second, if one examines the mathematical
class-wise scores from each model are linearly integrated to produce techniques employed initially and current studies, it may be concluded
fused scores, and the test example is given to the class with the highest that limited concepts were utilized from different fields. The scientists
score. The complete framework is shown in Fig. 24. typically employed linear algebra methods such as matrix manipula-
tion, which could not handle the expected computing complexities as
per the suggestion given by Zeinali et al. (2019). The authors used
3.10. Acoustic scene classification using auditory datasets spectrograms, which have been shown to solve existing computation
problems. Spectrograms may handle many audio characteristics at
According to Barchiesi et al. (2015) and Mesaros et al. (2016), once. This allows the computational complexities to be kept within
in recent years, the technique for using datasets containing audio to the predicted ranges. Furthermore, a significant difficulty for efforts in
tackle difficulties involving patterns for land use has gotten a lot of the domain thus far has been the lack of uniform vocabulary in the
study attention. Despite being a burgeoning sector in the study of datasets in use. Simply put, the standard of audio captured from various
analysis of data intelligently, there is still a long way to go. The locations, such as New York changes significantly for a given acoustic
authors Kumpawat and Dey (2021) have suggested how to classify a scene such as an office. The overall characteristics of audio, such as
pre-determined acoustic scene. frequency, average amplitude, and so on, vary widely. The use of
25
Fig. 26. Block diagram for acoustic scene classification using DL architectures (Spoorthy et al., 2021).
spectrograms helps to mitigate this problem because audio features are The final dense layer must include dense operations that equal the
evenly represented. The data augmentation techniques aid in the better dataset’s classes. To prevent the model from overfitting, a dropout layer
fitting of examined data into the model. The classification problem is is added to the network. The activation layer is the final layer. The
solved using a pre-trained neural network. Furthermore, techniques for convolution and max-pooling blocks of the CRNN model utilized in
classifying audio overlooked the time factors that comprise the dataset this study are followed by the activation layer, batch normalization,
of audio for technical grounds. The spatiotemporal factor is taken into and drop-out layer. Permute and reshaping layers are also required
account while using a problem-specific convolutional neural network. in this case. Permute layers are used to modify the direction of the
This increases the quality of the result. The steps for acoustic scene axes of feature vectors. The feature vector is then converted to a two-
classification using auditory datasets are depicted in Fig. 25. dimensional feature vector using a reshaped layer. Two GRUs are used
in the proposed network. Past timestamps are taken into account by
3.11. An analysis of the procedures of CNN and CRNN models to solve these GRU levels. Finally, the dense layers receive the output of the
ASC problem bidirectional layers, which is delivered as the completely connected
layer’s input. The CRNN model with ELU activation function achieves
The authors Spoorthy et al. (2021) and Wang et al. (2021) have the highest accuracy in recognizing the ASC task, which is 90.96%.
divided audio recordings into classes based on the scenes and environ- For the LeakyReLU function, 73.61% is the best recognition for a CNN
ments in which they were recorded. In most applications, DL is one of model proposed by Spoorthy et al. (2021). The complete functionalities
the most modern developments. CNN and CRNN are two DL techniques for ASC using DL architectures are shown in Fig. 26.
utilized by the authors to classify acoustic scenes. Three activation
functions, ReLU, LeakyReLU, and ELU, are employed to assess the 4. Future direction
model. The CRNN model scored the greatest recognition accuracy of
90.96% for the ASC assignment. The model performed well on the In context with the current status of the work for classification
fundamental convolution architecture, with a 10.9% improvement over of acoustic sound classification, the following important facts have
the standard system in this job. been observed. It is an opportunity or challenge for eminent research
scholars to grab these facts and contribute towards the enhancement
3.11.1. Proposed methodology of acoustic scene classification tasks. Future Directions for both pre-
Bear et al. (2022) and Nanni et al. (2019) suggested that the processing and classification for ASC are also listed in Table 24.
audio segments are used to extract feature representations, which are
then fed into deep learning models. Because of the short duration 1. Integrated Deep Neural Network(DNN) works fine for detect-
of acoustic scenes, it is required to express the segments in Time– ing and classifying acoustic scenes and events, but accuracy
Frequency Representations(TFR) to obtain more distinct information. is degraded for joint training. It is possible to implement self-
Before extracting log-mel spectrogram features, the audio segments are supervised learning to enhance their performance. Moreover,
subjected to a standard STFT. The suggested approach incorporates knowledge distillation and soft-label training can be applied to
features from the standard model for DCASE 2019 challenge. ReLU, enable cross-dataset training.
ELU, and LeakyReLU functions are employed in the hidden layers of 2. Acoustic scene classification can be done using audio tagging
the deep learning models, while softmax is utilized in the final layer. A systems but it makes the code more dimensional and creates two
tiny non-zero constant gradient is allowed by the LeakyReLU activation independent representations in each fragment of the concate-
function. Typically, a value of 0.01 is provided. By maintaining a nated representation. It was also observed that due to overfitting
small negative slope, the problem of dying ReLU was addressed by the brought on by too many parameters in fully connected layers,
creation of this activation function. ELU smoothing occurs gradually individual transformation layers have worse performance. The
until the output equals negative alpha. The loss function categorical overfitting problem must be resolved.
cross entropy which is best suited for multi-class classification is used 3. Acoustic scenes can also be classified using multiple devices. But
in both CNN and CRNN models. The Adam optimizer is used to train decomposed convolutional layer decreases the accuracy com-
the network, using a learning rate of 0.001. 80% of the data is used to pared to the full-parameter baseline. Moreover, Maximum Mean
train the models, while 20% is used to test them. According to Koutini Discrepancy produces results that are poorer than the reference
et al. (2019d), convolution and max-pooling techniques are employed baseline.
in the CNN architecture for the ASC job. To avoid bias, a batch normal- 4. Convolutional Recurrent Neural Network(CRNN) and Grid search
ization layer is included in the network. In the network, the flattening can be used for acoustic scene classification. CRNN can give the
layer is utilized to flatten the various blocks and reduce them to one better result and Grid can be used for optimizing the parameters
dimension. The network employs two dense layers after flattening. for data augmentation.
26
Table 24
Summary of existing work on different classification technique for ASC.
Author Year Findings Implementa- Results Evaluation Metric Future Scope Datasets
Details tion/Evaluation /Tools
Mechanism
Jung et al. 2020 1. Three integrated DNN 1. DcaseNet-v1 – 1. Highest 1. Overall 1. For joint training 1. For ASC Task -
(2020c) architectures, DcaseNets are Event detection using performance Classification degradation of accuracy in DCASE 2020 Task
proposed Convolutional among all three Accuracy(Acc) for ASC - 1-a dataset
for three task – ASC, Audio Recurrent Neural architectures - ASC DcaseNet-v3 2. For TAG Task -
Tagging(AT) and Sound Network (CRNN) DcaseNet-v3 2. Label-Weighted 2. Self supervised learning DCASE 2019 Task
Event and segment level 2. Accuracy for Label- can be adopted to improve 2 dataset
Detection(SED). scene classification ASC after fine Ranking the 3. For SED Task -
2. Developed DNN for ASC. tuning: 70.35% Precision(LWLRP) task performance. DCASE 2020 Task
architecture based on human 2. DcaseNet-v2 – Few 3. After fine 3. Knowledge distillation 3 dataset
cognitive CNN blocks – ASC tuning for each for TAG and soft-label training can
processes, such as adults’ and task, 3. F-Score(F1) be
scene perception and Gated Recurrent Unit Error Rate: and applied to enable cross
newborns’ (GRU) will give AT 34.70% Error Rate for dataset training.
long-term learning. and SED. 4. Architecture SED 4. Further investigation and
3. Fine-tuning for the 3. DcaseNet-v3 – trained on a improvements can be done
integrated DNNs for a DcaseNet-v2 + single task - as
particular task boosts Additional layers DcaseNet-v1 - DcaseNet, first approach to
performance. F1:79.62%, integrate the three tasks
ER:34.86%. together.
v2 - Acc:69.54%,
v3 -
LWLRP:70.62%
Spoorthy et al. 2021 1. Audio recordings are 1. Convolution 1. CRNN – 1. Accuracy 1. Activation function and 1. DCASE 2019
(2021) divided into classes based Neural Network 90.96% accuracy 2. Confusion the depth of the neural ASC Task 1.a
on the scenes (CNN) and with ELU matrices network development
and environments in which Convolution- activation 3. Activation architecture used play an dataset
they were recorded. Recurrent Neural function for functions: important role in the 10 acoustic scenes
2. Audio segments are used Network (CRNN) classification. ReLU, LeakyReLU, performance 14400 number of
to extract feature for classification Improvement and ELU of the network. audio recordings
representations, 2. Categorical cross 10.9% are used to 2. The different DL 1440 audio
which are then fed into entropy loss function 2. CNN – 73.61% evaluate the architectures and feature recordings in
deep learning models. 3. Optimizer - Adam, with LeakyReLU model representations each class
Learning Rate : 0.001 function for can be used to improve the Sampling Rate :
classification performance of the ASC 48000 Hz,
system. Resolution :
24-bit
Kumpawat and 2021 1. The use of spectrograms 1. Pre-Trained Neural 1. 68% accuracy 1. Accuracy 1. ASC tasks: Convolutional DCASE 2019
Dey (2021) helps to mitigate the Network for obtained in 24th ReLU and Recurrent Neural Network Challenge Dataset
problem of Classification- epoch Softmax (CRNN) can give better 10 s audio signals
changing of frequency and Spatiotemporal factor 2. 67.6% activation result. for 10 acoustic
amplitude of audio. consideration for accuracy with function 2. Domain specific results scenes
2. The classification problem temporal Testing dataset are obtained but novel 1440 segments
is solved using a pre-trained factors. solution in for each scene
NN. 2. Pre-trained VGG16 various different domain of 40 h audio
3. Problem revolving around model with custom diverse background is dataset
land and for that model is classifiers. required.
built to 3. Data loaded as
classify some pre-defined batch of 32.
acoustic scene. 4. Auditory masking
4. Input: audio comprises of and time
a few scene. stretching are done
5. Output: scene class for data analysis and
6. ASC tasks: CNN(ReLu and augmentation.
Softmax), pre-trained VGG16
5. A spectrogram with learnable hyper-parameters has not been model, which uses PCEN, can be further enhanced by learning
addressed for the processing of audio data. data-dependent parameter values.
6. Active learning and semi-supervised learning approaches in com- 11. To visualize CNNs more deeply, models of characteristics-level
bination are not discussed for labeling the training audio data. attention can be studied. To examine the temporal information
The annotation task can be reduced with the combined ap- in acoustic scenes, combining CNNs with phase-wise learning
proach. strategies and 3D CNNs is one option.
7. Labeling with clustered data has not been determined in detail 12. Layer-wise ensemble approach can be used with DNN for acous-
for annotation of training audio data. tic scene classification. The DNN can be trained as effectively as
8. Loss function is a heuristic method for detecting outliers. Using a layer-wise ensemble framework with fewer data to incorporate
a different loss function when training a neural network created the hidden layer knowledge.
separately for one-class classification is an alternative to Deep 13. The depth of the utilized NN architecture and the activation
Convolutional Auto Encoders. function has a significant impact on how well the network
9. Instead of learning how to change signals, DCASE challenges performs. Therefore, the ASC system’s performance can be en-
emphasize fixed spectrogram-based signal representation. There- hanced by using various DL architectures with appropriate depth
fore, the researchers should always first focus on using a fixed and feature representations.
spectrogram to represent the audio signal. 14. There is still an opportunity to improve the Factorized CNN
10. To boost robustness to loudness change, utilize PCEN and ad- model combined with various approaches to cut down on the
dress the problems with log compression. On sizable recorded number of features because if the results are compared to DCASE
noisy and far-field evaluation sets, PCEN greatly enhances identi- 2019 Rank 1 (even though the model parameters are different),
fication performance. Contrary to principal component analysis, DCASE 2019 Rank 1 gives better results.
PCEN can be dispersed among multiple sensors and implemented 15. The researchers should focus on developing a model using a ge-
in real-time. The performance of the keyword spotting acoustic netic algorithm. Although, Deep Self-Adaptive-Genetic-Algorithm
27
Abeßer (2020) 2020 1. DL based ASC algorithms, 1. CNN - 1. With available 1. Accuracy, 1. State-of-the-art ASC DCASE 2018 and
data preparation, and data Multi-scale data, deep Error Rate, algorithms have matured 2019
modeling are summarized. feature approach, learning models Confusion matrix, and can be challenges-dataset
2. ASC task - the detection Xception outperforms the applied in context-aware is used.
of audio events presents in network traditional ROC curve, devices. In real-world
scene. architecture methods such F1-Score etc application
3. Data Modeling Learning 2. ASC algorithm as Gaussian scenarios, novel challenges
Paradigm - to analyze ASC from DCASE takes mixture models, need to be faced such as
systems. spectrogram as hidden Markov microphone mismatch and
4. Open challenges arise network input models to extract domain adaptation, open-set
from deploying ASC represented based the useful classification, as well as
algorithms to real on Mel features. model complexity and
world application scenarios spectrogram. 2. Rapid increase real-time
– like Model Interpretability. 3. ASC algorithms of scientific processing constraints.
5. Review on DL techniques use CNN based publications 2. ASC algorithm with novel
for audio signal processing. network by recent techniques from
6. The feature representation architectures. advances in the unsupervised
like log-mel spectra and raw 4. Feed-Forward field of deep and self-supervised learning
Neural learning such as to classify novel sounds
waveform, and DL models Networks(FFNN) transfer learning, while
like audio specific CNN, 5. CRNN - Gated attention maintaining knowledge
Long Recurrent Neural mechanisms, and about previously learned
Short-Term Memory Networks multitask acoustic
architectures are reviewed. (GRNN) and learning, as well scenes and events
7. Data-driven filters and Time-Delay as the release of
Mel Frequency Cepstral Neural Networks various public
Coefficients (TDNN), datasets -
(MFCCs) are used for Bidirectional State-of-the-art
acoustic feature extraction. Gated Recurrent ASC algorithms
Units (BiGRU) have been
6. Acoustic Event given better
Detection(AED) results.
algorithms use
CRNN based
network
architectures.
Shuyang et al. 2020 1. SED technique requires a 1. Gated CNNs 1. Annotation Segment-based 1. The accuracy of the active 1. Dataset –
(2020) huge amount of labeled data containing 6 result:2% from Error Rate (ER) learning system does not TUT Rare Sound
for blocks. training Accuracy improve with increasing 2017(DCASE
training purpose. 2. Linear and 2. Budget saved labeling budget due to high 2017
2. An active learning system Sigmoid layer. for labeling- 90% label distribution bias that Challenge) and
obtains the accuracy of the 3. Weakly 3. Best has a negative effect on the TAU Spatial
SED supervised based performance - accuracy of learned models. Sound 2019
model with the least amount on attention. annotating only 2. Active learning with (DCASE 2019
of annotation process. 5%. semi-supervised learning Task 3)
3. Sound Event Detection techniques not addressed so
Model - Trained with far.
annotated samples.
Ye et al. (2020) 2020 1. A method to look into 1. Hazard sound 1. The model Accuracy 1. The limited data 1. Evaluation was
ambient sound for saving event feature gives 90% Mean Average collection may lead to high constructed from
lives. learning based on accuracy for Precision variance - BBC Sound
2. Develop an emergency sparse coding - sound issue, therefore the Effects Library,
sound recognition Dictionary identification and identification accuracy of Urban-Sound8K
framework based Learning. 87% Mean the two datasets, ESC:
on acoustic feature learning 2.At classification Average Precision classes can be further Dataset for ESC,
and data-driven taxonomy stage,results using enhanced by supplying more sound effects
construction to depict Mean Averaged (crowd souring clips. from internet
acoustic clues for early Precision (MAP) audio dataset). 2. Employ transfer learning sources.
disaster for all hazard methodologies for 3275 audio clips
detection. sound categories DNN-based in 10 categories
3. An automated method of showed that systems, where a portion of of emergency
creating a taxonomy to HR-LR worked. the DNN can be first learnt sound incidents.
make 3. CMSM to by enormous external audio
it easier to identify multiple remove the data, to address the issue
classes of hazardous sound. background noise. of limited audio data.
Suh et al. (2020) 2020 1. Acoustic scene 1. Task 1 Subtask 1. For Subtask A Accuracy 1. Different Issues with the 1. For Subtask A
classification systems for A: Trident ResNet :models trained at model for Subtask B can be : TAU Urban
DCASE 2020 Model 62, 126, and 254 addressed more efficiently. containing
challenge Task 1. (Three parallel epochs 2. Data with multiple labels 2020 Mobile
2. Task 1 – ASC with ResNets). 2. For Subtask B are not considered. dataset - 23,040
Several Devices for subtask 2. Task 1 Subtask :models trained at 3. The duration of the data samples
A and B: Shallow is of 10 s which may 2. For Subtask B :
ASC with Lower-Complexity Inception model. 254 and 510 contains useful information TAU Urban
for Subtask B 3. Optimizer: SGD epochs very rarely. containing
3. Acoustic Scene 3. Task 1 Subtask 2020 dataset -
classification using multiple optimizer having A - Accuracy: 14400 samples,
devices. 0.9 for 73.7% different
4. Use of Focal loss which momentum. in classification acoustic scenes
attenuates the log-loss for test split from twelve
generated 4. Task 1 Subtask European cities,
with the aid of using B - Accuracy: Three labels -
well-trained samples. 97.6% indoor, outdoor
and
transportation.
28
Koutini et al. 2020 1. Description of Task 1 of 1. Subtask 1.A – 1. Accuracy for Accuracy 1. MMD produces results 1. DCASE 2020
(2020) the DCASE-2020 challenge. RFR-CNN model as a MMD 70.80% and (70.80%) that are on Challenge –
2. Task 1A : Acoustic Scene baseline. SWD average Dataset -
Classification with Multiple 2. Models developed 71.80% with poorer than the reference 10 acoustic scenes
Devices using Maximum CP-Res Damp baseline.
3. Task 1B : Low-Complexity Mean architecture 2. Devices should be robust.
Acoustic Scene Classification Discrepancy (MMD) 2. Accuracy
4. Explores two domain and the Sliced 71.07% with
adaptation objectives – to Wasserstein CP-Res Damp
reduce the Distance (SWD) for architecture
mismatch between source Task 1A. without MMD
devices and target devices. 3. Subtask 1.B – and SWD
5. Four different Weight pruning, layer
architectures are designed
and developed for decomposition, and
both Task 1A and Task 1B. width/depth
reduction of the
basis network
Jung et al. 2020 1. ASC model developed 1. ASC architecture – 1. Classification Accuracy 1. Increase the 1. DCASE 2019
(2020b) ATS inspired by human DNN and SVM accuracy for dimensionality of the code Task 1A dataset -
perception 2. SVM with radial concatenation and two Audio
mechanisms. basis function or and multi-head representations separately recording from 12
2. ATS - To detect sound Sigmoid attention is exists in each subspace of different
occurrences in recorded kernel is used for 75.66% and the European cities
audio. back-end 75.58%, concatenated representation. Number of
3. Tag vector is obtained classification. respectively, 2. Separate transformation Acoustic Scene
from ATS, concatenated with compared to layers led to worse (Class) – 10
ASC system. 73.63% performance Audio Recording
4. Tag vector also generates for the baseline. because of overfitting due to Duration - 40 h,
attention map to classify the 2. The system’s too many parameters of Recording Type -
sound. accuracy is fully Stereo,
5. Multi-head attention was 76.75% for connected layers. Sampling Rate –
adopted to emphasize the Model- Multihead 48000
feature map Attention Using 2. DCASE 2019
of the ASC using Tag vector Tag Task 2 for Audio
Vector with ASC Tagging System
Features
Roletscheck 2019 1. An approach to 1. Developed a 1. Accuracy Accuracy 1. Addition of convolutional 1. TUT Urban
et al. (2019) automatically generate a genetic algorithm – 74.7% on the Fitness function - layer followed by a dense Acoustic Scenes
suitable NN Deep Self-Adaptive- development score layer 2018 dataset
architecture and Genetic-Algorithm dataset for the followed by a convolutional from Subtask A.
hyper-parameters for given (DeepSAGA) (consists population vote layer and recurrent layers 10-seconds audio
classification problem. of a series of (’’Pop. vote’’) could segments from 10
2. To generate and evaluate convolutional layers strategy increase the classification different
a variety of Deep followed by a series 2. Accuracy accuracy acoustic scenes
Convolutional of dense layers) to 72.8% on the 2. DeepSAGA’s methodology
Networks - genetic algorithm automatically development is genetic and not just for
is used. generate dataset for CNN audio.
3. Handle the task of ASC CNNs from scratch (’’DeepSAGA However, in order to better
for DCASE 2018 Challenge. CNN’’) optimize it and move closer
4. Exploring the capabilities strategy to the
of neuro-evolution to aim of rendering
discover an hand-crafted NN systems
optimal DNN topology and obsolete, more
its hyper-parameters for ASC research will be needed in
task. the future to test its
performance .
Ren et al. 2019 1. The size of RF is used to 1. Atrous 1. Accuracy of Accuracy 1. Feature level attention DCASE 2018
(2019) reduce the size of the Convolutional Neural 72.7% models can be investigated dataset
feature maps. Networks (ACNNs) outperforming to reach The dataset
2. Internal feature maps of with global attention the CNNs without a deeper visualization of contains 10
the attention model are pooling used as the dilation at 60.4% CNNs acoustic scene
explained. classification model. 2. Learnt feature 2. CNNs followed by classes
3. Four dilated convolutional 2. ACNN with a large maps contain rich sequence to sequence and each audio
layers followed by a global receptive field learning methods recording has a
attention pooling model instead of information on and 3D CNNs can be duration of 10
were used to fix the size of local pooling to fix acoustic scenes in considered to investigate the seconds.
feature maps the size of feature the temporal
for the visualization. maps time–frequency information in acoustic
domain scenes
does not give better performance (72.8% accuracy) more re- to handle failure circumstances and what will be the computa-
search will be needed in the future to evaluate its performance tional complexity for particular real-time audio signal processing
with different types of datasets. requirements. Although doing this is a very challenging task it
16. The modern cutting-edge ASC algorithms have developed to the will enhance the performance of the developed model for ASC.
point where they can be applied to context-sensitive devices.
Novel challenges must be overcome in real-world application 5. Conclusion
contexts, including microphone mismatch and domain adaption,
open-set categorization, model complexity, and time restrictions There are a huge number of research paper publications in recent
for real-time processing. years in the field of solving the issues of acoustic scene classifications.
17. The researchers should always focus on and should give more There are several reasons for achieving this milestone. Firstly, the
priority to understanding how a network’s or a sub-networks be- dataset is widely available by the DCASE or even it can be obtained
havior might aid to improve the model’s underlying framework from various public dataset repositories. Secondly, due to the evolving
29
Cho et al. 2019 1. FCNN is used to learn the 1. The samples from 1. FCNN: 71.15% Accuracy 1. If result of FCNN is 1. The dataset of
(2019) patterns by factorizing the the same audio scene and 75.97% for compared with DCASE 2019 the DCASE
2D are 40 and Rank 1 challenge 2019
kernel into two separate 1D grouped separately 200 log mel cases (although the model Task 1A
kernels. from the environment respectively parameters are different), 2. TAU Urban
2. The basic components of FCNN-triplet-spec: the later Acoustic Scenes
two primary patterns of - Factorized CNN 73.14% and model gives better result, dataset -
audio scene with improved 77.19% for therefore there is still an Acoustic scene
classification are learned as generalization 40 and 200 log opportunity samples recorded
a result of the factorized capabilities in an mel cases to enhance the FCNN model in ten
kernel. unseen environment. respectively integrated with different different
3. The cross-entropy loss 2. Large-margin loss 2. Reduces methodology to reduce the European cities –
function is paired with a function - Cross number of number of parameters. 10 audio scenes
modified Entropy parameters
triplet loss to improve the gradually.
learnt model’s performance. 3. Large-Margin
FCNN has better
generalization
ability.
Sharma et al. 2019 1. Multiple feature channels 1. DCNN - working 1. Accuracy on Accuracy 1. More feature channels 1. UrbanSound8K
(2019) consisting of MFCC, GFCC, on time and feature UrbanSound8K may be added to further dataset - 8732
CQT domain dataset improve short audio
and Chromagram, are used separately and max (98.60%), ESC-10 the discrimination strength clips of various
for the ESC task. pooling layers that dataset (97.25%) of the input, but doing so environment
2. Separable convolutions, downsample and may sound sources,
pooling and batch time and feature ESC-50 dataset also raise the computational ESC-10 dataset
normalization domain separately. (95.50%) complexity of the entire and ESC-50
layers along with Leaky 2. DCNN - consists of model, dataset
ReLU activation and dropout eight repetitions of pre-processing costs, and the
and Conv2D-Conv2D- number of kernels needed to
L2 regularization are used MaxPool-BatchNorm
with CNN. with handle the multi-channel
3. Number of channels:4 different number of input.
kernels and kernel
sizes.
Jati et al. 2019 1. The challenge of 1. DL framework 1. TDNN-small Accuracy 1. Acoustic scene prediction 1. Sampled long
(2019) describing acoustic scenes in based on a TDNN- to and TDNN-big Mean test through wearable or mobile audio recordings
a workplace handle the outperform confusion matrix devices. collected
environment is addressed. long egocentric the baseline DNN 2. How the egocentric from clinical
2. The consequence from recordings. by an absolute acoustic patterns are related providers in a
speech of speaker is used 2. TDNN - long term 4.41% to hospital, who
in acoustic scenes is temporal and 3.16% individual mental states such wore the audio
investigated. dependencies with respectively as stress can be badges during
4. The impact of foreground much fewer 2. The model investigated? multiple work
speech is examined. parameters. TDNN-big is more 3. Segment-level modeling shifts -
3. TDNN-small: 128 accurate helps compressing the data, multi-modal
filters at each CNN in predicting and therefore higher layers sensory data
layer, nurse stations and of temporal systems can be (audio,
4. TDNN-big: 256 patient rooms. designed for end-to-end physiology,
filters at each CNN learning. continuous
layer, location, etc.)
5. 280k and 954k from
parameters for 350 nurses and
TDNN-small and big other direct
clinical providers
in a critical care
hospital
Saki et al. 2019 1. An open-set evolving 1. Multi-class 1. MCOSR : TPR Accuracy 1. Classifier can be 1. DCASE
(2019) audio scene classification Open-set Evolving = 88.22 and True-Positive developed to address the challenge dataset,
technique, Recognition 1-FPR = 93.03 Rate(TPR) practical issues TUT Acoustic
MCOSR, to recognize and (MCOSR) Model. 2. Rejection False-Positive related to open-set audio Scenes 2017, and
learn unknown acoustic 2. Multi-class Threshold <0.05 Rate(FPR) scene classification technique music files from
scenes . open-set evolving for MCOSR Rejection Audio set.
2. Tasks - Recognize sound recognition Threshold during run time. The dataset
signals and associate them function - Support comprises 12
with Vector Data acoustic scenes-
known classes, detect the Description ‘‘absence’’,
hidden unknown classes (SVDD) for ‘‘cooking’’,
among the classification. ‘‘dishwashing’’,
rejected sound samples and 3. L3-Net-based ‘‘eating’’, ‘‘social
learn those novel detected Audio Embedding activity’’,
classes. Network used ‘‘watching TV’’,
3. Developed an approach to as the feature ‘‘car’’, ‘‘music’’,
detect unknown sound extractor ‘‘restaurant’’,
classes ‘‘transport’’,
compared to EVM and ‘‘vacuum
W-SVM. cleaner’’, and
‘‘working’’
technology of DL, DNN, convolutional neural network variants, support The authors have reviewed several recent papers and found that the
vector machines, integrated pre-trained deep neural networks, etc., the general tasks for pre-processing of audio sound are — to extract feature
model for the ASC task can be enhanced. DCASE community and other representations from audio segments which are fed into deep learning
research forums provide an opportunity like organizing competitions, models, use spectrograms that help to mitigate the problem of changing
or annual evaluation of research papers to enhance the performance of frequency and amplitude of audio, use of data augmentation techniques
ASC tasks. The block diagram utilized for the pre-processing and classi- like mix-up to fit the data into the model, measure the data curation
fication of various sounds for ASC tasks is included in this paper along challenges, minimizing the data redundancy and preserving the data,
with a systematic summary of the many contemporary techniques, deep using spectrogram as network input, etc. Several techniques like deep
learning, and neural network-based algorithms. neural network for feature extraction, SCalable Automatic Repairing
30
Wilkinghoff 2019 1. A system for open-set 1. Combination of 1. Closed-set Mean Squared 1. Since the loss function 1. DCASE
and Kurth acoustic scene classification CNNs and deep classification Error(MSE) that needs to be optimized Challenge 2019
(2019) is presented. convolutional accuracies: Accuracy does not dataset -
2. Use of CNNs for closed-set auto encoders for public/private specifically attempt to reject Task 1C for
classification and Deep outlier detection. leaderboard and unknown samples, using the Open-set
Convolutional Auto 2. Implement CNNs evaluation mean squared error of classification and
Encoders(DCAEs) for using Keras with set-86.5%,85.5% DCAEs for outlier detection Subtask A of task
rejecting unknown Tensorflow. and 85.2% is 1 for closed-set
acoustic scenes via outlier 3. Few CNN is respectively. essentially a heuristic. One classification
detection. trained for 6000 2. Open-set alternative to DCAEs is to
3. An effective way to epochs with a batch classification train
combine a closed-set size of 32. accuracies: a neural network with a
classification system 4. To detect outliers, combined score, different loss function
and outlier detection models one-class known classes designed
into a single open-set system classification and specifically for one-class
is models, DCAEs are unknown classes classification.
presented. used. −67.4%, 53.1% 2. The performance on
5. Logistic regression and open-set performance is
model - threshold of 81.8% enhanced
0.5 is respectively by using external data to
used to determine 3. The proposed train the models or by
outlier. system improves enhancing
the overall closed-set classification
score 47.6% to models using attention
62.1% on mechanism.
evaluation
dataset.
Purwins et al. 2019 1. Review on the 1. Neural Network 1. Results 1. WER-to count Research work can be 1. Large training
(2019) state-of-the-art deep learning with stacked layers - depends on the the features of carried on to- datasets for
techniques for Convolutional Neural environment word error 1. Find how the behavior of -ImageNet -
audio signal processing. Networks, and the selected 2. AUROC - to a network could help in 14 million hand
2. The feature representation Recurrent Neural model evaluate binary improving the model labeled images,
like log-mel spectra and raw Networks, architectures. classification structure to address failure Speech
Convolutional 3. MOS- to cases. recognition
waveform, and deep learning Recurrent Neural evaluate quality 2. Invesitagte what will be datasets,
models like audio specific Networks, of the Computational Song datasets,
CNN, Long Short-Term Sequence to synthesized audio Complexity MusicNet
Memory architectures are Sequence Models - in speech for specific requirements of datasets,
reviewed. Connectionist 4. Turing Test - real-time audio signal AudioSet datasets
3. With available data, deep Temporal For audio processing. - for
learning models outperforms Classification Model, generation, 3. Know which model is environmental
Generative F-Score superior in which setting of sound
the traditional methods such Adversarial Networks the classification,
as Gaussian mixture models, environment and dataset. Data Generation
hidden Markov models to - Very hard to answer all and Data
extract the useful features. questions, since different Augmentation
4. Data-driven filters and research
MFCCs are used for feature groups yield state-of the-art
extraction. results with different models.
Singh et al. 2018 1. Issue pointed out: nature 1. 15 hidden layers 1. Non-linear 1. Accuracy 1. Proposed framework with DCASE 2016
(2018) of the environment and and Separate SVM gives 93% 2. Trained model fewer data as well as acoustic scene
multiple classifier model is accuracy on - Evaluated training categorization
overlapping acoustic events. trained using feature evaluation set of with feature the Deep Neural dataset
2. ASC problem is addressed vectors. DCASE 2016 vectors Network(DNN) to
using SoundNet which is 2. 15 different data. incorporate the hidden
pre-trained on raw audio classifier model 2. Relative layer information.
signals. obtained. improvement of 2. Layer-wise ensemble
3. Scores from each layer 3. SVM and MRE 30.85% by best approach can be used with
are combined for based model. individual layer DNN.
classification. 4. Fusion Operation - of SoundNet.
4. Layer-wise ensemble Majority Voting Label
approach is used for
effectiveness Fusion Scheme or
on DCASE 2016 dataset. Maximum Likelihood
Score
Fusion Scheme
(SCARE) a machine learning method that offers predictions for several dataset) and 95.50% (ESC-50 dataset) using DCNN architecture with
attributes simultaneously, tag vector was used to enhance the feature Leaky ReLU activation function, 90.96% accuracy using CRNN with
map of the ASC employing multi-head attention, PCEN can be used ELU activation function (refer Table 24).
to increase robustness to loudness variation and address the problems
with log compression, etc. The researchers should constantly pay attention to and place a
In this paper, many papers are also reviewed for the classification higher priority on comprehending how the behavior of a network or
of sounds during acoustic scene creation. Some of the findings are like a sub-network could help in improving the model structure to address
– three integrated DNN architectures, DcaseNets are proposed for three failure cases and what will be the computational complexity for partic-
tasks – ASC, TAG, and SED, use of the audio tagging system which is
ular requirements of real-time audio signal processing. Although it is a
inspired by human perception mechanisms, classification using multi-
difficult task, completing it will undoubtedly improve the performance
ple devices, solving the problem of DCASE 2020 Tasks (in many cases
Task1 and Task 2), etc. All these findings are indicating that these are of the ASC model.
the several tasks that must be solved for ASC, and for doing this several Finally, based on various criteria, including functionality described
techniques like — designing ASC with CNN variants, RF-Regularized separately, results from the description with quantifiable value, a
CNNs, SVM, integrated pre-trained DNN, etc are used.
dataset with appropriate quantifiable analysis, and pictorial represen-
Although the best accuracy for classification depends on how dif-
ferent issues of ASC tasks are resolved with the available set-up of tation of model under discussion, this review paper is contrasted with
datasets based on the current review two best accuracies obtained in some other survey papers that have already been published (refer
recent years are — 98.60% (UrbanSound8K dataset), 97.25% (ESC-10 Table 1).
31
CRediT authorship contribution statement Chen, H., Zhang, P., & Yan, Y. (2019). An audio scene classification framework
with embedded filters and a DCT-based temporal module. In ICASSP 2019 -
2019 IEEE international conference on acoustics, speech and signal processing. IEEE,
Vikash Kumar Singh: Conceptualization, Writing – original draft,
http://dx.doi.org/10.1109/icassp.2019.8683636.
Visualization. Kalpana Sharma: Methodology, Writing – review & edit- Cheng, S.-S., Wang, H.-M., & Fu, H.-C. (2008). BIC-based audio segmentation by divide-
ing, Supervision. Samarendra Nath Sur: Conceptualization, Writing – and-conquer. In 2008 IEEE International Conference on Acoustics, Speech and Signal
review & editing, Resources. Processing. IEEE, http://dx.doi.org/10.1109/icassp.2008.4518741.
Cho, J., Yun, S., Park, H., Eum, J., & Hwang, K. (2019). Acoustic scene classification
based on a large-margin factorized CNN. In Proceedings of the detection and
Declaration of competing interest classification of acoustic scenes and events 2019 workshop. New York University,
http://dx.doi.org/10.33682/8xh4-jm46.
The authors declare that they have no known competing finan- Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y. (2015).
cial interests or personal relationships that could have appeared to KATARA: A data cleaning system powered by knowledge bases and crowdsourcing.
In Proceedings of the 2015 ACM SIGMOD international conference on management of
influence the work reported in this paper.
data. ACM, http://dx.doi.org/10.1145/2723372.2749431.
Chu, X., Morcos, J., Ilyas, I. F., Ouzzani, M., Papotti, P., Tang, N., & Ye, Y.
Data availability (2015). KATARA: Reliable data cleaning with knowledge bases and crowdsourcing.
Proceedings of the VLDB Endowment, 8(12), 1952–1955. http://dx.doi.org/10.14778/
No data was used for the research described in the article 2824032.2824109.
Cicco, V. D., Firmani, D., Koudas, N., Merialdo, P., & Srivastava, D. (2019). Interpreting
deep learning models for entity resolution. In Proceedings of the second international
References workshop on exploiting artificial intelligence techniques for data management. ACM
Press, http://dx.doi.org/10.1145/3329859.3329878.
Abeßer, J. (2020). A review of deep learning based methods for acoustic scene Coates, A., & Ng, A. Y. (2011). The importance of encoding versus training with sparse
classification. Applied Sciences, 10(6), http://dx.doi.org/10.3390/app10062020. coding and vector quantization. In 28th International conference on machine learning
Abeßer, J., Mimilakis, S. I., Grafe, R., & Lukashevich, H. (2017). Acoustic scene classifi- (pp. 921–928). ACM.
cation by combining autoencoder-based dimensionality reduction and convolutional Coates, A., & Ng, A. Y. (2012). Learning feature representations with K-means. In
neural networks. In Detection and classification of acoustic scenes and events workshop Lecture notes in computer science (pp. 561–580). Springer Berlin Heidelberg, http:
(DCASE), Munich, Germany. //dx.doi.org/10.1007/978-3-642-35289-8_30.
Akiyama, O., & Sato, J. (2019). DCASE 2019 task 2: Multitask learning, semi-supervised Cohen, B., Vawdrey, D. K., Liu, J., Caplan, D., Furuya, E. Y., Mis, F. W., & Larson, E.
learning and model ensemble with noisy data for audio tagging. In Proceedings of (2015). Challenges associated with using large data sets for quality assessment
the detection and classification of acoustic scenes and events 2019 workshop. New York and research in clinical settings. Policy, Politics, &Amp Nursing Practice, 16(3–4),
University, http://dx.doi.org/10.33682/0avf-bm61. 117–124. http://dx.doi.org/10.1177/1527154415603358.
Arniriparian, S., Freitag, M., Cummins, N., Gerczuk, M., Pugachevskiy, S., & Schuller, B. Crocco, M., Cristani, M., Trucco, A., & Murino, V. (2014). Audio surveillance: A
(2018). A fusion of deep convolutional generative adversarial networks and systematic review. ACM Computing Surveys, 48(4), 52:1–52:46, arXiv:1409.7787.
sequence to sequence autoencoders for acoustic scene classification. In 2018 26th Dang, A., Vu, T. H., & Wang, J.-C. (2017). A survey of deep learning for polyphonic
European signal processing conference. IEEE, http://dx.doi.org/10.23919/eusipco. sound event detection. In 2017 International conference on orange technologies. IEEE,
2018.8553225. http://dx.doi.org/10.1109/icot.2017.8336092.
Aytar, Y., Vondrick, C., & Torralba, A. (2016). SoundNet: Learning sound representa- Davis, S., & Mermelstein, P. (1980). Comparison of parametric representations for
tions from unlabeled video. Adv. Neural Inf. Process. Syst.29: Annu. Conf. Neural Inf. monosyllabic word recognition in continuously spoken sentences. IEEE Transactions
Process. Syst., 892–900, arXiv:1610.09001. on Acoustics, Speech, and Signal Processing, 28(4), 357–366. http://dx.doi.org/10.
Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural machine translation by jointly 1109/tassp.1980.1163420.
learning to align and translate. arXiv:1409.0473. Deng, J., Dong, W., Socher, R., Li, L.-J., Li, K., & Fei-Fei, L. (2009). ImageNet: A
Banerjee, S., Chattopadhyay, T., Pal, A., & Garain, U. (2018). Automation of feature large-scale hierarchical image database. In 2009 IEEE conference on computer vision
engineering for IoT analytics. ACM SIGBED Rev., 15(2), 24–30. http://dx.doi.org/ and pattern recognition. IEEE, http://dx.doi.org/10.1109/cvpr.2009.5206848.
10.1145/3231535.3231538. Doersch, C., Gupta, A., & Efros, A. A. (2015). Unsupervised visual representation
Barchiesi, D., Giannoulis, D., Stowell, D., & Plumbley, M. D. (2015). Acoustic scene learning by context prediction. arXiv:1505.05192.
classification: Classifying environments from the sounds they produce. IEEE Signal Ebaid, A., Thirumuruganathan, S., Aref, W. G., Elmagarmid, A., & Ouzzani, M.
Process. Mag., 32(3), 16–34. http://dx.doi.org/10.1109/msp.2014.2326181. (2019). EXPLAINER: Entity resolution explanations. In 2019 IEEE 35th international
Basbug, A. M., & Sert, M. (2019). Acoustic scene classification using spatial conference on data engineering. IEEE, http://dx.doi.org/10.1109/icde.2019.00224.
pyramid pooling with convolutional neural networks. In 2019 IEEE 13th interna- Ebraheem, M., Thirumuruganathan, S., Joty, S., Ouzzani, M., & Tang, N. (2018).
tional conference on semantic computing. IEEE, http://dx.doi.org/10.1109/icosc.2019. Distributed representations of tuples for entity resolution. Proceedings of the VLDB
8665547. Endowment, 11(11), 1454–1467. http://dx.doi.org/10.14778/3236187.3236198.
Bear, H. L., Morf, V., & Benetos, E. (2022). An evaluation of data augmentation methods Edelman, A., Arias, T. A., & Smith, S. T. (1998). The geometry of algorithms with
for sound scene geotagging. http://dx.doi.org/10.31219/osf.io/cbxqe. orthogonality constraints. SIAM Journal on Matrix Analysis and Applications, 20(2),
Bear, H. L., Nolasco, I., & Benetos, E. (2019). Towards joint sound scene and polyphonic 303–353. http://dx.doi.org/10.1137/s0895479895290954.
sound event recognition. arXiv:1904.10408. Eghbal-Zadeh, H., Dorfer, M., & Widmer, G. (2017). Deep within-class covariance
Bisot, V., Serizel, R., Essid, S., & Richard, G. (2016). Supervised non negative matrix analysis for robust audio representation learning. arXiv:1711.04022.
factorization for acoustic scene classification. In Detection and classification of Fernandez, R. C., Deng, D., Mansour, E., Qahtan, A. A., Tao, W., Abedjan, Z.,
acoustic scenes and events 2016. Elmagarmid, A., Ilyas, I. F., Madden, S., Ouzzani, M., Stonebraker, M., & Tang, N.
Bisot, V., Serizel, R., Essid, S., & Richard, G. (2017). Non negative feature learning (2017). A demo of the data civilizer system. In Proceedings of the 2017 ACM
methods for acoustic SceneClassification. In Detection and classification of acoustic international conference on management of data. ACM, http://dx.doi.org/10.1145/
scenes and events workshop(DCASE), Munich, Germany. 3035918.3058740.
Bittner, R. M., McFee, B., Salamon, J., Li, P., & Bello, J. P. (2017). Deep salience Foggia, P., Petkov, N., Saggese, A., Strisciuglio, N., & Vento, M. (2015). Reliable
representations for F0 estimation in polyphonic music. In 19th International society detection of audio events in highly noisy environments. Pattern Recognition Letters,
for music informationretrieval conference (ISMIR), Suzhou, China, 63–70. 65, 22–28. http://dx.doi.org/10.1016/j.patrec.2015.06.026.
Boss, J. D., Shah, C. T., Elner, V. M., & Hassan, A. S. (2015). Assessment of Fonseca, E., Gong, R., Bogdanov, D., Slizovskaia, O., Gomez, E., & Serra, X. (2017).
office-based practice patterns on protective eyewear counseling for patients with Acoustic scene classification by ensembling gradient boosting machine and convo-
monocular vision. Ophthalmic Plastic &Amp Reconstructive Surgery, 31(5), 361–363. lutional neural networks. In Detection and classification of acoustic scenes and events
http://dx.doi.org/10.1097/iop.0000000000000348. workshop (DCASE), Munich, Germany.
Chan, W., Jaitly, N., Le, Q., & Vinyals, O. (2016). Listen, attend and spell: A neural Fonseca, E., Plakal, M., Font, F., Ellis, D. P. W., & Serra, X. (2019). Audio tagging with
network for large vocabulary conversational speech recognition. In 2016 IEEE noisy labels and minimal supervision. arXiv:1906.02975.
international conference on acoustics, speech and signal processing. IEEE, http://dx. Fujisawa, K., Hirabe, Y., Suwa, H., Arakawa, Y., & Yasumoto, K. (2015). Automatic
doi.org/10.1109/icassp.2016.7472621. content curation system for multiple live sport video streams. In 2015 IEEE
Chen, H., Liu, Z., Liu, Z., Zhang, P., & Yan, Y. (2019). Integrating the data augmentation international symposium on multimedia. IEEE, http://dx.doi.org/10.1109/ism.2015.
scheme with various classifiers for acoustic scene modeling. In Detection and 17.
classification of acoustic scenes and events workshop (DCASE), New York, NY, USA. Furui, S. (1986). Speaker-independent isolated word recognition based on emphasized
Chen, H., Zhang, P., Bai, H., Yuan, Q., Bao, X., & Yan, Y. (2018). Deep convolutional spectral dynamics. In ICASSP ’86. IEEE international conference on acoustics, speech,
neural network with scalogram for audio scene modeling. In Interspeech 2018. ISCA, and signal processing. Institute of Electrical and Electronics Engineers, http://dx.doi.
http://dx.doi.org/10.21437/interspeech.2018-1524. org/10.1109/icassp.1986.1168654.
32
Gemmeke, J. F., Ellis, D. P. W., Freedman, D., Jansen, A., Lawrence, W., Moore, R. Khayyat, Z., Ilyas, I. F., Jindal, A., Madden, S., Ouzzani, M., Papotti, P., Quiané-Ruiz, J.-
C., Plakal, M., & Ritter, M. (2017). Audio set: An ontology and human-labeled A., Tang, N., & Yin, S. (2015). BigDansing. In Proceedings of the 2015 ACM SIGMOD
dataset for audio events. In IEEE international conference on acoustics, speech and international conference on management of data. ACM, http://dx.doi.org/10.1145/
signal processing (ICASSP), New Orleans, la, USA, 776–780. 2723372.2747646.
Goodfellow, I., Bengio, Y., & Courville, A. (2016). Deep learning. MIT Press. Kim, J.-H., Jung, J.-W., Shim, H.-J., & Yu, H.-J. (2020). Audio tag representation guided
Goodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., dual attention network for acousticscene classification. In Detection and classification
Courville, A., & Bengio, Y. (2014). Generative adversarial nets. In Advances in of acoustic scenes and events 2020.
neural information processing systems (pp. 2672–2680). Red Hook, NY, USA: Curran Kolouri, S., Nadjahi, K., Simsekli, U., Badeau, R., & Rohde, G. K. (2019). Generalized
Associates,Inc.. sliced wasserstein distances. arXiv:1902.00434.
Hakkani-Tur, D., Riccardi, G., & Gorin, A. (2002). Active learning for automatic speech Kong, Q., Xu, Y., Iqbal, T., Cao, Y., Wang, W., & Plumbley, M. D. (2019). Acoustic
recognition. In IEEE international conference on acoustics speech and signal processing. scene generation with conditional sample RNN. In IEEE international conference on
IEEE, http://dx.doi.org/10.1109/icassp.2002.5745510. acoustics, speech and signal processing(ICASSP), Brighton, UK, 925–929.
Han, W., Coutinho, E., Ruan, H., Li, H., Schuller, B., Yu, X., & Zhu, X. (2016). Semi- Kong, Q., Xu, Y., Wang, W., & Plumbley, M. D. (2017). A joint detection-classification
supervised active learning for sound classification in hybrid learning environments. model for audio tagging of weakly labelled data. In 2017 IEEE international
In F. Schwenker (Ed.), PLOS ONE, 11(9), Article e0162075. http://dx.doi.org/10. conference on acoustics, speech and signal processing. IEEE, http://dx.doi.org/10.
1371/journal.pone.0162075. 1109/icassp.2017.7952234.
Han, Y., Park, J., & Lee, K. (2017). Convolutional neural networks with binaural Kosmider, M. (2019). Calibrating neural networks for secondary recording devices: Technical
representations and background subtraction for acoustic scene classification. In report, Detection and Classification of Acoustic Scenes and Events 2019, Samsung
Detection and classification of acousticscenes and events workshop (DCASE), Munich, R&D Institute Poland Artificial Intelligence Warsaw, Poland.
Germany. Kotti, M., Benetos, E., & Kotropoulos, C. (2008). Computationally efficient and robust
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Deep residual learning for image BIC-based speaker segmentation. IEEE Transactions on Audio, Speech, and Language
recognition. In 2016 IEEE conference on computer vision and pattern recognition. IEEE, Processing, 16(5), 920–933. http://dx.doi.org/10.1109/tasl.2008.925152.
http://dx.doi.org/10.1109/cvpr.2016.90. Koutini, K., Chowdhury, S., Haunschmid, V., Eghbal-zadeh, H., & Widmer, G. (2019).
He, K., Zhang, X., Ren, S., & Sun, J. (2016). Identity mappings in deep residual Emotion and theme recognition in music with frequency-aware RF-regularized
networks. arXiv:1603.05027. CNNs. arXiv:1911.05833.
Heer, J., Hellerstein, J., & Kandel, S. (2015). Predictive interaction for data transfor- Koutini, K., Eghbal-zadeh, H., Dorfer, M., & Widmer, G. (2019). The receptive field as a
mation. In 7th Biennial conference on innovative data systems research (CIDR ’15), regularizer in deep convolutional neural networks for acoustic scene classification.
Asilomar, California, USA. In 2019 27th European signal processing conference. IEEE, http://dx.doi.org/10.
Heittola, T., Mesaros, A., & Virtanen, T. (2020). Acoustic scene classification in DCASE 23919/eusipco.2019.8902732.
2020 challenge: generalization across devices and low complexity solutions. arXiv: Koutini, K., Eghbal-Zadeh, H., & Widmer, G. (2019). CP-JKU submissions to DCASE’19:
2005.14623. Acoustic scene classification and audio tagging with receptive-field-regularized CNNs:
Hershey, S., Chaudhuri, S., Ellis, D. P. W., Gemmeke, J. F., Jansen, A., Moore, R. Technical report, Detection and Classification of Acoustic Scenes and Events 2019,
C., Plakal, M., Platt, D., Saurous, R. A., Seybold, B., Slaney, M., Weiss, R. J., Institute of Computational Perception (CP-JKU) and LIT Artificial Intelligence
& Wilson, K. (2017). CNN architectures for large-scale audio classification. In Lab,Johannes Kepler University Linz, Austria.
2017 IEEE international conference on acoustics, speech and signal processing. IEEE, Koutini, K., Eghbal-zadeh, H., & Widmer, G. (2019). Receptive-field-regularized CNN
http://dx.doi.org/10.1109/icassp.2017.7952132. variants for acoustic scene classification. In Proceedings of the detection and
Hoshen, Y., Weiss, R. J., & Wilson, K. W. (2015). Speech acoustic modeling from raw classification of acoustic scenes and events 2019 workshop. New York University,
multichannel waveforms. In 2015 IEEE international conference on acoustics, speech http://dx.doi.org/10.33682/cjd9-kc43.
and signal processing. IEEE, http://dx.doi.org/10.1109/icassp.2015.7178847. Koutini, K., Henkel, F., Eghbal-Zadeh, H., & Widmer, G. (2020). CP-JKU submissions to
Huang, G., Li, Y., Pleiss, G., Liu, Z., Hopcroft, J. E., & Weinberger, K. Q. (2017). DCASE’20: Low-complexity cross-device acoustic scene classification with RF-regularized
Snapshot ensembles: Train 1, get M for free. arXiv:1704.00109. CNNs: Technical report, Detection and Classification of Acoustic Scenes and Events
Huang, J., Lu, H., Meyer, P. L., Cordourier, H., & Ontiveros, J. D. H. (2019). Acoustic 2020, Institute of Computational Perception (CP-JKU) and LIT Artificial Intelligence
scene classification using deep learning-based ensemble averaging. In Proceedings Lab,Johannes Kepler University Linz, Austria.
of the detection and classification of acoustic scenes and events 2019 workshop (DCASE Kudo, M., Maeda, K., & Satoh, F. (2016). Adaptable privacy-preserving data curation for
2019). New York University, http://dx.doi.org/10.33682/8rd2-g787. business process analysis services. In 2016 IEEE international conference on services
Huzaifah, M. (2017). Comparison of time-frequency representations for environmental computing. IEEE, http://dx.doi.org/10.1109/scc.2016.60.
sound classification using convolutional neural networks. arXiv:1706.07156. Kumar, A., Khadkevich, M., & Fugen, C. (2018). Knowledge transfer from weakly
Imoto, K., & Shimauchi, S. (2016). Acoustic scene analysis based on hierarchical labeled audio using convolutional neural network for sound events and scenes.
generative model of acoustic event sequence. IEICE Transactions on Information and In 2018 IEEE international conference on acoustics, speech and signal processing. IEEE,
Systems, E99.D(10), 2539–2549. http://dx.doi.org/10.1587/transinf.2016slp0004. http://dx.doi.org/10.1109/icassp.2018.8462200.
Imoto, K., Tonami, N., Koizumi, Y., Yasuda, M., Yamanishi, R., & Yamashita, Y. (2020). Kumpawat, J., & Dey, S. (2021). Acoustic scene classification using auditory datasets.
Sound event detection by multitask learning of sound events and scenes with soft arXiv:2112.13450.
scene labels. arXiv:2002.05848. Lebedev, V., Ganin, Y., Rakhuba, M., Oseledets, I., & Lempitsky, V. (2014). Speeding-
India, M., Safari, P., & Hernando, J. (2019). Self multi-head attention for speaker up convolutional neural networks using fine-tuned CP-decomposition. arXiv:1412.
recognition. In Interspeech 2019. ISCA, http://dx.doi.org/10.21437/interspeech. 6553.
2019-2616. Lee, M. L., Ling, T. W., & Low, W. L. (2000). IntelliClean:A knowledge-based intelligent
Jaitly, N., & Hinton, G. (2011). Learning a better representation of speech soundwaves data cleaner. In Proceedings of the sixth ACM SIGKDD international conference on
using restricted boltzmann machines. In 2011 IEEE international conference on knowledge discovery and data mining. ACM Press, http://dx.doi.org/10.1145/347090.
acoustics, speech and signal processing. IEEE, http://dx.doi.org/10.1109/icassp.2011. 347154.
5947700. Lehner, B., Koutini, K., Schwarzlmüller, C. H., Gallien, T., & Widmer, G. (2019).
Jati, A., Nadarajan, A., Mundnich, K., & Narayanan, S. (2019). Characterizing dynami- Acoustic scene classification with reject option based on resnets. In Detection and
cally varying acoustic scenes from egocentric audio recordings in workplace setting. classification of acoustic scenes and events workshop (DCASE), New York, NY, USA.
arXiv:1911.03843. Li, Z., Hou, Y., Xie, X., Li, S., Zhang, L., Du, S., & Liu, W. (2019). Multi-level
Jing, L., & Tian, Y. (2019). Self-supervised visual feature learning with deep neural attention model with deep scattering spectrum for acoustic scene classification.
networks: A survey. arXiv:1902.06162. In 2019 IEEE international conference on multimedia &amp expo workshops. IEEE,
Jung, J.-W., Heo, H.-S., Shim, H.-J., & Yu, H.-J. (2018). DNN based multi-level http://dx.doi.org/10.1109/icmew.2019.00074.
features ensemble for acoustic scene classification. In Proceedings of the detection Li, H., Kadav, A., Durdanovic, I., Samet, H., & Graf, H. P. (2016). Pruning filters for
and classification of acoustic scenes and events 2018 workshop. efficient ConvNets. arXiv:1608.08710.
Jung, J.-W., Heo, H.-S., jin Shim, H., & Yu, H.-J. (2019). Distilling the knowledge of Lin, T.-Y., Goyal, P., Girshick, R., He, K., & Dollar, P. (2017). Focal loss for dense
specialist deep neural networks in acoustic scene classification. In Proceedings of the object detection. In 2017 IEEE international conference on computer vision. IEEE,
detection and classification of acoustic scenes and events 2019 workshop. New York http://dx.doi.org/10.1109/iccv.2017.324.
University, http://dx.doi.org/10.33682/gqpj-ac63. Liu, S., Mallol-Ragolta, A., Parada-Cabaleiro, E., Qian, K., Jing, X., Kathan, A., Hu, B.,
Jung, J.-W., Heo, H.-S., Shim, H.-J., & Yu, H.-J. (2020). Knowledge distillation in & Schuller, B. W. (2022). Audio self-supervised learning: A survey. Patterns, 3(12),
acoustic scene classification. IEEE Access, 8, 166870–166879. http://dx.doi.org/10. Article 100616. http://dx.doi.org/10.1016/j.patter.2022.100616.
1109/access.2020.3021711. Liu, M., Wang, W., & Li, Y. (2019). The system for acoustic scene classification using resnet:
Jung, J.-W., jin Shim, H., ho Kim, J., bin Kim, S., & Yu, H.-J. (2020). Acoustic scene Technical report, (DCASE2019) School of Electronic and Information Engineering,
classification using audio tagging. arXiv:2003.09164. South China University of Technology, Guangzhou, China.
Jung, J.-W., Shim, H.-J., Kim, J.-H., & Yu, H.-J. (2020). DcaseNet: An integrated Lostanlen, V., & Cella, C.-E. (2016). Deep convolutional networks on the pitch spiral
pretrained deep neural network for detecting and classifying acoustic scenes and for music instrument recognition. In 17th International society for music information
events. arXiv:2009.09642. retrieval conference (ISMIR), New York City, United States, 612–618.
33
Lostanlen, V., Salamon, J., Cartwright, M., McFee, B., Farnsworth, A., Kelling, S., & Pezoulas, V. C., Kourou, K. D., Kalatzis, F., Exarchos, T. P., Venetsanopoulou, A.,
Bello, J. P. (2019). Per-channel energy normalization: Why and how. IEEE Signal Zampeli, E., Gandolfo, S., Skopouli, F., Vita, S. D., Tzioufas, A. G., & Fotiadis, D. I.
Processing Letters, 26(1), 39–43. http://dx.doi.org/10.1109/lsp.2018.2878620. (2019). Medical data quality assessment: On the development of an automated
Luo, W., Li, Y., Urtasun, R., & Zemel, R. (2017). Understanding the effective receptive framework for medical data curation. Computers in Biology and Medicine, 107,
field in deep convolutional neural networks. arXiv:1701.04128. 270–283. http://dx.doi.org/10.1016/j.compbiomed.2019.03.001.
Maka, T. (2018). Audio feature space analysis for acoustic scene classification. In Phaye, S. S. R., Benetos, E., & Wang, Y. (2018). SubSpectralNet - using sub-spectrogram
Detectionand classification of acoustic scenes and events workshop (DCASE), Surrey, based convolutional neural networks for acoustic scene classification. arXiv:1810.
UK. 12642.
Plumbley, M. D., Kroos, C., Bello, J. P., Richard, G., Ellis, D. P., & Mesaros, A.
Marchi, E., Tonelli, D., Xu, X., Ringeval, F., Deng, J., Squartini, S., & Schuller, B.
(2018). Detection and classification of acoustic scenes and events 2018 workshop
(2016). Pairwise decomposition with deep neural networks and multiscale kernel
(DCASE2018). In Detection and classification ofacoustic scenes and events 2018
subspace learning for acoustic scene classification. In Detection and classification of
workshop (DCASE2018), Tampere University of Technology. Laboratory of Signal
acoustic scenes and events workshop (DCASE),Budapest, Hungary.
Processing.
Mariotti, O., Cord, M., & Schwander, O. (2018). Exploring deep vision models for Primus, P., Eghbal-zadeh, H., Eitelsebner, D., Koutini, K., Arzt, A., & Widmer, G. (2019).
acoustic scene classification. In Detection and classification of acoustic scenes and Exploiting parallel audio recordings to enforce device invariance in CNN-based
events workshop (DCASE), Surrey,UK. acoustic scene classification. arXiv:1909.02869.
Mars, R., Pratik, P., Nagisetty, S., & Lim, C. (2019). Acoustic scene classification from Primus, P., & Eitelsebner, D. (2019). Acoustic scene classification with mismatched
binaural signals using convolutional neural networks. In Proceedings of the detection recording devices: Technical report, Institute of Computational Perception (CP-
and classification of acoustic scenes and events 2019 workshop. New York University, JKU)Johannes Kepler University Linz, Austria, Detection and Classification of
http://dx.doi.org/10.33682/6c9z-gd15. Acoustic Scenes and Events.
Mattys, S. L., Davis, M. H., Bradlow, A. R., & Scott, S. K. (2012). Speech recognition in Purwins, H., Li, B., Virtanen, T., Schluter, J., Chang, S.-Y., & Sainath, T. (2019).
adverse conditions: A review. Language and Cognitive Processes, 27(7–8), 953–978. Deep learning for audio signal processing. IEEE Journal of Selected Topics in Signal
http://dx.doi.org/10.1080/01690965.2012.705006. Processing, 13(2), 206–219. http://dx.doi.org/10.1109/jstsp.2019.2908700.
McDonnell, M. D., & Gao, W. (2019). Acoustic scene classification using deep residual Qian, K., Ren, Z., Pandit, V., Yang, Z., Zhang, Z., & Schuller, B. (2017). Wavelets
networks with late fusion of separated high and low frequency paths: Technical report, revisited for the classification of acoustic scenes. In Detection and classification of
Detection and Classification of Acoustic Scenes and Events 2019, Computational acoustic scenes and events workshop (DCASE), Munich, Germany.
Learning Systems Laboratory,School of Information Technology and Mathematical Rafii, Z., & Pardo, B. (2012). Music/voice separation using the similarity matrix. In
Sciences,University of South Australia, Mawson Lakes SA 5095, Australia. 13th International society for music information retrieval conference (ISMIR), Porto,
Portugal, 583–588.
Mesaros, A., Heittola, T., Benetos, E., Foster, P., Lagrange, M., Virtanen, T., &
Rahm, E., & Do, H. H. (2000). Data cleaning: Problems and current approaches. In
Plumbley, M. D. (2018). Detection and classification of acoustic scenes and events:
IEEE bulletin of the technical committee on data engineering (pp. 3–13).
Outcome of the DCASE 2016 challenge. IEEE/ACM Transactions on Audio, Speech,
Ravanelli, M., & Bengio, Y. (2018). Speaker recognition from raw waveform with
and Language Processing, 26(2), 379–393. http://dx.doi.org/10.1109/taslp.2017.
SincNet. arXiv:1808.00158.
2778423.
Ren, Z., Kong, Q., Han, J., Plumbley, M. D., & Schuller, B. W. (2019). Attention-based
Mesaros, A., Heittola, T., & Virtanen, T. (2016). TUT database for acoustic scene atrous convolutional neural networks: Visualisation and understanding perspectives
classification and sound event detection. In 2016 24th European signal processing of acoustic scenes. In ICASSP 2019 - 2019 IEEE international conference on acoustics,
conference. IEEE, http://dx.doi.org/10.1109/eusipco.2016.7760424. speech and signal processing. IEEE, http://dx.doi.org/10.1109/icassp.2019.8683434.
Mesaros, A., Heittola, T., & Virtanen, T. (2017). Assessment of human and machine Ren, Z., Kong, Q., Qian, K., Plumbley, M. D., & Schuller, B. W. (2018). Attention-based
performance in acoustic scene classification: Dcase 2016 case study. In 2017 convolutional neural networks for acoustic scene classification. In Detection and
IEEE workshop on applications of signal processing to audio and acoustics. IEEE, classification of acoustic scenes and events.
http://dx.doi.org/10.1109/waspaa.2017.8170047. Ren, Z., Pandit, V., Qian, K., Yang, Z., Zhang, Z., & Schuller, B. (2017). Deep sequential
Mesaros, A., Heittola, T., & Virtanen, T. (2018). A multi-device dataset for urban image features for acoustic scene classification. In Detection and classification of
acoustic scene classification. arXiv:1807.09840. acoustic scenes and eventsworkshop (DCASE), Munich, Germany.
Mesaros, A., Heittola, T., & Virtanen, T. (2019). Acoustic scene classification in DCASE Riccardi, G., & Hakkani-Tur, D. (2005). Active learning: theory and applications to
2019 challenge: Closed and open set classification and data mismatch setups. In automatic speech recognition. IEEE Transactions on Speech and Audio Processing,
Proceedings of the detection and classification of acoustic scenes and events 2019 13(4), 504–511. http://dx.doi.org/10.1109/tsa.2005.848882.
workshop. New York University, http://dx.doi.org/10.33682/m5kp-fa97. Ridzuan, F., & Zainon, W. M. N. W. (2019). A review on data cleansing methods for big
data. Procedia Computer Science, 161, 731–738. http://dx.doi.org/10.1016/j.procs.
Michael Mandel, J. S., & Ellis, D. P. (2019). Proceedings of the detection and classification
2019.11.177.
of acoustic scenes and events 2019 workshop. New York University, http://dx.doi.org/
Roletscheck, C., Watzka, T., Seiderer, A., Schiller, D., & Andre, E. (2019). Using an
10.33682/1syg-dy60.
evolutionary approach to explore convolutional neural networks for acoustic scene
Mille, R. (2014). Big data curation. In 20th International conference on management of
classification. In Detectionand classification of acoustic scenes and events workshop
data (COMAD),17th-19th Dec 2014 At Hyderabad, India.
(DCASE), New York, NY, USA.
Miyamoto, K., Koseki, A., & Ohno, M. (2017). Effective data curation for frequently Saki, F., Guo, Y., Hung, C.-Y., hoon Kim, L., Deshpande, M., Moon, S., Koh, E.,
asked questions. In 2017 IEEE international conference on service operations and & Visser, E. (2019). Open-set evolving acoustic scene classification system. In
logistics, and informatics. IEEE, http://dx.doi.org/10.1109/soli.2017.8120960. Proceedings of the detection and classification of acoustic scenes and events 2019
Mohamed, A.-R., Hinton, G., & Penn, G. (2012). Understanding how deep belief workshop. New York University, http://dx.doi.org/10.33682/en2t-9m14.
networks perform acoustic modelling. In ICASSP. Salah, H., Al-Omari, I., Alwidian, J., Al-Hamadin, R., & Tawalbeh, T. (2019). Data
Mudgal, S., Li, H., Rekatsinas, T., Doan, A., Park, Y., Krishnan, G., Deep, R., Arcaute, E., streams curation for better machine learning functionality and result to serve IoT
& Raghavendra, V. (2018). Deep learning for entity matching. In Proceedings of the and other applications: A survey. Journal of Computer Science, 15(10), 1572–1584.
2018 international conference on management of data. ACM, http://dx.doi.org/10. http://dx.doi.org/10.3844/jcssp.2019.1572.1584.
1145/3183713.3196926. Salamon, J., & Bello, J. P. (2017). Deep convolutional neural networks and data
Mun, S., Park, S., Han, D. K., & Ko, H. (2017). Generative adversarial networks based augmentation for environmental sound classification. IEEE Signal Processing Letters,
acoustic scene training set augmentation and selection using SVM hyperplane. In 24(3), 279–283. http://dx.doi.org/10.1109/lsp.2017.2657381.
Detection and classification of acoustic scenes and events workshop (DCASE), Munich, Seo, H., Park, J., & Park, Y. (2019). Acoustic scene classification using various pre-
Germany. processed features andconvolutional neural networks. In Detection and classification
of acoustic scenes and events 2019.
Nanni, L., Maguolo, G., & Paci, M. (2019). Data augmentation approaches for improving
Sharma, J., Granmo, O.-C., & Goodwin, M. (2019). Environment sound classification
animal audio classification. arXiv:1912.07756.
using multiple feature channels and attention based deep convolutional neural
Nguyen, T. N. T., Jones, D. L., & Gan, W. (2020). DCASE 2020 task 3: Ensemble
network. arXiv:1908.11219.
of sequence matching networks for dynamic sound event localization, detection, and Sharma, G., Umapathy, K., & Krishnan, S. (2020). Trends in audio signal feature
tracking: Technical report, Detection and Classification of Acoustic Scenes and Events, extraction methods. Applied Acoustics, 158, http://dx.doi.org/10.1016/j.apacoust.
Nanyang Technological University, School of Electrical and Electronic Engineering, 2019.107020.
Singapore, University of Illinois at Urbana-Champaign, Dept. of Electrical and Shuyang, Z., Heittola, T., & Virtanen, T. (2017). Active learning for sound event
Computer Engineering,Illinois, USA. classification by clustering unlabeled data. In 2017 IEEE international conference
Nguyen, T., & Pernkopf, F. (2018). Acoustic scene classification using a convolutional on acoustics, speech and signal processing. IEEE, http://dx.doi.org/10.1109/icassp.
neural network ensemble and nearest neighbor filters. In Detection and classification 2017.7952256.
of acoustic scenesand events workshop (DCASE), Surrey, UK. Shuyang, Z., Heittola, T., & Virtanen, T. (2018). An active learning method using
Nogueira, A. F. R., Oliveira, H. S., Machado, J. J. M., & Tavares, J. M. R. S. (2022). clustering and committee-based sample selection for sound event classification.
Sound classification and processing of urban environments: A systematic literature In 2018 16th international workshop on acoustic signal enhancement. IEEE, http:
review. Sensors, 22(22), 8608. http://dx.doi.org/10.3390/s22228608. //dx.doi.org/10.1109/iwaenc.2018.8521336.
34
Shuyang, Z., Heittola, T., & Virtanen, T. (2020). Active learning for sound event Wang, Y., Getreuer, P., Hughes, T., Lyon, R. F., & Saurous, R. A. (2017). Trainable
detection. IEEE/ACM Transactions on Audio, Speech, and Language Processing, 28. frontend for robust and far-field keyword spotting. In 2017 IEEE international
Sidi, F., Panahy, P. H. S., Affendey, L. S., Jabar, M. A., Ibrahim, H., & Mustapha, A. conference on acoustics, speech and signal processing. IEEE, http://dx.doi.org/10.
(2012). Data quality: A survey of data quality dimensions. In 2012 International 1109/icassp.2017.7953242.
conference on information retrieval &amp knowledge management. IEEE, http://dx.doi. Wang, H., Li, M., Bu, Y., Li, J., Gao, H., & Zhang, J. (2016). Cleanix:A big data cleaning
org/10.1109/infrkm.2012.6204995. parfait. ACM SIGMOD Record, 44(4), 35–40. http://dx.doi.org/10.1145/2935694.
Silla, C. N., & Freitas, A. A. (2010). A survey of hierarchical classification across 2935702.
different application domains. Data Mining and Knowledge Discovery, 22(1–2), 31–72. Wang, H., Zou, Y., & Wang, W. (2021). SpecAugment++: A hidden space data
http://dx.doi.org/10.1007/s10618-010-0175-9. augmentation method for acoustic scene classification. arXiv:2103.16858.
Singh, A., Kaur, N., Kukreja, V., Kadyan, V., & Kumar, M. (2022). Computational Wilkinghoff, K., & Kurth, F. (2019). Open-set acoustic scene classification with deep
intelligence in processing of speech acoustics: A survey. Complex &Amp Intelligent convolutional autoencoders. In Proceedings of the detection and classification of
Systems, 8(3), 2623–2661. http://dx.doi.org/10.1007/s40747-022-00665-1. acoustic scenes and events 2019 workshop. New York University, http://dx.doi.org/
Singh, A., Thakur, A., Rajan, P., & Bhavsar, A. (2018). A layer-wise score level ensemble 10.33682/340j-wd27.
framework for acoustic scene classification. In 2018 26th European signal processing Wu, T. T., & Lange, K. (2008). Coordinate descent algorithms for lasso penalized
conference. IEEE, http://dx.doi.org/10.23919/eusipco.2018.8553052. regression. The Annals of Applied Statistics, 2(1), http://dx.doi.org/10.1214/07-
Soo Hyun Bae, I. C., & Kim, N. S. (2016). Acoustic scene classification using parallel aoas147.
combination of LSTM and CNN. In Detection and classification of acoustic scenes and Wu, Y., & Lee, T. (2019). Enhancing sound texture in CNN-based acoustic scene
events workshop (DCASE),Budapest, Hungary, 3 September 2016. classification. In ICASSP 2019 - 2019 IEEE international conference on acoustics,
Sowe, S. K., & Zettsu, K. (2013). The architecture and design of a community-based speech and signal processing. IEEE, http://dx.doi.org/10.1109/icassp.2019.8683490.
cloud platform for curating big data. In 2013 International conference on cyber- Xia, X., Togneri, R., Sohel, F., Zhao, Y., & Huang, D. (2019). A survey: Neural network-
enabled distributed computing and knowledge discovery. IEEE, http://dx.doi.org/10. based deep learning for acoustic event detection. Circuits, Systems, and Signal
1109/cyberc.2013.35. Processing, 38(8), 3433–3453. http://dx.doi.org/10.1007/s00034-019-01094-1.
Spoorthy, V., Mulimani, M., & Koolagudi, S. G. (2021). Acoustic scene classification us- Xu, J.-X., Lin, T.-C., Yu, T.-C., Tai, T.-C., & Chang, P.-C. (2018). Acoustic scene
ing deep learning architectures. In 2021 6th international conference for convergence classification using reduced mobile net architecture. In 2018 IEEE international
in technology. IEEE, http://dx.doi.org/10.1109/i2ct51068.2021.9418177. symposium on multimedia. IEEE, http://dx.doi.org/10.1109/ism.2018.00038.
Stonebrake, M., & Ilyas, I. F. (2018). Data integration: The current status and the way Yakout, M., Berti-Équille, L., & Elmagarmid, A. K. (2013). Don’t be SCAREd. In
forward. IEEE Data Engineering Bulletin, 41(2), 3–9. Proceedings of the 2013 international conference on management of data. ACM Press,
Stonebraker, M., Bruckner, D., Ilyas, I. F., Beskales, G., Cherniack, M., & Zdonik, S. http://dx.doi.org/10.1145/2463676.2463706.
(2013). Data curation at scale: The data tamer system. In 6th Biennial conference Yamaguchi, O., Fukui, E., & Maeda, K. (2002). Face recognition using temporal image
on innovative data systems research (CIDR ’13), Asilomar, California, USA. sequence. In Proceedings Third IEEE International Conference on Automatic Face
Suh, S., Lim, W., Park, S., & Jeong, Y. (2019). Acoustic scene classification using and Gesture Recognition. IEEE Comput. Soc, http://dx.doi.org/10.1109/afgr.1998.
SpecAugment and convolutional neural networkwith inception modules: Technical report, 670968.
Detection and Classification of Acoustic Scenes and Events 2019, Realistic AV Re- Yang, L., Chen, X., & Tao, L. (2018). Acoustic scene classification using multi-
search GroupElectronics and Telecommunications Research Institute218 Gajeong-ro, scale features. In Detection and classification of acoustic scenes and events workshop
Yuseong-gu, Daejeon, Korea. (DCASE), Surrey, UK.
Suh, S., Park, S., Jeong, Y., & Lee, T. (2020). Designing acoustic scene classifica- Yang, C., Puthal, D., Mohanty, S. P., & Kougianos, E. (2017). Big-sensing-data curation
tion models with CNN variants: Technical report, Detection and Classification of for the cloud is coming: A promise of scalable cloud-data-center mitigation for next-
Acoustic Scenes and Events 2020, Media Coding Research SectionElectronics and generation IoT and wireless sensor networks. IEEE Consumer Electronics Magazine,
Telecommunications Research Institute. 6(4), 48–56. http://dx.doi.org/10.1109/mce.2017.2714695.
Szegedy, C., Liu, W., Jia, Y., Sermanet, P., Reed, S., Anguelov, D., Erhan, D., Yasumoto, K., Yamaguchi, H., & Shigeno, H. (2016). Survey of real-time processing
Vanhoucke, V., & Rabinovich, A. (2015). Going deeper with convolutions. In 2015 technologies of IoT data streams. Journal of Information Processing, 24(2), 195–202.
IEEE conference on computer vision and pattern recognition. IEEE, http://dx.doi.org/ http://dx.doi.org/10.2197/ipsjjip.24.195.
10.1109/cvpr.2015.7298594. Ye, J., Kobayashi, T., Toyama, N., Tsuda, H., & Murakawa, M. (2018). Acoustic
Szegedy, C., Zaremba, W., Sutskever, I., Bruna, J., Erhan, D., Goodfellow, I., & Fer- scene classification using efficient summary statistics and multiple spectro-
gus, R. (2013). Intriguing properties of neural networks. In International conference temporal descriptor fusion. Applied Sciences, 8(8), 1363. http://dx.doi.org/10.3390/
on learning representations. arXiv:1312.6199. app8081363.
Takahashi, G., Yamada, T., Ono, N., & Makino, S. (2017). Performance evaluation Ye, J., Kobayashi, T., Wang, X., Tsuda, H., & Murakawa, M. (2020). Audio data mining
of acoustic scene classification using DNN-GMM and frame-concatenated acoustic for anthropogenic disaster identification: An automatic taxonomy approach. IEEE
features. In 2017 Asia-Pacific signal and information processing association annual Transactions on Emerging Topics in Computing, 8(1), 126–136. http://dx.doi.org/10.
summit and conference. IEEE, http://dx.doi.org/10.1109/apsipa.2017.8282314. 1109/tetc.2017.2700843.
Thickstun, J., Harchaoui, Z., & Kakade, S. (2016). Learning features of music from Zeinali, H., Burget, L., & Černocký, J. H. (2019). Acoustic scene classification using
scratch. In International conference on learningrepresentations. arXiv:1611.09827. fusion of attentive convolutional neural networks for DCASE2019 challenge. arXiv:
Thirumuruganathan, S., Tang, N., Ouzzani, M., & Doan, A. (2020). Data curation with 1907.07127.
deep learning. Open Proceedings. Zhang, H., Cisse, M., Dauphin, Y. N., & Lopez-Paz, D. (2017). mixup: Beyond empirical
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L., risk minimization. arXiv:1710.09412.
& Polosukhin, I. (2017). Attention is all you need. arXiv:1706.03762. Zheng, X. (2019). Acoustic scene classification combining log-mel CNN model and end-to-end
Virtanen, T., Mesaros, A., Heittola, T., Diment, A., Vincent, E., Benetos, E., & model: Technical report, National Engineering Laboratory for Speech and Language
Elizalde, B. M. (2017). Detection and classification of AcousticScenes and events Information ProcessingUniversity of Science and Technology of China, Hefei, China.
2017 workshop (DCASE2017). In Proceedings of the detection and classification of Zhong, Z., Zheng, L., Kang, G., Li, S., & Yang, Y. (2017). Random erasing data
acousticscenes and events 2017 workshop. augmentation. arXiv:1708.04896.
Waldekar, S., & Saha, G. (2018). Classification of audio scenes with novel features in Zieliński, S., & Lee, H. (2018). Feature extraction of binaural recordings for acoustic
a fused system framework. Digital Signal Processing, 75, 71–82. http://dx.doi.org/ scene classification. In Proceedings of the 2018 federated conference on computer
10.1016/j.dsp.2017.12.012. science and information systems. IEEE, http://dx.doi.org/10.15439/2018f182.
35

1 s20 S0957417423010229 Main

Uploaded by

Document Informationclick to expand document information

Copyright:

Available Formats

1 s20 S0957417423010229 Main

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

1 s20 S0957417423010229 Main

Uploaded by

Copyright:

Available Formats

Expert Systems With Applications 229 (2023) 120520

Contents lists available at ScienceDirect

Expert Systems With Applications

A survey on preprocessing and classification techniques for acoustic scene

ARTICLE INFO ABSTRACT

1. Introduction recognition of individual sound classes and activities in the environ-

Fig. 1. Organization of paper.

Salah et al. Survey on DC 51 16 for Yes Yes Yes Yes No No Yes No No No

Ridzuan and Reviews the DC 26 4 Yes Yes No Yes Yes No Yes No No No

Dang et al. Overview of DL 26 (Only DL 13 Yes No Yes Yes No No Yes No No No

(continued on next page)

(continued on next page)

(continued on next page)

The remaining pre-processing techniques are listed in Table 4.

(continued on next page)

Wang et al. (2017) 1. How to optimize 1. PCEN model as NN layers. 1.PCEN

(continued on next page)

Fig. 7. Processing flow of ASC (Abeßer, 2020).

Fig. 17. Conventional ASC system (Jung et al., 2020b).

Fig. 21. Architecture for DcaseNet-v1 (Jung et al., 2020c).

Fig. 22. Architecture for DcaseNet-v2 (Jung et al., 2020c).

Fig. 23. Architecture for DcaseNet-v3 (Jung et al., 2020c).

(continued on next page)

(continued on next page)

(continued on next page)

(continued on next page)

You might also like