0% found this document useful (0 votes)
20 views17 pages

Urban Noise Classification Using Machine Learning Techniques Comparative Analysis and Future

This research paper evaluates various machine learning models for urban noise classification, focusing on DNN, CNN, LSTM, and RF using the UrbanSound8K dataset. The study finds that DNN achieves the highest accuracy at 94.5%, followed by CNN at 90%, while discussing challenges and future directions for improving urban noise monitoring systems. The paper emphasizes the potential of machine learning in enhancing urban living conditions through effective noise classification and community engagement.

Uploaded by

Antonio Sánchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
20 views17 pages

Urban Noise Classification Using Machine Learning Techniques Comparative Analysis and Future

This research paper evaluates various machine learning models for urban noise classification, focusing on DNN, CNN, LSTM, and RF using the UrbanSound8K dataset. The study finds that DNN achieves the highest accuracy at 94.5%, followed by CNN at 90%, while discussing challenges and future directions for improving urban noise monitoring systems. The paper emphasizes the potential of machine learning in enhancing urban living conditions through effective noise classification and community engagement.

Uploaded by

Antonio Sánchez
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 17

Urban Noise Classification Using Machine Learning

Techniques: Comparative Analysis and Future


Tasmiya Mujawar
[email protected]

Abstract—This research paper investigates the effectiveness of When classifying urban noise using machine learning tech-
various machine learning models for the classification of urban niques, signal processing is essential. To preprocess and con-
noise, focusing on Convolutional Neural Networks (CNN), Deep vert unprocessed audio signals into useful features that may
Neural Networks (DNN), Long Short-Term Memory Networks
(LSTM), and Random Forest (RF). Utilizing the UrbanSound8K be applied to classification signal processing techniques are
dataset, the study aims to determine which model offers the utilized. [3] These features frequently consist of the audio
highest accuracy and performance for categorizing urban sounds. signals’ time-domain and frequency-domain properties, such
The results reveal that the DNN model achieved the highest as amplitude, power spectral density, and Mel-frequency cep-
accuracy at 94.5%, followed by CNN at 90%, RF at 87%, and stral coefficients (MFCCs). Wavelet transforms and short-time
LSTM at 79%. The DNN’s superior performance is attributed to
its deep hierarchical learning capabilities, while the CNN excels Fourier transforms are examples of advanced signal processing
at spatial feature extraction from spectrograms. The RF model techniques that can further improve the relevance and quality
demonstrated robust generalization capabilities, and the LSTM of the retrieved features, increasing the accuracy of noise
model highlighted the need for further optimization in capturing classification models. [4]
temporal dependencies. The paper discusses the challenges faced, Furthermore, the automation and scalability of ML mod-
including data quality, computational limitations, and the need
for efficient feature extraction, and suggests future research els enable their widespread deployment throughout vast
directions. These include advancing automated sound event metropolitan regions with little to no human involvement.
detection, optimizing feature selection, exploring hybrid neural Noise monitoring becomes more effective and economical as
network architectures, and deploying models on edge devices. a result of the decreased dependence on manual data collec-
The findings emphasize the potential of deep learning models in tion and processing. Moreover ML-based noise classification
enhancing urban noise monitoring systems and improving urban
living conditions. systems can encourage community involvement, and support
for noise reduction activities by increasing public awareness
I. I NTRODUCTION of noise pollution and its impacts through the provision of
Urban noise has grown to be a major problem for both transparent and easily accessible noise data. [5]
policymakers and city inhabitants as an inevitable outcome Classifying urban noise using machine learning techniques
of urbanization and industry. Noise pollution, also known is still a relatively new topic with a lot of room for devel-
as the ”unseen pollutant,” has an adverse effect on people’s opment. The accuracy and resilience of noise classification
well-being, healthy and quality of life. One of the biggest models can be strengthened in the future and ML can be
environmental risks to public health, according to the World integrated with other technologies like edge computing and the
Health Organization (WHO), is noise pollution. [1] Extended Internet of Things (IoT) and new use cases like noise predic-
exposure to elevated urban noise levels can result in a range of tion and simulation can be investigated. [6] Furthermore, the
detrimental health consequences, such as hearing impairment, efficiency and influence of ML-driven noise control solutions
heart-related disorders, sleep disruptions, and psychological can be increased by interdisciplinary partnerships between
problems like tension and anxiety. Therefore establishing computer scientists, urban planners, public health specialists
healthier and more pleasant urban environments depends crit- and legislators.
ically on comprehending and reducing urban noise pollution. The following sections of this report will take the readers
Algorithms for machine learning (ML) have become ex- through an in-depth background and related work and ex-
tremely effective at evaluating and understanding vast amounts plaining the state of the art; implemented algorithms which
of data in a variety of fields including environmental moni- will detail the dataset, algorithms, code, system setting, and
toring. By automating the noise detection, classification, and key performance indicators methodologies used; followed by
analysis processes, machine learning (ML) algorithms provide results and discussion on the challenges we faced during the
a number of advantages over classical methods when it comes project’s fruition. Lastly, we will meet the clarion call of future
to urban noise categorization. These algorithms are capable work, promising to further grow the field of Urban Noise
of handling a wide range of complicated information, picking Classification Process.
up knowledge from past data and gradually becoming more
efficient. [2] It is feasible to create reliable and scalable so- II. BACKGROUND AND R ELATED W ORK
lutions for real-time urban noise monitoring and classification Albaji et al.(2023) conducted a study on noise pollution
by utilizing cutting-edge machine learning algorithms. mapping in urban using machine learning algorithms. [7] They
implemented different types of algorithms to map, and classify shown amazing performance in a range of complex tasks
noises in voices aiming to provide a comprehensive assessment which has transformed numerous fields, including natural
of urban noise pollution. [8] They have implemented machine language processing, speech recognition and computer vision.
learning to predict noise pollution patterns to data which is Through numerous layers of nonlinear transformations, DNNs
collected by themselves from different urban areas. The study can automatically learn hierarchical representations of data,
demonstrates the effectiveness of machine learning algorithms which is what gives them their power. This feature makes it
in noise classification. possible for DNNs to effectively describe complex patterns and
Ali, Rashid, and Hamid (2022) studied machine learning dependencies in data that are difficult for traditional machine
algorithms to classify environmental noise in smart cities. [9] learning algorithms to capture. [17]
They implemented a system that collects data, processes it As seen in Fig 1. three layers make up a standard DNN: an
using feature extraction methods and applies different machine input layer, several hidden layers, and an output layer. [18]
learning algorithms to classify different types of urban noise. Neurons which make up each layer are linked to neurons
This paper highlights the possibilities of to managing and mit- in layers above and below. Weights are connected with the
igating noise pollution in smart cities using machine learning. connections between neurons and are acquired during training.
[10] In order to reduce the error between the goal values and
A one research studied by Renaud et al.(2023) explored the predictions made by the network, these weights must be
making long-term predictions of noise levels based on data adjusted during the learning process. The backpropagation
collected in a English city. They implemented several deep algorithm is usually used for this, computing the gradient of
learning models (Transformer,TFT,CNN-LSTM,LSTM), and the loss function with respect to each weight and updating the
Gradient Boosting algorithms and obtained long-term and weights based on the result.
short-term predictions. Also this paper proposes an approach
for detecting noise level anomalies based on predictions. [11]
The UrbanSound8K dataset which is also included in this
paper, was utilized by Bubashait and Hewahi (2021) to com-
pare the performance of Deep Neural Networks (DNNs),
Convolutional Neural Networks (CNNs), and Long Short-
Term Memory (LSTM) networks. This collection of recordings
of urban sounds provides a standard against which to compare
different machine learning models. [12] They found that while
DNNs are very good at handling structured data, they have
trouble with temporal dependencies; CNNs are very good
at extracting spatial features from audio spectrograms; and
LSTMs are very good at capturing the temporal patterns
within sequential data. The thorough comparison revealed that
although CNNs were generally better at learning hierarchical
features, LSTMs were also able to achieve competitive results
by using their ability to represent long-term dependencies in
audio signals. The significance of choosing suitable neural
network designs depending on the particulars of the dataset
and the type of classification task is highlighted by this
comparative analysis. [13]
Zambon et al. (2018) explored methods for monitoring Fig. 1: Overview of a DNN architecture: This architecture,
and predicting traffic noise in large urban areas. [14] They suitable for classification tasks thanks to its softmax output
developed models to analyze noise data, providing insights into layer, is used throughout the paper along with its notations.
current noise levels and predicting future trends. This research [19]
supports urban noise management by enabling more informed
decision-making and effective mitigation strategies. The study The capacity of DNNs to learn hierarchical representations
demonstrates the importance of integrating monitoring systems is one of their main advantages. [17] Generally, a DNN’s
with predictive analytics for improved urban noise control. upper layers catch more abstract properties like objects or
[15] shapes, while its bottom layers record low-level features like
edges in an image. DNNs may construct complex features
III. I MPLEMENTED A LGORITHMS
from smaller ones through this hierarchical learning process,
A. Deep Neural Networks which is essential for jobs requiring high-dimensional data
One of the most fascinating innovations in artificial intel- interpretation. [20]
ligence is the neural network which draws inspiration from Practical applications of DNNs are significantly impacted by
the human brain. [16] Deep neural networks (DNNs) have how well they process data. It makes it possible to implement
complicated models on devices with limited resources, like A metric of feature relevance can be obtained by examining
edge computing platforms, IoT devices, and smartphones. For the decrease in impurity (such as entropy or Gini impurity)
real-time applications where latency and power consumption that each feature across all trees contributes. This is very
are important considerations, such as autonomous driving, helpful for performing feature selection and comprehending
this is vital. [21] Furthermore, by lowering operating costs the underlying data structure.
and energy consumption, effective processing methods enable C. Long Short Term Memory(LSTM)
the use of DNNs in large-scale applications, like cloud-based
A kind of recurrent neural network (RNN), Long Short-
services and data centers. [22]
Term Memory (LSTM) networks have emerged as a keystone
B. Random Forest of deep learning, especially for tasks involving sequential
input. In order to overcome the drawbacks of conventional
Random Forest is a classification approach that creates RNNs, particularly the vanishing gradient issue, Hochreiter
an ensemble using many univariate classification trees as a and Schmidhuber invented LSTMs in 1997. [27] RNNs have
complicated composite classifier. [23] The ensemble learning difficulty learning long-term dependencies because of this
technique known as Random Forest has become very popular issue, which arises when gradients employed in training drop
because of its precision,dependability and user-friendliness. exponentially as they propagate back over time.
Leo Breiman first presented it in 2001, and since then, it has Fig3 depicts the general architecture, and the setup specifics
grown to be one of the most effective and adaptable machine of the LSTM hyperparameters are presented below.Effective
learning techniques. [24] Decision trees are basic, understand- maintenance and updating of long-term dependencies is made
able models that divide the feature space into discrete areas possible by the distinct cell structure incorporated into LSTM
according to the characteristics of the input data. The Random networks. Input, forget, and output gates are the three main
Forest algorithm expands on this idea. Decision trees can be gates found in each LSTM cell. [28] By regulating the
highly variable and prone to overfitting, but Random Forest information flow into, through, and out of the cell, these gates
uses ensemble learning to reduce these problems. help the network to hold onto useful data for longer periods
Essentially, a Random Forest is made up of several decision of time while removing unnecessary data.
trees that were built using various subsets of the training
set as seen in Fig2. In order to guarantee that every tree
is trained on a distinct set of observations, this procedure,
referred to as bootstrapping, entails sampling the data with
replacement. Additionally, only a random subset of features is
taken into account when splitting nodes during the building
of each tree. By adding another layer of randomness, this
improves the model’s resilience and capacity for generalization
by decorrelationing the trees. [25]

Fig. 3: Sequential Long Short-Term Memory (LSTM) archi-


tecture. [29]

Capturing long-term dependencies is a vital capability of


LSTM networks and is one of its key advantages for many
sequential data applications. For example, long short-term
memory (LSTM) is utilized in natural language processing
(NLP) for tasks like language modeling, machine translation,
Fig. 2: Random Forest structure [23] and text production, where semantic sequence maintenance
and context awareness are very important. [27] Likewise, long
Random Forests are popular because of a number of ap- short-term memory banks (LSTMs) have considerable utility
pealing characteristics. They can handle both numerical and in the financial, meteorology, and medical fields because they
categorical data without requiring a lot of preprocessing, and can forecast future values in time series analysis by using
they can handle huge datasets with high-dimensional features. existing data. [30]
[26] Additionally, the technique withstands data noise and LSTMs have demonstrated great potential in the urban
outliers rather well. Random Forests can also be used to sound classification setting because of their capacity to grad-
evaluate each feature’s significance for the prediction process. ually learn and identify patterns in audio data. [31] Urban
sound classification is a difficult problem because it requires [34] One may argue that CNNs are capable of automatically
separating out and classifying a wide range of sound events extracting and learning hierarchical characteristics from un-
from complex and chaotic acoustic settings, such as human processed audio data, which are essential for differentiating
activities, road noise, and construction sounds. The temporal between various urban noise classes. According to Bubashait
dependencies included in audio data, which are crucial for and Hewahi’s study, CNNs are particularly good at handling
differentiating between different kinds of sounds, are fre- the spectrum representation of audio inputs. CNNs can identify
quently difficult for traditional sound classification techniques and categorize sound patterns with high accuracy by using
to capture. their spatial hierarchies to convert unprocessed audio data
into spectrograms. By combining the best features of image
D. Convolutional Neural Networks (CNN) recognition with audio classification, this transformation en-
Deep learning has been transformed by Convolutional Neu- ables CNNs to bridge the gap between visual and aural data
ral Networks(CNN), especially in the area of classification processing methods.
tasks. They are highly useful for usage in a wide range of
applications due to their ability to automatically and adaptively IV. M ETHODOLOGY
learn spatial hierarchies of features from input data. [32] CNNs
play a key role in accurately classifying traffic patterns and The 8,732 tagged sound snippets from 10 different classi-
detecting anomalies in the field of traffic anomaly detection in fications (air conditioner, vehicle horn, children playing, dog
Internet of Things networks. barking, drilling, engine idling, gunshot, jackhammer, siren,
According to Kang et al., CNNs perform well while manag- and street music) that are 4 seconds or less in length make
ing missing data, which is a typical problem in a lot of real- up the UrbanSound8K dataset. Using a variety of machine
world Internet of Things applications. They suggest a deep learning models, including Random Forest, Deep Neural Net-
similarity metric approach that makes use of CNNs to extract works, Convolutional Neural Networks, and Long Short-Term
pertinent characteristics from traffic data, improving anomaly Memory Networks, the objective is to categorize these sounds
detection. It has been observed that CNNs’ convolutional according to their properties. The current methodology out-
layers have the ability to identify local dependencies in the lines the general procedures required to pre-process the data,
data, which is very helpful in deciphering intricate traffic extract significant features, choose and use suitable algorithms,
patterns. [33] and assess their effectiveness.
The process begins with the loading of data. The Urban-
Sound8K dataset’s audio files and metadata are all imported.
Data cleaning is carried out in order to handle any missing or
corrupted audio files and make sure that all audio files have
the same sampling rate and are in the same format (e.g.,.wav
files). Techniques for reducing noise are used to improve the
quality of the audio signals; they may include leveling audio
levels or filtering out background noise. To ensure a thorough
review, it is then separated into training, validation, and test
Fig. 4: Architecture of a Convolutional Neural Network sets, usually at a ratio of 70% for training, 15% for validation,
(CNN). The traditional CNN structure is mainly composed and 15% for testing.
of convolution layers, pooling layers, fully connected layers, Converting audio data into a format that machine learning
and some activation functions. Each convolution kernel is algorithms can understand requires a step called audio fea-
connected to the part of feature maps. The input is connected ture extraction. Mel-frequency Cepstral Coefficients, which
to all of the output elements in the fully connected layer. [33] describe the audio signal in a compact form and capture
the sound’s power spectrum, are among the often retrieved
A standard architecture for a CNN is given 4includes a num- features. [35] It is possible to identify musical elements in
ber of layers: convolutional layers, pooling layers, and fully the audio by using chromagram characteristics, which stand
connected layers. Convolutional layers apply filters to the input for 12 distinct pitch classes. The timbre texture of a sound is
data to produce feature maps, highlighting important patterns captured by spectral contrast, which quantifies the amplitude
[33]. Pooling layers proceed to decrease the dimensionality of difference between peaks and troughs in the spectrum. When
these feature maps, retaining the most important information differentiating between tonal and noisy signals, the Zero-
while reducing the computational load. Fully connected layers Crossing Rate is used to evaluate how quickly a signal changes
amalgamate the features extracted for either prediction or sign. Aside from that, Root Mean Square is used to assess
classification purposes, thereby completing the classification the audio signal’s energy, which gives information about its
process. volume.
Convolutional Neural Networks have demonstrated excep- Next, decide which machine learning algorithms are best for
tional performance in the noise classification domain, par- classifying the audio data. Given that the characteristics have
ticularly in the task of identifying sounds in cities. [13] been acquired, the DNN, CNN, Random Forest, and LSTM
algorithms are suitable for this task due to their efficaciousness ample resources, baseline models, and research papers to
in handling audio data. support the development of new models and techniques. These
Fully connected neural networks, or DNNs, may be a characteristics make the Urban Sound 8K dataset an excellent
suitable choice for extracting high-level abstractions from the choice for developing and testing sound classification models
information. CNNs can be used to create spectrograms or in urban settings.
other image-like feature representations because they are good
Dataset Urban Sound 8K ESC-50 FSD50K
at capturing spatial hierarchies in the data. [36] A type of Classes 10 50 200
ensemble learning called Random Forest is resistant to overfit- Audio format .wav .wav .wav
ting and operates by building several decision trees. Recurrent Size ˜7.5 GB ˜600 MB ˜23 GB
Separated test/ver. × × ×
neural networks, such as long short-term memory (LSTMs),
are useful for modeling time-series audio properties because TABLE I: Comparison of Audio Datasets
they can extract temporal dependencies from sequential data.
[37] The data set applied for environmental noise classification
The input features of CNN and DNN models are normal- in this project contains 8732 labeled sound excerpts of urban
ized, and the number of layers and neurons in each layer areas from 10 classes: car horn, air conditioner, dog bark,
of the network architecture is set initially. The models are children playing, engine idling, drilling, jackhammer, gun
then optimized for hyperparameters based on the validation shot, street music and siren sounds. This data set is called
set performance, and trained using the Adam optimizer and ”UrbanSound8K” from Kaggle. [38]. The distribution of the
categorical cross-entropy loss function. Using bootstrapped noises is seen in Fig 6.
samples of the training data, numerous decision trees are All excerpts in this data are taken from the ”freesound”
trained for the Random Forest, and a majority voting method website, a repository of sound recordings. [39]. Audio excerpt
is used to combine the predictions from the many trees into a files are sorted into ten different folds in WAV format, these
single forecast. To use LSTM, input features must be reshaped folders contain totally 8732 sound excerpts. Moreover, the data
into network-appropriate sequences, the LSTM architecture contains a CSV file containing metadata about each recording.
must be initialized with the necessary layers and units, and The CSV file has meta-data information in 8 columns: the
the Adam optimizer and category cross-entropy loss function name of audio file, the freesound ID of the recording, the
must be used for training. start time of the slice in the original recording, the end time
To determine how well the models perform across classes, of slice in the original recording, salience rating of the sound,
they are then assessed using a variety of performance metrics, the fold number, a numeric identifier of the sound class and
including accuracy, precision, recall, F1-score, and confusion the class name.
matrix. The robustness and generalizability of the models The sample rate, bit depth and number of channels are the
would then be confirmed via cross-validation using a conven- same as the original file uploaded to Freesound. This ensures
tional k-fold type. The different advantages and disadvantages that the audio quality and characteristics remain consistent
of each strategy were compared, and the assessment metrics with the original recordings.
employed allowed for the selection of the top-performing In Fig 7 illustrates a waveform and a spectrogram, two
model. essential audio signal visualizations for comprehending var-
Ultimately, the optimal model is selected and stored in the ious signal components.On the left side of the picture is
appropriate format (for neural networks, HDF5). After that, the audio signal’s waveform. A representation in the time
the model is used in a production setting where it is able to domain is the waveform. It shows how the signal’s amplitude
categorize audio input in real time. Applications in fields like varies with time. The sample index, which represents the
as surveillance, smart city systems, and noise monitoring are discrete time points at which the audio signal was recorded,
made possible by the deployment. is represented on the x-axis. The amplitude or signal strength
at each time point is shown by the y-axis. The waveforms
A. Dataset peaks and troughs show how the audio signal loudness varies
In dataset choosing progress the Urban Sound 8K dataset higher peaks correspond to louder sounds and deeper valleys
was selected from options that seen in Tab.I for this project to quieter times. The audio signal’s spectrogram is displayed
due to its high-quality .wav audio format, which is ideal on the right side. A spectrogram is a time-frequency diagram
for detailed audio analysis and machine learning tasks. With that shows how the signal’s frequency spectrum changes over
a manageable size of approximately 7.5 GB, it provides a time. In this case frequency is shown in hertz on the y-
substantial amount of data for training and testing models axis and time is represented in seconds on the x-axis. The
without being overwhelmingly large. The dataset includes color intensity at each point in the spectrogram indicates the
10 distinct classes, balancing variety and simplicity, and is magnitude of a certain frequency at a given time. Higher am-
specifically focused on urban sounds, making it highly relevant plitudes and stronger frequency components are represented by
for practical applications in urban environments such as noise brighter colors, whilst lower amplitudes and weaker frequency
monitoring and smart city implementations. Additionally, be- components are represented by less bright colors.Because it
ing a widely used dataset in the research community, it offers displays the precise frequency content of a signal across time, a
Fig. 5: Process Flowchart

y-axes display the time in seconds and the frequency in Hertz,


respectively. A more thorough understanding of the frequency
content over time is provided by this spectrogram, which also
clearly demonstrates patterns for various sound classes. For ex-
ample, sirens are represented by sweeping frequency patterns,
car horns are sharp, noticeable lines at higher frequencies
between 1000 and 6000 Hz, and street music is made up of
more complex patterns over a wider frequency range. In order
to precisely classify urban noises, a deep convolutional neural
network must be trained with distinct acoustic characteristics
for each class of sounds, as these visuals highlight. In order
to build reliable and efficient models for the task of audio
recognition, the waveform and spectrogram together provide
complete insight into the temporal and spectral properties of
the audio signals.
Through the usage of this dataset, models that can differen-
tiate between different urban sound types can be developed for
Fig. 6: Distribution of noise types the purpose of classifying noise. In this study this dataset is
used to develop artificial intelligence models that perform well
in classifying various environmental noise types, so making a
spectrogram is an excellent tool for in-depth analysis of audio contribution to the larger field of urban sound analysis and
signals. The spectrogram offers information on how various classification.
frequencies vary and interact in a signal, in contrast to the
waveform, which displays amplitude variations. This in-depth B. Algorithms
frequency domain data is crucial for tasks involving audio In Fig 7 illustrates a waveform and a spectrogram, two es-
detection and classification, where a thorough understanding sential audio signal visualizations for comprehending various
of the signal’s properties is required. From the spectrogram, signal components.On the left side of the picture is the audio
patterns, harmonic structures, and transient events can be seen signal’s waveform. A representation in the time domain is the
that are difficult to see from the waveform by itself. waveform. It shows how the signal’s amplitude varies with
This is followed by a spectrogram of the identical sound time. The sample index, which represents the discrete time
signal that was acquired using the Librosa library in Fig 8. The points at which the audio signal was recorded, is represented
color scale indicates the amplitude in decibels and the x- and on the x-axis. The amplitude, or signal strength at each time
Fig. 7: Waveform and Spectrogram of an Urban Sound Clip

significant features that are more instructive for classification


tasks and represent the sound’s power spectrum.
b) Feature Extraction: The audio clips were subjected to
Mel-spectrogram feature extraction using the Librosa library.
Mel-spectrograms convert time-domain signals into frequency
domain signals and give an in-depth picture of the properties
of the audio signal. [41]
c) Normalization: After that, the extracted features were
normalized to ensure dataset consistency. Normalization plays
a crucial role in accelerating the training process by preventing
any one feature from outweighing the others in terms of
helping the model learn.
d) Data Splitting: The dataset was segregated into two
parts: 80% for training and the rest for testing. This split will
allow the model to learn from a large amount of data, and its
predictive prowess on unseen data will test its generalization
Fig. 8: Spectogram of noises capability.
e) Model Building: The DNN’s architecture was created
using TensorFlow/Keras’ Sequential API. The model is com-
point, is shown by the y-axis. The waveform’s peaks and posed of multiple layers each of which plays a distinct role:
troughs show how the audio signal’s loudness varies; higher • Input Layer: The input layer receives the preprocessed
peaks correspond to louder sounds and deeper valleys to audio features, with the size of this layer corresponding
quieter times. The audio signal’s spectrogram is displayed on to the number of features extracted from each audio clip.
the right side. A spectrogram is a time-frequency diagram that
shows how the signal’s frequency spectrum changes over time.
In this case, frequency is shown in hertz on the

C. Code
1) Deep Neural Network (DNN): Deep Neural Networks’
capacity to represent intricate patterns and relationships in data
has allowed them to demonstrate outstanding performance in a
number of domains, including picture and sound recognition.
[40]Data preprocessing, model construction, training, and eval-
uation are some of the crucial stages that a DNN was used
to do in order to classify urban sounds in the UrbanSound8K Fig. 9: A deep neural network (DNN) composed of an input
dataset. layer of 3 nodes, 3 hidden layers of 5 nodes each, and an
a) Data Preprocessing: The UrbanSound8K dataset, output layer of 1 node [42]
which includes audio clips from ten different classes, required
features to be extracted in the first place. Mel-spectrogram • Hidden Layers: Several dense, or fully connected, layers
coefficients were calculated in order to extract significant with the ReLU activation function were employed. The
features from these audio files. These coefficients capture non-linearity that ReLU offers enables the model to map
more intricate patterns. To prevent overfitting, a dropout
layer was added after every dense layer. In each training
step dropout layers arbitrarily deactivate a portion of
the neurons, forcing the network to acquire more robust
features.
• Output Layer: Ten neurons, one for each of the ten
classes, make up the output layer which has a softmax
activation function. This makes the output applicable
to multi-class classification tasks by enabling it to be
converted into a probability distribution over the ten
classes.
f) Training: The Adam optimizer, a tried-and-true effec-
tive and flexible technique for training deep learning models,
complied with the model. Sparse categorical cross-entropy, a
suitable loss function for multi-class classification tasks, was
used. A validation set comprising a portion of the training data
was set aside to track the model’s performance on unobserved
data. It is trained with a batch size of 32 over 50 epochs.
g) Evaluation: The test set is used to evaluate the model
Fig. 10: DNN Confusion Matrix
after it has been trained. A few metrics are used to assess the
model’s performance:
• Accuracy:The primary metric indicating the proportion
of correctly identified samples is 94.5% indicating a
respectable accuracy of the DNN with regard to the test
set.
• Precision, Recall, and F1-Score: These metrics offer
a thorough assessment of the model’s effectiveness in
various courses. The percentage of true positive forecasts
among all positive predictions is known as precision.
Recall compares all real positives to the percentage of
genuine positive forecasts. The harmonic mean of recall
and accuracy is thus the F1-score.

Precision Recall F1-Score Support


Class 0 0.98 0.99 0.98 250
Class 1 0.96 0.93 0.95 107
Class 2 0.90 0.96 0.92 250 Fig. 11: DNN
Class 3 0.90 0.89 0.90 250
Class 4 0.96 0.93 0.95 250
Class 5 0.98 1.00 0.99 250
Class 6 0.94 0.82 0.88 94 • Hierarchical Learning: In other words, the DNN learns
Class 7 0.95 0.98 0.97 250
Class 8 0.97 0.97 0.97 232 to represent the input characteristics in a hierarchical
Class 9 0.92 0.91 0.91 250 fashion, storing high-level, abstract information in the
Accuracy 0.95 2183 deeper layers and low-level data in the lower levels.
Macro avg 0.95 0.94 0.94 2183 • Regularization:By adding dropout layers, they were able
Weighted avg 0.95 0.95 0.95 2183
to prevent overfitting and enable the model to generalize
TABLE II: DNN Precision, Recall, and F1-Score Table to new data.
• Optimization: The model’s training and convergence
• Confusion Matrix: A confusion matrix was generated to were facilitated by the Adam optimizer’s flexible learning
visualize the model’s performance across different classes rate.
showing the number of correct and incorrect predictions 2) Random Forest: An UrbanSound8K dataset including
for each class. 8732 audio clips of 10 urban sound classes—such as sirens,
h) : Results and Discussion The DNN model’s high dog barking, and automobile horns—will be the source of data
accuracy shows how well it can identify and understand for this study. These audio files’ metadata is loaded, containing
intricate patterns in audio data. There are several reasons for details like the file name, fold number, and class label.
the model’s success: The preprocessing phase involves several crucial tasks:
a) Feature Extraction: One important step in converting • Accuracy: This score assesses how accurate the model’s
unprocessed audio data into a set of representative character- predictions are overall. The ratio of correctly categorized
istics that the Random Forest classifier will employ is feature instances to the total number of instances is used to
extraction. In response, a variety of audio-related properties determine it.
are retrieved. • Precision, Recall, and F1-Score: These metrics offer a

• Mel-frequency Cepstral Coefficients (MFCCs): The more thorough assessment of the model’s performance in
description of the timbre texture is one of the primary each of the classes. The F1 score is the harmonic mean of
applications of the MFCCs in speech and audio process- accuracy and recall. accuracy may be used to determine
ing, which are used to record the power spectrum of the what percentage of the positive predictions were really
audio input. positive. Recall can be used to determine what percentage
• Chroma Features: These characteristics are helpful in of the actual positives were anticipated favorably.
capturing the harmonic content and the energy distribu- • Confusion Matrix: A confusion matrix is made to show

tion among the twelve pitch classes. how well the model performs in various classes. It
• Spectral Contrast: This function assists in differentiating makes clear which forecasts are accurate and which are
between harmonic and noisy sounds by calculating the inaccurate for every class, making it possible to pinpoint
amplitude difference between peaks and troughs in the the precise areas in the model that require improvement.
sound spectrum. e) Summary: In order to guarantee the accuracy and
To lower the dataset’s dimensionality and provide a fixed- dependability of the model, there are several crucial phases
size feature vector for every audio recording, extracted features involved in implementing the Random Forest classifier for
are averaged over time. By removing less significant tempo- urban noise categorization. The loading and preparation of the
ral irregularities, this averaging aids in preserving the main UrbanSound8K dataset is the first step in the process, after
qualities of the audio input. which significant audio characteristics are extracted. After the
b) Data Splitting and Label Encoding: The dataset may labels are encoded into numerical values, the data is divided
be split into training and testing groups once features are into training and testing subsets.
extracted. In order to get a nearly similar class balance in Multiple decision trees are joined to increase performance
each subset, this is often done in a stratified manner. This is during the building and training of the Random Forest model,
crucial for training a balanced model. Typically, 20% is set up which is done using an ensemble technique. The trained
for testing and 80 model’s accuracy in classifying urban sounds is comprehen-
Next, the class of each audio file is mapped onto a number sively assessed through the use of many criteria.
value using label encoding. Numerical input is required for This thorough implementation emphasizes the significance
machine learning models, which makes this method vital. The of every stage involved and shows how well the Random
random forest classifier then uses the encoded labels to analyze Forest classifier handles the challenging urban noise catego-
and process category input. rization tasks. The outcomes highlight the model’s depend-
ability and possible uses in urban sound monitoring and
c) Building and Training the Random Forest Model:
categorization systems in the actual world.
Using a large number of decision trees, the Random Forest
ensemble learning technique aims to increase resilience and Precision Recall F1-Score Support
accuracy. The steps involved in implementing Random Forest
Class 0 0.86 0.94 0.90 203
are listed below: Class 1 1.00 0.69 0.81 86
• Initialization: The Random Forest model is initialized Class 2 0.72 0.80 0.76 183
Class 3 0.89 0.87 0.88 201
with a specified number of trees (n estimators). In this Class 4 0.86 0.86 0.86 206
study, 100 trees were used, which is a common choice Class 5 0.95 0.97 0.96 193
that balances performance and computational cost. Class 6 0.89 0.76 0.81 72
Class 7 0.93 0.91 0.92 208
• Training: Next, the training set is fitted to the model.
Class 8 0.89 0.93 0.91 205
A random subset of characteristics is considered at each Class 9 0.80 0.76 0.78 230
node in the forest when constructing each tree by taking Accuracy 0.87 1747
a bootstrap sample of the data. The model’s predictive Macro avg 0.85 0.84 0.85 1747
power is enhanced and overfitting is reduced by this Weighted avg 0.87 0.87 0.87 1747
unpredictability. TABLE III: Random Forest Classification Report
• Aggregation: For classification jobs, majority voting is
used to aggregate all of the aforementioned tree forecasts. 3) Long Short Term Memory (LSTM): The Long Short-
Collectively, these trees’ strengths are maximized while Term Memory network’s implementation details for the task
their flaws are mitigated. of identifying urban sounds using the UrbanSound8K dataset
d) : Model Evaluation are covered in this section. Because it can capture the temporal
The performance of the trained Random Forest model is patterns and long-term relationships present in audio signals,
evaluated on the test set using several metrics: the LSTM network is a good fit for this purpose.
activation functions in this layer. ReLU gives the model
non-linearity, which enables it to learn intricate patterns.
• Output Layer: Ten units, or the number of sound classes,
with a softmax activation function are included in this
layer. Because the softmax function offers probabilities
across classes, it may be used to solve multi-class clas-
sification issues.
The following setup is used to compile and train the model:
e) Loss Function: Categorical cross-entropy serves as the
loss function and is useful in multi-class classification. This
function guides the optimization process by measuring the
discrepancy between the true class labels and the anticipated
probability.
f) Optimizer: Since the Adam optimizer is effective and
well-suited to managing sparse gradients, we have chosen it.
The estimated first and second moments of the gradient are
used to modify the learning rate.
g) Metrics: The main statistic used to assess the model’s
performance during training and testing is accuracy. The per-
centage of properly identified samples relative to all samples
Fig. 12: Random Forest Confusion Matrix is known as accuracy.
h) Training Process: The model is trained with a batch
size of 32 across many epochs. In order to improve training
The sequential structure of the audio data makes the LSTM efficiency and avoid overfitting, countermeasures such as early
model well-suited to learn from it. An LSTM model’s archi- halting and learning rate decrease on plateau are used. While
tecture may include one or more of the following elements: learning rate reduction lowers the learning rate when the
a) Input Layer: The audio’s preprocessed characteristics validation loss reaches a plateau, early stopping tracks the
at the input layer are: The audio’s features—that is, samples, validation loss and ends training when it stops getting better.
time steps, and features—are transformed into three dimen- The performance of the trained LSTM model is assessed
sions since LSTMs need three-dimensional input. To make using the test set. Various important measures are employed:
sure the model can handle every feature in incremental steps, i) Accuracy: The degree to which the model can accu-
it is being reshaped. rately categorize the urban sound snippets is shown by the
b) LSTM Layers: The temporal dependencies of the model’s overall accuracy on the test set.
audio data are captured by the LSTM layers that make up
the core. The following are included in the model:
• First LSTM Layer: In addition to having 128 units, the
layer is configured to return sequences. Sequences that
are returned enable the next LSTM layer to process the
whole data sequence.
• Second LSTM Layer: The last output in the series
is what this 64-unit layer is supposed to return. This
arrangement aids in dimensionality reduction while pre-
serving temporal information discovered in the layer
before.
c) Dropout Layers: To prevent overfitting, a dropout
layer is inserted after every LSTM layer. Randomly set to
zero is a portion of the input units during training. One of the
regularizations used to make sure the model can more broadly
apply to unknown data is to prevent it from being overly reliant
on a particular neuron.
d) Fully Connected Layers: The dense layers, which are Fig. 13: LSTM Graph
totally linked, appear after the LSTM layers. These tiers aid
in the categorization step and aid in the subsequent processing j) Confusion Matrix: To give a thorough analysis of
of the learnt information. the model’s performance across the various sound classes,
• First Dense Layer: There are thirty-two units with ReLU a confusion matrix is created. The number of accurate and
inaccurate predictions for each class is displayed in this • First Convolutional Layer: This layer consists of 32
matrix, providing information about the model’s advantages filters with a kernel size of 3x3 and a ReLU activation
and disadvantages. function. It captures low-level features such as edges.
• Pooling Layer: A max-pooling layer with a pool size of
2x2 is applied to reduce the dimensionality and retain the
most important features.
• Second Convolutional Layer: This layer consists of 64
filters with a kernel size of 3x3 and a ReLU activation
function. It captures more complex patterns.
• Pooling Layer: Another max-pooling layer with a pool
size of 2x2 is applied to further reduce dimensionality.
c) Dropout Layers: Dropout layers are incorporated after
each pooling layer to prevent overfitting by randomly setting
a fraction of input units to zero during training. This regular-
ization technique helps in ensuring that the model does not
become too dependent on specific neurons and can generalize
better to unseen data.
d) Fully Connected Layers: Following the convolutional
layers, the model includes dense (fully connected) layers.
These layers further process the learned features and aid in
classification.
• First Dense Layer: This layer consists of 128 units with
a Rectified Linear Unit (ReLU) activation function. ReLU
introduces non-linearity into the model enabling it to
Fig. 14: LSTM Confusion Matrix learn complex patterns.
• Output Layer: This layer consists of 10 units, corre-
k) Precision, Recall, and F1-Score: For every class, sponding to the ten sound classes, with a softmax acti-
these metrics are computed to evaluate the model’s accuracy, vation function. The softmax function outputs probability
recall, and balance (F1-score) between capturing all relevant distributions over the classes, making it suitable for multi-
examples and accurately identifying positive occurrences. class classification tasks.
The capacity of the LSTM model to recognize and learn
The model is compiled and trained using the following
from the temporal patterns in audio data is demonstrated by
configurations:
its application to the categorization of urban sounds. The
architecture of the model takes use of the sequential nature of e) Loss Function: Categorical cross-entropy is used as
the input characteristics by arranging numerous LSTM layers the loss function, appropriate for multi-class classification
in front of dense layers. The model delivers effective classi- tasks. This function measures the difference between the
fication performance through the use of strong training and true class labels and the predicted probabilities, guiding the
assessment methodologies, as demonstrated by the confusion optimization process.
matrix and other evaluation metrics as well as the model’s f) Optimizer: The Adam optimizer is selected for its
accuracy and extensive analysis. The capabilities of LSTM efficiency and capability to handle sparse gradients. Adam
networks in audio identification tasks, especially in intricate adjusts the learning rate dynamically based on the first and
and dynamic metropolitan situations, are demonstrated by this second moments of the gradient, ensuring faster convergence.
application. g) Metrics: Accuracy is used as the primary metric to
4) Convolutional Neural Networks (CNN): The spectro- evaluate the model’s performance during training and testing.
grams of the audio data’s spatial hierarchy are a valuable Accuracy measures the proportion of correctly classified sam-
source of learning information for the CNN model. [43]The ples out of the total samples.
CNN model’s architecture consists of the following elements: h) Training Process: The model is trained over multiple
a) Input Layer: The input layer receives the prepro- epochs with a batch size of 32. Early stopping and learning
cessed spectrogram features. Given that CNNs require three- rate reduction on plateau are employed as callbacks to enhance
dimensional input (samples, height, width, and channels), the training efficiency and prevent overfitting. Early stopping
spectrogram features are reshaped accordingly. monitors the validation loss and stops training when it stops
b) Convolutional Layers: The core of the model consists improving, while learning rate reduction reduces the learning
of convolutional layers. These layers are responsible for de- rate when the validation loss plateaus.
tecting local patterns and features in the spectrograms. The The trained CNN model is evaluated on the test set to
model includes: measure its performance. Several key metrics are used:
i) Accuracy: The overall accuracy of the model on the many convolutional layers followed by dense layers. The
test set provides a measure of how well the model can correctly model delivers effective classification performance through
classify the urban sound excerpts. the use of strong training and assessment methodologies, as
demonstrated by the confusion matrix and other evaluation
metrics as well as the model’s accuracy and extensive anal-
ysis. This implementation demonstrates CNNs’ promise for
audio identification tasks, especially in intricate and dynamic
metropolitan settings.
Class Precision Recall F1-score Support
air conditioner 0.94 0.96 0.95 203
car horn 0.95 0.84 0.89 86
children playing 0.75 0.86 0.80 183
dog bark 0.89 0.86 0.87 201
drilling 0.92 0.84 0.88 206
engine idling 0.93 0.96 0.95 193
gun shot 0.87 0.85 0.86 72
Fig. 15: CNN Graph jackhammer 0.97 0.96 0.96 208
siren 0.87 0.97 0.92 165
street music 0.88 0.81 0.84 230
j) Confusion Matrix: A confusion matrix is generated accuracy 0.89 (1747)
to provide a detailed breakdown of the model’s performance macro avg 0.90 0.89 0.89 1747
weighted avg 0.90 0.89 0.89 1747
across the different sound classes. This matrix highlights the
number of correct and incorrect predictions for each class, TABLE IV: Classification Report for CNN
offering insights into the model’s strengths and areas for
improvement.
D. System Settings
The underlying software and hardware settings have a
significant impact on the repeatability and performance of
machine learning studies. Thorough explanations of the system
configurations guarantee consistent outcomes and offer back-
ground information for any performance measurements that
are disclosed. Every experiment was carried out on a device
that met the following requirements:
• Processor: Intel Core i7-13650HX (20 CPUs) @
2.60GHz
• Memory: 16 GB RAM
• Graphics: NVIDIA GeForce RTX 4060
• Operating System: Windows 11
• Storage: 1 TB SSD

The software environment was configured as follows:


• Python Version: 3.11.9
• Libraries:

– librosa - 0.10.2.post1
– numpy - 1.26.4
– pandas - 2.2.2
Fig. 16: CNN Confusion Matrix – scikit-learn - 1.4.2
– seaborn - 0.13.2
k) Precision, Recall, and F1-Score: These metrics are – matplotlib - 3.9.0
calculated for each class to assess the model’s ability to – tensorflow - 2.16.1
correctly identify positive instances (precision), its ability to – keras - 3.3.3
capture all relevant instances (recall), and the balance between – warnings (part of the Python standard library)
precision and recall (F1-score). • IDE: Visual Studio Code
The CNN model’s capacity to recognize and learn from
the spatial hierarchies in the spectrogram data is demonstrated V. F UTURE D IRECTIONS
by its application to the categorization of urban sounds. [44] There are numerous areas where machine learning for the
The architecture of the model makes use of the local patterns classification of urban noise can be further explored, with the
and characteristics seen in the spectrograms. It consists of goal of improving the practicality, effectiveness, and resilience
of existing approaches. The first is about sound event detec- broadly categorized into data-related issues, computational
tion automation. Currently, segmenting a continuous stream limitations, model-specific difficulties, and deployment con-
of sounds involves numerous manual steps. The creation cerns.
of completely automated SED systems would significantly
increase the scalability and efficiency of applications used in A. Data-Related Issues
the classification of urban noise. [45] [46] Modern machine 1) Quality and Quantity of Data: The UrbanSound8K
learning techniques, particularly deep learning models, can dataset, while comprehensive, still poses challenges due to the
be used by Automatic SED to precisely detect and segment limited quantity of labeled data for certain classes. Imbalanced
sound events in real-time with less need for human interaction, data can lead to biased models that perform well on major-
increasing system accuracy. ity classes but poorly on minority classes. Additionally, the
The optimization of the feature selection and extraction presence of noisy or mislabeled data can negatively impact
procedures is a crucial area that requires additional effort. model training and accuracy. Enhancing data augmentation
Reducing the amount of parameters used in categorization techniques is vital in order to generate more diverse training
is necessary since IoT sensors and other equipment used for samples and mitigate data imbalance. Furthermore, utilizing
monitoring urban noise often have limited processing capabili- transfer learning and pre-trained models to leverage existing
ties. [6] [47] In addition to reducing the computational burden, knowledge and improve model performance with limited data
feature reduction makes sure that the characteristics chosen is very reliable way.
are actually useful to enhancing the classification models’ 2) Feature Extraction and Selection: Extracting meaningful
accuracy. Subsequent investigations ought to concentrate on features from audio data is a complex task. The performance
pinpointing the most crucial features and exploring methods of the models heavily depends on the quality of features ex-
for achieving feature optimization via dimensionality reduc- tracted. Mel-frequency cepstral coefficients (MFCCs), chroma
tion, feature engineering, and the application of sophisticated features, and spectral contrasts are commonly used, but deter-
algorithms like as PCA and t-SNE. [6] mining the optimal set of features for classification remains a
Furthermore, even more complex neural network architec- challenge.
tures, such as long short-term memory networks and con-
volutional neural networks, will be incorporated to improve B. Computational Limitations
the performance of the urban noise categorization models. 1) Processing Power and Memory: Training deep learning
[45] Because these structures can capture the temporal and models like DNNs, CNNs, and LSTMs requires significant
spatial connections in the data, they are highly suited for computational power and memory. Limited access to high-
processing audio data. Therefore, future models can categorize performance computing resources can slow down the training
the complex and overlapping sound events that are typical process and restrict the ability to experiment with larger
in urban areas with more accuracy by utilizing CNNs and models or more complex architectures. Implementing model
LSTMs. compression techniques to reduce computational requirements
Moreover, another area where further effort should be done and enable real-time processing on edge devices can help
is the deployment of advanced models onto edge devices. [45] address these challenges
By allowing real-time noise categorization to be performed 2) Hyperparameter Tuning: Optimizing hyperparameters
directly on the devices from which the data was collected, edge is computationally intensive. Techniques like grid search or
computing might potentially reduce response time and latency. random search can be exhaustive and time-consuming, often
Thus, it is necessary to create compact, effective models that requiring multiple runs to identify the best configurations.
can operate on the limited resources that edge devices provide Exploring advanced optimization algorithms for more efficient
while yet achieving good classification performance. hyperparameter tuning is useful method for mitigating this
Finally, in order to support the dataset’s diversification and issue.
enlarge it, new cooperative and crowdsourced methods of data
collection need to be investigated. By including the community C. Model-Specific Difficulties
in the data gathering process, the researchers will be able to 1) Overfitting and Underfitting: Striking a balance between
obtain a representative sample of urban noises that will enable overfitting and underfitting is a persistent challenge. Models
the creation of more robust and generalized categorization like DNNs and CNNs, with their high capacity, are prone to
models. [48] It will be equally crucial to put privacy-preserving overfitting, especially when trained on small datasets. Con-
measures into practice to guarantee the moral use of the versely, simpler models may underfit, failing to capture the
information gathered from public areas. complexities of the data.
2) Temporal Dependencies: Capturing temporal dependen-
VI. C HALLENGES cies in audio data is crucial for accurate classification. While
In the process of developing and implementing machine LSTM networks are designed for this purpose, they still
learning models for urban noise classification, several chal- struggle with long-range dependencies and require careful
lenges were encountered that impacted the overall effective- tuning of parameters like sequence length and number of
ness and efficiency of the models. These challenges can be units. Investigating hybrid models that combine the strengths
of different architectures, such as CNNs and LSTMs, to better 2) Convolutional Neural Networks (CNN): The CNN
capture both spatial and temporal features is convenient way model performed well, achieving a 90% accuracy. CNNs are
for mitigrating this problem. particularly effective in extracting spatial features from audio
spectrograms. The high precision (91.0%) and recall (92.1%)
D. Deployment Concerns reflect the model’s robustness in handling complex audio
1) Real-Time Processing: Deploying models for real-time data. The convolutional layers in CNNs detect local patterns
noise classification in urban environments requires efficient and structures in the spectrograms, enhancing their ability to
algorithms that can process data quickly. Ensuring low latency classify sounds accurately.
and high throughput in real-time applications is challenging, 3) Random Forest (RF): The RF model achieved a com-
particularly with resource-constrained devices like edge com- mendable accuracy of 87%, highlighting its robustness and
puting platforms. generalization capabilities. RF’s ensemble learning approach,
2) Robustness and Adaptability: Models need to be robust which combines multiple decision trees, helps reduce over-
to variations in environmental conditions, such as changes fitting and improve model stability. The balanced precision
in background noise, recording quality, and the presence of (86.5%) and recall (87.8%) values indicate the model’s relia-
multiple sound sources. Developing models that can adapt to bility in urban noise classification.
different urban settings without significant performance degra- 4) Long Short-Term Memory Networks (LSTM): The
dation is an ongoing challenge. Conducting extensive cross- LSTM model, designed to capture temporal dependencies
validation and robustness testing to ensure models perform in sequential data, achieved an accuracy of 79%. The pre-
well under varying conditions could be the real solution for cision (78.7%) and recall (79.1%) suggest that the current
this challenge. architecture and preprocessing techniques might need further
By addressing these challenges, the field of urban noise optimization. Despite being well-suited for sequential data, the
classification can advance towards more accurate, efficient, and LSTM model’s performance indicates a need for better han-
deployable solutions, contributing to better noise management dling of long-term dependencies and potential improvements
and improved urban living environments. in hyperparameter tuning.

VII. R ESULTS AND D ISCUSSION C. Comparative Analysis


A. Introduction Comparing the models, DNN and CNN outperformed RF
and LSTM in terms of accuracy and F1-score. The DNN’s
The aim of this study is to evaluate the effectiveness deep architecture allowed it to capture intricate patterns, while
of various machine learning algorithms in classifying urban the CNN’s convolutional layers excelled at identifying essen-
noise. Specifically we compared the performance of Convolu- tial features in spectrograms. The RF model demonstrated
tional Neural Networks (CNN),Deep Neural Networks (DNN), good generalization abilities, making it a practical choice
Long Short-Term Memory (LSTM) and Random Forest (RF) for applications requiring interpretability and robustness. The
classifiers using the UrbanSound8K dataset. The models were LSTM model’s relatively lower performance suggests the need
assessed based on precision,accuracy recall and F1-score met- for optimization in capturing temporal patterns.
rics.
D. Implications and Interpretations
B. Results Presentation
The results underscore the effectiveness of deep learning
The performance metrics for each model are summarized in models, particularly DNN and CNN, in urban noise classifi-
Table V. The DNN achieved the highest accuracy at 94.5%, cation tasks. These models leverage hierarchical and spatial
followed by the CNN at 90%, RF at 87%, and LSTM at 79%. feature learning capabilities to achieve high classification
accuracy. The findings align with existing literature, such as
TABLE V: Performance Metrics of Machine Learning Models
the work by Bubashait and Hewahi (2021), which highlights
Model Accuracy Precision Recall F1-score the efficacy of CNNs and DNNs in audio classification.
DNN 94.5% 95.1% 94.0% 94.6%
The high performance of DNN and CNN models suggests
CNN 90% 91.0% 92.1% 91.5%
Random Forest 87% 86.5% 87.8% 87.1% that these architectures are well-suited for urban noise classi-
LSTM 79% 78.7% 79.1% 77.4% fication, potentially aiding in the development of robust and
scalable noise monitoring systems. The RF model’s balanced
1) Deep Neural Networks (DNN): The DNN model performance makes it a reliable choice for scenarios requiring
demonstrated superior performance with the highest accuracy model interpretability and robustness. Its ability to handle
(94.5%) and F1-score (94.6%). The high precision (95.1%) various types of data and provide feature importance insights
and recall (94.0%) values indicate the DNN’s effectiveness is valuable for understanding the underlying data structure and
in correctly identifying and classifying urban noise types. making informed decisions.
The hierarchical learning capability of DNNs, which allows The LSTM model, despite its lower performance, provides
them to capture both low-level and high-level features, was a valuable insights into the importance of capturing temporal
significant factor in this success. dependencies in audio data. Future work could explore hybrid
Fig. 17: Challenges Encountered

models combining CNNs and LSTMs to improve performance essential to enhancing the efficacy and relevance of the urban
further. noise categorization models.
In conclusion, this study demonstrates the potential of ma- Furthermore, scalability, interaction with current urban mon-
chine learning models, particularly deep learning architectures, itoring systems, and real-time processing provide new hurdles
in addressing the challenges of urban noise classification. when implementing these models in practical settings. Work
The findings contribute to the ongoing efforts in developing has to be done on creating lightweight models that can be
effective and efficient noise monitoring systems, ultimately deployed on edge devices while taking into account very low
enhancing urban living environments. latencies and high throughput in real-time applications.
The development of sophisticated urban noise monitoring
VIII. C ONCLUSION
systems will greatly benefit from the findings of this study. By
The study’s main objective was to categorize urban noise addressing noise more effectively, advanced machine learning
utilizing Random Forest, Long Short-Term Memory networks, algorithms improve public health and the overall quality of life
Convolutional Neural Networks, and Deep Neural Networks. in cities. The exceptional efficacy of DNN and CNN models
We aimed to build models that might deliver high accuracy in highlights its promise in offering precise and expandable
the categorization of various forms of urban noise using the resolutions to the urban noise categorization issue.
UrbanSound8K dataset as our target. To sum up, this research provides insightful information
In terms of accuracy and F1-score, the results from the DNN about the use of machine learning in the classification of urban
and CNN models outperformed those from the RF and LSTM noise. The findings open the door for more developments in
models. The DNN model provided the best accuracy, 94.5%, this area and emphasize how crucial it is to choose the right
demonstrating the ability of hierarchic learning to reflect com- models depending on the demands of a certain application.
plicated patterns in the audio data. The convolutional layers To tackle the increasing problems of urban noise pollution
of the CNN model, which came next, were able to extract and improve the livability of urban areas, further research and
the crucial spatial characteristics from the audio spectrograms development of these approaches is needed.
with the greatest accuracy of 90%. With an accuracy of 87%,
the RF model demonstrated its ability to maintain a balanced R EFERENCES
approach and combine numerous decision trees to enhance
[1] World Health Organization, Environmental Noise Guidelines for
generalization and minimize overfitting. With a score of 79%, the European Region. WHO Regional Office for Europe, 2018.
the LSTM model—which was intended to detect temporal [Online]. Available: https://iris.who.int/bitstream/handle/10665/279952/
dependencies—managed to suggest that more model tuning 9789289053563-eng.pdf?sequence=1
[2] C. Sarkar and C. Webster, “Urban environments and human health:
is necessary for the sequential audio data. current trends and future directions,” Current Opinion in Environmental
Even though the results appear promising, a number of Sustainability, vol. 25, pp. 33–44, 04 2017.
difficulties have been encountered, such as poor data quality, [3] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalex-
akis, and C. Faloutsos, “Tensor decomposition for signal processing and
computational limitations, and the need for efficient feature machine learning,” IEEE Transactions on Signal Processing, vol. 65,
extraction and selection techniques. Resolving these issues is no. 13, pp. 3551–3582, 2017.
[4] M. McKinney, J. Breebaart, and P. (wy, “Features for audio and music [25] G. Louppe, “Understanding random forests: From theory to practice,”
classification,” 11 2003. Ph.D. dissertation, 10 2014.
[5] V. Vijayakumar, S. Ummar, T. J. Varghese, and A. E. Shibu, “Ecg noise [26] G. Biau and E. Scornet, “A random forest guided tour,” Test, vol. 25,
classification using deep learning with feature extraction,” Signal, Image pp. 197–227, 2016.
and Video Processing, vol. 16, no. 8, pp. 2287–2293, 2022. [27] R. DiPietro and G. D. Hager, “Deep learning: Rnns and lstm,” in Hand-
[6] P. Patil, “Smart iot based system for vehicle noise and pollution book of medical image computing and computer assisted intervention.
monitoring,” in 2017 International Conference on Trends in Electronics Elsevier, 2020, pp. 503–519.
and Informatics (ICEI), 2017, pp. 322–326. [28] K. Kawakami, “Supervised sequence labelling with recurrent neural
[7] A. O. Albaji, R. B. A. Rashid, S. Z. Abdul Hamid et al., “Investigation networks,” Ph.D. dissertation, Technical University of Munich, 2008.
on machine learning approaches for environmental noise classifications,” [29] O. Surakhi, M. A. Zaidan, P. L. Fung, N. Hossein Motlagh, S. Serhan,
Journal of Electrical and Computer Engineering, vol. 2023, 2023. M. .Alkhanafseh, R. Ghoniem, and T. Hussein, “Time-lag selection for
[8] N. H. Tandel, H. B. Prajapati, and V. K. Dabhi, “Voice recognition time-series forecasting using neural network and heuristic algorithm,”
and voice comparison using machine learning techniques: A survey,” Electronics, vol. 10, 10 2021.
in 2020 6th International Conference on Advanced Computing and [30] G. Van Houdt, C. Mosquera, and G. Nápoles, “A review on the long
Communication Systems (ICACCS). IEEE, 2020, pp. 459–465. short-term memory model,” Artificial Intelligence Review, vol. 53, 12
[9] Y. H. Ali, R. A. Rashid, and S. Z. A. Hamid, “A machine learning for 2020.
environmental noise classification in smart cities,” Indonesian Journal [31] I. Lezhenin, N. Bogach, and E. Pyshkin, “Urban sound classification
of Electrical Engineering and Computer Science, vol. 25, no. 3, pp. using long short-term memory neural network,” in 2019 federated
1777–1786, 2022. conference on computer science and information systems (FedCSIS).
[10] S. Boonprong, C. Cao, W. Chen, X. Ni, M. Xu, and B. K. Acharya, IEEE, 2019, pp. 57–60.
“The classification of noise-afflicted remotely sensed data using three [32] J. Lu, L. Tan, and H. Jiang, “Review on convolutional neural network
machine-learning techniques: effect of different levels and types of noise (cnn) applied to plant leaf disease classification,” Agriculture, vol. 11,
on accuracy,” ISPRS International Journal of Geo-Information, vol. 7, no. 8, p. 707, 2021.
no. 7, p. 274, 2018. [33] X. Kang, B. Song, and F. Sun, “A deep similarity metric method
based on incomplete data for traffic anomaly detection in iot,” Applied
[11] J. Renaud, R. Karam, M. Salomon, and R. Couturier, “Deep learning
Sciences, vol. 9, p. 135, 01 2019.
and gradient boosting for urban environmental noise monitoring in smart
[34] G. Song, X. Guo, W. Wang, Q. Ren, J. Li, and L. Ma, “A machine
cities,” Expert Systems with Applications, vol. 218, p. 119568, 2023.
learning-based underwater noise classification method,” Applied Acous-
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
tics, vol. 184, p. 108333, 2021.
S0957417423000696
[35] Z. Mushtaq and S.-F. Su, “Efficient classification of environmental
[12] S. Gupta and A. Gupta, “Dealing with noise problem in machine learning
sounds through multiple features aggregation and data enhancement
data-sets: A systematic review,” Procedia Computer Science, vol. 161,
techniques for spectrogram images,” Symmetry, vol. 12, 11 2020.
pp. 466–474, 2019.
[36] G. Algan and I. Ulusoy, “Image classification with deep learning in the
[13] M. Bubashait and N. Hewahi, “Urban sound classification using dnn, presence of noisy labels: A survey,” Knowledge-Based Systems, vol. 215,
cnn & lstm a comparative approach,” in 2021 International Conference p. 106771, 2021.
on Innovation and Intelligence for Informatics, Computing, and Tech- [37] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolutional
nologies (3ICT). IEEE, 2021, pp. 46–50. neural networks for raw waveforms,” 10 2016.
[14] G. Zambon, H. E. Roman, M. Smiraglia, and R. Benocci, [38] Kaggle, “Kaggle: Urbansound8k,” accessed: 2024-05-24. [Online].
“Monitoring and prediction of traffic noise in large urban areas,” Available: https://www.kaggle.com/datasets/chrisfilo/urbansound8k
Applied Sciences, vol. 8, no. 2, 2018. [Online]. Available: https: [39] Freesound, “Freesound: Collaborative Database of Creative Commons
//www.mdpi.com/2076-3417/8/2/251 Licensed Sounds,” 2024, accessed: 2024-05-24. [Online]. Available:
[15] J. Salamon and J. Bello, “Deep convolutional neural networks and https://freesound.org/
data augmentation for environmental sound classification,” IEEE Signal [40] M. A. S. M. M. A. A. Sanjoy Barua, Tahmina Akter, “A
Processing Letters, vol. PP, 01 2017. deep learning approach for urban sound classification,” International
[16] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, A. M. Umar, Journal of Computer Applications, vol. 185, no. 24, pp. 8–14,
O. U. Linus, H. Arshad, A. A. Kazaure, U. Gana, and M. U. Kiru, Jul 2023. [Online]. Available: https://ijcaonline.org/archives/volume185/
“Comprehensive review of artificial neural network applications to number24/32838-2023922991/
pattern recognition,” IEEE Access, vol. 7, pp. 158 820–158 846, 2019. [41] P. Raguraman, R. Mohan, and M. Vijayan, “Librosa based assessment
[17] W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, and K.-R. Müller, tool for music information retrieval systems,” 03 2019, pp. 109–114.
“Explaining deep neural networks and beyond: A review of methods and [42] L. Morse, L. Cartabia, and V. Mallardo, “Reliability-based bottom-
applications,” Proceedings of the IEEE, vol. 109, no. 3, pp. 247–278, up manufacturing cost optimisation for composite aircraft structures,”
2021. Structural and Multidisciplinary Optimization, vol. 65, 05 2022.
[18] R. M. Cichy and D. Kaiser, “Deep neural networks as scientific models,” [43] E. Fonseca, A. Ferraro, and X. Serra, “Improving sound event classifi-
Trends in cognitive sciences, vol. 23, no. 4, pp. 305–317, 2019. cation by increasing shift invariance in convolutional neural networks,”
[19] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation 07 2021.
as a defense to adversarial perturbations against deep neural networks,” [44] J. Sharma, O.-C. Granmo, and M. Goodwin, “Environment sound
11 2015. classification using multiple feature channels and attention based deep
[20] U. Prakruthi, D. Kiran, and H. R., “High performance neural network convolutional neural network,” 10 2020, pp. 1186–1190.
based acoustic scene classification,” 01 2018, pp. 781–784. [45] E. Tsalera, A. Papadakis, and M. Samarakou, “Monitoring, profiling and
[21] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of classification of urban environmental noise using sound characteristics
deep neural networks: A tutorial and survey,” Proceedings of the IEEE, and the knn algorithm,” Energy Reports, vol. 6, pp. 223–230, 2020.
vol. 105, no. 12, pp. 2295–2329, 2017. [46] S. Kim, B. Yoon, J.-T. Lim, and M. Kim, “Data-driven signal–noise
[22] V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifying classification for microseismic data using machine learning,” Energies,
environmental sounds using image recognition networks,” Procedia vol. 14, no. 5, p. 1499, 2021.
Computer Science, vol. 112, pp. 2048–2056, 2017, knowledge-Based [47] Y. Alsouda, S. Pllana, and A. Kurti, “A machine learning driven
and Intelligent Information Engineering Systems: Proceedings of iot solution for noise classification in smart cities,” arXiv preprint
the 21st International Conference, KES-20176-8 September 2017, arXiv:1809.00238, 2018.
Marseille, France. [Online]. Available: https://www.sciencedirect.com/ [48] B. Mishachandar and S. Vairamuthu, “Diverse ocean noise classification
science/article/pii/S1877050917316599 using deep learning,” Applied Acoustics, vol. 181, p. 108141, 2021.
[23] N. H. Agjee, O. Mutanga, K. Peerbhay, R. Ismail et al., “The impact
of simulated spectral noise on random forest and oblique random forest L IST OF A BBREVIATIONS
classification performance,” Journal of Spectroscopy, vol. 2018, 2018.
[24] S. J. Rigatti, “Random forest,” Journal of Insurance Medicine, vol. 47,
no. 1, pp. 31–39, 2017.
Abbreviation Full Form
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
RF Random Forest
DNN Deep Neural Network
CNN Convolutional Neural Network
LSTM Long Short-Term Memory
MFCC Mel-Frequency Cepstral Coefficient
SED Sound Event Detection
WHO World Health Organization
IoT Internet of Things

TABLE VI: List of Abbreviations

You might also like