Urban Noise Classification Using Machine Learning Techniques Comparative Analysis and Future
Urban Noise Classification Using Machine Learning Techniques Comparative Analysis and Future
Abstract—This research paper investigates the effectiveness of When classifying urban noise using machine learning tech-
various machine learning models for the classification of urban niques, signal processing is essential. To preprocess and con-
noise, focusing on Convolutional Neural Networks (CNN), Deep vert unprocessed audio signals into useful features that may
Neural Networks (DNN), Long Short-Term Memory Networks
(LSTM), and Random Forest (RF). Utilizing the UrbanSound8K be applied to classification signal processing techniques are
dataset, the study aims to determine which model offers the utilized. [3] These features frequently consist of the audio
highest accuracy and performance for categorizing urban sounds. signals’ time-domain and frequency-domain properties, such
The results reveal that the DNN model achieved the highest as amplitude, power spectral density, and Mel-frequency cep-
accuracy at 94.5%, followed by CNN at 90%, RF at 87%, and stral coefficients (MFCCs). Wavelet transforms and short-time
LSTM at 79%. The DNN’s superior performance is attributed to
its deep hierarchical learning capabilities, while the CNN excels Fourier transforms are examples of advanced signal processing
at spatial feature extraction from spectrograms. The RF model techniques that can further improve the relevance and quality
demonstrated robust generalization capabilities, and the LSTM of the retrieved features, increasing the accuracy of noise
model highlighted the need for further optimization in capturing classification models. [4]
temporal dependencies. The paper discusses the challenges faced, Furthermore, the automation and scalability of ML mod-
including data quality, computational limitations, and the need
for efficient feature extraction, and suggests future research els enable their widespread deployment throughout vast
directions. These include advancing automated sound event metropolitan regions with little to no human involvement.
detection, optimizing feature selection, exploring hybrid neural Noise monitoring becomes more effective and economical as
network architectures, and deploying models on edge devices. a result of the decreased dependence on manual data collec-
The findings emphasize the potential of deep learning models in tion and processing. Moreover ML-based noise classification
enhancing urban noise monitoring systems and improving urban
living conditions. systems can encourage community involvement, and support
for noise reduction activities by increasing public awareness
I. I NTRODUCTION of noise pollution and its impacts through the provision of
Urban noise has grown to be a major problem for both transparent and easily accessible noise data. [5]
policymakers and city inhabitants as an inevitable outcome Classifying urban noise using machine learning techniques
of urbanization and industry. Noise pollution, also known is still a relatively new topic with a lot of room for devel-
as the ”unseen pollutant,” has an adverse effect on people’s opment. The accuracy and resilience of noise classification
well-being, healthy and quality of life. One of the biggest models can be strengthened in the future and ML can be
environmental risks to public health, according to the World integrated with other technologies like edge computing and the
Health Organization (WHO), is noise pollution. [1] Extended Internet of Things (IoT) and new use cases like noise predic-
exposure to elevated urban noise levels can result in a range of tion and simulation can be investigated. [6] Furthermore, the
detrimental health consequences, such as hearing impairment, efficiency and influence of ML-driven noise control solutions
heart-related disorders, sleep disruptions, and psychological can be increased by interdisciplinary partnerships between
problems like tension and anxiety. Therefore establishing computer scientists, urban planners, public health specialists
healthier and more pleasant urban environments depends crit- and legislators.
ically on comprehending and reducing urban noise pollution. The following sections of this report will take the readers
Algorithms for machine learning (ML) have become ex- through an in-depth background and related work and ex-
tremely effective at evaluating and understanding vast amounts plaining the state of the art; implemented algorithms which
of data in a variety of fields including environmental moni- will detail the dataset, algorithms, code, system setting, and
toring. By automating the noise detection, classification, and key performance indicators methodologies used; followed by
analysis processes, machine learning (ML) algorithms provide results and discussion on the challenges we faced during the
a number of advantages over classical methods when it comes project’s fruition. Lastly, we will meet the clarion call of future
to urban noise categorization. These algorithms are capable work, promising to further grow the field of Urban Noise
of handling a wide range of complicated information, picking Classification Process.
up knowledge from past data and gradually becoming more
efficient. [2] It is feasible to create reliable and scalable so- II. BACKGROUND AND R ELATED W ORK
lutions for real-time urban noise monitoring and classification Albaji et al.(2023) conducted a study on noise pollution
by utilizing cutting-edge machine learning algorithms. mapping in urban using machine learning algorithms. [7] They
implemented different types of algorithms to map, and classify shown amazing performance in a range of complex tasks
noises in voices aiming to provide a comprehensive assessment which has transformed numerous fields, including natural
of urban noise pollution. [8] They have implemented machine language processing, speech recognition and computer vision.
learning to predict noise pollution patterns to data which is Through numerous layers of nonlinear transformations, DNNs
collected by themselves from different urban areas. The study can automatically learn hierarchical representations of data,
demonstrates the effectiveness of machine learning algorithms which is what gives them their power. This feature makes it
in noise classification. possible for DNNs to effectively describe complex patterns and
Ali, Rashid, and Hamid (2022) studied machine learning dependencies in data that are difficult for traditional machine
algorithms to classify environmental noise in smart cities. [9] learning algorithms to capture. [17]
They implemented a system that collects data, processes it As seen in Fig 1. three layers make up a standard DNN: an
using feature extraction methods and applies different machine input layer, several hidden layers, and an output layer. [18]
learning algorithms to classify different types of urban noise. Neurons which make up each layer are linked to neurons
This paper highlights the possibilities of to managing and mit- in layers above and below. Weights are connected with the
igating noise pollution in smart cities using machine learning. connections between neurons and are acquired during training.
[10] In order to reduce the error between the goal values and
A one research studied by Renaud et al.(2023) explored the predictions made by the network, these weights must be
making long-term predictions of noise levels based on data adjusted during the learning process. The backpropagation
collected in a English city. They implemented several deep algorithm is usually used for this, computing the gradient of
learning models (Transformer,TFT,CNN-LSTM,LSTM), and the loss function with respect to each weight and updating the
Gradient Boosting algorithms and obtained long-term and weights based on the result.
short-term predictions. Also this paper proposes an approach
for detecting noise level anomalies based on predictions. [11]
The UrbanSound8K dataset which is also included in this
paper, was utilized by Bubashait and Hewahi (2021) to com-
pare the performance of Deep Neural Networks (DNNs),
Convolutional Neural Networks (CNNs), and Long Short-
Term Memory (LSTM) networks. This collection of recordings
of urban sounds provides a standard against which to compare
different machine learning models. [12] They found that while
DNNs are very good at handling structured data, they have
trouble with temporal dependencies; CNNs are very good
at extracting spatial features from audio spectrograms; and
LSTMs are very good at capturing the temporal patterns
within sequential data. The thorough comparison revealed that
although CNNs were generally better at learning hierarchical
features, LSTMs were also able to achieve competitive results
by using their ability to represent long-term dependencies in
audio signals. The significance of choosing suitable neural
network designs depending on the particulars of the dataset
and the type of classification task is highlighted by this
comparative analysis. [13]
Zambon et al. (2018) explored methods for monitoring Fig. 1: Overview of a DNN architecture: This architecture,
and predicting traffic noise in large urban areas. [14] They suitable for classification tasks thanks to its softmax output
developed models to analyze noise data, providing insights into layer, is used throughout the paper along with its notations.
current noise levels and predicting future trends. This research [19]
supports urban noise management by enabling more informed
decision-making and effective mitigation strategies. The study The capacity of DNNs to learn hierarchical representations
demonstrates the importance of integrating monitoring systems is one of their main advantages. [17] Generally, a DNN’s
with predictive analytics for improved urban noise control. upper layers catch more abstract properties like objects or
[15] shapes, while its bottom layers record low-level features like
edges in an image. DNNs may construct complex features
III. I MPLEMENTED A LGORITHMS
from smaller ones through this hierarchical learning process,
A. Deep Neural Networks which is essential for jobs requiring high-dimensional data
One of the most fascinating innovations in artificial intel- interpretation. [20]
ligence is the neural network which draws inspiration from Practical applications of DNNs are significantly impacted by
the human brain. [16] Deep neural networks (DNNs) have how well they process data. It makes it possible to implement
complicated models on devices with limited resources, like A metric of feature relevance can be obtained by examining
edge computing platforms, IoT devices, and smartphones. For the decrease in impurity (such as entropy or Gini impurity)
real-time applications where latency and power consumption that each feature across all trees contributes. This is very
are important considerations, such as autonomous driving, helpful for performing feature selection and comprehending
this is vital. [21] Furthermore, by lowering operating costs the underlying data structure.
and energy consumption, effective processing methods enable C. Long Short Term Memory(LSTM)
the use of DNNs in large-scale applications, like cloud-based
A kind of recurrent neural network (RNN), Long Short-
services and data centers. [22]
Term Memory (LSTM) networks have emerged as a keystone
B. Random Forest of deep learning, especially for tasks involving sequential
input. In order to overcome the drawbacks of conventional
Random Forest is a classification approach that creates RNNs, particularly the vanishing gradient issue, Hochreiter
an ensemble using many univariate classification trees as a and Schmidhuber invented LSTMs in 1997. [27] RNNs have
complicated composite classifier. [23] The ensemble learning difficulty learning long-term dependencies because of this
technique known as Random Forest has become very popular issue, which arises when gradients employed in training drop
because of its precision,dependability and user-friendliness. exponentially as they propagate back over time.
Leo Breiman first presented it in 2001, and since then, it has Fig3 depicts the general architecture, and the setup specifics
grown to be one of the most effective and adaptable machine of the LSTM hyperparameters are presented below.Effective
learning techniques. [24] Decision trees are basic, understand- maintenance and updating of long-term dependencies is made
able models that divide the feature space into discrete areas possible by the distinct cell structure incorporated into LSTM
according to the characteristics of the input data. The Random networks. Input, forget, and output gates are the three main
Forest algorithm expands on this idea. Decision trees can be gates found in each LSTM cell. [28] By regulating the
highly variable and prone to overfitting, but Random Forest information flow into, through, and out of the cell, these gates
uses ensemble learning to reduce these problems. help the network to hold onto useful data for longer periods
Essentially, a Random Forest is made up of several decision of time while removing unnecessary data.
trees that were built using various subsets of the training
set as seen in Fig2. In order to guarantee that every tree
is trained on a distinct set of observations, this procedure,
referred to as bootstrapping, entails sampling the data with
replacement. Additionally, only a random subset of features is
taken into account when splitting nodes during the building
of each tree. By adding another layer of randomness, this
improves the model’s resilience and capacity for generalization
by decorrelationing the trees. [25]
C. Code
1) Deep Neural Network (DNN): Deep Neural Networks’
capacity to represent intricate patterns and relationships in data
has allowed them to demonstrate outstanding performance in a
number of domains, including picture and sound recognition.
[40]Data preprocessing, model construction, training, and eval-
uation are some of the crucial stages that a DNN was used
to do in order to classify urban sounds in the UrbanSound8K Fig. 9: A deep neural network (DNN) composed of an input
dataset. layer of 3 nodes, 3 hidden layers of 5 nodes each, and an
a) Data Preprocessing: The UrbanSound8K dataset, output layer of 1 node [42]
which includes audio clips from ten different classes, required
features to be extracted in the first place. Mel-spectrogram • Hidden Layers: Several dense, or fully connected, layers
coefficients were calculated in order to extract significant with the ReLU activation function were employed. The
features from these audio files. These coefficients capture non-linearity that ReLU offers enables the model to map
more intricate patterns. To prevent overfitting, a dropout
layer was added after every dense layer. In each training
step dropout layers arbitrarily deactivate a portion of
the neurons, forcing the network to acquire more robust
features.
• Output Layer: Ten neurons, one for each of the ten
classes, make up the output layer which has a softmax
activation function. This makes the output applicable
to multi-class classification tasks by enabling it to be
converted into a probability distribution over the ten
classes.
f) Training: The Adam optimizer, a tried-and-true effec-
tive and flexible technique for training deep learning models,
complied with the model. Sparse categorical cross-entropy, a
suitable loss function for multi-class classification tasks, was
used. A validation set comprising a portion of the training data
was set aside to track the model’s performance on unobserved
data. It is trained with a batch size of 32 over 50 epochs.
g) Evaluation: The test set is used to evaluate the model
Fig. 10: DNN Confusion Matrix
after it has been trained. A few metrics are used to assess the
model’s performance:
• Accuracy:The primary metric indicating the proportion
of correctly identified samples is 94.5% indicating a
respectable accuracy of the DNN with regard to the test
set.
• Precision, Recall, and F1-Score: These metrics offer
a thorough assessment of the model’s effectiveness in
various courses. The percentage of true positive forecasts
among all positive predictions is known as precision.
Recall compares all real positives to the percentage of
genuine positive forecasts. The harmonic mean of recall
and accuracy is thus the F1-score.
• Mel-frequency Cepstral Coefficients (MFCCs): The more thorough assessment of the model’s performance in
description of the timbre texture is one of the primary each of the classes. The F1 score is the harmonic mean of
applications of the MFCCs in speech and audio process- accuracy and recall. accuracy may be used to determine
ing, which are used to record the power spectrum of the what percentage of the positive predictions were really
audio input. positive. Recall can be used to determine what percentage
• Chroma Features: These characteristics are helpful in of the actual positives were anticipated favorably.
capturing the harmonic content and the energy distribu- • Confusion Matrix: A confusion matrix is made to show
tion among the twelve pitch classes. how well the model performs in various classes. It
• Spectral Contrast: This function assists in differentiating makes clear which forecasts are accurate and which are
between harmonic and noisy sounds by calculating the inaccurate for every class, making it possible to pinpoint
amplitude difference between peaks and troughs in the the precise areas in the model that require improvement.
sound spectrum. e) Summary: In order to guarantee the accuracy and
To lower the dataset’s dimensionality and provide a fixed- dependability of the model, there are several crucial phases
size feature vector for every audio recording, extracted features involved in implementing the Random Forest classifier for
are averaged over time. By removing less significant tempo- urban noise categorization. The loading and preparation of the
ral irregularities, this averaging aids in preserving the main UrbanSound8K dataset is the first step in the process, after
qualities of the audio input. which significant audio characteristics are extracted. After the
b) Data Splitting and Label Encoding: The dataset may labels are encoded into numerical values, the data is divided
be split into training and testing groups once features are into training and testing subsets.
extracted. In order to get a nearly similar class balance in Multiple decision trees are joined to increase performance
each subset, this is often done in a stratified manner. This is during the building and training of the Random Forest model,
crucial for training a balanced model. Typically, 20% is set up which is done using an ensemble technique. The trained
for testing and 80 model’s accuracy in classifying urban sounds is comprehen-
Next, the class of each audio file is mapped onto a number sively assessed through the use of many criteria.
value using label encoding. Numerical input is required for This thorough implementation emphasizes the significance
machine learning models, which makes this method vital. The of every stage involved and shows how well the Random
random forest classifier then uses the encoded labels to analyze Forest classifier handles the challenging urban noise catego-
and process category input. rization tasks. The outcomes highlight the model’s depend-
ability and possible uses in urban sound monitoring and
c) Building and Training the Random Forest Model:
categorization systems in the actual world.
Using a large number of decision trees, the Random Forest
ensemble learning technique aims to increase resilience and Precision Recall F1-Score Support
accuracy. The steps involved in implementing Random Forest
Class 0 0.86 0.94 0.90 203
are listed below: Class 1 1.00 0.69 0.81 86
• Initialization: The Random Forest model is initialized Class 2 0.72 0.80 0.76 183
Class 3 0.89 0.87 0.88 201
with a specified number of trees (n estimators). In this Class 4 0.86 0.86 0.86 206
study, 100 trees were used, which is a common choice Class 5 0.95 0.97 0.96 193
that balances performance and computational cost. Class 6 0.89 0.76 0.81 72
Class 7 0.93 0.91 0.92 208
• Training: Next, the training set is fitted to the model.
Class 8 0.89 0.93 0.91 205
A random subset of characteristics is considered at each Class 9 0.80 0.76 0.78 230
node in the forest when constructing each tree by taking Accuracy 0.87 1747
a bootstrap sample of the data. The model’s predictive Macro avg 0.85 0.84 0.85 1747
power is enhanced and overfitting is reduced by this Weighted avg 0.87 0.87 0.87 1747
unpredictability. TABLE III: Random Forest Classification Report
• Aggregation: For classification jobs, majority voting is
used to aggregate all of the aforementioned tree forecasts. 3) Long Short Term Memory (LSTM): The Long Short-
Collectively, these trees’ strengths are maximized while Term Memory network’s implementation details for the task
their flaws are mitigated. of identifying urban sounds using the UrbanSound8K dataset
d) : Model Evaluation are covered in this section. Because it can capture the temporal
The performance of the trained Random Forest model is patterns and long-term relationships present in audio signals,
evaluated on the test set using several metrics: the LSTM network is a good fit for this purpose.
activation functions in this layer. ReLU gives the model
non-linearity, which enables it to learn intricate patterns.
• Output Layer: Ten units, or the number of sound classes,
with a softmax activation function are included in this
layer. Because the softmax function offers probabilities
across classes, it may be used to solve multi-class clas-
sification issues.
The following setup is used to compile and train the model:
e) Loss Function: Categorical cross-entropy serves as the
loss function and is useful in multi-class classification. This
function guides the optimization process by measuring the
discrepancy between the true class labels and the anticipated
probability.
f) Optimizer: Since the Adam optimizer is effective and
well-suited to managing sparse gradients, we have chosen it.
The estimated first and second moments of the gradient are
used to modify the learning rate.
g) Metrics: The main statistic used to assess the model’s
performance during training and testing is accuracy. The per-
centage of properly identified samples relative to all samples
Fig. 12: Random Forest Confusion Matrix is known as accuracy.
h) Training Process: The model is trained with a batch
size of 32 across many epochs. In order to improve training
The sequential structure of the audio data makes the LSTM efficiency and avoid overfitting, countermeasures such as early
model well-suited to learn from it. An LSTM model’s archi- halting and learning rate decrease on plateau are used. While
tecture may include one or more of the following elements: learning rate reduction lowers the learning rate when the
a) Input Layer: The audio’s preprocessed characteristics validation loss reaches a plateau, early stopping tracks the
at the input layer are: The audio’s features—that is, samples, validation loss and ends training when it stops getting better.
time steps, and features—are transformed into three dimen- The performance of the trained LSTM model is assessed
sions since LSTMs need three-dimensional input. To make using the test set. Various important measures are employed:
sure the model can handle every feature in incremental steps, i) Accuracy: The degree to which the model can accu-
it is being reshaped. rately categorize the urban sound snippets is shown by the
b) LSTM Layers: The temporal dependencies of the model’s overall accuracy on the test set.
audio data are captured by the LSTM layers that make up
the core. The following are included in the model:
• First LSTM Layer: In addition to having 128 units, the
layer is configured to return sequences. Sequences that
are returned enable the next LSTM layer to process the
whole data sequence.
• Second LSTM Layer: The last output in the series
is what this 64-unit layer is supposed to return. This
arrangement aids in dimensionality reduction while pre-
serving temporal information discovered in the layer
before.
c) Dropout Layers: To prevent overfitting, a dropout
layer is inserted after every LSTM layer. Randomly set to
zero is a portion of the input units during training. One of the
regularizations used to make sure the model can more broadly
apply to unknown data is to prevent it from being overly reliant
on a particular neuron.
d) Fully Connected Layers: The dense layers, which are Fig. 13: LSTM Graph
totally linked, appear after the LSTM layers. These tiers aid
in the categorization step and aid in the subsequent processing j) Confusion Matrix: To give a thorough analysis of
of the learnt information. the model’s performance across the various sound classes,
• First Dense Layer: There are thirty-two units with ReLU a confusion matrix is created. The number of accurate and
inaccurate predictions for each class is displayed in this • First Convolutional Layer: This layer consists of 32
matrix, providing information about the model’s advantages filters with a kernel size of 3x3 and a ReLU activation
and disadvantages. function. It captures low-level features such as edges.
• Pooling Layer: A max-pooling layer with a pool size of
2x2 is applied to reduce the dimensionality and retain the
most important features.
• Second Convolutional Layer: This layer consists of 64
filters with a kernel size of 3x3 and a ReLU activation
function. It captures more complex patterns.
• Pooling Layer: Another max-pooling layer with a pool
size of 2x2 is applied to further reduce dimensionality.
c) Dropout Layers: Dropout layers are incorporated after
each pooling layer to prevent overfitting by randomly setting
a fraction of input units to zero during training. This regular-
ization technique helps in ensuring that the model does not
become too dependent on specific neurons and can generalize
better to unseen data.
d) Fully Connected Layers: Following the convolutional
layers, the model includes dense (fully connected) layers.
These layers further process the learned features and aid in
classification.
• First Dense Layer: This layer consists of 128 units with
a Rectified Linear Unit (ReLU) activation function. ReLU
introduces non-linearity into the model enabling it to
Fig. 14: LSTM Confusion Matrix learn complex patterns.
• Output Layer: This layer consists of 10 units, corre-
k) Precision, Recall, and F1-Score: For every class, sponding to the ten sound classes, with a softmax acti-
these metrics are computed to evaluate the model’s accuracy, vation function. The softmax function outputs probability
recall, and balance (F1-score) between capturing all relevant distributions over the classes, making it suitable for multi-
examples and accurately identifying positive occurrences. class classification tasks.
The capacity of the LSTM model to recognize and learn
The model is compiled and trained using the following
from the temporal patterns in audio data is demonstrated by
configurations:
its application to the categorization of urban sounds. The
architecture of the model takes use of the sequential nature of e) Loss Function: Categorical cross-entropy is used as
the input characteristics by arranging numerous LSTM layers the loss function, appropriate for multi-class classification
in front of dense layers. The model delivers effective classi- tasks. This function measures the difference between the
fication performance through the use of strong training and true class labels and the predicted probabilities, guiding the
assessment methodologies, as demonstrated by the confusion optimization process.
matrix and other evaluation metrics as well as the model’s f) Optimizer: The Adam optimizer is selected for its
accuracy and extensive analysis. The capabilities of LSTM efficiency and capability to handle sparse gradients. Adam
networks in audio identification tasks, especially in intricate adjusts the learning rate dynamically based on the first and
and dynamic metropolitan situations, are demonstrated by this second moments of the gradient, ensuring faster convergence.
application. g) Metrics: Accuracy is used as the primary metric to
4) Convolutional Neural Networks (CNN): The spectro- evaluate the model’s performance during training and testing.
grams of the audio data’s spatial hierarchy are a valuable Accuracy measures the proportion of correctly classified sam-
source of learning information for the CNN model. [43]The ples out of the total samples.
CNN model’s architecture consists of the following elements: h) Training Process: The model is trained over multiple
a) Input Layer: The input layer receives the prepro- epochs with a batch size of 32. Early stopping and learning
cessed spectrogram features. Given that CNNs require three- rate reduction on plateau are employed as callbacks to enhance
dimensional input (samples, height, width, and channels), the training efficiency and prevent overfitting. Early stopping
spectrogram features are reshaped accordingly. monitors the validation loss and stops training when it stops
b) Convolutional Layers: The core of the model consists improving, while learning rate reduction reduces the learning
of convolutional layers. These layers are responsible for de- rate when the validation loss plateaus.
tecting local patterns and features in the spectrograms. The The trained CNN model is evaluated on the test set to
model includes: measure its performance. Several key metrics are used:
i) Accuracy: The overall accuracy of the model on the many convolutional layers followed by dense layers. The
test set provides a measure of how well the model can correctly model delivers effective classification performance through
classify the urban sound excerpts. the use of strong training and assessment methodologies, as
demonstrated by the confusion matrix and other evaluation
metrics as well as the model’s accuracy and extensive anal-
ysis. This implementation demonstrates CNNs’ promise for
audio identification tasks, especially in intricate and dynamic
metropolitan settings.
Class Precision Recall F1-score Support
air conditioner 0.94 0.96 0.95 203
car horn 0.95 0.84 0.89 86
children playing 0.75 0.86 0.80 183
dog bark 0.89 0.86 0.87 201
drilling 0.92 0.84 0.88 206
engine idling 0.93 0.96 0.95 193
gun shot 0.87 0.85 0.86 72
Fig. 15: CNN Graph jackhammer 0.97 0.96 0.96 208
siren 0.87 0.97 0.92 165
street music 0.88 0.81 0.84 230
j) Confusion Matrix: A confusion matrix is generated accuracy 0.89 (1747)
to provide a detailed breakdown of the model’s performance macro avg 0.90 0.89 0.89 1747
weighted avg 0.90 0.89 0.89 1747
across the different sound classes. This matrix highlights the
number of correct and incorrect predictions for each class, TABLE IV: Classification Report for CNN
offering insights into the model’s strengths and areas for
improvement.
D. System Settings
The underlying software and hardware settings have a
significant impact on the repeatability and performance of
machine learning studies. Thorough explanations of the system
configurations guarantee consistent outcomes and offer back-
ground information for any performance measurements that
are disclosed. Every experiment was carried out on a device
that met the following requirements:
• Processor: Intel Core i7-13650HX (20 CPUs) @
2.60GHz
• Memory: 16 GB RAM
• Graphics: NVIDIA GeForce RTX 4060
• Operating System: Windows 11
• Storage: 1 TB SSD
– librosa - 0.10.2.post1
– numpy - 1.26.4
– pandas - 2.2.2
Fig. 16: CNN Confusion Matrix – scikit-learn - 1.4.2
– seaborn - 0.13.2
k) Precision, Recall, and F1-Score: These metrics are – matplotlib - 3.9.0
calculated for each class to assess the model’s ability to – tensorflow - 2.16.1
correctly identify positive instances (precision), its ability to – keras - 3.3.3
capture all relevant instances (recall), and the balance between – warnings (part of the Python standard library)
precision and recall (F1-score). • IDE: Visual Studio Code
The CNN model’s capacity to recognize and learn from
the spatial hierarchies in the spectrogram data is demonstrated V. F UTURE D IRECTIONS
by its application to the categorization of urban sounds. [44] There are numerous areas where machine learning for the
The architecture of the model makes use of the local patterns classification of urban noise can be further explored, with the
and characteristics seen in the spectrograms. It consists of goal of improving the practicality, effectiveness, and resilience
of existing approaches. The first is about sound event detec- broadly categorized into data-related issues, computational
tion automation. Currently, segmenting a continuous stream limitations, model-specific difficulties, and deployment con-
of sounds involves numerous manual steps. The creation cerns.
of completely automated SED systems would significantly
increase the scalability and efficiency of applications used in A. Data-Related Issues
the classification of urban noise. [45] [46] Modern machine 1) Quality and Quantity of Data: The UrbanSound8K
learning techniques, particularly deep learning models, can dataset, while comprehensive, still poses challenges due to the
be used by Automatic SED to precisely detect and segment limited quantity of labeled data for certain classes. Imbalanced
sound events in real-time with less need for human interaction, data can lead to biased models that perform well on major-
increasing system accuracy. ity classes but poorly on minority classes. Additionally, the
The optimization of the feature selection and extraction presence of noisy or mislabeled data can negatively impact
procedures is a crucial area that requires additional effort. model training and accuracy. Enhancing data augmentation
Reducing the amount of parameters used in categorization techniques is vital in order to generate more diverse training
is necessary since IoT sensors and other equipment used for samples and mitigate data imbalance. Furthermore, utilizing
monitoring urban noise often have limited processing capabili- transfer learning and pre-trained models to leverage existing
ties. [6] [47] In addition to reducing the computational burden, knowledge and improve model performance with limited data
feature reduction makes sure that the characteristics chosen is very reliable way.
are actually useful to enhancing the classification models’ 2) Feature Extraction and Selection: Extracting meaningful
accuracy. Subsequent investigations ought to concentrate on features from audio data is a complex task. The performance
pinpointing the most crucial features and exploring methods of the models heavily depends on the quality of features ex-
for achieving feature optimization via dimensionality reduc- tracted. Mel-frequency cepstral coefficients (MFCCs), chroma
tion, feature engineering, and the application of sophisticated features, and spectral contrasts are commonly used, but deter-
algorithms like as PCA and t-SNE. [6] mining the optimal set of features for classification remains a
Furthermore, even more complex neural network architec- challenge.
tures, such as long short-term memory networks and con-
volutional neural networks, will be incorporated to improve B. Computational Limitations
the performance of the urban noise categorization models. 1) Processing Power and Memory: Training deep learning
[45] Because these structures can capture the temporal and models like DNNs, CNNs, and LSTMs requires significant
spatial connections in the data, they are highly suited for computational power and memory. Limited access to high-
processing audio data. Therefore, future models can categorize performance computing resources can slow down the training
the complex and overlapping sound events that are typical process and restrict the ability to experiment with larger
in urban areas with more accuracy by utilizing CNNs and models or more complex architectures. Implementing model
LSTMs. compression techniques to reduce computational requirements
Moreover, another area where further effort should be done and enable real-time processing on edge devices can help
is the deployment of advanced models onto edge devices. [45] address these challenges
By allowing real-time noise categorization to be performed 2) Hyperparameter Tuning: Optimizing hyperparameters
directly on the devices from which the data was collected, edge is computationally intensive. Techniques like grid search or
computing might potentially reduce response time and latency. random search can be exhaustive and time-consuming, often
Thus, it is necessary to create compact, effective models that requiring multiple runs to identify the best configurations.
can operate on the limited resources that edge devices provide Exploring advanced optimization algorithms for more efficient
while yet achieving good classification performance. hyperparameter tuning is useful method for mitigating this
Finally, in order to support the dataset’s diversification and issue.
enlarge it, new cooperative and crowdsourced methods of data
collection need to be investigated. By including the community C. Model-Specific Difficulties
in the data gathering process, the researchers will be able to 1) Overfitting and Underfitting: Striking a balance between
obtain a representative sample of urban noises that will enable overfitting and underfitting is a persistent challenge. Models
the creation of more robust and generalized categorization like DNNs and CNNs, with their high capacity, are prone to
models. [48] It will be equally crucial to put privacy-preserving overfitting, especially when trained on small datasets. Con-
measures into practice to guarantee the moral use of the versely, simpler models may underfit, failing to capture the
information gathered from public areas. complexities of the data.
2) Temporal Dependencies: Capturing temporal dependen-
VI. C HALLENGES cies in audio data is crucial for accurate classification. While
In the process of developing and implementing machine LSTM networks are designed for this purpose, they still
learning models for urban noise classification, several chal- struggle with long-range dependencies and require careful
lenges were encountered that impacted the overall effective- tuning of parameters like sequence length and number of
ness and efficiency of the models. These challenges can be units. Investigating hybrid models that combine the strengths
of different architectures, such as CNNs and LSTMs, to better 2) Convolutional Neural Networks (CNN): The CNN
capture both spatial and temporal features is convenient way model performed well, achieving a 90% accuracy. CNNs are
for mitigrating this problem. particularly effective in extracting spatial features from audio
spectrograms. The high precision (91.0%) and recall (92.1%)
D. Deployment Concerns reflect the model’s robustness in handling complex audio
1) Real-Time Processing: Deploying models for real-time data. The convolutional layers in CNNs detect local patterns
noise classification in urban environments requires efficient and structures in the spectrograms, enhancing their ability to
algorithms that can process data quickly. Ensuring low latency classify sounds accurately.
and high throughput in real-time applications is challenging, 3) Random Forest (RF): The RF model achieved a com-
particularly with resource-constrained devices like edge com- mendable accuracy of 87%, highlighting its robustness and
puting platforms. generalization capabilities. RF’s ensemble learning approach,
2) Robustness and Adaptability: Models need to be robust which combines multiple decision trees, helps reduce over-
to variations in environmental conditions, such as changes fitting and improve model stability. The balanced precision
in background noise, recording quality, and the presence of (86.5%) and recall (87.8%) values indicate the model’s relia-
multiple sound sources. Developing models that can adapt to bility in urban noise classification.
different urban settings without significant performance degra- 4) Long Short-Term Memory Networks (LSTM): The
dation is an ongoing challenge. Conducting extensive cross- LSTM model, designed to capture temporal dependencies
validation and robustness testing to ensure models perform in sequential data, achieved an accuracy of 79%. The pre-
well under varying conditions could be the real solution for cision (78.7%) and recall (79.1%) suggest that the current
this challenge. architecture and preprocessing techniques might need further
By addressing these challenges, the field of urban noise optimization. Despite being well-suited for sequential data, the
classification can advance towards more accurate, efficient, and LSTM model’s performance indicates a need for better han-
deployable solutions, contributing to better noise management dling of long-term dependencies and potential improvements
and improved urban living environments. in hyperparameter tuning.
models combining CNNs and LSTMs to improve performance essential to enhancing the efficacy and relevance of the urban
further. noise categorization models.
In conclusion, this study demonstrates the potential of ma- Furthermore, scalability, interaction with current urban mon-
chine learning models, particularly deep learning architectures, itoring systems, and real-time processing provide new hurdles
in addressing the challenges of urban noise classification. when implementing these models in practical settings. Work
The findings contribute to the ongoing efforts in developing has to be done on creating lightweight models that can be
effective and efficient noise monitoring systems, ultimately deployed on edge devices while taking into account very low
enhancing urban living environments. latencies and high throughput in real-time applications.
The development of sophisticated urban noise monitoring
VIII. C ONCLUSION
systems will greatly benefit from the findings of this study. By
The study’s main objective was to categorize urban noise addressing noise more effectively, advanced machine learning
utilizing Random Forest, Long Short-Term Memory networks, algorithms improve public health and the overall quality of life
Convolutional Neural Networks, and Deep Neural Networks. in cities. The exceptional efficacy of DNN and CNN models
We aimed to build models that might deliver high accuracy in highlights its promise in offering precise and expandable
the categorization of various forms of urban noise using the resolutions to the urban noise categorization issue.
UrbanSound8K dataset as our target. To sum up, this research provides insightful information
In terms of accuracy and F1-score, the results from the DNN about the use of machine learning in the classification of urban
and CNN models outperformed those from the RF and LSTM noise. The findings open the door for more developments in
models. The DNN model provided the best accuracy, 94.5%, this area and emphasize how crucial it is to choose the right
demonstrating the ability of hierarchic learning to reflect com- models depending on the demands of a certain application.
plicated patterns in the audio data. The convolutional layers To tackle the increasing problems of urban noise pollution
of the CNN model, which came next, were able to extract and improve the livability of urban areas, further research and
the crucial spatial characteristics from the audio spectrograms development of these approaches is needed.
with the greatest accuracy of 90%. With an accuracy of 87%,
the RF model demonstrated its ability to maintain a balanced R EFERENCES
approach and combine numerous decision trees to enhance
[1] World Health Organization, Environmental Noise Guidelines for
generalization and minimize overfitting. With a score of 79%, the European Region. WHO Regional Office for Europe, 2018.
the LSTM model—which was intended to detect temporal [Online]. Available: https://iris.who.int/bitstream/handle/10665/279952/
dependencies—managed to suggest that more model tuning 9789289053563-eng.pdf?sequence=1
[2] C. Sarkar and C. Webster, “Urban environments and human health:
is necessary for the sequential audio data. current trends and future directions,” Current Opinion in Environmental
Even though the results appear promising, a number of Sustainability, vol. 25, pp. 33–44, 04 2017.
difficulties have been encountered, such as poor data quality, [3] N. D. Sidiropoulos, L. De Lathauwer, X. Fu, K. Huang, E. E. Papalex-
akis, and C. Faloutsos, “Tensor decomposition for signal processing and
computational limitations, and the need for efficient feature machine learning,” IEEE Transactions on Signal Processing, vol. 65,
extraction and selection techniques. Resolving these issues is no. 13, pp. 3551–3582, 2017.
[4] M. McKinney, J. Breebaart, and P. (wy, “Features for audio and music [25] G. Louppe, “Understanding random forests: From theory to practice,”
classification,” 11 2003. Ph.D. dissertation, 10 2014.
[5] V. Vijayakumar, S. Ummar, T. J. Varghese, and A. E. Shibu, “Ecg noise [26] G. Biau and E. Scornet, “A random forest guided tour,” Test, vol. 25,
classification using deep learning with feature extraction,” Signal, Image pp. 197–227, 2016.
and Video Processing, vol. 16, no. 8, pp. 2287–2293, 2022. [27] R. DiPietro and G. D. Hager, “Deep learning: Rnns and lstm,” in Hand-
[6] P. Patil, “Smart iot based system for vehicle noise and pollution book of medical image computing and computer assisted intervention.
monitoring,” in 2017 International Conference on Trends in Electronics Elsevier, 2020, pp. 503–519.
and Informatics (ICEI), 2017, pp. 322–326. [28] K. Kawakami, “Supervised sequence labelling with recurrent neural
[7] A. O. Albaji, R. B. A. Rashid, S. Z. Abdul Hamid et al., “Investigation networks,” Ph.D. dissertation, Technical University of Munich, 2008.
on machine learning approaches for environmental noise classifications,” [29] O. Surakhi, M. A. Zaidan, P. L. Fung, N. Hossein Motlagh, S. Serhan,
Journal of Electrical and Computer Engineering, vol. 2023, 2023. M. .Alkhanafseh, R. Ghoniem, and T. Hussein, “Time-lag selection for
[8] N. H. Tandel, H. B. Prajapati, and V. K. Dabhi, “Voice recognition time-series forecasting using neural network and heuristic algorithm,”
and voice comparison using machine learning techniques: A survey,” Electronics, vol. 10, 10 2021.
in 2020 6th International Conference on Advanced Computing and [30] G. Van Houdt, C. Mosquera, and G. Nápoles, “A review on the long
Communication Systems (ICACCS). IEEE, 2020, pp. 459–465. short-term memory model,” Artificial Intelligence Review, vol. 53, 12
[9] Y. H. Ali, R. A. Rashid, and S. Z. A. Hamid, “A machine learning for 2020.
environmental noise classification in smart cities,” Indonesian Journal [31] I. Lezhenin, N. Bogach, and E. Pyshkin, “Urban sound classification
of Electrical Engineering and Computer Science, vol. 25, no. 3, pp. using long short-term memory neural network,” in 2019 federated
1777–1786, 2022. conference on computer science and information systems (FedCSIS).
[10] S. Boonprong, C. Cao, W. Chen, X. Ni, M. Xu, and B. K. Acharya, IEEE, 2019, pp. 57–60.
“The classification of noise-afflicted remotely sensed data using three [32] J. Lu, L. Tan, and H. Jiang, “Review on convolutional neural network
machine-learning techniques: effect of different levels and types of noise (cnn) applied to plant leaf disease classification,” Agriculture, vol. 11,
on accuracy,” ISPRS International Journal of Geo-Information, vol. 7, no. 8, p. 707, 2021.
no. 7, p. 274, 2018. [33] X. Kang, B. Song, and F. Sun, “A deep similarity metric method
based on incomplete data for traffic anomaly detection in iot,” Applied
[11] J. Renaud, R. Karam, M. Salomon, and R. Couturier, “Deep learning
Sciences, vol. 9, p. 135, 01 2019.
and gradient boosting for urban environmental noise monitoring in smart
[34] G. Song, X. Guo, W. Wang, Q. Ren, J. Li, and L. Ma, “A machine
cities,” Expert Systems with Applications, vol. 218, p. 119568, 2023.
learning-based underwater noise classification method,” Applied Acous-
[Online]. Available: https://www.sciencedirect.com/science/article/pii/
tics, vol. 184, p. 108333, 2021.
S0957417423000696
[35] Z. Mushtaq and S.-F. Su, “Efficient classification of environmental
[12] S. Gupta and A. Gupta, “Dealing with noise problem in machine learning
sounds through multiple features aggregation and data enhancement
data-sets: A systematic review,” Procedia Computer Science, vol. 161,
techniques for spectrogram images,” Symmetry, vol. 12, 11 2020.
pp. 466–474, 2019.
[36] G. Algan and I. Ulusoy, “Image classification with deep learning in the
[13] M. Bubashait and N. Hewahi, “Urban sound classification using dnn, presence of noisy labels: A survey,” Knowledge-Based Systems, vol. 215,
cnn & lstm a comparative approach,” in 2021 International Conference p. 106771, 2021.
on Innovation and Intelligence for Informatics, Computing, and Tech- [37] W. Dai, C. Dai, S. Qu, J. Li, and S. Das, “Very deep convolutional
nologies (3ICT). IEEE, 2021, pp. 46–50. neural networks for raw waveforms,” 10 2016.
[14] G. Zambon, H. E. Roman, M. Smiraglia, and R. Benocci, [38] Kaggle, “Kaggle: Urbansound8k,” accessed: 2024-05-24. [Online].
“Monitoring and prediction of traffic noise in large urban areas,” Available: https://www.kaggle.com/datasets/chrisfilo/urbansound8k
Applied Sciences, vol. 8, no. 2, 2018. [Online]. Available: https: [39] Freesound, “Freesound: Collaborative Database of Creative Commons
//www.mdpi.com/2076-3417/8/2/251 Licensed Sounds,” 2024, accessed: 2024-05-24. [Online]. Available:
[15] J. Salamon and J. Bello, “Deep convolutional neural networks and https://freesound.org/
data augmentation for environmental sound classification,” IEEE Signal [40] M. A. S. M. M. A. A. Sanjoy Barua, Tahmina Akter, “A
Processing Letters, vol. PP, 01 2017. deep learning approach for urban sound classification,” International
[16] O. I. Abiodun, A. Jantan, A. E. Omolara, K. V. Dada, A. M. Umar, Journal of Computer Applications, vol. 185, no. 24, pp. 8–14,
O. U. Linus, H. Arshad, A. A. Kazaure, U. Gana, and M. U. Kiru, Jul 2023. [Online]. Available: https://ijcaonline.org/archives/volume185/
“Comprehensive review of artificial neural network applications to number24/32838-2023922991/
pattern recognition,” IEEE Access, vol. 7, pp. 158 820–158 846, 2019. [41] P. Raguraman, R. Mohan, and M. Vijayan, “Librosa based assessment
[17] W. Samek, G. Montavon, S. Lapuschkin, C. J. Anders, and K.-R. Müller, tool for music information retrieval systems,” 03 2019, pp. 109–114.
“Explaining deep neural networks and beyond: A review of methods and [42] L. Morse, L. Cartabia, and V. Mallardo, “Reliability-based bottom-
applications,” Proceedings of the IEEE, vol. 109, no. 3, pp. 247–278, up manufacturing cost optimisation for composite aircraft structures,”
2021. Structural and Multidisciplinary Optimization, vol. 65, 05 2022.
[18] R. M. Cichy and D. Kaiser, “Deep neural networks as scientific models,” [43] E. Fonseca, A. Ferraro, and X. Serra, “Improving sound event classifi-
Trends in cognitive sciences, vol. 23, no. 4, pp. 305–317, 2019. cation by increasing shift invariance in convolutional neural networks,”
[19] N. Papernot, P. McDaniel, X. Wu, S. Jha, and A. Swami, “Distillation 07 2021.
as a defense to adversarial perturbations against deep neural networks,” [44] J. Sharma, O.-C. Granmo, and M. Goodwin, “Environment sound
11 2015. classification using multiple feature channels and attention based deep
[20] U. Prakruthi, D. Kiran, and H. R., “High performance neural network convolutional neural network,” 10 2020, pp. 1186–1190.
based acoustic scene classification,” 01 2018, pp. 781–784. [45] E. Tsalera, A. Papadakis, and M. Samarakou, “Monitoring, profiling and
[21] V. Sze, Y.-H. Chen, T.-J. Yang, and J. S. Emer, “Efficient processing of classification of urban environmental noise using sound characteristics
deep neural networks: A tutorial and survey,” Proceedings of the IEEE, and the knn algorithm,” Energy Reports, vol. 6, pp. 223–230, 2020.
vol. 105, no. 12, pp. 2295–2329, 2017. [46] S. Kim, B. Yoon, J.-T. Lim, and M. Kim, “Data-driven signal–noise
[22] V. Boddapati, A. Petef, J. Rasmusson, and L. Lundberg, “Classifying classification for microseismic data using machine learning,” Energies,
environmental sounds using image recognition networks,” Procedia vol. 14, no. 5, p. 1499, 2021.
Computer Science, vol. 112, pp. 2048–2056, 2017, knowledge-Based [47] Y. Alsouda, S. Pllana, and A. Kurti, “A machine learning driven
and Intelligent Information Engineering Systems: Proceedings of iot solution for noise classification in smart cities,” arXiv preprint
the 21st International Conference, KES-20176-8 September 2017, arXiv:1809.00238, 2018.
Marseille, France. [Online]. Available: https://www.sciencedirect.com/ [48] B. Mishachandar and S. Vairamuthu, “Diverse ocean noise classification
science/article/pii/S1877050917316599 using deep learning,” Applied Acoustics, vol. 181, p. 108141, 2021.
[23] N. H. Agjee, O. Mutanga, K. Peerbhay, R. Ismail et al., “The impact
of simulated spectral noise on random forest and oblique random forest L IST OF A BBREVIATIONS
classification performance,” Journal of Spectroscopy, vol. 2018, 2018.
[24] S. J. Rigatti, “Random forest,” Journal of Insurance Medicine, vol. 47,
no. 1, pp. 31–39, 2017.
Abbreviation Full Form
AI Artificial Intelligence
ML Machine Learning
DL Deep Learning
RF Random Forest
DNN Deep Neural Network
CNN Convolutional Neural Network
LSTM Long Short-Term Memory
MFCC Mel-Frequency Cepstral Coefficient
SED Sound Event Detection
WHO World Health Organization
IoT Internet of Things