0% found this document useful (0 votes)
31 views

Musical_Genre_Classification_Using_Advanced_Audio_Analysis_and_Deep_Learning_Techniques

Uploaded by

abhishek.21ad001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
31 views

Musical_Genre_Classification_Using_Advanced_Audio_Analysis_and_Deep_Learning_Techniques

Uploaded by

abhishek.21ad001
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 11

Received 13 May 2024; accepted 12 July 2024.

Date of publication 19 July 2024;


date of current version 26 September 2024. The review of this article was arranged by Associate Editor Peng Li.
Digital Object Identifier 10.1109/OJCS.2024.3431229

Musical Genre Classification Using Advanced


Audio Analysis and Deep Learning Techniques
MUMTAHINA AHMED1 , ULAND ROZARIO1 , MD MOHSHIN KABIR 2 ,
ZEYAR AUNG 3 (Senior Member, IEEE), JUNGPIL SHIN 4 , AND M. F. MRIDHA 5 (Senior Member, IEEE)
1
Department of Computer Science and Engineering, Bangladesh University of Business and Technology, Dhaka 1216, Bangladesh
2
Faculty of Informatics, Eötvös Loránd University, H-1117 Budapest, Hungary
3
Department of Computer Science, Khalifa University, Abu Dhabi 127788, UAE
4
School of Computer Science and Engineering, The University of Aizu, Aizu-wakamatsu 965-8580, Japan
5
Department of Computer Science and Engineering, American International University-Bangladesh, Dhaka 1229, Bangladesh
CORRESPONDING AUTHORS: M. F. MRIDHA; ZEYAR AUNG (e-mail: [email protected]; [email protected])

ABSTRACT Classifying music genres has been a significant problem in the decade of seamless music
streaming platforms and countless content creations. An accurate music genre classification is a fundamental
task with applications in music recommendation, content organization, and understanding musical trends.
This study presents a comprehensive approach to music genre classification using deep learning and advanced
audio analysis techniques. In this study, a deep learning method was used to tackle the task of music
genre classification. For this study, the GTZAN dataset was chosen for music genre classification. This
study examines the challenge of music genre categorization using Convolutional Neural Networks (CNN),
Feedforward Neural Networks (FNN), Support Vector Machine (SVM), k-nearest Neighbors (kNN), and
Long Short-term Memory (LSTM) models on the dataset. This study precisely cross-validates the model’s
output following feature extraction from pre-processed audio data and then evaluates its performance.
The modified CNN model performs better than conventional NN models by using its capacity to capture
complex spectrogram patterns. These results highlight how deep learning algorithms may improve systems
for categorizing music genres, with implications for various music-related applications and user interfaces.
Up to this point, 92.7% of the GTZAN dataset’s correctness has been achieved on the GTZAN dataset and
91.6% on the ISMIR2004 Ballroom dataset.

INDEX TERMS Convolutional neural networks, long short-term memory, support vector machine, k-nearest
neighbors, genre classification.

I. INTRODUCTION which describes a Python package for audio and music sig-
Humans classify music into genres based on what they think nal processing, and so on. Purwins et al. [25] discussed
of the music, how comfortable they are with the style, and convolutional neural networks, long short-term memory archi-
their capacity to make decisions between genres that are tectural variations among the deep learning models studied,
unclear. This makes classifying music genres in the field and other popular feature representations such as log-mel
of Music Information Retrieval (MIR) difficult. Sound pro- spectra and raw waveform. Previous research also indicated
cessing, audio synthesis, audio effect creation, and music that participants listened to music more often than any of
information retrieval depend on the extraction of audio fea- the other activities in Pachet et al. [22] (i.e., watching televi-
tures. David et al. [19] proposed the method to assess current sion, reading books, and watching movies). Pachet et al. [22]
audio feature extraction toolboxes and libraries by thoroughly study examines the connections between various factors,
examining their coverage, effort, presentation, and latency. including individual characteristics, self-perceptions, cogni-
There are several toolboxes for extracting audio features, tive capacities, and musical preferences. Music is essentially
such as “Essentia,” which is recommended by Bogdanov et subjective because many people experience the same song
al. [3], “Librosa,” which can be found in McFee et al. [18], differently. This study focuses on the model’s ability to

© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
VOLUME 5, 2024 https://creativecommons.org/licenses/by-nc-nd/4.0/ 457
AHMED ET AL.: MUSICAL GENRE CLASSIFICATION USING ADVANCED AUDIO ANALYSIS AND DEEP LEARNING TECHNIQUES

identify music based on objective audio elements but does not and model architectures and designs. Section IV contains the
examine how well these categories match human perceptions analysis results and a description of the data set, the exper-
of genre. However, the current problem is to organize and imental environment, and the experiment results. Section V
manage the millions of music titles produced by society, as also outlines the arguments made for the suggested system and
suggested by Rentfrow et al. [26]. In the context of artists recommends possible directions for further study. Section VI
and content creators, genre classification helps the production brings this article to a close.
process to be more efficient and enables artists to target tar-
geted audiences and refine their craft. Moreover, the online II. RELATED WORK
music streaming platforms Prey et al. [24] and the film and This study explores music genre classification using a com-
video game industries benefit significantly from an accurate bination of machine learning and deep learning methods.
genre classification. More significantly, it makes it difficult for Inspired by the traditional approach of analyzing instruments
us to fully understand and simulate people’s musical tastes. in music, the study turns to automated methods due to the
The fingerprinting (FP) approach was implemented by Unal advancements in machine learning.
et al. [31] to solve the difficult problem of resilient data Cheng et al. [4] used Convolutional Neural Networks with
extraction in query-by-humming (QBH) systems under un- five convolutional layers to classify the genres of music. The
predictable conditions. The use of Mel Frequency Cepstral accuracy they got was 83.3%. Their hop size was set to 256
Coefficients in the classification of musical instruments is with a fast Fourier transform on 1024 frames. There was
discussed by Loughran et al. [17]. MFCCs represent the spec- another approach by Ndou et al. [21] using traditional ma-
tral characteristics of audio signals, providing a compact yet chine learning and deep learning. They presented a thoroughly
highly informative feature set for music analysis. Previous reviewed paper on those approaches. They concluded their
research by Jensen et al. [10] summarizes how the MFCC is study with an accuracy of 92.69% by k-Nearest Neighbors.
being calculated. They capture key attributes of sound, such as Their Convolutional Neural Network provided an accuracy
timbre and pitch, which are instrumental in discerning musical of 72.40%. Sugianto et al. studied voting-based music genre
genres. classification [28]. They have also used the GTZAN dataset
However, to improve the model’s generalization over a for music genre classification. They obtained a voting scheme
greater range of musical styles, bigger and more diverse music accuracy of 71.87% and a single scheme accuracy of 63.49%,
datasets should be used for testing and training. Improving proving that the voting scheme offers greater accuracy than
the model’s ability to capture the complex musical features the single scheme. Prabhakar et al. [23] mentioned five dif-
that characterize genres may require developing more ad- ferent approaches such as WVG-ELNSC, SDA, RA-TSM,
vanced feature extraction techniques. The model improves its TSVM, and BAG proposed for music genre classification in
knowledge of musical differences by exploring audio data and their study. They obtained 93.51% accuracy using the pro-
identifying hidden components. Experimenting with various posed deep learning BAG model on GTZAN, ISMIR 2004,
machine learning models and hyperparameter optimization and MagnaTagATune data sets. Another work was featured
techniques can help identify the best architecture for a specific by Ashraf et al. [1], where a hybrid CNN and RNN vari-
genre classification problem, optimizing performance on the ant model was implemented. They achieved an accuracy of
selected dataset and genres. Beyond the obvious classifica- 89.30% by using a hybrid approach that combines CNN and
tions, this work explores the subtle characteristics that give Bi-GRU when the features were Mel-Spectrogram. Whereas
each genre its persona. with MFCC, they got 73.69% on the same hybrid model.
The contributions to this article are as follows: They also enriched their study with the use of hybrid models
r This study has extracted the features of the WAV- such as CNN-GRU, CNN-BiLSTM, and CNN-LSTM. They
formatted 30-second music files that were provided in used both MFCC and Mel-Spectrogram features on the hybrid
the data set so that one can re-use those extracted features models. Kostrzewa et al. [12] studied the classification of mu-
to train the proposed model and test/evaluate it. sic genres, and in their study, the FMA tiny dataset was used in
r This survey attempted CNN, LSTM, SVM, kNN, and several tests to examine the effectiveness and performance of
FNN to classify the genres of music and then proposed various models. The CNN and 1-D CRNN models produced
the modified Convolutional Neural Network model, the best outcomes, but the 2-D CRNN and LSTM models con-
which gives the best accuracy on both the GTZAN data siderably underperformed. A growing need for sophisticated
set for music genre classification and the ISMIR2004 Music Information Retrieval (MIR) approaches to categorize
Ballroom dataset among all the trained models. digital music files into various genres was discussed in the pa-
r The inquiry has evaluated the proposed Convolutional per by Mutiara et al. [20], the study emphasizes the necessity
Neural Network model and compared it to the other for an automatic genre tagging system and manual genre la-
models and to some well-known published papers. beling. The optimal audio feature mixture, combining musical
The structural organization of this study describes a thor- surface, Mel-Frequency Cepstrum Coefficients, tonality, and
ough overview of relevant studies in the realm of music genre LPC, achieved a remarkable classification accuracy of 76.6%.
classification, presented in Section II. Section III methodol- Another study by Farajzadeh et al. [7] presents PMG-
ogy thoroughly explains the suggested data set, algorithm, Net, a customized deep neural network-based technique for

458 VOLUME 5, 2024


automatically classifying Persian music genres. The re-
searchers built a dataset called PMG-Data, which included
500 songs from a variety of musical genres, including pop,
rap, traditional, rock, and monody, to assess PMG-Net’s per-
formance. Other researchers can access this freely available
dataset, which has been labeled for genre classification. The
method’s performance in classifying Persian music genres is
satisfactory, as evidenced by PMG-Net’s reported 86% ac-
curacy on the PMG-Data. This study fills a research gap by
utilizing deep neural network-based methods to categorize
Persian music genres, a gap previously addressed by previous
work focusing on western music.
Li et al. [13] utilized convolutional neural network tech-
niques to extract musical highlights from sound music. They FIGURE 1. Waveform of a music file from each genre from the GTZAN’s
found that CNN can identify relevant information from the genre classification data set.
tangents of musical examples, improving over time. This ap-
proach is adaptable and validates the innate characteristics of
audiovisual data collection, revealing the optimal parameter
set for sound music arrangement. Fulzele et al. [8] highlighted
the necessity of automatic music genre classification in the
digital age, where many music files are readily available on-
line. They used a hybrid model for classifying music genres
that combines a Long Short-Term Memory (LSTM) and a
Support Vector Machine (SVM) classifier. Compared to the
individual accuracies of LSTM (69%) and SVM (84%) classi-
fiers, the hybrid model comprising LSTM and SVM classifiers
produced an 89% success rate in classification for musical
genres. Schindler et al. [27] compared the effectiveness of sev-
eral neural network topologies for automatically classifying
music genres. FIGURE 2. Waveform of a music file from each genre from the ISMIR 2004
genre classification data set.
In conclusion, recent advancements in music genre clas-
sification, employing machine learning and deep learning,
showcase significant progress. Notable research, such as
Ashraf et al. [1] achieving 89.30% accuracy with a hybrid A. DATA COLLECTION
CNN and RNN variant and Ndou et al. [21] reaching 92.69% This study began by assembling a dataset, combining the well-
using k-Nearest Neighbors, exemplifies the capabilities of known GTZAN Dataset and the Ballroom dataset, designed
various models. Schindler et al. [27] study underscores the for ISMIR 2004’s rhythm description contest. Following
superiority of CNN-based approaches over manual feature Tzanetakis et al. [30] methodology, the GTZAN dataset pro-
creation. These developments have profound implications for vides a diverse representation of ten musical genres, each
the digital age, enhancing music streaming services with with precisely 100 music files in WAV format, lasting 30
more accurate recommendations and personalized playlists seconds. The Music Technology Group and Pompeu Fabra
and enriching user exploration. The future of music genre University created ISMIR2004, a genre classification dataset
classification appears promising, with ongoing advancements for a music data analysis contest. It consists of ten differ-
in models and methodologies fueled by the dynamic intersec- ent genres: quickstep, jive,rumba-international, rumba-misc,
tion of machine learning and music. This study anticipates chachacha, vinnsewaltz, samba, and waltz.The quality and di-
even more robust genre classification algorithms in the years versity of this dataset significantly influence the performance
ahead. of the proposed models. The standardized approach ensures
a robust foundation for developing accurate models in music
genre classification. Wave plots of samples from the GTZAN
III. METHODOLOGIES dataset and the ISMIR2004 dataset are shown in Figs. 1 and
The methodology used for the research will be discussed 2, respectively.
in this part, encompassing aspects such as data collection,
data preprocessing procedures, and the foundational architec- B. DATA PREPROCESSING
tures constituting the baseline for the proposed musical genre In this experiment, a 30-second audio segment undergoes a
classification system. A complete diagram of this study’s comprehensive preprocessing stage. The audio is transformed
workflow is presented in Fig. 3. into its corresponding Mel spectrums, quantized into audio

VOLUME 5, 2024 459


AHMED ET AL.: MUSICAL GENRE CLASSIFICATION USING ADVANCED AUDIO ANALYSIS AND DEEP LEARNING TECHNIQUES

FIGURE 3. Workflow diagram of the proposed framework.

affected by noise and allows machine learning to efficiently


recognize genres based on their unique sound characteristics.
Feature selection plays an important role in machine learning
for two key reasons. First, it improves model performance
by removing irrelevant or redundant data. This prevents the
model from making noise-based decisions and leads to more
accurate predictions. Second, it reduces computational com-
plexity. Less data means faster training times and lower
resource demands, making the entire machine-learning pro-
cess more efficient. The challenge in feature selection is the
large dimensionality of audio data. Extracting several char-
acteristics can result in a complicated feature space, making
it difficult to determine which ones are most useful. Further-
more, finding a balance between informative and redundant
features is critical. Removing too much data may result in
the loss of important information for genre separation, while
FIGURE 4. Extraction of Short-Time Fourier Transform from the
WAV-formatted audio data. maintaining duplicate attributes may hamper performance and
increase calculation time. Bergstra et al. [2] used an ensem-
ble learner called ADABOOST to select from a set of audio
signals at a sampling rate of 22050, and subjected to a Fast features that have been extracted from segmented audio and
Fourier Transform (FFT) process applied to 2048 frames then aggregated. Fig. 5 shows the spectrogram of a wav file
(visually represented in Fig. 4). This study initializes the from each genre; the spectrogram was plotted by extracting
preprocessing by extracting the Mel-Frequency Cepstral Co- a short-time Fourier Transform and then visualizing by Li-
efficients from audio files from each genre. MFCC focuses brosa’s spaceshow, and Figs. 6 and 7 show the amplitude vs
on the perceptually important parts of the audio using a mel frequency graph after the Fast Fourier Transform on audio
scale, similar to human hearing. This compressed data is less data in both datasets, respectively.

460 VOLUME 5, 2024


FIGURE 5. Time vs. Frequency (Hz) Spectrograms of wav file from each
genre.

FIGURE 7. Amplitude vs. Frequency graph after Fast Fourier Transform on


audio data on ISMIR2004 dataset.

the total number of data points in the original signal. e is the


−i2π kn
base of the natural logarithm (approximately 2.71). e N is
a complex exponential term that plays a crucial role in con-
verting the time-domain signal to the frequency domain. Now,
the equations below show the separated forms into simpler
compounds based on the operations on xk and the sum of
operations on the even and odd samples of xk .
N N
2 −1
 2 −1

−i2π k2m −i2π k(2m+1)
= x2 m .e N + x(2m+1) .e N

m=0 m=0
N N
FIGURE 6. Amplitude vs. Frequency graph after Fast Fourier Transform on 2 −1
 −i2π
 km 2 −1
 −i2π
 km
N −i2π k N
audio data on GTZAN dataset. = x2 m .e 2 +e N + x(2m+1) .e 2 (2)
m=0 m=0

The STFT calculation is represented by the formula The Hop size, set at 512 samples, is vital in controlling the
X(nt,m f ), where n and m represent time and frequency, analysis frequency. The preprocessing is facilitated through
respectively. The t signifies the time resolution,  f denotes the utilization of the Librosa library, a powerful tool in the
the frequency resolution, and the summation considers win- field of audio analysis. The preprocessing pipeline outlined
dowed signal values along with a complex exponential term in this experiment encompasses a series of carefully designed
to account for frequency components. This method offers in- steps that transform raw audio data into a structured and in-
formation on how the frequency content of a signal varies over formative representation.
time, which is helpful for tasks like spectrogram synthesis and
audio analysis. C(x(t )) = F −1 [log(F [x(t )])] (3)


n+Q x(t ) represents a signal, likely in the time domain, where t
− j2π pmt  f
X (nt , m f ) = w((n − p)t )x(pt )e t denotes time. Signals in music are audio waveforms, repre-
p=n−Q senting sound over time. F is a Fourier Transform, which is
(1) used in signal processing to convert signals from the time
domain to the frequency domain. log(F [x(t )]) is done for

N−1
−i2π kn compressing the dynamic range of the signal. We used the
Xk = xn .e N
inverse transformation F −1 for feature extraction. So, after
n=0
calculating the logarithm of the modified signal, we use an
xk represents the k-th coefficient in the frequency domain. n is inverse transformation to return to a new domain.
the index variable used for the summation. xn represents the Fig. 8 shows a visual representation of a rock audio file’s
n-th data point in the original signal. i is the imaginary unit. spectrogram and MFCC. These Mel Frequency Cepstral Co-
It’s used to create the complex exponential term. N represents efficients serve as the basis for subsequent analysis and

VOLUME 5, 2024 461


AHMED ET AL.: MUSICAL GENRE CLASSIFICATION USING ADVANCED AUDIO ANALYSIS AND DEEP LEARNING TECHNIQUES

the input signal F being shifted by (u, v) during the convolu-


tion operation.
Convolutional layers are designed to detect spatial patterns
and features in the input data. These layers use learnable filters
to perform convolution operations over the input, capturing lo-
cal features. However, CNN models are getting improved [9]
by modification in many recent works. Dong [6] used CNNs
to extract musical pattern features from the mel-scale spec-
trogram of audio signals. Multiple convolutional layers in
Convolutional Neural Networks identify hierarchical features,
FIGURE 8. ’Rock’ audio file’s spectrogram and MFCC visualization. and pooling layers then do spatial reduction and downsam-
pling. For high-level feature extraction and classification, the
output is flattened and fed into fully connected layers; for
classification tasks, ultimately contributing to the understand- multi-class problems, softmax activation is frequently used.
ing and categorization of musical genres. Dropout and batch normalization layers are examples of extra
elements that can be included for regularization and stability
C. ARCHITECTURE OF FNN MODEL during training. The CNN used in this work has an initial
Feedforward Neural Networks are the foundation of deep layer of 128 filters (3 × 3 kernel) with ReLU activation,
learning architectures. It is also known as Artificial Neural which was then added. Max-pooling, batch normalization, and
Networks (ANN). FNNs are a class of neural networks where another set of convolutional layers were then added. Feature
the information flows in one direction. The direction is from extraction is improved with a final convolutional layer with 64
the input layer to the output layer, with the help of concealed filters (2 × 2 kernel). After that, the architecture switches to
intermediary layers. Unlike CNN, which is used for image fully linked layers, which include a dense layer for high-level
classification, object detection [15], etc., FNN is widely used feature processing that has 64 units, ReLU activation, and
for many machine learning tasks, including classification, re- dropout regularization.
gression, and function approximation.
f (x) = max(0, x) (6)
∂L
[L]
= δi[L]( j) ak[L−1]( j) (4) Finally, the output layer, equipped with softmax activation,
∂wik
provides predictions across 10 classes for the classification
The paper presents an advanced neural network model de- task. The softmax function is described below,
signed for categorizing music genres using the Keras Sequen- ezi
tial API. The model starts with an input layer flattening the P(yi ) = K zj
(7)
data and adds densely connected hidden layers, each with j=1 e

2048 to 64 neurons. These layers record low-level and high- Here, the softmax function P(yi ) converts class scores to prob-
level data representations, capturing complex relationships abilities, showing the probabilities that the input belongs to
between traits and musical genres. Dropout layers prevent each class i. It exponentiates the scores zi to ensure positivity
overfitting and follows each dense layer, while Softmax ac- before normalizing them to produce a probability distribution
tivation is used in the output layer. Adam, an optimizer, helps over all classes.
modify the model’s weights throughout training to minimize This carefully designed CNN architecture, also represented
loss. The model also undergoes L2 regularization to manage in Fig. 9, showcases a hierarchy of feature extraction capabil-
complexity. The goal is to accurately classify music genres ities, making it well-suited for a range of image classification
based on complex feature correlations in audio data. tasks.

D. ARCHITECTURE OF CNN MODEL E. ARCHITECTURE OF MODIFIED CNN MODEL


In a Convolutional Neural Network architecture, the network This study uses a Convolutional Neural Network architecture
typically comprises a stack of convolutional layers, followed to classify music genres. The model uses Keras’ sequential
by optional pooling layers, fully connected layers, and an technique. Our model starts with three convolutional layers,
output layer. The 2D convolution formula is given below. both with a kernel size of (3, 3) and ReLU activation.These
layers generate local features from input Mel-frequency

k 
k
cepstral coefficients (MFCCs), whose dimensionality (input
G[i, j] = H[u, v]F [i − u, j − v] (5)
shape) is determined by training data. The first layer utilizes
u=−k u=−k
256 filters, while the second layer has 128 filters. For addi-
The G[i, j] represents the output of the convolution operation tional feature extraction, an optional third layer of 64 filters
at position (i, j). H[u, v] is the kernel or filter being applied can be added. For each convolutional layer, MaxPooling2D
to the input signal F during convolution. It represents the layers with pool sizes (3, 3) and strides (2, 2) have been
weights or coefficients of the filter. F [i − u, j − v] represents used to reduce the feature maps. Batch normalization layers

462 VOLUME 5, 2024


FIGURE 9. CNN Model to classify the genres of music on GTZAN’s data set.

are introduced after each pooling layer to improve training


stability. Two more convolutional layers have been added,
applying kernel sizes of (3, 3) and (2, 2) with ReLU activation.
These layers refine higher-level features extracted from the
music data. Downsampling and training stability are obtained
using MaxPooling2D and batch normalization layers, as in the
previous sections. A flattening layer converts the pooled fea-
ture maps into one-dimensional vectors that may be fed into
fully connected layers. A dense layer with 64 units and ReLU
activation is employed for further processing. A dropout layer
with a rate of 0.2 helps to prevent overfitting. The last layer is
dense, consisting of 10 units, and activated with softmax. This
layer predicts the probability distribution of music falling into
one of the 10 genre classifications. Our model is compiled
FIGURE 10. Proposed Modified CNN Architecture to Classify the Genre of
using the Adam optimizer with a learning rate of 0.0001. Music.
Both traditional CNN and modified CNN are CNN struc-
tures for music genre categorization; however, their complex-
ity varies. Traditional CNN has a simpler structure with two design. It begins with three “Conv2D” layers, the first having
“Conv2D” layers, each with 128 filters and a kernel size of (3, 256 filters, the second with 128 filters, and the third with
3). This process pulls characteristics from the MFCC input. 64 filters, which is shown in Fig. 10. This augmentation is
In comparison, updated cnn utilizes a possibly more powerful especially notable since it represents an honest attempt to

VOLUME 5, 2024 463


AHMED ET AL.: MUSICAL GENRE CLASSIFICATION USING ADVANCED AUDIO ANALYSIS AND DEEP LEARNING TECHNIQUES

Linear Unit activation functions, which are supplemented with


dropout layers to counteract overfitting. The model culmi-
nates in an output layer with softmax activation, allowing
for multi-class categorization into 10 separate music genres.
This architectural arrangement uses the temporal correlations
inherent in audio data, proving its potential to improve the ac-
curacy and performance of music genre categorization tasks.

G. ARCHITECTURE OF SVM MODEL


In our study, the SVM architecture is described in the obtained
FIGURE 11. Proposed LSTM Model to Classify the Genre of Music. Mel-Frequency Cepstral Coefficients (MFCC) features of au-
dio samples for music genre classification. Initially, the audio
files are sampled at a rate of 22,050 Hz, and each track is
modify higher-level characteristics collected from music data, divided into numerous parts lasting 30 seconds each. MFCC
which might lead to increased classification accuracy. This features are then computed for each segment via the librosa
may allow for richer feature extraction. Modified CNN was package. The generated MFCC vectors are flattened into one-
updated with higher model capacity due to extra convolutional dimensional arrays and divided into training and testing sets.
layers and filters, which may lead to greater performance, but Changsheng et al. [32] propose effective algorithms to auto-
it must be monitored for overfitting. Traditional CNN delivers matically classify and summarize music content, and SVM is
computational efficiency, but its limitations may restrict its used to classify music. Selecting the hyperplane in the feature
ability to capture complicated genre-specific properties. space that best divides several classes is the foundation of
SVMs, as opposed to neural networks. A linear SVM model
F. ARCHITECTURE OF RNN-LSTM MODEL is trained on the training set using the sci-kit-learn library’s
This work presents an advanced Recurrent Neural Network SVC class, with the regularization parameter (C) set to one.
(RNN) architecture designed to represent sequential data and Subsequently, the trained model is used to predict genre labels
extract subtle information from audio inputs, enabling more for the test set. The accuracy-score function from sci-kit-learn
precise genre categorization. Each layer in the architecture is used to evaluate the SVM model’s performance in classi-
has been thoughtfully created to play a distinct function in fying music genres. This architecture shows the process of
the categorization process. Because of their fundamental con- feature extraction, model training, prediction, and evaluation
nection with music’s sequential and temporal aspects, RNNs in SVM-based music genre classification, with a focus on the
can capture both short- and long-term dependencies, as stated audio sample rate and track duration parameters.
in [29], and can be flexible enough to operate with a variety of
input representations. The input layer of the RNN serves as the H. ARCHITECTURE OF KNN MODEL
point of entry for audio data, which is presented in the form of In our study, the K Nearest Neighbors (KNN) approach for
sequential feature vectors. This study opts for Mel-frequency music genre classification starts by modeling each music sam-
cepstral coefficients as the primary feature representation, en- ple as a set of statistical data taken from its audio signal,
capsulating essential spectral information. namely the mean and covariance matrices of Mel-Frequency
This article emphasizes the use of recurrent layers, specif- Cepstral Coefficients (MFCC). These attributes capture key
ically Long Short-Term Memory (LSTM) cells, to capture audio elements, such as spectral content and timbral proper-
long-term temporal dependencies in audio data. The flattened ties. The program then determines the distance between each
feature vector undergoes connection to fully connected layers pair of samples using a mathematical calculation known as the
with Rectified Linear Units for non-linearity. Kullback-Leibler divergence, which evaluates the difference
To prevent overfitting, dropout layers (0.3) are strategically in their feature distributions. The Kullback-Leibler divergence
placed between fully connected layers, promoting diversity in formula is as follows,
feature reliance during training. L2 regularization (coefficient: 
M
ŷc
0.01) further ensures model robustness. The final output layer, KL(ŷ||y) = ŷc log (8)
matching music genre count, employs softmax activation for yc
c=1
probability distributions, with the highest probability deter- In the study of the kNN algorithm, ŷ represents the estimated
mining the model’s genre classification. or predicted probabilities of a data point belonging to each
The proposed Recurrent Neural Network architecture of the M classes, while y represents the true probabilities or
shown in Fig. 11 defines a deep neural network with LSTM the ground truth distribution. ŷc , yc denote the probabilities
layers for processing sequential data. The network has nu- associated with category c in the distributions ŷc and yc , re-
merous LSTM layers, including two with 128 and 64 units, spectively. It is used as a distance metric to determine related
respectively, all of which are fitted with L2 regularization data points for classification based on class probabilities.
to improve generalization and reduce overfitting. Follow- 
ing these recurrent layers are Dense layers with Rectified Euclidean Distance = (p1 − q1 )2 + · · · + (pn − qn )2 (9)

464 VOLUME 5, 2024


TABLE 1. Comparing Different Approaches and Their Accuracy on GTZAN
Dataset

FIGURE 13. Obtained accuracy of the modified CNN model after 100
epochs.

TABLE 2. Comparing Different Approaches and Their Accuracy on Ballroom


Dataset

FIGURE 12. Modified Convolutional Neural Networks model’s loss after


100 epochs.

In the topic of kNN, the distance is calculated between each The study extensively tested using the BallRoom genre
pair of data points in the dataset using this formula. Based classification dataset, consisting of 698 occurrences of 10
on these distances, the kNN algorithm chooses the K nearest genres. Mel-frequency cepstral coefficient data from corre-
neighbors for each test sample. The genre labels of these sponding WAV files was used to train models. The modified
neighbors are then utilized to estimate the genre of the test Convolutional Neural Network achieved the highest accu-
sample via a voting process, with the most frequently oc- racy at 91.6%, followed closely by the typical CNN with
curring genre among the neighbors selected as the predicted 90%. The Feedforward Neural Network and Support Vec-
genre label. Finally, the model’s accuracy is assessed by com- tor Machine models achieved accuracies of 74.1% and 76%,
paring the predicted genre labels to the actual genre labels of respectively. The k-Nearest Neighbors model had an accu-
the test samples. This approach enables the KNN algorithm racy of 68.5%, and the Recurrent Neural Network with Long
to categorize music samples into distinct genres based on Short-Term Memory (RNN-LSTM) achieved a performance
auditory attributes and similarities to other samples in the of 71.6%.
database. These findings highlight the research’s efficacy in using the
modified CNN model to classify genres more accurately than
IV. RESULT & ANALYSIS other examined models. The various accuracy results show the
After the training as well as the evaluation of the models, this advantages and disadvantages of each model, offering insight-
study found out that the modified CNN model demonstrated ful information for additional improvement and optimization
exceptional performance, achieving an impressive accuracy in subsequent iterations of the classification system.
rate of 92.7% (shown in Fig. 13) on the GTZAN music genre Precision in this study refers to the percentage of accurate
classification dataset, where the base Convolutional Neural positive predictions among all positive predictions generated
Network provided an accuracy of 85.56%, SVM attained an by the model.
accuracy of 79%, FNN gave an accuracy of 76%, Long Short
TP
Term Memory gave 73% of accuracy, and kNN supported with Precision = (10)
an accuracy of 70%. Table 1 shows a detailed comparison T P + FP
of results. The result states that the proposed CNN model The goal of recall is to find the most appropriate items from
performs better on GTZAN’s music genre classification data all the available options. It quantifies the percentage of pre-
set. dictions that came true out of the dataset’s actual positive

VOLUME 5, 2024 465


AHMED ET AL.: MUSICAL GENRE CLASSIFICATION USING ADVANCED AUDIO ANALYSIS AND DEEP LEARNING TECHNIQUES

TABLE 3. Comparison With State of the Art

cases. Where TP means true positive, and FN refers to a false the test set, converted all the audio in the dataset into their
negative. respective MFCC, and sent them to the proposed CNN model
TP for training. It is important to consider alternative approaches,
Recall = (11) such as N. Karunakaran and A. Arya [11] employed Princi-
T P + FN
pal Component Analysis (PCA) for dimensionality reduction
The F1 score is an essential metric for evaluating the accuracy and selected the 30 principle components, different Machine
and reliability of classification models in various fields be- learning models are trained and tested with 10-fold stratified
cause it can balance the trade-off between precision and recall samples, 9 folds for training, and 1 fold for testing. Prab-
and effectively addresses imbalanced datasets. The F1 score hakar et al. [23] explored a deep learning approach using a
was computed using the following formula: BAG model, achieving 93.51% accuracy across three datasets:
2 ∗ Precision ∗ Recall 2 ∗ TP GTZAN, ISMIR 2004, and MagnaTagATune. Figs. 12 and
F1 = = (12) 13 show the loss curve and accuracy curve of the model,
Precision + Recall 2 ∗ T P + FP + FN
respectively.
A. DATASET DESCRIPTION
In this research, the data were divided into two sets. One is for V. DISCUSSION AND FUTURE WORK
training and the other is for testing, which is 70% and 30% The usefulness of many machine learning models, such as
for both GTZAN and ISMIR2004 datasets, respectively. The K-nearest neighbors, Feedforward Neural Networks, Con-
following papers split the dataset 70/30 (70% used to train, volutional Neural Networks, Support Vector Machines, and
30% used to test), and the total number of iterations performed Recurrent Neural Networks with Long Short-Term Memory
in the experiment is 2,180. The batch size is set to 32, and the (RNN-LSTM), for the categorization of musical genres
epochs are set to 100. De Sousa et al. [5] randomly sorted the was investigated in this study. Compared to other models,
GTZAN dataset with 1000 pieces and selected the first 667 to CNNs obtained better results, demonstrating their capacity
train and the latest 333 to test the model created. Representing to identify complex spectrogram patterns–a crucial skill for
two-thirds (66.7%) of the dataset was used for training, and genre classification. However, the interpretability of CNNs is
one-third (33.3%) for testing. This process was repeated 30 limited because of their black-box character, highlighting the
times. need for more study to improve model comprehension. The
choice of the dataset, traditionally GTZAN, should be recon-
B. EXPERIMENTAL ENVIRONMENT sidered for more extensive, diverse, and real-world datasets to
The experiment was conducted on a Linux-based virtual better represent modern music genres. Addressing dataset bias
computer with an NVIDIA Tesla T4 GPU, executing 2048 is crucial for improving model generalizability. Music genre
iterations and 100 epochs. Adam [16], an optimization tech- classification models not only serve their primary purpose
nique, was used to reduce difficulties in deep neural network but also impact music recommendation systems, content
training, achieving a 0.0001 learning rate in the modified organization, and staying updated with musical trends. Further
architecture. research can expand these models’ applications, incorporating
user behavior analysis, cross-cultural music knowledge, and
C. EXPERIMENTAL RESULT additional features like lyrics analysis, enhancing the overall
The architecture of the proposed model has a remarkable listening experience on digital platforms. To ensure fair
accuracy rate of 92.7% based on the modified CNN model. comparisons, continuous improvement of evaluation criteria
Table 3 shows the result of the method compared with Cheng and benchmarking standards is necessary. Defined criteria
et al. [4], De Sousa et al. [5], Elbir et al. [33], Lidy et al. [14], and methods will facilitate impartial model evaluations. In
and Bergstra et al. [2]. All these research studies split the conclusion, this study highlights the potential of deep learning
GTZAN dataset into 70% for the training set and 30% for models, particularly CNNs, for music genre classification.

466 VOLUME 5, 2024


Future research should prioritize interpretability, diverse [10] J. H. Jensen, M. G. Christensen, M. N. Murthi, and S. H. Jensen,
datasets, multimodal techniques, and practical applications to “Evaluation of MFCC estimation techniques for music similarity,” in
Proc. IEEE 14th Eur. Signal Process. Conf., 2006, pp. 1–5.
enhance precision and utility in music genre categorization [11] N. Karunakaran and A. Arya, “A scalable hybrid classifier for music
systems, benefiting both the music industry and users. genre classification using machine learning concepts and spark,” in
Proc. IEEE Int. Conf. Intell. Auton. Syst., 2018, pp. 128–135.
[12] D. Kostrzewa, P. Kaminski, and R. Brzeski, “Music genre classification:
VI. CONCLUSION Looking for the perfect network,” in Proc. Int. Conf. Comput. Sci., 2021,
pp. 55–67.
This study demonstrates the efficacy of convolutional neu- [13] T. Li, A. B. Chan, and A. Chun, “Automatic musical pattern fea-
ral networks by classifying music genres with an impressive ture extraction using convolutional neural network,” Genre, vol. 100,
92.7% accuracy. However, there are still difficulties since pp. 546–550, 2010.
[14] T. Lidy, A. Rauber, A. Pertusa, and J. M. Inesta, “Mirex 2007 combin-
genre separation is complex and impacted by personal, cul- ing audio and symbolic descriptors for music classification from audio,”
tural, and historical variables. The study recognizes the intri- in Proc. MIREX 2007-Music Inf. Retrieval Eval. eXchange, Citeseer,
cacy of music genre classification and the necessity for more 2007.
[15] A. A. Lima, M. M. Kabir, S. C. Das, M. N. Hasan, and M. Mridha,
investigation. Even with exceptional results on the GTZAN “Road sign detection using variants of YOLO and R-CNN: An analysis
dataset, there is still a lot of new ground that has to be from the perspective of Bangladesh,” in Proc. Int. Conf. Big Data, IoT,
explored. Future research is broadening its scope and con- Mach. Learn.: BIM 2021, Springer, 2022, pp. 555–565.
[16] M. Liu, W. Zhang, F. Orabona, and T. Yang, “Adam: A stochastic
centrating on other aspects of audio, such as spectral quality, method with adaptive variance reduction,” 2020, arXiv:2011.11985.
rhythmic patterns, and lyric analysis. In order to improve ac- [17] R. Loughran, J. Walker, M. O’Neill, and M. O’Farrell, “The use of mel-
curacy and capture a variety of genre features, complex model frequency cepstral coefficients in musical instrument identification,” in
Proc. Int. Comput. Music Conf. Proc., 2008.
architectures are used, which include ensemble learning and [18] B. McFee et al., “librosa: Audio and music signal analysis in python,”
attention procedures. Classification across genres and cultures in Proc. 14th python Sci. Conf., 2015, vol. 8, pp. 18–25.
is a growing field of study that presents fascinating problems. [19] D. Moffat, D. Ronan, and J. D. Reiss, “An evaluation of audio feature
extraction toolboxes,” 2015.
In summary, this work represents a development in categoriz- [20] A. B. Mutiara, R. Refianti, and N. R. Mukarromah, “Musical genre
ing musical genres and highlights the need for more research. classification using SVM and audio features,” TELKOMNIKA Telecom-
Investigating proposed paths can enhance our knowledge of munication Comput. Electron. Control, vol. 140, no. 3, pp. 1024–1034,
2016.
musical genres, impacting information structure, music sug- [21] N. Ndou, R. Ajoodha, and A. Jadhav, “Music genre classification: A
gestion programs, and wider uses in the constantly changing review of deep-learning and traditional machine-learning approaches,”
music sector and worldwide consumer inclinations. in Proc. 2021 IEEE Int. IOT, Electron. Mechatronics Conf., 2021,
pp. 1–6.
[22] F. Pachet and J.-J. Aucouturier, “Improving timbre similarity: How
ACKNOWLEDGMENT high is the sky,” J. Negative Results Speech Audio Sci., vol. 10, no. 1,
pp. 1–13, 2004.
The authors would like to sincerely thank the Advanced [23] S. K. Prabhakar and S.-W. Lee, “Holistic approaches to music genre
Machine Intelligence Research Lab (AMIR Lab) for its con- classification using efficient transfer and deep learning techniques,”
tinuous support and instructions to fulfill what the author Expert Syst. Appl., vol. 211, 2023, Art. no. 118636.
[24] R. Prey, “Nothing personal: Algorithmic individuation on mu-
wanted to achieve. sic streaming platforms. media,” Culture Soc., vol. 400, no. 7,
pp. 1086–1100, 2018.
[25] H. Purwins, B. Li, T. Virtanen, J. Schlüter, S.-Y. Chang, and T. Sainath,
REFERENCES “Deep learning for audio signal processing,” IEEE J. Sel. Topics Signal
[1] M. Ashraf et al., “A hybrid CNN and RNN variant model for music Process., vol. 13, no. 2, pp. 206–219, May 2019.
classification,” Appl. Sci., vol. 130, no. 3, 2023, Art. no. 476. [26] P. J. Rentfrow and S. D. Gosling, “The do re mi’s of everyday life:
[2] J. Bergstra, N. Casagrande, D. Erhan, D. Eck, and B. Kégl, “Aggre- The structure and personality correlates of music preferences,” J. Pers.
gate features and ADABOOST for music classification,” Mach. Learn., Social Psychol., vol. 840, no. 6, 2003, Art. no. 1236.
vol. 65, pp. 473–484, 2006. [27] A. Schindler, T. Lidy, and A. Rauber, “Comparing shallow versus deep
[3] D. Bogdanov et al., “Essentia: An audio analysis library for music neural network architectures for automatic music genre classification,”
information retrieval,” in Proc. 14th Conf. Int. Soc. Music Inf. Retrieval, in Proc. 9th Forum Media Technol., 2016, pp. 17–21.
A. Britto, F. Gouyon, and S. Dixon, eds., Curitiba, Brazil, Nov. 2013, [28] S. Sugianto and S. Suyanto, “Voting-based music genre classifica-
pp. 493–498. tion using melspectogram and convolutional neural network,” in Proc.
[4] Y.-H. Cheng, P.-C. Chang, and C.-N. Kuo, “Convolutional neural net- IEEE Int. Seminar Res. Inf. Technol. Intell. Syst., 2019, pp. 330–333,
works approach for music genre classification,” in Proc. IEEE Int. doi: 10.1109/ISRITI48646.2019.9034644.
Symp. Comput., Consum. Control, 2020, pp. 309–403. [29] C. P. Tang, K. L. Chui, Y. K. Yu, Z. Zeng, and K. H. Wong, “Music
[5] J. M. de Sousa, E. T. Pereira, and L. R. Veloso, “A robust music genre genre classification using a hierarchical long short term memory (lstm)
classification approach for global and regional music datasets evalua- model,” Proc. SPIE, vol. 10828, pp. 334–340, 2018.
tion,” in Proc. IEEE Int. Conf. Digit. Signal Process., 2016, vol. 35, [30] G. Tzanetakis and P. Cook, “Musical genre classification of audio sig-
pp. 109–113. nals,” IEEE Trans. Speech Audio Process., vol. 10, no. 5, pp. 293–302,
[6] M. Dong, “Convolutional neural network achieves human-level accu- Jul. 2002.
racy in music genre classification,” 2018, arXiv:1802.09697. [31] E. Unal, E. Chew, P. G. Georgiou, and S. S. Narayanan, “Chal-
[7] N. Farajzadeh, N. Sadeghzadeh, and M. Hashemzadeh, “PMG-Net: lenging uncertainty in query by humming systems: A fingerprinting
Persian music genre classification using deep neural networks,” Enter- approach,” IEEE Trans. Audio, Speech, Lang. Process., vol. 160, no. 2,
tainment Comput., vol. 44, 2023, Art. no. 100518. pp. 359–371, Feb. 2008.
[8] P. Fulzele, R. Singh, N. Kaushik, and K. Pandey, “A hybrid model for [32] C. Xu, N. C. Maddage, and Xi. Shao, “Automatic music classification
music genre classification using LSTM and SVM,” in Proc. IEEE 11th and summarization,” IEEE Trans. Speech Audio Process., vol. 13, no. 3,
Int. Conf. Contemporary Comput., 2018, pp. 1–3. pp. 441–450, May 2005.
[9] A. Ishraq, A. A. Lima, M. M. Kabir, M. S. Rahman, and M. F. Mridha, [33] A. E. H. B. Çam, M. E. Iyican, B. Öztürk, and N. Aydin, “Music
“Assessment of building damage on post-hurricane satellite imagery genre classification and recommendation by using machine learning
using improved CNN,” in Proc. IEEE Int. Conf. Decis. Aid Sci. Appl., techniques,” in Proc. IEEE Innovations Intell. Syst. Appl. Conf., 2018,
2022, pp. 665–669. pp. 1–5.
VOLUME 5, 2024 467

You might also like