Musical_Genre_Classification_Using_Advanced_Audio_Analysis_and_Deep_Learning_Techniques
Musical_Genre_Classification_Using_Advanced_Audio_Analysis_and_Deep_Learning_Techniques
ABSTRACT Classifying music genres has been a significant problem in the decade of seamless music
streaming platforms and countless content creations. An accurate music genre classification is a fundamental
task with applications in music recommendation, content organization, and understanding musical trends.
This study presents a comprehensive approach to music genre classification using deep learning and advanced
audio analysis techniques. In this study, a deep learning method was used to tackle the task of music
genre classification. For this study, the GTZAN dataset was chosen for music genre classification. This
study examines the challenge of music genre categorization using Convolutional Neural Networks (CNN),
Feedforward Neural Networks (FNN), Support Vector Machine (SVM), k-nearest Neighbors (kNN), and
Long Short-term Memory (LSTM) models on the dataset. This study precisely cross-validates the model’s
output following feature extraction from pre-processed audio data and then evaluates its performance.
The modified CNN model performs better than conventional NN models by using its capacity to capture
complex spectrogram patterns. These results highlight how deep learning algorithms may improve systems
for categorizing music genres, with implications for various music-related applications and user interfaces.
Up to this point, 92.7% of the GTZAN dataset’s correctness has been achieved on the GTZAN dataset and
91.6% on the ISMIR2004 Ballroom dataset.
INDEX TERMS Convolutional neural networks, long short-term memory, support vector machine, k-nearest
neighbors, genre classification.
I. INTRODUCTION which describes a Python package for audio and music sig-
Humans classify music into genres based on what they think nal processing, and so on. Purwins et al. [25] discussed
of the music, how comfortable they are with the style, and convolutional neural networks, long short-term memory archi-
their capacity to make decisions between genres that are tectural variations among the deep learning models studied,
unclear. This makes classifying music genres in the field and other popular feature representations such as log-mel
of Music Information Retrieval (MIR) difficult. Sound pro- spectra and raw waveform. Previous research also indicated
cessing, audio synthesis, audio effect creation, and music that participants listened to music more often than any of
information retrieval depend on the extraction of audio fea- the other activities in Pachet et al. [22] (i.e., watching televi-
tures. David et al. [19] proposed the method to assess current sion, reading books, and watching movies). Pachet et al. [22]
audio feature extraction toolboxes and libraries by thoroughly study examines the connections between various factors,
examining their coverage, effort, presentation, and latency. including individual characteristics, self-perceptions, cogni-
There are several toolboxes for extracting audio features, tive capacities, and musical preferences. Music is essentially
such as “Essentia,” which is recommended by Bogdanov et subjective because many people experience the same song
al. [3], “Librosa,” which can be found in McFee et al. [18], differently. This study focuses on the model’s ability to
© 2024 The Authors. This work is licensed under a Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 License. For more information, see
VOLUME 5, 2024 https://creativecommons.org/licenses/by-nc-nd/4.0/ 457
AHMED ET AL.: MUSICAL GENRE CLASSIFICATION USING ADVANCED AUDIO ANALYSIS AND DEEP LEARNING TECHNIQUES
identify music based on objective audio elements but does not and model architectures and designs. Section IV contains the
examine how well these categories match human perceptions analysis results and a description of the data set, the exper-
of genre. However, the current problem is to organize and imental environment, and the experiment results. Section V
manage the millions of music titles produced by society, as also outlines the arguments made for the suggested system and
suggested by Rentfrow et al. [26]. In the context of artists recommends possible directions for further study. Section VI
and content creators, genre classification helps the production brings this article to a close.
process to be more efficient and enables artists to target tar-
geted audiences and refine their craft. Moreover, the online II. RELATED WORK
music streaming platforms Prey et al. [24] and the film and This study explores music genre classification using a com-
video game industries benefit significantly from an accurate bination of machine learning and deep learning methods.
genre classification. More significantly, it makes it difficult for Inspired by the traditional approach of analyzing instruments
us to fully understand and simulate people’s musical tastes. in music, the study turns to automated methods due to the
The fingerprinting (FP) approach was implemented by Unal advancements in machine learning.
et al. [31] to solve the difficult problem of resilient data Cheng et al. [4] used Convolutional Neural Networks with
extraction in query-by-humming (QBH) systems under un- five convolutional layers to classify the genres of music. The
predictable conditions. The use of Mel Frequency Cepstral accuracy they got was 83.3%. Their hop size was set to 256
Coefficients in the classification of musical instruments is with a fast Fourier transform on 1024 frames. There was
discussed by Loughran et al. [17]. MFCCs represent the spec- another approach by Ndou et al. [21] using traditional ma-
tral characteristics of audio signals, providing a compact yet chine learning and deep learning. They presented a thoroughly
highly informative feature set for music analysis. Previous reviewed paper on those approaches. They concluded their
research by Jensen et al. [10] summarizes how the MFCC is study with an accuracy of 92.69% by k-Nearest Neighbors.
being calculated. They capture key attributes of sound, such as Their Convolutional Neural Network provided an accuracy
timbre and pitch, which are instrumental in discerning musical of 72.40%. Sugianto et al. studied voting-based music genre
genres. classification [28]. They have also used the GTZAN dataset
However, to improve the model’s generalization over a for music genre classification. They obtained a voting scheme
greater range of musical styles, bigger and more diverse music accuracy of 71.87% and a single scheme accuracy of 63.49%,
datasets should be used for testing and training. Improving proving that the voting scheme offers greater accuracy than
the model’s ability to capture the complex musical features the single scheme. Prabhakar et al. [23] mentioned five dif-
that characterize genres may require developing more ad- ferent approaches such as WVG-ELNSC, SDA, RA-TSM,
vanced feature extraction techniques. The model improves its TSVM, and BAG proposed for music genre classification in
knowledge of musical differences by exploring audio data and their study. They obtained 93.51% accuracy using the pro-
identifying hidden components. Experimenting with various posed deep learning BAG model on GTZAN, ISMIR 2004,
machine learning models and hyperparameter optimization and MagnaTagATune data sets. Another work was featured
techniques can help identify the best architecture for a specific by Ashraf et al. [1], where a hybrid CNN and RNN vari-
genre classification problem, optimizing performance on the ant model was implemented. They achieved an accuracy of
selected dataset and genres. Beyond the obvious classifica- 89.30% by using a hybrid approach that combines CNN and
tions, this work explores the subtle characteristics that give Bi-GRU when the features were Mel-Spectrogram. Whereas
each genre its persona. with MFCC, they got 73.69% on the same hybrid model.
The contributions to this article are as follows: They also enriched their study with the use of hybrid models
r This study has extracted the features of the WAV- such as CNN-GRU, CNN-BiLSTM, and CNN-LSTM. They
formatted 30-second music files that were provided in used both MFCC and Mel-Spectrogram features on the hybrid
the data set so that one can re-use those extracted features models. Kostrzewa et al. [12] studied the classification of mu-
to train the proposed model and test/evaluate it. sic genres, and in their study, the FMA tiny dataset was used in
r This survey attempted CNN, LSTM, SVM, kNN, and several tests to examine the effectiveness and performance of
FNN to classify the genres of music and then proposed various models. The CNN and 1-D CRNN models produced
the modified Convolutional Neural Network model, the best outcomes, but the 2-D CRNN and LSTM models con-
which gives the best accuracy on both the GTZAN data siderably underperformed. A growing need for sophisticated
set for music genre classification and the ISMIR2004 Music Information Retrieval (MIR) approaches to categorize
Ballroom dataset among all the trained models. digital music files into various genres was discussed in the pa-
r The inquiry has evaluated the proposed Convolutional per by Mutiara et al. [20], the study emphasizes the necessity
Neural Network model and compared it to the other for an automatic genre tagging system and manual genre la-
models and to some well-known published papers. beling. The optimal audio feature mixture, combining musical
The structural organization of this study describes a thor- surface, Mel-Frequency Cepstrum Coefficients, tonality, and
ough overview of relevant studies in the realm of music genre LPC, achieved a remarkable classification accuracy of 76.6%.
classification, presented in Section II. Section III methodol- Another study by Farajzadeh et al. [7] presents PMG-
ogy thoroughly explains the suggested data set, algorithm, Net, a customized deep neural network-based technique for
m=0 m=0
N N
FIGURE 6. Amplitude vs. Frequency graph after Fast Fourier Transform on 2 −1
−i2π
km 2 −1
−i2π
km
N −i2π k N
audio data on GTZAN dataset. = x2 m .e 2 +e N + x(2m+1) .e 2 (2)
m=0 m=0
The STFT calculation is represented by the formula The Hop size, set at 512 samples, is vital in controlling the
X(nt,m f ), where n and m represent time and frequency, analysis frequency. The preprocessing is facilitated through
respectively. The t signifies the time resolution, f denotes the utilization of the Librosa library, a powerful tool in the
the frequency resolution, and the summation considers win- field of audio analysis. The preprocessing pipeline outlined
dowed signal values along with a complex exponential term in this experiment encompasses a series of carefully designed
to account for frequency components. This method offers in- steps that transform raw audio data into a structured and in-
formation on how the frequency content of a signal varies over formative representation.
time, which is helpful for tasks like spectrogram synthesis and
audio analysis. C(x(t )) = F −1 [log(F [x(t )])] (3)
n+Q x(t ) represents a signal, likely in the time domain, where t
− j2π pmt f
X (nt , m f ) = w((n − p)t )x(pt )e t denotes time. Signals in music are audio waveforms, repre-
p=n−Q senting sound over time. F is a Fourier Transform, which is
(1) used in signal processing to convert signals from the time
domain to the frequency domain. log(F [x(t )]) is done for
N−1
−i2π kn compressing the dynamic range of the signal. We used the
Xk = xn .e N
inverse transformation F −1 for feature extraction. So, after
n=0
calculating the logarithm of the modified signal, we use an
xk represents the k-th coefficient in the frequency domain. n is inverse transformation to return to a new domain.
the index variable used for the summation. xn represents the Fig. 8 shows a visual representation of a rock audio file’s
n-th data point in the original signal. i is the imaginary unit. spectrogram and MFCC. These Mel Frequency Cepstral Co-
It’s used to create the complex exponential term. N represents efficients serve as the basis for subsequent analysis and
2048 to 64 neurons. These layers record low-level and high- Here, the softmax function P(yi ) converts class scores to prob-
level data representations, capturing complex relationships abilities, showing the probabilities that the input belongs to
between traits and musical genres. Dropout layers prevent each class i. It exponentiates the scores zi to ensure positivity
overfitting and follows each dense layer, while Softmax ac- before normalizing them to produce a probability distribution
tivation is used in the output layer. Adam, an optimizer, helps over all classes.
modify the model’s weights throughout training to minimize This carefully designed CNN architecture, also represented
loss. The model also undergoes L2 regularization to manage in Fig. 9, showcases a hierarchy of feature extraction capabil-
complexity. The goal is to accurately classify music genres ities, making it well-suited for a range of image classification
based on complex feature correlations in audio data. tasks.
FIGURE 13. Obtained accuracy of the modified CNN model after 100
epochs.
In the topic of kNN, the distance is calculated between each The study extensively tested using the BallRoom genre
pair of data points in the dataset using this formula. Based classification dataset, consisting of 698 occurrences of 10
on these distances, the kNN algorithm chooses the K nearest genres. Mel-frequency cepstral coefficient data from corre-
neighbors for each test sample. The genre labels of these sponding WAV files was used to train models. The modified
neighbors are then utilized to estimate the genre of the test Convolutional Neural Network achieved the highest accu-
sample via a voting process, with the most frequently oc- racy at 91.6%, followed closely by the typical CNN with
curring genre among the neighbors selected as the predicted 90%. The Feedforward Neural Network and Support Vec-
genre label. Finally, the model’s accuracy is assessed by com- tor Machine models achieved accuracies of 74.1% and 76%,
paring the predicted genre labels to the actual genre labels of respectively. The k-Nearest Neighbors model had an accu-
the test samples. This approach enables the KNN algorithm racy of 68.5%, and the Recurrent Neural Network with Long
to categorize music samples into distinct genres based on Short-Term Memory (RNN-LSTM) achieved a performance
auditory attributes and similarities to other samples in the of 71.6%.
database. These findings highlight the research’s efficacy in using the
modified CNN model to classify genres more accurately than
IV. RESULT & ANALYSIS other examined models. The various accuracy results show the
After the training as well as the evaluation of the models, this advantages and disadvantages of each model, offering insight-
study found out that the modified CNN model demonstrated ful information for additional improvement and optimization
exceptional performance, achieving an impressive accuracy in subsequent iterations of the classification system.
rate of 92.7% (shown in Fig. 13) on the GTZAN music genre Precision in this study refers to the percentage of accurate
classification dataset, where the base Convolutional Neural positive predictions among all positive predictions generated
Network provided an accuracy of 85.56%, SVM attained an by the model.
accuracy of 79%, FNN gave an accuracy of 76%, Long Short
TP
Term Memory gave 73% of accuracy, and kNN supported with Precision = (10)
an accuracy of 70%. Table 1 shows a detailed comparison T P + FP
of results. The result states that the proposed CNN model The goal of recall is to find the most appropriate items from
performs better on GTZAN’s music genre classification data all the available options. It quantifies the percentage of pre-
set. dictions that came true out of the dataset’s actual positive
cases. Where TP means true positive, and FN refers to a false the test set, converted all the audio in the dataset into their
negative. respective MFCC, and sent them to the proposed CNN model
TP for training. It is important to consider alternative approaches,
Recall = (11) such as N. Karunakaran and A. Arya [11] employed Princi-
T P + FN
pal Component Analysis (PCA) for dimensionality reduction
The F1 score is an essential metric for evaluating the accuracy and selected the 30 principle components, different Machine
and reliability of classification models in various fields be- learning models are trained and tested with 10-fold stratified
cause it can balance the trade-off between precision and recall samples, 9 folds for training, and 1 fold for testing. Prab-
and effectively addresses imbalanced datasets. The F1 score hakar et al. [23] explored a deep learning approach using a
was computed using the following formula: BAG model, achieving 93.51% accuracy across three datasets:
2 ∗ Precision ∗ Recall 2 ∗ TP GTZAN, ISMIR 2004, and MagnaTagATune. Figs. 12 and
F1 = = (12) 13 show the loss curve and accuracy curve of the model,
Precision + Recall 2 ∗ T P + FP + FN
respectively.
A. DATASET DESCRIPTION
In this research, the data were divided into two sets. One is for V. DISCUSSION AND FUTURE WORK
training and the other is for testing, which is 70% and 30% The usefulness of many machine learning models, such as
for both GTZAN and ISMIR2004 datasets, respectively. The K-nearest neighbors, Feedforward Neural Networks, Con-
following papers split the dataset 70/30 (70% used to train, volutional Neural Networks, Support Vector Machines, and
30% used to test), and the total number of iterations performed Recurrent Neural Networks with Long Short-Term Memory
in the experiment is 2,180. The batch size is set to 32, and the (RNN-LSTM), for the categorization of musical genres
epochs are set to 100. De Sousa et al. [5] randomly sorted the was investigated in this study. Compared to other models,
GTZAN dataset with 1000 pieces and selected the first 667 to CNNs obtained better results, demonstrating their capacity
train and the latest 333 to test the model created. Representing to identify complex spectrogram patterns–a crucial skill for
two-thirds (66.7%) of the dataset was used for training, and genre classification. However, the interpretability of CNNs is
one-third (33.3%) for testing. This process was repeated 30 limited because of their black-box character, highlighting the
times. need for more study to improve model comprehension. The
choice of the dataset, traditionally GTZAN, should be recon-
B. EXPERIMENTAL ENVIRONMENT sidered for more extensive, diverse, and real-world datasets to
The experiment was conducted on a Linux-based virtual better represent modern music genres. Addressing dataset bias
computer with an NVIDIA Tesla T4 GPU, executing 2048 is crucial for improving model generalizability. Music genre
iterations and 100 epochs. Adam [16], an optimization tech- classification models not only serve their primary purpose
nique, was used to reduce difficulties in deep neural network but also impact music recommendation systems, content
training, achieving a 0.0001 learning rate in the modified organization, and staying updated with musical trends. Further
architecture. research can expand these models’ applications, incorporating
user behavior analysis, cross-cultural music knowledge, and
C. EXPERIMENTAL RESULT additional features like lyrics analysis, enhancing the overall
The architecture of the proposed model has a remarkable listening experience on digital platforms. To ensure fair
accuracy rate of 92.7% based on the modified CNN model. comparisons, continuous improvement of evaluation criteria
Table 3 shows the result of the method compared with Cheng and benchmarking standards is necessary. Defined criteria
et al. [4], De Sousa et al. [5], Elbir et al. [33], Lidy et al. [14], and methods will facilitate impartial model evaluations. In
and Bergstra et al. [2]. All these research studies split the conclusion, this study highlights the potential of deep learning
GTZAN dataset into 70% for the training set and 30% for models, particularly CNNs, for music genre classification.