The Implementation of A Proposed Deep-Learning Alg
The Implementation of A Proposed Deep-Learning Alg
Research Article
Lili Liu*
Open Access. © 2024 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
2 Lili Liu
According to the analysis of the music signal, the pre- dynamic, rhythm, tonal, and spectral features comprising a
lude, the end, and other relatively less important moments total of several useful features. Kumaraswamy and Poonacha
could be given less attention. However, the section could [11] proposed a new music genre classification model that
highlight the rhythm style of the whole music in the climax includes two major processes: feature extraction and classifica-
section of the song. Then, a higher degree of attention tion. In the feature extraction phase, features like “non-
should be given so that the network can employ limited negative matrix factorization features, short-time Fourier
resources and pay more attention to the most salient and transform features, and pitch features” are extracted. Far-
closely related information in an input sequence. Thus, an ajzadeh et al. [12] suggested a tailored deep neural network-
attention mechanism is also added to RNNs to improve its based method, termed PMG-Net, that automatically classifies
performance, so that different time series features could be Persian music genres. Singh and Biswas [13] assessed and
assigned to different weights when the model is trained. compared the robustness of some commonly used musical
Hence, the attention probability distribution corresponding and non-musical features on deep-learning models for the
to the feature representation is calculated through the MGC task by evaluating the performance of selected models
attention mechanism. Therefore, obtaining a feature repre- on multiple employed features extracted from various data-
sentation could more accurately characterize musical char- sets accounting for billions of segmented data samples.
acteristics and improve classification accuracy. Deep-learning algorithms help concurrently design pro-
Even though the number of related research covering cesses that conduct feature extractions and run classifica-
deep-learning methods grows exponentially, the number of tions. Thus, designing a system that could automatically
studies has still been limited. Nevertheless, deep-learning- classify music genres and improve the accuracy of the clas-
based methods have gained momentum to derive music sification process as much as possible is a broadly examined
features and classify music genres recently. Prabhakar research subject. An attention mechanism is implemented
and Lee [4] proposed five interesting and novel approaches in the network so limited resources that are characterized as
for music genre classifications, which are called weighted the most salient and closely related information in the input
visibility graph-based elastic net sparse classifier, stacked sequence are captured. An RNN model with the attention
denoising autoencoder classifier, Riemannian alliance tan- mechanism assigns different time series features to distinct
gent space mapping transfer learning, transfer support weights when the model is trained. Hence, the attention
vector machine algorithm, and lastly bidirectional long probability distribution corresponding to the feature repre-
short-term memory attention model with graphical convo- sentation is calculated through the attention mechanism.
lution network. Hongdan et al. [5] developed a deep-learning Besides, linear time-varying forgetting factors are employed
method taking into account the disparities in spectrums that to improve the stability of the system, and the effect of
can predict and classify song genres better. Foleis and feature recognition of music genres through enhancement
Tavares [6] presented a novel texture selector based on processing is progressed. The results suggest that obtaining a
K-Means aimed to identify diverse sound textures within feature representation more accurately characterizes music
each track. The results show that capturing texture diversity genres and improves classification accuracy in the article.
within tracks is important for improving classification per-
formance. Salazar [7] proposed an music genre classification
system by using two levels of hierarchical mining, gray-level
co-occurrence matrix networks generated by the Mel-spec- 2 Related work
trogram, and a multi-hybrid feature strategy. Yu et al. [8]
proposed a new model incorporating with attention The literature has research mainly focusing on music
mechanism based on a bidirectional recurrent neural net- genre classifications. A convolutional deep belief network
work. Furthermore, two attention-based models (serial (CDBN) was employed to pre-train the entire dataset in an
attention and parallelized attention) are implemented to unsupervised manner on the Million-Song dataset and then
have better classification outcomes. Folorunso et al. [9] used the mastered parameters to initialize a convolutional
implemented the global mean (tree SHAP) method to deter- multilayer perceptron with the same architecture. A decent
mine feature importance and impact on the classification accuracy in music genre classification and artist identifica-
model. Further analysis of the individual genres found some tion tasks was achieved [14]. Partesotti et al. [15] proposed a
nearness in the timbral properties between some of the music genre classification method based on the segment
genres. Chapaneri et al. [10] studied features that are features of a long short-term memory (LSTM) that was
extracted from the music signal for an effective representa- used to master the representation of frame-level features
tion to aid in genre classification. The feature set comprises to obtain segment features and then combined the LSTM
Deep-learning music genre categorization research 3
segment features with initial frame features to attain fused proposed a music classification method based on RNNs and
segment features. Evaluations on the ISMIR database showed attention mechanisms. The music was segmented, and the
that the LSTM segment features outperformed frame fea- feature sequence was extracted from the main melody of
tures. Babich [16] applied a CNN to music genre classification the segment. Then, an RNN was used to learn the semantic
and compared the results with those obtained with hand- features of the audio from the feature sequence of the
crafted features and Support vector machine classifiers. Gon- musical segment, and an attention mechanism was added
çalves and Schiavoni [17] fused CNN-learned features and to assign different attention weights to the mastered features.
hand-crafted features to evaluate the complementarity Amarillas [27] applied CDBNs to unlabeled auditory data
between these representations in the task of music genre such as speech and music and evaluated unsupervised mas-
classification. Gorbunova [18] took a small set of eight musical tered feature representations in multiple audio classification
features that embodied dynamics, timbre, and pitch as input tasks. Scavone and Smith [28] used a complex network to
to the CNN. The CNN was then trained in such a way that the model music, where each vertex represented a song, and
filter dimensions were interpretable in time and frequency the edges between vertices were represented by conditional
domains, and the results on the GTZAN dataset showed that probability vectors computed by first-order Markov chains.
eight musical features based on dynamics, timbre, and pitch Finally, the rhythm features were extracted from the com-
perform better than Mel-spectrograms were found. Dickens munity detection method in the complex network for hier-
et al. [19] and Vereshchahina-Biliavska et al. [20] proposed the archical clustering to resolve the problem of automatic
utilization of masked CNNs for music genre recognition. Thus, classification of music genres. Turchet et al. [29] extracted
conditional neural networks preserved inter-frame relation- local patches from time-frequency transformed music signals,
ships. Then, the multivariate conditional neural network which were then preprocessed and used for the K-means clus-
extended conditional neural network by performing masking tering for unsupervised learning of a local feature dictionary.
operations on network links. The masking process induced The local feature dictionary was further convoluted to extract
the network to learn and automatically explore a range of feature responses for classification. Way [30] decomposed
feature combinations within the frequency band and helped the data matrix of unlabeled samples into basis and acti-
neurons in hidden layers become feature vector local region vation matrices through sparse coding. Each sample was
experts. Unlike typical frame-level feature representations, represented as a linear combination of the columns in the
Tabuena [21] proposed a CNN architecture that used basis matrix, followed by a basis that remained fixed to
sample-level filters to learn feature representations and obtain activations for labeled data. Finally, these activa-
mainly conducted three aspects of work: reducing the sam- tions were utilized for music genre classification.
pling frequency of audio signals to shorten the training time,
combining transfer learning to expand multi-level and multi-
scale aggregation features, and visualizing the filters learned
by the sample layer CNN, and explaining the learned features.
Turchet and Barthet [22] proposed a deep RNN automatic
3 Error classification model of
labeling algorithm based on scattering transformation fea- music genre
tures. The five-layer RNN with a gated recurrent unit (GRU)
could fully utilize the scattering transform spectrogram, and A summary of the proposed method is articulated as fol-
the effect was better than that based on features such as the lows: Noise added to the dual-rate discrete state space
MFCC and Mel spectrogram. Khulusi et al. [23] applied an model leads to the output error model of the dual-rate
LSTM to music genre classification, extracted three features system by introducing an intermediate variable. Since
of MFCC, spectral centroid, and spectral contrast from the the output data contained unmeasured data, the general
GTZAN dataset, and then trained the LSTM based on this method of identifying single-rate systems could not be
feature. Cano et al. [24] compared Bi-directional recurrent used. Thus, the dual-rate system is implemented and trans-
neural network (BRNN), GRU, GRU parallel, and serial atten- formed into an equivalent form that uses a polynomial
tion models based on the BRNN through experiments and transformation method, and the output signal in the infor-
verified the effectiveness of the attention mechanism. mation vector of the equivalent model will become the
Magnusson [25] designed dense inception, which is a dual-rate output signal. However, the parameter estimates
novel CNN architecture for music genre classification that of the transformed model are biased. Therefore, the article
could improve the transfer of information between the input proposes a method based on deviation compensation to
and output and the multi-scale fusion feature to choose the successfully resolve the problem. Also, the polynomial trans-
kernel size independently. Gorbunova and Petrnova [26] formation method increases the number of parameters that
4 Lili Liu
q ∈ z} , that is, the dual-rate output error model was investi- equivalent model. With the aid of the auxiliary-based
gated. Since the output data contained an unmeasured item equivalent model, an estimate of the data could be easily
y(kq − h)i , h(i 1, 2, − ), the general method of identifying attained that could be measured in the dual-rate system.
single-rate systems could not be used. However, this processing will result in a large number of
The study of the dual-rate model transformed into an estimates in the parameter vector.
equivalent form uses a polynomial transformation method, The intermediate variable model is examined by
and the output signal in the information vector of the B (z )
equivalent model is the dual-rate output signal. However, x (kqh) = u(kqh)
A (z )
the parameter estimates of the transformed model are (3)
b1z−h + b2z−2h + … + bnz−nh
biased. Therefore, the article proposes a method based on = u(kqh) .
1 + a1z−h + a2z−2h + … + anz−nh
deviation compensation to successfully resolve the problem.
However, the polynomial transformation method increases The algorithm converts the intermediate variable x (kqh)
the number of parameters that need to be identified, so the into the following form:
accuracy is low, and its convergence is difficult to prove. x (kqh) = φT (kqh)θ ,
The recursive algorithm only uses limited output and
φT (kqh) = [− x (kqh − h) , − x (kqh − 2h),…, − x (kqh
input datasets {u(ih) , y(iqh) , i = 0, 1, 2, …, k } but does not
use data {u(ih) , y(iqh) , i = k + 1, k + 2, …, L} . However, − nh)u(kqh − h) , u(kqh − 2h),…, u(kqh (4)
the iterative algorithm uses a large amount of data, thereby − nh)]
improving the accuracy of parameter estimation. The sche-
θ = [a1, a2 ,…, an , b1 , b2 ,…, bn]T .
matic diagram of the equivalent model constructed by the
intermediate variables of the output error class is depicted in The expression φ(kqh) is the information vector of the
Figure 1. x (kqh) model and θ is the parameter vector of the
Deep-learning music genre categorization research 5
intermediate variable x (kqh) model. Then, equation (2) can The expression e(kqh) = y(kqh) − φT (kqh)θ (kqh − qh)
be rewritten in the following form: is the innovation at the time kqh .
y(kqh) = φT (kqh)θ + v(kqh) . (5) For the least squares identification algorithm and the
recursive augmented stochastic gradient algorithm using
(kqh) is the estimate of the parameter vector θ
Here, θ the auxiliary (estimated) model, the parameter vector
at a time kqh and ∥X∥2 = tr[XXT ] represents the norm of X . recursive equations of these two algorithms contain
A criterion function is defined by innovation scalars. The recurrence expression of the
parameter vector of the dual-rate multiple innovation
J (θ) = ∥y(kqh) − φT (kqh)θ∥2 . (6)
identification method is proposed that contains the inno-
The optimal parameter estimation can be obtained vation vector.
by minimizing equation (6). However, an issue exists. The following vector matrix is assumed:
The information vector φ(qt ) contains unmeasurable (kqh ), φ
(kqh − qh ),…, φ
(kqh − ( p − 1)qh )],
Φ( p , kqh ) = [φ
x (kqh − ih) , i = 1, 2,…, n , and the incomplete data volume
Y( p, kqh ) = [ y(kqh ), y(kqh − qh ),…, y(kqh − ( p − 1)qh )]T , (11)
makes it impossible to optimize equation (6). The estimate V( p, kqh ) = [v(kqh ), v(kqh − qh ),…, v(kqh − ( p − 1)qh )]T .
xˆ(kqh − ih) of x (kqh − ih) is equivalent to it, and the
model of the intermediate variable xˆ(kqh − ih) after the The expression Φ( p , kqh) , Y( p , kqh) , V( p , kqh) is a vector
equivalence is represented as follows: matrix containing p vectors. In this way, the output vector
Y( p , kqh) model is obtained as follows:
(kqh − ih) ,
T (kqh − ih)θ
xˆ (kqh − ih) = φ
Y( p , kqh) = ΦT ( p , kqh)θ + V( p , kqh) . (12)
T (kqh − ih) = [− xˆ (kqh − (i + 1)h) , − xˆ (kqh − (i
φ
+ 2)h),…, − xˆ (kqh − (i + n)h)u(kqh − (i (7) When the following criterion function is considered
Figure 2: The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 1 varies with time t.
Figure 3: The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.95 varies with time t.
Deep-learning music genre categorization research 7
Figure 4: The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.9 varies with time t.
Figure 5: The variation curve of parameter estimation error δ with time t when the DR-AM-MSG algorithm has a time-varying λ.
the scale of the neural network can be reduced. Thus, the including one-to-many, many-to-many, many-to-one, and
generalization ability can be guaranteed. On the other can be adapted to a variety of tasks, as shown in Figure 7.
hand, it gives the RNN both memory and learning abilities Figure 7(a) shows the one-to-many mode, which is often
and stores useful information in the parameter matrix U, used for decoder modeling. It is suitable for inputting a code
V, and W. vector, decoding it, and outputting the corresponding decoding
The input and output modes of the cyclic neural net- sequence. Figure 7(b) shows the many-to-one model, which is
work are very flexible, and there can be multiple modes, often used for sequence classifier modeling. It is suitable for
10 Lili Liu
adjusting the parameters of the classifier. The best model parameters of the RNN model. A simulation experiment
obtained by training is implemented to realize the discrimina- is carried out using test parameters, namely, audio recog-
tion of the genre utilizing test music samples. nition error, feature classification effect of music signal,
Figure 8 shows a music genre classification system and classification effect of music genre. The results are
based on the RNN. shown in Tables 1–3. The model is trained based on the
Since the data are limited, the method employed for the two data sets using 0.80 for training and 0.20 for the test.
data set enhancement is required and is shown in Figure 9. As Simulation data are used to generate the predictions for
mentioned above, in the two data sets used for classification, Tables 1–3. When the average audio error rate (the mean of
the duration of each song excerpt segment C is 30 s. In the Table 1) is 0.33, the feature classification of the musical
article, each segment is cut, and the duration of each sub- signal reaches 0.8945, and the genre classification reaches
segment ci, after being cut, is 3 s. Moreover, there is a 50% 82.55.
overlap between two adjacent sub-segments. The excerpts of The results imply that even though the audio error
each song are cut into 18 sub-segments with a duration of 3 s with a large mean occurs in the system, the proposed
(because the sample duration is about 30 s, the last slice may system could handle it in feature and genre classifications.
be less than 3 s, so it is discarded). In addition, each sub-
segment carries the same genre tag as the source segment.
The structure of the specific features is shown in Figure 10.
5 Conclusion
4.1 Result of music genre classification With the rapid development of network technology and
multimedia platforms, the amount of digital music has
The dual-rate system and auxiliary (estimated) model are increased rapidly, and it is difficult for listeners to manage
used to generate errors and also to optimize the these huge amounts of stored music. Moreover, listeners
Num Genre classification (%) Num Genre classification (%) Num Genre classification (%)
need to quickly and accurately retrieve the music they are parameter estimation, but the addition of the forgetting factor
interested in from a huge music database. Music genres are will increase the fluctuation of the system parameter estima-
different musical styles formed by different melodies, instru- tion. Finally, the proposed model has provided a good music
ments, rhythms, and other characteristics under different genre classification effect.
periods and different cultural backgrounds. Therefore, the Future direction will be based on the performance
classification of music genres has become a very important comparison of the proposed method with other implemen-
research direction in the field of music information retrieval. table methods to classify music genres.
Designing a system that automatically classified music
genres and improved the accuracy of the classification Funding information: The research did not receive any
process as much as possible has been a highly desired funding.
consequence. The article employed a deep-learning algo-
rithm (RNN) to study music genre classification and Author contributions: The article is written by a single
proposed a music genre classification system with an intel- author.
ligent music feature recognition function. Hence, the RNN
helped separately design processes that conducted feature Conflict of interest: Author declares no conflict of interest.
extractions and ran classifications concurrently. Thus, an
attention mechanism was implemented in the network so Ethical approval: No ethical approval is needed.
limited resources that were characterized as the most salient
and closely related information in the input sequence was cap- Informed consent: No consent is needed.
tured. The RNN model with the attention mechanism assigned
different time series features to distinct weights when the
model was trained. Hence, the attention probability distribution
corresponding to the feature representation was calculated References
through the attention mechanism. Moreover, obtaining a fea-
ture representation more accurately found musical character- [1] F. Calegario, M. Wanderley, S. Huot, G. Cabral, and G. Ramalho,
“A method and toolkit for digital musical instruments: generating
istics and improved classification accuracy in the article.
ideas and prototypes,” IEEE Multimed, vol. 24, no. 1, pp. 63–71, 2017.
The RNN shared parameters at different moments and [2] D. Tomašević, S. Wells, I. Y. Ren, A. Volk, and M. Pesek, “Exploring
positions and had two advantages. On the one hand, as the annotations for musical pattern discovery gathered with digital
parameter space was reduced, the scale of the neural net- annotation tools,” J. Math. Music., vol. 15, no. 2, pp. 194–207, 2021.
work was deceased. Thus, the generalization ability was [3] X. Serra, “The computational study of musical culture through its
digital traces,” Acta Musicologica, vol. 89, no. 1, pp. 24–44, 2017.
guaranteed. On the other hand, it gave the RNN both
[4] S. K. Prabhakar and S. W. Lee, “Holistic approaches to music genre
memory and learning abilities and stored useful informa- classification using efficient transfer and deep learning techni-
tion in the parameter matrices. Moreover, adjusting the ques,” Expert. Syst. Appl., vol. 211, p. 118636, 2023.
forgetting factor could speed up the convergence speed [5] W. Hongdan, S. SalmiJamali, C. Zhengping, S. Qiaojuan, and R. Le,
of the parameter estimation, but the addition of the forget- “An intelligent music genre analysis using feature extraction and
ting factor would increase the fluctuation of the system classification using deep learning techniques,” Comput. Electr. Eng.,
vol. 100, p. 107978, 2022.
parameter estimation. Therefore, the article also consid-
[6] J. H. Foleis and T. F. Tavares, “Texture selection for automatic music
ered a linear time-varying forgetting factor. Even though genre classification,” Appl. Soft Comput., vol. 89, p. 10612, 2022.
improvement has been based on the forgetting factor λ, [7] A. E. C. Salazar, “Hierarchical mining with complex networks for music
fluctuated estimations of the system parameters required genre classification,” Digital Signal. Process., vol. 127, p. 103559, 2022.
to use of another forgetting factor called a linear time- [8] Y. Yu, S. Luo, S. Liu, H. Qiao, Y. Liu, and L. Feng, “Deep attention
based music genre classification,” Neurocomputing, vol. 372,
varying forgetting factor, which is the same function for
pp. 84–91, 2020.
a time. Hence, the stabilization of the classification system [9] S. O. Folorunso, S. A. Afolabi, and A. B. Owodeyi, “Dissecting the
is better reached. So, instead of using a fixed value, a linear genre of Nigerian music with machine learning models,” J. King
time-varying form of the forgetting factor is implemented. Saud. Univ. – Comput. Inf. Sci., vol. 34, no. 8, Part B,
When the forgetting factor is denoted by λ = 1 , and the pp. 6266–6279, 2022.
innovation length p increases, the convergence speed of [10] S. Chapaneri, R. Lopes, and D. Jayaswal, “Evaluation of music fea-
tures for PUK kernel based genre classification,” Procedia Comput.
the parameter estimation system and the identification accu-
Sci., vol. 45, pp. 186–196, 2021.
racy gradually become faster and higher. When the forgetting [11] B. Kumaraswamy and P. G. Poonacha, “Deep convolutional neural
factor takes λ = 0.95 and λ = 0.9, respectively, adjusting the network for musical genre classification via new self adaptive sea
forgetting factor could speed up the convergence speed of the lion optimization,” Appl. Soft Comput., vol. 108, p. 107446, 2021.
Deep-learning music genre categorization research 13
[12] N. Farajzadeh, N. Sadeghzadeh, and M. Hashemzadeh, “PMG-Net: piano accompaniment transposition techniques,” Int. J. Res. Publ.,
Persian music genre classification using deep neural networks,” vol. 66, no. 1, pp. 1–11, 2021.
Entertain. Comput., vol. 44, p. 100518, 2023. [22] L. Turchet and M. Barthet, “A ubiquitous smart guitar system for
[13] Y. Singh and A. Biswas, “Robustness of musical features on deep collaborative musical practice,” J. N. Music. Res., vol. 48, no. 4,
learning models for music genre classification,” Expert. Syst. Appl., pp. 352–365, 2019.
vol. 199, p. 116879, 2022. [23] R. Khulusi, J. Kusnick, C. Meinecke, C. Gillmann, J. Focht, and S.
[14] I. B. Gorbunova and N. N. Petrova, “Digital sets of instruments in Jänicke, “A survey on visualizations for musical data,” Comput.
the system of contemporary artistic education in music: socio-cul- Graph. Forum, vol. 39, no. 6, pp. 82–110, 2020.
tural aspect,” J. Crit. Rev., vol. 7, no. 19, pp. 982–989, 2022. [24] E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F. R. Stöter,
[15] E. Partesotti, A. Peñalba, and J. Manzolli, “Digital instruments and “Musical source separation: An introduction,” IEEE Signal. Process.
their uses in music therapy,” Nordic J. Music. Ther., vol. 27, no. 5, Mag., vol. 36, no. 1, pp. 31–40, 2020.
pp. 399–418, 2018. [25] T. Magnusson, “The migration of musical instruments: On the
[16] B. Babich, “Musical “Covers” and the culture industry: From anti- socio-technological conditions of musical evolution,” J. N. Music.
quity to the age of digital reproducibility,” Res. Phenomenol., vol. 48, Res., vol. 50, no. 2, pp. 175–183, 2020.
no. 3, pp. 385–407, 2018. [26] I. B. Gorbunova and N. N. Petrova, “Music computer technologies,
[17] L. L. Gonçalves and F. L. Schiavoni, “Creating digital musical supply chain strategy, and transformation processes in the socio-
instruments with lib mosaic-sound and mosaicode,” Rev. de. Inform. cultural paradigm of performing art: Using digital button accor-
Teórica e Apl., vol. 27, no. 4, pp. 95–107, 2020. dion,” Int. J. Supply Chain Manag., vol. 8, no. 6, pp. 436–445, 2020.
[18] I. B. Gorbunova, “Music computer technologies in the perspective [27] J. A. A. Amarillas, “Marketing musical: música, industria y
of digital humanities, arts, and research,” Opcion, vol. 35, no. promoción en la era digital,” INTERdisciplina, vol. 9, no. 25,
SpecialEdition24, pp. 360–375, 2018. pp. 333–335, 2021.
[19] A. Dickens, C. Greenhalgh, and B. Koleva, “Facilitating accessibility [28] G. Scavone and J. O. Smith, “A landmark article on nonlinear time-
in performance: participatory design for digital musical instru- domain modeling in musical acoustics,” J. Acoust. Soc. Am., vol. 150,
ments,” J. Audio Eng. Soc., vol. 66, no. 4, pp. 211–219, 2018. no. 2, pp. R3–R4, 2021.
[20] O. Y. Vereshchahina-Biliavska, O. V. Cherkashyna, Y. O. [29] L. Turchet, T. West, and M. M. Wanderley, “Touching the audience:
Moskvichova, O. M. Yakymchuk, and O. V. Lys, “Anthropological musical haptic wearables for augmented and participatory live
view on the history of musical art,” Linguist. Cult. Rev., vol. 5, no. S2, music performances,” Personal. Ubiquitous Comput., vol. 25, no. 4,
pp. 108–120, 2021. pp. 749–769, 2021.
[21] A. C. Tabuena, “Chord-interval, direct-familiarization, musical [30] C. J. Way, “Populism in musical mash-ups: recontextualizing Brexit,”
instrument digital interface, circle of fifths, and functions as basic Soc. Semiotics, vol. 31, no. 3, pp. 489–506, 2021.