0% found this document useful (0 votes)
17 views13 pages

The Implementation of A Proposed Deep-Learning Alg

Uploaded by

anisanuriljanah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
17 views13 pages

The Implementation of A Proposed Deep-Learning Alg

Uploaded by

anisanuriljanah
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 13

Open Computer Science 2024; 14: 20230106

Research Article

Lili Liu*

The implementation of a proposed deep-learning


algorithm to classify music genres
https://doi.org/10.1515/comp-2023-0106 artificial and consist of large subjective factors. Thus, there
received May 8, 2023; accepted November 14, 2023 is no clear absolute standard that could be used for
Abstract: To improve the classification effect of music dividing music genres without error [1].
genres in the digital music era, the article employs deep- The extraction of music features is an indispensable
learning algorithms to improve the performance of the step in a classification task, and the quality of the featured
classification of music genres. An auxiliary (estimated) music is a crucial factor that affects the accuracy of the
model is constructed to estimate the amount of unmea- classification system. Hence, the conventional method of
sured data in the dual-rate system to enhance the recogni- extracting features requires rich prior knowledge and
tion effect of music features. Moreover, a dual-rate output complex mathematical tools, which makes it difficult to
error model to identify such impacts is proposed to elim- break through the bottleneck.
inate the impact of corrupt data caused by the estimation, With the popularity of deep-learning algorithms, it is
which eventually leads to the further improvement of the possible to train data to obtain valuable information and accu-
proposed model called dual-rate multi-innovation forget- rately characterize the musical features of music genres.
ting gradient algorithm based on the auxiliary model. In Hence, there is no need to design separate processes that
addition, the article employs linear time-varying forgetting deal with extracting features and running classifications [2].
factors to improve the stability of the system, advances the According to the literature, convolutional neural networks
recognition effect of music features through enhancement (CNNs) and recurrent neural networks (RNNs) are the most
processing, and combines a deep-learning algorithm to widely implemented deep-learning models in the research of
construct a classification system of music genres. The music genre classification. Due to the different structures of
result shows that the classification of the music genre CNNs and RNNs, the focus of the mastered features could not
system based on a deep-learning algorithm has a good be the same [3].
music genre classification effect. Designing a system that could automatically classify
music genres and improve the accuracy of the classifica-
Keywords: deep learning, music genre, classification, model tion process as much as possible is a significant research
subject. The deep-learning models have proved their cap-
ability in many disciplines and research fields. Generally,
the spectrogram of the music signals can be regarded as a
1 Introduction picture in essence, and spectrograms that are used as inputs
into CNNs help reach a better classification effect. As a result, a
Music genres are constituted of various music types, forms, growing number of studies have introduced several deep-
and styles composed by several different composers or learning models to classify music genres due to the high
instrumentalists in distinct periods and have come from achievements of those methods. Although CNNs have achieved
several cultural backgrounds. Music specialists usually very good results in some fields, especially in image proces-
divide a piece of music into genres based on attributes of sing, the classification tasks of music genres using CNNs have
musical instruments, forms of expression, regional culture, rarely been implemented. The same is true for RNNs.
and content of expression. However, when conventional In the article, the input of the deep-learning model is a
classification methods are substituted, results could be piece of music signal, which is essentially represented by
an input sequence called a time series. Even though con-
ventional networks pay the same attention to classifying

* Corresponding author: Lili Liu, College of Art, Hainan Tropical Ocean
music genres, RNNs are more suitable than others to
University, Sanya, Hainan, 572022, China, master relationships when input has time series character-
e-mail: [email protected] istics, which are called sequenced inputs.

Open Access. © 2024 the author(s), published by De Gruyter. This work is licensed under the Creative Commons Attribution 4.0 International License.
2  Lili Liu

According to the analysis of the music signal, the pre- dynamic, rhythm, tonal, and spectral features comprising a
lude, the end, and other relatively less important moments total of several useful features. Kumaraswamy and Poonacha
could be given less attention. However, the section could [11] proposed a new music genre classification model that
highlight the rhythm style of the whole music in the climax includes two major processes: feature extraction and classifica-
section of the song. Then, a higher degree of attention tion. In the feature extraction phase, features like “non-
should be given so that the network can employ limited negative matrix factorization features, short-time Fourier
resources and pay more attention to the most salient and transform features, and pitch features” are extracted. Far-
closely related information in an input sequence. Thus, an ajzadeh et al. [12] suggested a tailored deep neural network-
attention mechanism is also added to RNNs to improve its based method, termed PMG-Net, that automatically classifies
performance, so that different time series features could be Persian music genres. Singh and Biswas [13] assessed and
assigned to different weights when the model is trained. compared the robustness of some commonly used musical
Hence, the attention probability distribution corresponding and non-musical features on deep-learning models for the
to the feature representation is calculated through the MGC task by evaluating the performance of selected models
attention mechanism. Therefore, obtaining a feature repre- on multiple employed features extracted from various data-
sentation could more accurately characterize musical char- sets accounting for billions of segmented data samples.
acteristics and improve classification accuracy. Deep-learning algorithms help concurrently design pro-
Even though the number of related research covering cesses that conduct feature extractions and run classifica-
deep-learning methods grows exponentially, the number of tions. Thus, designing a system that could automatically
studies has still been limited. Nevertheless, deep-learning- classify music genres and improve the accuracy of the clas-
based methods have gained momentum to derive music sification process as much as possible is a broadly examined
features and classify music genres recently. Prabhakar research subject. An attention mechanism is implemented
and Lee [4] proposed five interesting and novel approaches in the network so limited resources that are characterized as
for music genre classifications, which are called weighted the most salient and closely related information in the input
visibility graph-based elastic net sparse classifier, stacked sequence are captured. An RNN model with the attention
denoising autoencoder classifier, Riemannian alliance tan- mechanism assigns different time series features to distinct
gent space mapping transfer learning, transfer support weights when the model is trained. Hence, the attention
vector machine algorithm, and lastly bidirectional long probability distribution corresponding to the feature repre-
short-term memory attention model with graphical convo- sentation is calculated through the attention mechanism.
lution network. Hongdan et al. [5] developed a deep-learning Besides, linear time-varying forgetting factors are employed
method taking into account the disparities in spectrums that to improve the stability of the system, and the effect of
can predict and classify song genres better. Foleis and feature recognition of music genres through enhancement
Tavares [6] presented a novel texture selector based on processing is progressed. The results suggest that obtaining a
K-Means aimed to identify diverse sound textures within feature representation more accurately characterizes music
each track. The results show that capturing texture diversity genres and improves classification accuracy in the article.
within tracks is important for improving classification per-
formance. Salazar [7] proposed an music genre classification
system by using two levels of hierarchical mining, gray-level
co-occurrence matrix networks generated by the Mel-spec- 2 Related work
trogram, and a multi-hybrid feature strategy. Yu et al. [8]
proposed a new model incorporating with attention The literature has research mainly focusing on music
mechanism based on a bidirectional recurrent neural net- genre classifications. A convolutional deep belief network
work. Furthermore, two attention-based models (serial (CDBN) was employed to pre-train the entire dataset in an
attention and parallelized attention) are implemented to unsupervised manner on the Million-Song dataset and then
have better classification outcomes. Folorunso et al. [9] used the mastered parameters to initialize a convolutional
implemented the global mean (tree SHAP) method to deter- multilayer perceptron with the same architecture. A decent
mine feature importance and impact on the classification accuracy in music genre classification and artist identifica-
model. Further analysis of the individual genres found some tion tasks was achieved [14]. Partesotti et al. [15] proposed a
nearness in the timbral properties between some of the music genre classification method based on the segment
genres. Chapaneri et al. [10] studied features that are features of a long short-term memory (LSTM) that was
extracted from the music signal for an effective representa- used to master the representation of frame-level features
tion to aid in genre classification. The feature set comprises to obtain segment features and then combined the LSTM
Deep-learning music genre categorization research  3

segment features with initial frame features to attain fused proposed a music classification method based on RNNs and
segment features. Evaluations on the ISMIR database showed attention mechanisms. The music was segmented, and the
that the LSTM segment features outperformed frame fea- feature sequence was extracted from the main melody of
tures. Babich [16] applied a CNN to music genre classification the segment. Then, an RNN was used to learn the semantic
and compared the results with those obtained with hand- features of the audio from the feature sequence of the
crafted features and Support vector machine classifiers. Gon- musical segment, and an attention mechanism was added
çalves and Schiavoni [17] fused CNN-learned features and to assign different attention weights to the mastered features.
hand-crafted features to evaluate the complementarity Amarillas [27] applied CDBNs to unlabeled auditory data
between these representations in the task of music genre such as speech and music and evaluated unsupervised mas-
classification. Gorbunova [18] took a small set of eight musical tered feature representations in multiple audio classification
features that embodied dynamics, timbre, and pitch as input tasks. Scavone and Smith [28] used a complex network to
to the CNN. The CNN was then trained in such a way that the model music, where each vertex represented a song, and
filter dimensions were interpretable in time and frequency the edges between vertices were represented by conditional
domains, and the results on the GTZAN dataset showed that probability vectors computed by first-order Markov chains.
eight musical features based on dynamics, timbre, and pitch Finally, the rhythm features were extracted from the com-
perform better than Mel-spectrograms were found. Dickens munity detection method in the complex network for hier-
et al. [19] and Vereshchahina-Biliavska et al. [20] proposed the archical clustering to resolve the problem of automatic
utilization of masked CNNs for music genre recognition. Thus, classification of music genres. Turchet et al. [29] extracted
conditional neural networks preserved inter-frame relation- local patches from time-frequency transformed music signals,
ships. Then, the multivariate conditional neural network which were then preprocessed and used for the K-means clus-
extended conditional neural network by performing masking tering for unsupervised learning of a local feature dictionary.
operations on network links. The masking process induced The local feature dictionary was further convoluted to extract
the network to learn and automatically explore a range of feature responses for classification. Way [30] decomposed
feature combinations within the frequency band and helped the data matrix of unlabeled samples into basis and acti-
neurons in hidden layers become feature vector local region vation matrices through sparse coding. Each sample was
experts. Unlike typical frame-level feature representations, represented as a linear combination of the columns in the
Tabuena [21] proposed a CNN architecture that used basis matrix, followed by a basis that remained fixed to
sample-level filters to learn feature representations and obtain activations for labeled data. Finally, these activa-
mainly conducted three aspects of work: reducing the sam- tions were utilized for music genre classification.
pling frequency of audio signals to shorten the training time,
combining transfer learning to expand multi-level and multi-
scale aggregation features, and visualizing the filters learned
by the sample layer CNN, and explaining the learned features.
Turchet and Barthet [22] proposed a deep RNN automatic
3 Error classification model of
labeling algorithm based on scattering transformation fea- music genre
tures. The five-layer RNN with a gated recurrent unit (GRU)
could fully utilize the scattering transform spectrogram, and A summary of the proposed method is articulated as fol-
the effect was better than that based on features such as the lows: Noise added to the dual-rate discrete state space
MFCC and Mel spectrogram. Khulusi et al. [23] applied an model leads to the output error model of the dual-rate
LSTM to music genre classification, extracted three features system by introducing an intermediate variable. Since
of MFCC, spectral centroid, and spectral contrast from the the output data contained unmeasured data, the general
GTZAN dataset, and then trained the LSTM based on this method of identifying single-rate systems could not be
feature. Cano et al. [24] compared Bi-directional recurrent used. Thus, the dual-rate system is implemented and trans-
neural network (BRNN), GRU, GRU parallel, and serial atten- formed into an equivalent form that uses a polynomial
tion models based on the BRNN through experiments and transformation method, and the output signal in the infor-
verified the effectiveness of the attention mechanism. mation vector of the equivalent model will become the
Magnusson [25] designed dense inception, which is a dual-rate output signal. However, the parameter estimates
novel CNN architecture for music genre classification that of the transformed model are biased. Therefore, the article
could improve the transfer of information between the input proposes a method based on deviation compensation to
and output and the multi-scale fusion feature to choose the successfully resolve the problem. Also, the polynomial trans-
kernel size independently. Gorbunova and Petrnova [26] formation method increases the number of parameters that
4  Lili Liu

need to be identified, so the accuracy would be low, and its


convergence would be difficult to prove. Hence, the iterative
algorithm is implemented and could use a large amount of
data, thereby improving the accuracy of parameter estima-
tion. Finally, the estimated model is called an auxiliary
model. An estimate of the data could be easily attained
that could be measured in the dual-rate system. To reach
optimal parameter scores, the recursive augmented sto-
chastic gradient search algorithm is employed. Eventually,
the dual-rate multi-innovation forgetting gradient algorithm
based on the auxiliary model (DR-AMMSG) is proposed.
Finally, the steps of the proposed method are presented in
an algorithm to be followed easily.
By adding noise to the dual-rate discrete state space
model, the output error model of the dual-rate system is
obtained as follows:
b1z−h + b2z−2h + … + bnz−nh
y(kqh) = u(kqh)
1 + a1z−h + a2z−2h + … + anz−nh
(1)
B (z ) Figure 1: An output error system with an auxiliary model.
+ v(kqh) = u(kqh) + v(kqh) .
A(z )
By introducing an intermediate variable x (t ), equation The idea of the auxiliary model is to establish an
B (z)
(1) can be transformed into equation (2) as follows: equivalent model Aa (z) with the structure same as the struc-
a
B(z)
B (z ) ture of the original system A(z)
for the unavailable inter-
x (kqh) = u(kqh) , y(kqh) = x (kqh) + v(kqh) . (2)
A (z ) mediate variable x (t ) in the system, and the equivalent
Ba (z)
Here, y(t ) is the output of the disturbed system. Without intermediate variable is represented by xa (t ) = Aa (z)
u (t ) .
loss of generality, when t ≤ 0, u(t ) = 0, y ⇔ 0, ∀, the input Bˆ (z) B(z)
The estimated model of is usually used as the
and output data are represented by {u(kh) , y(kqh) , q > 1, Aˆ (z) A(z)

q ∈ z} , that is, the dual-rate output error model was investi- equivalent model. With the aid of the auxiliary-based
gated. Since the output data contained an unmeasured item equivalent model, an estimate of the data could be easily
y(kq − h)i , h(i 1, 2, − ), the general method of identifying attained that could be measured in the dual-rate system.
single-rate systems could not be used. However, this processing will result in a large number of
The study of the dual-rate model transformed into an estimates in the parameter vector.
equivalent form uses a polynomial transformation method, The intermediate variable model is examined by
and the output signal in the information vector of the B (z )
equivalent model is the dual-rate output signal. However, x (kqh) = u(kqh)
A (z )
the parameter estimates of the transformed model are (3)
b1z−h + b2z−2h + … + bnz−nh
biased. Therefore, the article proposes a method based on = u(kqh) .
1 + a1z−h + a2z−2h + … + anz−nh
deviation compensation to successfully resolve the problem.
However, the polynomial transformation method increases The algorithm converts the intermediate variable x (kqh)
the number of parameters that need to be identified, so the into the following form:
accuracy is low, and its convergence is difficult to prove. x (kqh) = φT (kqh)θ ,
The recursive algorithm only uses limited output and
φT (kqh) = [− x (kqh − h) , − x (kqh − 2h),…, − x (kqh
input datasets {u(ih) , y(iqh) , i = 0, 1, 2, …, k } but does not
use data {u(ih) , y(iqh) , i = k + 1, k + 2, …, L} . However, − nh)u(kqh − h) , u(kqh − 2h),…, u(kqh (4)
the iterative algorithm uses a large amount of data, thereby − nh)]
improving the accuracy of parameter estimation. The sche-
θ = [a1, a2 ,…, an , b1 , b2 ,…, bn]T .
matic diagram of the equivalent model constructed by the
intermediate variables of the output error class is depicted in The expression φ(kqh) is the information vector of the
Figure 1. x (kqh) model and θ is the parameter vector of the
Deep-learning music genre categorization research  5

intermediate variable x (kqh) model. Then, equation (2) can The expression e(kqh) = y(kqh) − φT (kqh)θ (kqh − qh)
be rewritten in the following form: is the innovation at the time kqh .
y(kqh) = φT (kqh)θ + v(kqh) . (5) For the least squares identification algorithm and the
recursive augmented stochastic gradient algorithm using
 (kqh) is the estimate of the parameter vector θ
Here, θ the auxiliary (estimated) model, the parameter vector
at a time kqh and ∥X∥2 = tr[XXT ] represents the norm of X . recursive equations of these two algorithms contain
A criterion function is defined by innovation scalars. The recurrence expression of the
parameter vector of the dual-rate multiple innovation
J (θ) = ∥y(kqh) − φT (kqh)θ∥2 . (6)
identification method is proposed that contains the inno-
The optimal parameter estimation can be obtained vation vector.
by minimizing equation (6). However, an issue exists. The following vector matrix is assumed:
The information vector φ(qt ) contains unmeasurable  (kqh ), φ
 (kqh − qh ),…, φ
 (kqh − ( p − 1)qh )],
Φ( p , kqh ) = [φ
x (kqh − ih) , i = 1, 2,…, n , and the incomplete data volume
Y( p, kqh ) = [ y(kqh ), y(kqh − qh ),…, y(kqh − ( p − 1)qh )]T , (11)
makes it impossible to optimize equation (6). The estimate V( p, kqh ) = [v(kqh ), v(kqh − qh ),…, v(kqh − ( p − 1)qh )]T .
xˆ(kqh − ih) of x (kqh − ih) is equivalent to it, and the
model of the intermediate variable xˆ(kqh − ih) after the The expression Φ( p , kqh) , Y( p , kqh) , V( p , kqh) is a vector
equivalence is represented as follows: matrix containing p vectors. In this way, the output vector
Y( p , kqh) model is obtained as follows:
 (kqh − ih) ,
T (kqh − ih)θ
xˆ (kqh − ih) = φ
Y( p , kqh) = ΦT ( p , kqh)θ + V( p , kqh) . (12)
T (kqh − ih) = [− xˆ (kqh − (i + 1)h) , − xˆ (kqh − (i
φ
+ 2)h),…, − xˆ (kqh − (i + n)h)u(kqh − (i (7) When the following criterion function is considered

+ 1)h) , u(kqh − (i + 2)h),…, u(kqh − (i J (θ) = ∥Y( p , kqh) − ΦT ( p , kqh)θ∥2 . (13)


+ n)h)]. The gradient search method is used to minimize the
 (kqh − ih) in equa- function:
The unknown parameter vector θ
tion (7) is calculated by using the following equation: μkqh
 (kqh) = θ
θ  (kqh − qh) −  (kqh − qh))]
grad[J (θ
2
⎧θ (kqh − qh) , i = 1, 2,…, q − 1,  (kqh − qh) + μ Φ (14)
 (kqh − ih) = ⎨ 
θ (8) =θ kqh ( p , kqh )[Y( p , kqh )
⎩θ (kqh) , i = 0.  (kqh − qh)].
T ( p , kqh)θ
−Φ
Therefore, the auxiliary (estimated) model of
The expression μkqh is the step length of the iterative
xˆ(kqh − ih) is presented as follows:
search of the gradient of kqh sampling points. Considering
⎧φ  (kqh − qh ) , i = 1, 2 …, q − 1,
T (kqh − ih )θ the step size of the recursive gradient search, the forgetting
xˆ (kqh − ih ) = ⎨
⎩φ  (kqh ) , i = 0,
T (kqh )θ factor λ is added, and the choices are presented as follows:

T (kqh − ih ) = [− xˆ (kqh − (i + 1)h ) , − xˆ (kqh


φ 1
(9) μkqh = ,
− (i + 2)h ),…, − xˆ (kqh
r (kqh) (15)
( p , kqh)∥2 ,
r (kqh) = λr (kqh − qh) + ∥Φ r (0) = 1.
− (i + n )h ) , u(kqh − (i + 1)h ) , u(kqh
− (i + 2)h ),…, u(kqh − (i + n )h )] ∈ R 2n . In summary, the steps of the DR-AMMSG based on the
auxiliary (estimated) model are presented as follows:
After establishing such an auxiliary (estimated) model, ( p, kqh )
it is impossible to obtain a measurable approximation.  (kqh) = θ
(1) θ  (kqh − qh) + Φ
E( p , kqh)
r (kqh )
The method of stochastic gradient search is chosen to  (kqh − qh)
T ( p , kqh)θ
(2) E( p , kqh) = Y( p , kqh) − Φ
optimize equation (6). The parameter estimation algorithm
⎡ y(kqh) − φT (kqh)θ  (kqh − qh) ⎤
is obtained based on gradient search as follows:
⎢  (kqh − qh) ⎥
y(kqh − qh) − φ T (kqh − qh)θ

 (kqh − qh) + φ(kqh) [y(kqh)
 (kqh) = θ
θ

= ⋯⋯

r (kqh) ⎢ ⎥
T
 (kqh − qh)], (10) ⎢ y(kqh − ( p − 1)qh) − φ (kqh − ( p ⎥
−φ  (kqh)θ
T
 (kqh − qh)
⎢⎣ − 1)qh)θ ⎥⎦
 (kqh)∥2 ,
r (kqh) = r (kqh − qh) + ∥φ r (0) = 1.
(3) r (kqh) = λr (kqh − qh) + ∥Φ ( p , kqh)∥ , r (0) = 1
2
6  Lili Liu

Figure 2: The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 1 varies with time t.

(4) Φ( p , kqh) = [φ


 (kqh) , φ
 (kqh − qh),…, φ (kqh − ( p − 1)qh)] 2. The algorithm collects input and output datasets
T
(5) φˆ (kqh) = [− xˆ (kqh − h) , − xˆ (kqh − 2h),…, − xˆ (kqh {u(kh) , y(kqh)} and finds φˆ(kqh) them according to
− nh) , u(kqh − h) , u(kqh − 2h),…, u(kqh − nh)] equation (5), where the unmeasurable x (kqh − ih) is
∈ R 2n ( p , kqh) is
calculated by equations (6) and (7), and Φ
(6) calculated according to equation (4).
 (kqh − qh) , i = 1, 2 …, q − 1 3. The algorithm obtains the innovation vector E( p , kqh)
T (kqh − ih)θ
⎧φ
xˆ (kqh − ih) = ⎨ and r (kqh) equations (2) and (3), respectively.
T (kqh)θ
⎩φ  (kqh) , i = 0,
4. The algorithm finds θ (kqh) from equation (1).
5. The algorithm increases the time k by 1 and jumps to step 2.
(7) φˆ T (kqh − ih ) = [−xˆ (kqh − (i + 1)h ), −xˆ (kqh − (i
+ 2)h ),…, −xˆ (kqh − (i + n )h )u(kqh The dual-rate sampling data system model is presented
(16)
− (i + 1)h ), u(kqh − (i + 2)h ),…, u(kqh as follows:
− (i + n )h )] ∈ R 2n . B (z − 1 )
y(2t ) = u(2t ) + v(2t ) ,
A( z − 1 )
The expression E(p,kqh) is the innovation vector of
length p. A(z−1) = 1 + a1z−1 + a2z−2 = 1 − 1.1z−1 + 0.5z−2 , (17)
The flow of the proposed algorithm is presented as B (z − 1 )
= b1 + b2 z −1 z −2 = 0.16z−1 − 0.8z−2 ,
follows: θ = [a1, a2 , b1 , b2].
1. The algorithm is initialized and k = 1, θ  (0) = In /p ,
0 The parameter scores q = 2, h = 1 are used in the dual-
r (0) = 1, p0 = 106 is set. rate system, and the sampled dual-rate signal is denoted

Figure 3: The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.95 varies with time t.
Deep-learning music genre categorization research  7

Figure 4: The variation curve of the parameter estimation error of the DR-AMMSG algorithm when λ = 0.9 varies with time t.

Figure 5: The variation curve of parameter estimation error δ with time t when the DR-AM-MSG algorithm has a time-varying λ.

by {u(t ) , y(2t )} . The two-rate least squares identification 3.1 Implementation


algorithm (AMLS) using the auxiliary (estimated) model
and the two-rate multiple innovation stochastic gradient (1) When the forgetting factor is represented by λ = 1 , the
identification algorithm with a forgetting factor (auxilary system parameter estimation error is shown in Figure 2.
model multi-innovation forgetting gradient) based on the aux-
iliary model are compared to identify the system. The error of Figure 2 depicts that as the innovation length p increases,
the parameter estimations is calculated by δ = ∥θ  − θ∥/∥θ∥, the convergence speed of the parameter estimation system
 represents the estimation of the parameter vector, and
where θ and the identification accuracy gradually become faster and
θ represents the true value of the system parameter vector. higher.

Figure 6: A schematic diagram of the basic structure of an RNN.


8  Lili Liu

(2) When the forgetting factor takes λ = 0.95 and 0.9,


respectively, the schematic diagrams of the system
parameter estimation error are shown in Figures 3
and 4.

From Figures 3–5, adjusting the forgetting factor could


speed up the convergence speed of the parameter estima-
tion, but the addition of the forgetting factor will increase
the fluctuation of the system parameter estimation. There-
fore, the article considers a linear time-varying forgetting
0.1
factor λ(t ) = 0.9 + N t , where N is the number of recursive
steps, which is set to 3,000 in the example. Thus, the stability
of the system is improved, and the effect of music feature
recognition through enhancement processing is advanced.
The simulation results are shown in Figure 6.

4 Music genre classification based


on a deep-learning algorithm
In this section, we briefly explain what are the components
of an RNN structure and how it will be implemented in the
classification of the music genres based on the proposed
algorithm in the previous section. The RNN is a neural net-
work that specializes in processing time series sequences. Its
basic structure is shown in Figure 6. The left side of Figure 6
shows the structure of the RNN. Module A represents the
hidden nodes in the network, and xt is the value of the input
sequence x at the t-th time. Ot is the output of the hidden
node at the t-th time, ht is the hidden state of the hidden
node at the t-th time, and U, V, and W are the parameter
matrices in the network.
Due to the limitations of the shallow structures, the
classifier makes it difficult to express the music sequence
and semantic information at a deeper level, which affects the
performance of the classification. The proposed method,
according to the feature sequence of the input music, employs
both the RNN and the attention mechanism concurrently.
Hence, the Bi-GRU and attention mechanisms are imple-
mented to design the network to classify music. The Bi-GRU is
good at processing sequenced data and automatically masters
music context semantics and high-level features from the
sequenced features.
By expanding the loop edges of the hidden node along
the time axis, the chain structure shown on the right side of
Figure 6 can be obtained. RNNs share parameters at dif-
ferent moments and positions and have two advantages. Figure 7: A schematic diagram of the input and output of a cyclic neural
network.
On the one hand, as the parameter space can be reduced,
Deep-learning music genre categorization research  9

Figure 8: Music genre classification system based on the RNN.

Figure 9: A schematic diagram of data enhancement.

the scale of the neural network can be reduced. Thus, the including one-to-many, many-to-many, many-to-one, and
generalization ability can be guaranteed. On the other can be adapted to a variety of tasks, as shown in Figure 7.
hand, it gives the RNN both memory and learning abilities Figure 7(a) shows the one-to-many mode, which is often
and stores useful information in the parameter matrix U, used for decoder modeling. It is suitable for inputting a code
V, and W. vector, decoding it, and outputting the corresponding decoding
The input and output modes of the cyclic neural net- sequence. Figure 7(b) shows the many-to-one model, which is
work are very flexible, and there can be multiple modes, often used for sequence classifier modeling. It is suitable for
10  Lili Liu

deep-learning tasks of the input sequence and single output.


The output node of the recurrent unit passes directly through the
classifier. Figure 7(c) shows the synchronous many-to-many
mode. Each time step of the sequence corresponds to an output,
which can be used for tasks such as text generation and music
synthesis. Figure 7(d) shows the asynchronous many-to-many
mode, which is often used for encoder–decoder modeling, which
can be obtained by coupling and connecting two RNNs based on
context. Both the input data sequence and the output target
sequence are variable in length, and the lengths are allowed to
be unequal, which is suitable for machine translation problems.
On the other hand, preprocessing is an indispensable
stage in processing music signals, and its main purpose is to
facilitate feature extraction in the next stage. The extracted
characteristics are another form of expression of the music
signals. Because the music signal originally contains a lot of
redundancy, if the time domain audio signal is directly input
into the classification system, the amount of calculation would
be catastrophic. Finally, the extracted feature parameters are
Figure 10: Feature classification of a music signal.
input into the classifier, and then the feature is modeled by

Table 2: Feature classification effect of music signal


Table 1: Audio recognition error
Num Feature Num Feature Num Feature
Num Audio Num Audio Num Audio recognition recognition recognition
error (%) error (%) error (%) (%) (%) (%)

1 0.0280 27 0.0783 53 0.4244 1 88.47 27 80.51 53 86.81


2 0.4582 28 0.4826 54 0.2415 2 91.84 28 83.59 54 87.99
3 0.5972 29 0.1616 55 0.3695 3 94.93 29 91.00 55 94.38
4 0.0205 30 0.5913 56 0.0267 4 82.63 30 80.42 56 80.02
5 0.5946 31 0.5011 57 0.5165 5 82.87 31 82.67 57 88.24
6 0.2911 32 0.5818 58 0.4447 6 87.81 32 91.68 58 82.03
7 0.0151 33 0.2085 59 0.5772 7 92.88 33 92.35 59 85.70
8 0.4425 34 0.3836 60 0.1382 8 81.18 34 81.69 60 92.88
9 0.5340 35 0.1526 61 0.0896 9 82.44 35 84.39 61 85.30
10 0.0458 36 0.2599 62 0.3846 10 88.98 36 92.30 62 88.55
11 0.2517 37 0.0804 63 0.0417 11 87.66 37 79.15 63 85.16
12 0.3598 38 0.1525 64 0.3797 12 94.12 38 90.05 64 79.51
13 0.3856 39 0.4676 65 0.0691 13 90.27 39 81.64 65 90.08
14 0.3569 40 0.4867 66 0.3118 14 90.11 40 80.33 66 94.18
15 0.3577 41 0.4868 67 0.3589 15 85.08 41 93.25 67 83.60
16 0.1420 42 0.3443 68 0.5953 16 89.27 42 86.59 68 87.32
17 0.4973 43 0.1902 69 0.4151 17 86.58 43 87.90 69 80.45
18 0.0627 44 0.2261 70 0.1155 18 79.80 44 91.03 70 93.19
19 0.3401 45 0.4719 71 0.5692 19 79.26 45 84.37 71 90.66
20 0.5157 46 0.3057 72 0.3158 20 82.48 46 79.09 72 92.75
21 0.3039 47 0.4056 73 0.5894 21 87.55 47 80.57 73 84.21
22 0.2468 48 0.2800 74 0.2500 22 90.32 48 90.63 74 85.62
23 0.0631 49 0.4487 75 0.1793 23 86.91 49 80.73 75 81.72
24 0.0880 50 0.5337 76 0.5303 24 85.49 50 79.15 76 87.95
25 0.3448 51 0.0426 77 0.5372 25 82.86 51 85.89 77 93.57
26 0.5567 52 0.5400 78 0.5618 26 94.39 52 83.22 78 81.10
Deep-learning music genre categorization research  11

adjusting the parameters of the classifier. The best model parameters of the RNN model. A simulation experiment
obtained by training is implemented to realize the discrimina- is carried out using test parameters, namely, audio recog-
tion of the genre utilizing test music samples. nition error, feature classification effect of music signal,
Figure 8 shows a music genre classification system and classification effect of music genre. The results are
based on the RNN. shown in Tables 1–3. The model is trained based on the
Since the data are limited, the method employed for the two data sets using 0.80 for training and 0.20 for the test.
data set enhancement is required and is shown in Figure 9. As Simulation data are used to generate the predictions for
mentioned above, in the two data sets used for classification, Tables 1–3. When the average audio error rate (the mean of
the duration of each song excerpt segment C is 30 s. In the Table 1) is 0.33, the feature classification of the musical
article, each segment is cut, and the duration of each sub- signal reaches 0.8945, and the genre classification reaches
segment ci, after being cut, is 3 s. Moreover, there is a 50% 82.55.
overlap between two adjacent sub-segments. The excerpts of The results imply that even though the audio error
each song are cut into 18 sub-segments with a duration of 3 s with a large mean occurs in the system, the proposed
(because the sample duration is about 30 s, the last slice may system could handle it in feature and genre classifications.
be less than 3 s, so it is discarded). In addition, each sub-
segment carries the same genre tag as the source segment.
The structure of the specific features is shown in Figure 10.
5 Conclusion
4.1 Result of music genre classification With the rapid development of network technology and
multimedia platforms, the amount of digital music has
The dual-rate system and auxiliary (estimated) model are increased rapidly, and it is difficult for listeners to manage
used to generate errors and also to optimize the these huge amounts of stored music. Moreover, listeners

Table 3: Classification effect of music genre

Num Genre classification (%) Num Genre classification (%) Num Genre classification (%)

1 79.16 27 90.02 53 83.95


2 77.89 28 79.89 54 76.58
3 78.28 29 91.61 55 87.26
4 85.66 30 76.30 56 75.51
5 86.15 31 78.38 57 75.27
6 75.88 32 76.16 58 82.38
7 75.45 33 75.65 59 87.05
8 83.74 34 90.70 60 76.49
9 83.82 35 91.08 61 78.79
10 77.80 36 79.11 62 75.68
11 87.94 37 83.77 63 81.13
12 78.96 38 87.05 64 89.59
13 78.04 39 88.90 65 87.73
14 76.13 40 81.78 66 88.81
15 79.98 41 81.39 67 83.09
16 80.16 42 87.48 68 89.29
17 83.06 43 89.92 69 82.32
18 81.78 44 87.67 70 81.69
19 85.79 45 90.31 71 82.90
20 84.62 46 75.55 72 81.76
21 85.37 47 90.83 73 90.78
22 77.27 48 78.04 74 87.98
23 83.05 49 78.17 75 79.07
24 77.89 50 81.98 76 87.43
25 78.37 51 87.22 77 82.45
26 79.40 52 75.03 78 91.80
12  Lili Liu

need to quickly and accurately retrieve the music they are parameter estimation, but the addition of the forgetting factor
interested in from a huge music database. Music genres are will increase the fluctuation of the system parameter estima-
different musical styles formed by different melodies, instru- tion. Finally, the proposed model has provided a good music
ments, rhythms, and other characteristics under different genre classification effect.
periods and different cultural backgrounds. Therefore, the Future direction will be based on the performance
classification of music genres has become a very important comparison of the proposed method with other implemen-
research direction in the field of music information retrieval. table methods to classify music genres.
Designing a system that automatically classified music
genres and improved the accuracy of the classification Funding information: The research did not receive any
process as much as possible has been a highly desired funding.
consequence. The article employed a deep-learning algo-
rithm (RNN) to study music genre classification and Author contributions: The article is written by a single
proposed a music genre classification system with an intel- author.
ligent music feature recognition function. Hence, the RNN
helped separately design processes that conducted feature Conflict of interest: Author declares no conflict of interest.
extractions and ran classifications concurrently. Thus, an
attention mechanism was implemented in the network so Ethical approval: No ethical approval is needed.
limited resources that were characterized as the most salient
and closely related information in the input sequence was cap- Informed consent: No consent is needed.
tured. The RNN model with the attention mechanism assigned
different time series features to distinct weights when the
model was trained. Hence, the attention probability distribution
corresponding to the feature representation was calculated References
through the attention mechanism. Moreover, obtaining a fea-
ture representation more accurately found musical character- [1] F. Calegario, M. Wanderley, S. Huot, G. Cabral, and G. Ramalho,
“A method and toolkit for digital musical instruments: generating
istics and improved classification accuracy in the article.
ideas and prototypes,” IEEE Multimed, vol. 24, no. 1, pp. 63–71, 2017.
The RNN shared parameters at different moments and [2] D. Tomašević, S. Wells, I. Y. Ren, A. Volk, and M. Pesek, “Exploring
positions and had two advantages. On the one hand, as the annotations for musical pattern discovery gathered with digital
parameter space was reduced, the scale of the neural net- annotation tools,” J. Math. Music., vol. 15, no. 2, pp. 194–207, 2021.
work was deceased. Thus, the generalization ability was [3] X. Serra, “The computational study of musical culture through its
digital traces,” Acta Musicologica, vol. 89, no. 1, pp. 24–44, 2017.
guaranteed. On the other hand, it gave the RNN both
[4] S. K. Prabhakar and S. W. Lee, “Holistic approaches to music genre
memory and learning abilities and stored useful informa- classification using efficient transfer and deep learning techni-
tion in the parameter matrices. Moreover, adjusting the ques,” Expert. Syst. Appl., vol. 211, p. 118636, 2023.
forgetting factor could speed up the convergence speed [5] W. Hongdan, S. SalmiJamali, C. Zhengping, S. Qiaojuan, and R. Le,
of the parameter estimation, but the addition of the forget- “An intelligent music genre analysis using feature extraction and
ting factor would increase the fluctuation of the system classification using deep learning techniques,” Comput. Electr. Eng.,
vol. 100, p. 107978, 2022.
parameter estimation. Therefore, the article also consid-
[6] J. H. Foleis and T. F. Tavares, “Texture selection for automatic music
ered a linear time-varying forgetting factor. Even though genre classification,” Appl. Soft Comput., vol. 89, p. 10612, 2022.
improvement has been based on the forgetting factor λ, [7] A. E. C. Salazar, “Hierarchical mining with complex networks for music
fluctuated estimations of the system parameters required genre classification,” Digital Signal. Process., vol. 127, p. 103559, 2022.
to use of another forgetting factor called a linear time- [8] Y. Yu, S. Luo, S. Liu, H. Qiao, Y. Liu, and L. Feng, “Deep attention
based music genre classification,” Neurocomputing, vol. 372,
varying forgetting factor, which is the same function for
pp. 84–91, 2020.
a time. Hence, the stabilization of the classification system [9] S. O. Folorunso, S. A. Afolabi, and A. B. Owodeyi, “Dissecting the
is better reached. So, instead of using a fixed value, a linear genre of Nigerian music with machine learning models,” J. King
time-varying form of the forgetting factor is implemented. Saud. Univ. – Comput. Inf. Sci., vol. 34, no. 8, Part B,
When the forgetting factor is denoted by λ = 1 , and the pp. 6266–6279, 2022.
innovation length p increases, the convergence speed of [10] S. Chapaneri, R. Lopes, and D. Jayaswal, “Evaluation of music fea-
tures for PUK kernel based genre classification,” Procedia Comput.
the parameter estimation system and the identification accu-
Sci., vol. 45, pp. 186–196, 2021.
racy gradually become faster and higher. When the forgetting [11] B. Kumaraswamy and P. G. Poonacha, “Deep convolutional neural
factor takes λ = 0.95 and λ = 0.9, respectively, adjusting the network for musical genre classification via new self adaptive sea
forgetting factor could speed up the convergence speed of the lion optimization,” Appl. Soft Comput., vol. 108, p. 107446, 2021.
Deep-learning music genre categorization research  13

[12] N. Farajzadeh, N. Sadeghzadeh, and M. Hashemzadeh, “PMG-Net: piano accompaniment transposition techniques,” Int. J. Res. Publ.,
Persian music genre classification using deep neural networks,” vol. 66, no. 1, pp. 1–11, 2021.
Entertain. Comput., vol. 44, p. 100518, 2023. [22] L. Turchet and M. Barthet, “A ubiquitous smart guitar system for
[13] Y. Singh and A. Biswas, “Robustness of musical features on deep collaborative musical practice,” J. N. Music. Res., vol. 48, no. 4,
learning models for music genre classification,” Expert. Syst. Appl., pp. 352–365, 2019.
vol. 199, p. 116879, 2022. [23] R. Khulusi, J. Kusnick, C. Meinecke, C. Gillmann, J. Focht, and S.
[14] I. B. Gorbunova and N. N. Petrova, “Digital sets of instruments in Jänicke, “A survey on visualizations for musical data,” Comput.
the system of contemporary artistic education in music: socio-cul- Graph. Forum, vol. 39, no. 6, pp. 82–110, 2020.
tural aspect,” J. Crit. Rev., vol. 7, no. 19, pp. 982–989, 2022. [24] E. Cano, D. FitzGerald, A. Liutkus, M. D. Plumbley, and F. R. Stöter,
[15] E. Partesotti, A. Peñalba, and J. Manzolli, “Digital instruments and “Musical source separation: An introduction,” IEEE Signal. Process.
their uses in music therapy,” Nordic J. Music. Ther., vol. 27, no. 5, Mag., vol. 36, no. 1, pp. 31–40, 2020.
pp. 399–418, 2018. [25] T. Magnusson, “The migration of musical instruments: On the
[16] B. Babich, “Musical “Covers” and the culture industry: From anti- socio-technological conditions of musical evolution,” J. N. Music.
quity to the age of digital reproducibility,” Res. Phenomenol., vol. 48, Res., vol. 50, no. 2, pp. 175–183, 2020.
no. 3, pp. 385–407, 2018. [26] I. B. Gorbunova and N. N. Petrova, “Music computer technologies,
[17] L. L. Gonçalves and F. L. Schiavoni, “Creating digital musical supply chain strategy, and transformation processes in the socio-
instruments with lib mosaic-sound and mosaicode,” Rev. de. Inform. cultural paradigm of performing art: Using digital button accor-
Teórica e Apl., vol. 27, no. 4, pp. 95–107, 2020. dion,” Int. J. Supply Chain Manag., vol. 8, no. 6, pp. 436–445, 2020.
[18] I. B. Gorbunova, “Music computer technologies in the perspective [27] J. A. A. Amarillas, “Marketing musical: música, industria y
of digital humanities, arts, and research,” Opcion, vol. 35, no. promoción en la era digital,” INTERdisciplina, vol. 9, no. 25,
SpecialEdition24, pp. 360–375, 2018. pp. 333–335, 2021.
[19] A. Dickens, C. Greenhalgh, and B. Koleva, “Facilitating accessibility [28] G. Scavone and J. O. Smith, “A landmark article on nonlinear time-
in performance: participatory design for digital musical instru- domain modeling in musical acoustics,” J. Acoust. Soc. Am., vol. 150,
ments,” J. Audio Eng. Soc., vol. 66, no. 4, pp. 211–219, 2018. no. 2, pp. R3–R4, 2021.
[20] O. Y. Vereshchahina-Biliavska, O. V. Cherkashyna, Y. O. [29] L. Turchet, T. West, and M. M. Wanderley, “Touching the audience:
Moskvichova, O. M. Yakymchuk, and O. V. Lys, “Anthropological musical haptic wearables for augmented and participatory live
view on the history of musical art,” Linguist. Cult. Rev., vol. 5, no. S2, music performances,” Personal. Ubiquitous Comput., vol. 25, no. 4,
pp. 108–120, 2021. pp. 749–769, 2021.
[21] A. C. Tabuena, “Chord-interval, direct-familiarization, musical [30] C. J. Way, “Populism in musical mash-ups: recontextualizing Brexit,”
instrument digital interface, circle of fifths, and functions as basic Soc. Semiotics, vol. 31, no. 3, pp. 489–506, 2021.

You might also like