Music Generation With NLP-1
Music Generation With NLP-1
Article
Jun Min, Zhaoqi Liu, Lei Wang, Dongyang Li, Maoqing Zhang and Yantai Huang
Special Issue
Modelling, Monitoring, Control and Optimization for Complex Industrial Processes
Edited by
Dr. Zhiwei Gao
https://doi.org/10.3390/pr10122515
processes
Article
Music Generation System for Adversarial Training Based on
Deep Learning
Jun Min 1 , Zhaoqi Liu 1 , Lei Wang 1, *, Dongyang Li 1 , Maoqing Zhang 1 and Yantai Huang 2
1 College of Electronics and Information Engineering, TongJi University, Shanghai 201804, China
2 College of Automation and Electrical Engineering, Zhejiang University of Science and Technology,
Hangzhou 310023, China
* Correspondence: [email protected]
Abstract: With the rapid development of artificial intelligence, the application of this new technology
to music generation has attracted more attention and achieved gratifying results. This study proposes
a method for combining the transformer deep-learning model with generative adversarial networks
(GANs) to explore a more competitive music generation algorithm. The idea of text generation in
natural language processing (NLP) was used for reference, and a unique loss function was designed
for the model. The training process solves the problem of a nondifferentiable gradient in generating
music. Compared with the problem that LSTM cannot deal with long sequence music, the model
based on transformer and GANs can extract the relationship in the notes of long sequence music
samples and learn the rules of music composition well. At the same time, the optimized transformer
and GANs model has obvious advantages in the complexity of the system and the accuracy of
generating notes.
Keywords: artificial intelligence (AI); music generation; natural language processing; transformer;
GANs
new model tries to generate longer sequences by obtaining the information of past time
steps, which is similar to music generation by humans. Convolutional neural networks
(CNNs) [6] are a basic model commonly used in music generation. In the late 1990s, the
CNN prototype [7] was proposed, but it was not until AlexNet [8] in 2013 that CNNs as we
know them were recognized. Based on this, Google established the deep mind artificial
intelligence laboratory in London to develop WaveNet [9], one of the most successful CNN
music generation applications. The recurrent neural network (RNN) has become the most
popular processing model for typical serialized data such as music.
Considering the similarity between music and language, some language generation
models can be converted to generating music. This approach represents note embedding
by vectors. Embedding [10] was a concept initially used in NLP. Its principle was to obtain
the vector representation of each word, note, or chord through training a neural network
with a corpus and then use the result as the input to the network for downstream tasks.
In music, each major or minor chord has three notes as the tonic of the mode. In other
words, the whole song surrounds the tonic and its related chords [11,12]. In addition to
this obvious relationship between notes, there are also invisible relationships in music [13],
corresponding to anaphora in language. This is why it is important to choose a transformer
as the core model for music generation. In addition, GANs are widely used in imaging,
language, and music by training the discriminator and generator in the network to optimize
them, ensuring the final authenticity of the generated results [14]. GANs are now coming
into use to generate music sequences [15]. A critical difficulty to consider is that the
generated sequence must be fed into a discriminator because of the network architecture
of the generating countermeasure network. Therefore, this study proposes a new model
structure that combines a transformer [16] and GANs [17] to create music. The study also
presents a unique loss function to enable the system to learn and update the parameters in
the two gradient descent directions of “real music” and the target sequence [18].
The contributions of this study are as follows:
• Establishing a new music generation system that combines the transformer and GANs.
Proposing a unique loss function for the proposed model to learn from the descend-
ing direction of “real music” and the target sequence and to update the parameters
over time.
• Improving the input and output structures of the discriminator and the generator
and solving the problems of gradient non-differentiability and mode collapse in
the discriminator.
• Applying the vocabulary matching method to perfect the intricate melody generated
in the time domain and generate a real and controllable long-term structure.
• Presenting a relatively objective suggestion to evaluate music based on Euler’s music
evaluation mechanism.
The rest of the study is organized as follows: Section 2 introduces related work
on music generation. Section 3 puts forward the method proposed in this study and
introduces the construction and training process of the model. Section 4 presents the results
of the experiment.
2. Related Work
Long short-term memory (LSTM) developed rapidly and has been applied exten-
sively in music processing and the music generation field. Through LSTM, automatically
generated music can achieve high-quality, high-fidelity, and high-definition music effects.
Ycart et al. [19] and Sheykhivand et al. [20] both used LSTM as the cornerstone neural
network to realize music generation. Borodin et al. [21] proposed a multi-channel data-
processing method for a chord using the many-hot encoding method. The input was opti-
mized from single-note encoding to a multi-dimensional representation vector. The next
note combination was predicted by LSTM, which enriched the production. Chen et al. [22]
combined LSTM with chaos theory to optimize the tone shift in music without deformation,
reduce the amount of calculation, and optimize the training efficiency. Lehner et al. [23]
Processes 2022, 10, 2515 3 of 14
combined LSTM with a restricted Boltzmann machine (RBM) [24]. These techniques and
other probabilistic methods are combined with deep neural networks to help people better
make music. However, LSTM methods have failed to generate long-term sequences. The
authors of [25] proposed combining biaxial LSTM with GANs, which improved music
quality significantly over ordinary LSTM. Many experiments have found that these models
only allow the network to learn the relationship between note characteristics from actual
music data; they did not learn harmony from the music as a whole or the rules composers
need to follow.
Language and music have similar characteristics [26]. In natural language processing,
Google put forward the transformer [27] in 2017, which used a self-attention mechanism as
its main component and became a state-of-the art (SOTA) method in many projects. The
transformer was a typical encoder-decoder model (i.e., a sequence-to-sequence (seq2seq)
model) [28] mainly used for generating scenarios such as question-answering systems and
machine translation. Given the similarity between language and music, Google applied
the transformer to music generation in 2018 [29]. Because of its low memory density, the
transformer could generate longer, more coherent music. However, the music transformer
was imperfect and had too many redundant and sparse musical notes. Based on the
transformer decoder, OpenAI proposed the second-generation Generative Pre-Training
(GPT) model [30] in 2019. GPT-3 was considered a dangerous machine learning model due
to its high intelligence. In April of the same year, the music generation system MuseNet
appeared, based on the GPT-2 [31]. It could generate works of any genre and style, and
even any composer and pianist. The generated music could be confused with the official
versions. In 2020, Jin et al. [32] proposed a new scheme based on combining the transformer
and GPT. Since then, artificial intelligence composition has reached a more mature stage.
However, the quality of music generated by these methods has not reached an acceptable
level because the neural network cannot understand the complexities of the language of
music. The information in the notes needs to be transmitted to the system as part of the
input, for example by marking. Many experiments have found that these models only
allowed a network to learn the relationship between note characteristics from music data;
they did not follow the rules of the music and composition.
3. Proposed Method
The first problem is generating note sequences and exploring the conversion relation-
ship. In note conversion, the most commonly used method is to convert notes into one-hot
encoding, but because the sample involves many notes, the digital matrix generated by
one-hot encoding is sparse. Although one-hot encoding can reduce the dimensionality of
the input model, the predicted label probability distribution cannot be directly transformed
accurately into notes, so the sorted note sequence is digitally coded according to the occur-
rence order of the notes. Although this encoding method ignores the original relationship
between notes (e.g., the relationship between tone C and chord C-E-G), the model structure
of the transformer handles the note ID input by first going through a trainable embedding
layer in the system to find a reasonable mapping and converting the note ID into a vector
containing the note relationship. In addition, owing to the different lengths of music singles,
ranging from dozens of notes to thousands of notes, the seq_length parameter is set to
make the input sequence and output sequence into two equal continuous sequences. For
example, the first ten notes of a song are converted as follows:
[6.11, E-5, E5, F#4, B3, E-5, B4, 2.6, E-5, 11.3.6].
Among them, “6.11” is a polyphonic chord divided by a period, and the rest are single
notes. There are more than 1500 different notes or chords in 2786 samples of music in the
GiantMIDI-Piano dataset. To reduce the difficulty of model prediction, all notes are sorted
according to their frequency of occurrence, and the notes that occurred less frequently than
the set threshold are marked [unk]. The process for note feature extraction is shown in
Figure 1.
“ ”
QK T
Attention( Q, K, V ) = so f tmax √ V (3)
dk
Processes 2022, 10, 2515 5 of 14
where dk is the dimension of the input vector. There are two common attention functions:
one of them is additive attention, and the other is dot-product attention. The attention mech-
anism used here is dot-product attention. This attention mechanism is not only faster than
additive attention but also saves more space. The weight of Attention is calculated according
to Q and K, and dk needs to be scaled; otherwise, when the value of dot-multiplication is
too large, the gradient that is calculated by the function softmax will be very small, and this
is not conducive to backpropagation.
Multihead attention is composed of h self-attention. The calculation formula is shown
in Equation (4):
where
headi = Attention QWiQ , KWiK , VWiV (5)
Because self-attention only learns from one perspective, it may be biased. Therefore, h
different weight combinations are designed. Before calculating Attention, Q, K, and V are
linearly transformed with the above weight combination, respectively. The weight of h
angles of Attention is spliced, and linear transformation is performed with a new weight
matrix to get the final output. Multihead attention splits the Q, K, V vector of the note unit
into h word vectors with dmodel /h dimensionality for self-attention calculation; splices the
operation results; merges and adjusts them with the full connection layer; and then outputs
the result. The decoder structure is roughly the same as that of the encoder and is composed
of N subdecoders. The difference is that the Q and K vectors in the multihead attention
input of the decoder come from the encoder. To obtain the relationship between units in the
note sequence from data learning and training, the dot product of Q and key K, formed by
self-attention in the encoding process, becomes the weight of V in the understanding code
process. Compared with the encoder layer, a masked multihead attention unit is added to
each subdecoder layer, because in generating note sequences, prediction of the next note
unit needs to be performed after the prediction of the current note unit. Otherwise, the
model is equivalent to directly seeing the question’s answer before learning, so the training
is meaningless.
A frame diagram of the whole model is shown in Figure 2.
where LearningRate is the probability distribution of notes, dmodel is the dimension of the
model input vector, and StepNum is the number of steps in the current workout. The reason
for using the warm-up learning rate strategy is that the compensation mechanism needs
to be combined at an early stage of system training. A fast update rate is used so that the
model learns the parameter characteristics of notes quickly. In the middle and late stages of
training, the learning rate is slowly reduced so that the model can better learn the detailed
characteristics of the note distribution. Combined with GANs, the prediction sequence of
notes is input to the discriminator to determine whether it is sampled from the dataset or
generated. Equation (7) is the optimization function of GANs:
where x ∼ pdata represents the input subject to the real note distribution, and z ∼ pz
represents the analog distribution. G is the generator, and D is the discriminator. As
is shown in Equation (7), by term Ex∼ pdata [log D ( x )] and Ez∼ pz [log(1 − D ( G (z)))], the
discriminator is expected to maximize the probability of sampled sequences being true and
minimize the probability of generated sequences being false. Apparently, the output of the
discriminator is the probability of being true music. Consequently, the only target label
for D ( x ) is 1, while for D ( G (z)), it is 0. Therefore, Equation (7) can be seen as a variant of
cross-entropy. When the generator G is fixed, the partial derivative of the objective function
V ( G, D ) yields formula (8) for discriminator D:
pg (x)
D∗ ( x) = (8)
p g ( x ) + pdata ( x )
Substituting the optimal discriminator in Formula (7) into Formula (8) causes the
optimization goal to optimize the Jenson–Shannon divergence (JSD) of p g ( x ) and pdata ( x ).
When p g ( x ) = pdata ( x ), they reach Nash equilibrium. Currently, the discrimination proba-
bility of discriminator D for actual samples or generated samples is 50%.
This process has two problems: gradient non-differentiability and mode collapse. The
reason why the gradient is not differentiable is that GANs need to input the generated note
sequence into the discriminator to determine authenticity, but the output of the generated
Processes 2022, 10, 2515 7 of 14
where yi is the normalized output probability of the ith note, τ is the inverse temperature
parameter, K is a global scalar, and ∑kj=1 exp y j /τ is the normalizing term. As τ → 0 ,
the probability distribution after Equation (9) approaches the one-hot encoding vector. As
τ → +∞ , the output becomes a uniform probability. When τ is a finite positive value,
the sample produced by Equation (9) is smooth and differentiable by the generator. To
conclude, the relationship between probability distribution and extraction of the maximum
value is expected to be learned by the inverse temperature parameter. During training, τ is
set to a large value, which slowly decreases almost to zero.
The standard one-hot embedding represents the real sequence, and the generator’s
output is the probability distribution of the predicted label. There are huge differences in
the expression form between the two, which the discriminator in GANs captures. Therefore,
the one-hot embedding of the real sequence needs to be optimized, and noise added, as
shown in Equation (10):
onehot(yi ) + gi
yi = so f tmax (10)
λ
where yi is the sequence of the real notes, gi is the random noise in the section (−ε, ε), and λ
is a constant less than 1, which is used to amplify the result of noise. The purpose of adding
λ is to make the vector after softmax conversion closer to the form of one-hot encoding.
Another problem is mode collapse. The method used to evaluate the generation effect
of the generator is to calculate the accuracy between the generated result and the actual
result, but the system will mistake the one-hot encoding feature of the real sequence as
one of the real features, resulting in the low accuracy of the generated model, and finally
leading to mode collapse.
A root mean square error (RMSE) is added between the predicted and real sequence
to accelerate the convergence and avoid mode collapse. When the difference between the
predicted label and the real label exceeds a reasonable value, RMSE can correct the learning
direction of the gradient.
Assuming the real target sequence sample { x1 , . . . xK } and the real input sample
{z1 , . . . zK }, the calculation formula of the loss function is as follows:
1 K
K ∑ n =1
L= [logD ( xi ) + log(1 − D ( G (zi )))] (11)
where G is the generator, D is the discriminator, Z is a real input sample, and n and i
are constants.
Processes 2022, 10, 2515 8 of 14
Therefore, the calculation formula of the loss function L is as shown in Equation (12):
1 K 1 K
∑ j=1 k ŷ j − G z j k2 + exp( ∑n=1 log(1 − D ( G (zi )))
L=α (12)
K K
where α (0 < α < 1) is the preset weight coefficient, ŷ j represents the real sequence label
of the j-th sentence, and the former is the RMSE between the predicted sequence and the
real sequence.
The calculation steps of the algorithm are as follows (Algorithm 1):
4. Experimental Summary
The GiantMIDI-Piano dataset, published by Jin et al. in 2020, was used to develop
the proposed method [31]. Table 1 compares several major MIDI format music datasets.
The GiantMIDI-Piano dataset is dramatically improved in quantity and richness compared
to the others. More than 10,000 piano pieces with a total time of more than 1200 h can be
played by algorithms. It is the most extensive classical piano dataset in the world.
In Equation (12), the former is the root mean square error, and the latter is the cross-
entropy loss. The loss of the cross-entropy term results in exponential amplification, making
the model converge faster during the gradient of the training process. In addition, α avoids
the process of convergence of authenticity learning instability because of an excessive
RMSE. The value of α is selected by the training accuracy after five epochs, and the results
are shown in Table 2. Note that since α is a hyperparameter used during the training of the
generator, the accuracy being compared here is the accuracy of the generator.
α
Processes 2022, 10, 2515 9 of 14
α
α
Table 2. Influence of different values of α on the accuracy of the model.
According to Table 2, the accuracy rate changes when different values are substituted.
When α is equal to 0.3, the accuracy rate is the largest, so the value of α is 0.3 (the result
rounded up to 1 decimal place).
The proposed music generation model based on transformer and GANs has two loss
optimization functions, corresponding to the optimization update of discriminator D and
generator G in the generation countermeasure network. For discriminator D, the output
is only 0 or 𝛼1, and the accuracy rate is the ratio of the predicted number of correct𝛼 tags
to the number of all tags. At each time step of prediction, the generator solves a multi-
classification problem with the label dimension of vocab_size. For the prediction of each
unit, its output is the probability distribution of the unit label.
Figure 3a is the process of discriminator loss rate. The loss of the discriminator
decreases rapidly at the beginning of training until it is finally stable. Figure 3b is the
change process of discriminator accuracy. The accuracy of the discriminator also increases
rapidly to about 50% after the beginning of training, reaching the optimum state. Figure 3c
is the change process of the correctness of the verification set. The accuracy rate of the
verification set rises during the training process and finally reaches about 90%.
Figure 4a shows the input and output attention before optimization, while Figure 4b
shows the input and output attention after optimization. Comparing Figure 4a,b, it can
be seen that the image in Figure 4b is more complex than Figure 4a. At the same time,
according to the visual experience, the color of Figure 4b is also darker than Figure 4a. Here,
the image is used to reflect the relationship between the notes in generated melody and
input samples. The color is darker, so the relationship between the input notes (or chords)
is stronger.
In addition, the multihead attention mechanism in the transformer recognizes the
relationship between input and output units. Figure 4a shows the input and output
attention without using the optimization loss function, and Figure 4b shows the output
attention after training with the optimization loss function.
The GAN model without the optimization function does not capture the relationship
between input and output, but the optimized model learns the coupling relationship
between input and output units far better. The final musical notation is shown in Figure 5.
Processes 2022, 10, 2515 10 of 14
to Euler’s music evaluation elements, several commonly used elements in music evalua-
(a) (b)
to Euler’s music evaluation elements, several commonly used elements in music evalua-
Figure 5. The results generated by the system based on transformer and GANs.
To make up for the defect of root mean square error in music evaluation, according to
Euler’s music evaluation elements, several commonly used elements in music evaluation,
as shown in Table 3, were selected to quantify the advantages and disadvantages of the
output music samples.
The calculation formula of the absolute interval gradient is shown in Equation (13):
1, x ≤ 6
s g = lg(− x + 16), 6 < x ≤ 15 (13)
0, x > 15
2 ∗ nmin ∗ nmax
nneibor = (14)
nmin + nmax
where nmin indicates the number of notes in the bass in the presence of extreme note
differences; nmax indicates the number of notes in the range of treble; and nneibor indicates
the average number of extreme notes.
The calculation formula for dissonance is shown in Equation (15):
where nleap indicates the number of dissonances, and pi is the tone of the i note.
The calculation formula for chord single tone ratio is shown in Equation (16):
nchord
rchord = (16)
nchord + nnote
where nchord indicates the number of chords, and nnote is the number of single notes.
The calculation formula for note diversity is shown in Equation (17):
ndi f
rdiv = (17)
n
where ndi f indicates the number of non-repeated notes, and n is the total length of the
sequence. The final score output is shown in Equation (18):
1 5
s=
∑5i=1 wi
∑ i =1 wi s i (18)
where wi is the output’s weight for each item, si is the output’s value for each item, and S
is the final score.
The same sample dataset (GiantMIDI-Piano) [31] was input into several systems, as
shown in Tables 4 and 5, to compare output results (full score is 100).
Finally, we selected 30 volunteers with musical backgrounds from the Shanghai Con-
servatory of Music and 30 volunteers from the College of Electronics and Information
Engineering at Tongji University for the test. Suppose that the volunteers from the Shanghai
Conservatory of Music are professional and those from Tongji University are nonprofes-
sional in the field of music. Based on Table 3, the total score is 100 points, and the scoring
results are shown in Table 5. The results in Table 4 are calculated according to the elements
involved in Table 3, and the calculation process refers to Equations (13)–(18). Table 5 is the
Processes 2022, 10, 2515 12 of 14
result of the volunteers’ manual evaluation according to the elements involved in Table 3.
The original samples for Tables 4 and 5 all used melodies included in the same sample
dataset (GiantMIDI-Piano).
Combining Tables 4 and 5, it can be seen that, for both data evaluation based on
Euler’s music evaluation elements or manual evaluation, the optimized transformer and
GANs model has the highest scores. At the same time, compared with other models, the
optimized transformer and GANs model also has the best accuracy for the notes.
5. Conclusions
Taking aim at the challenge of music generation, this study overcame the obstacles
facing the sequence generation model and GANs, proposed a music generation model based
on transformer and GANs, and optimized the structure of GANs. Through experimentation,
it was found that the chord processing reported in this study is not ideal, as reflected in the
high proportion of chords and excessive discordant notes. A character dictionary processed
by this method was constructed according to the notes and chords in real music. To reduce
the size of the dictionary and improve the prediction effect, word frequency was used as
the basis for filtering, and some notes and chords with low frequency, extreme notes, and
complex chords were shielded. Still, the number of chord labels is much larger than the
number of individual notes. Although the number of single notes in the training data is
much larger than the occurrence frequency of chords, the transformer is not sensitive to the
frequency in the calculation process of the model, regardless of the relative probability of
single notes and chords. The dataset selected in this study is based on piano compositions
that contain many different chords and notes. Using only major and minor chords is not
enough to express the music in this dataset. Therefore, preprocessing of music datasets and
use of algorithms to effectively summarize the rules for the occurrence of notes should be
the primary goals of music creation in the future.
Author Contributions: Conceptualization, J.M. and L.W.; methodology, Z.L.; software, D.L.; valida-
tion, M.Z., J.M. and L.W.; formal analysis, Z.L. and J.M.; investigation, D.L. and M.Z.; resources, J.M.
and D.L.; data curation, J.M.; writing—original draft preparation, J.M. and L.W.; writing—review
and editing, D.L., Y.H. and M.Z.; visualization, Y.H.; supervision, Y.H.; project administration, J.M.
and L.W.; funding acquisition, Y.H. All authors have read and agreed to the published version of
the manuscript.
Funding: This work was funded by Science and Technology Winter Olympic Project (Grant num-
ber 2018YFF0300505) and Joint Fund of Zhejiang Provincial Natural Science Foundation (Grant
number LHY20F030001).
Institutional Review Board Statement: The study did not require ethical approval.
Informed Consent Statement: Not applicable.
Data Availability Statement: Not applicable.
Processes 2022, 10, 2515 13 of 14
References
1. Olson, H.F.; Belar, H. Electronic music synthesizer. J. Acoust. Soc. Am. 1955, 27, 595–612. [CrossRef]
2. Steedman, M.J. A generative grammar for jazz chord sequences. Music. Percept. 1984, 2, 52–77. [CrossRef]
3. Ebcioğlu, K. An expert system for harmonizing four-part chorales. Comput. Music. J. 1988, 12, 43–51. [CrossRef]
4. Boulanger-Lewandowski, N.; Bengio, Y.; Vincent, P. Modeling temporal dependencies in high-dimensional sequences: Application
to polyphonic music generation and transcription. arXiv 2012, arXiv:1206.6392.
5. Gao, Z.; Chen, M.Z.; Zhang, D. Special Issue on “Advances in condition monitoring, optimization and control for complex
industrial processes”. Processes 2021, 9, 664. [CrossRef]
6. O’Hanlon, K.; Sandler, M.B. Fifthnet: Structured compact neural networks for automatic chord recognition. IEEE/ACM Trans.
Audio Speech Lang. Process. 2021, 29, 2671–2682. [CrossRef]
7. Zou, F.; Schwarz, S.; Nossek, J.A. Cellular neural network design using a learning algorithm. In Proceedings of the IEEE
International Workshop on Cellular Neural Networks and Their Applications, Budapest, Hungary, 16–19 December 1990;
pp. 73–81.
8. Kingma, D.P.; Welling, M. Auto-encoding variational bayes. arXiv 2013, arXiv:1312.6114.
9. Chorowski, J.; Weiss, R.J.; Bengio, S.; Van Den Oord, A. Unsupervised speech representation learning using wavenet autoencoders.
IEEE/ACM Trans. Audio Speech Lang. Process. 2019, 27, 2041–2053. [CrossRef]
10. Hinton, G.E.; Salakhutdinov, R.R. Reducing the dimensionality of data with neural networks. Science 2006, 313, 504–507.
[CrossRef] [PubMed]
11. Johnson, D. Composing Music with Recurrent Neural Networks. August 2015. Available online: http://www.hexahedria.com/
2015/08/03/composing-musicwith-recurrent-neural-networks/ (accessed on 26 October 2015).
12. Gao, Z.; Liu, X. An overview on fault diagnosis, prognosis and resilient control for wind turbine systems. Processes 2021, 9, 300.
[CrossRef]
13. Choi, K.; Fazekas, G.; Cho, K.; Sandler, M. The effects of noisy labels on deep convolutional neural networks for music tagging.
IEEE Trans. Emerg. Top. Comput. Intell. 2018, 2, 139–149. [CrossRef]
14. Pelchat, N.; Gelowitz, C.M. Neural network music genre classification. Can. J. Electr. Comput. Eng. 2020, 43, 170–173. [CrossRef]
15. Lu, L.; Xu, L.; Xu, B.; Li, G.; Cai, H. Fog computing approach for music cognition system based on machine learning algorithm.
IEEE Trans. Comput. Soc. Syst. 2018, 5, 1142–1151. [CrossRef]
16. Liu, C.H.; Ting, C.K. Computational intelligence in music composition: A survey. IEEE Trans. Emerg. Top. Comput. Intell. 2016, 1,
2–15. [CrossRef]
17. Sigtia, S.; Benetos, E.; Dixon, S. An end-to-end neural network for polyphonic piano music transcription. IEEE/ACM Trans. Audio
Speech Lang. Process. 2016, 24, 927–939. [CrossRef]
18. Thalmann, F.; Wiggins, G.A.; Ycart, A.; Benetos, E. Sandler M. Representing Modifiable and Reusable Musical Content on the
Web with Constrained Multi-Hierarchical Structures. IEEE Trans. Multimed. 2020, 22, 2645–2658. [CrossRef]
19. Ycart, A.; Benetos, E. Learning and Evaluation Methodologies for Polyphonic Music Sequence Prediction with LSTMs. IEEE/ACM
Trans. Audio Speech Lang. Process. 2020, 28, 1328–1341. [CrossRef]
20. Sheykhivand, S.; Mousavi, Z.; Rezaii, T.Y.; Farzamnia, A. Recognizing emotions evoked by music using CNN-LSTM networks on
EEG signals. IEEE Access 2020, 8, 139332–139345. [CrossRef]
21. Borodin, A.; Rabani, Y.; Schieber, B. Deterministic many-tomany hot potato routing. IEEE Trans. Parallel Distrib. Syst. 1997, 8,
587–596. [CrossRef]
22. Chen, J.; Pan, F.; Zhong, P.; He, T.; Qi, L.; Lu, J.; He, P.; Zheng, Y. An automatic method to develop music with music segment and
long short term memory for tinnitus music therapy. IEEE Access 2020, 8, 141860–141871. [CrossRef]
23. Lehner, B.; Schlüter, J.; Widmer, G. Online, loudness-invariant vocal detection in mixed music signals. IEEE/ACM Trans. Audio
Speech Lang. Process. 2018, 26, 1369–1380. [CrossRef]
24. Wang, C.; Xu, C.; Yao, X.; Tao, D. Evolutionary generative adversarial networks. IEEE Trans. Evol. Comput. 2019, 23, 921–934.
[CrossRef]
25. Liang, Z.; Zhang, S. Generating and Measuring Similar Sentences Using Long Short-Term Memory and Generative Adversarial
Networks. IEEE Access 2021, 9, 112637–112654. [CrossRef]
26. Arora, C.; Sabetzadeh, M.; Briand, L.; Zimmer, F. Automated checking of conformance to requirements templates using natural
language processing. IEEE Trans. Softw. Eng. 2015, 41, 944–968. [CrossRef]
27. Vaswani, A.; Shazeer, N.; Parmar, N.; Uszkoreit, J.; Jones, L.; Gomez, A.N.; Kaiser, Ł.; Polosukhin, I. Attention is all you need. In
Proceedings of the Advances in Neural Information Processing Systems, Long Beach, CA, USA, 4–9 December 2017; Volume 30.
28. Sutskever, I.; Vinyals, O.; Le, Q.V. Sequence to sequence learning with neural networks. In Proceedings of the Advances in Neural
Information Processing Systems, Montreal, QC, Canada, 8–13 December 2014; Volume 27.
29. Huang, C.Z.A.; Vaswani, A.; Uszkoreit, J.; Shazeer, N.; Simon, I.; Hawthorne, C.; Dai, A.M.; Hoffman, M.D.; Dinculescu, M.; Eck,
D. Music transformer. arXiv 2018, arXiv:1809.04281.
30. Radford, A. Language Models are Unsupervised Multitask Learners. OpenAI Blog 2019, 1, 9–33.
31. Payne, C. MuseNet. OpenAI Blog 2019, 3.
Processes 2022, 10, 2515 14 of 14
32. Jin, C.; Wang, T.; Liu, S.; Tie, Y.; Li, J.; Li, X.; Lui, S. A transformer-based model for multi-track music generation. Int. J. Multimed.
Data Eng. Manag. (IJMDEM) 2020, 11, 36–54. [CrossRef]
33. Bahdanau, D.; Cho, K.; Bengio, Y. Neural machine translation by jointly learning to align and translate. arXiv 2014, arXiv:1409.0473.