Audio Representations For Deep Learning in Sound Synthesis A Review
Audio Representations For Deep Learning in Sound Synthesis A Review
synthesis: A review
2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA) | 978-1-6654-0969-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/AICCSA53542.2021.9686838
Abstract—The rise of deep learning algorithms has led many synthesis of the sound is usually a challenging task and the loss
researchers to withdraw from using classic signal processing of information can cause significant reconstruction error [5].
methods for sound generation. Deep learning models have Parameters extracted from state-of-the-art vocoders have also
achieved expressive voice synthesis, realistic sound textures,
and musical notes from virtual instruments. However, the most been proposed for deep neural network applications [6]. These
suitable deep learning architecture is still under investigation. parameters demonstrate a potential in marrying the deep gener-
The choice of architecture is tightly coupled to the audio ative models with statistical parametric synthesizers. Finally,
representations. A sound’s original waveform can be too dense contemporary investigations allow the network to determine
and rich for deep learning models to deal with efficiently - the feature needed for the task [7]. Linguistic and acoustic
and complexity increases training time and computational cost.
Also, it does not represent sound in the manner in which it is features can be encoded into latent representations such as
perceived. Therefore, in many cases, the raw audio has been embeddings.
transformed into a compressed and more meaningful form using Apart from an overview of the audio representations existing
upsampling, feature-extraction, or even by adopting a higher level in sound synthesis implementations, this paper additionally
illustration of the waveform. Furthermore, conditional on the
quotes popular schemes for conditioning a deep generative
form chosen, additional conditioning representations, different
model architectures, and numerous metrics for evaluating the network with auxiliary data. Conditioning in generative models
reconstructed sound have been investigated. This paper provides can control the aspects of the synthesis and lead to new
an overview of audio representations applied to sound synthesis samples with specific characteristics [8]. Furthermore, the
using deep learning. Additionally, it presents the most signifi- paper highlights examples of deep generative models for audio
cant methods for developing and evaluating a sound synthesis
generation applications. Deep neural networks have demon-
architecture using deep learning models, always depending on
the audio representation. strated remarkable progress in the field demonstrating im-
Index Terms—Sound representations, Deep learning, Genera- pressive results. A final section discusses evaluation processes
tive models, Sound synthesis. for synthesised sound. Subjective evaluation via listening tests
are generally considered the most reliable measure of quality.
I. I NTRODUCTION However, multiple other metrics for assessing a generative
Sound generation algorithms synthesize a time domain model have been proposed converting both acoustic signals
waveform. This waveform should be coherent and appropriate to intermediate representations of them to be examined. Con-
for its intended use. These waveforms can convey complex sequently, audio representations assume an essential role not
and varied information. Deep generative networks [1] have only as input data but also influence the network architecture,
demonstrated great potential for such tasks having been used the conditioning technique as well as the evaluation process.
for the synthesis of a range of sounds, from pleasant pieces
of music to natural speech [2]. These models discover latent II. I NPUT R EPRESENTATIONS
representations based on the distribution of the initial data and
then sample from this distribution to generate new acoustic In the literature, numerous audio representations have been
signals with the same properties as the original ones. In proved beneficial for audio synthesis applications. Many times,
many cases, the deep learning models can operate along with comparisons have been conducted between different forms of
signal processing algorithms and enhance their expression the sound to reveal the most appropriate representation for
capabilities [3] [4]. a specific deep learning architecture. Raw audio and time-
The representation of the sound embraced by the deep frequency representations usually present the first attempts in
neural network plays a major role on the development of the such experiments. However, recent studies also look to higher
algorithm. Raw time domain audio is a rich representation level forms that offer more meaningful description such as
which leads to massive information making the network com- embeddings, or multiple sound features like the fundamental
putationally expensive and therefore slow. Compressed Time- frequency, loudness and features extracted by state-of-the-art
frequency representations based on spectrograms can decrease vocoders such as WORLD [9]. The table I summarizes the
the computer power needed but the parameter detection and advantages and disadvantages of each sound representation.
Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
a real-valued vector. This approach assisted the process of while training. Two types of conditioning have been proposed,
text in deep learning applications by inserting the property global and local [11] [50]. In global conditioning, additional
of closer embeddings in vector space to encode words with latent representations can be appended across all training data.
similar meaning. The same approach has been adopted by Global conditioning can encode speaker’s voice or linguistics
sound processing to reduce the dimensionality of the signal features. Local conditioning usually refers to supplementary
[40] [22], enhance the timbre synthesis [3] or even generate timeseries with lower sampling rate than the original waveform
a more interpretable representation [41] [42] to effectively or even mel-spectrograms, logarithmic fundamental frequency
extract parameters for a synthesizer. In [7] an autoencoder or auxiliary pitch information.
generates a latent representation to condition a WaveNet model WaveNet has achieved one of the most effective strategies
while Dhariwal et al [43] implemented three separate encoders for conditioning deep neural networks [51]. Therefore, later
to generate vectors with different temporal resolutions. sound generation schemes adopted a WaveNet network for
conditioning. The majority of these works conditioned their
E. Symbolic
model to spectrograms [31] [52] [49] [53] [33] [54] [28] [55]
In music processing, the term symbolic refers to the use of while others included linguistic features and pitch information
representations such as Musical Instrument Digital Interface [12] [56], phoneme encodings [6], features extracted from the
(MIDI) or piano rolls. MIDI is a technical standard that STRAIGHT vocoder [57] or even MIDI representations [58].
describes a protocol, a digital interface or the link for the Although it has been proven that convolutional networks
simultaneous operation between multiple electronic musical are capable of effective conditioning, other architectures can
instruments. A MIDI file demonstrates the notes being played use auxiliary input data as well. Recurrent neural networks
in every time step. Usually this file consists of information of such as LSTMs have been adopted conditioning as frame-
the instrument being played, the pitch and its velocity. MidiNet level auxiliary feature vectors [59] or as one-hot representation
[44] is one of the most popular implementations using MIDI encoding music style [46]. Autoencoders can be conditioned
to generate music pieces. including additional input to the encoder [23] [36] but also as
Piano roll constitutes a more dense representation of MIDI. input only to the decoder [40].
A piece of music can be represented by a binary N × T
matrix where N is the number of playable notes and T is the B. Input to the Generator
number of timesteps. In MuseGAN [45], Generative Adversar- Generative Adversarial Networks (GANs) consist of two
ial Networks (GANs) have been applied for music generation separate networks, the Generator and the Discriminator. Fol-
using multiple-track piano-roll representation. Also, in DeepJ lowing the fundamental properties of GANs, the Generator
[46], they scaled the representation matrix between 0 and 1 to converts random noise to structured data while the Discrimi-
capture the note’s dynamics. The most notable disadvantage nator endeavors to classify a signal as original or generated.
of symbolic representations is that holding a note and replay- For applying conditioning in GANs, the most common tech-
ing a note are demonstrated by the same representation. To nique constitutes a biased input to the Generator. In sound
differentiate these two stages, DeepJ included a second matrix synthesis, a well established conditioning method includes the
called replay along with the original matrix play. mel-spectrogram as input to the Generator [60] [16] [19].
III. C ONDITIONING R EPRESENTATIONS This way, the synthesised sound is not just a product of a
Neural networks are able to generate sound based on the specific distribution but it also obtains desirable properties. For
statistical distribution of the training data. The more uniform example, it can be enforced to conditioning on predetermined
the input data to the network is, the more natural outcome can instrument or voice. Furthermore, a Generator conditioned on
be achieved. However, in cases where the amount of training spectrograms can also be used as a vocoder [61]. In addition
data is not sufficient, additional data with similar properties to the mel-spectrogram, other implementations have been con-
can be included by applying conditioning methods. Following ditioned on raw audio [62], one-hot vectors to encode musical
these techniques, the generated sound can be conditioned on pitch [30], linguistic features [13], or latent representations to
specific traits such as speaker’s voice [47] [27], independent identify speaker [63].
pitch [3] [48] [36], linguistic features [49] [17] or latent C. Other
representations [4] [45]. Instead of one-hot-embeddings, some
implementations have also used a confusion matrix to capture At last, other variations of conditioning have been intro-
a variation of emotions [39], while others provided supple- duced as well. Kim et al [14] adjusted conditioning through the
mentary positional information of each segment conditioning loss function. They estimated an auxiliary probability density
music to the artist or genre [43]. After training, the user using mel-spectrograms for local conditioning. Pink et al [64]
is able to decide between the conditioning properties of the applied bias terms in every layer of the convolutional network
synthesised sound. using also mel-spectrograms. Extra bias to the network has
been also proposed by [65] to encode linguistic features
A. Additional Input while in [7] every layer was biased with a different linear
The simplest strategy for applying conditioning to deep projection of the temporal embeddings. In [38] linguistic
learning architectures is by including auxiliary input data features were added to the output of each hidden layer in
Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
TABLE I
OVERVIEW OF SOUND REPRESENTATIONS
the Generator while in [44] a new network was introduced by the one in Eq.5 are able to explicitly demonstrate temporal
the name conditionerCNN to work along with the Generator dependencies.
encoding chords for melody generation. Finally, Juvela et al
[37] conducted a comparative study of conditioning methods.
T
Y
IV. M ETHODS p(X) = p(xt |x1 , ..., xt−1 ) (5)
t−1
During recent years, deep learning models have significantly
contributed to research on sound generation. Using a variety
of deep learning algorithms, multiple representations have WaveNet [11] presents the most influential architecture of
been applied. The most common architectures include autore- explicit autoregressive models. The probability distribution
gressive methods, variational autoencoders (VAE), adversarial can be imitated by a stack of convolutional layers. However,
networks and normalising flows. However, many approaches to improve efficiency, the sequential data passes through a
can fall in more than one category. stack of dilated causal convolutional layers where the input
data are masked to skip some dependencies. Following a
A. Autoregressive similar scheme, FFTNet [48] takes advantage of convolutional
Autoregressive models define a category of generative mod- networks mimicking the FFT algorithm while upsampling the
els where every new sample in a sequence of data depends input data. However, to clear up the confusion, convolutional
on previous samples. Autoregressive deep neural networks networks do not always lead to autoregressive models [20].
can be represented by architectures that demonstrate this Autoregressive models have been initially proposed for
continuation implicitely or explicitely. Conventional methods sequential data. Therefore, in sound synthesis, raw audio is
that implicitely indicate a time-related manner are the recurrent ordinarily used as the input representation. However, many
neural networks. These models are able to recall previous data auxiliary representations have been applied conditioning the
dynamically using complex hidden state. SampleRNN [15] is audio generation on a variety of properties. More details about
one well established research work that applies hierarchical conditioning techniques for autoregressive models have been
recurrent neural networks such as GRU and LSTM on different already presented in Section III.
temporal resolutions for sound synthesis. In order to illustrate Since autoregressive models can be applied on sequential
the temporal behaviour of the network, Mehri et al conducted data, they are well established in sound generation related
experiments to test the model’s memory by injecting one sec- topics. Autoregressive models are easy to train and they can
ond of silence between two random sequential samples. Other manipulate data in real time. Furthermore, convolutional-based
significant papers on autoregressive models using recurrent models can be trained in parallel. Nevertheless, although these
neural networks are WaveRNN [17], MelNet [27] or LPCNet models can be paralleled during training, the generation is
[4]. In WaveRNN, they introduced a method for reducing the sequential and therefore slow. Synthesised data are affected
sampling time by using a batch of short sequences instead of only by previous samples, providing half way dependencies.
a unique long sequence while maintaining high quality sound. Finally, the generation can be consistent to specific properties
Generative models where the synthesis of the sequential for a definite number of samples and the outcome often lacks
samples follows a conditional probability distribution like global structure.
Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
B. Normalizing Flow Its purpose is to maximise the probability of identifying
Normalizing flows constitute a family of generative models real and generated data. On the other hand, the Generator
consisting of multiple simple distributions for transforming is trained through the Discriminator. Information about the
input data to latent representations. A sequence of simple, original distribution of the dataset are concealed from it and
invertible and computationally inexpensive mappings z ∽ p(z) its aim is to minimise the error of the Discriminator. This
can model a reversible complex one. This complex transfor- minimax game can be summarised by the Eq. 8.
mation is presented in Eq. 6 and the inverse can be achieved
min max V (D, G) = Ex∽pdata (x) [log D(x)]
by repeatedly changing the variables as shown in Eq. 7. The G D (8)
mapping functions, then, can be parametrised by a deep neural +Ez∽pz (z) [log(1 − D(G(z)))]
network.
On the field of sound generation a variety of implemen-
tations have been proposed using numerous representations.
x = f0 ◦ f1 ◦ ... ◦ fk (z) (6)
In [30], spectrograms were generated using upsampling con-
z= fk−1 ◦ −1
fk−1 ◦ ... ◦ f0−1 (x) (7) volutions for fast generation while in [32], they investigated
whether waveform or spectrograms are more effective for
WaveGlow [21], a flow-based generative network, can syn-
GANs applying the Wasserstein loss function. In Parallel
thesise sound from its mel-spectrogram. By applying an Affine
WaveGAN [60], a teacher-student scheme was adopted using
Coupling Layer and a 1x1 Invertible Convolution, the model
non-autoregressive WaveNet in order to improve WaveGAN’s
aims to maximise the likelihood of the training data. The
efficiency. Yamamoto et al [16] applied GANs using a IAF
implementation has been proposed by NVIDIA and it is able
generator optimised by a probability density distillation algo-
to generate sound in real time. Insightful alternatives have also
rithm. Also, in GAN-TTS [13], they examined an ensemble of
been proposed on normalising flows by using only a single loss
Discriminators to generate acoustic features using the Hinge
function, without any auxiliary loss terms [14] or by applying
loss function along with [61] [63]. Lastly, GANs have also
dilated 2-D convolutional layers [64].
been applied in a variety of applications such as text-to-
Finally, in order to reduce the number of repeated iterations
speech applications [19], speech synthesis [66] [67], speech
needed by normalising flows, they have been merged with
enhancement [62] or symbolic music generation [45].
autoregressive methods. This architecture manages to increase
the performance of autoregressive models since the sampling D. Variational Autoencoders
can be processed in parallel. Using Inverse Autoregressive
An autoencoder is one of the fundamental deep learning
Flows (IAF), Oord et al increased the efficiency of WaveNet
architectures consisting of two separate networks, an encoder
[12]. Their implementation follows a ”probability density
and a decoder. The encoder compresses the input data into
distillation” where a pre-trained WaveNet model is used as
a latent representation while the decoder synthesises data
a teacher and scores the samples a WaveNet student outputs.
from the learned latent space. The original scheme of an
This way, the student can be trained in accordance with the dis-
autoencoder was initially created for dimensionality reduction
tribution of the teacher. A similar approach has been adopted
purposes. Although theoretically the decoder bear some re-
by ClariNet [31], where a Gaussian inverse autoregressive
semblance to the generator of GANs, the model is not well
flow is applied on WaveNet to train a text-to-wave neural
qualified for the synthesis of new examples. The network
architecture.
endeavors to reconstruct the original input, therefore it lacks
C. Adversarial Learning of expressiveness.
Unlike the Inverse Autoregressive Flows where a pre-trained To use autoencoders as generative models, variational au-
teacher network assist a student model, in adversarial learning, toencoders have been proposed [68]. In this architecture, the
two neural networks match against each other in a two-player encoder first models a latent distribution and then the network
minimax game. The fundamental architecture of Generative samples from the distribution to generate latent examples. The
Adversarial Networks (GANs) is based on two models, the success of the variational autoencoders is mostly based on
Generator (G) and the Discriminator (D). The Generator maps the Kullback–Leibler (KL) divergence used as a loss function.
a latent representation to the data space. In a vanilla GAN, the The encoder introduces a new distribution q(z|X) to estimate
Generator maps random noise to a desirable representation. p(z|X) as much as possible by minimising the KL divergence.
For sound synthesis this representation could be raw audio The complete loss function is demonstrated in the Eq. 9 where
or spectrogram. This desired representation, original or gener- the first term (called reconstruction loss) is applied on the final
ated, is used as input to the Discriminator which is trained to layer and the second term (called regularization loss) adjusts
distinguish between real and fake data. The maximum benefit the latent layer.
from GANs is acquired when the Generator produces perfect
data and the Discriminator is not able to differentiate between L = Ez∽q(z|X) [log p(X|z)] − DKL [q(z|X)||p(z)] (9)
real and fake data.
From a more technical point of view, the Discriminator Many variations of VAE have been applied for sound
is trained using only the distribution of the original data. generation topics. In [3] they used VAE with feedforward
Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
networks and an additive synthesiser to reproduce monophonic for evaluation purposes. The rest of the metrics will be
musical notes. In [69] and [40] they applied convolutional analysed in the following sections. A similar set of evaluation
layers while in [36] a Variational Parametric Synthesiser was metrics has also been adopted by [50] including NDB.
proposed using a conditional VAE.
A modification of variational autoencoders proposed for C. Inception Score
music synthesis is VQ-VAE [43]. In this approach, the network The Inception Score (IS) is another perceptual metric which
is trained to encode the input data into a sequence of discrete correlates with human evaluation and is mostly adopted by
tokens. Jukebox introduces this method to flatten the data and GANs. For the Inception Score, a pre-trained Inception classi-
process it using autoregressive Transformers. fier is applied to the output of the generative model. In order to
measure the diversity of the synthesised data, the IS calculates
V. E VALUATION the KL divergence between the model scores P (y|x) and the
Although in the last decade generative models presented marginal distribution P (y) as it can be expressed by the Eq.10
significant improvements, a definitive evaluation process still for every possible class [71]. The IS is maximised when the
remains an open question. Many mathematical metrics have generated examples belong to only one class and every class
been proposed for perceptually evaluating the generative sound is predicted equally often.
and usually a transformation to another audio representation
have been adopted. However, despite the numerous attempts, IS = exp(Ex DKL (P (y|x)||P (y))) (10)
none of these metrics are as reliable as the subjective evalution
of human listeners. In [30] and [7], a pitch classifier is trained on spectrograms
of the NSynth dataset while in WaveGAN [32], the classifier
A. Perceptual Evaluation is trained on normalised log mel-spectrograms having zero
Human evaluation usually accounts for the mean opinion mean and unit variance. Finally, metrics like Pitch Accuracy
score between a group of listeners. To conduct the study, and Pitch Entropy or a nearest neighbour technique have been
many researchers used crowdMOS [70], a user-friendly toolkit adopted by GANSynth and WaveGAN respectively to further
for performing listening evaluations. As well as the mean evaluate the efficiency of their Inception Score. Finally, in [50]
opinion score, a confidence interval is also been computed. they also applied a modified inception score.
Furthermore, in order to attract an accountable number of
D. Distances-based measurements
subjects with specific characteristics, Amazon Mechanical
Turk has been widely used. In many cases, raters have been This evaluation category includes metrics that measure the
asked to pass a hearing test [21], keep headphones on [61] distance between representations of the original data and
[11], or only native speakers for evaluating speech have been the distribution of the generated examples. Binkowski et
asked [60] [61] [54]. al. proposed two distance-based metrics, the Fréchet Deep-
In these mean opinion score tests, subjects have been asked Speech Distance (FDSD) and the Kernel DeepSpeech Distance
to rate a sound in a five-point Likert scale in terms of (KDSD) [13] for evaluating their text-to-speech model. The
pleasantness [21], naturalness [11] [13] [63], sound quality two metrics make use of the the Fréchet distance and the
[17] or speaker diversity [32]. In addition, subjects have been Maximum Mean Discrepancy respectively on audio features
requested to express a preference between sounds of two extracted by a speech recognition model.
generative models hearing the same pitch [30] or speech [17] The Fréchet or 2-Wasserstein distance has been proposed
[11]. Finally, for evaluating WaveGAN [32], humans listened by other research papers as well. Engel et al [30] applied
to digits between one to ten and were asked to indicate which the Fréchet Inception Distance on features extracted by a
number they heard. pitch classifier while Kilgour et al [72] used this distance
to measure the intensity of a distortion in generated sound.
B. Number of Statistically-Different Bins However, although many researchers report successful results
The Number of Statistically-Different Bins (NDB) is a using 2-Wasserstein, Donahue et al [63] reported that a similar
metric for unconditional generative models in order to es- evaluation metric did not produce a desirable outcome in their
timate the diversity of the synthesised examples. Clustering experiments.
techniques are applied on the training data creating cells of Distances-based measurements have also been investigated
similar properties. Then, the same algorithm tries to categorise individually by separate parameter estimations. In [3] distances
the generated data into the cells. If a generated example does between the generated loudness and fundamental frequency of
not belong to a predefined cluster, then the generated sound synthesised and training data are used.
is statistically significantly different.
GANSynth [30] used k-means to map the log spectrogram E. Spectral Convergence
of the generated sound into k = 50 Voronoi cells. As well The Spectral Convergence expresses the mean difference
as Mean Opinion Score and the Number of Statistically- between the original and the generated spectrogram. It has
Different Bins, GANSynth also used Inception Score, Pitch been applied by [43] [20] [57] [40] in order to evaluate their
Accuracy and Pitch Entropy and Frechet Inception Distance synthesised music. The Spectral Convergence can be expressed
Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
by the Eq.11 which is also identified as the minimization [7] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and
process of the Griffin-Lim algorithm. M. Norouzi, “Neural Audio Synthesis of Musical Notes with WaveNet
Autoencoders,” arXiv:1704.01279 [cs], Apr. 2017. arXiv: 1704.01279.
v [8] R. Manzelli, V. Thakkar, A. Siahkamari, and B. Kulis, “Conditioning
uP
u e 2 Deep Generative Raw Audio Models for Structured Automatic Music,”
n,m |S(n, m) − S(n, m)|
SC = t P (11) arXiv:1806.09905 [cs, eess, stat], June 2018. arXiv: 1806.09905.
n,m S(n, m)
[9] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A Vocoder-Based
High-Quality Speech Synthesis System for Real-Time Applications,”
IEICE Transactions on Information and Systems, vol. E99.D, no. 7,
F. Log Likelihood pp. 1877–1884, 2016.
A final evaluation metric includes a Negative Log Like- [10] C. E. Shannont, “Communication in the Presence of Noise,” PROCEED-
INGS OF THE I.R.E., p. 12, 1949.
lihood (NLL) [17] [15] and an objective Conditional Log [11] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
Likelihood (CLL) [14] usually measured in bits per audio N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Gener-
sample. ative Model for Raw Audio,” arXiv:1609.03499 [cs], Sept. 2016. arXiv:
1609.03499.
[12] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
VI. C ONCLUSION K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stim-
berg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,
The choice of audio representation is one of the most N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and
significant factors in the development of deep learning models D. Hassabis, “Parallel WaveNet: Fast High-Fidelity Speech Synthesis,”
for sound synthesis. Numerous representations have been pro- arXiv:1711.10433 [cs], Nov. 2017. arXiv: 1711.10433.
[13] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen,
posed by previous researchers focusing on different properties. N. Casagrande, L. C. Cobo, and K. Simonyan, “High Fidelity Speech
Raw audio is a direct representation demanding notable mem- Synthesis with Adversarial Networks,” arXiv:1909.11646 [cs, eess],
ory and computational cost. It is also not considered for eval- Sept. 2019. arXiv: 1909.11646.
[14] S. Kim, S.-g. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet : A
uating purposes since different waveforms can perceptually Generative Flow for Raw Audio,” arXiv:1811.02155 [cs, eess], May
produce the same sound. Spectrograms can overcome some of 2019. arXiv: 1811.02155.
the disadvantages of raw audio and have been considered as [15] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,
A. Courville, and Y. Bengio, “SampleRNN: An Unconditional End-
an alternative for training as well as for evaluation. However, to-End Neural Audio Generation Model,” arXiv:1612.07837 [cs], Feb.
reconstructing the original sound from its spectrogram is a 2017. arXiv: 1612.07837.
challenging task since it may produce sound suffering from [16] R. Yamamoto, E. Song, and J.-M. Kim, “Probability density distil-
lation with generative adversarial networks for high-quality parallel
distortions and lack of phase coherence. Recently, other audio waveform generation,” arXiv:1904.04472 [cs, eess], Aug. 2019. arXiv:
representations have received much attention such as latent 1904.04472.
representations, embeddings and acoustic features but they all [17] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,
E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and
require a powerful decoder. The choice of audio representation K. Kavukcuoglu, “Efficient Neural Audio Synthesis,” arXiv:1802.08435
is still very much dependent on the application. [cs, eess], June 2018. arXiv: 1802.08435.
[18] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,
VII. ACKNOWLEDGMENTS Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis,
R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-End Speech
This publication has emanated from research supported in Synthesis,” arXiv:1703.10135 [cs], Apr. 2017. arXiv: 1703.10135.
[19] P. Neekhara, C. Donahue, M. Puckette, S. Dubnov, and J. McAuley, “Ex-
part by a grant from Science Foundation Ireland under Grant pediting TTS Synthesis with Adversarial Vocoding,” arXiv:1904.07944
number 18/CRT/6183. For the purpose of Open Access, the [cs, eess], July 2019. arXiv: 1904.07944.
author has applied a CC BY public copyright licence to [20] S. O. Arik, H. Jun, and G. Diamos, “Fast Spectrogram Inversion using
any Author Accepted Manuscript version arising from this Multi-head Convolutional Neural Networks,” IEEE Signal Processing
Letters, vol. 26, pp. 94–98, Jan. 2019. arXiv: 1808.06719.
submission’. [21] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A Flow-based
Generative Network for Speech Synthesis,” arXiv:1811.00002 [cs, eess,
R EFERENCES stat], Oct. 2018. arXiv: 1811.00002.
[22] K. Peng, W. Ping, Z. Song, and K. Zhao, “Non-Autoregressive Neu-
[1] H. Gm, M. K. Gourisaria, M. Pandey, and S. S. Rautaray, “A compre- ral Text-to-Speech,” arXiv:1905.08459 [cs, eess], June 2020. arXiv:
hensive survey and analysis of generative models in machine learning,” 1905.08459.
Computer Science Review, vol. 38, p. 100285, Nov. 2020. [23] C. Aouameur, P. Esling, and G. Hadjeres, “Neural Drum Machine
[2] M. Huzaifah and L. Wyse, “Deep generative models for musical au- : An Interactive System for Real-time Synthesis of Drum Sounds,”
dio synthesis,” arXiv:2006.06426 [cs, eess, stat], Nov. 2020. arXiv: arXiv:1907.02637 [cs, eess], Nov. 2019. arXiv: 1907.02637.
2006.06426. [24] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y.
[3] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable Liu, “FastSpeech: Fast, Robust and Controllable Text to Speech,”
Digital Signal Processing,” arXiv:2001.04643 [cs, eess, stat], Jan. 2020. arXiv:1905.09263 [cs, eess], Nov. 2019. arXiv: 1905.09263.
arXiv: 2001.04643. [25] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,
[4] J.-M. Valin and J. Skoglund, “LPCNET: Improving Neural Speech “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,”
Synthesis through Linear Prediction,” in ICASSP 2019 - 2019 IEEE arXiv:2006.04558 [cs, eess], Mar. 2021. arXiv: 2006.04558.
International Conference on Acoustics, Speech and Signal Processing [26] R. Liu, B. Sisman, F. Bao, G. Gao, and H. Li, “WaveTTS: Tacotron-
(ICASSP), (Brighton, United Kingdom), pp. 5891–5895, IEEE, May based TTS with Joint Time-Frequency Domain Loss,” arXiv:2002.00417
2019. [cs, eess], Apr. 2020. arXiv: 2002.00417.
[5] P. Govalkar, J. Fischer, F. Zalkow, and C. Dittmar, “A Comparison of [27] S. Vasquez and M. Lewis, “MelNet: A Generative Model for Audio in
Recent Neural Vocoders for Speech Signal Reconstruction,” in 10th the Frequency Domain,” arXiv:1906.01083 [cs, eess, stat], June 2019.
ISCA Speech Synthesis Workshop, pp. 7–12, ISCA, Sept. 2019. arXiv: 1906.01083.
[6] M. Blaauw and J. Bonada, “A Neural Parametric Singing Synthesizer,” [28] S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse, “Tim-
arXiv:1704.03809 [cs], Aug. 2017. arXiv: 1704.03809. breTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical
Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
Timbre Transfer,” arXiv:1811.09620 [cs, eess, stat], May 2019. arXiv: [52] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,
1811.09620. Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian-
[29] G. A. Velasco, N. Holighaus, M. Dörfler, and T. Grill, “CONSTRUCT- nakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet
ING AN INVERTIBLE CONSTANT-Q TRANSFORM WITH NON- on Mel Spectrogram Predictions,” arXiv:1712.05884 [cs], Feb. 2018.
STATIONARY GABOR FRAMES,” p. 8, 2011. arXiv: 1712.05884.
[30] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and [53] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,
A. Roberts, “GANSYNTH: ADVERSARIAL NEURAL AUDIO SYN- J. Raiman, and J. Miller, “Deep Voice 3: Scaling Text-to-Speech with
THESIS,” p. 17, 2019. Convolutional Sequence Learning,” arXiv:1710.07654 [cs, eess], Feb.
[31] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel Wave Generation in 2018. arXiv: 1710.07654.
End-to-End Text-to-Speech,” p. 13. [54] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Neural Speech
[32] C. Donahue, J. McAuley, and M. Puckette, “ADVERSARIAL AUDIO Synthesis with Transformer Network,” arXiv:1809.08895 [cs], Jan.
SYNTHESIS,” p. 16, 2019. 2019. arXiv: 1809.08895.
[33] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, [55] M. Angrick, C. Herff, E. Mugler, M. C. Tate, M. W. Slutzky, D. J.
Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style Tokens: Unsupervised Krusienski, and T. Schultz, “Speech synthesis from ECoG using densely
Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” connected 3D convolutional neural networks,” Journal of Neural Engi-
p. 10. neering, vol. 16, p. 036019, June 2019.
[34] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform [56] S. O. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky,
models for statistical parametric speech synthesis,” arXiv:1904.12088 Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and
[cs, eess, stat], Nov. 2019. arXiv: 1904.12088. M. Shoeybi, “Deep Voice: Real-time Neural Text-to-Speech,” p. 10.
[35] A. Defossez, N. Zeghidour, N. Usunier, L. Bottou, and F. Bach, “SING: [57] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,
Symbol-to-Instrument Neural Generator,” p. 11. “Speaker-Dependent WaveNet Vocoder,” in Interspeech 2017, pp. 1118–
[36] K. Subramani, P. Rao, and A. D’Hooge, “Vapar Synth - A Variational 1122, ISCA, Aug. 2017.
Parametric Model for Audio Synthesis,” in ICASSP 2020 - 2020 IEEE [58] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang,
International Conference on Acoustics, Speech and Signal Processing S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling Factorized
(ICASSP), (Barcelona, Spain), pp. 796–800, IEEE, May 2020. Piano Music Modeling and Generation with the MAESTRO Dataset,”
[37] L. Juvela, B. Bollepalli, V. Tsiaras, and P. Alku, “GlotNet—A Raw arXiv:1810.12247 [cs, eess, stat], Jan. 2019. arXiv: 1810.12247.
Waveform Model for the Glottal Excitation in Statistical Parametric [59] Z.-H. Ling, Y. Ai, Y. Gu, and L.-R. Dai, “Waveform Modeling and
Speech Synthesis,” IEEE/ACM Transactions on Audio, Speech, and Generation Using Hierarchical Recurrent Neural Networks for Speech
Language Processing, vol. 27, pp. 1019–1030, June 2019. Bandwidth Extension,” IEEE/ACM Transactions on Audio, Speech, and
[38] S. Yang, L. Xie, X. Chen, X. Lou, X. Zhu, D. Huang, and H. Li, Language Processing, vol. 26, pp. 883–894, May 2018.
“Statistical Parametric Speech Synthesis Using Generative Adversarial [60] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast
Networks Under A Multi-task Learning Framework,” arXiv:1707.01670 waveform generation model based on generative adversarial networks
[cs], July 2017. arXiv: 1707.01670. with multi-resolution spectrogram,” arXiv:1910.11480 [cs, eess], Feb.
[39] G. E. Henter, J. Lorenzo-Trueba, X. Wang, and J. Yamagishi, “Deep 2020. arXiv: 1910.11480.
Encoder-Decoder Models for Unsupervised Learning of Controllable [61] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,
Speech Synthesis,” arXiv:1807.11470 [cs, eess, stat], Sept. 2018. arXiv: J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, “MelGAN:
1807.11470. Generative Adversarial Networks for Conditional Waveform Synthesis,”
[40] A. Bitton, P. Esling, and T. Harada, “Neural Granular Sound Synthesis,” arXiv:1910.06711 [cs, eess], Dec. 2019. arXiv: 1910.06711.
arXiv:2008.01393 [cs, eess], Aug. 2020. arXiv: 2008.01393. [62] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech Enhancement
[41] P. Esling, A. Chemla-Romeu-Santos, and A. Bitton, “Generative timbre Generative Adversarial Network,” arXiv:1703.09452 [cs], June 2017.
spaces: regularizing variational auto-encoders with perceptual metrics,” arXiv: 1703.09452.
arXiv:1805.08501 [cs, eess], Oct. 2018. arXiv: 1805.08501. [63] J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan,
[42] P. Esling, N. Masuda, A. Bardet, R. Despres, and A. Chemla-Romeu- “End-to-End Adversarial Text-to-Speech,” arXiv:2006.03575 [cs, eess],
Santos, “Universal audio synthesizer control with normalizing flows,” Mar. 2021. arXiv: 2006.03575.
arXiv:1907.00971 [cs, eess, stat], July 2019. arXiv: 1907.00971. [64] W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A Compact Flow-
[43] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, based Model for Raw Audio,” p. 11.
“Jukebox: A Generative Model for Music,” arXiv:2005.00341 [cs, eess, [65] K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-to-phoneme
stat], Apr. 2020. arXiv: 2005.00341. conversion using Long Short-Term Memory recurrent neural networks,”
[44] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “MidiNet: A Convolutional in 2015 IEEE International Conference on Acoustics, Speech and
Generative Adversarial Network for Symbolic-domain Music Genera- Signal Processing (ICASSP), (South Brisbane, Queensland, Australia),
tion,” arXiv:1703.10847 [cs], July 2017. arXiv: 1703.10847. pp. 4225–4229, IEEE, Apr. 2015.
[45] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “MuseGAN: [66] K. Oyamada, H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and H. Ando,
Multi-track Sequential Generative Adversarial Networks for Symbolic “Generative adversarial network-based approach to signal reconstruction
Music Generation and Accompaniment,” arXiv:1709.06298 [cs, eess, from magnitude spectrogram,” in 2018 26th European Signal Processing
stat], Nov. 2017. arXiv: 1709.06298. Conference (EUSIPCO), (Rome), pp. 2514–2518, IEEE, Sept. 2018.
[46] H. H. Mao, T. Shin, and G. W. Cottrell, “DeepJ: Style-Specific Music [67] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical Parametric Speech
Generation,” 2018 IEEE 12th International Conference on Semantic Synthesis Incorporating Generative Adversarial Networks,” IEEE/ACM
Computing (ICSC), pp. 377–382, Jan. 2018. arXiv: 1801.00887. Transactions on Audio, Speech, and Language Processing, vol. 26,
[47] Y. Zhao, S. Takaki, H.-T. Luong, J. Yamagishi, D. Saito, and N. Mine- pp. 84–96, Jan. 2018.
matsu, “Wasserstein GAN and Waveform Loss-Based Acoustic Model [68] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,”
Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a arXiv:1312.6114 [cs, stat], May 2014. arXiv: 1312.6114.
WaveNet Vocoder,” IEEE Access, vol. 6, pp. 60478–60488, 2018. [69] A. Pandey and D. Wang, “A New Framework for CNN-Based Speech
[48] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “Fftnet: A Real- Enhancement in the Time Domain,” IEEE/ACM Transactions on Audio,
Time Speaker-Dependent Neural Vocoder,” in 2018 IEEE International Speech, and Language Processing, vol. 27, pp. 1179–1188, July 2019.
Conference on Acoustics, Speech and Signal Processing (ICASSP), [70] F. Ribeiro, D. Florêncio, C. Zhang, and M. Seltzer, “crowdMOS: An
(Calgary, AB), pp. 2251–2255, IEEE, Apr. 2018. Approach for Crowdsourcing Mean Opinion Score Studies,” p. 4.
[49] S. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, [71] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and
J. Raiman, and Y. Zhou, “Deep Voice 2: Multi-Speaker Neural Text- X. Chen, “Improved Techniques for Training GANs,” arXiv:1606.03498
to-Speech,” arXiv:1705.08947 [cs], Sept. 2017. arXiv: 1705.08947. [cs], June 2016. arXiv: 1606.03498.
[50] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A [72] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr\’echet Audio
Versatile Diffusion Model for Audio Synthesis,” arXiv:2009.09761 [cs, Distance: A Metric for Evaluating Music Enhancement Algorithms,”
eess, stat], Mar. 2021. arXiv: 2009.09761. arXiv:1812.08466 [cs, eess], Jan. 2019. arXiv: 1812.08466.
[51] J. Boilard, P. Gournay, and R. Lefebvre, “A Literature Review of
WaveNet: Theory, Application and Optimization,” p. 17.
Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.