Audio Representations For Deep Learning in Sound Synthesis A Review

Uploaded by

YAOZHONG ZHANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

16 views8 pages

Audio Representations For Deep Learning in Sound Synthesis A Review

Uploaded by

YAOZHONG ZHANG

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 8

Audio representations for deep learning in sound

synthesis: A review
2021 IEEE/ACS 18th International Conference on Computer Systems and Applications (AICCSA) | 978-1-6654-0969-8/21/$31.00 ©2021 IEEE | DOI: 10.1109/AICCSA53542.2021.9686838

Anastasia Natsiou Seán O’Leary

Technological University of Dublin Technological University of Dublin
Dublin, Ireland Dublin, Ireland
[email protected] [email protected]

Abstract—The rise of deep learning algorithms has led many synthesis of the sound is usually a challenging task and the loss
researchers to withdraw from using classic signal processing of information can cause significant reconstruction error [5].
methods for sound generation. Deep learning models have Parameters extracted from state-of-the-art vocoders have also
achieved expressive voice synthesis, realistic sound textures,
and musical notes from virtual instruments. However, the most been proposed for deep neural network applications [6]. These
suitable deep learning architecture is still under investigation. parameters demonstrate a potential in marrying the deep gener-
The choice of architecture is tightly coupled to the audio ative models with statistical parametric synthesizers. Finally,
representations. A sound’s original waveform can be too dense contemporary investigations allow the network to determine
and rich for deep learning models to deal with efficiently - the feature needed for the task [7]. Linguistic and acoustic
and complexity increases training time and computational cost.
Also, it does not represent sound in the manner in which it is features can be encoded into latent representations such as
perceived. Therefore, in many cases, the raw audio has been embeddings.
transformed into a compressed and more meaningful form using Apart from an overview of the audio representations existing
upsampling, feature-extraction, or even by adopting a higher level in sound synthesis implementations, this paper additionally
illustration of the waveform. Furthermore, conditional on the
quotes popular schemes for conditioning a deep generative
form chosen, additional conditioning representations, different
model architectures, and numerous metrics for evaluating the network with auxiliary data. Conditioning in generative models
reconstructed sound have been investigated. This paper provides can control the aspects of the synthesis and lead to new
an overview of audio representations applied to sound synthesis samples with specific characteristics [8]. Furthermore, the
using deep learning. Additionally, it presents the most signifi- paper highlights examples of deep generative models for audio
cant methods for developing and evaluating a sound synthesis
generation applications. Deep neural networks have demon-
architecture using deep learning models, always depending on
the audio representation. strated remarkable progress in the field demonstrating im-
Index Terms—Sound representations, Deep learning, Genera- pressive results. A final section discusses evaluation processes
tive models, Sound synthesis. for synthesised sound. Subjective evaluation via listening tests
are generally considered the most reliable measure of quality.
I. I NTRODUCTION However, multiple other metrics for assessing a generative
Sound generation algorithms synthesize a time domain model have been proposed converting both acoustic signals
waveform. This waveform should be coherent and appropriate to intermediate representations of them to be examined. Con-
for its intended use. These waveforms can convey complex sequently, audio representations assume an essential role not
and varied information. Deep generative networks [1] have only as input data but also influence the network architecture,
demonstrated great potential for such tasks having been used the conditioning technique as well as the evaluation process.
for the synthesis of a range of sounds, from pleasant pieces
of music to natural speech [2]. These models discover latent II. I NPUT R EPRESENTATIONS
representations based on the distribution of the initial data and
then sample from this distribution to generate new acoustic In the literature, numerous audio representations have been
signals with the same properties as the original ones. In proved beneficial for audio synthesis applications. Many times,
many cases, the deep learning models can operate along with comparisons have been conducted between different forms of
signal processing algorithms and enhance their expression the sound to reveal the most appropriate representation for
capabilities [3] [4]. a specific deep learning architecture. Raw audio and time-
The representation of the sound embraced by the deep frequency representations usually present the first attempts in
neural network plays a major role on the development of the such experiments. However, recent studies also look to higher
algorithm. Raw time domain audio is a rich representation level forms that offer more meaningful description such as
which leads to massive information making the network com- embeddings, or multiple sound features like the fundamental
putationally expensive and therefore slow. Compressed Time- frequency, loudness and features extracted by state-of-the-art
frequency representations based on spectrograms can decrease vocoders such as WORLD [9]. The table I summarizes the
the computer power needed but the parameter detection and advantages and disadvantages of each sound representation.

978-1-6654-0969-8/21/$31.00 ©2021 IEEE

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
A. Waveform - Raw Audio mel-spectrogram is generated by the application of perceptual
The term raw audio is often used to refer to a waveform filters on the DFT called mel-filter bands. The most common
encoded using pulse code modulation (PCM). This is the formula for encoding mel-filter bands is presented by Eq. 3
sampling of a continuous waveform in time and amplitude. where f is the frequency in Hertz. However, other models
It represents the waveform as a sequence of numbers, each have captured the perceptual transformation applying a linear
number representing an amplitude sample at a chosen sam- transformation until 1kHz and a logarithmic above this thresh-
pling frequency. In order for this discrete sequence of samples old.
to capture all the necessary information, the highest frequency f
in the signal should adhere to the Nyquist-Shannon theorem mel = 2595log10(1 + ) (3)
700
[10]. According to this theorem, only frequencies of less
than half the sampling frequency can be reproduced from CQT is another time-frequency representation where the
the sampled signal. A typical sampling frequency for audio frequencies are geometrically spaced. The centre frequencies
applications is 44.1kHz. Each real number is assigned to the of the filters are calculated from the result of the formula
k
approximate fixed value in a finite set of discrete numbers. ωk = 2 b ω0 where k = 1, 2, ..kmax and b is a constant
The most common levels for quantization are stored in 8 number. The bandwidth of each frequency then comes as
bits (256 levels), 16 bits (65536 levels) and 24 bits (16.8 δk = ωk+1 − ωk = ωk (2 1b − 1) and therefore the frequency
million levels). Therefore, a sound with a duration of one resolution is determined by the Eq. 4 where Q is the quality
second sampled at 44.1kHz generates 44100 samples. This factor.
representations is considered extremely informative even for ωk 1
deep learning networks. Q= = (2 b − 1)−1 (4)
δk
In order for the outcome of the deep learning model to be
more effective, a pre-processing step can be used to reduce the CQT is a representation with different frequency resolution
quantization range of the raw audio. Many research approaches in low and high frequencies. However, the phase part is
[11] [12] [13] apply µ-law to decrease the possible values discarded and the representation is in most of the cases
of each prediction. µ-law is presented in Eq. 1 where −1 < irreversible. Following this argument, Velasco et al [29] pro-
x < 1 and µ equals the number of levels created after the posed an invertible CQT based on nonstationary Gabor frames.
transformation. Another variation of CQT is rainbowgrams. Rainbowgrams
proposed by Engel et al [7] using colors to encode time
ln(1 + µ|x|) derivatives of the phase.
f (x) = sgn(x) (1)
ln(1 + µ) In addition, more complicated spectrogram-based represen-
Although non-linear quantization processes such as µ-law tations have also been investigated. GANSynth [30] conducted
received much attention the last years, the majority of the experiments with numerous spectrograms including scaled
existing papers use a normalized high resolution signal as input logarithmic amplitude and phase of the STFT, increased reso-
[14]. Finally, other applications include linear quantization of lution of the original spectrogram or applied mel-filters. Also,
the input waveform [15] [16] and different designs for most they examined an Instantaneous Frequency based spectrogram
and less significant bits [17]. where the phase of the STFT is scaled and unwraped (add −π)
and then the finite difference between the frames is computed.
B. Spectrograms Other applications [31] [32], also, made comparisons between
A spectrogram is a time/frequency visual representation of raw audio and spectrogram to uncover the most functional
sound. A spectrogram can be obtained via the Short Time representation for their deep learning model.
Fourier Transform (STFT), where the Fourier Transform is ap-
C. Acoustic Features
plied to overlapping segments of the waveform. The Discrete
Fourier Transform (DFT) is presented by the equation Eq. 2 Overcoming the wealth of acoustic information presenting
for k = 0, 1, .., N − 1 where N is the number of samples and in a sound waveform, various studies extract perceptual fea-
k is the number of segments. The spectrogram uses just the tures from the original signal. These acoustic features can be
absolute values of the STFT, discarding the phase. This type represented by phoneme inputs [33], fundamental frequency
of spectrograms has been used in many by a variety of papers and spectral features [34] or multiple information such as
[18] [19] [20]. the velocity, instrument, pitch and time [35]. Other imple-
mentations have included cepstral coefficients [4] [36] or a
N
X −1 variety of linguistic and acoustic features [37] [13]. Finally,
X(k) = x(n)e−jωk n (2) widely recommended parameters have also been extracted by
n=0 the WORLD vocoder [6] [38] [39].
Apart from the original spectrogram, deep learning ar-
chitectures have also experimented with non linear spectro- D. Embeddings
grams such as mel-spectrograms [21] [22] [23] [24] [25] Embeddings initially introduced by Natural Language Pro-
[26] [27] or Constant-Q Transformations (CQT) [28]. The cessing (NLP) in order to convert a word or sentence into

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
a real-valued vector. This approach assisted the process of while training. Two types of conditioning have been proposed,
text in deep learning applications by inserting the property global and local [11] [50]. In global conditioning, additional
of closer embeddings in vector space to encode words with latent representations can be appended across all training data.
similar meaning. The same approach has been adopted by Global conditioning can encode speaker’s voice or linguistics
sound processing to reduce the dimensionality of the signal features. Local conditioning usually refers to supplementary
[40] [22], enhance the timbre synthesis [3] or even generate timeseries with lower sampling rate than the original waveform
a more interpretable representation [41] [42] to effectively or even mel-spectrograms, logarithmic fundamental frequency
extract parameters for a synthesizer. In [7] an autoencoder or auxiliary pitch information.
generates a latent representation to condition a WaveNet model WaveNet has achieved one of the most effective strategies
while Dhariwal et al [43] implemented three separate encoders for conditioning deep neural networks [51]. Therefore, later
to generate vectors with different temporal resolutions. sound generation schemes adopted a WaveNet network for
conditioning. The majority of these works conditioned their
E. Symbolic
model to spectrograms [31] [52] [49] [53] [33] [54] [28] [55]
In music processing, the term symbolic refers to the use of while others included linguistic features and pitch information
representations such as Musical Instrument Digital Interface [12] [56], phoneme encodings [6], features extracted from the
(MIDI) or piano rolls. MIDI is a technical standard that STRAIGHT vocoder [57] or even MIDI representations [58].
describes a protocol, a digital interface or the link for the Although it has been proven that convolutional networks
simultaneous operation between multiple electronic musical are capable of effective conditioning, other architectures can
instruments. A MIDI file demonstrates the notes being played use auxiliary input data as well. Recurrent neural networks
in every time step. Usually this file consists of information of such as LSTMs have been adopted conditioning as frame-
the instrument being played, the pitch and its velocity. MidiNet level auxiliary feature vectors [59] or as one-hot representation
[44] is one of the most popular implementations using MIDI encoding music style [46]. Autoencoders can be conditioned
to generate music pieces. including additional input to the encoder [23] [36] but also as
Piano roll constitutes a more dense representation of MIDI. input only to the decoder [40].
A piece of music can be represented by a binary N × T
matrix where N is the number of playable notes and T is the B. Input to the Generator
number of timesteps. In MuseGAN [45], Generative Adversar- Generative Adversarial Networks (GANs) consist of two
ial Networks (GANs) have been applied for music generation separate networks, the Generator and the Discriminator. Fol-
using multiple-track piano-roll representation. Also, in DeepJ lowing the fundamental properties of GANs, the Generator
[46], they scaled the representation matrix between 0 and 1 to converts random noise to structured data while the Discrimi-
capture the note’s dynamics. The most notable disadvantage nator endeavors to classify a signal as original or generated.
of symbolic representations is that holding a note and replay- For applying conditioning in GANs, the most common tech-
ing a note are demonstrated by the same representation. To nique constitutes a biased input to the Generator. In sound
differentiate these two stages, DeepJ included a second matrix synthesis, a well established conditioning method includes the
called replay along with the original matrix play. mel-spectrogram as input to the Generator [60] [16] [19].
III. C ONDITIONING R EPRESENTATIONS This way, the synthesised sound is not just a product of a
Neural networks are able to generate sound based on the specific distribution but it also obtains desirable properties. For
statistical distribution of the training data. The more uniform example, it can be enforced to conditioning on predetermined
the input data to the network is, the more natural outcome can instrument or voice. Furthermore, a Generator conditioned on
be achieved. However, in cases where the amount of training spectrograms can also be used as a vocoder [61]. In addition
data is not sufficient, additional data with similar properties to the mel-spectrogram, other implementations have been con-
can be included by applying conditioning methods. Following ditioned on raw audio [62], one-hot vectors to encode musical
these techniques, the generated sound can be conditioned on pitch [30], linguistic features [13], or latent representations to
specific traits such as speaker’s voice [47] [27], independent identify speaker [63].
pitch [3] [48] [36], linguistic features [49] [17] or latent C. Other
representations [4] [45]. Instead of one-hot-embeddings, some
implementations have also used a confusion matrix to capture At last, other variations of conditioning have been intro-
a variation of emotions [39], while others provided supple- duced as well. Kim et al [14] adjusted conditioning through the
mentary positional information of each segment conditioning loss function. They estimated an auxiliary probability density
music to the artist or genre [43]. After training, the user using mel-spectrograms for local conditioning. Pink et al [64]
is able to decide between the conditioning properties of the applied bias terms in every layer of the convolutional network
synthesised sound. using also mel-spectrograms. Extra bias to the network has
been also proposed by [65] to encode linguistic features
A. Additional Input while in [7] every layer was biased with a different linear
The simplest strategy for applying conditioning to deep projection of the temporal embeddings. In [38] linguistic
learning architectures is by including auxiliary input data features were added to the output of each hidden layer in

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
TABLE I
OVERVIEW OF SOUND REPRESENTATIONS

Representation Papers Advantages Disadvantages Comments

[11] [12] -Computationally expensive. Used as input or

-Completely describes the waveform.
Waveform [13] [14] [15] -Unstructured representation that does not conditional
-Directly generates the output waveform.
[16] [17] reflect sound perception. representation
[18] [19]
-Interpretable representations that are Used as input or
[20] [21] [22] -Typically phase is discarded meaning it is
Spectrograms related to sound perception. conditional
[23] [24] [25] not directly invertible.
-Easy to illustrate/plot. representation
[26] [27] [28]
[33] [34]
-Difficult to synthesize waveforms with
[35] [4] [36] -Compressed, descriptive representation of Mostly used for
Acoustic features long term coherence.
[37] [13] [6] aspects of sound. conditioning
-Typically don’t fully specify sounds.
[38] [39]
[3] [22] [40] -Similar sounds lead to smaller distance in
Latent -Losses in decoding.
[41] [42] [7] multi-dimensional space. Mostly VAEs
representations -Can be difficult to interpret.
[43] -Compressed representation.
[44] [45] -Meaningful description of musical
Symbolic -Very high level description of audio.
[46] content.

the Generator while in [44] a new network was introduced by the one in Eq.5 are able to explicitly demonstrate temporal
the name conditionerCNN to work along with the Generator dependencies.
encoding chords for melody generation. Finally, Juvela et al
[37] conducted a comparative study of conditioning methods.
T
Y
IV. M ETHODS p(X) = p(xt |x1 , ..., xt−1 ) (5)
t−1
During recent years, deep learning models have significantly
contributed to research on sound generation. Using a variety
of deep learning algorithms, multiple representations have WaveNet [11] presents the most influential architecture of
been applied. The most common architectures include autore- explicit autoregressive models. The probability distribution
gressive methods, variational autoencoders (VAE), adversarial can be imitated by a stack of convolutional layers. However,
networks and normalising flows. However, many approaches to improve efficiency, the sequential data passes through a
can fall in more than one category. stack of dilated causal convolutional layers where the input
data are masked to skip some dependencies. Following a
A. Autoregressive similar scheme, FFTNet [48] takes advantage of convolutional
Autoregressive models define a category of generative mod- networks mimicking the FFT algorithm while upsampling the
els where every new sample in a sequence of data depends input data. However, to clear up the confusion, convolutional
on previous samples. Autoregressive deep neural networks networks do not always lead to autoregressive models [20].
can be represented by architectures that demonstrate this Autoregressive models have been initially proposed for
continuation implicitely or explicitely. Conventional methods sequential data. Therefore, in sound synthesis, raw audio is
that implicitely indicate a time-related manner are the recurrent ordinarily used as the input representation. However, many
neural networks. These models are able to recall previous data auxiliary representations have been applied conditioning the
dynamically using complex hidden state. SampleRNN [15] is audio generation on a variety of properties. More details about
one well established research work that applies hierarchical conditioning techniques for autoregressive models have been
recurrent neural networks such as GRU and LSTM on different already presented in Section III.
temporal resolutions for sound synthesis. In order to illustrate Since autoregressive models can be applied on sequential
the temporal behaviour of the network, Mehri et al conducted data, they are well established in sound generation related
experiments to test the model’s memory by injecting one sec- topics. Autoregressive models are easy to train and they can
ond of silence between two random sequential samples. Other manipulate data in real time. Furthermore, convolutional-based
significant papers on autoregressive models using recurrent models can be trained in parallel. Nevertheless, although these
neural networks are WaveRNN [17], MelNet [27] or LPCNet models can be paralleled during training, the generation is
[4]. In WaveRNN, they introduced a method for reducing the sequential and therefore slow. Synthesised data are affected
sampling time by using a batch of short sequences instead of only by previous samples, providing half way dependencies.
a unique long sequence while maintaining high quality sound. Finally, the generation can be consistent to specific properties
Generative models where the synthesis of the sequential for a definite number of samples and the outcome often lacks
samples follows a conditional probability distribution like global structure.

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
B. Normalizing Flow Its purpose is to maximise the probability of identifying
Normalizing flows constitute a family of generative models real and generated data. On the other hand, the Generator
consisting of multiple simple distributions for transforming is trained through the Discriminator. Information about the
input data to latent representations. A sequence of simple, original distribution of the dataset are concealed from it and
invertible and computationally inexpensive mappings z ∽ p(z) its aim is to minimise the error of the Discriminator. This
can model a reversible complex one. This complex transfor- minimax game can be summarised by the Eq. 8.
mation is presented in Eq. 6 and the inverse can be achieved
min max V (D, G) = Ex∽pdata (x) [log D(x)]
by repeatedly changing the variables as shown in Eq. 7. The G D (8)
mapping functions, then, can be parametrised by a deep neural +Ez∽pz (z) [log(1 − D(G(z)))]
network.
On the field of sound generation a variety of implemen-
tations have been proposed using numerous representations.
x = f0 ◦ f1 ◦ ... ◦ fk (z) (6)
In [30], spectrograms were generated using upsampling con-
z= fk−1 ◦ −1
fk−1 ◦ ... ◦ f0−1 (x) (7) volutions for fast generation while in [32], they investigated
whether waveform or spectrograms are more effective for
WaveGlow [21], a flow-based generative network, can syn-
GANs applying the Wasserstein loss function. In Parallel
thesise sound from its mel-spectrogram. By applying an Affine
WaveGAN [60], a teacher-student scheme was adopted using
Coupling Layer and a 1x1 Invertible Convolution, the model
non-autoregressive WaveNet in order to improve WaveGAN’s
aims to maximise the likelihood of the training data. The
efficiency. Yamamoto et al [16] applied GANs using a IAF
implementation has been proposed by NVIDIA and it is able
generator optimised by a probability density distillation algo-
to generate sound in real time. Insightful alternatives have also
rithm. Also, in GAN-TTS [13], they examined an ensemble of
been proposed on normalising flows by using only a single loss
Discriminators to generate acoustic features using the Hinge
function, without any auxiliary loss terms [14] or by applying
loss function along with [61] [63]. Lastly, GANs have also
dilated 2-D convolutional layers [64].
been applied in a variety of applications such as text-to-
Finally, in order to reduce the number of repeated iterations
speech applications [19], speech synthesis [66] [67], speech
needed by normalising flows, they have been merged with
enhancement [62] or symbolic music generation [45].
autoregressive methods. This architecture manages to increase
the performance of autoregressive models since the sampling D. Variational Autoencoders
can be processed in parallel. Using Inverse Autoregressive
An autoencoder is one of the fundamental deep learning
Flows (IAF), Oord et al increased the efficiency of WaveNet
architectures consisting of two separate networks, an encoder
[12]. Their implementation follows a ”probability density
and a decoder. The encoder compresses the input data into
distillation” where a pre-trained WaveNet model is used as
a latent representation while the decoder synthesises data
a teacher and scores the samples a WaveNet student outputs.
from the learned latent space. The original scheme of an
This way, the student can be trained in accordance with the dis-
autoencoder was initially created for dimensionality reduction
tribution of the teacher. A similar approach has been adopted
purposes. Although theoretically the decoder bear some re-
by ClariNet [31], where a Gaussian inverse autoregressive
semblance to the generator of GANs, the model is not well
flow is applied on WaveNet to train a text-to-wave neural
qualified for the synthesis of new examples. The network
architecture.
endeavors to reconstruct the original input, therefore it lacks
C. Adversarial Learning of expressiveness.
Unlike the Inverse Autoregressive Flows where a pre-trained To use autoencoders as generative models, variational au-
teacher network assist a student model, in adversarial learning, toencoders have been proposed [68]. In this architecture, the
two neural networks match against each other in a two-player encoder first models a latent distribution and then the network
minimax game. The fundamental architecture of Generative samples from the distribution to generate latent examples. The
Adversarial Networks (GANs) is based on two models, the success of the variational autoencoders is mostly based on
Generator (G) and the Discriminator (D). The Generator maps the Kullback–Leibler (KL) divergence used as a loss function.
a latent representation to the data space. In a vanilla GAN, the The encoder introduces a new distribution q(z|X) to estimate
Generator maps random noise to a desirable representation. p(z|X) as much as possible by minimising the KL divergence.
For sound synthesis this representation could be raw audio The complete loss function is demonstrated in the Eq. 9 where
or spectrogram. This desired representation, original or gener- the first term (called reconstruction loss) is applied on the final
ated, is used as input to the Discriminator which is trained to layer and the second term (called regularization loss) adjusts
distinguish between real and fake data. The maximum benefit the latent layer.
from GANs is acquired when the Generator produces perfect
data and the Discriminator is not able to differentiate between L = Ez∽q(z|X) [log p(X|z)] − DKL [q(z|X)||p(z)] (9)
real and fake data.
From a more technical point of view, the Discriminator Many variations of VAE have been applied for sound
is trained using only the distribution of the original data. generation topics. In [3] they used VAE with feedforward

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
networks and an additive synthesiser to reproduce monophonic for evaluation purposes. The rest of the metrics will be
musical notes. In [69] and [40] they applied convolutional analysed in the following sections. A similar set of evaluation
layers while in [36] a Variational Parametric Synthesiser was metrics has also been adopted by [50] including NDB.
proposed using a conditional VAE.
A modification of variational autoencoders proposed for C. Inception Score
music synthesis is VQ-VAE [43]. In this approach, the network The Inception Score (IS) is another perceptual metric which
is trained to encode the input data into a sequence of discrete correlates with human evaluation and is mostly adopted by
tokens. Jukebox introduces this method to flatten the data and GANs. For the Inception Score, a pre-trained Inception classi-
process it using autoregressive Transformers. fier is applied to the output of the generative model. In order to
measure the diversity of the synthesised data, the IS calculates
V. E VALUATION the KL divergence between the model scores P (y|x) and the
Although in the last decade generative models presented marginal distribution P (y) as it can be expressed by the Eq.10
significant improvements, a definitive evaluation process still for every possible class [71]. The IS is maximised when the
remains an open question. Many mathematical metrics have generated examples belong to only one class and every class
been proposed for perceptually evaluating the generative sound is predicted equally often.
and usually a transformation to another audio representation
have been adopted. However, despite the numerous attempts, IS = exp(Ex DKL (P (y|x)||P (y))) (10)
none of these metrics are as reliable as the subjective evalution
of human listeners. In [30] and [7], a pitch classifier is trained on spectrograms
of the NSynth dataset while in WaveGAN [32], the classifier
A. Perceptual Evaluation is trained on normalised log mel-spectrograms having zero
Human evaluation usually accounts for the mean opinion mean and unit variance. Finally, metrics like Pitch Accuracy
score between a group of listeners. To conduct the study, and Pitch Entropy or a nearest neighbour technique have been
many researchers used crowdMOS [70], a user-friendly toolkit adopted by GANSynth and WaveGAN respectively to further
for performing listening evaluations. As well as the mean evaluate the efficiency of their Inception Score. Finally, in [50]
opinion score, a confidence interval is also been computed. they also applied a modified inception score.
Furthermore, in order to attract an accountable number of
D. Distances-based measurements
subjects with specific characteristics, Amazon Mechanical
Turk has been widely used. In many cases, raters have been This evaluation category includes metrics that measure the
asked to pass a hearing test [21], keep headphones on [61] distance between representations of the original data and
[11], or only native speakers for evaluating speech have been the distribution of the generated examples. Binkowski et
asked [60] [61] [54]. al. proposed two distance-based metrics, the Fréchet Deep-
In these mean opinion score tests, subjects have been asked Speech Distance (FDSD) and the Kernel DeepSpeech Distance
to rate a sound in a five-point Likert scale in terms of (KDSD) [13] for evaluating their text-to-speech model. The
pleasantness [21], naturalness [11] [13] [63], sound quality two metrics make use of the the Fréchet distance and the
[17] or speaker diversity [32]. In addition, subjects have been Maximum Mean Discrepancy respectively on audio features
requested to express a preference between sounds of two extracted by a speech recognition model.
generative models hearing the same pitch [30] or speech [17] The Fréchet or 2-Wasserstein distance has been proposed
[11]. Finally, for evaluating WaveGAN [32], humans listened by other research papers as well. Engel et al [30] applied
to digits between one to ten and were asked to indicate which the Fréchet Inception Distance on features extracted by a
number they heard. pitch classifier while Kilgour et al [72] used this distance
to measure the intensity of a distortion in generated sound.
B. Number of Statistically-Different Bins However, although many researchers report successful results
The Number of Statistically-Different Bins (NDB) is a using 2-Wasserstein, Donahue et al [63] reported that a similar
metric for unconditional generative models in order to es- evaluation metric did not produce a desirable outcome in their
timate the diversity of the synthesised examples. Clustering experiments.
techniques are applied on the training data creating cells of Distances-based measurements have also been investigated
similar properties. Then, the same algorithm tries to categorise individually by separate parameter estimations. In [3] distances
the generated data into the cells. If a generated example does between the generated loudness and fundamental frequency of
not belong to a predefined cluster, then the generated sound synthesised and training data are used.
is statistically significantly different.
GANSynth [30] used k-means to map the log spectrogram E. Spectral Convergence
of the generated sound into k = 50 Voronoi cells. As well The Spectral Convergence expresses the mean difference
as Mean Opinion Score and the Number of Statistically- between the original and the generated spectrogram. It has
Different Bins, GANSynth also used Inception Score, Pitch been applied by [43] [20] [57] [40] in order to evaluate their
Accuracy and Pitch Entropy and Frechet Inception Distance synthesised music. The Spectral Convergence can be expressed

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
by the Eq.11 which is also identified as the minimization [7] J. Engel, C. Resnick, A. Roberts, S. Dieleman, D. Eck, K. Simonyan, and
process of the Griffin-Lim algorithm. M. Norouzi, “Neural Audio Synthesis of Musical Notes with WaveNet
Autoencoders,” arXiv:1704.01279 [cs], Apr. 2017. arXiv: 1704.01279.
v [8] R. Manzelli, V. Thakkar, A. Siahkamari, and B. Kulis, “Conditioning
uP
u e 2 Deep Generative Raw Audio Models for Structured Automatic Music,”
n,m |S(n, m) − S(n, m)|
SC = t P (11) arXiv:1806.09905 [cs, eess, stat], June 2018. arXiv: 1806.09905.
n,m S(n, m)
[9] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A Vocoder-Based
High-Quality Speech Synthesis System for Real-Time Applications,”
IEICE Transactions on Information and Systems, vol. E99.D, no. 7,
F. Log Likelihood pp. 1877–1884, 2016.
A final evaluation metric includes a Negative Log Like- [10] C. E. Shannont, “Communication in the Presence of Noise,” PROCEED-
INGS OF THE I.R.E., p. 12, 1949.
lihood (NLL) [17] [15] and an objective Conditional Log [11] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves,
Likelihood (CLL) [14] usually measured in bits per audio N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: A Gener-
sample. ative Model for Raw Audio,” arXiv:1609.03499 [cs], Sept. 2016. arXiv:
1609.03499.
[12] A. v. d. Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
VI. C ONCLUSION K. Kavukcuoglu, G. v. d. Driessche, E. Lockhart, L. C. Cobo, F. Stim-
berg, N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen,
The choice of audio representation is one of the most N. Kalchbrenner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and
significant factors in the development of deep learning models D. Hassabis, “Parallel WaveNet: Fast High-Fidelity Speech Synthesis,”
for sound synthesis. Numerous representations have been pro- arXiv:1711.10433 [cs], Nov. 2017. arXiv: 1711.10433.
[13] M. Bińkowski, J. Donahue, S. Dieleman, A. Clark, E. Elsen,
posed by previous researchers focusing on different properties. N. Casagrande, L. C. Cobo, and K. Simonyan, “High Fidelity Speech
Raw audio is a direct representation demanding notable mem- Synthesis with Adversarial Networks,” arXiv:1909.11646 [cs, eess],
ory and computational cost. It is also not considered for eval- Sept. 2019. arXiv: 1909.11646.
[14] S. Kim, S.-g. Lee, J. Song, J. Kim, and S. Yoon, “FloWaveNet : A
uating purposes since different waveforms can perceptually Generative Flow for Raw Audio,” arXiv:1811.02155 [cs, eess], May
produce the same sound. Spectrograms can overcome some of 2019. arXiv: 1811.02155.
the disadvantages of raw audio and have been considered as [15] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,
A. Courville, and Y. Bengio, “SampleRNN: An Unconditional End-
an alternative for training as well as for evaluation. However, to-End Neural Audio Generation Model,” arXiv:1612.07837 [cs], Feb.
reconstructing the original sound from its spectrogram is a 2017. arXiv: 1612.07837.
challenging task since it may produce sound suffering from [16] R. Yamamoto, E. Song, and J.-M. Kim, “Probability density distil-
lation with generative adversarial networks for high-quality parallel
distortions and lack of phase coherence. Recently, other audio waveform generation,” arXiv:1904.04472 [cs, eess], Aug. 2019. arXiv:
representations have received much attention such as latent 1904.04472.
representations, embeddings and acoustic features but they all [17] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,
E. Lockhart, F. Stimberg, A. v. d. Oord, S. Dieleman, and
require a powerful decoder. The choice of audio representation K. Kavukcuoglu, “Efficient Neural Audio Synthesis,” arXiv:1802.08435
is still very much dependent on the application. [cs, eess], June 2018. arXiv: 1802.08435.
[18] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly,
VII. ACKNOWLEDGMENTS Z. Yang, Y. Xiao, Z. Chen, S. Bengio, Q. Le, Y. Agiomyrgiannakis,
R. Clark, and R. A. Saurous, “Tacotron: Towards End-to-End Speech
This publication has emanated from research supported in Synthesis,” arXiv:1703.10135 [cs], Apr. 2017. arXiv: 1703.10135.
[19] P. Neekhara, C. Donahue, M. Puckette, S. Dubnov, and J. McAuley, “Ex-
part by a grant from Science Foundation Ireland under Grant pediting TTS Synthesis with Adversarial Vocoding,” arXiv:1904.07944
number 18/CRT/6183. For the purpose of Open Access, the [cs, eess], July 2019. arXiv: 1904.07944.
author has applied a CC BY public copyright licence to [20] S. O. Arik, H. Jun, and G. Diamos, “Fast Spectrogram Inversion using
any Author Accepted Manuscript version arising from this Multi-head Convolutional Neural Networks,” IEEE Signal Processing
Letters, vol. 26, pp. 94–98, Jan. 2019. arXiv: 1808.06719.
submission’. [21] R. Prenger, R. Valle, and B. Catanzaro, “WaveGlow: A Flow-based
Generative Network for Speech Synthesis,” arXiv:1811.00002 [cs, eess,
R EFERENCES stat], Oct. 2018. arXiv: 1811.00002.
[22] K. Peng, W. Ping, Z. Song, and K. Zhao, “Non-Autoregressive Neu-
[1] H. Gm, M. K. Gourisaria, M. Pandey, and S. S. Rautaray, “A compre- ral Text-to-Speech,” arXiv:1905.08459 [cs, eess], June 2020. arXiv:
hensive survey and analysis of generative models in machine learning,” 1905.08459.
Computer Science Review, vol. 38, p. 100285, Nov. 2020. [23] C. Aouameur, P. Esling, and G. Hadjeres, “Neural Drum Machine
[2] M. Huzaifah and L. Wyse, “Deep generative models for musical au- : An Interactive System for Real-time Synthesis of Drum Sounds,”
dio synthesis,” arXiv:2006.06426 [cs, eess, stat], Nov. 2020. arXiv: arXiv:1907.02637 [cs, eess], Nov. 2019. arXiv: 1907.02637.
2006.06426. [24] Y. Ren, Y. Ruan, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y.
[3] J. Engel, L. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable Liu, “FastSpeech: Fast, Robust and Controllable Text to Speech,”
Digital Signal Processing,” arXiv:2001.04643 [cs, eess, stat], Jan. 2020. arXiv:1905.09263 [cs, eess], Nov. 2019. arXiv: 1905.09263.
arXiv: 2001.04643. [25] Y. Ren, C. Hu, X. Tan, T. Qin, S. Zhao, Z. Zhao, and T.-Y. Liu,
[4] J.-M. Valin and J. Skoglund, “LPCNET: Improving Neural Speech “FastSpeech 2: Fast and High-Quality End-to-End Text to Speech,”
Synthesis through Linear Prediction,” in ICASSP 2019 - 2019 IEEE arXiv:2006.04558 [cs, eess], Mar. 2021. arXiv: 2006.04558.
International Conference on Acoustics, Speech and Signal Processing [26] R. Liu, B. Sisman, F. Bao, G. Gao, and H. Li, “WaveTTS: Tacotron-
(ICASSP), (Brighton, United Kingdom), pp. 5891–5895, IEEE, May based TTS with Joint Time-Frequency Domain Loss,” arXiv:2002.00417
2019. [cs, eess], Apr. 2020. arXiv: 2002.00417.
[5] P. Govalkar, J. Fischer, F. Zalkow, and C. Dittmar, “A Comparison of [27] S. Vasquez and M. Lewis, “MelNet: A Generative Model for Audio in
Recent Neural Vocoders for Speech Signal Reconstruction,” in 10th the Frequency Domain,” arXiv:1906.01083 [cs, eess, stat], June 2019.
ISCA Speech Synthesis Workshop, pp. 7–12, ISCA, Sept. 2019. arXiv: 1906.01083.
[6] M. Blaauw and J. Bonada, “A Neural Parametric Singing Synthesizer,” [28] S. Huang, Q. Li, C. Anil, X. Bao, S. Oore, and R. B. Grosse, “Tim-
arXiv:1704.03809 [cs], Aug. 2017. arXiv: 1704.03809. breTron: A WaveNet(CycleGAN(CQT(Audio))) Pipeline for Musical

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.
Timbre Transfer,” arXiv:1811.09620 [cs, eess, stat], May 2019. arXiv: [52] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen,
1811.09620. Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian-
[29] G. A. Velasco, N. Holighaus, M. Dörfler, and T. Grill, “CONSTRUCT- nakis, and Y. Wu, “Natural TTS Synthesis by Conditioning WaveNet
ING AN INVERTIBLE CONSTANT-Q TRANSFORM WITH NON- on Mel Spectrogram Predictions,” arXiv:1712.05884 [cs], Feb. 2018.
STATIONARY GABOR FRAMES,” p. 8, 2011. arXiv: 1712.05884.
[30] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, C. Donahue, and [53] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang,
A. Roberts, “GANSYNTH: ADVERSARIAL NEURAL AUDIO SYN- J. Raiman, and J. Miller, “Deep Voice 3: Scaling Text-to-Speech with
THESIS,” p. 17, 2019. Convolutional Sequence Learning,” arXiv:1710.07654 [cs, eess], Feb.
[31] W. Ping, K. Peng, and J. Chen, “ClariNet: Parallel Wave Generation in 2018. arXiv: 1710.07654.
End-to-End Text-to-Speech,” p. 13. [54] N. Li, S. Liu, Y. Liu, S. Zhao, M. Liu, and M. Zhou, “Neural Speech
[32] C. Donahue, J. McAuley, and M. Puckette, “ADVERSARIAL AUDIO Synthesis with Transformer Network,” arXiv:1809.08895 [cs], Jan.
SYNTHESIS,” p. 16, 2019. 2019. arXiv: 1809.08895.
[33] Y. Wang, D. Stanton, Y. Zhang, R. Skerry-Ryan, E. Battenberg, J. Shor, [55] M. Angrick, C. Herff, E. Mugler, M. C. Tate, M. W. Slutzky, D. J.
Y. Xiao, F. Ren, Y. Jia, and R. A. Saurous, “Style Tokens: Unsupervised Krusienski, and T. Schultz, “Speech synthesis from ECoG using densely
Style Modeling, Control and Transfer in End-to-End Speech Synthesis,” connected 3D convolutional neural networks,” Journal of Neural Engi-
p. 10. neering, vol. 16, p. 036019, June 2019.
[34] X. Wang, S. Takaki, and J. Yamagishi, “Neural source-filter waveform [56] S. O. Arık, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky,
models for statistical parametric speech synthesis,” arXiv:1904.12088 Y. Kang, X. Li, J. Miller, A. Ng, J. Raiman, S. Sengupta, and
[cs, eess, stat], Nov. 2019. arXiv: 1904.12088. M. Shoeybi, “Deep Voice: Real-time Neural Text-to-Speech,” p. 10.
[35] A. Defossez, N. Zeghidour, N. Usunier, L. Bottou, and F. Bach, “SING: [57] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda,
Symbol-to-Instrument Neural Generator,” p. 11. “Speaker-Dependent WaveNet Vocoder,” in Interspeech 2017, pp. 1118–
[36] K. Subramani, P. Rao, and A. D’Hooge, “Vapar Synth - A Variational 1122, ISCA, Aug. 2017.
Parametric Model for Audio Synthesis,” in ICASSP 2020 - 2020 IEEE [58] C. Hawthorne, A. Stasyuk, A. Roberts, I. Simon, C.-Z. A. Huang,
International Conference on Acoustics, Speech and Signal Processing S. Dieleman, E. Elsen, J. Engel, and D. Eck, “Enabling Factorized
(ICASSP), (Barcelona, Spain), pp. 796–800, IEEE, May 2020. Piano Music Modeling and Generation with the MAESTRO Dataset,”
[37] L. Juvela, B. Bollepalli, V. Tsiaras, and P. Alku, “GlotNet—A Raw arXiv:1810.12247 [cs, eess, stat], Jan. 2019. arXiv: 1810.12247.
Waveform Model for the Glottal Excitation in Statistical Parametric [59] Z.-H. Ling, Y. Ai, Y. Gu, and L.-R. Dai, “Waveform Modeling and
Speech Synthesis,” IEEE/ACM Transactions on Audio, Speech, and Generation Using Hierarchical Recurrent Neural Networks for Speech
Language Processing, vol. 27, pp. 1019–1030, June 2019. Bandwidth Extension,” IEEE/ACM Transactions on Audio, Speech, and
[38] S. Yang, L. Xie, X. Chen, X. Lou, X. Zhu, D. Huang, and H. Li, Language Processing, vol. 26, pp. 883–894, May 2018.
“Statistical Parametric Speech Synthesis Using Generative Adversarial [60] R. Yamamoto, E. Song, and J.-M. Kim, “Parallel WaveGAN: A fast
Networks Under A Multi-task Learning Framework,” arXiv:1707.01670 waveform generation model based on generative adversarial networks
[cs], July 2017. arXiv: 1707.01670. with multi-resolution spectrogram,” arXiv:1910.11480 [cs, eess], Feb.
[39] G. E. Henter, J. Lorenzo-Trueba, X. Wang, and J. Yamagishi, “Deep 2020. arXiv: 1910.11480.
Encoder-Decoder Models for Unsupervised Learning of Controllable [61] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh,
Speech Synthesis,” arXiv:1807.11470 [cs, eess, stat], Sept. 2018. arXiv: J. Sotelo, A. de Brebisson, Y. Bengio, and A. Courville, “MelGAN:
1807.11470. Generative Adversarial Networks for Conditional Waveform Synthesis,”
[40] A. Bitton, P. Esling, and T. Harada, “Neural Granular Sound Synthesis,” arXiv:1910.06711 [cs, eess], Dec. 2019. arXiv: 1910.06711.
arXiv:2008.01393 [cs, eess], Aug. 2020. arXiv: 2008.01393. [62] S. Pascual, A. Bonafonte, and J. Serrà, “SEGAN: Speech Enhancement
[41] P. Esling, A. Chemla-Romeu-Santos, and A. Bitton, “Generative timbre Generative Adversarial Network,” arXiv:1703.09452 [cs], June 2017.
spaces: regularizing variational auto-encoders with perceptual metrics,” arXiv: 1703.09452.
arXiv:1805.08501 [cs, eess], Oct. 2018. arXiv: 1805.08501. [63] J. Donahue, S. Dieleman, M. Bińkowski, E. Elsen, and K. Simonyan,
[42] P. Esling, N. Masuda, A. Bardet, R. Despres, and A. Chemla-Romeu- “End-to-End Adversarial Text-to-Speech,” arXiv:2006.03575 [cs, eess],
Santos, “Universal audio synthesizer control with normalizing flows,” Mar. 2021. arXiv: 2006.03575.
arXiv:1907.00971 [cs, eess, stat], July 2019. arXiv: 1907.00971. [64] W. Ping, K. Peng, K. Zhao, and Z. Song, “WaveFlow: A Compact Flow-
[43] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever, based Model for Raw Audio,” p. 11.
“Jukebox: A Generative Model for Music,” arXiv:2005.00341 [cs, eess, [65] K. Rao, F. Peng, H. Sak, and F. Beaufays, “Grapheme-to-phoneme
stat], Apr. 2020. arXiv: 2005.00341. conversion using Long Short-Term Memory recurrent neural networks,”
[44] L.-C. Yang, S.-Y. Chou, and Y.-H. Yang, “MidiNet: A Convolutional in 2015 IEEE International Conference on Acoustics, Speech and
Generative Adversarial Network for Symbolic-domain Music Genera- Signal Processing (ICASSP), (South Brisbane, Queensland, Australia),
tion,” arXiv:1703.10847 [cs], July 2017. arXiv: 1703.10847. pp. 4225–4229, IEEE, Apr. 2015.
[45] H.-W. Dong, W.-Y. Hsiao, L.-C. Yang, and Y.-H. Yang, “MuseGAN: [66] K. Oyamada, H. Kameoka, T. Kaneko, K. Tanaka, N. Hojo, and H. Ando,
Multi-track Sequential Generative Adversarial Networks for Symbolic “Generative adversarial network-based approach to signal reconstruction
Music Generation and Accompaniment,” arXiv:1709.06298 [cs, eess, from magnitude spectrogram,” in 2018 26th European Signal Processing
stat], Nov. 2017. arXiv: 1709.06298. Conference (EUSIPCO), (Rome), pp. 2514–2518, IEEE, Sept. 2018.
[46] H. H. Mao, T. Shin, and G. W. Cottrell, “DeepJ: Style-Specific Music [67] Y. Saito, S. Takamichi, and H. Saruwatari, “Statistical Parametric Speech
Generation,” 2018 IEEE 12th International Conference on Semantic Synthesis Incorporating Generative Adversarial Networks,” IEEE/ACM
Computing (ICSC), pp. 377–382, Jan. 2018. arXiv: 1801.00887. Transactions on Audio, Speech, and Language Processing, vol. 26,
[47] Y. Zhao, S. Takaki, H.-T. Luong, J. Yamagishi, D. Saito, and N. Mine- pp. 84–96, Jan. 2018.
matsu, “Wasserstein GAN and Waveform Loss-Based Acoustic Model [68] D. P. Kingma and M. Welling, “Auto-Encoding Variational Bayes,”
Training for Multi-Speaker Text-to-Speech Synthesis Systems Using a arXiv:1312.6114 [cs, stat], May 2014. arXiv: 1312.6114.
WaveNet Vocoder,” IEEE Access, vol. 6, pp. 60478–60488, 2018. [69] A. Pandey and D. Wang, “A New Framework for CNN-Based Speech
[48] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “Fftnet: A Real- Enhancement in the Time Domain,” IEEE/ACM Transactions on Audio,
Time Speaker-Dependent Neural Vocoder,” in 2018 IEEE International Speech, and Language Processing, vol. 27, pp. 1179–1188, July 2019.
Conference on Acoustics, Speech and Signal Processing (ICASSP), [70] F. Ribeiro, D. Florêncio, C. Zhang, and M. Seltzer, “crowdMOS: An
(Calgary, AB), pp. 2251–2255, IEEE, Apr. 2018. Approach for Crowdsourcing Mean Opinion Score Studies,” p. 4.
[49] S. Arik, G. Diamos, A. Gibiansky, J. Miller, K. Peng, W. Ping, [71] T. Salimans, I. Goodfellow, W. Zaremba, V. Cheung, A. Radford, and
J. Raiman, and Y. Zhou, “Deep Voice 2: Multi-Speaker Neural Text- X. Chen, “Improved Techniques for Training GANs,” arXiv:1606.03498
to-Speech,” arXiv:1705.08947 [cs], Sept. 2017. arXiv: 1705.08947. [cs], June 2016. arXiv: 1606.03498.
[50] Z. Kong, W. Ping, J. Huang, K. Zhao, and B. Catanzaro, “DiffWave: A [72] K. Kilgour, M. Zuluaga, D. Roblek, and M. Sharifi, “Fr\’echet Audio
Versatile Diffusion Model for Audio Synthesis,” arXiv:2009.09761 [cs, Distance: A Metric for Evaluating Music Enhancement Algorithms,”
eess, stat], Mar. 2021. arXiv: 2009.09761. arXiv:1812.08466 [cs, eess], Jan. 2019. arXiv: 1812.08466.
[51] J. Boilard, P. Gournay, and R. Lefebvre, “A Literature Review of
WaveNet: Theory, Application and Optimization,” p. 17.

Authorized licensed use limited to: UNIVERSITY OF NOTTINGHAM. Downloaded on October 26,2024 at 09:49:18 UTC from IEEE Xplore. Restrictions apply.

A I For Children
100% (1)
A I For Children
140 pages
Intelligent Traffic
No ratings yet
Intelligent Traffic
78 pages
Deep Learning KCS078
0% (1)
Deep Learning KCS078
2 pages
Game Ontology - Encylopedia of Ludic Terms
No ratings yet
Game Ontology - Encylopedia of Ludic Terms
16 pages
Deep Learning For Signal Processing
No ratings yet
Deep Learning For Signal Processing
19 pages
High Fidelity Neural Audio Compression: Alexandre Défossez
No ratings yet
High Fidelity Neural Audio Compression: Alexandre Défossez
19 pages
Artificial Intelligence Enabled Marketing Solution
No ratings yet
Artificial Intelligence Enabled Marketing Solution
14 pages
Arabic Sign Language Recognition: A Deep Learning Approach
No ratings yet
Arabic Sign Language Recognition: A Deep Learning Approach
107 pages
Catch-A-Waveform: Learning To Generate Audio From A Single Short Example
No ratings yet
Catch-A-Waveform: Learning To Generate Audio From A Single Short Example
16 pages
RRSGAN Reference-Based Super-Resolution For Remote Sensing Image
No ratings yet
RRSGAN Reference-Based Super-Resolution For Remote Sensing Image
17 pages
A Comprehensive Survey On Deep Music Generation
No ratings yet
A Comprehensive Survey On Deep Music Generation
96 pages
QB DL
No ratings yet
QB DL
2 pages
Large Scale Deep Learning With TensorFlow (PDFDrive)
No ratings yet
Large Scale Deep Learning With TensorFlow (PDFDrive)
240 pages
Soundstream: An End-To-End Neural Audio Codec
No ratings yet
Soundstream: An End-To-End Neural Audio Codec
12 pages
Empire of Great Brightness Visual and Material Cultures of Ming China, 1368-1644 (Craig Clunas) (Z-Library)
No ratings yet
Empire of Great Brightness Visual and Material Cultures of Ming China, 1368-1644 (Craig Clunas) (Z-Library)
288 pages
Handwritten Digit Recognition Using Machine and Deep Learning Algorithms
No ratings yet
Handwritten Digit Recognition Using Machine and Deep Learning Algorithms
6 pages
WaveNet: A Generative Model For Raw Audio
No ratings yet
WaveNet: A Generative Model For Raw Audio
16 pages
Jukebox: A Generative Model For Music
No ratings yet
Jukebox: A Generative Model For Music
20 pages
Learning Representations From Audio Using Autoencoders
No ratings yet
Learning Representations From Audio Using Autoencoders
11 pages
Multi-Instrument Music Synthesis With Spectrogram Diffusion
No ratings yet
Multi-Instrument Music Synthesis With Spectrogram Diffusion
12 pages
44 E2248
No ratings yet
44 E2248
11 pages
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
No ratings yet
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
14 pages
Simple and Controllable Music Generation: : Equal Contributions, : Core Team
No ratings yet
Simple and Controllable Music Generation: : Equal Contributions, : Core Team
15 pages
Example-Guided Physically Based Modal Sound Synthesis
No ratings yet
Example-Guided Physically Based Modal Sound Synthesis
16 pages
Ijeti 02 00017
No ratings yet
Ijeti 02 00017
4 pages
Efficient and Accurate Sound Propagation Using Adaptive Rectangular Decomposition
No ratings yet
Efficient and Accurate Sound Propagation Using Adaptive Rectangular Decomposition
13 pages
Datasheet PDF
No ratings yet
Datasheet PDF
6 pages
2015 WeissMueller TonalComplexity ICASSP1
No ratings yet
2015 WeissMueller TonalComplexity ICASSP1
6 pages
SIVE Toolkit Manual
No ratings yet
SIVE Toolkit Manual
20 pages
Citizen Care
No ratings yet
Citizen Care
7 pages
Tech Talk Supplementary
No ratings yet
Tech Talk Supplementary
4 pages
Shier Jordie MSC 2021
No ratings yet
Shier Jordie MSC 2021
153 pages
Internet of Medical Things (IoMT) For Cardio-Vascular Disease
No ratings yet
Internet of Medical Things (IoMT) For Cardio-Vascular Disease
15 pages
A Social History of The Chinese Book Books and Literati Culture in Late Imperial China (Joseph McDermott) (Z-Library)
100% (1)
A Social History of The Chinese Book Books and Literati Culture in Late Imperial China (Joseph McDermott) (Z-Library)
311 pages
A Stacked Multi-Connection Simple Reducing Net For
No ratings yet
A Stacked Multi-Connection Simple Reducing Net For
14 pages
VR Game Ontology - Encylopedia of Ludic Terms
No ratings yet
VR Game Ontology - Encylopedia of Ludic Terms
18 pages
SoundStorm: Efficient Parallel Audio Generation - Arxiv:2305.09636
No ratings yet
SoundStorm: Efficient Parallel Audio Generation - Arxiv:2305.09636
9 pages
DDSP Differentiable Digital Signal Processing
No ratings yet
DDSP Differentiable Digital Signal Processing
19 pages
Melnet: A Generative Model For Audio in The Frequency Domain
No ratings yet
Melnet: A Generative Model For Audio in The Frequency Domain
14 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
Rave
No ratings yet
Rave
15 pages
NeurIPS 2020 Hifi Gan Generative Adversarial Networks For Efficient and High Fidelity Speech Synthesis Paper
No ratings yet
NeurIPS 2020 Hifi Gan Generative Adversarial Networks For Efficient and High Fidelity Speech Synthesis Paper
12 pages
Convention Paper 5452: Audio Engineering Society
100% (1)
Convention Paper 5452: Audio Engineering Society
10 pages
Project Sketch
No ratings yet
Project Sketch
3 pages
Video Game Ontology Competency Questions
No ratings yet
Video Game Ontology Competency Questions
4 pages
Artificial Intelligence
No ratings yet
Artificial Intelligence
16 pages
Octave System Sound Processing Library: Lóránt Oroszlány
No ratings yet
Octave System Sound Processing Library: Lóránt Oroszlány
39 pages
The Emergence of Deep Learning: New Opportunities For Music and Audio Technologies
No ratings yet
The Emergence of Deep Learning: New Opportunities For Music and Audio Technologies
2 pages
Deep Learning For Audio Signal Processing
No ratings yet
Deep Learning For Audio Signal Processing
14 pages
Fibinet: Combining Feature Importance and Bilinear Feature Interaction For Click-Through Rate Prediction
No ratings yet
Fibinet: Combining Feature Importance and Bilinear Feature Interaction For Click-Through Rate Prediction
8 pages
Audio Classification Using Deep Learning Report
No ratings yet
Audio Classification Using Deep Learning Report
25 pages
Nietjet 0602S 2018 003
No ratings yet
Nietjet 0602S 2018 003
5 pages
NLPReport Phase 1
No ratings yet
NLPReport Phase 1
5 pages
Generating Black Metal and Math Rock
No ratings yet
Generating Black Metal and Math Rock
3 pages
Suoni
No ratings yet
Suoni
38 pages
Intro To AI - 1 MCQ Ans
No ratings yet
Intro To AI - 1 MCQ Ans
8 pages
TSP 4.0 Capstone Project Student Report GK
No ratings yet
TSP 4.0 Capstone Project Student Report GK
47 pages
Nvidia Learning Learning Path Developers It Administrators
No ratings yet
Nvidia Learning Learning Path Developers It Administrators
19 pages
How To Write A CHI Paper
No ratings yet
How To Write A CHI Paper
42 pages
D - B A I: Iffusion Ased Udio Npainting
No ratings yet
D - B A I: Iffusion Ased Udio Npainting
15 pages
Neuero Rack AI MODULAR
No ratings yet
Neuero Rack AI MODULAR
8 pages
Smart Outpass Management System
No ratings yet
Smart Outpass Management System
9 pages
Unit 4 Notes
No ratings yet
Unit 4 Notes
21 pages
2021 Deep Learning Audio Book
No ratings yet
2021 Deep Learning Audio Book
38 pages
Audio Gen
No ratings yet
Audio Gen
16 pages
Acapella-Based Music Generation With Sequential Models Utilizing Discrete Cosine Transform
No ratings yet
Acapella-Based Music Generation With Sequential Models Utilizing Discrete Cosine Transform
10 pages
Report
No ratings yet
Report
7 pages
DL For Acoustics
No ratings yet
DL For Acoustics
4 pages
Deepfake Audio Detection and Justification With Ex
No ratings yet
Deepfake Audio Detection and Justification With Ex
19 pages
CBM342 BCI Unit III
No ratings yet
CBM342 BCI Unit III
16 pages
PACHET - BRIOT - Deeplearningformusicgeneration
No ratings yet
PACHET - BRIOT - Deeplearningformusicgeneration
14 pages
Genai Mini Project Report
No ratings yet
Genai Mini Project Report
8 pages
Info Resume
No ratings yet
Info Resume
2 pages
AI Audio Deepfake
No ratings yet
AI Audio Deepfake
18 pages
Applications of Deep Learning To Audio Generation
No ratings yet
Applications of Deep Learning To Audio Generation
16 pages
Audio Noise Detection
No ratings yet
Audio Noise Detection
29 pages
Music Generation With NLP-1
No ratings yet
Music Generation With NLP-1
15 pages
A Survey of Deep Learning Audio Generation Methods
No ratings yet
A Survey of Deep Learning Audio Generation Methods
14 pages
Computers 13 00256
No ratings yet
Computers 13 00256
13 pages
AI Text To Music
No ratings yet
AI Text To Music
2 pages
Test 2 35
No ratings yet
Test 2 35
25 pages
From Theory To Practice The Evolution of Artificial Intelligence in Business
No ratings yet
From Theory To Practice The Evolution of Artificial Intelligence in Business
15 pages
Pfa Inr
No ratings yet
Pfa Inr
75 pages
Ji Yang Luo Survey Symbolic Music Generation
No ratings yet
Ji Yang Luo Survey Symbolic Music Generation
39 pages
2209 03143v2AudioLM
No ratings yet
2209 03143v2AudioLM
11 pages
Interview Questions
No ratings yet
Interview Questions
18 pages
Gen-Ai ppt1
No ratings yet
Gen-Ai ppt1
5 pages
Fai Question Bank
No ratings yet
Fai Question Bank
2 pages
ECCV 2020 Paper Digests
No ratings yet
ECCV 2020 Paper Digests
138 pages
SingGANGenerativeAdversarialNetworkForHigh Fidelity
No ratings yet
SingGANGenerativeAdversarialNetworkForHigh Fidelity
11 pages
Final Intro AIReport
No ratings yet
Final Intro AIReport
9 pages
AI Final Exam
No ratings yet
AI Final Exam
8 pages
Tech 3
No ratings yet
Tech 3
9 pages
Dversarial Udio Ynthesis: Chris Donahue Julian Mcauley Miller Puckette
No ratings yet
Dversarial Udio Ynthesis: Chris Donahue Julian Mcauley Miller Puckette
16 pages
NVIDIA NeMo Audio Codec 44khz
No ratings yet
NVIDIA NeMo Audio Codec 44khz
7 pages
HiFi GAN
No ratings yet
HiFi GAN
14 pages
Fast and Flexible Neural Audio Synthesis: Lamtharn Hantrakul Jesse Engel Adam Roberts Chenjie Gu
No ratings yet
Fast and Flexible Neural Audio Synthesis: Lamtharn Hantrakul Jesse Engel Adam Roberts Chenjie Gu
7 pages
High-Fidelity Audio Compression With Improved RVQGAN: Rithesh Kumar Prem Seetharaman
No ratings yet
High-Fidelity Audio Compression With Improved RVQGAN: Rithesh Kumar Prem Seetharaman
14 pages
Deepfake Basepaper
No ratings yet
Deepfake Basepaper
3 pages
Deep Learning: Fundamentals and Applications
From Everand
Deep Learning: Fundamentals and Applications
Fouad Sabry
No ratings yet
Deep Learning
From Everand
Deep Learning
Manish Soni
No ratings yet

Audio Representations For Deep Learning in Sound Synthesis A Review

Uploaded by

Audio Representations For Deep Learning in Sound Synthesis A Review

Uploaded by

Audio representations for deep learning in sound

Anastasia Natsiou Seán O’Leary

978-1-6654-0969-8/21/$31.00 ©2021 IEEE

Representation Papers Advantages Disadvantages Comments

[11] [12] -Computationally expensive. Used as input or

You might also like