0% found this document useful (0 votes)
71 views6 pages

Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks

The document discusses a new technique called multi-head convolutional neural networks (MCNN) for fast waveform synthesis from spectrograms without iterative algorithms or autoregression. MCNN enables significantly faster processing than commonly used techniques like Griffin-Lim and can synthesize high quality speech over 300 times faster than real-time.

Uploaded by

filase8959
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
71 views6 pages

Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks

The document discusses a new technique called multi-head convolutional neural networks (MCNN) for fast waveform synthesis from spectrograms without iterative algorithms or autoregression. MCNN enables significantly faster processing than commonly used techniques like Griffin-Lim and can synthesize high quality speech over 300 times faster than real-time.

Uploaded by

filase8959
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 6

1

Fast Spectrogram Inversion using Multi-head


Convolutional Neural Networks
Sercan Ö. Arık∗ , Heewoo Jun∗ , Gregory Diamos

Abstract—We propose the multi-head convolutional neural these generic spectrogram inversion techniques is their fixed
network (MCNN) for waveform synthesis from spectrograms. objectives, rendering them inflexible to adapt for a particular
Nonlinear interpolation in MCNN is employed with transposed domain like human speech.
convolution layers in parallel heads. MCNN enables signifi-
One common use case of spectrograms is the audio domain,
arXiv:1808.06719v2 [cs.SD] 6 Nov 2018

cantly better utilization of modern multi-core processors than


commonly-used iterative algorithms like Griffin-Lim, and yields which is also the focus of this paper. Autoregressive modeling
very fast (more than 300x real-time) runtime. For training of of waveforms, in particular for audio, is a common approach.
MCNN, we use a large-scale speech recognition dataset and State-of-the-art results in generative speech modeling use neu-
losses defined on waveforms that are related to perceptual audio ral networks [6][7] that employ autoregression at the sample
quality. We demonstrate that MCNN constitutes a very promising
approach for high-quality speech synthesis, without any iterative rate. Yet, these models bring challenges for deployment, as
algorithms or autoregression in computations. they need to run inference ∼16k-24k times every second. One
approach is to approximate autoregression with an inference-
Index Terms—Phase reconstruction, deep learning, convolu-
tional neural networks, short-time Fourier transform, spectro- efficient model which can be trained by learning an inverse-
gram, time-frequency signal processing, speech synthesis. autoregressive flow using distillation [6]. Recently, autoregres-
I. I NTRODUCTION sive neural networks have also been adapted for spectrogram
A spectrogram contains intensity information of time- inversion. [8] uses the WaveNet architecture [9], which is com-
varying spectrum of a waveform. Waveform to spectrogram posed of stacked dilated convolution layers with spectrogram
conversion is fundamentally lossy, because the magnitude frames as external conditioner. But autoregression at sample
calculation removes the phase from the short-time Fourier rate is employed, resulting in slow synthesis. A fundamental
transform (STFT). Spectrogram inversion has been studied question is whether high quality synthesis necessitates explicit
widely in literature. Yet, there is no known algorithm that autoregressive modeling. Some generative models, e.g. [10],
guarantees a globally optimal solution at a low computational [11], synthesize audio by applying autoregression at the rate
complexity. A fundamental challenge is the non-convexity of of spectrogram timeframes (100s of samples), and still does
intensity constraints with an unknown phase. not yield a noticeable decrease in audio quality.
The most popular technique for spectrogram inversion is We propose the multi-head convolutional neural network
the Griffin-Lim (GL) algorithm [1]. GL is based on itera- (MCNN) that employs non-autoregressive modeling for the
tively estimating the unknown phases by repeatedly converting perennial spectrogram inversion problem. Our study is mainly
between frequency and time domain using the STFT and motivated by two trends. Firstly, modern multi-core proces-
its inverse, substituting the magnitude of each frequency sors, such as GPUs or TPUs [12], achieve their peak perfor-
component to the predicted magnitude at each step. Although mance for algorithms with high compute intensity [13]. Com-
the simplicity of GL is appealing, it can be slow due to the pute intensity (also known as operational intensity) is defined
sequentiality of operations. In [2], a fast variant is studied as the average number of operations per data access. Secondly,
by modifying its update step with a term that depends on many recent generative audio models, such as text-to-speech
the magnitude of the previous update step. In [3], the single- [10][11], audio style transfer [14], or speech enhancement
pass spectrogram inversion (SPSI) algorithm is introduced, [15], output spectrograms (that are typically converted to
which can synthesize waveforms in a single fully deterministic waveforms using GL), and can potentially benefit from direct
pass and can be further improved with extra GL iterations. waveform synthesis by integrating trainable models into their
SPSI estimates the instantaneous frequency of each frame by end-to-end frameworks. MCNN achieves very high audio qual-
peak-picking and quadratic interpolation. In [4], another non- ity (quantified by human raters and conventional metrics like
iterative spectrogram inversion technique is proposed, based spectral convergence (SC) and speaker classification accuracy),
on the partial derivatives with respect to a Gaussian window, while achieving more than 300x real-time synthesis, and has
which allows analytical derivations. In [5], a convex relaxation the potential to be integrated with end-to-end training in audio
is applied to express spectrogram inversion as a semidefinite processing.
program with a convergence guarantee, at the expense of the
II. M ULTI - HEAD C ONVOLUTIONAL N EURAL N ETWORK
increased dimensionality. Overall, one common drawback for
We assume the STFT-magnitude input for the waveform
Baidu Silicon Valley Artificial Intelligence Lab 1195 Bordeaux Dr. Sunny- s, |STFT(s)|, has a dimension of Tspec × Fspec and the
vale, CA 94089.
∗ Equal contribution corresponding waveform has a dimension of Twave , where
Manuscript received August, 2018. Tspec is the number of spectrogram timeframes, Fspec is the
2

number of frequency channels, and Twave is the number of number of frequency channels in the processed representation,
waveform samples. The ratio Twave /Tspec is determined by and should be gradually reduced to 1 to produce the time-
the spectrogram parameters, the hop length and the window domain waveform. As the convolutional filters are shared
length. We assume these parameters are known a priori. in channel dimension for different timesteps, MCNN can
input a spectrogram with an arbitrary duration. A trainable
Spectrogram
(Tspec x Fspec) scalar is multiplied to the output of each head to match the
overall scaling of inverse STFT operation and to determine
Head 1
... Head i ... Head n the relative weights of different heads. Lastly, all head outputs
4& × 4# × 45 × are summed and passed through a scaled softsign nonlinearity,
+ f (x) = ax/(1 + |bx|), where a and b are trainable scalars, to
Scaled softsign: ) * = ,*/(1 + |2*|) Waveform bound the output waveform.
(Twave x 1)

III. AUDIO L OSSES

Head i Loss functions that are correlated with the perceptual quality
Transposed convolution layer 1
(width: w1, stride: s1, # filters: c1) should be used to train generative models. We consider a linear
ELU combination of the below loss terms between the estimated
Hidden
waveform ŝ and the ground truth waveform s, presented in
(s1Tspec x c1) the order of observed empirical significance:
Transposed convolution layer 2 (i) Spectral convergence (SC):
(width: w2, stride: s2, # filters: c2)

ELU k|STFT(s)| − |STFT(ŝ)|kF /k|STFT(s)|kF , (1)


Hidden
...

(∏$%&
#'& "# Tspec x cL-1)
where k·kF is the Frobenius norm over time and frequency. SC
loss emphasizes highly on large spectral components, which
Transposed convolution layer L
(width: wL, stride: sL, # filters: 1) helps especially in early phases of training.
ELU (ii) Log-scale STFT-magnitude loss:

Fig. 1. Proposed MCNN architecture for spectrogram inversion. k log(|STFT(s)| + ) − log(|STFT(ŝ)| + )k1 , (2)
To synthesize a waveform from the spectrogram, a function where k · k1 is the L1 norm and  is a small number. The goal
parameterized by a neural network needs to perform nonlinear with log-scale STFT-magnitude loss is to accurately fit small-
upsampling in time domain, while utilizing the spectral infor- amplitude components (as opposed to the SC), which tends to
mation in different channels. Typically, the window length is be more important towards the later phases of training.
much longer than the hop length, and it is important to utilize (iii) Instantaneous frequency loss:
this extra information in neighboring time frames. For fast

inference, we need a neural network architecture that achieves φ(STFT(s)) − ∂ φ(STFT(ŝ)) ,

∂t (3)
a high compute intensity by repeatedly applied computations ∂t
1
with the same kernel.
where φ(·) is the phase argument function. The time derivative
Based on these motivations, we propose the multi-head ∂ ∂f f (t+∆t)−f (t)
convolutional neural network (MCNN) architecture. MCNN ∂t is estimated with finite difference ∂t = ∆t .
has multiple heads that use the same types of layers but Spectral phase is highly unstructured along either time or
with different weights and initialization, and they learn to frequency domain, so fitting raw phase values is very chal-
cooperate as a form of ensemble learning. By using multiple lenging and does not improve training. Instead, instantaneous
heads, we allow each model to allocate different upsampling frequency is a smooth phase-dependent metric, which can be
kernels to different components of the waveform which is more accurately fit.
analyzed further in Appendix B. Each head is composed of L (iv) Weighted phase loss:
transposed convolution layers (please see [16] for more details k|STFT(s)| |STFT(ŝ)| − <{STFT(s)} <{STFT(ŝ)}
about transposed convolutional layers), as shown in Fig. 1.
Each transposed convolution layer consists of a 1-D temporal − ={STFT(s)} ={STFT(ŝ)}k1 , (4)
convolution operation, followed by an exponential linear unit where is element-wise product, < is the real part and =
[17]1 . For the lth layer, wl is the filter width, sl is the stride, is the imaginary part. When a circular normal distribution is
and cl is the number of output filters (channels). Striding in assumed for the phase, the log-likelihood function is propor-
convolutions determines the amount QL of temporal upsampling, tional to L(s, ŝ) = cos(φ(STFT(s)) − φ(STFT(ŝ))) [18]. We
and should be chosen to satisfy l=1 sl ·Tspec = Twave . Filter can correspondingly define a loss as W (s, ŝ) = 1 − L(s, ŝ),
widths control the amount of local neighborhood information which is minimized (W (s, ŝ) = 0) when φ(STFT(s)) =
used while upsampling. The number of filters determine the φ(STFT(ŝ)). To focus on the high-amplitude components
1 It was empirically found to produce superior audio quality than other more and for better numerical stability, we further modify
nonlinearities we tried, such as ReLU and softsign. W (s, ŝ) by scaling it with |STFT(s)| |STFT(ŝ)|, which yields
Eq. 4 after L1 norm.
3

IV. E XPERIMENTAL R ESULTS


A. Experimental setup
We use the LibriSpeech dataset [19], after a preprocessing
pipeline, composed of segmentation and denoising, similar
to [10]. LibriSpeech contains 960 hours of public-domain
audiobooks from 2484 speakers sampled at 16 KHz. It is
originally constructed for automatic speech recognition and
the audio quality is thus lower compared to speech synthesis
datasets.
As the spectrogram parameters, a hop length of 256 (16 ms
duration), a Hanning window with a length of 1024 (64 ms
duration), and an FFT size of 2048 are assumed. MCNN has 8
transposed convolution layers, with (si , wi , ci ) = (2, 13, 28−i )
Fig. 3. Log-STFT of synthesized sample for MCNN trained with only SC
for 1 ≤ i ≤ 8, i.e. halving in the number of channels is loss (top) and all losses (bottom).
balanced with temporal upsampling by a factor of two. The
coefficients of the loss functions in Sec. III are chosen as 1,
6, 10 and 1 respectively, optimized for the audio quality by According to the subjective human ratings (MOS), MCNN
employing a random grid search. The model is trained using outperforms GL, even with a high number of iterations and
the Adam optimizer [20]. The initial learning rate of 0.0005 is SPSI initialization. When trained only on spectral convergence
annealed at a rate of 0.94 every 5000 iterations. The model is (SC), MCNN is on par with GL. Indeed, merely having SC
trained for ∼600k iterations with a batch size of 16 distributed loss as the training objective yields even slightly better SC
across 4 GPUs with synchronous updates. We compare our for test samples. Yet, with only SC loss, lower audio quality
results to conventional implementations of GL [1] and SPSI is observed for some samples due to generated background
[3] with and without extra GL iterations. noise and less clear high frequency harmonics, as exemplified
B. Synthesized audio waveform quality in Fig. 3. To further improve the audio quality, flexibility of
MCNN for integration of other losses is beneficial, as seen
A synthesized audio waveform is exemplified in Fig. 2. We from Table I. Ablation studies also show sufficiently large filter
observe that complicated patterns can be fit, and there is a width and sufficiently high number of heads are important.
small phase error between relevant high-amplitude spectral Transposed convolutions tend to produce checkerboard-like
components (the amount of shift between the peaks is low). patterns [22], and a single-head may not be able to generate
all frequencies efficiently. In an ensemble, however, different
heads cooperate to cancel out artifacts and cover different
frequency bands, as further elaborated in Appendix B. Lastly,
high speaker classification accuracy shows that MCNN can
efficiently preserve the characteristics of speakers (e.g. pitch,
accent, etc.) without any conditioning, showing potential for
direct integration into training for applications like voice
cloning.

C. Generalization and optimization to a particular speaker


The audio quality is maintained even when the MCNN
trained on LibriSpeech is used for an unseen speaker (from
a high-quality text-to-speech dataset [23]), as shown in Table
II. To evaluate how much the quality can be improved, we
also train a separate MCNN model using only that particular
speaker’s audio data, with reoptimized hyperparameters.4 The
Fig. 2. Comparison of the waveform (entire utterance and a zoomed portion) single-speaker MCNN model yields a very small quality gap
and its spectrogram, for the ground truth (left) and MCNN-generated (right). with the ground truth.

We evaluate the quality of synthesis on the held-out Lib- D. Representation learning of the frequency basis
riSpeech samples (Table I) using mean opinion score (MOS)2 , MCNN is trained only with human speech, which is com-
SC, and classification accuracy (we use the speaker classifier posed of time-varying signals at many frequencies. Inter-
model from [21]) to measure the distinguishability of 2484 estingly, MCNN learns the Fourier basis representation in
speakers.3
4 Filter width is increased to 19 to improve the resolution for modeling
2 Human ratings are collected via Amazon Mechanical Turk framework of more clear high frequency components. Lower learning rate and more
independently for each evaluation, as in [21]. Multiple votes on the same aggressive annealing are applied due to the small size of the dataset, which is
sample are aggregated by a majority voting rule. ∼20 hours in total. Loss coefficient of Eq. 2 is increased because the dataset
3 Audio samples can be found in https://mcnnaudiodemos.github.io/. is higher in quality and yields lower SC.
4

TABLE I
MOS WITH 95% CONFIDENCE INTERVAL , AVERAGE SPECTRAL CONVERGENCE AND SPEAKER CLASSIFICATION ACCURACY FOR L IBRI S PEECH TEST
SAMPLES .

Model MOS (out of 5) Spectral convergence (dB) Classification accuracy (%)


MCNN (filter width of 13, 8 heads, all losses) 3.50 ± 0.18 −12.9 76.8
MCNN (filter width of 9) 3.26 ± 0.18 −11.9 73.2
MCNN (2 heads) 2.78 ± 0.17 −10.7 71.4
MCNN (loss: Eq. (1)) 3.32 ± 0.16 −13.3 69.6
MCNN (loss: Eq. (1) & Eq. (2)) 3.35 ± 0.18 −12.6 73.2
GL (3 iterations) 2.55 ± 0.26 −5.9 76.8
GL (50 iterations) 3.28 ± 0.24 −10.1 78.6
GL (150 iterations) 3.41 ± 0.21 −13.6 82.1
SPSI 2.52 ± 0.28 −4.9 75.0
SPSI + GL (3 iterations) 3.18 ± 0.23 −8.7 78.6
SPSI + GL (50 iterations) 3.41 ± 0.19 −11.8 78.6
Ground truth 4.20 ± 0.16 −∞ 85.7

TABLE II benchmark runtime on a Nvidia Tesla P100 GPU.5 The


MOS WITH 95% CONFIDENCE INTERVAL FOR SINGLE - SPEAKER SAMPLES baseline MCNN model from Table 1 (the one in bold font)
( FROM AN INTERNAL DATASET [23]).
can generate ∼5.2M samples/sec, yielding ∼330 times faster-
Model MOS (out of 5) than-real-time waveform synthesis. Compared to MCNN, the
MCNN (trained on LibriSpeech) 3.55 ± 0.17 runtime of GL is ∼20 times slower for 50 iterations, and ∼60
MCNN (trained on single-speaker 3.91 ± 0.17 times slower for 150 iterations. The computational complexity
GL (150 iterations) 3.84 ± 0.16 of MCNN is ∼2.2 GFLOPs/sec, and indeed is slightly higher
SPSI + GL (50 iterations) 3.69 ± 0.17 than the complexity of 150 GL iterations. However, much
Ground truth 4.28 ± 0.14 shorter runtime is due to the properties of the neural network
architecture that render it very well suited for modern multi-
core processors like GPUs or TPUs. First and foremost,
MCNN requires much less DRAM bandwidth (in byte/s) - the
compute intensity of MCNN, 61 FLOPs/byte, is more than an
order of magnitude higher than that of GL, 1.9 FLOPs/byte.
In addition, MCNN has a shorter critical path of dependent
operations in its compute graph compared to GL, yielding
parallelization and utilization. Efficient inference with such
a highly-specialized model is enabled by learning from large-
scale training data, which is not possible for signal processing
algorithms like GL.

V. C ONCLUSIONS
We propose the MCNN architecture for the spectrogram
inversion problem. MCNN achieves very fast waveform syn-
Fig. 4. Synthesized waveforms by MCNN (trained on LibriSpeech), for thesis without noticeably sacrificing the perceptual quality.
spectrogram inputs corresponding to sinusoids at 500, 1000 and 2000 Hz,
and for a spectrogram input of superposed sinusoids at 1000 and 2000 Hz.
MCNN is trained on a large-scale speech dataset and can
generalize well to unseen speech or speakers. MCNN and
its variants will benefit even more from future hardware in
the spectral range of human speech, as shown in Fig. 4 ways that autoregressive neural network models and traditional
(representations get poorer for higher frequencies beyond iterative signal processing techniques like GL cannot take
human speech, due to the increased train-test mismatch). When advantage of. In addition, they will benefit from larger scale
the input spectrograms correspond to constant frequencies, audio datasets, which are expected to close the gap in quality
sinusoidal waveforms at those frequencies are synthesized. with ground truth. An important future direction is to integrate
When the input spectrograms correspond to a few frequency MCNN into end-to-end training of other generative audio
bands, the synthesized waveforms are superpositions of pure models, such as text-to-speech or audio style transfer systems.
sinusoids of constituent frequencies. For all cases, phase
coherence over a long time window is observed.

E. Deployment considerations 5 We consider the Tensorflow implementation of operations without specific


kernel optimizations, which can yield to further improvements specific to the
We evaluate the inference complexity and compute intensity hardware. For a fair comparison, we consider the GPU implementation of GL
(based on the assumptions presented in Appendix A) and using Tensorflow FFT/inverse FFT operations.
5

R EFERENCES A PPENDIX
[1] D. Griffin and J. Lim, “Signal estimation from modified short-time A. Complexity modeling
fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 32, no. 2, pp. 236–243, Apr 1984. Computational complexity of operations is represented by
[2] N. Perraudin, P. Balazs, and P. L. Sndergaard, “A fast griffin-lim algo- the total number of algorithmic FLOPs without consider-
rithm,” in 2013 IEEE Workshop on Applications of Signal Processing ing hardware-specific logic-level implementations. (Such a
to Audio and Acoustics, Oct 2013, pp. 1–4.
[3] G. T. Beauregard, M. Harish, and L. Wyse, “Single pass spectrogram complexity metric also has limitations of representing some
inversion,” in 2015 IEEE International Conference on Digital Signal major sources of power consumption, such as loading and
Processing (DSP), July 2015, pp. 427–431. storing data.) We count all point-wise operations (including
[4] Z. Prusa, P. Balazs, and P. L. Sondergaard, “A noniterative method for
reconstruction of phase from stft magnitude,” IEEE/ACM Transactions nonlinearities) as 1 FLOP, which is motivated with the trend
on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154– of implementing most mathematical operations as a single
1164, May 2017. instruction. We ignore the complexities of register memory-
[5] D. L. Sun and J. O. Smith, III, “Estimating a Signal from a Magnitude
Spectrogram via Convex Optimization,” arXiv: 1209.2076, 2012. move operations. We assume that a matrix-matrix multiply,
[6] A. van den Oord, Y. Li, I. Babuschkin, Simonyan et al., “Parallel between W , an m × n matrix and X, an n × p matrix takes
WaveNet: Fast High-Fidelity Speech Synthesis,” arXiv:1711.10433, Nov. 2mnp FLOPs. Similar expression is generalized for multi-
2017.
[7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, dimensional tensors, that are used in convolutional layers.
Y. Zhang, Y. Wang, R. Skerry-Ryan et al., “Natural tts synthesis by For real-valued fast Fourier transform (FFT), we assume the
conditioning wavenet on mel spectrogram predictions,” arXiv preprint complexity of 2.5N log2 (N ) FLOPs for a vector of length
arXiv:1712.05884, 2017.
[8] S. Ö. Arik, G. F. Diamos, A. Gibiansky et al., “Deep voice 2: Multi- N [24]. For most operations used in this paper, Tensorflow
speaker neural text-to-speech,” arXiv: 1705.08947, 2017. profiling tool [25] includes FLOP counts, which we directly
[9] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, adapted.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
A gnerative model for raw audio,” arXiv:1609.03499, 2016.
[10] W. Ping, K. Peng, A. Gibiansky et al., “Deep voice 3: 2000-speaker
neural text-to-speech,” arXiv: 1710.07654, 2017.
B. Analysis of contributions of multiple heads
[11] Y. Wang, R. J. Skerry-Ryan, D. Stanton et al., “Tacotron: A fully end- Fig. 5 shows the outputs of individual heads along with the
to-end text-to-speech synthesis model,” arXiv: 1703.10135, 2017.
[12] N. P. Jouppi, C. Young, N. Patil et al., “In-datacenter performance
overall waveform. We observe that multiple heads focus on
analysis of a tensor processing unit,” SIGARCH Comput. Archit. News, different portions of the waveform in time, and also on differ-
vol. 45, no. 2, pp. 1–12, Jun. 2017. ent frequency bands. For example, head 2 mostly focuses on
[13] H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan, and Y. Li, “Gpuroofline:
A model for guiding performance optimizations on gpus,” in Euro-Par
low-frequency components. While training, individual heads
2012 Parallel Processing, C. Kaklamanis, T. Papatheodorou, and P. G. are not constrained for such a behavior. In fact, different heads
Spirakis, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. share the same architecture, but initial random weights of the
[14] E. Grinstein, N. Q. K. Duong, A. Ozerov, and P. Pérez, “Audio style
transfer,” arXiv: 1710.11385, 2017.
heads determine which portions of the waveform they will
[15] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement focus on in the later phases of training. The structure of the
with generative adversarial networks for robust speech recognition,” network promotes cooperation with the end-to-end objective.
arXiv: 1711.05747, 2017.
[16] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
Hence, initialization with the same weights would nullify the
learning,” arXiv: 1603.07285, 2016. benefit of the multi-head architecture. Although intelligibility
[17] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep of individual waveform outputs is very low (we also note
network learning by exponential linear units,” arXiv: 1511.07289, 2015.
[18] J. Engel, C. Resnick, A. Roberts et al., “Neural audio synthesis of
that a nonlinear combination of these waveforms can also
musical notes with wavenet autoencoders,” arXiv: 1704.01279, 2017. generate new frequencies that do not exist in these individual
[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an outputs.), their combination can yield highly natural-sounding
ASR corpus based on public domain audio books,” in Acoustics, Speech
and Signal Processing (ICASSP), 2015 IEEE International Conference
waveforms.
on. IEEE, 2015, pp. 5206–5210.
[20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv: 1412.6980, vol. abs/1412.6980, 2014.
[21] S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural Voice
Cloning with a Few Samples,” arXiv: 1802.06006, 2018.
[22] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard
artifacts,” Distill, 2016. [Online]. Available: http://distill.pub/2016/
deconv-checkerboard
[23] S. Ö. Arik, M. Chrzanowski, A. Coates et al., “Deep voice: Real-time
neural text-to-speech,” arXiv: 1702.07825, 2017.
[24] “Fft benchmark methodology,” http://www.fftw.org/speed/method.html,
accessed: 2018-07-30.
[25] “Tensorflow profiler and advisor,” https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/core/profiler/README.md, accessed:
2018-07-30.
6

Fig. 5. Top row: An example synthesized waveform and its log-STFT. Bottom 8 rows: Outputs of the waveforms of each of the constituent heads. For better
visualization, waveforms are normalized in each head and small-amplitude components in STFTs are discarded after applying a threshold.

You might also like