0% found this document useful (0 votes)

71 views6 pages

Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks

The document discusses a new technique called multi-head convolutional neural networks (MCNN) for fast waveform synthesis from spectrograms without iterative algorithms or autoregression. MCNN enables significantly faster processing than commonly used techniques like Griffin-Lim and can synthesize high quality speech over 300 times faster than real-time.

Uploaded by

filase8959

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

71 views6 pages

Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks

Uploaded by

filase8959

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 6

1

Fast Spectrogram Inversion using Multi-head

Convolutional Neural Networks
Sercan Ö. Arık∗ , Heewoo Jun∗ , Gregory Diamos

Abstract—We propose the multi-head convolutional neural these generic spectrogram inversion techniques is their fixed
network (MCNN) for waveform synthesis from spectrograms. objectives, rendering them inflexible to adapt for a particular
Nonlinear interpolation in MCNN is employed with transposed domain like human speech.
convolution layers in parallel heads. MCNN enables signifi-
One common use case of spectrograms is the audio domain,
arXiv:1808.06719v2 [cs.SD] 6 Nov 2018

cantly better utilization of modern multi-core processors than

commonly-used iterative algorithms like Griffin-Lim, and yields which is also the focus of this paper. Autoregressive modeling
very fast (more than 300x real-time) runtime. For training of of waveforms, in particular for audio, is a common approach.
MCNN, we use a large-scale speech recognition dataset and State-of-the-art results in generative speech modeling use neu-
losses defined on waveforms that are related to perceptual audio ral networks [6][7] that employ autoregression at the sample
quality. We demonstrate that MCNN constitutes a very promising
approach for high-quality speech synthesis, without any iterative rate. Yet, these models bring challenges for deployment, as
algorithms or autoregression in computations. they need to run inference ∼16k-24k times every second. One
approach is to approximate autoregression with an inference-
Index Terms—Phase reconstruction, deep learning, convolu-
tional neural networks, short-time Fourier transform, spectro- efficient model which can be trained by learning an inverse-
gram, time-frequency signal processing, speech synthesis. autoregressive flow using distillation [6]. Recently, autoregres-
I. I NTRODUCTION sive neural networks have also been adapted for spectrogram
A spectrogram contains intensity information of time- inversion. [8] uses the WaveNet architecture [9], which is com-
varying spectrum of a waveform. Waveform to spectrogram posed of stacked dilated convolution layers with spectrogram
conversion is fundamentally lossy, because the magnitude frames as external conditioner. But autoregression at sample
calculation removes the phase from the short-time Fourier rate is employed, resulting in slow synthesis. A fundamental
transform (STFT). Spectrogram inversion has been studied question is whether high quality synthesis necessitates explicit
widely in literature. Yet, there is no known algorithm that autoregressive modeling. Some generative models, e.g. [10],
guarantees a globally optimal solution at a low computational [11], synthesize audio by applying autoregression at the rate
complexity. A fundamental challenge is the non-convexity of of spectrogram timeframes (100s of samples), and still does
intensity constraints with an unknown phase. not yield a noticeable decrease in audio quality.
The most popular technique for spectrogram inversion is We propose the multi-head convolutional neural network
the Griffin-Lim (GL) algorithm [1]. GL is based on itera- (MCNN) that employs non-autoregressive modeling for the
tively estimating the unknown phases by repeatedly converting perennial spectrogram inversion problem. Our study is mainly
between frequency and time domain using the STFT and motivated by two trends. Firstly, modern multi-core proces-
its inverse, substituting the magnitude of each frequency sors, such as GPUs or TPUs [12], achieve their peak perfor-
component to the predicted magnitude at each step. Although mance for algorithms with high compute intensity [13]. Com-
the simplicity of GL is appealing, it can be slow due to the pute intensity (also known as operational intensity) is defined
sequentiality of operations. In [2], a fast variant is studied as the average number of operations per data access. Secondly,
by modifying its update step with a term that depends on many recent generative audio models, such as text-to-speech
the magnitude of the previous update step. In [3], the single- [10][11], audio style transfer [14], or speech enhancement
pass spectrogram inversion (SPSI) algorithm is introduced, [15], output spectrograms (that are typically converted to
which can synthesize waveforms in a single fully deterministic waveforms using GL), and can potentially benefit from direct
pass and can be further improved with extra GL iterations. waveform synthesis by integrating trainable models into their
SPSI estimates the instantaneous frequency of each frame by end-to-end frameworks. MCNN achieves very high audio qual-
peak-picking and quadratic interpolation. In [4], another non- ity (quantified by human raters and conventional metrics like
iterative spectrogram inversion technique is proposed, based spectral convergence (SC) and speaker classification accuracy),
on the partial derivatives with respect to a Gaussian window, while achieving more than 300x real-time synthesis, and has
which allows analytical derivations. In [5], a convex relaxation the potential to be integrated with end-to-end training in audio
is applied to express spectrogram inversion as a semidefinite processing.
program with a convergence guarantee, at the expense of the
II. M ULTI - HEAD C ONVOLUTIONAL N EURAL N ETWORK
increased dimensionality. Overall, one common drawback for
We assume the STFT-magnitude input for the waveform
Baidu Silicon Valley Artificial Intelligence Lab 1195 Bordeaux Dr. Sunny- s, |STFT(s)|, has a dimension of Tspec × Fspec and the
vale, CA 94089.
∗ Equal contribution corresponding waveform has a dimension of Twave , where
Manuscript received August, 2018. Tspec is the number of spectrogram timeframes, Fspec is the
2

number of frequency channels, and Twave is the number of number of frequency channels in the processed representation,
waveform samples. The ratio Twave /Tspec is determined by and should be gradually reduced to 1 to produce the time-
the spectrogram parameters, the hop length and the window domain waveform. As the convolutional filters are shared
length. We assume these parameters are known a priori. in channel dimension for different timesteps, MCNN can
input a spectrogram with an arbitrary duration. A trainable
Spectrogram
(Tspec x Fspec) scalar is multiplied to the output of each head to match the
overall scaling of inverse STFT operation and to determine
Head 1
... Head i ... Head n the relative weights of different heads. Lastly, all head outputs
4& × 4# × 45 × are summed and passed through a scaled softsign nonlinearity,
+ f (x) = ax/(1 + |bx|), where a and b are trainable scalars, to
Scaled softsign: ) * = ,*/(1 + |2*|) Waveform bound the output waveform.
(Twave x 1)

III. AUDIO L OSSES

Head i Loss functions that are correlated with the perceptual quality
Transposed convolution layer 1
(width: w1, stride: s1, # filters: c1) should be used to train generative models. We consider a linear
ELU combination of the below loss terms between the estimated
Hidden
waveform ŝ and the ground truth waveform s, presented in
(s1Tspec x c1) the order of observed empirical significance:
Transposed convolution layer 2 (i) Spectral convergence (SC):
(width: w2, stride: s2, # filters: c2)

ELU k|STFT(s)| − |STFT(ŝ)|kF /k|STFT(s)|kF , (1)

Hidden
...

(∏$%&
#'& "# Tspec x cL-1)
where k·kF is the Frobenius norm over time and frequency. SC
loss emphasizes highly on large spectral components, which
Transposed convolution layer L
(width: wL, stride: sL, # filters: 1) helps especially in early phases of training.
ELU (ii) Log-scale STFT-magnitude loss:

Fig. 1. Proposed MCNN architecture for spectrogram inversion. k log(|STFT(s)| + ) − log(|STFT(ŝ)| + )k1 , (2)
To synthesize a waveform from the spectrogram, a function where k · k1 is the L1 norm and is a small number. The goal
parameterized by a neural network needs to perform nonlinear with log-scale STFT-magnitude loss is to accurately fit small-
upsampling in time domain, while utilizing the spectral infor- amplitude components (as opposed to the SC), which tends to
mation in different channels. Typically, the window length is be more important towards the later phases of training.
much longer than the hop length, and it is important to utilize (iii) Instantaneous frequency loss:
this extra information in neighboring time frames. For fast
∂
inference, we need a neural network architecture that achieves φ(STFT(s)) − ∂ φ(STFT(ŝ)) ,

∂t (3)
a high compute intensity by repeatedly applied computations ∂t
1
with the same kernel.
where φ(·) is the phase argument function. The time derivative
Based on these motivations, we propose the multi-head ∂ ∂f f (t+∆t)−f (t)
convolutional neural network (MCNN) architecture. MCNN ∂t is estimated with finite difference ∂t = ∆t .
has multiple heads that use the same types of layers but Spectral phase is highly unstructured along either time or
with different weights and initialization, and they learn to frequency domain, so fitting raw phase values is very chal-
cooperate as a form of ensemble learning. By using multiple lenging and does not improve training. Instead, instantaneous
heads, we allow each model to allocate different upsampling frequency is a smooth phase-dependent metric, which can be
kernels to different components of the waveform which is more accurately fit.
analyzed further in Appendix B. Each head is composed of L (iv) Weighted phase loss:
transposed convolution layers (please see [16] for more details k|STFT(s)| |STFT(ŝ)| − <{STFT(s)} <{STFT(ŝ)}
about transposed convolutional layers), as shown in Fig. 1.
Each transposed convolution layer consists of a 1-D temporal − ={STFT(s)} ={STFT(ŝ)}k1 , (4)
convolution operation, followed by an exponential linear unit where is element-wise product, < is the real part and =
[17]1 . For the lth layer, wl is the filter width, sl is the stride, is the imaginary part. When a circular normal distribution is
and cl is the number of output filters (channels). Striding in assumed for the phase, the log-likelihood function is propor-
convolutions determines the amount QL of temporal upsampling, tional to L(s, ŝ) = cos(φ(STFT(s)) − φ(STFT(ŝ))) [18]. We
and should be chosen to satisfy l=1 sl ·Tspec = Twave . Filter can correspondingly define a loss as W (s, ŝ) = 1 − L(s, ŝ),
widths control the amount of local neighborhood information which is minimized (W (s, ŝ) = 0) when φ(STFT(s)) =
used while upsampling. The number of filters determine the φ(STFT(ŝ)). To focus on the high-amplitude components
1 It was empirically found to produce superior audio quality than other more and for better numerical stability, we further modify
nonlinearities we tried, such as ReLU and softsign. W (s, ŝ) by scaling it with |STFT(s)||STFT(ŝ)|, which yields
Eq. 4 after L1 norm.
3

IV. E XPERIMENTAL R ESULTS

A. Experimental setup
We use the LibriSpeech dataset [19], after a preprocessing
pipeline, composed of segmentation and denoising, similar
to [10]. LibriSpeech contains 960 hours of public-domain
audiobooks from 2484 speakers sampled at 16 KHz. It is
originally constructed for automatic speech recognition and
the audio quality is thus lower compared to speech synthesis
datasets.
As the spectrogram parameters, a hop length of 256 (16 ms
duration), a Hanning window with a length of 1024 (64 ms
duration), and an FFT size of 2048 are assumed. MCNN has 8
transposed convolution layers, with (si , wi , ci ) = (2, 13, 28−i )
Fig. 3. Log-STFT of synthesized sample for MCNN trained with only SC
for 1 ≤ i ≤ 8, i.e. halving in the number of channels is loss (top) and all losses (bottom).
balanced with temporal upsampling by a factor of two. The
coefficients of the loss functions in Sec. III are chosen as 1,
6, 10 and 1 respectively, optimized for the audio quality by According to the subjective human ratings (MOS), MCNN
employing a random grid search. The model is trained using outperforms GL, even with a high number of iterations and
the Adam optimizer [20]. The initial learning rate of 0.0005 is SPSI initialization. When trained only on spectral convergence
annealed at a rate of 0.94 every 5000 iterations. The model is (SC), MCNN is on par with GL. Indeed, merely having SC
trained for ∼600k iterations with a batch size of 16 distributed loss as the training objective yields even slightly better SC
across 4 GPUs with synchronous updates. We compare our for test samples. Yet, with only SC loss, lower audio quality
results to conventional implementations of GL [1] and SPSI is observed for some samples due to generated background
[3] with and without extra GL iterations. noise and less clear high frequency harmonics, as exemplified
B. Synthesized audio waveform quality in Fig. 3. To further improve the audio quality, flexibility of
MCNN for integration of other losses is beneficial, as seen
A synthesized audio waveform is exemplified in Fig. 2. We from Table I. Ablation studies also show sufficiently large filter
observe that complicated patterns can be fit, and there is a width and sufficiently high number of heads are important.
small phase error between relevant high-amplitude spectral Transposed convolutions tend to produce checkerboard-like
components (the amount of shift between the peaks is low). patterns [22], and a single-head may not be able to generate
all frequencies efficiently. In an ensemble, however, different
heads cooperate to cancel out artifacts and cover different
frequency bands, as further elaborated in Appendix B. Lastly,
high speaker classification accuracy shows that MCNN can
efficiently preserve the characteristics of speakers (e.g. pitch,
accent, etc.) without any conditioning, showing potential for
direct integration into training for applications like voice
cloning.

C. Generalization and optimization to a particular speaker

The audio quality is maintained even when the MCNN
trained on LibriSpeech is used for an unseen speaker (from
a high-quality text-to-speech dataset [23]), as shown in Table
II. To evaluate how much the quality can be improved, we
also train a separate MCNN model using only that particular
speaker’s audio data, with reoptimized hyperparameters.4 The
Fig. 2. Comparison of the waveform (entire utterance and a zoomed portion) single-speaker MCNN model yields a very small quality gap
and its spectrogram, for the ground truth (left) and MCNN-generated (right). with the ground truth.

We evaluate the quality of synthesis on the held-out Lib- D. Representation learning of the frequency basis
riSpeech samples (Table I) using mean opinion score (MOS)2 , MCNN is trained only with human speech, which is com-
SC, and classification accuracy (we use the speaker classifier posed of time-varying signals at many frequencies. Inter-
model from [21]) to measure the distinguishability of 2484 estingly, MCNN learns the Fourier basis representation in
speakers.3
4 Filter width is increased to 19 to improve the resolution for modeling
2 Human ratings are collected via Amazon Mechanical Turk framework of more clear high frequency components. Lower learning rate and more
independently for each evaluation, as in [21]. Multiple votes on the same aggressive annealing are applied due to the small size of the dataset, which is
sample are aggregated by a majority voting rule. ∼20 hours in total. Loss coefficient of Eq. 2 is increased because the dataset
3 Audio samples can be found in https://mcnnaudiodemos.github.io/. is higher in quality and yields lower SC.
4

TABLE I
MOS WITH 95% CONFIDENCE INTERVAL , AVERAGE SPECTRAL CONVERGENCE AND SPEAKER CLASSIFICATION ACCURACY FOR L IBRI S PEECH TEST
SAMPLES .

Model MOS (out of 5) Spectral convergence (dB) Classification accuracy (%)

MCNN (filter width of 13, 8 heads, all losses) 3.50 ± 0.18 −12.9 76.8
MCNN (filter width of 9) 3.26 ± 0.18 −11.9 73.2
MCNN (2 heads) 2.78 ± 0.17 −10.7 71.4
MCNN (loss: Eq. (1)) 3.32 ± 0.16 −13.3 69.6
MCNN (loss: Eq. (1) & Eq. (2)) 3.35 ± 0.18 −12.6 73.2
GL (3 iterations) 2.55 ± 0.26 −5.9 76.8
GL (50 iterations) 3.28 ± 0.24 −10.1 78.6
GL (150 iterations) 3.41 ± 0.21 −13.6 82.1
SPSI 2.52 ± 0.28 −4.9 75.0
SPSI + GL (3 iterations) 3.18 ± 0.23 −8.7 78.6
SPSI + GL (50 iterations) 3.41 ± 0.19 −11.8 78.6
Ground truth 4.20 ± 0.16 −∞ 85.7

TABLE II benchmark runtime on a Nvidia Tesla P100 GPU.5 The

MOS WITH 95% CONFIDENCE INTERVAL FOR SINGLE - SPEAKER SAMPLES baseline MCNN model from Table 1 (the one in bold font)
( FROM AN INTERNAL DATASET [23]).
can generate ∼5.2M samples/sec, yielding ∼330 times faster-
Model MOS (out of 5) than-real-time waveform synthesis. Compared to MCNN, the
MCNN (trained on LibriSpeech) 3.55 ± 0.17 runtime of GL is ∼20 times slower for 50 iterations, and ∼60
MCNN (trained on single-speaker 3.91 ± 0.17 times slower for 150 iterations. The computational complexity
GL (150 iterations) 3.84 ± 0.16 of MCNN is ∼2.2 GFLOPs/sec, and indeed is slightly higher
SPSI + GL (50 iterations) 3.69 ± 0.17 than the complexity of 150 GL iterations. However, much
Ground truth 4.28 ± 0.14 shorter runtime is due to the properties of the neural network
architecture that render it very well suited for modern multi-
core processors like GPUs or TPUs. First and foremost,
MCNN requires much less DRAM bandwidth (in byte/s) - the
compute intensity of MCNN, 61 FLOPs/byte, is more than an
order of magnitude higher than that of GL, 1.9 FLOPs/byte.
In addition, MCNN has a shorter critical path of dependent
operations in its compute graph compared to GL, yielding
parallelization and utilization. Efficient inference with such
a highly-specialized model is enabled by learning from large-
scale training data, which is not possible for signal processing
algorithms like GL.

V. C ONCLUSIONS
We propose the MCNN architecture for the spectrogram
inversion problem. MCNN achieves very fast waveform syn-
Fig. 4. Synthesized waveforms by MCNN (trained on LibriSpeech), for thesis without noticeably sacrificing the perceptual quality.
spectrogram inputs corresponding to sinusoids at 500, 1000 and 2000 Hz,
and for a spectrogram input of superposed sinusoids at 1000 and 2000 Hz.
MCNN is trained on a large-scale speech dataset and can
generalize well to unseen speech or speakers. MCNN and
its variants will benefit even more from future hardware in
the spectral range of human speech, as shown in Fig. 4 ways that autoregressive neural network models and traditional
(representations get poorer for higher frequencies beyond iterative signal processing techniques like GL cannot take
human speech, due to the increased train-test mismatch). When advantage of. In addition, they will benefit from larger scale
the input spectrograms correspond to constant frequencies, audio datasets, which are expected to close the gap in quality
sinusoidal waveforms at those frequencies are synthesized. with ground truth. An important future direction is to integrate
When the input spectrograms correspond to a few frequency MCNN into end-to-end training of other generative audio
bands, the synthesized waveforms are superpositions of pure models, such as text-to-speech or audio style transfer systems.
sinusoids of constituent frequencies. For all cases, phase
coherence over a long time window is observed.

E. Deployment considerations 5 We consider the Tensorflow implementation of operations without specific

kernel optimizations, which can yield to further improvements specific to the
We evaluate the inference complexity and compute intensity hardware. For a fair comparison, we consider the GPU implementation of GL
(based on the assumptions presented in Appendix A) and using Tensorflow FFT/inverse FFT operations.
5

R EFERENCES A PPENDIX
[1] D. Griffin and J. Lim, “Signal estimation from modified short-time A. Complexity modeling
fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 32, no. 2, pp. 236–243, Apr 1984. Computational complexity of operations is represented by
[2] N. Perraudin, P. Balazs, and P. L. Sndergaard, “A fast griffin-lim algo- the total number of algorithmic FLOPs without consider-
rithm,” in 2013 IEEE Workshop on Applications of Signal Processing ing hardware-specific logic-level implementations. (Such a
to Audio and Acoustics, Oct 2013, pp. 1–4.
[3] G. T. Beauregard, M. Harish, and L. Wyse, “Single pass spectrogram complexity metric also has limitations of representing some
inversion,” in 2015 IEEE International Conference on Digital Signal major sources of power consumption, such as loading and
Processing (DSP), July 2015, pp. 427–431. storing data.) We count all point-wise operations (including
[4] Z. Prusa, P. Balazs, and P. L. Sondergaard, “A noniterative method for
reconstruction of phase from stft magnitude,” IEEE/ACM Transactions nonlinearities) as 1 FLOP, which is motivated with the trend
on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154– of implementing most mathematical operations as a single
1164, May 2017. instruction. We ignore the complexities of register memory-
[5] D. L. Sun and J. O. Smith, III, “Estimating a Signal from a Magnitude
Spectrogram via Convex Optimization,” arXiv: 1209.2076, 2012. move operations. We assume that a matrix-matrix multiply,
[6] A. van den Oord, Y. Li, I. Babuschkin, Simonyan et al., “Parallel between W , an m × n matrix and X, an n × p matrix takes
WaveNet: Fast High-Fidelity Speech Synthesis,” arXiv:1711.10433, Nov. 2mnp FLOPs. Similar expression is generalized for multi-
2017.
[7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, dimensional tensors, that are used in convolutional layers.
Y. Zhang, Y. Wang, R. Skerry-Ryan et al., “Natural tts synthesis by For real-valued fast Fourier transform (FFT), we assume the
conditioning wavenet on mel spectrogram predictions,” arXiv preprint complexity of 2.5N log2 (N ) FLOPs for a vector of length
arXiv:1712.05884, 2017.
[8] S. Ö. Arik, G. F. Diamos, A. Gibiansky et al., “Deep voice 2: Multi- N [24]. For most operations used in this paper, Tensorflow
speaker neural text-to-speech,” arXiv: 1705.08947, 2017. profiling tool [25] includes FLOP counts, which we directly
[9] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, adapted.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
A gnerative model for raw audio,” arXiv:1609.03499, 2016.
[10] W. Ping, K. Peng, A. Gibiansky et al., “Deep voice 3: 2000-speaker
neural text-to-speech,” arXiv: 1710.07654, 2017.
B. Analysis of contributions of multiple heads
[11] Y. Wang, R. J. Skerry-Ryan, D. Stanton et al., “Tacotron: A fully end- Fig. 5 shows the outputs of individual heads along with the
to-end text-to-speech synthesis model,” arXiv: 1703.10135, 2017.
[12] N. P. Jouppi, C. Young, N. Patil et al., “In-datacenter performance
overall waveform. We observe that multiple heads focus on
analysis of a tensor processing unit,” SIGARCH Comput. Archit. News, different portions of the waveform in time, and also on differ-
vol. 45, no. 2, pp. 1–12, Jun. 2017. ent frequency bands. For example, head 2 mostly focuses on
[13] H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan, and Y. Li, “Gpuroofline:
A model for guiding performance optimizations on gpus,” in Euro-Par
low-frequency components. While training, individual heads
2012 Parallel Processing, C. Kaklamanis, T. Papatheodorou, and P. G. are not constrained for such a behavior. In fact, different heads
Spirakis, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. share the same architecture, but initial random weights of the
[14] E. Grinstein, N. Q. K. Duong, A. Ozerov, and P. Pérez, “Audio style
transfer,” arXiv: 1710.11385, 2017.
heads determine which portions of the waveform they will
[15] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement focus on in the later phases of training. The structure of the
with generative adversarial networks for robust speech recognition,” network promotes cooperation with the end-to-end objective.
arXiv: 1711.05747, 2017.
[16] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
Hence, initialization with the same weights would nullify the
learning,” arXiv: 1603.07285, 2016. benefit of the multi-head architecture. Although intelligibility
[17] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep of individual waveform outputs is very low (we also note
network learning by exponential linear units,” arXiv: 1511.07289, 2015.
[18] J. Engel, C. Resnick, A. Roberts et al., “Neural audio synthesis of
that a nonlinear combination of these waveforms can also
musical notes with wavenet autoencoders,” arXiv: 1704.01279, 2017. generate new frequencies that do not exist in these individual
[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an outputs.), their combination can yield highly natural-sounding
ASR corpus based on public domain audio books,” in Acoustics, Speech
and Signal Processing (ICASSP), 2015 IEEE International Conference
waveforms.
on. IEEE, 2015, pp. 5206–5210.
[20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv: 1412.6980, vol. abs/1412.6980, 2014.
[21] S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural Voice
Cloning with a Few Samples,” arXiv: 1802.06006, 2018.
[22] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard
artifacts,” Distill, 2016. [Online]. Available: http://distill.pub/2016/
deconv-checkerboard
[23] S. Ö. Arik, M. Chrzanowski, A. Coates et al., “Deep voice: Real-time
neural text-to-speech,” arXiv: 1702.07825, 2017.
[24] “Fft benchmark methodology,” http://www.fftw.org/speed/method.html,
accessed: 2018-07-30.
[25] “Tensorflow profiler and advisor,” https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/core/profiler/README.md, accessed:
2018-07-30.
6

Fig. 5. Top row: An example synthesized waveform and its log-STFT. Bottom 8 rows: Outputs of the waveforms of each of the constituent heads. For better
visualization, waveforms are normalized in each head and small-amplitude components in STFTs are discarded after applying a threshold.

Vocoder Summer School 2021
No ratings yet
Vocoder Summer School 2021
298 pages
Klingbeil Dissertation Web
No ratings yet
Klingbeil Dissertation Web
167 pages
Ventricular Arrhythmia Classification Using Convolutional Neural Networks
No ratings yet
Ventricular Arrhythmia Classification Using Convolutional Neural Networks
118 pages
Convention Paper 5452: Audio Engineering Society
100% (1)
Convention Paper 5452: Audio Engineering Society
10 pages
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
No ratings yet
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
14 pages
MasterarbeitAmelieNuesse Final
No ratings yet
MasterarbeitAmelieNuesse Final
100 pages
J Chemolab 2018 07 008
No ratings yet
J Chemolab 2018 07 008
25 pages
Waveglow: A Flow-Based Generative Network For Speech Synthesis Ryan Prenger, Rafael Valle, Bryan Catanzaro NVIDIA Corporation
No ratings yet
Waveglow: A Flow-Based Generative Network For Speech Synthesis Ryan Prenger, Rafael Valle, Bryan Catanzaro NVIDIA Corporation
5 pages
DDSP Differentiable Digital Signal Processing
No ratings yet
DDSP Differentiable Digital Signal Processing
19 pages
Melnet: A Generative Model For Audio in The Frequency Domain
No ratings yet
Melnet: A Generative Model For Audio in The Frequency Domain
14 pages
Low-Pass High-Pass Bandpass Bandstop: Laborator 4
No ratings yet
Low-Pass High-Pass Bandpass Bandstop: Laborator 4
9 pages
Fast Nearly ML Estimation of The Parameters of Real or Complex Single Tones or Resolved Multiple Tones
No ratings yet
Fast Nearly ML Estimation of The Parameters of Real or Complex Single Tones or Resolved Multiple Tones
9 pages
Major
No ratings yet
Major
8 pages
Multi-Instrument Music Synthesis With Spectrogram Diffusion
No ratings yet
Multi-Instrument Music Synthesis With Spectrogram Diffusion
12 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
Pert Usa PHD
No ratings yet
Pert Usa PHD
232 pages
On The Compensation Between Magnitude and Phase in Speech Separation
No ratings yet
On The Compensation Between Magnitude and Phase in Speech Separation
5 pages
Pfa Inr
No ratings yet
Pfa Inr
75 pages
WaveGlow A Flow-Based Generative Network For Speech Synthesis
No ratings yet
WaveGlow A Flow-Based Generative Network For Speech Synthesis
5 pages
Dong 2020
No ratings yet
Dong 2020
5 pages
Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
No ratings yet
Voice Disorder Detection Using Long Short Term Memory (LSTM) Model
4 pages
10 1109@JSTSP 2019 2909479
No ratings yet
10 1109@JSTSP 2019 2909479
13 pages
PSD Computing
No ratings yet
PSD Computing
11 pages
DAFx2018 Paper 63
No ratings yet
DAFx2018 Paper 63
9 pages
Evaluation MFCC For Music Similarity
No ratings yet
Evaluation MFCC For Music Similarity
5 pages
Audio Noise Detection
No ratings yet
Audio Noise Detection
29 pages
Mrac Paper1a
No ratings yet
Mrac Paper1a
11 pages
Finetuning
No ratings yet
Finetuning
10 pages
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
No ratings yet
Ph.D. Thesis Computationally Efficient Methods For Polyphonic Music Transcription
232 pages
Spectnet: End-To-End Audio Signal Classification Using Learnable Spectrogram Features
No ratings yet
Spectnet: End-To-End Audio Signal Classification Using Learnable Spectrogram Features
8 pages
FFFFFFFFFFFFFFFFFF
No ratings yet
FFFFFFFFFFFFFFFFFF
12 pages
Dual Attention Network For Pitch Estimation of Monophonic Music
No ratings yet
Dual Attention Network For Pitch Estimation of Monophonic Music
6 pages
Multi-Band Melgan Faster Waveform Generation For High-Quality Text-To-Speech
No ratings yet
Multi-Band Melgan Faster Waveform Generation For High-Quality Text-To-Speech
7 pages
Formant Estimation and Tracking Using Probabilistic Heat-Maps
No ratings yet
Formant Estimation and Tracking Using Probabilistic Heat-Maps
5 pages
2023 ISMIR Learn Mf0 Without Labeled Data
No ratings yet
2023 ISMIR Learn Mf0 Without Labeled Data
8 pages
Istftnet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
No ratings yet
Istftnet: Fast and Lightweight Mel-Spectrogram Vocoder Incorporating Inverse Short-Time Fourier Transform
6 pages
Music Source Separation: Francisco Javier Cifuentes Garc Ia
No ratings yet
Music Source Separation: Francisco Javier Cifuentes Garc Ia
7 pages
Spectrogram Transformers For Audio Classification
No ratings yet
Spectrogram Transformers For Audio Classification
7 pages
2015 Elsevier Speaker Identification Using Vowels Features Through A Combined Method of Formants Wavelets and Neural Network Classifiers
No ratings yet
2015 Elsevier Speaker Identification Using Vowels Features Through A Combined Method of Formants Wavelets and Neural Network Classifiers
9 pages
2021 Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using A Multi-Label CTC Loss
No ratings yet
2021 Learning Multi-Pitch Estimation From Weakly Aligned Score-Audio Pairs Using A Multi-Label CTC Loss
5 pages
Predicting Singer Voice Using Convolutional Neural Network
No ratings yet
Predicting Singer Voice Using Convolutional Neural Network
17 pages
Biomedical Signal Processing
No ratings yet
Biomedical Signal Processing
45 pages
DL For Acoustics
No ratings yet
DL For Acoustics
4 pages
Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods
No ratings yet
Voice Analysis Using Short Time Fourier Transform and Cross Correlation Methods
6 pages
Hold Me Tight: Stable Encoder-Decoder Design For Speech Enhancement
No ratings yet
Hold Me Tight: Stable Encoder-Decoder Design For Speech Enhancement
5 pages
Discrete Time Processing of Speech Signa
No ratings yet
Discrete Time Processing of Speech Signa
12 pages
Nantes Universit E, Ecole Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
No ratings yet
Nantes Universit E, Ecole Centrale Nantes, CNRS, LS2N, UMR 6004, F-44000 Nantes, France
5 pages
Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, and Martin Ehler
No ratings yet
Vincent Lostanlen, Daniel Haider, Han Han, Mathieu Lagrange, Peter Balazs, and Martin Ehler
5 pages
Musical Instrument Identification With Supervised Learning: Orchisama Das
No ratings yet
Musical Instrument Identification With Supervised Learning: Orchisama Das
4 pages
Timbre Analysis of Music Audio Signals With Convolutional Neural Networks
No ratings yet
Timbre Analysis of Music Audio Signals With Convolutional Neural Networks
5 pages
Refinegan: Universally Generating Waveform Better Than Ground Truth With Highly Accurate Pitch and Intensity Responses
No ratings yet
Refinegan: Universally Generating Waveform Better Than Ground Truth With Highly Accurate Pitch and Intensity Responses
5 pages
Wu 2019
No ratings yet
Wu 2019
4 pages
Project (Article) 4
No ratings yet
Project (Article) 4
18 pages
2 SER Using LSTM
No ratings yet
2 SER Using LSTM
5 pages
Lead Small Teams
No ratings yet
Lead Small Teams
92 pages
Dilated Convolution With Dilated GRU For Music Source Separation
No ratings yet
Dilated Convolution With Dilated GRU For Music Source Separation
7 pages
NVIDIA NeMo Audio Codec 44khz
No ratings yet
NVIDIA NeMo Audio Codec 44khz
7 pages
Flowhigh: Towards Efficient and High-Quality Audio Super-Resolution With Single-Step Flow Matching
No ratings yet
Flowhigh: Towards Efficient and High-Quality Audio Super-Resolution With Single-Step Flow Matching
5 pages
Stratus 3i Installation Guide
No ratings yet
Stratus 3i Installation Guide
8 pages
Tribal Pesonaliteis PDF
No ratings yet
Tribal Pesonaliteis PDF
5 pages
Brighton Spec ASME 80-10 2017 PDF
No ratings yet
Brighton Spec ASME 80-10 2017 PDF
1 page
6936 PDF
100% (2)
6936 PDF
2 pages
Subnet Mask PDF
No ratings yet
Subnet Mask PDF
5 pages
Method Statement For Installation
No ratings yet
Method Statement For Installation
6 pages
Establishing OPC UA Connectivity With Rockwell Automation® Integrated Architecture
No ratings yet
Establishing OPC UA Connectivity With Rockwell Automation® Integrated Architecture
3 pages
Kruse
No ratings yet
Kruse
25 pages
An Application of Ultrasound Technology in Condition Monitoring-Rev.1-Web
No ratings yet
An Application of Ultrasound Technology in Condition Monitoring-Rev.1-Web
16 pages
Environment Notes by Akshay Jadhav Sir Rank52
No ratings yet
Environment Notes by Akshay Jadhav Sir Rank52
176 pages
Dan Glimne Motor Tuning 2 - MC Jan-70
No ratings yet
Dan Glimne Motor Tuning 2 - MC Jan-70
40 pages
Role of Women in Mozart and Puccinis Operas
No ratings yet
Role of Women in Mozart and Puccinis Operas
12 pages
How To Use DNA Baser - 2 Minutes Video Tutorial - Url
No ratings yet
How To Use DNA Baser - 2 Minutes Video Tutorial - Url
13 pages
UCUN DINAS I BHS INGGRIS PKT A Dijawab
100% (3)
UCUN DINAS I BHS INGGRIS PKT A Dijawab
12 pages
Statistical Reasoning For Everyday Life 5th Edition Bennett Test Bank Download
100% (3)
Statistical Reasoning For Everyday Life 5th Edition Bennett Test Bank Download
40 pages
Magdala de Nemure Volume 1
No ratings yet
Magdala de Nemure Volume 1
271 pages
Evidence Claim Assessment Worksheet
No ratings yet
Evidence Claim Assessment Worksheet
3 pages
Bizhub C25 Spec
No ratings yet
Bizhub C25 Spec
8 pages
Sneha SVMCM SC 2023-2024
No ratings yet
Sneha SVMCM SC 2023-2024
2 pages
Contest1 Tasks
No ratings yet
Contest1 Tasks
9 pages
Unit 5: Advanced PHP & Mysql: Web Programming
No ratings yet
Unit 5: Advanced PHP & Mysql: Web Programming
22 pages
Peace and Conflict Studies
No ratings yet
Peace and Conflict Studies
18 pages
Monetary Statistics M
No ratings yet
Monetary Statistics M
42 pages
300 GPD Water Maker
No ratings yet
300 GPD Water Maker
7 pages
15 Advanced English Phrases For Better Expressing Emotions
No ratings yet
15 Advanced English Phrases For Better Expressing Emotions
4 pages
To From
No ratings yet
To From
4 pages
Amaravathi Bye Laws
No ratings yet
Amaravathi Bye Laws
5 pages
The Nexus Between Visioning and Planning
No ratings yet
The Nexus Between Visioning and Planning
2 pages
The Ghosts of Adichanallur - Artefacts That Suggest An Ancient Tamil Civilisation of Great Sophistication - The Hindu
No ratings yet
The Ghosts of Adichanallur - Artefacts That Suggest An Ancient Tamil Civilisation of Great Sophistication - The Hindu
12 pages
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
From Everand
Techniques and Tools for Artificial Intelligence. Neural Networks via R and PYTHON
César Pérez López
No ratings yet
Applied Digital Signal Processing and Applications
From Everand
Applied Digital Signal Processing and Applications
Othman Omran Khalifa
No ratings yet
Attractor Networks: Fundamentals and Applications in Computational Neuroscience
From Everand
Attractor Networks: Fundamentals and Applications in Computational Neuroscience
Fouad Sabry
No ratings yet

Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks

Uploaded by

Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks

Uploaded by

1

Fast Spectrogram Inversion using Multi-head

cantly better utilization of modern multi-core processors than

III. AUDIO L OSSES

ELU k|STFT(s)| − |STFT(ŝ)|kF /k|STFT(s)|kF , (1)

IV. E XPERIMENTAL R ESULTS

C. Generalization and optimization to a particular speaker

Model MOS (out of 5) Spectral convergence (dB) Classification accuracy (%)

TABLE II benchmark runtime on a Nvidia Tesla P100 GPU.5 The

E. Deployment considerations 5 We consider the Tensorflow implementation of operations without specific

You might also like