Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks
Fast Spectrogram Inversion Using Multi-Head Convolutional Neural Networks
Abstract—We propose the multi-head convolutional neural these generic spectrogram inversion techniques is their fixed
network (MCNN) for waveform synthesis from spectrograms. objectives, rendering them inflexible to adapt for a particular
Nonlinear interpolation in MCNN is employed with transposed domain like human speech.
convolution layers in parallel heads. MCNN enables signifi-
One common use case of spectrograms is the audio domain,
arXiv:1808.06719v2 [cs.SD] 6 Nov 2018
number of frequency channels, and Twave is the number of number of frequency channels in the processed representation,
waveform samples. The ratio Twave /Tspec is determined by and should be gradually reduced to 1 to produce the time-
the spectrogram parameters, the hop length and the window domain waveform. As the convolutional filters are shared
length. We assume these parameters are known a priori. in channel dimension for different timesteps, MCNN can
input a spectrogram with an arbitrary duration. A trainable
Spectrogram
(Tspec x Fspec) scalar is multiplied to the output of each head to match the
overall scaling of inverse STFT operation and to determine
Head 1
... Head i ... Head n the relative weights of different heads. Lastly, all head outputs
4& × 4# × 45 × are summed and passed through a scaled softsign nonlinearity,
+ f (x) = ax/(1 + |bx|), where a and b are trainable scalars, to
Scaled softsign: ) * = ,*/(1 + |2*|) Waveform bound the output waveform.
(Twave x 1)
Head i Loss functions that are correlated with the perceptual quality
Transposed convolution layer 1
(width: w1, stride: s1, # filters: c1) should be used to train generative models. We consider a linear
ELU combination of the below loss terms between the estimated
Hidden
waveform ŝ and the ground truth waveform s, presented in
(s1Tspec x c1) the order of observed empirical significance:
Transposed convolution layer 2 (i) Spectral convergence (SC):
(width: w2, stride: s2, # filters: c2)
(∏$%&
#'& "# Tspec x cL-1)
where k·kF is the Frobenius norm over time and frequency. SC
loss emphasizes highly on large spectral components, which
Transposed convolution layer L
(width: wL, stride: sL, # filters: 1) helps especially in early phases of training.
ELU (ii) Log-scale STFT-magnitude loss:
Fig. 1. Proposed MCNN architecture for spectrogram inversion. k log(|STFT(s)| + ) − log(|STFT(ŝ)| + )k1 , (2)
To synthesize a waveform from the spectrogram, a function where k · k1 is the L1 norm and is a small number. The goal
parameterized by a neural network needs to perform nonlinear with log-scale STFT-magnitude loss is to accurately fit small-
upsampling in time domain, while utilizing the spectral infor- amplitude components (as opposed to the SC), which tends to
mation in different channels. Typically, the window length is be more important towards the later phases of training.
much longer than the hop length, and it is important to utilize (iii) Instantaneous frequency loss:
this extra information in neighboring time frames. For fast
∂
inference, we need a neural network architecture that achieves
φ(STFT(s)) − ∂ φ(STFT(ŝ))
,
∂t (3)
a high compute intensity by repeatedly applied computations ∂t
1
with the same kernel.
where φ(·) is the phase argument function. The time derivative
Based on these motivations, we propose the multi-head ∂ ∂f f (t+∆t)−f (t)
convolutional neural network (MCNN) architecture. MCNN ∂t is estimated with finite difference ∂t = ∆t .
has multiple heads that use the same types of layers but Spectral phase is highly unstructured along either time or
with different weights and initialization, and they learn to frequency domain, so fitting raw phase values is very chal-
cooperate as a form of ensemble learning. By using multiple lenging and does not improve training. Instead, instantaneous
heads, we allow each model to allocate different upsampling frequency is a smooth phase-dependent metric, which can be
kernels to different components of the waveform which is more accurately fit.
analyzed further in Appendix B. Each head is composed of L (iv) Weighted phase loss:
transposed convolution layers (please see [16] for more details k|STFT(s)| |STFT(ŝ)| − <{STFT(s)} <{STFT(ŝ)}
about transposed convolutional layers), as shown in Fig. 1.
Each transposed convolution layer consists of a 1-D temporal − ={STFT(s)} ={STFT(ŝ)}k1 , (4)
convolution operation, followed by an exponential linear unit where is element-wise product, < is the real part and =
[17]1 . For the lth layer, wl is the filter width, sl is the stride, is the imaginary part. When a circular normal distribution is
and cl is the number of output filters (channels). Striding in assumed for the phase, the log-likelihood function is propor-
convolutions determines the amount QL of temporal upsampling, tional to L(s, ŝ) = cos(φ(STFT(s)) − φ(STFT(ŝ))) [18]. We
and should be chosen to satisfy l=1 sl ·Tspec = Twave . Filter can correspondingly define a loss as W (s, ŝ) = 1 − L(s, ŝ),
widths control the amount of local neighborhood information which is minimized (W (s, ŝ) = 0) when φ(STFT(s)) =
used while upsampling. The number of filters determine the φ(STFT(ŝ)). To focus on the high-amplitude components
1 It was empirically found to produce superior audio quality than other more and for better numerical stability, we further modify
nonlinearities we tried, such as ReLU and softsign. W (s, ŝ) by scaling it with |STFT(s)||STFT(ŝ)|, which yields
Eq. 4 after L1 norm.
3
We evaluate the quality of synthesis on the held-out Lib- D. Representation learning of the frequency basis
riSpeech samples (Table I) using mean opinion score (MOS)2 , MCNN is trained only with human speech, which is com-
SC, and classification accuracy (we use the speaker classifier posed of time-varying signals at many frequencies. Inter-
model from [21]) to measure the distinguishability of 2484 estingly, MCNN learns the Fourier basis representation in
speakers.3
4 Filter width is increased to 19 to improve the resolution for modeling
2 Human ratings are collected via Amazon Mechanical Turk framework of more clear high frequency components. Lower learning rate and more
independently for each evaluation, as in [21]. Multiple votes on the same aggressive annealing are applied due to the small size of the dataset, which is
sample are aggregated by a majority voting rule. ∼20 hours in total. Loss coefficient of Eq. 2 is increased because the dataset
3 Audio samples can be found in https://mcnnaudiodemos.github.io/. is higher in quality and yields lower SC.
4
TABLE I
MOS WITH 95% CONFIDENCE INTERVAL , AVERAGE SPECTRAL CONVERGENCE AND SPEAKER CLASSIFICATION ACCURACY FOR L IBRI S PEECH TEST
SAMPLES .
V. C ONCLUSIONS
We propose the MCNN architecture for the spectrogram
inversion problem. MCNN achieves very fast waveform syn-
Fig. 4. Synthesized waveforms by MCNN (trained on LibriSpeech), for thesis without noticeably sacrificing the perceptual quality.
spectrogram inputs corresponding to sinusoids at 500, 1000 and 2000 Hz,
and for a spectrogram input of superposed sinusoids at 1000 and 2000 Hz.
MCNN is trained on a large-scale speech dataset and can
generalize well to unseen speech or speakers. MCNN and
its variants will benefit even more from future hardware in
the spectral range of human speech, as shown in Fig. 4 ways that autoregressive neural network models and traditional
(representations get poorer for higher frequencies beyond iterative signal processing techniques like GL cannot take
human speech, due to the increased train-test mismatch). When advantage of. In addition, they will benefit from larger scale
the input spectrograms correspond to constant frequencies, audio datasets, which are expected to close the gap in quality
sinusoidal waveforms at those frequencies are synthesized. with ground truth. An important future direction is to integrate
When the input spectrograms correspond to a few frequency MCNN into end-to-end training of other generative audio
bands, the synthesized waveforms are superpositions of pure models, such as text-to-speech or audio style transfer systems.
sinusoids of constituent frequencies. For all cases, phase
coherence over a long time window is observed.
R EFERENCES A PPENDIX
[1] D. Griffin and J. Lim, “Signal estimation from modified short-time A. Complexity modeling
fourier transform,” IEEE Transactions on Acoustics, Speech, and Signal
Processing, vol. 32, no. 2, pp. 236–243, Apr 1984. Computational complexity of operations is represented by
[2] N. Perraudin, P. Balazs, and P. L. Sndergaard, “A fast griffin-lim algo- the total number of algorithmic FLOPs without consider-
rithm,” in 2013 IEEE Workshop on Applications of Signal Processing ing hardware-specific logic-level implementations. (Such a
to Audio and Acoustics, Oct 2013, pp. 1–4.
[3] G. T. Beauregard, M. Harish, and L. Wyse, “Single pass spectrogram complexity metric also has limitations of representing some
inversion,” in 2015 IEEE International Conference on Digital Signal major sources of power consumption, such as loading and
Processing (DSP), July 2015, pp. 427–431. storing data.) We count all point-wise operations (including
[4] Z. Prusa, P. Balazs, and P. L. Sondergaard, “A noniterative method for
reconstruction of phase from stft magnitude,” IEEE/ACM Transactions nonlinearities) as 1 FLOP, which is motivated with the trend
on Audio, Speech, and Language Processing, vol. 25, no. 5, pp. 1154– of implementing most mathematical operations as a single
1164, May 2017. instruction. We ignore the complexities of register memory-
[5] D. L. Sun and J. O. Smith, III, “Estimating a Signal from a Magnitude
Spectrogram via Convex Optimization,” arXiv: 1209.2076, 2012. move operations. We assume that a matrix-matrix multiply,
[6] A. van den Oord, Y. Li, I. Babuschkin, Simonyan et al., “Parallel between W , an m × n matrix and X, an n × p matrix takes
WaveNet: Fast High-Fidelity Speech Synthesis,” arXiv:1711.10433, Nov. 2mnp FLOPs. Similar expression is generalized for multi-
2017.
[7] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, dimensional tensors, that are used in convolutional layers.
Y. Zhang, Y. Wang, R. Skerry-Ryan et al., “Natural tts synthesis by For real-valued fast Fourier transform (FFT), we assume the
conditioning wavenet on mel spectrogram predictions,” arXiv preprint complexity of 2.5N log2 (N ) FLOPs for a vector of length
arXiv:1712.05884, 2017.
[8] S. Ö. Arik, G. F. Diamos, A. Gibiansky et al., “Deep voice 2: Multi- N [24]. For most operations used in this paper, Tensorflow
speaker neural text-to-speech,” arXiv: 1705.08947, 2017. profiling tool [25] includes FLOP counts, which we directly
[9] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, adapted.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
A gnerative model for raw audio,” arXiv:1609.03499, 2016.
[10] W. Ping, K. Peng, A. Gibiansky et al., “Deep voice 3: 2000-speaker
neural text-to-speech,” arXiv: 1710.07654, 2017.
B. Analysis of contributions of multiple heads
[11] Y. Wang, R. J. Skerry-Ryan, D. Stanton et al., “Tacotron: A fully end- Fig. 5 shows the outputs of individual heads along with the
to-end text-to-speech synthesis model,” arXiv: 1703.10135, 2017.
[12] N. P. Jouppi, C. Young, N. Patil et al., “In-datacenter performance
overall waveform. We observe that multiple heads focus on
analysis of a tensor processing unit,” SIGARCH Comput. Archit. News, different portions of the waveform in time, and also on differ-
vol. 45, no. 2, pp. 1–12, Jun. 2017. ent frequency bands. For example, head 2 mostly focuses on
[13] H. Jia, Y. Zhang, G. Long, J. Xu, S. Yan, and Y. Li, “Gpuroofline:
A model for guiding performance optimizations on gpus,” in Euro-Par
low-frequency components. While training, individual heads
2012 Parallel Processing, C. Kaklamanis, T. Papatheodorou, and P. G. are not constrained for such a behavior. In fact, different heads
Spirakis, Eds. Berlin, Heidelberg: Springer Berlin Heidelberg, 2012. share the same architecture, but initial random weights of the
[14] E. Grinstein, N. Q. K. Duong, A. Ozerov, and P. Pérez, “Audio style
transfer,” arXiv: 1710.11385, 2017.
heads determine which portions of the waveform they will
[15] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement focus on in the later phases of training. The structure of the
with generative adversarial networks for robust speech recognition,” network promotes cooperation with the end-to-end objective.
arXiv: 1711.05747, 2017.
[16] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for deep
Hence, initialization with the same weights would nullify the
learning,” arXiv: 1603.07285, 2016. benefit of the multi-head architecture. Although intelligibility
[17] D. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate deep of individual waveform outputs is very low (we also note
network learning by exponential linear units,” arXiv: 1511.07289, 2015.
[18] J. Engel, C. Resnick, A. Roberts et al., “Neural audio synthesis of
that a nonlinear combination of these waveforms can also
musical notes with wavenet autoencoders,” arXiv: 1704.01279, 2017. generate new frequencies that do not exist in these individual
[19] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: an outputs.), their combination can yield highly natural-sounding
ASR corpus based on public domain audio books,” in Acoustics, Speech
and Signal Processing (ICASSP), 2015 IEEE International Conference
waveforms.
on. IEEE, 2015, pp. 5206–5210.
[20] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,”
arXiv: 1412.6980, vol. abs/1412.6980, 2014.
[21] S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural Voice
Cloning with a Few Samples,” arXiv: 1802.06006, 2018.
[22] A. Odena, V. Dumoulin, and C. Olah, “Deconvolution and checkerboard
artifacts,” Distill, 2016. [Online]. Available: http://distill.pub/2016/
deconv-checkerboard
[23] S. Ö. Arik, M. Chrzanowski, A. Coates et al., “Deep voice: Real-time
neural text-to-speech,” arXiv: 1702.07825, 2017.
[24] “Fft benchmark methodology,” http://www.fftw.org/speed/method.html,
accessed: 2018-07-30.
[25] “Tensorflow profiler and advisor,” https://github.com/tensorflow/
tensorflow/blob/master/tensorflow/core/profiler/README.md, accessed:
2018-07-30.
6
Fig. 5. Top row: An example synthesized waveform and its log-STFT. Bottom 8 rows: Outputs of the waveforms of each of the constituent heads. For better
visualization, waveforms are normalized in each head and small-amplitude components in STFTs are discarded after applying a threshold.