0% found this document useful (0 votes)

157 views12 pages

Soundstream: An End-To-End Neural Audio Codec

SoundStream is a neural audio codec that can efficiently compress speech, music, and general audio at bitrates normally targeted by speech-tailored codecs. It relies on a fully convolutional encoder/decoder architecture and a learnable residual vector quantizer trained end-to-end. Subjective evaluations found that SoundStream at 3 kbps outperforms Opus at 12 kbps and approaches EVS at 9.6 kbps for audio sampled at 24 kHz. It can also perform joint compression and enhancement, such as background noise suppression for speech.

Uploaded by

Rifqi Ikhwanuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

157 views12 pages

Soundstream: An End-To-End Neural Audio Codec

Uploaded by

Rifqi Ikhwanuddin

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 12

1

SoundStream: An End-to-End Neural Audio Codec

Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi

Abstract—We present SoundStream, a novel neural audio

codec that can efficiently compress speech, music and general
audio at bitrates normally targeted by speech-tailored codecs.
SoundStream relies on a model architecture composed by a fully ):7
convolutional encoder/decoder network and a residual vector 7SYRH7XVIEQ
quantizer, which are trained jointly end-to-end. Training lever-
ages recent advances in text-to-speech and speech enhancement, 7SYRH7XVIEQWGEPEFPI

197,6%WGSVI
which combine adversarial and reconstruction losses to allow
3TYW
the generation of high-quality audio content from quantized
arXiv:2107.03312v1 [cs.SD] 7 Jul 2021

embeddings. By training with structured dropout applied to

quantizer layers, a single model can operate across variable
bitrates from 3 kbps to 18 kbps, with a negligible quality loss
):7
when compared with models trained at fixed bitrates. In addition,
the model is amenable to a low latency implementation, which
supports streamable inference and runs in real time on a
smartphone CPU. In subjective evaluations using audio at 24 kHz 0]VE
sampling rate, SoundStream at 3 kbps outperforms Opus at
12 kbps and approaches EVS at 9.6 kbps. Moreover, we are able to 3TYW
perform joint compression and enhancement either at the encoder
or at the decoder side with no additional latency, which we
demonstrate through background noise suppression for speech.
&MXVEXIOFTW
Fig. 1: SoundStream @3 kbps vs. state-of-the-art codecs.
I. I NTRODUCTION
Audio codecs can be partitioned into two broad categories: machine learning models have been successfully applied in the
waveform codecs and parametric codecs. Waveform codecs field of audio compression, demonstrating the additional value
aim at producing at the decoder side a faithful reconstruction brought by data-driven solutions. For example, it is possible
of the input audio samples. In most cases, these codecs to apply them as a post-processing step to improve the quality
rely on transform coding techniques: a (usually invertible) of existing codecs. This can be accomplished either via audio
transform is used to map an input time-domain waveform superresolution, i.e., extending the frequency bandwidth [1],
to the time-frequency domain. Then, transform coefficients via audio denoising, i.e., removing lossy coding artifacts [2],
are quantized and entropy coded. At the decoder side the or via packet loss concealment [3].
transform is inverted to reconstruct a time-domain waveform. Other solutions adopt ML-based models as an integral part of
Often the bit allocation at the encoder is driven by a perceptual the audio codec architecture. In these areas, recent advances in
model, which determines the quantization process. Generally, text-to-speech (TTS) technology proved to be a key ingredient.
waveform codecs make little or no assumptions about the For example, WaveNet [4], a strong generative model originally
type of audio content and can thus operate on general audio. applied to generate speech from text, was adopted as a decoder
As a consequence of this, they produce very high-quality in a neural codec [5], [6]. Other neural audio codecs adopt
audio at medium-to-high bitrates, but they tend to introduce different model architectures, e.g., WaveRNN in LPCNet [7]
coding artifacts when operating at low bitrates. Parametric and WaveGRU in Lyra [8], all targeting speech at low bitrates.
codecs aim at overcoming this problem by making specific In this paper we propose SoundStream, a novel audio codec
assumptions about the source audio to be encoded (in most that can compress speech, music and general audio more
cases, speech) and introducing strong priors in the form of a efficiently than previous codecs, as illustrated in Figure 1.
parametric model that describes the audio synthesis process. SoundStream leverages state-of-the-art solutions in the field
The encoder estimates the parameters of the model, which are of neural audio synthesis, and introduces a new learnable
then quantized. The decoder generates a time-domain waveform quantization module, to deliver audio at high perceptual quality,
using a synthesis model driven by quantized parameters. while operating at low-to-medium bitrates. Figure 2 illustrates
Unlike waveform codecs, the goal is not to obtain a faithful the high level model architecture of the codec. A fully con-
reconstruction on a sample-by-sample basis, but rather to volutional encoder receives as input a time-domain waveform
generate audio that is perceptually similar to the original. and produces a sequence of embeddings at a lower sampling
Traditional waveform and parametric codecs rely on signal rate, which are quantized by a residual vector quantizer. A
processing pipelines and carefully engineered design choices, fully convolutional decoder receives the quantized embeddings
which exploit in-domain knowledge on psycho-acoustics and and reconstructs an approximation of the original waveform.
speech synthesis to improve coding efficiency. More recently, The model is trained end-to-end using both reconstruction and
2

Nq
Training Inference
Transmitter
Q1 Q2 Q3 Q4 Q5 Q6 ...
Encoder
Encoder Decoder
RVQ

RVQ
Receiver
Decoder
Denoising
on/off

Discriminator

Fig. 2: SoundStream model architecture. A convolutional encoder produces a latent representation of the input audio samples,
which is quantized using a variable number nq of residual vector quantizers (RVQ). During training, the model parameters
are optimized using a combination of reconstruction and adversarial losses. An optional conditioning input can be used to
indicate whether background noise has to be removed from the audio. When deploying the model, the encoder and quantizer
on a transmitter client send the compressed bitstream to a receiver client that can then decode the audio signal.

adversarial losses. To this end, one (or more) discriminators are coding efficiency over different content types, bitrates and
trained jointly, with the goal of distinguishing the decoded audio sampling rates, while ensuring low-latency for real-time audio
from the original audio and, as a by-product, provide a space communications. We compare SoundStream with both Opus
where a feature-based reconstruction loss can be computed. and EVS in our subjective evaluation.
Both the encoder and the decoder only use causal convolutions, Audio generative models – Several generative models
so the overall architectural latency of the model is determined have been developed for converting text or coded features
solely by the temporal resampling ratio between the original into audio waveforms. WaveNet [4] allows for global and
time-domain waveform and the embeddings. local signal conditioning to synthesize both speech and music.
In summary, this paper makes the following key contribu- SampleRNN [11] uses recurrent networks in a similar fashion,
tions: but it relies on previous samples at different scales. These
• We propose SoundStream, a neural audio codec in which auto-regressive models deliver very high-quality audio, at the
all the constituent components (encoder, decoder and quan- cost of increased computational complexity, since samples
tizer) are trained end-to-end with a mix of reconstruction are generated one by one. To overcome this issue, Paral-
and adversarial losses to achieve superior audio quality. lel WaveNet [12] allows for parallel computation, yielding con-
• We introduce a new residual vector quantizer, and investi- siderable speedup during inference. Other approaches involve
gate the rate-distortion-complexity trade-offs implied by its lightweight and sparse models [13] and networks mimicking
design. In addition, we propose a novel “quantizer dropout” the fast Fourier transform as part of the model [7], [14].
technique for training the residual vector quantizer, which More recently, generative adversarial models have emerged
enables a single model to handle different bitrates. as a solution able to deliver high-quality audio with a lower
• We demonstrate that learning the encoder brings a very computational complexity. MelGAN [15] is trained to produce
significant coding efficiency improvement, with respect audio waveforms when conditioned on mel-spectrograms,
to a solution that adopts mel-spectrogram features. training a multi-scale waveform discriminator together with the
• We demonstrate by means of subjective quality metrics generator. HiFiGAN [16] takes a similar approach but it applies
that SoundStream outperforms both Opus and EVS over discriminators to both multiple scales and multiple periods of
a wide range of bitrates. the audio samples. The design of the decoder and the losses in
• We design our model to support streamable inference, SoundStream is based on this class of audio generative models.
which can operate at low-latency. When deployed on a Audio enhancement – Deep neural networks have been
smartphone, it runs in real-time on a single CPU thread. applied to different audio enhancement tasks, ranging from
• We propose a variant of the SoundStream codec that denoising [17]–[21] to dereverberation [22], [23], lossy coding
performs jointly audio compression and enhancement, denoising [2] and frequency bandwidth extension [1], [24]. In
without introducing additional latency. this paper we show that it is possible to jointly perform audio
enhancement and compression with a single model, without
introducing additional latency.
II. R ELATED WORK
Vector quantization – Learning the optimal quantizer is a
Traditional audio codecs – Opus [9] and EVS [10] are key element to achieve high coding efficiency. Optimal scalar
state-of-the-art audio codecs, which combine traditional coding quantization based on Lloyd’s algorithm [25] can be extended
tools, such as LPC, CELP and MDCT, to deliver high to a high-dimensional space via the generalized Lloyd algorithm
3

(GLA) [26], which is very similar to k-means clustering [27]. at a 24 kHz sampling rate and low-to-medium bitrates (3 kbps
In vector quantization [28], a point in a high-dimensional to 18 kbps in our experiments), in real time on a smartphone
space is mapped onto a discrete set of code vectors. Vector CPU. This is the first time that a neural audio codec is shown
quantization has been commonly used as a building block to outperform state-of-the-art codecs like Opus and EVS over
of traditional audio codecs [29]. For example, CELP [30] this broad range of bitrates.
adopts an excitation signal encoded via a vector quantizer Joint compression and enhancement – Recent work has
codebook. More recently, vector quantization has been applied explored joint compression and enhancement. The work in [41]
in the context of neural network models to compress the latent trains a speech enhancement system with a quantized bottleneck.
representation of input features. For example, in variational Instead, SoundStream integrates a time-dependent conditioning
autoencoders, vector quantization has been used to generate layer, which allows for real-time controllable denoising. As
images [31], [32] and music [33], [34]. Vector quantization can we design SoundStream as a general-purpose audio codec,
become prohibitively expensive, as the size of the codebook controlling when to denoise allows for encoding acoustic scenes
grows exponentially when rate is increased. For this reason, and natural sounds that would be otherwise removed.
structured vector quantizers [35], [36] (e.g., residual, product,
lattice vector quantizers, etc.) have been proposed to obtain III. M ODEL
a trade-off between computational complexity and coding
efficiency in traditional codecs. In SoundStream, we extend the We consider a single channel recording x ∈ RT , sampled at
learnable vector quantizer of VQ-VAE [31] and introduce a fs . The SoundStream model consists of a sequence of three
residual (a.k.a. multi-stage) vector quantizer, which is learned building blocks, as illustrated in Figure 2:
end-to-end with the rest of the model. To the best of the • an encoder, which maps x to a sequence of embeddings
authors knowledge, this is the first time that this form of vector (see Section III-A),
quantization is used in the context of neural networks and • a residual vector quantizer, which replaces each embed-
trained end-to-end with the rest of the model. ding by the sum of vectors from a set of finite codebooks,
Neural audio codecs – End-to-end neural audio codecs rely thus compressing the representation with a target number
on data-driven methods to learn efficient audio representations, of bits (see Section III-C),
T
instead of relying on handcrafted signal processing components. • a decoder, which produces a lossy reconstruction x̂ ∈ R
Autoencoder networks with quantization of hidden features from quantized embeddings (see Section III-B).
were applied to speech coding early on [37]. More recently, The model is trained end-to-end together with a discrimi-
a more sophisticated deep convolutional network for speech nator (see Section III-D), using the mix of adversarial and
compression was described in [38]. Efficient compression of reconstruction losses described in Section III-E. Optionally, a
audio using neural networks has been demonstrated in several conditioning signal can be added, which determines whether
works, mostly targeting speech coding at low bitrates. A VQ- denoising is applied at the encoder or decoder side, as detailed
VAE speech codec was proposed in [6], operating at 1.6 kbps. in Section III-F.
Lyra [8] is a generative model that encodes quantized mel-
spectrogram features of speech, which are decoded with an
auto-regressive WaveGRU model to achieve state-of-the-art A. Encoder architecture
results at 3 kbps. A very low-bitrate codec was proposed in [39] The encoder architecture is illustrated in Figure 3 and
by decoding speech representations obtained via self-supervised follows the same structure as the streaming SEANet encoder
learning. An end-to-end audio codec targeting general audio described in [1], but without skip connections. It consists of a
at high bitrates (i.e., above 64 kbps) was proposed in [40]. 1D convolution layer (with Cenc channels), followed by Benc
The model architecture adopts a residual coding pipeline, convolution blocks. Each of the blocks consists of three residual
which consists of multiple autoencoding modules and a psycho- units, containing dilated convolutions with dilation rates of 1,
acoustic model is used to drive the loss function during training. 3, and 9, respectively, followed by a down-sampling layer in
Unlike [39] which specifically targets speech by combining the form of a strided convolution. The number of channels is
speaker, phonetic and pitch embeddings, SoundStream does doubled whenever down-sampling, starting from Cenc . A final
not make assumptions on the nature of the signal it encodes, 1D convolution layer with a kernel of length 3 and a stride of
and thus works for diverse audio content types. While [8] 1 is used to set the dimensionality of the embeddings to D. To
learns a decoder on fixed features, SoundStream is trained guarantee real-time inference, all convolutions are causal. This
in an end-to-end fashion. Our experiments (see Section IV) means that padding is only applied to the past but not the future
show that learning the encoder increases the audio quality in both training and offline inference, whereas no padding is
substantially. SoundStream achieves bitrate scalability, i.e., the used in streaming inference. We use the ELU activation [42]
ability of a single model to operate at different bitrates at no and we do not apply any normalization. The number Benc of
additional cost, thanks to its residual vector quantizer and to convolution blocks and the corresponding striding sequence
our original quantizer dropout training scheme (see Section determines the temporal resampling ratio between the input
III-C). This is unlike [38] and [40] which enforce a specific waveform and the embeddings. For example, when Benc = 4
bitrate during training and require training a different model and using (2, 4, 5, 8) as strides, one embedding is computed
for each target bitrate. A single SoundStream model is able every M = 2 · 4 · 5 · 8 = 320 input samples. Thus, the encoder
to compress speech, music and general audio, while operating outputs enc(x) ∈ RS×D , with S = T /M .
4

Encoder Decoder
Waveform @ 24 kHz Embeddings @ 75 Hz

Conv1D (k=7, n=C) EncoderBlock (N, S) FiLM conditioning

DecoderBlock (N, S)
ResidualUnit (N, dilation)
EncoderBlock (N=2C, S=2) Conv1D (k=7, n=16C)
ResidualUnit (N/2, dilation=1) (Conv1D)T (k=2S, n=N, stride=S)
Conv1D (k=7, n=N, dilation)
EncoderBlock (N=4C, S=4) DecoderBlock (N=8C, S=8)
ResidualUnit (N/2, dilation=3) ResidualUnit (N/2, dilation=1)

EncoderBlock (N=8C, S=5) Conv1D (k=1, n=N) DecoderBlock (N=4C, S=5)

ResidualUnit (N/2, dilation=9) ResidualUnit (N/2, dilation=3)

EncoderBlock (N=16C, S=8) DecoderBlock (N=2C, S=4)

Conv1D (k=2S, n=N, stride=S) ResidualUnit (N/2, dilation=9)

Conv1D (k=3, n=K) DecoderBlock (N=C, S=2)

FiLM conditioning Conv1D (k=7, n=1)

Embeddings @ 75 Hz Waveform @ 24 kHz

Fig. 3: Encoder and decoder model architecture.

Algorithm 1: Residual Vector Quantization bits/second (bps). In order to train SoundStream in an end-
Input: y = enc(x) the output of the encoder, vector to-end fashion, the quantizer needs to be jointly trained with
quantizers Qi for i = 1..Nq the encoder and the decoder by backpropagation. The vector
Output: the quantized ŷ quantizer (VQ) proposed in [31], [32] in the context of VQ-
ŷ ← 0.0 VAEs meets this requirement. This vector quantizer learns a
residual ← y codebook of N vectors to encode each D-dimensional frame
for i = 1 to Nq do of enc(x). The encoded audio enc(x) ∈ RS×D is then mapped
ŷ += Qi (residual) to a sequence of one-hot vectors of shape S × N , which can
residual −= Qi (residual) be represented using S log2 N bits.
Limitations of Vector Quantization – As a concrete example,
return ŷ
let us consider a codec targeting a bitrate R = 6000 bps.
When using a striding factor M = 320, each second of
audio at sampling rate fs = 24000 Hz is represented by
B. Decoder architecture
S = 75 frames at the output of the encoder. This corresponds
The decoder architecture follows a similar design, as to r = 6000/75 = 80 bits allocated to each frame. Using a
illustrated in Figure 3. A 1D convolution layer is followed by a plain vector quantizer, this requires storing a codebook with
sequence of Bdec convolution blocks. The decoder block mirrors N = 280 vectors, which is obviously unfeasible.
the encoder block, and consists of a transposed convolution Residual Vector Quantizer – To address this issue we
for up-sampling followed by the same three residual units. We adopt a Residual Vector Quantizer (a.k.a. multi-stage vector
use the same strides as the encoder, but in reverse order, to quantizer [36]), which cascades Nq layers of VQ as follows.
reconstruct a waveform with the same resolution as the input The unquantized input vector is passed through a first VQ and
waveform. The number of channels is halved whenever up- quantization residuals are computed. The residuals are then
sampling, so that the last decoder block outputs Cdec channels. iteratively quantized by a sequence of additional Nq − 1 vector
A final 1D convolution layer with one filter, a kernel of size quantizers, as described in Algorithm 1. The total rate budget
7 and stride 1 projects the embeddings back to the waveform is uniformly allocated to each VQ, i.e., ri = r/Nq = log N .
2
domain to produce x̂. In Figure 3, the same number of channels For example, when using Nq = 8, each quantizer uses a
in both the encoder and the decoder is controlled by the same codebook of size N = 2r/Nq = 280/8 = 1024. For a target
parameter, i.e., Cenc = Cdec = C. We also investigate cases in rate budget r, the parameter Nq controls the tradeoff between
which Cenc 6= Cdec , which results in a computationally lighter computational complexity and coding efficiency, which we
encoder and a heavier decoder, or vice-versa (see Section V-D). investigate in Section V-D.
The codebook of each quantizer is trained with exponential
C. Residual Vector Quantizer: moving average updates, following the method proposed in
The goal of the quantizer is to compress the output VQ-VAE-2 [32]. To improve the usage of the codebooks we
of the encoder enc(x) to a target bitrate R, expressed in use two additional methods. First, instead of using a random
5

initialization for the codebook vectors, we run the k-means Waveform @ 24 kHz
algorithm on the first training batch and use the learned
centroids as initialization. This allows the codebook to be
close to the distribution of its inputs and improves its usage. STFT(w, h)
Second, as proposed in [34], when a codebook vector has not
been assigned any input frame for several batches, we replace
ResidualUnit (N=C, m=2, s=(1, 2))
it with an input frame randomly sampled within the current
ResidualUnit (N, m, (st, sf))
batch. More precisely, we track the exponential moving average
of the assignments to each vector (with a decay factor of 0.99) ResidualUnit (N=2C, m=2, s=(2, 2))

and replace the vectors of which this statistic falls below 2. Conv2D (k=3×3, n=N)

Enabling bitrate scalability with quantizer dropout – Resid- ResidualUnit (N=4C, m=1, s=(1, 2))
ual vector quantization provides a convenient framework for Conv2D (k=(st +2)×(sf +2), n=mN)
controlling the bitrate. For a fixed size N of each codebook, ResidualUnit (N=4C, m=2, s=(2, 2))
the number of VQ layers Nq determines the bitrate. Since the
vector quantizers are trained jointly with the encoder/decoder, ResidualUnit (N=8C, m=1, s=(1, 2))
in principle a different SoundStream model should be trained
for each target bitrate. Instead, having a single bitrate scalable ResidualUnit (N=8C, m=2, s=(2, 2))
model that can operate at several target bitrates is much more
practical, since this reduces the memory footprint needed to
Conv2D (k=1×F/26, n=1)
store model parameters both at the encoder and decoder side.
To train such a model, we modify Algorithm 1 in the STFTDiscriminator(C)
following way: for each input example, we sample nq uniformly
Logits
at random in [1; Nq ] and only use quantizers Qi for i = 1 . . . nq .
This can be seen as a form of structured dropout [43] applied
to quantization layers. Consequently, the model is trained to Fig. 4: STFT-based discriminator architecture.
encode and decode audio for all target bitrates corresponding
to the range nq = 1 . . . Nq . During inference, the value of down-sampled, and 4-times down-sampled. Each single-scale
nq is selected based on the desired bitrate. Previous models discriminator consists of an initial plain convolution followed
for neural compression have relied on product quantization by four grouped convolutions, each of which has a group size
(wav2vec 2.0 [44]), or on concatenating the output of several of 4, a down-sampling factor of 4, and a channel multiplier
VQ layers [5], [6]. With such approaches, changing the bitrate of 4 up to a maximum of 1024 output channels. They are
requires either changing the architecture of the encoder and/or followed by two more plain convolution layers to produce the
the decoder, as the dimensionality changes, or retraining an final output, i.e., the logits.
appropriate codebook. A key advantage of our residual vector The STFT-based discriminator is illustrated in Figure 4
quantizer is that the dimensionality of the embeddings does and operates on a single scale, computing the STFT with a
not change with the bitrate. Indeed, the additive composition window length of W = 1024 samples and a hop length of
of the outputs of each VQ layer progressively refines the H = 256 samples. A 2D-convolution (with kernel size 7 × 7
quantized embeddings, while keeping the same shape. Hence, and 32 channels) is followed by a sequence of residual blocks.
no architectural changes are needed in neither the encoder nor Each block starts with a 3×3 convolution, followed by a 3×4 or
the decoder to accommodate different bitrates. In Section V-C, a 4×4 convolution, with strides equal to (1, 2) or (2, 2), where
we show that this method allows one to train a single (s , s ) indicates the down-sampling factor along the time axis
t f
SoundStream model, which matches the performance of models and the frequency axis. We alternate between (1, 2) and (2, 2)
trained specifically for a given bitrate. strides, for a total of 6 residual blocks. The number of channels
is progressively increased with the depth of the network. At
D. Discriminator architecture the output of the last residual block, the activations have shape
T /(H · 23 ) × F/26 , where T is the number of samples in the
To compute the adversarial losses described in Section III-E,
time domain and F = W/2 is the number of frequency bins.
we define two different discriminators: i) a wave-based dis-
The last layer aggregates the logits across the (down-sampled)
criminator, which receives as input a single waveform; ii) an
frequency bins with a fully connected layer (implemented as a
STFT-based discriminator, which receives as input the complex-
1 × F/26 convolution), to obtain a 1-dimensional signal in the
valued STFT of the input waveform, expressed in terms of
(down-sampled) time domain.
real and imaginary parts. Since both discriminators are fully
convolutional, the number of logits in the output is proportional
to the length of the input audio. E. Training objective
For the wave-based discriminator, we use the same multi- Let G(x) = dec(Q(enc(x)) denote the SoundStream gen-
resolution convolutional discriminator proposed in [15] and erator, which processes the input waveform x through the
adopted in [45]. Three structurally identical models are applied encoder, the quantizer and the decoder, and x̂ = G(x) be the
to the input audio at different resolutions: original, 2-times decoded waveform. We train SoundStream with a mix of losses
6

to achieve both signal reconstruction fidelity and perceptual F. Joint compression and enhancement
quality, following the principles of the perception-distortion In traditional audio processing pipelines, compression and
trade-off discussed in [46]. enhancement are typically performed by different modules.
The adversarial loss is used to promote perceptual quality For example, it is possible to apply an audio enhancement
and it is defined as a hinge loss over the logits of the algorithm at the transmitter side, before audio is compressed,
discriminator, averaged over multiple discriminators and over or at the receiver side, after audio is decoded. In this setup,
time. More formally, let k ∈ {0, . . . , K} index over the each processing step contributes to the end-to-end latency, e.g.,
individual discriminators, where k = 0 denotes the STFT-based due to buffering the input audio to the expected frame length
discriminator and k ∈ {1, . . . , K} the different resolutions determined by the specific algorithm adopted. Conversely,
of the waveform-based discriminator (K = 3 in our case). we design SoundStream in such a way that compression and
Let Tk denote the number of logits at the output of the k-th enhancement can be carried out jointly by the same model,
discriminator along the time dimension. The discriminator is without increasing the overall latency.
trained to classify original vs. decoded audio, by minimizing The nature of the enhancement can be determined by the
" # choice of the training data. As a concrete example, in this
1 X 1 X
LD = Ex max 0, 1 − Dk,t (x) + paper we show that it is possible to combine compression
K Tk t with background noise suppression. More specifically, we
k
" #
1 X 1 X train a model in such a way that one can flexibly enable or
Ex max 0, 1 + Dk,t (G(x)) , (1) disable denoising at inference time, by feeding a conditioning
K Tk t
k signal that represents the two modes (denoising enabled
while the adversarial loss for the generator is or disabled). To this end, we prepare the training data to
  consist of tuples of the form: (inputs, targets, denoise).
When denoise = false, targets = inputs; when
1 1
X
Ladv
G = Ex max 0, 1 − Dk,t (G(x))  . (2) denoise = true, targets contain the clean speech
K Tk
k,t component of the corresponding inputs. Hence, the network
To promote fidelity of the decoded signal x̂ with respect to is trained to reconstruct noisy speech if the conditioning signal
the original x we adopt two additional losses: i) a “feature” is disabled, and to produce a clean version of the noisy input if
loss Lfeat it is enabled. Note that when inputs consist of clean audio
G , computed in the feature space defined by the
discriminator(s) [15]; ii) a multi-scale spectral reconstruction (speech or music), targets = inputs and denoise can
rec
loss LG [47]. be either true or false. This is done to prevent SoundStream
More specifically, the feature loss is computed by taking from adversely affecting clean audio when denoising is enabled.
the average absolute difference between the discriminator’s To process the conditioning signal, we use Feature-wise
internal layer outputs for the generated audio and those for the Linear Modulation (FiLM) layers [49] in between residual
corresponding target audio. units, which take network features as inputs and transform
  them as
1 X 1 X (l) (l)
an,c = γn,c an,c + βn,c ,
e (7)
Lfeat = E (x) − D (G(x)) ,

G x

k,t k,t
KL Tk,l t
D
k,l where an,c is the nth activation in the cth channel. The
(3) coefficients γn,c and βn,c are computed by a linear layer that
(l)
where L is the number of internal layers, Dk,t (l ∈ {1, . . . , L}) takes as input a (potentially time-varying) two-dimensional
is the t-th output of layer l of discriminator k, and Tk,l denotes one-hot encoding that determines the denoising mode. This
the length of the layer in the time dimension. allows one to adjust the level of denoising over time.
The multi-scale spectral reconstruction loss follows the In principle, FiLM layers can be used anywhere throughout
specifications described in [48]: the encoder and decoder architecture. However, in our pre-
X X liminary experiments, we found that applying conditioning
Lrec
G = kSts (x) − Sts (G(x))k1 + (4) at the bottleneck either at the encoder or at the decoder
6
s∈2 ,...,211 t side (as illustrated in Figure 3) was effective and no further
X
s s
αs k log St (x) − log St (G(x))k2 , (5) improvements were observed by applying FiLM layers at
t different depths. In Section V-E, we quantify the impact of
enabling denoising at either the encoder or decoder side both
where Sts (x) denotes the t-th frame of a 64-bin mel-
in terms of audio quality and bitrate.
spectrogram computed with window length p equal to s and
hop length equal to s/4. We set αs = s/2 as in [48].
The overall generator loss is a weighted sum of the different IV. E VALUATION SETUP
loss components: A. Datasets
LG = λadv Ladv feat rec We train SoundStream on three types of audio content:
G + λfeat · LG + λrec · LG . (6)
clean speech, noisy speech and music, all at 24 kHz sampling
In all our experiments we set λadv = 1, λfeat = 100 and rate. For clean speech, we use the LibriTTS dataset [50].
λrec = 1. For noisy speech, we synthesize samples by mixing speech
7

from LibriTTS with noise from Freesound [51]. We apply presented Lyra codec [8] which is an autoregressive generative
peak normalization to randomly selected crops of 3 seconds codec operating at 3 kbps. We provide audio processed by
and adjust the mixing gain of the noise component sampling SoundStream and baselines at different bitrates on a public
uniformly in the interval [−30 dB, 0 dB]. For music, we use webpage1 .
the MagnaTagATune dataset [52]. We evaluate our models
on disjoint test splits of the datasets above. In addition, we
collected a real-world dataset, which contains both near-field V. R ESULTS
and far-field (reverberant) speech, with background noise in
some of the examples. Unless stated otherwise, objective and A. Comparison with other codecs
subjective metrics are computed on a set of 200 audio clips Figure 5 reports the main result of the paper, where we
2-4 seconds long, with 50 samples from each of the four compare SoundStream to Opus and EVS at different bitrates.
datasets listed above (i.e., clean speech, noisy speech, music, Namely, we repeated a subjective evaluation based on a
noisy/reverberant speech). MUSHRA-inspired crowdsourced scheme, when SoundStream
operates at three different bitrates: i) low (3 kbps); ii) medium
B. Evaluation metrics (6 kbps); iii) high (12 kbps). Figure 5a shows that SoundStream
To evaluate SoundStream, we perform subjective evaluations at 3 kbps significantly outperforms both Opus at 6 kbps and
by human raters. We have chosen a crowd-sourced methodology EVS at 5.9 kbps (i.e., the lowest bitrates at which these codecs
inspired by MUSHRA [53], with a hidden reference but can operate), despite using half of the bitrate. To match the
no lowpass-filtered anchor. Each of the 200 samples of the quality of SoundStream, EVS needs to use at least 9.6 kbps
evaluation dataset, which include clean, noisy and reverberant and Opus at least 12 kbps, i.e., 3.2× to 4× more bits than
speech, as well as music, was rated 20 times. The raters were SoundStream. We also observe that SoundStream outperforms
required to be native English speakers and be using headphones. Lyra when they both operate at 3 kbps. We observe similar
Additionally, to avoid noisy data, a post-screening was put in results when SoundStream operates at 6 kbps and 12 kbps. At
place to exclude ratings by listeners who rated the reference medium bitrates, EVS and Opus require, respectively, 2.2× to
below 90 more than 20% of the time or rated non-reference 2.6× more bits to match the same quality. At high bitrates,
samples above 90 more than 50% of the time. 1.3× to 1.6× more bits.
For development and hyperparameter selection, we rely on Figure 6 illustrates the results of the subjective evaluation
computational, objective metrics. Numerous metrics have been by content type. We observe that the quality of SoundStream
developed in the past for assessing the perceived similarity remains consistent when encoding clean speech and noisy
between a reference and a processed audio signal. The ITU-T speech. In addition, SoundStream can encode music when
standards PESQ [54] and its replacement POLQA [55] are using as little as 3 kbps, with quality significantly better than
commonly used metrics. However, both are inconvenient to use Opus at 12 kbps and EVS at 5.9 kbps. This is the first time
owing to licensing restrictions. We choose the freely available that a codec is shown to operate on diverse content types at
and recently open-sourced ViSQOL [56], [57] metric, which has such a low bitrate.
previously shown comparable performance to POLQA. In early
experiments, we found this metric to be strongly correlated
with subjective evaluations. We thus use it for model selection B. Objective quality metrics
and ablation studies. Figure 7a shows the rate-quality curve of SoundStream over
a wide range of bitrates, from 3 kbps to 18 kbps. We observe
C. Baselines that quality, as measured by means of ViSQOL, gracefully
Opus [9] is a versatile speech and audio codec supporting decreases as the bitrate is reduced and it remains above 3.7
signal bandwidths from 4 kHz to 24 kHz and bitrates from even at the lowest bitrate. In our work, SoundStream operates
6 kbps to 510 kbps. Since its standardization by the IETF in at constant bitrate, i.e., the same number of bits is allocated
2012 it has been widely deployed for speech communication to each encoded frame. At the same time, we measure the
over the internet. As the audio codec in applications such as bitrate lower bound by computing the empirical entropy of the
Zoom and applications based on WebRTC [58], [59], such quantization symbols of the vector quantizers, assuming each
as Microsoft Teams and Google Meet, Opus has hundreds vector quantizer to be a discrete memoryless source, i.e., no
of millions of daily users. Opus is also one of the main statistical redundancy is exploited across different layers of the
audio codecs used in YouTube for streaming. Enhanced Voice residual vector quantizer, nor across time. Figure 7a indicates
Services (EVS) [10] is the latest codec standardized by the a potential rate saving between 7% and 20%.
3GPP and was primarily designed for Voice over LTE (VoLTE). We also investigate the rate-quality tradeoff achieved when
Like Opus, it is a versatile codec operating at multiple signal encoding different content types, as illustrated in Figure 7b.
bandwidths, 4 kHz to 20 kHz, and bitrates, 5.9 kbps to 128 kbps. Unsurprisingly, the highest quality is achieved when encoding
It is replacing AMR-WB [60] and retains full backward clean speech. Music represents a more challenging case, due
operability. In this paper we utilize these two systems as to its inherent diversity of content.
baselines for comparison with the SoundStream codec. For the
lowest bitrates, we also compare the performance of the recently 1 https://google-research.github.io/seanet/soundstream/examples/
8

7SYRH7XVIEQ 3TYW 7SYRH7XVIEQWGEPEFPI

7SYRH7XVIEQWGEPEFPI 7SYRH7XVIEQ 3TYW
):7 ):7
7SYRH7XVIEQ ):7
7SYRH7XVIEQWGEPEFPI
197,6%WGSVI

197,6%WGSVI

197,6%WGSVI
3TYW ):7

):7
):7 3TYW

0]VE
3TYW 3TYW

&MXVEXIOFTW &MXVEXIOFTW &MXVEXIOFTW
(a) Low bitrate. (b) Medium bitrate. (c) High bitrate.
Fig. 5: Subjective evaluation results. Error bars denote 95% confidence intervals.

2SMW]VIZIVFIVERX 2SMW]VIZIVFIVERX 2SMW]VIZIVFIVERX

'PIERWTIIGL 2SMW]WTIIGL 1YWMG WTIIGL 'PIERWTIIGL 2SMW]WTIIGL 1YWMG WTIIGL 'PIERWTIIGL 2SMW]WTIIGL 1YWMG WTIIGL

3TYW$OFTW 3TYW$OFTW 3TYW$OFTW

0]VE$OFTW ):7$OFTW ):7$OFTW
):7$OFTW
):7$OFTW ):7$OFTW
3TYW$OFTW
7SYRH7XVIEQ$OFTW
7SYRH7XVIEQ$OFTW WGEPEFPI 3TYW$OFTW
WGEPEFPI
3TYW$OFTW 7SYRH7XVIEQ$OFTW
7SYRH7XVIEQ$OFTW
7SYRH7XVIEQ$OFTW 7SYRH7XVIEQ$OFTW
):7$OFTW WGEPEFPI
6IJIVIRGI 6IJIVIRGI 6IJIVIRGI

197,6%WGSVI 197,6%WGSVI 197,6%WGSVI

(a) Low bitrate. (b) Medium bitrate. (c) High bitrate.

Fig. 6: Subjective evaluation results by content type. Error bars denote 95% confidence intervals.

C. Bitrate scalability (see Figure 5). When operating at 3 kbps, the bitrate scalable
variant of SoundStream is only slightly worse than the bitrate
We investigate the bitrate scalability provided by training specific variant. Conversely, both at 6 kbps and 12 kbps it
a single model that can serve different bitrates. To evaluate matches the same quality as the bitrate specific variant.
this aspect, for each bitrate R we consider three SoundStream
configurations: a) a non-scalable model trained and evaluated
at bitrate R (bitrate specific); b) a non-scalable model trained D. Ablation studies
at 18 kbps and evaluated at bitrate R by using only the We carried out several additional experiments to evaluate the
first nq quantizers during inference (18 kbps - no dropout); impact of some of the design choices applied to SoundStream.
c) a scalable model trained with quantizer dropout and Unless stated otherwise, all these experiments operate at 6 kbps.
evaluated at bitrate R (bitrate scalable). Figure 7c shows the Advantage of learning the encoder – We explored the impact
ViSQOL scores for these three scenarios. Remarkably, a model of replacing the learnable encoder of SoundStream with a
trained specifically at 18 kbps retains good performance when fixed mel-filterbank, similarly to Lyra [8]. We learned both the
evaluated at lower bitrates, even though the model was not quantizer and the decoder and observed a significant drop in
trained in these conditions. Unsurprisingly, the quality drop objective quality, with ViSQOL going from 3.96 to 3.33. Note
increases as the bitrate decreases, i.e., when there is a more that this is significantly worse than what can be achieved when
significant difference between training and inference. This gap learning the encoder and halving the bitrate (i.e., ViSQOL
vanishes when using the quantizer dropout strategy described equal to 3.76 at 3 kbps). This demonstrates that the additional
in Section III-C. Surprisingly, the bitrate scalable model seems complexity of having a learnable encoder translates to a very
to marginally outperform bitrate specific models at 9 kbps and significant improvement in the rate-quality trade-off.
12 kbps. This suggests that quantizer dropout, beyond providing Encoder and decoder capacity – The main drawback of
bitrate scalability, may act as a regularizer. using a learnable encoder is the computational cost of the
We confirm these results by including the bitrate scalable neural architecture, which can be significantly higher than
variant of SoundStream in the MUSHRA subjective evaluation computing fixed, non-learnable features such as mel-filterbanks.
9

E F G

:M7530

'PIERWTIIGL
2SMW]WTIIGL &MXVEXIWGEPEFPI
7SYRH7XVIEQ 1YWMG 2SXFMXVEXIWGEPEFPI
)QTMVMGEPIRXVST]FSYRH 2SMW]VIZIVFIVERXWTIIGL &MXVEXIWTIGMJMG

&MXVEXIOFTW
Fig. 7: ViSQOL vs. bitrate. a) SoundStream performance on test data, comparing the actual bitrate with the potential bitrate
savings achievable by entropy coding b) ViSQOL scores by content type c) Comparison of SoundStream models that are
trained at 18 kbps with quantizer dropout (bitrate scalable), without quantizer dropout (not bitrate scalable) and evaluated with a
variable number of quantizers, or trained and evaluated at a fixed bitrate (bitrate specific). Error bars denote 95% confidence
intervals.

TABLE I: Audio quality (ViSQOL) and model complexity TABLE II: Trade-off between residual vector quantizer depth
(number of parameters and real-time factor) for different and codebook size at 6 kbps.
capacity trade-offs between encoder and decoder, at 6kbps.
Number of quantizers Nq 8 16 80
Cenc Cdec #Params RTF (enc) RTF (dec) ViSQOL Codebook size N 1024 32 2
ViSQOL 4.01 ± 0.03 3.98 ± 0.03 3.92 ± 0.03
32 32 8.4 M 2.4× 2.3× 4.01 ± 0.03
16 16 2.4 M 7.5× 7.1× 3.98 ± 0.03
Smaller encoder TABLE III: Audio quality (ViSQOL) and real-time factor for
16 32 5.5 M 7.5× 2.3× 4.02 ± 0.03 different levels of architectural latency, defined by the total
8 32 4.8 M 18.6× 2.3× 3.99 ± 0.03
striding factor of the encoder/decoder, at 6 kbps.
Smaller decoder
32 16 5.3 M 2.4× 7.1× 3.97 ± 0.03 Strides Latency Nq RTF (enc) RTF (dec) ViSQOL
32 8 4.4 M 2.4× 17.1× 3.90 ± 0.03
(1, 4, 5, 8) 7.5ms 4 1.6× 1.5× 4.01 ± 0.02
(2, 4, 5, 8) 13ms 8 2.4× 2.3× 4.01 ± 0.03
(4, 4, 5, 8) 26ms 16 4.1× 4.0× 4.01 ± 0.03
For SoundStream to be competitive with traditional codecs,
not only should it provide a better perceptual quality at an
equivalent bitrate, but it must also run in real-time on resource- Nq denotes the number of quantizers and N the codebook
limited hardware. Table I shows how computational efficiency size. Hence, it is possible to achieve the same target bitrate
and audio quality are impacted by the number of channels in the for different combinations of Nq and N . Table II shows
encoder Cenc and the decoder Cdec . We measured the real-time three configurations, all operating at 6 kbps. As expected,
factor (RTF), defined as the ratio between the temporal length using fewer vector quantizers, each with a larger codebook,
of the input audio and the time needed for encoding/decoding it achieves the highest coding efficiency at the cost of higher
with SoundStream. We profiled these models on a single CPU computational complexity. Remarkably, using a sequence of
thread of a Pixel4 smartphone. We observe that the default 80 1-bit quantizers leads only to a modest quality degradation.
model (Cenc = Cdec = 32) runs in real-time (RTF > 2.3×). This demonstrates that it is possible to successfully train very
Decreasing the model capacity by setting Cenc = Cdec = deep residual vector quantizers without facing optimization
16 only marginally affects the reconstruction quality while issues. On the other side, as discussed in Section III-C, growing
increasing the real-time factor significantly (RTF > 7.1×). the codebook size can quickly lead to unmanageable memory
We also investigated configurations with asymmetric model requirements. Thus, the proposed residual vector quantizer
capacities. Using a smaller encoder, it is possible to achieve a offers a practical and effective solution for learning neural
significant speedup without sacrificing quality (ViSQOL drops codecs operating at high bitrates, as it scales gracefully when
from 3.96 to 3.94, while the encoder RTF increases to 18.6×). using many quantizers, each with a smaller codebook.
Instead, decreasing the capacity of the decoder has a more Latency – The architectural latency M of the model is defined
significant impact on quality (ViSQOL drops from 3.96 to by the product of the strides, as explained in Section III-A.
3.84). This is aligned with recent findings in the field of neural In our default configuration, M = 2 · 4 · 5 · 8 = 320 samples,
image compression [61], which also adopt a lighter encoder which means that one frame corresponds to 13.3ms of audio
and a heavier decoder. at 24 kHz. The bit budget allocated to the residual vector
Vector quantizer depth and codebook size – The number of quantizer needs to be adjusted based on the target architectural
bits used to encode a single frame is equal to Nq log2 N , where latency. For example, when operating at 6 kbps, the residual
10

E F G
)RGSHIVGSRHMXMSRMRK (IGSHIVGSRHMXMSRMRK *M\IHHIRSMWIV

(IRSMWMRKSR (IRSMWMRKSR

:M7530

(IRSMWMRKSR)' (IRSMWMRKSR)'

(IRSMWMRKSJJ (IRSMWMRKSJJ *M\IHHIRSMWIV
(IRSMWMRKSJJ)' (IRSMWMRKSJJ)' *M\IHHIRSMWIV)'

&MXVEXIOFTW
Fig. 8: Performance of SoundStream when performing joint compression and background noise suppression, measured by
ViSQOL scores at different bitrates. We compare three variants: a) flexible denoising, where the conditioning is added at the
encoder side; b) flexible denoising, where the conditioning is added at the decoder side; and c) fixed denoising, where the
model was trained to always produce clean outputs. For all models we also report the potential bitrate savings achievable by
entropy coding (EC). Error bars denote 95% confidence intervals.

vector quantizer has a budget of 80 bits per frame. If we lower bound needed to encode the test samples. Figure 8 shows
double the latency, one frame corresponds to 26.6ms, so the that both the encoder-side denoising and fixed denoising offer
per-frame budget needs to be increased to 160 bits. Table III substantial bitrate savings when compared with decoder-side
compares three configurations, all operating at 6 kbps, where denoising. Hence, applying denoising before quantization leads
the budget is adjusted by changing the number of quantizers, to a representation that can be encoded with fewer bits.
while keeping the codebook size fixed. We observe that these
three configurations are equivalent in terms of audio quality. At F. Joint vs. disjoint compression and enhancement
the same time, increasing the latency of the model significantly
increases the real-time factor, as encoding/decoding of a single We compare the proposed model, which is able to perform
frame corresponds to a longer audio sample. joint compression and enhancement, with a configuration
in which compression is performed by SoundStream (with
denoising disabled) and enhancement by a dedicated denoising
E. Joint compression and enhancement model. For the latter, we adopt SEANet [45], which features a
We evaluate a variant of SoundStream that is able to jointly very similar model architecture, with the notable exception of
perform compression and background noise suppression, which skip connections between encoder and decoder layers and the
was trained as described in Section III-F. We consider two absence of quantization. We consider two variants: i) one in
configurations, in which the conditioning signal is applied to which compression is followed by denoising (i.e., denoising is
the embeddings: i) one where the conditioning signal is added applied at the decoder side); ii) another one in which denoising
at the encoder side, just before quantization; ii) another where is followed by compression (i.e., denoising is applied at the
it is added at the decoder side. For each configuration, we train encoder side).
models at different bitrates. For evaluation we use 1000 samples We evaluate the different models using the VCTK
of noisy speech, generated as described in Section IV-A and dataset [62], which was neither used for training SoundStream
compute ViSQOL scores when denoising is enabled or disabled, nor SEANet. The input samples are 2 s clips of noisy speech
using clean speech references as targets. Figures 8 shows a cropped to reduce periods of silence and resampled at 24 kHz.
substantial improvement of quality when denoising is enabled, For each of the four input signal-to-noise ratios (0 dB, 5 dB,
with no significant difference between denoising either at the 10 dB and 15 dB), we run inference on 1000 samples and
encoder or at the decoder. We observe that the proposed model, compute ViSQOL scores. As shown in Table IV, one single
which is able to flexibly enable or disable denoising at inference model trained for joint compression and enhancement achieves
time, does not incur a cost in performance, when compared a level of quality that is almost on par with using two disjoint
with a model in which denoising is always enabled. This can models. Also, the former requires only half of the computational
be seen comparing Figure 8c with Figure 8a and Figure 8b. cost and incurs no additional architectural latency, which would
We also investigate whether denoising affects the potential be introduced when stacking disjoint models. We also observe
bitrate savings that would be achievable by entropy coding. To that the performance gap decreases as the input SNR increases.
evaluate this aspect, we first measured the empirical probability
(q)
distributions pi , i = 1 . . . N, q = 1 . . . Nq on 3200 samples VI. C ONCLUSIONS
of training data. Then, we measured the empirical distribution We propose SoundStream, a novel neural audio codec that
(q)
ri on the 1000 test samples and computed the cross-entropy outperforms state-of-the-art audio codecs over a wide range
P (q) (q)
H(r, p) = − i,q ri log2 pi , as an estimate of the bitrate of bitrates and content types. SoundStream consists of an
11

TABLE IV: Comparison of SoundStream as a joint denoiser and [11] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo,
codec with SEANet as a denoiser compressed by a SoundStream A. Courville, and Y. Bengio, “SampleRNN: An unconditional end-to-end
codec at different signal-to-noise ratios. Uncertainties denote neural audio generation model,” arXiv:1612.07837, 2017.
[12] A. van den Oord, Y. Li, I. Babuschkin, K. Simonyan, O. Vinyals,
95% confidence intervals. K. Kavukcuoglu, G. van den Driessche, E. Lockhart, L. Cobo, F. Stimberg,
N. Casagrande, D. Grewe, S. Noury, S. Dieleman, E. Elsen, N. Kalchbren-
ViSQOL
ner, H. Zen, A. Graves, H. King, T. Walters, D. Belov, and D. Hassabis,
SoundStream SEANet “Parallel WaveNet: Fast high-fidelity speech synthesis,” in Proceedings
Input SNR SoundStream →SEANet →SoundStream of the 35th International Conference on Machine Learning, 2018, pp.
0 dB 2.93 ± 0.02 3.02 ± 0.03 3.05 ± 0.02 3918–3926.
5 dB 3.18 ± 0.02 3.30 ± 0.02 3.31 ± 0.02
[13] N. Kalchbrenner, E. Elsen, K. Simonyan, S. Noury, N. Casagrande,
E. Lockhart, F. Stimberg, A. van den Oord, S. Dieleman, and
10 dB 3.42 ± 0.02 3.51 ± 0.02 3.50 ± 0.02 K. Kavukcuoglu, “Efficient neural audio synthesis,” arXiv:1802.08435,
15 dB 3.58 ± 0.02 3.64 ± 0.02 3.63 ± 0.02 2018.
[14] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, “FFTNet: a real-time
speaker-dependent neural vocoder,” in IEEE International Conference on
Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 2251–2255.
encoder, a residual vector quantizer and a decoder, which are [15] K. Kumar, R. Kumar, T. de Boissiere, L. Gestin, W. Z. Teoh, J. Sotelo,
trained end-to-end using a mix of adversarial and reconstruction A. de Brebisson, Y. Bengio, and A. Courville, “MelGAN: Generative
losses to achieve superior audio quality. The model supports adversarial networks for conditional waveform synthesis,” in Advances
in Neural Information Processing Systems, 2019.
streamable inference and can run in real-time on a single [16] J. Kong, J. Kim, and J. Bae, “HiFi-GAN: Generative Adversarial Net-
smartphone CPU. When trained with quantizer dropout, a works for efficient and high fidelity speech synthesis,” arXiv:2010.05646,
single SoundStream model achieves bitrate scalability with 2020.
[17] X. Feng, Y. Zhang, and J. Glass, “Speech feature denoising and
a minimal loss in performance when compared with bitrate- dereverberation via deep autoencoders for noisy reverberant speech
specific models. In addition, we show that it is possible to recognition,” in IEEE International Conference on Acoustics, Speech
combine compression and enhancement in a single model and Signal Processing (ICASSP), 2014, pp. 1759–1763.
[18] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech enhancement
without introducing additional latency. generative adversarial network,” arXiv:1703.09452, 2017.
[19] F. G. Germain, Q. Chen, and V. Koltun, “Speech denoising with deep
feature losses,” arXiv:1806.10522, 2018.
ACKNOWLEDGMENTS [20] D. Rethage, J. Pons, and X. Serra, “A WaveNet for speech denoising,”
The authors thank Yunpeng Li, Dominik Roblek, Félix de in IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2018, pp. 5069–5073.
Chaumont Quitry and Dick Lyon for their feedback on this [21] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement
work. with generative adversarial networks for robust speech recognition,” in
2018 IEEE International Conference on Acoustics, Speech and Signal
Processing (ICASSP), 2018, pp. 5024–5028.
R EFERENCES [22] T. Ishii, H. Komiyama, T. Shinozaki, Y. Horiuchi, and S. Kuroiwa,
[1] Y. Li, M. Tagliasacchi, O. Rybakov, V. Ungureanu, and D. Roblek, “Reverberant speech recognition based on denoising autoencoder.” in
“Real-time speech frequency bandwidth extension,” in IEEE International Interspeech, 2013, pp. 3512–3516.
Conference on Acoustics, Speech and Signal Processing (ICASSP), 2021, [23] D. S. Williamson and D. Wang, “Time-frequency masking in the
pp. 691–695. complex domain for speech dereverberation and denoising,” IEEE/ACM
[2] A. Biswas and D. Jia, “Audio codec enhancement with generative Transactions on Audio, Speech, and Language Processing, vol. 25, pp.
adversarial networks,” in IEEE International Conference on Acoustics, 1492–1501, 2017.
Speech and Signal Processing (ICASSP), 2020, pp. 356–360. [24] T. Y. Lim, R. A. Yeh, Y. Xu, M. N. Do, and M. Hasegawa-Johnson, “Time-
[3] F. Stimberg, A. Narest, A. Bazzica, L. Kolmodin, P. Barrera González, frequency networks for audio super-resolution,” in IEEE International
O. Sharonova, H. Lundin, and T. C. Walters, “WaveNetEQ — Packet loss Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018,
concealment with WaveRNN,” in 54th Asilomar Conference on Signals, pp. 646–650.
Systems, and Computers, 2020, pp. 672–676. [25] S. Lloyd, “Least squares quantization in PCM,” IEEE transactions on
[4] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, information theory, vol. 28, pp. 129–137, 1982.
A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: [26] Y. Linde, A. Buzo, and R. Gray, “An algorithm for vector quantizer
A generative model for raw audio,” arXiv:1609.03499, 2016. design,” IEEE Transactions on Communications, vol. 28, pp. 84–95,
[5] W. B. Kleijn, F. S. Lim, A. Luebs, J. Skoglund, F. Stimberg, Q. Wang, 1980.
and T. C. Walters, “Wavenet based low rate speech coding,” in IEEE [27] J. MacQueen, “Some methods for classification and analysis of multi-
international conference on acoustics, speech and signal processing variate observations,” Proceedings of the Fifth Berkeley Symposium on
(ICASSP), 2018, pp. 676–680. Mathematical Statistics and Probability, pp. 281–297, 1967.
[6] C. Gârbacea, A. van den Oord, Y. Li, F. S. C. Lim, A. Luebs, O. Vinyals, [28] R. Gray, “Vector quantization,” IEEE ASSP Magazine, vol. 1, pp. 4–29,
and T. C. Walters, “Low bit-rate speech coding with VQ-VAE and 1984.
a WaveNet decoder,” in IEEE International Conference on Acoustics, [29] J. Makhoul, S. Roucos, and H. Gish, “Vector quantization in speech
Speech and Signal Processing (ICASSP), 2019, pp. 735–739. coding,” Proceedings of the IEEE, vol. 73, pp. 1551–1588, 1985.
[7] J.-M. Valin and J. Skoglund, “LPCNet: improving neural speech synthesis [30] M. Schroeder and B. Atal, “Code-excited linear prediction (CELP): High-
through linear prediction,” in IEEE International Conference on Acoustics, quality speech at very low bit rates,” in IEEE International Conference on
Speech and Signal Processing (ICASSP), 2019, pp. 5891–5895. Acoustics, Speech, and Signal Processing (ICASSP), 1985, pp. 937–940.
[8] W. B. Kleijn, A. Storus, M. Chinen, T. Denton, F. S. C. Lim, A. Luebs, [31] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete
J. Skoglund, and H. Yeh, “Generative speech coding with predictive representation learning,” arXiv:1711.00937, 2017.
variance regularization,” arXiv:2102.09660, 2021. [32] A. Razavi, A. van den Oord, and O. Vinyals, “Generating diverse high-
[9] J.-M. Valin, K. Vos, and T. B. Terriberry, “Definition of the Opus Audio fidelity images with VQ-VAE-2,” arXiv:1906.00446, 2019.
Codec,” IETF RFC 6716, 2012, https://tools.ietf.org/html/rfc6716. [33] S. Dieleman, A. van den Oord, and K. Simonyan, “The challenge of real-
[10] M. Dietz, M. Multrus, V. Eksler, V. Malenovsky, E. Norvell, H. Pobloth, istic music generation: Modelling raw audio at scale,” arXiv:1806.10474,
L. Miao, Z. Wang, L. Laaksonen, A. Vasilache, Y. Kamamoto, K. Kikuiri, 2018.
S. Ragot, J. Faure, H. Ehara, V. Rajendran, V. Atti, H. Sung, E. Oh, [34] P. Dhariwal, H. Jun, C. Payne, J. W. Kim, A. Radford, and I. Sutskever,
H. Yuan, and C. Zhu, “Overview of the EVS codec architecture,” in IEEE “Jukebox: A generative model for music,” arXiv:2005.00341, 2020.
International Conference on Acoustics, Speech and Signal Processing [35] B.-H. Juang and A. Gray, “Multiple stage vector quantization for speech
(ICASSP), 2015, pp. 5698–5702. coding,” in IEEE International Conference on Acoustics, Speech, and
Signal Processing (ICASSP), 1982, pp. 597–600.
12

[36] A. Vasuki and P. Vanathi, “A review of vector quantization techniques,” [50] H. Zen, V. Dang, R. Clark, Y. Zhang, R. J. Weiss, Y. Jia, Z. Chen, and
IEEE Potentials, vol. 25, pp. 39–47, 2006. Y. Wu, “LibriTTS: a corpus derived from LibriSpeech for text-to-speech,”
[37] S. Morishima, H. Harashima, and Y. Katayama, “Speech coding based arXiv:1904.02882, 2019.
on a multi-layer neural network,” in IEEE International Conference on [51] E. Fonseca, J. Pons Puig, X. Favory, F. Font Corbera, D. Bogdanov,
Communications, Including Supercomm Technical Sessions, 1990, pp. A. Ferraro, S. Oramas, A. Porter, and X. Serra, “Freesound datasets: a
429–433. platform for the creation of open audio datasets,” in Proceedings of the
[38] S. Kankanahalli, “End-to-end optimized speech coding with deep neural 18th ISMIR Conference, 2017, pp. 486–493.
networks,” in IEEE International Conference on Acoustics, Speech and [52] E. Law, K. West, M. I. Mandel, M. Bay, and J. S. Downie, “Evaluation
Signal Processing (ICASSP), 2018, pp. 2521–2525. of algorithms using games: The case of music tagging.” in ISMIR, 2009,
[39] A. Polyak, Y. Adi, J. Copet, E. Kharitonov, K. Lakhotia, W.-N. Hsu, pp. 387–392.
A. Mohamed, and E. Dupoux, “Speech resynthesis from discrete [53] ITU-R, Recommendation BS.1534-1: Method for the subjective as-
disentangled self-supervised representations,” arXiv:2104.00355, 2021. sessment of intermediate quality level of coding systems, International
[40] K. Zhen, J. Sung, M. S. Lee, S. Beack, and M. Kim, “Cascaded cross- Telecommunications Union, 2001.
module residual learning towards lightweight end-to-end speech coding,” [54] ITU, “Perceptual evaluation of speech quality (PESQ): an objective
arXiv:1906.07769, 2019. method for end-to-end speech quality assessment of narrow-band
[41] J. Casebeer, V. Vale, U. Isik, J.-M. Valin, R. Giri, and A. Krishnaswamy, telephone networks and speech codecs,” Int. Telecomm. Union, Geneva,
“Enhancing into the codec: Noise robust speech coding with vector- Switzerland, ITU-T Rec. P.862, 2001.
quantized autoencoders,” in IEEE International Conference on Acoustics, [55] ——, “Perceptual objective listening quality assessment,” Int. Telecomm.
Speech and Signal Processing (ICASSP), 2021, pp. 711–715. Union, Geneva, Switzerland, ITU-T Rec. P.863, 2018.
[42] D.-A. Clevert, T. Unterthiner, and S. Hochreiter, “Fast and accurate [56] A. Hines, J. Skoglund, A. Kokaram, and N. Harte, “ViSQOL: The virtual
deep network learning by exponential linear units (elus),” arXiv preprint speech quality objective listener,” in International Workshop on Acoustic
arXiv:1511.07289, 2015. Signal Enhancement (IWAENC), 2012, pp. 1–4.
[43] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- [57] M. Chinen, F. S. C. Lim, J. Skoglund, N. Gureev, F. O’Gorman, and
nov, “Dropout: a simple way to prevent neural networks from overfitting,” A. Hines, “ViSQOL v3: an open source production ready objective speech
The journal of machine learning research, vol. 15, pp. 1929–1958, 2014. and audio metric,” in Twelfth International Conference on Quality of
[44] A. Baevski, H. Zhou, A. Mohamed, and M. Auli, “wav2vec 2.0: Multimedia Experience (QoMEX), 2020, pp. 1–6.
A framework for self-supervised learning of speech representations,” [58] W3C, “WebRTC 1.0: Real-time communication between browsers,” 2019,
arXiv:2006.11477, 2020. https://www.w3.org/TR/webrtc/.
[45] M. Tagliasacchi, Y. Li, K. Misiunas, and D. Roblek, “SEANet: A multi- [59] C. Holmberg, S. Håkansson, and G. Eriksson, “Web real-time com-
modal speech enhancement network,” in Interspeech, 2020, pp. 1126– munication use cases and requirements,” IETF RFC 7478, Mar. 2015,
1130. https://tools.ietf.org/html/rfc7478.
[46] Y. Blau and T. Michaeli, “The perception-distortion tradeoff,” in [60] B. Bessette, R. Salami, R. Lefebvre, M. Jelinek, J. Rotola-Pukkila,
IEEE/CVF Conference on Computer Vision and Pattern Recognition, J. Vainio, H. Mikkola, and K. Jarvinen, “The adaptive multirate wideband
2018, pp. 6228–6237. speech codec (AMR-WB),” IEEE Transactions on Speech and Audio
[47] J. Engel, L. H. Hantrakul, C. Gu, and A. Roberts, “DDSP: Differentiable Processing, vol. 10, pp. 620–636, 2002.
digital signal processing,” arXiv:2001.04643, 2020. [61] F. Mentzer, G. Toderici, M. Tschannen, and E. Agustsson, “High-fidelity
[48] A. A. Gritsenko, T. Salimans, R. van den Berg, J. Snoek, and N. Kalch- generative image compression,” arXiv:2006.09965, 2020.
brenner, “A spectral energy distance for parallel speech synthesis,” [62] J. Yamagishi, C. Veaux, and K. MacDonald, “CSTR VCTK Corpus:
arXiv:2008.01160, 2020. English multi-speaker corpus for CSTR voice cloning toolkit (version
[49] E. Perez, F. Strub, H. de Vries, V. Dumoulin, and A. Courville, “FiLM: 0.92),” 2019.
Visual reasoning with a general conditioning layer,” Proceedings of the
AAAI Conference on Artificial Intelligence, vol. 32, pp. 3942–3951, 2018.

Learniing English by Using High Five English Book & MP3
No ratings yet
Learniing English by Using High Five English Book & MP3
10 pages
Convention Paper 5553: Spectral Band Replication, A Novel Approach in Audio Coding
No ratings yet
Convention Paper 5553: Spectral Band Replication, A Novel Approach in Audio Coding
8 pages
Supported Media Types On BlackBerry Tablets (English)
No ratings yet
Supported Media Types On BlackBerry Tablets (English)
53 pages
High Fidelity Neural Audio Compression: Alexandre Défossez
No ratings yet
High Fidelity Neural Audio Compression: Alexandre Défossez
19 pages
Test 2 35
No ratings yet
Test 2 35
25 pages
High-Fidelity Audio Compression With Improved RVQGAN: Rithesh Kumar Prem Seetharaman
No ratings yet
High-Fidelity Audio Compression With Improved RVQGAN: Rithesh Kumar Prem Seetharaman
14 pages
SoundStorm: Efficient Parallel Audio Generation - Arxiv:2305.09636
No ratings yet
SoundStorm: Efficient Parallel Audio Generation - Arxiv:2305.09636
9 pages
NVIDIA NeMo Audio Codec 44khz
No ratings yet
NVIDIA NeMo Audio Codec 44khz
7 pages
Speech Coding Journal
No ratings yet
Speech Coding Journal
20 pages
Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
No ratings yet
Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
20 pages
Speech Quality Evaluation of Neural Audio Codecs - 24c - Interspeech
No ratings yet
Speech Quality Evaluation of Neural Audio Codecs - 24c - Interspeech
5 pages
Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
No ratings yet
Speech Coding: Fundamentals and Applications: ARK Asegawa Ohnson
20 pages
A Fine Granular Scalable To Lossless Audio Coder: Rongshan Yu, Susanto Rahardja, Lin Xiao, and Chi Chung Ko
No ratings yet
A Fine Granular Scalable To Lossless Audio Coder: Rongshan Yu, Susanto Rahardja, Lin Xiao, and Chi Chung Ko
12 pages
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
No ratings yet
Melgan: Generative Adversarial Networks For Conditional Waveform Synthesis
14 pages
Audio Compression
No ratings yet
Audio Compression
6 pages
Speech Compression
No ratings yet
Speech Compression
22 pages
2209 03143v2AudioLM
No ratings yet
2209 03143v2AudioLM
11 pages
Codec 2
No ratings yet
Codec 2
30 pages
Towards Audio Language Modeling - An Overview
No ratings yet
Towards Audio Language Modeling - An Overview
5 pages
Speech Coding
100% (3)
Speech Coding
36 pages
Final Intro AIReport
No ratings yet
Final Intro AIReport
9 pages
Rave
No ratings yet
Rave
15 pages
Fundamental of Telecommunications April 28th
No ratings yet
Fundamental of Telecommunications April 28th
17 pages
Learning Representations From Audio Using Autoencoders
No ratings yet
Learning Representations From Audio Using Autoencoders
11 pages
TF Codec
No ratings yet
TF Codec
12 pages
Kimia Report
No ratings yet
Kimia Report
26 pages
Melnet: A Generative Model For Audio in The Frequency Domain
No ratings yet
Melnet: A Generative Model For Audio in The Frequency Domain
14 pages
Speech Compression Techniques: An Overview
No ratings yet
Speech Compression Techniques: An Overview
4 pages
Hifi Codec
No ratings yet
Hifi Codec
6 pages
Venkata Lakshmi 08011012170 Sep Audio Compression
No ratings yet
Venkata Lakshmi 08011012170 Sep Audio Compression
8 pages
Audio Compression: Ashish Sharma
No ratings yet
Audio Compression: Ashish Sharma
7 pages
Low Bit Rate Coding
No ratings yet
Low Bit Rate Coding
4 pages
Huff Man 1
No ratings yet
Huff Man 1
4 pages
Research Paper 3
No ratings yet
Research Paper 3
11 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
A Hierarchical Approach For Audio Capture, Archive, and Distribution
No ratings yet
A Hierarchical Approach For Audio Capture, Archive, and Distribution
20 pages
4 Chapter Audio and Video Compression
No ratings yet
4 Chapter Audio and Video Compression
122 pages
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
No ratings yet
Effects of Dataset Sampling Rate For Noise Cancellation Through Deep Learning
16 pages
Audio Representations For Deep Learning in Sound Synthesis A Review
No ratings yet
Audio Representations For Deep Learning in Sound Synthesis A Review
8 pages
Comparative Analysis of Modern Formats of Lossy Audio Compression
No ratings yet
Comparative Analysis of Modern Formats of Lossy Audio Compression
13 pages
Digital Audio Compression: by Davis Yen Pan
No ratings yet
Digital Audio Compression: by Davis Yen Pan
14 pages
A Low Complexity Embedded Compression Codec Design With Rate Control For High Definition Video
No ratings yet
A Low Complexity Embedded Compression Codec Design With Rate Control For High Definition Video
14 pages
Audio Compression
No ratings yet
Audio Compression
6 pages
HiFi GAN
No ratings yet
HiFi GAN
14 pages
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
No ratings yet
Wave Tacotron Spectrogram Free End To End Text To Speech Synthesis
5 pages
Pcs Final Report
No ratings yet
Pcs Final Report
13 pages
Source Coding Basic
No ratings yet
Source Coding Basic
7 pages
Audio Noise Detection
No ratings yet
Audio Noise Detection
29 pages
RDT Ortego
No ratings yet
RDT Ortego
28 pages
Multimedia
No ratings yet
Multimedia
80 pages
ART2017951
No ratings yet
ART2017951
5 pages
Computers 13 00256
No ratings yet
Computers 13 00256
13 pages
Audio Gen
No ratings yet
Audio Gen
16 pages
Speech and Audio Coding
No ratings yet
Speech and Audio Coding
16 pages
Speech Chapter 4
No ratings yet
Speech Chapter 4
41 pages
AudioPaLM - A Large Language Model That Can Speak and Listen
No ratings yet
AudioPaLM - A Large Language Model That Can Speak and Listen
27 pages
MPEG
No ratings yet
MPEG
12 pages
Audio Coding and Standards
No ratings yet
Audio Coding and Standards
32 pages
Data Leakage in Cross-Modal Retrieval Training: A Case Study
No ratings yet
Data Leakage in Cross-Modal Retrieval Training: A Case Study
5 pages
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Error-Correction on Non-Standard Communication Channels
From Everand
Error-Correction on Non-Standard Communication Channels
Edward A. Ratzer
No ratings yet
Colour Banding: Exploring the Depths of Computer Vision: Unraveling the Mystery of Colour Banding
From Everand
Colour Banding: Exploring the Depths of Computer Vision: Unraveling the Mystery of Colour Banding
Fouad Sabry
No ratings yet
Av Log
No ratings yet
Av Log
1,849 pages
8000个4K蓝光原盘磁力链接
No ratings yet
8000个4K蓝光原盘磁力链接
403 pages
30 Notification 1
No ratings yet
30 Notification 1
9 pages
ACE Mega CodecS Pack 6
No ratings yet
ACE Mega CodecS Pack 6
6 pages
Rpi Hdmi
No ratings yet
Rpi Hdmi
3 pages
Trace
No ratings yet
Trace
2,391 pages
Ffmpeg Commands
No ratings yet
Ffmpeg Commands
2 pages
How To Setup MPC-HC Properly
No ratings yet
How To Setup MPC-HC Properly
2 pages
Data Headshot FFTH Ghim Dau
No ratings yet
Data Headshot FFTH Ghim Dau
207 pages
Audio Report
No ratings yet
Audio Report
1 page
Log Cat 1751020381283
No ratings yet
Log Cat 1751020381283
5 pages
Dti 55g 6t Or2x1
No ratings yet
Dti 55g 6t Or2x1
1 page
Nuuo Support Camera 2 6 0
No ratings yet
Nuuo Support Camera 2 6 0
151 pages
Trace
No ratings yet
Trace
87 pages
Chris Colorado
No ratings yet
Chris Colorado
2 pages
Audio Video Coding and Compression
No ratings yet
Audio Video Coding and Compression
3 pages
Advanced Audio Coding-LC
No ratings yet
Advanced Audio Coding-LC
12 pages
En - Omnicast Supported Hardware 4.6
No ratings yet
En - Omnicast Supported Hardware 4.6
12 pages
Trace
No ratings yet
Trace
145 pages
Av Log
No ratings yet
Av Log
896 pages
Trace
No ratings yet
Trace
48 pages
Benny Hill
No ratings yet
Benny Hill
4 pages
Smkbaktiibu3palembang - Sch.id PAS2024 Adm Hasil Ujian Cetak 322
No ratings yet
Smkbaktiibu3palembang - Sch.id PAS2024 Adm Hasil Ujian Cetak 322
9 pages
Logcat
No ratings yet
Logcat
2 pages
Media Info
No ratings yet
Media Info
2 pages
Free Video Joiner, AVI, WMV, MPEG Video Joiner Freeware
No ratings yet
Free Video Joiner, AVI, WMV, MPEG Video Joiner Freeware
1 page
Date250330 Time082956 SHOW
No ratings yet
Date250330 Time082956 SHOW
2 pages
Logs 24-11-29 001901
No ratings yet
Logs 24-11-29 001901
36 pages

Soundstream: An End-To-End Neural Audio Codec

Uploaded by

Soundstream: An End-To-End Neural Audio Codec

Uploaded by

1

SoundStream: An End-to-End Neural Audio Codec

Abstract—We present SoundStream, a novel neural audio

embeddings. By training with structured dropout applied to

Conv1D (k=7, n=C) EncoderBlock (N, S) FiLM conditioning

EncoderBlock (N=8C, S=5) Conv1D (k=1, n=N) DecoderBlock (N=4C, S=5)

ResidualUnit (N/2, dilation=9) ResidualUnit (N/2, dilation=3)

Conv1D (k=2S, n=N, stride=S) ResidualUnit (N/2, dilation=9)

FiLM conditioning Conv1D (k=7, n=1)

Embeddings @ 75 Hz Waveform @ 24 kHz

Fig. 3: Encoder and decoder model architecture.

7SYRH7XVIEQ 3TYW 7SYRH7XVIEQWGEPEFPI

2SMW]VIZIVFIVERX 2SMW]VIZIVFIVERX 2SMW]VIZIVFIVERX

3TYW$OFTW 3TYW$OFTW 3TYW$OFTW

(a) Low bitrate. (b) Medium bitrate. (c) High bitrate.

                 

 (IRSMWMRKSR (IRSMWMRKSR

(IRSMWMRKSR )' (IRSMWMRKSR )'

                 

You might also like

7SYRH7XVIEQ 3TYW 7SYRH7XVIEQWGEPEFPI

2SMW]VIZIVFIVERX 2SMW]VIZIVFIVERX 2SMW]VIZIVFIVERX

3TYW$OFTW 3TYW$OFTW 3TYW$OFTW

(IRSMWMRKSR (IRSMWMRKSR

(IRSMWMRKSR)' (IRSMWMRKSR)'