1
SoundStream: An End-to-End Neural Audio Codec
Neil Zeghidour, Alejandro Luebs, Ahmed Omran, Jan Skoglund, Marco Tagliasacchi
Abstract—We present SoundStream, a novel neural audio
codec that can efficiently compress speech, music and general
audio at bitrates normally targeted by speech-tailored codecs.
SoundStream relies on a model architecture composed by a fully ) : 7
convolutional encoder/decoder network and a residual vector 7 S Y R H 7 X V I E Q
quantizer, which are trained jointly end-to-end. Training lever-
ages recent advances in text-to-speech and speech enhancement, 7 S Y R H 7 X V I E Q W G E P E F P I
1 9 7 , 6 % W G S V I
which combine adversarial and reconstruction losses to allow
3 T Y W
the generation of high-quality audio content from quantized
arXiv:2107.03312v1 [cs.SD] 7 Jul 2021
embeddings. By training with structured dropout applied to
quantizer layers, a single model can operate across variable
bitrates from 3 kbps to 18 kbps, with a negligible quality loss
) : 7
when compared with models trained at fixed bitrates. In addition,
the model is amenable to a low latency implementation, which
supports streamable inference and runs in real time on a
smartphone CPU. In subjective evaluations using audio at 24 kHz 0 ] V E
sampling rate, SoundStream at 3 kbps outperforms Opus at
12 kbps and approaches EVS at 9.6 kbps. Moreover, we are able to 3 T Y W
perform joint compression and enhancement either at the encoder
or at the decoder side with no additional latency, which we
demonstrate through background noise suppression for speech.
&