The Future of
The Future of
Listen Share
DALL-E 2: a record player that receives text on one side, and outputs sound waves on the other side. digital
art.
Hello readers,
In this article, we will dive deep into a new and exciting text-to-speech model
developed by Microsoft Research, called VALL-E. The paper presenting the work has
been released on Jan. 5, 2023, and since then has been gaining much attention
online. It is worth noting that as of writing this article, no pre-trained model has
been released and the only option currently to battle-test this model is to train it by
yourself.
Nevertheless, the idea presented in this paper is novel and interesting and worth
digging into, regardless of whether I can immediately clone my voice with it or not.
This article will be organized as follows:
Fast-forward to the era of deep learning. Nowadays, the dominant strategy in text-
to-speech synthesis is summarized in Figure 1. Let’s go over its different parts.
First, we have a phonemizer that transforms text into phonemes. Phonemes are
a textual representation of the pronunciation of words (for example — the word
tomato will have different phonemes in an American and British accent), and
this representation helps the downstream model achieve better results.
Afterward, we have an acoustic model which transforms these phonemes into a
Mel spectrogram, which is a representation of audio in the time X frequency
domain. A spectrogram is achieved by applying a short Fourier transform
(STFT) on overlapping time windows of a raw audio waveform (here is an
excellent explanation about the Mel spectrogram —
https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-
fca2afa2ce53). Of course in this case the spectrogram is being created by a
statistical model, as no input audio exists in real-time text-to-speech. Examples
of recent model architectures include Tacotron2, DeepVoice 3, and
TransformerTTS.
The final stage is the conversion of the Mel spectrogram into a waveform. A
waveform is usually sampled at 24/48 kHz, where each sample is digitized into a
16-bit number. These numbers represent the amount of air pressure at each
moment in time, which is the sound we eventually hear in our ears. Why can’t
we just deterministically convert the spectrogram into a waveform? Because it
requires major upsampling in the time domain which requires us to create
information that doesn’t explicitly exist in the spectrogram, and also because
spectrograms don’t contain phase information (only frequency). So, as in the
conversion of phonemes to a Mel spectrogram, here as well we need a statistical
model to convert the spectrogram into a waveform and these models are called
Vocoders. Examples of Vocoders include WaveNet, WaveRNN, and MelGAN.
Additionally, there are recent models such as VITS and YourTTS, which employ an
end-to-end model to generate waveforms from text input. Another example of such
an end-to-end system is a paper titled “End-to-End Adversarial Text-to-Speech” by
Deepmind (which is excellently explained by Yannic Kilcher here —
https://www.youtube.com/watch?v=WTB2p4bqtXU). In this paper, they employ a
GAN-like training procedure to produce realistic speech sound waves. They also
need to tackle the alignment problem, which is the degree to which word utterances
in the generated samples align in time with the same word utterances in the ground
truth samples. This problem does not “solve on its own” and requires explicit
handling in the model architecture.
The main drawback of these end-to-end TTS models is their incredible complexity.
Text and speech are such different modalities, and this requires complex models
which tackle problems such as alignment, speaker identity, and language in an
explicit manner, making these models highly complex. The charm of VALL-E, which
we will soon dive into, is that it takes the relative simplicity of generative language
models and employs them creatively in the field of speech generation. For people
like me who are new to the field of TTS and speech in general and have some
experience in NLP, it allows a good entry point into this fascinating field.
This short overview did not do justice to the immense field of TTS, which one can
spend a lifetime studying and understanding (I do encourage you to dive a bit
deeper). Yet, we are here today to talk about VALL-E, so allow me to jump straight to
it.
Figure 2. The high-level structure of VALL-E. Image taken from the original paper [1].
The 3-second acoustic prompt, which the output speech is conditioned on, is fed
into an audio codec encoder. In VALL-E they use a pre-trained audio encoder for
this — Encodec (developed by Facebook Research —
https://arxiv.org/abs/2210.13438). Encodec takes as input a waveform of speech and
outputs a compressed discrete representation of it via recursive vector quantization
(RVQ) using an encoder-decoder neural architecture. We will dive into Encodec in
Part 3 of this article, but for now, let’s just assume that it outputs a discrete
representation of the audio signal by splitting it into fixed time windows and
assigning each window a representation from a known vocabulary of audio
embeddings (conceptually, very similar to word embeddings).
Once the model receives these two inputs, it can act as an autoregressive language
model and output the next discrete audio representation. Because the audio
representations come from a fixed which was learned by Encodec, we can think of
this simply as predicting the next word in a sentence out of a fixed vocabulary of
words (a fixed vocabulary of sound representations, in our case). After these sound
representations are predicted they are transformed back into the original waveform
representation using the Decoder part of the Encodec model.
In the results section of the VALL-E paper, they have shown that they outperform
the previous state-of-the-art zero-shot TTS model, YourTTS, on the LibriSpeech data
on several metrics which include human-based evaluations such as similarity mean
option score (SMOS) and algorithm-based evaluations such as word error rate
(WER). In an interesting ablation study, they show that the phoneme prompt
contributes to the content of generation (by reducing WER) and the audio prompt
contributes to speaker similarity (by improving a speaker similarity metric).
We will now dive into the Encodec model which is responsible for converting audio
to discrete tokens and back and is the enabler for using a language model approach
to audio generation in this paper.
On the far left we have our original waveform, which is sampled at 24/48 kHz, and
each sample is represented by 16 bits (65536 options). The raw signal gets passed
into the Encoder which includes 1D convolution operations for downsampling and a
two-layer LSTM for sequence modeling. The output of the encoder is 75/150 latent
timesteps (compare this to the original 24/48K!), with a depth dimension of 128.
The interesting bit here is, of course, the quantizer. How does Encodec quantize the
continuous domain of sound? Using a technique called residual vector quantization
(RVQ) which consists of projecting an input vector onto the closest entry in a
codebook of a given size. Let’s break that sentence down. First, what is a codebook?
In the case of VALL-E, a codebook is a dictionary of vectors of size 1024, where each
entry represents a vector of size 128. Our goal in vector quantization is to map a
certain vector to the closest vector in the codebook (by Euclidean distance, for
example), thereafter it can be represented by the index of that vector in the
codebook (assuming everyone has access to the codebook). Of course, in this way,
we lose a lot of information. What if no vector in the codebook accurately resembles
our vector? Hence the “residual” in RVQ!
In Figure 5, I show how a vector is quantized using residual vector quantization. In
the example, we have 3 codebooks. The input vector is compared to each of the
vectors in the first codebook and assigned to the closest one (C1,1). Then, the
residual between the C1,1 and the input is calculated and we try to match the
residual to the next codebook, and so on until we reach the end of our codebooks.
The final RVQ representation is the indices that were matched in each of the
codebooks (1, 3, 2 in our example). This encoding method is extremely efficient. If
we have 8 codebooks where each contains 1024 entries — we can represent
1024⁸=1.2e+24 different vectors using only 1024*8=8192 numbers! Of course, the
sender and receiver must hold the same codebook for this quantization method to
work. If you want to learn more about RVQ, such as how the codebooks are trained,
I recommend reading another paper that Encodec is based on called SoundStream
— https://arxiv.org/abs/2107.03312 (yes, this is a rabbit hole).
Figure 5. Example of Residual Vector Quantization. Image by author.
Back to the Encodec pipeline in Figure 4, let’s notice 3 additional details which are
relevant to its training procedure:
1. Mel spectrograms are created both from the input audio and from the generated
audio. These spectrograms are compared and the signal from the comparison is
used as a loss to direct the model training.
Eight different codebooks are used in VALL-E as part of the Encodec model, where
each codebook consists of 1024 entries. The codes from the first quantizer
(codebook) are processed by the AR model according to this Equation 1. Let’s first
clarify some terminology here:
So, Equation 1 shows that the output for the first quantizer is conditioned on the
input data, and on the previous timesteps’ outputs for the first quantizer (just like
an autoregressive language model).
Equation 1. The autoregressive model — is applied to the first quantizer. Image taken from the original paper
[1].
Equation 1. The non-autoregressive model — is applied to the second to eight quantizers. Image taken from
the original paper [1].
Equations 1 and 2 are visually depicted in Figure 6, which shows the AR and NAR
models together and highlights the differences between them. We see that the AR
transformer is used to predict only C:,₁ which are the tokens for the first quantizer.
While doing so, it attends to the previous tokens it has generated. the NAR
transformer attends to the previous quantizers, and not to previous timesteps (the
previous tokens of the current quantizer are not available in the NAR model).
Figure 6. AR and NAR models in VALL-E. Image taken from the original paper [1].
VALL-E has been trained on 60K hours of audio from the LibriLight dataset,
containing 7000 distinct speakers (which is over 100 times more data than previous
state-of-the-art). The dataset is audio-only, hence for labeling an automatic speech
recognition model was used. The Encodec model is used as a pre-trained model and
no fine-tuning is performed on it for VALL-E as far as I could understand.
For training, random 10–20 second samples were taken from LibriLight. For the
acoustic prompt, another 3 seconds were taken from the same utterance. They used
16 Tesla V-100 GPUs to train the model, which is very modest compared to large
SOTA language models!
We learned about the procedure and data, now let’s try to use the unofficial Pytorch
implementation of VALL-E in GitHub.
1. I wanted to replicate their “hello world” experiment with my own voice, just to
see that the pipeline is working properly
Why the limited experiments? Because training this model from scratch takes many
resources which I don’t currently have, plus I assume a pre-trained model will be
released sooner or later.
So how did I succeed? I managed to replicate the “hello world” experiment, but
unfortunately, I didn’t manage to replicate the Paperspace experiment — I just got a
model which creates a garbled sound that vaguely reminds my voice. This is
probably because of a lack of resources (I am training it on a Google Colab instance
which times out after 12 hours). But still, I want to go over the process with you. My
version of the VALL-E notebook is here —
https://colab.research.google.com/drive/1NNOsvfiOfGeV-
BBgGkwf0pyGAwAgx3Gi#scrollTo=SbWtNBVg_Tfd.
You will see a directory called vall-e in your file browser. The path content/vall-
e/data/test contains the data for the “hello world” experiment. Notice that it
contains two files because for some reason it breaks with only one. To replicate this
experiment, simply delete the files in the data directory using !rm content/vall-
e/data/test/* , record yourself saying “Hello world”, and save it as two .wav files with
different names. Put the .wav files in the data directory including two text files
containing the words “hello world” (the text files should have the same names as the
.wav files with a .normalized.txt suffix).
The first cell will run the Encodec model on your own data and perform
quantization, just as we discussed earlier. The second cell will convert the text “hello
world” into phonemes.
Afterward, the processed data is ready and you can run the cells that operate the
training procedure. There is separate training for the NAR and AR models
(remember that as we saw earlier, the NAR model training is dependent on the AR
model, but the AR model uses and produces only the first quantizer data, thus
independent of the NAR model).
After the model has finished training you will run this cell:
!mkdir -p zoo
!python -m vall_e.export zoo/ar.pt yaml=config/test/ar.yml
!python -m vall_e.export zoo/nar.pt yaml=config/test/nar.yml
Which saves the latest model checkpoint (that has been automatically created) into
a directory called zoo .
This will run the model with a text prompt of “Hello world”, and an audio prompt of
the same utterance. It will save the generated sample as toy.wav which you can then
listen to using:
And that’s it! You created your own VALL-E “Hello world”. Unless you have many
computing resources, it is probably best to wait for a pre-trained model to come
around to actually make use of this model further.
We also talked about the Encodec model, which performs audio quantization and is
used as a pre-trained model in the training of VALL-E. Encodec is fascinating in
itself and manages to create super-condensed audio representations using residual
vector quantization. The creators of VALL-E leveraged this feature and built a
generative “language” model on top of this quantization.
Finally, we saw some code and replicated the “hello world” experiment from the
unofficial code with our own voice. The official code for this paper hasn’t been
released, nor has a model checkpoint been released. It would be interesting to see
and use a pre-trained model for VALL-E, which I assume will turn up sooner or later.
Nevertheless, this was an interesting learning journey.
Elad
102 2
References
[1] https://arxiv.org/abs/2301.02111 — The VALL-E paper (Neural Codec Language
Models are Zero-Shot Text to Speech Synthesizers)
19
Marco Peixeiro in Towards Data Science
2.3K 21
1.5K 25
70 1
Zahrizhal Ali
206 3
Mimi Chen
Lists
71
La Javaness R&D
155 2
Tiya Vaj