0% found this document useful (0 votes)
21 views25 pages

The Future of

This document provides a summary of a new text-to-speech model called VALL-E developed by Microsoft Research. It discusses the basic concepts of text-to-speech synthesis and compares the traditional neural TTS pipeline to the VALL-E pipeline. The VALL-E pipeline uses a pre-trained audio encoder called Encodec to transform speech into discrete representations, which a language model can then predict to generate waveforms. The document is organized into multiple parts that introduce VALL-E, describe how it works, and explain the role of Encodec in the model's ability to perform zero-shot speech generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
21 views25 pages

The Future of

This document provides a summary of a new text-to-speech model called VALL-E developed by Microsoft Research. It discusses the basic concepts of text-to-speech synthesis and compares the traditional neural TTS pipeline to the VALL-E pipeline. The VALL-E pipeline uses a pre-trained audio encoder called Encodec to transform speech into discrete representations, which a language model can then predict to generate waveforms. The document is organized into multiple parts that introduce VALL-E, describe how it works, and explain the role of Encodec in the model's ability to perform zero-shot speech generation.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 25

VALL-E — The Future of Text to Speech?

A paper walkthrough of the new text-to-speech model by Microsoft Research

Elad Rapaport · Follow


Published in Towards Data Science
15 min read · Apr 15

Listen Share
DALL-E 2: a record player that receives text on one side, and outputs sound waves on the other side. digital
art.

Hello readers,

In this article, we will dive deep into a new and exciting text-to-speech model
developed by Microsoft Research, called VALL-E. The paper presenting the work has
been released on Jan. 5, 2023, and since then has been gaining much attention
online. It is worth noting that as of writing this article, no pre-trained model has
been released and the only option currently to battle-test this model is to train it by
yourself.

Nevertheless, the idea presented in this paper is novel and interesting and worth
digging into, regardless of whether I can immediately clone my voice with it or not.
This article will be organized as follows:

Part 1 — Introduction to Text to Speech, Basic Concepts

Part 2 — VALL-E: Text to Speech as a Language Model

Part 3 — Encodec: The Workhorse Behind VALL-E

Part 4 — Problem formulation and training of VALL-E

Part 5— Some Coding

Part 6— Conclusions & Thoughts Ahead

Part 1 — Introduction to Text to Speech, Basic Concepts


The technology of text-to-speech is not new and has been around since the “Voder”
— the first electronic voice synthesizer from Bell Labs in 1939 which required
manual operation. Since then, the field has seen incredible developments and up
until ~2017, the dominant technology was concatenative speech synthesis. This
technology is based on the concatenation of pre-recorded speech segments to
create intelligible speech. Although this technology can produce lifelike results, its
drawbacks are obvious — it cannot generate new voices which don’t exist in the pre-
recorded database, and it cannot generate speech with a different tone or emotion.

Fast-forward to the era of deep learning. Nowadays, the dominant strategy in text-
to-speech synthesis is summarized in Figure 1. Let’s go over its different parts.

Figure 1. A model neural text-to-speech pipeline. Image by author.

First, we have a phonemizer that transforms text into phonemes. Phonemes are
a textual representation of the pronunciation of words (for example — the word
tomato will have different phonemes in an American and British accent), and
this representation helps the downstream model achieve better results.
Afterward, we have an acoustic model which transforms these phonemes into a
Mel spectrogram, which is a representation of audio in the time X frequency
domain. A spectrogram is achieved by applying a short Fourier transform
(STFT) on overlapping time windows of a raw audio waveform (here is an
excellent explanation about the Mel spectrogram —
https://medium.com/analytics-vidhya/understanding-the-mel-spectrogram-
fca2afa2ce53). Of course in this case the spectrogram is being created by a
statistical model, as no input audio exists in real-time text-to-speech. Examples
of recent model architectures include Tacotron2, DeepVoice 3, and
TransformerTTS.

The final stage is the conversion of the Mel spectrogram into a waveform. A
waveform is usually sampled at 24/48 kHz, where each sample is digitized into a
16-bit number. These numbers represent the amount of air pressure at each
moment in time, which is the sound we eventually hear in our ears. Why can’t
we just deterministically convert the spectrogram into a waveform? Because it
requires major upsampling in the time domain which requires us to create
information that doesn’t explicitly exist in the spectrogram, and also because
spectrograms don’t contain phase information (only frequency). So, as in the
conversion of phonemes to a Mel spectrogram, here as well we need a statistical
model to convert the spectrogram into a waveform and these models are called
Vocoders. Examples of Vocoders include WaveNet, WaveRNN, and MelGAN.

Additionally, there are recent models such as VITS and YourTTS, which employ an
end-to-end model to generate waveforms from text input. Another example of such
an end-to-end system is a paper titled “End-to-End Adversarial Text-to-Speech” by
Deepmind (which is excellently explained by Yannic Kilcher here —
https://www.youtube.com/watch?v=WTB2p4bqtXU). In this paper, they employ a
GAN-like training procedure to produce realistic speech sound waves. They also
need to tackle the alignment problem, which is the degree to which word utterances
in the generated samples align in time with the same word utterances in the ground
truth samples. This problem does not “solve on its own” and requires explicit
handling in the model architecture.

The main drawback of these end-to-end TTS models is their incredible complexity.
Text and speech are such different modalities, and this requires complex models
which tackle problems such as alignment, speaker identity, and language in an
explicit manner, making these models highly complex. The charm of VALL-E, which
we will soon dive into, is that it takes the relative simplicity of generative language
models and employs them creatively in the field of speech generation. For people
like me who are new to the field of TTS and speech in general and have some
experience in NLP, it allows a good entry point into this fascinating field.

This short overview did not do justice to the immense field of TTS, which one can
spend a lifetime studying and understanding (I do encourage you to dive a bit
deeper). Yet, we are here today to talk about VALL-E, so allow me to jump straight to
it.

Part 2 — VALL-E: Text to Speech as a Language Model


As in other text-to-speech systems, the input to VALL-E is phonemicized text, and
the output is the corresponding sound waveform. Additionally, VALL-E employs a
prompting mechanism in which a 3-second audio sample is fed as additional input
to the model. This allows the generation of a speech utterance of the input text
which is conditioned on the given audio prompt — in practice, this means the ability
to perform zero-shot speech generation, which is the generation of speech from a
voice unseen in the training data. The high-level structure of VALL-E is presented in
Figure 2.

Figure 2. The high-level structure of VALL-E. Image taken from the original paper [1].

Let’s understand what happens in this pipeline. First, we have a phoneme


conversion of the text, which is standard procedure as we understood already, and
doesn’t require any learning mechanism. In order to process these phonemes by the
model we have a phoneme embedding layer that takes as input a vector of indices
into the phoneme vocabulary and outputs a matrix of embeddings corresponding to
the input indices.

The 3-second acoustic prompt, which the output speech is conditioned on, is fed
into an audio codec encoder. In VALL-E they use a pre-trained audio encoder for
this — Encodec (developed by Facebook Research —
https://arxiv.org/abs/2210.13438). Encodec takes as input a waveform of speech and
outputs a compressed discrete representation of it via recursive vector quantization
(RVQ) using an encoder-decoder neural architecture. We will dive into Encodec in
Part 3 of this article, but for now, let’s just assume that it outputs a discrete
representation of the audio signal by splitting it into fixed time windows and
assigning each window a representation from a known vocabulary of audio
embeddings (conceptually, very similar to word embeddings).

Once the model receives these two inputs, it can act as an autoregressive language
model and output the next discrete audio representation. Because the audio
representations come from a fixed which was learned by Encodec, we can think of
this simply as predicting the next word in a sentence out of a fixed vocabulary of
words (a fixed vocabulary of sound representations, in our case). After these sound
representations are predicted they are transformed back into the original waveform
representation using the Decoder part of the Encodec model.

In Figure 3 we compare the pipeline of VALL-E to the traditional neural TTS


pipeline. We see that the main difference is the intermediate representation of
audio. In VALL-E they gave up on the Mel spectrogram and used the representation
created by the Encodec model. It is worth noting though that under the hood
Encodec uses a spectrogram representation as well, so it is still somewhat in use in
this architecture, albeit less prominently.
Figure 3. The VALL-E pipeline VS the traditional neural TTS pipeline. Image by author.

In the results section of the VALL-E paper, they have shown that they outperform
the previous state-of-the-art zero-shot TTS model, YourTTS, on the LibriSpeech data
on several metrics which include human-based evaluations such as similarity mean
option score (SMOS) and algorithm-based evaluations such as word error rate
(WER). In an interesting ablation study, they show that the phoneme prompt
contributes to the content of generation (by reducing WER) and the audio prompt
contributes to speaker similarity (by improving a speaker similarity metric).

We will now dive into the Encodec model which is responsible for converting audio
to discrete tokens and back and is the enabler for using a language model approach
to audio generation in this paper.

Part 3 — Encodec: The Workhorse Behind VALL-E


In Figure 4 we can see the Encodec architecture. It is an encoder-decoder
architecture that learns a condensed representation of the audio signal via the task
of reconstruction. Let’s go over its different parts to understand what is going on
under the hood.
Figure 4. The Encodec architecture. Image taken from the original paper [2].

On the far left we have our original waveform, which is sampled at 24/48 kHz, and
each sample is represented by 16 bits (65536 options). The raw signal gets passed
into the Encoder which includes 1D convolution operations for downsampling and a
two-layer LSTM for sequence modeling. The output of the encoder is 75/150 latent
timesteps (compare this to the original 24/48K!), with a depth dimension of 128.

The decoder is simply a mirrored version of the encoder, using transposed


convolutions in order to upsample the latent space and construct the audio
waveform (here is a good explanation on transposed convolutions
https://towardsdatascience.com/what-is-transposed-convolutional-layer-
40e5e6e31c11).

The interesting bit here is, of course, the quantizer. How does Encodec quantize the
continuous domain of sound? Using a technique called residual vector quantization
(RVQ) which consists of projecting an input vector onto the closest entry in a
codebook of a given size. Let’s break that sentence down. First, what is a codebook?

In the case of VALL-E, a codebook is a dictionary of vectors of size 1024, where each
entry represents a vector of size 128. Our goal in vector quantization is to map a
certain vector to the closest vector in the codebook (by Euclidean distance, for
example), thereafter it can be represented by the index of that vector in the
codebook (assuming everyone has access to the codebook). Of course, in this way,
we lose a lot of information. What if no vector in the codebook accurately resembles
our vector? Hence the “residual” in RVQ!
In Figure 5, I show how a vector is quantized using residual vector quantization. In
the example, we have 3 codebooks. The input vector is compared to each of the
vectors in the first codebook and assigned to the closest one (C1,1). Then, the
residual between the C1,1 and the input is calculated and we try to match the
residual to the next codebook, and so on until we reach the end of our codebooks.
The final RVQ representation is the indices that were matched in each of the
codebooks (1, 3, 2 in our example). This encoding method is extremely efficient. If
we have 8 codebooks where each contains 1024 entries — we can represent
1024⁸=1.2e+24 different vectors using only 1024*8=8192 numbers! Of course, the
sender and receiver must hold the same codebook for this quantization method to
work. If you want to learn more about RVQ, such as how the codebooks are trained,
I recommend reading another paper that Encodec is based on called SoundStream
— https://arxiv.org/abs/2107.03312 (yes, this is a rabbit hole).
Figure 5. Example of Residual Vector Quantization. Image by author.

Back to the Encodec pipeline in Figure 4, let’s notice 3 additional details which are
relevant to its training procedure:

1. Mel spectrograms are created both from the input audio and from the generated
audio. These spectrograms are compared and the signal from the comparison is
used as a loss to direct the model training.

2. Several discriminators are used in order to compare a short-time Fourier


transform (STFT) of the original and synthetic waveform. This GAN loss gives a
different signal than the Mel spectrogram comparison and was found useful for
Encodec.
3. The quantizer contains transformers that are used for further compression of
the audio signal. This is not the transformer in VALL-E that predicts the next
token of speech, as confusing as it may be. For further understanding of the
transformers in Encodec, I recommend reading the paper or watching the video
by Aleksa Gordic — https://www.youtube.com/watch?v=mV7bhf6b2Hs.

Let’s summarize what we know so far. VALL-E is a text-to-speech model that


resembles language models in its operational mode, such that it predicts the next
discrete audio token for a given prompt, which consists of phonemicized text and
audio input. These discrete tokens are learned by another model called Encodec
(which itself is based on SoundStream) that uses an encoder-decoder architecture
with residual vector quantization to convert audio into discrete codes.

Part 4 — Problem formulation and training of VALL-E


VALL-E contains two transformer models which are used to process the input data
(phonemicized text and audio) — an autoregressive (AR) transformer that attends
only to past data, and a non-autoregressive (NAR) transformer that attends to all
points in time. Let’s see why.

Eight different codebooks are used in VALL-E as part of the Encodec model, where
each codebook consists of 1024 entries. The codes from the first quantizer
(codebook) are processed by the AR model according to this Equation 1. Let’s first
clarify some terminology here:

C represents the generated output — as discrete audio codes

C~ is the 3-second input acoustic prompt

x is the input text as a phoneme sequence

C:,₁ represents data from the first quantizer/codebook for C

So, Equation 1 shows that the output for the first quantizer is conditioned on the
input data, and on the previous timesteps’ outputs for the first quantizer (just like
an autoregressive language model).
Equation 1. The autoregressive model — is applied to the first quantizer. Image taken from the original paper
[1].

In Equation 2 we see the generation of codes for quantizers 2 to 8. Unlike the


previous case, here the output for each quantizer is conditioned on all of the
timesteps from the previous quantizers (when calculating codes for quantizer #7,
the model is conditioned on data generated for quantizers 1 to 6). Unlike the AR
model, this allows parallel generation of all timesteps in a single quantizer because
it is only dependent on previous quantizer codes and not on previous timesteps of
the same quantizer. The authors stressed this point because fast inference is
especially important in text-to-speech models which need to generate speech in
real-time scenarios.

Equation 1. The non-autoregressive model — is applied to the second to eight quantizers. Image taken from
the original paper [1].

Equations 1 and 2 are visually depicted in Figure 6, which shows the AR and NAR
models together and highlights the differences between them. We see that the AR
transformer is used to predict only C:,₁ which are the tokens for the first quantizer.
While doing so, it attends to the previous tokens it has generated. the NAR
transformer attends to the previous quantizers, and not to previous timesteps (the
previous tokens of the current quantizer are not available in the NAR model).
Figure 6. AR and NAR models in VALL-E. Image taken from the original paper [1].

VALL-E has been trained on 60K hours of audio from the LibriLight dataset,
containing 7000 distinct speakers (which is over 100 times more data than previous
state-of-the-art). The dataset is audio-only, hence for labeling an automatic speech
recognition model was used. The Encodec model is used as a pre-trained model and
no fine-tuning is performed on it for VALL-E as far as I could understand.

For training, random 10–20 second samples were taken from LibriLight. For the
acoustic prompt, another 3 seconds were taken from the same utterance. They used
16 Tesla V-100 GPUs to train the model, which is very modest compared to large
SOTA language models!

We learned about the procedure and data, now let’s try to use the unofficial Pytorch
implementation of VALL-E in GitHub.

Part 5 — Some Coding


VALL-E doesn’t have an official implementation on GitHub, so for my
experimentations, I will rely on the unofficial version which was released —
https://github.com/enhuiz/vall-e. Moreover, no model checkpoint has been released
so you have to train it from scratch.
There is also a Google Colab notebook to follow along with a simple training
example —
https://colab.research.google.com/drive/1wEze0kQ0gt9B3bQmmbtbSXCoCTpq5vg-?
usp=sharing. In this example, they overfit the model on a single utterance of “hello
world” and they show that the model is able to reproduce this single utterance. I was
interested in two things:

1. I wanted to replicate their “hello world” experiment with my own voice, just to
see that the pipeline is working properly

2. I wanted to replicate the experiment done by James Skelton from Paperspace —


https://blog.paperspace.com/training-vall-e-from-scratch-on-your-own-voice-
samples/, where he trained a model on a very small subset of his own
recordings, and managed to replicate his voice with it (on something he already
recorded)

Why the limited experiments? Because training this model from scratch takes many
resources which I don’t currently have, plus I assume a pre-trained model will be
released sooner or later.

So how did I succeed? I managed to replicate the “hello world” experiment, but
unfortunately, I didn’t manage to replicate the Paperspace experiment — I just got a
model which creates a garbled sound that vaguely reminds my voice. This is
probably because of a lack of resources (I am training it on a Google Colab instance
which times out after 12 hours). But still, I want to go over the process with you. My
version of the VALL-E notebook is here —
https://colab.research.google.com/drive/1NNOsvfiOfGeV-
BBgGkwf0pyGAwAgx3Gi#scrollTo=SbWtNBVg_Tfd.

Once you run the following line in the Colab notebook —

!git clone --recurse-submodules https://github.com/enhuiz/vall-e.git

You will see a directory called vall-e in your file browser. The path content/vall-

e/data/test contains the data for the “hello world” experiment. Notice that it
contains two files because for some reason it breaks with only one. To replicate this
experiment, simply delete the files in the data directory using !rm content/vall-

e/data/test/* , record yourself saying “Hello world”, and save it as two .wav files with
different names. Put the .wav files in the data directory including two text files
containing the words “hello world” (the text files should have the same names as the
.wav files with a .normalized.txt suffix).

Following that, you will run these two cells:

!python -m vall_e.emb.qnt data/test

!python -m vall_e.emb.g2p data/test

The first cell will run the Encodec model on your own data and perform
quantization, just as we discussed earlier. The second cell will convert the text “hello
world” into phonemes.

Afterward, the processed data is ready and you can run the cells that operate the
training procedure. There is separate training for the NAR and AR models
(remember that as we saw earlier, the NAR model training is dependent on the AR
model, but the AR model uses and produces only the first quantizer data, thus
independent of the NAR model).

!python -m vall_e.train yaml=config/test/ar.yml

!python -m vall_e.train yaml=config/test/nar.yml

After the model has finished training you will run this cell:

!mkdir -p zoo
!python -m vall_e.export zoo/ar.pt yaml=config/test/ar.yml
!python -m vall_e.export zoo/nar.pt yaml=config/test/nar.yml

Which saves the latest model checkpoint (that has been automatically created) into
a directory called zoo .

Finally, you will perform inference with the model using:

Open in app Sign up Sign in


!python -m vall_e 'hello world' /content/vall-e/data/test/hello_world.wav toy.w

This will run the model with a text prompt of “Hello world”, and an audio prompt of
the same utterance. It will save the generated sample as toy.wav which you can then
listen to using:

from IPython.display import Audio


Audio('toy.wav')

And that’s it! You created your own VALL-E “Hello world”. Unless you have many
computing resources, it is probably best to wait for a pre-trained model to come
around to actually make use of this model further.

Part 6 — Conclusions & Thoughts Ahead


In this article, we saw VALL-E, a new text-to-speech architecture by Microsoft
Research. VALL-E generates audio in a language-model-like manner, which
differentiates it from recent state-of-the-art methods that are usually end-to-end or
follow a text->spectrogram->waveform creation pipeline.

We also talked about the Encodec model, which performs audio quantization and is
used as a pre-trained model in the training of VALL-E. Encodec is fascinating in
itself and manages to create super-condensed audio representations using residual
vector quantization. The creators of VALL-E leveraged this feature and built a
generative “language” model on top of this quantization.

Finally, we saw some code and replicated the “hello world” experiment from the
unofficial code with our own voice. The official code for this paper hasn’t been
released, nor has a model checkpoint been released. It would be interesting to see
and use a pre-trained model for VALL-E, which I assume will turn up sooner or later.
Nevertheless, this was an interesting learning journey.

See you next time!

Elad
102 2
References
[1] https://arxiv.org/abs/2301.02111 — The VALL-E paper (Neural Codec Language
Models are Zero-Shot Text to Speech Synthesizers)

[2] https://arxiv.org/abs/2301.02111 — The Encodec paper (High Fidelity Neural


Audio Compression)

[3] https://wiki.aalto.fi/display/ITSP/Concatenative+speech+synthesis — Explanation


on concatenative speech synthesis

[4] https://www.youtube.com/watch?v=aLBedWj-5CQ&t=1s — Deep dive into speech


synthesis meetup (HuggingFace)

[5] https://www.youtube.com/watch?v=MA8PCvmr8B0 — Pushing the frontier of


neural text to speech (Microsoft Research)

[6] https://www.youtube.com/watch?v=G9k-2mYl6Vo&t=5593s — Excellent video by


John Tan Chong Min about VALL-E

[7] https://www.youtube.com/watch?v=mV7bhf6b2Hs — Excellent video by Aleksa


Gordic about Encodec

Machine Learning Deep Learning Text To Speech


Follow

Written by Elad Rapaport


83 Followers · Writer for Towards Data Science

Data Scientist, SW Engineer, Israeli

More from Elad Rapaport and Towards Data Science

Elad Rapaport in Towards Data Science

MovieLens-1M Deep Dive — Part I


A hands-on recommendation systems tour using the popular benchmark dataset

12 min read · Jun 9, 2022

19
Marco Peixeiro in Towards Data Science

TimeGPT: The First Foundation Model for Time Series Forecasting


Explore the first generative pre-trained forecasting model and apply it in a project with Python

· 12 min read · Oct 24

2.3K 21

Rahul Nayak in Towards Data Science

How to Convert Any Text Into a Graph of Concepts


A method to convert any text corpus into a Knowledge Graph using Mistral 7B.

12 min read · Nov 10

1.5K 25

Elad Rapaport in Towards Data Science

MovieLens-1M Deep Dive — Part II, Tensorflow Recommenders


Creating a deep retrieval recommender system using Tensorflow Recommenders, while
leveraging additional context features.

11 min read · Sep 12, 2022

70 1

See all from Elad Rapaport

See all from Towards Data Science


Recommended from Medium

Zahrizhal Ali

Crafting Your Custom Text-to-Speech Model.


Welcome to a wild journey where code and comedy collide! In this blog, we’re going to embark
on a whimsical quest to build our very own…

11 min read · Jun 2

206 3
Mimi Chen

Fine-tuning Your Speech-To-Text Model with Phrases and Boost (GCP)


Speech-to-Text, also known as Automatic Speech Recognition (ASR), has pervaded every
corner of our lives, thanks to deep learning…

5 min read · May 30

Lists

Predictive Modeling w/ Python


20 stories · 611 saves

Practical Guides to Machine Learning


10 stories · 694 saves

Natural Language Processing


864 stories · 407 saves

The New Chatbots: ChatGPT, Bard, and Beyond


12 stories · 202 saves
Shaip

A Comprehensive Overview of Automatic Speech Recognition (ASR)


Automatic Speech Recognition technology has been there for a long haul but recently gained
prominence after its use became prevalent in…

6 min read · Jun 12

Zahra Ahmad in MLearning.ai

Build Speech to Text Model from Scratch


What we need to train a model that converts speech into text from collecting data to inference

· 4 min read · Jun 1

71

La Javaness R&D

Comparing state-of-the-art speaker diarization frameworks : Pyannote


vs Nemo
This article focuses on two state-of-the-art open-source frameworks for speaker diarization:
pyannote.audio and NeMo

9 min read · Jul 17

155 2
Tiya Vaj

How to make speaker identification model robust?


Making a speaker identification model robust to changes in tone or speaking style is an
important challenge. Here’s how you can enhance the…

2 min read · Aug 19

See more recommendations

You might also like