Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium
Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium
Open in app
43
Search
Get unlimited access to the best of Medium for less than $1/week. Become a member
I have to admit, I’m more of an audiobook/podcast person these days. With most of
my hybrid/office time spent in Bangalore traffic, it’s much easier to listen to books
while I’m on the go instead of finding time to sit down and read. But one thing that’s
always bugged me is how some of the books I’m interested in don’t have audiobook
versions, especially older ones from Project Gutenberg. So, I started researching
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 1/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
ways to create audiobooks using TTS (Text-to-Speech) technology. And that’s how I
came across open-source TTS models like Coqui’s XTTS-v2.
TTS has come a long way from the early days when it sounded robotic and, well,
kind of awkward. Back then, it used to rely on stringing together bits of pre-
recorded audio or using basic models that didn’t really “get” how human speech
works. But now, with neural networks and deep learning, TTS can sound way more
natural. Think of Google’s WaveNet — it was one of the first models to make speech
sound almost human, and now, neural TTS is everywhere.
Then there’s the open-source side, which is what I am really into. With open-source
models, you can experiment, customize, and basically do whatever you want
without restrictions.
Discovering XTTS-v2
That brings me to Coqui and their XTTS-v2 model. This open-source model seems
like the perfect solution for my audiobook project. It’s powerful, flexible, and free to
use — ideal if you want to create something like custom audiobooks or any other
long text-to-audio, which otherwise would require a paid option. Some highlights:
Emotion and style transfer: One of the standout features of XTTS-v2 is its ability
to handle emotion and style transfer. This means you can clone a voice
capturing not just the sound of a voice but also its unique style and tone. This
makes it perfect for creating consistent, personalized voices for audiobooks.
Imagine narrating a whole series of books with the same voice that you’ve
cloned and customized to sound exactly how you want.
Low latency: It’s fast enough for real-time applications, which means it’s not
only useful for creating audiobooks but also for live applications like voice
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 2/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
My process
I started with setting up the model to use the GPU. This works easily with CPU as
well but GPU is, of course, faster.
config = XttsConfig()
config.load_json("/content/XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/content/XTTS-v2/", eval=True)
model.cuda()
I could then use a reference audio file and sample text to test.
text_en = """The quick brown fox jumps over the lazy dog near the idyllic river
In the silence of the ancient library, amidst the scent of old books, a whisper
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 3/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
reference_audio = "/content/XTTS-v2/samples/female_en.wav"
process_tts(text_en, reference_audio, "en", "audio_output.wav")
This is excellent quality for a free solution. I tested out with 2 more reference audio
files: A male English voice and a male Hindi voice.
The sampled audio and the generated output for the male English voice are below:
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 4/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
आज का दिन बहुत सुंदर है, और मुझे आशा है कि आप इसे खुशी और सुकू न के साथ बिता रहे हैं
All of these again have excellent quality for an open-source, free model.
Now it was time to move to my goal, which was to see how good is this model for
creating audiobooks, in other words converting longer texts. My next test was
converting this book to audio (David Copperfield by Charles Dickens). I picked the
first chapter for my test.
I went ahead and batchified the process for chunks to fit the token limits of the
model.
import os
from pydub import AudioSegment
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 5/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
audio_files = []
for i, sentence in enumerate(sentences):
output_file = f"audio_sentence_{i}.wav"
process_tts(sentence, reference_audio, "en", output_file)
audio_files.append(output_file)
combined = AudioSegment.empty()
silence = AudioSegment.silent(duration=500)
combined.export("combined_audio.wav", format="wav")
And here is the first chapter in audio. I am overall impressed with the quality of the
audio 😊
Final thoughts
I’m pretty excited about these possibilities with XTTS-v2. I also think this could be
an excellent choice if you’re working on a more complex project like a voice
assistant. And the best part? You’re not stuck with rigid, expensive closed-source
systems — you can tweak it to fit your exact needs. And the quality of the audio
should increase with fine-tuning the model.
References
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 6/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
sample audio files for speech recognition. (2020, August 14). Kaggle.
https://www.kaggle.com/datasets/pavanelisetty/sample-audio-files-for-speech-
recognition
Follow
Learning tech wiz with a knack for AI/Dev; Open-source evangelist. Opinions expressed are solely my own
and do not express the views or opinions of my employer!
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 7/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
Jaimon Jacob
May 17 6
Jaimon Jacob
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 8/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
Oct 22
Jaimon Jacob
Automating resume analysis with LLAMA 3.1 and a bit of open-source vs.
closed-source debate
Imagine being a recruiter in a large organization. Every day, you’re inundated with hundreds of
resumes. Some are well-organized, others……
Aug 16 7
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 9/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
Jaimon Jacob
Automating PowerPoint creation with Google Gemini 1.5 Pro and Python’s
pptx Library
While automating the creation of PowerPoint presentations offers undeniable benefits, it’s
important to note that human input is still…
Jun 4 79 2
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 10/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
Oct 28 2 1
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 11/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
Oct 28 263 1
Lists
AI Regulation
6 stories · 604 saves
ChatGPT
21 stories · 859 saves
C. L. Beard in OpenSourceScribes
5d ago 17
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 12/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
Oct 29 5
AI Rabbit in CodeX
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 13/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium
Oct 27 48 1
Datadrifters
4d ago 141
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 14/14