0% found this document useful (0 votes)
97 views14 pages

Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
97 views14 pages

Building Audiobooks Using The Open-Source XTTS-V2 Model - by Jaimon Jacob - Oct, 2024 - Medium

Uploaded by

陳賢明
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 14

2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Open in app

43
Search

Get unlimited access to the best of Medium for less than $1/week. Become a member

Building audiobooks using the open-source


XTTS-V2 model
Jaimon Jacob · Follow
5 min read · Oct 18, 2024

Listen Share More

Photo by Distingué CiDDiQi on Unsplash

I have to admit, I’m more of an audiobook/podcast person these days. With most of
my hybrid/office time spent in Bangalore traffic, it’s much easier to listen to books
while I’m on the go instead of finding time to sit down and read. But one thing that’s
always bugged me is how some of the books I’m interested in don’t have audiobook
versions, especially older ones from Project Gutenberg. So, I started researching

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 1/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

ways to create audiobooks using TTS (Text-to-Speech) technology. And that’s how I
came across open-source TTS models like Coqui’s XTTS-v2.

TTS has come a long way from the early days when it sounded robotic and, well,
kind of awkward. Back then, it used to rely on stringing together bits of pre-
recorded audio or using basic models that didn’t really “get” how human speech
works. But now, with neural networks and deep learning, TTS can sound way more
natural. Think of Google’s WaveNet — it was one of the first models to make speech
sound almost human, and now, neural TTS is everywhere.

Closed-source vs. Open-source TTS


So, I started digging into TTS models, and there are a lot of options out there. Some
are closed-source; you can’t really customize them or tinker with them because
they’re locked behind paywalls. Companies like Google, Amazon, and Microsoft
have their own TTS services that are really good, but they’re also expensive if you’re
thinking of long-term use or customization.

Then there’s the open-source side, which is what I am really into. With open-source
models, you can experiment, customize, and basically do whatever you want
without restrictions.

Discovering XTTS-v2
That brings me to Coqui and their XTTS-v2 model. This open-source model seems
like the perfect solution for my audiobook project. It’s powerful, flexible, and free to
use — ideal if you want to create something like custom audiobooks or any other
long text-to-audio, which otherwise would require a paid option. Some highlights:

Emotion and style transfer: One of the standout features of XTTS-v2 is its ability
to handle emotion and style transfer. This means you can clone a voice
capturing not just the sound of a voice but also its unique style and tone. This
makes it perfect for creating consistent, personalized voices for audiobooks.
Imagine narrating a whole series of books with the same voice that you’ve
cloned and customized to sound exactly how you want.

Multilingual support: XTTS-v2 supports multiple languages (17), so if you’ve got


texts in different languages, you can switch seamlessly between them.

Low latency: It’s fast enough for real-time applications, which means it’s not
only useful for creating audiobooks but also for live applications like voice

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 2/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

assistants. This model also runs easily on CPU.

My process
I started with setting up the model to use the GPU. This works easily with CPU as
well but GPU is, of course, faster.

!pip -q install TTS soundfile

from huggingface_hub import snapshot_download


!git clone https://huggingface.co/coqui/XTTS-v2

from TTS.tts.configs.xtts_config import XttsConfig


from TTS.tts.models.xtts import Xtts
import soundfile as sf
import os

config = XttsConfig()
config.load_json("/content/XTTS-v2/config.json")
model = Xtts.init_from_config(config)
model.load_checkpoint(config, checkpoint_dir="/content/XTTS-v2/", eval=True)
model.cuda()

def process_tts(input_text, speaker_audio, lang, output_file):


outputs = model.synthesize(
input_text,
config,
speaker_wav=speaker_audio,
gpt_cond_len=3,
language=lang,
)
audio_data = outputs["wav"]
sample_rate = 24000
sf.write(output_file, audio_data, sample_rate)
print("Audio saved successfully!")

I could then use a reference audio file and sample text to test.

text_en = """The quick brown fox jumps over the lazy dog near the idyllic river
In the silence of the ancient library, amidst the scent of old books, a whisper

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 3/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

reference_audio = "/content/XTTS-v2/samples/female_en.wav"
process_tts(text_en, reference_audio, "en", "audio_output.wav")

This was the reference audio that I used.

The generated audio from the text is below.

This is excellent quality for a free solution. I tested out with 2 more reference audio
files: A male English voice and a male Hindi voice.

The sampled audio and the generated output for the male English voice are below:

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 4/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

I also tried the TTS in Hindi. The text used was:

आज का दिन बहुत सुंदर है, और मुझे आशा है कि आप इसे खुशी और सुकू न के साथ बिता रहे हैं

The reference audio was:

The output was:

All of these again have excellent quality for an open-source, free model.

Now it was time to move to my goal, which was to see how good is this model for
creating audiobooks, in other words converting longer texts. My next test was
converting this book to audio (David Copperfield by Charles Dickens). I picked the
first chapter for my test.

I went ahead and batchified the process for chunks to fit the token limits of the
model.

import os
from pydub import AudioSegment
import nltk
nltk.download('punkt')
from nltk.tokenize import sent_tokenize

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 5/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

with open('chapter_1.txt', 'r', encoding="utf-8") as file:


text = file.read()
sentences = sent_tokenize(text)

audio_files = []
for i, sentence in enumerate(sentences):
output_file = f"audio_sentence_{i}.wav"
process_tts(sentence, reference_audio, "en", output_file)
audio_files.append(output_file)

combined = AudioSegment.empty()
silence = AudioSegment.silent(duration=500)

for audio_file in audio_files:


audio = AudioSegment.from_wav(audio_file)
combined += audio + silence

combined.export("combined_audio.wav", format="wav")

for audio_file in audio_files:


os.remove(audio_file)

And here is the first chapter in audio. I am overall impressed with the quality of the
audio 😊

Final thoughts
I’m pretty excited about these possibilities with XTTS-v2. I also think this could be
an excellent choice if you’re working on a more complex project like a voice
assistant. And the best part? You’re not stuck with rigid, expensive closed-source
systems — you can tweak it to fit your exact needs. And the quality of the audio
should increase with fine-tuning the model.

References
https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 6/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Coqui-Ai. (n.d.). GitHub — coqui-ai/TTS: 🐸💬 — a deep learning toolkit for Text-to-


Speech, battle-tested in research and production. GitHub. https://github.com/coqui-
ai/TTS

sample audio files for speech recognition. (2020, August 14). Kaggle.
https://www.kaggle.com/datasets/pavanelisetty/sample-audio-files-for-speech-
recognition

Audio Dataset with 10 Indian Languages. (2021, August 31). Kaggle.


https://www.kaggle.com/datasets/hbchaitanyabharadwaj/audio-dataset-with-10-
indian-languages

Artificial Intelligence Audiobooks Hugging Face Xtts Text To Speech

Follow

Written by Jaimon Jacob


25 Followers

Learning tech wiz with a knack for AI/Dev; Open-source evangelist. Opinions expressed are solely my own
and do not express the views or opinions of my employer!

More from Jaimon Jacob

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 7/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Jaimon Jacob

Removing background noise from speech using SpeechBrain models


SpeechBrain is an open-source, all-in-one toolkit designed for speech processing. Built on
PyTorch, it offers a comprehensive suite of…

May 17 6

Jaimon Jacob

Testing the Meta Spirit LM speech-to-speech generation capabilities


Typically, text-to-speech pipelines involve three main steps: first, speech is transcribed using
automatic speech recognition (ASR); next…

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 8/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Oct 22

Jaimon Jacob

Automating resume analysis with LLAMA 3.1 and a bit of open-source vs.
closed-source debate
Imagine being a recruiter in a large organization. Every day, you’re inundated with hundreds of
resumes. Some are well-organized, others……

Aug 16 7

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 9/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Jaimon Jacob

Automating PowerPoint creation with Google Gemini 1.5 Pro and Python’s
pptx Library
While automating the creation of PowerPoint presentations offers undeniable benefits, it’s
important to note that human input is still…

Jun 4 79 2

See all from Jaimon Jacob

Recommended from Medium

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 10/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Sahin Ahmed, Data Scientist

How to Build an AI-Powered YouTube Video Summariser App with Llama


3.2 and LangChain
Introduction

Oct 28 2 1

Thomas Reid in AI Advances

OLLAMA & Hugging Face: 1000s of Models, One Powerful AI Platform


Harness the Power of Diverse Models for Smarter Solutions

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 11/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Oct 28 263 1

Lists

AI Regulation
6 stories · 604 saves

ChatGPT
21 stories · 859 saves

Generative AI Recommended Reading


52 stories · 1470 saves

Natural Language Processing


1792 stories · 1401 saves

C. L. Beard in OpenSourceScribes

Trending Open Source this Week, #32


Tools for LLMS, dev tools, and an open source messaging app

5d ago 17

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 12/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Sangeeth Joseph - The AI dev in Python in Plain English

“How Python’s with Statement Can Simplify Your Code”


Let’s dive into with keyword and context managers to up your clean code game

Oct 29 5

AI Rabbit in CodeX

Has Anthropic Claude just wiped out an entire industry?


If you have been following the news, you may have read about a new feature (or should I call it a
product) in the Claude API — it is…

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 13/14
2024/11/5 晚上11:06 Building audiobooks using the open-source XTTS-V2 model | by Jaimon Jacob | Oct, 2024 | Medium

Oct 27 48 1

Datadrifters

GOT-OCR2.0 in Action: Optical Character Recognition Applications and


Code Examples
I’ve been diving into GOT-OCR2.0 lately, and it’s pretty impressive.

4d ago 141

See more recommendations

https://medium.com/@jaimonjk/building-audiobooks-using-the-open-source-xtts-v2-model-6bfbbd412fee 14/14

You might also like