0% found this document useful (0 votes)
1 views10 pages

Mohitmajor

The document is a major project report on 'Multi-Speaker Voice Cloner' submitted by Mohit Kumar and Ishan Singh for their Bachelor of Technology in Computer Science & Engineering. It outlines the project's objectives, methodology, and tools used, aiming to create an open-source framework for voice cloning that operates in real-time with minimal reference speech. The report includes sections on introduction, literature survey, research methodology, and references to relevant works in the field.

Uploaded by

Meenu Narwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
1 views10 pages

Mohitmajor

The document is a major project report on 'Multi-Speaker Voice Cloner' submitted by Mohit Kumar and Ishan Singh for their Bachelor of Technology in Computer Science & Engineering. It outlines the project's objectives, methodology, and tools used, aiming to create an open-source framework for voice cloning that operates in real-time with minimal reference speech. The report includes sections on introduction, literature survey, research methodology, and references to relevant works in the field.

Uploaded by

Meenu Narwal
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 10

A

Major Project Report on

“Multi-Speaker Voice Cloner”

Submitted for the partial fulfillment of the requirement for the award
of the degree of

Bachelor of Technology
in
Computer Science & Engineering

Submitted to: Submitted by:


Dr. Ritu Mohit Kumar(200010130070)
Dept. of CSE Ishan Singh(200010130047)
GJUS&T, Hisar B.Tech (CSE) – 7th Sem

Department of Computer Science & Engineering Guru


Jambheshwar University of Science & Technology, Hisar
January,2024

1
CANDIDATE’S DECLARATION

DECLARATION
We, Mohit Kumar (200010130070) and Ishan Singh (200010130047) certify that the work
contained in this project synopsis is original and has been carried by me under the guidance of
my supervisor. This work has not been submitted to any other institute for the award of any
degree or diploma and I have followed the ethical practices and other guidelines provided by the
Department of Computer Science and Engineering in preparing the report. Whenever I have used
materials (data, theoretical analysis, figures, and text) from other sources, I have given due credit
to them by citing them in the text of the report and giving their details in the references. Further, I
have taken permission from the copyright owners of the sources, whenever necessary.

Signature

Mohit Kumar (200010130070)

Ishan Singh(200010130047)

2
CERTIFICATE

This is to certify that Mohit Kumar (200010130070) and Ishan Singh (200010130047) are students
of B.Tech (CSE), Department of Computer Science & Engineering, Guru Jambheshwar
University of Science & Technology, Hisar has completed the project entitled “Multi-Speaker Voice
Cloner”.

Associate Prof.- Dr Ritu


Dept. of CSE
GJUS&T, Hisar

3
Contents
“Multi-Speaker Voice Cloner”
Page No

1. Introduction 5
2. Literature Survey 6
3. Objective 6
4. Research Methodology 7-8
5. Tools Used 9
6. References 10

4
Introduction
Recent advances in deep learning have shown impressive results in the domain of text-
tospeech. To this end, a deep neural network is usually trained using a corpus of several hours
of professionally recorded speech from a single speaker. Giving a new voice to such a model is
highly expensive, as it requires recording a new dataset and retraining the model. A recent
research introduced a three-stage pipeline that allows to clone a voice unseen during training
from only a few seconds of reference speech, and without retraining the model. The authors
share remarkably natural-sounding results, but provide no implementation. We aim to
reproduce this framework and make it open-source, the first public implementation of it. We
aim to adapt the framework with a newer vocoder model, so as to make it run in real-time.

Deep learning models have become predominant in many fields of applied machine learning.
Text-to-speech (TTS), the process of synthesizing artificial speech from a text prompt, is no
exception. Deep models that would produce more natural-sounding speech than the traditional
concatenative approaches begun appearing in 2016. Much of the research focus has been since
gathered around making these deep models more efficient, sound more natural, or training
them in an end-to-end fashion. Inference has come from being hundreds of times slower than
real-time on GPU to be possible in real-time on a mobile CPU. Interestingly, speech naturalness
is best rated with subjective metrics; and comparison with actual human speech leads to the
conclusion that there might be such a thing as “speech more natural than human speech”. In
fact, some argue that the human naturalness threshold has already been crossed.

Datasets of professionally recorded speech are a scarce resource. Synthesizing a natural voice
with a correct pronunciation, lively intonation and a minimum of background noise requires
training data with the same qualities. Furthermore, data efficiency remains a core issue of deep
learning. Training a common text-to-speech model such as Tacotron [1] typically requires
hundreds of hours of speech. Yet the ability to generate speech with any voice is attractive for a
range of applications, be they useful or merely a matter of customization. Research has led to
frameworks for voice conversion and voice cloning. They differ in that voice conversion is a form
of style transfer on a speech segment from a voice to another, whereas voice cloning consists in
capturing the voice of a speaker to perform text-to-speech on arbitrary inputs.

While the complete training of a single-speaker TTS model is technically a form of voice cloning,
the interest rather lies in creating a fixed model able to incorporate newer voices with little data.
The common approach is to condition a TTS model trained to generalize[2] to new speakers on
an embedding of the voice to clone.

5
Literature Survey
The multi-speaker generative model and speaker encoder will be trained using LibriSpeech
dataset, which contains audios (16 KHz) for 2484 speakers, totalling 820 hours. LibriSpeech is a
dataset for automatic speech recognition, and its audio quality is lower compared to speech
synthesis datasets. 7 Voice cloning will be performed on VCTK dataset. VCTK consists of audios
sampled at 48 KHz for 108 native speakers of English with various accents. To be consistent with
LibriSpeech dataset, VCTK audios are downsampled to 16 KHz. For a chosen speaker, a few cloning
audios will be randomly sampled for each experiment.

Objective
Our main objective is to achieve a powerful form of voice cloning. The resulting framework must
be able to operate in a zero-shot setting, that is, for speakers unseen during training. It should
incorporate a speaker’s voice with only a few seconds of reference speech and make it run in real-
time, i.e., to generate speech in a time shorter or equal to duration of production of the produced
speech. And main aim is to make the implementation open-source and integrate previously made
frameworks (if any) to our implementation

6
Research Methodology
Our approach can be divided into 4 phases:
Phase 1: Problem Definition
Consider a dataset of utterances grouped by their speaker. We denote the jth utterance
of the ith speaker as uij . Utterances are in the waveform domain. We denote by xij the log-mel
spectrogram of the utterance uij . A log-mel spectrogram is a deterministic, non-invertible (lossy)
function that extracts speech features from a waveform, so as to handle speech in a more tractable
fashion in machine learning. Later devising functions for Speaker encoder, synthesizer and vocoder
we can determine a loss function Lv from newly found objective function.
This approach may have drawbacks:
• It requires training all three models on a same dataset, meaning that this dataset would ideally
need to meet the requirements for all models: a large number of speakers for the encoder but at
the same time, transcripts for the synthesizer. A low level noise for the synthesizer and somehow
an average noise level for the encoder (so as to be able to handle noisy input speech). These
conflicts are problematic and would lead to training models that could perform better if trained
separately on distinct datasets.
• The convergence of the combined model could be very hard to reach. In particular, the
Tacotron[1] synthesizer could take a significant time before producing correct alignments.

Figure 1: Sequential three-stage approach for training SV2TTS(three-stage deep learning


framework that allows creating a numerical representation of a voice from a few seconds of audio
and to use it to condition a text-to-speech model trained to generalize to new voices).

7
Phase 2: Speaker Encoder
The encoder model and its training procedure are described over several papers [4]. We
reproduced this model with a PyTorch implementation of our own. We synthesize the parts that
are pertinent to SV2TTS as well as our choices of implementation.

Phase 3: Synthesizer
The synthesizer is Tacotron 2 without Wavenet [5]. We will use an open-source
Tensorflow implementation of Tacotron 2 from which we strip Wavenet and implement the
modifications added by SV2TTS. Also WaveRNN[3] will also be implemented for synthesizing
speech from given transcript.

Phase 4: Vocoder
In SV2TTS and in Tacotron2, WaveNet is the vocoder. WaveNet has been at the heart of
deep learning with audio since its release and remains state of the art when it comes to voice
naturalness in TTS. It is however also known for being the slowest practical deep learning
architecture at inference time.
Nonetheless, WaveNet remains the vocoder in SV2TTS as speed is not the main concern
and because Google’s own WaveNet implementation with various improvements already
generates at 8000 samples per second [4]. This is in contrast with ”vanilla” WaveNet which
generates at 172 steps per second at best. At the time of the writing of this synopsis, most open-
source implementations of WaveNet are still vanilla implementations.

Gantt Chart

8
Tools Used
Deep Learning Frameworks:
TensorFlow: TensorFlow is a popular open-source deep learning framework. Many TTS
models, including Tacotron, have been implemented using TensorFlow.
PyTorch: PyTorch is another widely used deep learning framework that supports
dynamic computation graphs, making it suitable for TTS research and implementation.

TTS Models and Architectures:


Tacotron: Tacotron is a popular TTS model that uses a sequence-to-sequence
architecture with attention mechanisms. Tacotron has several versions (e.g., Tacotron
2), and its implementations can be found in TensorFlow and PyTorch.
WaveNet: WaveNet, developed by DeepMind, is another TTS model known for
producing high-quality and natural-sounding speech. It is based on a deep generative
model for raw audio waveforms.

Voice Conversion Frameworks:


StarGAN-VC: StarGAN-VC is a framework for voice conversion based on the StarGAN
model. It allows for multi-domain voice conversion, enabling the transformation of a
source speaker's voice to sound like a target speaker's voice.
CycleGAN-VC: CycleGAN-VC is an adaptation of the CycleGAN model for voice
conversion. It uses a cyclic consistency loss to ensure that the converted voice can be
reversed back to the original voice.

Preprocessing Tools:
Librosa: Librosa is a Python package for music and audio analysis. It is often used for
audio preprocessing tasks, such as extracting features from audio signals, which can
be useful in TTS model training.
NLTK (Natural Language Toolkit): For processing and handling text data, NLTK is a
powerful library that can be employed for tasks like tokenization and text cleaning.

Voice Cloning Tools:


Descript’s Overdub: Descript's Overdub is a voice cloning tool that allows users to
generate natural-sounding speech with a given voice model. It is designed for
voiceover applications and customization of generated voices.

Speech Synthesis Markup Language (SSML):


SSML tools and libraries: SSML is a standard for speech synthesis markup, allowing
control over aspects like pitch, rate, and volume in synthesized speech. Tools and
libraries supporting SSML can be essential for fine-tuning the characteristics of
generated speech.

9
References

[1] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob
Clark, and Rif A. Saurous. Tacotron: A fully end-to-end textto-speech synthesis model.

[2] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for
speaker verification.

[3] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward
Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu.
Efficient neural audio synthesis.

[4] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick
Nguyen, Ruoming Pang, Ignacio Lopez-Moreno, and Yonghui Wu. Transfer learning from speaker
verification to multispeaker text-to-speech synthesis.

[5] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for
raw audio.

10

You might also like