0% found this document useful (0 votes)

1 views10 pages

Mohitmajor

The document is a major project report on 'Multi-Speaker Voice Cloner' submitted by Mohit Kumar and Ishan Singh for their Bachelor of Technology in Computer Science & Engineering. It outlines the project's objectives, methodology, and tools used, aiming to create an open-source framework for voice cloning that operates in real-time with minimal reference speech. The report includes sections on introduction, literature survey, research methodology, and references to relevant works in the field.

Uploaded by

Meenu Narwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

1 views10 pages

Mohitmajor

Uploaded by

Meenu Narwal

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 10

A

Major Project Report on

“Multi-Speaker Voice Cloner”

Submitted for the partial fulfillment of the requirement for the award
of the degree of

Bachelor of Technology
in
Computer Science & Engineering

Submitted to: Submitted by:

Dr. Ritu Mohit Kumar(200010130070)
Dept. of CSE Ishan Singh(200010130047)
GJUS&T, Hisar B.Tech (CSE) – 7th Sem

Department of Computer Science & Engineering Guru

Jambheshwar University of Science & Technology, Hisar
January,2024

1
CANDIDATE’S DECLARATION

DECLARATION
We, Mohit Kumar (200010130070) and Ishan Singh (200010130047) certify that the work
contained in this project synopsis is original and has been carried by me under the guidance of
my supervisor. This work has not been submitted to any other institute for the award of any
degree or diploma and I have followed the ethical practices and other guidelines provided by the
Department of Computer Science and Engineering in preparing the report. Whenever I have used
materials (data, theoretical analysis, figures, and text) from other sources, I have given due credit
to them by citing them in the text of the report and giving their details in the references. Further, I
have taken permission from the copyright owners of the sources, whenever necessary.

Signature

Mohit Kumar (200010130070)

Ishan Singh(200010130047)

2
CERTIFICATE

This is to certify that Mohit Kumar (200010130070) and Ishan Singh (200010130047) are students
of B.Tech (CSE), Department of Computer Science & Engineering, Guru Jambheshwar
University of Science & Technology, Hisar has completed the project entitled “Multi-Speaker Voice
Cloner”.

Associate Prof.- Dr Ritu

Dept. of CSE
GJUS&T, Hisar

3
Contents
“Multi-Speaker Voice Cloner”
Page No

1. Introduction 5
2. Literature Survey 6
3. Objective 6
4. Research Methodology 7-8
5. Tools Used 9
6. References 10

4
Introduction
Recent advances in deep learning have shown impressive results in the domain of text-
tospeech. To this end, a deep neural network is usually trained using a corpus of several hours
of professionally recorded speech from a single speaker. Giving a new voice to such a model is
highly expensive, as it requires recording a new dataset and retraining the model. A recent
research introduced a three-stage pipeline that allows to clone a voice unseen during training
from only a few seconds of reference speech, and without retraining the model. The authors
share remarkably natural-sounding results, but provide no implementation. We aim to
reproduce this framework and make it open-source, the first public implementation of it. We
aim to adapt the framework with a newer vocoder model, so as to make it run in real-time.

Deep learning models have become predominant in many fields of applied machine learning.
Text-to-speech (TTS), the process of synthesizing artificial speech from a text prompt, is no
exception. Deep models that would produce more natural-sounding speech than the traditional
concatenative approaches begun appearing in 2016. Much of the research focus has been since
gathered around making these deep models more efficient, sound more natural, or training
them in an end-to-end fashion. Inference has come from being hundreds of times slower than
real-time on GPU to be possible in real-time on a mobile CPU. Interestingly, speech naturalness
is best rated with subjective metrics; and comparison with actual human speech leads to the
conclusion that there might be such a thing as “speech more natural than human speech”. In
fact, some argue that the human naturalness threshold has already been crossed.

Datasets of professionally recorded speech are a scarce resource. Synthesizing a natural voice
with a correct pronunciation, lively intonation and a minimum of background noise requires
training data with the same qualities. Furthermore, data efficiency remains a core issue of deep
learning. Training a common text-to-speech model such as Tacotron [1] typically requires
hundreds of hours of speech. Yet the ability to generate speech with any voice is attractive for a
range of applications, be they useful or merely a matter of customization. Research has led to
frameworks for voice conversion and voice cloning. They differ in that voice conversion is a form
of style transfer on a speech segment from a voice to another, whereas voice cloning consists in
capturing the voice of a speaker to perform text-to-speech on arbitrary inputs.

While the complete training of a single-speaker TTS model is technically a form of voice cloning,
the interest rather lies in creating a fixed model able to incorporate newer voices with little data.
The common approach is to condition a TTS model trained to generalize[2] to new speakers on
an embedding of the voice to clone.

5
Literature Survey
The multi-speaker generative model and speaker encoder will be trained using LibriSpeech
dataset, which contains audios (16 KHz) for 2484 speakers, totalling 820 hours. LibriSpeech is a
dataset for automatic speech recognition, and its audio quality is lower compared to speech
synthesis datasets. 7 Voice cloning will be performed on VCTK dataset. VCTK consists of audios
sampled at 48 KHz for 108 native speakers of English with various accents. To be consistent with
LibriSpeech dataset, VCTK audios are downsampled to 16 KHz. For a chosen speaker, a few cloning
audios will be randomly sampled for each experiment.

Objective
Our main objective is to achieve a powerful form of voice cloning. The resulting framework must
be able to operate in a zero-shot setting, that is, for speakers unseen during training. It should
incorporate a speaker’s voice with only a few seconds of reference speech and make it run in real-
time, i.e., to generate speech in a time shorter or equal to duration of production of the produced
speech. And main aim is to make the implementation open-source and integrate previously made
frameworks (if any) to our implementation

6
Research Methodology
Our approach can be divided into 4 phases:
Phase 1: Problem Definition
Consider a dataset of utterances grouped by their speaker. We denote the jth utterance
of the ith speaker as uij . Utterances are in the waveform domain. We denote by xij the log-mel
spectrogram of the utterance uij . A log-mel spectrogram is a deterministic, non-invertible (lossy)
function that extracts speech features from a waveform, so as to handle speech in a more tractable
fashion in machine learning. Later devising functions for Speaker encoder, synthesizer and vocoder
we can determine a loss function Lv from newly found objective function.
This approach may have drawbacks:
• It requires training all three models on a same dataset, meaning that this dataset would ideally
need to meet the requirements for all models: a large number of speakers for the encoder but at
the same time, transcripts for the synthesizer. A low level noise for the synthesizer and somehow
an average noise level for the encoder (so as to be able to handle noisy input speech). These
conflicts are problematic and would lead to training models that could perform better if trained
separately on distinct datasets.
• The convergence of the combined model could be very hard to reach. In particular, the
Tacotron[1] synthesizer could take a significant time before producing correct alignments.

Figure 1: Sequential three-stage approach for training SV2TTS(three-stage deep learning

framework that allows creating a numerical representation of a voice from a few seconds of audio
and to use it to condition a text-to-speech model trained to generalize to new voices).

7
Phase 2: Speaker Encoder
The encoder model and its training procedure are described over several papers [4]. We
reproduced this model with a PyTorch implementation of our own. We synthesize the parts that
are pertinent to SV2TTS as well as our choices of implementation.

Phase 3: Synthesizer
The synthesizer is Tacotron 2 without Wavenet [5]. We will use an open-source
Tensorflow implementation of Tacotron 2 from which we strip Wavenet and implement the
modifications added by SV2TTS. Also WaveRNN[3] will also be implemented for synthesizing
speech from given transcript.

Phase 4: Vocoder
In SV2TTS and in Tacotron2, WaveNet is the vocoder. WaveNet has been at the heart of
deep learning with audio since its release and remains state of the art when it comes to voice
naturalness in TTS. It is however also known for being the slowest practical deep learning
architecture at inference time.
Nonetheless, WaveNet remains the vocoder in SV2TTS as speed is not the main concern
and because Google’s own WaveNet implementation with various improvements already
generates at 8000 samples per second [4]. This is in contrast with ”vanilla” WaveNet which
generates at 172 steps per second at best. At the time of the writing of this synopsis, most open-
source implementations of WaveNet are still vanilla implementations.

Gantt Chart

8
Tools Used
Deep Learning Frameworks:
TensorFlow: TensorFlow is a popular open-source deep learning framework. Many TTS
models, including Tacotron, have been implemented using TensorFlow.
PyTorch: PyTorch is another widely used deep learning framework that supports
dynamic computation graphs, making it suitable for TTS research and implementation.

TTS Models and Architectures:

Tacotron: Tacotron is a popular TTS model that uses a sequence-to-sequence
architecture with attention mechanisms. Tacotron has several versions (e.g., Tacotron
2), and its implementations can be found in TensorFlow and PyTorch.
WaveNet: WaveNet, developed by DeepMind, is another TTS model known for
producing high-quality and natural-sounding speech. It is based on a deep generative
model for raw audio waveforms.

Voice Conversion Frameworks:

StarGAN-VC: StarGAN-VC is a framework for voice conversion based on the StarGAN
model. It allows for multi-domain voice conversion, enabling the transformation of a
source speaker's voice to sound like a target speaker's voice.
CycleGAN-VC: CycleGAN-VC is an adaptation of the CycleGAN model for voice
conversion. It uses a cyclic consistency loss to ensure that the converted voice can be
reversed back to the original voice.

Preprocessing Tools:
Librosa: Librosa is a Python package for music and audio analysis. It is often used for
audio preprocessing tasks, such as extracting features from audio signals, which can
be useful in TTS model training.
NLTK (Natural Language Toolkit): For processing and handling text data, NLTK is a
powerful library that can be employed for tasks like tokenization and text cleaning.

Voice Cloning Tools:

Descript’s Overdub: Descript's Overdub is a voice cloning tool that allows users to
generate natural-sounding speech with a given voice model. It is designed for
voiceover applications and customization of generated voices.

Speech Synthesis Markup Language (SSML):

SSML tools and libraries: SSML is a standard for speech synthesis markup, allowing
control over aspects like pitch, rate, and volume in synthesized speech. Tools and
libraries supporting SSML can be essential for fine-tuning the characteristics of
generated speech.

9
References

[1] Yuxuan Wang, R. J. Skerry-Ryan, Daisy Stanton, Yonghui Wu, Ron J. Weiss, Navdeep Jaitly,
Zongheng Yang, Ying Xiao, Zhifeng Chen, Samy Bengio, Quoc V. Le, Yannis Agiomyrgiannakis, Rob
Clark, and Rif A. Saurous. Tacotron: A fully end-to-end textto-speech synthesis model.

[2] Li Wan, Quan Wang, Alan Papir, and Ignacio Lopez Moreno. Generalized end-to-end loss for
speaker verification.

[3] Nal Kalchbrenner, Erich Elsen, Karen Simonyan, Seb Noury, Norman Casagrande, Edward
Lockhart, Florian Stimberg, Aaron van den Oord, Sander Dieleman, and Koray Kavukcuoglu.
Efficient neural audio synthesis.

[4] Ye Jia, Yu Zhang, Ron J. Weiss, Quan Wang, Jonathan Shen, Fei Ren, Zhifeng Chen, Patrick
Nguyen, Ruoming Pang, Ignacio Lopez-Moreno, and Yonghui Wu. Transfer learning from speaker
verification to multispeaker text-to-speech synthesis.

[5] A¨aron van den Oord, Sander Dieleman, Heiga Zen, Karen Simonyan, Oriol Vinyals, Alex Graves,
Nal Kalchbrenner, Andrew W. Senior, and Koray Kavukcuoglu. Wavenet: A generative model for
raw audio.

Suoni
No ratings yet
Suoni
38 pages
Real Time Voice Cloning Final
No ratings yet
Real Time Voice Cloning Final
18 pages
Thesis
No ratings yet
Thesis
37 pages
A Framework For Deepfake V2
No ratings yet
A Framework For Deepfake V2
24 pages
Am PDF
No ratings yet
Am PDF
11 pages
DB Report Low Resource Text To Speech Synthesis
No ratings yet
DB Report Low Resource Text To Speech Synthesis
18 pages
Voice Cloning in Real Time
No ratings yet
Voice Cloning in Real Time
6 pages
NAUTILUS A Versatile Voice Cloning System
No ratings yet
NAUTILUS A Versatile Voice Cloning System
15 pages
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
No ratings yet
Deep Learning-Based Analysis of A Real-Time Voice Cloning System
6 pages
Preprints202306 0223 v1
No ratings yet
Preprints202306 0223 v1
20 pages
Adobe Scan 18 Mar 2025
No ratings yet
Adobe Scan 18 Mar 2025
3 pages
Real Time Chat Application Using Socket - Io
No ratings yet
Real Time Chat Application Using Socket - Io
48 pages
Voice Master Report
No ratings yet
Voice Master Report
63 pages
Lit Rev Dis 1
No ratings yet
Lit Rev Dis 1
24 pages
AI Based Voice Cloning System: From Text To Speech
No ratings yet
AI Based Voice Cloning System: From Text To Speech
9 pages
Review 1 Report Presentation
No ratings yet
Review 1 Report Presentation
13 pages
Kumbharana CK Thesis Cs
No ratings yet
Kumbharana CK Thesis Cs
243 pages
Speaker Recognition Using MATLAB
95% (64)
Speaker Recognition Using MATLAB
75 pages
Text To Speech With Custom Voice
No ratings yet
Text To Speech With Custom Voice
10 pages
Speech Recognition: Fundamentals and Applications
From Everand
Speech Recognition: Fundamentals and Applications
Fouad Sabry
No ratings yet
EAI Endorsed Transactions: A Survey of Audio Synthesis and Lip-Syncing For Synthetic Video Generation
No ratings yet
EAI Endorsed Transactions: A Survey of Audio Synthesis and Lip-Syncing For Synthetic Video Generation
9 pages
Imp Tts
No ratings yet
Imp Tts
4 pages
Mini Project Report 3.00000000
No ratings yet
Mini Project Report 3.00000000
21 pages
Audio Processing Using Matlab
No ratings yet
Audio Processing Using Matlab
51 pages
Voice Cloning
No ratings yet
Voice Cloning
4 pages
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
No ratings yet
Deep Learning Based Multilingual Speech Synthesis Using Multi Feature Fusion Methods
16 pages
Speech Processing
No ratings yet
Speech Processing
5 pages
"Electronics Component Identification From Voice": An Industrial Oriented Mini Project Report
No ratings yet
"Electronics Component Identification From Voice": An Industrial Oriented Mini Project Report
59 pages
Voice Sample
No ratings yet
Voice Sample
44 pages
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
No ratings yet
AJSAT Vol.5 No.2 July Dece 2016 pp.23 30
9 pages
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
From Everand
Text-to-Speech Systems and Algorithms: Definitive Reference for Developers and Engineers
Richard Johnson
No ratings yet
1709 07552 PDF
No ratings yet
1709 07552 PDF
138 pages
Mini Project Report
No ratings yet
Mini Project Report
32 pages
Text Independent Speaker Verification System: Khushboo Modi
No ratings yet
Text Independent Speaker Verification System: Khushboo Modi
12 pages
Institute of Professional Studies and Research: Project
No ratings yet
Institute of Professional Studies and Research: Project
29 pages
Voice Morphing: Welcome To Scribd - Where The World Comes To Read, Discover, and Share..
No ratings yet
Voice Morphing: Welcome To Scribd - Where The World Comes To Read, Discover, and Share..
49 pages
Signals Final Project Report
No ratings yet
Signals Final Project Report
15 pages
Pankaj Singh Synopsis (Recovoicegnition)
No ratings yet
Pankaj Singh Synopsis (Recovoicegnition)
11 pages
Low Resource Text To Speech Synthesis
No ratings yet
Low Resource Text To Speech Synthesis
15 pages
Voice To Text Conversion Using Deep Learning
No ratings yet
Voice To Text Conversion Using Deep Learning
6 pages
VoiceAssistantMiniProject Report
No ratings yet
VoiceAssistantMiniProject Report
42 pages
Malayalam Speech Recognition
No ratings yet
Malayalam Speech Recognition
3 pages
Paper TTS+Conversion
No ratings yet
Paper TTS+Conversion
13 pages
Final Deepfake Voice Detection Report
No ratings yet
Final Deepfake Voice Detection Report
36 pages
Synopsis Voice Recognition
No ratings yet
Synopsis Voice Recognition
9 pages
V4I4-1307 Uuu
No ratings yet
V4I4-1307 Uuu
6 pages
Table of Contents - Jeffin9
No ratings yet
Table of Contents - Jeffin9
13 pages
AVSR Project Report
No ratings yet
AVSR Project Report
62 pages
Real Time Speaker Recognition
No ratings yet
Real Time Speaker Recognition
45 pages
Neural Voice Cloning With A Few Samples
No ratings yet
Neural Voice Cloning With A Few Samples
18 pages
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
No ratings yet
Voice Conversion With Deep Learning: Miguel Varela Ramos, Instituto Superior T Ecnico, Universidade de Lisboa
10 pages
Bhoomika Tech Seminar Report
No ratings yet
Bhoomika Tech Seminar Report
19 pages
Cambridge University Engineering Department Trumpington Street, Cambridge, England, CB2 1PZ
No ratings yet
Cambridge University Engineering Department Trumpington Street, Cambridge, England, CB2 1PZ
4 pages
The Main Principles of Text-to-Speech Synthesis System: January 2010
No ratings yet
The Main Principles of Text-to-Speech Synthesis System: January 2010
8 pages
Silent Speech Interface: Fundamentals and Applications
From Everand
Silent Speech Interface: Fundamentals and Applications
Fouad Sabry
No ratings yet
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
From Everand
Hugging Face Transformers Essentials: From Fine-Tuning to Deployment
Robert Johnson
No ratings yet
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
From Everand
The Newbie’s Guidebook to ChatGPT: A Beginner's Tutorial: The Newbie’s Guidebook
Timothy King
No ratings yet
Audio Visual Speech Recognition: Advancements, Applications, and Insights
From Everand
Audio Visual Speech Recognition: Advancements, Applications, and Insights
Fouad Sabry
No ratings yet
Real-Time Critical Systems
From Everand
Real-Time Critical Systems
Jordan Lee Mauro-Buhagiar
3/5 (1)
Python Mini Manual
From Everand
Python Mini Manual
CodeCraft Dynamics
No ratings yet
WE 3rd Part 3
No ratings yet
WE 3rd Part 3
7 pages
WE 3rd Part-2
No ratings yet
WE 3rd Part-2
2 pages
Major Project 523
No ratings yet
Major Project 523
63 pages
Test Plan
No ratings yet
Test Plan
23 pages
Guru Jambheshwar University of Science & Technology, Hisar (HOSTEL STAY FORM For Session 2023-24)
No ratings yet
Guru Jambheshwar University of Science & Technology, Hisar (HOSTEL STAY FORM For Session 2023-24)
5 pages
Gate - 2025 Test Series Topic and Full Test
No ratings yet
Gate - 2025 Test Series Topic and Full Test
7 pages
examSectionGuide1485 - Pattern - 2023 11 03 03 32
No ratings yet
examSectionGuide1485 - Pattern - 2023 11 03 03 32
9 pages
Block Cipher Modes of Operation
No ratings yet
Block Cipher Modes of Operation
16 pages
Java Advance Topic in Hindi
No ratings yet
Java Advance Topic in Hindi
2 pages
Exam Schedule 2,4,6-08-05-2025
No ratings yet
Exam Schedule 2,4,6-08-05-2025
9 pages
Turtle Graphics
No ratings yet
Turtle Graphics
13 pages
Ch-8 MCQ Classes and Objects in Java
No ratings yet
Ch-8 MCQ Classes and Objects in Java
5 pages
461 Mca2yearssyllabusay2020-2021
No ratings yet
461 Mca2yearssyllabusay2020-2021
54 pages
Programming
No ratings yet
Programming
91 pages
37 PUT Delete Image and FM
No ratings yet
37 PUT Delete Image and FM
14 pages
Big o Solutions To Past Questions 1
No ratings yet
Big o Solutions To Past Questions 1
6 pages
Cs f301 Principles of Programming Languages
No ratings yet
Cs f301 Principles of Programming Languages
5 pages
Online Training Exam Details
No ratings yet
Online Training Exam Details
12 pages
Performance Tasks Grade 8 2ND Quarter 1 1
No ratings yet
Performance Tasks Grade 8 2ND Quarter 1 1
3 pages
Suu Van
No ratings yet
Suu Van
12 pages
CT, 6803415-3 ChapterZero
No ratings yet
CT, 6803415-3 ChapterZero
60 pages
Bsc20 Java e Content U Sample
No ratings yet
Bsc20 Java e Content U Sample
21 pages
Priority Order of CSE & CSE Specialisation
No ratings yet
Priority Order of CSE & CSE Specialisation
10 pages
Fds Practical No.7
No ratings yet
Fds Practical No.7
9 pages
Front Page of Report
No ratings yet
Front Page of Report
3 pages
C++ Identifiers, Data Types and Operators
No ratings yet
C++ Identifiers, Data Types and Operators
5 pages
Rethinking LLM Memorization Through The Lens of Adversarial Compression
No ratings yet
Rethinking LLM Memorization Through The Lens of Adversarial Compression
16 pages
Chapter On Strings - Class 10
No ratings yet
Chapter On Strings - Class 10
51 pages
Christine Mindrift Resume
No ratings yet
Christine Mindrift Resume
3 pages
Min Heap in Java
No ratings yet
Min Heap in Java
7 pages
Study Material CS 2023-24 - 2 Data Files
No ratings yet
Study Material CS 2023-24 - 2 Data Files
4 pages
Introduction To Randomized Algorithms
No ratings yet
Introduction To Randomized Algorithms
18 pages
Typescript Cheat Sheet
No ratings yet
Typescript Cheat Sheet
4 pages
Rosen 7 e Extra Examples 0102
No ratings yet
Rosen 7 e Extra Examples 0102
5 pages
Turbo C Environment
No ratings yet
Turbo C Environment
12 pages
CS Project Sai
No ratings yet
CS Project Sai
11 pages
DPP (12-13) - 12th - Maths - 2015 - E
No ratings yet
DPP (12-13) - 12th - Maths - 2015 - E
2 pages