0% found this document useful (0 votes)

65 views34 pages

Project Report1

The document describes a minor project report submitted by two students, Archit Gupta and Ishita Tougde, at Oriental College of Technology in Bhopal on the topic of text-to-speech under the guidance of their professor Sachin Malviya. It includes declarations by the students and certificates from the college confirming the students successfully completed the minor project. The report will cover an introduction to text-to-speech technology, different types of text-to-speech tools, and results and discussion around text-to-speech.

Uploaded by

Archit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

0% found this document useful (0 votes)

65 views34 pages

Project Report1

Uploaded by

Archit Gupta

We take content rights seriously. If you suspect this is your content, claim it here.

Available Formats

Download as PDF, TXT or read online on Scribd

You are on page 1/ 34

ORIENTAL COLLEGE OF TECHNOLOGY, BHOPAL

Department of Information Technology

Approved by AICTE New Delhi & Govt. of MP
Affiliated to Rajiv Gandhi Proudyogiki Vishwavidhyalaya, Bhopal
NOVEMBER, 2022

MINOR PROJECT REPORT

On
TEXT TO SPEECH

Submitted in Partial fulfillment for the Award of the degree of

Bachelor of Technology in
INFORMATION TECHNOLOGY
Submitted to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDHYALAYA, BHOPAL (M.P)

Submitted by

ARCHIT GUPTA (0126IT201022)

ISHITA TOUGDE (0126IT201047)

Under the guidance of

Prof. Sachin Malviya
Assistant Professor
Phone No.-0755-2529015, 2529016
Fax: 0755-2529472
E-mail: [email protected]
Website: http://www.oriental.ac.in/oct-bhopal/
ORIENTAL COLLEGE OF TECHNOLOGY, BHOPAL
Approved by AICTE, New Delhi & Govt. of M.P. Affiliated to Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal
Oriental Campus, Raisen Road, Bhopal-462021 (MP) INDIA
----------------------------------------------------------------------------------------------------------------

DEPARTMENT OF INFORMATION TECHNOLOGY ENGINEERING

CANDIDATE’S DECLARATION

I hereby declare that the Minor Project report on “ Text to speech ” which is being
presented here for the partial fulfillment of the requirement of Degree of “Bachelor of
Technology ” has been carried out at Department of Information Technology Oriental
College of Technology Bhopal . The technical information provided in this report is
presented with due permission of the authorities from the training organization .

Signature of Student
ARCHIT GUPTA (0126IT201022)
ISHITA TOUGDE (0126IT201047)

Phone No.-0755-2529015, 2529016

Fax: 0755-2529472
E-mail: [email protected]
Website: http://www.oriental.ac.in/oct-bhopal/
ORIENTAL COLLEGE OF TECHNOLOGY, BHOPAL
Approved by AICTE, New Delhi & Govt. of M.P. Affiliated to Rajiv Gandhi Proudyogiki Vishwavidyalaya, Bhopal
Oriental Campus, Raisen Road, Bhopal-462021 (MP) INDIA
----------------------------------------------------------------------------------------------------------------

DEPARTMENT OF INFORMATION TECHNOLOGY

CERTIFICATE OF INSTITUTE

This is to certify that Mr. ARCHIT GUPTA & Ms. ISHITA TOUGDE of B. Tech. Information Technology
Department Enrollment No. 0126IT201022 & 0126IT201047 have successfully completed their Minor
Project during the academic year 2022-2023 as partial fulfillment of the Bachelor of Technology in
Information Technology.

PROF. SACHIN MALVIYA PROF. AMIT KANSKAR

ASSISTANT PROFESSOR HOD

IT DEPARTMENT IT DEPARTMENT
ACKNOWLEDGEMENT

I take the opportunity to express my cordial gratitude and deep sense of

indebtedness to my guide as. Pro. Sachin Malviya for the valuable
guidance and inspiration throughout the project duration. I feel thankful
to him for his innovative ideas, which led to successful submission of
this minor project work. I feel proud and fortune to work under such an
outstanding mentor in the field of text mining. He has always welcomed
my problem and helped me to clear my doubt. I will always be grateful
to him for providing me moral support and sufficient time.

I owe sincere thanks to Dr. Amita Mahor, director OCT, for providing
me with moral support and necessary help during my project work in
the department. At the same time, i would like to thank prof. Amit
Kanskar (hod, it) and all other faculty members and all non-teaching
staff of department of information technology for their valuable co-
operation.
I would also thank to my institution, faculty members and staff without
whom this project would have been a distant reality. I also extend my
heartfelt thanks to my family and well-wishers.

Name of student: Archit Gupta, Ishita Tougde

Enrollment No.: 0126IT201022, 0126IT201047

TABLE OF CONTENTS

1. Front Page ii
2. Candidate’s Declaration iii
3. Certificate of Institute iv
4. Acknowledgement v
5. List of Figures vi
6. List of Tables vii
7. List of Symbols & Abbreviations viii
8. Executive Summary ix

CONTENTS:

1. INTRODUCTION

1.1 Heading1
1.2 Heading2

1.1.1 Sub heading-1

1.1.2 Sub heading-2

2. LITERATURE REVIEW

3. CHAPTER AS PER REQUIREMENT(OPTIONAL)

4. RESULTS AND DISCUSSION

5. CONCLUSION AND FUTURE SCOPE

REFERENCES

APPENDICES
INTRODUCTION

A few years ago, the creation of the software and hardware image processing systems was
mainly limited to the development of the user interface, which most of the programmers of
each firm were engaged in. The situation has been significantly changed with the advent of
the Windows operating system when the majority of the developers switched to solving the
problems of image processing itself. However, this has not yet led to the cardinal progress in
solving typical tasks of recognizing faces, car numbers, road signs, analyzing remote and
medical images, etc. Each of these "eternal" problems is solved by trial and error by the
efforts of numerous groups of the engineers and scientists. As modern technical solutions are
turn out to be excessively expensive, the task of automating the creation of the software tools
for solving intellectual problems is formulated and intensively solved abroad. In the field of
image processing, the required tool kit should be supporting the analysis and recognition of
images of previously unknown content and ensure the effective development of applications
by ordinary programmers. Just as the Windows toolkit supports the creation of interfaces for
solving various applied problems.

1.1 Text-to-Speech Technology: What It Is and How It Works:-

Text-to-speech (TTS) is a type of assistive technology that reads digital text aloud. It’s
sometimes called “read aloud” technology. TTS can take words on a computer or other
digital device and convert them into audio. TTS is very helpful for kids who struggle with
reading, but it can also help kids with writing and editing, and even focusing.

How text-to-speech works:-

TTS works with nearly every personal digital device, including computers, smartphones and
tablets. All kinds of text files can be read aloud, including Word and Pages documents. Even
online web pages can be read aloud.

The voice in TTS is computer-generated, and reading speed can usually be sped up or slowed
down. Voice quality varies, but some voices sound human. There are even computer-
generated voices that sound like children speaking.

Many TTS tools highlight words as they are read aloud. This allows kids to see text and hear
it at the same time.

Some TTS tools also have a technology called optical character recognition (OCR). OCR
allows TTS tools to read text aloud from images. For example, your child could take a photo
of a street sign and have the words on the sign turned into audio.

1.2 Types of text-to-speechtools:

1.2.1 Depending on the device your child uses, there are many different TTS tools:

Built-in text-to-speech: Many devices have built-in TTS tools. This includes desktop and
laptop computers, smartphones and digital tablets and Chrome. Your child can use this TTS
without purchasing special apps or software.

Web-based tools: Some websites have TTS tools on-site. For instance, you can turn on our
website’s “Reading Assist” tool, located in the lower left corner of your screen, to have this
web page read aloud. Also, kids with dyslexia may qualify for a free Bookshare account with
digital books that can be read with TTS.

Text-to-speech apps: Kids can also download TTS apps on smartphones and digital tablets.
These apps often have special features like text highlighting in different colors and OCR.
Some examples include Voice Dream Reader, Claro Scan Pen and Office Lens.
Chrome tools: Chrome is a relatively new platform with several TTS tools. These include
Read&Write for Google Chrome and Snap&Read Universal. You can use these tools on a
Chromebook or any computer with the Chrome browser.

Text-to-speech software programs: There are also several literacy software programs for
desktop and laptop computers. In addition to other reading and writing tools, many of these
programs have TTS. Examples include Kurzweil 3000, ClaroRead and Read&Write.
Microsoft’s Immersive Reader tool also has TTS. It can be found in programs like OneNote
and Word.

1.2.2 What does the research says about text-to-speech:

Despite the growing popularity, the research on text to speech is somewhat unclear.

While this technology allows students to access the classroom material, some researchers
have found mixed results on how well students are able to comprehend the text being read to
them (Dalton & Strangman, 2006). Furthermore, another team of researchers found that text-
to-speech technologies did not impact adolescent students ability to comprehend the reading,
however the students did report that they value the increased independence that the TTS
software gave them (Meyer, 2014).

However, one study found that students who have been diagnosed with dyslexia did benefit
from the use of TTS software. This team offered students training in TTS software in a
small-group format for six weeks, and saw improvements in motivation to read, improved
comprehension, and improved fluency (White, 2014). Similarly, positive results were found
in another study in which TTS was found to be effective in allowing students to access the
reading material and was also perceived favorably by the students who used it, especially
students in grades 6-8.
 Problem definition:-

1. People with literacy difficulties:-

2. People with different learning styles:- Some people are auditory learners, some are
visual learners, and some are kin-esthetic learners – most learn best through a combination of
the three. Universal Design for Learning is a plan for teaching which, through the use of
technology and adaptable lesson plans, aims to help the maximum number of learners
comprehend and retain information by appealing to all learning styles.

3. Ease in business communication:- Speech-enabling pre- and after-sales service

minimizes human agent workload, provides personalized services, accelerates throughput,
and reduces operational costs.It can also be used as a source of orderly communication in
visa interviews and business interactions.It could be used in various places. For example:-
While travelling to different state or country where people speak different languages this
technology can be used for effective communication with people.

4. Ease of accessibility:- Text-to-speech technology is increasing access for persons with

special needs, especially the visually and hearing impaired, and the dyslexic. Games like the
Forza 5 Horizon now come with native on-screen sign interpreters for the deaf. Wearable
tech-sleeves can also translate sign language into speech. Screen readers can read text aloud
making reading much more accessible to the physically challenged, whether for education or
entertainment purposes.
5. Content consumption on go:- TTS allows people to enjoy , and also provides an
option for content consumption on the go, taking content away from the computer screen and
into any environment that’s convenient for the consumer. For people with visual impairment,
text to speech can be a very useful tool as well. For those who access content on mobile
devices, reading a great deal of content on a small screen is not always easy. Having text-to-
speech software doing the work is much easier. It allows people to get the information they
want without the inconvenience of a lot of scrolling.

6. More Accomodative:- TTS software eliminates the need to collaborate with voiceover
professionals. Although it is quite easy to find voiceover artists today, you cannot vouch for
their authenticity and quality before collaborating with them. Since you need to enter into a
contract before using their voice, you may still have to pay if their quality of voice is not as
you expected. Although tying up with voiceover artists may seem easy (and it indeed is), the
entire process of execution might be strenuous and time-consuming. Thanks to online TTS
services, businesses and individuals have found a more convenient way to get their text
translated into human voice (read, speech). Moreover, you can use the TTS software to re-
narrate the text as often as you want until you obtain the best results.

7. Instant results and cost efficiency:- If you do not have the time or knowhow to find a
voiceover artist, sign up a contract, specify the requirement, do the recording, edit the voice
clip, redo if required, and pay the money, TTS services are meant for you. TTS services let
you do all of these at a fraction of the cost you’d spend otherwise. Understanding TTS
services is as easy as ABC. All you need to do is to sign up for free, upload the text, and get
the output in the blink of an eye. TTS service providers generally offer voice generators for
free. However, the free plan provides you with limited access to their services. You may
have to pay a minimal fee to get more extensive sounds and effects. The little amount you
pay to avail of a good quality TTS service can help you save a massive amount.
BACKGROUND

The aim of text to speech technology is to convert text input and convert it into digital
audio signals.Text-to-speech synthesis -TTS - is the automatic conversion of a text into
speech that resembles, as closely as possible, a native speaker of the language reading that
text. Text-tospeech synthesizer (TTS) is the technology which lets computer speak to you.
The TTS system gets the text as the input and then a computer algorithm which called
TTS engine analyses the text, pre-processes the text and synthesizes the speech with some
mathematical models. The TTS engine usually generates sound data in an audio format as
the output. Text to speech technology, commonly known as TTS, is the conversion of text
into voice output. In the early days of TTS, it wasn't so efficient; however, the advent of
deep learning entirely changed the scenario. As it stands, modern computers are capable of
concatenating the speech from various databases. This speech or sound is synonymous
with natural sounds and reacts to pitch, pronunciation, frequency, etc. Considering the fact
that text to speech assistive technology excellently interprets the text and the associated
speech constraints, it is widely employed by businesses to enhance the user experience.

One of the conspicuous technologies used for text to speech conversion is optical
character recognition (OCR) that converts the text from the images or handwritten
documents into machine-encoded text. This machine-encoded text can then be read aloud
by the TTS tools. Prominent TTS tools encompass web-based tools, chrome tools, text-to-
speech apps, text-to-speech software, etc.
The input text is given to the system and then the preferred language is
selected in which we want to convert to text. The converted audio is the output
which can be listened to. Thus the output is the digital audio signal which is
processed from the input text provided. The audio can thus be downloaded and
listened to for various purposes. For example, a person doesn’t know French
language and wants to listen to the text in French. Then he will provide the text
that he wants to listen in the input state. After this the person will select the
language in which he wants to listen the text to. In this example, we would
select the language as French. After selecting the system will process the input
text and will convert it into sound signals and will provide the text as output in
the form of language audio.
Speech synthesis is the building factor of this technology. Speech synthesis is the artificial
production of human speech. A computer system used for this purpose is called a speech
synthesizer, and can be implemented in software or hardware products. A text-to-speech
(TTS) system converts normal language text into speech; other systems render symbolic
linguistic representations like phonetic transcriptions into speech. The reverse process is
speech recognition.

Synthesized speech can be created by concatenating pieces of recorded speech that are
stored in a database. Systems differ in the size of the stored speech units; a system that
stores phones or dip hones provides the largest output range, but may lack clarity. For
specific usage domains, the storage of entire words or sentences allows for high-quality
output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other
human voice characteristics to create a completely "synthetic" voice output.

The quality of a speech synthesizer is judged by its similarity to the human voice and by
ability to be understood clearly. An intelligible text-to-speech program allows people with
visual impairments or reading disabilities to listen to written words on a home computer.
Many computer operating systems have included speech synthesizers since the early
1990s.

A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-
end. The front-end has two major tasks. First, it converts raw text containing symbols like
numbers and abbreviations into the equivalent of written-out words. This process is often
called text normalization, pre-processing, or tokenization. The front-end then
assigns phonetic transcriptions to each word, and divides and marks the text into prosodic
units, like phrases, clauses, and sentences. The process of assigning phonetic
transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.
Phonetic transcriptions and prosody information together make up the symbolic linguistic
representation that is output by the front-end. The back-end—often referred to as
the synthesizer—then converts the symbolic linguistic representation into sound. In certain
systems, this part includes the computation of the target prosody (pitch contour, phoneme
durations), which is then imposed on the output speech. There are different ways to
perform speech synthesis. The choice depends on the task they are used for, but the most
widely used method is Concatentive Synthesis, because it generally produces the most
natural-sounding synthesized speech.

Domain-specific Synthesis: Domain-specific synthesis concatenates pre-recorded

words and phrases to create complete utterances. It is used in applications where the
variety of texts the system will output is limited to a particular domain, like transit
schedule announcements or weather reports. The technology is very simple to implement,
and has been in commercial use for a long time, in devices like talking clocks and
calculators. The level of naturalness of these systems can be very high because the variety
of sentence types is limited, and they closely match the prosody and intonation of the
original recordings. Because these systems are limited by the words and phrases in their
databases, they are not general-purpose and can only synthesize the combinations of
words and phrases with which they have been pre-programmed. The blending of words
within naturally spoken language however can still cause problems unless many variations
are taken into account. For example, in nonrhotic dialects of English the "r" in words like
"clear" /ˈklɪə/ is usually only pronounced when the following word has a vowel as its first
letter (e.g. "clear out" is realized as /ˌklɪəɾˈʌʊt/). Likewise in French, many final
consonants become no longer silent if followed by a word that begins with a vowel, an
effect called liaison. This alternation cannot be reproduced by a simple word-
concatenation system, which would require additional complexity to be context-sensitive.
This involves recording the voice of a person speaking the desired words and phrases.
This is useful if only the restricted volume of phrases and sentences is used and the
variety of texts the system will output is limited to a particular domain e.g. a message in a
train station, whether reports or checking a telephone subscriber’s account balance.

Unit Selection Synthesis: Unit selection synthesis uses large databases of recorded
speech. During database creation, each recorded utterance is segmented into some or all of
the following: individual phones, diphones, half-phones, syllables, morphemes, words,
phrases, and sentences. Typically, the division into segments is done using a specially
modified speech recognizer set to a "forced alignment" mode with some manual
correction afterward, using visual representations such as the waveform and spectrogram.
An index of the units in the speech database is then created based on the segmentation and
acoustic parameters like the fundamental frequency (pitch), duration, position in the
syllable, and neighboring phones. At runtime, the desired target utterance is created by
determining the best chain of candidate units from the database (unit selection). This
process is typically achieved using a specially weighted decision tree. Unit selection
provides the greatest naturalness, because it applies only a small a90-mount of digital
signals processing (DSP) to the recorded speech. DSP often makes recorded speech sound
less natural, although some systems use a small.amount of signal processing at the point
of concatenation to smooth the waveform. The output from the best unit-selection systems
is often indistinguishable from real human voices, especially in contexts for which the
TTS system has been tuned. However, maximum naturalness typically require
unitselection speech databases to be very large, in some systems ranging into the
gigabytes of recorded data, representing dozens of hours of speech. [12]. Also, unit
selection algorithms have been known to select segments from a place that results in less
than ideal synthesis (e.g. minor words become unclear) even when a better choice exists
in the database.

Diphone Synthesis: Diphone synthesis uses a minimal speech database containing all
the diphones (sound-to-sound transitions) occurring in a language. The number of
diphones depends on the phonotactics of the language: for example, Spanish has about
800 diphones, and German about 2500. In diphone synthesis, only one example of each
diphone is contained in the speech database. At runtime, the target prosody of a sentence
is superimposed on these minimal units by means of digital signal processing techniques
such as linear predictive coding, PSOLA[12] or MBROLA. The quality of the resulting
speech is generally worse than that of unit-selection systems, but more natural-sounding
than the output of format synthesizers. Diphone synthesis suffers from the sonic glitches
of concatenation synthesis and the robotic-sounding nature of format synthesis, and has
few of the advantages of either approach other than small size. As such, its use in
commercial applications is declining, although it continues to be used in research because
there are a number of freely available software implementations.
Structure of A Text-To-Speech Synthesizer System:
Text-to-speech synthesis takes place in several steps. The TTS systems get a text as input,
which it first must analyze and then transform into a phonetic description. Then in a
further step it generates the prosody. From the information now available, it can produce a
speech signal.The structure of the text-to-speech synthesizer can be broken down into
major modules:

1. Natural Language Processing (NLP) module: It produces a phonetic transcription of

the text read, together with prosody.

2. Digital Signal Processing (DSP) module: It transforms the symbolic information it

receives from NLP into audible and intelligible speech. The major operations of the NLP
module are as follows:

Text Analysis: First the text is segmented into tokens. The token-to-word conversion
creates the orthographic form of the token. For the token “Mr” the orthographic form
“Mister” is formed by expansion, the token gets the orthographic form “twelve” and
“1997” is transformed to “nineteen ninety seven”.

Application of Pronunciation Rules: After the text analysis has been completed,
pronunciation rules can be applied. Letters cannot be transformed 1:1 into phonemes
because correspondence is not always parallel. In certain environments, a single letter can
correspond to either no phoneme (for example, “h” in “caught”) or several phoneme (“m”
in “Maximum”). In addition, several letters can correspond to a single phoneme (“ch” in
“rich”). There are two strategies to determine pronunciation:
Prosody Generation: After the pronunciation has been determined, the prosody is generated.
The degree of naturalness of a TTS system is dependent on prosodic factors like intonation
modelling (phrasing and accentuation), amplitude modelling and duration modelling
(including the duration of sound and the duration of pauses, which determines the length of
the syllable and the tempos of the speech).
LITERATURE SURVEY

SPEECH RECOGNITION:- Speech Recognition is the ability of

machine/program to identify words and phrases in spoken language and convert them into
machine-readable format. Speech Recognition Systems can be classified on basis of the
following parameters:-

• Speaker: All speakers have a different kind of voice. The models hence are either
designed for a specific speaker or an independent speaker.

• Vocal Sound: The way the speaker speaks also plays a role in speech recognition. Some
models can recognize either single utterances or separate utterance with a pause in between.

• Vocabulary: The size of the vocabulary plays an important role in determining the
complexity, performance, and precision of the system.

1. Basic Speech Recognition Model: Each speech recognition system follow some
standard steps as shown in figure.

(I) Pre-processing: The analog speech signal is transformed into digital signals for later
processing. This digital signal is moved to the first order filters tospectrally flatten the
signals. This helps in increasing the signal’s energy at a higher frequency.
ii) Feature Extraction: This step finds the set of parameters of utterances that have a
correlation with speech signals. These parameters, known as features, are computed
through processing of the acoustic waveform. The main focus is to compute a sequence of
feature vectors (relevant information) providing a compact representation of the given
input signal. Commonly used feature extraction techniques are discussed below: • Linear
Predictive Coding (LPC): The basic idea is that the speech sample can be approximated as
a linear combination of past speech samples. Figure 2 shows the LPC process. The
digitized signal is blocked into frames of N samples. Then each sample frame is
windowed to minimize signal discontinuities. Each framed window is then auto-
correlated. The last step is the LPC analysis, which converts each frame of auto-
correlations into LPC parameter set. • Mel-Frequency Cestrum Co-efficient (MFCC): It is
a very powerful technique and uses human auditory perception system. MFCC applies
certain steps to the input signal: Framing: Speech waveform is cropped to remove
interference if present; Windowing: minimizes the discontinuities in the signal; Discrete
Fourier Transform: converts each frame from time domain to frequency domain; Mel
Filter Bank Algorithm: the signal is plotted against the Mel spectrum to mimic human
hearing. • Dynamic Time Warping: This algorithm is used for measuring the similarity
between two-time series which may vary in speed, based on dynamic programming. It
aims at aligning two sequences of feature vectors (1 of each series) iteratively until an
optimal match (according to a suitable metrics) between them is found.
Acoustic Models: It is the fundamental part of Automated Speech Recognition (ASR)
system where a connection between the acoustic information and phonetics is established.
Training establishes a correlation between the basic speech units and the acoustic
observations.

Language Models: This model induces the probability of a word occurrence after a
word sequence. It contains the structural constraints available in the language to generate
the probabilities of occurrence. The language model distinguishes word and phrase that
has a similar sound.

Pattern Classification: It is the process of comparing the unknown pattern with existing
sound reference pattern and computing similarity between them. After completing the
training of the system at the time of testing, patterns are classified to recognize the
speech. Different approaches for pattern matching are:

Template Based Approach: This approach has a collection of speech patterns which are
stored as a reference representing dictionary words. Speech is recognized by matching the
uttered word with the reference template.

Knowledge Based Approach: This approach takes set of features from the speech and
then train the system to generate set of production rules automatically from the samples.

Neural Network Based Approach: This approach is capable of solving more

complicated recognition task. The basic idea is to compile and incorporate knowledge
from a variety of knowledge sources with the problem at hand.

Statistical Based Approach: In this approach, variations in speech are modeled

statistically (e.g. HMM) using training methods.
TEXT TO SPEECH CONVERSION:
Text-To-Speech is a process in which input text is first analysed, then processed and
understood, and then the text is converted to digital audio and then spoken. Figure 3
shows the block diagram of TTS. The figure shows all the steps involved in the text to
speech conversion but the main phases of TTS systems are:-

Text Processing: The input text is analysed, normalized (handles acronyms and
abbreviation and match the text) and transcribed into phonetic or linguistic representation.
• Speech Synthesis: Some of the speech synthesis techniques are:

Articulator Synthesis: Uses mechanical and acoustic model for speech generation. It
produces intelligible synthetic speech but it is far from natural sound and hence not
widely used.

Format Synthesis: In this system, representation of individual speech segments are

stored on a parametric basis. There are two basic structures in format synthesis, parallel
and cascade, but for better performance, some kind of combination of these 2 structures is
used. A cascade format synthesizer consists of band-pass resonators connected in series.
The output of each format resonator is applied to the input of the successive one. The
cascade structure needs only format frequencies as control information. A parallel format
synthesizer consists of resonators connected in parallel. The excitation signal is applied to
all formats simultaneously and their outputs are summed.

Concatenation Synthesis: This technique synthesizes sound by concatenating short

samples of sound called units. It is used in speech synthesis to generate user specific
sequence of sound from a database built from the recording of other sequences. Units for
Concatenative synthesis are: Phone- a single unit of sound; Diphoneis defined as the
signal from either midpoint of a phone or point of least change within the phone to the
similar point in the next phone; Triphone- is a section of the signal taking in a sequence
going from middle of a phone completely through the next one to the middle of a third.
OBJECTIVE

The objective of text to speech technology is to ease and bridge the

communication gap between people speaking different languages.

With globalization, people travel from one place to another for work and

travel purposes, they might not be fluent in speaking the language in different

areas or countries so with the help of this technology people can speak in their

native language and can get it translated according to the country in which

they are in.

TTS allows people to enjoy , and also provides an option for content

consumption on the go, taking content away from the computer screen and

into any environment that’s convenient for the consumer. For people with

visual impairment, text to speech can be a very useful tool as well. For those

who access content on mobile devices, reading a great deal of content on a

small screen is not always easy. Having text-to-speech software doing the

work is much easier. It allows people to get the information they want without

the inconvenience of a lot of scrolling.

TECHNOLOGY USED

SOFTWARE REQUIREMENTS:-

 Windows 07 and above

 Visual Studio Code 2008

HARDWARE REQUIREMENTS:-

1. Processor – i3 and above

2. Hard Disk – 256GB

3. Memory – 2 GB RAM and above

LANGUAGES USED:-

 HTML
 CSS
 JAVASCRIPT
USE CASE DIAGRAM
PROPOSED METHODOLOGY

Text-to-speech device consists of two main modules, the image processing module and
voice processing modules.Image processing module captures image using camera,
converting the image into text. Voice processing module changes the text into sound and
processes it with specific physical characteristics so that the sound can be understood.
Figure 1 shows the block diagram of TextTo-Speech device, 1st block is image processing
module, where OCR converts .jpg to .txt form. 2nd is voice processing module which
converts text to speech.

The above figure shows the block diagram of Text-To-Speech device, 1st block is
image processing module, where OCR converts .jpg to .txt form. 2nd is voice
processing module which converts .txt to speech. OCR is important element in this
module. OCR or Optical Character Recognition is a technology that automatically
recognize the character through the optical mechanism, this technology imitate the
ability of the human senses of sight, where the camera becomes a replacement for eye
and image processing is done in the computer engine as a substitute for the human
brain2 . Tesseract OCR is a type of OCR engine with matrix matching. The selection of
Tesseract engine is because of its flexibility and extensibility of machines and the fact
that many communities are active researchers to develop this OCR engine and also
because Tesseract OCR can support 149 languages. In this project we are identifying
English alphabets. Before feeding the image to the OCR, it is converted to a binary
image to increase the recognition accuracy. Image binary conversion is done by using
Imagemagick software, which is another open source tool for image manipulation. The
output of OCR is the text, which is stored in a file (speech.txt). Machines still have
defects such as distortion at the edges and dim light effect, so it is still difficult for most
OCR engines to get high accuracy text. It needs some supporting and condition in order
to get the minimal defect.

Software design: Software processes the input image and converted

into text format. The software implementation is showed in the below
figure.

The Voice Processing Module: In this module text is converted

to speech. The output of OCR is the text, which is stored in a file (speech. txt).
Here, Festival software is used to convert the text to speech. Festival is an
open source Text To Speech (TTS) 7,8 system, which is available in many
languages. In this project, English TTS 9–11system is used for reading the
text.
DEVELOPMENT STAGES

 Firstly the requirement analysis was carried out.

 Then the identification of target audience was carefully
done.
 After all the research work, the work was preceded to the
development stage where the blueprint of the solution was
made.
 After all the research and development, it was coded and
then appropriate tests were carried out.
EXPECTED OUTCOMES
APPLICATIONS:
 People with learning disabilities who have difficulty reading large amounts of text due
to dyslexia or other problems really benefit from TTS, offering them an easier option
for experiencing website content.
 People who have literacy issues and those trying to learn another language often get
frustrated trying to browse the internet because so much text is confusing. Many
people have difficulty reading fluently in a second language even though they may be
able to read content with a basic understanding. TTS technology allows them to
understand information in a way that makes content easier to retain.
 TTS allows people to enjoy , and also provides an option for content consumption on
the go, taking content away from the computer screen and into any environment
that’s convenient for the consumer. For people with visual impairment, text to speech
can be a very useful tool as well. For those who access content on mobile devices,
reading a great deal of content on a small screen is not always easy. Having text-to-
speech software doing the work is much easier. It allows people to get the
information they want without the inconvenience of a lot of scrolling.
 TTS offers many benefits for content owners and publishers as well. This feature
immediately increase the accessibility of online content for those with visual
impairments or reading difficulties and it facilitates access for a larger percentage of
the online population, including those whose native language is different from the
language of a particular website or mobile app.
 Text-to-Speech makes it easier in general for all people to access online content on

mobile devices, increases citizen engagement and strengthens corporate social

responsibility by ensuring that information is available in both written and audio
format.
CONCLUSION:-

 Text to speech synthesis is a rapidly growing aspect of computer

technology and is increasingly playing a more important role in the way
we interact with the system and interfaces across a variety of platforms.
 We have identified the various operations and processes involved in
text to speech synthesis.
 We have also developed a very simple and attractive graphical user
interface which allows the user to type in his/her text provided in the
text field in the application.
 Our system interfaces with a text to speech engine developed for
American English.
 In future, we plan to make efforts to create engines for localized
Nigerian language so as to make text to speech technology more
accessible to a wider range of Nigerians.
 This already exists in some native languages e.g. Swahili, Konkani, the
Vietnamese synthesis system and the Telugu language.
 Another area of further work is the implementation of a text to speech
system on other platforms, such as telephony systems, ATM machines,
video games and any other platforms where text to speech technology
would be an added advantage and increase functionality.
 Speech-to-text conversion (STT) machines aim at providing benefits for
the deaf or people who can’t speak and people with literacy difficulties
and to people who want to learn new languages. TTS is also widely
used in business interactions and company firms.
 Text-to-Speech device can change the text image input into sound with
a performance that is high enough and a readability tolerance of less
than 2%, with the average time processing less than three minutes for
A4 paper size.
FUTURE ENHANCEMENTS:-

The further upgradation that can be done to this project

includes:-

1. Adding a translator along with it will further improvise

the need for this technology i.e. will further provide assist

to illiterate people and in business firms.

2. Adding a feature for uploading file of text or word format

and the whole file can be processed and be converted to

audio and the audio can be downloaded in the required

audio format.

3. Furthermore, along with text to speech, speech to text can

also be enabled so that the needs of both processes can be

carried out smoothly in a effective manner.

REFERENCES:-

 Lemmetty, S., 1999. Review of Speech Syn1thesis Technology. Masters

Dissertation, Helsinki University Of Technology.

 Dutoit, T., 1993. High quality text-to-speech synthesis of the French

language. Doctoral dissertation, Faculte Polytechnique de Mons.

 Suendermann, D., Höge, H., and Black, A., 2010. Challenges in Speech
Synthesis. Chen, F., Jokinen, K., (eds.), Speech Technology, Springer
Science + Business Media LLC.

 Allen, J., Hunnicutt, M. S., Klatt D., 1987. From Text to Speech: The MITalk
system. Cambridge University Press.

 Rubin, P., Baer, T., and Mermelstein, P., 1981. An articulatory synthesizer
for perceptual research. Journal of the Acoustical Society of America 70:
321–328.

 Van Santen, J.P.H., Sproat, R. W., Olive, J.P., and Hirschberg, J., 1997.
Progress in Speech Synthesis. Springer.

 Van Santen, J.P.H., 1994. Assignment of segmental duration in text-to-

speech synthesis. Computer Speech & Language, Volume 8, Issue 2, Pages
95–128

 Wasala, A., Weerasinghe R. , and Gamage, K., 2006, Sinhala Grapheme-to-

Phoneme Conversion and Rules for Schwaepenthesis. Proceedings of the
COLING/ACL 2006 Main Conference Poster Sessions, Sydney, Australia, pp.
890-897.

 Lamel, L.F., Gauvain, J.L., Prouts, B., Bouhier, C., and Boesch, R., 1993.
Generation and Synthesis of Broadcast Messages, Proceedings ESCA-NATO
Workshop and Applications of Speech Technology.

 Van Truc, T., Le Quang, P., van Thuyen, V., Hieu, L.T., Tuan, N.M., and
Hung P.D., 2013. Vietnamese Synthesis System, Capstone Project Document,
FPT UNIVERSITY.

 Kominek, J., and Black, A.W., 2003. CMU ARCTIC databases for speech
synthesis. CMU-LTI-03-177. Language Technologies Institute, School of
Computer Science, Carnegie Mellon University.

Text To Speech Converter Documentation
50% (4)
Text To Speech Converter Documentation
28 pages
Synopsis
No ratings yet
Synopsis
18 pages
Mini Project Report 3.00000000
No ratings yet
Mini Project Report 3.00000000
21 pages
Mini Project
No ratings yet
Mini Project
19 pages
Rajveer Project File
No ratings yet
Rajveer Project File
43 pages
Project Mini 1
No ratings yet
Project Mini 1
75 pages
Text To Speech Speech To Text Using Translations (Mini Project)
No ratings yet
Text To Speech Speech To Text Using Translations (Mini Project)
46 pages
Text To Speech Convertion Report
No ratings yet
Text To Speech Convertion Report
26 pages
Text - To - Speech Converter: Bachelor of Engineering IN Computer Science & Engineering
57% (7)
Text - To - Speech Converter: Bachelor of Engineering IN Computer Science & Engineering
42 pages
Format of Mini - Project Report
No ratings yet
Format of Mini - Project Report
32 pages
Paper 5728
No ratings yet
Paper 5728
3 pages
Text To Speech
No ratings yet
Text To Speech
21 pages
Chapter One Genesis - 011542
No ratings yet
Chapter One Genesis - 011542
7 pages
Rapha Dauda Chapter One To Four - 034731
No ratings yet
Rapha Dauda Chapter One To Four - 034731
40 pages
Rapha Dauda One To Five - 043847
No ratings yet
Rapha Dauda One To Five - 043847
41 pages
Introduction
No ratings yet
Introduction
3 pages
191057jaspreet Kaur
No ratings yet
191057jaspreet Kaur
49 pages
Membaca Text Bahasa Inggris Lebih Mudah Dengan Text-To Speech (TTS)
No ratings yet
Membaca Text Bahasa Inggris Lebih Mudah Dengan Text-To Speech (TTS)
15 pages
Final Synopsis PANS
No ratings yet
Final Synopsis PANS
14 pages
V Jaggadeesh
No ratings yet
V Jaggadeesh
7 pages
Audext Project
No ratings yet
Audext Project
37 pages
An Interactive Intelligent Web-Based Text-To-Speech System For The Visually Impaired
No ratings yet
An Interactive Intelligent Web-Based Text-To-Speech System For The Visually Impaired
24 pages
6.python Text To Speech
No ratings yet
6.python Text To Speech
2 pages
IJRPR4449
No ratings yet
IJRPR4449
4 pages
Widcollogo1 FINAL
No ratings yet
Widcollogo1 FINAL
83 pages
Technology in Special Education
No ratings yet
Technology in Special Education
7 pages
TEXT - TO - SPEECH - CONVERSION - 22215a1211
No ratings yet
TEXT - TO - SPEECH - CONVERSION - 22215a1211
8 pages
Ranjith S - Mini Project
No ratings yet
Ranjith S - Mini Project
74 pages
An Efficient Approach For Text-to-Speech Conversio
No ratings yet
An Efficient Approach For Text-to-Speech Conversio
6 pages
Elektro Reader: A Text To Speech App
No ratings yet
Elektro Reader: A Text To Speech App
15 pages
Video Transcript - Explore The Text To Speech Technology
No ratings yet
Video Transcript - Explore The Text To Speech Technology
2 pages
Review of Text To Speech Conversion Methods: Poonam.S.Shetake, S.A.Patil, P. M Jadhav
No ratings yet
Review of Text To Speech Conversion Methods: Poonam.S.Shetake, S.A.Patil, P. M Jadhav
7 pages
Text To Speech
No ratings yet
Text To Speech
5 pages
Session 5 - Speech Recognition
No ratings yet
Session 5 - Speech Recognition
20 pages
Edu 214 Assignment 8
No ratings yet
Edu 214 Assignment 8
8 pages
Under The Guidance Of: S K Biswal
No ratings yet
Under The Guidance Of: S K Biswal
19 pages
Institute of Professional Studies and Research: Project
No ratings yet
Institute of Professional Studies and Research: Project
29 pages
Salisu Umar Final Copy-1-67 Compressed Compressed
No ratings yet
Salisu Umar Final Copy-1-67 Compressed Compressed
60 pages
Augustin Document
No ratings yet
Augustin Document
43 pages
Assistivetechnology Docx 1
No ratings yet
Assistivetechnology Docx 1
5 pages
"Text To Speech Converter": A Project Report On
No ratings yet
"Text To Speech Converter": A Project Report On
9 pages
Report
No ratings yet
Report
38 pages
Mad Lab Report
0% (2)
Mad Lab Report
27 pages
PDF To Voice by Using Deep Learning
No ratings yet
PDF To Voice by Using Deep Learning
5 pages
Visual Assist
No ratings yet
Visual Assist
53 pages
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
From Everand
Cookbook for Mobile Robotic Platform Control: With Internet of Things And Ti Launch Pad
Dr. Anita Gehlot
No ratings yet
Project Chapter One
No ratings yet
Project Chapter One
3 pages
Text To Speech Synthesis 93
No ratings yet
Text To Speech Synthesis 93
15 pages
Final Mad
No ratings yet
Final Mad
18 pages
Text Tool Report
No ratings yet
Text Tool Report
32 pages
Main (pt2)
No ratings yet
Main (pt2)
13 pages
Presentation 3
No ratings yet
Presentation 3
24 pages
Computer Expo
No ratings yet
Computer Expo
6 pages
Text To Speech Seminar
No ratings yet
Text To Speech Seminar
10 pages
UNIT-2 - Expressive Human and CMD Languages
No ratings yet
UNIT-2 - Expressive Human and CMD Languages
21 pages
How A Text To Speech Tool Works For Educators
No ratings yet
How A Text To Speech Tool Works For Educators
18 pages
Mini Proj Rep
No ratings yet
Mini Proj Rep
20 pages
Radha Govind Engineering College, Meerut
No ratings yet
Radha Govind Engineering College, Meerut
11 pages
Dip PDF
No ratings yet
Dip PDF
30 pages
Interacting With Computers by Voice: Automatic Speech Recognition and Synthesis
No ratings yet
Interacting With Computers by Voice: Automatic Speech Recognition and Synthesis
34 pages
Firmata WIFI
No ratings yet
Firmata WIFI
37 pages
Ejemplo de Ensayo de 3 Páginas
100% (2)
Ejemplo de Ensayo de 3 Páginas
4 pages
ATG CA Versioning Training
No ratings yet
ATG CA Versioning Training
11 pages
CSC 112 For 2nd Semester Year One Students
No ratings yet
CSC 112 For 2nd Semester Year One Students
43 pages
09 Inheritance
No ratings yet
09 Inheritance
44 pages
f31 Book Arith Pres pt1
No ratings yet
f31 Book Arith Pres pt1
91 pages
CS403 Mcqs Mid Term by Vu Topper RM-1-1
No ratings yet
CS403 Mcqs Mid Term by Vu Topper RM-1-1
56 pages
Practitioner's Guide To Data Science
No ratings yet
Practitioner's Guide To Data Science
403 pages
Unit 5 I/O Organization: Computer Architecture
No ratings yet
Unit 5 I/O Organization: Computer Architecture
9 pages
Field Call Dynamic DNS Infrastructure v11.2 (Slides)
No ratings yet
Field Call Dynamic DNS Infrastructure v11.2 (Slides)
37 pages
Lecture 2
No ratings yet
Lecture 2
11 pages
General BAPI Interview Questions
No ratings yet
General BAPI Interview Questions
7 pages
Organisational Informatics
No ratings yet
Organisational Informatics
21 pages
Convert XP To VHD
No ratings yet
Convert XP To VHD
3 pages
RigakuMiniFlex Standard Operating Procedures 2016 07 21 PDF
No ratings yet
RigakuMiniFlex Standard Operating Procedures 2016 07 21 PDF
6 pages
Dell Compellent and Citrix XenDesktop 1000 Desktops Reference Architecture
No ratings yet
Dell Compellent and Citrix XenDesktop 1000 Desktops Reference Architecture
22 pages
MSBTE Solution App-2
No ratings yet
MSBTE Solution App-2
4 pages
Event Listeners in Java PDF
No ratings yet
Event Listeners in Java PDF
2 pages
Exp# 5d Shared Memory Aim: CS2257 Operating System Lab
No ratings yet
Exp# 5d Shared Memory Aim: CS2257 Operating System Lab
4 pages
ST Francis Catholic College, Edmondson Park: Term 1 Parent/Teacher/Student Interviews (Yrs 1-10)
No ratings yet
ST Francis Catholic College, Edmondson Park: Term 1 Parent/Teacher/Student Interviews (Yrs 1-10)
2 pages
A Crypto-Watermarking Method
100% (1)
A Crypto-Watermarking Method
11 pages
Lab 2 - IP SAN - iSCSI
No ratings yet
Lab 2 - IP SAN - iSCSI
27 pages
Using Adams/View Command Lang. - MD Adams 2010
100% (1)
Using Adams/View Command Lang. - MD Adams 2010
2,578 pages
Vcp6.5 DCV Study Guide
No ratings yet
Vcp6.5 DCV Study Guide
251 pages
DEE S5 MPMC Unit4
No ratings yet
DEE S5 MPMC Unit4
12 pages
Heap and Hashtable
No ratings yet
Heap and Hashtable
7 pages
Kenwood tk-760g tk-762g Rev
No ratings yet
Kenwood tk-760g tk-762g Rev
67 pages
Pra Api 22 1
No ratings yet
Pra Api 22 1
64 pages
Security ICT
No ratings yet
Security ICT
5 pages
Assignment No 2 (2024 25)
No ratings yet
Assignment No 2 (2024 25)
2 pages

Project Report1

Uploaded by

Project Report1

Uploaded by

ORIENTAL COLLEGE OF TECHNOLOGY, BHOPAL

Department of Information Technology

MINOR PROJECT REPORT

Submitted in Partial fulfillment for the Award of the degree of

ARCHIT GUPTA (0126IT201022)

ISHITA TOUGDE (0126IT201047)

Under the guidance of

DEPARTMENT OF INFORMATION TECHNOLOGY ENGINEERING

Phone No.-0755-2529015, 2529016

DEPARTMENT OF INFORMATION TECHNOLOGY

PROF. SACHIN MALVIYA PROF. AMIT KANSKAR

ASSISTANT PROFESSOR HOD

I take the opportunity to express my cordial gratitude and deep sense of

Name of student: Archit Gupta, Ishita Tougde

Enrollment No.: 0126IT201022, 0126IT201047

1.1.1 Sub heading-1

3. CHAPTER AS PER REQUIREMENT(OPTIONAL)

4. RESULTS AND DISCUSSION

5. CONCLUSION AND FUTURE SCOPE

1.1 Text-to-Speech Technology: What It Is and How It Works:-

How text-to-speech works:-

1.2 Types of text-to-speechtools:

1.2.2 What does the research says about text-to-speech:

1. People with literacy difficulties:-

3. Ease in business communication:- Speech-enabling pre- and after-sales service

4. Ease of accessibility:- Text-to-speech technology is increasing access for persons with

Domain-specific Synthesis: Domain-specific synthesis concatenates pre-recorded

1. Natural Language Processing (NLP) module: It produces a phonetic transcription of

2. Digital Signal Processing (DSP) module: It transforms the symbolic information it

SPEECH RECOGNITION:- Speech Recognition is the ability of

Neural Network Based Approach: This approach is capable of solving more

Statistical Based Approach: In this approach, variations in speech are modeled

Format Synthesis: In this system, representation of individual speech segments are

Concatenation Synthesis: This technique synthesizes sound by concatenating short

The objective of text to speech technology is to ease and bridge the

communication gap between people speaking different languages.

they are in.

who access content on mobile devices, reading a great deal of content on a

the inconvenience of a lot of scrolling.

 Windows 07 and above

1. Processor – i3 and above

2. Hard Disk – 256GB

3. Memory – 2 GB RAM and above

Software design: Software processes the input image and converted

The Voice Processing Module: In this module text is converted

 Firstly the requirement analysis was carried out.

mobile devices, increases citizen engagement and strengthens corporate social

 Text to speech synthesis is a rapidly growing aspect of computer

The further upgradation that can be done to this project

1. Adding a translator along with it will further improvise

to illiterate people and in business firms.

2. Adding a feature for uploading file of text or word format

and the whole file can be processed and be converted to

audio and the audio can be downloaded in the required

3. Furthermore, along with text to speech, speech to text can

also be enabled so that the needs of both processes can be

carried out smoothly in a effective manner.

 Lemmetty, S., 1999. Review of Speech Syn1thesis Technology. Masters

 Dutoit, T., 1993. High quality text-to-speech synthesis of the French

 Van Santen, J.P.H., 1994. Assignment of segmental duration in text-to-

 Wasala, A., Weerasinghe R. , and Gamage, K., 2006, Sinhala Grapheme-to-

You might also like