Project Report1
Project Report1
Bachelor of Technology in
INFORMATION TECHNOLOGY
Submitted to
RAJIV GANDHI PROUDYOGIKI VISHWAVIDHYALAYA, BHOPAL (M.P)
Submitted by
CANDIDATE’S DECLARATION
I hereby declare that the Minor Project report on “ Text to speech ” which is being
presented here for the partial fulfillment of the requirement of Degree of “Bachelor of
Technology ” has been carried out at Department of Information Technology Oriental
College of Technology Bhopal . The technical information provided in this report is
presented with due permission of the authorities from the training organization .
Signature of Student
ARCHIT GUPTA (0126IT201022)
ISHITA TOUGDE (0126IT201047)
CERTIFICATE OF INSTITUTE
This is to certify that Mr. ARCHIT GUPTA & Ms. ISHITA TOUGDE of B. Tech. Information Technology
Department Enrollment No. 0126IT201022 & 0126IT201047 have successfully completed their Minor
Project during the academic year 2022-2023 as partial fulfillment of the Bachelor of Technology in
Information Technology.
IT DEPARTMENT IT DEPARTMENT
ACKNOWLEDGEMENT
I owe sincere thanks to Dr. Amita Mahor, director OCT, for providing
me with moral support and necessary help during my project work in
the department. At the same time, i would like to thank prof. Amit
Kanskar (hod, it) and all other faculty members and all non-teaching
staff of department of information technology for their valuable co-
operation.
I would also thank to my institution, faculty members and staff without
whom this project would have been a distant reality. I also extend my
heartfelt thanks to my family and well-wishers.
1. Front Page ii
2. Candidate’s Declaration iii
3. Certificate of Institute iv
4. Acknowledgement v
5. List of Figures vi
6. List of Tables vii
7. List of Symbols & Abbreviations viii
8. Executive Summary ix
CONTENTS:
1. INTRODUCTION
1.1 Heading1
1.2 Heading2
2. LITERATURE REVIEW
REFERENCES
APPENDICES
INTRODUCTION
A few years ago, the creation of the software and hardware image processing systems was
mainly limited to the development of the user interface, which most of the programmers of
each firm were engaged in. The situation has been significantly changed with the advent of
the Windows operating system when the majority of the developers switched to solving the
problems of image processing itself. However, this has not yet led to the cardinal progress in
solving typical tasks of recognizing faces, car numbers, road signs, analyzing remote and
medical images, etc. Each of these "eternal" problems is solved by trial and error by the
efforts of numerous groups of the engineers and scientists. As modern technical solutions are
turn out to be excessively expensive, the task of automating the creation of the software tools
for solving intellectual problems is formulated and intensively solved abroad. In the field of
image processing, the required tool kit should be supporting the analysis and recognition of
images of previously unknown content and ensure the effective development of applications
by ordinary programmers. Just as the Windows toolkit supports the creation of interfaces for
solving various applied problems.
Text-to-speech (TTS) is a type of assistive technology that reads digital text aloud. It’s
sometimes called “read aloud” technology. TTS can take words on a computer or other
digital device and convert them into audio. TTS is very helpful for kids who struggle with
reading, but it can also help kids with writing and editing, and even focusing.
The voice in TTS is computer-generated, and reading speed can usually be sped up or slowed
down. Voice quality varies, but some voices sound human. There are even computer-
generated voices that sound like children speaking.
Many TTS tools highlight words as they are read aloud. This allows kids to see text and hear
it at the same time.
Some TTS tools also have a technology called optical character recognition (OCR). OCR
allows TTS tools to read text aloud from images. For example, your child could take a photo
of a street sign and have the words on the sign turned into audio.
1.2.1 Depending on the device your child uses, there are many different TTS tools:
Built-in text-to-speech: Many devices have built-in TTS tools. This includes desktop and
laptop computers, smartphones and digital tablets and Chrome. Your child can use this TTS
without purchasing special apps or software.
Web-based tools: Some websites have TTS tools on-site. For instance, you can turn on our
website’s “Reading Assist” tool, located in the lower left corner of your screen, to have this
web page read aloud. Also, kids with dyslexia may qualify for a free Bookshare account with
digital books that can be read with TTS.
Text-to-speech apps: Kids can also download TTS apps on smartphones and digital tablets.
These apps often have special features like text highlighting in different colors and OCR.
Some examples include Voice Dream Reader, Claro Scan Pen and Office Lens.
Chrome tools: Chrome is a relatively new platform with several TTS tools. These include
Read&Write for Google Chrome and Snap&Read Universal. You can use these tools on a
Chromebook or any computer with the Chrome browser.
Text-to-speech software programs: There are also several literacy software programs for
desktop and laptop computers. In addition to other reading and writing tools, many of these
programs have TTS. Examples include Kurzweil 3000, ClaroRead and Read&Write.
Microsoft’s Immersive Reader tool also has TTS. It can be found in programs like OneNote
and Word.
Despite the growing popularity, the research on text to speech is somewhat unclear.
While this technology allows students to access the classroom material, some researchers
have found mixed results on how well students are able to comprehend the text being read to
them (Dalton & Strangman, 2006). Furthermore, another team of researchers found that text-
to-speech technologies did not impact adolescent students ability to comprehend the reading,
however the students did report that they value the increased independence that the TTS
software gave them (Meyer, 2014).
However, one study found that students who have been diagnosed with dyslexia did benefit
from the use of TTS software. This team offered students training in TTS software in a
small-group format for six weeks, and saw improvements in motivation to read, improved
comprehension, and improved fluency (White, 2014). Similarly, positive results were found
in another study in which TTS was found to be effective in allowing students to access the
reading material and was also perceived favorably by the students who used it, especially
students in grades 6-8.
Problem definition:-
2. People with different learning styles:- Some people are auditory learners, some are
visual learners, and some are kin-esthetic learners – most learn best through a combination of
the three. Universal Design for Learning is a plan for teaching which, through the use of
technology and adaptable lesson plans, aims to help the maximum number of learners
comprehend and retain information by appealing to all learning styles.
6. More Accomodative:- TTS software eliminates the need to collaborate with voiceover
professionals. Although it is quite easy to find voiceover artists today, you cannot vouch for
their authenticity and quality before collaborating with them. Since you need to enter into a
contract before using their voice, you may still have to pay if their quality of voice is not as
you expected. Although tying up with voiceover artists may seem easy (and it indeed is), the
entire process of execution might be strenuous and time-consuming. Thanks to online TTS
services, businesses and individuals have found a more convenient way to get their text
translated into human voice (read, speech). Moreover, you can use the TTS software to re-
narrate the text as often as you want until you obtain the best results.
7. Instant results and cost efficiency:- If you do not have the time or knowhow to find a
voiceover artist, sign up a contract, specify the requirement, do the recording, edit the voice
clip, redo if required, and pay the money, TTS services are meant for you. TTS services let
you do all of these at a fraction of the cost you’d spend otherwise. Understanding TTS
services is as easy as ABC. All you need to do is to sign up for free, upload the text, and get
the output in the blink of an eye. TTS service providers generally offer voice generators for
free. However, the free plan provides you with limited access to their services. You may
have to pay a minimal fee to get more extensive sounds and effects. The little amount you
pay to avail of a good quality TTS service can help you save a massive amount.
BACKGROUND
The aim of text to speech technology is to convert text input and convert it into digital
audio signals.Text-to-speech synthesis -TTS - is the automatic conversion of a text into
speech that resembles, as closely as possible, a native speaker of the language reading that
text. Text-tospeech synthesizer (TTS) is the technology which lets computer speak to you.
The TTS system gets the text as the input and then a computer algorithm which called
TTS engine analyses the text, pre-processes the text and synthesizes the speech with some
mathematical models. The TTS engine usually generates sound data in an audio format as
the output. Text to speech technology, commonly known as TTS, is the conversion of text
into voice output. In the early days of TTS, it wasn't so efficient; however, the advent of
deep learning entirely changed the scenario. As it stands, modern computers are capable of
concatenating the speech from various databases. This speech or sound is synonymous
with natural sounds and reacts to pitch, pronunciation, frequency, etc. Considering the fact
that text to speech assistive technology excellently interprets the text and the associated
speech constraints, it is widely employed by businesses to enhance the user experience.
One of the conspicuous technologies used for text to speech conversion is optical
character recognition (OCR) that converts the text from the images or handwritten
documents into machine-encoded text. This machine-encoded text can then be read aloud
by the TTS tools. Prominent TTS tools encompass web-based tools, chrome tools, text-to-
speech apps, text-to-speech software, etc.
The input text is given to the system and then the preferred language is
selected in which we want to convert to text. The converted audio is the output
which can be listened to. Thus the output is the digital audio signal which is
processed from the input text provided. The audio can thus be downloaded and
listened to for various purposes. For example, a person doesn’t know French
language and wants to listen to the text in French. Then he will provide the text
that he wants to listen in the input state. After this the person will select the
language in which he wants to listen the text to. In this example, we would
select the language as French. After selecting the system will process the input
text and will convert it into sound signals and will provide the text as output in
the form of language audio.
Speech synthesis is the building factor of this technology. Speech synthesis is the artificial
production of human speech. A computer system used for this purpose is called a speech
synthesizer, and can be implemented in software or hardware products. A text-to-speech
(TTS) system converts normal language text into speech; other systems render symbolic
linguistic representations like phonetic transcriptions into speech. The reverse process is
speech recognition.
Synthesized speech can be created by concatenating pieces of recorded speech that are
stored in a database. Systems differ in the size of the stored speech units; a system that
stores phones or dip hones provides the largest output range, but may lack clarity. For
specific usage domains, the storage of entire words or sentences allows for high-quality
output. Alternatively, a synthesizer can incorporate a model of the vocal tract and other
human voice characteristics to create a completely "synthetic" voice output.
The quality of a speech synthesizer is judged by its similarity to the human voice and by
ability to be understood clearly. An intelligible text-to-speech program allows people with
visual impairments or reading disabilities to listen to written words on a home computer.
Many computer operating systems have included speech synthesizers since the early
1990s.
A text-to-speech system (or "engine") is composed of two parts: a front-end and a back-
end. The front-end has two major tasks. First, it converts raw text containing symbols like
numbers and abbreviations into the equivalent of written-out words. This process is often
called text normalization, pre-processing, or tokenization. The front-end then
assigns phonetic transcriptions to each word, and divides and marks the text into prosodic
units, like phrases, clauses, and sentences. The process of assigning phonetic
transcriptions to words is called text-to-phoneme or grapheme-to-phoneme conversion.
Phonetic transcriptions and prosody information together make up the symbolic linguistic
representation that is output by the front-end. The back-end—often referred to as
the synthesizer—then converts the symbolic linguistic representation into sound. In certain
systems, this part includes the computation of the target prosody (pitch contour, phoneme
durations), which is then imposed on the output speech. There are different ways to
perform speech synthesis. The choice depends on the task they are used for, but the most
widely used method is Concatentive Synthesis, because it generally produces the most
natural-sounding synthesized speech.
Unit Selection Synthesis: Unit selection synthesis uses large databases of recorded
speech. During database creation, each recorded utterance is segmented into some or all of
the following: individual phones, diphones, half-phones, syllables, morphemes, words,
phrases, and sentences. Typically, the division into segments is done using a specially
modified speech recognizer set to a "forced alignment" mode with some manual
correction afterward, using visual representations such as the waveform and spectrogram.
An index of the units in the speech database is then created based on the segmentation and
acoustic parameters like the fundamental frequency (pitch), duration, position in the
syllable, and neighboring phones. At runtime, the desired target utterance is created by
determining the best chain of candidate units from the database (unit selection). This
process is typically achieved using a specially weighted decision tree. Unit selection
provides the greatest naturalness, because it applies only a small a90-mount of digital
signals processing (DSP) to the recorded speech. DSP often makes recorded speech sound
less natural, although some systems use a small.amount of signal processing at the point
of concatenation to smooth the waveform. The output from the best unit-selection systems
is often indistinguishable from real human voices, especially in contexts for which the
TTS system has been tuned. However, maximum naturalness typically require
unitselection speech databases to be very large, in some systems ranging into the
gigabytes of recorded data, representing dozens of hours of speech. [12]. Also, unit
selection algorithms have been known to select segments from a place that results in less
than ideal synthesis (e.g. minor words become unclear) even when a better choice exists
in the database.
Diphone Synthesis: Diphone synthesis uses a minimal speech database containing all
the diphones (sound-to-sound transitions) occurring in a language. The number of
diphones depends on the phonotactics of the language: for example, Spanish has about
800 diphones, and German about 2500. In diphone synthesis, only one example of each
diphone is contained in the speech database. At runtime, the target prosody of a sentence
is superimposed on these minimal units by means of digital signal processing techniques
such as linear predictive coding, PSOLA[12] or MBROLA. The quality of the resulting
speech is generally worse than that of unit-selection systems, but more natural-sounding
than the output of format synthesizers. Diphone synthesis suffers from the sonic glitches
of concatenation synthesis and the robotic-sounding nature of format synthesis, and has
few of the advantages of either approach other than small size. As such, its use in
commercial applications is declining, although it continues to be used in research because
there are a number of freely available software implementations.
Structure of A Text-To-Speech Synthesizer System:
Text-to-speech synthesis takes place in several steps. The TTS systems get a text as input,
which it first must analyze and then transform into a phonetic description. Then in a
further step it generates the prosody. From the information now available, it can produce a
speech signal.The structure of the text-to-speech synthesizer can be broken down into
major modules:
Text Analysis: First the text is segmented into tokens. The token-to-word conversion
creates the orthographic form of the token. For the token “Mr” the orthographic form
“Mister” is formed by expansion, the token gets the orthographic form “twelve” and
“1997” is transformed to “nineteen ninety seven”.
Application of Pronunciation Rules: After the text analysis has been completed,
pronunciation rules can be applied. Letters cannot be transformed 1:1 into phonemes
because correspondence is not always parallel. In certain environments, a single letter can
correspond to either no phoneme (for example, “h” in “caught”) or several phoneme (“m”
in “Maximum”). In addition, several letters can correspond to a single phoneme (“ch” in
“rich”). There are two strategies to determine pronunciation:
Prosody Generation: After the pronunciation has been determined, the prosody is generated.
The degree of naturalness of a TTS system is dependent on prosodic factors like intonation
modelling (phrasing and accentuation), amplitude modelling and duration modelling
(including the duration of sound and the duration of pauses, which determines the length of
the syllable and the tempos of the speech).
LITERATURE SURVEY
• Speaker: All speakers have a different kind of voice. The models hence are either
designed for a specific speaker or an independent speaker.
• Vocal Sound: The way the speaker speaks also plays a role in speech recognition. Some
models can recognize either single utterances or separate utterance with a pause in between.
• Vocabulary: The size of the vocabulary plays an important role in determining the
complexity, performance, and precision of the system.
1. Basic Speech Recognition Model: Each speech recognition system follow some
standard steps as shown in figure.
(I) Pre-processing: The analog speech signal is transformed into digital signals for later
processing. This digital signal is moved to the first order filters tospectrally flatten the
signals. This helps in increasing the signal’s energy at a higher frequency.
ii) Feature Extraction: This step finds the set of parameters of utterances that have a
correlation with speech signals. These parameters, known as features, are computed
through processing of the acoustic waveform. The main focus is to compute a sequence of
feature vectors (relevant information) providing a compact representation of the given
input signal. Commonly used feature extraction techniques are discussed below: • Linear
Predictive Coding (LPC): The basic idea is that the speech sample can be approximated as
a linear combination of past speech samples. Figure 2 shows the LPC process. The
digitized signal is blocked into frames of N samples. Then each sample frame is
windowed to minimize signal discontinuities. Each framed window is then auto-
correlated. The last step is the LPC analysis, which converts each frame of auto-
correlations into LPC parameter set. • Mel-Frequency Cestrum Co-efficient (MFCC): It is
a very powerful technique and uses human auditory perception system. MFCC applies
certain steps to the input signal: Framing: Speech wave- form is cropped to remove
interference if present; Windowing: minimizes the discontinuities in the signal; Discrete
Fourier Transform: converts each frame from time domain to frequency domain; Mel
Filter Bank Algorithm: the signal is plotted against the Mel spectrum to mimic human
hearing. • Dynamic Time Warping: This algorithm is used for measuring the similarity
between two-time series which may vary in speed, based on dynamic programming. It
aims at aligning two sequences of feature vectors (1 of each series) iteratively until an
optimal match (according to a suitable metrics) between them is found.
Acoustic Models: It is the fundamental part of Automated Speech Recognition (ASR)
system where a connection between the acoustic information and phonetics is established.
Training establishes a correlation between the basic speech units and the acoustic
observations.
Language Models: This model induces the probability of a word occurrence after a
word sequence. It contains the structural constraints available in the language to generate
the probabilities of occurrence. The language model distinguishes word and phrase that
has a similar sound.
Pattern Classification: It is the process of comparing the unknown pattern with existing
sound reference pattern and computing similarity between them. After completing the
training of the system at the time of testing, patterns are classified to recognize the
speech. Different approaches for pattern matching are:
Template Based Approach: This approach has a collection of speech patterns which are
stored as a reference representing dictionary words. Speech is recognized by matching the
uttered word with the reference template.
Knowledge Based Approach: This approach takes set of features from the speech and
then train the system to generate set of production rules automatically from the samples.
Text Processing: The input text is analysed, normalized (handles acronyms and
abbreviation and match the text) and transcribed into phonetic or linguistic representation.
• Speech Synthesis: Some of the speech synthesis techniques are:
Articulator Synthesis: Uses mechanical and acoustic model for speech generation. It
produces intelligible synthetic speech but it is far from natural sound and hence not
widely used.
With globalization, people travel from one place to another for work and
travel purposes, they might not be fluent in speaking the language in different
areas or countries so with the help of this technology people can speak in their
native language and can get it translated according to the country in which
TTS allows people to enjoy , and also provides an option for content
consumption on the go, taking content away from the computer screen and
into any environment that’s convenient for the consumer. For people with
visual impairment, text to speech can be a very useful tool as well. For those
small screen is not always easy. Having text-to-speech software doing the
work is much easier. It allows people to get the information they want without
SOFTWARE REQUIREMENTS:-
HARDWARE REQUIREMENTS:-
LANGUAGES USED:-
HTML
CSS
JAVASCRIPT
USE CASE DIAGRAM
PROPOSED METHODOLOGY
Text-to-speech device consists of two main modules, the image processing module and
voice processing modules.Image processing module captures image using camera,
converting the image into text. Voice processing module changes the text into sound and
processes it with specific physical characteristics so that the sound can be understood.
Figure 1 shows the block diagram of TextTo-Speech device, 1st block is image processing
module, where OCR converts .jpg to .txt form. 2nd is voice processing module which
converts text to speech.
The above figure shows the block diagram of Text-To-Speech device, 1st block is
image processing module, where OCR converts .jpg to .txt form. 2nd is voice
processing module which converts .txt to speech. OCR is important element in this
module. OCR or Optical Character Recognition is a technology that automatically
recognize the character through the optical mechanism, this technology imitate the
ability of the human senses of sight, where the camera becomes a replacement for eye
and image processing is done in the computer engine as a substitute for the human
brain2 . Tesseract OCR is a type of OCR engine with matrix matching. The selection of
Tesseract engine is because of its flexibility and extensibility of machines and the fact
that many communities are active researchers to develop this OCR engine and also
because Tesseract OCR can support 149 languages. In this project we are identifying
English alphabets. Before feeding the image to the OCR, it is converted to a binary
image to increase the recognition accuracy. Image binary conversion is done by using
Imagemagick software, which is another open source tool for image manipulation. The
output of OCR is the text, which is stored in a file (speech.txt). Machines still have
defects such as distortion at the edges and dim light effect, so it is still difficult for most
OCR engines to get high accuracy text. It needs some supporting and condition in order
to get the minimal defect.
includes:-
the need for this technology i.e. will further provide assist
audio format.
Suendermann, D., Höge, H., and Black, A., 2010. Challenges in Speech
Synthesis. Chen, F., Jokinen, K., (eds.), Speech Technology, Springer
Science + Business Media LLC.
Allen, J., Hunnicutt, M. S., Klatt D., 1987. From Text to Speech: The MITalk
system. Cambridge University Press.
Rubin, P., Baer, T., and Mermelstein, P., 1981. An articulatory synthesizer
for perceptual research. Journal of the Acoustical Society of America 70:
321–328.
Van Santen, J.P.H., Sproat, R. W., Olive, J.P., and Hirschberg, J., 1997.
Progress in Speech Synthesis. Springer.
Lamel, L.F., Gauvain, J.L., Prouts, B., Bouhier, C., and Boesch, R., 1993.
Generation and Synthesis of Broadcast Messages, Proceedings ESCA-NATO
Workshop and Applications of Speech Technology.
Van Truc, T., Le Quang, P., van Thuyen, V., Hieu, L.T., Tuan, N.M., and
Hung P.D., 2013. Vietnamese Synthesis System, Capstone Project Document,
FPT UNIVERSITY.
Kominek, J., and Black, A.W., 2003. CMU ARCTIC databases for speech
synthesis. CMU-LTI-03-177. Language Technologies Institute, School of
Computer Science, Carnegie Mellon University.