A Project Report ON Speech Synthesizer: Tilak Maharashtra Vidyapeeth, Pune
A Project Report ON Speech Synthesizer: Tilak Maharashtra Vidyapeeth, Pune
A PROJECT REPORT
ON
SPEECH SYNTHESIZER
------------------------------------------------------------------------------------------------------------------------
By
ADITYA GURAV
------------------------------------------------------------------------------------------------------------------------
Page no -1
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
CERTIFICATE
“Speech Synthesizer”
Has been satisfactorily completed by
ADITYA GURAV
------------------------------------------------------------------------------------------------------------------------
[2020-2021] Page no -2
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
ACKNOWLEDGEMENT
I express my profound thanks to our head of department Mrs. “Asmita Namjoshi ”, project
guide and project incharge Mr. Rakesh Patil Sir and all those who have indirectly guided and helped
us in preparation of this project.
ADITYA GURAV
------------------------------------------------------------------------------------------------------------------------
[2020-2021] Page no -3
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
PROJECT SYNOPSIS
Speech synthesizer can be described as artificial production of human speech . A computer
system used for this purpose is called a speech synthesizer, and can be implemented in software or
hardware. A speech synthesizer system converts normal language text into speech Synthesized
speech can be created by concatenating pieces of recorded speech that are stored in a database.
Systems differ in the size of the stored speech units, a system that stores phones or diphones
provides the largest output range, but may lack clarity. For specific usage domains, the storage of
entire words or sentences allows for high-quality output. Alternatively, a synthesizer can
incorporate a model of the vocal tract and other human voice characteristics to create a
completely "synthetic" voice output . The quality of a speech synthesizer is judged by its
similarity to the human voice and by its ability to be understood. An intelligible speech
synthesizer program allows people with visual impairments or reading disabilities to listen
to written works on a home computer. From the information now available, it can produce a speech
signal. The structure of the speech synthesizer can be broken down into major modules
Natural Language Processing (NLP) module:It produces a phonetic transcription of the text
read, together with prosody.
DigitalSignal Processing (DSP) module:It transforms the symbolicinformation it receives from
NLP into audible and intelligible speech.The major operations of the NLP module are as follows:
TextAnalysis: First the text is segmented into tokens. The token-to-word conversion creates the
orthographic form of the token. For the token “Mr” the orthographic form “Mister” is formed by
expansion, the token “12” gets the orthographic form “twelve” and “1997” is transformed to
“nineteen ninety seven”.
Application of Pronunciation Rules:After the text analysis has been completed, pronunciation
rules can be applied. Letters cannot be transformed 1:1 into phonemes because correspondence is
not always parallel. In certain environments, a single letter can correspond to either no phoneme .
Prosody Generation: after the pronunciation has been determined, the prosody is generated. The
degree of naturalness of a speech synthesizer is dependent on prosodic factors like intonation
modelling amplitude modelling and duration modelling
[2020-2021] Page no -4
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
INDEX
[2020-2021] Page no -5
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
SPEECH SYNTHESIZER
------------------------------------------------------------------------------------------------------------------------
Introduction:-
The project Speech Synthesizer is developed as a part of VI semester for the partial
fulfilment of the BCA Degree.A Speech Synthesizer is an Python based tool that converts text into
spoken word, by analyzing and processing the text using Natural Language Processing (NLP) and
then using Digital Signal Processing (DSP) technology to convert this processed text into
synthesized speech representation of the text. Here, I developed useful speech synthesizer in the
form of a simple application that converts inputted text into synthesized speech and reads out to the
user which can then be saved as an mp3.file. The development of a speech synthesizer will be of
great help to people with visual impairment and make making through large volume of text easier.
speech synthesizer is the automatic conversion of a text into speech that resembles,as closely as
possible, a native speaker of the language reading that text. speech synthesizer is the technology
which lets computer speak to you. The TTS system gets the text as the input and then a computer
algorithm which called TTS engine analyses the text, pre-processes the text and synthesizes the
speech with some mathematical models. The TTS engine usually generates sound data in an audio
format as the output. Speech synthesizer procedure consists of two main phases. The first is text
analysis, where the input text is transcribed into a phonetic or some other linguistic representation,
and the second one is the generation of speech waveforms, where the output is produced from this
phonetic and prosodic information. These two phases are usually called high and low-level
synthesis .The input text might be for example data from a word processor, standard ASCII from
email, a mobile text-message, or scanned text from a newspaper. The character string is then pre-
processed and analyzed into phonetic representation which is usually a string of phonemes with
some additional information for correct intonation, duration, and stress. Speech sound is finally
generated with the low-level synthesizer by the information from high-level one. The artificial
production of speech-like sounds has a long history, with documented mechanical attempts dating to
the eighteenth century.
[2020-2021] Page no -6
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
SPEECH SYNTHESIZER
------------------------------------------------------------------------------------------------------------------------
[2020-2021] Page no -7
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
SPEECH SYNTHESIZER
------------------------------------------------------------------------------------------------------------------------
Domain-specific Synthesis: -
Domain-specific synthesis concatenates pre-recorded words and
phrases to create complete utterances. It is used in applications where the variety of texts the system
will output is limited to a particular domain, like transit schedule announcements or weather reports.
The technology is very simple to implement, and has been in commercial use for a long time, in
devices like talking clocks and calculators. The level of naturalness of these systems can be very
high because the variety of sentence types is limited, and they closely match the prosody and
intonation of the original recordings. Because these systems are limited by the words and phrases in
their databases, they are not general-purpose and can only synthesize the combinations of words
and phrases with which they have been pre-programmed. The blending of words within naturally
spoken language however can still cause problems unless many variations are taken into account.
For example, in non-rhotic dialects of English the "r" in words like "clear" /ˈklɪə/ is usually only
pronounced when the following word has a vowel as its first letter (e.g. "clear out" is realized as
/ˌklɪəɾˈʌʊt/) . Likewise in French, many final consonants become no longer silent if followed by a
word that begins with a vowel, an effect called liaison. This alternation cannot be reproduced by a
simple word-concatenation system, which would require additional complexity to be context-
sensitive. This involves recording the voice of a person speaking the desired words and phrases.
This is useful if only the restricted volume of phrases and sentences is used and the variety of
texts the system will output is limited to a particular domain e.g. a message in a train station,
whether reports or checking a telephone subscriber’s account balance.
[2020-2021] Page no -8
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
SPEECH SYNTHESIZER
------------------------------------------------------------------------------------------------------------------------
[2020-2021] Page no -9
Tilak Maharashtra Vidyapeeth, Pune
Department of Computer Science
------------------------------------------------------------------------------------------------------------------------
SPEECH SYNTHESIZER
------------------------------------------------------------------------------------------------------------------------
Diphone Synthesis:-
Diphone synthesis uses a minimal speech database containing all the diphones
(sound-to-sound transitions) occurring in a language. The number of diphones depends on the
phonotactics of the language: for example, Spanish has about 800 diphones, and German about
2500. In diphone synthesis, only one example of each diphone is contained in the speech database.
At runtime, the target prosody of a sentence is superimposed on these minimal units by means of
digital signal processing techniques such as linear predictive coding, PSOLA or MBROLA. The
quality of the resulting speech is generally worse than that of unit-selection systems, but more
natural-sounding than the output of formant synthesizers. Diphone synthesis suffers from the sonic
glitches of concatenative synthesis and the robotic-sounding nature of formant synthesis, and has
few of the advantages of either approach other than small size. As such, its use in commercial
applications is declining, although it continues to be used in research because there are a number
of freely available software implementations
• In a rule based solution, pronunciation rules are generated from the phonological knowledge
of dictionaries. Only words whose pronunciation is a complete\ exception are included in the
dictionary. The two applications differ significantly in the size of their dictionaries. The
dictionary-based solution is many times larger than the rules-based solution’s
dictionary of exception. However, dictionary-based solutions can be more exact than rule-
based solution if they have a large enough phonetic dictionary available.
● Prosody Generation:
After the pronunciation has been determined, the prosody is generated.
The degree of naturalness of a TTS system is dependent on prosodic factors like intonation
modelling (phrasing and accentuation), amplitude modelling and duration modelling
(including the duration of sound and the duration of pauses, which determines the length of
the syllable and the tempos of the speech). The output of the NLP module is passed to the
DSP module. This is where the actual synthesis of the speech signal happens. In
concatenative synthesis the selection and linkingof speech segments take place. For
individual sounds the best option (where several appropriate options are available) are
selected from a database and concatenated.
IMPLEMENTATION / PREREQUISITE :-
Tkinter is the standard GUI library for Python. Python
when combined with tkinter provides a fast and easy way to create GUI applications.By this library
we can make a compelling choice for building GUI applications in Python, especially for
applications where a modern sheen is unnecessary, and the top priority is to build something that’s
functional and cross-platform quickly.
There are lots of library in python one of them is gTTS (Google Text-to-Speech), a Python library
and CLI tool to interface with Google Translate’s text-to-speech API. Then import the libraries:
tkinter, gTTS, and playsound, Then initialize the window , then write function to convert text to
speech in python ,then write function to Exit and function to reset and define Buttons and we are
Done.
Domain Knowledge:-
Domain-specific synthesizer concatenates pre-recorded words and phrases to
create complete utterances. It is used in applications where the variety of texts the system will
output is limited to a particular domain, like transit schedule announcements or weather reports.The
technology is very simple to implement, and has been in commercial use for a long time, in devices
like talking clocks and calculators This is useful if only the restricted volume of phrases and
sentences is used and the variety of texts the system will output is limited to a particular domain e.g.
a message in a train station, whether reports or checking a telephone subscriber’s account balance
Advantages:-
People with learning disabilities who have difficulty reading large amounts of text
due to dyslexia or other problems really benefit from speech synthesizer, offering
them an easier option for experiencing website content.
Speech synthesizer allows people to enjoy , and also provides an option for content
consumption on the go
speech synthesizer offers many benefits for content owners and publishers as well
Speech synthesizer makes it easier in general for all people to access online
content on mobile devices, and strengthens corporate social responsibility by
ensuring that information is available in both written and audio format.
Hardware Requirement:-
Computer / Laptop
Keyboard
Mouse
Software Requirement:-
Python
Web Browser
Future Scope:-Accuracy will become better and better. Dictation speech recognition will gradually
become accepted Greater use will be made of “intelligent systems” which will attempt to guess
what the speaker intended to say, rather than what was actually said, as people often misspeak and
make unintentional mistakes. Microphone and sound systems will be designed to adapt more
quickly to changing background noise levels, different environments, with better recognition of
extraneous material to be discarded. More and more elderly people benefit from voice interfaces.
They are also becoming more familiar with the computer technology, however they have problems
with understanding the synthesized speech, particularly if they have hearing problems, and when
they miss the contextual clues that compensate for weakened acoustic stimuli. Unfortunately, most
of the research investigating potential reasons for these problems has not been carried out on unit-
selection synthesis, but on formant synthesis. Formant synthesis lacks acoustic information in the
signal and exhibits incorrect prosody. Since concatenative approaches preserve far more of the
acoustic signal than formant synthesizers, lack of information should not be a problem anymore.
Instead, there are problems with spectral mismatches between units with spectral distortion due to
signal processing, and temporal distortion due to wrong durations .
Limitations or Boundaries:-
It can often be seen that online speech synthesizers do not recog-nize
special characters and symbols such as dot ".", question mark "?", or hash "#". Their databases
usually contain only a few prere-corded voices that are used for synthesis. Modern software often
leads to a different pronunciation of a particular text. What is more, there is a limit to the number of
words for the input text that is going to be converted into speech .For certain languages synthetic
speech is easier to produce than in others. Also, the amount of potential users and markets are very
different with different countries and languages which also affects how much resources are
available for developing speech synthesis. Most of languages have also some special features which
can make the development process either much easier or considerably harder. Some languages, such
as Finnish, Italian, and Spanish, have very regular pronunciation. Sometimes there is almost one-to-
one correspondence with letter to sound. The other end is for example French with very irregular
pronunciation. Many languages, such as French, German, Danish and Portuguese also contain lots
of special stress markers and other non ASCII characters (Oliveira et al. 1992). In German, the
sentential structure differs largely from other languages. For text analysis, the use of capitalized
letters with nouns may cause some problems because capitalized words are usually analyzed
differently than others.
Refrences:-