0% found this document useful (0 votes)
24 views4 pages

Text To Speech: A Simple Tutorial: D.Sasirekha, E.Chandra

Uploaded by

wardinamiza88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
24 views4 pages

Text To Speech: A Simple Tutorial: D.Sasirekha, E.Chandra

Uploaded by

wardinamiza88
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 4

International Journal of Soft Computing and Engineering (IJSCE)

ISSN: 2231-2307, Volume-2, Issue-1, March 2012

TEXT TO SPEECH: A SIMPLE TUTORIAL


D.Sasirekha, E.Chandra

Abstract:Research on Text to Speech (TTS) conversion is a large asked to deal with emotions in speaking styles. And there
enterprise that shows an impressive improvement in the last has been growing interest in developing commercial systems
couple of decades. This article has two main goals. The first goal based on Limited Domain (LD-TTS) [6], which restricts the
is to summarize the published literatures on Text to Speech (TTS),
with discussing about the efforts taken in each paper. The second scope of the input text so as to obtain high quality speech
goal is to describe specific tasks concentrated during Text to synthesis.
Speech (TTS) conversion namely, Preprocessing & text detection, As there are number of research prototypes of TTS systems
Linearization, Text normalization, prosodic phrasing, OCR, has been developed and none was compared with the
Acoustic processing and Intonation. We illustrate these topics by commercial grade TTS systems for quality. The main reason
describing the TTS synthesis. This system will be highly useful for
is that it needs improvisation in collaboration between
an illiterate and vision impaired people to hear and understand
the content, where they face many problems in their daily life due linguistics and technologists.
to the differences in their script system. This paper starts with the Text to speech should be made audibly communicate
introduction to some basic concepts on TTS synthesis, which will information to the user, when digital audio recordings are
be useful for the readers who are less familiar in this area of inadequate, for developing a user friendly speech synthesizer.
research.
Thus this system widely helps
Index Terms—TTS. in developing a Computer-Human interaction like- voice
annotations to files , Speech enabled applications, talking
computer systems (GPS, Phone-based) etc.
I. INTRODUCTION Section II describes the evolution of the system and
Section III describes the steps involved in developing a
A text to speech (TTS) synthesizer is a computer based effective text to speech (TTS) system..
system that can read text aloud automatically, regardless of
whether the text is introduced by a computer input stream or a
II. EVOLUTION OF TTS
scanned input submitted to an Optical character recognition
(OCR) engine. A speech synthesizer can be implemented by
both hardware and software. It has been made a very fast Let us start , with the understanding of the progression of
improvement in this field over the couple of decades and lot text to speech (TTS) system. In 1779, the Danish scientist
of high quality TTS systems are now available for commercial Christian Kratzenstein, working at the Russian Academy of
use. Sciences, built models of the human vocal tract that could
Speech is often based on concatenation of natural speech produce the five long vowel sounds they are [a], [e], [i], [o]
i.e units,that are taken from natural speech put together to and [u].In 1791, an Austrian scientist developed a system
form a word or sentence. Concatenative speech synthesis [1] based on the previous one included tongue, lips and “mouth”
has become very popular in recent years due to its improved made of rubber and a “nose” with two nostrils which was able
sensitivity to unit context over simpler predecessors. to pronounce consonants. In 1837, Joseph Faber developed a
Rhythm [2] is an important factor that makes the system which implemented Pharyngeal Cavity, used for
synthesized speech of a TTS system more natural and singing. It was controlled by keyboard.
understandable, The prosodic structure provides important
Bell Labs Developed VOCODER, a clearly intelligible.
information for the prosody generation model to produce
keyboard-operated electronic speech analyzer and
effects in synthesized speech.
synthesizer. In 1939, Homer Dudely developed VODER
Many TTS systems are developed based on the principle,
which was a`n improvement over VOCODER.
corpus-based speech synthesis [3] [10]. It is very popular for
The Pattern Playback was built by Dr. Franklin S. Cooper
its high quality and natural speech output.
and his colleagues at Haskins Laboratories.First Electronic
According to [4] [5], the next generation TTS systems are
based TTS system was designed in 1968.
Concatenation Technique was developed by 1970‟s. Many
computer operating systems have included speech
synthesizers since the early 1980s. From 1990‟s , there was a
D.Sasirekha, Computer Science ,Research Scholar, Karpagam
progress in Unit Selection and Diphone Synthesis.
University, Coimabtore, India, [email protected]

Dr.E.Chandra, Dean, School Of Computer Studies,


Dr. S.N.S. Rajalakshmi College of Arts and Science, Coimbatore,
India, [email protected]

275
TEXT TO SPEECH: A SIMPLE TUTORIAL

III. ARCHITECTURE OF TTS “too common words “and other diacritics from letters .

The TTS system comprises of these 5 fundamental components: Text normalization is useful for example for comparing
two sequences of characters which represented differently but
A. Text Analysis and Detection mean the same. “Don‟t” vs “Do not”, “I‟m” vs “I am”, “Can‟t”
B. Text Normalization and Linearization
C. Phonetic Analysis
vs “cannot” are some of the examples.
D. Prosodic Modeling and Intonation The main 4 phases of Text Normalization are
E. Acoustic Processing (i). Number converter: Number is pronounced differently in
The input text is passed through these phases to obtain the speech. different situations. Like
1772 (date): seventeen seventy two.
1772(phone number): one seven seven two
Input Text 1772 (quantifier): one thousand seven hundred and seventy
two .
Fractional and decimal numbers are handled.
0.302 (number): point three knot two
Text Analysis Text Normalization
&Text Detection & Text Linearization (ii). Abbrevation converter:
Abbreviations area changed to full textual format.

Mrs. - Misses
Phonetic Analysis St. Joseph St. - Saint Joseph Street

(iii). Acronym converter: Acronyms are replaced by single


letter components.
Acoustic Prosodic Modeling
Processing & Intonation S. I. - S I

(iv). Word segmentation:


Sentences are a group of word segments. Special delimiter
Speech as output to separate segments.
(i.e. „||‟).Segments can be an acronym, a single word or a
numeral.
Fig 1: System Overview of TTS
Examples of acronyms:
A. Text Analysis and Detection
“NATO” - “nayto”
“HIV” - “aitch eye ve”
The Text Analysis part is preprocessing part which analyse
“Henry IV” - “Henry the fourth”
the input text and organize into manageable list of words. It
“Chapter IV”- “Chapter four”
consists of numbers, abbreviations, acronyms and idiomatics
Punctuation marks are also identified.
and transforms them into full text when needed. An important
problem is encountered as soon as the character level : that of
Linearization is the process of giving a hyper text link to
punctuation ambiguity (sentence end detection). It can be
give the user a quick overview of the page. Then the TTS
solved, to some extent, with elementary regular grammars
system will help to read out the linearized data.This feature
helps in selecting the text and reading and also to list the links
Text detection is localize [8] the text areas from any kind of
in the hyper text.
printed documents. Most of the previous researches were
concentrated on extracting text from video. We aim at
developing a technique that work for all kind of documents C. Phonetic Analysis
like newspapers, books etc
Phonetic Analysis converts the orthographical symbols into
B. Text Normalization and Linearization phonological ones using a phonetic alphabet. Basically
known as “grapheme-to-phoneme” conversion.
Text Normalization is the transformation of text to
Phone is a sound that has definite shape as a sound wave.
pronounceable form. Text normalization is often performed Phone is the smallest sound unit. A collection of phones that
before text is processed in some way, such as generating constitute minimal distinctive phonetic units are called
synthesized speech or automated language translation. The Phoneme. Number of phonemes is relatively smaller than the
main objective of this process is to identify punctuation marks graphemes, only 44.
and pauses between words. Usually the text normalization
process is done for converting all letters of lowercase or upper
case, to remove punctuations, accent marks , stopwords or

276
International Journal of Soft Computing and Engineering (IJSCE)
ISSN: 2231-2307, Volume-2, Issue-1, March 2012
Phoneme Set (English) E. Acoustic Processing
The speech will be spoken according to the voice
 Vowels (19) : /a/, /ae/, /air/, /ar/, /e/, /ee/, /i/, /ie/, /o/, characteristics of a person, There are three type of Acoustic
/oe/, /oi/, /oo/, /ow/, /or/, /u/, /ur/, /ue/, /uh/, /w/. synthesing available

 Consonants (25) : /b/, /ks/gz/, /c/k/, /ch/, /d/, /f/, /g/,


/h/, /j/, /l/, /m/, /n/, /ng/, /p/, /kw/, /r/, /s/, /sh/, /t/, (i).Concatenative Synthesis
/th/, /th/, /v/, /y/, /z/, /zh/. (ii).Formant Synthesis
(iii).Articulatory Synthesis
Examples:
The concatenation of prerecorded human voice is called
o /air/ : square, bear. Concatenative synthesis, in this process a database is needed
having all the prerecorded words .The natural sounding
o /ow/ : down, house. speech is the main advantage and the main drawback is the
using and developing of large database.
o /ks/gz/ : box, exist
Formant-synthesized speech can be constantly
Pronunciation of word based on its spelling has two intelligible .It does not have any database of speech samples.
approaches to do speech synthesis namely So the speech is artificial and robotic.
(a)Dictionary based approach
(b) Rule based approach. Speech organs are called Articulators. In this articulatory
A dictionary is kept were It stores all kinds of words with synthesis techniques for synthesizing speech based on models
their correct pronunciation, it‟s a matter of looking in to of the human vocal tract are to be developed. It produces a
dictionary for each word for spelling out with correct complete synthetic output, typically based on mathematical
pronounciation. This approach is very quick and accurate and models
the pronounciation quality will be better but the major
drawback is that it needs a large database to store all words
and the system will stop if a word is not found in the
dictionary. IV. CONCLUSION

The letter sounds for a word are blended together to form a This paper made a clear and simple overview of working of
pronunciation based on some rule. Here main advantage is text to speech system (TTS) in step by step process. There are
that it requires no database and it works on any type of input. many text to speech systems (TTS) available in the market
same way the complexity grows for irregular inputs and also much improvisation is going on in the research area
to make the speech more effective, natural with stress and
D. PROSODIC MODELLING AND INTONATION emotions. We expect the synthesizers to continue to improve
research in prosodic phrasing, improving quality of speech,
The concept of prosody is the combination of stress voice, emotions and expressiveness in speech and to simplify
pattern , rhythm and intonation in a speech. The prosodic the conversion process so as to avoid complexity in the
modeling describes the speakers emotion. Recent program.
investigations suggest the identification of the vocal features
which signal emotional content may help to create a very
natural [9] synthesized speech. REFERENCES
Intonation is simply a variation of speech while speaking. [1] Frances Alias, Xavier Servillano, Joan Claudi socoro and Xavier
Gonzalvo “Towards High-Quality Next Generation Text-to-Speech
All languages use pitch, as intonation to convey an instance,
Synthesis:A multi domain Approach by Automatic Domain
to express happiness, to raise a question etc. Modelling of an Classification”,IEEE Transactions on AUDIO,SPEECH AND
intonation is an important task that affects intelligibility and LANGUAG PROCESSING, VOL16,NO,7 september 2008.
naturalness of the speech. To receive high quality text to [2] Qing Guo, Jie Zhang, Nobuyuki Katae, Hao Yu , “High –Quality
speech conversion, good model of intonation is needed. Prosody Generation in Mandrain Text-to-Speech system”, FujiTSu
Sci.Tech,J., vol.46, No.1,pp.40-46 ,2010.
Generally intonations are distinguished as [3] Gopalakrishna anumanchipalli,Rahul Chitturi, Sachin Joshi, Rohit
Kumar, Satinder Pal Singh,R.n.v Sitaram,D.P.Kishore, “Development
(i) Rising Intonation of Indian Language Speech Databases for Large Vocabulary Speech
(when the pitch of the voice increases) Recognition System”,
[4] A.Black, H.Zen and K.Tokuda “Statistical parametric speech
(ii) Falling Intonation synthesis”, in proc.ICASSP, Honolulu, HI 2007, vol IV, PP
(when pitch of the voice decreases) 1229-1232.
(iii) Dipping Intonation [5] G.Bailly, N.Campbell and b.Mobius, “ISCA special session: Hot topics
(when the pitch of the voice falls and then rises) in speech synthesis”, in proc.Eurospeech,Genea, Switzerland, 2003, pp
37-40.
(iv) Peaking Intonation
[6] M.Ostendorf and I.Bulyko, “The impact of speech recognition on
(when the pitch of the voice raises and then falls) speech synthesis”, in proc, IEEE Workshop Speech Synthesis, Santa
Monica,2002,pp. 99-106.
[7] Text To Speech Synthesis - a knol by Jaibatrik Dutta .

277
TEXT TO SPEECH: A SIMPLE TUTORIAL

[8] Silvio Ferreia,Celina Thillou, Bernaud Gosselin, “From Picture to


Speech: an Innovative Application for Embedded Environment”,
[9] M.Nageshwara Rao, Samuel Thomas, T.Nagarajan and Hema
A.Muthy, “Text-to-Speech Syntheis using syllable line units”
[10] Jindrich Matousek, Josef Psutks, Jiri Krita, “Design of speech Corpus
for Text-to-Speech Synthesis”

D.Sasirekha , completed her BSc (CS)-2003 in Avinashlingam


University for Women, coimbatore and M.Sc (CS)-2005 in
Annamalai University, Currently doing Ph.D (PT) (CS) in
Karpagam University, Coimbatore and working as a staff in
Avinashilingam University for Women, Coimbatore,India.

Dr.E.Chandra received her B.Sc., from Bharathiar University,


Coimbatore in 1992 and received M.Sc., from Avinashilingam
University ,Coimbatore in 1994. She obtained her M.Phil., in the
area of Neural Networks from Bharathiar University, in 1999. She
obtained her PhD degree in the area of Speech recognition system
from Alagappa University Karikudi in 2007. She has totally 16 yrs
of experience in teaching including 6 months in the industry. At
present she is working as Director, School of Computer Studies in
Dr.SNS Rajalakshmi College of Arts & Science, Coimbatore. She
has published more than 30 research papers in National,
International journals and conferences in India and abroad. She has
guided more than 20 M.Phil., Research Scholars. At present 3
M.Phil Scholars and 8 Ph.D Scholars are working under her
guidance. She has delivered lectures to various Colleges in Tamil
Nadu & Kerala. She is a Board of studies member at various
colleges. Her research interest lies in the area of Neural networks,
speech recognition systems, fuzzy logic and Machine Learning
Techniques. She is a Life member of CSI, Society of Statistics and
Computer Applications. Currently Management Committee member
of CSI Coimbatore Chapter.

278

You might also like