Text To Speech Conversion Module
Text To Speech Conversion Module
Abstract: This paper proposes a method at developing stored in a text file. The user is given an option to read
a complete system in which Text can be converted to this generated text file.
Speech, Text file can be converted to Speech, Text in The paper is organized as follows: Section II provides
various Languages can be converted to Speech, Image the related work and the existing methods related to this
can be converted to Text and Image can be converted to application. Section III explains how the application is
Speech using MATLAB as a programming tool. The made and is broken down into the various functions
various methods used are Preprocessing, Unicode provided by the application. Section IV provides an
Conversion, Segmentation, Concatenation, Prosody and insight into the project by showing the simulation of
Smoothing, to be then combined in an application for the application using MATLAB and the results in the
easy access and usability. The Motivation behind form snapshots. Section V concludes the paper with the
developing this system is to combine various modules future possibilities of the project.
using modular approach in order to get a simple yet
effective way for differentially abled people to interact 2. Related Work
with others and thereby making the society better.
Currently the various methods used for text to speech
Keywords: Text-to-Speech(TTS), Foreign Language conversion are Concatenation Synthesis which includes
Image-to-Text(ITS), Segmentation, Gray, Application. unit selection, diphone and domain specific synthesis.
Other methods include Formant, Articulatory, HMM
1. Introduction based and sine wave synthesis.
The threading of fragments of speech is done using
Text-to-Speech module aims to provide a user- friendly Concatenation synthesis [1]. This produces a natural life-
application to general users. The main modules used in like system created speech. However, due to the
this application are: Text-to-Speech convertor and differences in the nature of human voice and
Image-to-Text convertor. The application provides a synthetically produced machine voice, there is a chance
multi-functionality platform for users to communicate, of recognizable glitches when producing the output.
listen or narrate conveniently. The users can choose to There are three key sub-types of concatenation
convert readable images to text files or read text as synthesis [3].
such. Formant synthesis produces an output by using additive
The text-to-speech mode converts a text file or inputted synthesis of an acoustic representation in the form of a
text to speech which then is narrated/read using the model. This is different from other techniques as
voice database used by Microsoft SAPI[4]. The human speech is not used during runtime.
application integrates a narrator to help the users use Articulatory synthesis is a method of synthesizing
the software. This processing is done using phonemes machine speech based on simulations of the vocal tract
and concatenating syllables using optimal coupling of humans and their respective articulatory processes.
algorithm. These methods are prone to various challenges such as
Image-to-text mode converts readable images to a text Text normalization difficulties-which is the process of
file which can further be used for speech conversion. standardizing text and is seldom straightforward which
Readable images are images that have less complexity stands as an obstacle to any speech module. Texts often
in the foreground making it possible to extract the contain numbers and acronyms which must be then
letters in a grammatical fashion. The user is provided extracted into phonetic depiction for further processing.
with multiple options in this software where he/she can Moreover, there are many words in the English
select the mode of operation. When Text-to-Speech vocabulary which contextually require a different
mode is selected, the user has to input text using the pronunciation.
text input box or a text file. The text is processed and There are also Text-to-phoneme challenges like
the resulting speech is produced. When Image-to-Text determining the tone and pronunciation of a
mode is selected, the user has to input a readable image word/phrase based on the spelling in context.
file which is processed and the text in the image is Evaluation challenges occur while processing speech.
Hence, there is always a compromise between the
389
International Journal of Pure and Applied Mathematics Special Issue
production proficiency and replay prerequisite required converted to Unicode. Unicode has the explicit aim of
in speech synthesis. transcending the limitations of traditional
trad character
Prosodics and emotional content are also important for encodings. Here the pre-processed
processed text is used to
producing the vocal features that human uses identify the fonts of input text and is converted to
showcasing their emotions and the context of the Unicode. Now, the encoded text is segmented into
phrase/text and hence is used to produce a more natural syllables and the duplicates are removed.
synthesized speech. The syllabled text is then mapped with the pre-recorded
Image processing [2] is done by saturating the colour of syllable sound files in the database. These syllables will
the image to grayscale using a set of grayscale be then concatenated and smoothened for resultant
thresholds. outputs. This is done by optimal coupling algorithm
Image Segmentation is done ne to the noiseless grayscale (refer with: Eq. (1)).
image using framelets to extract the characters out of 0.54 0.46cos .. (1)
the image and the extracted characters are compared to
This gives a smooth human speech output.
the database and the text is produced. The size of the
frames is set to identify the characters. Variations can be applied to the resulting output. Voice,
rate, volume, of the output speech can be suitably
changed by the user.
3. Proposed Method and Algorithm
390
International Journal of Pure and Applied Mathematics Special Issue
option betweenexiting the app or reusing it to operate variables. These character variables are stored in the
other functions (As shown in: Fig. 3). form of an array.
The contents of the array are converted to cells of
Pseudo Code for TTS Module dimension 42X24 and stored in variable name
templates.Now the templates can be called by the OCR
1. Initializing a function with five input arguments program.
(text, voice, pace, volume, sample rate) 1. Load the required image in any given format
2. Checking if SAPI exist if not error (jpg,jpeg,png) from the system.
3. Creating local speech server’s default interface 2. Pre-processing of the selected image is done by
(SV) adjusting the size and resolution. This is done to
4. Invoking voices from default interface prepare the image for the next stages of processing.
5. Listing the voices recognized in the system 3. The image is then converted from RGB to a
6. If number of arguments >1 Selection of voice grayscale image in order to remove noise by
7. If number of arguments >2 Also Setting the pace eliminating hue and saturation (refer to: Eqn. (2)).
8. If number of arguments >3 Also Setting volume =− … (2)
9. If number of arguments <5 Also Setting Frequency g = Converted Image
10. Invoking Speech from default interface t = Threshold calculated from the image
11. Clear SV f = Input Image
12. End function line: g = ~im2bw(f, t)
4. The resultant image now consists of pixels
representing only intensity values of the gray
gradient.
5. This image is now made to undergo binary
transformation to convert the grayscale image into
a black and white image.
6. This is done by using hard thresholding (refer to:
Fig. 5) where the gray threshold value is chosen.
Anything above the threshold is represented by
Figure 4. Image -to-Speech Problem Formulation. white and anything below is represented as black.
391
International Journal of Pure and Applied Mathematics Special Issue
9. After detection of text zones, the process of line The user is then prompted by the program about the
segmentation is done to separate various lines of text file and receives the option to view the text file
text. followed by the option of reading the file.
10. Texts are now identified with the help of character The program then uses the text to speech conversion
recognition in which the identified characters need module to produce the speech output for the respective
to be compared with the templates in the data base. text file.Now the user gets the option among exiting the
11. All extracted characters are resized to template app or reusing it to operate other functions.
dimensions and converted to individual character
code. 4. Simulation and Results
12. These characters are then saved into an output
format either TXT, DOC or PDF. The simulation of Text to Speech conversion module is
13. The text file can now be read by our first TTS done using MATLAB which gives the following
module to give a speech output. results.Firstly, the application is opened and an
introduction page is shown to the user (refer with Fig.
8).
392
International Journal of Pure and Applied Mathematics Special Issue
The selected image is then showed to the user after The application can also play the text file in the form of
which the image is converted to gray with specific a speech output once the user selects that option. Once
threshold values (refer with: Fig 12) in order to extract the user has finished working with the application, then
the text from the image. the application can be closed by Quit option followed
by a “Thank you” prompt (refer to: Fig. 17).
393
International Journal of Pure and Applied Mathematics Special Issue
The results obtained from the derivatives of the paper [9] Implementation of Text to Speech Conversion-
help us conclude that text can be segmented into Chaw Su Thu Thu , Theingi Zin-International Journal
syllables and mapped to sounds to produce speech. The of Engineering Research & Technology (IJERT)-Vol. 3
speech output originating from the system can be Issue 3, March – 2014
manipulated as per the user’s objectives.
We also encounter the processing of images in order to [10] Zhen Li; Xi Zhou; Thomas S. Huang 2009 16th
extract mechanical text from the image which can be IEEE International Conference on Image Processing
further converted to speech. The product’s future goal (ICIP),2009
is to help the differently abled population of the society
[14]
. Henceforth, this application is coded using modular [11] Manabu Ohta; Toshihiro Hachiki; AtsuhiroTakasu
programming into independent functions which allows Fourth International Conference on the Applications of
users in the future to add more functionalities to this Digital Information and Web Technologies (ICADIWT
application making it a more comprehensive and 2011),2011
powerful tool to provide a smart solution to the [12] Tu Bui; John Collomosse 2015 IEEE International
problems/challenges faced by us. More refined Conference on Image Processing (ICIP), 2015.
algorithms will improve this application in the future.
[13] Breen, A.P., "The future role of text to speech
References synthesis in automated services," Advances in
Interactive Voice Technologies for Telecommunication
[1] Chucai Yi, Yingi Tian, K.Anuradha, Text to Services (Digest No: 1997/147), IEEE Colloquium on ,
Speech Conversion, IEEE Transaction on vol.19,pp vol., no., pp.6/1-6/5, 12 Jun 1997.
.269-278, 2013.
[14] H.Li, D.Doerman, and O.Kia, “A system for
[2] Digital Image Processing, Tata McGraw Hill converting English Text into Speech,” IEEE
Education Private Limited by S Jayaraman, S Transactions on Image Processing, pp. 147-156, 2004.
Esakkirajan, T Veerakumar
[15] T. Padmapriya and V. Saminadan, “Improving
[3] Zhang, J., 2004. Language Generation and Speech Throughput for Downlink Multi user MIMO-LTE
Synthesis in Dialogues for Language Learning. Masters Advanced Networks using SINR approximation and
Dissertation, Massachusetts Institute of Technology. Hierarchical CSI feedback”, International Journal of
[4]Text-to-speech (TTS) Overview. In Voice RSS Mobile Design Network and Innovation- Inderscience
Website. Retrieved February 21, 2014, from Publisher, ISSN : 1744-2850 vol. 6, no.1, pp. 14-23,
http://www. voicerss. org/tts/ May 2015.
394
395
396