0% found this document useful (0 votes)
261 views8 pages

Text To Speech Conversion Module

This document summarizes a research paper that proposes a modular text-to-speech conversion system using MATLAB. The system allows text, text files, text in different languages, images, and image files to be converted to both text and speech. The key modules are preprocessing, unicode conversion, segmentation, concatenation, prosody smoothing, and text/image manipulation. The goal is to create an accessible multi-functional platform for differently abled users to communicate and interact through text, images, and speech.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
0% found this document useful (0 votes)
261 views8 pages

Text To Speech Conversion Module

This document summarizes a research paper that proposes a modular text-to-speech conversion system using MATLAB. The system allows text, text files, text in different languages, images, and image files to be converted to both text and speech. The key modules are preprocessing, unicode conversion, segmentation, concatenation, prosody smoothing, and text/image manipulation. The goal is to create an accessible multi-functional platform for differently abled users to communicate and interact through text, images, and speech.
Copyright
© © All Rights Reserved
We take content rights seriously. If you suspect this is your content, claim it here.
Available Formats
Download as PDF, TXT or read online on Scribd
You are on page 1/ 8

International Journal of Pure and Applied Mathematics

Volume 115 No. 6 2017, 389-395


ISSN: 1311-8080 (printed version); ISSN: 1314-3395 (on-line version)
url: http://www.ijpam.eu
Special Issue
ijpam.eu

TEXT TO SPEECH CONVERSION MODULE

Hussain Rangoonwala1, Vishal Kaushik2, P Mohith3and DhanalakshmiSamiappan4


1,2,3
Department of Electronics and Communication Engineering, SRM University, Chennai, India.
4
Assistant Professor, Department of Electronics and Communication
Engineering, SRM University, Chennai, India.
4
[email protected]

Abstract: This paper proposes a method at developing stored in a text file. The user is given an option to read
a complete system in which Text can be converted to this generated text file.
Speech, Text file can be converted to Speech, Text in The paper is organized as follows: Section II provides
various Languages can be converted to Speech, Image the related work and the existing methods related to this
can be converted to Text and Image can be converted to application. Section III explains how the application is
Speech using MATLAB as a programming tool. The made and is broken down into the various functions
various methods used are Preprocessing, Unicode provided by the application. Section IV provides an
Conversion, Segmentation, Concatenation, Prosody and insight into the project by showing the simulation of
Smoothing, to be then combined in an application for the application using MATLAB and the results in the
easy access and usability. The Motivation behind form snapshots. Section V concludes the paper with the
developing this system is to combine various modules future possibilities of the project.
using modular approach in order to get a simple yet
effective way for differentially abled people to interact 2. Related Work
with others and thereby making the society better.
Currently the various methods used for text to speech
Keywords: Text-to-Speech(TTS), Foreign Language conversion are Concatenation Synthesis which includes
Image-to-Text(ITS), Segmentation, Gray, Application. unit selection, diphone and domain specific synthesis.
Other methods include Formant, Articulatory, HMM
1. Introduction based and sine wave synthesis.
The threading of fragments of speech is done using
Text-to-Speech module aims to provide a user- friendly Concatenation synthesis [1]. This produces a natural life-
application to general users. The main modules used in like system created speech. However, due to the
this application are: Text-to-Speech convertor and differences in the nature of human voice and
Image-to-Text convertor. The application provides a synthetically produced machine voice, there is a chance
multi-functionality platform for users to communicate, of recognizable glitches when producing the output.
listen or narrate conveniently. The users can choose to There are three key sub-types of concatenation
convert readable images to text files or read text as synthesis [3].
such. Formant synthesis produces an output by using additive
The text-to-speech mode converts a text file or inputted synthesis of an acoustic representation in the form of a
text to speech which then is narrated/read using the model. This is different from other techniques as
voice database used by Microsoft SAPI[4]. The human speech is not used during runtime.
application integrates a narrator to help the users use Articulatory synthesis is a method of synthesizing
the software. This processing is done using phonemes machine speech based on simulations of the vocal tract
and concatenating syllables using optimal coupling of humans and their respective articulatory processes.
algorithm. These methods are prone to various challenges such as
Image-to-text mode converts readable images to a text Text normalization difficulties-which is the process of
file which can further be used for speech conversion. standardizing text and is seldom straightforward which
Readable images are images that have less complexity stands as an obstacle to any speech module. Texts often
in the foreground making it possible to extract the contain numbers and acronyms which must be then
letters in a grammatical fashion. The user is provided extracted into phonetic depiction for further processing.
with multiple options in this software where he/she can Moreover, there are many words in the English
select the mode of operation. When Text-to-Speech vocabulary which contextually require a different
mode is selected, the user has to input text using the pronunciation.
text input box or a text file. The text is processed and There are also Text-to-phoneme challenges like
the resulting speech is produced. When Image-to-Text determining the tone and pronunciation of a
mode is selected, the user has to input a readable image word/phrase based on the spelling in context.
file which is processed and the text in the image is Evaluation challenges occur while processing speech.
Hence, there is always a compromise between the

389
International Journal of Pure and Applied Mathematics Special Issue

production proficiency and replay prerequisite required converted to Unicode. Unicode has the explicit aim of
in speech synthesis. transcending the limitations of traditional
trad character
Prosodics and emotional content are also important for encodings. Here the pre-processed
processed text is used to
producing the vocal features that human uses identify the fonts of input text and is converted to
showcasing their emotions and the context of the Unicode. Now, the encoded text is segmented into
phrase/text and hence is used to produce a more natural syllables and the duplicates are removed.
synthesized speech. The syllabled text is then mapped with the pre-recorded
Image processing [2] is done by saturating the colour of syllable sound files in the database. These syllables will
the image to grayscale using a set of grayscale be then concatenated and smoothened for resultant
thresholds. outputs. This is done by optimal coupling algorithm
Image Segmentation is done ne to the noiseless grayscale (refer with: Eq. (1)).
image using framelets to extract the characters out of 0.54 0.46cos .. (1)
the image and the extracted characters are compared to
This gives a smooth human speech output.
the database and the text is produced. The size of the
frames is set to identify the characters. Variations can be applied to the resulting output. Voice,
rate, volume, of the output speech can be suitably
changed by the user.
3. Proposed Method and Algorithm

Figure 1. Block Diagram of the modules

The basic block diagram (refer with: fig. 1) of this


project includes a modular approach in which the input
text is manipulated in the text manipulation module
where variations of voice, rate and volume is made.
The next block is the text to speech module where the
manipulated text is converted into Speech by unit Figure 3. Flowchart for Text-to
to-Speech Convertor
selection synthesis.
The image to text module has input as image which is The program starts with the introduction page
converted to text. Using the above two modules accompanied with a narrated welcome. On clicking the
combined we can have image to speech module to give next button, the user is directed to the help page where
output as speech. the usergets the option to invoke the narrator which
reads out the contents of the helppage by clicking
click on
the read button.
By clicking onto the enter button, the user is introduced
to a dialogue box containingoptions for manipulating
voice, rate, pace of the text to be read. The user will
be directed to page of his choice of function where
along with application
pplication of thefunction, the user can also
control the volume of the speech output. The user
entersthe text into the text field and the resultant speech
output is produced uponclicking on the speak button.
The Voice function allows the user to choose betweenbet
Figure 2. Text-to-Speech
Speech Problem Formulation. the language voice packs installed on the system.
Hence, the function allows the provision of generatinga
Figure 2 explains the flow of the text to speech module multi-lingual
lingual output.The Rate function provides the
which is explained in detail as follows: Text to speech provision to change the sampling rate of the
conversion can be accomplished by starting with the speechoutput.The Pace function allows the user to t
method of pre-processing
processing of the input text. Here the control the speed of the resultant speechoutput. After
text abbreviations, acronyms and numbers are completion of the various processes, the user gets the
expanded[10]. The pre-processed
processed text will then be

390
International Journal of Pure and Applied Mathematics Special Issue

option betweenexiting the app or reusing it to operate variables. These character variables are stored in the
other functions (As shown in: Fig. 3). form of an array.
The contents of the array are converted to cells of
Pseudo Code for TTS Module dimension 42X24 and stored in variable name
templates.Now the templates can be called by the OCR
1. Initializing a function with five input arguments program.
(text, voice, pace, volume, sample rate) 1. Load the required image in any given format
2. Checking if SAPI exist if not error (jpg,jpeg,png) from the system.
3. Creating local speech server’s default interface 2. Pre-processing of the selected image is done by
(SV) adjusting the size and resolution. This is done to
4. Invoking voices from default interface prepare the image for the next stages of processing.
5. Listing the voices recognized in the system 3. The image is then converted from RGB to a
6. If number of arguments >1 Selection of voice grayscale image in order to remove noise by
7. If number of arguments >2 Also Setting the pace eliminating hue and saturation (refer to: Eqn. (2)).
8. If number of arguments >3 Also Setting volume =− … (2)
9. If number of arguments <5 Also Setting Frequency g = Converted Image
10. Invoking Speech from default interface t = Threshold calculated from the image
11. Clear SV f = Input Image
12. End function line: g = ~im2bw(f, t)
4. The resultant image now consists of pixels
representing only intensity values of the gray
gradient.
5. This image is now made to undergo binary
transformation to convert the grayscale image into
a black and white image.
6. This is done by using hard thresholding (refer to:
Fig. 5) where the gray threshold value is chosen.
Anything above the threshold is represented by
Figure 4. Image -to-Speech Problem Formulation. white and anything below is represented as black.

An allocation for image to speech conversion is also


provided (refer with: fig. 4) where the contents of the
image will be converted to text and thereafter read out
as speech.
Images containing handwritten text, printed text form is
converted to Unicode text, the image can be a scanned
file, a photograph of a document or a photograph taken
in real-time. Figure 5. Graph and Eqn. depicting Hard Thresholding
The image is converted into an inverted grayscale
image. The noise, if any, is removed using the 7. This is followed by inverse transformation (refer
threshold. The foreground is made white and the with: Fig. 6)where the black and white intensity
background black due to inverse gray-scaling. values are inversed.
This is further segmented using frames to extract
characters from the image which is mapped to a matrix
which allows the image to be read line by line and
character by character[7]. The extracted text is stored
into the matrix as it is read, allowing the extracted text
to be appropriately related to the image and hence the
text saved isn’t randomly stored. Further, the text will
be entered to a text file and the contents of the text file
can be read aloud using the text-to-speech convertor of
the application. Figure 6. Graph and Eqn. depicting Inverse
Transformation
Algorithm for ITS Module
8. After pre-processing of the image, layout analysis
Initially we create a database which consists of all is done to find the various places where text is
character templates which are stored in character present in the image.

391
International Journal of Pure and Applied Mathematics Special Issue

9. After detection of text zones, the process of line The user is then prompted by the program about the
segmentation is done to separate various lines of text file and receives the option to view the text file
text. followed by the option of reading the file.
10. Texts are now identified with the help of character The program then uses the text to speech conversion
recognition in which the identified characters need module to produce the speech output for the respective
to be compared with the templates in the data base. text file.Now the user gets the option among exiting the
11. All extracted characters are resized to template app or reusing it to operate other functions.
dimensions and converted to individual character
code. 4. Simulation and Results
12. These characters are then saved into an output
format either TXT, DOC or PDF. The simulation of Text to Speech conversion module is
13. The text file can now be read by our first TTS done using MATLAB which gives the following
module to give a speech output. results.Firstly, the application is opened and an
introduction page is shown to the user (refer with Fig.
8).

Figure 8. Introduction page of the application

The user can now move ahead in the application by


clicking next. In the text to speech module the user has
a choice among selecting voice, rate and pace as
follows:

Figure 7. Flowchart for Image-to-Text Convertor


Figure 9. Selection of various modes
The program starts with the introduction page
accompanied with a narrated welcome.On clicking the If the voice mode is selected then the user can choose
next button, the user is directed to the help page where among various languages such as English, French,
the usergets the option to invoke the narrator which Spanish, Chinese, Russian, etc. (Refer with: Fig. 10).
reads out the contents of the helppage by clicking on Here text in any font is read with the correct accent of
the read button. By clicking onto the enter button, the language which is chosen.
program asks the user to input an image containing to The rate and pace functionalities can also be used by
text to be synthesized.The program then displays the the user to alter how the text is spoken.
gray-scaled version of the image sequenced by
blackand white form of the image.The lines containing
text are cropped out into images and displayed, which
is followed by the displaying of cropped image of each
letter forming the textin the respective line.Upon
completion of this segment of the program, the detected
text is stored ina text file.

392
International Journal of Pure and Applied Mathematics Special Issue

Figure 13. Individually extracted line

Figure 10. Page for selection of Voice


Figure 14. Character extraction
In the next module of image to text conversion, the user
can select the image to be processed in order to extract The same is done for numbers as well (refer to: Fig.
the text from it as shown in the figure below (refer 15). By comparing the letters and numbers by already
with: Fig. 11). saved template files of the images of individual
numbers and letters, we can convert the image into a
text file once the user selects convert to text file option.

Figure 15. Processing of numbers in a line of the


Figure 11. Selecting the image to be processed. selected image.

The selected image is then showed to the user after The application can also play the text file in the form of
which the image is converted to gray with specific a speech output once the user selects that option. Once
threshold values (refer with: Fig 12) in order to extract the user has finished working with the application, then
the text from the image. the application can be closed by Quit option followed
by a “Thank you” prompt (refer to: Fig. 17).

Figure 16. Quitting the application


Figure 12. Gray threshold of the selected image

The text is first processed by extracting individual


lines, as shown in Figure 13, from the image after
which the letters are extracted and displayed
individually as shown in figure 14.

Figure 17. Thank you message

5. Conclusion and related work

393
International Journal of Pure and Applied Mathematics Special Issue

The results obtained from the derivatives of the paper [9] Implementation of Text to Speech Conversion-
help us conclude that text can be segmented into Chaw Su Thu Thu , Theingi Zin-International Journal
syllables and mapped to sounds to produce speech. The of Engineering Research & Technology (IJERT)-Vol. 3
speech output originating from the system can be Issue 3, March – 2014
manipulated as per the user’s objectives.
We also encounter the processing of images in order to [10] Zhen Li; Xi Zhou; Thomas S. Huang 2009 16th
extract mechanical text from the image which can be IEEE International Conference on Image Processing
further converted to speech. The product’s future goal (ICIP),2009
is to help the differently abled population of the society
[14]
. Henceforth, this application is coded using modular [11] Manabu Ohta; Toshihiro Hachiki; AtsuhiroTakasu
programming into independent functions which allows Fourth International Conference on the Applications of
users in the future to add more functionalities to this Digital Information and Web Technologies (ICADIWT
application making it a more comprehensive and 2011),2011
powerful tool to provide a smart solution to the [12] Tu Bui; John Collomosse 2015 IEEE International
problems/challenges faced by us. More refined Conference on Image Processing (ICIP), 2015.
algorithms will improve this application in the future.
[13] Breen, A.P., "The future role of text to speech
References synthesis in automated services," Advances in
Interactive Voice Technologies for Telecommunication
[1] Chucai Yi, Yingi Tian, K.Anuradha, Text to Services (Digest No: 1997/147), IEEE Colloquium on ,
Speech Conversion, IEEE Transaction on vol.19,pp vol., no., pp.6/1-6/5, 12 Jun 1997.
.269-278, 2013.
[14] H.Li, D.Doerman, and O.Kia, “A system for
[2] Digital Image Processing, Tata McGraw Hill converting English Text into Speech,” IEEE
Education Private Limited by S Jayaraman, S Transactions on Image Processing, pp. 147-156, 2004.
Esakkirajan, T Veerakumar
[15] T. Padmapriya and V. Saminadan, “Improving
[3] Zhang, J., 2004. Language Generation and Speech Throughput for Downlink Multi user MIMO-LTE
Synthesis in Dialogues for Language Learning. Masters Advanced Networks using SINR approximation and
Dissertation, Massachusetts Institute of Technology. Hierarchical CSI feedback”, International Journal of
[4]Text-to-speech (TTS) Overview. In Voice RSS Mobile Design Network and Innovation- Inderscience
Website. Retrieved February 21, 2014, from Publisher, ISSN : 1744-2850 vol. 6, no.1, pp. 14-23,
http://www. voicerss. org/tts/ May 2015.

[5] Rhead, Mke, "Accuracy of automatic number plate


recognition (ANPR) and real world UK number plate
problems." IEEE International Carnahan Conference on
Security Technology (ICCST), 2012.

[6]Rakesh Kumar Mandal, N R Manna, "Hand Written


English Character Recognition using Row- wise
Segmentation Technique", International Symposium on
Devices MEMS, Intelligent Systems &
Communication, pp. 5-9, 2011.

[7] K. Jain and B. Yu. “Automatic Text Location in


Images and Video Frames”. In Proc. of International
Conference of Pattern Recognition (ICPR), Brisbane,
pp. 1497-1499, 1998

[8] Text to Speech Synthesis System in Indian English


2016 IEEE Region 10 Conference (TENCON)
Conversion of English Text-to-Speech (TTS) Using
Indian Speech Signal-ShanthaSelvaKumari,
R.Sangeetha,International Journal of Scientific
Engineering and Technology-Volume No.4 Issue
No.8,01August2015.

394
395
396

You might also like